[Spark-Avro] Question related to the Avro data generated by Spark-Avro

2015-11-16 Thread java8964
Hi, I have one question related to Spark-Avro, not sure if here is the best place to ask. I have the following Scala Case class, populated with the data in the Spark application, and I tried to save it as AVRO format in the HDFS case class Claim ( ..) case class Coupon ( account_id: Long

RE: In Spark application, how to get the passed in configuration?

2015-11-12 Thread java8964
ix. So try something like --conf spark.runtime.environment=passInValue . RegardsVarun On Thu, Nov 12, 2015 at 9:51 PM, java8964 <java8...@hotmail.com> wrote: In my Spark application, I want to access the pass in configuration, but it doesn't work. How should I do that? object myCode

In Spark application, how to get the passed in configuration?

2015-11-12 Thread java8964
In my Spark application, I want to access the pass in configuration, but it doesn't work. How should I do that? object myCode extends Logging { // starting point of the application def main(args: Array[String]): Unit = { val sparkContext = new SparkContext() val runtimeEnvironment =

RE: Cassandra via SparkSQL/Hive JDBC

2015-11-11 Thread java8964
Any reason that Spark Cassandra connector won't work for you? Yong To: bryan.jeff...@gmail.com; user@spark.apache.org From: bryan.jeff...@gmail.com Subject: RE: Cassandra via SparkSQL/Hive JDBC Date: Tue, 10 Nov 2015 22:42:13 -0500 Anyone have thoughts or a similar use-case for SparkSQL /

RE: RDD's filter() or using 'where' condition in SparkSQL

2015-10-29 Thread java8964
Won't you be able to use case statement to generate a virtual column (like partition_num), then use analytic SQL partition by this virtual column? In this case, the full dataset will be just scanned once. Yong Date: Thu, 29 Oct 2015 10:51:53 -0700 Subject: RDD's filter() or using 'where'

RE: RDD's filter() or using 'where' condition in SparkSQL

2015-10-29 Thread java8964
? Thanks Anfernee On Thu, Oct 29, 2015 at 11:07 AM, java8964 <java8...@hotmail.com> wrote: Won't you be able to use case statement to generate a virtual column (like partition_num), then use analytic SQL partition by this virtual column? In this case, the full dataset will be just scanned

RE: Problem with make-distribution.sh

2015-10-26 Thread java8964
Maybe you need the Hive part? Yong Date: Mon, 26 Oct 2015 11:34:30 -0400 Subject: Problem with make-distribution.sh From: yana.kadiy...@gmail.com To: user@spark.apache.org Hi folks, building spark instructions (http://spark.apache.org/docs/latest/building-spark.html) suggest that

RE: Spark SQL running totals

2015-10-15 Thread java8964
My mistake. I didn't noticed "UNBOUNDED PRECEDING" already supported. So cumulative sum should work then. Thanks Yong From: java8...@hotmail.com To: mich...@databricks.com; deenar.toras...@gmail.com CC: spanayo...@msn.com; user@spark.apache.org Subject: RE: Spark SQL running totals Date: Thu, 15

RE: Spark SQL running totals

2015-10-15 Thread java8964
Not sure the windows function can work for his case. If you do a "sum() over (partitioned by)", that will return a total sum per partition, instead of a cumulative sum wanted in this case. I saw there is a "cume_dis", but no "cume_sum". Do we really have a "cume_sum" in Spark window function, or

RE: Spark DataFrame GroupBy into List

2015-10-14 Thread java8964
My guess is the same as UDAF of (collect_set) in Hive. https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-Built-inAggregateFunctions(UDAF) Yong From: sliznmail...@gmail.com Date: Wed, 14 Oct 2015 02:45:48 + Subject: Re: Spark DataFrame GroupBy into List To:

How to handle the UUID in Spark 1.3.1

2015-10-09 Thread java8964
Hi, Sparkers: In this case, I want to use Spark as an ETL engine to load the data from Cassandra, and save it into HDFS. Here is the environment specified information: Spark 1.3.1Cassandra 2.1HDFS/Hadoop 2.2 I am using the Cassandra Spark Connector 1.3.x, which I have no problem to query the C*

RE: How to handle the UUID in Spark 1.3.1

2015-10-09 Thread java8964
This is related:SPARK-10501 On Fri, Oct 9, 2015 at 7:28 AM, java8964 <java8...@hotmail.com> wrote: Hi, Sparkers: In this case, I want to use Spark as an ETL engine to load the data from Cassandra, and save it into HDFS. Here is the environment specified information: Spark 1.3.1Cassandra 2

RE: Building RDD for a Custom MPP Database

2015-10-05 Thread java8964
You want to implement a custom InputFormat for your MPP, which can provide the location preference information to Spark. Yong > Date: Mon, 5 Oct 2015 10:53:27 -0700 > From: vjan...@sankia.com > To: user@spark.apache.org > Subject: Building RDD for a Custom MPP Database > > Hi > I have to build

RE: Problem understanding spark word count execution

2015-10-02 Thread java8964
tes sent to driver is the final output aggregated on the reducers end, and merged back to the driver." , which part of our word count code takes care of this part ? And yes there are only 273 distinct words in the text so that's not a surprise. Thanks again, Hope to get a reply. --Kartik On T

RE: Problem understanding spark word count execution

2015-10-02 Thread java8964
if for every shuffle write , it always writes to disk , what is the meaning of these properties - spark.shuffle.memoryFraction spark.shuffle.spill Thanks,Kartik On Fri, Oct 2, 2015 at 6:22 AM, java8964 <java8...@hotmail.com> wrote: No problem. >From the mapper side, Spark

RE: Problem understanding spark word count execution

2015-10-01 Thread java8964
I am not sure about originally explain of shuffle write. In the word count example, the shuffle is needed, as Spark has to group by the word (ReduceBy is more accurate here). Image that you have 2 mappers to read the data, then each mapper will generate the (word, count) tuple output in

RE: Setting executors per worker - Standalone

2015-09-29 Thread java8964
I don't think you can do that in the Standalone mode before 1.5. The best you can do is to have multi workers per box. One worker can and will only start one executor, before Spark 1.5. What you can do is to set "SPARK_WORKER_INSTANCES", which control how many worker instances you can start per

RE: nested collection object query

2015-09-29 Thread java8964
You have 2 options: Option 1: Use lateral view explode, as you did below. But if you want to remove the duplicate, then use distinct after that. For example: col1, col2, ArrayOf(Struct) After explode: col1, col2, employee0col1, col2, employee1col1, col2, employee0 Then select distinct col1, col2

RE: nested collection object query

2015-09-28 Thread java8964
Your employee in fact is an array of struct, not just struct. If you are using HiveSQLContext, then you can refer it like following: select id from member where employee[0].name = 'employee0' The employee[0] is pointing to the 1st element of the array. If you want to query all the elements in the

RE: Is this a Spark issue or Hive issue that Spark cannot read the string type data in the Parquet generated by Hive

2015-09-28 Thread java8964
ary fields :) Cheng On 9/25/15 2:03 PM, java8964 wrote: Hi, Spark Users: I have a problem related to Spark cannot recognize the string type in the Parquet schema generated by Hive. Version of all

Is this a Spark issue or Hive issue that Spark cannot read the string type data in the Parquet generated by Hive

2015-09-25 Thread java8964
Hi, Spark Users: I have a problem related to Spark cannot recognize the string type in the Parquet schema generated by Hive. Version of all components: Spark 1.3.1Hive 0.12.0Parquet 1.3.2 I generated a detail low level table in the Parquet format using MapReduce java code. This table can be read

RE: Java Heap Space Error

2015-09-24 Thread java8964
nding where dt='2015-9' and userid != '' and userid is not null and userid is not NULL and pagetype = 'productDetail' group by userid """.stripMargin) @java8964 I tried with sql.shuffle.partitions = 1 but no luck. It’s again one of the partitions shuffle size is huge and th

RE: Java Heap Space Error

2015-09-24 Thread java8964
and day >= '${day}' and userid != '' and userid is not null and userid is not NULL and pagetype = 'productDetail' group by userid """.stripMargin) On 24 Sep 2015, at 16:52, java8964 <java8...@hotmail.com> wrote:This is interesting. So you mean that query as "select userid

RE: Java Heap Space Error

2015-09-23 Thread java8964
(regexp_replace(regexp_replace(substr(productcategory,2,length(productcategory)-2),'\"',''),\",\",' ') inputlist from landing where dt='2015-9' and userid != '' and userid is not null and userid is not NULL and pagetype = 'productDetail' group by userid On 23 Sep 2015, at 23:55,

RE: Debugging too many files open exception issue in Spark shuffle

2015-09-23 Thread java8964
That is interesting. I don't have any Mesos experience, but just want to know the reason why it does so. Yong > Date: Wed, 23 Sep 2015 15:53:54 -0700 > Subject: Debugging too many files open exception issue in Spark shuffle > From: dbt...@dbtsai.com > To: user@spark.apache.org > > Hi, > >

RE: Long GC pauses with Spark SQL 1.3.0 and billion row tables

2015-09-22 Thread java8964
Or at least tell us how many partitions you are using. Yong > Date: Tue, 22 Sep 2015 02:06:15 -0700 > From: belevts...@gmail.com > To: user@spark.apache.org > Subject: Re: Long GC pauses with Spark SQL 1.3.0 and billion row tables > > Could it be that your data is skewed? Do you have

RE: spark-avro takes a lot time to load thousands of files

2015-09-22 Thread java8964
Your performance problem sounds like in the driver, which is trying to boardcast 10k files by itself alone, which becomes the bottle neck. What you wants is just transfer the data from AVRO format per file to another format. In MR, most likely each mapper process one file, and you utilized the

RE: Long GC pauses with Spark SQL 1.3.0 and billion row tables

2015-09-22 Thread java8964
Or at least tell us how many partitions you are using. Yong > Date: Tue, 22 Sep 2015 02:06:15 -0700 > From: belevts...@gmail.com > To: user@spark.apache.org > Subject: Re: Long GC pauses with Spark SQL 1.3.0 and billion row tables > > Could it be that your data is skewed? Do you have

RE: application failed on large dataset

2015-09-16 Thread java8964
ch the block, and after several retries, the executor just dies with such error.And for your question, I did not see any executor restart during the job.PS: the operator I am using during that stage if rdd.glom().mapPartitions() java8964 <java8...@hotmail.com>于2015年9月15日周二 下

RE: application failed on large dataset

2015-09-16 Thread java8964
? sun.nio.ch.selectionkeyi...@3011c7c9java.nio.channels.CancelledKeyException at org.apache.spark.network.nio.ConnectionManager.run(ConnectionManager.scala:461) at org.apache.spark.network.nio.ConnectionManager$$anon$7.run(ConnectionManager.scala:193) java8964 <java8...@hotmail.com>于2015年9月16日周三 下午

RE: Change protobuf version or any other third party library version in Spark application

2015-09-15 Thread java8964
to package all my custom classes and its dependencies including protobuf 3. The problem is how to configure spark shell to use my uber jar first. java8964 -- appreciate the link and I will try the configuration. Looks promising. However, the "user classpath first" attribute does not apply

RE: application failed on large dataset

2015-09-15 Thread java8964
When you saw this error, does any executor die due to whatever error? Do you check to see if any executor restarts during your job? It is hard to help you just with the stack trace. You need to tell us the whole picture when your jobs are running. Yong From: qhz...@apache.org Date: Tue, 15 Sep

RE: Change protobuf version or any other third party library version in Spark application

2015-09-15 Thread java8964
It is a bad idea to use the major version change of protobuf, as it most likely won't work. But you really want to give it a try, set the "user classpath first", so the protobuf 3 coming with your jar will be used. The setting depends on your deployment mode, check this for the parameter:

RE: Best way to merge final output part files created by Spark job

2015-09-14 Thread java8964
For text file, this merge works fine, but for binary format like "ORC", "Parquet" or "AVOR", not sure this will work. These kind of formats in fact are not append-able, as they write the detail data information either in the head or at tail part of the file. You have to use the format specified

RE: Calculating Min and Max Values using Spark Transformations?

2015-08-28 Thread java8964
Or RDD.max() and RDD.min() won't work for you? Yong Subject: Re: Calculating Min and Max Values using Spark Transformations? To: as...@wso2.com CC: user@spark.apache.org From: jfc...@us.ibm.com Date: Fri, 28 Aug 2015 09:28:43 -0700 If you already loaded csv data into a dataframe, why not

RE: How to avoid shuffle errors for a large join ?

2015-08-28 Thread java8964
There are several possibilities here. 1) Keep in mind that 7GB data will need way more than 7G heap, as deserialize java object needs much more space than data itself. Grand rule is multiple 6 to 8 times, so 7G data need 50G heap space.2) You should monitor the Spark UI, to check how many

RE: query avro hive table in spark sql

2015-08-27 Thread java8964
What version of the Hive you are using? And do you compile to the right version of Hive when you compiled Spark? BTY, spark-avro works great for our experience, but still, some non-tech people just want to use as a SQL shell in spark, like HIVE-CLI. Yong From: mich...@databricks.com Date: Wed,

RE: query avro hive table in spark sql

2015-08-27 Thread java8964
if this issue might be coz of querying across different schema version of data ? ThanksGiri On Thu, Aug 27, 2015 at 5:39 AM, java8964 java8...@hotmail.com wrote: What version of the Hive you are using? And do you compile to the right version of Hive when you compiled Spark? BTY, spark-avro works great

RE: Protobuf error when streaming from Kafka

2015-08-25 Thread java8964
Did your spark build with Hive? I met the same problem before because the hive-exec jar in the maven itself include protobuf class, which will be included in the Spark jar. Yong Date: Tue, 25 Aug 2015 12:39:46 -0700 Subject: Re: Protobuf error when streaming from Kafka From: lcas...@gmail.com

SparkSQL problem with IBM BigInsight V3

2015-08-25 Thread java8964
Hi, On our production environment, we have a unique problems related to Spark SQL, and I wonder if anyone can give me some idea what is the best way to handle this. Our production Hadoop cluster is IBM BigInsight Version 3, which comes with Hadoop 2.2.0 and Hive 0.12. Right now, we build spark

RE: Protobuf error when streaming from Kafka

2015-08-25 Thread java8964
need to build spark from source code? On Tue, Aug 25, 2015 at 1:06 PM, Cassa L lcas...@gmail.com wrote: I downloaded below binary version of spark. spark-1.4.1-bin-cdh4 On Tue, Aug 25, 2015 at 1:03 PM, java8964 java8...@hotmail.com wrote: Did your spark build with Hive? I met the same problem

RE: Transformation not happening for reduceByKey or GroupByKey

2015-08-21 Thread java8964
I believe spark-shell -i scriptFile is there. We also use it, at least in Spark 1.3.1. dse spark will just wrap spark-shell command, underline it is just invoking spark-shell. I don't know too much about the original problem though. Yong Date: Fri, 21 Aug 2015 18:19:49 +0800 Subject: Re:

RE: Transformation not happening for reduceByKey or GroupByKey

2015-08-21 Thread java8964
What version of Spark you are using, or comes with DSE 4.7? We just cannot reproduce it in Spark. yzhang@localhost$ more test.sparkval pairs = sc.makeRDD(Seq((0,1),(0,2),(1,20),(1,30),(2,40)))pairs.reduceByKey((x,y) = x + y).collectyzhang@localhost$ ~/spark/bin/spark-shell --master local -i

How frequently should full gc we expect

2015-08-21 Thread java8964
In the test job I am running in Spark 1.3.1 in our stage cluster, I can see following information on the application stage information: MetricMin25th percentileMedian75th percentileMaxDuration0 ms1.1 min1.5 min1.7 min3.4 minGC Time11 s16 s21 s25 s54 s From the GC output log, I can see it is

RE: Any suggestion about sendMessageReliably failed because ack was not received within 120 sec

2015-08-20 Thread java8964
The closed information I can found online related to this error ishttps://issues.apache.org/jira/browse/SPARK-3633 But it is quite different in our case. In our case, we never saw the (Too many open files) error, the log just simple show the 120 sec time out. I checked all the GC output from all

Any suggestion about sendMessageReliably failed because ack was not received within 120 sec

2015-08-20 Thread java8964
Hi, Sparkers: After first 2 weeks of Spark in our production cluster, with more familiar with Spark, we are more confident to avoid Lost Executor due to memory issue. So far, most of our jobs won't fail or slow down due to Lost executor. But sometimes, I observed that individual tasks failed due

RE: Failed to fetch block error

2015-08-19 Thread java8964
From the log, it looks like the OS user who is running spark cannot open any more file. Check your ulimit setting for that user: ulimit -aopen files (-n) 65536 Date: Tue, 18 Aug 2015 22:06:04 -0700 From: swethakasire...@gmail.com To: user@spark.apache.org Subject: Failed

RE: Spark Job Hangs on our production cluster

2015-08-18 Thread java8964
classes is not responsive. I'd try running outside of the repl and see if that works. sorry not a full diagnosis but maybe this'll help a bit. On Tue, Aug 11, 2015 at 3:19 PM, java8964 java8...@hotmail.com wrote: Currently we have a IBM BigInsight cluster with 1 namenode + 1 JobTracker + 42 data

Spark Job Hangs on our production cluster

2015-08-17 Thread java8964
I am comparing the log of Spark line by line between the hanging case (big dataset) and not hanging case (small dataset). In the hanging case, the Spark's log looks identical with not hanging case for reading the first block data from the HDFS. But after that, starting from line 438 in the

RE: Spark Job Hangs on our production cluster

2015-08-14 Thread java8964
I still want to check if anyone can provide any help related to the Spark 1.2.2 will hang on our production cluster when reading Big HDFS data (7800 avro blocks), while looks fine for small data (769 avro blocks). I enable the debug level in the spark log4j, and attached the log file if it

Spark 1.2.2 build problem with Hive 0.12, bringing in wrong version of avro-mapred

2015-08-12 Thread java8964
Hi, This email is sent to both dev and user list, just want to see if someone familiar with Spark/Maven build procedure can provide any help. I am building Spark 1.2.2 with the following command: mvn -Phadoop-2.2 -Dhadoop.version=2.2.0 -Phive -Phive-0.12.0 The spark-assembly-1.2.2-hadoop2.2.0.jar

Spark Job Hangs on our production cluster

2015-08-11 Thread java8964
Currently we have a IBM BigInsight cluster with 1 namenode + 1 JobTracker + 42 data/task nodes, which runs with BigInsight V3.0.0.2, corresponding with Hadoop 2.2.0 with MR1. Since IBM BigInsight doesn't come with Spark, so we build Spark 1.2.2 with Hadoop 2.2.0 + Hive 0.12 by ourselves, and

Spark SQL query AVRO file

2015-08-07 Thread java8964
Hi, Spark users: We currently are using Spark 1.2.2 + Hive 0.12 + Hadoop 2.2.0 on our production cluster, which has 42 data/task nodes. There is one dataset stored as Avro files about 3T. Our business has a complex query running for the dataset, which is stored in nest structure with Array of

RE: Spark SQL query AVRO file

2015-08-07 Thread java8964
...@databricks.com Date: Fri, 7 Aug 2015 11:32:21 -0700 Subject: Re: Spark SQL query AVRO file To: java8...@hotmail.com CC: user@spark.apache.org Have you considered trying Spark SQL's native support for avro data? https://github.com/databricks/spark-avro On Fri, Aug 7, 2015 at 11:30 AM, java8964 java8

RE: Spark SQL query AVRO file

2015-08-07 Thread java8964
it using HiveQL CREATE TEMPORARY TABLE episodes USING com.databricks.spark.avro OPTIONS (path src/test/resources/episodes.avro) On Fri, Aug 7, 2015 at 11:42 AM, java8964 java8...@hotmail.com wrote: Hi, Michael: I am not sure how spark-avro can help in this case. My understanding is that to use

RE: Use rank with distribute by in HiveContext

2015-07-16 Thread java8964
Yes. The HIVE UDF and distribute by both supported by Spark SQL. If you are using Spark 1.4, you can try Hive analytics windows function (https://cwiki.apache.org/confluence/display/Hive/LanguageManual+WindowingAndAnalytics),most of which are already supported in Spark 1.4, so you don't need the

RE: How does spark manage the memory of executor with multiple tasks

2015-05-27 Thread java8964
Same as you, there are lots of people coming from MapReduce world, and try to understand the internals of Spark. Hope below can help you some way. For the end users, they only have concept of Job. I want to run a word count job from this one big file, that is the job I want to run. How many

RE: 回复:回复:RE: 回复:Re: sparksql running slow while joining_2_tables.

2015-05-06 Thread java8964
It looks like you have data in these 24 partitions, or more. How many unique name in your data set? Enlarge the shuffle partitions only make sense if you have large partition groups in your data. What you described looked like either your dataset having data in these 24 partitions, or you have

RE: Expert advise needed. (POC is at crossroads)

2015-04-30 Thread java8964
Really not expert here, but try the following ideas: 1) I assume you are using yarn, then this blog is very good about the resource tuning: http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/ 2) If 12G is a hard limit in this case, then you have no option but lower

RE: Equi Join is taking for ever. 1 Task is Running while other 199 are complete

2015-04-13 Thread java8964
If it is really due to data skew, will the task hanging has much bigger Shuffle Write Size in this case? In this case, the shuffle write size for that task is 0, and the rest IO of this task is not much larger than the fast finished tasks, is that normal? I am also interested in this case, as

RE: EC2 spark-submit --executor-memory

2015-04-08 Thread java8964
If you are using Spark Standalone deployment, make sure you set the WORKER_MEMROY over 20G, and you do have 20G physical memory. Yong Date: Tue, 7 Apr 2015 20:58:42 -0700 From: li...@adobe.com To: user@spark.apache.org Subject: EC2 spark-submit --executor-memory Dear Spark team, I'm

RE: Reading file with Unicode characters

2015-04-08 Thread java8964
Spark use the Hadoop TextInputFormat to read the file. Since Hadoop is almost only supporting Linux, so UTF-8 is the only encoding supported, as it is the the one on Linux. If you have other encoding data, you may want to vote for this Jira:https://issues.apache.org/jira/browse/MAPREDUCE-232

RE: Incremently load big RDD file into Memory

2015-04-07 Thread java8964
cartesian is an expensive operation. If you have 'M' records in location, then locations. cartesian(locations) will generate MxM result.If locations is a big RDD, it is hard to do the locations. cartesian(locations) efficiently.Yong Date: Tue, 7 Apr 2015 10:04:12 -0700 From:

RE: 'Java heap space' error occured when query 4G data file from HDFS

2015-04-07 Thread java8964
It is hard to guess why OOM happens without knowing your application's logic and the data size. Without knowing that, I can only guess based on some common experiences: 1) increase spark.default.parallelism2) Increase your executor-memory, maybe 6g is not just enough 3) Your environment is kind

RE: Reading a large file (binary) into RDD

2015-04-03 Thread java8964
@deanwamplerhttp://polyglotprogramming.com On Thu, Apr 2, 2015 at 6:53 PM, java8964 java8...@hotmail.com wrote: I think implementing your own InputFormat and using SparkContext.hadoopFile() is the best option for your case. Yong From: kvi...@vt.edu Date: Thu, 2 Apr 2015 17:31:30 -0400 Subject: Re

RE: Spark SQL. Memory consumption

2015-04-02 Thread java8964
It is hard to say what could be reason without more detail information. If you provide some more information, maybe people here can help you better. 1) What is your worker's memory setting? It looks like that your nodes have 128G physical memory each, but what do you specify for the worker's

Cannot run the example in the Spark 1.3.0 following the document

2015-04-02 Thread java8964
I tried to check out what Spark SQL 1.3.0. I installed it and following the online document here: http://spark.apache.org/docs/latest/sql-programming-guide.html In the example, it shows something like this:// Select everybody, but increment the age by 1 df.select(name, df(age) + 1).show() //

RE: Cannot run the example in the Spark 1.3.0 following the document

2015-04-02 Thread java8964
The import command already run. Forgot the mention, the rest of examples related to df all works, just this one caused problem. Thanks Yong Date: Fri, 3 Apr 2015 10:36:45 +0800 From: fightf...@163.com To: java8...@hotmail.com; user@spark.apache.org Subject: Re: Cannot run the example in the

RE: Reading a large file (binary) into RDD

2015-04-02 Thread java8964
I think implementing your own InputFormat and using SparkContext.hadoopFile() is the best option for your case. Yong From: kvi...@vt.edu Date: Thu, 2 Apr 2015 17:31:30 -0400 Subject: Re: Reading a large file (binary) into RDD To: freeman.jer...@gmail.com CC: user@spark.apache.org The file has a

RE: Spark SQL 1.3.0 - spark-shell error : HiveMetastoreCatalog.class refers to term cache in package com.google.common which is not available

2015-04-02 Thread java8964
Hmm, I just tested my own Spark 1.3.0 build. I have the same problem, but I cannot reproduce it on Spark 1.2.1 If we check the code change below: Spark 1.3 branchhttps://github.com/apache/spark/blob/branch-1.3/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala vs Spark

RE: SparkSql - java.util.NoSuchElementException: key not found: node when access JSON Array

2015-03-31 Thread java8964
You can use the HiveContext instead of SQLContext, which should support all the HiveQL, including lateral view explode. SQLContext is not supporting that yet. BTW, nice coding format in the email. Yong Date: Tue, 31 Mar 2015 18:18:19 -0400 Subject: Re: SparkSql -

RE: java.io.FileNotFoundException when using HDFS in cluster mode

2015-03-30 Thread java8964
I think the jar file has to be local. In HDFS is not supported yet in Spark. See this answer: http://stackoverflow.com/questions/28739729/spark-submit-not-working-when-application-jar-is-in-hdfs Date: Sun, 29 Mar 2015 22:34:46 -0700 From: n.e.trav...@gmail.com To: user@spark.apache.org

RE: 2 input paths generate 3 partitions

2015-03-27 Thread java8964
The files sound too small to be 2 blocks in HDFS. Did you set the defaultParallelism to be 3 in your spark? Yong Subject: Re: 2 input paths generate 3 partitions From: zzh...@hortonworks.com To: rvern...@gmail.com CC: user@spark.apache.org Date: Fri, 27 Mar 2015 23:15:38 + Hi Rares,

RE: Why I didn't see the benefits of using KryoSerializer

2015-03-20 Thread java8964
the performance a little, though I dunno how much. It might be worth running your experiments again with slightly more complicated objects and see what you observe. Imran On Thu, Mar 19, 2015 at 12:57 PM, java8964 java8...@hotmail.com wrote: I read the Spark code a little bit, trying

RE: com.esotericsoftware.kryo.KryoException: java.io.IOException: File too large vs FileNotFoundException (Too many open files) on spark 1.2.1

2015-03-20 Thread java8964
Do you think the ulimit for the user running Spark on your nodes? Can you run ulimit -a under the user who is running spark on the executor node? Does the result make sense for the data you are trying to process? Yong From: szheng.c...@gmail.com To: user@spark.apache.org Subject:

RE: Why I didn't see the benefits of using KryoSerializer

2015-03-19 Thread java8964
I read the Spark code a little bit, trying to understand my own question. It looks like the different is really between org.apache.spark.serializer.JavaSerializer and org.apache.spark.serializer.KryoSerializer, both having the method named writeObject. In my test case, for each line of my text

RE: mapPartitions - How Does it Works

2015-03-18 Thread java8964
Here is what I think: mapPartitions is for a specialized map that is called only once for each partition. The entire content of the respective partitions is available as a sequential stream of values via the input argument (Iterarator[T]). The combined result iterators are automatically

Why I didn't see the benefits of using KryoSerializer

2015-03-17 Thread java8964
Hi, I am new to Spark. I tried to understand the memory benefits of using KryoSerializer. I have this one box standalone test environment, which is 24 cores with 24G memory. I installed Hadoop 2.2 plus Spark 1.2.0. I put one text file in the hdfs about 1.2G. Here is the settings in the

RE: can spark take advantage of ordered data?

2015-03-11 Thread java8964
RangePartitioner? At least for join, you can implement your own partitioner, to utilize the sorted data. Just my 2 cents. Date: Wed, 11 Mar 2015 17:38:04 -0400 Subject: can spark take advantage of ordered data? From: jcove...@gmail.com To: User@spark.apache.org Hello all, I am wondering if spark

RE: Spark SQL using Hive metastore

2015-03-11 Thread java8964
You need to include the Hadoop native library in your spark-shell/spark-sql, assuming your hadoop native library including native snappy library. spark-sql --driver-library-path point_to_your_hadoop_native_library In spark-sql, you can just use any command as you are in Hive CLI. Yong Date: Wed,

RE: sc.textFile() on windows cannot access UNC path

2015-03-10 Thread java8964
for sc.textFile(…)? Ningjun From: java8964 [mailto:java8...@hotmail.com] Sent: Monday, March 09, 2015 5:33 PM To: Wang, Ningjun (LNG-NPV); user@spark.apache.org Subject: RE: sc.textFile() on windows cannot access UNC path This is a Java problem, not really Spark. From this page

RE: Compilation error

2015-03-10 Thread java8964
Or another option is to use Scala-IDE, which is built on top of Eclipse, instead of pure Eclipse, so Scala comes with it. Yong From: so...@cloudera.com Date: Tue, 10 Mar 2015 18:40:44 + Subject: Re: Compilation error To: mohitanch...@gmail.com CC: t...@databricks.com;

RE: sc.textFile() on windows cannot access UNC path

2015-03-09 Thread java8964
This is a Java problem, not really Spark. From this page: http://stackoverflow.com/questions/18520972/converting-java-file-url-to-file-path-platform-independent-including-u You can see that using Java.nio.* on JDK 7, it will fix this issue. But Path class in Hadoop will use java.io.*, instead of

From Spark web ui, how to prove the parquet column pruning working

2015-03-09 Thread java8964
Hi, Currently most of the data in our production is using Avro + Snappy. I want to test the benefits if we store the data in Parquet format. I changed the our ETL to generate the Parquet format, instead of Avor, and want to test a simple sql in Spark SQL, to verify the benefits from Parquet. I

RE: Help me understand the partition, parallelism in Spark

2015-02-26 Thread java8964
Anyone can share any thoughts related to my questions? Thanks From: java8...@hotmail.com To: user@spark.apache.org Subject: Help me understand the partition, parallelism in Spark Date: Wed, 25 Feb 2015 21:58:55 -0500 Hi, Sparkers: I come from the Hadoop MapReducer world, and try to understand

RE: Help me understand the partition, parallelism in Spark

2015-02-26 Thread java8964
other possible sources of OOM, so this is definitely not the *only* solution. Sorry I can't comment in particular about Spark SQL -- hopefully somebody more knowledgeable can comment on that. On Wed, Feb 25, 2015 at 8:58 PM, java8964 java8...@hotmail.com wrote: Hi, Sparkers: I come from

Help me understand the partition, parallelism in Spark

2015-02-25 Thread java8964
Hi, Sparkers: I come from the Hadoop MapReducer world, and try to understand some internal information of spark. From the web and this list, I keep seeing people talking about increase the parallelism if you get the OOM error. I tried to read document as much as possible to understand the RDD

RE: Spark performance tuning

2015-02-21 Thread java8964
Can someone share some ideas about how to tune the GC time? Thanks From: java8...@hotmail.com To: user@spark.apache.org Subject: Spark performance tuning Date: Fri, 20 Feb 2015 16:04:23 -0500 Hi, I am new to Spark, and I am trying to test the Spark SQL performance vs Hive. I setup a

Spark performance tuning

2015-02-20 Thread java8964
Hi, I am new to Spark, and I am trying to test the Spark SQL performance vs Hive. I setup a standalone box, with 24 cores and 64G memory. We have one SQL in mind to test. Here is the basically setup on this one box for the SQL we are trying to run: 1) Dataset 1, 6.6G AVRO file with snappy

RangePartitioner in Spark 1.2.1

2015-02-17 Thread java8964
Hi, Sparkers: I just happened to search in google for something related to the RangePartitioner of spark, and found an old thread in this email list as here: http://apache-spark-user-list.1001560.n3.nabble.com/RDD-and-Partition-td991.html I followed the code example mentioned in that email thread

RE: spark left outer join with java.lang.UnsupportedOperationException: empty collection

2015-02-12 Thread java8964
OK. I think I have to use None instead null, then it works. Still switching from Java. I can also just use the field name as what I assume. Great experience. From: java8...@hotmail.com To: user@spark.apache.org Subject: spark left outer join with java.lang.UnsupportedOperationException: empty

spark left outer join with java.lang.UnsupportedOperationException: empty collection

2015-02-12 Thread java8964
Hi, I am using Spark 1.2.0 with Hadoop 2.2. Now I have to 2 csv files, but have 8 fields. I know that the first field from both files are IDs. I want to find all the IDs existed in the first file, but NOT in the 2nd file. I am coming with the following code in spark-shell. case class origAsLeft

Spark concurrency question

2015-02-08 Thread java8964
Hi, I have some questions about how the spark run the job concurrently. For example, if I setup the Spark on one standalone test box, which has 24 core and 64G memory. I setup the Worker memory to 48G, and Executor memory to 4G, and using spark-shell to run some jobs. Here is something confusing

My first experience with Spark

2015-02-05 Thread java8964
I am evaluating Spark for our production usage. Our production cluster is Hadoop 2.2.0 without Yarn. So I want to test Spark with Standalone deployment running with Hadoop. What I have in mind is to test a very complex Hive query, which joins between 6 tables, lots of nested structure with

RE: My first experience with Spark

2015-02-05 Thread java8964
Finally I gave up after there are too many failed retry. From the log in the worker side, it looks like failed with JVM OOM, as below: 15/02/05 17:02:03 ERROR util.SparkUncaughtExceptionHandler: Uncaught exception in thread Thread[Driver Heartbeater,5,main]java.lang.OutOfMemoryError: Java heap

RE: My first experience with Spark

2015-02-05 Thread java8964
, or pass in a level of parallelism as second parameter to a suitable operation in your code. Deb On Thu, Feb 5, 2015 at 1:03 PM, java8964 java8...@hotmail.com wrote: I am evaluating Spark for our production usage. Our production cluster is Hadoop 2.2.0 without Yarn. So I want to test Spark

Problem to run spark as standalone

2014-10-27 Thread java8964
Hi, Spark Users: I tried to test the spark in a standalone box, but faced an issue which I don't know what is the root cause. I basically followed exactly document of deploy spark in a standalone environment. 1) I check out spark source code of release 1.1.02) I build the spark with following

RE: Problem to run spark as standalone

2014-10-27 Thread java8964
I did a little more research about this. It looks like the worker started successfully, but on port 40294. This is shown in both log and master web UI. The question is that in the log, the master akka.tcp is trying to connect to another different port (44017). Why? Yong From: