need info on Spark submit on yarn-cluster mode

2015-04-08 Thread sachin Singh
Hi , I observed that we have installed only one cluster, and submiting job as yarn-cluster then getting below error, so is this cause that installation is only one cluster? Please correct me, if this is not cause then why I am not able to run in cluster mode, spark submit command is - spark-submit

Re: Need subscription process

2015-04-08 Thread ๏̯͡๏
Check your spam or any filter, On Wed, Apr 8, 2015 at 2:17 PM, Jeetendra Gangele gangele...@gmail.com wrote: Hi All how can I subscribe myself in this group so that every mail sent to this group comes to me as well. I already sent request to user-subscr...@spark.apache.org ,still Iam not

partition by category

2015-04-08 Thread SiMaYunRui
Hi folks, I am writing to ask how to filter and partition a set of files thru Spark. The situation is that I have N big files (cannot fit into single machine). And each line of files starts with a category (say Sport, Food, etc), while only have less than 100 categories actually. I need a

Spark Tasks failing with Cannot find address

2015-04-08 Thread ๏̯͡๏
I have a spark stage that has 8 tasks. 7/8 have completed. However 1 task is failing with Cannot find address Aggregated Metrics by ExecutorExecutor IDAddressTask TimeTotal TasksFailed TasksSucceeded TasksShuffle Read Size / RecordsShuffle Write Size / RecordsShuffle Spill (Memory)Shuffle Spill

Need subscription process

2015-04-08 Thread Jeetendra Gangele
Hi All how can I subscribe myself in this group so that every mail sent to this group comes to me as well. I already sent request to user-subscr...@spark.apache.org ,still Iam not getting mail sent to this group by other persons. Regards Jeetendra

Re: Parquet Hive table become very slow on 1.3?

2015-04-08 Thread Zheng, Xudong
Hi Cheng, I tried both these patches, and seems still not resolve my issue. And I found the most time is spend on this line in newParquet.scala: ParquetFileReader.readAllFootersInParallel( sparkContext.hadoopConfiguration, seqAsJavaList(leaves), taskSideMetaData) Which need read all the files

Re: RDD collect hangs on large input data

2015-04-08 Thread Zsolt Tóth
I use EMR 3.3.1 which comes with Java 7. Do you think that this may cause the issue? Did you test it with Java 8?

RE: Difference between textFile Vs hadoopFile (textInoutFormat) on HDFS data

2015-04-08 Thread Puneet Kumar Ojha
Thanks From: Nick Pentreath [mailto:nick.pentre...@gmail.com] Sent: Tuesday, April 07, 2015 5:52 PM To: Puneet Kumar Ojha Cc: user@spark.apache.org Subject: Re: Difference between textFile Vs hadoopFile (textInoutFormat) on HDFS data There is no difference - textFile calls hadoopFile with a

Re: Spark Tasks failing with Cannot find address

2015-04-08 Thread ๏̯͡๏
Spark Version 1.3 Command: ./bin/spark-submit -v --master yarn-cluster --driver-class-path

Re: The differentce between SparkSql/DataFram join and Rdd join

2015-04-08 Thread Hao Ren
Hi Michael, In fact, I find that all workers are hanging when SQL/DF join is running. So I picked the master and one of the workers. jstack is the following: Master

Re: need info on Spark submit on yarn-cluster mode

2015-04-08 Thread Steve Loughran
This means the spark workers exited with code 15; probably nothing YARN related itself (unless there are classpath-related problems). Have a look at the logs of the app/container via the resource manager. You can also increase the time that logs get kept on the nodes themselves to something

Re: Issue with pyspark 1.3.0, sql package and rows

2015-04-08 Thread Davies Liu
I will look into this today. On Wed, Apr 8, 2015 at 7:35 AM, Stefano Parmesan parme...@spaziodati.eu wrote: Did anybody by any chance had a look at this bug? It keeps on happening to me, and it's quite blocking, I would like to understand if there's something wrong in what I'm doing, or

Exception in thread main java.util.concurrent.TimeoutException: Futures timed out after [10000 milliseconds] when create context

2015-04-08 Thread Shuai Zheng
Hi All, In some cases, I have below exception when I run spark in local mode (I haven't see this in a cluster). This is weird but also affect my local unit test case (it is not always happen, but usually one per 4-5 times run). From the stack, looks like error happen when create the context,

[ThriftServer] User permissions warning

2015-04-08 Thread Yana Kadiyska
Hi folks, I am noticing a pesky and persistent warning in my logs (this is from Spark 1.2.1): 15/04/08 15:23:05 WARN ShellBasedUnixGroupsMapping: got exception trying to get groups for user anonymous org.apache.hadoop.util.Shell$ExitCodeException: id: anonymous: No such user at

RE: Advice using Spark SQL and Thrift JDBC Server

2015-04-08 Thread Mohammed Guller
+1 Interestingly, I ran into the exactly the same issue yesterday. I couldn’t find any documentation about which project to include as a dependency in build.sbt to use HiveThriftServer2. Would appreciate help. Mohammed From: Todd Nist [mailto:tsind...@gmail.com] Sent: Wednesday, April 8,

Re: Issue with pyspark 1.3.0, sql package and rows

2015-04-08 Thread Stefano Parmesan
Did anybody by any chance had a look at this bug? It keeps on happening to me, and it's quite blocking, I would like to understand if there's something wrong in what I'm doing, or whether there's a workaround or not. Thank you all, -- Dott. Stefano Parmesan Backend Web Developer and Data Lover

Re: Opening many Parquet files = slow

2015-04-08 Thread Ted Yu
You may have seen this thread: http://search-hadoop.com/m/JW1q5SlRpt1 Cheers On Wed, Apr 8, 2015 at 6:15 AM, Eric Eijkelenboom eric.eijkelenb...@gmail.com wrote: Hi guys *I’ve got:* - 180 days of log data in Parquet. - Each day is stored in a separate folder in S3. - Each day

Re: Spark 1.3 on CDH 5.3.1 YARN

2015-04-08 Thread Sean Owen
Yes, should be fine since you are running on YARN. This is probably more appropriate for the cdh-user list. On Apr 8, 2015 9:35 AM, roy rp...@njit.edu wrote: Hi, We have cluster running on CDH 5.3.2 and Spark 1.2 (Which is current version in CDH5.3.2), But We want to try Spark 1.3 without

Re: Error running Spark on Cloudera

2015-04-08 Thread Marcelo Vanzin
spark.eventLog.dir should contain the full HDFS URL. In general, this should be sufficient: spark.eventLog.dir=hdfs:/user/spark/applicationHistory On Wed, Apr 8, 2015 at 6:45 AM, Vijayasarathy Kannan kvi...@vt.edu wrote: I am trying to run a Spark application using spark-submit on a cluster

Re: Timeout errors from Akka in Spark 1.2.1

2015-04-08 Thread Tathagata Das
There are a couple of options. Increase timeout (see Spark configuration). Also see past mails in the mailing list. Another option you may try (I have gut feeling that may work, but I am not sure) is calling GC on the driver periodically. The cleaning up of stuff is tied to GCing of RDD objects

Reading file with Unicode characters

2015-04-08 Thread Arun Lists
Hi, Does SparkContext's textFile() method handle files with Unicode characters? How about files in UTF-8 format? Going further, is it possible to specify encodings to the method? If not, what should one do if the files to be read are in some encoding? Thanks, arun

Spark SQL Avro Library for 1.2

2015-04-08 Thread roy
How do I build Spark SQL Avro Library for Spark 1.2 ? I was following this https://github.com/databricks/spark-avro and was able to build spark-avro_2.10-1.0.0.jar by simply running sbt/sbt package from the project root. but we are on Spark 1.2 and need compatible spark-avro jar. Any idea how

Spark 1.3 on CDH 5.3.1 YARN

2015-04-08 Thread roy
Hi, We have cluster running on CDH 5.3.2 and Spark 1.2 (Which is current version in CDH5.3.2), But We want to try Spark 1.3 without breaking existing setup, so is it possible to have Spark 1.3 on existing setup ? Thanks -- View this message in context:

RE: EC2 spark-submit --executor-memory

2015-04-08 Thread java8964
If you are using Spark Standalone deployment, make sure you set the WORKER_MEMROY over 20G, and you do have 20G physical memory. Yong Date: Tue, 7 Apr 2015 20:58:42 -0700 From: li...@adobe.com To: user@spark.apache.org Subject: EC2 spark-submit --executor-memory Dear Spark team, I'm

Subscribe

2015-04-08 Thread Idris Ali

Re: Subscribe

2015-04-08 Thread Ted Yu
Please email user-subscr...@spark.apache.org On Apr 8, 2015, at 6:28 AM, Idris Ali psychid...@gmail.com wrote: - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail:

Opening many Parquet files = slow

2015-04-08 Thread Eric Eijkelenboom
Hi guys I’ve got: 180 days of log data in Parquet. Each day is stored in a separate folder in S3. Each day consists of 20-30 Parquet files of 256 MB each. Spark 1.3 on Amazon EMR This makes approximately 5000 Parquet files with a total size if 1.5 TB. My code: val in =

Re: Advice using Spark SQL and Thrift JDBC Server

2015-04-08 Thread Todd Nist
To use the HiveThriftServer2.startWithContext, I thought one would use the following artifact in the build: org.apache.spark%% spark-hive-thriftserver % 1.3.0 But I am unable to resolve the artifact. I do not see it in maven central or any other repo. Do I need to build Spark and

Error running Spark on Cloudera

2015-04-08 Thread Vijayasarathy Kannan
I am trying to run a Spark application using spark-submit on a cluster using Cloudera manager. I get the error Exception in thread main java.io.IOException: Error in creating log directory: file:/user/spark/applicationHistory//app-20150408094126-0008 Adding the below lines in

Maintaining state

2015-04-08 Thread boston2004_williams
It should be noted I'm a newbie to Spark so please have patience ... I'm trying to convert an existing application over to spark and am running into some high level questions that I can't seem to resolve. Possibly because what I'm trying to do is not supported. In a nutshell as I process

Spark Streaming and SQL

2015-04-08 Thread Vadim Bichutskiy
Hi all, I am using Spark Streaming to monitor an S3 bucket for objects that contain JSON. I want to import that JSON into Spark SQL DataFrame. Here's my current code: *from pyspark import SparkContext, SparkConf* *from pyspark.streaming import StreamingContext* *import json* *from pyspark.sql

Re: Opening many Parquet files = slow

2015-04-08 Thread Michael Armbrust
Thanks for the report. We improved the speed here in 1.3.1 so would be interesting to know if this helps. You should also try disabling schema merging if you do not need that feature (i.e. all of your files are the same schema). sqlContext.load(path, parquet, Map(mergeSchema - false)) On Wed,

Re: The differentce between SparkSql/DataFram join and Rdd join

2015-04-08 Thread Michael Armbrust
I think your thread dump for the master is actually just a thread dump for SBT that is waiting on a forked driver program. ... java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) - waiting on 0x7fed624ff528 (a java.lang.UNIXProcess) at

Re: Incremently load big RDD file into Memory

2015-04-08 Thread Guillaume Pitel
Hi Muhammad, There are lots of ways to do it. My company actually develops a text mining solution which embeds a very fast Approximate Neighbours solution (a demo with real time queries on the wikipedia dataset can be seen at wikinsights.org). For the record, we now prepare a dataset of 4.5

Re: Advice using Spark SQL and Thrift JDBC Server

2015-04-08 Thread Michael Armbrust
Sorry guys. I didn't realize that https://issues.apache.org/jira/browse/SPARK-4925 was not fixed yet. You can publish locally in the mean time (sbt/sbt publishLocal). On Wed, Apr 8, 2015 at 8:29 AM, Mohammed Guller moham...@glassbeam.com wrote: +1 Interestingly, I ran into the exactly

Re: parquet partition discovery

2015-04-08 Thread Michael Armbrust
Back to the user list so everyone can see the result of the discussion... Ah. It all makes sense now. The issue is that when I created the parquet files, I included an unnecessary directory name (data.parquet) below the partition directories. It’s just a leftover from when I started with

start-slave.sh not starting

2015-04-08 Thread Mohit Anchlia
I am trying to start the worker by: sbin/start-slave.sh spark://ip-10-241-251-232:7077 In the logs it's complaining about: Master must be a URL of the form spark://hostname:port I also have this in spark-defaults.conf spark.master spark://ip-10-241-251-232:7077 Did I miss

Re: org.apache.spark.ml.recommendation.ALS

2015-04-08 Thread Jay Katukuri
some additional context: Since, I am using features of spark 1.3.0, I have downloaded spark 1.3.0 and used spark-submit from there. The cluster is still on spark-1.2.0. So, this looks to me that at runtime, the executors could not find some libraries of spark-1.3.0, even though I ran

RE: Reading file with Unicode characters

2015-04-08 Thread java8964
Spark use the Hadoop TextInputFormat to read the file. Since Hadoop is almost only supporting Linux, so UTF-8 is the only encoding supported, as it is the the one on Linux. If you have other encoding data, you may want to vote for this Jira:https://issues.apache.org/jira/browse/MAPREDUCE-232

RE: Advice using Spark SQL and Thrift JDBC Server

2015-04-08 Thread Mohammed Guller
Michael, Thank you! Looks like the sbt build is broken for 1.3. I downloaded the source code for 1.3, but I get the following error a few minutes after I run “sbt/sbt publishLocal” [error] (network-shuffle/*:update) sbt.ResolveException: unresolved dependency:

Class incompatible error

2015-04-08 Thread Mohit Anchlia
I am seeing the following, is this because of my maven version? 15/04/08 15:42:22 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, ip-10-241-251-232.us-west-2.compute.internal): java.io.InvalidClassException: org.apache.spark.Aggregator; local class incompatible: stream classdesc

Unit testing with HiveContext

2015-04-08 Thread Daniel Siegmann
I am trying to unit test some code which takes an existing HiveContext and uses it to execute a CREATE TABLE query (among other things). Unfortunately I've run into some hurdles trying to unit test this, and I'm wondering if anyone has a good approach. The metastore DB is automatically created in

Empty RDD?

2015-04-08 Thread Vadim Bichutskiy
When I call *transform* or *foreachRDD *on* DStream*, I keep getting an error that I have an empty RDD, which make sense since my batch interval maybe smaller than the rate at which new data are coming in. How to guard against it? Thanks, Vadim ᐧ

Re: Timeout errors from Akka in Spark 1.2.1

2015-04-08 Thread N B
Since we are running in local mode, won't all the executors be in the same JVM as the driver? Thanks NB On Wed, Apr 8, 2015 at 1:29 PM, Tathagata Das t...@databricks.com wrote: Its does take effect on the executors, not on the driver. Which is okay because executors have all the data and

Pyspark query by binary type

2015-04-08 Thread jmalm
I am loading some avro data into spark using the following code: sqlContext.sql(CREATE TEMPORARY TABLE foo USING com.databricks.spark.avro OPTIONS (path 'hdfs://*.avro')) The avro data contains some binary fields that get translated to the BinaryType data type. I am struggling with how to use

Re: Add row IDs column to data frame

2015-04-08 Thread olegshirokikh
More generic version of a question below: Is it possible to append a column to existing DataFrame at all? I understand that this is not an easy task in Spark environment, but is there any workaround? -- View this message in context:

Re: Spark Streaming and SQL

2015-04-08 Thread Vadim Bichutskiy
Hi all, I figured it out! The DataFrames and SQL example in Spark Streaming docs were useful. Best, Vadim ᐧ On Wed, Apr 8, 2015 at 2:38 PM, Vadim Bichutskiy vadim.bichuts...@gmail.com wrote: Hi all, I am using Spark Streaming to monitor an S3 bucket for objects that contain JSON. I want

sortByKey with multiple partitions

2015-04-08 Thread Tom
Hi, If I perform a sortByKey(true, 2).saveAsTextFile(filename) on a cluster, will the data be sorted per partition, or in total. (And is this guaranteed?) Example: Input 4,2,3,6,5,7 Sorted per partition: part-0: 2,3,7 part-1: 4,5,6 Sorted in total: part-0: 2,3,4 part-1: 5,6,7

Re: Add row IDs column to data frame

2015-04-08 Thread Bojan Kostic
You could convert DF to RDD, then in map phase or in join add new column, and then again convert to DF. I know this is not elegant solution and maybe it is not a solution at all. :) But this is the first thing that popped in my mind. I am new also to DF api. Best Bojan On Apr 9, 2015 00:37,

Re: Timeout errors from Akka in Spark 1.2.1

2015-04-08 Thread N B
Thanks TD. I believe that might have been the issue. Will try for a few days after passing in the GC option on the java command line when we start the process. Thanks for your timely help. NB On Wed, Apr 8, 2015 at 6:08 PM, Tathagata Das t...@databricks.com wrote: Yes, in local mode they the

Re: Timeout errors from Akka in Spark 1.2.1

2015-04-08 Thread Tathagata Das
Yes, in local mode they the driver and executor will be same the process. And in that case the Java options in SparkConf configuration will not work. On Wed, Apr 8, 2015 at 1:44 PM, N B nb.nos...@gmail.com wrote: Since we are running in local mode, won't all the executors be in the same JVM

Re: sortByKey with multiple partitions

2015-04-08 Thread Ted Yu
See the scaladoc from OrderedRDDFunctions.scala : * Sort the RDD by key, so that each partition contains a sorted range of the elements. Calling * `collect` or `save` on the resulting RDD will return or output an ordered list of records * (in the `save` case, they will be written to

Re: Class incompatible error

2015-04-08 Thread Ted Yu
bq. one is Oracle and the other is OpenJDK I don't have experience with mixed JDK's. Can you try with using single JDK ? Cheers On Wed, Apr 8, 2015 at 3:26 PM, Mohit Anchlia mohitanch...@gmail.com wrote: For the build I am using java version 1.7.0_65 which seems to be the same as the one on

Re: Opening many Parquet files = slow

2015-04-08 Thread Cheng Lian
Hi Eric - Would you mind to try either disabling schema merging as what Michael suggested, or disabling the new Parquet data source by sqlContext.setConf(spark.sql.parquet.useDataSourceApi, false) Cheng On 4/9/15 2:43 AM, Michael Armbrust wrote: Thanks for the report. We improved the speed

RE: Advice using Spark SQL and Thrift JDBC Server

2015-04-08 Thread Mohammed Guller
Hey Patrick, Michael and Todd, Thank you for your help! As you guys recommended, I did a local install and got my code to compile. As an FYI, on my local machine the sbt build fails even if I add –DskipTests. So I used mvn. Mohammed From: Patrick Wendell [mailto:patr...@databricks.com] Sent:

Re: parquet partition discovery

2015-04-08 Thread Cheng Lian
On 4/9/15 3:09 AM, Michael Armbrust wrote: Back to the user list so everyone can see the result of the discussion... Ah. It all makes sense now. The issue is that when I created the parquet files, I included an unnecessary directory name (data.parquet) below the partition

Re: Empty RDD?

2015-04-08 Thread Tathagata Das
Aah yes. The jsonRDD method needs to walk through the whole RDD to understand the schema, and does not work if there is not data in it. Making sure there is no data in it using take(1) should work. TD

Re: Cannot run unit test.

2015-04-08 Thread Mike Trienis
It's because your tests are running in parallel and you can only have one context running at a time. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Cannot-run-unit-test-tp14459p22429.html Sent from the Apache Spark User List mailing list archive at

Regarding GroupBy

2015-04-08 Thread Jeetendra Gangele
I wanted to run the groupBy(partition ) but this is not working. here first part in pairvendorData will be repeated multiple second part. Both are object do I need to overrite the equals and hash code? Is groupBy fast enough? JavaPairRDDVendorRecord, VendorRecord pairvendorData

Re: Reading file with Unicode characters

2015-04-08 Thread Arun Lists
Thanks! arun On Wed, Apr 8, 2015 at 10:51 AM, java8964 java8...@hotmail.com wrote: Spark use the Hadoop TextInputFormat to read the file. Since Hadoop is almost only supporting Linux, so UTF-8 is the only encoding supported, as it is the the one on Linux. If you have other encoding data,

Re: Empty RDD?

2015-04-08 Thread Vadim Bichutskiy
Thanks TD! On Apr 8, 2015, at 9:36 PM, Tathagata Das t...@databricks.com wrote: Aah yes. The jsonRDD method needs to walk through the whole RDD to understand the schema, and does not work if there is not data in it. Making sure there is no data in it using take(1) should work. TD

Re: Opening many Parquet files = slow

2015-04-08 Thread Prashant Kommireddi
We noticed similar perf degradation using Parquet (outside of Spark) and it happened due to merging of multiple schemas. Would be good to know if disabling merge of schema (if the schema is same) as Michael suggested helps in your case. On Wed, Apr 8, 2015 at 11:43 AM, Michael Armbrust

Re: function to convert to pair

2015-04-08 Thread Ted Yu
Please take a look at zipWithIndex() of RDD. Cheers On Wed, Apr 8, 2015 at 3:40 PM, Jeetendra Gangele gangele...@gmail.com wrote: Hi All I have a RDDSomeObject I want to convert it to RDDsequenceNumber,SomeObject this sequence number can be 1 for first SomeObject 2 for second SomeOjejct

Re: [ThriftServer] User permissions warning

2015-04-08 Thread Cheng Lian
The Thrift server hasn't support authentication or Hadoop doAs yet, so you can simply ignore this warning. To avoid this, when connecting via JDBC you may specify the user to the same user who starts the Thrift server process. For Beeline, use -n user. On 4/8/15 11:49 PM, Yana Kadiyska

Re: Empty RDD?

2015-04-08 Thread Tathagata Das
What is the computation you are doing in the foreachRDD, that is throwing the exception? One way to guard against is to do a take(1) to see if you get back any data. If there is none, then don't do anything with the RDD. TD On Wed, Apr 8, 2015 at 1:08 PM, Vadim Bichutskiy

function to convert to pair

2015-04-08 Thread Jeetendra Gangele
Hi All I have a RDDSomeObject I want to convert it to RDDsequenceNumber,SomeObject this sequence number can be 1 for first SomeObject 2 for second SomeOjejct Regards jeet

Re: Class incompatible error

2015-04-08 Thread Mohit Anchlia
For the build I am using java version 1.7.0_65 which seems to be the same as the one on the spark host. However one is Oracle and the other is OpenJDK. Does that make any difference? On Wed, Apr 8, 2015 at 1:24 PM, Ted Yu yuzhih...@gmail.com wrote: What version of Java do you use to build ?

Re: Class incompatible error

2015-04-08 Thread Ted Yu
What version of Java do you use to build ? Cheers On Wed, Apr 8, 2015 at 12:43 PM, Mohit Anchlia mohitanch...@gmail.com wrote: I am seeing the following, is this because of my maven version? 15/04/08 15:42:22 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0,

Re: Support for Joda

2015-04-08 Thread Ted Yu
Which version of Joda are you using ? Here is snippet of dependency:tree out w.r.t. Joda : [INFO] +- org.apache.flume:flume-ng-core:jar:1.4.0:compile ... [INFO] | +- joda-time:joda-time:jar:2.1:compile FYI On Wed, Apr 8, 2015 at 12:53 PM, Patrick Grandjean p.r.grandj...@gmail.com wrote: Hi,

Re: Unit testing with HiveContext

2015-04-08 Thread Ted Yu
Please take a look at sql/hive/src/main/scala/org/apache/spark/sql/hive/test/TestHive.scala : protected def configure(): Unit = { warehousePath.delete() metastorePath.delete() setConf(javax.jdo.option.ConnectionURL, sjdbc:derby:;databaseName=$metastorePath;create=true)

Re: Advice using Spark SQL and Thrift JDBC Server

2015-04-08 Thread Todd Nist
Hi Mohammed, I think you just need to add -DskipTests to you build. Here is how I built it: mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.6.0 -Phive -Phive-thriftserver -DskipTests clean package install build/sbt does however fail even if only doing package which should skip tests. I am able to