Eclipse Spark plugin and sample Scala projects

2014-07-15 Thread buntu
Hi -- I tried searching for eclipse spark plugin setup for developing with Spark and there seems to be some information I can go with but I have not seen a starter app or project to import into Eclipse and try it out. Can anyone please point me to any Scala projects to import into Scala Eclipse

Spark SQL throws ClassCastException on first try; works on second

2014-07-15 Thread Nick Chammas
I’m running this query against RDD[Tweet], where Tweet is a simple case class with 4 fields. sqlContext.sql( SELECT user, COUNT(*) as num_tweets FROM tweets GROUP BY user ORDER BY num_tweets DESC, user ASC ; ).take(5) The first time I run this, it throws the following:

Re: SparkR failed to connect to the master

2014-07-15 Thread Shivaram Venkataraman
You'll need to build SparkR to match the Spark version deployed on the cluster. You can do that by changing the Spark version in SparkR's build.sbt [1]. If you are using the Maven build you'll need to edit pom.xml Thanks Shivaram [1]

答复:RACK_LOCAL Tasks Failed to finish

2014-07-15 Thread 洪奇
I just running PageRank(included in GraphX) on a dataset which has 55876487 edges. I submit the application to YARN with options`--num-executors 30 --executor-memory 30g --driver-memory 10g --executor-cores 8`. Thanks--发件人:Ankur

Re: Spark SQL throws ClassCastException on first try; works on second

2014-07-15 Thread Michael Armbrust
You might be hitting SPARK-1994 https://issues.apache.org/jira/browse/SPARK-1994, which is fixed in 1.0.1. On Mon, Jul 14, 2014 at 11:16 PM, Nick Chammas nicholas.cham...@gmail.com wrote: I’m running this query against RDD[Tweet], where Tweet is a simple case class with 4 fields.

Re: hdfs replication on saving RDD

2014-07-15 Thread Andrew Ash
In general it would be nice to be able to configure replication on a per-job basis. Is there a way to do that without changing the config values in the Hadoop conf/ directory between jobs? Maybe by modifying OutputFormats or the JobConf ? On Mon, Jul 14, 2014 at 11:12 PM, Matei Zaharia

Re: Nested Query With Spark SQL(1.0.1)

2014-07-15 Thread anyweil
I mean the query on the nested data such as JSON, not the nested query, sorry for the misunderstanding. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Query-the-nested-JSON-data-With-Spark-SQL-1-0-1-tp9544p9726.html Sent from the Apache Spark User List

Re: Spark SQL throws ClassCastException on first try; works on second

2014-07-15 Thread Nicholas Chammas
Ah, good catch, that seems to be it. I'd use 1.0.1, except I've been hitting up against SPARK-2471 https://issues.apache.org/jira/browse/SPARK-2471 with that version, which doesn't let me access my data in S3. :( OK, at least I know this has probably already been fixed. Nick On Tue, Jul 15,

Re: import org.apache.spark.streaming.twitter._ in Shell

2014-07-15 Thread Nicholas Chammas
Hmm, I'd like to clarify something from your comments, Tathagata. Going forward, is Twitter Streaming functionality not supported from the shell? What should users do if they'd like to process live Tweets from the shell? Nick On Mon, Jul 14, 2014 at 11:50 PM, Nicholas Chammas

Re: Nested Query With Spark SQL(1.0.1)

2014-07-15 Thread Michael Armbrust
In general this should be supported using [] to access array data and . to access nested fields. Is there something you are trying that isn't working? On Mon, Jul 14, 2014 at 11:25 PM, anyweil wei...@gmail.com wrote: I mean the query on the nested data such as JSON, not the nested query,

Re: ALS on EC2

2014-07-15 Thread Xiangrui Meng
Could you share the code of RecommendationALS and the complete spark-submit command line options? Thanks! -Xiangrui On Mon, Jul 14, 2014 at 11:23 PM, Srikrishna S srikrishna...@gmail.com wrote: Using properties file: null Main class: RecommendationALS Arguments: _train.csv _validation.csv

Re: Repeated data item search with Spark SQL(1.0.1)

2014-07-15 Thread anyweil
Thank you so much for the reply, here is my code. 1. val conf = new SparkConf().setAppName(Simple Application) 2. conf.setMaster(local) 3. val sc = new SparkContext(conf) 4. val sqlContext = new org.apache.spark.sql.SQLContext(sc) 5. import sqlContext.createSchemaRDD 6. val path1 =

Re: KMeansModel Construtor error

2014-07-15 Thread Xiangrui Meng
I don't think MLlib supports model serialization/deserialization. You got the error because the constructor is private. I created a JIRA for this: https://issues.apache.org/jira/browse/SPARK-2488 and we try to make sure it is implemented in v1.1. For now, you can modify the KMeansModel and remove

Re: Nested Query With Spark SQL(1.0.1)

2014-07-15 Thread anyweil
Yes, just as my last post, using [] to access array data and . to access nested fields seems not work. BTW, i have deeped into the code of the current master branch. spark / sql / catalyst / src / main / scala / org / apache / spark / sql / catalyst / plans / logical / LogicalPlan.scala from

Re: Repeated data item search with Spark SQL(1.0.1)

2014-07-15 Thread Michael Armbrust
Sorry for the trouble. There are two issues here: - Parsing of repeated nested (i.e. something[0].field) is not supported in the plain SQL parser. SPARK-2096 https://issues.apache.org/jira/browse/SPARK-2096 - Resolution is broken in the HiveQL parser. SPARK-2483

Re: jsonRDD: NoSuchMethodError

2014-07-15 Thread SK
I am running this in standalone mode on a single machine. I built the spark jar from scratch (sbt assembly) and then included that in my application (the same process I have done for earlier versions). thanks. -- View this message in context:

Re: truly bizarre behavior with local[n] on Spark 1.0.1

2014-07-15 Thread Walrus theCat
This is (obviously) spark streaming, by the way. On Mon, Jul 14, 2014 at 8:27 PM, Walrus theCat walrusthe...@gmail.com wrote: Hi, I've got a socketTextStream through which I'm reading input. I have three Dstreams, all of which are the same window operation over that socketTextStream. I

Re: import org.apache.spark.streaming.twitter._ in Shell

2014-07-15 Thread Praveen Seluka
If you want to make Twitter* classes available in your shell, I believe you could do the following 1. Change the parent pom module ordering - Move external/twitter before assembly 2. In assembly/pom.xm, add external/twitter dependency - this will package twitter* into the assembly jar Now when

Re: Spark SQL 1.0.1 error on reading fixed length byte array

2014-07-15 Thread Pei-Lun Lee
Hi Michael, Good to know it is being handled. I tried master branch (9fe693b5) and got another error: scala sqlContext.parquetFile(/tmp/foo) java.lang.RuntimeException: Unsupported parquet datatype optional fixed_len_byte_array(4) b at scala.sys.package$.error(package.scala:27) at

Re: Repeated data item search with Spark SQL(1.0.1)

2014-07-15 Thread anyweil
Thank you so much for the information, now i have merge the fix of #1411 and seems the HiveSQL works with: SELECT name FROM people WHERE schools[0].time2. But one more question is: Is it possible or planed to support the schools.time format to filter the record that there is an element inside

Re: jsonRDD: NoSuchMethodError

2014-07-15 Thread SK
The problem is resolved. Thanks. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/jsonRDD-NoSuchMethodError-tp9688p9742.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: import org.apache.spark.streaming.twitter._ in Shell

2014-07-15 Thread Nick Pentreath
You could try the following: create a minimal project using sbt or Maven, add spark-streaming-twitter as a dependency, run sbt assembly (or mvn package) on that to create a fat jar (with Spark as provided dependency), and add that to the shell classpath when starting up. On Tue, Jul 15, 2014 at

Re: Spark SQL 1.0.1 error on reading fixed length byte array

2014-07-15 Thread Michael Armbrust
Oh, maybe not. Please file another JIRA. On Tue, Jul 15, 2014 at 12:34 AM, Pei-Lun Lee pl...@appier.com wrote: Hi Michael, Good to know it is being handled. I tried master branch (9fe693b5) and got another error: scala sqlContext.parquetFile(/tmp/foo) java.lang.RuntimeException:

Spark Streaming w/ tshark exception problem on EC2

2014-07-15 Thread Gianluca Privitera
Hi, I’ve got a problem with Spark Streaming and tshark. While I’m running locally I have no problems with this code, but when I run it on a EC2 cluster I get the exception shown just under the code. def dissection(s: String): Seq[String] = { try { Process(hadoop command to create

Kryo NoSuchMethodError on Spark 1.0.0 standalone

2014-07-15 Thread jfowkes
Hi there, I've been sucessfully using the precompiled Spark 1.0.0 Java api on a small cluster in standalone mode. However, when I try to use Kryo serializer by adding conf.set(spark.serializer,org.apache.spark.serializer.KryoSerializer); as suggested, Spark crashes out with the following error:

Re: Catalyst dependency on Spark Core

2014-07-15 Thread Sean Owen
Agree. You end up with a core and a corer core to distinguish between and it ends up just being more complicated. This sounds like something that doesn't need a module. On Tue, Jul 15, 2014 at 5:59 AM, Patrick Wendell pwend...@gmail.com wrote: Adding new build modules is pretty high overhead, so

Re: the default GraphX graph-partition strategy on multicore machine?

2014-07-15 Thread Yifan LI
Dear Ankur, Thanks so much! Btw, is there any possibility to customise the partition strategy as we expect? Best, Yifan On Jul 11, 2014, at 10:20 PM, Ankur Dave ankurd...@gmail.com wrote: Hi Yifan, When you run Spark on a single machine, it uses a local mode where one task per core can

Re: Spark SQL 1.0.1 error on reading fixed length byte array

2014-07-15 Thread Pei-Lun Lee
Filed SPARK-2446 2014-07-15 16:17 GMT+08:00 Michael Armbrust mich...@databricks.com: Oh, maybe not. Please file another JIRA. On Tue, Jul 15, 2014 at 12:34 AM, Pei-Lun Lee pl...@appier.com wrote: Hi Michael, Good to know it is being handled. I tried master branch (9fe693b5) and got

Re: Spark SQL 1.0.1 error on reading fixed length byte array

2014-07-15 Thread Pei-Lun Lee
Sorry, should be SPARK-2489 2014-07-15 19:22 GMT+08:00 Pei-Lun Lee pl...@appier.com: Filed SPARK-2446 2014-07-15 16:17 GMT+08:00 Michael Armbrust mich...@databricks.com: Oh, maybe not. Please file another JIRA. On Tue, Jul 15, 2014 at 12:34 AM, Pei-Lun Lee pl...@appier.com wrote:

Re: Need help on spark Hbase

2014-07-15 Thread Jerry Lam
Hi Rajesh, can you describe your spark cluster setup? I saw localhost:2181 for zookeeper. Best Regards, Jerry On Tue, Jul 15, 2014 at 9:47 AM, Madabhattula Rajesh Kumar mrajaf...@gmail.com wrote: Hi Team, Could you please help me to resolve the issue. *Issue *: I'm not able to connect

Re: Need help on spark Hbase

2014-07-15 Thread Krishna Sankar
One vector to check is the HBase libraries in the --jars as in : spark-submit --class your class --master master url --jars

Re: NotSerializableException in Spark Streaming

2014-07-15 Thread Nicholas Chammas
Hey Diana, Did you ever figure this out? I’m running into the same exception, except in my case the function I’m calling is a KMeans model.predict(). In regular Spark it works, and Spark Streaming without the call to model.predict() also works, but when put together I get this serialization

Driver cannot receive StatusUpdate message for FINISHED

2014-07-15 Thread 林武康
Hi all, I got a strange problem, I submit a reduce job(any one split), it finished normally on Executor, log is: 14/07/15 21:08:56 INFO Executor: Serialized size of result for 0 is 10476031 14/07/15 21:08:56 INFO Executor: Sending result for 0 directly to driver 14/07/15 21:08:56 INFO Executor:

Re: Spark-Streaming collect/take functionality.

2014-07-15 Thread jon.burns
It works perfect, thanks!. I feel like I should have figured that out, I'll chalk it up to inexperience with Scala. Thanks again. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-collect-take-functionality-tp9670p9772.html Sent from the

Re: Need help on spark Hbase

2014-07-15 Thread Jerry Lam
Hi Rajesh, I have a feeling that this is not directly related to spark but I might be wrong. The reason why is that when you do: Configuration configuration = HBaseConfiguration.create(); by default, it reads the configuration files hbase-site.xml in your classpath and ... (I don't remember

Re: Iteration question

2014-07-15 Thread Matei Zaharia
Hi Nathan, I think there are two possible reasons for this. One is that even though you are caching RDDs, their lineage chain gets longer and longer, and thus serializing each RDD takes more time. You can cut off the chain by using RDD.checkpoint() periodically, say every 5-10 iterations. The

Re: Store one to many relation ship in parquet file with spark sql

2014-07-15 Thread Michael Armbrust
Make the Array a Seq. On Tue, Jul 15, 2014 at 7:12 AM, Jaonary Rabarisoa jaon...@gmail.com wrote: Hi all, How should I store a one to many relationship using spark sql and parquet format. For example I the following case class case class Person(key: String, name: String, friends:

Error while running Spark SQL join when using Spark 1.0.1

2014-07-15 Thread Keith Simmons
HI folks, I'm running into the following error when trying to perform a join in my code: java.lang.NoClassDefFoundError: Could not initialize class org.apache.spark.sql.catalyst.types.LongType$ I see similar errors for StringType$ and also: scala.reflect.runtime.ReflectError: value apache is

How does Spark speculation prevent duplicated work?

2014-07-15 Thread Mingyu Kim
Hi all, I was curious about the details of Spark speculation. So, my understanding is that, when ³speculated² tasks are newly scheduled on other machines, the original tasks are still running until the entire stage completes. This seems to leave some room for duplicated work because some spark

Re: How to kill running spark yarn application

2014-07-15 Thread hsy...@gmail.com
Interesting, I run on my local one node cluster using apache hadoop On Tue, Jul 15, 2014 at 7:55 AM, Jerry Lam chiling...@gmail.com wrote: For your information, the SparkSubmit runs at the host you executed the spark-submit shell script (which in turns invoke the SparkSubmit program). Since

Spark Performance Bench mark

2014-07-15 Thread Malligarjunan S
Hello All, I am a newbie to Apache Spark, I would like to know the performance benchmark of Apache Spark. My current requirement is as follows I have few files in 2 s3 buckets Each file may have minimum of 1 million records. File data are tab separated. Have to compare few columns and filter the

MLLib - Regularized logistic regression in python

2014-07-15 Thread fjeg
Hi All, I am trying to perform regularized logistic regression with mllib in python. I have seen that this is possible in the following scala example: https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/mllib/BinaryClassification.scala But I do not see

Count distinct with groupBy usage

2014-07-15 Thread buntu
Hi -- New to Spark and trying to figure out how to do a generate unique counts per page by date given this raw data: timestamp,page,userId 1405377264,google,user1 1405378589,google,user2 1405380012,yahoo,user1 .. I can do a groupBy a field and get the count: val lines=sc.textFile(data.csv) val

Re: Error while running Spark SQL join when using Spark 1.0.1

2014-07-15 Thread Michael Armbrust
Are you registering multiple RDDs of case classes as tables concurrently? You are possibly hitting SPARK-2178 https://issues.apache.org/jira/browse/SPARK-2178 which is caused by SI-6240 https://issues.scala-lang.org/browse/SI-6240. On Tue, Jul 15, 2014 at 10:49 AM, Keith Simmons

Re: Count distinct with groupBy usage

2014-07-15 Thread Nick Pentreath
You can use .distinct.count on your user RDD. What are you trying to achieve with the time group by? — Sent from Mailbox On Tue, Jul 15, 2014 at 8:14 PM, buntu buntu...@gmail.com wrote: Hi -- New to Spark and trying to figure out how to do a generate unique counts per page by date given

Re: Count distinct with groupBy usage

2014-07-15 Thread Zongheng Yang
Sounds like a job for Spark SQL: http://spark.apache.org/docs/latest/sql-programming-guide.html ! On Tue, Jul 15, 2014 at 11:25 AM, Nick Pentreath nick.pentre...@gmail.com wrote: You can use .distinct.count on your user RDD. What are you trying to achieve with the time group by? — Sent from

Spark Performance issue

2014-07-15 Thread Malligarjunan S
Hello all, I am a newbie to Spark, Just analyzing the product. I am facing a performance problem with hive, Trying analyse whether the Spark will solve it or not. but it seems that Spark also taking lot of time.Let me know if I miss anything. shark select count(time) from table2; OK 6050 Time

Re: Count distinct with groupBy usage

2014-07-15 Thread buntu
We have CDH 5.0.2 which doesn't include Spark SQL yet and may only be available in CDH 5.1 which is yet to be released. If Spark SQL is the only option then I might need to hack around to add it into the current CDH deployment if thats possible. -- View this message in context:

Re: Count distinct with groupBy usage

2014-07-15 Thread buntu
Thanks Nick. All I'm attempting is to report number of unique visitors per page by date. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Count-distinct-with-groupBy-usage-tp9781p9786.html Sent from the Apache Spark User List mailing list archive at

Re: Large Task Size?

2014-07-15 Thread Aaron Davidson
Ah, I didn't realize this was non-MLLib code. Do you mean to be sending stochasticLossHistory in the closure as well? On Sun, Jul 13, 2014 at 1:05 AM, Kyle Ellrott kellr...@soe.ucsc.edu wrote: It uses the standard SquaredL2Updater, and I also tried to broadcast it as well. The input is a

Re: Count distinct with groupBy usage

2014-07-15 Thread Sean Owen
If you are counting per time and per page, then you need to group by time and page not just page. Something more like: csv.groupBy(csv = (csv(0),csv(1))) ... This gives a list of users per (time,page). As Nick suggests, then you count the distinct values for each key: ...

Re: Repeated data item search with Spark SQL(1.0.1)

2014-07-15 Thread Jerry Lam
Hi guys, Sorry, I'm also interested in this nested json structure. I have a similar SQL in which I need to query a nested field in a json. Does the above query works if it is used with sql(sqlText) assuming the data is coming directly from hdfs via sqlContext.jsonFile? The SPARK-2483

Re: Error while running Spark SQL join when using Spark 1.0.1

2014-07-15 Thread Zongheng Yang
FWIW, I am unable to reproduce this using the example program locally. On Tue, Jul 15, 2014 at 11:56 AM, Keith Simmons keith.simm...@gmail.com wrote: Nope. All of them are registered from the driver program. However, I think we've found the culprit. If the join column between two tables is

count vs countByValue in for/yield

2014-07-15 Thread Ognen Duzlevski
Hello, I am curious about something: val result = for { (dt,evrdd) - evrdds val ct = evrdd.count } yield (dt-ct) works. val result = for { (dt,evrdd) - evrdds val ct = evrdd.countByValue } yield (dt-ct) does not work. I get: 14/07/15 16:46:33 WARN

Re: the default GraphX graph-partition strategy on multicore machine?

2014-07-15 Thread Ankur Dave
On Jul 15, 2014, at 12:06 PM, Yifan LI iamyifa...@gmail.com wrote: Btw, is there any possibility to customise the partition strategy as we expect? I'm not sure I understand. Are you asking about defining a custom

Re: getting ClassCastException on collect()

2014-07-15 Thread _soumya_
Not sure I can help, but I ran into the same problem. Basically my use case is a that I have a List of strings - which I then convert into a RDD using sc.parallelize(). This RDD is then operated on by the foreach() function. Same as you, I get a runtime exception : java.lang.ClassCastException:

Re: How does Spark speculation prevent duplicated work?

2014-07-15 Thread Andrew Ash
Hi Nan, Great digging in -- that makes sense to me for when a job is producing some output handled by Spark like a .count or .distinct or similar. For the other part of the question, I'm also interested in side effects like an HDFS disk write. If one task is writing to an HDFS path and another

parallel stages?

2014-07-15 Thread Wei Tan
Hi, I wonder if I do wordcount on two different files, like this: val file1 = sc.textFile(/...) val file2 = sc.textFile(/...) val wc1= file.flatMap(..).reduceByKey(_ + _,1) val wc2= file.flatMap(...).reduceByKey(_ + _,1) wc1.saveAsTextFile(titles.out) wc2.saveAsTextFile(tables.out) Would the

Re: Error while running Spark SQL join when using Spark 1.0.1

2014-07-15 Thread Keith Simmons
To give a few more details of my environment in case that helps you reproduce: * I'm running spark 1.0.1 downloaded as a tar ball, not built myself * I'm running in stand alone mode, with 1 master and 1 worker, both on the same machine (though the same error occurs with two workers on two

Re: Count distinct with groupBy usage

2014-07-15 Thread buntu
Thanks Sean!! Thats what I was looking for -- group by on mulitple fields. I'm gonna play with it now. Thanks again! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Count-distinct-with-groupBy-usage-tp9781p9803.html Sent from the Apache Spark User List

Re: Possible bug in Spark Streaming :: TextFileStream

2014-07-15 Thread Tathagata Das
On second thought I am not entirely sure whether that bug is the issue. Are you continuously appending to the file that you have copied to the directory? Because filestream works correctly when the files are atomically moved to the monitored directory. TD On Mon, Jul 14, 2014 at 9:08 PM,

Re: SQL + streaming

2014-07-15 Thread Tathagata Das
Could you run it locally first to make sure it works, and you see output? Also, I recommend going through the previous step-by-step approach to narrow down where the problem is. TD On Mon, Jul 14, 2014 at 9:15 PM, hsy...@gmail.com hsy...@gmail.com wrote: Actually, I deployed this on yarn

Re: Spark Streaming Json file groupby function

2014-07-15 Thread Tathagata Das
I see you have the code to convert to Record class but commented it out. That is the right way to go. When you are converting it to a 4-tuple with (data(type),data(name),data(score),data(school)) ... its of type (Any, Any, Any, Any) as data(xyz) returns Any. And registerAsTable probably doesnt

Help with Json array parsing

2014-07-15 Thread SK
Hi, I have a json file where the object definition in each line includes an array component obj that contains 0 or more elements as shown by the example below. {name: 16287e9cdf, obj: [{min: 50,max: 59 }, {min: 20, max: 29}]}, {name: 17087e9cdf, obj: [{min: 30,max: 39 }, {min: 10, max: 19},

Re: truly bizarre behavior with local[n] on Spark 1.0.1

2014-07-15 Thread Tathagata Das
This sounds really really weird. Can you give me a piece of code that I can run to reproduce this issue myself? TD On Tue, Jul 15, 2014 at 12:02 AM, Walrus theCat walrusthe...@gmail.com wrote: This is (obviously) spark streaming, by the way. On Mon, Jul 14, 2014 at 8:27 PM, Walrus theCat

Re: Error while running Spark SQL join when using Spark 1.0.1

2014-07-15 Thread Michael Armbrust
Can you print out the queryExecution? (i.e. println(sql().queryExecution)) On Tue, Jul 15, 2014 at 12:44 PM, Keith Simmons keith.simm...@gmail.com wrote: To give a few more details of my environment in case that helps you reproduce: * I'm running spark 1.0.1 downloaded as a tar ball,

Re: Ideal core count within a single JVM

2014-07-15 Thread lokesh.gidra
It makes sense what you said. But, when I proportionately reduce the heap size, then also the problem persists. For instance, if I use 160 GB heap for 48 cores, whereas 80 GB heap for 24 cores, then also with 24 cores the performance is better (although worse than 160 GB with 24 cores) than

Re: truly bizarre behavior with local[n] on Spark 1.0.1

2014-07-15 Thread Walrus theCat
Will do. On Tue, Jul 15, 2014 at 12:56 PM, Tathagata Das tathagata.das1...@gmail.com wrote: This sounds really really weird. Can you give me a piece of code that I can run to reproduce this issue myself? TD On Tue, Jul 15, 2014 at 12:02 AM, Walrus theCat walrusthe...@gmail.com wrote:

Re: How does Spark speculation prevent duplicated work?

2014-07-15 Thread Bertrand Dechoux
I haven't look at the implementation but what you would do with any filesystem is write to a file inside the workspace directory of the task. And then only the attempt of the task that should be kept will perform a move to the final path. The other attempts are simply discarded. For most

Re: Error while running Spark SQL join when using Spark 1.0.1

2014-07-15 Thread Keith Simmons
Sure thing. Here you go: == Logical Plan == Sort [key#0 ASC] Project [key#0,value#1,value#2] Join Inner, Some((key#0 = key#3)) SparkLogicalPlan (ExistingRdd [key#0,value#1], MapPartitionsRDD[2] at mapPartitions at basicOperators.scala:176) SparkLogicalPlan (ExistingRdd [value#2,key#3],

Re: parallel stages?

2014-07-15 Thread Sean Owen
The last two lines are what trigger the operations, and they will each block until the result is computed and saved. So if you execute this code as-is, no. You could write a Scala program that invokes these two operations in parallel, like: Array((wc1,titles.out), (wc2,tables.out)).par.foreach {

Re: Spark Streaming Json file groupby function

2014-07-15 Thread srinivas
I am still getting the error...even if i convert it to record object KafkaWordCount { def main(args: Array[String]) { if (args.length 4) { System.err.println(Usage: KafkaWordCount zkQuorum group topics numThreads) System.exit(1) }

Re: Error while running Spark SQL join when using Spark 1.0.1

2014-07-15 Thread gorenuru
Just my few cents on this. I having the same problems with v 1.0.1 but this bug is sporadic and looks like is relayed to object initialization. Even more, i'm not using any SQL or something. I just have utility class like this: object DataTypeDescriptor { type DataType = String val

Multiple streams at the same time

2014-07-15 Thread gorenuru
Hi everyone. I have some problems running multiple streams at the same time. What i am doing is: object Test { import org.apache.spark.streaming._ import org.apache.spark.streaming.StreamingContext._ import org.apache.spark.api.java.function._ import org.apache.spark.streaming._

Re: Help with Json array parsing

2014-07-15 Thread SK
To add to my previous post, the error at runtime is teh following: Exception in thread main org.apache.spark.SparkException: Job aborted due to stage failure: Task 0.0:0 failed 1 times, most recent failure: Exception failure in TID 0 on host localhost: org.json4s.package$MappingException:

Retrieve dataset of Big Data Benchmark

2014-07-15 Thread Tom
Hi, I would like to use the dataset used in the Big Data Benchmark https://amplab.cs.berkeley.edu/benchmark/ on my own cluster, to run some tests between Hadoop and Spark. The dataset should be available at s3n://big-data-benchmark/pavlo/[text|text-deflate|sequence|sequence-snappy]/[suffix],

Re: Retrieve dataset of Big Data Benchmark

2014-07-15 Thread Burak Yavuz
Hi Tom, If you wish to load the file in Spark directly, you can use sc.textFile(s3n://big-data-benchmark/pavlo/...) where sc is your SparkContext. This can be done because the files should be publicly available and you don't need AWS Credentials to access them. If you want to download the

Re: Recommended pipeline automation tool? Oozie?

2014-07-15 Thread Dean Wampler
If you're already using Scala for Spark programming and you hate Oozie XML as much as I do ;), you might check out Scoozie, a Scala DSL for Oozie: https://github.com/klout/scoozie On Thu, Jul 10, 2014 at 5:52 PM, Andrei faithlessfri...@gmail.com wrote: I used both - Oozie and Luigi - but found

can't get jobs to run on cluster (enough memory and cpus are available on worker)

2014-07-15 Thread Matt Work Coarr
Hello spark folks, I have a simple spark cluster setup but I can't get jobs to run on it. I am using the standlone mode. One master, one slave. Both machines have 32GB ram and 8 cores. The slave is setup with one worker that has 8 cores and 24GB memory allocated. My application requires 2

Re: Large Task Size?

2014-07-15 Thread Kyle Ellrott
Yes, this is a proposed patch to MLLib so that you can use 1 RDD to train multiple models at the same time. I am hoping that by multiplexing several models in the same RDD will be more efficient then trying to get the Spark scheduler to manage a few 100 tasks simultaneously. I don't think I see

Re: can't get jobs to run on cluster (enough memory and cpus are available on worker)

2014-07-15 Thread Marcelo Vanzin
Have you looked at the slave machine to see if the process has actually launched? If it has, have you tried peeking into its log file? (That error is printed whenever the executors fail to report back to the driver. Insufficient resources to launch the executor is the most common cause of that,

Re: How does Spark speculation prevent duplicated work?

2014-07-15 Thread Tathagata Das
The way the HDFS file writing works at a high level is that each attempt to write a partition to a file starts writing to unique temporary file (say, something like targetDirectory/_temp/part-X_attempt-). If the writing into the file successfully completes, then the temporary file is moved

Re: Error when testing with large sparse svm

2014-07-15 Thread crater
I got a bit progress. I think the problem is with the BinaryClassificationMetrics, as long as I comment out all the prediction related metrics, I can run the svm example with my data. So the problem should be there I guess. -- View this message in context:

Re: NotSerializableException in Spark Streaming

2014-07-15 Thread Tathagata Das
I am very curious though. Can you post a concise code example which we can run to reproduce this problem? TD On Tue, Jul 15, 2014 at 2:54 PM, Tathagata Das tathagata.das1...@gmail.com wrote: I am not entire sure off the top of my head. But a possible (usually works) workaround is to define

Spark misconfigured? Small input split sizes in shark query

2014-07-15 Thread David Rosenstrauch
Got a spark/shark cluster up and running recently, and have been kicking the tires on it. However, been wrestling with an issue on it that I'm not quite sure how to solve. (Or, at least, not quite sure about the correct way to solve it.) I ran a simple Hive query (select count ...) against

Re: How does Spark speculation prevent duplicated work?

2014-07-15 Thread Matei Zaharia
Yeah, this is handled by the commit call of the FileOutputFormat. In general Hadoop OutputFormats have a concept called committing the output, which you should do only once per partition. In the file ones it does an atomic rename to make sure that the final output is a complete file. Matei On

Re: Need help on spark Hbase

2014-07-15 Thread Tathagata Das
Also, it helps if you post us logs, stacktraces, exceptions, etc. TD On Tue, Jul 15, 2014 at 10:07 AM, Jerry Lam chiling...@gmail.com wrote: Hi Rajesh, I have a feeling that this is not directly related to spark but I might be wrong. The reason why is that when you do: Configuration

Re: Can we get a spark context inside a mapper

2014-07-15 Thread Rahul Bhojwani
Thanks a lot Sean, Daniel, Matei and Jerry. I really appreciate your reply. And I also understand that I should be a little more patient. When I myself is only not able to reply within next 5 hours how can I expect question to be answered in that time. And yes the Idea of using a separate

Re: Error while running Spark SQL join when using Spark 1.0.1

2014-07-15 Thread Zongheng Yang
- user@incubator Hi Keith, I did reproduce this using local-cluster[2,2,1024], and the errors look almost the same. Just wondering, despite the errors did your program output any result for the join? On my machine, I could see the correct output. Zongheng On Tue, Jul 15, 2014 at 1:46 PM,

Re: Repeated data item search with Spark SQL(1.0.1)

2014-07-15 Thread Michael Armbrust
hql and sql are just two different dialects for interacting with data. After parsing is complete and the logical plan is constructed, the execution is exactly the same. On Tue, Jul 15, 2014 at 2:50 PM, Jerry Lam chiling...@gmail.com wrote: Hi Michael, I don't understand the difference

Re: Multiple streams at the same time

2014-07-15 Thread Tathagata Das
Creating multiple StreamingContexts using the same SparkContext is currently not supported. :) Guess it was not clear in the docs. Note to self. TD On Tue, Jul 15, 2014 at 1:50 PM, gorenuru goren...@gmail.com wrote: Hi everyone. I have some problems running multiple streams at the same

Re: Multiple streams at the same time

2014-07-15 Thread gorenuru
Oh, sad to hear that :( From my point of view, creating separate spark context for each stream is to expensive. Also, it's annoying because we have to be responsible for proper akka and UI port determination for each context. Do you know about any plans to support it? -- View this message in

Re: Multiple streams at the same time

2014-07-15 Thread Tathagata Das
Why do you need to create multiple streaming contexts at all? TD On Tue, Jul 15, 2014 at 3:43 PM, gorenuru goren...@gmail.com wrote: Oh, sad to hear that :( From my point of view, creating separate spark context for each stream is to expensive. Also, it's annoying because we have to be

Re: Error while running Spark SQL join when using Spark 1.0.1

2014-07-15 Thread Zongheng Yang
Hi Keith gorenuru, This patch (https://github.com/apache/spark/pull/1423) solves the errors for me in my local tests. If possible, can you guys test this out to see if it solves your test programs? Thanks, Zongheng On Tue, Jul 15, 2014 at 3:08 PM, Zongheng Yang zonghen...@gmail.com wrote: -

Re: Client application that calls Spark and receives an MLlib *model* Scala Object, not just result

2014-07-15 Thread Aris
Thanks Soumya - I guess the next step from here is to move the MLlib model from the Spark application with simply does the training, and giving to the client application which simply does the predictions. I will try the Kryo library to physically serialize the object and trade it across machines /

Re: Submitting to a cluster behind a VPN, configuring different IP address

2014-07-15 Thread Aris Vlasakakis
Hello! Just curious if anybody could respond to my original message, if anybody knows about how to set the configuration variables that are handles by Jetty and not Spark's native framework..which is Akka I think? Thanks On Thu, Jul 10, 2014 at 4:04 PM, Aris Vlasakakis a...@vlasakakis.com

Re: SQL + streaming

2014-07-15 Thread hsy...@gmail.com
Hi Tathagata, I could see the output of count, but no sql results. Run in standalone is meaningless for me and I just run in my local single node yarn cluster. Thanks On Tue, Jul 15, 2014 at 12:48 PM, Tathagata Das tathagata.das1...@gmail.com wrote: Could you run it locally first to make

Re: import org.apache.spark.streaming.twitter._ in Shell

2014-07-15 Thread Tathagata Das
Yes, what Nick said is the recommended way. In most usecases, a spark streaming program in production is not usually run from the shell. Hence, we chose not to make the external stuff (twitter, kafka, etc.) available to spark shell to avoid dependency conflicts brought it by them with spark's

Re: Multiple streams at the same time

2014-07-15 Thread gorenuru
Because I want to have different streams with different durations. Fornexample, one triggers snapshot analysis each 5 minutes and another each 10 seconds On Tue, Jul 15, 2014 at 3:59 pm, Tathagata Das [via Apache Spark User List] lt;ml-node+s1001560n984...@n3.nabble.comgt; wrote:

Re: Error while running Spark SQL join when using Spark 1.0.1

2014-07-15 Thread Keith Simmons
Cool. So Michael's hunch was correct, it is a thread issue. I'm currently using a tarball build, but I'll do a spark build with the patch as soon as I have a chance and test it out. Keith On Tue, Jul 15, 2014 at 4:14 PM, Zongheng Yang zonghen...@gmail.com wrote: Hi Keith gorenuru, This

  1   2   >