Re: Parallelize on spark context

2014-11-07 Thread _soumya_
Naveen, Don't be worried - you're not the only one to be bitten by this. A little inspection of the Javadoc told me you have this other option: JavaRDDInteger distData = sc.parallelize(data, 100); -- Now the RDD is split into 100 partitions. -- View this message in context:

Running an action inside a loop across multiple RDDs + java.io.NotSerializableException

2014-10-16 Thread _soumya_
Hi, my programming model requires me to generate multiple RDDs for various datasets across a single run and then run an action on it - E.g. MyFunc myFunc = ... //It implements VoidFunction //set some extra variables - all serializable ... for (JavaRDDString rdd: rddList) { ...

Re: Running an action inside a loop across multiple RDDs + java.io.NotSerializableException

2014-10-16 Thread _soumya_
Excuse me - the line inside the loop should read: rdd.foreach(myFunc) - not sc. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Running-an-action-inside-a-loop-across-multiple-RDDs-java-io-NotSerializableException-tp16580p16581.html Sent from the Apache

Re: Running an action inside a loop across multiple RDDs + java.io.NotSerializableException

2014-10-16 Thread _soumya_
Sorry - I'll furnish some details below. However, union is not an option for the business logic I have. The function will generate a specific file based on a variable passed in as the setter for the function. This variable changes with each RDD. I annotated the log line where the first run

Running Spark/YARN on AWS EMR - Issues finding file on hdfs?

2014-07-18 Thread _soumya_
I'm stumped with this one. I'm using YARN on EMR to distribute my spark job. While it seems initially, the job is starting up fine - the Spark Executor nodes are having trouble pulling the jars from the location on hdfs that the master just put the files on. [hadoop@ip-172-16-2-167 ~]$

Re: getting ClassCastException on collect()

2014-07-15 Thread _soumya_
Not sure I can help, but I ran into the same problem. Basically my use case is a that I have a List of strings - which I then convert into a RDD using sc.parallelize(). This RDD is then operated on by the foreach() function. Same as you, I get a runtime exception : java.lang.ClassCastException:

Linkage error - duplicate class definition

2014-07-11 Thread _soumya_
Facing a funny issue with the Spark class loader. Testing out a basic functionality on a vagrant VM with spark running - looks like it's attempting to ship the jar to a remote instance (in this case local) and somehow is encountering the jar twice? 14/07/11 23:27:59 INFO DAGScheduler: Got job 0

spark-submit script and spark.files.userClassPathFirst

2014-07-01 Thread _soumya_
Hi, I'm trying to get rid of an error (NoSuchMethodError) while using Amazon's s3 client on Spark. I'm using the Spark Submit script to run my code. Reading about my options and other threads, it seemed the most logical way would be make sure my jar is loaded first. Spark submit on debug shows

Issues starting up Spark on mesos - akka.version

2014-06-29 Thread _soumya_
I'm new to Spark and not very experienced with scala issues. I'm facing this error message while trying to start up Spark on Mesos on a vagrant box. vagrant@mesos:~/installs/spark-1.0.0$ java -cp rickshaw-spark-0.0.1-SNAPSHOT.jar com.evocalize.rickshaw.spark.applications.GenerateSEOContent -m