Broadcasting Large Objects Fails?
Hi, I am trying to broadcast large objects (order of a couple of 100 MBs). However, I keep getting errors when trying to do so: Traceback (most recent call last): File /LORM_experiment.py, line 510, in module broadcast_gradient_function = sc.broadcast(gradient_function) File /scratch/users/213444/spark/python/pyspark/context.py, line 643, in broadcast return Broadcast(self, value, self._pickled_broadcast_vars) File /scratch/users/213444/spark/python/pyspark/broadcast.py, line 65, in __init__ self._path = self.dump(value, f) File /scratch/users/213444/spark/python/pyspark/broadcast.py, line 82, in dump cPickle.dump(value, f, 2) SystemError: error return without exception set 15/02/22 04:52:14 ERROR Utils: Uncaught exception in thread delete Spark local dirs java.lang.IllegalStateException: Shutdown in progress Any idea how to prevent that? I got plenty of RAM, so there shouldn't be any problem with that. Thanks, Tassilo -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Broadcasting-Large-Objects-Fails-tp21752.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Running Example Spark Program
Hello All, I am new to Apache Spark, I am trying to run JavaKMeans.java from Spark Examples in my Ubuntu System. I downloaded spark-1.2.1-bin-hadoop2.4.tgz http://www.apache.org/dyn/closer.cgi/spark/spark-1.2.1/spark-1.2.1-bin-hadoop2.4.tgz and started sbin/start-master.sh After starting Spark and access http://localhost:8080/ to look at the status of my Spark Instance, and it shows as follows. * *URL:*spark://vm:7077 * *Workers:*0 * *Cores:*0 Total, 0 Used * *Memory:*0.0 B Total, 0.0 B Used * *Applications:*0 Running, 4 Completed * *Drivers:*0 Running, 0 Completed * *Status:*ALIVE Number of Cores is 0 and Memory is 0.0B. I think because of this I am getting following error when I try to run JavaKMeans.java Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory Am I missing any configuration before running sbin/start-master.sh? Regards, Surendran
Re: Running Example Spark Program
If you would like a morr detailed walkthrough I wrote one recently. https://dataissexy.wordpress.com/2015/02/03/apache-spark-standalone-clusters-bigdata-hadoop-spark/ Regards Jason Bell On 22 Feb 2015 14:16, VISHNU SUBRAMANIAN johnfedrickena...@gmail.com wrote: Try restarting your Spark cluster . ./sbin/stop-all.sh ./sbin/start-all.sh Thanks, Vishnu On Sun, Feb 22, 2015 at 7:30 PM, Surendran Duraisamy 2013ht12...@wilp.bits-pilani.ac.in wrote: Hello All, I am new to Apache Spark, I am trying to run JavaKMeans.java from Spark Examples in my Ubuntu System. I downloaded spark-1.2.1-bin-hadoop2.4.tgz http://www.apache.org/dyn/closer.cgi/spark/spark-1.2.1/spark-1.2.1-bin-hadoop2.4.tgz and started sbin/start-master.sh After starting Spark and access http://localhost:8080/ to look at the status of my Spark Instance, and it shows as follows. - *URL:* spark://vm:7077 - *Workers:* 0 - *Cores:* 0 Total, 0 Used - *Memory:* 0.0 B Total, 0.0 B Used - *Applications:* 0 Running, 4 Completed - *Drivers:* 0 Running, 0 Completed - *Status:* ALIVE Number of Cores is 0 and Memory is 0.0B. I think because of this I am getting following error when I try to run JavaKMeans.java Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory Am I missing any configuration before running sbin/start-master.sh? Regards, Surendran
RE: Spark SQL odbc on Windows
Hi Francisco,While I haven't tried this, have a look at the contents of start-thriftserver.sh - all it's doing is setting up a few variables and calling: /bin/spark-submit --class org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 and passing some additional parameters. Perhaps doing the same would work? I also believe that this hosts a jdbc server (not odbc), but there's a free odbc connector from databricks built by Simba, with which I've been able to connect to a spark cluster hosted on linux. -Ashic. To: user@spark.apache.org From: forch...@gmail.com Subject: Spark SQL odbc on Windows Date: Sun, 22 Feb 2015 09:45:03 +0100 Hello, I work on a MS consulting company and we are evaluating including SPARK on our BigData offer. We are particulary interested into testing SPARK as rolap engine for SSAS but we cannot find a way to activate the odbc server (thrift) on a Windows custer. There is no start-thriftserver.sh command available for windows. Somebody knows if there is a way to make this work? Thanks in advance!! Francisco
Re: Running Example Spark Program
Try restarting your Spark cluster . ./sbin/stop-all.sh ./sbin/start-all.sh Thanks, Vishnu On Sun, Feb 22, 2015 at 7:30 PM, Surendran Duraisamy 2013ht12...@wilp.bits-pilani.ac.in wrote: Hello All, I am new to Apache Spark, I am trying to run JavaKMeans.java from Spark Examples in my Ubuntu System. I downloaded spark-1.2.1-bin-hadoop2.4.tgz http://www.apache.org/dyn/closer.cgi/spark/spark-1.2.1/spark-1.2.1-bin-hadoop2.4.tgz and started sbin/start-master.sh After starting Spark and access http://localhost:8080/ to look at the status of my Spark Instance, and it shows as follows. - *URL:* spark://vm:7077 - *Workers:* 0 - *Cores:* 0 Total, 0 Used - *Memory:* 0.0 B Total, 0.0 B Used - *Applications:* 0 Running, 4 Completed - *Drivers:* 0 Running, 0 Completed - *Status:* ALIVE Number of Cores is 0 and Memory is 0.0B. I think because of this I am getting following error when I try to run JavaKMeans.java Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory Am I missing any configuration before running sbin/start-master.sh? Regards, Surendran
Re: Broadcasting Large Objects Fails?
Hi Akhil, thanks for your reply. I am using the latest version of Spark 1.2.1 (also tried 1.3 developer branch). If I am not mistaken the TorrentBroadcast is the default there, isn't it? Thanks, Tassilo On Sun, Feb 22, 2015 at 10:59 AM, Akhil Das ak...@sigmoidanalytics.com wrote: Did you try with torrent broadcast factory? Thanks Best Regards On Sun, Feb 22, 2015 at 3:29 PM, TJ Klein tjkl...@gmail.com wrote: Hi, I am trying to broadcast large objects (order of a couple of 100 MBs). However, I keep getting errors when trying to do so: Traceback (most recent call last): File /LORM_experiment.py, line 510, in module broadcast_gradient_function = sc.broadcast(gradient_function) File /scratch/users/213444/spark/python/pyspark/context.py, line 643, in broadcast return Broadcast(self, value, self._pickled_broadcast_vars) File /scratch/users/213444/spark/python/pyspark/broadcast.py, line 65, in __init__ self._path = self.dump(value, f) File /scratch/users/213444/spark/python/pyspark/broadcast.py, line 82, in dump cPickle.dump(value, f, 2) SystemError: error return without exception set 15/02/22 04:52:14 ERROR Utils: Uncaught exception in thread delete Spark local dirs java.lang.IllegalStateException: Shutdown in progress Any idea how to prevent that? I got plenty of RAM, so there shouldn't be any problem with that. Thanks, Tassilo -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Broadcasting-Large-Objects-Fails-tp21752.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Running Example Spark Program
Thank You Jason, Got the program working after setting SPARK_WORKER_CORES SPARK_WORKER_MEMORY While running the program from eclipse, got strange ClassNotFoundException. In JavaKMeans.java, ParsePoint is static inner class. When running the program I got ClassNotFound for ParsePoint. I have to set String jars[] = { /home/surendran/workspace/SparkKMeans/target/SparkKMeans-0.0.1-SNAPSHOT.jar }; sparkConf.setJars(jars); (or) to set SPARK_CLASSPATH to my jar to get the program working. What if I modify ParsePoint? I need to rebuild to bring the changes and run from eclipse? Regards, Surendran On Sunday 22 February 2015 07:55 PM, Jason Bell wrote: If you would like a morr detailed walkthrough I wrote one recently. https://dataissexy.wordpress.com/2015/02/03/apache-spark-standalone-clusters-bigdata-hadoop-spark/ Regards Jason Bell On 22 Feb 2015 14:16, VISHNU SUBRAMANIAN johnfedrickena...@gmail.com mailto:johnfedrickena...@gmail.com wrote: Try restarting your Spark cluster . ./sbin/stop-all.sh ./sbin/start-all.sh Thanks, Vishnu On Sun, Feb 22, 2015 at 7:30 PM, Surendran Duraisamy 2013ht12...@wilp.bits-pilani.ac.in mailto:2013ht12...@wilp.bits-pilani.ac.in wrote: Hello All, I am new to Apache Spark, I am trying to run JavaKMeans.java from Spark Examples in my Ubuntu System. I downloaded spark-1.2.1-bin-hadoop2.4.tgz http://www.apache.org/dyn/closer.cgi/spark/spark-1.2.1/spark-1.2.1-bin-hadoop2.4.tgz and started sbin/start-master.sh After starting Spark and access http://localhost:8080/ to look at the status of my Spark Instance, and it shows as follows. * *URL:*spark://vm:7077 * *Workers:*0 * *Cores:*0 Total, 0 Used * *Memory:*0.0 B Total, 0.0 B Used * *Applications:*0 Running, 4 Completed * *Drivers:*0 Running, 0 Completed * *Status:*ALIVE Number of Cores is 0 and Memory is 0.0B. I think because of this I am getting following error when I try to run JavaKMeans.java Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory Am I missing any configuration before running sbin/start-master.sh? Regards, Surendran
Re: Spark SQL odbc on Windows
Hi Francisco, Out of curiosity - why ROLAP mode using multi-dimensional mode (vs tabular) from SSAS to Spark? As a past SSAS guy you've definitely piqued my interest. The one thing that you may run into is that the SQL generated by SSAS can be quite convoluted. When we were doing the same thing to try to get SSAS to connect to Hive (ref paper at http://download.microsoft.com/download/D/2/0/D20E1C5F-72EA-4505-9F26-FEF9550EFD44/MOLAP2HIVE_KLOUT.docx) that was definitely a blocker. Note that Spark SQL is different than HIVEQL but you may run into the same issue. If so, the trick you may want to use is similar to the paper - use a SQL Server linked server connection and have SQL Server be your translator for the SQL generated by SSAS. HTH! Denny On Sun, Feb 22, 2015 at 01:44 Ashic Mahtab as...@live.com wrote: Hi Francisco, While I haven't tried this, have a look at the contents of start-thriftserver.sh - all it's doing is setting up a few variables and calling: /bin/spark-submit --class org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 and passing some additional parameters. Perhaps doing the same would work? I also believe that this hosts a jdbc server (not odbc), but there's a free odbc connector from databricks built by Simba, with which I've been able to connect to a spark cluster hosted on linux. -Ashic. -- To: user@spark.apache.org From: forch...@gmail.com Subject: Spark SQL odbc on Windows Date: Sun, 22 Feb 2015 09:45:03 +0100 Hello, I work on a MS consulting company and we are evaluating including SPARK on our BigData offer. We are particulary interested into testing SPARK as rolap engine for SSAS but we cannot find a way to activate the odbc server (thrift) on a Windows custer. There is no start-thriftserver.sh command available for windows. Somebody knows if there is a way to make this work? Thanks in advance!! Francisco
Re: Spark performance tuning
You can simply follow these http://spark.apache.org/docs/1.2.0/tuning.html Thanks Best Regards On Sun, Feb 22, 2015 at 1:14 AM, java8964 java8...@hotmail.com wrote: Can someone share some ideas about how to tune the GC time? Thanks -- From: java8...@hotmail.com To: user@spark.apache.org Subject: Spark performance tuning Date: Fri, 20 Feb 2015 16:04:23 -0500 Hi, I am new to Spark, and I am trying to test the Spark SQL performance vs Hive. I setup a standalone box, with 24 cores and 64G memory. We have one SQL in mind to test. Here is the basically setup on this one box for the SQL we are trying to run: 1) Dataset 1, 6.6G AVRO file with snappy compression, which contains nest structure of 3 array of struct in AVRO 2) Dataset2, 5G AVRO file with snappy compression 3) Dataset3, 2.3M AVRO file with snappy compression. The basic structure of the query is like this: (select xxx from dataset1 lateral view outer explode(struct1) lateral view outer explode(struct2) where x ) left outer join ( select from dataset2 lateral view explode(xxx) where ) on left outer join ( select xxx from dataset3 where ) on x So overall what it does is 2 outer explode on dataset1, left outer join with explode of dataset2, then finally left outer join with dataset 3. On this standalone box, I installed Hadoop 2.2 and Hive 0.12, and Spark 1.2.0. Baseline, the above query can finish around 50 minutes in Hive 12, with 6 mappers and 3 reducers, each with 1G max heap, in 3 rounds of MR jobs. This is a very expensive query running in our production, of course with much bigger data set, every day. Now I want to see how fast Spark can do for the same query. I am using the following settings, based on my understanding of Spark, for a fair test between it and Hive: export SPARK_WORKER_MEMORY=32g export SPARK_DRIVER_MEMORY=2g --executor-memory 9g --total-executor-cores 9 I am trying to run the one executor with 9 cores and max 9G heap, to make Spark use almost same resource we gave to the MapReduce. Here is the result without any additional configuration changes, running under Spark 1.2.0, using HiveContext in Spark SQL, to run the exactly same query: The Spark SQL generated 5 stage of tasks, shown below: 4 collect at SparkPlan.scala:84 +details 2015/02/20 10:48:46 *26 s* 200/200 3 mapPartitions at Exchange.scala:64 +details 2015/02/20 10:32:07 *16 min* 200/200 1112.3 MB 2 mapPartitions at Exchange.scala:64 +details 2015/02/20 10:22:06 *9 min* 40/40 4.7 GB 22.2 GB 1 mapPartitions at Exchange.scala:64 +details 2015/02/20 10:22:06 *1.9* min 50/50 6.2 GB 2.8 GB 0 mapPartitions at Exchange.scala:64 +details 2015/02/20 10:22:06 *6 s* 2/2 2.3 MB 156.6 KB So the wall time of whole query is 26s + 16m + 9m + 2m + 6s, around 28 minutes. It is about 56% of originally time, not bad. But I want to know any tuning of Spark can make it even faster. For stage 2 and 3, I observed that GC time is more and more expensive. Especially in stage 3, shown below: For stage 3: Metric Min 25th percentile Median 75th percentile Max Duration20 s30 s35 s39 s 2.4 min GC Time 9 s 17 s20 s25 s 2.2 min Shuffle Write 4.7 MB 4.9 MB 5.2 MB 6.1 MB 8.3 MB So in median, the GC time took overall 20s/35s = 57% of time. First change I made is to add the following line in the spark-default.conf: spark.serializer org.apache.spark.serializer.KryoSerializer My assumption is that using kryoSerializer, instead of default java serialize, will lower the memory footprint, should lower the GC pressure during runtime. I know the I changed the correct spark-default.conf, because if I were add spark.executor.extraJavaOptions -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps in the same file, I will see the GC usage in the stdout file. Of course, in this test, I didn't add that, as I want to only make one change a time. The result is almost the same, as using standard java serialize. The wall time is still 28 minutes, and in stage 3, the GC still took around 50 to 60% of time, almost same result within min, median to max in stage 3, without any noticeable performance gain. Next, based on my understanding, and for this test, I think the default spark.storage.memoryFraction is too high for this query, as there is no reason to reserve so much memory for caching data, Because we don't reuse any dataset in this one query. So I add this at the end of spark-shell command --conf spark.storage.memoryFraction=0.3, as I want to just reserve half of the memory for caching data vs first time. Of course, this time, I rollback the first change of KryoSerializer. The result looks like almost the same. The whole query finished around 28s + 14m
Re: Broadcasting Large Objects Fails?
Did you try with torrent broadcast factory? Thanks Best Regards On Sun, Feb 22, 2015 at 3:29 PM, TJ Klein tjkl...@gmail.com wrote: Hi, I am trying to broadcast large objects (order of a couple of 100 MBs). However, I keep getting errors when trying to do so: Traceback (most recent call last): File /LORM_experiment.py, line 510, in module broadcast_gradient_function = sc.broadcast(gradient_function) File /scratch/users/213444/spark/python/pyspark/context.py, line 643, in broadcast return Broadcast(self, value, self._pickled_broadcast_vars) File /scratch/users/213444/spark/python/pyspark/broadcast.py, line 65, in __init__ self._path = self.dump(value, f) File /scratch/users/213444/spark/python/pyspark/broadcast.py, line 82, in dump cPickle.dump(value, f, 2) SystemError: error return without exception set 15/02/22 04:52:14 ERROR Utils: Uncaught exception in thread delete Spark local dirs java.lang.IllegalStateException: Shutdown in progress Any idea how to prevent that? I got plenty of RAM, so there shouldn't be any problem with that. Thanks, Tassilo -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Broadcasting-Large-Objects-Fails-tp21752.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Broadcasting Large Objects Fails?
I see, thanks. Yes, I have tried already all sorts of changes to these parameters. Unfortunately, none of seem had any impact. Thanks, Tassilo On Sun, Feb 22, 2015 at 1:24 PM, Akhil Das ak...@sigmoidanalytics.com wrote: Yes it is, you have some more customizable options over here http://spark.apache.org/docs/1.2.0/configuration.html#compression-and-serialization Thanks Best Regards On Sun, Feb 22, 2015 at 11:47 PM, Tassilo Klein tjkl...@gmail.com wrote: Hi Akhil, thanks for your reply. I am using the latest version of Spark 1.2.1 (also tried 1.3 developer branch). If I am not mistaken the TorrentBroadcast is the default there, isn't it? Thanks, Tassilo On Sun, Feb 22, 2015 at 10:59 AM, Akhil Das ak...@sigmoidanalytics.com wrote: Did you try with torrent broadcast factory? Thanks Best Regards On Sun, Feb 22, 2015 at 3:29 PM, TJ Klein tjkl...@gmail.com wrote: Hi, I am trying to broadcast large objects (order of a couple of 100 MBs). However, I keep getting errors when trying to do so: Traceback (most recent call last): File /LORM_experiment.py, line 510, in module broadcast_gradient_function = sc.broadcast(gradient_function) File /scratch/users/213444/spark/python/pyspark/context.py, line 643, in broadcast return Broadcast(self, value, self._pickled_broadcast_vars) File /scratch/users/213444/spark/python/pyspark/broadcast.py, line 65, in __init__ self._path = self.dump(value, f) File /scratch/users/213444/spark/python/pyspark/broadcast.py, line 82, in dump cPickle.dump(value, f, 2) SystemError: error return without exception set 15/02/22 04:52:14 ERROR Utils: Uncaught exception in thread delete Spark local dirs java.lang.IllegalStateException: Shutdown in progress Any idea how to prevent that? I got plenty of RAM, so there shouldn't be any problem with that. Thanks, Tassilo -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Broadcasting-Large-Objects-Fails-tp21752.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Spark SQL odbc on Windows
Back to thrift, there was an earlier thread on this topic at http://mail-archives.apache.org/mod_mbox/spark-user/201411.mbox/%3CCABPQxsvXA-ROPeXN=wjcev_n9gv-drqxujukbp_goutvnyx...@mail.gmail.com%3E that may be useful as well. On Sun Feb 22 2015 at 8:42:29 AM Denny Lee denny.g@gmail.com wrote: Hi Francisco, Out of curiosity - why ROLAP mode using multi-dimensional mode (vs tabular) from SSAS to Spark? As a past SSAS guy you've definitely piqued my interest. The one thing that you may run into is that the SQL generated by SSAS can be quite convoluted. When we were doing the same thing to try to get SSAS to connect to Hive (ref paper at http://download.microsoft.com/download/D/2/0/D20E1C5F-72EA-4505-9F26-FEF9550EFD44/MOLAP2HIVE_KLOUT.docx) that was definitely a blocker. Note that Spark SQL is different than HIVEQL but you may run into the same issue. If so, the trick you may want to use is similar to the paper - use a SQL Server linked server connection and have SQL Server be your translator for the SQL generated by SSAS. HTH! Denny On Sun, Feb 22, 2015 at 01:44 Ashic Mahtab as...@live.com wrote: Hi Francisco, While I haven't tried this, have a look at the contents of start-thriftserver.sh - all it's doing is setting up a few variables and calling: /bin/spark-submit --class org.apache.spark.sql.hive. thriftserver.HiveThriftServer2 and passing some additional parameters. Perhaps doing the same would work? I also believe that this hosts a jdbc server (not odbc), but there's a free odbc connector from databricks built by Simba, with which I've been able to connect to a spark cluster hosted on linux. -Ashic. -- To: user@spark.apache.org From: forch...@gmail.com Subject: Spark SQL odbc on Windows Date: Sun, 22 Feb 2015 09:45:03 +0100 Hello, I work on a MS consulting company and we are evaluating including SPARK on our BigData offer. We are particulary interested into testing SPARK as rolap engine for SSAS but we cannot find a way to activate the odbc server (thrift) on a Windows custer. There is no start-thriftserver.sh command available for windows. Somebody knows if there is a way to make this work? Thanks in advance!! Francisco
Re: Broadcasting Large Objects Fails?
Yes it is, you have some more customizable options over here http://spark.apache.org/docs/1.2.0/configuration.html#compression-and-serialization Thanks Best Regards On Sun, Feb 22, 2015 at 11:47 PM, Tassilo Klein tjkl...@gmail.com wrote: Hi Akhil, thanks for your reply. I am using the latest version of Spark 1.2.1 (also tried 1.3 developer branch). If I am not mistaken the TorrentBroadcast is the default there, isn't it? Thanks, Tassilo On Sun, Feb 22, 2015 at 10:59 AM, Akhil Das ak...@sigmoidanalytics.com wrote: Did you try with torrent broadcast factory? Thanks Best Regards On Sun, Feb 22, 2015 at 3:29 PM, TJ Klein tjkl...@gmail.com wrote: Hi, I am trying to broadcast large objects (order of a couple of 100 MBs). However, I keep getting errors when trying to do so: Traceback (most recent call last): File /LORM_experiment.py, line 510, in module broadcast_gradient_function = sc.broadcast(gradient_function) File /scratch/users/213444/spark/python/pyspark/context.py, line 643, in broadcast return Broadcast(self, value, self._pickled_broadcast_vars) File /scratch/users/213444/spark/python/pyspark/broadcast.py, line 65, in __init__ self._path = self.dump(value, f) File /scratch/users/213444/spark/python/pyspark/broadcast.py, line 82, in dump cPickle.dump(value, f, 2) SystemError: error return without exception set 15/02/22 04:52:14 ERROR Utils: Uncaught exception in thread delete Spark local dirs java.lang.IllegalStateException: Shutdown in progress Any idea how to prevent that? I got plenty of RAM, so there shouldn't be any problem with that. Thanks, Tassilo -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Broadcasting-Large-Objects-Fails-tp21752.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Spark SQL odbc on Windows
Hello, I work on a MS consulting company and we are evaluating including SPARK on our BigData offer. We are particulary interested into testing SPARK as rolap engine for SSAS but we cannot find a way to activate the odbc server (thrift) on a Windows custer. There is no start-thriftserver.sh command available for windows. Somebody knows if there is a way to make this work? Thanks in advance!! Francisco
Re: Missing shuffle files
Do you guys have dynamic allocation turned on for YARN? Anders, was Task 450 in your job acting like a Reducer and fetching the Map spill output data from a different node? If a Reducer task can't read the remote data it needs, that could cause the stage to fail. Sometimes this forces the previous stage to also be re-computed if it's a wide dependency. But like Petar said, if you turn the external shuffle service on, YARN NodeManager process on the slave machines will serve out the map spill data, instead of the Executor JVMs (by default unless you turn external shuffle on, the Executor JVM itself serves out the shuffle data which causes problems if an Executor dies). Core, how often are Executors crashing in your app? How many Executors do you have total? And what is the memory size for each? You can change what fraction of the Executor heap will be used for your user code vs the shuffle vs RDD caching with the spark.storage.memoryFraction setting. On Sat, Feb 21, 2015 at 2:58 PM, Petar Zecevic petar.zece...@gmail.com wrote: Could you try to turn on the external shuffle service? spark.shuffle.service.enable = true On 21.2.2015. 17:50, Corey Nolet wrote: I'm experiencing the same issue. Upon closer inspection I'm noticing that executors are being lost as well. Thing is, I can't figure out how they are dying. I'm using MEMORY_AND_DISK_SER and i've got over 1.3TB of memory allocated for the application. I was thinking perhaps it was possible that a single executor was getting a single or a couple large partitions but shouldn't the disk persistence kick in at that point? On Sat, Feb 21, 2015 at 11:20 AM, Anders Arpteg arp...@spotify.com wrote: For large jobs, the following error message is shown that seems to indicate that shuffle files for some reason are missing. It's a rather large job with many partitions. If the data size is reduced, the problem disappears. I'm running a build from Spark master post 1.2 (build at 2015-01-16) and running on Yarn 2.2. Any idea of how to resolve this problem? User class threw exception: Job aborted due to stage failure: Task 450 in stage 450.1 failed 4 times, most recent failure: Lost task 450.3 in stage 450.1 (TID 167370, lon4-hadoopslave-b77.lon4.spotify.net): java.io.FileNotFoundException: /disk/hd06/yarn/local/usercache/arpteg/appcache/application_1424333823218_21217/spark-local-20150221154811-998c/03/rdd_675_450 (No such file or directory) at java.io.FileOutputStream.open(Native Method) at java.io.FileOutputStream.(FileOutputStream.java:221) at java.io.FileOutputStream.(FileOutputStream.java:171) at org.apache.spark.storage.DiskStore.putIterator(DiskStore.scala:76) at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:786) at org.apache.spark.storage.BlockManager.putIterator(BlockManager.scala:637) at org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:149) at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:74) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:264) at org.apache.spark.rdd.RDD.iterator(RDD.scala:231) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:64) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:192) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) TIA, Anders
Re: How to send user variables from Spark client to custom InputFormat or RecordReader ?
The SparkConf doesn't allow you to set arbitrary variables. You can use SparkContext's HadoopRDD and create a JobConf (with whatever variables you want), and then grab them out of the JobConf in your RecordReader. On Sun, Feb 22, 2015 at 4:28 PM, hnahak harihar1...@gmail.com wrote: Hi, I have written custom InputFormat and RecordReader for Spark, I need to use user variables from spark client program. I added them in SparkConf val sparkConf = new SparkConf().setAppName(args(0)).set(developer,MyName) *and in InputFormat class* protected boolean isSplitable(JobContext context, Path filename) { System.out.println(# Developer + context.getConfiguration().get(developer) ); return false; } but its return me *null* , is there any way I can pass user variables to my custom code? Thanks !! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-send-user-variables-from-Spark-client-to-custom-InputFormat-or-RecordReader-tp21755.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Posting to the list
I'm also facing the same issue, this is third time whenever I post anything it never accept by the community and at the same time got a failure mail in my register mail id. and when click to subscribe to this mailing list link, i didnt get any new subscription mail in my inbox. Please anyone suggest a best way to subscribed the email ID -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Posting-to-the-list-tp21750p21756.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: [Spark SQL]: Convert SchemaRDD back to RDD
Haven't found the method in http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.SchemaRDD The new DataFrame has this method: /** * Returns the content of the [[DataFrame]] as an [[RDD]] of [[Row]]s. * @group rdd */ def rdd: RDD[Row] = { FYI On Sun, Feb 22, 2015 at 11:51 AM, stephane.collot stephane.col...@gmail.com wrote: Hi Michael, I think that the feature (convert a SchemaRDD to a structured class RDD) is now available. But I didn't understand in the PR how exactly to do this. Can you give an example or doc links? Best regards -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-Convert-SchemaRDD-back-to-RDD-tp9071p21753.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
How to send user variables from Spark client to custom InputFormat or RecordReader ?
Hi, I have written custom InputFormat and RecordReader for Spark, I need to use user variables from spark client program. I added them in SparkConf val sparkConf = new SparkConf().setAppName(args(0)).set(developer,MyName) *and in InputFormat class* protected boolean isSplitable(JobContext context, Path filename) { System.out.println(# Developer + context.getConfiguration().get(developer) ); return false; } but its return me *null* , is there any way I can pass user variables to my custom code? Thanks !! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-send-user-variables-from-Spark-client-to-custom-InputFormat-or-RecordReader-tp21755.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: [Spark SQL]: Convert SchemaRDD back to RDD
Hi Michael, I think that the feature (convert a SchemaRDD to a structured class RDD) is now available. But I didn't understand in the PR how exactly to do this. Can you give an example or doc links? Best regards -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-Convert-SchemaRDD-back-to-RDD-tp9071p21753.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Posting to the list
bq. i didnt get any new subscription mail in my inbox. Have you checked your Spam folder ? Cheers On Sun, Feb 22, 2015 at 2:36 PM, hnahak harihar1...@gmail.com wrote: I'm also facing the same issue, this is third time whenever I post anything it never accept by the community and at the same time got a failure mail in my register mail id. and when click to subscribe to this mailing list link, i didnt get any new subscription mail in my inbox. Please anyone suggest a best way to subscribed the email ID -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Posting-to-the-list-tp21750p21756.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
How to integrate HBASE on Spark
Hi I had installed spark on 3 node cluster. Spark services are up and running.But i want to integrate hbase on spark Do i need to install HBASE on hadoop cluster or spark cluster. Please let me know asap. Regards, Sandeep.v
Submitting jobs to Spark EC2 cluster remotely
I've set up the EC2 cluster with Spark. Everything works, all master/slaves are up and running. I'm trying to submit a sample job (SparkPi). When I ssh to cluster and submit it from there - everything works fine. However when driver is created on a remote host (my laptop), it doesn't work. I've tried both modes for `--deploy-mode`: **`--deploy-mode=client`:** From my laptop: ./bin/spark-submit --master spark://ec2-52-10-82-218.us-west-2.compute.amazonaws.com:7077 --class SparkPi ec2test/target/scala-2.10/ec2test_2.10-0.0.1.jar Results in the following indefinite warnings/errors: WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory 15/02/22 18:30:45 ERROR SparkDeploySchedulerBackend: Asked to remove non-existent executor 0 15/02/22 18:30:45 ERROR SparkDeploySchedulerBackend: Asked to remove non-existent executor 1 ...and failed drivers - in Spark Web UI Completed Drivers with State=ERROR appear. I've tried to pass limits for cores and memory to submit script but it didn't help... **`--deploy-mode=cluster`:** From my laptop: ./bin/spark-submit --master spark://ec2-52-10-82-218.us-west-2.compute.amazonaws.com:7077 --deploy-mode cluster --class SparkPi ec2test/target/scala-2.10/ec2test_2.10-0.0.1.jar The result is: Driver successfully submitted as driver-20150223023734-0007 ... waiting before polling master for driver state ... polling master for driver state State of driver-20150223023734-0007 is ERROR Exception from cluster was: java.io.FileNotFoundException: File file:/home/oleg/spark/spark12/ec2test/target/scala-2.10/ec2test_2.10-0.0.1.jar does not exist. java.io.FileNotFoundException: File file:/home/oleg/spark/spark12/ec2test/target/scala-2.10/ec2test_2.10-0.0.1.jar does not exist. at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:397) at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:251) at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:329)at org.apache.spark.deploy.worker.DriverRunner.org$apache$spark$deploy$worker$DriverRunner$$downloadUserJar(DriverRunner.scala:150) at org.apache.spark.deploy.worker.DriverRunner$$anon$1.run(DriverRunner.scala:75) So, I'd appreciate any pointers on what is going wrong and some guidance how to deploy jobs from remote client. Thanks. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Submitting-jobs-to-Spark-EC2-cluster-remotely-tp21762.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: cannot run spark shell in yarn-client mode
Does anyone fix this error ? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/cannot-run-spark-shell-in-yarn-client-mode-tp4013p21761.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: How to integrate HBASE on Spark
If you are having both the clusters on the same network, then i'd suggest you installing it on the hadoop cluster. If you install it on the spark cluster itself, then hbase might take up a few cpu cycles and there's a chance for the job to lag. Thanks Best Regards On Mon, Feb 23, 2015 at 12:48 PM, sandeep vura sandeepv...@gmail.com wrote: Hi I had installed spark on 3 node cluster. Spark services are up and running.But i want to integrate hbase on spark Do i need to install HBASE on hadoop cluster or spark cluster. Please let me know asap. Regards, Sandeep.v
Re: Posting to the list
I checked it but I didn't see any mail from user list. Let me do it one more time. [image: Inline image 1] --Harihar On Mon, Feb 23, 2015 at 11:50 AM, Ted Yu yuzhih...@gmail.com wrote: bq. i didnt get any new subscription mail in my inbox. Have you checked your Spam folder ? Cheers On Sun, Feb 22, 2015 at 2:36 PM, hnahak harihar1...@gmail.com wrote: I'm also facing the same issue, this is third time whenever I post anything it never accept by the community and at the same time got a failure mail in my register mail id. and when click to subscribe to this mailing list link, i didnt get any new subscription mail in my inbox. Please anyone suggest a best way to subscribed the email ID -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Posting-to-the-list-tp21750p21756.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org -- {{{H2N}}}-(@:
Re: Use Spark Streaming for Batch?
Hi, On Sat, Feb 21, 2015 at 1:05 AM, craigv craigvanderbo...@gmail.com wrote: /Might it be possible to perform large batches processing on HDFS time series data using Spark Streaming?/ 1.I understand that there is not currently an InputDStream that could do what's needed. I would have to create such a thing. 2. Time is a problem. I would have to use the timestamps on our events for any time-based logic and state management 3. The batch duration would become meaningless in this scenario. Could I just set it to something really small (say 1 second) and then let it fall behind, processing the data as quickly as it could? So, if it is not an issue for you if everything is processed in one batch, you can use streamingContext.textFileStream(hdfsDirectory). This will create a DStream that has a huge RDD with all data in the first batch and then empty batches afterwards. You can have small batch size, should not be a problem. An alternative would be to write some code that creates one RDD per file in your HDFS directory, create a Queue of those RDDs and then use streamingContext.queueStream(), possibly with the oneAtATime=true parameter (which will process only one RDD per batch). However, to do window computations etc with the timestamps embedded *in* your data will be a major effort, as in: You cannot use the existing windowing functionality from Spark Streaming. If you want to read more about that, there have been a number of discussions about that topic on this list; maybe you can look them up. Tobias
Re: Any sample code for Kafka consumer
Spark Streaming already directly supports Kafka http://spark.apache.org/docs/latest/streaming-programming-guide.html#advanced-sources Is there any reason why that is not sufficient? TD On Sun, Feb 22, 2015 at 5:18 PM, mykidong mykid...@gmail.com wrote: In java, you can see this example: https://github.com/mykidong/spark-kafka-simple-consumer-receiver - Kidong. -- Original Message -- From: icecreamlc [via Apache Spark User List] [hidden email] http:///user/SendEmail.jtp?type=nodenode=21758i=0 To: mykidong [hidden email] http:///user/SendEmail.jtp?type=nodenode=21758i=1 Sent: 2015-02-21 오전 11:16:37 Subject: Any sample code for Kafka consumer Dear all, Do you have any sample code that consume data from Kafka server using spark streaming? Thanks a lot!!! -- If you reply to this email, your message will be added to the discussion below: http://apache-spark-user-list.1001560.n3.nabble.com/Any-sample-code-for-Kafka-consumer-tp21746.html To unsubscribe from Apache Spark User List, click here. NAML http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml -- View this message in context: Re: Any sample code for Kafka consumer http://apache-spark-user-list.1001560.n3.nabble.com/Any-sample-code-for-Kafka-consumer-tp21746p21758.html Sent from the Apache Spark User List mailing list archive http://apache-spark-user-list.1001560.n3.nabble.com/ at Nabble.com.
Re: Launching Spark cluster on EC2 with Ubuntu AMI
bq. bash: git: command not found Looks like the AMI doesn't have git pre-installed. Cheers On Sun, Feb 22, 2015 at 4:29 PM, olegshirokikh o...@solver.com wrote: I'm trying to launch Spark cluster on AWS EC2 with custom AMI (Ubuntu) using the following: ./ec2/spark-ec2 --key-pair=*** --identity-file='/home/***.pem' --region=us-west-2 --zone=us-west-2b --spark-version=1.2.1 --slaves=2 --instance-type=t2.micro --ami=ami-29ebb519 --user=ubuntu launch spark-ubuntu-cluster Everything starts OK and instances are launched: Found 1 master(s), 2 slaves Waiting for all instances in cluster to enter 'ssh-ready' state. Generating cluster's SSH key on master. But then I'm getting the following SSH errors until it stops trying and quits: bash: git: command not found Connection to ***.us-west-2.compute.amazonaws.com closed. Error executing remote command, retrying after 30 seconds: Command '['ssh', '-o', 'StrictHostKeyChecking=no', '-i', '/home/***t.pem', '-o', 'UserKnownHostsFile=/dev/null', '-t', '-t', u'ubuntu@***.us-west-2.compute.amazonaws.com', 'rm -rf spark-ec2 git clone https://github.com/mesos/spark-ec2.git -b v4']' returned non-zero exit status 127 I know that Spark EC2 scripts are not guaranteed to work with custom AMIs but still, it should work... Any advice would be greatly appreciated! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Launching-Spark-cluster-on-EC2-with-Ubuntu-AMI-tp21757.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: How to send user variables from Spark client to custom InputFormat or RecordReader ?
Thanks. I extract hadoop configuration and set a my arbitrary variable and able to get inside InputFormat from JobContext.configuration On Mon, Feb 23, 2015 at 12:04 PM, Tom Vacek minnesota...@gmail.com wrote: The SparkConf doesn't allow you to set arbitrary variables. You can use SparkContext's HadoopRDD and create a JobConf (with whatever variables you want), and then grab them out of the JobConf in your RecordReader. On Sun, Feb 22, 2015 at 4:28 PM, hnahak harihar1...@gmail.com wrote: Hi, I have written custom InputFormat and RecordReader for Spark, I need to use user variables from spark client program. I added them in SparkConf val sparkConf = new SparkConf().setAppName(args(0)).set(developer,MyName) *and in InputFormat class* protected boolean isSplitable(JobContext context, Path filename) { System.out.println(# Developer + context.getConfiguration().get(developer) ); return false; } but its return me *null* , is there any way I can pass user variables to my custom code? Thanks !! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-send-user-variables-from-Spark-client-to-custom-InputFormat-or-RecordReader-tp21755.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org -- {{{H2N}}}-(@:
Re: How to send user variables from Spark client to custom InputFormat or RecordReader ?
Instead of setting in SparkConf , set it into SparkContext.hadoopconfiguration.set(key,value) and from JobContext extract same key. --Harihar -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-send-user-variables-from-Spark-client-to-custom-InputFormat-or-RecordReader-tp21755p21759.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Launching Spark cluster on EC2 with Ubuntu AMI
I'm trying to launch Spark cluster on AWS EC2 with custom AMI (Ubuntu) using the following: ./ec2/spark-ec2 --key-pair=*** --identity-file='/home/***.pem' --region=us-west-2 --zone=us-west-2b --spark-version=1.2.1 --slaves=2 --instance-type=t2.micro --ami=ami-29ebb519 --user=ubuntu launch spark-ubuntu-cluster Everything starts OK and instances are launched: Found 1 master(s), 2 slaves Waiting for all instances in cluster to enter 'ssh-ready' state. Generating cluster's SSH key on master. But then I'm getting the following SSH errors until it stops trying and quits: bash: git: command not found Connection to ***.us-west-2.compute.amazonaws.com closed. Error executing remote command, retrying after 30 seconds: Command '['ssh', '-o', 'StrictHostKeyChecking=no', '-i', '/home/***t.pem', '-o', 'UserKnownHostsFile=/dev/null', '-t', '-t', u'ubuntu@***.us-west-2.compute.amazonaws.com', 'rm -rf spark-ec2 git clone https://github.com/mesos/spark-ec2.git -b v4']' returned non-zero exit status 127 I know that Spark EC2 scripts are not guaranteed to work with custom AMIs but still, it should work... Any advice would be greatly appreciated! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Launching-Spark-cluster-on-EC2-with-Ubuntu-AMI-tp21757.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Any sample code for Kafka consumer
In java, you can see this example: https://github.com/mykidong/spark-kafka-simple-consumer-receiver - Kidong. -- Original Message -- From: icecreamlc [via Apache Spark User List] ml-node+s1001560n21746...@n3.nabble.com To: mykidong mykid...@gmail.com Sent: 2015-02-21 오전 11:16:37 Subject: Any sample code for Kafka consumer Dear all, Do you have any sample code that consume data from Kafka server using spark streaming? Thanks a lot!!! If you reply to this email, your message will be added to the discussion below: http://apache-spark-user-list.1001560.n3.nabble.com/Any-sample-code-for-Kafka-consumer-tp21746.html To unsubscribe from Apache Spark User List, click here. NAML -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Any-sample-code-for-Kafka-consumer-tp21746p21758.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Use Spark Streaming for Batch?
See if https://issues.apache.org/jira/browse/SPARK-3660 helps you. My patch has been accepted and, this enhancement is scheduled for 1.3.0. This lets you specify initialRDD for updateStateByKey operation. Let me know if you need any information. On Sun, Feb 22, 2015 at 5:21 PM, Tobias Pfeiffer t...@preferred.jp wrote: Hi, On Sat, Feb 21, 2015 at 1:05 AM, craigv craigvanderbo...@gmail.com wrote: /Might it be possible to perform large batches processing on HDFS time series data using Spark Streaming?/ 1.I understand that there is not currently an InputDStream that could do what's needed. I would have to create such a thing. 2. Time is a problem. I would have to use the timestamps on our events for any time-based logic and state management 3. The batch duration would become meaningless in this scenario. Could I just set it to something really small (say 1 second) and then let it fall behind, processing the data as quickly as it could? So, if it is not an issue for you if everything is processed in one batch, you can use streamingContext.textFileStream(hdfsDirectory). This will create a DStream that has a huge RDD with all data in the first batch and then empty batches afterwards. You can have small batch size, should not be a problem. An alternative would be to write some code that creates one RDD per file in your HDFS directory, create a Queue of those RDDs and then use streamingContext.queueStream(), possibly with the oneAtATime=true parameter (which will process only one RDD per batch). However, to do window computations etc with the timestamps embedded *in* your data will be a major effort, as in: You cannot use the existing windowing functionality from Spark Streaming. If you want to read more about that, there have been a number of discussions about that topic on this list; maybe you can look them up. Tobias