laziness in textFile reading from HDFS?

2015-09-28 Thread davidkl
Hello, I need to process a significant amount of data every day, about 4TB. This will be processed in batches of about 140GB. The cluster this will be running on doesn't have enough memory to hold the dataset at once, so I am trying to understand how this works internally. When using textFile to

Re: ClassCastException: BlockManagerId cannot be cast to [B

2015-06-12 Thread davidkl
Hello, Just in case someone finds the same issue, it was caused by running the streaming app with different version of the cluster jars (the uber jar contained both yarn and spark). Regards J -- View this message in context:

ClassCastException: BlockManagerId cannot be cast to [B

2015-06-11 Thread davidkl
Hello, I am running a streaming app in Spark 1.2.1. When running local everything works fine. When I try on yarn-cluster it fails and I see ClassCastException in the log (see below). I can run Spark (non-streaming) apps in the cluster with no problem. Any ideas here? Thanks in advance! WARN

Re: Custom paritioning of DSTream

2015-04-23 Thread davidkl
Hello Evo, Ranjitiyer, I am also looking for the same thing. Using foreach is not useful for me as processing the RDD as a whole won't be distributed across workers and that would kill performance in my application :-/ Let me know if you find a solution for this. Regards -- View this

Re: Identify the performance bottleneck from hardware prospective

2015-03-05 Thread davidkl
Hello Julaiti, Maybe I am just asking the obvious :-) but did you check disk IO? Depending on what you are doing that could be the bottleneck. In my case none of the HW resources was a bottleneck, but using some distributed features that were blocking execution (e.g. Hazelcast). Could that be

Re: Extra output from Spark run

2015-03-05 Thread davidkl
If you do not want those progress indication to appear, just set spark.ui.showConsoleProgress to false, e.g: System.setProperty(spark.ui.showConsoleProgress, false); Regards -- View this message in context:

Re: Does Spark automatically run different stages concurrently when possible?

2015-01-19 Thread davidkl
Hi Jon, I am looking for an answer for a similar question in the doc now, so far no clue. I would need to know what is spark behaviour in a situation like the example you provided, but taking into account also that there are multiple partitions/workers. I could imagine it's possible that

Re: Newbie Question on How Tasks are Executed

2015-01-19 Thread davidkl
Hello Mixtou, if you want to look at partition ID, I believe you want to use mapPartitionsWithIndex -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Newbie-Question-on-How-Tasks-are-Executed-tp21064p21228.html Sent from the Apache Spark User List mailing

Re: Spark Streaming scheduling control

2014-10-20 Thread davidkl
Thanks Akhil Das-2: actually I tried setting spark.default.parallelism but no effect :-/ I am running standalone and performing a mix of map/filter/foreachRDD. I had to force parallelism with repartition to get both workers to process tasks, but I do not think this should be required (and I am

Re: Spark Streaming scheduling control

2014-10-20 Thread davidkl
One detail, even forcing partitions (/repartition/), spark is still holding some tasks; if I increase the load of the system (increasing /spark.streaming.receiver.maxRate/), even if all workers are used, the one with the receiver gets twice as many tasks compared with the other workers. Total

Spark Streaming scheduling control

2014-10-19 Thread davidkl
Hello, I have a cluster 1 master and 2 slaves running on 1.1.0. I am having problems to get both slaves working at the same time. When I launch the driver on the master, one of the slaves is assigned the receiver task, and initially both slaves start processing tasks. After a few tens of batches,

bug with MapPartitions?

2014-10-17 Thread davidkl
Hello, Maybe there is something I do not get to understand, but I believe this code should not throw any serialization error when I run this in the spark shell. Using similar code with map instead of mapPartitions works just fine. import java.io.BufferedInputStream import java.io.FileInputStream