Re: RDD collect hangs on large input data

2015-04-17 Thread Zsolt Tóth
Thanks for your answer Imran. I haven't tried your suggestions yet, but setting spark.shuffle.blockTransferService=nio solved my issue. There is a JIRA for this: https://issues.apache.org/jira/browse/SPARK-6962. Zsolt 2015-04-14 21:57 GMT+02:00 Imran Rashid iras...@cloudera.com: is it possible

Re: RDD collect hangs on large input data

2015-04-08 Thread Zsolt Tóth
I use EMR 3.3.1 which comes with Java 7. Do you think that this may cause the issue? Did you test it with Java 8?

Re: RDD collect hangs on large input data

2015-04-07 Thread Jon Chase
Zsolt - what version of Java are you running? On Mon, Mar 30, 2015 at 7:12 AM, Zsolt Tóth toth.zsolt@gmail.com wrote: Thanks for your answer! I don't call .collect because I want to trigger the execution. I call it because I need the rdd on the driver. This is not a huge RDD and it's not

Re: RDD collect hangs on large input data

2015-03-30 Thread Zsolt Tóth
Thanks for your answer! I don't call .collect because I want to trigger the execution. I call it because I need the rdd on the driver. This is not a huge RDD and it's not larger than the one returned with 50GB input data. The end of the stack trace: The two IP's are the two worker nodes, I think

Re: RDD collect hangs on large input data

2015-03-29 Thread Akhil Das
Don't call .collect if your data size huge, you can simply do a count() to trigger the execution. Can you paste your exception stack trace so that we'll know whats happening? Thanks Best Regards On Fri, Mar 27, 2015 at 9:18 PM, Zsolt Tóth toth.zsolt@gmail.com wrote: Hi, I have a simple

RDD collect hangs on large input data

2015-03-27 Thread Zsolt Tóth
Hi, I have a simple Spark application: it creates an input rdd with sc.textfile, and it calls flatMapToPair, reduceByKey and map on it. The output rdd is small, a few MB's. Then I call collect() on the output. If the textfile is ~50GB, it finishes in a few minutes. However, if it's larger