Re: Using Spark on Data size larger than Memory size

2014-06-11 Thread Allen Chang
Thanks. We've run into timeout issues at scale as well. We were able to workaround them by setting the following JVM options: -Dspark.akka.askTimeout=300 -Dspark.akka.timeout=300 -Dspark.worker.timeout=300 NOTE: these JVM options *must* be set on worker nodes (and not just the driver/master) for

Re: Using Spark on Data size larger than Memory size

2014-06-10 Thread Allen Chang
Thanks for the clarification. What is the proper way to configure RDDs when your aggregate data size exceeds your available working memory size? In particular, in additional to typical operations, I'm performing cogroups, joins, and coalesces/shuffles. I see that the default storage level for RDD

Monitoring spark dis-associated workers

2014-06-10 Thread Allen Chang
We're running into an issue where periodically the master loses connectivity with workers in the spark cluster. We believe this issue tends to manifest when the cluster is under heavy load, but we're not entirely sure when it happens. I've seen one or two other messages to this list about this issu

RDD values and defensive copying

2014-05-23 Thread Allen Chang
Hi all, I had a question: if I have an RDD containing mutable values, and I run a function over the RDD which mucks with the mutable values, what happens? What happens in the case of a cogroup? e.g.: inputRdd.cogroup(inputRdd2).flatMapValues(functionThatModifiesValues()) Will this result in un