Re: strange HashPartitioner behavior in Spark

2016-04-18 Thread Raghava Mutharaju
No. We specify it as a configuration option to the spark-submit. Does that make a difference? Regards, Raghava. On Mon, Apr 18, 2016 at 9:56 AM, Sonal Goyal wrote: > Are you specifying your spark master in the scala program? > > Best Regards, > Sonal > Founder, Nube

Re: strange HashPartitioner behavior in Spark

2016-04-18 Thread Sonal Goyal
Are you specifying your spark master in the scala program? Best Regards, Sonal Founder, Nube Technologies Reifier at Strata Hadoop World Reifier at Spark Summit 2015

Re: strange HashPartitioner behavior in Spark

2016-04-18 Thread Raghava Mutharaju
Mike, We tried that. This map task is actually part of a larger set of operations. I pointed out this map task since it involves partitionBy() and we always use partitionBy() whenever partition-unaware shuffle operations are performed (such as distinct). We in fact do not notice a change in the

Re: strange HashPartitioner behavior in Spark

2016-04-17 Thread Mike Hynes
A HashPartitioner will indeed partition based on the key, but you cannot know on *which* node that key will appear. Again, the RDD partitions will not necessarily be distributed evenly across your nodes because of the greedy scheduling of the first wave of tasks, particularly if those tasks have

Re: strange HashPartitioner behavior in Spark

2016-04-17 Thread Anuj Kumar
Good point Mike +1 On Mon, Apr 18, 2016 at 9:47 AM, Mike Hynes <91m...@gmail.com> wrote: > When submitting a job with spark-submit, I've observed delays (up to > 1--2 seconds) for the executors to respond to the driver in order to > receive tasks in the first stage. The delay does not persist

Re: strange HashPartitioner behavior in Spark

2016-04-17 Thread Raghava Mutharaju
We are testing with 52MB, but it would go to 20GB and more later on. The cluster size is also not static, we would be growing it. But the issue here is the behavior of HashPartitioner -- from what I understand, it should be partitioning the data based on the hash of the key irrespective of the RAM

Re: strange HashPartitioner behavior in Spark

2016-04-17 Thread Mike Hynes
When submitting a job with spark-submit, I've observed delays (up to 1--2 seconds) for the executors to respond to the driver in order to receive tasks in the first stage. The delay does not persist once the executors have been synchronized. When the tasks are very short, as may be your case

Re: strange HashPartitioner behavior in Spark

2016-04-17 Thread Anuj Kumar
Few params like- spark.task.cpus, spark.cores.max will help. Also, for 52MB of data you need not have 12GB allocated to executors. Better to assign 512MB or so and increase the number of executors per worker node. Try reducing that executor memory to 512MB or so for this case. On Mon, Apr 18,

Re: strange HashPartitioner behavior in Spark

2016-04-17 Thread Raghava Mutharaju
Yes its the same data. 1) The number of partitions are the same (8, which is an argument to the HashPartitioner). In the first case, these partitions are spread across both the worker nodes. In the second case, all the partitions are on the same node. 2) What resources would be of interest here?

Re: strange HashPartitioner behavior in Spark

2016-04-17 Thread Anuj Kumar
If the data file is same then it should have similar distribution of keys. Few queries- 1. Did you compare the number of partitions in both the cases? 2. Did you compare the resource allocation for Spark Shell vs Scala Program being submitted? Also, can you please share the details of Spark