The article is interesting but doesn't really help. It has only one
sentence about data distribution in partitions.
How can I diagnose skewed data distribution?
How could evenly sized blocks in HDFS lead to skewed data anyway?
On 9 Sep 2015 2:29 pm, "Akhil Das" wrote:
> This post here has a bit
This post here has a bit information
http://blog.cloudera.com/blog/2015/05/working-with-apache-spark-or-how-i-learned-to-stop-worrying-and-love-the-shuffle/
Thanks
Best Regards
On Wed, Sep 9, 2015 at 6:44 AM, mark wrote:
> As I understand things (maybe naively), my input data are stored in equa
As I understand things (maybe naively), my input data are stored in equal
sized blocks in HDFS, and each block represents a partition within Spark
when read from HDFS, therefore each block should hold roughly the same
number of records.
So something is missing in my understanding - what can cause
Try using a custom partitioner for the keys so that they will get evenly
distributed across tasks
Thanks
Best Regards
On Fri, Sep 4, 2015 at 7:19 PM, mark wrote:
> I am trying to tune a Spark job and have noticed some strange behavior -
> tasks in a stage vary in execution time, ranging from 2
I am trying to tune a Spark job and have noticed some strange behavior -
tasks in a stage vary in execution time, ranging from 2 seconds to 20
seconds. I assume tasks should all run in roughly the same amount of time
in a well tuned job.
So I did some investigation - the fast tasks appear to have