You can also so something similar to what is mentioned in [1].
The basic idea is to use two hash functions for each key and assigning it
to the least loaded of the two hashed worker.
Cheers,
Anis
[1].
https://melmeric.files.wordpress.com/2014/11/the-power-of-both-choices-practical-load-balancin
formation reduces the size of the huge partition, making it
> tenable for spark, as long as you can figure out logic for aggregating the
> results of the seeded partitions together again.
>
> On Tue, Feb 14, 2017 at 12:01 PM, Anis Nasir wrote:
>
> Dear All,
>
> I have fe
Dear All,
I have few use cases for spark streaming where spark cluster consist of
heterogenous machines.
Additionally, there is skew present in both the input distribution (e.g.,
each tuple is drawn from a zipf distribution) and the service time (e.g.,
service time required for each tuple comes f