Hi All,

I am running a Spark program where one of my parts is using Spark as a
scheduler rather than a data management framework. That is, my job can be
described as RDD[String] where the string describes an operation to perform
which may be cheap or expensive (process an object which may have a small
or large amount of records associated with it).

Leaving things to default, I have bad job balancing. I am wondering which
approach I should take:
1. Write a partitioner which uses partitionBy to ahead of time balance
partitions by number of records each string needs
2. repartition to have many small partitions (I have ~1700 strings acting
as jobs to run, so maybe 1-5 per partition). My question here is, does
Spark re-schedule/steal jobs if there are executors/worker processes that
aren't doing any work?

The second one would be easier and since I am not shuffling much data
around would work just fine for me, but I can't seem to find out for sure
if Spark does job re-scheduling/stealing.

Thanks
-- 
Pedro Rodriguez
PhD Student in Distributed Machine Learning | CU Boulder
UC Berkeley AMPLab Alumni

ski.rodrig...@gmail.com | pedrorodriguez.io | 909-353-4423
Github: github.com/EntilZha | LinkedIn:
https://www.linkedin.com/in/pedrorodriguezscience

Reply via email to