More instances = slower Spark job

Jeroen Miller Thu, 28 Sep 2017 01:41:38 -0700

Hello,

I am experiencing a disappointing performance issue with my Spark jobs
as I scale up the number of instances.


The task is trivial: I am loading large (compressed) text files from S3,
filtering out lines that do not match a regex, counting the numbers
of remaining lines and saving the resulting datasets as (compressed)
text files on S3. Nothing that a simple grep couldn't do, except that
the files are too large to be downloaded and processed locally.

On a single instance, I can process X GBs per hour. When scaling up
to 10 instances, I noticed that processing the /same/ amount of data
actually takes /longer/.

This is quite surprising as the task is really simple: I was expecting
a significant speed-up. My naive idea was that each executors would
process a fraction of the input file, count the remaining lines /locally/,
and save their part of the processed file /independently/, thus no data
shuffling would occur.

Obviously, this is not what is happening.

Can anyone shed some light on this or provide pointers to relevant
information?

Regards,

Jeroen

More instances = slower Spark job

Reply via email to