On 28 Sep 2017, at 09:41, Jeroen Miller 
<bluedasya...@gmail.com<mailto:bluedasya...@gmail.com>> wrote:

Hello,

I am experiencing a disappointing performance issue with my Spark jobs
as I scale up the number of instances.

The task is trivial: I am loading large (compressed) text files from S3,
filtering out lines that do not match a regex, counting the numbers
of remaining lines and saving the resulting datasets as (compressed)
text files on S3. Nothing that a simple grep couldn't do, except that
the files are too large to be downloaded and processed locally.

On a single instance, I can process X GBs per hour. When scaling up
to 10 instances, I noticed that processing the /same/ amount of data
actually takes /longer/.

This is quite surprising as the task is really simple: I was expecting
a significant speed-up. My naive idea was that each executors would
process a fraction of the input file, count the remaining lines /locally/,
and save their part of the processed file /independently/, thus no data
shuffling would occur.

Obviously, this is not what is happening.

Can anyone shed some light on this or provide pointers to relevant
information?


Expect the bandwidth of file input to be shared across all workers trying to 
read different parts of the same file. Two workers reading a file: half the B/W 
each. HDFS avoids this by splitting files into blocks and replicating each one 
by three: max bandwidth of a file is 3 * blocks(file). For S3, if one reader 
has bandwidth B, two readers get bandwidth B/2.


There are some subtleties related to multipart upload - know that if you can do 
multipart upload/copy with an upload size M, and set the emulated block size on 
the clients to be same value then you should get better paralallism, as the 
read bandwidth is really per uploaded block


fs.s3a.block.size 64M
fs.s3a.multipart.size 64M
fs.s3a.multipart.threshold 64M


The other scale issue is that AWS throttles S3 IO per shard of the data, sends 
503 responses back to the clients, which then sleep a bit to back off. Add more 
clients, more throttling, more sleep, same in-shard bandwidth.

a single bucket can have data on >1 shard, where different parts of the tree 
are on different shards

http://docs.aws.amazon.com/AmazonS3/latest/dev/request-rate-perf-considerations.html

Sadly, the normal layout of directory trees in big data sources 
(year/month/day) biases code to all try and work with the same shard.

The currently shipping s3a: client doesn't handle throttling, so that won't be 
the cause of your problem. If you are using EMR and its s3: connector, it may 
be.

Reply via email to