On 5 Jul 2017, at 14:40, Vadim Semenov 
<vadim.seme...@datadoghq.com<mailto:vadim.seme...@datadoghq.com>> wrote:

Are you sure that you use S3A?
Because EMR says that they do not support S3A

> Amazon EMR does not currently support use of the Apache Hadoop S3A file 
> system.

I think that the HEAD requests come from the `createBucketIfNotExists` in the 
AWS S3 library that checks if the bucket exists every time you do a PUT 
request, i.e. creates a HEAD request.

You can disable that by setting `fs.s3.buckets.create.enabled` to `false`

Yeah, I'd like to see the stack traces before blaming S3A and the ASF codebase

One thing I do know is that the shipping S3A client doesn't have any explicit 
handling of 503/retry events. I know that: 

There is some retry logic in bits of the AWS SDK related to file upload: that 
may log and retry, but in all the operations listing files, getting their 
details, etc: no resilience to throttling.

If it is surfacing against s3a, there isn't anything which can immediately be 
done to fix it, other than "spread your data around more buckets". Do attach 
the stack trace you get under 
https://issues.apache.org/jira/browse/HADOOP-14381 though: I'm about half-way 
through the resilience code (& fault injection needed to test it). The more 
where I can see problems arise, the more confident I can be that those 
codepaths will be resilient.

On Thu, Jun 29, 2017 at 4:56 PM, Everett Anderson 
<ever...@nuna.com.invalid<mailto:ever...@nuna.com.invalid>> wrote:

We're using Spark 2.0.2 + Hadoop 2.7.3 on AWS EMR with S3A for direct I/O 
from/to S3 from our Spark jobs. We set 
mapreduce.fileoutputcommitter.algorithm.version=2 and are using encrypted S3 

This has been working fine for us, but perhaps as we've been running more jobs 
in parallel, we've started getting errors like

Status Code: 503, AWS Service: Amazon S3, AWS Request ID: ..., AWS Error Code: 
SlowDown, AWS Error Message: Please reduce your request rate., S3 Extended 
Request ID: ...

We enabled CloudWatch S3 request metrics for one of our buckets and I was a 
little alarmed to see spikes of over 800k S3 requests over a minute or so, with 
the bulk of them HEAD requests.

We read and write Parquet files, and most tables have around 50 shards/parts, 
though some have up to 200. I imagine there's additional parallelism when 
reading a shard in Parquet, though.

Has anyone else encountered this? How did you solve it?

I'd sure prefer to avoid copying all our data in and out of HDFS for each job, 
if possible.


Reply via email to