AWS Consistent S3 & Apache Hadoop's S3A connector

Steve Loughran Fri, 04 Dec 2020 05:39:17 -0800

If you've missed the announcement, AWS S3 storage is now strongly
consistent: https://aws.amazon.com/s3/consistency/


That's full CRUD consistency, consistent listing, and no 404 caching.

You don't get: rename, or an atomic create-no-overwrite. Applications need
to know that and code for it.

This is enabled for all S3 buckets; no need to change endpoints or any
other settings. No extra cost, no performance impact. This is the biggest
change in S3 semantics since it launched.

What does this mean for the Hadoop S3A connector?


   1. We've been testing it for a while, no problems have surfaced.
   2. There's no need for S3Guard; leave the default settings alone. If you
   were using it, turn it off, restart *everything* and then you can delete
   the DDB table.
   3. Without S3 listings may get a bit slower. There's been a lot of work
   in branch-3.3 on speeding up listings against raw S3, especially for code
   which uses listStatusIterator() and listFiles (HADOOP-17400).


It'll be time to get Hadoop 3.3.1 out the door for people to play with;
it's got a fair few other s3a-side enhancements.

People are still using S3Guard and it needs to be maintained for now, but
we'll have to be fairly ruthless about what isn't going to get closed as
WONTFIX. I'm worried here about anyone using S3Guard against non-AWS
consistent stores. If you are, send me an email.

And so for releases/PRs, tdoing est runs with and without S3Guard is
important. I've added an optional backwards-incompatible change recently
for better scalability: HADOOP-13230. S3A to optionally retain directory
markers. which adds markers=keep/delete to the test matrix. This is a pain,
though as you can choose two options at a time it's manageable.

Apache HBase
============

You still need the HBoss extension in front of the S3A connector to use
Zookeeper to lock files during compaction.


Apache Spark
============

Any workflows which chained together reads directly after writes/overwrites
of files should now work reliably with raw S3.

The classic FileOutputCommitter commit-by-rename algorithms aren't going to
fail with FileNotFoundException during task commit.

   - They will still use copy to rename work, so take O(data) time to
   commit files
   - Without atomic dir rename, v1 commit algorithm can't isolate the
   commit operations of two task attempts. So it's unsafe and very slow.
   - The v2 commit is slow, doesn't have isolation between task attempt
   commits against any filesystem.
   - If different task attempts are generating unique filenames (possibly
   to work around s3 update inconsistencies), it's not safe. Turn that option
   off.

The S3A committers' algorithms are happy talking directly to S3. But:
SPARK-33402 is needed to fix a race condition in the staging committer. The
"Magic" committer, which has relied on a consistent store, is safe. There's
a fix in HADOOP-17318 for the staging committer; hadoop-aws builds with
that in will work safely with older spark versions.

Any formats which commit work by writing a file with a unique name &
updating a reference to it in a consistent store (iceberg &c) are still
going to work great. Naming is irrelevant and commit-by-writing-a-file is
S3's best story.

Disctp
======

There'll be no cached 404s ot break uploads, even if you don't have the
relevant fixes to stop HEAD requests before creating files (HADOOP-16932
and revert of HADOOP-8143)or update inconsistency (HADOOP-16775)

   - If your distcp version supports -direct, use it to avoid rename
   performance penalties
   - If your distcp version doesn't have HADOOP-15209 it can issue needless
   DELETE calls to S3 after a big update, and end up being throttled badly.
   Upgrade if you can.


If people are seeing problems: issues.apache.org + component HADOOP is
where to file JIRAs; please tag the version of hadoop libraries you've been
running with.

thanks,

-Steve

AWS Consistent S3 & Apache Hadoop's S3A connector

Reply via email to