[
https://issues.apache.org/jira/browse/HADOOP-13786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16212876#comment-16212876
]
Steve Loughran commented on HADOOP-13786:
-----------------------------------------
Aaron. thanks for all your review, I'll try to go through them here.
I'm about to create a sub JIRA & apply my work as a PR to it, so that we can
try doing reviews on github without the noise overloading this JIRA. For now
though, here's my review of your comments
h3. first set of comments
findbugs.xml. corrected. Issue is about a stream not being closed; the library
called does it.
bq. why copy/paste JsonSerDeser versus import?
the previous one is in a yarn JAR not on the CP for hadoop common. The existing
class is now a subclass of the new one, with some
extra stuff to handle parsing data off zookeeper records.
bq. fs.s3a.retry.throttle.interval defaults
good point. Let me seek some advice. I'm less worried about thundering herds
than the total #of requests/second, though on different S3 implementations
lock-step clients may be an issue. I'll make it 500 for now. It's with an
RetryUpToMaximumCountWithProportionalSleep policy, so we could go to say, 250 &
have it go sleep(250), sleep(500)...sleep(750)... until max-retries. There's no
jitter in that class though.
bq. PathOutputCommitterFactory & else {} clause with nothing but comments.
cut
bq. Update comment: "Verify that if the output path is null."
done.
bq. fs.s3a.retry.limit vs fs.s3a.attempts.maximum
Actually, there's no good reason for the difference is there, other than the
first was used in the AWS SDK only? I'll use attempts.maximum; picked up the
same default value (20). Updated: Code, docs, core-default.xml
bq. What if SDK bug yields NPE, for example? Then you NPE in your code in
translateException() guts, no?
bq. Retries stuff great to see, but the change is invasive and needs separate
careful consideration IMO. How painful would this be to pull out into separate
patch?
Pretty major. I started with it for all the new commit operations, so a commit
wouldn't fail from a transient event, & then addeed fault injection of throttle
events. At which point all the tests got into a mess as setup/teardown and
every existing call (getFileStatus()) failed. Essentially: you can't have a
robust committer without making all the ops fault tolerant.
Also goes with the move of WriteOperationsHelper. That said, the DDB retry
logic is something separate; its not something I've yet got fault injection
here.
bq. First question here is, what about SDK retries? I know the SDK
documentation and behavior is spotty at best, but there are some known things
(i.e. DynamoDB does retry in SDK for most operations, exclusive of batched
operations which are application responsibility).
I don't see that. In fact, I've seen DDB fail on throttling. Use the CLI to
crank back the IOP capacity to, say, 5 read + 5 write and then run the
integration tests in parallel with -Ds3guard -Ddynamodb. You will get
overloaded. That's why I added the s3guard CLI to list/change IOPs...bored of
going near the web UI and expecting it to be a recurrent support call.
I do know that AWS s3 transfer manager does retries using mark/reset to handle
retry of part uploads, and older SDK's didn't handle failure of the final post
of an MPU (HADOOP-14028). The code as stand, assumes that transfer manager
handles failures of async PUT/POST itself, but retries all other failures, with
the @attrs to make clear to methods calling other methods when they should
*not* be doing any more retry or translate themselves. If you wrap retry with
retry you end up in a situation where permanent failures take too long to
surface.
bq. (re idempotent delete ) "interesting. Is the argument that delete is racy
anyways, even if clients externally synchronize, thus we treat it as idempotent
even though it is not?"
There's a past WONTFIX JIRA when someone (Konstantin?) wanted to make delete
idempotent on HDFS and it was agreed, no, never. Here I have made it idempotent
on the basis that object stores are odd anyway. In particular, code which uses
create(overwrite=false) can get into conflict with delete, but as
create-no-overwrite isn't atomic in S3, do we care. Then there's recursive
delete.
Here, I don't know what to do. It is a lot easier if there is retry round
delete, but it may be something done at a higher level than the FS APIs.
bq. re: LOG.info() of failed to purge.
That log comes from HADOOP-13208 & failing if you ever tried to start up
against landsat-pds with purge enabled. I'll bump it up to INFO for now & see
what people say. Turned off for landsat in the test/resources/core-site.xml
file too.
bq. @Retries.RetryRaw
Thanks for the appreciation. I started off with javadocs but it wasn't that
reliable...annotations are clearer. We have them in
{{org.apache.hadoop.io.retry}} too, but they are actually declaring method
semantics for IPC to choose how to react. This is just docs. I think they may
-> the javadocs, but not checked. bigget risk: code changes make the
annotations obsolete & it is only picked up by manual review (and ideally fault
injection)
bq. org.apache.hadoop.fs.s3a.commit.Tasks
Ryan's code; I've not changed other than to move to lambda-expressions in
invocations. Some of the tests try the parallel vs sequential execution within
committers though, in particular TestStagingCommitter, which does explore the
various failure paths.
But yes, I could do a standalone test to for more low-level validation...let me
add that to my todo list.
bq. How about passing a Logger into DurationInfo instead of it using its own?
Good idea. Done.
BTW, been thinking about Htrace support here. I won't do it in this patch, but
its the obvious place to declare spans. For that though: start in S3AFS, have
consistent span names across operations in hadoop-common, then move up to
commit, where you also need your execution engine to embrace it too.
bq. cleanup/cleanupStagingDirs
Made abstract for you. I do need to do some more walkthroughs to be happy that
we are doing as much cleanup as possible.
Interesting issue there: should we do a best-effort cleanup on failure of, say
task commit, or rely on abortTask/abortJob to force that?
bq. Failure callback (in abortPendingUploads) is same as intended operation. Is
this in effect a single retry?
good point. Something probably taken from the Netflix code. I'll cut it, as
there is retry logic in the abort now anyway.
bq. javadocs on AbstractS3GuardCommitterFactory
fixed to be consistent with code.
bq. innerCommit & AtomicInteger
gets passed in to a closure which is then handed off to the retry stuff to call
on every retry. It's not that an atomic is needed, just something to hold an
integer which can be incremented in closures. You can see that in other
places/code elsewhere too, like in local vars you want closures to updated. If
there's a non-atomic version, especially something in the JDK (and *not* guava)
then I'd be happy to switch.
bq. CommitOperations.ls & "robust in JDK"
It used to be {{invoke.retry(()-> fs.listFiles(path.recursive))}}, and while
the listFiles call retried, the hasNext/next calls weren't. With the move of
retries into the FS, both ops are robust & so the call has been simplified.
I'll change the javadocs.
bq. Paths & Unit tests
There's some stuff in {{TestStagingPartitionedFileListing}}, otherwise nothing
specific except in the mock & integration tests for the staging committers.
bq. retries around dynamoDB.batchWriteItemUnprocessed()
OK, removed retry logic. There still aspects of its failures I think I need to
see more of to understand
------------------
h3. docs
Thanks for looking at the docs. I am still going through them as I think I've
got some duplication. I concur with your statement that this is better than the
MR docs; spent a lot of time stepping through tests to work out WTF
FileOutputCommitter really does, and how MapReduce uses it. It's made me a
strong enthusiast for deleting the v1 APIs entirely. **It is time**.
bq. I do like "successed" though, colloquially
Once you become a CS professor you can start using enough in talks that people
will think it's a real term, especially if you add some extra semantics around
observed state. A process may *succeed* locally, but it is only *successed*
when the rest of the system observes this and moves their state model into a
succeeded state. Yes, we could do this...
For now, fixed the docs.
bq. /S3Guard committers/S3A committers/
dunno. The magic one utterly depends on S3Guard for consistency, so using the
term keeps it clear, also helps with the branding of that. The staging ones
don't have that
bq. Nice Trick (re setting `mapreduce.fileoutputcommitter.algorithm.version` =
10 to fail fast)
yeah, guess how I came up with that.
bq. Multiple clients racing to write same path cannot detect which one actually
succeeded via return code (IIRC this is determined by S3 service-side timestamp
and not visible to clients). Am I right or mistaken here?
It's not that >1 client "succeeds" so much as they both succeed, you just don't
know which one came last, and even checking afterwards for timestamp or etag
doesn't guarantee there's not an overwrite in progress.
bq clarity nit: /implementations of the S3 protocol/S3 protocol-compatible
storage systems/
changed to "Some S3-compatible object stores are fully consistent;"
bq. > It requires a path element, such as `__magic` which cannot be used for
any purpose other than for the storage of pending commit data.
bq. Aside: We could make this string configurable in the future, no?
no, it will only cause chaos, confusion and support calls. I chose a path that
nobody appears to use today; if ever you get a bug report with __magic in the
path you can see what the issue is related to. Whereas if something could say
_temporary was "magic", there'd be chaos and you'd never work out why. Your
support team will never forgive you for even suggesting this, and QE will not
be sending you christmas cards.
bq. > skipDuringFaultInjection(fs);
bq. Nice.
you can do the entire suite with throttling cranked up; how I got both the
fault injection & retry coverage right.
bq. > log4j.logger.org.apache.hadoop.fs.s3a=DEBUG
bq. Seems a little heavy.
noted; cut back
Final bit of the architecture doc: yeah, it's out of date. I've already updated
the spark bit, but it needs a review and edit. Like you noted, its got design
work in there oo.
h3. Other issues.
Writing up this stuff here and elsewhere makes me think the term "directory
committer" is a bit confusing. What about we call that, at least in docs &
config the "staging" committer, with the partitioned committer its sibling. No
end users should be seeing the superclass after all. Similarly, renaming the
DynamicFactory the S3GuardCommitterFactory (or S3ACommitterFactory), and only
write up the use of that. It's the easiest to set up and explain. In particular
we could make it the default for the S3A schema, just using the default
fs.s3a.committer=file for a bucket for today's (flawed) behaviour. Change the
committer name globally or per bucket, and you've got fewer config options to
get wrong/try and debug.
Next patch will have more logging at init time, primarly related to debugging
things in some new bulk spark integration. Filed HADOOP-14965 for one key
issue, but expect some more logging @debug in the committers too.
> Add S3Guard committer for zero-rename commits to S3 endpoints
> -------------------------------------------------------------
>
> Key: HADOOP-13786
> URL: https://issues.apache.org/jira/browse/HADOOP-13786
> Project: Hadoop Common
> Issue Type: New Feature
> Components: fs/s3
> Affects Versions: 3.0.0-beta1
> Reporter: Steve Loughran
> Assignee: Steve Loughran
> Attachments: HADOOP-13786-036.patch, HADOOP-13786-037.patch,
> HADOOP-13786-038.patch, HADOOP-13786-039.patch,
> HADOOP-13786-HADOOP-13345-001.patch, HADOOP-13786-HADOOP-13345-002.patch,
> HADOOP-13786-HADOOP-13345-003.patch, HADOOP-13786-HADOOP-13345-004.patch,
> HADOOP-13786-HADOOP-13345-005.patch, HADOOP-13786-HADOOP-13345-006.patch,
> HADOOP-13786-HADOOP-13345-006.patch, HADOOP-13786-HADOOP-13345-007.patch,
> HADOOP-13786-HADOOP-13345-009.patch, HADOOP-13786-HADOOP-13345-010.patch,
> HADOOP-13786-HADOOP-13345-011.patch, HADOOP-13786-HADOOP-13345-012.patch,
> HADOOP-13786-HADOOP-13345-013.patch, HADOOP-13786-HADOOP-13345-015.patch,
> HADOOP-13786-HADOOP-13345-016.patch, HADOOP-13786-HADOOP-13345-017.patch,
> HADOOP-13786-HADOOP-13345-018.patch, HADOOP-13786-HADOOP-13345-019.patch,
> HADOOP-13786-HADOOP-13345-020.patch, HADOOP-13786-HADOOP-13345-021.patch,
> HADOOP-13786-HADOOP-13345-022.patch, HADOOP-13786-HADOOP-13345-023.patch,
> HADOOP-13786-HADOOP-13345-024.patch, HADOOP-13786-HADOOP-13345-025.patch,
> HADOOP-13786-HADOOP-13345-026.patch, HADOOP-13786-HADOOP-13345-027.patch,
> HADOOP-13786-HADOOP-13345-028.patch, HADOOP-13786-HADOOP-13345-028.patch,
> HADOOP-13786-HADOOP-13345-029.patch, HADOOP-13786-HADOOP-13345-030.patch,
> HADOOP-13786-HADOOP-13345-031.patch, HADOOP-13786-HADOOP-13345-032.patch,
> HADOOP-13786-HADOOP-13345-033.patch, HADOOP-13786-HADOOP-13345-035.patch,
> cloud-intergration-test-failure.log, objectstore.pdf, s3committer-master.zip
>
>
> A goal of this code is "support O(1) commits to S3 repositories in the
> presence of failures". Implement it, including whatever is needed to
> demonstrate the correctness of the algorithm. (that is, assuming that s3guard
> provides a consistent view of the presence/absence of blobs, show that we can
> commit directly).
> I consider ourselves free to expose the blobstore-ness of the s3 output
> streams (ie. not visible until the close()), if we need to use that to allow
> us to abort commit operations.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]