[ 
https://issues.apache.org/jira/browse/HADOOP-13786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16212876#comment-16212876
 ] 

Steve Loughran commented on HADOOP-13786:
-----------------------------------------


Aaron. thanks for all your review, I'll try to go through them here.

I'm about to create a sub JIRA & apply my work as a PR to it, so that we can 
try doing reviews on github without the noise overloading this JIRA. For now 
though, here's my review of your comments

h3. first set of comments

findbugs.xml. corrected. Issue is about a stream not being closed; the library 
called does it.

bq. why copy/paste JsonSerDeser versus import?

the previous one is in a yarn JAR not on the CP for hadoop common. The existing 
class is now a subclass of the new one, with some
extra stuff to handle parsing data off zookeeper records.

bq. fs.s3a.retry.throttle.interval defaults

good point. Let me seek some advice. I'm less worried about thundering herds 
than the total #of requests/second, though on different S3 implementations 
lock-step clients may be an issue. I'll make it 500 for now. It's with an 
RetryUpToMaximumCountWithProportionalSleep policy, so we could go to say, 250 & 
have it go sleep(250), sleep(500)...sleep(750)... until max-retries. There's no 
jitter in that class though.


bq. PathOutputCommitterFactory & else {} clause with nothing but comments.

cut

bq. Update comment: "Verify that if the output path is null."

done.

bq. fs.s3a.retry.limit vs fs.s3a.attempts.maximum

Actually, there's no good reason for the difference is there, other than the 
first was used in the AWS SDK only? I'll use attempts.maximum; picked up the 
same default value (20). Updated: Code, docs, core-default.xml

bq. What if SDK bug yields NPE, for example? Then you NPE in your code in 
translateException() guts, no?


bq. Retries stuff great to see, but the change is invasive and needs separate 
careful consideration IMO. How painful would this be to pull out into separate 
patch?

Pretty major. I started with it for all the new commit operations, so a commit 
wouldn't fail from a transient event, & then addeed fault injection of throttle 
events. At which point all the tests got into a mess as setup/teardown and 
every existing call (getFileStatus()) failed. Essentially: you can't have a 
robust committer without making all the ops fault tolerant.
Also goes with the move of WriteOperationsHelper. That said, the DDB retry 
logic is something separate; its not something I've yet got fault injection 
here.

bq. First question here is, what about SDK retries? I know the SDK 
documentation and behavior is spotty at best, but there are some known things 
(i.e. DynamoDB does retry in SDK for most operations, exclusive of batched 
operations which are application responsibility).

I don't see that. In fact, I've seen DDB fail on throttling. Use the CLI to 
crank back the IOP capacity to, say, 5 read + 5 write and then run the 
integration tests in parallel with -Ds3guard -Ddynamodb. You will get 
overloaded. That's why I added the s3guard CLI to list/change IOPs...bored of 
going near the web UI and expecting it to be a recurrent support call. 

I do know that AWS s3 transfer manager does retries using mark/reset to handle 
retry of part uploads, and older SDK's didn't handle failure of the final post 
of an MPU (HADOOP-14028). The code as stand, assumes that transfer manager 
handles failures of async PUT/POST itself, but retries all other failures, with 
the @attrs to make clear to methods calling other methods when they should 
*not* be doing any more retry or translate themselves. If you wrap retry with 
retry you end up in a situation where permanent failures take too long to 
surface.

bq. (re idempotent delete ) "interesting. Is the argument that delete is racy 
anyways, even if clients externally synchronize, thus we treat it as idempotent 
even though it is not?"

There's a past WONTFIX JIRA when someone (Konstantin?) wanted to make delete 
idempotent on HDFS and it was agreed, no, never. Here I have made it idempotent 
on the basis that object stores are odd anyway. In particular, code which uses 
create(overwrite=false) can get into conflict with delete, but as 
create-no-overwrite isn't atomic in S3, do we care. Then there's recursive 
delete.

Here, I don't know what to do. It is a lot easier if there is retry round 
delete, but it may be something done at a higher level than the FS APIs.

bq. re: LOG.info() of failed to purge. 

That log comes from HADOOP-13208 & failing if you ever tried to start up 
against landsat-pds with purge enabled. I'll bump it up to INFO for now & see 
what people say. Turned off for landsat in the test/resources/core-site.xml 
file too.

bq.  @Retries.RetryRaw

Thanks for the appreciation. I started off with javadocs but it wasn't that 
reliable...annotations are clearer. We have them in 
{{org.apache.hadoop.io.retry}} too, but they are actually declaring method 
semantics for IPC to choose how to react. This is just docs. I think they may 
-> the javadocs, but not checked. bigget risk: code changes make the 
annotations obsolete & it is only picked up by manual review (and ideally fault 
injection)

bq. org.apache.hadoop.fs.s3a.commit.Tasks

Ryan's code; I've not changed other than to move to lambda-expressions in 
invocations. Some of the tests try the parallel vs sequential execution within 
committers though, in particular TestStagingCommitter, which does explore the 
various failure paths.

But yes, I could do a standalone test to for more low-level validation...let me 
add that to my todo list.

bq. How about passing a Logger into DurationInfo instead of it using its own?

Good idea. Done. 

BTW, been thinking about Htrace support here. I won't do it in this patch, but 
its the obvious place to declare spans. For that though: start in S3AFS, have 
consistent span names across operations in hadoop-common, then move up to 
commit, where you also need your execution engine to embrace it too.


bq. cleanup/cleanupStagingDirs

Made abstract for you. I do need to do some more walkthroughs to be happy that 
we are doing as much cleanup as possible.
Interesting issue there: should we do a best-effort cleanup on failure of, say 
task commit, or rely on abortTask/abortJob to force that?

bq. Failure callback (in abortPendingUploads) is same as intended operation. Is 
this in effect a single retry?

good point. Something probably taken from the Netflix code. I'll cut it, as 
there is retry logic in the abort now anyway.


bq. javadocs on AbstractS3GuardCommitterFactory

fixed to be consistent with code.

bq. innerCommit & AtomicInteger

gets passed in to a closure which is then handed off to the retry stuff to call 
on every retry. It's not that an atomic is needed, just something to hold an 
integer which can be incremented in closures. You can see that in other 
places/code elsewhere too, like in local vars you want closures to updated. If 
there's a non-atomic version, especially something in the JDK (and *not* guava) 
then I'd be happy to switch.

bq. CommitOperations.ls & "robust in JDK"

It used to be {{invoke.retry(()-> fs.listFiles(path.recursive))}}, and while 
the listFiles call retried, the hasNext/next calls weren't. With the move of 
retries into the FS, both ops are robust & so the call has been simplified. 
I'll change the javadocs.


bq. Paths & Unit tests

There's some stuff in {{TestStagingPartitionedFileListing}}, otherwise nothing 
specific except in the mock & integration tests for the staging committers.

bq. retries around dynamoDB.batchWriteItemUnprocessed()

OK, removed retry logic. There still aspects of its failures I think I need to 
see more of to understand



------------------

h3. docs

Thanks for looking at the docs. I am still going through them as I think I've 
got some duplication. I concur with your statement that this is better than the 
MR docs; spent a lot of time stepping through tests to work out WTF 
FileOutputCommitter really does, and how MapReduce uses it. It's made me a 
strong enthusiast for deleting the v1 APIs entirely. **It is time**.

bq.  I do like "successed" though, colloquially

Once you become a CS professor you can start using enough in talks that people 
will think it's a real term, especially if you add some extra semantics around 
observed state. A process may *succeed* locally, but it is only *successed* 
when the rest of the system observes this and moves their state model into a 
succeeded state. Yes, we could do this...

For now, fixed the docs.

bq. /S3Guard committers/S3A committers/ 

dunno. The magic one utterly depends on S3Guard for consistency, so using the 
term keeps it clear, also helps with the branding of that. The staging ones 
don't have that 



bq. Nice Trick (re setting `mapreduce.fileoutputcommitter.algorithm.version` = 
10 to fail fast)

yeah, guess how I came up with that. 


bq. Multiple clients racing to write same path cannot detect which one actually 
succeeded via return code (IIRC this is determined by S3 service-side timestamp 
and not visible to clients). Am I right or mistaken here?

It's not that >1 client "succeeds" so much as they both succeed, you just don't 
know which one came last, and even checking afterwards for timestamp or etag 
doesn't guarantee there's not an overwrite in progress.

bq clarity nit: /implementations of the S3 protocol/S3 protocol-compatible 
storage systems/

changed to "Some S3-compatible object stores are fully consistent;"


bq. > It requires a path element, such as `__magic` which cannot be used for 
any purpose other than for the storage of pending commit data.
bq. Aside: We could make this string configurable in the future, no?

no, it will only cause chaos, confusion and support calls. I chose a path that 
nobody appears to use today; if ever you get a bug report with __magic in the 
path you can see what the issue is related to. Whereas if something could say 
_temporary was "magic", there'd be chaos and you'd never work out why. Your 
support team will never forgive you for even suggesting this, and QE will not 
be sending you christmas cards.


bq. > skipDuringFaultInjection(fs);
bq. Nice.

you can do the entire suite with throttling cranked up; how I got both the 
fault injection & retry coverage right.

bq. > log4j.logger.org.apache.hadoop.fs.s3a=DEBUG
bq. Seems a little heavy.

noted; cut back


Final bit of the architecture doc: yeah, it's out of date. I've already updated 
the spark bit, but it needs a review and edit. Like you noted, its got design 
work in there oo.

h3. Other issues.

Writing up this stuff here and elsewhere makes me think the term "directory 
committer" is a bit confusing. What about we call that, at least in docs & 
config the "staging" committer, with the partitioned committer its sibling. No 
end users should be seeing the superclass after all. Similarly, renaming the 
DynamicFactory the S3GuardCommitterFactory (or S3ACommitterFactory), and only 
write up the use of that. It's the easiest to set up and explain. In particular 
we could make it the default for the S3A schema, just using the default 
fs.s3a.committer=file for a bucket for today's (flawed) behaviour. Change the 
committer name globally or per bucket, and you've got fewer config options to 
get wrong/try and debug.

Next patch will have more logging at init time, primarly related to debugging 
things in some new bulk spark integration. Filed HADOOP-14965 for one key 
issue, but expect some more logging @debug in the committers too.

> Add S3Guard committer for zero-rename commits to S3 endpoints
> -------------------------------------------------------------
>
>                 Key: HADOOP-13786
>                 URL: https://issues.apache.org/jira/browse/HADOOP-13786
>             Project: Hadoop Common
>          Issue Type: New Feature
>          Components: fs/s3
>    Affects Versions: 3.0.0-beta1
>            Reporter: Steve Loughran
>            Assignee: Steve Loughran
>         Attachments: HADOOP-13786-036.patch, HADOOP-13786-037.patch, 
> HADOOP-13786-038.patch, HADOOP-13786-039.patch, 
> HADOOP-13786-HADOOP-13345-001.patch, HADOOP-13786-HADOOP-13345-002.patch, 
> HADOOP-13786-HADOOP-13345-003.patch, HADOOP-13786-HADOOP-13345-004.patch, 
> HADOOP-13786-HADOOP-13345-005.patch, HADOOP-13786-HADOOP-13345-006.patch, 
> HADOOP-13786-HADOOP-13345-006.patch, HADOOP-13786-HADOOP-13345-007.patch, 
> HADOOP-13786-HADOOP-13345-009.patch, HADOOP-13786-HADOOP-13345-010.patch, 
> HADOOP-13786-HADOOP-13345-011.patch, HADOOP-13786-HADOOP-13345-012.patch, 
> HADOOP-13786-HADOOP-13345-013.patch, HADOOP-13786-HADOOP-13345-015.patch, 
> HADOOP-13786-HADOOP-13345-016.patch, HADOOP-13786-HADOOP-13345-017.patch, 
> HADOOP-13786-HADOOP-13345-018.patch, HADOOP-13786-HADOOP-13345-019.patch, 
> HADOOP-13786-HADOOP-13345-020.patch, HADOOP-13786-HADOOP-13345-021.patch, 
> HADOOP-13786-HADOOP-13345-022.patch, HADOOP-13786-HADOOP-13345-023.patch, 
> HADOOP-13786-HADOOP-13345-024.patch, HADOOP-13786-HADOOP-13345-025.patch, 
> HADOOP-13786-HADOOP-13345-026.patch, HADOOP-13786-HADOOP-13345-027.patch, 
> HADOOP-13786-HADOOP-13345-028.patch, HADOOP-13786-HADOOP-13345-028.patch, 
> HADOOP-13786-HADOOP-13345-029.patch, HADOOP-13786-HADOOP-13345-030.patch, 
> HADOOP-13786-HADOOP-13345-031.patch, HADOOP-13786-HADOOP-13345-032.patch, 
> HADOOP-13786-HADOOP-13345-033.patch, HADOOP-13786-HADOOP-13345-035.patch, 
> cloud-intergration-test-failure.log, objectstore.pdf, s3committer-master.zip
>
>
> A goal of this code is "support O(1) commits to S3 repositories in the 
> presence of failures". Implement it, including whatever is needed to 
> demonstrate the correctness of the algorithm. (that is, assuming that s3guard 
> provides a consistent view of the presence/absence of blobs, show that we can 
> commit directly).
> I consider ourselves free to expose the blobstore-ness of the s3 output 
> streams (ie. not visible until the close()), if we need to use that to allow 
> us to abort commit operations.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to