loquisgon opened a new pull request #11941:
URL: https://github.com/apache/druid/pull/11941
### Description
The `S3Entity` reads objects (i.e. files) stored in an AWS bucket using a
stream. Often those files are large and S3 may drop the connection. One of the
symptoms of a dropped connection is an exception coming from the AWS client sdk
like this:
```
2021-10-13T01:04:21,019 ERROR [task-runner-0-priority-0]
org.apache.druid.indexing.common.task.batch.parallel.SinglePhaseSubTask -
Encountered exception in parallel sub task.
java.lang.IllegalStateException: java.io.IOException:
com.amazonaws.SdkClientException: Data read has a different length than the
expected: dataLength=589456583; expectedLength=4288561114; includeSkipped=true;
in.getClass()=class com.amazonaws.services.s3.AmazonS3Client$2;
markedSupported=false; marked=0; resetSinceLastMarked=false; markCount=0;
resetCount=0
at org.apache.commons.io.LineIterator.hasNext(LineIterator.java:108)
~[commons-io-2.11.0.jar:2.11.0]
at org.apache.druid.data.input.TextReader$1.hasNext(TextReader.java:73)
```
Today this exception in the code is not considered retry-able. In addition,
the current code in `org.apache.druid.data.input.impl.RetryinInputStrem` has a
private "reset condition". The retry condition simply retries inserting sleeps
in between but the reset condition actually first resets the input stream in
the offset where it left off, reopens, and retries.
For the exception above we want to add a custom "reset condition" (similar
to the already existing custom "retry condition") that the S3entity can use to
signal that the above exception should be resettable.
This PR implements that custom resettable condition and adds the proper
condition to the `S3Entty` so that it can recover from such exception.
### Algorithmic choices
Another choice could have been to use the [Amazon's SDK
TransferManager](https://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/services/s3/transfer/TransferManager.html)
or `org.apache.druid.data.input.InputEntity#fetch` to completely download the
S3 object to local storage and then fetch from there. But we decided to leave
this for future work if needed since reset & retry may be enough.
<hr>
##### Key changed/added classes in this PR
* `InputEntity`
* `RetryingInputStream`
<hr>
<!-- Check the items by putting "x" in the brackets for the done things. Not
all of these items apply to every PR. Remove the items which are not done or
not relevant to the PR. None of the items from the checklist below are strictly
necessary, but it would be very helpful if you at least self-review the PR. -->
This PR has:
- [X ] been self-reviewed.
- [ ] using the [concurrency
checklist](https://github.com/apache/druid/blob/master/dev/code-review/concurrency.md)
(Remove this item if the PR doesn't have any relation to concurrency.)
- [X] added documentation for new or modified features or behaviors.
- [ ] added Javadocs for most classes and all non-trivial methods. Linked
related entities via Javadoc links.
- [ ] added or updated version, license, or notice information in
[licenses.yaml](https://github.com/apache/druid/blob/master/dev/license.md)
- [ ] added comments explaining the "why" and the intent of the code
wherever would not be obvious for an unfamiliar reader.
- [ X] added unit tests or modified existing tests to cover new code paths,
ensuring the threshold for [code
coverage](https://github.com/apache/druid/blob/master/dev/code-review/code-coverage.md)
is met.
- [ ] added integration tests.
- [ ] been tested in a test Druid cluster.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]