Adithya-Shetty100 opened a new issue, #19616:
URL: https://github.com/apache/druid/issues/19616
### Affected Version
30.0.1 (the `S3RETRY` predicate is unchanged on `master` as of this writing).
### Description
`S3Utils.S3RETRY` treats a transient, retriable TLS/transport read failure
as non-retriable when the `SSLException` wraps a non-`IOException` cause.
A segment pull from S3 (`S3DataSegmentPuller.getSegmentFiles` ->
`CompressionUtils.unzip`, retried via `RetryUtils.retry(..., S3Utils.S3RETRY,
...)`) can fail mid-stream with:
```
javax.net.ssl.SSLException: Tag mismatch!
... SSLCipher$T13GcmReadCipherGenerator$GcmReadCipher.decrypt ...
Caused by: javax.crypto.AEADBadTagException: Tag mismatch!
at com.sun.crypto.provider.GaloisCounterMode.decryptFinal(...)
```
i.e. a corrupted TLS record failed AES-GCM authentication during the HTTPS
GET. This is a transient transport error; a fresh `getObject` (which
`S3DataSegmentPuller` already opens per retry via its `ByteSource`) would
almost certainly succeed.
It is not retried. In `S3RETRY`:
```java
} else if (e instanceof IOException) {
if (e.getCause() != null) {
// Recurse with the underlying cause to see if it's retriable.
return apply(e.getCause());
}
return true;
}
```
`SSLException` is an `IOException`, but it has a cause, so the predicate
recurses into the cause (`javax.crypto.AEADBadTagException`, a
`GeneralSecurityException`). That matches none of the retriable branches, so
the final `else` returns `apply(getCause())` -> `null` -> `false`. Zero
retries; the `SegmentLoadingException` propagates.
### Impact
Most severe under MSQ: a leaf worker's segment download throws
`SegmentLoadingException`, which surfaces as a non-retriable `UnknownError` and
aborts the entire multi-stage query. A single transient corrupted TLS record on
one worker (after many GB already streamed successfully) can kill a large
multi-worker SELECT with no retry. Probability scales with bytes streamed, so
it appears on large scans.
### Redacted stack trace
```
org.apache.druid.java.util.common.RE: Failed to download segment
[<datasource>_<interval>_<version>]
at
org.apache.druid.msq.exec.TaskDataSegmentProvider.fetchSegmentInternal(TaskDataSegmentProvider.java:139)
Caused by: org.apache.druid.segment.loading.SegmentLoadingException: Tag
mismatch!
at
org.apache.druid.storage.s3.S3DataSegmentPuller.getSegmentFiles(S3DataSegmentPuller.java:138)
at org.apache.druid.utils.CompressionUtils.unzip(CompressionUtils.java:239)
at org.apache.druid.java.util.common.RetryUtils.retry(RetryUtils.java:153)
at
org.apache.druid.storage.s3.S3DataSegmentPuller.getSegmentFiles(S3DataSegmentPuller.java:107)
Caused by: javax.net.ssl.SSLException: Tag mismatch!
Caused by: javax.crypto.AEADBadTagException: Tag mismatch!
at com.sun.crypto.provider.GaloisCounterMode.decryptFinal(...)
```
### Proposed fix
Classify `SSLException` (transport-level read failures) as retriable instead
of recursing into a non-`IOException` security cause. This mirrors the
targeted-signature approach of #11941 (retry for `SdkClientException` "Data
read has a different length than the expected"). PR to follow.
### Note
This is a transient transport-layer corruption that TLS correctly detects
(the GCM auth tag fails on a corrupted record); it is not a JDK or crypto
defect. The gap is purely the retry classification.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]