Adithya-Shetty100 opened a new issue, #19616:
URL: https://github.com/apache/druid/issues/19616

   ### Affected Version
   30.0.1 (the `S3RETRY` predicate is unchanged on `master` as of this writing).
   
   ### Description
   `S3Utils.S3RETRY` treats a transient, retriable TLS/transport read failure 
as non-retriable when the `SSLException` wraps a non-`IOException` cause.
   
   A segment pull from S3 (`S3DataSegmentPuller.getSegmentFiles` -> 
`CompressionUtils.unzip`, retried via `RetryUtils.retry(..., S3Utils.S3RETRY, 
...)`) can fail mid-stream with:
   
   ```
   javax.net.ssl.SSLException: Tag mismatch!
     ... SSLCipher$T13GcmReadCipherGenerator$GcmReadCipher.decrypt ...
   Caused by: javax.crypto.AEADBadTagException: Tag mismatch!
     at com.sun.crypto.provider.GaloisCounterMode.decryptFinal(...)
   ```
   
   i.e. a corrupted TLS record failed AES-GCM authentication during the HTTPS 
GET. This is a transient transport error; a fresh `getObject` (which 
`S3DataSegmentPuller` already opens per retry via its `ByteSource`) would 
almost certainly succeed.
   
   It is not retried. In `S3RETRY`:
   
   ```java
   } else if (e instanceof IOException) {
     if (e.getCause() != null) {
       // Recurse with the underlying cause to see if it's retriable.
       return apply(e.getCause());
     }
     return true;
   }
   ```
   
   `SSLException` is an `IOException`, but it has a cause, so the predicate 
recurses into the cause (`javax.crypto.AEADBadTagException`, a 
`GeneralSecurityException`). That matches none of the retriable branches, so 
the final `else` returns `apply(getCause())` -> `null` -> `false`. Zero 
retries; the `SegmentLoadingException` propagates.
   
   ### Impact
   Most severe under MSQ: a leaf worker's segment download throws 
`SegmentLoadingException`, which surfaces as a non-retriable `UnknownError` and 
aborts the entire multi-stage query. A single transient corrupted TLS record on 
one worker (after many GB already streamed successfully) can kill a large 
multi-worker SELECT with no retry. Probability scales with bytes streamed, so 
it appears on large scans.
   
   ### Redacted stack trace
   ```
   org.apache.druid.java.util.common.RE: Failed to download segment 
[<datasource>_<interval>_<version>]
     at 
org.apache.druid.msq.exec.TaskDataSegmentProvider.fetchSegmentInternal(TaskDataSegmentProvider.java:139)
   Caused by: org.apache.druid.segment.loading.SegmentLoadingException: Tag 
mismatch!
     at 
org.apache.druid.storage.s3.S3DataSegmentPuller.getSegmentFiles(S3DataSegmentPuller.java:138)
     at org.apache.druid.utils.CompressionUtils.unzip(CompressionUtils.java:239)
     at org.apache.druid.java.util.common.RetryUtils.retry(RetryUtils.java:153)
     at 
org.apache.druid.storage.s3.S3DataSegmentPuller.getSegmentFiles(S3DataSegmentPuller.java:107)
   Caused by: javax.net.ssl.SSLException: Tag mismatch!
   Caused by: javax.crypto.AEADBadTagException: Tag mismatch!
     at com.sun.crypto.provider.GaloisCounterMode.decryptFinal(...)
   ```
   
   ### Proposed fix
   Classify `SSLException` (transport-level read failures) as retriable instead 
of recursing into a non-`IOException` security cause. This mirrors the 
targeted-signature approach of #11941 (retry for `SdkClientException` "Data 
read has a different length than the expected"). PR to follow.
   
   ### Note
   This is a transient transport-layer corruption that TLS correctly detects 
(the GCM auth tag fails on a corrupted record); it is not a JDK or crypto 
defect. The gap is purely the retry classification.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to