dang-stripe opened a new issue, #8809:
URL: https://github.com/apache/pinot/issues/8809

   We recently ran into an issue where we performed a fleetwide restart of our 
realtime servers. They initially caught up on consumption then proceeded to 
fall behind due to S3 rate limiting on the S3 table directory. This gradually 
recovered as the servers were no longer rate limited.
   
   We observed the following log messages: 
https://github.com/apache/pinot/blob/master/pinot-controller/src/main/java/org/apache/pinot/controller/helix/core/realtime/PinotLLCRealtimeSegmentManager.java#L484
   
   ```
   [2022-04-14 18:29:20.176066] 2022/04/14 18:29:20.175 WARN 
[PinotLLCRealtimeSegmentManager] [grizzly-http-server-59] Caught exception 
while deleting temporary segment files for segment: 
table1__37__4760__20220414T1717Z
   [2022-04-14 18:29:20.176125] java.io.IOException: 
software.amazon.awssdk.services.s3.model.S3Exception: Please reduce your 
request rate. (Service: S3, Status Code: 503, Request ID: redacted, Extended 
Request ID: redacted)
   ```
   
   I'm wondering if allowing the table segment tmp dir be configurable would 
help here since having a separate s3 prefix would allow tmp files to have a 
separate rate limit. Not entirely clear to me if the contention was due to all 
the realtime servers hitting this particular listFiles call or if others were 
happening async.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to