dang-stripe opened a new issue, #8809: URL: https://github.com/apache/pinot/issues/8809
We recently ran into an issue where we performed a fleetwide restart of our realtime servers. They initially caught up on consumption then proceeded to fall behind due to S3 rate limiting on the S3 table directory. This gradually recovered as the servers were no longer rate limited. We observed the following log messages: https://github.com/apache/pinot/blob/master/pinot-controller/src/main/java/org/apache/pinot/controller/helix/core/realtime/PinotLLCRealtimeSegmentManager.java#L484 ``` [2022-04-14 18:29:20.176066] 2022/04/14 18:29:20.175 WARN [PinotLLCRealtimeSegmentManager] [grizzly-http-server-59] Caught exception while deleting temporary segment files for segment: table1__37__4760__20220414T1717Z [2022-04-14 18:29:20.176125] java.io.IOException: software.amazon.awssdk.services.s3.model.S3Exception: Please reduce your request rate. (Service: S3, Status Code: 503, Request ID: redacted, Extended Request ID: redacted) ``` I'm wondering if allowing the table segment tmp dir be configurable would help here since having a separate s3 prefix would allow tmp files to have a separate rate limit. Not entirely clear to me if the contention was due to all the realtime servers hitting this particular listFiles call or if others were happening async. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
