ahmedabu98 opened a new pull request, #31837: URL: https://github.com/apache/beam/pull/31837
After fixing concurrent connections issue (#31710), the only blocker to making Storage API batch scalable is managing AppendRows throughput quota. The Storage API backend sets up this quota by having a short-term (cell) quota and a long-term (region) quota: - short-term quota can take up to 10s to refill - long-term quota is an aggregate of multiple cells and can take up to 10min to refill It's important to note that all append operations are rejected while a quota is being refilled. The standard throughput quota is not sufficient for large writes. Large pipeline will typically exhaust the long-term quota quickly, leading to consistent failures for 10 min. With enough failures (10 fails per bundle, 4 failed bundles per Dataflow pipeline), the pipeline eventually gives up and fails. ### To deal with this, we can increase the retry backoff so that pipelines can survive long enough until the throughput quota is refilled. ## Disclaimer: Before this change, in the worst case where all append operations fail, each bundle will retry for: - 13 seconds for non-quota errors - 66 seconds for quota errors with 4 bundle failures, this total wait time goes up to 52s (non-quota errors) and 4.4min (quota errors) before pipeline failure. ---------------- With this change, the worst-case wait time goes up to: - 113 seconds (1.9 min) for non-quota errors - 340 seconds (5.7 min) for quota errors A Dataflow pipeline will fail after 7.5min (non-quota errors) and 22.5min (quota errors) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
