razvanculea commented on PR #32805: URL: https://github.com/apache/beam/pull/32805#issuecomment-2418849218
Why the change: - the BigQueryIO.write using StorageWrite can control the number of streams in streaming - in batch the number of connections will be proportional with the paralellism of the job (which can vary based on the source). Users can hit the [CreateWriteStreams quota](https://cloud.google.com/bigquery/quotas#write-api-limits) (10,000 streams every hour, per project per region) stream creation (can see in monitoring google.cloud.bigquery.storage.v1.BigQueryWrite.CreateWriteStream 4xx). The quota depletion might not impact the job that used a lot of it but the following jobs during the 1h window. The modified BigQqueryIO will inject a redistribute step in batch if withNumStorageWriteApiStreams > 0, which limits the number of CreateWriteStreams by the StorageApiWriteUnsharded step. This makes the same pipeline behave similarly in both steaming & batch. PS: using the StorageApiWriteSharded (made for streaming) in batch is an unsupported workaround that has even lower performance (and higher cost) in my tests due to multiple shuffles that are done. BigQueryIOLT has been modified to expose the parameters needed to test quota depletion. Setting withNumStorageWriteApiStreams very high will deplete the quota fast on a large test. An extra redistribute step, comes with a cost in speed, but gives control over the quota consumption. - Line 7 is a test where the pipeline is modified to have a redistribute step by the user. - Line 8 is a test where the pipeline is modified by the BQIO.write using withNumStorageWriteApiStreams = 4096 <img width="874" alt="Screenshot 2024-10-17 at 09 47 10" src="https://github.com/user-attachments/assets/23de0c22-96f8-455e-b9a0-2c4ea620e10a"> -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
