razvanculea commented on PR #32805:
URL: https://github.com/apache/beam/pull/32805#issuecomment-2418849218

   Why the change:
   
   - the BigQueryIO.write using StorageWrite can control the number of streams 
in streaming
   - in batch the number of connections will be proportional with the 
paralellism of the job (which can vary based on the source). Users can hit the 
[CreateWriteStreams 
quota](https://cloud.google.com/bigquery/quotas#write-api-limits) (10,000 
streams every hour, per project per region) stream creation  (can see in 
monitoring google.cloud.bigquery.storage.v1.BigQueryWrite.CreateWriteStream 
4xx). The quota depletion might not impact the job that used a lot of it but 
the following jobs during the 1h window.
   
   
   The modified BigQqueryIO will inject a redistribute step in batch if 
withNumStorageWriteApiStreams > 0, which limits the number of 
CreateWriteStreams by the StorageApiWriteUnsharded step. This makes the same 
pipeline behave similarly in both steaming & batch.
   
   PS: using the StorageApiWriteSharded (made for streaming) in batch is an 
unsupported workaround that has even lower performance (and higher cost) in my 
tests due to multiple shuffles that are done.
   
   BigQueryIOLT has been modified to expose the parameters needed to test quota 
depletion. Setting withNumStorageWriteApiStreams very high will deplete the 
quota fast on a large test.
   
   An extra redistribute step, comes with a cost in speed, but gives control 
over the quota consumption. 
   
   - Line 7 is a test where the pipeline is modified to have a redistribute 
step by the user.
   - Line 8 is a test where the pipeline is modified by the BQIO.write using 
withNumStorageWriteApiStreams = 4096
   <img width="874" alt="Screenshot 2024-10-17 at 09 47 10" 
src="https://github.com/user-attachments/assets/23de0c22-96f8-455e-b9a0-2c4ea620e10a";>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to