Re: [PR] BigQueryIO : control StorageWrite parallelism in batch, by reshuffling before write on the number of streams set for BigQueryIO.write() using .withNumStorageWriteApiStreams(numStorageWriteApiStreams) [beam]

via GitHub Mon, 28 Oct 2024 10:51:57 -0700


razvanculea commented on code in PR #32805:
URL: https://github.com/apache/beam/pull/32805#discussion_r1819501942



##########
sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/StorageApiLoads.java:
##########
@@ -354,25 +355,34 @@ public WriteResult expandUntriggered(
                 rowUpdateFn,
                 badRecordRouter));
 
+    PCollection<KV<DestinationT, StorageApiWritePayload>> 
successfulConvertedRows =
+        convertMessagesResult.get(successfulConvertedRowsTag);
+
+    if (numShards > 0) {

Review Comment:
   This gives more control on the batch behaviour, so it should be an 
improvement vs today. Because the solution when limits are hit is to change the 
graph by adding a redistribute step before the write in batch (as you rightly 
noticed in the previous comment) this step is not needed in streaming (thus 
making 2 different pipelines batch vs streaming).
   
   
   (Setting it to 0 has the same effect as before)
   Setting it to >0, will have similar effect in batch and streaming, and this 
is difference vs now, when in streaming it has effect, but none in batch.
   If the parameter is used for streaming then it means that the pipeline 
behaves well with this parameter, and should still behave well in batch (same 
limits on number of connections apply).
   The difference when today this is not set for batch is that for some batch 
jobs it will work because the parallelism is under the API limits, and on 
others (or simply re-running the same or similar batch jobs) will hit the 
limits.
   
   I'm for a more consistent behaviour between steaming and batch, and I do 
understand that this might have an effect on batch jobs with this parameter set 
and ignored.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] BigQueryIO : control StorageWrite parallelism in batch, by reshuffling before write on the number of streams set for BigQueryIO.write() using .withNumStorageWriteApiStreams(numStorageWriteApiStreams) [beam]

Reply via email to