razvanculea commented on code in PR #32805:
URL: https://github.com/apache/beam/pull/32805#discussion_r1819501942
##########
sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/StorageApiLoads.java:
##########
@@ -354,25 +355,34 @@ public WriteResult expandUntriggered(
rowUpdateFn,
badRecordRouter));
+ PCollection<KV<DestinationT, StorageApiWritePayload>>
successfulConvertedRows =
+ convertMessagesResult.get(successfulConvertedRowsTag);
+
+ if (numShards > 0) {
Review Comment:
This gives more control on the batch behaviour, so it should be an
improvement vs today. Because the solution when limits are hit is to change the
graph by adding a redistribute step before the write in batch (as you rightly
noticed in the previous comment) this step is not needed in streaming (thus
making 2 different pipelines batch vs streaming).
(Setting it to 0 has the same effect as before)
Setting it to >0, will have similar effect in batch and streaming, and this
is difference vs now, when in streaming it has effect, but none in batch.
If the parameter is used for streaming then it means that the pipeline
behaves well with this parameter, and should still behave well in batch (same
limits on number of connections apply).
The difference when today this is not set for batch is that for some batch
jobs it will work because the parallelism is under the API limits, and on
others (or simply re-running the same or similar batch jobs) will hit the
limits.
I'm for a more consistent behaviour between steaming and batch, and I do
understand that this might have an effect on batch jobs with this parameter set
and ignored.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]