[GitHub] [beam] nbali commented on a diff in pull request #22953: Fix for #22951

GitBox Wed, 21 Sep 2022 12:55:52 -0700


nbali commented on code in PR #22953:
URL: https://github.com/apache/beam/pull/22953#discussion_r976932415



##########
sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BatchLoads.java:
##########
@@ -113,6 +113,9 @@
   // If user triggering is supplied, we will trigger the file write after this 
many records are
   // written.
   static final int FILE_TRIGGERING_RECORD_COUNT = 500000;
+  // If user triggering is supplied, we will trigger the file write after this 
many bytes are
+  // written.
+  static final long FILE_TRIGGERING_BYTE_COUNT = 100 * (1L << 20); // 100MiB

Review Comment:
   @lukecwik 
   On second thought, I think there is a problem with using this 64MB default. 
We only flush the batch inside `GroupIntoBatches`, once the 
`storedBatchSizeBytes` is greater than or equal to the limit. So if we make the 
limit 64MB, more than likely we will flush just a bit more than 64MB so we 
won't fit into the 64MB buffer.
   
   So either the triggering byte count should be x% smaller than the 64MB 
default, or `GroupIntoBatches` has to be modified that it the current element 
would make it go over the byte size limit, then fire the batch without that 
element being added to the it first. The second seems like a better solution, 
but I assume doing the `storedBatchSizeBytes.read()` sooner would have 
performance impact.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [beam] nbali commented on a diff in pull request #22953: Fix for #22951

Reply via email to