nbali commented on code in PR #22953: URL: https://github.com/apache/beam/pull/22953#discussion_r976932415
########## sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BatchLoads.java: ########## @@ -113,6 +113,9 @@ // If user triggering is supplied, we will trigger the file write after this many records are // written. static final int FILE_TRIGGERING_RECORD_COUNT = 500000; + // If user triggering is supplied, we will trigger the file write after this many bytes are + // written. + static final long FILE_TRIGGERING_BYTE_COUNT = 100 * (1L << 20); // 100MiB Review Comment: @lukecwik On second thought, I think there is a problem with using this 64MB default. We only flush the batch inside `GroupIntoBatches`, once the `storedBatchSizeBytes` is greater than or equal to the limit. So if we make the limit 64MB, more than likely we will flush just a bit more than 64MB so we won't fit into the 64MB buffer. So either the triggering byte count should be x% smaller than the 64MB default, or `GroupIntoBatches` has to be modified that it the current element would make it go over the byte size limit, then fire the batch without that element being added to the it first. The second seems like a better solution, but I assume doing the `storedBatchSizeBytes.read()` sooner would have performance impact. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
