[GitHub] [beam] anantdamle commented on pull request #14852: [BEAM-12378] GroupIntoBatches improvements


anantdamle commented on pull request #14852:
URL: https://github.com/apache/beam/pull/14852#issuecomment-846066857



   @reuvenlax This is quite interesting, I was working on something similar 
with one key difference:
   This PR assumes that a new record will not modify accumulated records size. 
   e.g lets say I'm accumulating records created by flattening of 
nested-repeated field and computing a table, 
   record 1-  contains 3 fields, whereas record2 contained 4 (due to difference 
in repeated/ array elements)
   so the system will should recompute the size of accumulated batch again, how 
do we handle such situation with this PR?
   
   I am using something like: 
[BatchBySize](https://github.com/GoogleCloudPlatform/auto-data-tokenize/blob/main/dlp/src/main/java/com/google/cloud/solutions/autotokenize/pipeline/dlp/GroupByBatchSize.java)
   that allows accepting a 
[BatchAccumulator](https://github.com/GoogleCloudPlatform/auto-data-tokenize/blob/main/dlp/src/main/java/com/google/cloud/solutions/autotokenize/pipeline/dlp/BatchAccumulator.java)
 that offloads the batch size computation.
   
   Does this sound interesting? Happy to provide commits if you feel my 
approach makes sense.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [beam] anantdamle commented on pull request #14852: [BEAM-12378] GroupIntoBatches improvements

Reply via email to