[jira] [Commented] (BEAM-12378) GroupIntoBatches should support byte-size batches

Anant Damle (Jira) Fri, 21 May 2021 09:08:12 -0700


    [ 
https://issues.apache.org/jira/browse/BEAM-12378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17349358#comment-17349358
 ]


Anant Damle commented on BEAM-12378:
------------------------------------

@reuvenlax This is quite interesting, I was working on something similar with 
one key difference:
 This PR assumes that a new record will not modify accumulated records size. 
 e.g lets say I'm accumulating records created by flattening of nested-repeated 
field and computing a table, 
 record 1- contains 3 fields, whereas record2 contained 4 (due to difference in 
repeated/ array elements)
 so the system will should recompute the size of accumulated batch again, how 
do we handle such situation with this PR?

I am using something like: [BatchBySize, 
|https://github.com/GoogleCloudPlatform/auto-data-tokenize/blob/main/dlp/src/main/java/com/google/cloud/solutions/autotokenize/pipeline/dlp/GroupByBatchSize.java]that
 allows accepting a 
[BatchAccumulator|https://github.com/GoogleCloudPlatform/auto-data-tokenize/blob/main/dlp/src/main/java/com/google/cloud/solutions/autotokenize/pipeline/dlp/BatchAccumulator.java]
 that offloads the batch size computation.

Does this sound interesting? Happy to provide commits if you feel my approach 
makes sense.

> GroupIntoBatches should support byte-size batches
> -------------------------------------------------
>
>                 Key: BEAM-12378
>                 URL: https://issues.apache.org/jira/browse/BEAM-12378
>             Project: Beam
>          Issue Type: Bug
>          Components: sdk-java-core
>            Reporter: Reuven Lax
>            Priority: P2
>          Time Spent: 40m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (BEAM-12378) GroupIntoBatches should support byte-size batches

Reply via email to