[
https://issues.apache.org/jira/browse/BEAM-12378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17349358#comment-17349358
]
Anant Damle commented on BEAM-12378:
------------------------------------
@reuvenlax This is quite interesting, I was working on something similar with
one key difference:
This PR assumes that a new record will not modify accumulated records size.
e.g lets say I'm accumulating records created by flattening of nested-repeated
field and computing a table,
record 1- contains 3 fields, whereas record2 contained 4 (due to difference in
repeated/ array elements)
so the system will should recompute the size of accumulated batch again, how
do we handle such situation with this PR?
I am using something like: [BatchBySize,
|https://github.com/GoogleCloudPlatform/auto-data-tokenize/blob/main/dlp/src/main/java/com/google/cloud/solutions/autotokenize/pipeline/dlp/GroupByBatchSize.java]that
allows accepting a
[BatchAccumulator|https://github.com/GoogleCloudPlatform/auto-data-tokenize/blob/main/dlp/src/main/java/com/google/cloud/solutions/autotokenize/pipeline/dlp/BatchAccumulator.java]
that offloads the batch size computation.
Does this sound interesting? Happy to provide commits if you feel my approach
makes sense.
> GroupIntoBatches should support byte-size batches
> -------------------------------------------------
>
> Key: BEAM-12378
> URL: https://issues.apache.org/jira/browse/BEAM-12378
> Project: Beam
> Issue Type: Bug
> Components: sdk-java-core
> Reporter: Reuven Lax
> Priority: P2
> Time Spent: 40m
> Remaining Estimate: 0h
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)