[
https://issues.apache.org/jira/browse/BEAM-12378?focusedWorklogId=600467&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-600467
]
ASF GitHub Bot logged work on BEAM-12378:
-----------------------------------------
Author: ASF GitHub Bot
Created on: 21/May/21 16:06
Start Date: 21/May/21 16:06
Worklog Time Spent: 10m
Work Description: anantdamle commented on pull request #14852:
URL: https://github.com/apache/beam/pull/14852#issuecomment-846066857
@reuvenlax This is quite interesting, I was working on something similar
with one key difference:
This PR assumes that a new record will not modify accumulated records size.
e.g lets say I'm accumulating records created by flattening of
nested-repeated field and computing a table,
record 1- contains 3 fields, whereas record2 contained 4 (due to difference
in repeated/ array elements)
so the system will should recompute the size of accumulated batch again, how
do we handle such situation with this PR?
I am using something like:
[BatchBySize](https://github.com/GoogleCloudPlatform/auto-data-tokenize/blob/main/dlp/src/main/java/com/google/cloud/solutions/autotokenize/pipeline/dlp/GroupByBatchSize.java)
that allows accepting a
[BatchAccumulator](https://github.com/GoogleCloudPlatform/auto-data-tokenize/blob/main/dlp/src/main/java/com/google/cloud/solutions/autotokenize/pipeline/dlp/BatchAccumulator.java)
that offloads the batch size computation.
Does this sound interesting? Happy to provide commits if you feel my
approach makes sense.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
Issue Time Tracking
-------------------
Worklog Id: (was: 600467)
Time Spent: 40m (was: 0.5h)
> GroupIntoBatches should support byte-size batches
> -------------------------------------------------
>
> Key: BEAM-12378
> URL: https://issues.apache.org/jira/browse/BEAM-12378
> Project: Beam
> Issue Type: Bug
> Components: sdk-java-core
> Reporter: Reuven Lax
> Priority: P2
> Time Spent: 40m
> Remaining Estimate: 0h
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)