[GitHub] [beam] anantdamle commented on pull request #14852: [BEAM-12378] GroupIntoBatches improvements

GitBox Sun, 23 May 2021 07:44:30 -0700


anantdamle commented on pull request #14852:
URL: https://github.com/apache/beam/pull/14852#issuecomment-846574298



   @reuvenlax thanks for the heads up on time-complexity of reading BagState (I 
honestly didn't know). 
   Can you help suggest the best way to handle the following scenario:
   
   Context: I need to represent a nested-repeated data as flat-tables.
   e.g.
   lets say I have two input records, that need to be accumulated.
   <table>
   <thead>
   <th>record-1</th>
   <th>record-2</th>
   <tbody>
   <td><pre>{ id: 1, num: 1.23, arr: ["a", "b"] }</pre></td>
   <td><pre>{ id: 2, num: 4.56, arr: [ "d", "e", "f"] }</pre></td>
   </tbody>
   </table>
   
   The output is, the size of accumulated batch is actually not the sum of 
serialized size of individual elements, instead, the accumulator needs access 
to at least the headers list to compute the effective size of the accumulated 
batch.
   <table>
   <thead>
   <th>id</th>
   <th>num</th>
   <th>arr[0]</th>
   <th>arr[1]</th>
   <th>arr[2]</th>
   </thead>
   <tbody>
   <tr>
   <td>1</td>
   <td>1.23</td>
   <td>"a"</td>
   <td>"a"</td>
   <td>[empty]</td>
   </tr>
   <tr>
   <td>2</td>
   <td>4.56</td>
   <td>"d"</td>
   <td>"e"</td>
   <td>"f"</td>
   </tr>
   </tbody>
   </table>
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [beam] anantdamle commented on pull request #14852: [BEAM-12378] GroupIntoBatches improvements

Reply via email to