lostluck opened a new issue, #22900:
URL: https://github.com/apache/beam/issues/22900

   ### What would you like to happen?
   
   The Go SDK currently reads an *entire* value before permitting a 
ProcessElement to occur. This is fine for single small elements, but for KV<K, 
Iter<V>> elements (the post GBK type), this can lead to extreme heap usage 
depending on the runner behaviour, and the size of each Iter<V>.
   
   Instead of pre-decoding every V in the iterator before processing, we can 
instead decode each V as the user iterates through the stream, and process on 
demand. This moves the burden, and control over the heap usage after a GBK to 
the user.
   
   Note, that for this approach to be effective, we can't add any caching to 
allow for re-iteration without re-decoding. This prevents this approach from 
being used with general CoGBK types due their typical implementation in Beam 
SDKs (implementing them in terms of a KV<K,Iter<colID+value>> coder), which 
requires the SDK to re-iterate through values depending on the iterator in 
question in order to filter them as necessary. This prevents this approach 
being used in a fused stage where a single GBK result PCollection is read by 
multiple PTransforms.
   
   However, such an approach does work for post lifted combines, reshuffles, 
and all single GBKs, rendering this a worthwhile optimization specialization.
   
   A prototype of this approach yielded pipeline RAM reduction from 50% of 
previous usage to 33% of usage with int streams, with one user reporting as 
little as 5% of previous usage (though, they had a very hefty decoded value vs 
the read in data).  The magnitude of the benefit will be strongly data 
dependent, but in all cases, should improve Garbage Collection behavior with 
commensurate GC overhead reduction.
   
   ### Issue Priority
   
   Priority: 2
   
   ### Issue Component
   
   Component: sdk-go


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to