lostluck opened a new issue, #22900: URL: https://github.com/apache/beam/issues/22900
### What would you like to happen? The Go SDK currently reads an *entire* value before permitting a ProcessElement to occur. This is fine for single small elements, but for KV<K, Iter<V>> elements (the post GBK type), this can lead to extreme heap usage depending on the runner behaviour, and the size of each Iter<V>. Instead of pre-decoding every V in the iterator before processing, we can instead decode each V as the user iterates through the stream, and process on demand. This moves the burden, and control over the heap usage after a GBK to the user. Note, that for this approach to be effective, we can't add any caching to allow for re-iteration without re-decoding. This prevents this approach from being used with general CoGBK types due their typical implementation in Beam SDKs (implementing them in terms of a KV<K,Iter<colID+value>> coder), which requires the SDK to re-iterate through values depending on the iterator in question in order to filter them as necessary. This prevents this approach being used in a fused stage where a single GBK result PCollection is read by multiple PTransforms. However, such an approach does work for post lifted combines, reshuffles, and all single GBKs, rendering this a worthwhile optimization specialization. A prototype of this approach yielded pipeline RAM reduction from 50% of previous usage to 33% of usage with int streams, with one user reporting as little as 5% of previous usage (though, they had a very hefty decoded value vs the read in data). The magnitude of the benefit will be strongly data dependent, but in all cases, should improve Garbage Collection behavior with commensurate GC overhead reduction. ### Issue Priority Priority: 2 ### Issue Component Component: sdk-go -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
