echauchot commented on a change in pull request #11055: [BEAM-9436] Improve GBK in spark structured streaming runner URL: https://github.com/apache/beam/pull/11055#discussion_r397207312
########## File path: runners/spark/src/main/java/org/apache/beam/runners/spark/structuredstreaming/translation/batch/functions/GroupAlsoByWindowViaOutputBufferFn.java ########## @@ -65,9 +65,15 @@ public GroupAlsoByWindowViaOutputBufferFn( @Override public Iterator<WindowedValue<KV<K, Iterable<InputT>>>> call( - KV<K, Iterable<WindowedValue<InputT>>> kv) throws Exception { - K key = kv.getKey(); - Iterable<WindowedValue<InputT>> values = kv.getValue(); + K key, Iterator<WindowedValue<KV<K, InputT>>> iterator) throws Exception { + + // we have to meterialize the Iterator because ReduceFnRunner.processElements expects + // ArrayList<WindowedValue<InputT>> and not Iterator<WindowedValue<KV<K, InputT>>> + ArrayList<WindowedValue<InputT>> values = new ArrayList<>(); + while (iterator.hasNext()) { + WindowedValue<KV<K, InputT>> wv = iterator.next(); + values.add(wv.withValue(wv.getValue().getValue())); Review comment: I know but as stated in my comment in the code, it is unavoidable due to the reduceFnRunner needing a list as input. Anyway I measured during the load test (see results above) in a tiny JVM and I got no OOM, I only got spill to disc of GB of data ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services