[
https://issues.apache.org/jira/browse/BEAM-9436?focusedWorklogId=408814&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-408814
]
ASF GitHub Bot logged work on BEAM-9436:
----------------------------------------
Author: ASF GitHub Bot
Created on: 24/Mar/20 14:45
Start Date: 24/Mar/20 14:45
Worklog Time Spent: 10m
Work Description: echauchot commented on pull request #11055: [BEAM-9436]
Improve GBK in spark structured streaming runner
URL: https://github.com/apache/beam/pull/11055#discussion_r397207312
##########
File path:
runners/spark/src/main/java/org/apache/beam/runners/spark/structuredstreaming/translation/batch/functions/GroupAlsoByWindowViaOutputBufferFn.java
##########
@@ -65,9 +65,15 @@ public GroupAlsoByWindowViaOutputBufferFn(
@Override
public Iterator<WindowedValue<KV<K, Iterable<InputT>>>> call(
- KV<K, Iterable<WindowedValue<InputT>>> kv) throws Exception {
- K key = kv.getKey();
- Iterable<WindowedValue<InputT>> values = kv.getValue();
+ K key, Iterator<WindowedValue<KV<K, InputT>>> iterator) throws Exception
{
+
+ // we have to meterialize the Iterator because
ReduceFnRunner.processElements expects
+ // ArrayList<WindowedValue<InputT>> and not Iterator<WindowedValue<KV<K,
InputT>>>
+ ArrayList<WindowedValue<InputT>> values = new ArrayList<>();
+ while (iterator.hasNext()) {
+ WindowedValue<KV<K, InputT>> wv = iterator.next();
+ values.add(wv.withValue(wv.getValue().getValue()));
Review comment:
in previous impl materialization to list was already there
I know about the javadoc but as stated in my comment in the code, it is
unavoidable due to the reduceFnRunner needing a list as input.
Anyway I measured during the load test (see results above) in a tiny JVM and
I got no OOM, I only got spill to disc of GB of data
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
Issue Time Tracking
-------------------
Worklog Id: (was: 408814)
Time Spent: 9h 20m (was: 9h 10m)
> Try to avoid elements list materialization in GBK
> -------------------------------------------------
>
> Key: BEAM-9436
> URL: https://issues.apache.org/jira/browse/BEAM-9436
> Project: Beam
> Issue Type: Improvement
> Components: runner-spark
> Reporter: Etienne Chauchot
> Assignee: Etienne Chauchot
> Priority: Major
> Labels: structured-streaming
> Time Spent: 9h 20m
> Remaining Estimate: 0h
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)