[ 
https://issues.apache.org/jira/browse/BEAM-12335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17344290#comment-17344290
 ] 

Bashir Sadjad commented on BEAM-12335:
--------------------------------------

Thanks for filing this; but I think the issue is not just with high-fanout 
transforms. If the DirectRunner does not do any fusion, then a very simple 
pipeline that for example reads record from a DB (S1), makes one transformation 
on each record (S2), and write each transformed record to a file (S3) needs to 
load the whole DB in memory before running S3. While each record can be 
processed from start to the end without needing to load any other records, if 
S1, S2, and S3 were fused.

Now, I think actually some sort of fusion happens in DirectRunner, but I don't 
know exactly where and how as described 
[here|https://lists.apache.org/thread.html/r2feeb4b61b8d1ce2a526c37f9520b93f5b62c256ef3984d94c76e12d%40%3Cuser.beam.apache.org%3E].

> Apply basic fusion to Java DirectRunner to avoid keeping all intermittent 
> results in memory 
> --------------------------------------------------------------------------------------------
>
>                 Key: BEAM-12335
>                 URL: https://issues.apache.org/jira/browse/BEAM-12335
>             Project: Beam
>          Issue Type: Improvement
>          Components: runner-direct
>            Reporter: Boyuan Zhang
>            Priority: P2
>
> Current java direct runner doesn't fuse transforms into steps. Instead, it 
> almost executes each transform one by one. It results in memory pressure when 
> any transform is high-fanout.
> We already have a simple fusion logic in Java 
> SDK(https://github.com/apache/beam/blob/master/runners/core-construction-java/src/main/java/org/apache/beam/runners/core/construction/graph/GreedyPipelineFuser.java).
>  Work remaining here might be:
> * Apply such fusion into DirectRunner
> * Change the DirectRunner to be able run the fused steps.
> I understand that DirectRunner doesn't expect processing large volume data 
> and changing DirectRunner execution might be a fair amount of work.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to