[
https://issues.apache.org/jira/browse/BEAM-831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15867722#comment-15867722
]
Chinmay Kolhatkar commented on BEAM-831:
----------------------------------------
[~thw], [~jkff], I wet through the links provided and understood
producer-consumer fusion and sibling fusion concepts.
I'm currently focusing on producer-consumer fusion optimization.
To do that here is the approach I'm considering (Still working on a POC yet, so
might change):
1. Majority of the changes would go in TranslateContext of apex runner.
2. The streams variable can hold information about Locality as well which will
later be used in TranslationContext.populateDAG method.
3. In populateDAG Api, before the streams are connected, We can traverse the
streams/operators in topological order and find out adjacent ParDo stages for
putting them in either thread local OR container local and update hte field in
streams variable with right Locality.
Only thing that I'm not sure about is when to stop the merging of ParDos...
i.e. if the DAG is like ParDo A -> ParDo B -> ParDo C -> ParDo D.
Then at time it might be efficient to merge only B & C and not merge all of
them... Any thoughts on this?
Also, has any runner already done this?
I'm also considering to update ApexFlattenOperator and put the streams in
ThreadLocal instead of Default locality. Might be a different Jira for that.
Please share your opinion.
> ParDo Chaining
> --------------
>
> Key: BEAM-831
> URL: https://issues.apache.org/jira/browse/BEAM-831
> Project: Beam
> Issue Type: Improvement
> Components: runner-apex
> Reporter: Thomas Weise
>
> Current state of Apex runner creates a plan that will place each operator in
> a separate container (which would be processes when running on a YARN
> cluster). Often the ParDo operators can be collocated in same thread or
> container. Use Apex affinity/stream locality attributes for more efficient
> execution plan.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)