Amit Sela created BEAM-1250:
-------------------------------

             Summary: Remove leaf when materializing PCollection to avoid 
re-evaluation.
                 Key: BEAM-1250
                 URL: https://issues.apache.org/jira/browse/BEAM-1250
             Project: Beam
          Issue Type: Bug
          Components: runner-spark
            Reporter: Amit Sela
            Assignee: Amit Sela


When materializing a {{PCollection}} (implemented as {{RDD}}), to create a 
{{PCollectionView}} for example, the runner should remove the materialized 
{{RDD}} from the "leaves" set.
The runner keeps track of leaves left un-handled in the DAG to force action on 
them - {{Write}} for one is implemented via a sequence of {{ParDo}}s which are 
implemented by the runner via {{mapPartitions}} so we need to force an action.
Materializing an {{RDD}} is done via the action {{collect()}} so no reason to 
keep in "leaves" set.
Currently, it remains in the "leaves" set and so it is forced and evaluates the 
lineage and if not cached it will execute twice the lineage twice (unless 
caches are applied for some reason).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to