[
https://issues.apache.org/jira/browse/BEAM-1250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15807144#comment-15807144
]
ASF GitHub Bot commented on BEAM-1250:
--------------------------------------
Github user asfgit closed the pull request at:
https://github.com/apache/beam/pull/1747
> Remove leaf when materializing PCollection to avoid re-evaluation.
> ------------------------------------------------------------------
>
> Key: BEAM-1250
> URL: https://issues.apache.org/jira/browse/BEAM-1250
> Project: Beam
> Issue Type: Bug
> Components: runner-spark
> Reporter: Amit Sela
> Assignee: Amit Sela
>
> When materializing a {{PCollection}} (implemented as {{RDD}}), to create a
> {{PCollectionView}} for example, the runner should remove the materialized
> {{RDD}} from the "leaves" set.
> The runner keeps track of leaves left un-handled in the DAG to force action
> on them - {{Write}} for one is implemented via a sequence of ParDos which are
> implemented by the runner via {{mapPartitions}} so we need to force an action.
> Materializing an {{RDD}} is done via the action {{collect()}} so no reason to
> keep in "leaves" set.
> Currently, it remains in the "leaves" set and so it is forced and evaluates
> the lineage and if not cached it will execute twice the lineage twice (unless
> caches are applied for some reason).
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)