[
https://issues.apache.org/jira/browse/BEAM-1250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Amit Sela updated BEAM-1250:
----------------------------
Description:
When materializing a {{PCollection}} (implemented as {{RDD}}), to create a
{{PCollectionView}} for example, the runner should remove the materialized
{{RDD}} from the "leaves" set.
The runner keeps track of leaves left un-handled in the DAG to force action on
them - {{Write}} for one is implemented via a sequence of ParDos which are
implemented by the runner via {{mapPartitions}} so we need to force an action.
Materializing an {{RDD}} is done via the action {{collect()}} so no reason to
keep in "leaves" set.
Currently, it remains in the "leaves" set and so it is forced and evaluates the
lineage and if not cached it will execute twice the lineage twice (unless
caches are applied for some reason).
was:
When materializing a {{PCollection}} (implemented as {{RDD}}), to create a
{{PCollectionView}} for example, the runner should remove the materialized
{{RDD}} from the "leaves" set.
The runner keeps track of leaves left un-handled in the DAG to force action on
them - {{Write}} for one is implemented via a sequence of {{ParDo}}s which are
implemented by the runner via {{mapPartitions}} so we need to force an action.
Materializing an {{RDD}} is done via the action {{collect()}} so no reason to
keep in "leaves" set.
Currently, it remains in the "leaves" set and so it is forced and evaluates the
lineage and if not cached it will execute twice the lineage twice (unless
caches are applied for some reason).
> Remove leaf when materializing PCollection to avoid re-evaluation.
> ------------------------------------------------------------------
>
> Key: BEAM-1250
> URL: https://issues.apache.org/jira/browse/BEAM-1250
> Project: Beam
> Issue Type: Bug
> Components: runner-spark
> Reporter: Amit Sela
> Assignee: Amit Sela
>
> When materializing a {{PCollection}} (implemented as {{RDD}}), to create a
> {{PCollectionView}} for example, the runner should remove the materialized
> {{RDD}} from the "leaves" set.
> The runner keeps track of leaves left un-handled in the DAG to force action
> on them - {{Write}} for one is implemented via a sequence of ParDos which are
> implemented by the runner via {{mapPartitions}} so we need to force an action.
> Materializing an {{RDD}} is done via the action {{collect()}} so no reason to
> keep in "leaves" set.
> Currently, it remains in the "leaves" set and so it is forced and evaluates
> the lineage and if not cached it will execute twice the lineage twice (unless
> caches are applied for some reason).
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)