[
https://issues.apache.org/jira/browse/BEAM-8335?focusedWorklogId=352181&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-352181
]
ASF GitHub Bot logged work on BEAM-8335:
----------------------------------------
Author: ASF GitHub Bot
Created on: 02/Dec/19 20:36
Start Date: 02/Dec/19 20:36
Worklog Time Spent: 10m
Work Description: KevinGG commented on pull request #10236: [BEAM-8335]
Add method to PipelineInstrument to create background caching pipline
URL: https://github.com/apache/beam/pull/10236#discussion_r352809707
##########
File path: sdks/python/apache_beam/runners/interactive/pipeline_instrument.py
##########
@@ -65,25 +78,37 @@ def __init__(self, pipeline, options=None):
pipeline.to_runner_api(use_fake_coders=True),
pipeline.runner,
options)
+
+ self._background_caching_pipeline = beam.pipeline.Pipeline.from_runner_api(
+ pipeline.to_runner_api(use_fake_coders=True),
+ pipeline.runner,
+ options)
+
# Snapshot of original pipeline information.
(self._original_pipeline_proto,
self._original_context) = self._pipeline_snap.to_runner_api(
return_context=True, use_fake_coders=True)
# All compute-once-against-original-pipeline fields.
- self._has_unbounded_source = has_unbounded_source(self._pipeline_snap)
+ self._unbounded_sources = unbounded_sources(
+ self._background_caching_pipeline)
# TODO(BEAM-7760): once cache scope changed, this is not needed to manage
# relationships across pipelines, runners, and jobs.
self._pcolls_to_pcoll_id = pcolls_to_pcoll_id(self._pipeline_snap,
self._original_context)
+ # A list of all the unbounded PCollections in the pipeline. These unbounded
+ # pcollections will be cached.
+ self._unbounded_pcolls = unbounded_pcolls(self._unbounded_sources)
+
# A mapping from PCollection id to python id() value in user defined
# pipeline instance.
(self._pcoll_version_map,
self._cacheables,
# A dict from pcoll_id to variable name of the referenced PCollection.
# (Dict[str, str])
- self._cacheable_var_by_pcoll_id) = cacheables(self.pcolls_to_pcoll_id)
+ self._cacheable_var_by_pcoll_id) = cacheables(self.pcolls_to_pcoll_id,
+ self._unbounded_pcolls)
Review comment:
Hi Sam, the `cacheables` are analyzed from the static global scope, so all
PCollections should have already been included. There is no need to find and
include them here.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
Issue Time Tracking
-------------------
Worklog Id: (was: 352181)
Time Spent: 37h 10m (was: 37h)
> Add streaming support to Interactive Beam
> -----------------------------------------
>
> Key: BEAM-8335
> URL: https://issues.apache.org/jira/browse/BEAM-8335
> Project: Beam
> Issue Type: Improvement
> Components: runner-py-interactive
> Reporter: Sam Rohde
> Assignee: Sam Rohde
> Priority: Major
> Time Spent: 37h 10m
> Remaining Estimate: 0h
>
> This issue tracks the work items to introduce streaming support to the
> Interactive Beam experience. This will allow users to:
> * Write and run a streaming job in IPython
> * Automatically cache records from unbounded sources
> * Add a replay experience that replays all cached records to simulate the
> original pipeline execution
> * Add controls to play/pause/stop/step individual elements from the cached
> records
> * Add ability to inspect/visualize unbounded PCollections
--
This message was sent by Atlassian Jira
(v8.3.4#803005)