[ 
https://issues.apache.org/jira/browse/BEAM-7760?focusedWorklogId=318688&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-318688
 ]

ASF GitHub Bot logged work on BEAM-7760:
----------------------------------------

                Author: ASF GitHub Bot
            Created on: 26/Sep/19 01:22
            Start Date: 26/Sep/19 01:22
    Worklog Time Spent: 10m 
      Work Description: aaltay commented on pull request #9619: [BEAM-7760] 
Added pipeline_instrument module
URL: https://github.com/apache/beam/pull/9619#discussion_r328400449
 
 

 ##########
 File path: sdks/python/apache_beam/runners/interactive/pipeline_instrument.py
 ##########
 @@ -0,0 +1,470 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+"""Module to instrument interactivity to the given pipeline.
+
+For internal use only; no backwards-compatibility guarantees.
+This module accesses current interactive environment and analyzes given 
pipeline
+to transform original pipeline into a one-shot pipeline with interactivity.
+"""
+from __future__ import absolute_import
+
+import logging
+
+import apache_beam as beam
+from apache_beam.pipeline import PipelineVisitor
+from apache_beam.runners.interactive import cache_manager as cache
+from apache_beam.runners.interactive import interactive_environment as ie
+
+READ_CACHE = "_ReadCache_"
+WRITE_CACHE = "_WriteCache_"
+
+
+class PipelineInstrument(object):
+  """A pipeline instrument for pipeline to be executed by interactive runner.
+
+  This module should never depend on underlying runner that interactive runner
+  delegates. It instruments the original instance of pipeline directly by
+  appending or replacing transforms with help of cache. It provides
+  interfaces to recover states of original pipeline. It's the interactive
+  runner's responsibility to coordinate supported underlying runners to run
+  the pipeline instrumented and recover the original pipeline states if needed.
+  """
+
+  def __init__(self, pipeline, options=None):
+    self._pipeline = pipeline
+    # The cache manager should be initiated outside of this module and outside
+    # of run_pipeline() from interactive runner so that its lifespan could 
cover
+    # multiple runs in the interactive environment. Owned by
+    # interactive_environment module. Not owned by this module.
+    # TODO(BEAM-7760): change the scope of cache to be owned by runner or
+    # pipeline result instances because a pipeline is not 1:1 correlated to a
+    # running job. Only complete and read-only cache is valid across multiple
+    # jobs. Other cache instances should have their own scopes. Some design
+    # change should support only runner.run(pipeline) pattern rather than
+    # pipeline.run([runner]) and a runner can only run at most one pipeline at 
a
+    # time. Otherwise, result returned by run() is the only 1:1 anchor.
+    self._cache_manager = ie.current_env().cache_manager()
+
+    # Invoke a round trip through the runner API. This makes sure the Pipeline
+    # proto is stable. The snapshot of pipeline will not be mutated within this
+    # module and can be used to recover original pipeline if needed.
+    self._pipeline_snap = beam.pipeline.Pipeline.from_runner_api(
+        pipeline.to_runner_api(use_fake_coders=True),
+        pipeline.runner,
+        options)
+    # Snapshot of original pipeline information.
+    (self._original_pipeline_proto,
+     self._original_context) = self._pipeline_snap.to_runner_api(
+         return_context=True, use_fake_coders=True)
+
+    # All compute-once-against-original-pipeline fields.
+    self._has_unbounded_source = has_unbounded_source(self._pipeline_snap)
+    # TODO(BEAM-7760): once cache scope changed, this is not needed to manage
+    # relationships across pipelines, runners, and jobs.
+    self._pcolls_to_pcoll_id = pcolls_to_pcoll_id(self._pipeline_snap,
+                                                  self._original_context)
+
+    # A mapping from PCollection id to python id() value in user defined
+    # pipeline instance.
+    (self._pcoll_version_map,
+     self._cacheables) = cacheables(self.pcolls_to_pcoll_id())
+
+    # A dict from cache key to PCollection that is read from cache.
+    # If exists, caller should reuse the PCollection read. If not, caller
+    # should create new transform and track the PCollection read from cache.
+    # (Dict[str, AppliedPTransform]).
+    self._cached_pcoll_read = {}
+
+  def instrumented_pipeline_proto(self):
+    """Always returns a new instance of portable instrumented proto."""
+    return self._pipeline.to_runner_api(use_fake_coders=True)
+
+  def has_unbounded_source(self):
+    """Checks if a given pipeline has any source that is unbounded.
+
+    The function directly checks the source transform definition instead
+    of pvalues in the pipeline. Thus manually setting is_bounded field of
+    a PCollection or switching streaming mode will not affect this
+    function's result. The result is always deterministic when the source
+    code of a pipeline is defined.
+    """
+    return self._has_unbounded_source
+
+  def cacheables(self):
+    """Finds cacheable PCollections from the pipeline.
+
+    The function only treats the result as cacheables since there is no
+    guarantee whether the cache desired PCollection has been cached or
+    not. A PCollection desires caching when it's bound to a user defined
+    variable in source code. Otherwise, the PCollection is not reusale
+    nor introspectable which nullifying the need of cache.
+    """
+    return self._cacheables
+
+  def pcolls_to_pcoll_id(self):
+    """Returns a dict mapping str(PCollection)s to IDs."""
+    return self._pcolls_to_pcoll_id
+
+  def original_pipeline_proto(self):
+    """Returns the portable proto representation of the pipeline before
+    instrumentation."""
+    return self._original_pipeline_proto
+
+  def original_pipeline(self):
+    """Returns a snapshot of the pipeline before instrumentation."""
+    return self._pipeline_snap
+
+  def instrument(self):
+    """Instruments original pipeline with cache.
+
+    For cacheable output PCollection, if cache for the key doesn't exist, do
+    _write_cache(); for cacheable input PCollection, if cache for the key
+    exists, do _read_cache(). No instrument in any other situation.
+
+    Modifies:
+      self._pipeline
+    """
+    self._preprocess()
+    cacheable_inputs = set()
+
+    class InstrumentVisitor(PipelineVisitor):
+      """Visitor utilizes cache to instrument the pipeline."""
+
+      def __init__(self, pin):
+        self._pin = pin
+
+      def enter_composite_transform(self, transform_node):
+        self.visit_transform(transform_node)
+
+      def visit_transform(self, transform_node):
+        cacheable_inputs.update(self._pin._cacheable_inputs(transform_node))
+
+    v = InstrumentVisitor(self)
+    self._pipeline.visit(v)
+    # Create ReadCache transforms.
+    for cacheable_input in cacheable_inputs:
+      self._read_cache(cacheable_input)
+    # Replace/wire inputs w/ cached PCollections from ReadCache transforms.
+    self._replace_with_cached_inputs()
+    # Write cache for all cacheables.
+    for _, cacheable in self.cacheables().items():
+      self._write_cache(cacheable['pcoll'])
+    # TODO(BEAM-7760): prune sub graphs that doesn't need to be executed.
+
+  def _preprocess(self):
+    """Pre-processes the pipeline.
+
+    Since the pipeline instance in the class might not be the same instance
+    defined in the user code, the pre-process will figure out the relationship
+    of cacheable PCollections between these 2 instances by replacing 'pcoll'
+    fields in the cacheable dictionary with ones from the running instance.
+    """
+
+    class PreprocessVisitor(PipelineVisitor):
+
+      def __init__(self, pin):
+        self._pin = pin
+
+      def enter_composite_transform(self, transform_node):
+        self.visit_transform(transform_node)
+
+      def visit_transform(self, transform_node):
+        for in_pcoll in transform_node.inputs:
+          self._process(in_pcoll)
+        for out_pcoll in transform_node.outputs.values():
+          self._process(out_pcoll)
+
+      def _process(self, pcoll):
+        pcoll_id = self._pin.pcolls_to_pcoll_id().get(str(pcoll), '')
+        if pcoll_id in self._pin._pcoll_version_map:
+          cacheable_key = self._pin._cacheable_key(pcoll)
+          if (cacheable_key in self._pin.cacheables() and
+              self._pin.cacheables()[cacheable_key]['pcoll'] != pcoll):
+            self._pin.cacheables()[cacheable_key]['pcoll'] = pcoll
+
+    v = PreprocessVisitor(self)
+    self._pipeline.visit(v)
+
+  def _write_cache(self, pcoll):
+    """Caches a cacheable PCollection.
+
+    For the given PCollection, by appending sub transform part that materialize
+    the PCollection through sink into cache implementation. The cache write is
+    not immediate. It happens when the runner runs the transformed pipeline
+    and thus not usable for this run as intended. It's the caller's
+    responsibility to make sure the PCollection is indeed cacheable. Otherwise,
+    cache resources might be wasted. If a cache with corresponding key exists,
+    noop since a cache write is only needed when the last cache is invalidated.
+    And if a cache is invalidated, the PCollection's new key is guaranteed to
+    not exist in current cache.
+
+    Modifies:
+      self._pipeline
+    """
+    if pcoll.pipeline is not self._pipeline:
+      return
+    key = self.cache_key(pcoll)
+    if not self._cache_manager.exists('full', key):
 
 Review comment:
   What is being checked here? Could you also clarify the log message below?
 
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
-------------------

    Worklog Id:     (was: 318688)
    Time Spent: 9h 40m  (was: 9.5h)

> Interactive Beam Caching PCollections bound to user defined vars in notebook
> ----------------------------------------------------------------------------
>
>                 Key: BEAM-7760
>                 URL: https://issues.apache.org/jira/browse/BEAM-7760
>             Project: Beam
>          Issue Type: New Feature
>          Components: examples-python
>            Reporter: Ning Kang
>            Assignee: Ning Kang
>            Priority: Major
>          Time Spent: 9h 40m
>  Remaining Estimate: 0h
>
> Cache only PCollections bound to user defined variables in a pipeline when 
> running pipeline with interactive runner in jupyter notebooks.
> [Interactive 
> Beam|[https://github.com/apache/beam/tree/master/sdks/python/apache_beam/runners/interactive]]
>  has been caching and using caches of "leaf" PCollections for interactive 
> execution in jupyter notebooks.
> The interactive execution is currently supported so that when appending new 
> transforms to existing pipeline for a new run, executed part of the pipeline 
> doesn't need to be re-executed. 
> A PCollection is "leaf" when it is never used as input in any PTransform in 
> the pipeline.
> The problem with building caches and pipeline to execute around "leaf" is 
> that when a PCollection is consumed by a sink with no output, the pipeline to 
> execute built will miss the subgraph generating and consuming that 
> PCollection.
> An example, "ReadFromPubSub --> WirteToPubSub" will result in an empty 
> pipeline.
> Caching around PCollections bound to user defined variables and replacing 
> transforms with source and sink of caches could resolve the pipeline to 
> execute properly under the interactive execution scenario. Also, cached 
> PCollection now can trace back to user code and can be used for user data 
> visualization if user wants to do it.
> E.g.,
> {code:java}
> // ...
> p = beam.Pipeline(interactive_runner.InteractiveRunner(),
>                   options=pipeline_options)
> messages = p | "Read" >> beam.io.ReadFromPubSub(subscription='...')
> messages | "Write" >> beam.io.WriteToPubSub(topic_path)
> result = p.run()
> // ...
> visualize(messages){code}
>  The interactive runner automatically figures out that PCollection
> {code:java}
> messages{code}
> created by
> {code:java}
> p | "Read" >> beam.io.ReadFromPubSub(subscription='...'){code}
> should be cached and reused if the notebook user appends more transforms.
>  And once the pipeline gets executed, the user could use any 
> visualize(PCollection) module to visualize the data statically (batch) or 
> dynamically (stream)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to