[
https://issues.apache.org/jira/browse/BEAM-8457?focusedWorklogId=339081&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-339081
]
ASF GitHub Bot logged work on BEAM-8457:
----------------------------------------
Author: ASF GitHub Bot
Created on: 06/Nov/19 00:14
Start Date: 06/Nov/19 00:14
Worklog Time Spent: 10m
Work Description: KevinGG commented on pull request #9885: [BEAM-8457]
Label Dataflow jobs from Notebook
URL: https://github.com/apache/beam/pull/9885#discussion_r342858336
##########
File path: sdks/python/apache_beam/runners/dataflow/dataflow_runner.py
##########
@@ -360,6 +360,16 @@ def visit_transform(self, transform_node):
def run_pipeline(self, pipeline, options):
"""Remotely executes entire pipeline or parts reachable from node."""
+ # Label goog-dataflow-notebook if pipeline is initiated from interactive
+ # runner.
+ if pipeline.interactive:
Review comment:
I see your point! Yes, I have the
[capability](https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/interactive/interactive_environment.py#L131)
to check if current interpreted code is in a notebook or not. This branch will
need a rebase against master to take those changes.
To clartify the process:
When a DataflowRunner tries to run a job from a given pipeline,
1. Check if the module `interactive_environment` is imported by checking the
`sys.modules` dictionary;
2. Check if `current_env().is_in_notebook`;
3. If yes, label the job.
I think we have a little bit trade off here:
1. What we have here: Determining if the job is started from a pipeline
that was originally bundled with an Interactive Runner.
Doing it with string comparison
2. Deduce if the job is started from a notebook environment.
We'll introduce [interactive] dependencies including ipython into
DataflowRunner.
This will label Dataflow jobs from any pipeline originally bundled with
arbitrary runner in any kind of ipython-notebook as long as
`interactive_environment` module in `interactive` package has been
(transitively) imported.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
Issue Time Tracking
-------------------
Worklog Id: (was: 339081)
Time Spent: 7h (was: 6h 50m)
> Instrument Dataflow jobs that are launched from Notebooks
> ---------------------------------------------------------
>
> Key: BEAM-8457
> URL: https://issues.apache.org/jira/browse/BEAM-8457
> Project: Beam
> Issue Type: Improvement
> Components: runner-py-interactive
> Reporter: Ning Kang
> Assignee: Ning Kang
> Priority: Major
> Fix For: 2.17.0
>
> Time Spent: 7h
> Remaining Estimate: 0h
>
> Dataflow needs the capability to tell how many Dataflow jobs are launched
> from the Notebook environment, i.e., the Interactive Runner.
> # Change the pipeline.run() API to allow supply a runner and an option
> parameter so that a pipeline initially bundled w/ an interactive runner can
> be directly run by other runners from notebook.
> # Implicitly add the necessary source information through user labels when
> the user does p.run(runner=DataflowRunner()).
--
This message was sent by Atlassian Jira
(v8.3.4#803005)