[
https://issues.apache.org/jira/browse/BEAM-10603?focusedWorklogId=485432&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-485432
]
ASF GitHub Bot logged work on BEAM-10603:
-----------------------------------------
Author: ASF GitHub Bot
Created on: 16/Sep/20 22:16
Start Date: 16/Sep/20 22:16
Worklog Time Spent: 10m
Work Description: rohdesamuel commented on a change in pull request
#12799:
URL: https://github.com/apache/beam/pull/12799#discussion_r489784307
##########
File path: sdks/python/apache_beam/runners/interactive/utils.py
##########
@@ -34,7 +34,8 @@ def to_element_list(
reader, # type: Generator[Union[TestStreamPayload.Event,
WindowedValueHolder]]
coder, # type: Coder
include_window_info, # type: bool
- n=None # type: int
+ n=None, # type: int
+ include_teststream_events=False, # type: bool
Review comment:
Gotcha, changed to include_time_events
##########
File path: sdks/python/apache_beam/runners/interactive/recording_manager.py
##########
@@ -114,14 +113,19 @@ def read(self, tail=True):
# all elements from the cache were read. In the latter situation, it may be
# the case that the pipeline was still running. Thus, another invocation of
# `read` will yield new elements.
+ count_limiter = CountLimiter(self._n)
+ time_limiter = ProcessingTimeLimiter(self._duration_secs)
+ limiters = (count_limiter, time_limiter)
for e in utils.to_element_list(reader,
coder,
include_window_info=True,
- n=self._n):
- for l in limiters:
- l.update(e)
-
- yield e
+ n=self._n,
+ include_teststream_events=True):
+ if isinstance(e, TestStreamPayload.Event):
+ time_limiter.update(e)
+ else:
+ count_limiter.update(e)
+ yield e
Review comment:
Yep, it's to make sure we only count decoded elements. I added a comment
to make it more clear.
##########
File path: sdks/python/apache_beam/runners/interactive/recording_manager.py
##########
@@ -256,7 +259,7 @@ def describe(self):
size = sum(
cache_manager.size('full', s.cache_key) for s in
self._streams.values())
- return {'size': size, 'start': self._start}
+ return {'size': size}
Review comment:
Because the start time wasn't correct if we only start a background
caching job. In that case there wouldn't be a new `Recording` so the start time
would be 0. I think this also cleans up the logic a bit (no more `min`-ing over
all the start times).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
Issue Time Tracking
-------------------
Worklog Id: (was: 485432)
Time Spent: 31h (was: 30h 50m)
> Large Source Recording for Interarctive Runner
> ----------------------------------------------
>
> Key: BEAM-10603
> URL: https://issues.apache.org/jira/browse/BEAM-10603
> Project: Beam
> Issue Type: Improvement
> Components: runner-py-interactive
> Reporter: Sam Rohde
> Assignee: Sam Rohde
> Priority: P1
> Time Spent: 31h
> Remaining Estimate: 0h
>
> This changes the Interactive Runner to create a long-running background
> caching job that is decoupled from the user pipeline. When a user invokes a
> collect() or show(), it will read from the cache to compute the requested
> PCollections. Previously, the user would have to wait for the cache to be
> fully written to. This allows for the user to start experimenting immediately.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)