[
https://issues.apache.org/jira/browse/BEAM-11629?focusedWorklogId=549975&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-549975
]
ASF GitHub Bot logged work on BEAM-11629:
-----------------------------------------
Author: ASF GitHub Bot
Created on: 08/Feb/21 23:54
Start Date: 08/Feb/21 23:54
Worklog Time Spent: 10m
Work Description: dmkozh commented on pull request #13739:
URL: https://github.com/apache/beam/pull/13739#issuecomment-775544426
> Even with the latest changes, this is still not writing the windowing
information (including timestamps) to the cache.
That's exactly the intent of the change - we don't want to cache trivial
windowing information.
> Maybe it would be helpful to understand what the objective of this change
is?
The objective is described in the attached ticket - basically, we don't want
to cache redundant information at all, as it adds a huge overhead of ~500
bytes/record. It can be somewhat reduced, but it's still hundreds of bytes.
There may be some terminology confusion - by 'batch' pipelines I initially
meant the pipelines which don't ever care about windowing as they process all
the data at once.
If there is a better way to figure out if the pipeline doesn't care about
windowing, I could use that instead. Also, since this is an environment setting
now, it should be pretty hard to get unexpected results (though for users who
don't care about windowing there won't be an immediate benefit either...)
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
Issue Time Tracking
-------------------
Worklog Id: (was: 549975)
Time Spent: 2h 40m (was: 2.5h)
> Optimize the cache storage for InteractiveRunner
> ------------------------------------------------
>
> Key: BEAM-11629
> URL: https://issues.apache.org/jira/browse/BEAM-11629
> Project: Beam
> Issue Type: Improvement
> Components: runner-py-interactive
> Reporter: Dmytro Kozhevin
> Assignee: Dmytro Kozhevin
> Priority: P2
> Time Spent: 2h 40m
> Remaining Estimate: 0h
>
> Currently InteractiveRunner wraps every record of the cached PCollection into
> WindowedValue. There is 2 problems about this:
> 1) The windowing information is unnecessary for the batch-mode runs
> (everything is in the same global window).
> 2) Since the cache is stored as text, we pickle the WindowedValue, which adds
> ~500 bytes of data to every record (e.g. a cache of just 1000000 integers
> would take ~500MB instead of ~4MB).
> These issues significantly slow down the interactive runs for data with lots
> of small rows.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)