[ 
https://issues.apache.org/jira/browse/BEAM-11629?focusedWorklogId=554556&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-554556
 ]

ASF GitHub Bot logged work on BEAM-11629:
-----------------------------------------

                Author: ASF GitHub Bot
            Created on: 19/Feb/21 02:00
            Start Date: 19/Feb/21 02:00
    Worklog Time Spent: 10m 
      Work Description: dmkozh commented on pull request #13739:
URL: https://github.com/apache/beam/pull/13739#issuecomment-781761183


   IIUC the elements just get pickled, at least for e.g. DirectRunner (I don't 
think this should change between runners though). I agree that in theory there 
is no need to have more than a few bytes. In practice we have the 
test_stream.WindowedValueHolder which holds WindowedValue, which in turns holds 
a list of windows (I suppose it can have multiple elements), timestamp (which 
may have different type) and pane info. Most of the space is taken by the 
type/field names. I was able to cut some of this, but still got 400+ bytes per 
record. Ultimately, I lack knowledge/docs on the possible values of 
WindowedValues to optimize this further - I'm not sure how to come up with a 
schema for all the windowing information.
   
   In the end, I still think it's useful to strip windowing information when 
it's not needed, even if it was much smaller. Interactive analysis with 
windowing seems a bit exotic outside of the tutorials and any user not needing 
windowing could gain a significant performance boost.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Issue Time Tracking
-------------------

    Worklog Id:     (was: 554556)
    Time Spent: 3h 20m  (was: 3h 10m)

> Optimize the cache storage for InteractiveRunner
> ------------------------------------------------
>
>                 Key: BEAM-11629
>                 URL: https://issues.apache.org/jira/browse/BEAM-11629
>             Project: Beam
>          Issue Type: Improvement
>          Components: runner-py-interactive
>            Reporter: Dmytro Kozhevin
>            Assignee: Dmytro Kozhevin
>            Priority: P2
>          Time Spent: 3h 20m
>  Remaining Estimate: 0h
>
> Currently InteractiveRunner wraps every record of the cached PCollection into 
> WindowedValue. There is 2 problems about this:
> 1) The windowing information is unnecessary for the batch-mode runs 
> (everything is in the same global window).
> 2) Since the cache is stored as text, we pickle the WindowedValue, which adds 
> ~500 bytes of data to every record (e.g. a cache of just 1000000 integers 
> would take ~500MB instead of ~4MB).
> These issues significantly slow down the interactive runs for data with lots 
> of small rows.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to