[ 
https://issues.apache.org/jira/browse/BEAM-5428?focusedWorklogId=317561&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-317561
 ]

ASF GitHub Bot logged work on BEAM-5428:
----------------------------------------

                Author: ASF GitHub Bot
            Created on: 24/Sep/19 14:58
            Start Date: 24/Sep/19 14:58
    Worklog Time Spent: 10m 
      Work Description: mxm commented on issue #9418: [BEAM-5428] Implement 
cross-bundle user state caching in the Python SDK
URL: https://github.com/apache/beam/pull/9418#issuecomment-534598857
 
 
   Benchmark with a synthetic source with parallelism of 1 on my local machine:
   
   ```python
   max_items = 100000
   num_keys = 10
   
   class GenerateInput(beam.DoFn):
   
     def process(self, byte_array):
       key_string = byte_array
       key_int = int(key_string)
       yield (key_int % num_keys, "value" + key_string)
   
   class StatefulProcessing(beam.DoFn):
     count_state_spec = userstate.CombiningValueStateSpec(
       'count', beam.coders.IterableCoder(beam.coders.VarIntCoder()), sum)
   
     def process(self, kv, count=beam.DoFn.StateParam(count_state_spec)):
       k, v = kv
       count.add(len(v))
   
   (p
    | "Generate data" >> 
FlinkStreamingImpulseSource().set_message_count(max_items).set_interval_ms(0).with_output_types(bytes)
    | "Reshuffle" >> beam.Reshuffle()
    | "Format Data" >> beam.ParDo(GenerateInput()).with_output_types(KV[int, 
str])
    | "Stateful Processing" >> beam.ParDo(StatefulProcessing())
   )
   
   ```
   
   ```
   # elements: 100,000 
   # bundle size: 10
   # number of unique keys: 10
   Took 205.37934804  with experiments=state_cache_size=0
   Took 129.933783054 with experiments=state_cache_size=10
   ```
   
   The version with caching is about 1,59 times faster. These results are 
consistent across multiple runs. This is of course a bit of an extreme 
situation in terms of the number of unique keys, but it shows the potential for 
the caching.
 
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
-------------------

    Worklog Id:     (was: 317561)
    Time Spent: 19.5h  (was: 19h 20m)

> Implement cross-bundle state caching.
> -------------------------------------
>
>                 Key: BEAM-5428
>                 URL: https://issues.apache.org/jira/browse/BEAM-5428
>             Project: Beam
>          Issue Type: Improvement
>          Components: sdk-py-harness
>            Reporter: Robert Bradshaw
>            Assignee: Maximilian Michels
>            Priority: Major
>          Time Spent: 19.5h
>  Remaining Estimate: 0h
>
> Tech spec: 
> [https://docs.google.com/document/d/1BOozW0bzBuz4oHJEuZNDOHdzaV5Y56ix58Ozrqm2jFg/edit#heading=h.7ghoih5aig5m]
> Relevant document: 
> [https://docs.google.com/document/d/1ltVqIW0XxUXI6grp17TgeyIybk3-nDF8a0-Nqw-s9mY/edit#|https://docs.google.com/document/d/1ltVqIW0XxUXI6grp17TgeyIybk3-nDF8a0-Nqw-s9mY/edit]
> Mailing list link: 
> [https://lists.apache.org/thread.html/caa8d9bc6ca871d13de2c5e6ba07fdc76f85d26497d95d90893aa1f6@%3Cdev.beam.apache.org%3E]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to