[ 
https://issues.apache.org/jira/browse/BEAM-5519?focusedWorklogId=226829&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-226829
 ]

ASF GitHub Bot logged work on BEAM-5519:
----------------------------------------

                Author: ASF GitHub Bot
            Created on: 12/Apr/19 18:00
            Start Date: 12/Apr/19 18:00
    Worklog Time Spent: 10m 
      Work Description: kyle-winkelman commented on issue #6511: [BEAM-5519] 
Remove call to groupByKey in Spark Streaming.
URL: https://github.com/apache/beam/pull/6511#issuecomment-482667518
 
 
   I have spent a little bit of time trying to understand the Nexmark 
performance tests. I believe I have narrowed down the issue a little bit. The 
Nexmark BoundedEventSource splits based on the numEventsGenerator (which is 
100). Before this PR, the first time there is a GroupByKey in these queries the 
RDD would be partitioned to the default parallelism (not 100). With this PR, 
the first time there is a GroupByKey in these queries the RDD would stay 
partitioned at 100. I believe this is what is affecting the performance. I 
would say we could accept this degradation in performance because in real use 
when we ask a Source to split itself in the way we asked. Unless there are a 
large number of sources that tend not to split well.
 
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


Issue Time Tracking
-------------------

    Worklog Id:     (was: 226829)
    Time Spent: 4.5h  (was: 4h 20m)

> Spark Streaming Duplicated Encoding/Decoding Effort
> ---------------------------------------------------
>
>                 Key: BEAM-5519
>                 URL: https://issues.apache.org/jira/browse/BEAM-5519
>             Project: Beam
>          Issue Type: Bug
>          Components: runner-spark
>            Reporter: Kyle Winkelman
>            Assignee: Kyle Winkelman
>            Priority: Major
>              Labels: spark, spark-streaming, triaged
>             Fix For: 2.13.0
>
>          Time Spent: 4.5h
>  Remaining Estimate: 0h
>
> When using the SparkRunner in streaming mode. There is a call to groupByKey 
> followed by a call to updateStateByKey. BEAM-1815 fixed an issue where this 
> used to cause 2 shuffles but it still causes 2 encode/decode cycles.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to