[
https://issues.apache.org/jira/browse/BEAM-5519?focusedWorklogId=226829&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-226829
]
ASF GitHub Bot logged work on BEAM-5519:
----------------------------------------
Author: ASF GitHub Bot
Created on: 12/Apr/19 18:00
Start Date: 12/Apr/19 18:00
Worklog Time Spent: 10m
Work Description: kyle-winkelman commented on issue #6511: [BEAM-5519]
Remove call to groupByKey in Spark Streaming.
URL: https://github.com/apache/beam/pull/6511#issuecomment-482667518
I have spent a little bit of time trying to understand the Nexmark
performance tests. I believe I have narrowed down the issue a little bit. The
Nexmark BoundedEventSource splits based on the numEventsGenerator (which is
100). Before this PR, the first time there is a GroupByKey in these queries the
RDD would be partitioned to the default parallelism (not 100). With this PR,
the first time there is a GroupByKey in these queries the RDD would stay
partitioned at 100. I believe this is what is affecting the performance. I
would say we could accept this degradation in performance because in real use
when we ask a Source to split itself in the way we asked. Unless there are a
large number of sources that tend not to split well.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
Issue Time Tracking
-------------------
Worklog Id: (was: 226829)
Time Spent: 4.5h (was: 4h 20m)
> Spark Streaming Duplicated Encoding/Decoding Effort
> ---------------------------------------------------
>
> Key: BEAM-5519
> URL: https://issues.apache.org/jira/browse/BEAM-5519
> Project: Beam
> Issue Type: Bug
> Components: runner-spark
> Reporter: Kyle Winkelman
> Assignee: Kyle Winkelman
> Priority: Major
> Labels: spark, spark-streaming, triaged
> Fix For: 2.13.0
>
> Time Spent: 4.5h
> Remaining Estimate: 0h
>
> When using the SparkRunner in streaming mode. There is a call to groupByKey
> followed by a call to updateStateByKey. BEAM-1815 fixed an issue where this
> used to cause 2 shuffles but it still causes 2 encode/decode cycles.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)