[ 
https://issues.apache.org/jira/browse/BEAM-5519?focusedWorklogId=196381&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-196381
 ]

ASF GitHub Bot logged work on BEAM-5519:
----------------------------------------

                Author: ASF GitHub Bot
            Created on: 08/Feb/19 20:08
            Start Date: 08/Feb/19 20:08
    Worklog Time Spent: 10m 
      Work Description: kyle-winkelman commented on issue #6511: [BEAM-5519] 
Remove call to groupByKey in Spark Streaming.
URL: https://github.com/apache/beam/pull/6511#issuecomment-461930712
 
 
   I would like to finish this because I think its an improvement, especially 
to readability. For my simple tests as well as those in validatesRunner 
everything appears to be working well.
   
   One question I have is that I moved from mapPartitions to map in most places 
to improve readability and because it no longer is required to preserve 
partitioning (because we aren't doing groupByKey followed by updateStateByKey) 
as noted in these comments `// using mapPartitions allows to preserve the 
partitioner and avoid unnecessary shuffle downstream.` and `// we use 
mapPartitions with the RDD API because its the only available API that allows 
to preserve partitioning` . From my research this will not impact performance 
because we don't do any costly initialization in any of the `CoderHelpers` or 
`TranslationUtils` functions. Anyone have any reason not to do this?
 
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


Issue Time Tracking
-------------------

    Worklog Id:     (was: 196381)
    Time Spent: 2.5h  (was: 2h 20m)

> Spark Streaming Duplicated Encoding/Decoding Effort
> ---------------------------------------------------
>
>                 Key: BEAM-5519
>                 URL: https://issues.apache.org/jira/browse/BEAM-5519
>             Project: Beam
>          Issue Type: Bug
>          Components: runner-spark
>            Reporter: Kyle Winkelman
>            Assignee: Kyle Winkelman
>            Priority: Major
>              Labels: spark, spark-streaming, triaged
>          Time Spent: 2.5h
>  Remaining Estimate: 0h
>
> When using the SparkRunner in streaming mode. There is a call to groupByKey 
> followed by a call to updateStateByKey. BEAM-1815 fixed an issue where this 
> used to cause 2 shuffles but it still causes 2 encode/decode cycles.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to