[ 
https://issues.apache.org/jira/browse/BEAM-6332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vaclav Plajt updated BEAM-6332:
-------------------------------
    Description: Combine transformation is translated into Spark's RDD API in 
[GroupCombineFunctions|https://github.com/apache/beam/blob/master/runners/spark/src/main/java/org/apache/beam/runners/spark/translation/GroupCombineFunctions.java]
 `combinePerKey` and `combineGlobally` methods. Both methods use byte arrays as 
intermediate state of aggregation so they can be transferred over network. That 
leads to serialization and de-serialization of intermediate aggregation value 
every time new element is added to aggregation. That is unnecessary and should 
be avoided.  (was: Combine transformation is translated into Spark's RDD API in 
[GroupCombineFunctions](https://github.com/apache/beam/blob/master/runners/spark/src/main/java/org/apache/beam/runners/spark/translation/GroupCombineFunctions.java)
 `combinePerKey` and `combineGlobally` methods. Both methods use byte arrays as 
intermediate state of aggregation so they can be transferred over network. That 
leads to serialization and de-serialization of intermediate aggregation value 
every time new element is added to aggregation. That is unnecessary and should 
be avoided.)

> Avoid unnecessary serialization steps when executing combine transform
> ----------------------------------------------------------------------
>
>                 Key: BEAM-6332
>                 URL: https://issues.apache.org/jira/browse/BEAM-6332
>             Project: Beam
>          Issue Type: Improvement
>          Components: runner-spark
>            Reporter: Vaclav Plajt
>            Assignee: Vaclav Plajt
>            Priority: Major
>
> Combine transformation is translated into Spark's RDD API in 
> [GroupCombineFunctions|https://github.com/apache/beam/blob/master/runners/spark/src/main/java/org/apache/beam/runners/spark/translation/GroupCombineFunctions.java]
>  `combinePerKey` and `combineGlobally` methods. Both methods use byte arrays 
> as intermediate state of aggregation so they can be transferred over network. 
> That leads to serialization and de-serialization of intermediate aggregation 
> value every time new element is added to aggregation. That is unnecessary and 
> should be avoided.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to