[
https://issues.apache.org/jira/browse/BEAM-6214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ismaël Mejía resolved BEAM-6214.
--------------------------------
Resolution: Duplicate
Assignee: (was: David Moravek)
Fix Version/s: Not applicable
> Spark: CombineByKey performance
> -------------------------------
>
> Key: BEAM-6214
> URL: https://issues.apache.org/jira/browse/BEAM-6214
> Project: Beam
> Issue Type: Improvement
> Components: runner-spark
> Affects Versions: 2.8.0
> Reporter: David Moravek
> Priority: Major
> Labels: performance
> Fix For: Not applicable
>
> Original Estimate: 5h
> Remaining Estimate: 5h
>
> Right now spark runner's implementation of combineByKey causes at least two
> serializations / de-serializations of each element, because combine
> accumulators are byte based.
> We can do much better by letting accumulators to work on user defined java
> types and only serialize accumulators when we need to send them over the
> network.
> In order to do this, we need following:
> * Acummulator wrapper -> contains transient `T` value + byte payload, that
> is filled in during serialization
> * JavaSerialization: accumulator (beam wrapper) needs to implement
> Serializable and override writeObject and readObject methods and use beam
> coder
> * KryoSerialization: we need a custom kryo serializer for accumulator wrapper
> This should be enough to hook into all possible spark serialization
> interfaces.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)