[
https://issues.apache.org/jira/browse/BEAM-6812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ankit Jhalaria updated BEAM-6812:
---------------------------------
Comment: was deleted
(was: * When running a Combine.perKey transform on Spark, we noticed duplicate
results in the output for the same key.
* On further investigation, it turns out that `StreamingTransformTranslator`
uses `Spark's` HashPartitioner which does not work correctly when attempting to
partition on a key which is an array.
* Note from the code:
* /**
* A [[org.apache.spark.Partitioner]] that implements hash-based partitioning
using
* Java's `Object.hashCode`.
*
* Java arrays have hashCodes that are based on the arrays' identities rather
than their contents,
* so attempting to partition an RDD[Array[_]] or RDD[(Array[_], _)] using a
HashPartitioner will
* produce an unexpected or incorrect result.
*/)
> Convert keys to ByteArray in Combine.perKey for Spark
> -----------------------------------------------------
>
> Key: BEAM-6812
> URL: https://issues.apache.org/jira/browse/BEAM-6812
> Project: Beam
> Issue Type: Bug
> Components: runner-spark
> Reporter: Ankit Jhalaria
> Assignee: Ankit Jhalaria
> Priority: Critical
> Time Spent: 2h
> Remaining Estimate: 0h
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)