umartin opened a new pull request, #748: URL: https://github.com/apache/sedona/pull/748
## Did you read the Contributor Guide? - Yes, I have read [Contributor Rules](https://sedona.apache.org/community/rule/) and [Contributor Development Guide](https://sedona.apache.org/community/develop/) ## Is this PR related to a JIRA ticket? - Yes, the URL of the assoicated JIRA ticket is https://issues.apache.org/jira/browse/SEDONA-233. The PR name follows the format `[SEDONA-XXX] my subject`. ## What changes were proposed in this PR? This patch changes how the deduplication gets it partition id. The previous method of getting it from TaskContext was unreliable. Now it uses mapPartitionsWithIndex. The documentation clearly states that is uses the _original_ partition id. https://spark.apache.org/docs/latest/api/scala/org/apache/spark/rdd/RDD.html#mapPartitionsWithIndex[U](f:(Int,Iterator[T])=%3EIterator[U],preservesPartitioning:Boolean)(implicitevidence$9:scala.reflect.ClassTag[U]):org.apache.spark.rdd.RDD[U] Deduplication is refactored out of the join judgement into a separate DuplicatesFilter. Deduplication code that is used in sedona-flink is moved to common. ## How was this patch tested? Unit test added ## Did this PR include necessary documentation updates? - No, this PR does not affect any public API so no need to change the docs. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
