[jira] [Commented] (SEDONA-233) Incorrect results for several joins in a single stage

Jia Yu (Jira) Mon, 16 Jan 2023 15:14:06 -0800


    [ 
https://issues.apache.org/jira/browse/SEDONA-233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17677538#comment-17677538
 ]


Jia Yu commented on SEDONA-233:
-------------------------------

[~umartin] Is there a way to fix this? Maybe a naive solution could be:

use cache to force Spark finish the first join, then re-use cache?

 

E.g., 
joined_df = geo_df.alias("a").join(geo_df.alias("b"), 
f.expr("st_intersects(a.geom, b.geom)")).cache()

joined_df.union(joined_df).count()

> Incorrect results for several joins in a single stage
> -----------------------------------------------------
>
>                 Key: SEDONA-233
>                 URL: https://issues.apache.org/jira/browse/SEDONA-233
>             Project: Apache Sedona
>          Issue Type: Bug
>            Reporter: Martin Andersson
>            Priority: Major
>         Attachments: image-2023-01-16-17-38-00-132.png
>
>
> Queries with several joins in a single stage leads to warning logs and 
> possibly incorrect results. One way to trigger the error is to use the union 
> operator.
> {code:java}
> joined_df = geo_df.alias("a").join(geo_df.alias("b"), 
> f.expr("st_intersects(a.geom, b.geom)"))
> joined_df.union(joined_df).count()
> {code}
> Logs:
> {code:java}
> 23/01/16 17:22:58 WARN JudgementBase: Didn't find partition extent for this 
> partition: 8
> 23/01/16 17:22:58 WARN JudgementBase: Didn't find partition extent for this 
> partition: 11
> 23/01/16 17:22:58 WARN JudgementBase: Didn't find partition extent for this 
> partition: 12
> ...
> {code}
> Partitioned joins in Sedona assumes that TaskContext.partitionId is the same 
> as the grid id used for partitioning (JudgementBase::initPartition). That 
> isn't true if Spark runs several joins in a single stage.
> In the example above, if 10 partitions are used in each join, Spark will run 
> the two joins in a single stage with 20 tasks. The second join will have 
> partition id 10-19 instead of the expected 0-9. The second join could produce 
> incorrect results. If the partition extent isn't found there is no 
> deduplication. If it maps to the wrong extent it could eliminate rows that 
> shouldn't be eliminated.
> From spark-ui. Two joins in a single stage:
> !image-2023-01-16-17-38-00-132.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (SEDONA-233) Incorrect results for several joins in a single stage

Reply via email to