[ 
https://issues.apache.org/jira/browse/SPARK-52873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18016406#comment-18016406
 ] 

Dongjoon Hyun commented on SPARK-52873:
---------------------------------------

This is backported to branch-3.5 via https://github.com/apache/spark/pull/52135

> Hint causes semi join results to vary
> -------------------------------------
>
>                 Key: SPARK-52873
>                 URL: https://issues.apache.org/jira/browse/SPARK-52873
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core, SQL
>    Affects Versions: 3.4.1, 3.5.3, 3.5.5
>            Reporter: Peter Connolly
>            Assignee: Bruce Robbins
>            Priority: Major
>              Labels: correctness, pull-request-available
>             Fix For: 4.1.0, 4.0.1, 3.5.7
>
>
> I was running a query in production that was not getting the expected 
> results. I suspect that this query is using an improper join type for a semi 
> join with a non-equal predicate. 
> While my production code has no query hint I am able to reproduce the issue 
> consistently by doing a semi join with a hint in some sample code. In this 
> sample code I create two dataframes, one of which has strings and another 
> that I want to match on the first x characters using a startswith clause. I 
> realize that the substring equality join is logically redundant and could be 
> theoretically removed, but it helps greatly with performance in production. I 
> then run two queries that are nearly identical to find records that match. 
> The only difference between the queries is that the second query has a join 
> hint. Despite being the same logical query they produce different output.
> {code:java}
> import spark.implicits._
> val geohashes = Seq(
>   ("9ykchgz95z"),
>   ("abckd3kdf1"),
> ).toDF("geohash")
> .repartition(2)
> geohashes.createOrReplaceTempView("geohashes")
> val geohashesToInclude = Seq(
>   ("9ykchgz91"),
>   ("9ykchgz92"),
>   ("9ykchgz93"),
>   ("9ykchgz94"),
>   ("9ykchgz95"),
>   ("9ykchgz96"),
>   ("9ykchgz97"),
>   ("9ykchgz98"),
>   ("9ykchgz99"),
>   ("9ykchgz90"),
> ).toDF("geohash_prefix")
> .repartition(10)
> geohashesToInclude.createOrReplaceTempView("geohashes_to_include")
> spark.sql("SELECT * FROM geohashes g LEFT SEMI JOIN geohashes_to_include i ON 
> SUBSTRING(g.geohash, 1, 7) = SUBSTRING(i.geohash_prefix, 1, 7) AND 
> STARTSWITH(g.geohash, i.geohash_prefix)").show()
> +----------+
> |   geohash|
> +----------+
> |9ykchgz95z|
> +----------+
> spark.sql("SELECT /*+ SHUFFLE_HASH(i) */ * FROM geohashes g LEFT SEMI JOIN 
> geohashes_to_include i ON SUBSTRING(g.geohash, 1, 7) = 
> SUBSTRING(i.geohash_prefix, 1, 7) AND STARTSWITH(g.geohash, 
> i.geohash_prefix)").show()
> +-------+
> |geohash|
> +-------+
> +-------+ {code}
> I produced the above sample on spark 3.5.3, but I think I have observed 
> similar issues in production code in spark 3.4.1 and 3.5.5 (the latest 
> version on EMR).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to