[ https://issues.apache.org/jira/browse/SPARK-52873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18016406#comment-18016406 ]
Dongjoon Hyun commented on SPARK-52873: --------------------------------------- This is backported to branch-3.5 via https://github.com/apache/spark/pull/52135 > Hint causes semi join results to vary > ------------------------------------- > > Key: SPARK-52873 > URL: https://issues.apache.org/jira/browse/SPARK-52873 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL > Affects Versions: 3.4.1, 3.5.3, 3.5.5 > Reporter: Peter Connolly > Assignee: Bruce Robbins > Priority: Major > Labels: correctness, pull-request-available > Fix For: 4.1.0, 4.0.1, 3.5.7 > > > I was running a query in production that was not getting the expected > results. I suspect that this query is using an improper join type for a semi > join with a non-equal predicate. > While my production code has no query hint I am able to reproduce the issue > consistently by doing a semi join with a hint in some sample code. In this > sample code I create two dataframes, one of which has strings and another > that I want to match on the first x characters using a startswith clause. I > realize that the substring equality join is logically redundant and could be > theoretically removed, but it helps greatly with performance in production. I > then run two queries that are nearly identical to find records that match. > The only difference between the queries is that the second query has a join > hint. Despite being the same logical query they produce different output. > {code:java} > import spark.implicits._ > val geohashes = Seq( > ("9ykchgz95z"), > ("abckd3kdf1"), > ).toDF("geohash") > .repartition(2) > geohashes.createOrReplaceTempView("geohashes") > val geohashesToInclude = Seq( > ("9ykchgz91"), > ("9ykchgz92"), > ("9ykchgz93"), > ("9ykchgz94"), > ("9ykchgz95"), > ("9ykchgz96"), > ("9ykchgz97"), > ("9ykchgz98"), > ("9ykchgz99"), > ("9ykchgz90"), > ).toDF("geohash_prefix") > .repartition(10) > geohashesToInclude.createOrReplaceTempView("geohashes_to_include") > spark.sql("SELECT * FROM geohashes g LEFT SEMI JOIN geohashes_to_include i ON > SUBSTRING(g.geohash, 1, 7) = SUBSTRING(i.geohash_prefix, 1, 7) AND > STARTSWITH(g.geohash, i.geohash_prefix)").show() > +----------+ > | geohash| > +----------+ > |9ykchgz95z| > +----------+ > spark.sql("SELECT /*+ SHUFFLE_HASH(i) */ * FROM geohashes g LEFT SEMI JOIN > geohashes_to_include i ON SUBSTRING(g.geohash, 1, 7) = > SUBSTRING(i.geohash_prefix, 1, 7) AND STARTSWITH(g.geohash, > i.geohash_prefix)").show() > +-------+ > |geohash| > +-------+ > +-------+ {code} > I produced the above sample on spark 3.5.3, but I think I have observed > similar issues in production code in spark 3.4.1 and 3.5.5 (the latest > version on EMR). -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org