Josh Rosen created SPARK-19093:
----------------------------------

             Summary: LeftAntiJoin doesn't seem to resolve cached tables on 
right side
                 Key: SPARK-19093
                 URL: https://issues.apache.org/jira/browse/SPARK-19093
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 2.1.0
            Reporter: Josh Rosen


See reproduction at 
https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/1903098128019500/2699761537338853/1395282846718893/latest.html

Consider the following:

{code}
Seq(("a", "b"), ("c", "d"))
  .toDS
  .write
  .parquet("/tmp/rows")

val df = spark.read.parquet("/tmp/rows")
df.cache()
df.count()
df.createOrReplaceTempView("rows")

spark.sql("""
  select * from rows cross join rows
""").explain(true)

spark.sql("""
  select * from rows where not exists (select * from rows)
""").explain(true)
{code}

In both plans, I'd expect that both sides of the joins would read from the 
cached table for both the cross join and anti join, but the left anti join 
produces the following plan which only reads the left side from cache and reads 
the right side via a regular non-cahced scan:

{code}
== Parsed Logical Plan ==
'Project [*]
+- 'Filter NOT exists#3994
   :  +- 'Project [*]
   :     +- 'UnresolvedRelation `rows`
   +- 'UnresolvedRelation `rows`

== Analyzed Logical Plan ==
_1: string, _2: string
Project [_1#3775, _2#3776]
+- Filter NOT predicate-subquery#3994 []
   :  +- Project [_1#3775 AS _1#3775#4001, _2#3776 AS _2#3776#4002]
   :     +- Project [_1#3775, _2#3776]
   :        +- SubqueryAlias rows
   :           +- Relation[_1#3775,_2#3776] parquet
   +- SubqueryAlias rows
      +- Relation[_1#3775,_2#3776] parquet

== Optimized Logical Plan ==
Join LeftAnti
:- InMemoryRelation [_1#3775, _2#3776], true, 10000, StorageLevel(disk, memory, 
deserialized, 1 replicas)
:     +- *FileScan parquet [_1#3775,_2#3776] Batched: true, Format: Parquet, 
Location: InMemoryFileIndex[dbfs:/tmp/rows], PartitionFilters: [], 
PushedFilters: [], ReadSchema: struct<_1:string,_2:string>
+- Project [_1#3775 AS _1#3775#4001, _2#3776 AS _2#3776#4002]
   +- Relation[_1#3775,_2#3776] parquet

== Physical Plan ==
BroadcastNestedLoopJoin BuildRight, LeftAnti
:- InMemoryTableScan [_1#3775, _2#3776]
:     +- InMemoryRelation [_1#3775, _2#3776], true, 10000, StorageLevel(disk, 
memory, deserialized, 1 replicas)
:           +- *FileScan parquet [_1#3775,_2#3776] Batched: true, Format: 
Parquet, Location: InMemoryFileIndex[dbfs:/tmp/rows], PartitionFilters: [], 
PushedFilters: [], ReadSchema: struct<_1:string,_2:string>
+- BroadcastExchange IdentityBroadcastMode
   +- *Project [_1#3775 AS _1#3775#4001, _2#3776 AS _2#3776#4002]
      +- *FileScan parquet [_1#3775,_2#3776] Batched: true, Format: Parquet, 
Location: InMemoryFileIndex[dbfs:/tmp/rows], PartitionFilters: [], 
PushedFilters: [], ReadSchema: struct<_1:string,_2:string>
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to