[ https://issues.apache.org/jira/browse/SPARK-52339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
ASF GitHub Bot updated SPARK-52339: ----------------------------------- Labels: correctness pull-request-available (was: correctness) > Relations may appear equal even though they are different > --------------------------------------------------------- > > Key: SPARK-52339 > URL: https://issues.apache.org/jira/browse/SPARK-52339 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 4.0.0, 3.5.6 > Reporter: Bruce Robbins > Priority: Major > Labels: correctness, pull-request-available > > For example: > {noformat} > // create test data > val data = Seq((1, 2), (2, 3)).toDF("a", "b") > data.write.mode("overwrite").csv("/tmp/test") > val fileList1 = List.fill(2)("/tmp/test") > val fileList2 = List.fill(3)("/tmp/test") > val df1 = spark.read.schema("a int, b int").csv(fileList1: _*) > val df2 = spark.read.schema("a int, b int").csv(fileList2: _*) > df1.count() // correctly returns 4 > df2.count() // correctly returns 6 > // the following is the same as above, except df1 is persisted > val df1 = spark.read.schema("a int, b int").csv(fileList1: _*).persist > val df2 = spark.read.schema("a int, b int").csv(fileList2: _*) > df1.count() // correctly returns 4 > df2.count() // incorrectly returns 4!! > {noformat} > df1 and df2 were created with a different number of paths: df1 has 2, and df2 > has 3. But since the distinct set of root paths is the same (e.g., > {{Set("/tmp/test") == Set("/tmp/test"))}}, the two dataframes are considered > equal. Thus, when df1 is persisted, df2 uses df1's cached plan. > The same bug also causes inappropriate exchange reuse. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org