Hi All, I found a strange bug which is related with reading data from a updated path and cache operation. Please consider the following code:
import org.apache.spark.sql.DataFrame def f(data: DataFrame): DataFrame = { val df = data.filter("id>10") df.cache df.count df } f(spark.range(100).asInstanceOf[DataFrame]).count // output 89 which is correct f(spark.range(1000).asInstanceOf[DataFrame]).count // output 989 which is correct val dir = "/tmp/test" spark.range(100).write.mode("overwrite").parquet(dir) val df = spark.read.parquet(dir) df.count // output 100 which is correct f(df).count // output 89 which is correct spark.range(1000).write.mode("overwrite").parquet(dir) val df1 = spark.read.parquet(dir) df1.count // output 1000 which is correct, in fact other operation expect df1.filter("id>10") return correct result. f(df1).count // output 89 which is incorrect In fact when we use df1.filter("id>10"), spark will however use old cached dataFrame Any idea? Thanks a lot Cheers Gen