[jira] [Comment Edited] (SPARK-15678) Not use cache on appends and overwrites

Gen TANG (JIRA) Tue, 21 Feb 2017 21:42:08 -0800

    [ 
https://issues.apache.org/jira/browse/SPARK-15678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15877513#comment-15877513
 ]


Gen TANG edited comment on SPARK-15678 at 2/22/17 5:41 AM:
-----------------------------------------------------------

hi, I found a bug which is probably related with this issue.[~sameerag]
Please consider the following code.

{code}
import org.apache.spark.sql.DataFrame

def f(data: DataFrame): DataFrame = {
  val df = data.filter("id>10")
  df.cache
  df.count
  df
}

f(spark.range(100).asInstanceOf[DataFrame]).count // output 89 which is correct
f(spark.range(1000).asInstanceOf[DataFrame]).count // output 989 which is 
correct

val dir = "/tmp/test"
spark.range(100).write.mode("overwrite").parquet(dir)
val df = spark.read.parquet(dir)
df.count // output 100 which is correct
f(df).count // output 89 which is correct

spark.range(1000).write.mode("overwrite").parquet(dir)
val df1 = spark.read.parquet(dir)
df1.count // output 1000 which is correct, in fact other operation expect 
df1.filter("id>10") return correct result.
f(df1).count // output 89 which is incorrect
{code}

In fact, when we use df1.filter("id>10"), spark will use old cached dataFrame.


was (Author: gen):
hi, I found a bug which is probably related with this issue.[~sameerag]
Please consider the following code.

{code}
import org.apache.spark.sql.DataFrame

def f(data: DataFrame): DataFrame = {
  val df = data.filter("id>10")
  df.cache
  df.count
  df
}

f(spark.range(100).asInstanceOf[DataFrame]).count // output 89 which is correct
f(spark.range(1000).asInstanceOf[DataFrame]).count // output 989 which is 
correct

val dir = "/tmp/test"
spark.range(100).write.mode("overwrite").parquet(dir)
val df = spark.read.parquet(dir)
df.count // output 100 which is correct
f(df).count // output 89 which is correct

spark.range(1000).write.mode("overwrite").parquet(dir)
val df1 = spark.read.parquet(dir)
df1.count // output 1000 which is correct, in fact other operation expect 
df1.filter("id>10") return correct result.
f(df1).count // output 89 which is incorrect
{code}


> Not use cache on appends and overwrites
> ---------------------------------------
>
>                 Key: SPARK-15678
>                 URL: https://issues.apache.org/jira/browse/SPARK-15678
>             Project: Spark
>          Issue Type: Bug
>    Affects Versions: 2.0.0
>            Reporter: Sameer Agarwal
>            Assignee: Sameer Agarwal
>             Fix For: 2.0.0
>
>
> SparkSQL currently doesn't drop caches if the underlying data is overwritten.
> {code}
> val dir = "/tmp/test"
> sqlContext.range(1000).write.mode("overwrite").parquet(dir)
> val df = sqlContext.read.parquet(dir).cache()
> df.count() // outputs 1000
> sqlContext.range(10).write.mode("overwrite").parquet(dir)
> sqlContext.read.parquet(dir).count() // outputs 1000 instead of 10 <---- We 
> are still using the cached dataset
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (SPARK-15678) Not use cache on appends and overwrites

Reply via email to