[jira] [Commented] (SPARK-25538) incorrect row counts after distinct()

Kazuaki Ishizaki (JIRA) Mon, 01 Oct 2018 10:14:21 -0700


    [ 
https://issues.apache.org/jira/browse/SPARK-25538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16634344#comment-16634344
 ]


Kazuaki Ishizaki commented on SPARK-25538:
------------------------------------------

This test case does not print {{63}}.

{code}
  test("test2") {
    val df = spark.read.parquet("file:///SPARK-25538-repro")
    val c1 = df.distinct.count
    val c2 = df.sort("col_0").distinct.count
    val c3 = df.withColumnRenamed("col_0", "new").distinct.count
    val c0 = df.count
    print(s"c1=$c1, c2=$c2, c3=$c3, c0=$c0\n")
  }

c1=64, c2=73, c3=64, c0=123
{code}

> incorrect row counts after distinct()
> -------------------------------------
>
>                 Key: SPARK-25538
>                 URL: https://issues.apache.org/jira/browse/SPARK-25538
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.4.0
>         Environment: Reproduced on a Centos7 VM and from source in Intellij 
> on OS X.
>            Reporter: Steven Rand
>            Priority: Blocker
>              Labels: correctness
>         Attachments: SPARK-25538-repro.tgz
>
>
> It appears that {{df.distinct.count}} can return incorrect values after 
> SPARK-23713. It's possible that other operations are affected as well; 
> {{distinct}} just happens to be the one that we noticed. I believe that this 
> issue was introduced by SPARK-23713 because I can't reproduce it until that 
> commit, and I've been able to reproduce it after that commit as well as with 
> {{tags/v2.4.0-rc1}}. 
> Below are example spark-shell sessions to illustrate the problem. 
> Unfortunately the data used in these examples can't be uploaded to this Jira 
> ticket. I'll try to create test data which also reproduces the issue, and 
> will upload that if I'm able to do so.
> Example from Spark 2.3.1, which behaves correctly:
> {code}
> scala> val df = spark.read.parquet("hdfs:///data")
> df: org.apache.spark.sql.DataFrame = [<redacted>]
> scala> df.count
> res0: Long = 123
> scala> df.distinct.count
> res1: Long = 115
> {code}
> Example from Spark 2.4.0-rc1, which returns different output:
> {code}
> scala> val df = spark.read.parquet("hdfs:///data")
> df: org.apache.spark.sql.DataFrame = [<redacted>]
> scala> df.count
> res0: Long = 123
> scala> df.distinct.count
> res1: Long = 116
> scala> df.sort("col_0").distinct.count
> res2: Long = 123
> scala> df.withColumnRenamed("col_0", "newName").distinct.count
> res3: Long = 115
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25538) incorrect row counts after distinct()

Reply via email to