[
https://issues.apache.org/jira/browse/SPARK-39467?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Hyukjin Kwon updated SPARK-39467:
---------------------------------
Component/s: (was: Spark Core)
> Count on distinct asterisk not equals to the count with column names provided
> -----------------------------------------------------------------------------
>
> Key: SPARK-39467
> URL: https://issues.apache.org/jira/browse/SPARK-39467
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 3.1.3
> Environment: Spark 3.1.3 vanilla
> Reporter: Michael Taranov
> Priority: Minor
>
> Hi everyone,
> We came across a case when count distinct with asterisk produce incorrect
> result comparing to count distinct when all columns provided.
> Example provide below:
> {noformat}
> scala> val df = Seq(
> | (1655172,1463032,"PHON","US",null,1),
> | (1655172,1061329,"DESK","AU",null,3),
> | (1655172,1334977,"MOBILE","US",null,23),
> | (1655172,1165470,"PHON","CR",null,12),
> | (1655172,1021215,"PHON","CA","USD",11)).toDF
> df: org.apache.spark.sql.DataFrame = [_1: int, _2: int ... 4 more fields]
> scala> df.printSchema
> root
> |-- _1: integer (nullable = false)
> |-- _2: integer (nullable = false)
> |-- _3: string (nullable = true)
> |-- _4: string (nullable = true)
> |-- _5: string (nullable = true)
> |-- _6: integer (nullable = false)
> scala> df.createOrReplaceTempView("a_table")
> scala> spark.sql("select count(1), count(distinct(*)), count(distinct(_1, _2,
> _3, _4, _5, _6)) from a_table").show(false)
> +--------+--------------------------------------+----------------------------------------------------------------------------+
> |count(1)|count(DISTINCT _1, _2, _3, _4, _5, _6)|count(DISTINCT
> named_struct(_1, _1, _2, _2, _3, _3, _4, _4, _5, _5, _6, _6))|
> +--------+--------------------------------------+----------------------------------------------------------------------------+
> |5 |1 |5
> |
> +--------+--------------------------------------+----------------------------------------------------------------------------+
> {noformat}
> We understand that this is somehow related to null values but in our
> understanding asterisk should mimic same behavior as all columns provided.
> If there is any documentation about this It would be nice to read.
> Any help would be appreciated.
> Michael
--
This message was sent by Atlassian Jira
(v8.20.7#820007)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]