[
https://issues.apache.org/jira/browse/SPARK-35676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17362700#comment-17362700
]
Hyukjin Kwon commented on SPARK-35676:
--------------------------------------
{{distinc().count()}} includes nulls but {{countDistinct}} doesn't i guess?
> pyspark.sql.functions GroupBy agg CountDistinct() return bad value
> ------------------------------------------------------------------
>
> Key: SPARK-35676
> URL: https://issues.apache.org/jira/browse/SPARK-35676
> Project: Spark
> Issue Type: Bug
> Components: PySpark, SQL
> Affects Versions: 3.0.1
> Reporter: carlosgv
> Priority: Minor
>
> from pyspark.sql import functions as F
> gr_month = list.groupBy(F.year('date').alias('year'),
> F.month('date').alias('month')).agg(F.countDistinct('id').alias("n_id")).orderBy(F.col("year").desc(),
> F.col("month").desc()).persist()
> gr_month.show()
>
> |year|month|n_id|
> |2021|6|58801|
> |2021|5|93699|
> list.filter(F.year('date') == "2021").filter(F.month('date') ==
> "6").select('id').distinct().count()
> 98916
>
> 98916 != 58801
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]