[I] Inconsistency with count distinct on NaN values [datafusion]

via GitHub Wed, 04 Jun 2025 15:12:39 -0700


andygrove opened a new issue, #16254:
URL: https://github.com/apache/datafusion/issues/16254


   ### Describe the bug
   
   I have this csv file:
   
   ```
   a,b
   x,NaN
   x,NaN
   x,NaN
   ```
   
   With a simple select query, DF says there is only 1 distinct value for 
column b (which, I think is correct).
   
   ```
   > select count(distinct b) from 'nan.csv';
   +---------------------------+
   | count(DISTINCT nan.csv.b) |
   +---------------------------+
   | 1                         |
   +---------------------------+
   ```
   
   However, in an aggregate query, DF says there are 3 distinct values:
   
   ```
   > select a, count(distinct b) from 'nan.csv' group by 1 order by 1;
   +---+---------------------------+
   | a | count(DISTINCT nan.csv.b) |
   +---+---------------------------+
   | x | 3                         |
   +---+---------------------------+
   ```
   
   This behavior seems inconsistent. I would expect the aggregate query to also 
report that there is one distinct value (in Spark, the behavior is consistent 
between the two queries).
   
   
   
   
   ### To Reproduce
   
   _No response_
   
   ### Expected behavior
   
   _No response_
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[I] Inconsistency with count distinct on NaN values [datafusion]

Reply via email to