Re: [pyspark 2.3+] CountDistinct

Rishi Shah Sat, 29 Jun 2019 06:22:45 -0700

Thanks Abdeali! Please find details below:

df.agg(countDistinct(col('col1'))).show() --> 450089
df.agg(countDistinct(col('col1'))).show() --> 450076
df.filter(col('col1').isNull()).count() --> 0
df.filter(col('col1').isNotNull()).count() --> 450063


col1 is a string
Spark version 2.4.0
datasize: ~ 500GB


On Sat, Jun 29, 2019 at 5:33 AM Abdeali Kothari <abdealikoth...@gmail.com>
wrote:

> How large is the data frame and what data type are you counting distinct
> for?
> I use count distinct quite a bit and haven't noticed any thing peculiar.
>
> Also, which exact version in 2.3.x?
> And, are performing any operations on the DF before the countDistinct?
>
> I recall there was a bug when I did countDistinct(PythonUDF(x)) in the
> same query which was resolved in one of the minor versions in 2.3.x
>
> On Sat, Jun 29, 2019, 10:32 Rishi Shah <rishishah.s...@gmail.com> wrote:
>
>> Hi All,
>>
>> Just wanted to check in to see if anyone has any insight about this
>> behavior. Any pointers would help.
>>
>> Thanks,
>> Rishi
>>
>> On Fri, Jun 14, 2019 at 7:05 AM Rishi Shah <rishishah.s...@gmail.com>
>> wrote:
>>
>>> Hi All,
>>>
>>> Recently we noticed that countDistinct on a larger dataframe doesn't
>>> always return the same value. Any idea? If this is the case then what is
>>> the difference between countDistinct & approx_count_distinct?
>>>
>>> --
>>> Regards,
>>>
>>> Rishi Shah
>>>
>>
>>
>> --
>> Regards,
>>
>> Rishi Shah
>>
>

-- 
Regards,

Rishi Shah

Re: [pyspark 2.3+] CountDistinct

Reply via email to