Re: [pyspark 2.3+] CountDistinct

2019-07-01 Thread Abdeali Kothari
I can't exactly reproduce this. Here is what I tried quickly:

import uuid

import findspark
findspark.init()  # noqa
import pyspark
from pyspark.sql import functions as F  # noqa: N812

spark = pyspark.sql.SparkSession.builder.getOrCreate()

df = spark.createDataFrame([
[str(uuid.uuid4()) for i in range(45)],
], ['col1'])

print(' Spark version:', spark.sparkContext.version)
print(' Null count:', df.filter(F.col('col1').isNull()).count())
print(' Value count:', df.filter(F.col('col1').isNotNull()).count())
print(' Distinct Count 1:',
df.agg(F.countDistinct(F.col('col1'))).collect()[0][0])
print(' Distinct Count 2:',
df.agg(F.countDistinct(F.col('col1'))).collect()[0][0])

This always returns:
 Spark version: 2.4.0
 Null count: 0
 Value count: 45
 Distinct Count 1: 45
 Distinct Count 2: 45




On Sat, Jun 29, 2019 at 6:51 PM Rishi Shah  wrote:

> Thanks Abdeali! Please find details below:
>
> df.agg(countDistinct(col('col1'))).show() --> 450089
> df.agg(countDistinct(col('col1'))).show() --> 450076
> df.filter(col('col1').isNull()).count() --> 0
> df.filter(col('col1').isNotNull()).count() --> 450063
>
> col1 is a string
> Spark version 2.4.0
> datasize: ~ 500GB
>
>
> On Sat, Jun 29, 2019 at 5:33 AM Abdeali Kothari 
> wrote:
>
>> How large is the data frame and what data type are you counting distinct
>> for?
>> I use count distinct quite a bit and haven't noticed any thing peculiar.
>>
>> Also, which exact version in 2.3.x?
>> And, are performing any operations on the DF before the countDistinct?
>>
>> I recall there was a bug when I did countDistinct(PythonUDF(x)) in the
>> same query which was resolved in one of the minor versions in 2.3.x
>>
>> On Sat, Jun 29, 2019, 10:32 Rishi Shah  wrote:
>>
>>> Hi All,
>>>
>>> Just wanted to check in to see if anyone has any insight about this
>>> behavior. Any pointers would help.
>>>
>>> Thanks,
>>> Rishi
>>>
>>> On Fri, Jun 14, 2019 at 7:05 AM Rishi Shah 
>>> wrote:
>>>
 Hi All,

 Recently we noticed that countDistinct on a larger dataframe doesn't
 always return the same value. Any idea? If this is the case then what is
 the difference between countDistinct & approx_count_distinct?

 --
 Regards,

 Rishi Shah

>>>
>>>
>>> --
>>> Regards,
>>>
>>> Rishi Shah
>>>
>>
>
> --
> Regards,
>
> Rishi Shah
>


Re: [pyspark 2.3+] CountDistinct

2019-06-29 Thread Rishi Shah
Thanks Abdeali! Please find details below:

df.agg(countDistinct(col('col1'))).show() --> 450089
df.agg(countDistinct(col('col1'))).show() --> 450076
df.filter(col('col1').isNull()).count() --> 0
df.filter(col('col1').isNotNull()).count() --> 450063

col1 is a string
Spark version 2.4.0
datasize: ~ 500GB


On Sat, Jun 29, 2019 at 5:33 AM Abdeali Kothari 
wrote:

> How large is the data frame and what data type are you counting distinct
> for?
> I use count distinct quite a bit and haven't noticed any thing peculiar.
>
> Also, which exact version in 2.3.x?
> And, are performing any operations on the DF before the countDistinct?
>
> I recall there was a bug when I did countDistinct(PythonUDF(x)) in the
> same query which was resolved in one of the minor versions in 2.3.x
>
> On Sat, Jun 29, 2019, 10:32 Rishi Shah  wrote:
>
>> Hi All,
>>
>> Just wanted to check in to see if anyone has any insight about this
>> behavior. Any pointers would help.
>>
>> Thanks,
>> Rishi
>>
>> On Fri, Jun 14, 2019 at 7:05 AM Rishi Shah 
>> wrote:
>>
>>> Hi All,
>>>
>>> Recently we noticed that countDistinct on a larger dataframe doesn't
>>> always return the same value. Any idea? If this is the case then what is
>>> the difference between countDistinct & approx_count_distinct?
>>>
>>> --
>>> Regards,
>>>
>>> Rishi Shah
>>>
>>
>>
>> --
>> Regards,
>>
>> Rishi Shah
>>
>

-- 
Regards,

Rishi Shah


Re: [pyspark 2.3+] CountDistinct

2019-06-29 Thread Abdeali Kothari
How large is the data frame and what data type are you counting distinct
for?
I use count distinct quite a bit and haven't noticed any thing peculiar.

Also, which exact version in 2.3.x?
And, are performing any operations on the DF before the countDistinct?

I recall there was a bug when I did countDistinct(PythonUDF(x)) in the same
query which was resolved in one of the minor versions in 2.3.x

On Sat, Jun 29, 2019, 10:32 Rishi Shah  wrote:

> Hi All,
>
> Just wanted to check in to see if anyone has any insight about this
> behavior. Any pointers would help.
>
> Thanks,
> Rishi
>
> On Fri, Jun 14, 2019 at 7:05 AM Rishi Shah 
> wrote:
>
>> Hi All,
>>
>> Recently we noticed that countDistinct on a larger dataframe doesn't
>> always return the same value. Any idea? If this is the case then what is
>> the difference between countDistinct & approx_count_distinct?
>>
>> --
>> Regards,
>>
>> Rishi Shah
>>
>
>
> --
> Regards,
>
> Rishi Shah
>


Re: [pyspark 2.3+] CountDistinct

2019-06-28 Thread Rishi Shah
Hi All,

Just wanted to check in to see if anyone has any insight about this
behavior. Any pointers would help.

Thanks,
Rishi

On Fri, Jun 14, 2019 at 7:05 AM Rishi Shah  wrote:

> Hi All,
>
> Recently we noticed that countDistinct on a larger dataframe doesn't
> always return the same value. Any idea? If this is the case then what is
> the difference between countDistinct & approx_count_distinct?
>
> --
> Regards,
>
> Rishi Shah
>


-- 
Regards,

Rishi Shah


[pyspark 2.3+] CountDistinct

2019-06-14 Thread Rishi Shah
Hi All,

Recently we noticed that countDistinct on a larger dataframe doesn't always
return the same value. Any idea? If this is the case then what is the
difference between countDistinct & approx_count_distinct?

-- 
Regards,

Rishi Shah