Thanks Abdeali! Please find details below: df.agg(countDistinct(col('col1'))).show() --> 450089 df.agg(countDistinct(col('col1'))).show() --> 450076 df.filter(col('col1').isNull()).count() --> 0 df.filter(col('col1').isNotNull()).count() --> 450063
col1 is a string Spark version 2.4.0 datasize: ~ 500GB On Sat, Jun 29, 2019 at 5:33 AM Abdeali Kothari <abdealikoth...@gmail.com> wrote: > How large is the data frame and what data type are you counting distinct > for? > I use count distinct quite a bit and haven't noticed any thing peculiar. > > Also, which exact version in 2.3.x? > And, are performing any operations on the DF before the countDistinct? > > I recall there was a bug when I did countDistinct(PythonUDF(x)) in the > same query which was resolved in one of the minor versions in 2.3.x > > On Sat, Jun 29, 2019, 10:32 Rishi Shah <rishishah.s...@gmail.com> wrote: > >> Hi All, >> >> Just wanted to check in to see if anyone has any insight about this >> behavior. Any pointers would help. >> >> Thanks, >> Rishi >> >> On Fri, Jun 14, 2019 at 7:05 AM Rishi Shah <rishishah.s...@gmail.com> >> wrote: >> >>> Hi All, >>> >>> Recently we noticed that countDistinct on a larger dataframe doesn't >>> always return the same value. Any idea? If this is the case then what is >>> the difference between countDistinct & approx_count_distinct? >>> >>> -- >>> Regards, >>> >>> Rishi Shah >>> >> >> >> -- >> Regards, >> >> Rishi Shah >> > -- Regards, Rishi Shah