Hi there !

I have a potentially large dataset ( regarding number of rows and cols )

And I want to find the fastest way to drop some useless cols for me, i.e. cols containing only an unique value !

I want to know what do you think that I could do to do this as fast as possible using spark.


I already have a solution using distinct().count() or approxCountDistinct() But, they may not be the best choice as this requires to go through all the data, even if the 2 first tested values for a col are already different ( and in this case I know that I can keep the col )


Thx for your ideas !

Julien

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Reply via email to