One thing that we do on our datasets is :
1. Take 'n' random samples of equal size
2. If the distribution is heavily skewed for one key in your samples. The
way we define "heavy skewness" is; if the mean is more than one std
deviation away from the median.
In your case, you can drop this column.
I believe this only works when we need to drop duplicate ROWS
Here I want to drop cols which contains one unique value
Le 2018-05-31 11:16, Divya Gehlot a écrit :
you can try dropduplicate function
you can try dropduplicate function
https://github.com/spirom/LearningSpark/blob/master/src/main/scala/dataframe/DropDuplicates.scala
On 31 May 2018 at 16:34, wrote:
> Hi there !
>
> I have a potentially large dataset ( regarding number of rows and cols )
>
> And I want to find the fastest way
Hi Julien,
One quick and easy to implement idea is to use sampling on your dataset,
i.e., sample a large enough subset of your data and test is there are no
unique values on some columns. Repeat the process a few times and then do
the full test on the surviving columns.
This will allow you to
Hi there !
I have a potentially large dataset ( regarding number of rows and cols )
And I want to find the fastest way to drop some useless cols for me,
i.e. cols containing only an unique value !
I want to know what do you think that I could do to do this as fast as
possible using spark.