Re: [R] Filtering an Entire Dataset based on Several Conditions

Avi Gross via R-help Mon, 09 May 2022 14:58:29 -0700

Paul,

I read through the public replies you received and clearly some of us were not 
too clear on what you asked. Your subject line was not helpful as my first 
thought was that you wanted a single column examined for two conditions, as in 
EITHER less than 3 standard deviations above the mean OR more than three 
standard deviations below the mean. As someone else noted, using abs(whatever) 
makes it easy to do with a single condition.


You could have simply said you want to remove outliers more than 3 standard 
deviations from the mean! I note in standard normally distributed data, the 
ones you want to exclude may be only 0.3% but in 39 such columns, quite a few 
rows will be an issue.

The voluminous data you shared makes it clear you have 39 columns of this. So 
it seems what you meant was you wanted to apply the same logic to all 39 
columns and maybe that is what you meant by multiple conditions.

But what result do you want? Do you want to remove rows that are all-out of 
bounds or even a single one outside? Do you want to remove the row or just mark 
the outlier, perhaps with an NA?

Some have suggested you convert to a matrix where many tools are available, and 
back to a data.frame if needed. I note your solution becomes fairly trivial if 
you convert any values above 3.0 to an NA and then use complete.cases( to 
remove any NA rows. This assumes, of course, you have no NA to start with.

There are all kinds of ways to do things and if you were using the dplyr 
package from the tidyverse, which we are discouraged from talking about here, I 
can see possibilities including the rowwise() function.

One idea to consider is that you can use the max() and min() functions applied 
to rows or columns. You can convert your data (perhaps as a matrix) using t() 
and remove any rows where the max of the absolute value of "all row elements" 
exceeds 3. 

One thought is to re-scale your data again using a function that is a bit like 
a truncated normal distribution but instead of tossing outliers, it returns an 
NA. As noted, then complete.cases() handles your need. But, as noted, 

But in your case, all your columns are numeric and the same so fairly trivial 
code like: 

mydf[abs(mydf ) > 3] <- NA

mydf <- complete.cases(mydf)

just might do it for you.

Good luck!

-----Original Message-----
From: Paul Bernal <paulberna...@gmail.com>
To: Rui Barradas <ruipbarra...@sapo.pt>
Cc: R <r-help@r-project.org>
Sent: Mon, May 9, 2022 12:44 pm
Subject: Re: [R] Filtering an Entire Dataset based on Several Conditions

Dear Rui,

I was trying to dput() the datasets I am working on, but since it is a bit
large (42,000 rows by 60 columns) couldn´t retrieve all the structure of
the data to include it here, so I am attaching a couple of files. One is
the raw data (called trainFeatures42k), which is the data I need to
normalize, and the other is normalized_Data, which is the data normalized
(or at least I think I got to normalize it).

 Normalized_Data.csv
<https://drive.google.com/file/d/143I1O710gAqWjzx48Gt1bwUbrG0mbpfa/view?usp=drive_web>
 trainFeatures42k.xls
<https://drive.google.com/file/d/1deMzGMkJyeVsnRzTKirmm4VqIBRzbvzV/view?usp=drive_web>

I have tried some of the code you and other friends from the community have
kindly shared, but have not been able to filter values > -3 and < 3.

Thank you all for your valuable help always.
Best,
Paul

El lun, 9 may 2022 a las 4:22, Rui Barradas (<ruipbarra...@sapo.pt>)
escribió:

> Hello,
>
> Something like this?
> First normalize the data.
> Then a apply loop creates a logical matrix giving which numbers are in
> the range -3 to 3.
> If they are all TRUE then their sum by rows is equal to the number of
> columns. This creates a logical index i.
> Use that index i to subset the scaled data set.
>
> # test data set, remove the Species column (not numeric)
> df1 <- iris[-5]
>
> df1_norm <- scale(df1)
> i <- rowSums(apply(df1_norm, 2, \(x) x > -3 & x < 3)) == ncol(df1_norm)
>
> # returns a matrix
> df1_norm[i, ]
>
> # returns a data.frame
> as.data.frame(df1_norm[i,])
>
>
> Hope this helps,
>
> Rui Barradas
>
> Às 09:23 de 09/05/2022, Paul Bernal escreveu:
> > Dear friends,
> >
> > I have a dataframe which every single (i,j) entry (i standing for ith
> row,
> > j for jth column) has been normalized (converted to z-scores).
> >
> > Now I want to filter or subset the dataframe so that I only end up with
> a a
> > dataframe containing only entries greater than -3 or less than 3.
> >
> > How could I accomplish this?
> >
> > Best,
> > Paul
> >
> >      [[alternative HTML version deleted]]
> >
> > ______________________________________________
> > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
>

    [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Filtering an Entire Dataset based on Several Conditions

Reply via email to