Hello,

My code seems to work with your data, except that the first column is not to be scaled.



# file names
xlsfile <- file.path("~/dados", "trainFeatures42k.xls")
csvfile <- file.path("~/dados", "Normalized_Data.csv")
# read in the data files
df1 <- readxl::read_excel(xlsfile, col_names = FALSE)
df2 <- read.csv(csvfile)
# assign names to make all.equal happy
names(df1) <- sprintf("X%d", seq_len(ncol(df1)))
names(df2) <- sprintf("X%d", seq_len(ncol(df2)))

# the first column is not to be scaled
df1_norm <- scale(df1[-1])
# compare to the already scaled data from the Google Drive
# the data.frames are equal up to floating-point precision
identical(df2[-1], as.data.frame(df1_norm))
#[1] FALSE
all.equal(df2[-1], as.data.frame(df1_norm))
#[1] TRUE

# see if all values in each row are between -3 and 3
i <- rowSums(apply(df1_norm, 2, \(x) x > -3 & x < 3)) == ncol(df1_norm)

# filter and return a data.frame
df1_clean <- as.data.frame(df1_norm[i,])
dim(df1_clean)
#[1] 32494    60


See if you get the same results.

Hope this helps,

Rui Barradas


Às 17:44 de 09/05/2022, Paul Bernal escreveu:
Dear Rui,

I was trying to dput() the datasets I am working on, but since it is a bit large (42,000 rows by 60 columns) couldn´t retrieve all the structure of the data to include it here, so I am attaching a couple of files. One is the raw data (called trainFeatures42k), which is the data I need to normalize, and the other is normalized_Data, which is the data normalized (or at least I think I got to normalize it).

Normalized_Data.csv <https://drive.google.com/file/d/143I1O710gAqWjzx48Gt1bwUbrG0mbpfa/view?usp=drive_web> trainFeatures42k.xls <https://drive.google.com/file/d/1deMzGMkJyeVsnRzTKirmm4VqIBRzbvzV/view?usp=drive_web>

I have tried some of the code you and other friends from the community have kindly shared, but have not been able to filter values > -3 and < 3.

Thank you all for your valuable help always.
Best,
Paul

El lun, 9 may 2022 a las 4:22, Rui Barradas (<ruipbarra...@sapo.pt <mailto:ruipbarra...@sapo.pt>>) escribió:

    Hello,

    Something like this?
    First normalize the data.
    Then a apply loop creates a logical matrix giving which numbers are in
    the range -3 to 3.
    If they are all TRUE then their sum by rows is equal to the number of
    columns. This creates a logical index i.
    Use that index i to subset the scaled data set.

    # test data set, remove the Species column (not numeric)
    df1 <- iris[-5]

    df1_norm <- scale(df1)
    i <- rowSums(apply(df1_norm, 2, \(x) x > -3 & x < 3)) == ncol(df1_norm)

    # returns a matrix
    df1_norm[i, ]

    # returns a data.frame
    as.data.frame(df1_norm[i,])


    Hope this helps,

    Rui Barradas

    Às 09:23 de 09/05/2022, Paul Bernal escreveu:
     > Dear friends,
     >
     > I have a dataframe which every single (i,j) entry (i standing for
    ith row,
     > j for jth column) has been normalized (converted to z-scores).
     >
     > Now I want to filter or subset the dataframe so that I only end
    up with a a
     > dataframe containing only entries greater than -3 or less than 3.
     >
     > How could I accomplish this?
     >
     > Best,
     > Paul
     >
     >       [[alternative HTML version deleted]]
     >
     > ______________________________________________
     > R-help@r-project.org <mailto:R-help@r-project.org> mailing list
    -- To UNSUBSCRIBE and more, see
     > https://stat.ethz.ch/mailman/listinfo/r-help
    <https://stat.ethz.ch/mailman/listinfo/r-help>
     > PLEASE do read the posting guide
    http://www.R-project.org/posting-guide.html
    <http://www.R-project.org/posting-guide.html>
     > and provide commented, minimal, self-contained, reproducible code.


______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to