Dear all:

Described below is a large data set problem (data size > 2G after
unzipping, table delimited). I know R is not the
appropriate tool for such task, anyway
I did it on a server and get some straightforward problems.

1. The first is count.fields can count all the rows, however, when I
tried to remove rows beyond 3/5 of the data,R says
subscripts out of bounds, is there any option constraining the maximal
size for R to read in?

2. I rewrote the original data due to careless coding and find the
rewrote table delimited file does not match the
original file.
I experimented the code on a small dataset as attached at the end, no
problem at all for such small dataset.

I appreciate any tips and suggestions on how to remove the unwanted
rows in such a large dataset.

finally, thanks for all answering the tab delimited problem I rised yesterday.

### code as following ###

data.mm <- read.table(file,header=T,sep="\t",fill=T);   #read in the large file
cf <- count.fields(file,sep="\t");                   #count fields      
n <- 23;                                #the CORRECT fields for each row i.e., 
the number of variable name
del <- which(cf!=n);            # try to remove any row which has number of
fields not euqal to 23
del <- del-1;                   # count cf contains the fields of header, -1 
give the
row I want to remove

data.mm <- data.mm[-del,];      # try to remove the rows with fields number
unequal to 23
                                ### PROBLEM: R says "subscripts out of bonds"

write.table(data.mm,file="mm_0206.txt",
            eol="\n",sep="\t",
            quote=F,row.names=F); # since data.mm <- data.mm[-del,] aborted,
write the original data as mm_0206.txt
                                  ### PROBLEM:then following code should have 
the same output

table(cf);                         # maximal fields number is 23
table( count.fields("mm_0206.txt",sep="\t")); # maximal fields number
larger than 23 and other things also unequle
                                              # for example, original data has 
x rows with 10 fields, the wrote
                                              # data has y row with 10 fields.
                                              # if the original file is not 
correctly rewrote, probably
an equal length
                                              # file will also not be wrote 
properly wrote, suppose
data.mm <- data.mm[-del,];
                                              # get executed successfully.



####  experimental data set as following        ###

V1      V2      V3      v4      v5      v6      v7      v8      v9
11      1       desc    A       1       34      1-Sep-00        1       first 
mid last
12      2       desc    B       6       56      2-Sep-00        1       First 
last
13      3       desc    A       7       32      3-Sep-00        1       last
14      4       desc    4-Sep-00        0       first mid last
15      5       desc    A       2       .       5-Sep-00        1       first 
mid last
16      6       desc    B       9       3       6-Sep-00        0       last
17      7               A       6       65      7-Sep-00        first last
18      8       desc    B       2       .       8-Sep-00        0       last
19      9       desc    A       8       56      9-Sep-00        1       first 
last
20      10      desc    B       5       89      10-Sep-00       0       first 
last

______________________________________________
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to