How much time is it taking on the files and how many files do you have to process? I tried it with your data duplicated so that I had 57K lines and it took 27 seconds to process. How much faster to you want?
On Wed, Jul 9, 2008 at 10:57 AM, Paolo Sonego <[EMAIL PROTECTED]> wrote: > Thanks so much Jim! It works without a glitch! > My only problem is that the text files to be parsed are quite big, up to > several thousands rows (my apologies for the incomplete informations in my > former post), so loops are not my first choice. I'll take a look at 'lapply' > using your code as a model. Thanks again! > > Sincerely, > Paolo > > jim holtman ha scritto: >> >> This should do what you want: (it uses loops; you can work at >> replacing those with 'lapply' and such -- it all depends on if it is >> going to take you more time to rewrite the code than to process a set >> of data; you never did say how large the data was). This also "grows" >> a data.frame, but you have not indicated how efficient is has to be. >> So this could be used as a model. >> >> >>> >>> x <- readLines(textConnection("x x_string >>> >> >> + y y_string >> + id1 id1_string >> + id2 id2_string >> + z z_string >> + w w_string >> + stuff stuff stuff >> + stuff stuff stuff >> + stuff stuff stuff >> + // >> + x x_string1 >> + y y_string1 >> + z z_string1 >> + w w_string1 >> + stuff stuff stuff >> + stuff stuff stuff >> + stuff stuff stuff >> + // >> + x x_string2 >> + y y_string2 >> + id1 id1_string1 >> + id2 id2_string1 >> + z z_string2 >> + w w_string2 >> + stuff stuff stuff >> + stuff stuff stuff >> + stuff stuff stuff >> + //")) >> >>> >>> # I assume that each group is delimited by "//" >>> # initialize data.frame with desired values >>> .keys <- data.frame(x=NA, y=NA, id1=NA, id2=NA, w=NA) >>> .out <- .keys # for the first pass >>> .save <- NULL >>> for (i in seq_along(x)){ >>> >> >> + if (x[i] == "//"){ # output the current data >> + .save <- rbind(.save, .out) >> + .out <- .keys # setup for the next pass >> + } else { >> + .split <- strsplit(x[i], "\\s+") >> + if (.split[[1]][1] %in% names(.out)){ >> + .out[[.split[[1]][1]]] <- .split[[1]][2] >> + } >> + } >> + } >> >>> >>> .save >>> >> >> x y id1 id2 w >> 1 x_string y_string id1_string id2_string w_string >> 2 x_string1 y_string1 <NA> <NA> w_string1 >> 3 x_string2 y_string2 id1_string1 id2_string1 w_string2 >> >> >> On Wed, Jul 9, 2008 at 5:33 AM, Paolo Sonego <[EMAIL PROTECTED]> >> wrote: >> >>> >>> Dear R users, >>> >>> I have a big text file formatted like this: >>> >>> x x_string >>> y y_string >>> id1 id1_string >>> id2 id2_string >>> z z_string >>> w w_string >>> stuff stuff stuff >>> stuff stuff stuff >>> stuff stuff stuff >>> // >>> x x_string1 >>> y y_string1 >>> z z_string1 >>> w w_string1 >>> stuff stuff stuff >>> stuff stuff stuff >>> stuff stuff stuff >>> // >>> x x_string2 >>> y y_string2 >>> id1 id1_string1 >>> id2 id2_string1 >>> z z_string2 >>> w w_string2 >>> stuff stuff stuff >>> stuff stuff stuff >>> stuff stuff stuff >>> // >>> ... >>> ... >>> >>> >>> I'd like to parse this file and retrieve the x, y, id1, id2, z, w fields >>> and >>> save them into a a matrix object: >>> >>> x y id1 id2 z w >>> x_string y_string id1_string id2_string z_string w_string >>> x_string1 >>> y_string1 NA NA z_string1 w_string1 >>> x_string2 y_string2 id1_string1 id2_string1 z_string2 w_string2 >>> ... >>> ... >>> >>> id1, id2 fields are not always present within a section (the interval >>> between x and the last stuff) and >>> I'd like to insert a NA when they are absent (see above) so that >>> length(x)==length(y)==length(id1)==... . >>> >>> Without the id1, id2 fields the task is easily solvable importing the >>> text >>> file with readLines and retrieving the single fields with grep: >>> >>> input = readLines("file.txt") >>> x = grep("^x\\s", input, value = T) >>> id1 = grep("^id1\\s", input, value = T) >>> ... >>> >>> I'd like to accomplish this task entirely in R (no SQL, no perl script), >>> possibly without using loops. >>> >>> Any suggestions are quite welcome! >>> >>> Regards, >>> Paolo >>> >>> ______________________________________________ >>> R-help@r-project.org mailing list >>> https://stat.ethz.ch/mailman/listinfo/r-help >>> PLEASE do read the posting guide >>> http://www.R-project.org/posting-guide.html >>> and provide commented, minimal, self-contained, reproducible code. >>> >>> >> >> >> >> > > -- Jim Holtman Cincinnati, OH +1 513 646 9390 What is the problem you are trying to solve? ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.