FYI. I used your script on a Windows machine with 1.5GHZ and using the CYGWIN software that has the UNIX utilities. The field as 1000 lines with 10,000 fields on each line. Here is what it reported:
gawk 'BEGIN{FS=","}{print $(1) "," $(1000) "," $(1275) "," $(5678)}' < tempxx.txt > newdata.csv real 0m0.806s user 0m0.640s sys 0m0.124s So it took less than a second to process the file, so it still should be pretty fast on windows. BTW, the first run took 30 seconds of real time due to the slow disk that I have. The run above had the data already cached in memory. On 7/30/07, Ted Harding <[EMAIL PROTECTED]> wrote: > On 30-Jul-07 11:40:47, Eric Doviak wrote: > > [...] > > Sympathies for the constraints you are operating in! > > > The "Introduction to R" manual suggests modifying input files with > > Perl. Any tips on how to get started? Would Perl Data Language (PDL) be > > a good choice? http://pdl.perl.org/index_en.html > > I've not used SIPP files, but itseems that they are available in > "delimited" format, including CSV. > > For extracting a subset of fields (especially when large datasets may > stretch RAM resources) I would use awk rather than perl, since it > is a much lighter program, transparent to code for, efficient, and > it will do that job. > > On a Linux/Unix system (see below), say I wanted to extract fields > 1, 1000, 1275, .... , 5678 from a CSV file. Then the 'awk' line > that would do it would look like > > awk ' > BEGIN{FS=","}{print $(1) "," $(1000) "," $(1275) "," ... $(5678) > ' < sippfile.csv > newdata.csv > > Awk reads one line at a tine, and does with it what you tell it to do. > It will not be overcome by a file with an enormous number of lines. > Perl would be similar. So long as one line fits comfortably into RAM, > you would not be limited by file size (unless you're running out > of disk space), and operation will be quick, even for very long > lines (as an experiment, I just set up a file with 10,000 fields > and 35 lines; awk output 6 selected fields from all 35 lines in > about 1 second, on the 366MHz 128MB RAM machine I'm on at the > moment. After transferring it to a 733MHz 512MB RAM machine, it was > too quick to estimate; so I duplicated the lines to get a 363-line > file, and now got those same fields out in a bit less than 1 second. > So that's over 300 lines/second, 200,000 lines a minute, a million > lines in 5 minutes; and all on rather puny hardware.). > > In practice, you might want to write a separate script which woould > automatically create the necessary awk script (say if you supply > the filed names, haing already coded the filed positions corresponding > to filed names). You could exploit R's system() command to run the > scripts from within R, and then load in the filtered data. > > > I wrote a script which loads large datasets a few lines at a time, > > writes the dozen or so variables of interest to a CSV file, removes > > the loaded data and then (via a "for" loop) loads the next few lines > > .... I managed to get it to work with one of the SIPP core files, > > but it's SLOOOOW. > > See above ... > > > Worse, if I discover later that I omitted a relevant variable, > > then I'll have to run the whole script all over again. > > If the script worked quickly (as with awk), presumably you > wouldn't mind so much? > > Regarding Linux/Unix versus Windows. It is general experience > that Linux/Unix works faster, more cleanly and efficiently, and > often more reliably, for similar tasks; and cam do so on low grade > hardware. Also, these systems come with dozens of file-processing > utilities (including perl and awk; also many others), each of which > has been written to be efficient at precisely the repertoire of > tasks it was designed for. A lot of Windows sotware carries a huge > overhead of either cosmetic dross, or a pantechnicon of functionality > of which you are only going to need 0.01% at any one time. > > The Unix utilities have been ported to Windows, long since, but > I have no experience of using them in that environment. Others, > who have, can advise! But I'd seriously suggest getting hold of them. > > Hoping this helps, > Ted. > > -------------------------------------------------------------------- > E-Mail: (Ted Harding) <[EMAIL PROTECTED]> > Fax-to-email: +44 (0)870 094 0861 > Date: 30-Jul-07 Time: 18:24:41 > ------------------------------ XFMail ------------------------------ > > ______________________________________________ > R-help@stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > -- Jim Holtman Cincinnati, OH +1 513 646 9390 What is the problem you are trying to solve? ______________________________________________ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.