On 30-Jul-07 11:40:47, Eric Doviak wrote: > [...] Sympathies for the constraints you are operating in!
> The "Introduction to R" manual suggests modifying input files with > Perl. Any tips on how to get started? Would Perl Data Language (PDL) be > a good choice? http://pdl.perl.org/index_en.html I've not used SIPP files, but itseems that they are available in "delimited" format, including CSV. For extracting a subset of fields (especially when large datasets may stretch RAM resources) I would use awk rather than perl, since it is a much lighter program, transparent to code for, efficient, and it will do that job. On a Linux/Unix system (see below), say I wanted to extract fields 1, 1000, 1275, .... , 5678 from a CSV file. Then the 'awk' line that would do it would look like awk ' BEGIN{FS=","}{print $(1) "," $(1000) "," $(1275) "," ... $(5678) ' < sippfile.csv > newdata.csv Awk reads one line at a tine, and does with it what you tell it to do. It will not be overcome by a file with an enormous number of lines. Perl would be similar. So long as one line fits comfortably into RAM, you would not be limited by file size (unless you're running out of disk space), and operation will be quick, even for very long lines (as an experiment, I just set up a file with 10,000 fields and 35 lines; awk output 6 selected fields from all 35 lines in about 1 second, on the 366MHz 128MB RAM machine I'm on at the moment. After transferring it to a 733MHz 512MB RAM machine, it was too quick to estimate; so I duplicated the lines to get a 363-line file, and now got those same fields out in a bit less than 1 second. So that's over 300 lines/second, 200,000 lines a minute, a million lines in 5 minutes; and all on rather puny hardware.). In practice, you might want to write a separate script which woould automatically create the necessary awk script (say if you supply the filed names, haing already coded the filed positions corresponding to filed names). You could exploit R's system() command to run the scripts from within R, and then load in the filtered data. > I wrote a script which loads large datasets a few lines at a time, > writes the dozen or so variables of interest to a CSV file, removes > the loaded data and then (via a "for" loop) loads the next few lines > .... I managed to get it to work with one of the SIPP core files, > but it's SLOOOOW. See above ... > Worse, if I discover later that I omitted a relevant variable, > then I'll have to run the whole script all over again. If the script worked quickly (as with awk), presumably you wouldn't mind so much? Regarding Linux/Unix versus Windows. It is general experience that Linux/Unix works faster, more cleanly and efficiently, and often more reliably, for similar tasks; and cam do so on low grade hardware. Also, these systems come with dozens of file-processing utilities (including perl and awk; also many others), each of which has been written to be efficient at precisely the repertoire of tasks it was designed for. A lot of Windows sotware carries a huge overhead of either cosmetic dross, or a pantechnicon of functionality of which you are only going to need 0.01% at any one time. The Unix utilities have been ported to Windows, long since, but I have no experience of using them in that environment. Others, who have, can advise! But I'd seriously suggest getting hold of them. Hoping this helps, Ted. -------------------------------------------------------------------- E-Mail: (Ted Harding) <[EMAIL PROTECTED]> Fax-to-email: +44 (0)870 094 0861 Date: 30-Jul-07 Time: 18:24:41 ------------------------------ XFMail ------------------------------ ______________________________________________ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.