On Jan 9, 2008 2:01 PM, Derek Stephen Elmerick <[EMAIL PROTECTED]> wrote:

> Hello –
>
> I am trying to write code that will read in multiple datasets;
> however, I would like to skip any dataset where the read-in process
> takes longer than some fixed cutoff. A generic version of the function
> is the following:
>
> for(k in  1:number.of.datasets)
> {
>   X[k]=read.table(…)
> }
>
> The issue is that I cannot find a way to embed logic that will abort
> the read-in process of a specific dataset without manual intervention.
> I scanned the help manual and other postings, but no luck based on my
> search. Any thoughts?


A simple solution is to use nrows=1000000 or so (whatever makes sense).
Then, any dataset larger than that will be truncated.  If you use a
connection, you could even check after the read.table completes to see if
more rows are available--if so, the entire dataset has not been read.

A slightly more complicated solution might be to read in 1000 lines or so
(depends a bit on the data) at a time and then rbind the results of multiple
read.table() calls at the end.  If you capture the colClasses from the first
read, this can potentially be even faster than standard read.table() on the
whole dataset.  You can read from a connection so that the file does not
need to be reopened and the connection need not be reset.  You could check
the time after each chunk of lines to see if you have exceeded your
threshold.

There, of course, may be more clever solutions that I haven't thought of.

Sean

        [[alternative HTML version deleted]]

______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Reply via email to