On Fri, 10 Feb 2023 23:38:55 -0600 Spencer Graves <spencer.gra...@prodsyse.com> wrote:
> I have a 4.54 GB file that I'm trying to read in chunks using > "scan(..., skip=__)". It works as expected for small values of > "skip" but goes into an infinite loop for "skip=1e11" and similar > large values of skip: I cannot even interrupt it; I must kill R. Skipping lines is done by two nested loops. The outer loop counts the lines to skip; the inner loop reads characters until it encounters a newline or end of file. The outer loop doesn't check for EOF and keeps asking for more characters until the inner loop runs at least once for every line it wants to skip. The following patch should avoid the wait in such cases: --- src/main/scan.c (revision 83797) +++ src/main/scan.c (working copy) @@ -835,7 +835,7 @@ attribute_hidden SEXP do_scan(SEXP call, SEXP op, SEXP args, SEXP rho) { SEXP ans, file, sep, what, stripwhite, dec, quotes, comstr; - int c, flush, fill, blskip, multiline, escapes, skipNul; + int c = 0, flush, fill, blskip, multiline, escapes, skipNul; R_xlen_t nmax, nlines, nskip; const char *p, *encoding; RCNTXT cntxt; @@ -952,7 +952,7 @@ if(!data.con->canread) error(_("cannot read from this connection")); } - for (R_xlen_t i = 0; i < nskip; i++) /* MBCS-safe */ + for (R_xlen_t i = 0; i < nskip && c != R_EOF; i++) /* MBCS-safe */ while ((c = scanchar(FALSE, &data)) != '\n' && c != R_EOF); } Making it interruptible is a bit more work: we need to ensure that a valid context is set up and check regularly for an interrupt. --- src/main/scan.c (revision 83797) +++ src/main/scan.c (working copy) @@ -835,7 +835,7 @@ attribute_hidden SEXP do_scan(SEXP call, SEXP op, SEXP args, SEXP rho) { SEXP ans, file, sep, what, stripwhite, dec, quotes, comstr; - int c, flush, fill, blskip, multiline, escapes, skipNul; + int c = 0, flush, fill, blskip, multiline, escapes, skipNul; R_xlen_t nmax, nlines, nskip; const char *p, *encoding; RCNTXT cntxt; @@ -952,8 +952,6 @@ if(!data.con->canread) error(_("cannot read from this connection")); } - for (R_xlen_t i = 0; i < nskip; i++) /* MBCS-safe */ - while ((c = scanchar(FALSE, &data)) != '\n' && c != R_EOF); } ans = R_NilValue; /* -Wall */ @@ -966,6 +964,10 @@ cntxt.cend = &scan_cleanup; cntxt.cenddata = &data; + if (ii) for (R_xlen_t i = 0, j = 0; i < nskip && c != R_EOF; i++) /* MBCS-safe */ + while ((c = scanchar(FALSE, &data)) != '\n' && c != R_EOF) + if (j++ % 10000 == 9999) R_CheckUserInterrupt(); + switch (TYPEOF(what)) { case LGLSXP: case INTSXP: This way, even if you pour a Decanter of Endless Lines (e.g. mkfifo LINES; perl -E'print "A"x42 while 1;' > LINES) into scan(), it can still be interrupted, even if neither newline nor EOF ever arrives. (We never skip lines when reading from the console? I suppose it makes sense. I think this needs to be documented and can write a documentation patch.) If you actually have 1e11 lines in your file and would like to read it in chunks, it may help to use f <- file('...') chunk1 <- scan(f, n = n1, skip = nskip1) # the following will continue reading where chunk1 had ended chunk2 <- scan(f, n = n2, skip = nskip2) ...in order to avoid having to skip over chunks you have already read, which otherwise makes the algorithm quadratic in number of lines instead of linear. (I couldn't determine whether you're already doing this, sorry.) Skipping a fixed number of lines is hard: since they have variable length, it's required to read every character in order to determine whether it starts a new line. With byte ranges, it would have been possible to use seek(), but not here. -- Best regards, Ivan ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel