On Fri, 10 Feb 2023 23:38:55 -0600
Spencer Graves <spencer.gra...@prodsyse.com> wrote:
I have a 4.54 GB file that I'm trying to read in chunks using
"scan(..., skip=__)". It works as expected for small values of
"skip" but goes into an infinite loop for "skip=1e11" and similar
large values of skip: I cannot even interrupt it; I must kill R.
Skipping lines is done by two nested loops. The outer loop counts the
lines to skip; the inner loop reads characters until it encounters a
newline or end of file. The outer loop doesn't check for EOF and keeps
asking for more characters until the inner loop runs at least once for
every line it wants to skip. The following patch should avoid the
wait in such cases:
--- src/main/scan.c (revision 83797)
+++ src/main/scan.c (working copy)
@@ -835,7 +835,7 @@
attribute_hidden SEXP do_scan(SEXP call, SEXP op, SEXP args, SEXP rho)
{
SEXP ans, file, sep, what, stripwhite, dec, quotes, comstr;
- int c, flush, fill, blskip, multiline, escapes, skipNul;
+ int c = 0, flush, fill, blskip, multiline, escapes, skipNul;
R_xlen_t nmax, nlines, nskip;
const char *p, *encoding;
RCNTXT cntxt;
@@ -952,7 +952,7 @@
if(!data.con->canread)
error(_("cannot read from this connection"));
}
- for (R_xlen_t i = 0; i < nskip; i++) /* MBCS-safe */
+ for (R_xlen_t i = 0; i < nskip && c != R_EOF; i++) /* MBCS-safe */
while ((c = scanchar(FALSE, &data)) != '\n' && c != R_EOF);
}
Making it interruptible is a bit more work: we need to ensure that a
valid context is set up and check regularly for an interrupt.
--- src/main/scan.c (revision 83797)
+++ src/main/scan.c (working copy)
@@ -835,7 +835,7 @@
attribute_hidden SEXP do_scan(SEXP call, SEXP op, SEXP args, SEXP rho)
{
SEXP ans, file, sep, what, stripwhite, dec, quotes, comstr;
- int c, flush, fill, blskip, multiline, escapes, skipNul;
+ int c = 0, flush, fill, blskip, multiline, escapes, skipNul;
R_xlen_t nmax, nlines, nskip;
const char *p, *encoding;
RCNTXT cntxt;
@@ -952,8 +952,6 @@
if(!data.con->canread)
error(_("cannot read from this connection"));
}
- for (R_xlen_t i = 0; i < nskip; i++) /* MBCS-safe */
- while ((c = scanchar(FALSE, &data)) != '\n' && c != R_EOF);
}
ans = R_NilValue; /* -Wall */
@@ -966,6 +964,10 @@
cntxt.cend = &scan_cleanup;
cntxt.cenddata = &data;
+ if (ii) for (R_xlen_t i = 0, j = 0; i < nskip && c != R_EOF; i++) /* MBCS-safe */
+ while ((c = scanchar(FALSE, &data)) != '\n' && c != R_EOF)
+ if (j++ % 10000 == 9999) R_CheckUserInterrupt();
+
switch (TYPEOF(what)) {
case LGLSXP:
case INTSXP:
This way, even if you pour a Decanter of Endless Lines (e.g. mkfifo
LINES; perl -E'print "A"x42 while 1;' > LINES) into scan(), it can
still be interrupted, even if neither newline nor EOF ever arrives.