On 2/11/23 09:33, Ivan Krylov wrote:
On Fri, 10 Feb 2023 23:38:55 -0600
Spencer Graves <spencer.gra...@prodsyse.com> wrote:

I have a 4.54 GB file that I'm trying to read in chunks using
"scan(..., skip=__)".  It works as expected for small values of
"skip" but goes into an infinite loop for "skip=1e11" and similar
large values of skip:  I cannot even interrupt it;  I must kill R.
Skipping lines is done by two nested loops. The outer loop counts the
lines to skip; the inner loop reads characters until it encounters a
newline or end of file. The outer loop doesn't check for EOF and keeps
asking for more characters until the inner loop runs at least once for
every line it wants to skip. The following patch should avoid the
wait in such cases:

--- src/main/scan.c     (revision 83797)
+++ src/main/scan.c     (working copy)
@@ -835,7 +835,7 @@
  attribute_hidden SEXP do_scan(SEXP call, SEXP op, SEXP args, SEXP rho)
  {
      SEXP ans, file, sep, what, stripwhite, dec, quotes, comstr;
-    int c, flush, fill, blskip, multiline, escapes, skipNul;
+    int c = 0, flush, fill, blskip, multiline, escapes, skipNul;
      R_xlen_t nmax, nlines, nskip;
      const char *p, *encoding;
      RCNTXT cntxt;
@@ -952,7 +952,7 @@
            if(!data.con->canread)
                error(_("cannot read from this connection"));
        }
-       for (R_xlen_t i = 0; i < nskip; i++) /* MBCS-safe */
+       for (R_xlen_t i = 0; i < nskip && c != R_EOF; i++) /* MBCS-safe */
            while ((c = scanchar(FALSE, &data)) != '\n' && c != R_EOF);
      }
Making it interruptible is a bit more work: we need to ensure that a
valid context is set up and check regularly for an interrupt.

--- src/main/scan.c     (revision 83797)
+++ src/main/scan.c     (working copy)
@@ -835,7 +835,7 @@
  attribute_hidden SEXP do_scan(SEXP call, SEXP op, SEXP args, SEXP rho)
  {
      SEXP ans, file, sep, what, stripwhite, dec, quotes, comstr;
-    int c, flush, fill, blskip, multiline, escapes, skipNul;
+    int c = 0, flush, fill, blskip, multiline, escapes, skipNul;
      R_xlen_t nmax, nlines, nskip;
      const char *p, *encoding;
      RCNTXT cntxt;
@@ -952,8 +952,6 @@
            if(!data.con->canread)
                error(_("cannot read from this connection"));
        }
-       for (R_xlen_t i = 0; i < nskip; i++) /* MBCS-safe */
-           while ((c = scanchar(FALSE, &data)) != '\n' && c != R_EOF);
      }
ans = R_NilValue; /* -Wall */
@@ -966,6 +964,10 @@
      cntxt.cend = &scan_cleanup;
      cntxt.cenddata = &data;
+ if (ii) for (R_xlen_t i = 0, j = 0; i < nskip && c != R_EOF; i++) /* MBCS-safe */
+       while ((c = scanchar(FALSE, &data)) != '\n' && c != R_EOF)
+           if (j++ % 10000 == 9999) R_CheckUserInterrupt();
+
      switch (TYPEOF(what)) {
      case LGLSXP:
      case INTSXP:

This way, even if you pour a Decanter of Endless Lines (e.g. mkfifo
LINES; perl -E'print "A"x42 while 1;' > LINES) into scan(), it can
still be interrupted, even if neither newline nor EOF ever arrives.

Thanks, I've updated the implementation of scan() in R-devel to be interruptible while skipping lines.

I've done it slightly differently as I found there already was a memory leak, which could be fixed by creating the context a bit earlier.

I've also avoided modulo on the fast path as I saw 13% performance overhead on my mailbox file. Decrementing and checking against zero didn't have measurable overhead.

Best
Tomas

(We never skip lines when reading from the console? I suppose it makes
sense. I think this needs to be documented and can write a
documentation patch.)

If you actually have 1e11 lines in your file and would like to read it
in chunks, it may help to use

f <- file('...')
chunk1 <- scan(f, n = n1, skip = nskip1)
# the following will continue reading where chunk1 had ended
chunk2 <- scan(f, n = n2, skip = nskip2)

...in order to avoid having to skip over chunks you have already read,
which otherwise makes the algorithm quadratic in number of lines
instead of linear. (I couldn't determine whether you're already doing
this, sorry.)

Skipping a fixed number of lines is hard: since they have variable
length, it's required to read every character in order to determine
whether it starts a new line. With byte ranges, it would have been
possible to use seek(), but not here.


______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Reply via email to