Re: [Rd] iconv to UTF-16 encoding produces error due to embedded nulls (write.table with fileEncoding param)

Duncan Murdoch Mon, 29 Feb 2016 10:33:08 -0800

I have just committed your first patch (the strlen() replacement) toR-devel, and will soon put it in R-patched as well. I wont have time tolook at this again before the 3.2.4 release, so your file.show() patchisn't going to make it unless someone else gets to it.

There's still a faint chance that I'll do more in R-devel before 3.3.0,but I think it's best if there were bug reports about both of theseproblems so they don't get forgotten. Since the first one is mainly aWindows problem, I'll write that one up; I'd appreciate it if you couldwrite up the file.show() issue, after checking against R-devel rev 70247or higher.


Duncan Murdoch

On 25/02/2016 5:54 AM, Mikko Korpela wrote:

On 25.02.2016 11:31, Mikko Korpela wrote:

On 23.02.2016 14:06, Mikko Korpela wrote:

On 23.02.2016 11:37, Martin Maechler wrote:

nospam@altfeld-im de <[email protected]>
     on Mon, 22 Feb 2016 18:45:59 +0100 writes:


     > Dear R developers
     > I think I have found a bug that can be reproduced with two lines of code
     > and I am very thankful to get your first assessment or feed-back on my
     > report.

     > If this is the wrong mailing list or I did something wrong
     > (e. g. semi "anonymous" email address to protect my privacy and defend
     > unwanted spam) please let me know since I am new here.

     > Thank you very much :-)

     > J. Altfeld

Dear J.,
(yes, a bit less anonymity would be very welcomed here!),

You are right, this is a bug, at least in the documentation, but
probably "all real", indeed,

but read on.

     > On Tue, 2016-02-16 at 18:25 +0100, [email protected] wrote:
     >>
     >>
     >> If I execute the code from the "?write.table" examples section
     >>
     >> x <- data.frame(a = I("a \" quote"), b = pi)
     >> # (ommited code)
     >> write.csv(x, file = "foo.csv", fileEncoding = "UTF-16LE")
     >>
     >> the resulting CSV file has a size of 6 bytes which is too short
     >> (truncated):
     >>
     >> """,3

reproducibly, yes.
If you look at what write.csv does
and then simplify, you can get a similar wrong result by

   write.table(x, file = "foo.tab", fileEncoding = "UTF-16LE")

which results in a file with one line

""" 3

and if you debug  write.table() you see that its building blocks
here are
         file <- file(........, encoding = fileEncoding)

a        writeLines(*, file=file)  for the column headers,

and then "deeper down" C code which I did not investigate.


I took a look at connections.c. There is a call to strlen() that gets
confused by null characters. I think the obvious fix is to avoid the
call to strlen() as the size is already known:

Index: src/main/connections.c
===================================================================
--- src/main/connections.c      (revision 70213)
+++ src/main/connections.c      (working copy)
@@ -369,7 +369,7 @@
                /* is this safe? */
                warning(_("invalid char string in output conversion"));
            *ob = '\0';
-           con->write(outbuf, 1, strlen(outbuf), con);
+           con->write(outbuf, 1, ob - outbuf, con);
        } while(again && inb > 0);  /* it seems some iconv signal -1 on
                                       zero-length input */
      } else


But just looking a bit at such a file() object with writeLines()
seems slightly revealing, as e.g., 'eol' does not seem to
"work" for this encoding:

     > fn <- tempfile("ffoo"); ff <- file(fn, open="w", encoding = "UTF-16LE")
     > writeLines(LETTERS[3:1], ff); writeLines("|", ff); writeLines(">a", ff)
     > close(ff)
     > file.show(fn)
     CBA|>
     > file.size(fn)
     [1] 5
     >


With the patch applied:

     > readLines(fn, encoding="UTF-16LE", skipNul=TRUE)
     [1] "C"  "B"  "A"  "|"  ">a"
     > file.size(fn)
     [1] 22

I just realized that I was misusing the encoding argument of
readLines(). The code above works by accident, but the following would
be more appropriate:

     > ff <- file(fn, open="r", encoding="UTF-16LE")
     > readLines(ff)
     [1] "C"  "B"  "A"  "|"  ">a"
     > close(ff)

Testing on Linux, with the patch applied. (As noted by Duncan Murdoch,
the patch is incomplete on Windows.)

Before inspecting the file with readLines() I tried file.show() but it
did not work as expected. On Linux using a UTF-8 locale, the result of
trying to show the truly UTF-16LE encoded file with

     > file.show(fn, encoding="UTF-16LE")

was a pager showing "<43>" (quotes not included) followed by several
empty lines.

With the following patch, the command works correctly (in this case, on
this platform, not tested comprehensively). The idea is to read the
input file "raw" in order to avoid problems with null characters. The
input then needs to be split into lines after iconv(), or it could be
written to the output file with cat() if the style of line termination
characters does not matter. The 'perl = TRUE' is for assumed performance
advantage only. It can be removed, or one might want to test if there is
a significant difference one way or the other.

- Mikko

Index: src/library/base/R/files.R
===================================================================
--- src/library/base/R/files.R  (revision 70217)
+++ src/library/base/R/files.R  (working copy)
@@ -50,10 +50,13 @@
          for(i in seq_along(files)) {
              f <- files[i]
              tf <- tempfile()
-            tmp <- readLines(f, warn = FALSE)
+            tmp <- list(readBin(f, "raw", file.size(f)))
              tmp2 <- try(iconv(tmp, encoding, "", "byte"))
              if(inherits(tmp2, "try-error")) file.copy(f, tf)
-            else writeLines(tmp2, tf)
+            else {
+                tmp2 <- strsplit(tmp2, "\r\n?|\n", perl = TRUE)[[1L]]
+                writeLines(tmp2, tf)
+            }
              files[i] <- tf
              if(delete.file) unlink(f)
          }


______________________________________________
[email protected] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] iconv to UTF-16 encoding produces error due to embedded nulls (write.table with fileEncoding param)

Reply via email to