Now fixed in R-devel revision 72650.

Duncan Murdoch

On 02/05/2017 4:11 AM, Duncan Murdoch wrote:
On 01/05/2017 8:49 PM, Jack Kelley wrote:
Thanks for looking into this.

A few notes regarding all the UTF encodings on Windows 10 ...

This all stems from the ancient bad decision by Microsoft to translate
LF characters to CR LF when writing text files.  R passes 0A or 0A 00 or
0A 00 00 00 to the output routine (part of the C run-time), and it needs
to figure out how many characters there are in those bytes in order to
add the appropriate CR with the right width.

The default is 8 bit, so you get 0D 0A in current versions of R,
regardless of the encoding.

There are ways to declare UTF-16LE (see
https://msdn.microsoft.com/en-us/library/yeby3zcb.aspx, or Google
"Windows fopen" if that moves), but no other wide encoding.  That's what
I'm putting in place if you ask for UTF-16LE or UCS-2LE.  So far I'm not
planning to handle UTF-16BE or UTF-32, because doing those would mean R
would have to handle the translation of LF itself, and I'm too lazy to
do that.

So far this is working for writes, but not reads.  I still have to track
down what's going wrong there.

Duncan Murdoch


The default eol for write.csv (via write.table) is "\n" and always gives
as.raw (c (0x0d, 0x0a)), that is, <Carriage Return> <Line Feed> as adjacent
bytes. This is fine for UTF-8 but wrong for UTF-16 and UTF-32.

EXAMPLE: Using UTF-32 for exaggeration (note also that 3 nul bytes are
missing in the final CR+LF):

df <- data.frame (x = 1:2, y = 3:4)

$`UTF-32LE`$default.eol$raw
 [1] 22 00 00 00 78 00 00 00 22 00 00 00 2c 00 00 00 22 00 00 00 79 00 00 00
22
[26] 00 00 00 0d 0a 00 00 00 31 00 00 00 2c 00 00 00 33 00 00 00 0d 0a 00 00
00
[51] 32 00 00 00 2c 00 00 00 34 00 00 00 0d 0a 00 00 00

$`UTF-32BE`$default.eol$raw
 [1] 00 00 00 22 00 00 00 78 00 00 00 22 00 00 00 2c 00 00 00 22 00 00 00 79
00
[26] 00 00 22 00 00 00 0d 0a 00 00 00 31 00 00 00 2c 00 00 00 33 00 00 00 0d
0a
[51] 00 00 00 32 00 00 00 2c 00 00 00 34 00 00 00 0d 0a

(Nevertheless, Microsoft Excel 2013 tolerates these CSVs!)

One trick/solution is to use eol = "\r" (that is, <Carriage Return> only).

Regards -- Jack Kelley

----------------------------------------------------------------------------
--------

remove (list = objects())
print (sessionInfo())
cat ("##########################################################\n\n")

ENCODING <- c (
  "UTF-8",
  "UTF-16LE", "UTF-16BE", "UTF-16",
  "UTF-32LE", "UTF-32BE", "UTF-32"
)

df <- data.frame (x = 1:2, y = 3:4)

csv <- structure (lapply (ENCODING, function (encoding) {
  csv <- sprintf ("df_%s.csv", encoding)
  write.csv (df, csv, fileEncoding = encoding, row.names = FALSE)
  list (default.eol = list (
    csv = csv, raw = readBin (csv, "raw", 1000))
  )
}), .Names = ENCODING)

EOL <- c (LF = "\n", CR = "\r", "CR+LF" = "\r\n")

CSV <- structure (lapply (ENCODING, function (encoding) {
  structure (
    lapply (names (EOL), function (EOL.name) {
      csv <- sprintf ("df_%s_eol=%s.csv", encoding, EOL.name)
      write.csv (
        df, csv, fileEncoding = encoding, row.names = FALSE,
        eol = EOL [EOL.name]
      )
      list (csv = csv, raw = readBin (csv, "raw", 1000))
  }), .Names = names (EOL))
}), .Names = ENCODING)

print (csv)
print (CSV)

----------------------------------------------------------------------------
----------------

-----Original Message-----
From: Duncan Murdoch [mailto:murdoch.dun...@gmail.com]
Sent: Tuesday, 2 May 2017 04:22
To: Jack Kelley <jack.kel...@bigpond.com>; r-devel@r-project.org
Subject: Re: [Rd] Any progress on write.csv fileEncoding for UTF-16 and
UTF-32 ?

On 30/04/2017 12:23 PM, Duncan Murdoch wrote:
No, I don't think anyone is working on this.

There's a fairly simple workaround for the UTF-16 and UTF-32 iconv
issues:  don't attempt to produce character vectors, produce raw vectors
instead. (The "toRaw" argument to iconv() asks for this.) Raw vectors
can contain embedded nulls.  Character vectors can't, because
internally, R is using 8 bit C strings, and the nulls are string
terminators.

I don't know how difficult it would be to fix the write.table problems.

I've now taken a look, and it appears as if it's not too hard.  I'll see
if I can work out a patch that I trust.

Duncan Murdoch


Duncan Murdoch

On 29/04/2017 7:53 PM, Jack Kelley wrote:
"R version 3.4.0 (2017-04-21)"  on "x86_64-w64-mingw32" platform
... [rest omitted]




______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Reply via email to