Hi Jeff / all On 02/10/2024, 08.54, Jeff Newmiller wrote: > The Unicode FAQ does. If you specify endian-ness and a BOM is present and > these specifications agree then it would seem no harm no foul. The problem is > that if they conflict, then there is no clearly correct behavior: if the BOM > is valid then the user spec must be incorrectly specified and favoring the > user specification forces incorrect decoding. If the BOM is erroneous, then > you would want the user to be able to override the incorrect BOM... but these > two cases amount to defeating the BOMs purpose... it might as well not be > there. So the compliant handling of data with a BOM is for the user to make a > standard practice of not specifying endianness _unless they must override an > invalid BOM_ (which ought to be highly unusual)... save the sledgehammer for > unusual cases, and let the BOM be the "only" specification if it is present. > This lets the BOM serve its intended purpose of reducing how often users have > to guess.
Actually, the Unicode FAQ (https://unicode.org/faq/utf_bom.html, under "Q: Why wouldn’t I always use a protocol that requires a BOM?") says: "In particular, if a text data stream is marked as UTF-16BE, UTF-16LE, UTF-32BE or UTF-32LE, a BOM is neither necessary nor permitted. Any U+FEFF would be interpreted as a ZWNBSP." So, my interpretation of the Unicode recommendation is that specifying *LE/*BE takes precedence - and if both are provided, then the BOM should be interpreted as a zero-width non-breaking space i.e. ignored. Therefore, it would seem sensible for defensive programmers to specify *LE/*BE manually, safe in the knowledge that any BOM (correct or otherwise) becomes irrelevant - which is what I believe Tomas and Simon are suggesting. Although it is possible I misunderstood something... Best wishes, Matt On 02/10/2024, 08.54, "R-SIG-Mac on behalf of Jeff Newmiller via R-SIG-Mac" <r-sig-mac-boun...@r-project.org <mailto:r-sig-mac-boun...@r-project.org> on behalf of r-sig-mac@r-project.org <mailto:r-sig-mac@r-project.org>> wrote: [SNIP] >>I don't find anything inappropriate about the *LE/*BE specifications. > The Unicode FAQ does. If you specify endian-ness and a BOM is present and > these specifications agree then it would seem no harm no foul. The problem is > that if they conflict, then there is no clearly correct behavior: if the BOM > is valid then the user spec must be incorrectly specified and favoring the > user specification forces incorrect decoding. If the BOM is erroneous, then > you would want the user to be able to override the incorrect BOM... but these > two cases amount to defeating the BOMs purpose... it might as well not be > there. So the compliant handling of data with a BOM is for the user to make a > standard practice of not specifying endianness _unless they must override an > invalid BOM_ (which ought to be highly unusual)... save the sledgehammer for > unusual cases, and let the BOM be the "only" specification if it is present. > This lets the BOM serve its intended purpose of reducing how often users have > to guess. On October 1, 2024 1:50:25 PM MST, Tomas Kalibera <tomas.kalib...@gmail.com <mailto:tomas.kalib...@gmail.com>> wrote: >On 10/1/24 15:31, Jeff Newmiller wrote: >>> This is a problem in macOS libiconv. When converting from "UTF-16" with a >>> BOM, it correctly learns the byte-order from the BOM, but later forgets it >>> in some cases. This is not a problem in R, but could be worked-around in R. >> So, buggy system code on one system... >> >>> As Simon wrote, to avoid running into these problems (in released versions >>> of R), one should use "UTF-16LE", so explicitly specify the byte-order in >>> the encoding name. >> ... leads to institutionalized non-complince. >> >>> This is useful also because it is not clear what should be the default when >>> no BOM is present and different systems have different defaults. >> This is nonsense, for reasons previously provided. You are calling a bug a >> feature. The BOM is supposed to prevent you from having to know this detail, >> and what you do when no BOM is present should have no bearing on this case. > >I will try to explain this differently. The handling of BOMs in existing iconv >implementations is unreliable (one issue is documented in R documentation, one >issue is the one we have ran into now). Because it is unreliable, people who >want to be defensive and avoid problems are advised to use *LE (or *BE) >specifications. What is the default byte-order when no BOM is specified is not >reliable, either (defaults differ between systems and the standard is open to >interpretation - e.g. my Linux and Windows builds of R default to >little-endian, while my macOS build defaults to big-endian). It is thus not >advisable to depend on the default order, either, and a defensive solution is >again to use *LE or *BE specifications. So, in principle, simply always use >*LE or *BE. > >This advice is not a feature, it is a work-around that works for two problems: >that the byte order for specifications like "UTF-16" is unknown (bug in the >standard) and that specifying the byte-order by a BOM is unreliable (bugs in >implementations of iconv). > >> If Apple is intransigent (which would not be out of character) you could >> avoid institutionalized non-compliance at the user level by recognizing the >> buggy system and replacing the generic specification with this inappropriate >> LE or BE specification as directed by the BOM in the Mac-specific R code. > >Yes, indeed, the work-around for the libiconv bug can be implemented in future >versions of R and an experimental version is already in R-devel (still subject >to change), so that at user level, specifying say "UTF-16" on an input with >BOM will correctly use the byte-order of the BOM. > >I don't find anything inappropriate about the *LE/*BE specifications. > >Best >Tomas > >> >> >> On October 1, 2024 4:34:41 AM MST, Tomas Kalibera <tomas.kalib...@gmail.com >> <mailto:tomas.kalib...@gmail.com>> wrote: >>> On 9/9/24 12:53, Tomas Kalibera wrote: >>>> On 9/9/24 10:53, peter dalgaard wrote: >>>>> I am confused, and maybe I should just butt out of this, but: >>>>> >>>>> (a) BOM are designed to, um, mark the byte order... >>>>> >>>>> (b) in connections.c we have >>>>> >>>>> if(checkBOM && con->inavail >= 2 && >>>>> ((int)con->iconvbuff[0] & 0xff) == 255 && >>>>> ((int)con->iconvbuff[1] & 0xff) == 254) { >>>>> con->inavail -= (short) 2; >>>>> memmove(con->iconvbuff, con->iconvbuff+2, con->inavail); >>>>> } >>>>> which checks for the two first bytes being FF, FE. However, a big-endian >>>>> BOM would be FE, FF and I see no check for that. >>>> I think this is correct, it is executed only for encodings declared >>>> little-endian (UTF-16LE, UCS2-LE) - so, iconv will still know what is the >>>> byte-order from the name of the encoding, it will just not see the same >>>> information in the BOM. >>>>> Duncan's file starts >>>>> >>>>>> readBin('https://online.stat.psu.edu/onlinecourses/sites/stat501/files/ch15/employee.txt' >>>>>> >>>>>> <https://online.stat.psu.edu/onlinecourses/sites/stat501/files/ch15/employee.txt;>, >>>>>> what="raw", n=10) >>>>> [1] ff fe 74 00 69 00 6d 00 65 00 >>>>> >>>>> so the BOM does indeed indicate little-endian, but apparently we proceed >>>>> to discard it and read the file with system (big-)endianness, which >>>>> strikes me as just plain wrong... >>>> I've tested we are not discarding it by the code above and that iconv gets >>>> to see the BOM bytes. >>>>> I see no Mac-specific code for this, only win_iconv.c, so presumably we >>>>> have potential issues on everything non-Windows? >>>> I can reproduce the problem and will have a closer look, it may still be >>>> there is a bug in R. We have some work-arounds for recent iconv issues on >>>> macOS in sysutils.c. >>> This is a problem in macOS libiconv. When converting from "UTF-16" with a >>> BOM, it correctly learns the byte-order from the BOM, but later forgets it >>> in some cases. This is not a problem in R, but could be worked-around in R. >>> >>> As Simon wrote, to avoid running into these problems (in released versions >>> of R), one should use "UTF-16LE", so explicitly specify the byte-order in >>> the encoding name. This is useful also because it is not clear what should >>> be the default when no BOM is present and different systems have different >>> defaults. >>> >>> Best >>> Tomas >>> >>>> Tomas >>>> >>>>> -pd >>>>> >>>>>> On 9 Sep 2024, at 01:11 , Simon Urbanek <simon.urba...@r-project.org >>>>>> <mailto:simon.urba...@r-project.org>> wrote: >>>>>> >>>>>> From the help page: >>>>>> >>>>>> The encodings ‘"UCS-2LE"’ and ‘"UTF-16LE"’ are treated specially, >>>>>> as they are appropriate values for Windows ‘Unicode’ text files. >>>>>> If the first two bytes are the Byte Order Mark ‘0xFEFF’ then these >>>>>> are removed as some implementations of ‘iconv’ do not accept BOMs. >>>>>> >>>>>> so "UTF-16LE" is the documented way to reliably read such files. >>>>>> >>>>>> Cheers, >>>>>> Simon >>>>>> >>>>>> >>>>>> >>>>>>> On 8 Sep 2024, at 21:23, Duncan Murdoch <murdoch.dun...@gmail.com >>>>>>> <mailto:murdoch.dun...@gmail.com>> wrote: >>>>>>> >>>>>>> To R-SIG-Mac, with a copy to Jeff Newmiller: >>>>>>> >>>>>>> On R-help there's a thread about reading a remote file that is coded in >>>>>>> UTF-16LE with a byte-order mark. Jeff Newmiller pointed out >>>>>>> (https://stat.ethz.ch/pipermail/r-help/2024-September/479933.html >>>>>>> <https://stat.ethz.ch/pipermail/r-help/2024-September/479933.html>) >>>>>>> that it would be better to declare the encoding as "UTF-16", because >>>>>>> the BOM will indicate little endian. >>>>>>> >>>>>>> I tried this on my Mac running R 4.4.1, and it didn't work. I get the >>>>>>> same incorrect result from all of these commands: >>>>>>> >>>>>>> # Automatically recognizing a URL and using fileEncoding: >>>>>>> read.delim( >>>>>>> 'https://online.stat.psu.edu/onlinecourses/sites/stat501/files/ch15/employee.txt' >>>>>>> >>>>>>> <https://online.stat.psu.edu/onlinecourses/sites/stat501/files/ch15/employee.txt;>, >>>>>>> fileEncoding = "UTF-16" >>>>>>> ) >>>>>>> >>>>>>> # Using explicit url() with encoding: >>>>>>> read.delim( >>>>>>> url('https://online.stat.psu.edu/onlinecourses/sites/stat501/files/ch15/employee.txt' >>>>>>> >>>>>>> <https://online.stat.psu.edu/onlinecourses/sites/stat501/files/ch15/employee.txt;>, >>>>>>> encoding = "UTF-16") >>>>>>> ) >>>>>>> >>>>>>> # Specifying the endianness incorrectly: >>>>>>> read.delim( >>>>>>> url('https://online.stat.psu.edu/onlinecourses/sites/stat501/files/ch15/employee.txt' >>>>>>> >>>>>>> <https://online.stat.psu.edu/onlinecourses/sites/stat501/files/ch15/employee.txt;>, >>>>>>> encoding = "UTF-16BE") >>>>>>> ) >>>>>>> >>>>>>> The only way I get the correct result is if I specify "UTF-16LE" >>>>>>> explicitly, whereas Jeff got correct results on several different >>>>>>> systems using "UTF-16". >>>>>>> >>>>>>> Is this a MacOS bug or an R for MacOS bug? >>>>>>> >>>>>>> Duncan Murdoch >>>>>>> >>>>>>> _______________________________________________ >>>>>>> R-SIG-Mac mailing list >>>>>>> R-SIG-Mac@r-project.org <mailto:R-SIG-Mac@r-project.org> >>>>>>> https://stat.ethz.ch/mailman/listinfo/r-sig-mac >>>>>>> <https://stat.ethz.ch/mailman/listinfo/r-sig-mac> >>>>>>> >>>>>> _______________________________________________ >>>>>> R-SIG-Mac mailing list >>>>>> R-SIG-Mac@r-project.org <mailto:R-SIG-Mac@r-project.org> >>>>>> https://stat.ethz.ch/mailman/listinfo/r-sig-mac >>>>>> <https://stat.ethz.ch/mailman/listinfo/r-sig-mac> >>> _______________________________________________ >>> R-SIG-Mac mailing list >>> R-SIG-Mac@r-project.org <mailto:R-SIG-Mac@r-project.org> >>> https://stat.ethz.ch/mailman/listinfo/r-sig-mac >>> <https://stat.ethz.ch/mailman/listinfo/r-sig-mac> -- Sent from my phone. Please excuse my brevity. _______________________________________________ R-SIG-Mac mailing list R-SIG-Mac@r-project.org <mailto:R-SIG-Mac@r-project.org> https://stat.ethz.ch/mailman/listinfo/r-sig-mac <https://stat.ethz.ch/mailman/listinfo/r-sig-mac> _______________________________________________ R-SIG-Mac mailing list R-SIG-Mac@r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-mac