Re: [R-SIG-Mac] Bug in reading UTF-16LE file?

Matt Denwood Wed, 02 Oct 2024 02:05:37 -0700

Hi Jeff / all

On 02/10/2024, 08.54, Jeff Newmiller wrote:
> The Unicode FAQ does. If you specify endian-ness and a BOM is present and 
> these specifications agree then it would seem no harm no foul. The problem is 
> that if they conflict, then there is no clearly correct behavior: if the BOM 
> is valid then the user spec must be incorrectly specified and favoring the 
> user specification forces incorrect decoding. If the BOM is erroneous, then 
> you would want the user to be able to override the incorrect BOM... but these 
> two cases amount to defeating the BOMs purpose... it might as well not be 
> there. So the compliant handling of data with a BOM is for the user to make a 
> standard practice of not specifying endianness _unless they must override an 
> invalid BOM_ (which ought to be highly unusual)... save the sledgehammer for 
> unusual cases, and let the BOM be the "only" specification if it is present. 
> This lets the BOM serve its intended purpose of reducing how often users have 
> to guess.


Actually, the Unicode FAQ (https://unicode.org/faq/utf_bom.html, under "Q: Why 
wouldn’t I always use a protocol that requires a BOM?") says:  "In particular, 
if a text data stream is marked as UTF-16BE, UTF-16LE, UTF-32BE or UTF-32LE, a 
BOM is neither necessary nor permitted. Any U+FEFF would be interpreted as a 
ZWNBSP."

So, my interpretation of the Unicode recommendation is that specifying *LE/*BE 
takes precedence - and if both are provided, then the BOM should be interpreted 
as a zero-width non-breaking space i.e. ignored.  Therefore, it would seem 
sensible for defensive programmers to specify *LE/*BE manually, safe in the 
knowledge that any BOM (correct or otherwise) becomes irrelevant - which is 
what I believe Tomas and Simon are suggesting.  Although it is possible I 
misunderstood something...

Best wishes,

Matt



On 02/10/2024, 08.54, "R-SIG-Mac on behalf of Jeff Newmiller via R-SIG-Mac" 
<r-sig-mac-boun...@r-project.org <mailto:r-sig-mac-boun...@r-project.org> on 
behalf of r-sig-mac@r-project.org <mailto:r-sig-mac@r-project.org>> wrote:

[SNIP]

>>I don't find anything inappropriate about the *LE/*BE specifications.


> The Unicode FAQ does. If you specify endian-ness and a BOM is present and 
> these specifications agree then it would seem no harm no foul. The problem is 
> that if they conflict, then there is no clearly correct behavior: if the BOM 
> is valid then the user spec must be incorrectly specified and favoring the 
> user specification forces incorrect decoding. If the BOM is erroneous, then 
> you would want the user to be able to override the incorrect BOM... but these 
> two cases amount to defeating the BOMs purpose... it might as well not be 
> there. So the compliant handling of data with a BOM is for the user to make a 
> standard practice of not specifying endianness _unless they must override an 
> invalid BOM_ (which ought to be highly unusual)... save the sledgehammer for 
> unusual cases, and let the BOM be the "only" specification if it is present. 
> This lets the BOM serve its intended purpose of reducing how often users have 
> to guess.




On October 1, 2024 1:50:25 PM MST, Tomas Kalibera <tomas.kalib...@gmail.com 
<mailto:tomas.kalib...@gmail.com>> wrote:
>On 10/1/24 15:31, Jeff Newmiller wrote:
>>> This is a problem in macOS libiconv. When converting from "UTF-16" with a 
>>> BOM, it correctly learns the byte-order from the BOM, but later forgets it 
>>> in some cases. This is not a problem in R, but could be worked-around in R.
>> So, buggy system code on one system...
>> 
>>> As Simon wrote, to avoid running into these problems (in released versions 
>>> of R), one should use "UTF-16LE", so explicitly specify the byte-order in 
>>> the encoding name.
>> ... leads to institutionalized non-complince.
>> 
>>> This is useful also because it is not clear what should be the default when 
>>> no BOM is present and different systems have different defaults.
>> This is nonsense, for reasons previously provided. You are calling a bug a 
>> feature. The BOM is supposed to prevent you from having to know this detail, 
>> and what you do when no BOM is present should have no bearing on this case.
>
>I will try to explain this differently. The handling of BOMs in existing iconv 
>implementations is unreliable (one issue is documented in R documentation, one 
>issue is the one we have ran into now). Because it is unreliable, people who 
>want to be defensive and avoid problems are advised to use *LE (or *BE) 
>specifications. What is the default byte-order when no BOM is specified is not 
>reliable, either (defaults differ between systems and the standard is open to 
>interpretation - e.g. my Linux and Windows builds of R default to 
>little-endian, while my macOS build defaults to big-endian). It is thus not 
>advisable to depend on the default order, either, and a defensive solution is 
>again to use *LE or *BE specifications. So, in principle, simply always use 
>*LE or *BE.
>
>This advice is not a feature, it is a work-around that works for two problems: 
>that the byte order for specifications like "UTF-16" is unknown (bug in the 
>standard) and that specifying the byte-order by a BOM is unreliable (bugs in 
>implementations of iconv).
>
>> If Apple is intransigent (which would not be out of character) you could 
>> avoid institutionalized non-compliance at the user level by recognizing the 
>> buggy system and replacing the generic specification with this inappropriate 
>> LE or BE specification as directed by the BOM in the Mac-specific R code.
>
>Yes, indeed, the work-around for the libiconv bug can be implemented in future 
>versions of R and an experimental version is already in R-devel (still subject 
>to change), so that at user level, specifying say "UTF-16" on an input with 
>BOM will correctly use the byte-order of the BOM.
>
>I don't find anything inappropriate about the *LE/*BE specifications.
>
>Best
>Tomas
>
>> 
>> 
>> On October 1, 2024 4:34:41 AM MST, Tomas Kalibera <tomas.kalib...@gmail.com 
>> <mailto:tomas.kalib...@gmail.com>> wrote:
>>> On 9/9/24 12:53, Tomas Kalibera wrote:
>>>> On 9/9/24 10:53, peter dalgaard wrote:
>>>>> I am confused, and maybe I should just butt out of this, but:
>>>>> 
>>>>> (a) BOM are designed to, um, mark the byte order...
>>>>> 
>>>>> (b) in connections.c we have
>>>>> 
>>>>> if(checkBOM && con->inavail >= 2 &&
>>>>> ((int)con->iconvbuff[0] & 0xff) == 255 &&
>>>>> ((int)con->iconvbuff[1] & 0xff) == 254) {
>>>>> con->inavail -= (short) 2;
>>>>> memmove(con->iconvbuff, con->iconvbuff+2, con->inavail);
>>>>> }
>>>>> which checks for the two first bytes being FF, FE. However, a big-endian 
>>>>> BOM would be FE, FF and I see no check for that.
>>>> I think this is correct, it is executed only for encodings declared 
>>>> little-endian (UTF-16LE, UCS2-LE) - so, iconv will still know what is the 
>>>> byte-order from the name of the encoding, it will just not see the same 
>>>> information in the BOM.
>>>>> Duncan's file starts
>>>>> 
>>>>>> readBin('https://online.stat.psu.edu/onlinecourses/sites/stat501/files/ch15/employee.txt'
>>>>>>  
>>>>>> <https://online.stat.psu.edu/onlinecourses/sites/stat501/files/ch15/employee.txt;>,
>>>>>>  what="raw", n=10)
>>>>> [1] ff fe 74 00 69 00 6d 00 65 00
>>>>> 
>>>>> so the BOM does indeed indicate little-endian, but apparently we proceed 
>>>>> to discard it and read the file with system (big-)endianness, which 
>>>>> strikes me as just plain wrong...
>>>> I've tested we are not discarding it by the code above and that iconv gets 
>>>> to see the BOM bytes.
>>>>> I see no Mac-specific code for this, only win_iconv.c, so presumably we 
>>>>> have potential issues on everything non-Windows?
>>>> I can reproduce the problem and will have a closer look, it may still be 
>>>> there is a bug in R. We have some work-arounds for recent iconv issues on 
>>>> macOS in sysutils.c.
>>> This is a problem in macOS libiconv. When converting from "UTF-16" with a 
>>> BOM, it correctly learns the byte-order from the BOM, but later forgets it 
>>> in some cases. This is not a problem in R, but could be worked-around in R.
>>> 
>>> As Simon wrote, to avoid running into these problems (in released versions 
>>> of R), one should use "UTF-16LE", so explicitly specify the byte-order in 
>>> the encoding name. This is useful also because it is not clear what should 
>>> be the default when no BOM is present and different systems have different 
>>> defaults.
>>> 
>>> Best
>>> Tomas
>>> 
>>>> Tomas
>>>> 
>>>>> -pd
>>>>> 
>>>>>> On 9 Sep 2024, at 01:11 , Simon Urbanek <simon.urba...@r-project.org 
>>>>>> <mailto:simon.urba...@r-project.org>> wrote:
>>>>>> 
>>>>>> From the help page:
>>>>>> 
>>>>>> The encodings ‘"UCS-2LE"’ and ‘"UTF-16LE"’ are treated specially,
>>>>>> as they are appropriate values for Windows ‘Unicode’ text files.
>>>>>> If the first two bytes are the Byte Order Mark ‘0xFEFF’ then these
>>>>>> are removed as some implementations of ‘iconv’ do not accept BOMs.
>>>>>> 
>>>>>> so "UTF-16LE" is the documented way to reliably read such files.
>>>>>> 
>>>>>> Cheers,
>>>>>> Simon
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>>> On 8 Sep 2024, at 21:23, Duncan Murdoch <murdoch.dun...@gmail.com 
>>>>>>> <mailto:murdoch.dun...@gmail.com>> wrote:
>>>>>>> 
>>>>>>> To R-SIG-Mac, with a copy to Jeff Newmiller:
>>>>>>> 
>>>>>>> On R-help there's a thread about reading a remote file that is coded in 
>>>>>>> UTF-16LE with a byte-order mark. Jeff Newmiller pointed out 
>>>>>>> (https://stat.ethz.ch/pipermail/r-help/2024-September/479933.html 
>>>>>>> <https://stat.ethz.ch/pipermail/r-help/2024-September/479933.html>) 
>>>>>>> that it would be better to declare the encoding as "UTF-16", because 
>>>>>>> the BOM will indicate little endian.
>>>>>>> 
>>>>>>> I tried this on my Mac running R 4.4.1, and it didn't work. I get the 
>>>>>>> same incorrect result from all of these commands:
>>>>>>> 
>>>>>>> # Automatically recognizing a URL and using fileEncoding:
>>>>>>> read.delim(
>>>>>>> 'https://online.stat.psu.edu/onlinecourses/sites/stat501/files/ch15/employee.txt'
>>>>>>>  
>>>>>>> <https://online.stat.psu.edu/onlinecourses/sites/stat501/files/ch15/employee.txt;>,
>>>>>>> fileEncoding = "UTF-16"
>>>>>>> )
>>>>>>> 
>>>>>>> # Using explicit url() with encoding:
>>>>>>> read.delim(
>>>>>>> url('https://online.stat.psu.edu/onlinecourses/sites/stat501/files/ch15/employee.txt'
>>>>>>>  
>>>>>>> <https://online.stat.psu.edu/onlinecourses/sites/stat501/files/ch15/employee.txt;>,
>>>>>>> encoding = "UTF-16")
>>>>>>> )
>>>>>>> 
>>>>>>> # Specifying the endianness incorrectly:
>>>>>>> read.delim(
>>>>>>> url('https://online.stat.psu.edu/onlinecourses/sites/stat501/files/ch15/employee.txt'
>>>>>>>  
>>>>>>> <https://online.stat.psu.edu/onlinecourses/sites/stat501/files/ch15/employee.txt;>,
>>>>>>> encoding = "UTF-16BE")
>>>>>>> )
>>>>>>> 
>>>>>>> The only way I get the correct result is if I specify "UTF-16LE" 
>>>>>>> explicitly, whereas Jeff got correct results on several different 
>>>>>>> systems using "UTF-16".
>>>>>>> 
>>>>>>> Is this a MacOS bug or an R for MacOS bug?
>>>>>>> 
>>>>>>> Duncan Murdoch
>>>>>>> 
>>>>>>> _______________________________________________
>>>>>>> R-SIG-Mac mailing list
>>>>>>> R-SIG-Mac@r-project.org <mailto:R-SIG-Mac@r-project.org>
>>>>>>> https://stat.ethz.ch/mailman/listinfo/r-sig-mac 
>>>>>>> <https://stat.ethz.ch/mailman/listinfo/r-sig-mac>
>>>>>>> 
>>>>>> _______________________________________________
>>>>>> R-SIG-Mac mailing list
>>>>>> R-SIG-Mac@r-project.org <mailto:R-SIG-Mac@r-project.org>
>>>>>> https://stat.ethz.ch/mailman/listinfo/r-sig-mac 
>>>>>> <https://stat.ethz.ch/mailman/listinfo/r-sig-mac>
>>> _______________________________________________
>>> R-SIG-Mac mailing list
>>> R-SIG-Mac@r-project.org <mailto:R-SIG-Mac@r-project.org>
>>> https://stat.ethz.ch/mailman/listinfo/r-sig-mac 
>>> <https://stat.ethz.ch/mailman/listinfo/r-sig-mac>


-- 
Sent from my phone. Please excuse my brevity.


_______________________________________________
R-SIG-Mac mailing list
R-SIG-Mac@r-project.org <mailto:R-SIG-Mac@r-project.org>
https://stat.ethz.ch/mailman/listinfo/r-sig-mac 
<https://stat.ethz.ch/mailman/listinfo/r-sig-mac>



_______________________________________________
R-SIG-Mac mailing list
R-SIG-Mac@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-mac

Re: [R-SIG-Mac] Bug in reading UTF-16LE file?

Reply via email to