Re: [Development] utf-8 BOM and parsers

Thiago Macieira Mon, 14 Apr 2014 10:34:44 -0700

Em seg 14 abr 2014, às 09:59:18, Thiago Macieira escreveu:
> Also, the Unix philosophy is that UTF-8 BOMs should not be used. This
> started  on Windows, with tools like Notepad, where changing the system
> locale is not an option.


To be clear: BOMs are to be used to determine that the content *is* UTF-8. 
Once you know that it is UTF-8, you can strip it and pass to the decoder. 
Passing the BOM to the decoder sounds wrong because you'd be expecting ito 
choose the codec when decoding. That's what Notepad does: if there's a BOM, it 
decodes as UTF-8; otherwise it decodes as ANSI.

Having the BOM there also breaks roundtrip:

        QString bom = u"\ufeff" "any string goes here";
        QCOMPARE(QString::fromUtf8(bom.toUtf8()), bom);

QString::toUtf8 does not, cannot and will never add the BOM. It would break 
concatenation.

I know this is a behaviour change. But I repeat that it is an *intentional* 
change.

The U+FEFF character is called "zero-width non-breaking space" (ZWNBSP) 
anywhere else, so it's valid to appear there. Including the next character in 
a file.

-- 
Thiago Macieira - thiago.macieira (AT) intel.com
  Software Architect - Intel Open Source Technology Center

_______________________________________________
Development mailing list
[email protected]
http://lists.qt-project.org/mailman/listinfo/development

Re: [Development] utf-8 BOM and parsers

Reply via email to