Re: [Development] utf-8 BOM and parsers
On 22/04/14 16:36, Thiago Macieira thiago.macie...@intel.com wrote: Em ter 22 abr 2014, às 12:35:33, Knoll Lars escreveu: Hi, Just came back from vacation today. Unfortunately BOM’s at the beginning of files seem to still be used quite a bit esp. in the Windows world. So I would actually vote for option 1 and rather keep compatibility. Reason is that stripping the BOM will not break anything, but leaving it in will. We could also consider using our builtin utf8 decoder for all utf8 locales, so that we don’t use iconv or ICU if the locale is utf-8 (and thus always strip the BOM). That would at least give us consistent cross platform behaviour. I'll send the update to the release branch in the next few hours. Thanks! Lars ___ Development mailing list Development@qt-project.org http://lists.qt-project.org/mailman/listinfo/development
Re: [Development] utf-8 BOM and parsers
Em qua 23 abr 2014, às 08:14:41, Knoll Lars escreveu: I'll send the update to the release branch in the next few hours. Thanks! I will do that today. I spent my Qt time yesterday with the changelog and the header diff. Running the scripts took about 5 seconds for each Editing the changelog took a lot of time, as did configuring msmtp to send the emails for the header diffs -- new computer, I forgot to copy the config file from the old one. -- Thiago Macieira - thiago.macieira (AT) intel.com Software Architect - Intel Open Source Technology Center ___ Development mailing list Development@qt-project.org http://lists.qt-project.org/mailman/listinfo/development
Re: [Development] utf-8 BOM and parsers
Em qua 23 abr 2014, às 07:38:30, Thiago Macieira escreveu: Em qua 23 abr 2014, às 08:14:41, Knoll Lars escreveu: I'll send the update to the release branch in the next few hours. Thanks! I will do that today. I spent my Qt time yesterday with the changelog and the header diff. Running the scripts took about 5 seconds for each Editing the changelog took a lot of time, as did configuring msmtp to send the emails for the header diffs -- new computer, I forgot to copy the config file from the old one. https://codereview.qt-project.org/83980 https://codereview.qt-project.org/83981 I managed to write the code so that the ASCII case isn't impacted by the presence of the BOM. I had been meaning to add that simdDecodeAscii call out of the loop for some time, to improve code generation... -- Thiago Macieira - thiago.macieira (AT) intel.com Software Architect - Intel Open Source Technology Center ___ Development mailing list Development@qt-project.org http://lists.qt-project.org/mailman/listinfo/development
Re: [Development] utf-8 BOM and parsers
Em ter 22 abr 2014, às 12:35:33, Knoll Lars escreveu: Hi, Just came back from vacation today. Unfortunately BOM’s at the beginning of files seem to still be used quite a bit esp. in the Windows world. So I would actually vote for option 1 and rather keep compatibility. Reason is that stripping the BOM will not break anything, but leaving it in will. We could also consider using our builtin utf8 decoder for all utf8 locales, so that we don’t use iconv or ICU if the locale is utf-8 (and thus always strip the BOM). That would at least give us consistent cross platform behaviour. I'll send the update to the release branch in the next few hours. -- Thiago Macieira - thiago.macieira (AT) intel.com Software Architect - Intel Open Source Technology Center ___ Development mailing list Development@qt-project.org http://lists.qt-project.org/mailman/listinfo/development
Re: [Development] utf-8 BOM and parsers
On 14 Apr 2014, at 2:26 PM, Simon Hausmann wrote: We have various parsers in Qt that parse source code and do things with it, such as the QML parser… I was just baffled by this issue this morning for a couple of hours: tried to port some QML code that was working fine with Qt 5.2.1 to 5.3.0 and got a nonsense error message that it expected a numeric value at row 1 col 1 in the file. I didn't suspect the BOM until I showed the message to someone who happened to have seen this happen before. So I can say that as of this moment we definitely are not compliant with Postel's Law (http://en.wikipedia.org/wiki/Robustness_principle): we should ignore the BOM rather than being surprised by it, but also not require it. I suppose others have a better idea where to make this change, but if the lower layers insist on keeping it, then it will need to mean the parser should ignore it (the first part of Postel's Law). This needs to be fixed before we ship, otherwise it will be a show-stopper for someone if they can't figure out what this impertinent error message really means. I never intentionally put that BOM there, so it must be that some old version of Creator or some other editor did it. I do agree with the principle that BOMs should be avoided, because UTF-8 is a good default assumption about the code page if it's otherwise unknown. Are the parsers still generated by qlalr? Then maybe the fix could go there. It might also be a good idea for Creator to strip the BOM, or at least warn about it, if it's really an inadvisable thing to ever want in a source file (second part of Postel's law). ___ Development mailing list Development@qt-project.org http://lists.qt-project.org/mailman/listinfo/development
Re: [Development] utf-8 BOM and parsers
Hi, this is tracked by https://bugreports.qt-project.org/browse/QTBUG-37423 - keeping an eye on the release blocking bugs https://bugreports.qt-project.org/browse/QTBUG-37065 occasionally helps to minimize surprises ;-) . Friedemann -- Friedemann Kleint Digia, Qt ___ Development mailing list Development@qt-project.org http://lists.qt-project.org/mailman/listinfo/development
Re: [Development] utf-8 BOM and parsers
Em seg 14 abr 2014, às 10:33:48, Thiago Macieira escreveu: Em seg 14 abr 2014, às 09:59:18, Thiago Macieira escreveu: Also, the Unix philosophy is that UTF-8 BOMs should not be used. This started on Windows, with tools like Notepad, where changing the system locale is not an option. To be clear: BOMs are to be used to determine that the content *is* UTF-8. Once you know that it is UTF-8, you can strip it and pass to the decoder. Passing the BOM to the decoder sounds wrong because you'd be expecting ito choose the codec when decoding. That's what Notepad does: if there's a BOM, it decodes as UTF-8; otherwise it decodes as ANSI. Having the BOM there also breaks roundtrip: QString bom = u\ufeff any string goes here; QCOMPARE(QString::fromUtf8(bom.toUtf8()), bom); QString::toUtf8 does not, cannot and will never add the BOM. It would break concatenation. I know this is a behaviour change. But I repeat that it is an *intentional* change. The U+FEFF character is called zero-width non-breaking space (ZWNBSP) anywhere else, so it's valid to appear there. Including the next character in a file. Lars, can you make a call? Options are: 1) revert to old behaviour, change the content creators to never add a BOM 2) same as above, but fix the parsers now and change the behaviour in QString in Qt 5.4 or 5.5 3) keep the new behaviour, document it in the changelog, change the content creators as above, and fix the parsers -- Thiago Macieira - thiago.macieira (AT) intel.com Software Architect - Intel Open Source Technology Center ___ Development mailing list Development@qt-project.org http://lists.qt-project.org/mailman/listinfo/development
Re: [Development] utf-8 BOM and parsers
-Original Message- From: development-bounces+kai.koehne=digia@qt-project.org [mailto:development-bounces+kai.koehne=digia@qt-project.org] On Behalf Of Thiago Macieira Sent: Monday, April 14, 2014 7:34 PM To: development@qt-project.org Subject: Re: [Development] utf-8 BOM and parsers Hi Thiago, Thanks for listening the reasons here in detail! Em seg 14 abr 2014, às 09:59:18, Thiago Macieira escreveu: Also, the Unix philosophy is that UTF-8 BOMs should not be used. This started on Windows, with tools like Notepad, where changing the system locale is not an option. It's mostly an issue with (files edited on) Windows, indeed. To be clear: BOMs are to be used to determine that the content *is* UTF-8. Once you know that it is UTF-8, you can strip it and pass to the decoder. Passing the BOM to the decoder sounds wrong because you'd be expecting ito choose the codec when decoding. That's what Notepad does: if there's a BOM, it decodes as UTF-8; otherwise it decodes as ANSI. Right. But the issue is that the 'easiest' way to get a file into a qstring so far is QFile file; // ... QString::fromUtf8(file.readAll()); We're using that pattern btw in both Qt and Qt Creator, too. This breaks now in ways that can be pretty subtle (given that it only affects files starting with a BOM, and that the BOM isn't displayed usually). Having the BOM there also breaks roundtrip: QString bom = u\ufeff any string goes here; QCOMPARE(QString::fromUtf8(bom.toUtf8()), bom); QString::toUtf8 does not, cannot and will never add the BOM. It would break concatenation. So you'd have to add a BOM explicitly to the file before writing, if you really want it. I know this is a behaviour change. But I repeat that it is an *intentional* change. The U+FEFF character is called zero-width non-breaking space (ZWNBSP) anywhere else, so it's valid to appear there. Including the next character in a file. Right, though I understood this is deprecated since Unicode 3.2 (released in 2002). All in all, I see a lot of code breaking with this change ... Given that, I'd like to give a +1 for reverting to the behavior for 5.3 from my side. My 2 cents Kai ___ Development mailing list Development@qt-project.org http://lists.qt-project.org/mailman/listinfo/development
Re: [Development] utf-8 BOM and parsers
On Tuesday 15 April 2014, Koehne Kai wrote: -Original Message- From: development-bounces+kai.koehne=digia@qt-project.org [mailto:development-bounces+kai.koehne=digia@qt-project.org] On Behalf Of Thiago Macieira Sent: Monday, April 14, 2014 7:34 PM To: development@qt-project.org Subject: Re: [Development] utf-8 BOM and parsers Hi Thiago, Thanks for listening the reasons here in detail! Em seg 14 abr 2014, às 09:59:18, Thiago Macieira escreveu: Also, the Unix philosophy is that UTF-8 BOMs should not be used. This started on Windows, with tools like Notepad, where changing the system locale is not an option. It's mostly an issue with (files edited on) Windows, indeed. To be clear: BOMs are to be used to determine that the content *is* UTF-8. Once you know that it is UTF-8, you can strip it and pass to the decoder. Passing the BOM to the decoder sounds wrong because you'd be expecting ito choose the codec when decoding. That's what Notepad does: if there's a BOM, it decodes as UTF-8; otherwise it decodes as ANSI. Right. But the issue is that the 'easiest' way to get a file into a qstring so far is QFile file; // ... QString::fromUtf8(file.readAll()); We're using that pattern btw in both Qt and Qt Creator, too. This breaks now in ways that can be pretty subtle (given that it only affects files starting with a BOM, and that the BOM isn't displayed usually). Having the BOM there also breaks roundtrip: QString bom = u\ufeff any string goes here; QCOMPARE(QString::fromUtf8(bom.toUtf8()), bom); QString::toUtf8 does not, cannot and will never add the BOM. It would break concatenation. So you'd have to add a BOM explicitly to the file before writing, if you really want it. BOM has no official meaning and function other than as a zero-width non- breaking space in UTF-8. It was only meant as a byte-order marker in 16- and 32-bit unicode. If you add it to unix files it breaks other magic markers at the beginning of the file. UTF-8 BOM is a Windows specific non-standard hack that is recommended against. So yes, anyone that wants it needs to add it themselves, as it becomes part of the text content on any other platform `Allan ___ Development mailing list Development@qt-project.org http://lists.qt-project.org/mailman/listinfo/development
[Development] utf-8 BOM and parsers
Hi, We have various parsers in Qt that parse source code and do things with it, such as the QML parser, the CSS parser and others. We do make the assumption that their input is UTF-8 encoded and therefore have simply used QString code = QString::fromUtf8(byteArray); in some form or other, and then passed the code variable to the lexer. The lexers often check for white space using QChar::isSpace() and act accordingly. When the input file started with a byte order mark, previous versions of QString::fromUtf8 used to remove that mark and nothing happened. In Qt 5.3 the behavior was changed and the byte order mark is present in the resulting QString, which causes issues in parsers that do not expect that mark to appear. (This has been reported by early testers of Qt 5.3 in various places in Jira) Since this affects not just one place but many (and for example we have many copies of the QML lexer around), I'd like to determine what the _correct_ fix for this issue is, because frankly speaking I don't know :). However I have an interest in the same fix being applied to qtbase, qtdeclarative, qtscript, qtcreator and other affected modules. So I have some questions: 1) Should the character be treated as a white-space character? (one that doesn't consume any column in the line/column reporting later) If yes, what is the right way to fix the parsers? 1.1) Should any char.isSpace() condition be extended to check for such markers? 1.2) Or should isSpace() be changed? 2) Alternatively, do we need a function somewhere else in Qt that removes a trailing byte order mark from the QString and we change all parsers in Qt to use that function? 3) I noticed that QString::fromUtf8() differs from QTextCodec in this aspect. Is that intentional? Simon ___ Development mailing list Development@qt-project.org http://lists.qt-project.org/mailman/listinfo/development
Re: [Development] utf-8 BOM and parsers
On 14 Apr 2014, at 14:26, Simon Hausmann simon.hausm...@digia.com wrote: Since this affects not just one place but many (and for example we have many copies of the QML lexer around), I'd like to determine what the _correct_ fix for this issue is, because frankly speaking I don't know :). However I have an interest in the same fix being applied to qtbase, qtdeclarative, qtscript, qtcreator and other affected modules. Even more critical, this behavioural change won’t only affect Qt modules, but also a lot of customer code, which cannot be fixed by us. Which makes me wonder if such a be a change between 5.2 and 5.3 is acceptable at all. Was it intentional or an unintended side-effect? I can’t find any discussion about the issue. 3) I noticed that QString::fromUtf8() differs from QTextCodec in this aspect. Is that intentional? That inconsistency makes it even more confusing to me. -- Qt Developer Days 2014: October 6 - 8 at BCC, Berlin Frank Osterfeld | frank.osterf...@kdab.com | Senior Software Engineer KDAB (Deutschland) GmbHCo KG, a KDAB Group company Tel. Germany +49-30-521325470, Sweden (HQ) +46-563-540090 KDAB - Qt Experts - Platform-independent software solutions ___ Development mailing list Development@qt-project.org http://lists.qt-project.org/mailman/listinfo/development
Re: [Development] utf-8 BOM and parsers
Em seg 14 abr 2014, às 15:13:53, Frank Osterfeld escreveu: On 14 Apr 2014, at 14:26, Simon Hausmann simon.hausm...@digia.com wrote: Since this affects not just one place but many (and for example we have many copies of the QML lexer around), I'd like to determine what the _correct_ fix for this issue is, because frankly speaking I don't know :). However I have an interest in the same fix being applied to qtbase, qtdeclarative, qtscript, qtcreator and other affected modules. Even more critical, this behavioural change won’t only affect Qt modules, but also a lot of customer code, which cannot be fixed by us. Which makes me wonder if such a be a change between 5.2 and 5.3 is acceptable at all. Was it intentional or an unintended side-effect? I can’t find any discussion about the issue. It was intentional as part of the UTF-8 codec rewrite. 3) I noticed that QString::fromUtf8() differs from QTextCodec in this aspect. Is that intentional? That inconsistency makes it even more confusing to me. QTextCodec is stateful and allows you to choose, as one of the options, whether to ignore the BOM or not. QString::fromUtf8 is stateless. Anyway, I don't want to change the behaviour back, but if the consensus is that it should be done, I'll prepare a patch and send to release. -- Thiago Macieira - thiago.macieira (AT) intel.com Software Architect - Intel Open Source Technology Center ___ Development mailing list Development@qt-project.org http://lists.qt-project.org/mailman/listinfo/development
Re: [Development] utf-8 BOM and parsers
On Monday 14 April 2014 07:14:44 Thiago Macieira wrote: Em seg 14 abr 2014, às 15:13:53, Frank Osterfeld escreveu: On 14 Apr 2014, at 14:26, Simon Hausmann simon.hausm...@digia.com wrote: Since this affects not just one place but many (and for example we have many copies of the QML lexer around), I'd like to determine what the _correct_ fix for this issue is, because frankly speaking I don't know :). However I have an interest in the same fix being applied to qtbase, qtdeclarative, qtscript, qtcreator and other affected modules. Even more critical, this behavioural change won’t only affect Qt modules, but also a lot of customer code, which cannot be fixed by us. Which makes me wonder if such a be a change between 5.2 and 5.3 is acceptable at all. Was it intentional or an unintended side-effect? I can’t find any discussion about the issue. It was intentional as part of the UTF-8 codec rewrite. 3) I noticed that QString::fromUtf8() differs from QTextCodec in this aspect. Is that intentional? That inconsistency makes it even more confusing to me. QTextCodec is stateful and allows you to choose, as one of the options, whether to ignore the BOM or not. QString::fromUtf8 is stateless. Anyway, I don't want to change the behaviour back, but if the consensus is that it should be done, I'll prepare a patch and send to release. What were the reason to change that behaviour? Personally, I think it's safer to keep the 5.2 behaviour and avoid breaking user's code. -- Olivier Woboq - Qt services and support - http://woboq.com - http://code.woboq.org ___ Development mailing list Development@qt-project.org http://lists.qt-project.org/mailman/listinfo/development
Re: [Development] utf-8 BOM and parsers
Em seg 14 abr 2014, às 09:59:18, Thiago Macieira escreveu: Also, the Unix philosophy is that UTF-8 BOMs should not be used. This started on Windows, with tools like Notepad, where changing the system locale is not an option. To be clear: BOMs are to be used to determine that the content *is* UTF-8. Once you know that it is UTF-8, you can strip it and pass to the decoder. Passing the BOM to the decoder sounds wrong because you'd be expecting ito choose the codec when decoding. That's what Notepad does: if there's a BOM, it decodes as UTF-8; otherwise it decodes as ANSI. Having the BOM there also breaks roundtrip: QString bom = u\ufeff any string goes here; QCOMPARE(QString::fromUtf8(bom.toUtf8()), bom); QString::toUtf8 does not, cannot and will never add the BOM. It would break concatenation. I know this is a behaviour change. But I repeat that it is an *intentional* change. The U+FEFF character is called zero-width non-breaking space (ZWNBSP) anywhere else, so it's valid to appear there. Including the next character in a file. -- Thiago Macieira - thiago.macieira (AT) intel.com Software Architect - Intel Open Source Technology Center ___ Development mailing list Development@qt-project.org http://lists.qt-project.org/mailman/listinfo/development