Re: [Development] utf-8 BOM and parsers

2014-04-23 Thread Knoll Lars
On 22/04/14 16:36, Thiago Macieira thiago.macie...@intel.com wrote:

Em ter 22 abr 2014, às 12:35:33, Knoll Lars escreveu:
 Hi,
 
 Just came back from vacation today.
 
 Unfortunately BOM’s at the beginning of files seem to still be used
quite
 a bit esp. in the Windows world. So I would actually vote for option 1
and
 rather keep compatibility. Reason is that stripping the BOM will not
break
 anything, but leaving it in will.
 
 We could also consider using our builtin utf8 decoder for all utf8
 locales, so that we don’t use iconv or ICU if the locale is utf-8 (and
 thus always strip the BOM). That would at least give us consistent cross
 platform behaviour.

I'll send the update to the release branch in the next few hours.

Thanks!

Lars

___
Development mailing list
Development@qt-project.org
http://lists.qt-project.org/mailman/listinfo/development


Re: [Development] utf-8 BOM and parsers

2014-04-23 Thread Thiago Macieira
Em qua 23 abr 2014, às 08:14:41, Knoll Lars escreveu:
 I'll send the update to the release branch in the next few hours.
 
 Thanks!

I will do that today. I spent my Qt time yesterday with the changelog and the 
header diff.

Running the scripts took about 5 seconds for each Editing the changelog took a 
lot of time, as did configuring msmtp to send the emails for the header diffs 
-- 
new computer, I forgot to copy the config file from the old one.

-- 
Thiago Macieira - thiago.macieira (AT) intel.com
  Software Architect - Intel Open Source Technology Center

___
Development mailing list
Development@qt-project.org
http://lists.qt-project.org/mailman/listinfo/development


Re: [Development] utf-8 BOM and parsers

2014-04-23 Thread Thiago Macieira
Em qua 23 abr 2014, às 07:38:30, Thiago Macieira escreveu:
 Em qua 23 abr 2014, às 08:14:41, Knoll Lars escreveu:
  I'll send the update to the release branch in the next few hours.
  
  Thanks!
 
 I will do that today. I spent my Qt time yesterday with the changelog and
 the header diff.
 
 Running the scripts took about 5 seconds for each Editing the changelog took
 a lot of time, as did configuring msmtp to send the emails for the header
 diffs -- new computer, I forgot to copy the config file from the old one.

https://codereview.qt-project.org/83980
https://codereview.qt-project.org/83981

I managed to write the code so that the ASCII case isn't impacted by the 
presence of the BOM. I had been meaning to add that simdDecodeAscii call out 
of the loop for some time, to improve code generation...
-- 
Thiago Macieira - thiago.macieira (AT) intel.com
  Software Architect - Intel Open Source Technology Center

___
Development mailing list
Development@qt-project.org
http://lists.qt-project.org/mailman/listinfo/development


Re: [Development] utf-8 BOM and parsers

2014-04-22 Thread Thiago Macieira
Em ter 22 abr 2014, às 12:35:33, Knoll Lars escreveu:
 Hi,
 
 Just came back from vacation today.
 
 Unfortunately BOM’s at the beginning of files seem to still be used quite
 a bit esp. in the Windows world. So I would actually vote for option 1 and
 rather keep compatibility. Reason is that stripping the BOM will not break
 anything, but leaving it in will.
 
 We could also consider using our builtin utf8 decoder for all utf8
 locales, so that we don’t use iconv or ICU if the locale is utf-8 (and
 thus always strip the BOM). That would at least give us consistent cross
 platform behaviour.

I'll send the update to the release branch in the next few hours.
-- 
Thiago Macieira - thiago.macieira (AT) intel.com
  Software Architect - Intel Open Source Technology Center

___
Development mailing list
Development@qt-project.org
http://lists.qt-project.org/mailman/listinfo/development


Re: [Development] utf-8 BOM and parsers

2014-04-16 Thread Rutledge Shawn

On 14 Apr 2014, at 2:26 PM, Simon Hausmann wrote:

 We have various parsers in Qt that parse source code and do things with it, 
 such as the QML parser…

I was just baffled by this issue this morning for a couple of hours:  tried to 
port some QML code that was working fine with Qt 5.2.1 to 5.3.0 and got a 
nonsense error message that it expected a numeric value at row 1 col 1 in the 
file.  I didn't suspect the BOM until I showed the message to someone who 
happened to have seen this happen before.  So I can say that as of this moment 
we definitely are not compliant with Postel's Law  
(http://en.wikipedia.org/wiki/Robustness_principle): we should ignore the BOM 
rather than being surprised by it, but also not require it.  I suppose others 
have a better idea where to make this change, but if the lower layers insist on 
keeping it, then it will need to mean the parser should ignore it (the first 
part of Postel's Law).  This needs to be fixed before we ship, otherwise it 
will be a show-stopper for someone if they can't figure out what this 
impertinent error message really means.  I never intentionally put that BOM 
there, so it must be that some old version of Creator or some other editor did 
it.  I do agree with the principle that BOMs should be avoided, because UTF-8 
is a good default assumption about the code page if it's otherwise unknown.

Are the parsers still generated by qlalr?  Then maybe the fix could go there.

It might also be a good idea for Creator to strip the BOM, or at least warn 
about it, if it's really an inadvisable thing to ever want in a source file 
(second part of Postel's law).

___
Development mailing list
Development@qt-project.org
http://lists.qt-project.org/mailman/listinfo/development


Re: [Development] utf-8 BOM and parsers

2014-04-16 Thread Friedemann Kleint
Hi,

this is tracked by https://bugreports.qt-project.org/browse/QTBUG-37423 
- keeping an eye on the release blocking bugs 
https://bugreports.qt-project.org/browse/QTBUG-37065 occasionally helps 
to minimize surprises ;-) .

Friedemann

-- 
Friedemann Kleint
Digia, Qt

___
Development mailing list
Development@qt-project.org
http://lists.qt-project.org/mailman/listinfo/development


Re: [Development] utf-8 BOM and parsers

2014-04-16 Thread Thiago Macieira
Em seg 14 abr 2014, às 10:33:48, Thiago Macieira escreveu:
 Em seg 14 abr 2014, às 09:59:18, Thiago Macieira escreveu:
  Also, the Unix philosophy is that UTF-8 BOMs should not be used. This
  started  on Windows, with tools like Notepad, where changing the system
  locale is not an option.
 
 To be clear: BOMs are to be used to determine that the content *is* UTF-8.
 Once you know that it is UTF-8, you can strip it and pass to the decoder.
 Passing the BOM to the decoder sounds wrong because you'd be expecting ito
 choose the codec when decoding. That's what Notepad does: if there's a BOM,
 it decodes as UTF-8; otherwise it decodes as ANSI.
 
 Having the BOM there also breaks roundtrip:
 
   QString bom = u\ufeff any string goes here;
   QCOMPARE(QString::fromUtf8(bom.toUtf8()), bom);
 
 QString::toUtf8 does not, cannot and will never add the BOM. It would break
 concatenation.
 
 I know this is a behaviour change. But I repeat that it is an *intentional*
 change.
 
 The U+FEFF character is called zero-width non-breaking space (ZWNBSP)
 anywhere else, so it's valid to appear there. Including the next character
 in a file.

Lars, can you make a call?

Options are:
1) revert to old behaviour, change the content creators to never add a BOM

2) same as above, but fix the parsers now and change the behaviour in QString 
in Qt 5.4 or 5.5

3) keep the new behaviour, document it in the changelog, change the content 
creators as above, and fix the parsers

-- 
Thiago Macieira - thiago.macieira (AT) intel.com
  Software Architect - Intel Open Source Technology Center

___
Development mailing list
Development@qt-project.org
http://lists.qt-project.org/mailman/listinfo/development


Re: [Development] utf-8 BOM and parsers

2014-04-15 Thread Koehne Kai


 -Original Message-
 From: development-bounces+kai.koehne=digia@qt-project.org
 [mailto:development-bounces+kai.koehne=digia@qt-project.org] On
 Behalf Of Thiago Macieira
 Sent: Monday, April 14, 2014 7:34 PM
 To: development@qt-project.org
 Subject: Re: [Development] utf-8 BOM and parsers

Hi Thiago,

Thanks for listening the reasons here in detail!

 Em seg 14 abr 2014, às 09:59:18, Thiago Macieira escreveu:
  Also, the Unix philosophy is that UTF-8 BOMs should not be used. This
  started  on Windows, with tools like Notepad, where changing the
  system locale is not an option.

It's mostly an issue with (files edited on) Windows, indeed. 

 To be clear: BOMs are to be used to determine that the content *is* UTF-8.
 Once you know that it is UTF-8, you can strip it and pass to the decoder.
 Passing the BOM to the decoder sounds wrong because you'd be expecting
 ito choose the codec when decoding. That's what Notepad does: if there's a
 BOM, it decodes as UTF-8; otherwise it decodes as ANSI.

Right. But the issue is that the 'easiest' way to get a file into a qstring so 
far is

QFile file;
// ...
QString::fromUtf8(file.readAll());

We're using that pattern btw in both Qt and Qt Creator, too. This breaks now in 
ways that can be pretty subtle (given that it only affects files starting with 
a BOM, and that the BOM isn't displayed usually).

 Having the BOM there also breaks roundtrip:
 
   QString bom = u\ufeff any string goes here;
   QCOMPARE(QString::fromUtf8(bom.toUtf8()), bom);
 
 QString::toUtf8 does not, cannot and will never add the BOM. It would break
 concatenation.

So you'd have to add a BOM explicitly to the file before writing, if you really 
want it.

 I know this is a behaviour change. But I repeat that it is an *intentional*
 change.

 The U+FEFF character is called zero-width non-breaking space (ZWNBSP)
 anywhere else, so it's valid to appear there. Including the next character in 
 a
 file.

Right, though I understood this is deprecated since Unicode 3.2 (released in 
2002).

All in all, I see a lot of code breaking with this change ... Given that, I'd 
like to give a +1 for reverting to the behavior for 5.3 from my side. 

My 2 cents

Kai
___
Development mailing list
Development@qt-project.org
http://lists.qt-project.org/mailman/listinfo/development


Re: [Development] utf-8 BOM and parsers

2014-04-15 Thread Allan Sandfeld Jensen
On Tuesday 15 April 2014, Koehne Kai wrote:
  -Original Message-
  From: development-bounces+kai.koehne=digia@qt-project.org
  [mailto:development-bounces+kai.koehne=digia@qt-project.org] On
  Behalf Of Thiago Macieira
  Sent: Monday, April 14, 2014 7:34 PM
  To: development@qt-project.org
  Subject: Re: [Development] utf-8 BOM and parsers
 
 Hi Thiago,
 
 Thanks for listening the reasons here in detail!
 
  Em seg 14 abr 2014, às 09:59:18, Thiago Macieira escreveu:
   Also, the Unix philosophy is that UTF-8 BOMs should not be used. This
   started  on Windows, with tools like Notepad, where changing the
   system locale is not an option.
 
 It's mostly an issue with (files edited on) Windows, indeed.
 
  To be clear: BOMs are to be used to determine that the content *is*
  UTF-8. Once you know that it is UTF-8, you can strip it and pass to the
  decoder. Passing the BOM to the decoder sounds wrong because you'd be
  expecting ito choose the codec when decoding. That's what Notepad does:
  if there's a BOM, it decodes as UTF-8; otherwise it decodes as ANSI.
 
 Right. But the issue is that the 'easiest' way to get a file into a qstring
 so far is
 
 QFile file;
 // ...
 QString::fromUtf8(file.readAll());
 
 We're using that pattern btw in both Qt and Qt Creator, too. This breaks
 now in ways that can be pretty subtle (given that it only affects files
 starting with a BOM, and that the BOM isn't displayed usually).
 
  Having the BOM there also breaks roundtrip:
  QString bom = u\ufeff any string goes here;
  QCOMPARE(QString::fromUtf8(bom.toUtf8()), bom);
  
  QString::toUtf8 does not, cannot and will never add the BOM. It would
  break concatenation.
 
 So you'd have to add a BOM explicitly to the file before writing, if you
 really want it.
 
BOM has no official meaning and function other than as a zero-width non-
breaking space in UTF-8. It was only meant as a byte-order marker in 16- and 
32-bit unicode. If you add it to unix files it breaks other magic markers at 
the beginning of the file. UTF-8 BOM is a Windows specific non-standard hack 
that is recommended against. So yes, anyone that wants it needs to add it 
themselves, as it becomes part of the text content on any other platform

`Allan

___
Development mailing list
Development@qt-project.org
http://lists.qt-project.org/mailman/listinfo/development


[Development] utf-8 BOM and parsers

2014-04-14 Thread Simon Hausmann
Hi,

We have various parsers in Qt that parse source code and do things with it, 
such as the QML parser, the CSS parser and others. We do make the assumption 
that their input is UTF-8 encoded and therefore have simply used

QString code = QString::fromUtf8(byteArray);

in some form or other, and then passed the code variable to the lexer. The 
lexers often check for white space using QChar::isSpace() and act 
accordingly.

When the input file started with a byte order mark, previous versions of 
QString::fromUtf8 used to remove that mark and nothing happened.

In Qt 5.3 the behavior was changed and the byte order mark is present in the 
resulting QString, which causes issues in parsers that do not expect that mark 
to appear. (This has been reported by early testers of Qt 5.3 in various 
places in Jira)

Since this affects not just one place but many (and for example we have many 
copies of the QML lexer around), I'd like to determine what the _correct_ fix 
for this issue is, because frankly speaking I don't know :). However I have an 
interest in the same fix being applied to qtbase, qtdeclarative, qtscript, 
qtcreator and other affected modules.

So I have some questions:

1) Should the character be treated as a white-space character? (one that 
doesn't consume any column in the line/column reporting later) If yes, what is 
the right way to fix the parsers? 
  1.1) Should any char.isSpace() condition be extended to check for such 
markers?
  1.2) Or should isSpace() be changed?

2) Alternatively, do we need a function somewhere else in Qt that removes a 
trailing byte order mark from the QString and we change all parsers in Qt to 
use that function?

3) I noticed that QString::fromUtf8() differs from QTextCodec in this aspect. 
Is that intentional?


Simon
___
Development mailing list
Development@qt-project.org
http://lists.qt-project.org/mailman/listinfo/development


Re: [Development] utf-8 BOM and parsers

2014-04-14 Thread Frank Osterfeld

On 14 Apr 2014, at 14:26, Simon Hausmann simon.hausm...@digia.com wrote:
 
 Since this affects not just one place but many (and for example we have many 
 copies of the QML lexer around), I'd like to determine what the _correct_ fix 
 for this issue is, because frankly speaking I don't know :). However I have 
 an 
 interest in the same fix being applied to qtbase, qtdeclarative, qtscript, 
 qtcreator and other affected modules.

Even more critical, this behavioural change won’t only affect Qt modules, but 
also a lot of customer code, which cannot be fixed by us.
Which makes me wonder if such a be a change between 5.2 and 5.3 is acceptable 
at all. Was it intentional or an unintended side-effect? I can’t find any 
discussion about the issue.

 3) I noticed that QString::fromUtf8() differs from QTextCodec in this aspect. 
 Is that intentional?

That inconsistency makes it even more confusing to me.

-- 
Qt Developer Days 2014: October 6 - 8 at BCC, Berlin

Frank Osterfeld | frank.osterf...@kdab.com | Senior Software Engineer
KDAB (Deutschland) GmbHCo KG, a KDAB Group company
Tel. Germany +49-30-521325470, Sweden (HQ)  +46-563-540090
KDAB - Qt Experts - Platform-independent software solutions

___
Development mailing list
Development@qt-project.org
http://lists.qt-project.org/mailman/listinfo/development


Re: [Development] utf-8 BOM and parsers

2014-04-14 Thread Thiago Macieira
Em seg 14 abr 2014, às 15:13:53, Frank Osterfeld escreveu:
 On 14 Apr 2014, at 14:26, Simon Hausmann simon.hausm...@digia.com wrote:
  Since this affects not just one place but many (and for example we have
  many copies of the QML lexer around), I'd like to determine what the
  _correct_ fix for this issue is, because frankly speaking I don't know
  :). However I have an interest in the same fix being applied to qtbase,
  qtdeclarative, qtscript, qtcreator and other affected modules.
 
 Even more critical, this behavioural change won’t only affect Qt modules,
 but also a lot of customer code, which cannot be fixed by us. Which makes
 me wonder if such a be a change between 5.2 and 5.3 is acceptable at all.
 Was it intentional or an unintended side-effect? I can’t find any
 discussion about the issue.

It was intentional as part of the UTF-8 codec rewrite.

  3) I noticed that QString::fromUtf8() differs from QTextCodec in this
  aspect. Is that intentional?
 
 That inconsistency makes it even more confusing to me.

QTextCodec is stateful and allows you to choose, as one of the options, 
whether to ignore the BOM or not. QString::fromUtf8 is stateless.

Anyway, I don't want to change the behaviour back, but if the consensus is 
that it should be done, I'll prepare a patch and send to release.

-- 
Thiago Macieira - thiago.macieira (AT) intel.com
  Software Architect - Intel Open Source Technology Center

___
Development mailing list
Development@qt-project.org
http://lists.qt-project.org/mailman/listinfo/development


Re: [Development] utf-8 BOM and parsers

2014-04-14 Thread Olivier Goffart
On Monday 14 April 2014 07:14:44 Thiago Macieira wrote:
 Em seg 14 abr 2014, às 15:13:53, Frank Osterfeld escreveu:
  On 14 Apr 2014, at 14:26, Simon Hausmann simon.hausm...@digia.com wrote:
   Since this affects not just one place but many (and for example we have
   many copies of the QML lexer around), I'd like to determine what the
   _correct_ fix for this issue is, because frankly speaking I don't know
   
   :). However I have an interest in the same fix being applied to qtbase,
   
   qtdeclarative, qtscript, qtcreator and other affected modules.
  
  Even more critical, this behavioural change won’t only affect Qt modules,
  but also a lot of customer code, which cannot be fixed by us. Which makes
  me wonder if such a be a change between 5.2 and 5.3 is acceptable at all.
  Was it intentional or an unintended side-effect? I can’t find any
  discussion about the issue.
 
 It was intentional as part of the UTF-8 codec rewrite.
 
   3) I noticed that QString::fromUtf8() differs from QTextCodec in this
   aspect. Is that intentional?
  
  That inconsistency makes it even more confusing to me.
 
 QTextCodec is stateful and allows you to choose, as one of the options,
 whether to ignore the BOM or not. QString::fromUtf8 is stateless.
 
 Anyway, I don't want to change the behaviour back, but if the consensus is
 that it should be done, I'll prepare a patch and send to release.


What were the reason to change that behaviour?
Personally, I think it's safer to keep the 5.2 behaviour and avoid breaking 
user's code.

-- 
Olivier 

Woboq - Qt services and support - http://woboq.com - http://code.woboq.org
___
Development mailing list
Development@qt-project.org
http://lists.qt-project.org/mailman/listinfo/development


Re: [Development] utf-8 BOM and parsers

2014-04-14 Thread Thiago Macieira
Em seg 14 abr 2014, às 09:59:18, Thiago Macieira escreveu:
 Also, the Unix philosophy is that UTF-8 BOMs should not be used. This
 started  on Windows, with tools like Notepad, where changing the system
 locale is not an option.

To be clear: BOMs are to be used to determine that the content *is* UTF-8. 
Once you know that it is UTF-8, you can strip it and pass to the decoder. 
Passing the BOM to the decoder sounds wrong because you'd be expecting ito 
choose the codec when decoding. That's what Notepad does: if there's a BOM, it 
decodes as UTF-8; otherwise it decodes as ANSI.

Having the BOM there also breaks roundtrip:

QString bom = u\ufeff any string goes here;
QCOMPARE(QString::fromUtf8(bom.toUtf8()), bom);

QString::toUtf8 does not, cannot and will never add the BOM. It would break 
concatenation.

I know this is a behaviour change. But I repeat that it is an *intentional* 
change.

The U+FEFF character is called zero-width non-breaking space (ZWNBSP) 
anywhere else, so it's valid to appear there. Including the next character in 
a file.

-- 
Thiago Macieira - thiago.macieira (AT) intel.com
  Software Architect - Intel Open Source Technology Center

___
Development mailing list
Development@qt-project.org
http://lists.qt-project.org/mailman/listinfo/development