[Bug libstdc++/66464] codecvt_utf16 max_length returning incorrect value

2015-06-19 Thread redi at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66464

--- Comment #7 from Jonathan Wakely redi at gcc dot gnu.org ---
No, it will be in 5.1.1-4


[Bug libstdc++/66464] codecvt_utf16 max_length returning incorrect value

2015-06-18 Thread lcarreon at bigpond dot net.au
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66464

--- Comment #6 from Leo Carreon lcarreon at bigpond dot net.au ---
Has this fix been included in the recent gcc-5.1.1-3 update on Fedora 22?


[Bug libstdc++/66464] codecvt_utf16 max_length returning incorrect value

2015-06-12 Thread redi at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66464

--- Comment #3 from Jonathan Wakely redi at gcc dot gnu.org ---
Author: redi
Date: Fri Jun 12 10:26:05 2015
New Revision: 224415

URL: https://gcc.gnu.org/viewcvs?rev=224415root=gccview=rev
Log:
PR libstdc++/66464
* src/c++11/codecvt.cc (codecvt_utf16_basechar32_t::do_max_length):
Return 4 not 3.

Modified:
trunk/libstdc++-v3/ChangeLog
trunk/libstdc++-v3/src/c++11/codecvt.cc


[Bug libstdc++/66464] codecvt_utf16 max_length returning incorrect value

2015-06-12 Thread redi at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66464

Jonathan Wakely redi at gcc dot gnu.org changed:

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
 Resolution|--- |FIXED
   Target Milestone|--- |5.2

--- Comment #5 from Jonathan Wakely redi at gcc dot gnu.org ---
(In reply to Leo Carreon from comment #2)
 Just clarifying that my comments are to do with codecvt_utf16char32_t and
 codecvt_utf8char32_t.
 
 The way I understand it, codecvt_utf16char32_t should be converting
 between UTF-16 and UCS-4.  UTF-16 uses 2 bytes for characters in the BMP
 (characters in the range 0x to 0x) and 4 bytes (surrogate pairs) for
 characters above the BMP (0x01 to 0x10).  UCS-4 uses 4 byte values. 
 Therefore, codecvt_utf16char32_t::max_length() should be returning 4 if
 the BOM is not taken into account.

Yes, that's now fixed.

 codecvt_utf8char32_t converts between UTF-8 and UCS-4.  UTF-8 can use up
 to 4 bytes for characters up to the range 0x10.  Therefore,
 codecvt_utf8char32_t::max_length() should be returning 4 if the BOM is not
 taken into account.
 
 As I said in my previous post, I'm not sure if the BOM should be accounted
 for in max_length().

I've raised that question with the C++ committee.

  If I'm not mistaken, the purpose of this function is
 to allow a user to estimate how many bytes are required to fit a UCS-4
 string when converted to either UTF-16 or UTF-8.  And my guess, the BOM can
 be taken into account separately when doing the estimation.  For example,
 when wstring_convert estimates the length of the std::string to be generated
 by wstring_convert::to_bytes().  It should be the number of UCS-4 characters
 multiplied by max_length() and then add the size of the BOM if required. 
 The resulting std::string can be resized after the conversion to eliminate
 the unused bytes.

I believe that's the usual use case for max_length, and agree it's better to
calculate N * max_length() + length(BOM), rather than have max_length() include
the BOM, however the way max_length() is specified in the standard does suggest
it should be including the BOM. We'll discuss it in the committee and process
it as a defect report against the standard if necessary.

 Note that the comment you mentioned in your reply probably only applies to
 codecvt_utf8_utf16 which converts between UTF-8 and UTF-16 directly without
 going thru the UCS-4 conversion.

Agreed.


[Bug libstdc++/66464] codecvt_utf16 max_length returning incorrect value

2015-06-12 Thread redi at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66464

--- Comment #4 from Jonathan Wakely redi at gcc dot gnu.org ---
Author: redi
Date: Fri Jun 12 11:22:01 2015
New Revision: 224417

URL: https://gcc.gnu.org/viewcvs?rev=224417root=gccview=rev
Log:
PR libstdc++/66464
* src/c++11/codecvt.cc (codecvt_utf16_basechar32_t::do_max_length):
Return 4 not 3.

Modified:
branches/gcc-5-branch/libstdc++-v3/ChangeLog
branches/gcc-5-branch/libstdc++-v3/src/c++11/codecvt.cc


[Bug libstdc++/66464] codecvt_utf16 max_length returning incorrect value

2015-06-09 Thread redi at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66464

Jonathan Wakely redi at gcc dot gnu.org changed:

   What|Removed |Added

 Status|UNCONFIRMED |ASSIGNED
   Last reconfirmed||2015-06-09
   Assignee|unassigned at gcc dot gnu.org  |redi at gcc dot gnu.org
 Ever confirmed|0   |1

--- Comment #1 from Jonathan Wakely redi at gcc dot gnu.org ---
(In reply to Leo Carreon from comment #0)
 I just noticed that codecvt_utf16char32_t::max_length() is returning 3.
 
 This appears to be the wrong value because a surrogate pair is composed of 4
 bytes therefore max_length() should at least be returning 4.

Agreed, I think that's just a mistake.

I wrote this comment in the code:

int
codecvtchar16_t, char, mbstate_t::do_max_length() const throw()
{
  // Any valid UTF-8 sequence of 3 bytes fits in a single 16-bit code unit,
  // whereas 4 byte sequences require two 16-bit code units.
  return 3;
}

But that reasoning (even if it's correct!) doesn't apply to
codecvt_utf16char32_t.

 I'm also wondering whether the BOM should be taken into account.  If it so
 happens that at the beginning of a UTF-16 string which has a BOM and it so
 happens to start with a surrogate pair, 6 bytes have to be consumed to
 generate a single UCS-4 character.
 
 Should the same thing be considered with
 codecvt_utf8char32_t::max_length() which currently returns 4.  Taking into
 account the BOM and the longest UTF-8 character below 0x10, shouldn't
 max_length() return 7.
 
 I'm not really sure if the BOM should be taken into account because the
 standard's definition for do_max_length() simply says the maximum number of
 input characters that needs to be consumed to generate a single output
 character.

That's a very good question.


[Bug libstdc++/66464] codecvt_utf16 max_length returning incorrect value

2015-06-09 Thread lcarreon at bigpond dot net.au
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66464

--- Comment #2 from Leo Carreon lcarreon at bigpond dot net.au ---
Just clarifying that my comments are to do with codecvt_utf16char32_t and
codecvt_utf8char32_t.

The way I understand it, codecvt_utf16char32_t should be converting between
UTF-16 and UCS-4.  UTF-16 uses 2 bytes for characters in the BMP (characters in
the range 0x to 0x) and 4 bytes (surrogate pairs) for characters above
the BMP (0x01 to 0x10).  UCS-4 uses 4 byte values.  Therefore,
codecvt_utf16char32_t::max_length() should be returning 4 if the BOM is not
taken into account.

codecvt_utf8char32_t converts between UTF-8 and UCS-4.  UTF-8 can use up to 4
bytes for characters up to the range 0x10.  Therefore,
codecvt_utf8char32_t::max_length() should be returning 4 if the BOM is not
taken into account.

As I said in my previous post, I'm not sure if the BOM should be accounted for
in max_length().  If I'm not mistaken, the purpose of this function is to allow
a user to estimate how many bytes are required to fit a UCS-4 string when
converted to either UTF-16 or UTF-8.  And my guess, the BOM can be taken into
account separately when doing the estimation.  For example, when
wstring_convert estimates the length of the std::string to be generated by
wstring_convert::to_bytes().  It should be the number of UCS-4 characters
multiplied by max_length() and then add the size of the BOM if required.  The
resulting std::string can be resized after the conversion to eliminate the
unused bytes.

Note that the comment you mentioned in your reply probably only applies to
codecvt_utf8_utf16 which converts between UTF-8 and UTF-16 directly without
going thru the UCS-4 conversion.