[Bug libstdc++/66464] codecvt_utf16 max_length returning incorrect value
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66464 --- Comment #7 from Jonathan Wakely redi at gcc dot gnu.org --- No, it will be in 5.1.1-4
[Bug libstdc++/66464] codecvt_utf16 max_length returning incorrect value
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66464 --- Comment #6 from Leo Carreon lcarreon at bigpond dot net.au --- Has this fix been included in the recent gcc-5.1.1-3 update on Fedora 22?
[Bug libstdc++/66464] codecvt_utf16 max_length returning incorrect value
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66464 --- Comment #3 from Jonathan Wakely redi at gcc dot gnu.org --- Author: redi Date: Fri Jun 12 10:26:05 2015 New Revision: 224415 URL: https://gcc.gnu.org/viewcvs?rev=224415root=gccview=rev Log: PR libstdc++/66464 * src/c++11/codecvt.cc (codecvt_utf16_basechar32_t::do_max_length): Return 4 not 3. Modified: trunk/libstdc++-v3/ChangeLog trunk/libstdc++-v3/src/c++11/codecvt.cc
[Bug libstdc++/66464] codecvt_utf16 max_length returning incorrect value
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66464 Jonathan Wakely redi at gcc dot gnu.org changed: What|Removed |Added Status|ASSIGNED|RESOLVED Resolution|--- |FIXED Target Milestone|--- |5.2 --- Comment #5 from Jonathan Wakely redi at gcc dot gnu.org --- (In reply to Leo Carreon from comment #2) Just clarifying that my comments are to do with codecvt_utf16char32_t and codecvt_utf8char32_t. The way I understand it, codecvt_utf16char32_t should be converting between UTF-16 and UCS-4. UTF-16 uses 2 bytes for characters in the BMP (characters in the range 0x to 0x) and 4 bytes (surrogate pairs) for characters above the BMP (0x01 to 0x10). UCS-4 uses 4 byte values. Therefore, codecvt_utf16char32_t::max_length() should be returning 4 if the BOM is not taken into account. Yes, that's now fixed. codecvt_utf8char32_t converts between UTF-8 and UCS-4. UTF-8 can use up to 4 bytes for characters up to the range 0x10. Therefore, codecvt_utf8char32_t::max_length() should be returning 4 if the BOM is not taken into account. As I said in my previous post, I'm not sure if the BOM should be accounted for in max_length(). I've raised that question with the C++ committee. If I'm not mistaken, the purpose of this function is to allow a user to estimate how many bytes are required to fit a UCS-4 string when converted to either UTF-16 or UTF-8. And my guess, the BOM can be taken into account separately when doing the estimation. For example, when wstring_convert estimates the length of the std::string to be generated by wstring_convert::to_bytes(). It should be the number of UCS-4 characters multiplied by max_length() and then add the size of the BOM if required. The resulting std::string can be resized after the conversion to eliminate the unused bytes. I believe that's the usual use case for max_length, and agree it's better to calculate N * max_length() + length(BOM), rather than have max_length() include the BOM, however the way max_length() is specified in the standard does suggest it should be including the BOM. We'll discuss it in the committee and process it as a defect report against the standard if necessary. Note that the comment you mentioned in your reply probably only applies to codecvt_utf8_utf16 which converts between UTF-8 and UTF-16 directly without going thru the UCS-4 conversion. Agreed.
[Bug libstdc++/66464] codecvt_utf16 max_length returning incorrect value
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66464 --- Comment #4 from Jonathan Wakely redi at gcc dot gnu.org --- Author: redi Date: Fri Jun 12 11:22:01 2015 New Revision: 224417 URL: https://gcc.gnu.org/viewcvs?rev=224417root=gccview=rev Log: PR libstdc++/66464 * src/c++11/codecvt.cc (codecvt_utf16_basechar32_t::do_max_length): Return 4 not 3. Modified: branches/gcc-5-branch/libstdc++-v3/ChangeLog branches/gcc-5-branch/libstdc++-v3/src/c++11/codecvt.cc
[Bug libstdc++/66464] codecvt_utf16 max_length returning incorrect value
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66464 Jonathan Wakely redi at gcc dot gnu.org changed: What|Removed |Added Status|UNCONFIRMED |ASSIGNED Last reconfirmed||2015-06-09 Assignee|unassigned at gcc dot gnu.org |redi at gcc dot gnu.org Ever confirmed|0 |1 --- Comment #1 from Jonathan Wakely redi at gcc dot gnu.org --- (In reply to Leo Carreon from comment #0) I just noticed that codecvt_utf16char32_t::max_length() is returning 3. This appears to be the wrong value because a surrogate pair is composed of 4 bytes therefore max_length() should at least be returning 4. Agreed, I think that's just a mistake. I wrote this comment in the code: int codecvtchar16_t, char, mbstate_t::do_max_length() const throw() { // Any valid UTF-8 sequence of 3 bytes fits in a single 16-bit code unit, // whereas 4 byte sequences require two 16-bit code units. return 3; } But that reasoning (even if it's correct!) doesn't apply to codecvt_utf16char32_t. I'm also wondering whether the BOM should be taken into account. If it so happens that at the beginning of a UTF-16 string which has a BOM and it so happens to start with a surrogate pair, 6 bytes have to be consumed to generate a single UCS-4 character. Should the same thing be considered with codecvt_utf8char32_t::max_length() which currently returns 4. Taking into account the BOM and the longest UTF-8 character below 0x10, shouldn't max_length() return 7. I'm not really sure if the BOM should be taken into account because the standard's definition for do_max_length() simply says the maximum number of input characters that needs to be consumed to generate a single output character. That's a very good question.
[Bug libstdc++/66464] codecvt_utf16 max_length returning incorrect value
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66464 --- Comment #2 from Leo Carreon lcarreon at bigpond dot net.au --- Just clarifying that my comments are to do with codecvt_utf16char32_t and codecvt_utf8char32_t. The way I understand it, codecvt_utf16char32_t should be converting between UTF-16 and UCS-4. UTF-16 uses 2 bytes for characters in the BMP (characters in the range 0x to 0x) and 4 bytes (surrogate pairs) for characters above the BMP (0x01 to 0x10). UCS-4 uses 4 byte values. Therefore, codecvt_utf16char32_t::max_length() should be returning 4 if the BOM is not taken into account. codecvt_utf8char32_t converts between UTF-8 and UCS-4. UTF-8 can use up to 4 bytes for characters up to the range 0x10. Therefore, codecvt_utf8char32_t::max_length() should be returning 4 if the BOM is not taken into account. As I said in my previous post, I'm not sure if the BOM should be accounted for in max_length(). If I'm not mistaken, the purpose of this function is to allow a user to estimate how many bytes are required to fit a UCS-4 string when converted to either UTF-16 or UTF-8. And my guess, the BOM can be taken into account separately when doing the estimation. For example, when wstring_convert estimates the length of the std::string to be generated by wstring_convert::to_bytes(). It should be the number of UCS-4 characters multiplied by max_length() and then add the size of the BOM if required. The resulting std::string can be resized after the conversion to eliminate the unused bytes. Note that the comment you mentioned in your reply probably only applies to codecvt_utf8_utf16 which converts between UTF-8 and UTF-16 directly without going thru the UCS-4 conversion.