[Bug c++/112652] g++.dg/cpp26/literals2.C FAILs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112652 Jakub Jelinek changed: What|Removed |Added Target Milestone|14.2|14.3 --- Comment #12 from Jakub Jelinek --- GCC 14.2 is being released, retargeting bugs to GCC 14.3.
[Bug c++/112652] g++.dg/cpp26/literals2.C FAILs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112652 Richard Biener changed: What|Removed |Added Target Milestone|14.0|14.2 --- Comment #11 from Richard Biener --- GCC 14.1 is being released, retargeting bugs to GCC 14.2.
[Bug c++/112652] g++.dg/cpp26/literals2.C FAILs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112652 --- Comment #10 from ro at CeBiTec dot Uni-Bielefeld.DE --- > --- Comment #9 from Jakub Jelinek --- > (In reply to r...@cebitec.uni-bielefeld.de from comment #8) >> FWIW, the iconv conversion tables in /usr/lib/iconv can be regenerated >> from the OpenSolaris sources, modified not to do that '?' conversion. >> Worked for a quick check for the UTF-8 -> ASCII example, but the '?' is >> more prevalent and would need to be eradicated upstream. > > If it is always '?' used instead of unknown character, we could also have some > hack on the libcpp side for it. It took me a bit to get back to you here since I had to check with both Solaris engineering and dig up our old Solaris 9 sources (which, unlike, OpenSolaris, have no relevant parts missing due to copyright issues). Both what I found in the iconv conversion tables and what's documented in unicode_iconv(7) confirms the consistent use of '?'. The manpage has If the source character code value is not within a range defined by the source codeset standard, it is considered as an illegal character. If the source character code value is within the range(s) defined by the standard, it will be considered as non-identical, even if the source character code value maps to an undefined or a reserved location within the valid range. The non-identical character will map to either ? (0x3f in ASCII-compatible codesets) if the target codeset is a non-Unicode codeset or to Unicode replacement character (U+FFFD) if the target codeset is an Unicode codeset. It will of course be in the respective charset's encoding (0x3f for ASCII, 0x6f for EBCDIC), but that's all I could find. This is not a complete guarantee (I may well have missed something), but seems plausible enough... > Like (but limited to Solaris hosts) in convert_using_iconv when converting > from > SOURCE_CHARSET to some other character set don't try to convert the whole > UTF-8 > string at once, but split it into chunks at u'?' characters, so > foo???bar?baz?qux > would be iconv converted as > foo > ??? > bar > ? > baz > ? > qux > chunks. And when converting the non-? chunks, it would after the conversion > check for the '?' character (in the destination character set - that is > something that perhaps could be queried during initialization after > iconv_open) > and treat it as an error if it appeared there. Or always convert also back to > UTF-8 and check if it has more '?' characters than the source. Unless we want to take the easy way out and just require GNU libiconv on Solaris, that seems like a plausible way of handling the issue.
[Bug c++/112652] g++.dg/cpp26/literals2.C FAILs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112652 --- Comment #9 from Jakub Jelinek --- (In reply to r...@cebitec.uni-bielefeld.de from comment #8) > FWIW, the iconv conversion tables in /usr/lib/iconv can be regenerated > from the OpenSolaris sources, modified not to do that '?' conversion. > Worked for a quick check for the UTF-8 -> ASCII example, but the '?' is > more prevalent and would need to be eradicated upstream. If it is always '?' used instead of unknown character, we could also have some hack on the libcpp side for it. Like (but limited to Solaris hosts) in convert_using_iconv when converting from SOURCE_CHARSET to some other character set don't try to convert the whole UTF-8 string at once, but split it into chunks at u'?' characters, so foo???bar?baz?qux would be iconv converted as foo ??? bar ? baz ? qux chunks. And when converting the non-? chunks, it would after the conversion check for the '?' character (in the destination character set - that is something that perhaps could be queried during initialization after iconv_open) and treat it as an error if it appeared there. Or always convert also back to UTF-8 and check if it has more '?' characters than the source.
[Bug c++/112652] g++.dg/cpp26/literals2.C FAILs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112652 --- Comment #8 from ro at CeBiTec dot Uni-Bielefeld.DE --- > --- Comment #7 from Jakub Jelinek --- > (In reply to r...@cebitec.uni-bielefeld.de from comment #6) >> > --- Comment #5 from ro at CeBiTec dot Uni-Bielefeld.DE > > Uni-Bielefeld.DE> --- >> >> --- Comment #4 from Jakub Jelinek --- >> >> Given that C++ says e.g. in https://eel.is/c++draft/lex.ccon#3.1 >> >> that program is ill-formed if some character lacks encoding in the >> >> execution >> >> character set, I'm afraid the Solaris iconv behavior results in violation >> >> of >> >> Although I can barely wrap my head around the standardese there, I had a >> look at n4928 (the last? C++23 draft), which has a different wording >> here (p.25, 5.13.3): > > The testcase is for a C++26 feature, which made those ill-formed. Should have been obvious from the pathname ;-( N4971 has that wording... >> The current Solaris iconv behaviour certainly isn't particularly >> intuitive and I'll ask the Solaris engineers about it. However, there's >> the question what to do about the testcase? Just xfail it on Solaris or >> omit just the two affected subtests there? > > xfailing is one possibility, but then on Solaris we'll never support C++26 > properly. I guess it's the best solution in the short term (GCC 14), though. > Or require using GNU libiconv rather than Solaris iconv if it can't deal with > that? At least document the suggestion in install.texi; I wouldn't make it a hard requirement yet. I'll also wait what the Solaris engineers can provide on background for the current behaviour. FWIW, the iconv conversion tables in /usr/lib/iconv can be regenerated from the OpenSolaris sources, modified not to do that '?' conversion. Worked for a quick check for the UTF-8 -> ASCII example, but the '?' is more prevalent and would need to be eradicated upstream.
[Bug c++/112652] g++.dg/cpp26/literals2.C FAILs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112652 --- Comment #7 from Jakub Jelinek --- (In reply to r...@cebitec.uni-bielefeld.de from comment #6) > > --- Comment #5 from ro at CeBiTec dot Uni-Bielefeld.DE > Uni-Bielefeld.DE> --- > >> --- Comment #4 from Jakub Jelinek --- > >> Given that C++ says e.g. in https://eel.is/c++draft/lex.ccon#3.1 > >> that program is ill-formed if some character lacks encoding in the > >> execution > >> character set, I'm afraid the Solaris iconv behavior results in violation > >> of > > Although I can barely wrap my head around the standardese there, I had a > look at n4928 (the last? C++23 draft), which has a different wording > here (p.25, 5.13.3): The testcase is for a C++26 feature, which made those ill-formed. > The current Solaris iconv behaviour certainly isn't particularly > intuitive and I'll ask the Solaris engineers about it. However, there's > the question what to do about the testcase? Just xfail it on Solaris or > omit just the two affected subtests there? xfailing is one possibility, but then on Solaris we'll never support C++26 properly. Or require using GNU libiconv rather than Solaris iconv if it can't deal with that?
[Bug c++/112652] g++.dg/cpp26/literals2.C FAILs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112652 --- Comment #6 from ro at CeBiTec dot Uni-Bielefeld.DE --- > --- Comment #5 from ro at CeBiTec dot Uni-Bielefeld.DE Uni-Bielefeld.DE> --- >> --- Comment #4 from Jakub Jelinek --- >> Given that C++ says e.g. in https://eel.is/c++draft/lex.ccon#3.1 >> that program is ill-formed if some character lacks encoding in the execution >> character set, I'm afraid the Solaris iconv behavior results in violation of Although I can barely wrap my head around the standardese there, I had a look at n4928 (the last? C++23 draft), which has a different wording here (p.25, 5.13.3): (3.1) — A character-literal with a c-char-sequence consisting of a single basic-c-char, simple-escape-sequence, or universal-character-name is the code unit value of the specified character as encoded in the literal’s associated character encoding. [Note 2 : If the specified character lacks representation in the literal’s associated character encoding or if it cannot be encoded as a single code unit, then the literal is a non-encodable character literal. —end note > I've not yet tried to understand what either iconv(3) has to say on the > matter. Digging further, Solaris iconv(3C) has If iconv() encounters a character in the input buffer that is legal, but for which an identical character does not exist in the target code set, iconv() performs an implementation-defined conversion on this character. which exactly matches XPG7, so the behaviour seems to be in line with the standards. I've also found that Solaris 11 has iconvctl(3C) (obviously patterened after GNU libiconv) with ICONV_SET_TRANSLITERATE With this request and a pointer to a const int with a non-zero value, caller can instruct the current conversion to transliterate non-identical characters from the input buffer during the code con- version as much as it can. The value of zero, on the other hand, turns it off. However, int transliterate = 0; iconvctl (cd, ICONV_SET_TRANSLITERATE, &transliterate); doesn't make a difference. The current Solaris iconv behaviour certainly isn't particularly intuitive and I'll ask the Solaris engineers about it. However, there's the question what to do about the testcase? Just xfail it on Solaris or omit just the two affected subtests there?
[Bug c++/112652] g++.dg/cpp26/literals2.C FAILs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112652 --- Comment #5 from ro at CeBiTec dot Uni-Bielefeld.DE --- > --- Comment #4 from Jakub Jelinek --- > Given that C++ says e.g. in https://eel.is/c++draft/lex.ccon#3.1 > that program is ill-formed if some character lacks encoding in the execution > character set, I'm afraid the Solaris iconv behavior results in violation of > the C++ standard requirements, it is hard to argue that in the Solaris case > e.g. ISO-8859-1 execution charset would be some special character set where ? > character represents all Unicode characters which don't have a representation > in the character set in addition to ?. I've now started digging into this myself. * Solaris iconv(1) says output. If no conversion exists for a particular character, an imple- mentation-defined conversion is performed on this character. * This seems to at least partially match with XPG7: -s Suppress any messages written to standard error concerning invalid characters. When -s is not used, the results of encountering invalid characters in the input stream (either those that are not valid characters in the codeset of the input file or that have no corresponding character in the codeset of the output file) shall be specified in the system documentation. The presence or absence of -s shall not affect the exit status of iconv. AFAIU that's related to what Solaris iconv(1) does, although they don't specify the output '?' and produce no message. However, they still exit with 0, which seems wrong to me. I've not yet tried to understand what either iconv(3) has to say on the matter. > I'm afraid we don't want to maintain iconv replacement inside of libcpp > though. Agreed.
[Bug c++/112652] g++.dg/cpp26/literals2.C FAILs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112652 Jakub Jelinek changed: What|Removed |Added CC||jason at gcc dot gnu.org --- Comment #4 from Jakub Jelinek --- Given that C++ says e.g. in https://eel.is/c++draft/lex.ccon#3.1 that program is ill-formed if some character lacks encoding in the execution character set, I'm afraid the Solaris iconv behavior results in violation of the C++ standard requirements, it is hard to argue that in the Solaris case e.g. ISO-8859-1 execution charset would be some special character set where ? character represents all Unicode characters which don't have a representation in the character set in addition to ?. I'm afraid we don't want to maintain iconv replacement inside of libcpp though.
[Bug c++/112652] g++.dg/cpp26/literals2.C FAILs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112652 --- Comment #3 from Jakub Jelinek --- (In reply to r...@cebitec.uni-bielefeld.de from comment #2) > > --- Comment #1 from Jakub Jelinek --- > > Strange. On cfarm211 which is > > SunOS gcc-solaris11 5.11 11.3 sun4u sparc SUNW,SPARC-Enterprise > > the test passes. > > Can you check which libiconv got picked up there? The non-standard > OpenCSW packages on that system may include GNU libiconv and install > into default system directories, so they are picked up by default. /opt/csw/lib/libiconv.so.2 > > > You get no diagnostics for those lines at all? Buggy libconv? > > No. There's no separate libiconv on Solaris; the iconv* functions are > included in libc. On Linux I get: echo á | iconv -f UTF-8 -t ASCII -; echo 😁 | iconv -f UTF-8 -t ISO-8859-1 - iconv: illegal input sequence at position 0 iconv: illegal input sequence at position 0 while on Solaris echo á | iconv -f UTF-8 -t ASCII -; echo 😁 | iconv -f UTF-8 -t ISO-8859-1 - ? ? If it maps all characters which do not have representation in the destination character set into ?, then it is useless for the test in question.
[Bug c++/112652] g++.dg/cpp26/literals2.C FAILs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112652 --- Comment #2 from ro at CeBiTec dot Uni-Bielefeld.DE --- > --- Comment #1 from Jakub Jelinek --- > Strange. On cfarm211 which is > SunOS gcc-solaris11 5.11 11.3 sun4u sparc SUNW,SPARC-Enterprise > the test passes. Can you check which libiconv got picked up there? The non-standard OpenCSW packages on that system may include GNU libiconv and install into default system directories, so they are picked up by default. > You get no diagnostics for those lines at all? Buggy libconv? No. There's no separate libiconv on Solaris; the iconv* functions are included in libc. > I mean the emojis certainly aren't in ISO-8859-1... Probably not ;-) FWIW, I've just built trunk with GNU libiconv 1.17 on i386-pc-solaris2.11. The test PASSes now with both LANG=C and LANG=en_US.UTF-8. I'll dig further into Solaris iconv functions here...
[Bug c++/112652] g++.dg/cpp26/literals2.C FAILs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112652 --- Comment #1 from Jakub Jelinek --- Strange. On cfarm211 which is SunOS gcc-solaris11 5.11 11.3 sun4u sparc SUNW,SPARC-Enterprise the test passes. /export/home/jakub/gcc/gcc/testsuite/g++.dg/cpp26/literals2.C:7:9: warning: multi-character character constant [-Wmultichar] /export/home/jakub/gcc/gcc/testsuite/g++.dg/cpp26/literals2.C:8:9: warning: multi-character character constant [-Wmultichar] /export/home/jakub/gcc/gcc/testsuite/g++.dg/cpp26/literals2.C:10:9: error: converting to execution character set: Illegal byte sequence /export/home/jakub/gcc/gcc/testsuite/g++.dg/cpp26/literals2.C:11:9: error: named universal character escapes are only valid in C++23 /export/home/jakub/gcc/gcc/testsuite/g++.dg/cpp26/literals2.C:11:9: error: converting UCN to execution character set: Illegal byte sequence /export/home/jakub/gcc/gcc/testsuite/g++.dg/cpp26/literals2.C:13:9: error: converting UCN to execution character set: Illegal byte sequence ... You get no diagnostics for those lines at all? Buggy libconv? I mean the emojis certainly aren't in ISO-8859-1...
[Bug c++/112652] g++.dg/cpp26/literals2.C FAILs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112652 Rainer Orth changed: What|Removed |Added Target Milestone|--- |14.0