[Bug fortran/66499] Letters with accents change format behavior for X and T descriptors.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66499 --- Comment #7 from Jerry DeLisle --- There two issues going on here. We do not interpret source code that is UTF-8 encoded. This is why in our current tests for UTF-8 encoding of data files we us hexidecimal codes. I will have to see what the standard says about non=ASCII character sets in source code. If I get around this by using something like this: char1 = 4_"Test without local char" char2 = 4_"Test with local char " char2(22:22) = 4_"Ã" char2(23:23) = 4_"Ã" $ ./a.out 23 23 1234567890123456789012345678901234567890 Test without local char 10. Test with local char ÃÃ10. The string lengths now match correctly. One can see the tabbing is still off. This is because the format buffer seek functions are byte oriented and when using UTF-8 encoding we need to seek the buffer differently. In fact we have to allocate it differently as well to maintain the four byte characters.
[Bug fortran/66499] Letters with accents change format behavior for X and T descriptors.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66499 --- Comment #6 from Jerry DeLisle --- This is an interesting puzzle. I took the -fdump-tree-original output of compiling the test case and edited out all except the initialization of the two variables char1 and char2. I lined these up so we could see what each 4-byte character looks like. The last two characters should be two characters of c383. We are generating c300 8300 for each character. --- __builtin_memmove ((void *) , (void *) &" T\x00\x00\x00 e\x00\x00\x00 s\x00\x00\x00 t\x00\x00\x00 \x00\x00\x00 w\x00\x00\x00 i\x00\x00\x00 t\x00\x00\x00 h\x00\x00\x00 o\x00\x00\x00 u\x00\x00\x00 t\x00\x00\x00 \x00\x00\x00 l\x00\x00\x00 o\x00\x00\x00 c\x00\x00\x00 a\x00\x00\x00 l\x00\x00\x00 \x00\x00\x00 c\x00\x00\x00 h\x00\x00\x00 a\x00\x00\x00 r\x00\x00"[1]{lb: 1 sz: 4}, 92); __builtin_memmove ((void *) , (void *) &" T\x00\x00\x00 e\x00\x00\x00 s\x00\x00\x00 t\x00\x00\x00 \x00\x00\x00 w\x00\x00\x00 i\x00\x00\x00 t\x00\x00\x00 h\x00\x00\x00 \x00\x00\x00 l\x00\x00\x00 o\x00\x00\x00 c\x00\x00\x00 a\x00\x00\x00 l\x00\x00\x00 \x00\x00\x00 c\x00\x00\x00 h\x00\x00\x00 a\x00\x00\x00 r\x00\x00\x00 \x00\x00\x00 \xc3\x00\x00\x00 \x83\x00\x00\x00 \xc3\x00\x00\x00 \x83\x00\x00"[1]{lb: 1 sz: 4}, 100);
[Bug fortran/66499] Letters with accents change format behavior for X and T descriptors.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66499 Jerry DeLisle changed: What|Removed |Added Status|NEW |ASSIGNED CC||jvdelisle at gcc dot gnu.org Assignee|unassigned at gcc dot gnu.org |jvdelisle at gcc dot gnu.org --- Comment #5 from Jerry DeLisle --- assigning to myself
[Bug fortran/66499] Letters with accents change format behavior for X and T descriptors.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66499 Jerry DeLisle changed: What|Removed |Added Status|ASSIGNED|NEW
[Bug fortran/66499] Letters with accents change format behavior for X and T descriptors.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66499 Jerry DeLisle changed: What|Removed |Added Status|NEW |ASSIGNED CC||jvdelisle at gcc dot gnu.org Assignee|unassigned at gcc dot gnu.org |jvdelisle at gcc dot gnu.org --- Comment #4 from Jerry DeLisle --- The case in the original report is likely not valid without setting the encoding for the output unit as Dominique has done in Comment 1. The problem here is likely similar to 99210.
[Bug fortran/66499] Letters with accents change format behavior for X and T descriptors.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66499 Jerry DeLisle changed: What|Removed |Added Status|ASSIGNED|NEW
[Bug fortran/66499] Letters with accents change format behavior for X and T descriptors.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66499 Dominique d'Humieres changed: What|Removed |Added Status|NEW |ASSIGNED
[Bug fortran/66499] Letters with accents change format behavior for X and T descriptors.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66499 --- Comment #3 from Jerry DeLisle --- The trimmed length is incorrect. With this test: program test_character real:: a character(len=2, kind=4):: char1, char2 char2 = 4_"Ã" open(6, encoding="utf-8") write(*,'(a)') trim(char2) !print *, len(trim(char2),4) end program The length computed for len(trim(char2),4) is 2. $ ./a.out >test.out [jerry@amda8 pr66499]$ xxd test.out : c383 c283 0a We have an extra word being emitted. Two extra bytes.
[Bug fortran/66499] Letters with accents change format behavior for X and T descriptors.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66499 Dominique d'Humieres dominiq at lps dot ens.fr changed: What|Removed |Added Status|UNCONFIRMED |NEW Last reconfirmed||2015-06-11 Ever confirmed|0 |1 --- Comment #1 from Dominique d'Humieres dominiq at lps dot ens.fr --- Confirmed from 4.8.4 up to trunk (6.0). If I add the lines print *, len(trim(char1)) print *, len(trim(char2)) I get 23 25 So each à counts as two characters, while it is printed as only one. This make me wonder if the code is valid. However the following variant program test_character real:: a character(len=30, kind=4):: char1, char2 a = 10 char1 = 4_Test without local char char2 = 4_Test with local char Ãà 10 format(2X, A, T40, f10.4) open(6, encoding=utf-8) print *, len(trim(char1)) print *, len(trim(char2)) write(*,10) char1, a write(*,10) char2, a end program gives 23 25 Test without local char 10. Test with local char Ãà 10. ???
[Bug fortran/66499] Letters with accents change format behavior for X and T descriptors.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66499 Jerry DeLisle jvdelisle at gcc dot gnu.org changed: What|Removed |Added CC||jvdelisle at gcc dot gnu.org Assignee|unassigned at gcc dot gnu.org |jvdelisle at gcc dot gnu.org --- Comment #2 from Jerry DeLisle jvdelisle at gcc dot gnu.org --- UTF-8 is a variable length encoding. That explains the 6 character difference. The tabbing code assumes a fixed length per character. I will have to investigate this further. I suspect we are counting bytes, assuming the position from where we left off.