[Bug fortran/66499] Letters with accents change format behavior for X and T descriptors.

2024-02-24 Thread jvdelisle at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66499

--- Comment #7 from Jerry DeLisle  ---
There two issues going on here. We do not interpret source code that is UTF-8
encoded.  This is why in our current tests for UTF-8 encoding of data files we
us hexidecimal codes.

I will have to see what the standard says about non=ASCII character sets in
source code.

If I get around this by using something like this:

char1 = 4_"Test without local char"
char2 = 4_"Test with local char "

char2(22:22) = 4_"Ã"
char2(23:23) = 4_"Ã"

$ ./a.out 
  23
  23
1234567890123456789012345678901234567890
  Test without local char  10.
  Test with local char ÃÃ10.

The string lengths now match correctly.  One can see the tabbing is still off. 
This is because the format buffer seek functions are byte oriented and when
using UTF-8 encoding we need to seek the buffer differently. In fact we have to
allocate it differently as well to maintain the four byte characters.

[Bug fortran/66499] Letters with accents change format behavior for X and T descriptors.

2024-02-23 Thread jvdelisle at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66499

--- Comment #6 from Jerry DeLisle  ---
This is an interesting puzzle. I took the -fdump-tree-original output of
compiling the test case and edited out all except the initialization of the two
variables char1 and char2.

I lined these up so we could see what each 4-byte character looks like.  The
last two characters should be two characters of c383.  We are generating c300
8300 for each character.

---

__builtin_memmove ((void *) , (void *) &"
T\x00\x00\x00
e\x00\x00\x00
s\x00\x00\x00
t\x00\x00\x00
 \x00\x00\x00
w\x00\x00\x00
i\x00\x00\x00
t\x00\x00\x00
h\x00\x00\x00
o\x00\x00\x00
u\x00\x00\x00
t\x00\x00\x00
 \x00\x00\x00
l\x00\x00\x00
o\x00\x00\x00
c\x00\x00\x00
a\x00\x00\x00
l\x00\x00\x00
 \x00\x00\x00
c\x00\x00\x00
h\x00\x00\x00
a\x00\x00\x00
r\x00\x00"[1]{lb: 1 sz: 4}, 92);

__builtin_memmove ((void *) , (void *) &"
T\x00\x00\x00
e\x00\x00\x00
s\x00\x00\x00
t\x00\x00\x00
 \x00\x00\x00
w\x00\x00\x00
i\x00\x00\x00
t\x00\x00\x00
h\x00\x00\x00
 \x00\x00\x00
l\x00\x00\x00
o\x00\x00\x00
c\x00\x00\x00
a\x00\x00\x00
l\x00\x00\x00
 \x00\x00\x00
c\x00\x00\x00
h\x00\x00\x00
a\x00\x00\x00
r\x00\x00\x00
 \x00\x00\x00
\xc3\x00\x00\x00
\x83\x00\x00\x00
\xc3\x00\x00\x00
\x83\x00\x00"[1]{lb: 1 sz: 4}, 100);

[Bug fortran/66499] Letters with accents change format behavior for X and T descriptors.

2023-10-13 Thread jvdelisle at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66499

Jerry DeLisle  changed:

   What|Removed |Added

 Status|NEW |ASSIGNED
 CC||jvdelisle at gcc dot gnu.org
   Assignee|unassigned at gcc dot gnu.org  |jvdelisle at gcc dot 
gnu.org

--- Comment #5 from Jerry DeLisle  ---
assigning to myself

[Bug fortran/66499] Letters with accents change format behavior for X and T descriptors.

2021-04-16 Thread jvdelisle at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66499

Jerry DeLisle  changed:

   What|Removed |Added

 Status|ASSIGNED|NEW

[Bug fortran/66499] Letters with accents change format behavior for X and T descriptors.

2021-02-28 Thread jvdelisle at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66499

Jerry DeLisle  changed:

   What|Removed |Added

 Status|NEW |ASSIGNED
 CC||jvdelisle at gcc dot gnu.org
   Assignee|unassigned at gcc dot gnu.org  |jvdelisle at gcc dot 
gnu.org

--- Comment #4 from Jerry DeLisle  ---
The case in the original report is likely not valid without setting the
encoding for the output unit as Dominique has done in Comment 1. The problem
here is likely similar to 99210.

[Bug fortran/66499] Letters with accents change format behavior for X and T descriptors.

2018-10-05 Thread jvdelisle at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66499

Jerry DeLisle  changed:

   What|Removed |Added

 Status|ASSIGNED|NEW

[Bug fortran/66499] Letters with accents change format behavior for X and T descriptors.

2018-03-03 Thread dominiq at lps dot ens.fr
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66499

Dominique d'Humieres  changed:

   What|Removed |Added

 Status|NEW |ASSIGNED

[Bug fortran/66499] Letters with accents change format behavior for X and T descriptors.

2017-05-18 Thread jvdelisle at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66499

--- Comment #3 from Jerry DeLisle  ---
The trimmed length is incorrect.

With this test:

program test_character
  real:: a
  character(len=2, kind=4):: char1, char2

  char2 = 4_"Ã"

  open(6, encoding="utf-8")

  write(*,'(a)') trim(char2)
  !print *, len(trim(char2),4)
end program


The length computed for len(trim(char2),4) is 2.

$ ./a.out >test.out 
[jerry@amda8 pr66499]$ xxd test.out 
: c383 c283 0a 

We have an extra word being emitted.  Two extra bytes.

[Bug fortran/66499] Letters with accents change format behavior for X and T descriptors.

2015-06-11 Thread dominiq at lps dot ens.fr
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66499

Dominique d'Humieres dominiq at lps dot ens.fr changed:

   What|Removed |Added

 Status|UNCONFIRMED |NEW
   Last reconfirmed||2015-06-11
 Ever confirmed|0   |1

--- Comment #1 from Dominique d'Humieres dominiq at lps dot ens.fr ---
Confirmed from 4.8.4 up to trunk (6.0). If I add the lines

print *, len(trim(char1))
print *, len(trim(char2))

I get

  23
  25

So each à counts as two characters, while it is printed as only one. This make
me wonder if the code is valid. However the following variant

program test_character
real:: a
character(len=30, kind=4):: char1, char2

a = 10
char1 = 4_Test without local char
char2 = 4_Test with local char ÃÃ

10 format(2X, A, T40, f10.4)

open(6, encoding=utf-8)

print *, len(trim(char1))
print *, len(trim(char2))

write(*,10) char1, a
write(*,10) char2, a

end program

gives

  23
  25
  Test without local char 10.
  Test with local char ÃÃ   10.

???

[Bug fortran/66499] Letters with accents change format behavior for X and T descriptors.

2015-06-11 Thread jvdelisle at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66499

Jerry DeLisle jvdelisle at gcc dot gnu.org changed:

   What|Removed |Added

 CC||jvdelisle at gcc dot gnu.org
   Assignee|unassigned at gcc dot gnu.org  |jvdelisle at gcc dot 
gnu.org

--- Comment #2 from Jerry DeLisle jvdelisle at gcc dot gnu.org ---
UTF-8 is a variable length encoding.  That explains the 6 character difference.
 The tabbing code assumes a fixed length per character.  I will have to
investigate this further.  I suspect we are counting bytes, assuming the
position from where we left off.