Re: c32width gives incorrect return values in C locale

2023-11-11 Thread Eli Zaretskii
> From: Bruno Haible 
> Cc: bug-libunistr...@gnu.org
> Date: Sat, 11 Nov 2023 23:54:52 +0100
> 
> [CCing bug-libunistring]
> Gavin Smith wrote:
> > I did not understand why uc_width was said to be "locale dependent":
> > 
> >   "These functions are locale dependent."
> > 
> > - from 
> > .
> 
> That's because some Unicode characters have "ambiguous width" — width 1 in
> Western locales, width 2 is East Asian locales (for historic and font choice
> reasons).

I think this should be explained in the documentation, if it isn't
already.  This "ambiguous width" issue is very subtle and unknown to
many (most?) people, so not having it explicit in the documentation is
not user-friendly, IMO.

> > I also don't understand the purpose of the "encoding" argument -- can this
> > always be "UTF-8"?
> 
> Yes, it can be always "UTF-8"; then uc_width will always choose width 1 for
> these characters.

Regardless of the locale?  Is there an assumption that UTF-8 means
"not CJK" or something?

> > I'm also unclear on the exact relationship between the types char32_t,
> > ucs4_t and uint32_t.  For example, uc_width takes a ucs4_t argument
> > but u8_mbtouc writes to a char32_t variable.  In the code I committed,
> > I used a cast to ucs4_t when calling uc_width.
> 
> These types are all identical. Therefore you don't even need to cast.
> 
>   - char32_t comes from  (ISO C 11 or newer).
>   - ucs4_t comes from GNU libunistring.
>   - uint32_t comes from .

AFAIU, char32_t is identical to uint_least32_t (which is also from
stdint.h).



Re: c32width gives incorrect return values in C locale

2023-11-11 Thread Bruno Haible
[CCing bug-libunistring]
Gavin Smith wrote:
> I did not understand why uc_width was said to be "locale dependent":
> 
>   "These functions are locale dependent."
> 
> - from 
> .

That's because some Unicode characters have "ambiguous width" — width 1 in
Western locales, width 2 is East Asian locales (for historic and font choice
reasons).

> I also don't understand the purpose of the "encoding" argument -- can this
> always be "UTF-8"?

Yes, it can be always "UTF-8"; then uc_width will always choose width 1 for
these characters.

> I'm also unclear on the exact relationship between the types char32_t,
> ucs4_t and uint32_t.  For example, uc_width takes a ucs4_t argument
> but u8_mbtouc writes to a char32_t variable.  In the code I committed,
> I used a cast to ucs4_t when calling uc_width.

These types are all identical. Therefore you don't even need to cast.

  - char32_t comes from  (ISO C 11 or newer).
  - ucs4_t comes from GNU libunistring.
  - uint32_t comes from .

Bruno







Re: c32width gives incorrect return values in C locale

2023-11-11 Thread Gavin Smith
On Sat, Nov 11, 2023 at 09:06:41PM +0100, Bruno Haible wrote:
> [CCing bug-gnulib]
> Indeed, the c32* functions by design work only on those Unicode characters
> that can be represented as multibyte sequences in the current locale.
> 
> I'll document this better in the Gnulib manual.
> 
> Since you want texinfo to work on UTF-8 encoded text with characters outside
> the repertoire of the current locale, you'll need the libunistring functions,
> documented in
> .
> Namely, replace c32width with uc_width.

Thanks, that seems to work perfectly.

I also changed c32isupper to uc_is_upper.  The gnulib manual stated
(node "isupper"):

  ‘c32isupper’
   This function operates in a locale dependent way, on 32-bit wide
   characters.  In order to use it, you first have to convert from
   multibyte to 32-bit wide characters, using the ‘mbrtoc32’ function.
   It is provided by the Gnulib module ‘c32isupper’.
  
  ...
  
  ‘uc_is_upper’
   This function operates in a locale independent way, on Unicode
   characters.  It is provided by the Gnulib module
   ‘unictype/ctype-upper’.

- and we wanted the "locale independent way".

I did not understand why uc_width was said to be "locale dependent":

  "These functions are locale dependent."

- from 
.

I also don't understand the purpose of the "encoding" argument -- can this
always be "UTF-8"?

I'm also unclear on the exact relationship between the types char32_t,
ucs4_t and uint32_t.  For example, uc_width takes a ucs4_t argument
but u8_mbtouc writes to a char32_t variable.  In the code I committed,
I used a cast to ucs4_t when calling uc_width.



Re: c32width gives incorrect return values in C locale

2023-11-11 Thread Bruno Haible
[CCing bug-gnulib]
Gavin Smith wrote:
> > I guess you will need to look at the Unicode characters that you pass to 
> > c32width,
> > and whether you get return values < 1 for some of them.
> 
> It is locale-dependent!
> 
> It looks like c32width is simply being redirected to wcwidth which then
> doesn't work properly with LC_ALL=C.  This is from the gnulib module
> c32width.
> 
> I don't know if there is an easy way to make a self-contained example
> to show the difference, because it needs all the gnulib Makefile machinery,
> but the difference shows up for any non-ASCII character.  If I add a line
> like
> 
>  fprintf (stderr, "width of [%4.0lx] is %d (remaining %s)\n",
> (long) wc, width, q);
> 
> in the right place in the code, where width is the result of c32width,
> then the output looks like
> 
> width of [  40] is 1 (remaining @)
> width of [  4f] is 1 (remaining OE )
> width of [  45] is 1 (remaining E )
> width of [ 152] is -1 (remaining Œ)
> width of [  28] is 1 (remaining (Œ)
> 
> for LC_ALL=C, but
> 
> width of [  40] is 1 (remaining @)
> width of [  4f] is 1 (remaining OE )
> width of [  45] is 1 (remaining E )
> width of [ 152] is 1 (remaining Œ)
> width of [  28] is 1 (remaining (Œ)
> 
> otherwise (LC_ALL=en_GB.UTF-8).

Indeed, the c32* functions by design work only on those Unicode characters
that can be represented as multibyte sequences in the current locale.

I'll document this better in the Gnulib manual.

Since you want texinfo to work on UTF-8 encoded text with characters outside
the repertoire of the current locale, you'll need the libunistring functions,
documented in
.
Namely, replace c32width with uc_width.

Bruno






c32width gives incorrect return values in C locale

2023-11-11 Thread Gavin Smith
On Fri, Nov 10, 2023 at 07:39:43PM +, Gavin Smith wrote:
> Is the expected output
> 
>å å (å) Å Å (Å) æ æ (æ) œ œ (œ) Æ Æ (Æ) Œ Œ (Œ) ø ø (ø) Ø Ø (Ø) ß ß (ß)
> 
> (width 74) or
> 
>@aa å (å) @AA Å (Å) @ae æ (æ) @oe œ (œ) @AE Æ (Æ) @OE Œ (Œ) @o ø (ø) @O Ø 
> (Ø) @ss ß (ß)
> 
> (width 90)?
> 
> I guess you will need to look at the Unicode characters that you pass to 
> c32width,
> and whether you get return values < 1 for some of them.

It is locale-dependent!

It looks like c32width is simply being redirected to wcwidth which then
doesn't work properly with LC_ALL=C.  This is from the gnulib module
c32width.

I don't know if there is an easy way to make a self-contained example
to show the difference, because it needs all the gnulib Makefile machinery,
but the difference shows up for any non-ASCII character.  If I add a line
like

 fprintf (stderr, "width of [%4.0lx] is %d (remaining %s)\n",
(long) wc, width, q);

in the right place in the code, where width is the result of c32width,
then the output looks like

width of [  40] is 1 (remaining @)
width of [  4f] is 1 (remaining OE )
width of [  45] is 1 (remaining E )
width of [ 152] is -1 (remaining Œ)
width of [  28] is 1 (remaining (Œ)

for LC_ALL=C, but

width of [  40] is 1 (remaining @)
width of [  4f] is 1 (remaining OE )
width of [  45] is 1 (remaining E )
width of [ 152] is 1 (remaining Œ)
width of [  28] is 1 (remaining (Œ)

otherwise (LC_ALL=en_GB.UTF-8).

Should this be reported as a bug to bug-gnulib or bug-libunistring?

In the context of the input from the test, the following is the contents
of a a simplified test file "test.texi":

@@aa @aa{} (å)
@@AA @AA{} (Å)
@@ae @ae{} (æ)
@@oe @oe{} (œ)
@@AE @AE{} (Æ)
@@OE @OE{} (Œ)
@@o @o{} (ø)
@@O @O{} (Ø)
@@ss @ss{} (ß)
@@l @l{} (ł)
@@L @L{} (Ł)
@@DH @DH{} (Ð)
@@TH @TH{} (Þ)
@@dh @dh{} (ð)
@@th @th{} (þ)


Then, in a UTF-8 locale:

$ ../tp/texi2any.pl test.texi && cat test.info
test.texi: warning: document without nodes
This is test.info, produced by texi2any version 7.1dev+dev from
test.texi.

@aa å (å) @AA Å (Å) @ae æ (æ) @oe œ (œ) @AE Æ (Æ) @OE Œ (Œ) @o ø (ø) @O
Ø (Ø) @ss ß (ß) @l ł (ł) @L Ł (Ł) @DH Ð (Ð) @TH Þ (Þ) @dh ð (ð) @th þ
(þ)



Tag Table:

End Tag Table


Local Variables:
coding: utf-8
End:

However:

$ LC_ALL=C ../tp/texi2any.pl test.texi && cat test.info
test.texi: warning: document without nodes
This is test.info, produced by texi2any version 7.1dev+dev from
test.texi.

@aa å (å) @AA Å (Å) @ae æ (æ) @oe œ (œ) @AE Æ (Æ) @OE Œ (Œ) @o ø (ø) @O Ø (Ø) 
@ss ß (ß) @l
ł(ł) @L Ł (Ł) @DH Ð (Ð) @TH Þ (Þ) @dh ð (ð) @th þ (þ)



Tag Table:

End Tag Table


Local Variables:
coding: utf-8
End:

In the later case, it is a much longer line.