On Fri, Apr 24, 2026 at 07:20:35PM +0000, Vincent Belaïche wrote:
> Having said that, what I was trying to highlight in my previous email is
> that:
> 
> - this fix needs that Texinfo maintainers do some customization for each
>   language in order to remove the accent or to surround by {} non ASCII
>   letters that have multi-byte UTF-8 representations.
> 
> - this fix would not work for non latin script (eg. Russian or Japanese)

Russian is not supported by texinfo.tex because we don't load any Cyrillic
fonts.

I don't know how indices should be sorted for Japanese.

> PS-2: to Gavin, actually if we forget about correct key sorting and just
> want the compilation not to be broken, we could have a simpler fix:
> 
> 1. pass the real encoding from texi2dvi to texindex through the command
>    line (or via some custom envvar)
> 
> 2. in texindex AWK script, make the @initial{...} starting letter never
>    break an UTF-8 byte sequence when the current locale encoding is
>    8bit, but the document encoding is UTF-8. What this would require is
>    just
> 
>    - set some flag to true when current locale encoding is 8bit and
>      document encoding is UTF-8
>    - if flag false, no change, take the first char into @initial{...}
>    - if flag true, replace the first char extraction by some function
>      that interpret char as bytes and take 1, 2 or 3 chars depending how
>      these bytes fit into a unicode char thus encoded.

Apparently this last point is hard to achieve in awk, especially in a way
that will work portably across different awk implementations such as gawk
or mawk.


> My understanding is that the compilation is broken by texindex producing
> some @initial{<1st byte of a multibyte char>}.
> 
> PS-3: to Gavin again, in the same vein as PS-2, my original patch could
>  be improved, still with the only objective to not breaking the
>  compilation and accepting bad sorting when the locale is not installed,
>  by doing the following:
>  - gather <document-locale> from the texinfo document (already in my patch)
>  - check if the <document-locale> is installed, if yes proceed as already
>    done in my patch
>  - otherwise if <current locale> and <document-locale> use the same
>    encoding, do not change the locale when calling texindex
>  - otherwise if <derived locale> use XX encoding set LC_ALL=C.XX when
>    calling texindex.
> 
>  This way, the patch does not imply that the <document-locale> is
>  installed, but just that C.XX in installed for the encoding of
>  <document-locale> which is quite more likely to be true.

Another point, which may have been mentioned already in this discussion, is
that setting the locale probably won't make much of a difference to the
order texindex sorts index entries, apart from changing the character
encoding:

    The POSIX standard used to say that all string comparisons are performed
    based on the locale's "collating order".  This is the order in which
    characters sort, as defined by the locale (for more discussion, *note
    Locales::).  This order is usually very different from the results
    obtained when doing straight byte-by-byte comparison.(1)
    
       Because this behavior differs considerably from existing practice,
    'gawk' only implemented it when in POSIX mode (*note Options::).

(Info node "(gawk)POSIX String Comparison".)

Reply via email to