Re: mbcel module for Gnulib?, incomplete multibyte sequences

2023-07-17 Thread Paul Eggert

On 2023-07-16 15:18, Bruno Haible wrote:

Paul Eggert wrote:

Although I'm sure mbiter can be improved I
don't see how it could catch up to mbcel so long as it continues to
solve a harder problem than mbcel solves.


I don't know exactly what you mean by "harder problem".


I meant that it solves a harder porting problem because it worries about 
more issues, e.g., it worries about mbrtoc32 returning (size_t) -3, or 
returning (size_t) -1 in the C locale. I guess it also worries about 
column counting (something I hadn't thought about but your email raised 
the issue). There are probably other things that it worries about, that 
mbcel does not. The more things mbiter needs to worry about the slower 
it gets.




The other significant difference that I see is the handling of multibyte
sequences. When there 2 or 3 bytes (of, say, UTF-8) that constitute an
incomplete multibyte character at the end of the string,


This isn't a problem for programs like grep and diff, where there's 
always a newline at the end the input buffer.




   - mbcel returns each byte, one by one, as a character without a
 char32_t code.


(A nit: it's not a character; it's an encoding error.)



   - ISO 10646 says ([1] section R.7) "it shall interpret that malformed
 sequence in the same way that it interprets a character that is outside
 the adopted subset".


If I understand this requirement correctly mbcel satisfies it, as mbcel 
treats those two things in the same way, namely, as sequences of 
encoding error bytes.




   - Markus Kuhn's example ([2] section 3) has a section where
   "All bytes of an incomplete sequence should be signalled as a single
malformed sequence, i.e., you should see only a single replacement
character in each of the next 10 tests."


Kuhn is talking about programs that display characters to users and that 
need some way to signal encoding errors. But diff is not such a program: 
it doesn't need to display a signal for an incomplete sequence, because 
it's not responsible for display.


Even for the class of programs that Kuhn is talking about it's not clear 
that the practice he recommends is a good one. It's certainly not 
typical practice in the GNU/Linux world. It's not true of the first five 
applications that I tested on Ubuntu 23.04: Emacs, Chrome, Firefox, 
less, and gnome-terminal.


Even if Kuhn's suggestion were good for display programs, programs like 
diff should not treat differing encoding error byte sequences as if they 
were equivalent. If two files A and B contain different encoding errors 
I expect most users would prefer "diff A B" to report the differences.


I take the point that diff's column counting disagrees with Kuhn's 
suggestion. However, there's no standard for columnar display of 
encoding errors. Some programs display each encoding byte as a 
single-column character. Some do it as a two-column character. Emacs by 
default uses four columns. xterm, the program that you mention, is 
glitchy: sometimes it displays a UTF-8-like sequence as a single-column 
U+FFFD REPLACEMENT CHARACTER but sometimes it doesn't, and on my 
platform, when I cat Kuhn's test to standard output, two of the four 
tests in the last screenful fail to line up their columns. There's not 
even a standard column width for U+FFFD itself: Kuhn recommends 1 in 
, but 2 is more common in 
my experience.


In short, in practice there's no way for diff to tell how encoding 
errors are displayed. diff currently guesses 1 column per encoding error 
byte, as that's an easy guess. It's not clear that complicating this 
guess would be a net win for diff users. Which means mbcel is good 
enough for diff.


(Composing this email prompted me to document this issue better in the 
diffutils manual, so I installed the attached patch there.)




This may be acceptable as a corner case for 'diff'. But for a module offered
by Gnulib, we should IMO continue to follow the best practice here.


Although Kuhn's suggestion may be best practice for some applications, 
it's not best for applications like diff, and it would be helpful if 
Gnulib could support these applications.
From 856f72409b62d9d3459270971dbca9d19e31 Mon Sep 17 00:00:00 2001
From: Paul Eggert 
Date: Mon, 17 Jul 2023 12:48:30 -0700
Subject: [PATCH] doc: document tab behavior better

* doc/diffutils.texi (Tabs): Document issues with tabs,
encoding errors, and non-ASCII characters.
---
 doc/diffutils.texi | 17 ++---
 1 file changed, 14 insertions(+), 3 deletions(-)

diff --git a/doc/diffutils.texi b/doc/diffutils.texi
index 6f2657f..c2d42bd 100644
--- a/doc/diffutils.texi
+++ b/doc/diffutils.texi
@@ -1904,6 +1904,10 @@ These adjustments can be applied to any output format.
 @cindex tab stop alignment
 @cindex aligning tab stops
 
+The tab character moves the cursor to the next tab stop.
+Tab stops are normally every 8 display columns;
+this can be altered by the 

[PATCH] announce-gen: Allow using local git user.name.

2023-07-17 Thread Simon Josefsson via Gnulib discussion list
Hi.  I think announce-gen should use the username provided by the
per-repo .git/config rather than the global ~/.gitconfig.

I hit this corner on build farms where I don't want to modify files
outside of the repository.

/Simon
From 6928b3b86169bf5d265f745ff5f93eb21181a59e Mon Sep 17 00:00:00 2001
From: Simon Josefsson 
Date: Mon, 17 Jul 2023 22:07:57 +0200
Subject: [PATCH] announce-gen: Allow using local git user.name.

* build-aux/announce-gen (readable_interval): Remove --global
parameter to 'git config' call.
---
 ChangeLog  | 6 ++
 build-aux/announce-gen | 4 ++--
 2 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/ChangeLog b/ChangeLog
index 88de38a2ae..a19162d1bf 100644
--- a/ChangeLog
+++ b/ChangeLog
@@ -1,3 +1,9 @@
+2023-07-17  Simon Josefsson  
+
+	announce-gen: Allow using local git user.name.
+	* build-aux/announce-gen (readable_interval): Remove --global
+	parameter to 'git config' call.
+
 2023-07-17  Bruno Haible  
 
 	mbuiter: Optimize.
diff --git a/build-aux/announce-gen b/build-aux/announce-gen
index 850619a121..4056d443b0 100755
--- a/build-aux/announce-gen
+++ b/build-aux/announce-gen
@@ -35,7 +35,7 @@
 eval 'exec perl -wSx "$0" "$@"'
  if 0;
 
-my $VERSION = '2023-02-26 17:15'; # UTC
+my $VERSION = '2023-07-17 20:05'; # UTC
 # The definition above must lie within the first 8 lines in order
 # for the Emacs time-stamp write hook (at end) to update it.
 # If you change this file with Emacs, please let the write hook
@@ -545,7 +545,7 @@ EOF
   my $v0 = $prev_version;
   my $v1 = $curr_version;
 
-  (my $first_name = `git config --global user.name|cut -d' ' -f1`)
+  (my $first_name = `git config user.name|cut -d' ' -f1`)
 =~ m{\S} or die "no name? set user.name in ~/.gitconfig\n";
 
   chomp (my $n_ci = `git rev-list "v$v0..v$v1" | wc -l`);
-- 
2.34.1



signature.asc
Description: PGP signature


mbuiter: Optimize

2023-07-17 Thread Bruno Haible
This patch moves the MB_CUR_MAX reference out of the mbuiter loop, and
thereby provides a speedup between 1.5% and 2% (according to the
bench-mbuiter benchmark).


2023-07-17  Bruno Haible  

mbuiter: Optimize.
* lib/mbuiter.h (struct mbuiter_multi): Add cur_max field.
(mbui_init): Initialize it.
(mbuiter_multi_next): Use it instead of MB_CUR_MAX.
(mbuiter_multi_copy): Update.

diff --git a/lib/mbuiter.h b/lib/mbuiter.h
index 6a06032941..9e2e90a235 100644
--- a/lib/mbuiter.h
+++ b/lib/mbuiter.h
@@ -121,6 +121,7 @@ struct mbuiter_multi
before and after every mbuiter_multi_next 
invocation.
  */
   bool next_done;   /* true if mbui_avail has already filled the following 
*/
+  unsigned int cur_max; /* A cache of MB_CUR_MAX.  */
   struct mbchar cur;/* the current character:
 const char *cur.ptr  pointer to current character
 The following are only valid after mbui_avail.
@@ -160,7 +161,7 @@ mbuiter_multi_next (struct mbuiter_multi *iter)
 with_shift:
   #endif
   iter->cur.bytes = mbrtoc32 (>cur.wc, iter->cur.ptr,
-  strnlen1 (iter->cur.ptr, MB_CUR_MAX),
+  strnlen1 (iter->cur.ptr, iter->cur_max),
   >state);
   if (iter->cur.bytes == (size_t) -1)
 {
@@ -225,6 +226,7 @@ mbuiter_multi_copy (struct mbuiter_multi *new_iter, const 
struct mbuiter_multi *
   #endif
 mbszero (_iter->state);
   new_iter->next_done = old_iter->next_done;
+  new_iter->cur_max = old_iter->cur_max;
   mb_copy (_iter->cur, _iter->cur);
 }
 
@@ -234,13 +236,15 @@ typedef struct mbuiter_multi mbui_iterator_t;
 #define mbui_init(iter, startptr) \
   ((iter).cur.ptr = (startptr), \
(iter).in_shift = false, mbszero (&(iter).state), \
-   (iter).next_done = false)
+   (iter).next_done = false, \
+   (iter).cur_max = MB_CUR_MAX)
 #else
 /* Optimized: no in_shift.  */
 #define mbui_init(iter, startptr) \
   ((iter).cur.ptr = (startptr), \
mbszero (&(iter).state), \
-   (iter).next_done = false)
+   (iter).next_done = false, \
+   (iter).cur_max = MB_CUR_MAX)
 #endif
 #define mbui_avail(iter) \
   (mbuiter_multi_next (&(iter)), !mb_isnul ((iter).cur))


mbchar: Reduce size of 'struct mbchar'

2023-07-17 Thread Bruno Haible
This is a preparation for the modules 'mbitern' and 'mbuitern':


2023-07-17  Bruno Haible  

mbchar: Reduce size of 'struct mbchar'.
* modules/mbfile (configure.ac): Define GNULIB_MBFILE as an indicator.
* lib/mbchar.h (MBCHAR_BUF_SIZE): Set to 4.
(struct mbchar): Disable member 'buf' if the module 'mbfile' is not in
use.

diff --git a/lib/mbchar.h b/lib/mbchar.h
index 82c373f47e..d1900b5a51 100644
--- a/lib/mbchar.h
+++ b/lib/mbchar.h
@@ -156,7 +156,9 @@ _GL_INLINE_HEADER_BEGIN
 # define MBCHAR_INLINE _GL_INLINE
 #endif
 
-#define MBCHAR_BUF_SIZE 24
+/* The longest multibyte characters, nowadays, are 4 bytes long.
+   Regardless of the values of MB_CUR_MAX and MB_LEN_MAX.  */
+#define MBCHAR_BUF_SIZE 4
 
 struct mbchar
 {
@@ -164,7 +166,9 @@ struct mbchar
   size_t bytes; /* number of bytes of current character, > 0 */
   bool wc_valid;/* true if wc is a valid 32-bit wide character */
   char32_t wc;  /* if wc_valid: the current character */
+#if defined GNULIB_MBFILE
   char buf[MBCHAR_BUF_SIZE]; /* room for the bytes, used for file input only */
+#endif
 };
 
 /* EOF (not a real character) is represented with bytes = 0 and
diff --git a/modules/mbfile b/modules/mbfile
index af3674ea79..09593fcc3f 100644
--- a/modules/mbfile
+++ b/modules/mbfile
@@ -18,6 +18,10 @@ stdbool
 
 configure.ac:
 gl_MBFILE
+dnl Do not use gl_MODULE_INDICATOR([mbfile]) here: we don't want 'struct 
mbchar'
+dnl to have a different size in lib/ than in tests/.
+AC_DEFINE([GNULIB_MBFILE], [1],
+  [Define to 1 if the gnulib module 'mbfile' is in use.])
 
 Makefile.am:
 lib_SOURCES += mbfile.h mbfile.c






Re: From wchar_t to char32_t, new module mbszero

2023-07-17 Thread Bruno Haible
Running the test suite on Minix shows test failures. There, like on NetBSD,
the source code investigation was incomplete. This patch fixes the failures.


2023-07-17  Bruno Haible  

mbszero: Fix for Minix.
* lib/wchar.in.h: (_GL_MBSTATE_INIT_SIZE): Don't define on Minix.
(_GL_MBSTATE_ZERO_SIZE): Define to 4 on Minix.

diff --git a/lib/wchar.in.h b/lib/wchar.in.h
index 430fa6fcec..510f202537 100644
--- a/lib/wchar.in.h
+++ b/lib/wchar.in.h
@@ -451,8 +451,9 @@ _GL_WARN_ON_USE (mbsinit, "mbsinit is unportable - "
lib/libc/citrus/citrus_*.c.  */
 /* File   INIT_SIZE  ZERO_SIZE
citrus_none.c  0  0 */
-#  define _GL_MBSTATE_INIT_SIZE 1
-#  define _GL_MBSTATE_ZERO_SIZE 1
+/* But 1 is not the correct value for _GL_MBSTATE_ZERO_SIZE: we get test
+   failures for values < 4.  */
+#  define _GL_MBSTATE_ZERO_SIZE 4
 # elif defined __sun  /* Solaris */
 /* On Solaris, mbstate_t is defined in .
It is an opaque aligned 24-byte or 32-byte struct, of which at most the 
first






Re: From wchar_t to char32_t, new module mbszero

2023-07-17 Thread Bruno Haible
Paul Eggert wrote:
> > However, after implementing mbszero with this data and enabling its use
> > in many places, I got test failures on NetBSD and Solaris.
> >- On NetBSD, the minimum we need to clear is 28 bytes.
> >- On Solaris OmniOS and OpenIndiana, the minimum we need to clear is 16 
> > bytes.
> >- On proprietary Solaris, the minimum we need to clear is 20 or 28 bytes
> >  (depending on 32-bit or 64-bit mode).
> > So, clearly this is fragile stuff.
> 
> Were the test failures for single calls to mbrtoc32, or were they for 
> something else?

Find attached the test failures on NetBSD 9.3 and Solaris 11.4, when I define
_GL_MBSTATE_ZERO_SIZE to a value that is 4 bytes smaller than the working one.

In all three cases, there were
  assertion "mbsinit (>state)" failed
failures, which means that mbsinit takes into account more initial bytes than
we expected.

On Solaris 11.4, there were also many test failures of other kinds.

> If the former, what went wrong with the source-code analysis?

On NetBSD, I apparently did not locate the right source code of the mbsinit
function, due to the complexity of the citrus code. And did not want to debug
it, because debugging in libc code without debugging information is often
a waste of time.

On Solaris 11.4, we don't have source code, and it apparently behaves
differently than the OpenSolaris derivatives for which we have source code.

>   > +/* _GL_MBSTATE_INIT_SIZE describes how mbsinit() behaves: It is the 
> number of
> > +   bytes at the beginning of an mbstate_t that need to be zero, for 
> > mbsinit()
> > +   to return true.
> 
> This macro is not used anywhere. How about adding a comment explaining 
> why it's defined but not used? Or if it's not needed we can remove it.

It's needed, namely as lower bound for _GL_MBSTATE_ZERO_SIZE:
0 < _GL_MBSTATE_INIT_SIZE <= _GL_MBSTATE_ZERO_SIZE <= sizeof (mbstate_t).

The test suite failures on NetBSD clearly showed what happens when we ignore
what mbsinit() does.

> > +# elif __GLIBC__ >= 2 /* glibc */
> 
> Should be glibc 2.2 not glibc 2, if I read the history right.

Strictly speaking, yes. But we have long stopped supporting glibc 2.0 and 2.1
as portability targets. We removed m4/glibc21.m4 in 2020.

> Also should check that __GLIBC__ is defined, for the benefit of picky 
> compilers.

We have more than 300 uses of '# if/elif __GLIBC__ >= 2' in lib/*.h, and
haven't received reports about it being a problem. So I guess people are clever
enough to remove these extra-pickiness compiler options when they compile
GNU code.

> > # elif defined MUSL_LIBC  /* musl libc */
> > /* mbstate_t is defined in .
> >It is an opaque aligned 8-byte struct, of which at most the first
> >4 bytes are used.
> >For more details, see src/multibyte/mbrtowc.c.  */
> > #  define _GL_MBSTATE_INIT_SIZE 4
> > #  define _GL_MBSTATE_ZERO_SIZE 4
> 
> Better to say 'sizeof (unsigned)' instead of '4', as the source code 
> uses 'unsigned'. This wouldn't affect the machine code on existing 
> platforms, would be better documentation, and would be theoretically 
> better on oddball future platforms.

OK for documentation purposes.

Regarding oddball future platforms, they can write 'unsigned long' instead
of 'unsigned' in their definition of mbstate_t; therefore it does not really
matter whether we write 4 or sizeof (unsigned).

And I won't write
  sizeof (((mbstate_t) {0}).__count)
outside of comments, since the less identifiers with double-underscore
we use, the better.

> On 64-bit Solaris 10 sparc with either GCC or Oracle's compiler, it's 
> the same number of insns to initialize 28 bytes vs 32 bytes. So if 28 is 
> needed let's drop the optimization for Solaris as not worth the 
> aggravation of maintaining and worrying about its brittleness.

OK, done. Also because it's a proprietary platform and without source
inspection, we cannot really know whether there's one locale encoding that
needs the 32 bytes actually.

> > +#  define _GL_MBSTATE_ZERO_SIZE sizeof (mbstate_t)
> 
> Might be better to have this sort of thing at the end, as the default, 
> rather than sprinkle lots of copies of it elsewhere.

OK, done.

> Assuming the problem was not with single calls to mbrtoc32, how about if 
> we define another function mbszerotoc32 that is the minimum number of 
> bytes to clear so that a single call to mbrtoc32 will work?

The problem with this proposal is that we then have two different functions,
used in various places, and one of them will have at most half of the unit
test coverage than what we have now.

We are already far in the land of using undocumented internals and relying
on accidental facts (such as the order of the fields in the mbstate_t).
Only the unit tests give me a certain trust in what we do. In this situation,
I would not like to bet that a random half of the unit tests will give us
the same reliability.

What would the advantage of