Re: mbcel module for Gnulib?, incomplete multibyte sequences

2023-07-20 Thread Bruno Haible
Hi Paul,

> >> Although I'm sure mbiter can be improved I
> >> don't see how it could catch up to mbcel so long as it continues to
> >> solve a harder problem than mbcel solves.
> > 
> > I don't know exactly what you mean by "harder problem".
> 
> I meant that it solves a harder porting problem because it worries about 
> more issues, e.g., it worries about mbrtoc32 returning (size_t) -3,

This makes only for a small performance difference. I could measure it
by doing the benchmarks of
  mbiterf-bench-tests mbuiterf-bench-tests
versus
  mbiterf-bench-tests mbuiterf-bench-tests mbrtoc32-regular

In the latter case, the lines marked with
  #if !GNULIB_MBRTOC32_REGULAR
are optimized away.

> or returning (size_t) -1 in the C locale.

Indeed, this shows as a difference between mbiterf and mbcel in the
test cases c, f:

mbiterf mbcel   mbuiterf
c1.145  0.6701.179
f   13.028  5.714   14.654

But since the glibc people are already working on resolving this issue,
I won't spend time optimizing it one way or the other.

> > The other significant difference that I see is the handling of multibyte
> > sequences. When there 2 or 3 bytes (of, say, UTF-8) that constitute an
> > incomplete multibyte character at the end of the string,
> 
> This isn't a problem for programs like grep and diff, where there's 
> always a newline at the end the input buffer.

I disagree: Any program can run into it when the input is
  
My screenshot from the 'src/diff -y -t' output in an xterm also shows
that there is an issue.

> >- mbcel returns each byte, one by one, as a character without a
> >  char32_t code.
> 
> (A nit: it's not a character; it's an encoding error.)

Sure. Some programs then treat that error as if an U+FFFD character
had been read.

> >- ISO 10646 says ([1] section R.7) "it shall interpret that malformed
> >  sequence in the same way that it interprets a character that is outside
> >  the adopted subset".
> 
> If I understand this requirement correctly mbcel satisfies it, as mbcel 
> treats those two things in the same way, namely, as sequences of 
> encoding error bytes.

No, I don't think mbcel satisfies it, since mbcel interprets the
"malformed sequence" not like "a character" but like multiple characters.

> >- Markus Kuhn's example ([2] section 3) has a section where
> >"All bytes of an incomplete sequence should be signalled as a single
> > malformed sequence, i.e., you should see only a single replacement
> > character in each of the next 10 tests."
> 
> Kuhn is talking about programs that display characters to users and that 
> need some way to signal encoding errors. But diff is not such a program: 
> it doesn't need to display a signal for an incomplete sequence, because 
> it's not responsible for display.

Kuhn's writeup is generally about UTF-8 decoding. In the year 2000, when it
was written, the most important decoders were in display engines of terminal
emulators. Nowadays, we have UTF-8 decoders in many many programs.

The Unicode Standard has several pages about this topic:
Unicode 15.0 section 3.9
  https://www.unicode.org/versions/Unicode15.0.0/ch03.pdf  pages 124..129
It is also referenced by section 5.22
  https://www.unicode.org/versions/Unicode15.0.0/ch05.pdf  page 255.

The relevant text starts at page 127:
  "U+FFFD Substitution of Maximal Subparts

   An increasing number of implementations are adopting the handling of
   ill-formed subsequences as specified in the W3C standard for encoding
   to achieve consistent U+FFFD replacements. ...
   Although the Unicode Standard does not require this practice for conformance
   ..."
See also the table 3-11 on page 128.

So, clearly, this is not a *requirement* for a conforming UTF-8 decoder.
But the Unicode Standard's authors would not describe it in this great length
if it wasn't a good practice.

> It's certainly not 
> typical practice in the GNU/Linux world. It's not true of the first five 
> applications that I tested on Ubuntu 23.04: Emacs, Chrome, Firefox, 
> less, and gnome-terminal.

Well, then I'll have to write a couple of QoI (quality of implementation)
reports...

(xterm does it right, but you are right that nowadays gnome-terminal and other
vte-based terminal emulators are the majority.)

> Even if Kuhn's suggestion were good for display programs, programs like 
> diff should not treat differing encoding error byte sequences as if they 
> were equivalent. If two files A and B contain different encoding errors 
> I expect most users would prefer "diff A B" to report the differences.

Sure. If you were to use the 'mbiterf' module instead of mbcel, the
mb_equal macro from mbchar.h does the right thing. Yes, an mb_equal
call is a bit more complicated than the same_ch_err definition that
you have in diffutils/src/io.c. That's the unavoidable consequence of
treating a sequence of 2 or 3 bytes as *one* error.

> There's not 
> even a standard column width for U+FFFD itself: Kuhn reco

Re: mbcel module for Gnulib?

2023-07-20 Thread Bruno Haible
I wrote:
> So, clearly, the struct-as-return-value approach nowadays generates
> more efficient code than when mbiter was designed (in 2001-2005).
> 
> I will thus create new variants of mbiter, mbuiter, called mbitern, mbuitern
> ('n' for "new"), that allow better optimization by current GCC.

I've called these variants 'mbiterf' and 'mbuiterf'. The benchmarks
now show decent performance for ASCII text (tests a and b):

$ ./gnulib-tool --create-testdir --dir=../testdir1 --single-configure --symlink 
mbuiterf-bench-tests mbiterf-bench-tests mbrtoc32-regular
$ cd ../testdir1
$ patch -p1 < .../y.diff

mbiterf mbcel   mbuiterf
a0.226  0.2200.206
b0.231  0.2210.205
c1.145  0.6701.179
d0.850  0.7150.882
e0.894  0.7200.939
f   13.028  5.714   14.654
g8.855  6.6949.601
h8.618  6.596   10.072
i3.955  2.8284.631
j4.169  3.0274.756

Thanks Paul for the insights!

Bruno






Re: error: possibly undefined macro: gl_CHECK_NEXT_HEADERS

2023-07-20 Thread Bruno Haible
cbh34680 wrote:
> In gnulib I'm seeing this errors:
> 
> configure:4438: error: possibly undefined macro: gl_CHECK_NEXT_HEADERS
>   If this token and others are legitimate, please use m4_pattern_allow.
>   See the Autoconf documentation.
> 
> configure:4839: error: possibly undefined macro: gl_ABSOLUTE_HEADER_ONE
>   If this token and others are legitimate, please use m4_pattern_allow.
>   See the Autoconf documentation.

Thanks for the report. I can reproduce it with
  ./gnulib-tool --create-testdir --dir=../testdir0 --single-configure error-h

> This patch fixes it.

Thanks; I prefer this patch instead.


2023-07-20  Bruno Haible  

error-h: Fix dependencies.
Reported by  in
.
* modules/error-h (Depends-on): Add include_next.

diff --git a/modules/error-h b/modules/error-h
index b6618d3c76..eeb5801732 100644
--- a/modules/error-h
+++ b/modules/error-h
@@ -7,6 +7,7 @@ m4/error_h.m4
 
 Depends-on:
 gen-header
+include_next
 snippet/c++defs
 
 configure.ac:






error: possibly undefined macro: gl_CHECK_NEXT_HEADERS

2023-07-20 Thread cbh34680
In gnulib I'm seeing this errors:

configure:4438: error: possibly undefined macro: gl_CHECK_NEXT_HEADERS
  If this token and others are legitimate, please use m4_pattern_allow.
  See the Autoconf documentation.

configure:4839: error: possibly undefined macro: gl_ABSOLUTE_HEADER_ONE
  If this token and others are legitimate, please use m4_pattern_allow.
  See the Autoconf documentation.

This patch fixes it.

diff --git a/modules/error-h b/modules/error-h
index b6618d3c76..31749b3f92 100644
--- a/modules/error-h
+++ b/modules/error-h
@@ -4,6 +4,8 @@ Functions for error reporting.
 Files:
 lib/error.in.h
 m4/error_h.m4
+m4/include_next.m4
+m4/absolute-header.m4

 Depends-on:
 gen-header


Re: Debugging du v9.3

2023-07-20 Thread Bruno Haible
Pádraig Brady wrote:
> With the attached gnulib patch applied you should be able to build with:

The patch looks good, except that in this .h file, function parameter names
are typically omitted. So, the declaration should better read:

void fts_cross_check (FTS const *);

Bruno






Re: Debugging du v9.3

2023-07-20 Thread Pádraig Brady

On 19/07/2023 19:27, Saulius Krasuckas wrote:

Hello,

I am trying to build Coreutils-9.3 from source. [1]

If I run ./configure && make, it builds.

But I noticed that it calculates some things wrong.  So I decided to
enable some debugging.  Any suggestions?
Since I found none, I looked up at the source and noticed the DU_DEBUG
define: [2]

Now if I clean the dir and run make CFLAGS="-DDU_DEBUG", it fails:

--- snip ---
CC   src/dircolors.o
CCLD src/dircolors
CC   src/dirname.o
CCLD src/dirname
CC   src/du.o
In file included from src/du.c:26:
src/du.c: In function 'du_files':
./lib/config.h:4385:25: warning: implicit declaration of function
'rpl_fts_cross_check'; did you mean 'fts_cross_check'?
[-Wimplicit-function-declaration]
   4385 | #define fts_cross_check rpl_fts_cross_check
| ^~~
src/du.c:60:31: note: in expansion of macro 'fts_cross_check'
 60 | # define FTS_CROSS_CHECK(Fts) fts_cross_check (Fts)
|   ^~~
src/du.c:707:11: note: in expansion of macro 'FTS_CROSS_CHECK'
707 |   FTS_CROSS_CHECK (fts);
|   ^~~
CCLD src/du
/usr/bin/ld: src/du.o: in function `du_files':
du.c:(.text+0x14b5): undefined reference to `rpl_fts_cross_check'
/usr/bin/ld: src/du.o: in function `main':
du.c:(.text+0x16fa): undefined reference to `fts_debug'
collect2: error: ld returned 1 exit status
make[2]: *** [Makefile:10638: src/du] Error 1
make[2]: Leaving directory '/home/sskras/debug/src/coreutils-9.3'
make[1]: *** [Makefile:21297: all-recursive] Error 1
make[1]: Leaving directory '/home/sskras/debug/src/coreutils-9.3'
make: *** [Makefile:8434: all] Error 2
--- snip ---

Is this a bug and/how should I report it?


It looks like an issue with gnulib.
With the attached gnulib patch applied you should be able to build with:

  make CFLAGS="-DDU_DEBUG -DGNULIB_FTS_DEBUG=1 -O2"

Note -O2 is specified to avoid compiler warnings at default optimization levels.

It would be useful to know if fts_cross_check() is useful to you.

cheers,
Pádraig


diff --git a/lib/fts.c b/lib/fts.c
index 884b84a5a1..875fe05793 100644
--- a/lib/fts.c
+++ b/lib/fts.c
@@ -1736,11 +1736,11 @@ fd_ring_print (FTS const *sp, FILE *stream, char const *msg)
 {
   int fd = fd_ring->ir_data[i];
   if (fd < 0)
-fprintf (stream, "%d: %d:\n", i, fd);
+fprintf (stream, "%u: %d:\n", i, fd);
   else
 {
   struct devino wd = getdevino (fd);
-  fprintf (stream, "%d: %d: "PRINT_DEVINO"\n", i, fd, wd.dev, wd.ino);
+  fprintf (stream, "%u: %d: "PRINT_DEVINO"\n", i, fd, wd.dev, wd.ino);
 }
   if (i == fd_ring->ir_back)
 break;
diff --git a/lib/fts_.h b/lib/fts_.h
index fa3d4146e2..82da038a53 100644
--- a/lib/fts_.h
+++ b/lib/fts_.h
@@ -271,6 +271,10 @@ _GL_ATTRIBUTE_NODISCARD
 FTSENT  *fts_read (FTS *) __THROW;
 
 int  fts_set (FTS *, FTSENT *, int) __THROW;
+
+#if GNULIB_FTS_DEBUG
+void fts_cross_check (FTS const *sp);
+#endif
 __END_DECLS
 
 #endif /* fts.h */