Your message dated Fri, 25 Oct 2024 04:04:11 +0000
with message-id <[email protected]>
and subject line Bug#1027414: fixed in luit 2.0.20240910-1
has caused the Debian Bug report #1027414,
regarding luit: Luit does not handle Unicode beyond BMP
to be marked as done.

This means that you claim that the problem has been dealt with.
If this is not the case it is now your responsibility to reopen the
Bug report if necessary, and/or fix the problem forthwith.

(NB: If you are a system administrator and have no idea what this
message is talking about, this may indicate a serious mail system
misconfiguration somewhere. Please contact [email protected]
immediately.)


-- 
1027414: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1027414
Debian Bug Tracking System
Contact [email protected] with problems
--- Begin Message ---
Package: luit
Version: 2.0.20221028-1
Severity: normal
Tags: patch
X-Debbugs-Cc: [email protected]

Dear Maintainer,

It appears that luit does not handle UTF-8 beyond U+FFFF. For example:

    printf "Nabla (U+2207): \U2207\nBold Nabla (U+1D6C1): \U1D6C1\n" \
        | luit -encoding UTF-8 -c

The output expected is:

        Nabla (U+2207): ∇
        Bold Nabla (U+1D6C1): 𝛁

The output actually produced by luit is:

        Nabla (U+2207): ∇
        Bold Nabla (U+1D6C1): 훁

Note that luit generates U+D6C1 (훁) instead of U+1D6C1 (𝛁).

I believe the bug is in iso2022.c:outbufUTF8() which looks like this:

    if (c <= 0x7F) {
        OUTBUF_MAKE_FREE(is, fd, 1);
        is->outbuf[is->outbuf_count++] = UChar(c);
    } else if (c <= 0x7FF) {
        OUTBUF_MAKE_FREE(is, fd, 2);
        is->outbuf[is->outbuf_count++] = UChar(0xC0 | ((c >> 6) & 0x1F));
        is->outbuf[is->outbuf_count++] = UChar(0x80 | (c & 0x3F));
    } else {
        OUTBUF_MAKE_FREE(is, fd, 3);
        is->outbuf[is->outbuf_count++] = UChar(0xE0 | ((c >> 12) & 0x0F));
        is->outbuf[is->outbuf_count++] = UChar(0x80 | ((c >> 6) & 0x3F));
        is->outbuf[is->outbuf_count++] = UChar(0x80 | (c & 0x3F));
    }

As you can see, it only handles three byte UTF-8 sequences, covering
0x000000 to 0x00FFFF. A fourth byte is needed to cover the
supplemental planes up to 0x10FFFF (the limit of Unicode).

        *   *   *   


I created a patch for the above problem and, in testing it, found
another bug. Certain valid Unicode characters were not being read if
they were in the final plane (0x100000 to 0x10FFFF). I tracked it down
to other.c:stack_utf8():

        u = ((s->utf8.buf[0] & 0x03) << 18)
            | ((s->utf8.buf[1] & 0x3F) << 12)
            | ((s->utf8.buf[2] & 0x3F) << 6)
            | ((s->utf8.buf[3] & 0x3F));

The first byte of a four byte UTF-8 sequence gets ANDed with 0x03,
keeping just the low two bits. However, it should keep three bits.
Changing 0x03 to 0x07 fixes the problem.

I have attached patches for both issues. 


-- System Information:
Debian Release: bookworm/sid
  APT prefers testing
  APT policy: (500, 'testing')
Architecture: amd64 (x86_64)

Kernel: Linux 6.0.0-6-amd64 (SMP w/8 CPU threads; PREEMPT)
Locale: LANG=en_US.UTF-8, LC_CTYPE=en_US.UTF-8 (charmap=UTF-8), LANGUAGE not set
Shell: /bin/sh linked to /usr/bin/dash
Init: systemd (via /run/systemd/system)
LSM: AppArmor: enabled

Versions of packages luit depends on:
ii  libc6  2.36-6

luit recommends no packages.

luit suggests no packages.

-- no debconf information
--- iso2022.c.orig      2018-06-27 15:46:34.000000000 -0700
+++ iso2022.c   2022-12-30 19:22:53.774355814 -0800
@@ -134,11 +134,35 @@
        OUTBUF_MAKE_FREE(is, fd, 2);
        is->outbuf[is->outbuf_count++] = UChar(0xC0 | ((c >> 6) & 0x1F));
        is->outbuf[is->outbuf_count++] = UChar(0x80 | (c & 0x3F));
-    } else {
+    } else if (c <= 0xFFFF) {
        OUTBUF_MAKE_FREE(is, fd, 3);
        is->outbuf[is->outbuf_count++] = UChar(0xE0 | ((c >> 12) & 0x0F));
        is->outbuf[is->outbuf_count++] = UChar(0x80 | ((c >> 6) & 0x3F));
        is->outbuf[is->outbuf_count++] = UChar(0x80 | (c & 0x3F));
+    } else if (c <= 0x1FFFFF) {
+       OUTBUF_MAKE_FREE(is, fd, 4);
+       is->outbuf[is->outbuf_count++] = UChar(0xF0 | ((c >> 18) & 0x07));
+       is->outbuf[is->outbuf_count++] = UChar(0x80 | ((c >> 12) & 0x3F));
+       is->outbuf[is->outbuf_count++] = UChar(0x80 | ((c >> 6) & 0x3F));
+       is->outbuf[is->outbuf_count++] = UChar(0x80 | (c & 0x3F));
+    } else if (c <= 0x03FFFFFF) {
+       OUTBUF_MAKE_FREE(is, fd, 5);
+       is->outbuf[is->outbuf_count++] = UChar(0xF8 | ((c >> 24) & 0x03));
+       is->outbuf[is->outbuf_count++] = UChar(0x80 | ((c >> 18) & 0x3f));
+       is->outbuf[is->outbuf_count++] = UChar(0x80 | ((c >> 12) & 0x3F));
+       is->outbuf[is->outbuf_count++] = UChar(0x80 | ((c >> 6) & 0x3F));
+       is->outbuf[is->outbuf_count++] = UChar(0x80 | (c & 0x3F));
+    } else if (c <= 0x7FFFFFFF) {
+       OUTBUF_MAKE_FREE(is, fd, 6);
+       is->outbuf[is->outbuf_count++] = UChar(0xFC | ((c >> 30) & 0x01));
+       is->outbuf[is->outbuf_count++] = UChar(0x80 | ((c >> 24) & 0x3f));
+       is->outbuf[is->outbuf_count++] = UChar(0x80 | ((c >> 18) & 0x3f));
+       is->outbuf[is->outbuf_count++] = UChar(0x80 | ((c >> 12) & 0x3F));
+       is->outbuf[is->outbuf_count++] = UChar(0x80 | ((c >> 6) & 0x3F));
+       is->outbuf[is->outbuf_count++] = UChar(0x80 | (c & 0x3F));
+    } else {
+      /* "21 bits ought to be enough for anybody!" -- The Unicode Consortium */
+      Warning("ignoring character beyond UTF-8's 31-bit range: %'X.\n", c);
     }
 }
 
--- other.c.orig        2013-02-02 13:50:30.000000000 -0800
+++ other.c     2022-12-30 20:11:22.391737595 -0800
@@ -122,26 +122,26 @@
        return (int) c;
     }
     if (s->utf8.buf_ptr == 0) {
-       if ((c & 0x40) == 0)
+       if ((c & 0x40) == 0)    /* Skip continuation bytes 10xx xxxx */
            return -1;
        s->utf8.buf[s->utf8.buf_ptr++] = UChar(c);
-       if ((c & 0x60) == 0x40)
+       if ((c & 0x60) == 0x40)                 /* Starts with 110x xxxx */
            s->utf8.len = 2;
-       else if ((c & 0x70) == 0x60)
+       else if ((c & 0x70) == 0x60)            /* Starts with 1110 xxxx */
            s->utf8.len = 3;
-       else if ((c & 0x78) == 0x70)
+       else if ((c & 0x78) == 0x70)            /* Starts with 1111 0xxx */
            s->utf8.len = 4;
        else
            s->utf8.buf_ptr = 0;
        return -1;
     }
-    if ((c & 0x40) != 0) {
+    if ((c & 0x40) != 0) {     /* Resync if not a continuation 10xx xxxx */
        s->utf8.buf_ptr = 0;
        return -1;
     }
     s->utf8.buf[s->utf8.buf_ptr++] = UChar(c);
     if (s->utf8.buf_ptr < s->utf8.len)
-       return -1;
+       return -1;              /* Get the next continuation byte */
     switch (s->utf8.len) {
     case 2:
        u = ((s->utf8.buf[0] & 0x1F) << 6) | (s->utf8.buf[1] & 0x3F);
@@ -160,7 +160,7 @@
        else
            return u;
     case 4:
-       u = ((s->utf8.buf[0] & 0x03) << 18)
+       u = ((s->utf8.buf[0] & 0x07) << 18)
            | ((s->utf8.buf[1] & 0x3F) << 12)
            | ((s->utf8.buf[2] & 0x3F) << 6)
            | ((s->utf8.buf[3] & 0x3F));

--- End Message ---
--- Begin Message ---
Source: luit
Source-Version: 2.0.20240910-1
Done: Thomas E. Dickey <[email protected]>

We believe that the bug you reported is fixed in the latest version of
luit, which is due to be installed in the Debian FTP archive.

A summary of the changes between this version and the previous one is
attached.

Thank you for reporting the bug, which will now be closed.  If you
have further comments please address them to [email protected],
and the maintainer will reopen the bug report if appropriate.

Debian distribution maintenance software
pp.
Thomas E. Dickey <[email protected]> (supplier of updated luit 
package)

(This message was generated automatically at their request; if you
believe that there is a problem with it please contact the archive
administrators by mailing [email protected])


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA512

Format: 1.8
Date: Sat, 14 Sep 2024 07:21:24 -0400
Source: luit
Architecture: source
Version: 2.0.20240910-1
Distribution: unstable
Urgency: low
Maintainer: Thomas E. Dickey <[email protected]>
Changed-By: Thomas E. Dickey <[email protected]>
Closes: 1027414
Changes:
 luit (2.0.20240910-1) unstable; urgency=low
 .
   * New upstream release
     - Luit now handles Unicode beyond BMP (Closes: #1027414)
   * Update years in debian/copyright.
   * Update build dependencies and standards version
   * Use current SPDX license name in debian/copyright; DEP 5 is inaccurate (see
     https://invisible-island.net/ncurses/ncurses-license.html#issues_expat)
Checksums-Sha1:
 ec00981fef9895978483c769124c84b1994d02c2 1907 luit_2.0.20240910-1.dsc
 dcba0765575a443b119d1282ab65a018900ed857 212641 luit_2.0.20240910.orig.tar.gz
 7bd4671872a2802efa5300a2383b553fd87bbc07 5872 luit_2.0.20240910-1.debian.tar.xz
 d667c25f059596f719fe8fbc55f5ba4d608ddbf5 6594 
luit_2.0.20240910-1_i386.buildinfo
Checksums-Sha256:
 9ce1ad4acb23906d6017641b55373ab40da03af011e7acc307e096f53904128f 1907 
luit_2.0.20240910-1.dsc
 a15d7fcbfc25ae1453d61aec23ff6ba04145d6e7b7b3b0071eb5cfda3a3a49d5 212641 
luit_2.0.20240910.orig.tar.gz
 936ca8ce5aff754b7fd7d8ec2c69a3468b2d41a095eb1043aa77817c5f7fbe77 5872 
luit_2.0.20240910-1.debian.tar.xz
 29355fd9e76b8595ef66e18aef52b62066e89a9ba54d34a9b36e7ec8e1a5fa75 6594 
luit_2.0.20240910-1_i386.buildinfo
Files:
 6f3cf0321646c0b9b8e9fc903b275445 1907 utils optional luit_2.0.20240910-1.dsc
 c9db8c12a3ad697a075179f07b099eaf 212641 utils optional 
luit_2.0.20240910.orig.tar.gz
 987583e0dda873b084c64d09a1b303ea 5872 utils optional 
luit_2.0.20240910-1.debian.tar.xz
 482dcfd6a0f319db699ec5ab10c8c5ce 6594 utils optional 
luit_2.0.20240910-1_i386.buildinfo
-----BEGIN PGP SIGNATURE-----

iQIzBAEBCgAdFiEEUtWxWT1/2RRqWmMHHHxB7evdu2AFAmcbE0UACgkQHHxB7evd
u2B+qA//ZxKjBO/DWQGjswGUHsy7Pbgp0ruK8A8cPdKMBDGYZFP0aX/p0H5rzJNi
xRedgpaO51pKg+SA+la4LcrQkQqCm9fOy0kBhVAaM1S8cINM4vFDg4tgon4l8vF4
2vVvJavi6Nj39xdUHLZCd7uyq14YtKRBC+rC7WsyJp66x8nn1ozPmIFAKJRVkYex
xABI0UCaLxmUFVTlaOjXmibnfpMhoPeS+i5rSvSVf3me7WLLsgMKhRqAV6JIxn6D
GL0wWVfOh1oz1SlTFhbeNLkK2aipCj8EijxG+vKmbCQYc4D5VxyYYbhyDzYpS7XP
BsDo8rhkaiRaA8xsHCFtlEd05vQKHD9xQ0Ll8JZL84OdShMY9e8OJXGLW6ENilZ2
XprASgSyCczpFuouwx48dls6maoE/UZQCtpluMCpN+hW6M1bpLUnd/tjXgqH8E7+
kUAVNWNH2DRqT1dVDV3crT8aKAApW6xCT7iDyStdQPtqjFKrjLpgF2CTcuxZfGOy
5i2dCg9uYdL62iwwkBDcYcysa2qYU86Et4JNhZKOVY7K8Te1JJLqO+SCCU0VBaUu
j9B2zTX2XwjtWMtTh4VXf/DKYREsTZPWa2Iyonsy2oTJkxVNIPJR1aGk8x4YihkA
mYN0ymtK8w69GF4RGnZUqSKoTx/AsO8CKX80bfTCxjPNKHvZjJc=
=gv9M
-----END PGP SIGNATURE-----

Attachment: pgprnF1H4Rtgd.pgp
Description: PGP signature


--- End Message ---

Reply via email to