Control: clone -1 -2 Control: reasign -2 wml 2.12.2~ds1-2 Control: retitle -2 wml: Regression in "htmlstrip -O2" (default) with Chinese language
Hi,
Boyuan Yang wrote:
> Thanks for raising this issue.
Thanks from me, too. I wasn't aware of such a regression, sorry.
> These build errors might have multiple causes,
> but I stripped the issue down to a (possible) regression of wml. Let's fix
> this issue first before talking about others.
>
> =======================================
> $ wml --version
> This is WML Version 2.12.2
> Copyright (c) 1996-2001 Ralf S. Engelschall.
> Copyright (c) 1999-2001 Denis Barbier.
>
> This program is distributed in the hope that it will be useful,
> but WITHOUT ANY WARRANTY; without even the implied warranty of
> MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
> GNU General Public License for more details.
> $ cat /etc/issue
> Debian GNU/Linux bullseye/sid \n \l
>
> $ cat a.wml
> <p>
> 包
> </p>
> $ hexdump -C a.wml
> 00000000 3c 70 3e 0a e5 8c 85 0a 3c 2f 70 3e 0a |<p>.....</p>.|
> 0000000d
> $ wml a.wml > test.txt
> $ cat test.txt
> <p>
> �
> </p>
> $ hexdump -C test.txt
> 00000000 3c 70 3e 0a e5 8c 0a 3c 2f 70 3e 0a |<p>....</p>.|
> 0000000c
> $
[…]
> I am using Debian Unstable but similar things also happen in Buster.
Can confirm that this is a regression between Stretch and Buster. :-(
> The single character in the a.wml above is U+5305 [1], namely "CJK Unified
> Ideograph-5305", a commonly-used Chinese character. Its UTF-8 encoding is
> "0xE5 0x8C 0x85". However after wml transformation, only "0xE5 0x8C" was kept
> and the "0x85" was dropped. That's surely a regression.
Ack. Figured out that it's pass 8 of 9 passes in WML:
→ cat a.wml | wml -p1-8
<p>
�
</p>
→ cat a.wml | wml -p1-7
<p>
包
</p>
→ cat a.wml | wml -p1-7,9
<p>
包
</p>
→ echo 包 | /usr/share/wml/exec/wml_p8_htmlstrip
�
→
Pass 8 is htmlstrip, something similar uglifyjs, but for HTML.
Since that pass should be only for delivery performance and disk space
reasons, it likely can be left out easily.
So I see multiple ways to more or less quickly fix this issue in the
Debian web:
* Always call wml with "-p1-7,9".
* Call wml with "-p1-7,9" if any of the affected languages is build.
* Add <nostrip>…</nostrip> containers in the header and footer
templates for the affected langauges.
To be more precise, it's the optimisation level 2 of htmlstrip:
→ echo 包 | /usr/share/wml/exec/wml_p8_htmlstrip -O 0
包
→ echo 包 | /usr/share/wml/exec/wml_p8_htmlstrip -O 1
包
→ echo 包 | /usr/share/wml/exec/wml_p8_htmlstrip -O 2
�
→
The man page says:
Level 2:
Good stripping: Same as level 1 plus compression of
multiple whitespaces (more then one in sequence) to single
whitespaces [txt,tag] and stripping of trailing whitespaces
at the of of a line [txt,tag,pre].
This level is the default because while providing good
optimization the HTML markup is not destroyed and remains
human readable.
So instead of skipping htmlstrip completely, everywhere, where I
suggested passing "-p1-7,9", also "-O1" could be passed to wml as
this is passed to htmlstrip:
→ cat a.wml | wml -O1
<p>
包
</p>
> I cc-ed the wml maintainer in Debian. Axel, is there any possibility to solve
> this regression in both Sid/Testing and Stable?
I think the above is a good first workaround on buster. With this
mail, I clone the bug report and will try to figure out what change in
htmlstrip caused the regression and/or how it can be fixed.
I though currently have issues building more recent upstream versions
of WML which is the reason why wml in Unstable hasn't seen an update
yet. A more recent version is in git, but IIRC there was another
release or two recently, at which I haven't looked yet.
Regards, Axel
--
,''`. | Axel Beckert <[email protected]>, https://people.debian.org/~abe/
: :' : | Debian Developer, ftp.ch.debian.org Admin
`. `' | 4096R: 2517 B724 C5F6 CA99 5329 6E61 2FF9 CD59 6126 16B5
`- | 1024D: F067 EA27 26B9 C3FC 1486 202E C09E 1D89 9593 0EDE
signature.asc
Description: PGP signature

