On 2015-09-02 16:50:21 +0200, Olaf Hering wrote:
> Is there no better way to translate manual.xml into manual.txt beside
> using a webbrowser to dump manual.html? Perhaps there is none.

There would be XSLT.

> The current state for me is oddly looking output. Example:
> 
> ...
> 7. Forwarding and Bouncing Mail
> ...
> function bound to ?b? and ?f? respectively.
> ...
> 
> xxd says:
> 00017db0: 3c62 6f75 6e63 653e 2066 756e 6374 696f  <bounce> functio
> 00017dc0: 6e20 616e 6420 666f 7277 6172 6469 6e67  n and forwarding
> 00017dd0: 2075 7369 6e67 2074 6865 203c 666f 7277   using the <forw
> 00017de0: 6172 643e 0a66 756e 6374 696f 6e20 626f  ard>.function bo
> 00017df0: 756e 6420 746f 20e2 809c 62e2 809d 2061  und to ...b... a
> 00017e00: 6e64 20e2 809c 66e2 809d 2072 6573 7065  nd ...f... respe
> 00017e10: 6374 6976 656c 792e 0a0a 466f 7277 6172  ctively...Forwar
> 00017e20: 6469 6e67 2063 616e 2062 6520 646f 6e65  ding can be done
> 
> Is 0xe2 0x80 0x9c valid UTF-8 for '“'?

Yes:

zira:~> unicode “
U+201C LEFT DOUBLE QUOTATION MARK
UTF-8: e2 80 9c  UTF-16BE: 201c  Decimal: &#8220;
“
Category: Pi (Punctuation, Initial quote)
Bidi: ON (Other Neutrals)

> Appearently it is, because thats what Firefox gives with copy&paste,
> and its looking fine in this vim session. So that means that less(1)
> and even vim(1) is unable to cope with manual.txt. Is there perhaps
> a mix of encodings in manual.txt that confuses the pager?! Does it
> fail just for me?

No problems with "less" on my machine on text with this character. But
it is not present in the Mutt manual: I get the ASCII double-quotes
'"'. The reason may be that I compile Mutt with LC_ALL=C, which is the
portable locale. If the result of the build depends on the locales, I
would see this as a bug.

> After some debugging it turned out that mutt has a bug:
> LC_ALL=C w3m -dump doc/manual.html > bad.txt
> LC_ALL=C.UTF-8 w3m -dump doc/manual.html > good.txt
> 
> I suggest to force UTF-8 instead of plain ASCII.

As one shouldn't change the locales except by setting LC_ALL=C
(C.UTF-8 is unfortunately not standard and broken when used with
glibc[*]), this would mean using a tool that can transform XML to
text in a way that does not depend on the locales (e.g. something
based on XSLT). Or stick with ASCII (but do not use w3m, which
cannot transcode non-ASCII characters).

[*] https://sourceware.org/bugzilla/show_bug.cgi?id=16621

-- 
Vincent Lefèvre <[email protected]> - Web: <https://www.vinc17.net/>
100% accessible validated (X)HTML - Blog: <https://www.vinc17.net/blog/>
Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon)

Reply via email to