On 2015-09-02 16:50:21 +0200, Olaf Hering wrote: > Is there no better way to translate manual.xml into manual.txt beside > using a webbrowser to dump manual.html? Perhaps there is none.
There would be XSLT. > The current state for me is oddly looking output. Example: > > ... > 7. Forwarding and Bouncing Mail > ... > function bound to ?b? and ?f? respectively. > ... > > xxd says: > 00017db0: 3c62 6f75 6e63 653e 2066 756e 6374 696f <bounce> functio > 00017dc0: 6e20 616e 6420 666f 7277 6172 6469 6e67 n and forwarding > 00017dd0: 2075 7369 6e67 2074 6865 203c 666f 7277 using the <forw > 00017de0: 6172 643e 0a66 756e 6374 696f 6e20 626f ard>.function bo > 00017df0: 756e 6420 746f 20e2 809c 62e2 809d 2061 und to ...b... a > 00017e00: 6e64 20e2 809c 66e2 809d 2072 6573 7065 nd ...f... respe > 00017e10: 6374 6976 656c 792e 0a0a 466f 7277 6172 ctively...Forwar > 00017e20: 6469 6e67 2063 616e 2062 6520 646f 6e65 ding can be done > > Is 0xe2 0x80 0x9c valid UTF-8 for '“'? Yes: zira:~> unicode “ U+201C LEFT DOUBLE QUOTATION MARK UTF-8: e2 80 9c UTF-16BE: 201c Decimal: “ “ Category: Pi (Punctuation, Initial quote) Bidi: ON (Other Neutrals) > Appearently it is, because thats what Firefox gives with copy&paste, > and its looking fine in this vim session. So that means that less(1) > and even vim(1) is unable to cope with manual.txt. Is there perhaps > a mix of encodings in manual.txt that confuses the pager?! Does it > fail just for me? No problems with "less" on my machine on text with this character. But it is not present in the Mutt manual: I get the ASCII double-quotes '"'. The reason may be that I compile Mutt with LC_ALL=C, which is the portable locale. If the result of the build depends on the locales, I would see this as a bug. > After some debugging it turned out that mutt has a bug: > LC_ALL=C w3m -dump doc/manual.html > bad.txt > LC_ALL=C.UTF-8 w3m -dump doc/manual.html > good.txt > > I suggest to force UTF-8 instead of plain ASCII. As one shouldn't change the locales except by setting LC_ALL=C (C.UTF-8 is unfortunately not standard and broken when used with glibc[*]), this would mean using a tool that can transform XML to text in a way that does not depend on the locales (e.g. something based on XSLT). Or stick with ASCII (but do not use w3m, which cannot transcode non-ASCII characters). [*] https://sourceware.org/bugzilla/show_bug.cgi?id=16621 -- Vincent Lefèvre <[email protected]> - Web: <https://www.vinc17.net/> 100% accessible validated (X)HTML - Blog: <https://www.vinc17.net/blog/> Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon)
