retitle 514963 manconv fails to distinguish between "text not in input
encoding" and "characters not representable in output encoding"
found 514963 2.5.3-3
user [email protected]
usertags 514963 target-2.5.5
thanks
On Thu, Feb 12, 2009 at 12:03:01PM +0100, Michal Čihař wrote:
> I noticed this issue, when some translated man pages from gammu package
> (currently in experimental) do not show properly. All they are properly
> encoded in utf-8 and man has no problem showing them locally. But once
> they get installed into /usr/share/man/cs/, some iso-8859-2 detection
> sometimes fails and manconv starts to thing that some of pages are in
> iso-8859-2 instead of utf-8.
>
> - From debug logs, I found out that /usr/lib/man-db/manconv -f
> utf-8:iso-8859-2 -t ISO-8859-2//IGNORE is called and on some of pages,
> it things the man page is in iso-8859-2 instead of utf-8. If the man
> page is not in /usr/share/man/cs/, the iso-8859-2 is missing in from
> charsets and man page is shown correctly.
>
> I'm attaching example of such man page.
I tried to warn about this problem in the policy manual:
Due to limitations in current implementations, all characters in the
manual page source should be representable in the usual legacy
encoding for that language, even if the file is actually encoded in
UTF-8.
... and you can see the problem like this:
$ iconv -f UTF-8 -t ISO-8859-2 gammu-smsd-mysql.7 >/dev/null
iconv: illegal input sequence at position 325
In other words, what's happening here is that the middle dot (U+00B7) at
position 325 isn't representable in ISO-8859-2. Unfortunately, manconv
isn't currently smart enough to distinguish between "conversion failed
because this isn't valid UTF-8" and "conversion failed because this bit
of UTF-8 isn't available in the target encoding", and therefore it falls
back to recoding from ISO-8859-2 to ISO-8859-2 (i.e. a no-op) and then
you see the mess when it tries to interpret UTF-8 as if it were
ISO-8859-2.
I think it might be possible to fix this, albeit more slowly, by
recoding the page to UCS-4, which should always succeed as long as the
text matches the input encoding being tried, and then recoding from
there to ISO-8859-2 and just throwing away characters that don't fit.
Alternatively, by the time we've done that we might have a groff that
supports UTF-8 input!
For the meantime, you can work around this problem by ensuring that your
manual page passes 'iconv -f UTF-8 -t ISO-8859-2 gammu-smsd-mysql.7
>/dev/null'.
Thanks,
--
Colin Watson [[email protected]]
--
To UNSUBSCRIBE, email to [email protected]
with a subject of "unsubscribe". Trouble? Contact [email protected]