Re: tr is handling bytes not characters

Nick Demou Tue, 10 Feb 2009 08:06:11 -0800

On Tue, Feb 10, 2009 at 12:59 PM, Jim Meyering <[email protected]> wrote:
> Nick Demou <[email protected]> wrote:
>> [...]
>> Thanks for the info Eric. I was almost sure this would be the case. In
>> fact I don't consider this as the main topic of my bug report. The
>> main topic for me is the documentation. The man and info page don't
>> make it clear that utf-8 is not supported. I believe that others after
>> me will spend a lot of time just to realize that "it's just a missing
>> feature".  Do you have any thoughts regarding my suggestions on the
>> documentation?
>
> The "real" documentation is in coreutils.texi (generated to
> coreutils.info and available via "info coreutils").  There,
> under "tr invocation", it already has this caveat:


oops, mea culpa
I did read carefully the man page and then I did search coreutils info
before submitting this bug report. However I only searched for "utf"
and "unicode" so I missed the warning which doesn't contain any of the
two strings

> and since "man tr" does point to the authoritative source [the info pages]:
> [...]
> that may be enough.

I think it is for English speaking users but not for non-English
speaking ones who have to deal with actual[1] UTF8 text often. I would
suggest the following small corrections:

A. for the info page
====================

add a direct reference to UTF-8 and Unicode like this:

from:
#   Currently `tr' fully supports only single-byte characters.
# Eventually it will support multibyte characters;

to:
#   Currently `tr' fully supports only single-byte characters.
# Eventually it will support multibyte characters (e.g. UTF-8
# or UTF-16 encoded Unicode characters);

B. for the man page
===================

add a reference like this:

#  Currently `tr' fully supports only single-byte characters.
# (a notable example of multibyte characters that are not
# supported are UTF-8 and UTF-16 encoded Unicode characters)

C. for the core utils FAQ
=========================

add a Question like this one:

# Q: What's the status of Unicode support.

(for which I cannot suggest a thorough answer although I could try and
dig something out of the current documentation if noone else is able
to help at the moment)

or

# Q: I get funny/no/wrong results when dealing with
#    UTF-8/Unicode input

# A: UTF-8 and UTF-16 encodings for Unicode text is made up
#    of multibyte characters which are not well supported
#    by some coreutils programs.


___________________
[1] UTF-8 above the ASCII char set

--
"The software is licensed, not sold" -- MICROSOFT LICENSE TERMS


_______________________________________________
Bug-coreutils mailing list
[email protected]
http://lists.gnu.org/mailman/listinfo/bug-coreutils

Re: tr is handling bytes not characters

Reply via email to