On Tue, Sep 8, 2009 at 6:44 PM, Hilko Bengen<[email protected]> wrote:
> * Martin-Éric Racine:
>
>> On Tue, Sep 8, 2009 at 12:51 PM, Hilko Bengen<[email protected]> wrote:
>>> As far as I know, using non-ASCII characters in the GECOS field of
>>> /etc/passwd is not specified at all. So far, I haven't found anything
>>> Debian's main policy file, passwd(5), the adduser(8) and useradd(8)
>>> manpages, nor the documentation of base-passwd. (If you have found more
>>> than I, let me know.)
>>
>> While it is not specified, it has become a de-facto standard in Debian
>> and its derivatives to use UTF-8 for everything, including the real
>> name that appears in the GECOS field of /etc/passwd.
>
> I had also thought about UTF-8 becoming the standard encoding in many
> places in Debian, be it de-iure or de-facto. But I am not going to
> assume that this extends to /etc/passwd.
>
> And how should non-ASCII characters  in other kinds of user databases be
> treated, such as NIS or LDAP?

That's of course slightly more complicated.  However, as far as
/etc/passwd is concerned, testing the content with 'file' would be a
rather easy way to determine whether to use UTF-8 or something else.
Anyhow, the current approach in the global config to try iso-8859-1
and then utf-8 is broken, because it only works for non-EURO western
languages. The only correct assumption to make is utf-8 and if that
fails, then parse 'env' for whatever deprecated locale the user
currently has.

>>> From an application's standpoint, I'd tend to assume the GECOS field
>>> either to be a comma-sparated string of ASCII characters or a
>>> comma-separated string of byte values.
>> We cannot assume that anymore now that Debian uses UTF-8 for everything.
>
> If you can point me to a text passage in the policy (or any relevant
> discussion on the mailing lists), I will be happy to reconsider my
> opinion.

Every default installation of Debian or Ubuntu writes GECOS content in
UTF-8 based on the fullname that is given when creating the account,
because a Debian or Ubuntu system nowadays defaults to UTF-8 locales
and uses that to produce the GECOS info.

>>> Basing mailx' interpretation of the GECOS field on the sendcharset
>>> variable, as you suggested is probably not a good idea.
>> Why not?
>
> (it's sendcharsets, sorry for the typo)
>
> sendcharsets is about the target charset.

OK, what should we parse then to make a correct guess?

>>>> Message-Id: <[email protected]>
>>>> From: Martin-?ric Racine <[email protected]>
>>> As a workaround, please try setting your real name to a pre-encoded
>>> string in the .mailrc.
>> Do you really expect all users on a given system to start doing that,
>> just because their name includes non-ascii characters?
>
> Not at all. I just thought that this workaround might be helpful for you
> until the larger issues get sorted out. Feel free to ignore my
> suggestion. :-)

Well, I'd indeed hope we can sort this out.  Besides, I have a dozen
of different environment variables that already set my real name.
Can't we parse any of those? Oh, but that would probably fail too,
because it doesn't tell the encoding either, right?

>> Please remember that both Debian and Ubuntu nowadays allow non-ascii
>> GECOS content under the presumption that it will be in UTF-8.
>
> They have alway allowed non-ascii content in the GECOS field, but I see
> no such presumption.
>
> From the sources I have seen, existing tools for manipulating
> /etc/passwd will happily accept *any* byte sequence from the terminal.
> If an administrator has still set his console to iso-8859-1, that's what
> is used, without conversion.

It indeed accepts anything and, there days, with the locales using
UTF-8 variants be default, it really does get anything. :-)

Martin-Éric



--
To UNSUBSCRIBE, email to [email protected]
with a subject of "unsubscribe". Trouble? Contact [email protected]

Reply via email to