After looking at this for a while, I think
1) Python should be fixed to work properly in C locale. It's in fact fixed by 
PEP 538 (which roughly sets LC_COLLATE to C.UTF-8 when LC_ALL is not set and 
LC_COLLATE is C) and PEP 540 (which efficiently enforces UTF-8 for everything 
when locale is C).
2) Unfortunately, this means the only 'sane' python 3 versions are 3.7 and 
later. It's too late for us to try replacing Python3.5 with Python3.7 (we don't 
have men power to do it properly in 2020.04 time frame).
3) I'm trying to backport PEP 538 to our Python 3.5 based on Fedora Python 3.6 
PEP 538 backport, this could make things a bit better (at least when you don't 
have explicit LC_ALL=C in your environment) - 
https://github.com/OpenIndiana/oi-userland/pull/5595 .
4) 
https://github.com/OpenIndiana/pkg5/commit/d3187bbf6614769114482c8823a6b1adae05ea3a
 seems to be fine enough on pkg side for some time given that python should be 
updated in any case (likely no additional fixes required with Python 3.7).
5) I understand that wrappers, setting locale or pkg commands, rexecuting 
itself with correct locale would fix the problem, but find this awkward enough 
to avoid implementing this...


С уважением,
Александр Пыхалов,
программист отдела телекоммуникационной инфраструктуры
управления информационно-коммуникационной инфраструктуры ЮФУ


________________________________________
От: Joshua M. Clulow <[email protected]>
Отправлено: 12 марта 2020 г. 8:50
Кому: illumos-discuss
Тема: Re: [discuss] pkg, python3 and unicode

On Wed, 11 Mar 2020 at 01:31, Till Wegmüller <[email protected]> wrote:
> AFAIK only C.UTF-8 is sane for languages such as Python.
> Everything else will cause a failure somewhere in the code, as it is
> simply too many calls / type conversions. We should ensure that zlogin
> (or in general zone enter code) enforces C.UTF-8 as locale and not C.

I don't think that's right.  The "zlogin" program is really no
different to "ssh" or any other interhost login mechanism; it should
generally inherit and respect whatever locale the user was using when
invoked.

Software in any language, including Python, needs to be written with
the data format in mind.  If IPS is processing data which is defined
to be UTF-8, then it really needs to use UTF-8 aware data types and
library routines for accessing that data rather than depend on the
locale being correct or not.  If that UTF-8 data is subsequently
rendered for the user, it should then be converted to the active
locale for display.

This same property comes up a lot in Rust, where strings provided by
the OS (e.g., argv and the environment) are treated as byte arrays
until explicitly converted to a native UTF-8 string type for string
handling.  This conversion can fail if the input is not actually
UTF-8; or, it can optionally be lossy and replace invalid UTF-8
sequences with placeholders.  Either way, you are forced to choose a
policy and handle the different cases.


Cheers.

--
Joshua M. Clulow
http://blog.sysmgr.org

------------------------------------------
illumos: illumos-discuss
Permalink: 
https://illumos.topicbox.com/groups/discuss/T6f78aa7809ef6ec3-Mc738661e7fdb1b6b26a1dab3
Delivery options: https://illumos.topicbox.com/groups/discuss/subscription

Reply via email to