After looking at this for a while, I think 1) Python should be fixed to work properly in C locale. It's in fact fixed by PEP 538 (which roughly sets LC_COLLATE to C.UTF-8 when LC_ALL is not set and LC_COLLATE is C) and PEP 540 (which efficiently enforces UTF-8 for everything when locale is C). 2) Unfortunately, this means the only 'sane' python 3 versions are 3.7 and later. It's too late for us to try replacing Python3.5 with Python3.7 (we don't have men power to do it properly in 2020.04 time frame). 3) I'm trying to backport PEP 538 to our Python 3.5 based on Fedora Python 3.6 PEP 538 backport, this could make things a bit better (at least when you don't have explicit LC_ALL=C in your environment) - https://github.com/OpenIndiana/oi-userland/pull/5595 . 4) https://github.com/OpenIndiana/pkg5/commit/d3187bbf6614769114482c8823a6b1adae05ea3a seems to be fine enough on pkg side for some time given that python should be updated in any case (likely no additional fixes required with Python 3.7). 5) I understand that wrappers, setting locale or pkg commands, rexecuting itself with correct locale would fix the problem, but find this awkward enough to avoid implementing this...
С уважением, Александр Пыхалов, программист отдела телекоммуникационной инфраструктуры управления информационно-коммуникационной инфраструктуры ЮФУ ________________________________________ От: Joshua M. Clulow <[email protected]> Отправлено: 12 марта 2020 г. 8:50 Кому: illumos-discuss Тема: Re: [discuss] pkg, python3 and unicode On Wed, 11 Mar 2020 at 01:31, Till Wegmüller <[email protected]> wrote: > AFAIK only C.UTF-8 is sane for languages such as Python. > Everything else will cause a failure somewhere in the code, as it is > simply too many calls / type conversions. We should ensure that zlogin > (or in general zone enter code) enforces C.UTF-8 as locale and not C. I don't think that's right. The "zlogin" program is really no different to "ssh" or any other interhost login mechanism; it should generally inherit and respect whatever locale the user was using when invoked. Software in any language, including Python, needs to be written with the data format in mind. If IPS is processing data which is defined to be UTF-8, then it really needs to use UTF-8 aware data types and library routines for accessing that data rather than depend on the locale being correct or not. If that UTF-8 data is subsequently rendered for the user, it should then be converted to the active locale for display. This same property comes up a lot in Rust, where strings provided by the OS (e.g., argv and the environment) are treated as byte arrays until explicitly converted to a native UTF-8 string type for string handling. This conversion can fail if the input is not actually UTF-8; or, it can optionally be lossy and replace invalid UTF-8 sequences with placeholders. Either way, you are forced to choose a policy and handle the different cases. Cheers. -- Joshua M. Clulow http://blog.sysmgr.org ------------------------------------------ illumos: illumos-discuss Permalink: https://illumos.topicbox.com/groups/discuss/T6f78aa7809ef6ec3-Mc738661e7fdb1b6b26a1dab3 Delivery options: https://illumos.topicbox.com/groups/discuss/subscription
