Victor Stinner wrote: > Hi, > > Le Thursday 04 December 2008 21:02:19 Toshio Kuratomi, vous avez écrit : > >> These mixed encodings can occur for a variety of reasons. Here's an >> example that isn't too contrived :-) >> (...) >> Furthermore, they don't want to suffer from the space loss of using >> utf-8 to encode Japanese so they use shift-jis everywhere. > > "space loss"? Really? If you configure your server correctly, you should get > UTF-8 even if the file system is Shift-JIS. But it would be much easier to > use UTF-8 everywhere. > > Hum... I don't think that the discussion is about one specific server, but > the > lack of bytes environment variables in Python3 :-) > Yep. I can't change the logicalness of the policies of a different organization, only code my application to deal with it :-)
>> 1) return mixed unicode and byte types in ... > > NO! > It's nice that we agree... but I would prefer if you leave enough context so that others can see that we agree as well :-) >> 2) return only byte types in os.environ > > Hum... Most users have UTF-8 everywhere (eg. all Windows users ;-)), and > Python3 already use Unicode everywhere (input(), open(), filenames, ...). > We're also in agreement here. >> 3) silently ignore non-decodable value when accessing os.environ['PATH'] >> as we do now but allow access to the full information via >> os.environ[b'PATH'] and os.getenvb() > > I don't like os.environ[b'PATH']. I prefer to always get the same result > type... But os.listdir() doesn't respect that :-( > > os.listdir(str) -> list of str > os.listdir(bytes) -> list of bytes > > I would prefer a similar API for easier migration from Python2/Python3 > (unicode). os.environb sounds like the best choice for me. > <nod>. After thinking about how it would be used in subprocess calls I agree. os.environb would allow us to retrieve the full dict as bytes. os.environ[b''] only works on individual keys. Also os.getenv serves the same purpose as os.environ[b''] would whereas os.environb would have its own uses. > > But they are open questions (already asked in the bug tracker): > I answered these in the bug tracker. Here are the answers for the mailing list: > (a) Should os.environ be updated if os.environb is changed? If yes, how? > os.environb['PATH'] = '\xff' (or any invalid string in the system > default encoding) > => os.environ['PATH'] = ??? > The underlying environment that both variables reflect should be updated but what is displayed by os.environ should continue to follow the same rules. So if we follow option #3:: os.environb['PATH'] = b'\xff' os.environ['PATH'] => raises KeyError because PATH is not a key in the unicode decoded environment. (option #4 would issue a UnicodeDecodeError instead of a KeyError) Similarly, if you start with a variable in os.environb that can only be represented as bytes and your program transforms it into something that is decodable it should then show up in os.environ. > (b) Should os.environb be updated if os.environ is changed? If yes, how? > > The problem comes with non-Unicode locale (eg. latin-1 or ASCII): most > charset > are unable to encode the whole Unicode charset (eg. codes >= 65535). > > os.environ['PATH'] = chr(0x10000) > => os.environb['PATH'] = ??? > Ah, this is a good question. I misunderstood what you were getting at when you posted this to the bug report. I see several options but the one that seems the most sane is to raise UnicodeEncodeError when setting the value. With that, proper code to set an environment variable might look like this:: LANG=C python3.0 >>> variable = chr(0x10000) >>> try: >>> # Unicode aware locales >>> os.environ['MYVAR'] = variable >>> except UnicodeEncodeError: >>> # Non-Unicode locales >>> os.environb['MYVAR'] = bytes(variable, encoding='utf8') > (c) Same question when a key is deleted (del os.environ['PATH']). > Update the underlying env so both os.environ and os.environb reflect the change. Deleting should not hold the problems that updating does. > If Python 3.1 will have os.environ and os.environb, I'm quite sure that some > modules will user os.environ and other will prefer os.environb. If both > environments are differents, the two modules set will work differently :-/ > Exactly. So making sure they hold the same information is a priority. > It would be maybe easier if os.environ supports bytes and unicode keys. But > we > have to keep these assertions: > os.environ[bytes] -> bytes > os.environ[str] -> str > I think the same choices have to be made here. If LANG=C, we still have to decide what to do when os.environ[str] is set to a non-ASCii string. Additionally, the subprocess question makes using the key value undesirable compared with having a separate os.environb that accesses the same underlying data. >> 4) raise an exception when non-decodable values are *accessed* and >> continue as in #3. > > I like os.listdir() behaviour: just *ignore* non-decodable files. If you > really want to access these files, use a bytes directory name ;-) > Since you wrote the code for that I would hope so ;-) Here's my problem with it, though. With these semantics any program that works on arbitrary files and runs on *NIX has to check os.listdir(b'') and do the conversion manually. The only code that doesn't have to care is code that is working on files that the program created and thus controls. Since it is not obvious that this has to be done most programs won't do this by default, there will be subtle bugs in a lot of code that individual application authors will have to discover and change when a user realizes something is wrong. Since there's no traceback being issued, the process of discovery and debugging will be longer. >> I think that the ease of debugging is lost when we silently ignore an error. > > Guido gave a good example. If your directory contains an non decodable > filename (eg. "???.txt"): glob('*.py') will fail because of the evil > filename. With the current behaviour, you're unable to list all files but > glob('*.py') will list all Python scripts! > Current behaviour is this: os.listdir('.') => Only decodable filenames glob.glob('*') => Only decodable filenames os.listdir(b'.') => All filenames as bytes glob.glob(b'*') => All filenames as bytes I think the desired behaviour assuming the existence of anondecodable file is this: os.listdir('.') => traceback glob.glob('*') => traceback os.listdir(b'.') => All filenames as bytes glob.glob(b'*') => All filenames as bytes Both of these approaches are internally consistent. Why do you think that glob.glob('*.py') is special and should not traceback? > And Python3 is released, it's maybe a bad idea to change the behaviour (of > os.environ) in Python 3.1 :-/ > As you've pointed out, os.environ will have to change slightly. But others have already said that this is on the agenda to fix in 3.1. The current state is just broken as the environment is currently only partially readable from python. >> The bug report I opened suggests creating a PEP to address this issue. > > Please, try to answer to my questions about os.environ and os.environb > consistency. > I have. Twice now :-) > I also like bytes environment variables. I need them for my fuzzing program. > The lack of bytes variables is a regression from Python2 (for my program). On > UNIX, filenames are bytes and the environment variables are bytes. For the > best interoperability, Python3 should support bytes. But the default choice > should always be characters (unicode) and to never mix the bytes and str > types ;-) > I agree 100%. * Never mixing bytes and str is a *huge* benefit of python3 over python2. * Unicode str everywhere possible is a python3 benefit that helps to get conversion done at the border. I just differ in that I think lack of tracebacks when UnicodeDecodeErrors are encountered is a wart in python3 that did not exist in python2. -Toshio
signature.asc
Description: OpenPGP digital signature
_______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com