On 9/18/07, James Y Knight <[EMAIL PROTECTED]> wrote:
>
> On Sep 18, 2007, at 11:11 AM, Guido van Rossum wrote:
> > If they contain
> > non-ASCII bytes I am currently in favor os doing a best-effort
> > decoding using the default locale encoding, replacing errors with '?'
> > rather than throwing exception.
>
> One of the more common things to do with command line arguments is
> open them. So, it'd really be nice if:
>
> python -c 'import sys; open(sys.argv[1])' [some filename]

I'd like this too, but it isn't easy.

> would always work, regardless of the current system encoding and what
> characters make up the filename.  Note that filenames are essentially
> random binary gunk in most Unix systems; the encoding is unspecified,
> and there can in fact be multiple encodings, even for different
> directories making up a single file's path.
>
> I'd like to propose that python simply assume the external world is
> likely to be UTF-8, and always decode command-line arguments (and
> environment vars), and encode for filesystem operations using the
> roundtrip-able UTF-8b. Even if the system says its encoding is
> iso-2022 or some other abomination. This has upsides (simple, doesn't
> trample on PUA codepoints, only needs one new codec, never throws
> exception in the above example, and really is correct much of the
> time), and downsides (if the system locale is iso-2022, and all the
> filenames you're dealing with really are also properly encoded in
> iso-2022, it might be nice if they decoded into the sensible unicode
> string, instead of a non-sensical (but still round-trippable) one.
>
> I think the advantages outweigh the disadvantages, but the world I
> live in, using anything other than UTF8 or ASCII is grounds for entry
> into an insane asylum. ;)

You seem to be contradicting yourself. The world *isn't* using
UTF-8(b) predominantly yet, so assuming UTF-8(b) everywhere will break
your first requirement.

Two encodings are more likely (though not guaranteed) to produce
success: the locale encoding or the filesystem encoding. I'm thinking
that the locale encoding is probably the one to use for argv and
environ, since at least the user can change it in order to make things
work.

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)
_______________________________________________
Python-3000 mailing list
Python-3000@python.org
http://mail.python.org/mailman/listinfo/python-3000
Unsubscribe: 
http://mail.python.org/mailman/options/python-3000/archive%40mail-archive.com

Reply via email to