Re: [Python-Dev] Python-3.0, unicode, and os.environ

Adam Olsen Thu, 04 Dec 2008 15:16:08 -0800

On Thu, Dec 4, 2008 at 3:47 PM, André Malo <[EMAIL PROTECTED]> wrote:
> * Adam Olsen wrote:
>
>> On Thu, Dec 4, 2008 at 2:09 PM, André Malo <[EMAIL PROTECTED]> wrote:
>
>> > Here's an example which will become popular soon, I guess: CGI scripts
>> > and, of course WSGI applications. All those get their environment in an
>> > unknown encoding. In the worst case one can blow up the application by
>> > simply sending strange header lines over the wire. But there's more:
>> > consider running the server in C locale, then probably even a single 8
>> > bit char might break something (?).
>>
>> I think that's an argument that the framework should reencode all
>> input text into the correct system encoding before passing it on to
>> the CGI script or WSGI app.  If the framework doesn't have a clear way
>> to determine the client's encoding then it's all just gibberish
>> anyway.  A HTTP 400 or 500 range error code is appropriate here.
>
> Duh.
> See, you're already mixing different encodings and creating issues here!
> You're talking about client encoding (whatever that is) with correct system
> encoding (whatever that is, too) in the same paragraph and assume they are
> the same or compatible.


Mixing can work so long as the encoding is clearly specified and
unambiguous.  It limits your character set to a common subset of both
encodings, but that's a lesser problem.


> There are several points here:
>
> - there is no clear way to get a single client encoding for the whole HTTP
>  transaction (headers + body), because *there is none*. If the whole
>  header set matches the same encoding, it's more or less luck.

If there is no way, via official standards or defacto standards, you
should assume ascii and blow up if anything else is given.  At that
point it's meaningless garbage anyway.


> - there is no correct system encoding either. As said, I prefer running my
>  servers in C locale, so it's all ascii. In fact, it shouldn't matter. The
>  locale should not have anything to do with an application called over the
>  network.

I half agree: the network should be unaffected by the C locale.
However, using a C locale should limit you to ascii file names and
environment variables.


> - A 400 or 500 response for a header containing something like my name is
>  not appropriate.
>
> - Octets in HTTP headers are allowed. And they are what they are -
>  octets. The interpretation has to be left to the application, not the
>  framework.

If there is no clear interpretation then they're garbage.  If there is
a clear interpretation it could be done just as well in the framework,
which also lets all the apps benefit from a single implementation,
rather than trying to reimplement it for each one.


>> >> However, some pragmatism is also possible.  Many uses of PATH may
>> >> allow it to be treated as black-box bytes, rather than text.  The
>> >> minimal solution I see is to make os.getenv() and os.putenv() switch
>> >> to byte modes when given byte arguments, as os.listdir() does.  This
>> >> use case doesn't require the ability to iterate over all environment
>> >> variables, as os.environb would allow.
>> >>
>> >> I do wonder if controlling the environment given to a subprocess
>> >> requires os.environb, but it may be too obscure to really matter.
>> >
>> > IMHO, environment variables are no text. They are bytes by definition
>> > and should be treated as such.
>> > I know, there's windows having unicode enabled env vars on demand, but
>> > there's only trouble with those over there in apache's httpd (when
>> > passing them to CGI scripts, oh well...).
>>
>> Environment variables have textual names, are set via text, frequently
>
> Well, think about my example again. The friendly way to maintain them is not
> the issue. The problems arise at least when the variables are set by an
> attacker.

Maintaining them *IS* the issue.  The whole reason they're text in the
first place is to display them to and receive them back from the user.
 Otherwise we'd use meaningless serial numbers for directories or
something.

It may not seem to matter in this use case, but that's only because
they're communicated to/from the user on another system.


>> contain textual file names or paths, and my shell (bash in
>> gnome-terminal on ubuntu) lets me put unicode text in just fine.  The
>> underlying APIs may use bytes, but they're *intended* to be encoded
>> text.
>
> Yes, encoded text == bytes. No, they're intended to be c-strings. And well,
> even if we assume that they should contain text (as in encoded unicode),
> their meaning is application specific and so is the encoding (even if it's
> mixed).
>
> What I'm saying is: I don't see much use for unicode APIs for the
> environment at all, because I don't know what's in there before inspecting
> them. And apparently the only reliable way to inspect them is via a byte
> oriented API.

If you don't think your paths should contain text then please alter
your other systems to stop using japanese names.  Standardize on ascii
serial numbers or something equally meaningless.

Treating it as bytes is a bodge.  It's worth getting your use case to
"just work", but in the end it is text, and the *only* broad solution
to text is unicode.


-- 
Adam Olsen, aka Rhamphoryncus
_______________________________________________
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Python-3.0, unicode, and os.environ

Reply via email to