2008/9/30 Toshio Kuratomi <[EMAIL PROTECTED]>:
>
>
>
> On Sep 29, 4:33 pm, "Graham Dumpleton" <[EMAIL PROTECTED]>
> wrote:
>> 2008/9/30 Toshio Kuratomi <[EMAIL PROTECTED]>:
>>
>>
>>
>> >> For response headers and content, the application can either generate
>> >> bytes and thus control the encoding, or it will fallback to trying to
>> >> convert it as latin-1 ifUnicodesupplied, so like wsgi.input, no
>> >> problem there.
>>
>> > Unlike wsgi.input where the application *must* decide how to decode
>> > the data, you are trying to do automatic encoding of data in the wsgi
>> > server here.  This will cause tracebacks on someunicodestring input
>> > but not others (which is one of the reasons that people hateunicode
>> > handling in python-2).  The tracebacks occur because latin-1
>> > characters are a subset ofUnicodecharacters (note that we're not
>> > dealing with code-point to byte mapping here, we're dealing with
>> > character mapping).  So you can always convert latin-1 tounicode.
>> > But you can't always convertUnicodeto latin-1 (which is what this
>> > automatic conversion would attempt). It's much better for the
>> > application layer to always hand mod_wsgi byte types, neverunicode.
>>
>> The amendment page says:
>>
>>   When running under Python 3, applications SHOULD produce bytes
>> output and headers
>>
>> So, the ideal situation is that the application would always produce
>> bytes and so it is the application which is supposed to deal with it.
>>
>> That mod_wsgi fallbacks to converting anyUnicodestrings to bytes is
>> a fail safe as dictated by:
>>
>>   When running under Python 3, servers and gateways MUST accept
>>   strings as application output or headers, under the existing rules (i.e.,
>>   s.encode('latin-1') must convert the string to bytes without an
>>   exception)
>>
>>  and is more to protect lazy programmers, plus make it easier to port
>> WSGI applications for Python 2.X.
>>
> So there's two things here:
> 1) Maybe I'm misunderstanding some code but I thought mod_wsgi was
> decoding bytes going out to the app.  If that's not the case and
> mod_wsgi is only handing byte strings to the apps then that's fine.
> (I note that this interaction isn't specified in the Amendment which
> goes along with your general feeling on the problems with the WSGI-
> spec writing process.)

I thought I had made it clear enough and that the proposed amendments
were also clear on this.

The wsgi.input stream which contains the request content is 'bytes'.
Thus it is not touched by mod_wsgi. The amendments say:

  When running under Python 3, servers MUST make wsgi.input a
  binary (byte) stream

Though amendments do though also say:

  When running under Python 3, servers MUST provide CGI HTTP variables
  as strings, decoded from the headers using HTTP standard encodings
  (i.e. latin-1 + RFC 2047) (Open question: are there any CGI or WSGI
  variables that should NOT be strings?)

Thus, mod_wsgi does however convert the CGI variables (ie., translated
HTTP headers) in WSGI environment dictionary, into Unicode strings
using latin-1 encoding.

As I pointed out there were only a few variables in there which were
of concern. Brian has pointed out that request URI has to be ascii
characters but there possibly still is an open question there on how
encoding of non ascii characters works in practice. We just need to do
some actual tests to see what happens and whether there is a problem.

Thus we are possibly down to SCRIPT_FILENAME given that it is
reflecting a file system path. Again, we just need to do some actual
tests to see what happens. Remembering that Apache is going to dictate
in the main how things work.

> 2) pje said that accepting unicode str here would make it easier to
> port WSGI applications but that's actually not true.  In python-2.x,
> you are only supposed to pass byte strings (py-2.x str) so there's no
> problems.  When those str's are converted to unicode str in py3.x, you
> have to rewrite your code so you aren't passing non-latin-1
> characters.  At that point, there's zero incentive to pass a sanitized
> unicode string to the wsgi server as you had to go through the byte
> type in order to get there (unless you misunderstand the WSGI spec and
> think it wants you to send py-3.x str type.)
>
> As for protecting lazy programmers... I'd argue that it's much better
> to throw an exception immediately upon receiving a unicode type rather
> than waiting until your app starts getting popular and you suddenly
> have transient errors due to people occassionally submitting data with
> non-latin-1 characters.

My feeling was that fallback to converting to bytes using latin-1 was
so that simple applications would still work. For example, the hello
world application:

def application(environ, start_response):
    status = '200 OK'
    output = 'Hello World!'

    response_headers = [('Content-type', 'text/plain'),
                        ('Content-Length', str(len(output)))]
    start_response(status, response_headers)

    return [output]

works in by Python 2.X and 3.0 without change.

Larger applications such as Django already internally deal with all
response content as Unicode and convert it to string objects at last
minute. The 2to3 converter would presumably pick that up automatically
and make it produce bytes instead.

Request headers in Django are a bit different more interesting. At the
moment, it will do things like:

  path_info = force_unicode(environ.get('PATH_INFO', u'/'))

where force_unicode is:

  def force_unicode(s, encoding='utf-8', strings_only=False,
errors='strict'): ...

Thus, Django was converting Python 2.X string objects to Unicode but
as UTF-8, which technically may not be correct.

In Python 3.0 because this conversion will likely still be applied
when 2to3 conversion done, they may well be converting Unicode string
created as latin-1 to Unicode string as UTF-8, albeit possibly by
going back through bytes type to do it if I read code correctly.

So, issue there is whether that they are treating them as UTF-8 is
right given that amendment is suggesting CGI variables are supposed to
be handled as latin-1.

Anyway, that is getting a bit off topic.

>> In other words, your application is the one who should be dealing with
>> it in the first place if you want to be sure about what is being
>> produced.
>
> +100
>
>> It only becomes an issue where the WSGI application hasn't
>> done what it really should have done.
>>
> As long as mod_wsgi is only converting unicode to bytes and not
> converting bytes to unicode, this is true.

I have already explained that for CGI variables (translated HTTP
headers) in the WSGI environment dictionary, that mod_wsgi does
convert bytes to Unicode.

Graham

--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"modwsgi" group.
To post to this group, send email to modwsgi@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at 
http://groups.google.com/group/modwsgi?hl=en
-~----------~----~----~----~------~----~------~--~---

Reply via email to