Re: Request data encoding

Malcolm Tredinnick Fri, 10 Aug 2007 20:16:41 -0700

On Thu, 2007-08-02 at 19:33 -0500, Jacob Kaplan-Moss wrote:
> On 8/2/07, Simon Willison <[EMAIL PROTECTED]> wrote:
> > This is a totally ridiculous flaw with the HTTP spec - you literally
> > have no reliable way of telling what encoding a request coming in to
> > your site uses, since you can't be absolutely sure that the user-agent
> > read a page from your site to find out your character encoding!
> 
> W3C FTW!
> 
> > One really smart trick you can do is this: attempt to decode as UTF-8
> > (which is nice and strict and will fail noisily for pretty much
> > anything that isn't either UTF-8 or ASCII, a UTF-8 subset). If
> > decoding fails, assume ISO-8859-1 which will decode absolutely
> > anything without ever throwing an error (although if the content isn't
> > ISO-8859-1 you'll end up with garbage). I tend to call this the Flickr
> > trick, because of the lovely big letters here:
> > http://www.flickr.com/services/api/misc.encoding.html
> 
> Yeah, fooling around with it that's been pretty much the conclusion
> I've come to.
> 
> I'd like to wait for Malcolm to weigh in since he wrote much of this
> code (and I think he's on his way back to AU so it might be a bit
> before he's over jetlag and back on the list), but I think this is the
> right approach:
> 
> * Try to decode the form data using ``settings.DEFAULT_CHARSET``. In
> most cases this'll be UTF-8, but when it's not we can try to assume
> that data's being POSTed back in the same encoding we're serving it up
> in.
> * If that fails and ``DEFAULT_CHARSET`` isn't UTF-8, try UTF-8.
> That'll deal with relatively sane automated clients (i.e.
> ``WWW::Mechanize`` and all its clones).
> * If that fails, use ISO-WTFBBQNAMBLA-1.
> 
> How's that sound?


I dislike it. Various fallbacks were discussed in past threads and I
read them all again when doing that work. They all sounded flawed for
the same reason: Different people have different expectations about how
fallbacks should work (after all, if you're cursed to receive data from
Windows clients, cp-1252 is the best first fallback). Setting false
expectations feels wrong here.

We make one attempt at a default -- using the commonly applicable case
that Django will have generated the form. We provide a one-line way to
change the encoding if this isn't the case. Note that you can set
request.encoding *at any point* in the process, even after you've
already tried to access GET and POST. All it does it reset those
properties and redecodes the data with your new encoding settings the
next time you try to access them. So it's not a burden, in practice.

Receiving genuinely bad/invalid data is not uncommon either, as is
obvious as soon as you start running a really anal comment sanitisation
feature or looking at uploads from corporate systems. Trying to silently
change the encoding just to minimise the errors isn't a solution here --
you'll often end up in the wrong encoding altogether, when you should
have been ignoring bad data (because things like cp-1252 and iso-8859-1
understand more single byte values in all contexts than UTF-8, for
example). Change the encoding deliberately or not all.

I'm -1 on the proposed patch and the change in general.

Regards,
Malcolm

-- 
If Barbie is so popular, why do you have to buy her friends? 
http://www.pointy-stick.com/blog/


--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"Django developers" group.
To post to this group, send email to django-developers@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at 
http://groups.google.com/group/django-developers?hl=en
-~----------~----~----~----~------~----~------~--~---

Re: Request data encoding

Reply via email to