Re: Django should not use `force_unicode(..., errors='replace')` when parsing POST data.

Tai Lee Thu, 29 Mar 2012 15:42:53 -0700

I agree that it is not an ideal user experience to raise an exception while 
decoding POST data. However, I think the alternatives are arguably just as bad, 
or even worse.

I consider silently replacing characters in user data to avoid an exception 
while decoding to be a silent data loss / corruption issue.

Depending on the actual data being submitted and the web site in question, this 
may be a non-issue, it may be inconvenient but acceptable, it may be critical, 
and it may cause errors down the line (like the `DatabaseError` exception).

However, neither users or developers currently have any choice in the matter, 
and most won't even know it is happening.

If simply switching to `errors='strict'` is not acceptable, what do people 
think about adding a `dirty` or `raw` property to `request.FILES` and 
`request.POST` when a decoding error occurs, and storing the original byte 
strings in there while keeping the forcibly decoded strings in `request.FILES` 
and `request.POST` as they are now?

This would give developers a chance to deal with this in a way that suits them. 
They could use middleware to re-raise the silenced exception, or they could try 
to decode with common alternative encodings before falling back to forcibly 
decoding (or raising an exception), or they could return a form validation 
error in their view, or they could still use the forcibly decoded data but warn 
users that their data was altered.

If we were to go down this route, I would still prefer to see Django ship with 
such a middleware that re-raises the `UnicodeDecodeError` exception and have it 
enabled by default, simply because this issue involves silent changes to user 
supplied data. If it causes an error for anyone, and it shouldn't normally as 
this appears to be an edge case, the issue will be easily diagnosed and the 
developer can then choose if they want to silently replace data, or attempt 
alternative decoding, or display an error to users.

The Django docs for file uploads say:

> The content-type header uploaded with the file (e.g. text/plain or 
> application/pdf). Like any data supplied by the user, you shouldn't trust 
> that the uploaded file is actually this type. You'll still need to validate 
> that the file contains the content that the content-type header claims -- 
> "trust but verify."

I think Django should follow the "trust but verify" principle when decoding all 
POST data. It's true that we can't accurately detect what character encoding is 
being used for supplied data, and this is precisely why we shouldn't forcibly 
decode POST data. We can at least inform developers (if not users) when the 
specified character encoding is proven to be wrong, and allow them to choose 
how to handle it.

Cheers.
Tai.

On 30/03/2012, at 5:01 AM, Waylan Limberg wrote:

> I'm not sure which approach is the way to go here. However, forcing
> users to deal with encodings is generally a bad idea. Besides, you
> never can trust a browser to give you what it says it is giving you.
> In other words, the user may not be able to get the browser to send
> the correct encoding anyway. For those reasons I'm leaning toward #1.
> Of course, that begs the question: should Django be doing a better job
> escaping the data used to build the SQL statement? I guess we won't
> know unless we get the bad SQL statement. Which takes us back to #1.

-- 
You received this message because you are subscribed to the Google Groups 
"Django developers" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/django-developers?hl=en.

Re: Django should not use `force_unicode(..., errors='replace')` when parsing POST data.

Reply via email to