I agree that it is not an ideal user experience to raise an exception while decoding POST data. However, I think the alternatives are arguably just as bad, or even worse.
I consider silently replacing characters in user data to avoid an exception while decoding to be a silent data loss / corruption issue. Depending on the actual data being submitted and the web site in question, this may be a non-issue, it may be inconvenient but acceptable, it may be critical, and it may cause errors down the line (like the `DatabaseError` exception). However, neither users or developers currently have any choice in the matter, and most won't even know it is happening. If simply switching to `errors='strict'` is not acceptable, what do people think about adding a `dirty` or `raw` property to `request.FILES` and `request.POST` when a decoding error occurs, and storing the original byte strings in there while keeping the forcibly decoded strings in `request.FILES` and `request.POST` as they are now? This would give developers a chance to deal with this in a way that suits them. They could use middleware to re-raise the silenced exception, or they could try to decode with common alternative encodings before falling back to forcibly decoding (or raising an exception), or they could return a form validation error in their view, or they could still use the forcibly decoded data but warn users that their data was altered. If we were to go down this route, I would still prefer to see Django ship with such a middleware that re-raises the `UnicodeDecodeError` exception and have it enabled by default, simply because this issue involves silent changes to user supplied data. If it causes an error for anyone, and it shouldn't normally as this appears to be an edge case, the issue will be easily diagnosed and the developer can then choose if they want to silently replace data, or attempt alternative decoding, or display an error to users. The Django docs for file uploads say: > The content-type header uploaded with the file (e.g. text/plain or > application/pdf). Like any data supplied by the user, you shouldn't trust > that the uploaded file is actually this type. You'll still need to validate > that the file contains the content that the content-type header claims -- > "trust but verify." I think Django should follow the "trust but verify" principle when decoding all POST data. It's true that we can't accurately detect what character encoding is being used for supplied data, and this is precisely why we shouldn't forcibly decode POST data. We can at least inform developers (if not users) when the specified character encoding is proven to be wrong, and allow them to choose how to handle it. Cheers. Tai. On 30/03/2012, at 5:01 AM, Waylan Limberg wrote: > I'm not sure which approach is the way to go here. However, forcing > users to deal with encodings is generally a bad idea. Besides, you > never can trust a browser to give you what it says it is giving you. > In other words, the user may not be able to get the browser to send > the correct encoding anyway. For those reasons I'm leaning toward #1. > Of course, that begs the question: should Django be doing a better job > escaping the data used to build the SQL statement? I guess we won't > know unless we get the bad SQL statement. Which takes us back to #1. -- You received this message because you are subscribed to the Google Groups "Django developers" group. To post to this group, send email to [email protected]. To unsubscribe from this group, send email to [email protected]. For more options, visit this group at http://groups.google.com/group/django-developers?hl=en.
