Re: Unicode branch - last call for comments(?)

Malcolm Tredinnick Tue, 03 Jul 2007 19:49:13 -0700

On Wed, 2007-07-04 at 10:19 +0800, Russell Keith-Magee wrote:
> On 7/3/07, Malcolm Tredinnick <[EMAIL PROTECTED]> wrote:
> >
> > Russell, Adrian, Jacob: anybody want to read through a couple of
> > thousand lines of diffs for review or do you feel like trusting me
> > and/or the userbase (diffstat tells me the changes are: "190 files
> > changed, 2880 insertions(+), 1643 deletions(-)", although 11 files are
> > doc files and 60 files are test files, so the core changes aren't quite
> > that bad).
> 
> Oh my, Grandma, what a big patch you have... :-)
> 
> I've had a quick run through the diffs. While there are lot of
> changes, the only technical risk I can see is that there might be a
> few places where a str<->unicode conversion has been
> missed/misapplied.


Unsurprisingly, that's been the main area where early adopters have been
reporting bugs, too. I'm fairly confident we've nailed 98% of them,
since more than once I have literally done a grep for str() usage and
audited them all and similarly for *_unicode() to check for
inappropriate conversions.

The other non-trivial risk that doesn't show up from just reading the
patch is Python2.3 compatibility. There is a bug in 2.3 such that

        u'%s' % foo
        
does not call foo.__unicode__() when "foo" is an object with a
__unicode__ method; it calls foo.__str__() instead. So we have to write
that as 

        u'%s' % unicode(foo)
        
everywhere for class instances "foo", which took a bit of tracking down.
I think I've got them all and test against 2.3 routinely, but it's a
pain to have to remember.

> I suspect we could spend a lifetime auditing the patches and still not
> catch all the edge cases. IMHO, this is a time where we should be
> trusting the test suite, and adding tests where we have any doubt that
> the test suite is inadequate.

As those problems show up, they've generally been trivial to fix, too.

By the way, the principle is to move towards Unicode objects, rather
than the other direction, if you see any tickets reported like that.
Some early testers attached patches to unicode tickets that converted
things back to bytestrings to fix a problem (such as comparison issues)
which is almost always the wrong thing to do.

> Most of the common code paths for ASCII data should already be covered
> by the test suite. You appear to have added tests in all the areas I
> would have flagged (in particular query strings and serialization).
> 
> However, we should probably make sure that there are checks for
> situations where the user hasn't done any unicode preparations. The
> 'str' modeltest has been modified to demonstrate correct usage of
> __str__() in the unicode world, but IMHO there should be a regression
> test to validate that models with old __str__() definitions will still
> work for ASCII-only data.

True. I'll add that. It gets tested a lot in my offline tests, so it
works (I have some unported apps I test against), but a regression test
is needed, you're right.

> 
> The only other gotcha that caught my eye was a few places in the wsgi
> and modpython handlers where you have used str(), rather than
> smart_str(). Is this intentional?

Yes. The data arriving at those points must (in the RFC sense) already
be bytestrings and ASCII encoded. By using str() there, it is both
faster -- str is a builtin, smart_str is more complex Python code -- and
a sanity check because it will fail explosively if we have screwed up
and let Unicode slip through. That's actually caught a couple of
problems in the past.

Thanks for taking the time to look over this, Russ.

Cheers,
Malcolm

-- 
Despite the cost of living, have you noticed how popular it remains? 
http://www.pointy-stick.com/blog/


--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"Django developers" group.
To post to this group, send email to django-developers@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at 
http://groups.google.com/group/django-developers?hl=en
-~----------~----~----~----~------~----~------~--~---

Re: Unicode branch - last call for comments(?)

Reply via email to