V Tue, 29 Nov 2005 10:14:26 +0000, Neil Hodgson napsal(a): > David Siroky: > >> output = '' > > I suspect you really want "output = u''" here. > >> for c in line: >> if not unicodedata.combining(c): >> output += c > > This is creating as many as 50000 new string objects of increasing > size. To build large strings, some common faster techniques are to > either create a list of characters and then use join on the list or use > a cStringIO to accumulate the characters.
That is the answer I wanted, now I'm finally enlightened! :-) > > This is about 10 times faster for me: > > def no_diacritics(line): > if type(line) != unicode: > line = unicode(line, 'utf-8') > > line = unicodedata.normalize('NFKD', line) > > output = [] > for c in line: > if not unicodedata.combining(c): > output.append(c) > return u''.join(output) > > Neil Thanx! David -- http://mail.python.org/mailman/listinfo/python-list