* Stefan Behnel: > The nice thing about Py_UNICODE is that is basically gives you native > Unicode code points directly, without needing to decode UTF-8 byte > runs and the like. In Cython, it allows you to do things like this: > > def test_for_those_characters(unicode s): > for c in s: > # warning: randomly chosen Unicode escapes ahead > if c in u"\u0356\u1012\u3359\u4567": > return True > else: > return False > > The loop runs in plain C, using the somewhat obvious implementation > with a loop over Py_UNICODE characters and a switch statement for the > comparison. This would look a *lot* more ugly with UTF-8 encoded byte > strings.
Not really, because UTF-8 is quite search-friendly. (The if would have to invoke a memmem()-like primitive.) Random subscrips are problematic. However, why would one want to write loops like the above? Don't you have to take combining characters (comprising multiple codepoints) into account most of the time when you look at individual characters? Then UTF-32 does not offer much of a simplification. -- Florian Weimer <fwei...@bfk.de> BFK edv-consulting GmbH http://www.bfk.de/ Kriegsstraße 100 tel: +49-721-96201-1 D-76133 Karlsruhe fax: +49-721-96201-99 _______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com