Bug#442392: python-unac -- code is of poor quality and can be done as easily in native Python

Lukáš Lalinský Sat, 15 Sep 2007 12:39:46 -0700

On So, 2007-09-15 at 11:17 -0700, Joe Wreschnig wrote:
> All libunac does is run a decomposition filter on the unicode strings
> passed in. This can be done natively in Python without resort to a
> third-party C library, in a trivial amount of code.
> 
> Additionally, libunac hardcodes the Unicode consortium data into itself
> and furthermore changes that data based on non-standard proposals from
> its users.
> 
> The libunac Python wrapper cannot properly handle subclasses of string
> or unicode, since it compares the name of the class to 'str' or
> 'unicode' rather than checking the types per se (Which is also unsafe in
> other ways, since someone might have named some other class 'str'.)
> 
> I've attached a Python module that should be basically compatible with
> python-unac, except for the fact that Python's Unicode data does not
> include the non-standard decomposition forms present in libunac, and it
> works properly with subclasses of str or unicode.
> 
> There are two small differences; I made the default encoding utf-8
> instead of nothing (which always returned nothing), and I let the user
> pass in alternate error handling behavior if they want. I would argue
> both of these make it better, but the former is technically an API
> change.


There is one functional difference, NFKD normalization and the filtering
used in libunac will not provide the same result (the sample is some
random text from http://www.bbc.co.uk/vietnamese/):

>>> print unac.unac_string(u'Khoảng một triệu người châu Phi đang chịu
ảnh hưởng của lũ lụt do mưa lớn gây mất mùa, vỡ đê và hàng chục người
chết')
Khoang mot trieu nguoi chau Phi dang chiu anh huong cua lu lut do mua
lon gay mat mua, vo de va hang chuc nguoi chet
>>> print unac2.unac_string(u'Khoảng một triệu người châu Phi đang chịu
ảnh hưởng của lũ lụt do mưa lớn gây mất mùa, vỡ đê và hàng chục người
chết')
Khoang mot trieu nguoi chau Phi đang chiu anh huong cua lu lut do mua
lon gay mat mua, vo đe va hang chuc nguoi chet

(notice the 'd' in libunac and 'đ' in NKFD)

But there is one, more important, difference -- performance. Real-time
unicode normalization is slow, Python list filtering is slow. I use the
code for a Lucene index builder with custom unaccenting analyzer, and
the Python code would increase the running time significantly:

>>> timeit.Timer("unac.unac_string(u'Khoảng một triệu người châu Phi
đang chịu ảnh hưởng của lũ lụt do mưa lớn gây mất mùa, vỡ đê và hàng
chục người chết')", "from unac import unac").timeit(100000)
1.7533831596374512
>>> timeit.Timer("unac.unac_string(u'Khoảng một triệu người châu Phi
đang chịu ảnh hưởng của lũ lụt do mưa lớn gây mất mùa, vỡ đê và hàng
chục người chết')", "from unac2 import unac").timeit(100000)
19.089791059494019

I know that the wrapper code is not nice, and could be done much better,
but the functionality and the speed is not comparable with the code
based on Python's unicodedata module.

Lukas

signature.asc
Description: Toto je digitálne podpísaná časť správy

Bug#442392: python-unac -- code is of poor quality and can be done as easily in native Python

Reply via email to