Package: python-unac Version: 1.7.0-1 Priority: wishlist All libunac does is run a decomposition filter on the unicode strings passed in. This can be done natively in Python without resort to a third-party C library, in a trivial amount of code.
Additionally, libunac hardcodes the Unicode consortium data into itself and furthermore changes that data based on non-standard proposals from its users. The libunac Python wrapper cannot properly handle subclasses of string or unicode, since it compares the name of the class to 'str' or 'unicode' rather than checking the types per se (Which is also unsafe in other ways, since someone might have named some other class 'str'.) I've attached a Python module that should be basically compatible with python-unac, except for the fact that Python's Unicode data does not include the non-standard decomposition forms present in libunac, and it works properly with subclasses of str or unicode. There are two small differences; I made the default encoding utf-8 instead of nothing (which always returned nothing), and I let the user pass in alternate error handling behavior if they want. I would argue both of these make it better, but the former is technically an API change. -- Joe Wreschnig <[EMAIL PROTECTED]>
#!/usr/bin/env python
# -*- coding: utf-8 -*-
# By Joe Wreschnig, to show how stupid python-unac is.
# Released into the public domain.
"""unac - Remove accents from a string
This module contains one function. It removes accents and otherwise
decomposes Unicode characters within a Python str or unicode object.
"""
import unicodedata
def _notcomb(char):
return not unicodedata.combining(char)
def unac_string(text, charset='utf-8', errors='strict'):
"""Unaccent a string.
Pass in a unicode or a str object. For str objects a second
argument that specifies the encoding of the string is
required. This function returns an unaccented unicode or string
object, depending on what was passed in.
"""
was_str = False
if isinstance(text, str):
was_str = True
text = text.decode(charset, errors)
text = unicodedata.normalize("NFKD", text)
text = filter(_notcomb, text)
if was_str:
text = text.encode(charset, errors)
return text
if __name__ == "__main__":
def assert_equal(a, b, enc='utf-8'):
assert unac_string(a, enc) == b, "%r != %r" % (unac_string(a, enc), b)
assert_equal("test", "test")
assert_equal("\xe9t\xe9", "ete", 'latin1')
signature.asc
Description: This is a digitally signed message part

