Bug#442392: python-unac -- code is of poor quality and can be done as easily in native Python

Joe Wreschnig Sat, 15 Sep 2007 11:26:08 -0700

Package: python-unac
Version: 1.7.0-1
Priority: wishlist

All libunac does is run a decomposition filter on the unicode strings
passed in. This can be done natively in Python without resort to a
third-party C library, in a trivial amount of code.


Additionally, libunac hardcodes the Unicode consortium data into itself
and furthermore changes that data based on non-standard proposals from
its users.

The libunac Python wrapper cannot properly handle subclasses of string
or unicode, since it compares the name of the class to 'str' or
'unicode' rather than checking the types per se (Which is also unsafe in
other ways, since someone might have named some other class 'str'.)

I've attached a Python module that should be basically compatible with
python-unac, except for the fact that Python's Unicode data does not
include the non-standard decomposition forms present in libunac, and it
works properly with subclasses of str or unicode.

There are two small differences; I made the default encoding utf-8
instead of nothing (which always returned nothing), and I let the user
pass in alternate error handling behavior if they want. I would argue
both of these make it better, but the former is technically an API
change.
-- 
Joe Wreschnig <[EMAIL PROTECTED]>

#!/usr/bin/env python
# -*- coding: utf-8 -*-

# By Joe Wreschnig, to show how stupid python-unac is.
# Released into the public domain.

"""unac - Remove accents from a string

This module contains one function. It removes accents and otherwise
decomposes Unicode characters within a Python str or unicode object.
"""

import unicodedata

def _notcomb(char):
    return not unicodedata.combining(char)

def unac_string(text, charset='utf-8', errors='strict'):
    """Unaccent a string.

    Pass in a unicode or a str object. For str objects a second
    argument that specifies the encoding of the string is
    required. This function returns an unaccented unicode or string
    object, depending on what was passed in.
    """
    was_str = False
    if isinstance(text, str):
        was_str = True
        text = text.decode(charset, errors)
    text = unicodedata.normalize("NFKD", text)
    text = filter(_notcomb, text)
    if was_str:
        text = text.encode(charset, errors)
    return text

if __name__ == "__main__":
    def assert_equal(a, b, enc='utf-8'):
        assert unac_string(a, enc) == b, "%r != %r" % (unac_string(a, enc), b)
    assert_equal("test", "test")
    assert_equal("\xe9t\xe9", "ete", 'latin1')

signature.asc
Description: This is a digitally signed message part

Bug#442392: python-unac -- code is of poor quality and can be done as easily in native Python

Reply via email to