I added support for the new character set detection APIs in ICU 3.6.
These APIs are documented here:
    http://icu.sourceforge.net/apiref/icu4c/ucsdet_8h.html

The PyICU wrappers include two Python classes CharsetDetector and CharsetMatch which wrap the ICU APIs.

  detector = CharsetDetector()
  detector = CharsetDetector(string)   # string must be a python str
  detector = CharsetDetector(string, declaredEncoding)

  match = detector.detect()
  matches = detector.detectAll()       # return a tuple of all matches

  detector.setText(string)             # string must be python str
  detector.setDeclaredEncoding(encoding)

  detector.enableInputFilter(bool)
  bool = detector.isInputFilterEnabled()

  stringEnumeration = detector.getAllDetectableCharsets()

  string = match.getName()
  number = match.getConfidence()
  string = match.getLanguage()

  string = unicode(match)             # returns a unicode string

In other words, a simple way to take an attachment or feed data and convert it to unicode is:

  >>> unicode(CharsetDetector(data).detect())

Andi..
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

Open Source Applications Foundation "chandler-dev" mailing list
http://lists.osafoundation.org/mailman/listinfo/chandler-dev

Reply via email to