I added support for the new character set detection APIs in ICU 3.6.
These APIs are documented here:
http://icu.sourceforge.net/apiref/icu4c/ucsdet_8h.html
The PyICU wrappers include two Python classes CharsetDetector and
CharsetMatch which wrap the ICU APIs.
detector = CharsetDetector()
detector = CharsetDetector(string) # string must be a python str
detector = CharsetDetector(string, declaredEncoding)
match = detector.detect()
matches = detector.detectAll() # return a tuple of all matches
detector.setText(string) # string must be python str
detector.setDeclaredEncoding(encoding)
detector.enableInputFilter(bool)
bool = detector.isInputFilterEnabled()
stringEnumeration = detector.getAllDetectableCharsets()
string = match.getName()
number = match.getConfidence()
string = match.getLanguage()
string = unicode(match) # returns a unicode string
In other words, a simple way to take an attachment or feed data and convert it
to unicode is:
>>> unicode(CharsetDetector(data).detect())
Andi..
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
Open Source Applications Foundation "chandler-dev" mailing list
http://lists.osafoundation.org/mailman/listinfo/chandler-dev