On Tue, 1 Nov 2011, Robert Muir wrote:
Well as an alternative for them committing the ebcdic detection, perhaps we could look at the Charset detection apis and propose some API additions so that users (like Tika) can plug in custom detectors?

In theory it should be pluggable, but I seem to recal we needed to tweak a few core bits to get the detector working (around negative matches for control characters)

Looking at the svn version history, the ICU4J team don't appear to have done any work on their character detectors in several years. From the lack of responses when I asked on their list about extending them, I fear there may not be anyone left in their project who's interested in charset detectors any more. I'd love to be proved wrong though, if anyone has any personal contacts on the project they could prod about it?

Nick

Reply via email to