Hi, I'm wondering if there is a way to turn off character set detection when parsing with the AutoDetectParser, or if there is a way to speed up character set detection.
I ran a test that converted 52,717 documents to text. The documents were emails embedded in a .tar file. With character set detection, the test to 220 seconds. Without character set detection, the test took 21 seconds and only 6% of that time was spent in Tika. According to a profiler, the following methods took the bulk of the runtime when character set detection was used: 61.7% org.apache.tika.parser.txt.CharsetRecog_sbcs$NGramParser.parse 4.3% org.apache.tika.parser.txt.CharsetRecog_sbcs$CharsetRecog_IBM420_ar.isLamAlef 3.1% org.apache.tika.parser.txt.CharsetRecog_sbcs$CharsetRecog_IBM420_ar.unshapeLamAlef 2.6% org.apache.tika.parser.txt.CharsetDetector.setText(byte[ ]) 2.3% org.apache.tika.parser.txt.CharsetRecog_mbcs.match One problem that seems to contribute to this is that every character set is tested for each document, instead of starting with common character sets and stopping as soon as an adequate character set is found. To turn off character set detection, I created a new class that is essentially the TXTParser with character set detection removed. I then replaced every instance of TXTParser in AutoDetectParser's map of parsers with a text parser that does not determine the character set. I'm left with the following questions: - Can character set detection be sped up? - If character set detection can't be sped up, is there an easier way to turn it off? - If character set detection can't be sped up and there isn't an easier way to turn off character set detection, could an easier way to turn off character set detection be added? Thanks for your help, Paul
