Hi Paul,
Thanks for providing some interesting statistics.
On Aug 12, 2010, at 12:37pm, Paul Jakubik wrote:
I'm wondering if there is a way to turn off character set detection
when
parsing with the AutoDetectParser, or if there is a way to speed up
character set detection.
There are ways to make it faster, yes. Mostly involving changing the
underlying algorithm, which requires processing a significant amount
of text (currently it processes all the text). Some related issues:
https://issues.apache.org/jira/browse/TIKA-322
https://issues.apache.org/jira/browse/TIKA-369
-- Ken
I ran a test that converted 52,717 documents to text. The documents
were
emails embedded in a .tar file.
With character set detection, the test to 220 seconds. Without
character set
detection, the test took 21 seconds and only 6% of that time was
spent in
Tika.
According to a profiler, the following methods took the bulk of the
runtime
when character set detection was used:
61.7% org.apache.tika.parser.txt.CharsetRecog_sbcs$NGramParser.parse
4.3%
org.apache.tika.parser.txt.CharsetRecog_sbcs
$CharsetRecog_IBM420_ar.isLamAlef
3.1%
org.apache.tika.parser.txt.CharsetRecog_sbcs
$CharsetRecog_IBM420_ar.unshapeLamAlef
2.6% org.apache.tika.parser.txt.CharsetDetector.setText(byte[ ])
2.3% org.apache.tika.parser.txt.CharsetRecog_mbcs.match
One problem that seems to contribute to this is that every character
set is
tested for each document, instead of starting with common character
sets and
stopping as soon as an adequate character set is found.
To turn off character set detection, I created a new class that is
essentially the TXTParser with character set detection removed. I then
replaced every instance of TXTParser in AutoDetectParser's map of
parsers
with a text parser that does not determine the character set.
I'm left with the following questions:
- Can character set detection be sped up?
- If character set detection can't be sped up, is there an easier
way to
turn it off?
- If character set detection can't be sped up and there isn't an
easier way
to turn off character set detection, could an easier way to turn off
character set detection be added?
Thanks for your help,
Paul
--------------------------------------------
<http://ken-blog.krugler.org>
+1 530-265-2225
--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c w e b m i n i n g