Tim Allison created TIKA-4659:
---------------------------------
Summary: Add tika-eval-lite for embedded junk detection
Key: TIKA-4659
URL: https://issues.apache.org/jira/browse/TIKA-4659
Project: Tika
Issue Type: Task
Reporter: Tim Allison
We have the oov statistic in tika-eval, which requires lists of 20k words per
120+ languages. It would be useful to have something lighter weight for use in
charset detectors and/or parsers.
If we use a simple bigram model, we'd be able to run comparative stats -- is
this text run better as rtl or ltr in a PDF (at parse time) or in encoding
detection. We couldn't easily get a "this is junk" score by itself, but the
comparison part would be really useful.
We can generate bigram stats from the original tika-eval word lists trivially.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)