Tim Allison created TIKA-4659:
---------------------------------

             Summary: Add tika-eval-lite for embedded junk detection
                 Key: TIKA-4659
                 URL: https://issues.apache.org/jira/browse/TIKA-4659
             Project: Tika
          Issue Type: Task
            Reporter: Tim Allison


We have the oov statistic in tika-eval, which requires lists of 20k words per 
120+ languages. It would be useful to have something lighter weight for use in 
charset detectors and/or parsers. 

If we use a simple bigram model, we'd be able to run comparative stats -- is 
this text run better as rtl or ltr in a PDF (at parse time) or in encoding 
detection. We couldn't easily get a "this is junk" score by itself, but the 
comparison part would be really useful.

We can generate bigram stats from the original tika-eval word lists trivially. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to