[
https://issues.apache.org/jira/browse/TIKA-4659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18063660#comment-18063660
]
ASF GitHub Bot commented on TIKA-4659:
--------------------------------------
tballison closed pull request #2604: TIKA-4659
> Add tika-eval-lite for embedded junk detection
> ----------------------------------------------
>
> Key: TIKA-4659
> URL: https://issues.apache.org/jira/browse/TIKA-4659
> Project: Tika
> Issue Type: Task
> Reporter: Tim Allison
> Priority: Trivial
>
> We have the oov statistic in tika-eval, which requires lists of 20k words per
> 120+ languages. It would be useful to have something lighter weight for use
> in charset detectors and/or parsers.
> If we use a simple bigram model, we'd be able to run comparative stats -- is
> this text run better as rtl or ltr in a PDF (at parse time) or in encoding
> detection. We couldn't easily get a "this is junk" score by itself, but the
> comparison part would be really useful.
> We can generate bigram stats from the original tika-eval word lists
> trivially.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)