[
https://issues.apache.org/jira/browse/TIKA-1443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14982597#comment-14982597
]
Tim Allison commented on TIKA-1443:
-----------------------------------
[~kkrugler], have you looked at how Optimaize has worked on garbled text, by
chance? With the default settings and models in lang-detect, we're getting
very high confidence for 'bn' (for example) for entirely garbled text. Short
of using a common word lookup list (see
[discussion|https://issues.apache.org/jira/browse/PDFBOX-3058?focusedCommentId=14981928&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14981928]),
I'd hope that uni/bi/trigram character models would offer some insight into
whether something went horribly wrong during text extraction.
> Add a junk text detector to Tika
> --------------------------------
>
> Key: TIKA-1443
> URL: https://issues.apache.org/jira/browse/TIKA-1443
> Project: Tika
> Issue Type: Wish
> Reporter: Tim Allison
> Priority: Minor
>
> It would be helpful to have a detector that flags documents whose extracted
> text is junk. This could be used as a component of TIKA-1332 or as a
> standalone detector. See TIKA-1332 for some initial ideas of what statistics
> we might use for such a detector.
> Two use cases:
> * Parser developers could quickly see whether changes in code lead to less
> "junky" documents or more "junky" documents. This would also aid in
> prioritizing manual review of output comparison (see discussion in TIKA-1419).
> * Search system integrators could use that information to set document
> specific relevancy rankings or to avoid indexing a document
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)