[ https://issues.apache.org/jira/browse/TIKA-1443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14982597#comment-14982597 ]
Tim Allison commented on TIKA-1443: ----------------------------------- [~kkrugler], have you looked at how Optimaize has worked on garbled text, by chance? With the default settings and models in lang-detect, we're getting very high confidence for 'bn' (for example) for entirely garbled text. Short of using a common word lookup list (see [discussion|https://issues.apache.org/jira/browse/PDFBOX-3058?focusedCommentId=14981928&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14981928]), I'd hope that uni/bi/trigram character models would offer some insight into whether something went horribly wrong during text extraction. > Add a junk text detector to Tika > -------------------------------- > > Key: TIKA-1443 > URL: https://issues.apache.org/jira/browse/TIKA-1443 > Project: Tika > Issue Type: Wish > Reporter: Tim Allison > Priority: Minor > > It would be helpful to have a detector that flags documents whose extracted > text is junk. This could be used as a component of TIKA-1332 or as a > standalone detector. See TIKA-1332 for some initial ideas of what statistics > we might use for such a detector. > Two use cases: > * Parser developers could quickly see whether changes in code lead to less > "junky" documents or more "junky" documents. This would also aid in > prioritizing manual review of output comparison (see discussion in TIKA-1419). > * Search system integrators could use that information to set document > specific relevancy rankings or to avoid indexing a document -- This message was sent by Atlassian JIRA (v6.3.4#6332)