[
https://issues.apache.org/jira/browse/TIKA-1443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15865789#comment-15865789
]
Tim Allison commented on TIKA-1443:
-----------------------------------
Interesting work with references:
https://ryanfb.github.io/etc/2015/03/16/automatic_evaluation_of_ocr_quality.html
> Add a junk text detector to Tika
> --------------------------------
>
> Key: TIKA-1443
> URL: https://issues.apache.org/jira/browse/TIKA-1443
> Project: Tika
> Issue Type: Wish
> Reporter: Tim Allison
> Priority: Minor
>
> It would be helpful to have a detector that flags documents whose extracted
> text is junk. This could be used as a component of TIKA-1332 or as a
> standalone detector. See TIKA-1332 for some initial ideas of what statistics
> we might use for such a detector.
> Two use cases:
> * Parser developers could quickly see whether changes in code lead to less
> "junky" documents or more "junky" documents. This would also aid in
> prioritizing manual review of output comparison (see discussion in TIKA-1419).
> * Search system integrators could use that information to set document
> specific relevancy rankings or to avoid indexing a document
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)