[ 
https://issues.apache.org/jira/browse/TIKA-1443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15865789#comment-15865789
 ] 

Tim Allison commented on TIKA-1443:
-----------------------------------

Interesting work with references: 
https://ryanfb.github.io/etc/2015/03/16/automatic_evaluation_of_ocr_quality.html

> Add a junk text detector to Tika
> --------------------------------
>
>                 Key: TIKA-1443
>                 URL: https://issues.apache.org/jira/browse/TIKA-1443
>             Project: Tika
>          Issue Type: Wish
>            Reporter: Tim Allison
>            Priority: Minor
>
> It would be helpful to have a detector that flags documents whose extracted 
> text is junk.  This could be used as a component of TIKA-1332 or as a 
> standalone detector.  See TIKA-1332 for some initial ideas of what statistics 
> we might use for such a detector.
> Two use cases:
> * Parser developers could quickly see whether changes in code lead to less 
> "junky" documents or more "junky" documents.  This would also aid in 
> prioritizing manual review of output comparison (see discussion in TIKA-1419).
> * Search system integrators could use that information to set document 
> specific relevancy rankings or to avoid indexing a document



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to