[ 
https://issues.apache.org/jira/browse/TIKA-1443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14170462#comment-14170462
 ] 

Chris A. Mattmann commented on TIKA-1443:
-----------------------------------------

#love

> Add a junk text detector to Tika
> --------------------------------
>
>                 Key: TIKA-1443
>                 URL: https://issues.apache.org/jira/browse/TIKA-1443
>             Project: Tika
>          Issue Type: Wish
>            Reporter: Tim Allison
>            Priority: Minor
>
> It would be helpful to have a detector that flags documents whose extracted 
> text is junk.  This could be used as a component of TIKA-1332 or as a 
> standalone detector.  See TIKA-1332 for some initial ideas of what statistics 
> we might use for such a detector.
> Two use cases:
> * Parser developers could quickly see whether changes in code lead to less 
> "junky" documents or more "junky" documents.  This would also aid in 
> prioritizing manual review of output comparison (see discussion in TIKA-1419).
> * Search system integrators could use that information to set document 
> specific relevancy rankings or to avoid indexing a document



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to