[jira] [Commented] (TIKA-1443) Add a junk text detector to Tika

Wouter De Borger (JIRA) Thu, 30 Mar 2017 05:35:56 -0700

    [ 
https://issues.apache.org/jira/browse/TIKA-1443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15948974#comment-15948974
 ]


Wouter De Borger commented on TIKA-1443:
----------------------------------------

This feature would be very interesting. 

I'm working an a system which uses both PDFBox and can do OCR, when PDFBox 
fails. When the output of PDFBox is slightly suspicious, OCR is the preferred 
solution. So for me, the junk detector can be made very sensitive.

If I make any progress on this issue, I'll let you know.

> Add a junk text detector to Tika
> --------------------------------
>
>                 Key: TIKA-1443
>                 URL: https://issues.apache.org/jira/browse/TIKA-1443
>             Project: Tika
>          Issue Type: Wish
>            Reporter: Tim Allison
>            Priority: Minor
>
> It would be helpful to have a detector that flags documents whose extracted 
> text is junk.  This could be used as a component of TIKA-1332 or as a 
> standalone detector.  See TIKA-1332 for some initial ideas of what statistics 
> we might use for such a detector.
> Two use cases:
> * Parser developers could quickly see whether changes in code lead to less 
> "junky" documents or more "junky" documents.  This would also aid in 
> prioritizing manual review of output comparison (see discussion in TIKA-1419).
> * Search system integrators could use that information to set document 
> specific relevancy rankings or to avoid indexing a document



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (TIKA-1443) Add a junk text detector to Tika

Reply via email to