[ 
https://issues.apache.org/jira/browse/TIKA-1443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14982597#comment-14982597
 ] 

Tim Allison commented on TIKA-1443:
-----------------------------------

[~kkrugler], have you looked at how Optimaize has worked on garbled text, by 
chance?  With the default settings and models in lang-detect, we're getting 
very high confidence for 'bn' (for example) for entirely garbled text.  Short 
of using a common word lookup list (see 
[discussion|https://issues.apache.org/jira/browse/PDFBOX-3058?focusedCommentId=14981928&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14981928]),
 I'd hope that uni/bi/trigram character models would offer some insight into 
whether something went horribly wrong during text extraction.

> Add a junk text detector to Tika
> --------------------------------
>
>                 Key: TIKA-1443
>                 URL: https://issues.apache.org/jira/browse/TIKA-1443
>             Project: Tika
>          Issue Type: Wish
>            Reporter: Tim Allison
>            Priority: Minor
>
> It would be helpful to have a detector that flags documents whose extracted 
> text is junk.  This could be used as a component of TIKA-1332 or as a 
> standalone detector.  See TIKA-1332 for some initial ideas of what statistics 
> we might use for such a detector.
> Two use cases:
> * Parser developers could quickly see whether changes in code lead to less 
> "junky" documents or more "junky" documents.  This would also aid in 
> prioritizing manual review of output comparison (see discussion in TIKA-1419).
> * Search system integrators could use that information to set document 
> specific relevancy rankings or to avoid indexing a document



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to