[ 
https://issues.apache.org/jira/browse/TIKA-1332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14044682#comment-14044682
 ] 

Tim Allison commented on TIKA-1332:
-----------------------------------

To my mind, there are three families of things that can go wrong:

1) Parser can fail
    1a) throw an exception
    1b) hang forever

2) Fail to extract text and/or metadata from documents
    2a) nothing is extracted
    2b) some document components or attachments are not extracted: TIKA-1317 
and TIKA-1228

3) Extract junk (mojibake, too many spaces in pdfs, fail to add space btwn runs 
in .docx, etc), in which case there are two options:
      3a) We can do better.
      3b) We can't...the document is just plain broken.

We can easily count and compare 1).   By easily, I mean that I haven't fully 
worked it out, but it should be fairly straightforward.

Without a truth set or a comparison parser, we cannot easily measure 2a or 2b.  
For 2a, if there is no text, maybe there really is no text (image only pdfs or 
just a docx that contains images).  For 2b, we're really out of luck without 
other resources.
  
For 3), there's lots of room for work.  In short, I'd think we'd want to 
calculate how "languagey" the extracted text is.  Some indicators that occur to 
me:

 a) Type/token ratio or token entropy
 b) Average word length (with an exception for non-whitespace languages)
 c) Ratio of alphanumerics to total string length
 d) Analysis of language id confidence scores...if the string is long enough, 
you'd expect a langid component to return a very high score for the best 
language and then far lower scores for the 2nd and 3rd best languages.  If the 
langid component returns flat scores, then that might be an indicator that 
something didn't go well.  

What do you think?  Are there other things that can go wrong?  What else should 
we try to measure, in a supervised (not ideal) or semi-supervised (better) or 
unsupervised (best)? 

> Create "eval" code
> ------------------
>
>                 Key: TIKA-1332
>                 URL: https://issues.apache.org/jira/browse/TIKA-1332
>             Project: Tika
>          Issue Type: Sub-task
>          Components: cli, general, server
>            Reporter: Tim Allison
>
> For this issue, we can start with code to gather statistics on each run (# of 
> exceptions per file type, most common exceptions per file type, number of 
> metadata items, total text extracted, etc).  We should also be able to 
> compare one run against another.  Going forward, there's plenty of room to 
> improve.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to