[ 
https://issues.apache.org/jira/browse/TIKA-1302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14178361#comment-14178361
 ] 

Andrew Jackson edited comment on TIKA-1302 at 10/21/14 12:59 PM:
-----------------------------------------------------------------

Okay, so the c.300,000 exceptions are here: 
https://www.dropbox.com/s/ka19fguaxflp725/parse_errors.csv.gz?dl=0 - let me 
know if you'd like it placed elsewhere (it's 14MB of compressed CSV).

This conversation has helped me spot a gap in our code. We currently do a 
Tika.detect() before we do a Tika.parse(), and only do the latter if the former 
succeeded. Sadly, the version of the code that I used to generate this data did 
not record the Tika exception for the .detect() step, only the .parse() step. 
This will explain why there are no hung-thread events in this result set - the 
interrupted .detect() was not recorded properly.  We'll be re-running this scan 
soonish, so I'll make sure the next version records all the exceptions. IIRC, 
from looking at the MIME types, the permanent hangs were mostly ZIPs, Office 
documents, and maybe some PDFs.

Note that the CSV includes the Content-Type from the .detect() step, and this 
should indicate which module was run on the resource (i.e. whatever the Tika 
1.5 mapping was for that MIME type). I don't think we changed the parse 
configuration significantly, so it seems HTML and XHTML and XML should all have 
gone through the HtmlParser (I'm not 100% sure about this, and will try to 
check).

I'm not sure it's worth giving you all the SAX exceptions, as there are a lot 
of repeats of the same problems. I think a random sample of about 50,000 should 
be plenty. Does that sound okay to you?

EDIT: Oh, and I meant to say, I'm glad to hear about [~gostep] and 
[~talli...@apache.org]'s efforts to run this on GovDocs, and would be 
interested in comparing results. We already publish format profile data about 
web archives, and would love to have more data to refer to.


was (Author: anjackson):
Okay, so the c.300,000 exceptions are here: 
https://www.dropbox.com/s/ka19fguaxflp725/parse_errors.csv.gz?dl=0 - let me 
know if you'd like it placed elsewhere (it's 14MB of compressed CSV).

This conversation has helped me spot a gap in our code. We currently do a 
Tika.detect() before we do a Tika.parse(), and only do the latter if the former 
succeeded. Sadly, the version of the code that I used to generate this data did 
not record the Tika exception for the .detect() step, only the .parse() step. 
This will explain why there are no hung-thread events in this result set - the 
interrupted .detect() was not recorded properly.  We'll be re-running this scan 
soonish, so I'll make sure the next version records all the exceptions. IIRC, 
from looking at the MIME types, the permanent hangs were mostly ZIPs, Office 
documents, and maybe some PDFs.

Note that the CSV includes the Content-Type from the .detect() step, and this 
should indicate which module was run on the resource (i.e. whatever the Tika 
1.5 mapping was for that MIME type). I don't think we changed the parse 
configuration significantly, so it seems HTML and XHTML and XML should all have 
gone through the HtmlParser (I'm not 100% sure about this, and will try to 
check).

I'm not sure it's worth giving you all the SAX exceptions, as there are a lot 
of repeats of the same problems. I think a random sample of about 50,000 should 
be plenty. Does that sound okay to you?

> Let's run Tika against a large batch of docs nightly
> ----------------------------------------------------
>
>                 Key: TIKA-1302
>                 URL: https://issues.apache.org/jira/browse/TIKA-1302
>             Project: Tika
>          Issue Type: Improvement
>          Components: cli, general, server
>            Reporter: Tim Allison
>
> Many thanks to [~lewismc] for TIKA-1301!  Once we get nightly builds up and 
> running again, it might be fun to run Tika regularly against a large set of 
> docs and report metrics.
> One excellent candidate corpus is govdocs1: 
> http://digitalcorpora.org/corpora/files.
> Any other candidate corpora?  
> [~willp-bl], have anything handy you'd like to contribute? 
> [http://www.openplanetsfoundation.org/blogs/2014-03-21-tika-ride-characterising-web-content-nanite]
>  ;) 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to