[jira] [Commented] (TIKA-1302) Let's run Tika against a large batch of docs nightly

Tim Allison (JIRA) Mon, 03 Nov 2014 06:01:14 -0800

    [ 
https://issues.apache.org/jira/browse/TIKA-1302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14194543#comment-14194543
 ]


Tim Allison commented on TIKA-1302:
-----------------------------------

[~anjackson], the google docs link is down at the moment, so I can't see the 
full doc.  If there is any way to capture the full stacktrace so that we can 
compare with our govdocs1 runs, that would be fantastic.  You can see our 
current output format comparing two versions of PDFBox over on TIKA-1442. This 
is ongoing work (from my perspective), and there's no need to rush.  Whichever 
option is easier for you...thank you for sharing!

{quote}
 I don't think we changed the parse configuration significantly, so it seems 
HTML and XHTML and XML should all have gone through the HtmlParser (I'm not 
100% sure about this, and will try to check).
{quote}

Y, if you could check, I'd be interested.  I think the default behavior would 
be to send XML through the DcXMLParser, which is far stricter than the default 
HtmlParser.  You can see by our choice on tika-server, though, that at least 
one dev prefers to have our HtmlParser handle xml. :)

Thank you, again!


> Let's run Tika against a large batch of docs nightly
> ----------------------------------------------------
>
>                 Key: TIKA-1302
>                 URL: https://issues.apache.org/jira/browse/TIKA-1302
>             Project: Tika
>          Issue Type: Improvement
>          Components: cli, general, server
>            Reporter: Tim Allison
>
> Many thanks to [~lewismc] for TIKA-1301!  Once we get nightly builds up and 
> running again, it might be fun to run Tika regularly against a large set of 
> docs and report metrics.
> One excellent candidate corpus is govdocs1: 
> http://digitalcorpora.org/corpora/files.
> Any other candidate corpora?  
> [~willp-bl], have anything handy you'd like to contribute? 
> [http://www.openplanetsfoundation.org/blogs/2014-03-21-tika-ride-characterising-web-content-nanite]
>  ;) 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-1302) Let's run Tika against a large batch of docs nightly

Reply via email to