[ https://issues.apache.org/jira/browse/TIKA-1302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14194543#comment-14194543 ]
Tim Allison commented on TIKA-1302: ----------------------------------- [~anjackson], the google docs link is down at the moment, so I can't see the full doc. If there is any way to capture the full stacktrace so that we can compare with our govdocs1 runs, that would be fantastic. You can see our current output format comparing two versions of PDFBox over on TIKA-1442. This is ongoing work (from my perspective), and there's no need to rush. Whichever option is easier for you...thank you for sharing! {quote} I don't think we changed the parse configuration significantly, so it seems HTML and XHTML and XML should all have gone through the HtmlParser (I'm not 100% sure about this, and will try to check). {quote} Y, if you could check, I'd be interested. I think the default behavior would be to send XML through the DcXMLParser, which is far stricter than the default HtmlParser. You can see by our choice on tika-server, though, that at least one dev prefers to have our HtmlParser handle xml. :) Thank you, again! > Let's run Tika against a large batch of docs nightly > ---------------------------------------------------- > > Key: TIKA-1302 > URL: https://issues.apache.org/jira/browse/TIKA-1302 > Project: Tika > Issue Type: Improvement > Components: cli, general, server > Reporter: Tim Allison > > Many thanks to [~lewismc] for TIKA-1301! Once we get nightly builds up and > running again, it might be fun to run Tika regularly against a large set of > docs and report metrics. > One excellent candidate corpus is govdocs1: > http://digitalcorpora.org/corpora/files. > Any other candidate corpora? > [~willp-bl], have anything handy you'd like to contribute? > [http://www.openplanetsfoundation.org/blogs/2014-03-21-tika-ride-characterising-web-content-nanite] > ;) -- This message was sent by Atlassian JIRA (v6.3.4#6332)