[ https://issues.apache.org/jira/browse/TIKA-1302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14226415#comment-14226415 ]
Andrew Jackson commented on TIKA-1302: -------------------------------------- We have two more sets of data. One is the same as the 1996-2010 stuff, but from 2010 to April 2013, and for each item a copy can generally be accessed via the Internet Archive. We are planning to extend our indexing to the entire 1996-2013 dataset soon, but in reality its going to be a few months yet due to technical difficulties and other priorities. The second set of data runs from 2013 onwards, and due to the legal constraints on that material cannot be made available. However, for the next year or two, most of it will still be available on the live web, so that's the fallback option. That material has been indexed (although with an older Tika version), but we're going to re-index that too shortly, so we should also be able to make that available. (n.b. 'shortly' still means weeks or months!) Both of these data sets are large and contain more large files. There were c. 2 billion resources in the 1996-2010 chunk, and there are 1.5-2 billion in the 2010-2013 chunk, and over 2 billion per year since then, and in contrast to the early material, we do not limit the size per resource. So that should be interesting. However, it would be good to run against a broader range of material, given that I stop Tika from recursively processing ZIPs etc. and that web archives are rather weak on A/V files, systems files, software, etc. I'm not aware of a good A/V corpus, but on the systems and software side, there are the system images [also held at digitalcorpora.org|http://digitalcorpora.org/] and the [various files used by a RedHat dev to regression test the 'file' command|https://fedorahosted.org/file-tests/]. There is also [this small corpus of example files|https://github.com/openpreserve/format-corpus] that I have been contributing to lately, the [evolt browser archive|http://browsers.evolt.org/] and the [disktype filesystem image samples|http://disktype.cvs.sourceforge.net/viewvc/disktype/file-system-sampler/]. > Let's run Tika against a large batch of docs nightly > ---------------------------------------------------- > > Key: TIKA-1302 > URL: https://issues.apache.org/jira/browse/TIKA-1302 > Project: Tika > Issue Type: Improvement > Components: cli, general, server > Reporter: Tim Allison > Attachments: wayback_exception_summaries.xlsx > > > Many thanks to [~lewismc] for TIKA-1301! Once we get nightly builds up and > running again, it might be fun to run Tika regularly against a large set of > docs and report metrics. > One excellent candidate corpus is govdocs1: > http://digitalcorpora.org/corpora/files. > Any other candidate corpora? > [~willp-bl], have anything handy you'd like to contribute? > [http://www.openplanetsfoundation.org/blogs/2014-03-21-tika-ride-characterising-web-content-nanite] > ;) -- This message was sent by Atlassian JIRA (v6.3.4#6332)