[ 
https://issues.apache.org/jira/browse/TIKA-1302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14226415#comment-14226415
 ] 

Andrew Jackson commented on TIKA-1302:
--------------------------------------

We have two more sets of data. One is the same as the 1996-2010 stuff, but from 
2010 to April 2013, and for each item a copy can generally be accessed via the 
Internet Archive. We are planning to extend our indexing to the entire 
1996-2013 dataset soon, but in reality its going to be a few months yet due to 
technical difficulties and other priorities. The second set of data runs from 
2013 onwards, and due to the legal constraints on that material cannot be made 
available. However, for the next year or two, most of it will still be 
available on the live web, so that's the fallback option. That material has 
been indexed (although with an older Tika version), but we're going to re-index 
that too shortly, so we should also be able to make that available. (n.b. 
'shortly' still means weeks or months!)

Both of these data sets are large and contain more large files. There were c. 2 
billion resources in the 1996-2010 chunk, and there are 1.5-2 billion in the 
2010-2013 chunk, and over 2 billion per year since then, and in contrast to the 
early material, we do not limit the size per resource. So that should be 
interesting.

However, it would be good to run against a broader range of material, given 
that I stop Tika from recursively processing ZIPs etc. and that web archives 
are rather weak on A/V files, systems files, software, etc. I'm not aware of a 
good A/V corpus, but on the systems and software side, there are the system 
images [also held at digitalcorpora.org|http://digitalcorpora.org/] and the 
[various files used by a RedHat dev to regression test the 'file' 
command|https://fedorahosted.org/file-tests/]. There is also [this small corpus 
of example files|https://github.com/openpreserve/format-corpus] that I have 
been contributing to lately, the [evolt browser 
archive|http://browsers.evolt.org/] and the [disktype filesystem image 
samples|http://disktype.cvs.sourceforge.net/viewvc/disktype/file-system-sampler/].

> Let's run Tika against a large batch of docs nightly
> ----------------------------------------------------
>
>                 Key: TIKA-1302
>                 URL: https://issues.apache.org/jira/browse/TIKA-1302
>             Project: Tika
>          Issue Type: Improvement
>          Components: cli, general, server
>            Reporter: Tim Allison
>         Attachments: wayback_exception_summaries.xlsx
>
>
> Many thanks to [~lewismc] for TIKA-1301!  Once we get nightly builds up and 
> running again, it might be fun to run Tika regularly against a large set of 
> docs and report metrics.
> One excellent candidate corpus is govdocs1: 
> http://digitalcorpora.org/corpora/files.
> Any other candidate corpora?  
> [~willp-bl], have anything handy you'd like to contribute? 
> [http://www.openplanetsfoundation.org/blogs/2014-03-21-tika-ride-characterising-web-content-nanite]
>  ;) 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to