We were thinking something around 2TB of data with a good mix of excel, images, 
pdfs, text and powerpoints. So I guess a mix of everything.

From: Tim Allison <talli...@apache.org>
Date: Tuesday, July 26, 2022 at 9:19 AM
To: u...@tika.apache.org <u...@tika.apache.org>
Cc: Oscar Rieken Jr <oscar.rieke...@cofense.com>, corpora-dev@tika.apache.org 
<corpora-dev@tika.apache.org>
Subject: Re: Datasets for testing large number of attachments
External Email
What Nick said...

cc_large is a sample of some of the larger documents from 
commoncrawl3_refetched.

If you want to give your pipeline a workout, I also recommend using the 
MockParser that is available in the tika-core tests jar.  That allows you to 
instrument an OOM and timeouts and system exits and all sorts of other mayhem.  
Obv, don't put the tika-core tests jar on your class path in production.  See 
the files in 
https://github.com/apache/tika/tree/main/tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/resources/test-documents/mock
 for examples of how to trigger bad behavior with the MockParser.

On the corpora, as Nick said, let us know what you want and we can help you 
select files.

Cheers,

        Tim


On Tue, Jul 26, 2022 at 7:06 AM Nick Burch 
<apa...@gagravarr.org<mailto:apa...@gagravarr.org>> wrote:
On Mon, 25 Jul 2022, Oscar Rieken Jr via user wrote:
> I am currently trying to validate our Tika setup and was looking for a
> set of example data I could use

If you want a small number of files of lots of different types, the test
files in the Tika source tree will work. Main set are in
tika-parsers/src/test/resources/test-documents/

If you want a very large number of files, then the Tika Corpora collection
is a good source. We have a few different collections, including stuff
from common crawl, govdocs and bug trackers. If you can let us know what
sort of file types and how many, we can suggest the best corpora
collection

Nick

Reply via email to