Awesome thanks ill give this a shot! From: Nicholas DiPiazza <nicholas.dipia...@gmail.com> Date: Tuesday, July 26, 2022 at 3:13 PM To: u...@tika.apache.org <u...@tika.apache.org>, talli...@apache.org <talli...@apache.org> Cc: Oscar Rieken Jr <oscar.rieke...@cofense.com>, corpora-dev@tika.apache.org <corpora-dev@tika.apache.org> Subject: Re: Datasets for testing large number of attachments External Email Script I used back in the day to do what you are looking for:
#!/bin/bash for i in $(seq -f "%03g" 2 999) do wget http://downloads.digitalcorpora.org/corpora/files/govdocs1/zipfiles/$i.zip -O $i.zip unzip $i.zip rm $i.zip done not sure if it still works On Tue, Jul 26, 2022 at 1:59 PM Tim Allison <talli...@apache.org<mailto:talli...@apache.org>> wrote: As a warning, tho, Common Crawl truncates files at 1MB, so we have a bunch of truncated files. We refetched some and put those under commoncrawl3_refetched. On Tue, Jul 26, 2022 at 2:58 PM Tim Allison <talli...@apache.org<mailto:talli...@apache.org>> wrote: We have ~1.9TB. But I'd skip cc_large because that's just a copy of some directories under commoncrawl3_refetched. If you want to pull fresher data out of CommonCrawl, I have undocumented scripts to do that. I could add documentation. These are the top 100 mime types and counts. This db was generated on a slightly earlier version of the corpus/corpora, but it should be close enough. MIME_STRING cnt application/pdf 768490 text/plain 472041 text/html 429707 application/x-tika-msoffice 297990 image/png 190815 application/octet-stream 190645 image/jpeg 179533 application/xhtml+xml 151830 application/x-bzip2 124204 application/x-tika-ooxml 122523 application/x-bzip 107435 application/xml 107003 application/zip 93467 application/x-sh 88712 application/gzip 73535 image/gif 66713 application/zlib 46483 text/calendar 40385 application/postscript 35526 application/rss+xml 34428 application/atom+xml 28950 multipart/appledouble 27602 image/svg+xml 25771 application/vnd.oasis.opendocument.text 25753 application/rdf+xml 24890 application/vnd.google-earth.kml+xml 24049 application/rtf 23915 application/x-matroska 19437 application/x-shockwave-flash 18879 video/quicktime 18546 application/epub+zip 18205 application/vnd.ms-excel 17465 application/x-xz 16869 text/x-vcard 16772 application/java-vm 16761 audio/mpeg 15534 message/rfc822 14405 application/vnd.oasis.opendocument.spreadsheet 12659 application/x-bibtex-text-file 12261 application/x-rar-compressed; version=4 12123 text/x-php 10870 text/x-diff 10080 video/mp4 8281 audio/mp4 8221 application/x-msdownload 8019 application/x-bittorrent 7964 image/vnd.microsoft.icon 7382 application/mbox 6799 application/x-x509-cert; format=der 6597 audio/vnd.wave 6550 image/bmp 6411 application/x-endnote-refer 5922 image/vnd.djvu 5874 text/x-matlab 5734 application/vnd.apple.mpegurl 5511 image/tiff 5430 image/webp 4972 application/vnd.oasis.opendocument.presentation 3989 text/x-jsp 3973 text/x-csrc 3555 video/x-ms-wmv 3453 video/x-m4v 3443 application/x-dbf 3381 text/x-chdr 3263 text/x-perl 3124 application/x-rpm 3023 application/x-mobipocket-ebook 2726 audio/midi 2697 application/vnd.oasis.opendocument.graphics 2675 application/vnd.ms-excel.sheet.4 2591 application/x-font-ttf 2575 application/xspf+xml 2557 text/x-python 2416 audio/vorbis 2354 application/msword 2223 application/ogg 2222 application/x-gtar 2181 audio/x-mpegurl 2067 video/x-flv 1969 audio/x-ms-wma 1874 image/icns 1857 application/x-object 1823 application/x-7z-compressed 1795 application/x-msdownload; format=pe32 1784 application/x-debian-package 1700 application/x-mysql-table-definition 1669 image/vnd.dxf; format=ascii 1664 application/x-sqlite3 1606 application/x-berkeley-db; format=hash 1457 application/x-executable 1455 video/mpeg 1366 application/pkcs7-signature 1359 application/x-ms-asx 1266 image/vnd.zbrush.pcx 1247 image/vnd.dwg 1243 application/fits 1217 application/xslfo+xml 1206 application/x-sharedlib 1185 audio/prs.sid 1173 text/x-vcalendar 1156 On Tue, Jul 26, 2022 at 2:12 PM Oscar Rieken Jr <oscar.rieke...@cofense.com<mailto:oscar.rieke...@cofense.com>> wrote: We were thinking something around 2TB of data with a good mix of excel, images, pdfs, text and powerpoints. So I guess a mix of everything. From: Tim Allison <talli...@apache.org<mailto:talli...@apache.org>> Date: Tuesday, July 26, 2022 at 9:19 AM To: u...@tika.apache.org<mailto:u...@tika.apache.org> <u...@tika.apache.org<mailto:u...@tika.apache.org>> Cc: Oscar Rieken Jr <oscar.rieke...@cofense.com<mailto:oscar.rieke...@cofense.com>>, corpora-dev@tika.apache.org<mailto:corpora-dev@tika.apache.org> <corpora-dev@tika.apache.org<mailto:corpora-dev@tika.apache.org>> Subject: Re: Datasets for testing large number of attachments External Email What Nick said... cc_large is a sample of some of the larger documents from commoncrawl3_refetched. If you want to give your pipeline a workout, I also recommend using the MockParser that is available in the tika-core tests jar. That allows you to instrument an OOM and timeouts and system exits and all sorts of other mayhem. Obv, don't put the tika-core tests jar on your class path in production. See the files in https://github.com/apache/tika/tree/main/tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/resources/test-documents/mock for examples of how to trigger bad behavior with the MockParser. On the corpora, as Nick said, let us know what you want and we can help you select files. Cheers, Tim On Tue, Jul 26, 2022 at 7:06 AM Nick Burch <apa...@gagravarr.org<mailto:apa...@gagravarr.org>> wrote: On Mon, 25 Jul 2022, Oscar Rieken Jr via user wrote: > I am currently trying to validate our Tika setup and was looking for a > set of example data I could use If you want a small number of files of lots of different types, the test files in the Tika source tree will work. Main set are in tika-parsers/src/test/resources/test-documents/ If you want a very large number of files, then the Tika Corpora collection is a good source. We have a few different collections, including stuff from common crawl, govdocs and bug trackers. If you can let us know what sort of file types and how many, we can suggest the best corpora collection Nick