Re: Datasets for testing large number of attachments

Oscar Rieken Jr Tue, 26 Jul 2022 12:50:18 -0700

Awesome thanks ill give this a shot!

From: Nicholas DiPiazza <nicholas.dipia...@gmail.com>
Date: Tuesday, July 26, 2022 at 3:13 PM
To: u...@tika.apache.org <u...@tika.apache.org>, talli...@apache.org 
<talli...@apache.org>
Cc: Oscar Rieken Jr <oscar.rieke...@cofense.com>, corpora-dev@tika.apache.org 
<corpora-dev@tika.apache.org>
Subject: Re: Datasets for testing large number of attachments
External Email
Script I used back in the day to do what you are looking for:

#!/bin/bash
for i in $(seq -f "%03g" 2 999)
do
  wget 
http://downloads.digitalcorpora.org/corpora/files/govdocs1/zipfiles/$i.zip -O 
$i.zip
  unzip $i.zip
  rm $i.zip
done

not sure if it still works

On Tue, Jul 26, 2022 at 1:59 PM Tim Allison 
<talli...@apache.org<mailto:talli...@apache.org>> wrote:
As a warning, tho, Common Crawl truncates files at 1MB, so we have a bunch of 
truncated files.  We refetched some and put those under commoncrawl3_refetched.

On Tue, Jul 26, 2022 at 2:58 PM Tim Allison 
<talli...@apache.org<mailto:talli...@apache.org>> wrote:
We have ~1.9TB.  But I'd skip cc_large because that's just a copy of some 
directories under commoncrawl3_refetched.

If you want to pull fresher data out of CommonCrawl, I have undocumented 
scripts to do that.  I could add documentation.

These are the top 100 mime types and counts.  This db was generated on a 
slightly earlier version of the corpus/corpora, but it should be close enough.

MIME_STRING    cnt
application/pdf    768490
text/plain    472041
text/html    429707
application/x-tika-msoffice    297990
image/png    190815
application/octet-stream    190645
image/jpeg    179533
application/xhtml+xml    151830
application/x-bzip2    124204
application/x-tika-ooxml    122523
application/x-bzip    107435
application/xml    107003
application/zip    93467
application/x-sh    88712
application/gzip    73535
image/gif    66713
application/zlib    46483
text/calendar    40385
application/postscript    35526
application/rss+xml    34428
application/atom+xml    28950
multipart/appledouble    27602
image/svg+xml    25771
application/vnd.oasis.opendocument.text    25753
application/rdf+xml    24890
application/vnd.google-earth.kml+xml    24049
application/rtf    23915
application/x-matroska    19437
application/x-shockwave-flash    18879
video/quicktime    18546
application/epub+zip    18205
application/vnd.ms-excel    17465
application/x-xz    16869
text/x-vcard    16772
application/java-vm    16761
audio/mpeg    15534
message/rfc822    14405
application/vnd.oasis.opendocument.spreadsheet    12659
application/x-bibtex-text-file    12261
application/x-rar-compressed; version=4    12123
text/x-php    10870
text/x-diff    10080
video/mp4    8281
audio/mp4    8221
application/x-msdownload    8019
application/x-bittorrent    7964
image/vnd.microsoft.icon    7382
application/mbox    6799
application/x-x509-cert; format=der    6597
audio/vnd.wave    6550
image/bmp    6411
application/x-endnote-refer    5922
image/vnd.djvu    5874
text/x-matlab    5734
application/vnd.apple.mpegurl    5511
image/tiff    5430
image/webp    4972
application/vnd.oasis.opendocument.presentation    3989
text/x-jsp    3973
text/x-csrc    3555
video/x-ms-wmv    3453
video/x-m4v    3443
application/x-dbf    3381
text/x-chdr    3263
text/x-perl    3124
application/x-rpm    3023
application/x-mobipocket-ebook    2726
audio/midi    2697
application/vnd.oasis.opendocument.graphics    2675
application/vnd.ms-excel.sheet.4    2591
application/x-font-ttf    2575
application/xspf+xml    2557
text/x-python    2416
audio/vorbis    2354
application/msword    2223
application/ogg    2222
application/x-gtar    2181
audio/x-mpegurl    2067
video/x-flv    1969
audio/x-ms-wma    1874
image/icns    1857
application/x-object    1823
application/x-7z-compressed    1795
application/x-msdownload; format=pe32    1784
application/x-debian-package    1700
application/x-mysql-table-definition    1669
image/vnd.dxf; format=ascii    1664
application/x-sqlite3    1606
application/x-berkeley-db; format=hash    1457
application/x-executable    1455
video/mpeg    1366
application/pkcs7-signature    1359
application/x-ms-asx    1266
image/vnd.zbrush.pcx    1247
image/vnd.dwg    1243
application/fits    1217
application/xslfo+xml    1206
application/x-sharedlib    1185
audio/prs.sid    1173
text/x-vcalendar    1156

On Tue, Jul 26, 2022 at 2:12 PM Oscar Rieken Jr 
<oscar.rieke...@cofense.com<mailto:oscar.rieke...@cofense.com>> wrote:
We were thinking something around 2TB of data with a good mix of excel, images, 
pdfs, text and powerpoints. So I guess a mix of everything.

From: Tim Allison <talli...@apache.org<mailto:talli...@apache.org>>
Date: Tuesday, July 26, 2022 at 9:19 AM
To: u...@tika.apache.org<mailto:u...@tika.apache.org> 
<u...@tika.apache.org<mailto:u...@tika.apache.org>>
Cc: Oscar Rieken Jr 
<oscar.rieke...@cofense.com<mailto:oscar.rieke...@cofense.com>>, 
corpora-dev@tika.apache.org<mailto:corpora-dev@tika.apache.org> 
<corpora-dev@tika.apache.org<mailto:corpora-dev@tika.apache.org>>
Subject: Re: Datasets for testing large number of attachments
External Email
What Nick said...

cc_large is a sample of some of the larger documents from 
commoncrawl3_refetched.

If you want to give your pipeline a workout, I also recommend using the 
MockParser that is available in the tika-core tests jar.  That allows you to 
instrument an OOM and timeouts and system exits and all sorts of other mayhem.  
Obv, don't put the tika-core tests jar on your class path in production.  See 
the files in 
https://github.com/apache/tika/tree/main/tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/resources/test-documents/mock
 for examples of how to trigger bad behavior with the MockParser.

On the corpora, as Nick said, let us know what you want and we can help you 
select files.

Cheers,

        Tim

On Tue, Jul 26, 2022 at 7:06 AM Nick Burch 
<apa...@gagravarr.org<mailto:apa...@gagravarr.org>> wrote:
On Mon, 25 Jul 2022, Oscar Rieken Jr via user wrote:
> I am currently trying to validate our Tika setup and was looking for a
> set of example data I could use

If you want a small number of files of lots of different types, the test
files in the Tika source tree will work. Main set are in
tika-parsers/src/test/resources/test-documents/

If you want a very large number of files, then the Tika Corpora collection
is a good source. We have a few different collections, including stuff
from common crawl, govdocs and bug trackers. If you can let us know what
sort of file types and how many, we can suggest the best corpora
collection

Nick

Re: Datasets for testing large number of attachments

Reply via email to