We have ~1.9TB.  But I'd skip cc_large because that's just a copy of some
directories under commoncrawl3_refetched.

If you want to pull fresher data out of CommonCrawl, I have undocumented
scripts to do that.  I could add documentation.

These are the top 100 mime types and counts.  This db was generated on a
slightly earlier version of the corpus/corpora, but it should be close
enough.

MIME_STRING    cnt
application/pdf    768490
text/plain    472041
text/html    429707
application/x-tika-msoffice    297990
image/png    190815
application/octet-stream    190645
image/jpeg    179533
application/xhtml+xml    151830
application/x-bzip2    124204
application/x-tika-ooxml    122523
application/x-bzip    107435
application/xml    107003
application/zip    93467
application/x-sh    88712
application/gzip    73535
image/gif    66713
application/zlib    46483
text/calendar    40385
application/postscript    35526
application/rss+xml    34428
application/atom+xml    28950
multipart/appledouble    27602
image/svg+xml    25771
application/vnd.oasis.opendocument.text    25753
application/rdf+xml    24890
application/vnd.google-earth.kml+xml    24049
application/rtf    23915
application/x-matroska    19437
application/x-shockwave-flash    18879
video/quicktime    18546
application/epub+zip    18205
application/vnd.ms-excel    17465
application/x-xz    16869
text/x-vcard    16772
application/java-vm    16761
audio/mpeg    15534
message/rfc822    14405
application/vnd.oasis.opendocument.spreadsheet    12659
application/x-bibtex-text-file    12261
application/x-rar-compressed; version=4    12123
text/x-php    10870
text/x-diff    10080
video/mp4    8281
audio/mp4    8221
application/x-msdownload    8019
application/x-bittorrent    7964
image/vnd.microsoft.icon    7382
application/mbox    6799
application/x-x509-cert; format=der    6597
audio/vnd.wave    6550
image/bmp    6411
application/x-endnote-refer    5922
image/vnd.djvu    5874
text/x-matlab    5734
application/vnd.apple.mpegurl    5511
image/tiff    5430
image/webp    4972
application/vnd.oasis.opendocument.presentation    3989
text/x-jsp    3973
text/x-csrc    3555
video/x-ms-wmv    3453
video/x-m4v    3443
application/x-dbf    3381
text/x-chdr    3263
text/x-perl    3124
application/x-rpm    3023
application/x-mobipocket-ebook    2726
audio/midi    2697
application/vnd.oasis.opendocument.graphics    2675
application/vnd.ms-excel.sheet.4    2591
application/x-font-ttf    2575
application/xspf+xml    2557
text/x-python    2416
audio/vorbis    2354
application/msword    2223
application/ogg    2222
application/x-gtar    2181
audio/x-mpegurl    2067
video/x-flv    1969
audio/x-ms-wma    1874
image/icns    1857
application/x-object    1823
application/x-7z-compressed    1795
application/x-msdownload; format=pe32    1784
application/x-debian-package    1700
application/x-mysql-table-definition    1669
image/vnd.dxf; format=ascii    1664
application/x-sqlite3    1606
application/x-berkeley-db; format=hash    1457
application/x-executable    1455
video/mpeg    1366
application/pkcs7-signature    1359
application/x-ms-asx    1266
image/vnd.zbrush.pcx    1247
image/vnd.dwg    1243
application/fits    1217
application/xslfo+xml    1206
application/x-sharedlib    1185
audio/prs.sid    1173
text/x-vcalendar    1156




On Tue, Jul 26, 2022 at 2:12 PM Oscar Rieken Jr <oscar.rieke...@cofense.com>
wrote:

> We were thinking something around 2TB of data with a good mix of excel,
> images, pdfs, text and powerpoints. So I guess a mix of everything.
>
>
>
> *From: *Tim Allison <talli...@apache.org>
> *Date: *Tuesday, July 26, 2022 at 9:19 AM
> *To: *u...@tika.apache.org <u...@tika.apache.org>
> *Cc: *Oscar Rieken Jr <oscar.rieke...@cofense.com>,
> corpora-dev@tika.apache.org <corpora-dev@tika.apache.org>
> *Subject: *Re: Datasets for testing large number of attachments
>
> External Email
>
> What Nick said...
>
>
>
> cc_large is a sample of some of the larger documents from
> commoncrawl3_refetched.
>
>
>
> If you want to give your pipeline a workout, I also recommend using the
> MockParser that is available in the tika-core tests jar.  That allows you
> to instrument an OOM and timeouts and system exits and all sorts of other
> mayhem.  Obv, don't put the tika-core tests jar on your class path in
> production.  See the files in
> https://github.com/apache/tika/tree/main/tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/resources/test-documents/mock
> for examples of how to trigger bad behavior with the MockParser.
>
>
>
> On the corpora, as Nick said, let us know what you want and we can help
> you select files.
>
>
>
> Cheers,
>
>
>
>         Tim
>
>
>
>
>
> On Tue, Jul 26, 2022 at 7:06 AM Nick Burch <apa...@gagravarr.org> wrote:
>
> On Mon, 25 Jul 2022, Oscar Rieken Jr via user wrote:
> > I am currently trying to validate our Tika setup and was looking for a
> > set of example data I could use
>
> If you want a small number of files of lots of different types, the test
> files in the Tika source tree will work. Main set are in
> tika-parsers/src/test/resources/test-documents/
>
> If you want a very large number of files, then the Tika Corpora collection
> is a good source. We have a few different collections, including stuff
> from common crawl, govdocs and bug trackers. If you can let us know what
> sort of file types and how many, we can suggest the best corpora
> collection
>
> Nick
>
>

Reply via email to