As a warning, tho, Common Crawl truncates files at 1MB, so we have a bunch
of truncated files.  We refetched some and put those under
commoncrawl3_refetched.

On Tue, Jul 26, 2022 at 2:58 PM Tim Allison <talli...@apache.org> wrote:

> We have ~1.9TB.  But I'd skip cc_large because that's just a copy of some
> directories under commoncrawl3_refetched.
>
> If you want to pull fresher data out of CommonCrawl, I have undocumented
> scripts to do that.  I could add documentation.
>
> These are the top 100 mime types and counts.  This db was generated on a
> slightly earlier version of the corpus/corpora, but it should be close
> enough.
>
> MIME_STRING    cnt
> application/pdf    768490
> text/plain    472041
> text/html    429707
> application/x-tika-msoffice    297990
> image/png    190815
> application/octet-stream    190645
> image/jpeg    179533
> application/xhtml+xml    151830
> application/x-bzip2    124204
> application/x-tika-ooxml    122523
> application/x-bzip    107435
> application/xml    107003
> application/zip    93467
> application/x-sh    88712
> application/gzip    73535
> image/gif    66713
> application/zlib    46483
> text/calendar    40385
> application/postscript    35526
> application/rss+xml    34428
> application/atom+xml    28950
> multipart/appledouble    27602
> image/svg+xml    25771
> application/vnd.oasis.opendocument.text    25753
> application/rdf+xml    24890
> application/vnd.google-earth.kml+xml    24049
> application/rtf    23915
> application/x-matroska    19437
> application/x-shockwave-flash    18879
> video/quicktime    18546
> application/epub+zip    18205
> application/vnd.ms-excel    17465
> application/x-xz    16869
> text/x-vcard    16772
> application/java-vm    16761
> audio/mpeg    15534
> message/rfc822    14405
> application/vnd.oasis.opendocument.spreadsheet    12659
> application/x-bibtex-text-file    12261
> application/x-rar-compressed; version=4    12123
> text/x-php    10870
> text/x-diff    10080
> video/mp4    8281
> audio/mp4    8221
> application/x-msdownload    8019
> application/x-bittorrent    7964
> image/vnd.microsoft.icon    7382
> application/mbox    6799
> application/x-x509-cert; format=der    6597
> audio/vnd.wave    6550
> image/bmp    6411
> application/x-endnote-refer    5922
> image/vnd.djvu    5874
> text/x-matlab    5734
> application/vnd.apple.mpegurl    5511
> image/tiff    5430
> image/webp    4972
> application/vnd.oasis.opendocument.presentation    3989
> text/x-jsp    3973
> text/x-csrc    3555
> video/x-ms-wmv    3453
> video/x-m4v    3443
> application/x-dbf    3381
> text/x-chdr    3263
> text/x-perl    3124
> application/x-rpm    3023
> application/x-mobipocket-ebook    2726
> audio/midi    2697
> application/vnd.oasis.opendocument.graphics    2675
> application/vnd.ms-excel.sheet.4    2591
> application/x-font-ttf    2575
> application/xspf+xml    2557
> text/x-python    2416
> audio/vorbis    2354
> application/msword    2223
> application/ogg    2222
> application/x-gtar    2181
> audio/x-mpegurl    2067
> video/x-flv    1969
> audio/x-ms-wma    1874
> image/icns    1857
> application/x-object    1823
> application/x-7z-compressed    1795
> application/x-msdownload; format=pe32    1784
> application/x-debian-package    1700
> application/x-mysql-table-definition    1669
> image/vnd.dxf; format=ascii    1664
> application/x-sqlite3    1606
> application/x-berkeley-db; format=hash    1457
> application/x-executable    1455
> video/mpeg    1366
> application/pkcs7-signature    1359
> application/x-ms-asx    1266
> image/vnd.zbrush.pcx    1247
> image/vnd.dwg    1243
> application/fits    1217
> application/xslfo+xml    1206
> application/x-sharedlib    1185
> audio/prs.sid    1173
> text/x-vcalendar    1156
>
>
>
>
> On Tue, Jul 26, 2022 at 2:12 PM Oscar Rieken Jr <
> oscar.rieke...@cofense.com> wrote:
>
>> We were thinking something around 2TB of data with a good mix of excel,
>> images, pdfs, text and powerpoints. So I guess a mix of everything.
>>
>>
>>
>> *From: *Tim Allison <talli...@apache.org>
>> *Date: *Tuesday, July 26, 2022 at 9:19 AM
>> *To: *u...@tika.apache.org <u...@tika.apache.org>
>> *Cc: *Oscar Rieken Jr <oscar.rieke...@cofense.com>,
>> corpora-dev@tika.apache.org <corpora-dev@tika.apache.org>
>> *Subject: *Re: Datasets for testing large number of attachments
>>
>> External Email
>>
>> What Nick said...
>>
>>
>>
>> cc_large is a sample of some of the larger documents from
>> commoncrawl3_refetched.
>>
>>
>>
>> If you want to give your pipeline a workout, I also recommend using the
>> MockParser that is available in the tika-core tests jar.  That allows you
>> to instrument an OOM and timeouts and system exits and all sorts of other
>> mayhem.  Obv, don't put the tika-core tests jar on your class path in
>> production.  See the files in
>> https://github.com/apache/tika/tree/main/tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/resources/test-documents/mock
>> for examples of how to trigger bad behavior with the MockParser.
>>
>>
>>
>> On the corpora, as Nick said, let us know what you want and we can help
>> you select files.
>>
>>
>>
>> Cheers,
>>
>>
>>
>>         Tim
>>
>>
>>
>>
>>
>> On Tue, Jul 26, 2022 at 7:06 AM Nick Burch <apa...@gagravarr.org> wrote:
>>
>> On Mon, 25 Jul 2022, Oscar Rieken Jr via user wrote:
>> > I am currently trying to validate our Tika setup and was looking for a
>> > set of example data I could use
>>
>> If you want a small number of files of lots of different types, the test
>> files in the Tika source tree will work. Main set are in
>> tika-parsers/src/test/resources/test-documents/
>>
>> If you want a very large number of files, then the Tika Corpora
>> collection
>> is a good source. We have a few different collections, including stuff
>> from common crawl, govdocs and bug trackers. If you can let us know what
>> sort of file types and how many, we can suggest the best corpora
>> collection
>>
>> Nick
>>
>>

Reply via email to