As a warning, tho, Common Crawl truncates files at 1MB, so we have a bunch of truncated files. We refetched some and put those under commoncrawl3_refetched.
On Tue, Jul 26, 2022 at 2:58 PM Tim Allison <talli...@apache.org> wrote: > We have ~1.9TB. But I'd skip cc_large because that's just a copy of some > directories under commoncrawl3_refetched. > > If you want to pull fresher data out of CommonCrawl, I have undocumented > scripts to do that. I could add documentation. > > These are the top 100 mime types and counts. This db was generated on a > slightly earlier version of the corpus/corpora, but it should be close > enough. > > MIME_STRING cnt > application/pdf 768490 > text/plain 472041 > text/html 429707 > application/x-tika-msoffice 297990 > image/png 190815 > application/octet-stream 190645 > image/jpeg 179533 > application/xhtml+xml 151830 > application/x-bzip2 124204 > application/x-tika-ooxml 122523 > application/x-bzip 107435 > application/xml 107003 > application/zip 93467 > application/x-sh 88712 > application/gzip 73535 > image/gif 66713 > application/zlib 46483 > text/calendar 40385 > application/postscript 35526 > application/rss+xml 34428 > application/atom+xml 28950 > multipart/appledouble 27602 > image/svg+xml 25771 > application/vnd.oasis.opendocument.text 25753 > application/rdf+xml 24890 > application/vnd.google-earth.kml+xml 24049 > application/rtf 23915 > application/x-matroska 19437 > application/x-shockwave-flash 18879 > video/quicktime 18546 > application/epub+zip 18205 > application/vnd.ms-excel 17465 > application/x-xz 16869 > text/x-vcard 16772 > application/java-vm 16761 > audio/mpeg 15534 > message/rfc822 14405 > application/vnd.oasis.opendocument.spreadsheet 12659 > application/x-bibtex-text-file 12261 > application/x-rar-compressed; version=4 12123 > text/x-php 10870 > text/x-diff 10080 > video/mp4 8281 > audio/mp4 8221 > application/x-msdownload 8019 > application/x-bittorrent 7964 > image/vnd.microsoft.icon 7382 > application/mbox 6799 > application/x-x509-cert; format=der 6597 > audio/vnd.wave 6550 > image/bmp 6411 > application/x-endnote-refer 5922 > image/vnd.djvu 5874 > text/x-matlab 5734 > application/vnd.apple.mpegurl 5511 > image/tiff 5430 > image/webp 4972 > application/vnd.oasis.opendocument.presentation 3989 > text/x-jsp 3973 > text/x-csrc 3555 > video/x-ms-wmv 3453 > video/x-m4v 3443 > application/x-dbf 3381 > text/x-chdr 3263 > text/x-perl 3124 > application/x-rpm 3023 > application/x-mobipocket-ebook 2726 > audio/midi 2697 > application/vnd.oasis.opendocument.graphics 2675 > application/vnd.ms-excel.sheet.4 2591 > application/x-font-ttf 2575 > application/xspf+xml 2557 > text/x-python 2416 > audio/vorbis 2354 > application/msword 2223 > application/ogg 2222 > application/x-gtar 2181 > audio/x-mpegurl 2067 > video/x-flv 1969 > audio/x-ms-wma 1874 > image/icns 1857 > application/x-object 1823 > application/x-7z-compressed 1795 > application/x-msdownload; format=pe32 1784 > application/x-debian-package 1700 > application/x-mysql-table-definition 1669 > image/vnd.dxf; format=ascii 1664 > application/x-sqlite3 1606 > application/x-berkeley-db; format=hash 1457 > application/x-executable 1455 > video/mpeg 1366 > application/pkcs7-signature 1359 > application/x-ms-asx 1266 > image/vnd.zbrush.pcx 1247 > image/vnd.dwg 1243 > application/fits 1217 > application/xslfo+xml 1206 > application/x-sharedlib 1185 > audio/prs.sid 1173 > text/x-vcalendar 1156 > > > > > On Tue, Jul 26, 2022 at 2:12 PM Oscar Rieken Jr < > oscar.rieke...@cofense.com> wrote: > >> We were thinking something around 2TB of data with a good mix of excel, >> images, pdfs, text and powerpoints. So I guess a mix of everything. >> >> >> >> *From: *Tim Allison <talli...@apache.org> >> *Date: *Tuesday, July 26, 2022 at 9:19 AM >> *To: *u...@tika.apache.org <u...@tika.apache.org> >> *Cc: *Oscar Rieken Jr <oscar.rieke...@cofense.com>, >> corpora-dev@tika.apache.org <corpora-dev@tika.apache.org> >> *Subject: *Re: Datasets for testing large number of attachments >> >> External Email >> >> What Nick said... >> >> >> >> cc_large is a sample of some of the larger documents from >> commoncrawl3_refetched. >> >> >> >> If you want to give your pipeline a workout, I also recommend using the >> MockParser that is available in the tika-core tests jar. That allows you >> to instrument an OOM and timeouts and system exits and all sorts of other >> mayhem. Obv, don't put the tika-core tests jar on your class path in >> production. See the files in >> https://github.com/apache/tika/tree/main/tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/resources/test-documents/mock >> for examples of how to trigger bad behavior with the MockParser. >> >> >> >> On the corpora, as Nick said, let us know what you want and we can help >> you select files. >> >> >> >> Cheers, >> >> >> >> Tim >> >> >> >> >> >> On Tue, Jul 26, 2022 at 7:06 AM Nick Burch <apa...@gagravarr.org> wrote: >> >> On Mon, 25 Jul 2022, Oscar Rieken Jr via user wrote: >> > I am currently trying to validate our Tika setup and was looking for a >> > set of example data I could use >> >> If you want a small number of files of lots of different types, the test >> files in the Tika source tree will work. Main set are in >> tika-parsers/src/test/resources/test-documents/ >> >> If you want a very large number of files, then the Tika Corpora >> collection >> is a good source. We have a few different collections, including stuff >> from common crawl, govdocs and bug trackers. If you can let us know what >> sort of file types and how many, we can suggest the best corpora >> collection >> >> Nick >> >>