8 million PDFs/8TB from a month of Common Crawl. We refetched ~2 million truncated files.
Zips of PDFs are available here: https://digitalcorpora.org/corpora/file-corpora/cc-main-2021-31-pdf-untruncated/ Peter Wyatt (PDF Association)'s writeup is here: https://pdfa.org/new-large-scale-pdf-corpus-now-publicly-available/ --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
