All, Tilman Hausherr mentioned that we might want to update the common-crawl pdfs in our regression corpus. This proposal leaves the bugtracker PDFs as they are.
For the CC-based PDFs, we could: 1) remove existing truncated pdfs 2) fold in newer untruncated PDFs from: https://digitalcorpora.org/corpora/file-corpora/cc-main-2021-31-pdf-untruncated/ What do you think? Best, Tim