Refreshing the common crawl-derived PDFs in our regression corpus?

Tim Allison Fri, 19 May 2023 08:25:41 -0700

All,

  Tilman Hausherr mentioned that we might want to update the
common-crawl pdfs in our regression corpus.  This proposal leaves the
bugtracker PDFs as they are.


For the CC-based PDFs, we could:

1) remove existing truncated pdfs

2) fold in newer untruncated PDFs from:
https://digitalcorpora.org/corpora/file-corpora/cc-main-2021-31-pdf-untruncated/

What do you think?

Best,

      Tim

Refreshing the common crawl-derived PDFs in our regression corpus?

Reply via email to