Hi Andi It is easy to change CommonCrawlDocumentDownload to fetch other mime-types, see https://github.com/centic9/CommonCrawlDocumentDownload/tree/download_emf
However .emf files don't appear in the top-100 mimetypes of the crawls and thus are likely very rarely included if at all. I started a download-run, but the first two of the 300 index-files do not contain any matching extension or mime-type. See https://commoncrawl.github.io/cc-crawl-statistics/plots/mimetypes for mimetype-statistics in the crawl. Dominik. On Sat, Oct 6, 2018 at 8:14 PM Andreas Beeker <[email protected]> wrote: > Hi Tim / Dominik, > > please give me a few pointers, how I could access a pool of EMF files, > e.g. (not only) within the common crawl corpus. My focus is currently on > rendering, but as I extend the supported records, I also like to validate > the parsing. > As the EMF parsing is relatively new, you still might have a corpus for > it, Tim? > > I have a few old mails about the common crawl corpus [2], but I guess > there has been some restructuring taken place and there might be an easier > option than downloading the whole index. > > Of course office files which I parse for embedded EMFs are also ok. > > I have to admit, that I haven't yet tested Dominiks tool [1]. > > Alternatively I can use the govdocs1 corpus [3] > > Best wishes, > Andi > > > [1] https://github.com/centic9/CommonCrawlDocumentDownload > > [2] > http://apache-poi.1045710.n5.nabble.com/Using-CommonCrawl-for-POI-regression-mass-testing-td5721585.html > > [3] http://downloads.digitalcorpora.org/corpora/files/govdocs1/by_type/ > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] > >
