Hi Andi

It is easy to change CommonCrawlDocumentDownload to fetch other mime-types,
see https://github.com/centic9/CommonCrawlDocumentDownload/tree/download_emf

However .emf files don't appear in the top-100 mimetypes of the crawls and
thus are likely very rarely included if at all. I started a download-run,
but the first two of the 300 index-files do not contain any matching
extension or mime-type.

See https://commoncrawl.github.io/cc-crawl-statistics/plots/mimetypes for
mimetype-statistics in the crawl.

Dominik.

On Sat, Oct 6, 2018 at 8:14 PM Andreas Beeker <[email protected]> wrote:

> Hi Tim / Dominik,
>
> please give me a few pointers, how I could access a pool of EMF files,
> e.g. (not only) within the common crawl corpus. My focus is currently on
> rendering, but as I extend the supported records, I also like to validate
> the parsing.
> As the EMF parsing is relatively new, you still might have a corpus for
> it, Tim?
>
> I have a few old mails about the common crawl corpus [2], but I guess
> there has been some restructuring taken place and there might be an easier
> option than downloading the whole index.
>
> Of course office files which I parse for embedded EMFs are also ok.
>
> I have to admit, that I haven't yet tested Dominiks tool [1].
>
> Alternatively I can use the govdocs1 corpus [3]
>
> Best wishes,
> Andi
>
>
> [1] https://github.com/centic9/CommonCrawlDocumentDownload
>
> [2]
> http://apache-poi.1045710.n5.nabble.com/Using-CommonCrawl-for-POI-regression-mass-testing-td5721585.html
>
> [3] http://downloads.digitalcorpora.org/corpora/files/govdocs1/by_type/
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>

Reply via email to