Y. Turns out I extracted a bunch a while ago. See the 'emfs'
directory in this tar.bz2 file:
http://162.242.228.174/embedded_files/xmfs.tar.bz2
Let me know if you have any questions and/or if I can make that any
more useful for you.
Cheers,
Tim
On Mon, Oct 8, 2018 at 7:37 AM Tim Allison <[email protected]> wrote:
>
> At some point I extracted all emfs from our corpus. I’ll see if that data is
> still around and/or re-extract...prob have time tomorrow/ Wednesday
>
> On Sun, Oct 7, 2018 at 5:01 PM Dominik Stadler <[email protected]> wrote:
>>
>> Hi Andi
>>
>> It is easy to change CommonCrawlDocumentDownload to fetch other mime-types,
>> see https://github.com/centic9/CommonCrawlDocumentDownload/tree/download_emf
>>
>> However .emf files don't appear in the top-100 mimetypes of the crawls and
>> thus are likely very rarely included if at all. I started a download-run,
>> but the first two of the 300 index-files do not contain any matching
>> extension or mime-type.
>>
>> See https://commoncrawl.github.io/cc-crawl-statistics/plots/mimetypes for
>> mimetype-statistics in the crawl.
>>
>> Dominik.
>>
>> On Sat, Oct 6, 2018 at 8:14 PM Andreas Beeker <[email protected]> wrote:
>>
>> > Hi Tim / Dominik,
>> >
>> > please give me a few pointers, how I could access a pool of EMF files,
>> > e.g. (not only) within the common crawl corpus. My focus is currently on
>> > rendering, but as I extend the supported records, I also like to validate
>> > the parsing.
>> > As the EMF parsing is relatively new, you still might have a corpus for
>> > it, Tim?
>> >
>> > I have a few old mails about the common crawl corpus [2], but I guess
>> > there has been some restructuring taken place and there might be an easier
>> > option than downloading the whole index.
>> >
>> > Of course office files which I parse for embedded EMFs are also ok.
>> >
>> > I have to admit, that I haven't yet tested Dominiks tool [1].
>> >
>> > Alternatively I can use the govdocs1 corpus [3]
>> >
>> > Best wishes,
>> > Andi
>> >
>> >
>> > [1] https://github.com/centic9/CommonCrawlDocumentDownload
>> >
>> > [2]
>> > http://apache-poi.1045710.n5.nabble.com/Using-CommonCrawl-for-POI-regression-mass-testing-td5721585.html
>> >
>> > [3] http://downloads.digitalcorpora.org/corpora/files/govdocs1/by_type/
>> >
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: [email protected]
>> > For additional commands, e-mail: [email protected]
>> >
>> >
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]