Re: EMF corpus

Tim Allison Tue, 09 Oct 2018 06:08:56 -0700

Turns out that's a subset.  It looks like there should be ~200k emfs.
I'll try to dig up the extraction code and re-run.
On Tue, Oct 9, 2018 at 8:55 AM Tim Allison <[email protected]> wrote:
>
> Y.  Turns out I extracted a bunch a while ago.  See the 'emfs'
> directory in this tar.bz2 file:
> http://162.242.228.174/embedded_files/xmfs.tar.bz2
>
> Let me know if you have any questions and/or if I can make that any
> more useful for you.
>
> Cheers,
>
>            Tim
> On Mon, Oct 8, 2018 at 7:37 AM Tim Allison <[email protected]> wrote:
> >
> > At some point I extracted all emfs from our corpus. I’ll see if that data 
> > is still around and/or re-extract...prob have time tomorrow/ Wednesday
> >
> > On Sun, Oct 7, 2018 at 5:01 PM Dominik Stadler <[email protected]> 
> > wrote:
> >>
> >> Hi Andi
> >>
> >> It is easy to change CommonCrawlDocumentDownload to fetch other mime-types,
> >> see 
> >> https://github.com/centic9/CommonCrawlDocumentDownload/tree/download_emf
> >>
> >> However .emf files don't appear in the top-100 mimetypes of the crawls and
> >> thus are likely very rarely included if at all. I started a download-run,
> >> but the first two of the 300 index-files do not contain any matching
> >> extension or mime-type.
> >>
> >> See https://commoncrawl.github.io/cc-crawl-statistics/plots/mimetypes for
> >> mimetype-statistics in the crawl.
> >>
> >> Dominik.
> >>
> >> On Sat, Oct 6, 2018 at 8:14 PM Andreas Beeker <[email protected]> wrote:
> >>
> >> > Hi Tim / Dominik,
> >> >
> >> > please give me a few pointers, how I could access a pool of EMF files,
> >> > e.g. (not only) within the common crawl corpus. My focus is currently on
> >> > rendering, but as I extend the supported records, I also like to validate
> >> > the parsing.
> >> > As the EMF parsing is relatively new, you still might have a corpus for
> >> > it, Tim?
> >> >
> >> > I have a few old mails about the common crawl corpus [2], but I guess
> >> > there has been some restructuring taken place and there might be an 
> >> > easier
> >> > option than downloading the whole index.
> >> >
> >> > Of course office files which I parse for embedded EMFs are also ok.
> >> >
> >> > I have to admit, that I haven't yet tested Dominiks tool [1].
> >> >
> >> > Alternatively I can use the govdocs1 corpus [3]
> >> >
> >> > Best wishes,
> >> > Andi
> >> >
> >> >
> >> > [1] https://github.com/centic9/CommonCrawlDocumentDownload
> >> >
> >> > [2]
> >> > http://apache-poi.1045710.n5.nabble.com/Using-CommonCrawl-for-POI-regression-mass-testing-td5721585.html
> >> >
> >> > [3] http://downloads.digitalcorpora.org/corpora/files/govdocs1/by_type/
> >> >
> >> >
> >> > ---------------------------------------------------------------------
> >> > To unsubscribe, e-mail: [email protected]
> >> > For additional commands, e-mail: [email protected]
> >> >
> >> >


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: EMF corpus

Reply via email to