Turns out that's a subset. It looks like there should be ~200k emfs. I'll try to dig up the extraction code and re-run. On Tue, Oct 9, 2018 at 8:55 AM Tim Allison <[email protected]> wrote: > > Y. Turns out I extracted a bunch a while ago. See the 'emfs' > directory in this tar.bz2 file: > http://162.242.228.174/embedded_files/xmfs.tar.bz2 > > Let me know if you have any questions and/or if I can make that any > more useful for you. > > Cheers, > > Tim > On Mon, Oct 8, 2018 at 7:37 AM Tim Allison <[email protected]> wrote: > > > > At some point I extracted all emfs from our corpus. I’ll see if that data > > is still around and/or re-extract...prob have time tomorrow/ Wednesday > > > > On Sun, Oct 7, 2018 at 5:01 PM Dominik Stadler <[email protected]> > > wrote: > >> > >> Hi Andi > >> > >> It is easy to change CommonCrawlDocumentDownload to fetch other mime-types, > >> see > >> https://github.com/centic9/CommonCrawlDocumentDownload/tree/download_emf > >> > >> However .emf files don't appear in the top-100 mimetypes of the crawls and > >> thus are likely very rarely included if at all. I started a download-run, > >> but the first two of the 300 index-files do not contain any matching > >> extension or mime-type. > >> > >> See https://commoncrawl.github.io/cc-crawl-statistics/plots/mimetypes for > >> mimetype-statistics in the crawl. > >> > >> Dominik. > >> > >> On Sat, Oct 6, 2018 at 8:14 PM Andreas Beeker <[email protected]> wrote: > >> > >> > Hi Tim / Dominik, > >> > > >> > please give me a few pointers, how I could access a pool of EMF files, > >> > e.g. (not only) within the common crawl corpus. My focus is currently on > >> > rendering, but as I extend the supported records, I also like to validate > >> > the parsing. > >> > As the EMF parsing is relatively new, you still might have a corpus for > >> > it, Tim? > >> > > >> > I have a few old mails about the common crawl corpus [2], but I guess > >> > there has been some restructuring taken place and there might be an > >> > easier > >> > option than downloading the whole index. > >> > > >> > Of course office files which I parse for embedded EMFs are also ok. > >> > > >> > I have to admit, that I haven't yet tested Dominiks tool [1]. > >> > > >> > Alternatively I can use the govdocs1 corpus [3] > >> > > >> > Best wishes, > >> > Andi > >> > > >> > > >> > [1] https://github.com/centic9/CommonCrawlDocumentDownload > >> > > >> > [2] > >> > http://apache-poi.1045710.n5.nabble.com/Using-CommonCrawl-for-POI-regression-mass-testing-td5721585.html > >> > > >> > [3] http://downloads.digitalcorpora.org/corpora/files/govdocs1/by_type/ > >> > > >> > > >> > --------------------------------------------------------------------- > >> > To unsubscribe, e-mail: [email protected] > >> > For additional commands, e-mail: [email protected] > >> > > >> >
--------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
