Hi Tim / Dominik,

please give me a few pointers, how I could access a pool of EMF files, e.g. 
(not only) within the common crawl corpus. My focus is currently on rendering, 
but as I extend the supported records, I also like to validate the parsing.
As the EMF parsing is relatively new, you still might have a corpus for it, Tim?

I have a few old mails about the common crawl corpus [2], but I guess there has 
been some restructuring taken place and there might be an easier option than 
downloading the whole index.

Of course office files which I parse for embedded EMFs are also ok.

I have to admit, that I haven't yet tested Dominiks tool [1].

Alternatively I can use the govdocs1 corpus [3]

Best wishes,
Andi


[1] https://github.com/centic9/CommonCrawlDocumentDownload

[2] 
http://apache-poi.1045710.n5.nabble.com/Using-CommonCrawl-for-POI-regression-mass-testing-td5721585.html

[3] http://downloads.digitalcorpora.org/corpora/files/govdocs1/by_type/


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to