Hi Tim / Dominik, please give me a few pointers, how I could access a pool of EMF files, e.g. (not only) within the common crawl corpus. My focus is currently on rendering, but as I extend the supported records, I also like to validate the parsing. As the EMF parsing is relatively new, you still might have a corpus for it, Tim?
I have a few old mails about the common crawl corpus [2], but I guess there has been some restructuring taken place and there might be an easier option than downloading the whole index. Of course office files which I parse for embedded EMFs are also ok. I have to admit, that I haven't yet tested Dominiks tool [1]. Alternatively I can use the govdocs1 corpus [3] Best wishes, Andi [1] https://github.com/centic9/CommonCrawlDocumentDownload [2] http://apache-poi.1045710.n5.nabble.com/Using-CommonCrawl-for-POI-regression-mass-testing-td5721585.html [3] http://downloads.digitalcorpora.org/corpora/files/govdocs1/by_type/ --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
