All, I'm scraping XMPs out of our corpus and placing them here as standalone files:
https://corpora.tika.apache.org/base/xmps/ I've binned the files roughly based on the container file's mime type, e.g. https://corpora.tika.apache.org/base/xmps/pdf/ The process is still running, and I view this as a first draft. Please let me know if there's anything I can do to make these data easier to use/more useful or if you see any problems. Cheers, Tim