Hi Leonard, if you could provide a sample document with XMPs attached to various PDF objects you're interested in I could come up with a quick sample for Tim.
BR Maruan Am Mittwoch, dem 17.03.2021 um 13:39 -0400 schrieb Tim Allison: > Hi Leonard, > I'm literally just scraping bytes out of files for now without any > parsing...so if the XMP is concealed in a compressed stream or > something more interesting, I'm not grabbing it. I'm also not > tracking which XMP is associated with which object. > Please forgive me...if I traverse the COSDocument's objects and > look > for /Metadata and grab the stream, will that be what you're looking > for? Or, is there a commandline tool I can run to get what you're > interested in? > Thank you. > > Cheers, > > Tim > > On Wed, Mar 17, 2021 at 1:17 PM Leonard Rosenthol > <lrose...@adobe.com.invalid> wrote: > > > > Are you only pulling document-level XMP? If so, could you extend > > it to support object-level metadata as well? I, for one, would > > love to get insight into the use of object-level metadata - what > > objects are they attached to, what are they being used for, etc. > > > > Leonard > > > > On 3/17/21, 11:37 AM, "Tim Allison" <talli...@apache.org> wrote: > > > > All, > > > > I'm scraping XMPs out of our corpus and placing them here as > > standalone files: > > > > > > https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcorpora.tika.apache.org%2Fbase%2Fxmps%2F&data=04%7C01%7Clrosenth%40adobe.com%7C40651db6e9fa4260de9108d8e95a9b01%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C637515922640651454%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=ujb11etR6nqAqqxo7l1SHMiDrU5KxYPRXTm4nvXrCXo%3D&reserved=0 > > > > I've binned the files roughly based on the container file's > > mime > > type, e.g. > > https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcorpora.tika.apache.org%2Fbase%2Fxmps%2Fpdf%2F&data=04%7C01%7Clrosenth%40adobe.com%7C40651db6e9fa4260de9108d8e95a9b01%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C637515922640651454%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=HFcAVr0CLvIwEa5%2BsD8iYRSDgm6LWHNcXfzsPnSEDqs%3D&reserved=0 > > > > The process is still running, and I view this as a first > > draft. > > Please let me know if there's anything I can do to make these > > data > > easier to use/more useful or if you see any problems. > > > > Cheers, > > > > Tim > > -- -- Maruan Sahyoun FileAffairs GmbH Josef-Schappe-Straße 21 40882 Ratingen Tel: +49 (2102) 89497 88 Fax: +49 (2102) 89497 91 sahy...@fileaffairs.de www.fileaffairs.de Geschäftsführer: Maruan Sahyoun Handelsregister: AG Düsseldorf, HRB 53837 UST.-ID: DE248275827