If only we had some kind of a corpus we could all share... LOL Here are two...if I understand correctly.
https://corpora.tika.apache.org/base/docs/bug_trackers/PDFBOX/PDFBOX-4028-0.pdf <?xpacket begin='' id='W5M0MpCehiHzreSzNTczkc9d'?> <x:xmpmeta xmlns:x='adobe:ns:meta/' x:xmptk='XMP toolkit 3.0-28, framework 1.6'> <rdf:RDF xmlns:rdf='http://www.w3.org/1999/02/22-rdf-syntax-ns#' xmlns:iX='http://ns.adobe.com/iX/1.0/'> <rdf:Description rdf:about='uuid:f6b0dc62-e0f5-11da-9df8-891f95b09a7c' xmlns:exif='http://ns.adobe.com/exif/1.0/'> <exif:ColorSpace>4294967295</exif:ColorSpace> <exif:PixelXDimension>163</exif:PixelXDimension> <exif:PixelYDimension>124</exif:PixelYDimension> </rdf:Description> .... https://corpora.tika.apache.org/base/docs/bug_trackers/PDFBOX/PDFBOX-3724-0.pdf <x:xmpmeta xmlns:x="adobe:ns:meta/" x:xmptk="Adobe XMP Core 5.2-c003 61.141987, 2011/02/22-12:03:51 "> <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"> <rdf:Description rdf:about="" xmlns:photoshop="http://ns.adobe.com/photoshop/1.0/" xmlns:plus="http://ns.useplus.org/ldf/xmp/1.0/" xmlns:xmp="http://ns.adobe.com/xap/1.0/" xmlns:xmpMM="http://ns.adobe.com/xap/1.0/mm/" xmlns:stEvt="http://ns.adobe.com/xap/1.0/sType/ResourceEvent#" xmlns:fwl="http://ns.fotoware.com/iptcxmp-legacy/1.0/" xmlns:fwr="http://ns.fotoware.com/iptcxmp-reserved/1.0/" xmlns:crs="http://ns.adobe.com/camera-raw-settings/1.0/" xmlns:fwc="http://ns.fotoware.com/iptcxmp-custom/1.0/" xmlns:fwu="http://ns.fotoware.com/iptcxmp-user/1.0/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:Iptc4xmpExt="http://iptc.org/std/Iptc4xmpExt/2008-02-29/" photoshop:City="London" photoshop:DateCreated="2015-09-24" On Wed, Mar 17, 2021 at 2:14 PM Tim Allison <talli...@apache.org> wrote: > > Sounds like we might be extracting that info in the following line in Tika? > > https://github.com/apache/tika/blob/main/tika-parsers/tika-parsers-classic/tika-parsers-classic-modules/tika-parser-pdf-module/src/main/java/org/apache/tika/parser/pdf/ImageGraphicsEngine.java#L302 > > On Wed, Mar 17, 2021 at 2:03 PM sahy...@fileaffairs.de > <sahy...@fileaffairs.de> wrote: > > > > Hi Leonard, > > > > attachments won't work at the mailing list - could you upload it to a > > public location or send it to me in person? > > > > BR > > Maruan > > > > Am Mittwoch, dem 17.03.2021 um 17:57 +0000 schrieb Leonard Rosenthol: > > > Here is one that I have handy where there is XMP on the image... > > > > > > On 3/17/21, 1:44 PM, "sahy...@fileaffairs.de" > > > <sahy...@fileaffairs.de> wrote: > > > > > > Hi Leonard, > > > > > > if you could provide a sample document with XMPs attached to > > > various > > > PDF objects you're interested in I could come up with a quick > > > sample > > > for Tim. > > > > > > BR > > > Maruan > > > > > > Am Mittwoch, dem 17.03.2021 um 13:39 -0400 schrieb Tim Allison: > > > > Hi Leonard, > > > > I'm literally just scraping bytes out of files for now > > > without any > > > > parsing...so if the XMP is concealed in a compressed stream or > > > > something more interesting, I'm not grabbing it. I'm also not > > > > tracking which XMP is associated with which object. > > > > Please forgive me...if I traverse the COSDocument's objects > > > and > > > > look > > > > for /Metadata and grab the stream, will that be what you're > > > looking > > > > for? Or, is there a commandline tool I can run to get what > > > you're > > > > interested in? > > > > Thank you. > > > > > > > > Cheers, > > > > > > > > Tim > > > > > > > > On Wed, Mar 17, 2021 at 1:17 PM Leonard Rosenthol > > > > <lrose...@adobe.com.invalid> wrote: > > > > > > > > > > Are you only pulling document-level XMP? If so, could you > > > extend > > > > > it to support object-level metadata as well? I, for one, > > > would > > > > > love to get insight into the use of object-level metadata - > > > what > > > > > objects are they attached to, what are they being used for, > > > etc. > > > > > > > > > > Leonard > > > > > > > > > > On 3/17/21, 11:37 AM, "Tim Allison" <talli...@apache.org> > > > wrote: > > > > > > > > > > All, > > > > > > > > > > I'm scraping XMPs out of our corpus and placing them > > > here as > > > > > standalone files: > > > > > > > > > > > > > > > > > > https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcorpora.tika.apache.org%2Fbase%2Fxmps%2F&data=04%7C01%7Clrosenth%40adobe.com%7C388cecf991ed40022fd808d8e96c4aa6%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C637515998615522173%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=2TgR3TTbDedLLOn85E9sVHLePHUqDpzkDnF%2BsnzvIfk%3D&reserved=0 > > > > > > > > > > I've binned the files roughly based on the container > > > file's > > > > > mime > > > > > type, e.g. > > > > > > > > https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcorpora.tika.apache.org%2Fbase%2Fxmps%2Fpdf%2F&data=04%7C01%7Clrosenth%40adobe.com%7C388cecf991ed40022fd808d8e96c4aa6%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C637515998615532128%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=vheVHiNdgTtbOIL8plV6vRslcGB0d%2FByGYXtbByH2zk%3D&reserved=0 > > > > > > > > > > The process is still running, and I view this as a > > > first > > > > > draft. > > > > > Please let me know if there's anything I can do to make > > > these > > > > > data > > > > > easier to use/more useful or if you see any problems. > > > > > > > > > > Cheers, > > > > > > > > > > Tim > > > > > > > > > > > -- > > > -- > > > Maruan Sahyoun > > > > > > FileAffairs GmbH > > > Josef-Schappe-Straße 21 > > > 40882 Ratingen > > > > > > Tel: +49 (2102) 89497 88 > > > Fax: +49 (2102) 89497 91 > > > sahy...@fileaffairs.de > > > > > > https://nam04.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.fileaffairs.de%2F&data=04%7C01%7Clrosenth%40adobe.com%7C388cecf991ed40022fd808d8e96c4aa6%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C637515998615532128%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=qcCIbv8VTgWaudXut2FHgOOtJSQTJLDknTSznWdomgw%3D&reserved=0 > > > > > > Geschäftsführer: Maruan Sahyoun > > > Handelsregister: AG Düsseldorf, HRB 53837 > > > UST.-ID: DE248275827 > > > > > > > > > > -- > > -- > > Maruan Sahyoun > > > > FileAffairs GmbH > > Josef-Schappe-Straße 21 > > 40882 Ratingen > > > > Tel: +49 (2102) 89497 88 > > Fax: +49 (2102) 89497 91 > > sahy...@fileaffairs.de > > www.fileaffairs.de > > > > Geschäftsführer: Maruan Sahyoun > > Handelsregister: AG Düsseldorf, HRB 53837 > > UST.-ID: DE248275827 > >