If only we had some kind of a corpus we could all share... LOL

Here are two...if I understand correctly.

https://corpora.tika.apache.org/base/docs/bug_trackers/PDFBOX/PDFBOX-4028-0.pdf

<?xpacket begin='' id='W5M0MpCehiHzreSzNTczkc9d'?>
<x:xmpmeta xmlns:x='adobe:ns:meta/' x:xmptk='XMP toolkit 3.0-28, framework 1.6'>
<rdf:RDF xmlns:rdf='http://www.w3.org/1999/02/22-rdf-syntax-ns#'
xmlns:iX='http://ns.adobe.com/iX/1.0/'>

 <rdf:Description rdf:about='uuid:f6b0dc62-e0f5-11da-9df8-891f95b09a7c'
  xmlns:exif='http://ns.adobe.com/exif/1.0/'>
  <exif:ColorSpace>4294967295</exif:ColorSpace>
  <exif:PixelXDimension>163</exif:PixelXDimension>
  <exif:PixelYDimension>124</exif:PixelYDimension>
 </rdf:Description>
....

https://corpora.tika.apache.org/base/docs/bug_trackers/PDFBOX/PDFBOX-3724-0.pdf
<x:xmpmeta xmlns:x="adobe:ns:meta/" x:xmptk="Adobe XMP Core 5.2-c003
61.141987, 2011/02/22-12:03:51        ">
 <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#";>
  <rdf:Description rdf:about=""
    xmlns:photoshop="http://ns.adobe.com/photoshop/1.0/";
    xmlns:plus="http://ns.useplus.org/ldf/xmp/1.0/";
    xmlns:xmp="http://ns.adobe.com/xap/1.0/";
    xmlns:xmpMM="http://ns.adobe.com/xap/1.0/mm/";
    xmlns:stEvt="http://ns.adobe.com/xap/1.0/sType/ResourceEvent#";
    xmlns:fwl="http://ns.fotoware.com/iptcxmp-legacy/1.0/";
    xmlns:fwr="http://ns.fotoware.com/iptcxmp-reserved/1.0/";
    xmlns:crs="http://ns.adobe.com/camera-raw-settings/1.0/";
    xmlns:fwc="http://ns.fotoware.com/iptcxmp-custom/1.0/";
    xmlns:fwu="http://ns.fotoware.com/iptcxmp-user/1.0/";
    xmlns:dc="http://purl.org/dc/elements/1.1/";
    xmlns:Iptc4xmpExt="http://iptc.org/std/Iptc4xmpExt/2008-02-29/";
   photoshop:City="London"
   photoshop:DateCreated="2015-09-24"

On Wed, Mar 17, 2021 at 2:14 PM Tim Allison <talli...@apache.org> wrote:
>
> Sounds like we might be extracting that info in the following line in Tika?
>
> https://github.com/apache/tika/blob/main/tika-parsers/tika-parsers-classic/tika-parsers-classic-modules/tika-parser-pdf-module/src/main/java/org/apache/tika/parser/pdf/ImageGraphicsEngine.java#L302
>
> On Wed, Mar 17, 2021 at 2:03 PM sahy...@fileaffairs.de
> <sahy...@fileaffairs.de> wrote:
> >
> > Hi Leonard,
> >
> > attachments won't work at the mailing list - could you upload it to a
> > public location or send it to me in person?
> >
> > BR
> > Maruan
> >
> > Am Mittwoch, dem 17.03.2021 um 17:57 +0000 schrieb Leonard Rosenthol:
> > > Here is one that I have handy where there is XMP on the image...
> > >
> > > On 3/17/21, 1:44 PM, "sahy...@fileaffairs.de"
> > > <sahy...@fileaffairs.de> wrote:
> > >
> > >     Hi Leonard,
> > >
> > >     if you could provide a sample document with XMPs attached to
> > > various
> > >     PDF objects you're interested in I could come up with a quick
> > > sample
> > >     for Tim.
> > >
> > >     BR
> > >     Maruan
> > >
> > >     Am Mittwoch, dem 17.03.2021 um 13:39 -0400 schrieb Tim Allison:
> > >     > Hi Leonard,
> > >     >   I'm literally just scraping bytes out of files for now
> > > without any
> > >     > parsing...so if the XMP is concealed in a compressed stream or
> > >     > something more interesting, I'm not grabbing it.  I'm also not
> > >     > tracking which XMP is associated with which object.
> > >     >   Please forgive me...if I traverse the COSDocument's objects
> > > and
> > >     > look
> > >     > for /Metadata and grab the stream, will that be what you're
> > > looking
> > >     > for?  Or, is there a commandline tool I can run to get what
> > > you're
> > >     > interested in?
> > >     >   Thank you.
> > >     >
> > >     >   Cheers,
> > >     >
> > >     >               Tim
> > >     >
> > >     > On Wed, Mar 17, 2021 at 1:17 PM Leonard Rosenthol
> > >     > <lrose...@adobe.com.invalid> wrote:
> > >     > >
> > >     > > Are you only pulling document-level XMP?  If so, could you
> > > extend
> > >     > > it to support object-level metadata as well?   I, for one,
> > > would
> > >     > > love to get insight into the use of object-level metadata -
> > > what
> > >     > > objects are they attached to, what are they being used for,
> > > etc.
> > >     > >
> > >     > > Leonard
> > >     > >
> > >     > > On 3/17/21, 11:37 AM, "Tim Allison" <talli...@apache.org>
> > > wrote:
> > >     > >
> > >     > >     All,
> > >     > >
> > >     > >       I'm scraping XMPs out of our corpus and placing them
> > > here as
> > >     > > standalone files:
> > >     > >
> > >     > >
> > >     > >
> > > https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcorpora.tika.apache.org%2Fbase%2Fxmps%2F&amp;data=04%7C01%7Clrosenth%40adobe.com%7C388cecf991ed40022fd808d8e96c4aa6%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C637515998615522173%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=2TgR3TTbDedLLOn85E9sVHLePHUqDpzkDnF%2BsnzvIfk%3D&amp;reserved=0
> > >     > >
> > >     > >       I've binned the files roughly based on the container
> > > file's
> > >     > > mime
> > >     > >     type, e.g.
> > >     > >
> > > https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcorpora.tika.apache.org%2Fbase%2Fxmps%2Fpdf%2F&amp;data=04%7C01%7Clrosenth%40adobe.com%7C388cecf991ed40022fd808d8e96c4aa6%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C637515998615532128%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=vheVHiNdgTtbOIL8plV6vRslcGB0d%2FByGYXtbByH2zk%3D&amp;reserved=0
> > >     > >
> > >     > >       The process is still running, and I view this as a
> > > first
> > >     > > draft.
> > >     > >     Please let me know if there's anything I can do to make
> > > these
> > >     > > data
> > >     > >     easier to use/more useful or if you see any problems.
> > >     > >
> > >     > >       Cheers,
> > >     > >
> > >     > >                  Tim
> > >     > >
> > >
> > >     --
> > >     --
> > >     Maruan Sahyoun
> > >
> > >     FileAffairs GmbH
> > >     Josef-Schappe-Straße 21
> > >     40882 Ratingen
> > >
> > >     Tel: +49 (2102) 89497 88
> > >     Fax: +49 (2102) 89497 91
> > >     sahy...@fileaffairs.de
> > >
> > > https://nam04.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.fileaffairs.de%2F&amp;data=04%7C01%7Clrosenth%40adobe.com%7C388cecf991ed40022fd808d8e96c4aa6%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C637515998615532128%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=qcCIbv8VTgWaudXut2FHgOOtJSQTJLDknTSznWdomgw%3D&amp;reserved=0
> > >
> > >     Geschäftsführer: Maruan Sahyoun
> > >     Handelsregister: AG Düsseldorf, HRB 53837
> > >     UST.-ID: DE248275827
> > >
> > >
> >
> > --
> > --
> > Maruan Sahyoun
> >
> > FileAffairs GmbH
> > Josef-Schappe-Straße 21
> > 40882 Ratingen
> >
> > Tel: +49 (2102) 89497 88
> > Fax: +49 (2102) 89497 91
> > sahy...@fileaffairs.de
> > www.fileaffairs.de
> >
> > Geschäftsführer: Maruan Sahyoun
> > Handelsregister: AG Düsseldorf, HRB 53837
> > UST.-ID: DE248275827
> >

Reply via email to