> Ah, I wasn't aware of XMPFiles...thank you...I can run that next if that'd be 
> of any interest.

If there were a commandline or a Java SDK, I could run that next if
that'd be of any interest. :D

On Wed, Mar 17, 2021 at 3:28 PM Tim Allison <talli...@apache.org> wrote:
>
> Ah, I wasn't aware of XMPFiles...thank you...I can run that next if
> that'd be of any interest.
>
> I kicked off a process to run `exifTool -xmp -b` against the files.
> The output will go here:
> https://corpora.tika.apache.org/base/exiftool-xmps/
>
> On Wed, Mar 17, 2021 at 3:24 PM Leonard Rosenthol
> <lrose...@adobe.com.invalid> wrote:
> >
> > Very interesting - thanks.
> >
> > FWIW: The XMPToolkit itself has a module called "XMPFiles" 
> > (https://github.com/adobe/XMP-Toolkit-SDK#xmpfiles) whose job it is to read 
> > & write/update XMP (and other related metadata such as EXIF) from various 
> > file formats.  It's what all the Adobe apps use to handle XMP in any file 
> > format that we encounter.
> >
> > Leonard
> >
> > On 3/17/21, 2:48 PM, "Tim Allison" <talli...@apache.org> wrote:
> >
> >     Wait...I'm sorry...I'm wrong on the first point.
> >
> >     1) in Tika generally, we use Jempbox (currently) to parse XMP when the
> >     parsers come across it and after they select the right one and do any
> >     joining or other modifications...e.g. the "right" xmp.  We use xmpcore
> >     for converting other metadata to XMP in our tika-xmp module, and
> >     xmpcore is a dependency of Drew Noakes' metadata-extractor which is
> >     critical.
> >
> >     On Wed, Mar 17, 2021 at 2:43 PM Tim Allison <talli...@apache.org> wrote:
> >     >
> >     > >Isn't that why are you using the XMP Toolkit???
> >     >
> >     > Sorry, we may be talking about two different things.
> >     >
> >     > 1) In Tika generally, we use xmpcore to parse XMP after the parsers
> >     > extract it and process it (correctly!) from various file formats.
> >     >
> >     > 2) For this exercise, I wanted a quick and dirty byte scanner to
> >     > extract the raw xmp packets...as much as we could find in any file
> >     > format without relying on file-format specific parsers.
> >     >
> >     > I can do a second run where I modify Tika to extract the XMP from the
> >     > various parsers after they do their processing (determining most
> >     > recent/joining, etc) to extract the correct XMP.
> >     >
> >     > And I can do a third run where I modify Tika to extract XMP associated
> >     > with embedded images in PDFs, for example.
> >     >
> >     > I hope this clarifies things.  Please let me know what would be most
> >     > useful for you.
> >     >
> >     > Cheers,
> >     >
> >     >        Tim
> >     >
> >     > On Wed, Mar 17, 2021 at 2:26 PM Leonard Rosenthol
> >     > <lrose...@adobe.com.invalid> wrote:
> >     > >
> >     > > >    The other thing is that I wanted to scrape xmp out of files 
> > beyond PDFs.
> >     > > >
> >     > > Isn't that why are you using the XMP Toolkit???
> >     > >
> >     > > Leonard
> >     > >
> >     > > On 3/17/21, 2:10 PM, "Tim Allison" <talli...@apache.org> wrote:
> >     > >
> >     > >     > ARGH!!!!   Please don't do this - it will get you the wrong 
> > results in almost all cases.     Remember that in a PDF with updates, there 
> > can/will be a new XMP block with each update.
> >     > >
> >     > >     Ha, right.  I completely understand (perhaps _only_ this small 
> > point
> >     > >     on PDFs).  On this pass, my goal was to see what was in the 
> > file at
> >     > >     all, not what was the correct XMP. Part of my interest is in 
> > what's
> >     > >     available in the file, but not available readily to the user.
> >     > >
> >     > >     The other thing is that I wanted to scrape xmp out of files 
> > beyond PDFs.
> >     > >
> >     > >     So, I can definitely take a second run where I let a PDF tool 
> > extract
> >     > >     the correct XMP if there's interest in that.
> >     > >
> >     > >     On Wed, Mar 17, 2021 at 1:56 PM Leonard Rosenthol
> >     > >     <lrose...@adobe.com.invalid> wrote:
> >     > >     >
> >     > >     > >      I'm literally just scraping bytes out of files for now 
> > without any parsing
> >     > >     > >
> >     > >     > ARGH!!!!   Please don't do this - it will get you the wrong 
> > results in almost all cases.     Remember that in a PDF with updates, there 
> > can/will be a new XMP block with each update.
> >     > >     >
> >     > >     >
> >     > >     > > if I traverse the COSDocument's objects and look     for 
> > /Metadata and grab the stream, will that be what you're looking     for?
> >     > >     > >
> >     > >     > Just getting those elements would be a great start.  If you 
> > could also include the rest of the dictionary in which it was found (or at 
> > least the /Type and /Subtype keys, if present) would be great!
> >     > >     >
> >     > >     > Leonard
> >     > >     >
> >     > >     > On 3/17/21, 1:39 PM, "Tim Allison" <talli...@apache.org> 
> > wrote:
> >     > >     >
> >     > >     >     Hi Leonard,
> >     > >     >       I'm literally just scraping bytes out of files for now 
> > without any
> >     > >     >     parsing...so if the XMP is concealed in a compressed 
> > stream or
> >     > >     >     something more interesting, I'm not grabbing it.  I'm 
> > also not
> >     > >     >     tracking which XMP is associated with which object.
> >     > >     >       Please forgive me...if I traverse the COSDocument's 
> > objects and look
> >     > >     >     for /Metadata and grab the stream, will that be what 
> > you're looking
> >     > >     >     for?  Or, is there a commandline tool I can run to get 
> > what you're
> >     > >     >     interested in?
> >     > >     >       Thank you.
> >     > >     >
> >     > >     >       Cheers,
> >     > >     >
> >     > >     >                   Tim
> >     > >     >
> >     > >     >     On Wed, Mar 17, 2021 at 1:17 PM Leonard Rosenthol
> >     > >     >     <lrose...@adobe.com.invalid> wrote:
> >     > >     >     >
> >     > >     >     > Are you only pulling document-level XMP?  If so, could 
> > you extend it to support object-level metadata as well?   I, for one, would 
> > love to get insight into the use of object-level metadata - what objects 
> > are they attached to, what are they being used for, etc.
> >     > >     >     >
> >     > >     >     > Leonard
> >     > >     >     >
> >     > >     >     > On 3/17/21, 11:37 AM, "Tim Allison" 
> > <talli...@apache.org> wrote:
> >     > >     >     >
> >     > >     >     >     All,
> >     > >     >     >
> >     > >     >     >       I'm scraping XMPs out of our corpus and placing 
> > them here as standalone files:
> >     > >     >     >
> >     > >     >     >     
> > https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcorpora.tika.apache.org%2Fbase%2Fxmps%2F&amp;data=04%7C01%7Clrosenth%40adobe.com%7Cd72980268ef74c392dc008d8e97543f5%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C637516037146889530%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=P52Forv9X46J%2BcecAgfJ6%2FVllEXOuJIT8LOebljRYjE%3D&amp;reserved=0
> >     > >     >     >
> >     > >     >     >       I've binned the files roughly based on the 
> > container file's mime
> >     > >     >     >     type, e.g. 
> > https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcorpora.tika.apache.org%2Fbase%2Fxmps%2Fpdf%2F&amp;data=04%7C01%7Clrosenth%40adobe.com%7Cd72980268ef74c392dc008d8e97543f5%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C637516037146889530%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=l0Nz9sRuTzbF%2F122mGFilHpr3KZldEFPDb3fAZ9B0L0%3D&amp;reserved=0
> >     > >     >     >
> >     > >     >     >       The process is still running, and I view this as 
> > a first draft.
> >     > >     >     >     Please let me know if there's anything I can do to 
> > make these data
> >     > >     >     >     easier to use/more useful or if you see any 
> > problems.
> >     > >     >     >
> >     > >     >     >       Cheers,
> >     > >     >     >
> >     > >     >     >                  Tim
> >     > >     >     >
> >     > >     >
> >     > >
> >

Reply via email to