Very interesting - thanks.

FWIW: The XMPToolkit itself has a module called "XMPFiles" 
(https://github.com/adobe/XMP-Toolkit-SDK#xmpfiles) whose job it is to read & 
write/update XMP (and other related metadata such as EXIF) from various file 
formats.  It's what all the Adobe apps use to handle XMP in any file format 
that we encounter.

Leonard

On 3/17/21, 2:48 PM, "Tim Allison" <talli...@apache.org> wrote:

    Wait...I'm sorry...I'm wrong on the first point.

    1) in Tika generally, we use Jempbox (currently) to parse XMP when the
    parsers come across it and after they select the right one and do any
    joining or other modifications...e.g. the "right" xmp.  We use xmpcore
    for converting other metadata to XMP in our tika-xmp module, and
    xmpcore is a dependency of Drew Noakes' metadata-extractor which is
    critical.

    On Wed, Mar 17, 2021 at 2:43 PM Tim Allison <talli...@apache.org> wrote:
    >
    > >Isn't that why are you using the XMP Toolkit???
    >
    > Sorry, we may be talking about two different things.
    >
    > 1) In Tika generally, we use xmpcore to parse XMP after the parsers
    > extract it and process it (correctly!) from various file formats.
    >
    > 2) For this exercise, I wanted a quick and dirty byte scanner to
    > extract the raw xmp packets...as much as we could find in any file
    > format without relying on file-format specific parsers.
    >
    > I can do a second run where I modify Tika to extract the XMP from the
    > various parsers after they do their processing (determining most
    > recent/joining, etc) to extract the correct XMP.
    >
    > And I can do a third run where I modify Tika to extract XMP associated
    > with embedded images in PDFs, for example.
    >
    > I hope this clarifies things.  Please let me know what would be most
    > useful for you.
    >
    > Cheers,
    >
    >        Tim
    >
    > On Wed, Mar 17, 2021 at 2:26 PM Leonard Rosenthol
    > <lrose...@adobe.com.invalid> wrote:
    > >
    > > >    The other thing is that I wanted to scrape xmp out of files beyond 
PDFs.
    > > >
    > > Isn't that why are you using the XMP Toolkit???
    > >
    > > Leonard
    > >
    > > On 3/17/21, 2:10 PM, "Tim Allison" <talli...@apache.org> wrote:
    > >
    > >     > ARGH!!!!   Please don't do this - it will get you the wrong 
results in almost all cases.     Remember that in a PDF with updates, there 
can/will be a new XMP block with each update.
    > >
    > >     Ha, right.  I completely understand (perhaps _only_ this small point
    > >     on PDFs).  On this pass, my goal was to see what was in the file at
    > >     all, not what was the correct XMP. Part of my interest is in what's
    > >     available in the file, but not available readily to the user.
    > >
    > >     The other thing is that I wanted to scrape xmp out of files beyond 
PDFs.
    > >
    > >     So, I can definitely take a second run where I let a PDF tool 
extract
    > >     the correct XMP if there's interest in that.
    > >
    > >     On Wed, Mar 17, 2021 at 1:56 PM Leonard Rosenthol
    > >     <lrose...@adobe.com.invalid> wrote:
    > >     >
    > >     > >      I'm literally just scraping bytes out of files for now 
without any parsing
    > >     > >
    > >     > ARGH!!!!   Please don't do this - it will get you the wrong 
results in almost all cases.     Remember that in a PDF with updates, there 
can/will be a new XMP block with each update.
    > >     >
    > >     >
    > >     > > if I traverse the COSDocument's objects and look     for 
/Metadata and grab the stream, will that be what you're looking     for?
    > >     > >
    > >     > Just getting those elements would be a great start.  If you could 
also include the rest of the dictionary in which it was found (or at least the 
/Type and /Subtype keys, if present) would be great!
    > >     >
    > >     > Leonard
    > >     >
    > >     > On 3/17/21, 1:39 PM, "Tim Allison" <talli...@apache.org> wrote:
    > >     >
    > >     >     Hi Leonard,
    > >     >       I'm literally just scraping bytes out of files for now 
without any
    > >     >     parsing...so if the XMP is concealed in a compressed stream or
    > >     >     something more interesting, I'm not grabbing it.  I'm also not
    > >     >     tracking which XMP is associated with which object.
    > >     >       Please forgive me...if I traverse the COSDocument's objects 
and look
    > >     >     for /Metadata and grab the stream, will that be what you're 
looking
    > >     >     for?  Or, is there a commandline tool I can run to get what 
you're
    > >     >     interested in?
    > >     >       Thank you.
    > >     >
    > >     >       Cheers,
    > >     >
    > >     >                   Tim
    > >     >
    > >     >     On Wed, Mar 17, 2021 at 1:17 PM Leonard Rosenthol
    > >     >     <lrose...@adobe.com.invalid> wrote:
    > >     >     >
    > >     >     > Are you only pulling document-level XMP?  If so, could you 
extend it to support object-level metadata as well?   I, for one, would love to 
get insight into the use of object-level metadata - what objects are they 
attached to, what are they being used for, etc.
    > >     >     >
    > >     >     > Leonard
    > >     >     >
    > >     >     > On 3/17/21, 11:37 AM, "Tim Allison" <talli...@apache.org> 
wrote:
    > >     >     >
    > >     >     >     All,
    > >     >     >
    > >     >     >       I'm scraping XMPs out of our corpus and placing them 
here as standalone files:
    > >     >     >
    > >     >     >     
https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcorpora.tika.apache.org%2Fbase%2Fxmps%2F&amp;data=04%7C01%7Clrosenth%40adobe.com%7Cd72980268ef74c392dc008d8e97543f5%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C637516037146889530%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=P52Forv9X46J%2BcecAgfJ6%2FVllEXOuJIT8LOebljRYjE%3D&amp;reserved=0
    > >     >     >
    > >     >     >       I've binned the files roughly based on the container 
file's mime
    > >     >     >     type, e.g. 
https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcorpora.tika.apache.org%2Fbase%2Fxmps%2Fpdf%2F&amp;data=04%7C01%7Clrosenth%40adobe.com%7Cd72980268ef74c392dc008d8e97543f5%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C637516037146889530%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=l0Nz9sRuTzbF%2F122mGFilHpr3KZldEFPDb3fAZ9B0L0%3D&amp;reserved=0
    > >     >     >
    > >     >     >       The process is still running, and I view this as a 
first draft.
    > >     >     >     Please let me know if there's anything I can do to make 
these data
    > >     >     >     easier to use/more useful or if you see any problems.
    > >     >     >
    > >     >     >       Cheers,
    > >     >     >
    > >     >     >                  Tim
    > >     >     >
    > >     >
    > >

Reply via email to