All,

    The processes finished: https://corpora.tika.apache.org/base/xmps/

    Now has two subdirectories, one for the original raw byte scraping
(1.2 million files with some junk
https://corpora.tika.apache.org/base/xmps/scraped-xmps/) and one for
the logical XMPs extracted by ExifTool (450k files
https://corpora.tika.apache.org/base/xmps/exiftool-xmps/).

     I plan to write some lightweight code to traverse the DOM and
look for all /Metadata objects and what they're attached to.

     If the XMP files are of any use or if they'd be of more use to
you if we did further processing or packaging, please let me know.

    Cheers,

              Tim

On Wed, Mar 17, 2021 at 4:21 PM Tim Allison <talli...@apache.org> wrote:
>
> > Ah, I wasn't aware of XMPFiles...thank you...I can run that next if that'd 
> > be of any interest.
>
> If there were a commandline or a Java SDK, I could run that next if
> that'd be of any interest. :D
>
> On Wed, Mar 17, 2021 at 3:28 PM Tim Allison <talli...@apache.org> wrote:
> >
> > Ah, I wasn't aware of XMPFiles...thank you...I can run that next if
> > that'd be of any interest.
> >
> > I kicked off a process to run `exifTool -xmp -b` against the files.
> > The output will go here:
> > https://corpora.tika.apache.org/base/exiftool-xmps/
> >
> > On Wed, Mar 17, 2021 at 3:24 PM Leonard Rosenthol
> > <lrose...@adobe.com.invalid> wrote:
> > >
> > > Very interesting - thanks.
> > >
> > > FWIW: The XMPToolkit itself has a module called "XMPFiles" 
> > > (https://github.com/adobe/XMP-Toolkit-SDK#xmpfiles) whose job it is to 
> > > read & write/update XMP (and other related metadata such as EXIF) from 
> > > various file formats.  It's what all the Adobe apps use to handle XMP in 
> > > any file format that we encounter.
> > >
> > > Leonard
> > >
> > > On 3/17/21, 2:48 PM, "Tim Allison" <talli...@apache.org> wrote:
> > >
> > >     Wait...I'm sorry...I'm wrong on the first point.
> > >
> > >     1) in Tika generally, we use Jempbox (currently) to parse XMP when the
> > >     parsers come across it and after they select the right one and do any
> > >     joining or other modifications...e.g. the "right" xmp.  We use xmpcore
> > >     for converting other metadata to XMP in our tika-xmp module, and
> > >     xmpcore is a dependency of Drew Noakes' metadata-extractor which is
> > >     critical.
> > >
> > >     On Wed, Mar 17, 2021 at 2:43 PM Tim Allison <talli...@apache.org> 
> > > wrote:
> > >     >
> > >     > >Isn't that why are you using the XMP Toolkit???
> > >     >
> > >     > Sorry, we may be talking about two different things.
> > >     >
> > >     > 1) In Tika generally, we use xmpcore to parse XMP after the parsers
> > >     > extract it and process it (correctly!) from various file formats.
> > >     >
> > >     > 2) For this exercise, I wanted a quick and dirty byte scanner to
> > >     > extract the raw xmp packets...as much as we could find in any file
> > >     > format without relying on file-format specific parsers.
> > >     >
> > >     > I can do a second run where I modify Tika to extract the XMP from 
> > > the
> > >     > various parsers after they do their processing (determining most
> > >     > recent/joining, etc) to extract the correct XMP.
> > >     >
> > >     > And I can do a third run where I modify Tika to extract XMP 
> > > associated
> > >     > with embedded images in PDFs, for example.
> > >     >
> > >     > I hope this clarifies things.  Please let me know what would be most
> > >     > useful for you.
> > >     >
> > >     > Cheers,
> > >     >
> > >     >        Tim
> > >     >
> > >     > On Wed, Mar 17, 2021 at 2:26 PM Leonard Rosenthol
> > >     > <lrose...@adobe.com.invalid> wrote:
> > >     > >
> > >     > > >    The other thing is that I wanted to scrape xmp out of files 
> > > beyond PDFs.
> > >     > > >
> > >     > > Isn't that why are you using the XMP Toolkit???
> > >     > >
> > >     > > Leonard
> > >     > >
> > >     > > On 3/17/21, 2:10 PM, "Tim Allison" <talli...@apache.org> wrote:
> > >     > >
> > >     > >     > ARGH!!!!   Please don't do this - it will get you the wrong 
> > > results in almost all cases.     Remember that in a PDF with updates, 
> > > there can/will be a new XMP block with each update.
> > >     > >
> > >     > >     Ha, right.  I completely understand (perhaps _only_ this 
> > > small point
> > >     > >     on PDFs).  On this pass, my goal was to see what was in the 
> > > file at
> > >     > >     all, not what was the correct XMP. Part of my interest is in 
> > > what's
> > >     > >     available in the file, but not available readily to the user.
> > >     > >
> > >     > >     The other thing is that I wanted to scrape xmp out of files 
> > > beyond PDFs.
> > >     > >
> > >     > >     So, I can definitely take a second run where I let a PDF tool 
> > > extract
> > >     > >     the correct XMP if there's interest in that.
> > >     > >
> > >     > >     On Wed, Mar 17, 2021 at 1:56 PM Leonard Rosenthol
> > >     > >     <lrose...@adobe.com.invalid> wrote:
> > >     > >     >
> > >     > >     > >      I'm literally just scraping bytes out of files for 
> > > now without any parsing
> > >     > >     > >
> > >     > >     > ARGH!!!!   Please don't do this - it will get you the wrong 
> > > results in almost all cases.     Remember that in a PDF with updates, 
> > > there can/will be a new XMP block with each update.
> > >     > >     >
> > >     > >     >
> > >     > >     > > if I traverse the COSDocument's objects and look     for 
> > > /Metadata and grab the stream, will that be what you're looking     for?
> > >     > >     > >
> > >     > >     > Just getting those elements would be a great start.  If you 
> > > could also include the rest of the dictionary in which it was found (or 
> > > at least the /Type and /Subtype keys, if present) would be great!
> > >     > >     >
> > >     > >     > Leonard
> > >     > >     >
> > >     > >     > On 3/17/21, 1:39 PM, "Tim Allison" <talli...@apache.org> 
> > > wrote:
> > >     > >     >
> > >     > >     >     Hi Leonard,
> > >     > >     >       I'm literally just scraping bytes out of files for 
> > > now without any
> > >     > >     >     parsing...so if the XMP is concealed in a compressed 
> > > stream or
> > >     > >     >     something more interesting, I'm not grabbing it.  I'm 
> > > also not
> > >     > >     >     tracking which XMP is associated with which object.
> > >     > >     >       Please forgive me...if I traverse the COSDocument's 
> > > objects and look
> > >     > >     >     for /Metadata and grab the stream, will that be what 
> > > you're looking
> > >     > >     >     for?  Or, is there a commandline tool I can run to get 
> > > what you're
> > >     > >     >     interested in?
> > >     > >     >       Thank you.
> > >     > >     >
> > >     > >     >       Cheers,
> > >     > >     >
> > >     > >     >                   Tim
> > >     > >     >
> > >     > >     >     On Wed, Mar 17, 2021 at 1:17 PM Leonard Rosenthol
> > >     > >     >     <lrose...@adobe.com.invalid> wrote:
> > >     > >     >     >
> > >     > >     >     > Are you only pulling document-level XMP?  If so, 
> > > could you extend it to support object-level metadata as well?   I, for 
> > > one, would love to get insight into the use of object-level metadata - 
> > > what objects are they attached to, what are they being used for, etc.
> > >     > >     >     >
> > >     > >     >     > Leonard
> > >     > >     >     >
> > >     > >     >     > On 3/17/21, 11:37 AM, "Tim Allison" 
> > > <talli...@apache.org> wrote:
> > >     > >     >     >
> > >     > >     >     >     All,
> > >     > >     >     >
> > >     > >     >     >       I'm scraping XMPs out of our corpus and placing 
> > > them here as standalone files:
> > >     > >     >     >
> > >     > >     >     >     
> > > https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcorpora.tika.apache.org%2Fbase%2Fxmps%2F&amp;data=04%7C01%7Clrosenth%40adobe.com%7Cd72980268ef74c392dc008d8e97543f5%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C637516037146889530%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=P52Forv9X46J%2BcecAgfJ6%2FVllEXOuJIT8LOebljRYjE%3D&amp;reserved=0
> > >     > >     >     >
> > >     > >     >     >       I've binned the files roughly based on the 
> > > container file's mime
> > >     > >     >     >     type, e.g. 
> > > https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcorpora.tika.apache.org%2Fbase%2Fxmps%2Fpdf%2F&amp;data=04%7C01%7Clrosenth%40adobe.com%7Cd72980268ef74c392dc008d8e97543f5%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C637516037146889530%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=l0Nz9sRuTzbF%2F122mGFilHpr3KZldEFPDb3fAZ9B0L0%3D&amp;reserved=0
> > >     > >     >     >
> > >     > >     >     >       The process is still running, and I view this 
> > > as a first draft.
> > >     > >     >     >     Please let me know if there's anything I can do 
> > > to make these data
> > >     > >     >     >     easier to use/more useful or if you see any 
> > > problems.
> > >     > >     >     >
> > >     > >     >     >       Cheers,
> > >     > >     >     >
> > >     > >     >     >                  Tim
> > >     > >     >     >
> > >     > >     >
> > >     > >
> > >

Reply via email to