Wait...I'm sorry...I'm wrong on the first point. 1) in Tika generally, we use Jempbox (currently) to parse XMP when the parsers come across it and after they select the right one and do any joining or other modifications...e.g. the "right" xmp. We use xmpcore for converting other metadata to XMP in our tika-xmp module, and xmpcore is a dependency of Drew Noakes' metadata-extractor which is critical.
On Wed, Mar 17, 2021 at 2:43 PM Tim Allison <talli...@apache.org> wrote: > > >Isn't that why are you using the XMP Toolkit??? > > Sorry, we may be talking about two different things. > > 1) In Tika generally, we use xmpcore to parse XMP after the parsers > extract it and process it (correctly!) from various file formats. > > 2) For this exercise, I wanted a quick and dirty byte scanner to > extract the raw xmp packets...as much as we could find in any file > format without relying on file-format specific parsers. > > I can do a second run where I modify Tika to extract the XMP from the > various parsers after they do their processing (determining most > recent/joining, etc) to extract the correct XMP. > > And I can do a third run where I modify Tika to extract XMP associated > with embedded images in PDFs, for example. > > I hope this clarifies things. Please let me know what would be most > useful for you. > > Cheers, > > Tim > > On Wed, Mar 17, 2021 at 2:26 PM Leonard Rosenthol > <lrose...@adobe.com.invalid> wrote: > > > > > The other thing is that I wanted to scrape xmp out of files beyond > > > PDFs. > > > > > Isn't that why are you using the XMP Toolkit??? > > > > Leonard > > > > On 3/17/21, 2:10 PM, "Tim Allison" <talli...@apache.org> wrote: > > > > > ARGH!!!! Please don't do this - it will get you the wrong results > > in almost all cases. Remember that in a PDF with updates, there > > can/will be a new XMP block with each update. > > > > Ha, right. I completely understand (perhaps _only_ this small point > > on PDFs). On this pass, my goal was to see what was in the file at > > all, not what was the correct XMP. Part of my interest is in what's > > available in the file, but not available readily to the user. > > > > The other thing is that I wanted to scrape xmp out of files beyond PDFs. > > > > So, I can definitely take a second run where I let a PDF tool extract > > the correct XMP if there's interest in that. > > > > On Wed, Mar 17, 2021 at 1:56 PM Leonard Rosenthol > > <lrose...@adobe.com.invalid> wrote: > > > > > > > I'm literally just scraping bytes out of files for now without > > any parsing > > > > > > > ARGH!!!! Please don't do this - it will get you the wrong results > > in almost all cases. Remember that in a PDF with updates, there > > can/will be a new XMP block with each update. > > > > > > > > > > if I traverse the COSDocument's objects and look for /Metadata > > and grab the stream, will that be what you're looking for? > > > > > > > Just getting those elements would be a great start. If you could > > also include the rest of the dictionary in which it was found (or at least > > the /Type and /Subtype keys, if present) would be great! > > > > > > Leonard > > > > > > On 3/17/21, 1:39 PM, "Tim Allison" <talli...@apache.org> wrote: > > > > > > Hi Leonard, > > > I'm literally just scraping bytes out of files for now without > > any > > > parsing...so if the XMP is concealed in a compressed stream or > > > something more interesting, I'm not grabbing it. I'm also not > > > tracking which XMP is associated with which object. > > > Please forgive me...if I traverse the COSDocument's objects and > > look > > > for /Metadata and grab the stream, will that be what you're > > looking > > > for? Or, is there a commandline tool I can run to get what you're > > > interested in? > > > Thank you. > > > > > > Cheers, > > > > > > Tim > > > > > > On Wed, Mar 17, 2021 at 1:17 PM Leonard Rosenthol > > > <lrose...@adobe.com.invalid> wrote: > > > > > > > > Are you only pulling document-level XMP? If so, could you > > extend it to support object-level metadata as well? I, for one, would > > love to get insight into the use of object-level metadata - what objects > > are they attached to, what are they being used for, etc. > > > > > > > > Leonard > > > > > > > > On 3/17/21, 11:37 AM, "Tim Allison" <talli...@apache.org> wrote: > > > > > > > > All, > > > > > > > > I'm scraping XMPs out of our corpus and placing them here > > as standalone files: > > > > > > > > > > https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcorpora.tika.apache.org%2Fbase%2Fxmps%2F&data=04%7C01%7Clrosenth%40adobe.com%7Cd262f00742e0448ff3e108d8e96fe674%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C637516014137263979%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=Xbzilw%2BpDWMnfVCtbMvLoAAMw0dLQM3S4rpli%2B%2BZUtY%3D&reserved=0 > > > > > > > > I've binned the files roughly based on the container > > file's mime > > > > type, e.g. > > https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcorpora.tika.apache.org%2Fbase%2Fxmps%2Fpdf%2F&data=04%7C01%7Clrosenth%40adobe.com%7Cd262f00742e0448ff3e108d8e96fe674%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C637516014137273937%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=R%2Fa6VoPWTqcCl52gBP8HLlLzVA5Xb1D4vtg2itxTx30%3D&reserved=0 > > > > > > > > The process is still running, and I view this as a first > > draft. > > > > Please let me know if there's anything I can do to make > > these data > > > > easier to use/more useful or if you see any problems. > > > > > > > > Cheers, > > > > > > > > Tim > > > > > > > > >