RE: ISO 19115 as a metadata model for Tika?

Nick Burch Thu, 15 Oct 2015 15:44:21 -0700

On Thu, 15 Oct 2015, Allison, Timothy B. wrote:

Y, as I'm thinking more about c) (and note that this is a personal andhalf-baked proposal, not at all speaking for the Tika community), wecould offer multiple models for advanced users.

One thing I'm keen to see is the ability to map on the output side backinto other standards. Our XMP module is one such use of this. The JSON oneis too, but less standard... Idea being that we have a common, sane,rich-enough internal representation, then people wanting XMP / ISO-19115/ etc can then transform the output Tika metadata onwards into theirchosen format

If someone wanted to contribute code that would represent metadata inISO 19115 for the appropriate parsers or if we could scrape ISO-19115out of documents (as we might consider doing with XMP streams), theadvanced user could grab that node and go to town. To emphasize Nick'spoint, we absolutely want to keep the basics easy to get to. No singlestandard is likely to be sufficient for us, and yet, we also don't wantto create our very own.

There's also Giuseppe's work on input metadata, eg TIKA-1691, to allowricher mapping from input metadata onto our standards. Having helped givehis talk in Budapest, with help from Michael from OODT/JPL, I more getthis. Idea is that quite custom formats (eg PDFs from one specificconference) could say "grab this text as 'first name', that as 'secondname', in our own custom metadata standard, then combine the two fordc:creator for everyone else". That probably works best in combinationwith some of the content -> metadata content handlers.


My view is that we need:
 * A model simple enough for beginners to understand and get started
 * Something flexible enough for advanced users to still utilise
 * Something that's consistent, as much as possible, between formats
 * Something with enough information (possibly hidden by default) to allow
   richly mapping out into other standards/systems
 * Something that works for "simple" office docs, but still copes with
   "complex" ones like media and scientific formats, without too much
   surprise of changes
 * Something that deals with the conflicts in the above ;-)

Our current new-ish retrofitted model with properties (which offer both asimple string and richer typed values) covers most of those, but isstruggling with the complex formats case.

All the alternatives, including my own preferred (complexity on the keynot value) have downsides and have issues on at least one of the above!

I think we're all very keen to find out how other projects have tackledthe same problem space, and how they've squared our circle...!


Nick

RE: ISO 19115 as a metadata model for Tika?

Reply via email to