On Thu, 15 Oct 2015, Allison, Timothy B. wrote:
Y, as I'm thinking more about c) (and note that this is a personal and half-baked proposal, not at all speaking for the Tika community), we could offer multiple models for advanced users.

One thing I'm keen to see is the ability to map on the output side back into other standards. Our XMP module is one such use of this. The JSON one is too, but less standard... Idea being that we have a common, sane, rich-enough internal representation, then people wanting XMP / ISO-19115 / etc can then transform the output Tika metadata onwards into their chosen format

If someone wanted to contribute code that would represent metadata in ISO 19115 for the appropriate parsers or if we could scrape ISO-19115 out of documents (as we might consider doing with XMP streams), the advanced user could grab that node and go to town. To emphasize Nick's point, we absolutely want to keep the basics easy to get to. No single standard is likely to be sufficient for us, and yet, we also don't want to create our very own.

There's also Giuseppe's work on input metadata, eg TIKA-1691, to allow richer mapping from input metadata onto our standards. Having helped give his talk in Budapest, with help from Michael from OODT/JPL, I more get this. Idea is that quite custom formats (eg PDFs from one specific conference) could say "grab this text as 'first name', that as 'second name', in our own custom metadata standard, then combine the two for dc:creator for everyone else". That probably works best in combination with some of the content -> metadata content handlers.

My view is that we need:
 * A model simple enough for beginners to understand and get started
 * Something flexible enough for advanced users to still utilise
 * Something that's consistent, as much as possible, between formats
 * Something with enough information (possibly hidden by default) to allow
   richly mapping out into other standards/systems
 * Something that works for "simple" office docs, but still copes with
   "complex" ones like media and scientific formats, without too much
   surprise of changes
 * Something that deals with the conflicts in the above ;-)

Our current new-ish retrofitted model with properties (which offer both a simple string and richer typed values) covers most of those, but is struggling with the complex formats case.

All the alternatives, including my own preferred (complexity on the key not value) have downsides and have issues on at least one of the above!

I think we're all very keen to find out how other projects have tackled the same problem space, and how they've squared our circle...!

Nick

Reply via email to