On Thu, 15 Oct 2015, Allison, Timothy B. wrote:
Y, as I'm thinking more about c) (and note that this is a personal and
half-baked proposal, not at all speaking for the Tika community), we
could offer multiple models for advanced users.
One thing I'm keen to see is the ability to map on the output side back
into other standards. Our XMP module is one such use of this. The JSON one
is too, but less standard... Idea being that we have a common, sane,
rich-enough internal representation, then people wanting XMP / ISO-19115
/ etc can then transform the output Tika metadata onwards into their
chosen format
If someone wanted to contribute code that would represent metadata in
ISO 19115 for the appropriate parsers or if we could scrape ISO-19115
out of documents (as we might consider doing with XMP streams), the
advanced user could grab that node and go to town. To emphasize Nick's
point, we absolutely want to keep the basics easy to get to. No single
standard is likely to be sufficient for us, and yet, we also don't want
to create our very own.
There's also Giuseppe's work on input metadata, eg TIKA-1691, to allow
richer mapping from input metadata onto our standards. Having helped give
his talk in Budapest, with help from Michael from OODT/JPL, I more get
this. Idea is that quite custom formats (eg PDFs from one specific
conference) could say "grab this text as 'first name', that as 'second
name', in our own custom metadata standard, then combine the two for
dc:creator for everyone else". That probably works best in combination
with some of the content -> metadata content handlers.
My view is that we need:
* A model simple enough for beginners to understand and get started
* Something flexible enough for advanced users to still utilise
* Something that's consistent, as much as possible, between formats
* Something with enough information (possibly hidden by default) to allow
richly mapping out into other standards/systems
* Something that works for "simple" office docs, but still copes with
"complex" ones like media and scientific formats, without too much
surprise of changes
* Something that deals with the conflicts in the above ;-)
Our current new-ish retrofitted model with properties (which offer both a
simple string and richer typed values) covers most of those, but is
struggling with the complex formats case.
All the alternatives, including my own preferred (complexity on the key
not value) have downsides and have issues on at least one of the above!
I think we're all very keen to find out how other projects have tackled
the same problem space, and how they've squared our circle...!
Nick