A plan to improve the metadata property definitions

Nick Burch Wed, 16 May 2012 08:51:33 -0700

Hi All

I've just been brainstorming with Ray Gauss, and we think we've come upwith a way to move towards cleaner and clearer metadata propertydefinitions (prefixes, properties with types etc), whilst maintainingbackwards compatibility and avoiding too much work for parsers duringthe migration. It'll hopefully also help with the larger plan ofimproving the metadata, and make life easier for people working on that.

I'll use DublinCore as an example, but it's not the only one this'llapply to.

Today, we have all the keys from DublinCore imported onto the Metadataobject, and all the parsers all call eg Metadata.DESCRIPTION rather thanDublinCore.DESCRIPTION. This is a string key, not a property, so there'sno information on it about type etc, and it's a raw key of "description"so people outside of the Java space (eg tika-cli users) don't know whatit is defined as.

What I think we'd really like is for that to be a property, with type,with a key that includes our chosen prefix (so that tika-cli users etcknow what it is), that doesn't break backwards compatibility until 2.0.

Additionally, we want to identify which properties are common, which allparsers should be mapping their metadata onto (eg everything should mapthe metadata that corresponds roughly to what Dublin Core explainsDescription to be, no matter what the format calls it), in addition fromany format specific ones (which only advance users want)


We think we have a plan!

In order to avoid breaking backwards compatibility, we've looked andbasically nothing uses the metadata key interfaces directly. Everythingseems to use the Metadata one instead, eg Metadata.DESCRIPTION ratherthan DublinCore.DESCRIPTION. So, we think we can change the dublin coreone, provided that Metadata is unchanged.

Step one is therefore to change all the definitions in Dublin Core to beproper properties. We copy over the old strings to Metadata, and@deprecate them (until 2.0). Everything should still work

Next, we define a class to hold the common Tika metadata properties.These are the ones we consider to be common across all formats, whichparsers should be trying to populate wherever they can. (Most parsersalready do this, eg for title or description). We'll do a few of these,but we'll need others to contribute to help decide the rest. These willbe delegated out to a standard property that someone else has alreadydefined, as we do now.

With that done, we can also specify some aliases, so that when you setone property it can be defined to also set some others. This allows usto say "when you set the new dublin core description, for now also goand set the old style description". This support will also be helpfulfor mappings on xmp aware (or similar) formats, to map between theircustom properties and our common ones.

Finally, we go through the parsers and update them to set the newproperties, rather than the old strings. They'll maintain compatibilityfor all users (those using the Java lookups, and those using raw keys egtika-cli), and when we drop that in 2.0 the parsers don't need to change

We'll be opening issues for all of these, and doing the work in smallchunks so everyone can follow. I believe this all fits with whateveryone has been discussing for a while, doesn't break anything, andmoves us forward. Despite the long email, it's actually quite small changes!


Nick

A plan to improve the metadata property definitions

Reply via email to