Hi All

I've just been brainstorming with Ray Gauss, and we think we've come up with a way to move towards cleaner and clearer metadata property definitions (prefixes, properties with types etc), whilst maintaining backwards compatibility and avoiding too much work for parsers during the migration. It'll hopefully also help with the larger plan of improving the metadata, and make life easier for people working on that.

I'll use DublinCore as an example, but it's not the only one this'll apply to.

Today, we have all the keys from DublinCore imported onto the Metadata object, and all the parsers all call eg Metadata.DESCRIPTION rather than DublinCore.DESCRIPTION. This is a string key, not a property, so there's no information on it about type etc, and it's a raw key of "description" so people outside of the Java space (eg tika-cli users) don't know what it is defined as.

What I think we'd really like is for that to be a property, with type, with a key that includes our chosen prefix (so that tika-cli users etc know what it is), that doesn't break backwards compatibility until 2.0.

Additionally, we want to identify which properties are common, which all parsers should be mapping their metadata onto (eg everything should map the metadata that corresponds roughly to what Dublin Core explains Description to be, no matter what the format calls it), in addition from any format specific ones (which only advance users want)

We think we have a plan!

In order to avoid breaking backwards compatibility, we've looked and basically nothing uses the metadata key interfaces directly. Everything seems to use the Metadata one instead, eg Metadata.DESCRIPTION rather than DublinCore.DESCRIPTION. So, we think we can change the dublin core one, provided that Metadata is unchanged.

Step one is therefore to change all the definitions in Dublin Core to be proper properties. We copy over the old strings to Metadata, and @deprecate them (until 2.0). Everything should still work

Next, we define a class to hold the common Tika metadata properties. These are the ones we consider to be common across all formats, which parsers should be trying to populate wherever they can. (Most parsers already do this, eg for title or description). We'll do a few of these, but we'll need others to contribute to help decide the rest. These will be delegated out to a standard property that someone else has already defined, as we do now.

With that done, we can also specify some aliases, so that when you set one property it can be defined to also set some others. This allows us to say "when you set the new dublin core description, for now also go and set the old style description". This support will also be helpful for mappings on xmp aware (or similar) formats, to map between their custom properties and our common ones.

Finally, we go through the parsers and update them to set the new properties, rather than the old strings. They'll maintain compatibility for all users (those using the Java lookups, and those using raw keys eg tika-cli), and when we drop that in 2.0 the parsers don't need to change

We'll be opening issues for all of these, and doing the work in small chunks so everyone can follow. I believe this all fits with what everyone has been discussing for a while, doesn't break anything, and moves us forward. Despite the long email, it's actually quite small changes!

Nick

Reply via email to