Hello all
In the last ApacheConf in Budapest, we had some discussion about
geospatial metadata in Tika. Currently Tika has 3 properties (latitude,
longitude, altitude) in its org.apache.tika.metadata.Geographic
interface, also reproduced in the TikeCoreProperties interface.
Geospatial metadata can be more complex, but does Tika wishes to support
more geospatial metadata structures or to keep that model simple?
If Tika wishes to support geospatial metadata more extensively, would
Tika consider to use the ISO 19115 metadata model? This international
standard is the official metadata model of the Open Geospatial
Consortium (OGC) and is in use in various organisations (some parts of
NASA, European Space Agency, Food and Agriculture Organisation, etc.).
The ISO 19115 standard is quite big, with about 500 properties.
ISO 19115 could be a format like any other formats in Tika. One possible
way for Tika to read and write ISO 19115 documents in XML would be to
use Apache Spatial Information System. The Maven dependency would be:
<dependency>
<groupId>org.apache.sis.core</groupId>
<artifactId>sis-metadata</artifactId>
<version>0.6</version>
</dependency>
And the code can be (there is a more generic API working also with
NetCDF files, be we can leave that for later):
import org.apache.sis.XML;
import org.opengis.metadata.Metadata;
...
Metadata metadata = (Metadata) XML.parse(URL);
The above Metadata object is the root of a tree. It may have many titles
for different things (a title for the data, a title for the quality
evaluation procedure, etc.), many authors, many variables in the
dataset, etc. One possible problem is that ISO 19115 metadata requires a
tree structure, while in my understanding Tika metadata are currently
stored in a flat structure. Does Tika plans to support a tree structure?
Would it be a pre-requite before Tika can support ISO 19115?
An other question is related to the fact that while officially a
geospatial metadata standard, ISO 19115 is actually a much more generic
metadata standard with some geospatial parts in it. In my understanding
ISO 19115 contains most of Dublin Core, together with many of the
properties currently provided in various Tika interfaces. ISO 19115
could potentially replace many org.apache.tika.metadata interfaces with
a single consistent model. I presume that such replacement would not be
possible for compatibility reasons, and maybe also for complexity
reasons. But I would be curious to know if Tika has some plan for the
evolution of its metadata model?
Regards,
Martin