> So this email is for discussion only - not for immediate action.
Got it. As you can see by TIKA-1607 and [0], this has been an ongoing and
important discussion, and I appreciate your contributions...I'm not a standards
person, and was interested to learn more about ISO 19115.
> But approach 1 or c suggests that different conceptual models (e.g. Dublin
> core versus ISO 19115) would co-exist.
Y, as I'm thinking more about c) (and note that this is a personal and
half-baked proposal, not at all speaking for the Tika community), we could
offer multiple models for advanced users. If someone wanted to contribute code
that would represent metadata in ISO 19115 for the appropriate parsers or if we
could scrape ISO-19115 out of documents (as we might consider doing with XMP
streams), the advanced user could grab that node and go to town. To emphasize
Nick's point, we absolutely want to keep the basics easy to get to. No single
standard is likely to be sufficient for us, and yet, we also don't want to
create our very own.
Again, I can't emphasize enough the importance of Nick's point on keeping
simple things simple. As SOLR-7232 shows, even our current model is not being
used correctly by very important consumers....I really need to get to work on
that one...
Cheers,
Tim
[0] http://wiki.apache.org/tika/MetadataRoadmap
-----Original Message-----
From: Martin Desruisseaux [mailto:[email protected]]
Sent: Thursday, October 15, 2015 6:10 AM
To: [email protected]
Subject: Re: ISO 19115 as a metadata model for Tika?
Le 14/10/15 20:15, Allison, Timothy B. a écrit :
> On TIKA-1607, there are two (and a half) proposals:
> 1) move everything to DOM with helper classes for common elements
> 2) use POJOs as metadata values
> c) ;) keep current setup, perhaps add binary values, use DOM inputstreams for
> things that already have standards (e.g. Dublin core) This could be a
> transitional step to option 1 in Tika 2.0.
>
> If we went with 1 or c) we could embed ISO 19115, we could either embed the
> info within the DOM or add an ISO DOM stream that would include this
> information.
Thanks for explaining. But approach 1 or c suggests that different conceptual
models (e.g. Dublin core versus ISO 19115) would co-exist, regardless of the
underlying data structure (DOM or something else), is that right? For example,
if someone what to get the title of a document, does he would specify for
example "I'm using the TITLE key from the Dublin core model" or "I'm using the
IDENTIFICATION_INFO/CITATION/TITLE
key from the ISO 19115 model"? Or does Tika plans to propose its own
"universal" model?
> (...snip...) However, once we move beyond Map<String, String[]> the
> user is going to have to have some knowledge of the metadata structure
> to extract information, whether that's POJO, DOM or Map<String, Node>.
Right, this is related to my question above. To avoid the need to know the
metadata structure of a specific data format, Tika (in my
understanding) currently maps some metadata to the Dublin core model, which is
used as a "universal" conceptual model. So anyone can ask for the title without
knowing where the title is stored in various data formats.
However for some more advanced needs, the Dublin core model is not enough and
can not easily be extended. A new conceptual model is needed.
ISO 19115 is one such conceptual model that could be used in replacement of
Dublin core, but there is also other conceptual models that are yet more
complex than ISO 191115. Is there some thoughts about what would be the
compromise between simplicity and completeness in Tika 2?
> On your interest in ISO 19115, to echo Nick, what specifically do you need?
> What document formats do you see populating this information?
We do not need changes in Tika model at this time since Apache SIS has its own
metadata engine (but targeting only geospatial data formats like NetCDF - no
Word or PDF parsing - and using ISO 19115 as its "universal model" instead than
Dublin core). But we have seen talks about geospatial metadata in Tika in
recent ApacheConf, and I was a little bit worried to see that some proposed
solutions (i.e. new properties) were Tika-specific instead than using
international standards (note: I'm not suggesting to use Apache SIS - only to
consider the international standard behind it).
So I'm not looking for a solution to a technical problem, but I'm trying to
learn more about the strategic direction that Tika wishes to take.
Would Tika considers to move to a richer metadata model than Dublin core? Would
ISO 19115 be considered too geospatial-centric (which I could understand)?
Would Tika supports more than one "universal model"
if it wants to preserve Dublin core simplicity with the richness of other
international standards?
About document formats populated with ISO 19115 metadata: standalone ISO
19115 files are provided by various data producers, for example 1) from NASA,
2) from the Spanish mapping agency or 3) from all French government agencies:
1.
http://podaac.jpl.nasa.gov/ws/metadata/dataset/?shortName=AVISO_L4_DYN_TOPO_1DEG_1MO&format=iso
2. http://www.ign.es/csw-inspire/srv/spa/xml_iso19139?id=9584
3. http://www.geocatalogue.fr/getMetadata?format=XML&id=1785
ISO 19115 information are also embedded in raster data like "GML in JPEG2000"
standard. Equivalent information are embedded in NetCDF files and translated to
the ISO 19115 model by tools like "ncISO" from NOAA/NGDC. I saw that Tika has
an org.apache.tika.metadata.ClimateForcast interface, but it describes only the
information at the root of NetCDF files without describing the variables
included in those files (which would need a metadata tree structure).
So this email is for discussion only - not for immediate action.
Regards,
Martin