> So this email is for discussion only - not for immediate action.
Got it.  As you can see by TIKA-1607 and [0], this has been an ongoing and 
important discussion, and I appreciate your contributions...I'm not a standards 
person, and was interested to learn more about ISO 19115.

> But approach 1 or c suggests that different conceptual models (e.g. Dublin 
> core versus ISO 19115) would co-exist.
Y, as I'm thinking more about c) (and note that this is a personal and 
half-baked proposal, not at all speaking for the Tika community), we could 
offer multiple models for advanced users.  If someone wanted to contribute code 
that would represent metadata in ISO 19115 for the appropriate parsers or if we 
could scrape ISO-19115 out of documents (as we might consider doing with XMP 
streams), the advanced user could grab that node and go to town.  To emphasize 
Nick's point, we absolutely want to keep the basics easy to get to.  No single 
standard is likely to be sufficient for us, and yet, we also don't want to 
create our very own.

Again, I can't emphasize enough the importance of Nick's point on keeping 
simple things simple.  As SOLR-7232 shows, even our current model is not being 
used correctly by very important consumers....I really need to get to work on 
that one...

Cheers,

              Tim


[0] http://wiki.apache.org/tika/MetadataRoadmap

-----Original Message-----
From: Martin Desruisseaux [mailto:[email protected]] 
Sent: Thursday, October 15, 2015 6:10 AM
To: [email protected]
Subject: Re: ISO 19115 as a metadata model for Tika?

Le 14/10/15 20:15, Allison, Timothy B. a écrit :
> On TIKA-1607, there are two (and a half) proposals:
> 1) move everything to DOM with helper classes for common elements
> 2) use POJOs as metadata values
> c) ;) keep current setup, perhaps add binary values, use DOM inputstreams for 
> things that already have standards (e.g. Dublin core)  This could be a 
> transitional step to option 1 in Tika 2.0.
>
> If we went with 1 or c) we could embed ISO 19115, we could either embed the 
> info within the DOM or add an ISO DOM stream that would include this 
> information.

Thanks for explaining. But approach 1 or c suggests that different conceptual 
models (e.g. Dublin core versus ISO 19115) would co-exist, regardless of the 
underlying data structure (DOM or something else), is that right? For example, 
if someone what to get the title of a document, does he would specify for 
example "I'm using the TITLE key from the Dublin core model" or "I'm using the 
IDENTIFICATION_INFO/CITATION/TITLE
key from the ISO 19115 model"? Or does Tika plans to propose its own 
"universal" model?


> (...snip...) However, once we move beyond Map<String, String[]> the 
> user is going to have to have some knowledge of the metadata structure 
> to extract information, whether that's POJO, DOM or Map<String, Node>.

Right, this is related to my question above. To avoid the need to know the 
metadata structure of a specific data format, Tika (in my
understanding) currently maps some metadata to the Dublin core model, which is 
used as a "universal" conceptual model. So anyone can ask for the title without 
knowing where the title is stored in various data formats.

However for some more advanced needs, the Dublin core model is not enough and 
can not easily be extended. A new conceptual model is needed.
ISO 19115 is one such conceptual model that could be used in replacement of 
Dublin core, but there is also other conceptual models that are yet more 
complex than ISO 191115. Is there some thoughts about what would be the 
compromise between simplicity and completeness in Tika 2?


> On your interest in ISO 19115, to echo Nick, what specifically do you need? 
> What document formats do you see populating this information?

We do not need changes in Tika model at this time since Apache SIS has its own 
metadata engine (but targeting only geospatial data formats like NetCDF - no 
Word or PDF parsing - and using ISO 19115 as its "universal model" instead than 
Dublin core). But we have seen talks about geospatial metadata in Tika in 
recent ApacheConf, and I was a little bit worried to see that some proposed 
solutions (i.e. new properties) were Tika-specific instead than using 
international standards (note: I'm not suggesting to use Apache SIS - only to 
consider the international standard behind it).

So I'm not looking for a solution to a technical problem, but I'm trying to 
learn more about the strategic direction that Tika wishes to take.
Would Tika considers to move to a richer metadata model than Dublin core? Would 
ISO 19115 be considered too geospatial-centric (which I could understand)? 
Would Tika supports more than one "universal model"
if it wants to preserve Dublin core simplicity with the richness of other 
international standards?

About document formats populated with ISO 19115 metadata: standalone ISO
19115 files are provided by various data producers, for example 1) from NASA, 
2) from the Spanish mapping agency or 3) from all French government agencies:

 1. 
http://podaac.jpl.nasa.gov/ws/metadata/dataset/?shortName=AVISO_L4_DYN_TOPO_1DEG_1MO&format=iso
 2. http://www.ign.es/csw-inspire/srv/spa/xml_iso19139?id=9584
 3. http://www.geocatalogue.fr/getMetadata?format=XML&id=1785

ISO 19115 information are also embedded in raster data like "GML in JPEG2000" 
standard. Equivalent information are embedded in NetCDF files and translated to 
the ISO 19115 model by tools like "ncISO" from NOAA/NGDC. I saw that Tika has 
an org.apache.tika.metadata.ClimateForcast interface, but it describes only the 
information at the root of NetCDF files without describing the variables 
included in those files (which would need a metadata tree structure).

So this email is for discussion only - not for immediate action.

    Regards,

        Martin


Reply via email to