Le 14/10/15 20:15, Allison, Timothy B. a écrit :
> On TIKA-1607, there are two (and a half) proposals:
> 1) move everything to DOM with helper classes for common elements
> 2) use POJOs as metadata values
> c) ;) keep current setup, perhaps add binary values, use DOM inputstreams for 
> things that already have standards (e.g. Dublin core)  This could be a 
> transitional step to option 1 in Tika 2.0.
>
> If we went with 1 or c) we could embed ISO 19115, we could either embed the 
> info within the DOM or add an ISO DOM stream that would include this 
> information.

Thanks for explaining. But approach 1 or c suggests that different
conceptual models (e.g. Dublin core versus ISO 19115) would co-exist,
regardless of the underlying data structure (DOM or something else), is
that right? For example, if someone what to get the title of a document,
does he would specify for example "I'm using the TITLE key from the
Dublin core model" or "I'm using the IDENTIFICATION_INFO/CITATION/TITLE
key from the ISO 19115 model"? Or does Tika plans to propose its own
"universal" model?


> (...snip...) However, once we move beyond Map<String, String[]> the
> user is going to have to have some knowledge of the metadata structure
> to extract information, whether that's POJO, DOM or Map<String, Node>.

Right, this is related to my question above. To avoid the need to know
the metadata structure of a specific data format, Tika (in my
understanding) currently maps some metadata to the Dublin core model,
which is used as a "universal" conceptual model. So anyone can ask for
the title without knowing where the title is stored in various data formats.

However for some more advanced needs, the Dublin core model is not
enough and can not easily be extended. A new conceptual model is needed.
ISO 19115 is one such conceptual model that could be used in replacement
of Dublin core, but there is also other conceptual models that are yet
more complex than ISO 191115. Is there some thoughts about what would be
the compromise between simplicity and completeness in Tika 2?


> On your interest in ISO 19115, to echo Nick, what specifically do you need? 
> What document formats do you see populating this information?

We do not need changes in Tika model at this time since Apache SIS has
its own metadata engine (but targeting only geospatial data formats like
NetCDF - no Word or PDF parsing - and using ISO 19115 as its "universal
model" instead than Dublin core). But we have seen talks about
geospatial metadata in Tika in recent ApacheConf, and I was a little bit
worried to see that some proposed solutions (i.e. new properties) were
Tika-specific instead than using international standards (note: I'm not
suggesting to use Apache SIS - only to consider the international
standard behind it).

So I'm not looking for a solution to a technical problem, but I'm trying
to learn more about the strategic direction that Tika wishes to take.
Would Tika considers to move to a richer metadata model than Dublin
core? Would ISO 19115 be considered too geospatial-centric (which I
could understand)? Would Tika supports more than one "universal model"
if it wants to preserve Dublin core simplicity with the richness of
other international standards?

About document formats populated with ISO 19115 metadata: standalone ISO
19115 files are provided by various data producers, for example 1) from
NASA, 2) from the Spanish mapping agency or 3) from all French
government agencies:

 1. 
http://podaac.jpl.nasa.gov/ws/metadata/dataset/?shortName=AVISO_L4_DYN_TOPO_1DEG_1MO&format=iso
 2. http://www.ign.es/csw-inspire/srv/spa/xml_iso19139?id=9584
 3. http://www.geocatalogue.fr/getMetadata?format=XML&id=1785

ISO 19115 information are also embedded in raster data like "GML in
JPEG2000" standard. Equivalent information are embedded in NetCDF files
and translated to the ISO 19115 model by tools like "ncISO" from
NOAA/NGDC. I saw that Tika has an
org.apache.tika.metadata.ClimateForcast interface, but it describes only
the information at the root of NetCDF files without describing the
variables included in those files (which would need a metadata tree
structure).

So this email is for discussion only - not for immediate action.

    Regards,

        Martin


Reply via email to