Le 14/10/15 20:15, Allison, Timothy B. a écrit : > On TIKA-1607, there are two (and a half) proposals: > 1) move everything to DOM with helper classes for common elements > 2) use POJOs as metadata values > c) ;) keep current setup, perhaps add binary values, use DOM inputstreams for > things that already have standards (e.g. Dublin core) This could be a > transitional step to option 1 in Tika 2.0. > > If we went with 1 or c) we could embed ISO 19115, we could either embed the > info within the DOM or add an ISO DOM stream that would include this > information.
Thanks for explaining. But approach 1 or c suggests that different conceptual models (e.g. Dublin core versus ISO 19115) would co-exist, regardless of the underlying data structure (DOM or something else), is that right? For example, if someone what to get the title of a document, does he would specify for example "I'm using the TITLE key from the Dublin core model" or "I'm using the IDENTIFICATION_INFO/CITATION/TITLE key from the ISO 19115 model"? Or does Tika plans to propose its own "universal" model? > (...snip...) However, once we move beyond Map<String, String[]> the > user is going to have to have some knowledge of the metadata structure > to extract information, whether that's POJO, DOM or Map<String, Node>. Right, this is related to my question above. To avoid the need to know the metadata structure of a specific data format, Tika (in my understanding) currently maps some metadata to the Dublin core model, which is used as a "universal" conceptual model. So anyone can ask for the title without knowing where the title is stored in various data formats. However for some more advanced needs, the Dublin core model is not enough and can not easily be extended. A new conceptual model is needed. ISO 19115 is one such conceptual model that could be used in replacement of Dublin core, but there is also other conceptual models that are yet more complex than ISO 191115. Is there some thoughts about what would be the compromise between simplicity and completeness in Tika 2? > On your interest in ISO 19115, to echo Nick, what specifically do you need? > What document formats do you see populating this information? We do not need changes in Tika model at this time since Apache SIS has its own metadata engine (but targeting only geospatial data formats like NetCDF - no Word or PDF parsing - and using ISO 19115 as its "universal model" instead than Dublin core). But we have seen talks about geospatial metadata in Tika in recent ApacheConf, and I was a little bit worried to see that some proposed solutions (i.e. new properties) were Tika-specific instead than using international standards (note: I'm not suggesting to use Apache SIS - only to consider the international standard behind it). So I'm not looking for a solution to a technical problem, but I'm trying to learn more about the strategic direction that Tika wishes to take. Would Tika considers to move to a richer metadata model than Dublin core? Would ISO 19115 be considered too geospatial-centric (which I could understand)? Would Tika supports more than one "universal model" if it wants to preserve Dublin core simplicity with the richness of other international standards? About document formats populated with ISO 19115 metadata: standalone ISO 19115 files are provided by various data producers, for example 1) from NASA, 2) from the Spanish mapping agency or 3) from all French government agencies: 1. http://podaac.jpl.nasa.gov/ws/metadata/dataset/?shortName=AVISO_L4_DYN_TOPO_1DEG_1MO&format=iso 2. http://www.ign.es/csw-inspire/srv/spa/xml_iso19139?id=9584 3. http://www.geocatalogue.fr/getMetadata?format=XML&id=1785 ISO 19115 information are also embedded in raster data like "GML in JPEG2000" standard. Equivalent information are embedded in NetCDF files and translated to the ISO 19115 model by tools like "ncISO" from NOAA/NGDC. I saw that Tika has an org.apache.tika.metadata.ClimateForcast interface, but it describes only the information at the root of NetCDF files without describing the variables included in those files (which would need a metadata tree structure). So this email is for discussion only - not for immediate action. Regards, Martin
