Sorry for cross-posting!

Dear CIMI friends,

In agreement with all sides, CIDOC and ICS-FORTH have assisted the CIMI-Harmony 
test with
mappings of the provided data to the CIDOC CRM model. We wish to express our 
particular gratitude to
the support we got from Harmony.

The ABC/Harmony CIMI Collaboration Project 
(http://www.cimi.org/public_docs/Harmony_long_desc.html)
stated besides others:

  "1.3 Goals & Objectives 

                    *   make progress on understanding how
                        to get effective interoperability
                        between metadata vocabularies. 
                    *   provide an additional dimension to the
                        testing of the CIDOC CRM 
   1.4 Expected Outcomes 
                              ......
                    *   demonstrable XML database 
                    *   identification of deficiencies in the
                        CIDOC CRM 


The basic work has been done by
ICS-FORTH on a voluntary base with limited own resources, so the results 
presented here can and will
be further improved. A technical report about the mapping method will be 
published in a few days.

So far we have mapped the data sample from the National Museum of Denmark, the 
Museum of Natural
History London (Clayton Herbarium), and Australian Museums On-Line. We did not 
have the resources to
address the RLG example, but this will done in the near future.

The semantics of all of these samples were completely covered by the CIDOC CRM. 
There were however
wide differences in the complexity and the degree of automation that could be 
achieved. We comment
in the sequence on the effort, tools, semantics and automation:

We have used a commercial tool for all transformations. The target files are 
XML instances of the most simple
DTD, which represents correctly the CIDOC CRM semantics and allows to create 
instances structurally
equivalent to correct RDF instances of a full RDFS version of the CIDOC CRM. 
The target files can be read
naturally using an xsl file making the properties visible and are on-line 
available on 
http://cidoc.ics.forth.gr/data_transformations.html.

The transformation was done by Iraklis Karvasonis, a graduate student in 
computer science with no museum
background,  assisted by me, Martin Doerr, on the data field semantics. For 
each example, about 2 full days
were needed to identify the sample schema-to-CRM mappings, and about a week to 
implement and test the
mappings. There is a straight-forward step to wrap the whole sample in an XML 
instance, which takes
longer for deeply nested tables, and then the semantic mapping from XML to XML.

A month was spent studying the tool, as well as a longer time playing with 
different representations and
identifying semantic errors, which could have been avoided with a domain expert 
on-site. Given that effort,
the NMD and Natural History London sample can be transformed without manual 
intervention. No single
line of programming was done.

The 2 days spent basically in the mapping was 95% about understanding the 
source schema semantics, from
interpretation of names, data examples and rare comments.This has nothing to do 
with the CRM itself,
except for the fact that it is precise in its meanings, as required for an 
effective information integration.

In detail:

The NMD data are analytical in the necessary detail to allow for complete 
automatic transformation. Two
default assumptions not obvious from the data could be clarified with the 
creator and expanded in the
CRM. As the NMD database uses dynamic types for events, a full mapping of the 
NMD event types to
CRM classes could have improved the mapping, but was discarded because of 
resource constraints.
For an Internet presentation the data could be even more compressed, if the 
internal NMD identifiers were
omitted. These are not required by the CRM, but were left to show the level of 
detail the CRM is capable to
capture. Individual lengthy identifiers like : "NMD System ID: 750", "2297 
Actor", were choosen
to show were global identifiers could be used in order to facilitate 
information integration in
very large repositories. This could have been done in a more consistent manner. 

The resulting CRM data are more compact than a Relational form and exhibit more 
explanatory
schema semantics.

The Clayton Herbarium sample is equally analytical as the NMD, even though is 
is encoded in one "flat"
table. This means, that parallel fields must be interpreted as dependent data 
paths. This is more complex
but not particularly difficult. Even though it is not in any "normal form", 
e.g. assigning the same fields once
again for a second event, it can be mapped without any difficulty. The logic 
behind is fairly complex,
reasoning about classification, which is totally foreseen in the CRM. We did 
not have any support in its
interpretation, so some potential errors are not due to the CRM but to us and 
must be corrected in the
future with the experts. One link NOT present in the CRM, and also NOT present 
in the Clayton schema,
but implicit in the data may be useful to have in the CRM: That the specimen is 
PROTOTYPE for the
creation of a species or genus.

The AMOL data present a difficulty of different kind: Fields with weak 
semantics like "description",
"statement" and "made note". These seem to be pretty much functional as 
formatting means, in the
tradition of museum catalogs, but cannot be used to interpret semantics. We 
could have done still
a good job, if some disciplined use of separators would have been applied. As 
the data are now,
automatic interpretation needs the use of background knowledge: Place name, 
person name,
organisation name, materials and object type authorities, heuristics and 
eventually natural language
interpretation. With these means, still are fairly complete job could be done 
automatically. We did not
have the resources, and have created to examples. The first a mapping of all 
uninterpretable texts to
a CRM "has note" property, the second a complete interpretation by hand. The 
latter shows, that the
meaning is completely captured except may be the "subject" field, which seems 
to be a heterogeneous
notion from the libraries world not contained in the CRM. (Heterogeneous 
meaning that it changes
interpretation with respect to the object depending on the object category).

Summarizing, we could demonstrate with this test, that the CIDOC CRM captures 
adequately and effectively
the domain of museum data, minor improvements notwithstanding, which will be 
taken into account in the
CRM standardisation process. Adequate meaning, that the CIDOC CRM provides a 
comparable or higher
expressive power than the source schemata. Effective meaning, that the size of 
the produced raw data is
comparable to the source and there is no loss of meaning in the transformation.

The complexity of mapping is typically due to the intrinsic complexity of 
interpreting cultural data sources, and in no means introduced by the
CRM.

As with the AMOL data, it could be shown that the CIDOC CRM can be useful to 
design and introduce
a moderate structuring to facilitate semantic interpretation, which is easily 
comprehensive by end-user
documentalists.

The Clayton data show, that this structuring needs in now ways be complex and 
deep as the CRM, nor that
the end user needs to fully understand the CRM. All data samples show, that the 
CRM instances are
comprehensive, even though the presented form was NOT designed for 
presentation, but to
render an understanding of the machine interpretable RAW DATA themselves.

CRM instances are data ready for automatic integration, given persons etc. can 
sufficiently
be identified globally - again a general problem of the process of integration 
and not of the CRM.

The test shows, that a non-domain expert with usual knowledge in handling IT 
tools can execute the
transformation with an affordable short advice from a domain expert 
knowledgeable also about the CRM.

This advise is once per database, and not per data, if data are sufficiently 
structured. This intellectual
investment cannot be avoided in any intelligent data integration, which tries 
to preserve and to respect
the intellectual qualities of our cultural heritage information. As this 
investment comes ones per
schema, its cost is small compared to the cost of designing and implementing 
the source data structure
itself. This is precisely the reason to have an International Standard: One 
such mapping should be
sufficient to solve global semantic interoperability. 

Deficiencies of the CIDOC CRM could not be identified. We collaborate closely 
with Harmony on the
harmonization with the ABC model, which seems to have strengths in areas so far 
not addressed by
the CRM, as e.g. performing arts, copyright issues, evolution of electronic 
documents and others.
(See: http://cidoc.ics.forth.gr/crmgroup_activities.html,
   "Working Group on Ontology Harmonization:"
  and http://cidoc.ics.forth.gr/docs/rome_full_rep_v2.doc). 

We kindly invite everybody to provide us with any kind of feedback, that could 
be useful for
our work and the achievement of the best standard for all of us.

Best regards,


Martin Doerr

Chair, 
CIDOC CRM Special Interest Group.
http://cidoc.ics.forth.gr/index.html

-- 

--------------------------------------------------------------
 Dr. Martin Doerr              |  Vox:+30(81)391625          |
 Senior Researcher             |  Fax:+30(81)391609          |
 Project Leader SIS            |  Email: [email protected] |
                                                             |
               Centre for Cultural Informatics               |
               Information Systems Laboratory                |
                Institute of Computer Science                |
   Foundation for Research and Technology - Hellas (FORTH)   |
                                                             |
 Vassilika Vouton,P.O.Box1385,GR71110 Heraklion,Crete,Greece |
                                                             |
         Web-site: http://www.ics.forth.gr/proj/isst         |
--------------------------------------------------------------

Reply via email to