The problem of identifier creation and co-reference resolution needs much more
flexible thinking than we are used to. Obviously, there is no one recipe.
But the problem is not so bad:

If I am sure I talk about Picasso, I do not see any point to swamp the world 
with
my local IDs.

If I talk about Picasso, I am sure there are millions of references.

If I talk about Picasso, it is good not create new URNs

If I talk about John Smith's marriage registered in Small Village county,
it is good to use a local ID.

There will hardly be a central resource.

If I talk about John Smith's marriage registered in Small Village county,
there will be very few sources pointing to it.

So, in both cases, we achieve the goal: Very few URNs per person.

Think flexible, design adaptive algorithms. If you don't find your person X
in a global resource, create a local ID, possibly with a complete different
algorithm.

For instance, if I have a document saying:

"He, who set the Pireaus Bank of Heraklion on fire in 2008, was
never identified"

I may use better a different identifier generation algorithm representation
from Smith, John George, (1901-1917).

How many more cases do we have to discuss? I think this is already quite 
exhaustive.

Would that be a topic for the coreference working group?

Best,

Martin

Maximilian Schich wrote:
I think you are right, referring to a more 'central' ID is always preferable. The date-range thing will probably work for Picasso, but I am not sure if it would allow for a disambiguation of all the Lees, Kims, Singhs and Smiths.

In general I think, it would be nice, if there would be a local ID for every person/instance, accompanied by a 'central' ID wherever possible. This would allow for a much better discussion of the provided data. Otherwise, i.e. with 'central' IDs only, it would not be possible say for e.g. "I don't believe your person (local ID x) is identical with Pablo Picasso ('central' ID y)". In other words, all parties should publish their existing local IDs, i.e. their database record numbers... , in addition to the 'central' IDs which allow for better normalization.

Basically I just don't like hidden information, which could be explicit.

Best wishes,
max.

Dr. des. Maximilian Schich M.A.
adr.: Westendstrasse 80 | D-80339 München | Germany
tel.: +49-179-6678041 | skype: maximilian.schich
mail: [email protected] | home: www.schich.info

CONFIDENTIALITY NOTICE: This e-mail message including attachments, if
any, is intended only for the person or entity to which it is addressed
and may contain confidential and/or privileged material. Any
unauthorized review, use, disclosure or distribution is prohibited. If
you are not the intended recipient, please contact the sender by reply
e-mail and destroy all copies of the original message. Thank you.


On 16.12.2008 20:36 Uhr, martin wrote:
I agree. The point is very simple:

There will be a long tail of URNs anyhow. If every local database creates its
own identifier, the list will be much, much longer.

For guys like Picasso,
referring either to VIAF or to ULAN would be currently a very sensible choice.
(viaf.org : "Picasso, Pablo, ‡d 1881-1973" or "DNB|118594206")
The likelihood of the two would be very high. That makes the world very small. Alternatively, we could create a normalized access point "Picasso, Pablo (1881-1973),
such as : urn:crm_actor:aacr2:picasso.pablo/1881-1973

Do you like it?

I don't know, how many people have exactly the same birth and death dates and names.

Best,

Martin

Maximilian Schich wrote:
(posted in this thread for continuity - also relevant for URI policies)

Dear All,

I agree with Martin: There should be a URN or something equivalent for Picasso in ULAN.

However, we should not underestimate the long tail phenomenon:

    * There will be loads of URNs for some single guys (like Picasso).
      Indeed the co-reference of all those Picasso-Identifiers will be
      hard to resolve. (I would bet there will not only be a long tail
      of URN frequency, i.e. how many URNs a Person has, but even a long
      tail of normalization, i.e. in the distribution how often specific
      URNs are used for a person).
    * On the other hand there will be a huge load of people in the long
      tail without any URN in norm-data sources like ULAN (think of 'the
      guy, who did the non-art sculpture my schoolyard' or 'the guy who
      paints sheep from Naples, but isn't the guy who paints sheep form
      Naples').

As far as we know, there is no way to avoid the long tail!

As a consequence, everybody has (to be able) to generate unique identifiers.

Kind regards, max.


On 16.12.2008 13:23 Uhr, martin wrote:
Dear All,

To my opinion, Pablo Picasso should be represented by a URN. I'd expect from the Getty a proposal how to write URNs for persons identified in ULAN. See discussion about URNs.

Best,

Martin

Maximilian Schich wrote:
I think we should encourage the owners of databases to use their existing 'database record numbers'/ /in conjunction with an identifier for their Institution as IDs for every conceivable instance.

Of course for 'Pablo Picasso' we would have a number of IDs:
an AKL number, another ULAN number, an ID from his city's birth registry, a record number in every private database, and probably an ID in the future all encompassing database (like for e.g. http://en.wikipedia.org/w/index.php?oldid=257931703 for http://en.wikipedia.org/wiki/Pablo_Picasso ).

The String 'Pablo Picasso' is one of the worst IDs, as there might be multiple language versions and different name formats. For e.g. in the ISI Web of Science the (ambiguous) ID would be 'P Picasso'; many people simply call him 'Picasso'; and his birth name is 'Pablo Diego José Francisco de Paula Juan Nepomuceno María de los Remedios Cipriano de la Santísima Trinidad Martyr Patricio Clito Ruíz y Picasso' - (not a joke!).

How to normalize the IDs is another question. As real data usually comes in long tails, norm data is of limited help.

Best wishes, max.

Dr. des. Maximilian Schich M.A.
adr.: Westendstrasse 80 | D-80339 München | Germany
tel.: +49-179-6678041 | skype: maximilian.schich
mail: [email protected] | home: www.schich.info

CONFIDENTIALITY NOTICE: This e-mail message including attachments, if
any, is intended only for the person or entity to which it is addressed
and may contain confidential and/or privileged material. Any
unauthorized review, use, disclosure or distribution is prohibited. If
you are not the intended recipient, please contact the sender by reply
e-mail and destroy all copies of the original message. Thank you.


On 15.12.2008 16:20 Uhr, Vadim Soshkin wrote:
I am agree with approach of moving English terms from class and property identifiers to rdf:label. Why user's instance identifiers are different? What identifier are you are proposing for 'Pablo Picasso'?

Best regards

Vadim
    -----Original Message-----
    *From:* [email protected]
[mailto:[email protected]]*On Behalf Of *Maximilian Schich
    *Sent:* Saturday, December 13, 2008 6:05 AM
    *To:* [email protected]
    *Cc:* 'crm-sig'
    *Subject:* Re: [Crm-sig] RDFS class identifiers

"I want the version that has the class (E) or property (P) number plus the text in the label and just the class (E) or property (P) number in the ID."

    me too! This clarifies that the node with the ID 'E21' indeed
    represents a CIDOC-CRM concept like 'E21_Person' and not the word
    'Person'. However we should clarifiy to the users, that they
    should not use a similar strategy in their rdf instances: The
    person 'Pablo Picasso' should not have an ID like '1495r3' and a
    label/appelation like '1495r3_Pablo_Picasso'. This seems logical
    from our point of view, but users may be tempted to do so.

    Can't we leave out * and #...?

    Kind regards,
    max.

    Dr. des. Maximilian Schich M.A.
    adr.: Westendstrasse 80 | D-80339 München | Germany
    tel.: +49-179-6678041 | skype: maximilian.schich
    mail: [email protected] | home: www.schich.info

CONFIDENTIALITY NOTICE: This e-mail message including attachments, if any, is intended only for the person or entity to which it is addressed
    and may contain confidential and/or privileged material. Any
unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply
    e-mail and destroy all copies of the original message. Thank you.


    On 13.12.2008 8:32 Uhr, Stephen Stead wrote:
I want the version that has the class (E) or property (P) number plus the text in the label and just the class (E) or property (P) number in the ID.
    Rgds
    SdS

    Stephen Stead
    Tel +44 20 8668 3075     Mob +44 7802 755 013
    E-mail [email protected]


    -----Original Message-----
From: [email protected] [mailto:[email protected]] On Behalf Of Vladimir Ivanov
    Sent: 13 December 2008 07:15
    To: martin
    Cc: crm-sig
    Subject: Re: [Crm-sig] RDFS class identifiers

    Dear all,

    I agree with Nick.
    This approach realises the statement that
    CRM is not about (Entity and Proprty) names
    but about (common, language independent) concepts.

    It also helps to manage multilingual version of the CRM when
    we have EXX in scope notes and can extend it with "full name"
    in a certain language.

    Example:

<rdfs:Class rdf:ID="E21_">
<rdfs:label xml:lang="en">Person</rdfs:label>
<rdfs:comment xml:lang="en">[Engish text]... E21_ [Engish
    text].......</rdfs:comment>.
    ...
<rdfs:label xml:lang="ru">????????</rdfs:label>
<rdfs:comment xml:lang="ru">[Russian text]... E21_ [Russian
    text]...</rdfs:comment>.
    ----------------

But natural language descriptions with codes and names are simplier
    than descriptions with codes only!

    Dear Martin,
    I'am afraid that "stars" (or any other symbol) in
    xml atributes may lead to some problems:

    1. <rdfs:label xml:lang="*en*">
    Some systems do not recognize *en* as English (en).

    2. <rdfs:subClassOf rdf:resource="*#E21*" />
    and <rdfs:Class rdf:ID="*E21*">
    refer to different entities .

Maybe, we should write <rdfs:subClassOf rdf:resource="#*E21*" /> ?

    Best regards,
    Vladimir

    2008/12/12 martin <[email protected]>:
    Dear Nick,

    I support this proposal as issue.

    I'd prefer however this form:

<rdfs:Class rdf:ID="*E21*">
     * * <rdfs:label xml:lang="*en*">*E21 Person*</rdfs:label>
     * * <rdfs:label xml:lang="*fr*">*E21 Personne*</rdfs:label>
     * * <rdfs:label xml:lang="*gr*">*E21 ???s?p?*</rdfs:label>
     * * <rdfs:subClassOf rdf:resource="*#E20*" />
     * * <rdfs:subClassOf rdf:resource="*#E39*" />
</rdfs:Class>

    Opinions?

    Best,

    Martin

    Nicholas Crofts wrote:
    Dear all,

    I've been doing some work recently using the CRM rdfs.
    http://cidoc.ics.forth.gr/rdfs/cidoc_v4.2.rdfs

The naming convention adopted for the class and property identifiers
    strikes me as inconvenient in some respects.
Currently, the names used for the class and property identifiers contain
    both the CRM code and the English label.

1. If the labels get changed at any time in the future, the identifiers
    are broken
    2. Non English speakers are put at a disadvantage
3. The rdf syntax is more verbose than necessary ... this may sound trivial but that overhead can be huge when migrating large datasets. 4. The names have been mangled with underscores to make them respect
    xml/rdf syntax.

I would suggest using just the codes (i.e. E1, P2, etc.) as class identifiers and including the names (in various languages) as rdf:labels.

    The result would like something like this:

<rdfs:Class rdf:ID="*E21*">
    * * <rdfs:label xml:lang="*en*">*Person*</rdfs:label>
    * * <rdfs:label xml:lang="*fr*">*Personne*</rdfs:label>
    * * <rdfs:label xml:lang="*gr*">*???s?p?*</rdfs:label>
    * * <rdfs:subClassOf rdf:resource="*#E20*" />
    * * <rdfs:subClassOf rdf:resource="*#E39*" />
</rdfs:Class>

    Rather than this:


<rdfs:Class rdf:ID="*E21.Person*">
* * <rdfs:subClassOf rdf:resource="*#E20.Biological_Object*" />
    * * <rdfs:subClassOf rdf:resource="*#E39.Actor*" />
</rdfs:Class>

    (NB I've removed the rdfs:comments for clarity)

It would be nice, of course, to be able to have both forms and define
    equivalence relationships between them.
This could perhaps be done with the rdfs:isDefinedBy property? but I'm
    not sure that it's meant for this.

    Best wishes

    Nick Crofts




------------------------------------------------------------------------

    _______________________________________________
    Crm-sig mailing list
    [email protected]
    http://lists.ics.forth.gr/mailman/listinfo/crm-sig
    --

    --------------------------------------------------------------
     Dr. Martin Doerr              |  Vox:+30(2810)391625        |
     Principle Researcher          |  Fax:+30(2810)391638        |
                                   |  Email: [email protected] |
                                                                 |
                   Center for Cultural Informatics               |
                   Information Systems Laboratory                |
                    Institute of Computer Science                |
       Foundation for Research and Technology - Hellas (FORTH)   |
                                                                 |
     Vassilika Vouton,P.O.Box1385,GR71110 Heraklion,Crete,Greece |
                                                                 |
             Web-site: http://www.ics.forth.gr/isl               |
    --------------------------------------------------------------

    _______________________________________________
    Crm-sig mailing list
    [email protected]
    http://lists.ics.forth.gr/mailman/listinfo/crm-sig

    _______________________________________________
    Crm-sig mailing list
    [email protected]
    http://lists.ics.forth.gr/mailman/listinfo/crm-sig


    _______________________________________________
    Crm-sig mailing list
    [email protected]
    http://lists.ics.forth.gr/mailman/listinfo/crm-sig
------------------------------------------------------------------------

_______________________________________________
Crm-sig mailing list
[email protected]
http://lists.ics.forth.gr/mailman/listinfo/crm-sig
Dr. des. Maximilian Schich M.A.
adr.: Westendstrasse 80 | D-80339 München | Germany
tel.: +49-179-6678041 | skype: maximilian.schich
mail: [email protected] | home: www.schich.info

CONFIDENTIALITY NOTICE: This e-mail message including attachments, if
any, is intended only for the person or entity to which it is addressed
and may contain confidential and/or privileged material. Any
unauthorized review, use, disclosure or distribution is prohibited. If
you are not the intended recipient, please contact the sender by reply
e-mail and destroy all copies of the original message. Thank you.

_______________________________________________
Crm-sig mailing list
[email protected]
http://lists.ics.forth.gr/mailman/listinfo/crm-sig



_______________________________________________
Crm-sig mailing list
[email protected]
http://lists.ics.forth.gr/mailman/listinfo/crm-sig




--

--------------------------------------------------------------
 Dr. Martin Doerr              |  Vox:+30(2810)391625        |
 Principle Researcher          |  Fax:+30(2810)391638        |
                               |  Email: [email protected] |
                                                             |
               Center for Cultural Informatics               |
               Information Systems Laboratory                |
                Institute of Computer Science                |
   Foundation for Research and Technology - Hellas (FORTH)   |
                                                             |
 Vassilika Vouton,P.O.Box1385,GR71110 Heraklion,Crete,Greece |
                                                             |
         Web-site: http://www.ics.forth.gr/isl               |
--------------------------------------------------------------

Reply via email to