Re: [Dbpedia-discussion] Concept Identifiers

Markus Kroetzsch Wed, 01 Jun 2016 01:38:46 -0700

Hi Tom,

My impression is that almost all ids in almost all datasets are opaque 
for reasons that are not so much related to language-neutrality concerns 
(but I guess it has been a relevant point in some efforts, especially 
when French and English people collaborate ;-).


Named IDs work best on closed domains where names/labels change very 
rarely. Names of places are maybe the best example, since you can use 
their relative location to make them unique, e.g., the dmoz id for 
Dresden is:

"Regional/Europe/Germany/States/Saxony/Localities/Dresden/"

This is a domain where named IDs work really well. Several scientific 
IDs also are great in this respect.

Other domains are not so easy to handle but are still working fairly 
well. Humans are a good example: they change their names relatively 
rarely, but they cannot be identified by the name alone. You need to add 
something else to achieve uniqueness of the ID. For example, Tim 
Berners-Lee's ID as a Fellow of the Royal Society is 
"timothy-berners-lee-11074". So this would be a mixed approach.

For other domains, named IDs are doable but not nice (not even for 
humans). Movies and books are typical examples of areas where labels are 
quite complicated to work with (they are not unique, they can be very 
long, and they can have all kinds of weird symbols and markup). This is 
probably why the vast majority of databases and catalogues in such areas 
are using opaque, numeric IDs.

And then, finally, there is the big class of cross-domain data where 
labels can have complicated forms, clash often, and change all the time.
Wikipedia, Freebase, and Wikidata are dealing with this. Here it is 
rather tricky to maintain stable named IDs, and indeed Wikipedia does 
not work very well as an ID provider. Reuse of the same IDs for a 
variety of things is the main problem here.

I like the approach you had in Freebase, using opaque IDs for stability 
but supporting additional (possibly domain-specific) IDs to ease use. 
Some databases already have something similar, e.g., TED has two speaker 
IDs for Tim Berners-Lee:

http://www.ted.com/speakers/338
http://www.ted.com/speakers/tim_berners_lee

If I would implement a system like the one Katie might have in mind, I 
would still use the opaque IDs and then display nice labels to the user 
within my tool. A properly formatted label is always better than the 
best of URIs when you want to show it to a user ("ted:tim_berners_lee" 
vs. "Tim Berners-Lee"). Raw RDF tools cannot be expected to provide this 
level of (end-)user friendliness, but many of the higher-level tools I 
am seeing today are working with labels (in multiple languages) without 
big problems.

Best regards,

Markus


On 31.05.2016 18:12, Tom Morris wrote:
> Hi Katie. I don't think there are universally agreed best practices in
> this space and people often have strongly held views on either side.
> You don't mention internationalization/localization which is, in my
> experience, a bigger concern for folks than semantic drift. Those who
> believe in numeric identifiers often think that using identifiers in a
> given natural language provides that language an undeserved pride of
> place and priority over other languages. Folks in this camp include the
> creators of CIDOC and there are people dismayed by BibFrame's
> abandonment of MARC-style numbers.
>
>  From a practical point of view, numeric identifiers, while perfectly
> sensible in the abstract, suffer from the weak tools that we have, so
> end up disadvantaging everyone equally, but everyone more than English
> identifiers probably would.
>
> Your note implies that concept URIs could change over time if they had
> natural language words as part of the URI.  I don't think this would be
> a good practice. If UAT:Black now means "orange," I think you need to
> either live with UAT:Black as the URI, mint a synonym UAT:Orange (and
> keep UAT:Black), or deprecate UAT:Black as a valid concept and create a
> new concept UAT:Orange. Which course of action is most appropriate will
> depend on the specific circumstances of a change. If you decide there's
> a new concept UAT:DarkGrey, that is split off from UAT:Black, perhaps
> the original can exist unchanged, but if you decide that there's really
> no such thing as "black" but just UAT:DarkGrey and UAT:DarkestGrey, then
> perhaps UAT:Black gets deprecated and removed. Changing the pieces of
> URI to UAT101, UAT102, UAT301, etc doesn't really affect most of the
> discussion. The only case it makes easier is avoid UAT:Black having a
> description of "vibrant orange," if the concept drifts far enough from
> its original label (which is embedded in the URI).
>
> Since Dimitris mentioned Freebase, briefly what they did was initially
> mint English language URIs based on the label of the topic, but
> eventually abandoned the practice because it was too difficult to do
> automatically and added too little value. They did keep English
> identifiers for types & properties which were part of the scheme, but
> these were hand assigned and provided a useful organizing function to
> group properties with the associated type, types with their domain, etc.
> A powerful feature of the Freebase setup was that a single topic could
> have arbitrarily many URIs, so dereferencing /en/Boston,
> /authority/viaf/1234, /authority/loc/lcnam/nm1234,
> /wikipedia/en_title/Boston (city), etc could all fetch the same the same
> content (without the use of redirects). The core identifiers for
> non-schema topics were machine generated sequential IDs encoded with a
> compact base 37(?) encoding, e.g.  /m/0d_23
>
> Tom
>
> p.s. I'm a couple of blocks away if you want to chat about this stuff
> some time.
>
> On Thu, May 26, 2016 at 2:43 PM, Katie Frey <kf...@cfa.harvard.edu
> <mailto:kf...@cfa.harvard.edu>> wrote:
>
>     Hello,
>
>     How are concept IDs handled for DBpedia?  It looks like the concept
>     URIs are descriptive (i.e. for the concept
>     http://dbpedia.org/page/Solar_System, the concept ID is
>     "Solar_System").  Are the descriptive IDs used throughout all of
>     dbpedia (back and front end) or are terms ultimately kept unique by
>     using numeric identifiers?
>
>     I've been developing a controlled vocabulary and I would also like
>     to use URIs so that my terms can be used with other linked data
>     schemes.  My group and I have had a lot of discussions regarding the
>     concept IDs; some want them to be descriptive, based on the
>     preferred term for each concept so that they are human readable but
>     this could cause problems if the terms used to describe each concept
>     change over time, others want them to be randomly generated so that
>     if the description of a term drifts over time the URI for the
>     concept will always remain static.
>
>     We are trying to figure out if there are any standards or best
>     practices we should be looking towards when it comes to concept
>     IDs.  Any thoughts/comments/justifications would be appreciated.
>
>     Best,
>     Katie
>
>     --
>     Katie E. Frey
>     John G. Wolbach Library, Harvard-Smithsonian Center for Astrophysics
>     60 Garden Street, MS-56, Cambridge, MA 02138
>     email: kf...@cfa.harvard.edu <mailto:kf...@cfa.harvard.edu>   |
>     phone: 617-496-7579 <tel:617-496-7579>
>     http://astrothesaurus.org           | http://library.cfa.harvard.edu/
>
>     "Surprising what you can dig out of books if you read long enough,
>     isn’t it?"
>     - Rand al'Thor (in Robert Jordan's The Shadow Rising, Book Four of
>     the Wheel of Time)
>
>     "This is insanity!"   "No, this is scholarship!"
>     - Yalb and Shallan (in Brandon Sanderson's Words of Radiance, Book
>     Two of the Stormlight Archive)
>
>     
> ------------------------------------------------------------------------------
>     Mobile security can be enabling, not merely restricting. Employees who
>     bring their own devices (BYOD) to work are irked by the imposition
>     of MDM
>     restrictions. Mobile Device Manager Plus allows you to control only the
>     apps on BYO-devices by containerizing them, leaving personal data
>     untouched!
>     https://ad.doubleclick.net/ddm/clk/304595813;131938128;j
>     _______________________________________________
>     Dbpedia-discussion mailing list
>     Dbpedia-discussion@lists.sourceforge.net
>     <mailto:Dbpedia-discussion@lists.sourceforge.net>
>     https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion
>
>
>
>
> ------------------------------------------------------------------------------
> What NetFlow Analyzer can do for you? Monitors network bandwidth and traffic
> patterns at an interface-level. Reveals which users, apps, and protocols are
> consuming the most bandwidth. Provides multi-vendor support for NetFlow,
> J-Flow, sFlow and other flows. Make informed decisions using capacity
> planning reports. https://ad.doubleclick.net/ddm/clk/305295220;132659582;e
>
>
>
> _______________________________________________
> Dbpedia-discussion mailing list
> Dbpedia-discussion@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion
>

-- 
Markus Kroetzsch
Faculty of Computer Science
Technische Universität Dresden
+49 351 463 38486
http://korrekt.org/

------------------------------------------------------------------------------
What NetFlow Analyzer can do for you? Monitors network bandwidth and traffic
patterns at an interface-level. Reveals which users, apps, and protocols are 
consuming the most bandwidth. Provides multi-vendor support for NetFlow, 
J-Flow, sFlow and other flows. Make informed decisions using capacity 
planning reports. https://ad.doubleclick.net/ddm/clk/305295220;132659582;e
_______________________________________________
Dbpedia-discussion mailing list
Dbpedia-discussion@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion

Re: [Dbpedia-discussion] Concept Identifiers

Reply via email to