sanity checking the LOD Cloud statistics

2009-04-01 Thread Ted Thibodeau Jr

Hello, all --

I've had a few minutes to start working to update my version [1] of the
LOD Cloud diagram [2], which means I got to start looking at the Data
Set Statistics [3] and Link Statistics [4] pages.

I have found a number of apparent discrepancies.  I'm not sure where  
these

came from, but I think they need attention and correction.

[3] gave some round, and some exact values.  It's not at all clear  
whether
these values were originally intended to reflect triple-counts in the  
data

set, URIs minted there (i.e., Entities named there), or something else
entirely.  I think the page holds a mix of these, which makes them  
rather

troublesome as a source of comparison between data sets.

[4] had few exact values, which appear to have been incorrectly added  
there,
and apparently means to use only 3 counts for the inter-set linkages  
--

 100,  1000  100.000.  Clearly, the last means more-than-one-
hundred-thousand -- because the first clearly means more-than-one- 
hundred --

but this was not obvious at first glance, given my US-training that the
period is used for the decimal, not for the thousands delimiter.

First thing, therefor, I suggest that all period-delimiters on [4]  
change

to comma-delimiters, to match the first page.  (I've actually made this
change, but incorrect values may well remain -- please read on.)

I think it also makes sense to add  10,000, and  1,000,000 to the
values here.  Just looking at the DBpedia actual counts which were on
the page, it's clear that a log-scale comparing the interlinkage levels
presents a better picture than the three arbitrarily chosen levels.
(Again, I've started using these as relevant.)


Now to the discrepancies.  From [3], I got this line --

   http://dbtune.org/bbc/playcount/   BBC Playcount Data  10,000

At first read, I thought that meant 10,000 triples.  But [4] indicated
these external link counts for BBC Playcount Data --

   http://www.bbc.co.uk/programmesBBC Programmes  100.000
   http://dbtune.org/musicbrainz  Musicbrainz 100.000

I don't see a way for 10,000 triples to include 200,000 external links.
That means that the first count must be of Entities.  But going to the
BBC Playcount home page [5], I found --

   Triple count1,954,786
   Distinct BBC Programmes resources   6,863
   Distinct Musicbrainz resources  7,055

An obvious missing number here is a count of minted URIs -- that is, of
BBC Playcount resources/entities -- but I also learned that BBC  
Playcount

URIs are not pointers-to-values, but values-in-themselves.  The count is
*embedded* in the URI (and thus, if a count changes, the URI changes!)  
--


   A playcount URI in this service looks like:

  http://dbtune.org/bbc/playcount/id_k

   Where id is the id of the episode or the brand, as in / 
programmes BBC
   catalogue, and k is a number between 0 and the number of  
playcounts

   for the episode or the brand.

If we accept this URI construction as reasonable (which I don't), it  
seems
that k must actually be a natural or counting number (i.e., an  
integer
greater than or equal to 1).  A value of 0 is nonsensical, as it would  
result
in a Cartesian data set -- where each and every Musicbrainz resource  
gets
a Playcount URI for each and every Programme resource -- and most of  
these
Playcount URIs would have k = 0, for most Musicbrainz resources were  
not

played in most Programmes.

Even if Zero-Play URIs are created only for those Musicbrainz resources
which were played in *some* Programme, for those Programmes where they
weren't played, far more URIs are created than are needed.

I'm hoping that the folks who built this data set are reading, and will
consider restructuring it.  I'd suggest that the URI structure should be
more like --

   http://dbtune.org/bbc/playcount/id_count

-- where id reflects *either* Programmes *or* Musicbrainz ID (this may
mean further thinking, as I'm not directly familiar with these IDs, and
Programmes may conflict with Musicbrainz), and the count (the *value*)
is returned when the constructed URI is dereferenced.


More baffling, and more troubling, on [3] I found --

   http://ieee.rkbexplorer.com/ IEEE 111

-- which purports to be linked out as follows --

   http://acm.rkbexplorer.com/ACM 1000
   http://eprints.rkbexplorer.com/eprints  100.000
   http://citeseer.rkbexplorer.com/   CiteSeer 100.000
   http://dblp.rkbexplorer.com/   DBLP RKB Explorer   1000
   http://laas.rkbexplorer.com/   LAAS CNRS100.000

Looking to primary sources again --

   Current statistics for this repository (ieee.rkbexplorer.com) —

  Last data assertion  2009-02-06 13:28:04
  Number of triples111442
  Number of symbols31552
  Size of RDF dataset  8.2M

   Current statistics for the CRS for this repository  
(ieee.rkbexplorer.com) —


  Last data assertion   

RE: sanity checking the LOD Cloud statistics

2009-04-01 Thread Michal Finkelstein
Hi Ted,

First, I totally agree with the need to change the current (relatively
arbitrary) levels.
Values like  100 and even  100,000 seem a bit anachronistic; I guess
these ranges were valid in the very first days of the LOD Cloud, but
today, for the most part, we're talking about millions of URIs, triples
etc.

Two significant errors I see related to OpenCalais:

 Open Calais  DBpedia 100
 Open Calais  Freebase100  

The correct number should be  100,000 for both OpenCalais-to-DBpedia
and OpenCalais-to-Freebase link counts.
To make sure we're on the same page: that's larger than one hundred
thousand.

Also regarding the size of the data set:

 OpenCalais   4,500,000

The number shown actually refers to the URI count and not to the number
of triples. 
The number of triples is at least 10 times bigger, or: 45,000,000
(that's 45 million triples).

Regards,
Michal

* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
* * *
Michal Finkelstein
Director, Content Strategy
The Calais Initiative

Thomson Reuters

michal.finkelst...@thomsonreuters.com





-Original Message-
From: public-lod-requ...@w3.org [mailto:public-lod-requ...@w3.org] On
Behalf Of Ted Thibodeau Jr
Sent: Wednesday, April 01, 2009 9:06 AM
To: public-lod@w3.org
Subject: sanity checking the LOD Cloud statistics

Hello, all --

I've had a few minutes to start working to update my version [1] of the
LOD Cloud diagram [2], which means I got to start looking at the Data
Set Statistics [3] and Link Statistics [4] pages.

I have found a number of apparent discrepancies.  I'm not sure where
these came from, but I think they need attention and correction.

[3] gave some round, and some exact values.  It's not at all clear
whether these values were originally intended to reflect triple-counts
in the data set, URIs minted there (i.e., Entities named there), or
something else entirely.  I think the page holds a mix of these, which
makes them rather troublesome as a source of comparison between data
sets.

[4] had few exact values, which appear to have been incorrectly added
there, and apparently means to use only 3 counts for the inter-set
linkages
--
 100,  1000  100.000.  Clearly, the last means more-than-one-
hundred-thousand -- because the first clearly means more-than-one-
hundred -- but this was not obvious at first glance, given my
US-training that the period is used for the decimal, not for the
thousands delimiter.

First thing, therefor, I suggest that all period-delimiters on [4]
change to comma-delimiters, to match the first page.  (I've actually
made this change, but incorrect values may well remain -- please read
on.)

I think it also makes sense to add  10,000, and  1,000,000 to the
values here.  Just looking at the DBpedia actual counts which were on
the page, it's clear that a log-scale comparing the interlinkage levels
presents a better picture than the three arbitrarily chosen levels.
(Again, I've started using these as relevant.)


Now to the discrepancies.  From [3], I got this line --

http://dbtune.org/bbc/playcount/   BBC Playcount Data  10,000

At first read, I thought that meant 10,000 triples.  But [4] indicated
these external link counts for BBC Playcount Data --

http://www.bbc.co.uk/programmesBBC Programmes  100.000
http://dbtune.org/musicbrainz  Musicbrainz 100.000

I don't see a way for 10,000 triples to include 200,000 external links.
That means that the first count must be of Entities.  But going to the
BBC Playcount home page [5], I found --

Triple count1,954,786
Distinct BBC Programmes resources   6,863
Distinct Musicbrainz resources  7,055

An obvious missing number here is a count of minted URIs -- that is, of
BBC Playcount resources/entities -- but I also learned that BBC
Playcount URIs are not pointers-to-values, but values-in-themselves.
The count is
*embedded* in the URI (and thus, if a count changes, the URI changes!)
--

A playcount URI in this service looks like:

   http://dbtune.org/bbc/playcount/id_k

Where id is the id of the episode or the brand, as in / programmes
BBC
catalogue, and k is a number between 0 and the number of
playcounts
for the episode or the brand.

If we accept this URI construction as reasonable (which I don't), it
seems that k must actually be a natural or counting number (i.e.,
an integer greater than or equal to 1).  A value of 0 is nonsensical, as
it would result in a Cartesian data set -- where each and every
Musicbrainz resource gets a Playcount URI for each and every Programme
resource -- and most of these Playcount URIs would have k = 0, for
most Musicbrainz resources were not played in most Programmes.

Even if Zero-Play URIs are created only for those Musicbrainz resources
which were played in *some* Programme, for those Programmes where they
weren't played, far more URIs are created than are needed.

I'm hoping that the folks who built 

Re: sanity checking the LOD Cloud statistics

2009-04-01 Thread Hugh Glaser
Nice going Ted.
Sanity checking (and even QA) is always good.
(I'll try and find the time to respond accurately to the RKB queries soon.)

Just one general comment I'd like to make - size isn't everything!

Millions of links between dbpedia and yago or freebase might give us a nice
warm feeling, but it would be nice to find space for what I think of as very
valuable links, that might be in small numbers - small but perfectly formed?

For example, if I was to have a site about the British royal family (or
maybe a small company or institution), I might only have a few hundred
people in it, some of whom would have pages in dbpedia, but certainly less
than 100.
If I have carefully made those links, it will be a great benefit to my site
(and possibly LOD in general), but there will be little or no visibility in
the LOD wiki, and certainly not the LOD diagram.
This seems a shame to me.
Of course, I could construct stuff to get over some arbitrary threshold if I
really want to, but we really don't want to encourage that.

(By the way, this is actually the situation for things like our RKB links to
Computer Scientists in dbpedia:- as you can imagine, there are not a huge
number of Computer Scientists in wikipedia.)

Best
Hugh




AW: sanity checking the LOD Cloud statistics - Please add the statistics for your dataset to the Wiki

2009-04-01 Thread Chris Bizer
Hi Ted,

good that you raise this topic. 

The statistics were added to the wiki by Anja and reflect her
knowledge/guesses about the size of the datasets and the numbers of links
between them. And of course, some of her guesses might be wrong.  

In an ideal world, these statistics would be provided by Semantic Web search
engines that crawl the cloud and calculate the statistics afterwards based
on what they actually got from the Web. Alternatively, all dataset providers
could publish Void descriptions of their datasets which could also be used
to generate the statistics.

But as the search engines have not yet reached this point and as Void is
also not used by all data providers, we thought it would be useful to put
these statistics as a starting point into the Wiki so that people
(especially data set publishers) can update them and we can use them when we
draw the LOD cloud the next time.

I have updated the statistics about outgoing links connecting DBpedia with
other datasets yesterday. 

If everybody on this list would do the same for the data sources they
maintain/use, I think we will get a much more accurate LOD diagram the next
time we draw it.

So, please: Take 5 minutes and quickly add the actual statistics about your
datasets to

http://esw.w3.org/topic/TaskForces/CommunityProjects/LinkingOpenData/DataSet
s/Statistics
(size of your dataset)

http://esw.w3.org/topic/TaskForces/CommunityProjects/LinkingOpenData/DataSet
s/LinkStatistics
(number of links connecting your dataset with other datasets)

Thanks a lot in advance!

Cheers

Chris




 -Ursprüngliche Nachricht-
 Von: public-lod-requ...@w3.org [mailto:public-lod-requ...@w3.org] Im
 Auftrag von Ted Thibodeau Jr
 Gesendet: Mittwoch, 1. April 2009 08:06
 An: public-lod@w3.org
 Betreff: sanity checking the LOD Cloud statistics
 
 Hello, all --
 
 I've had a few minutes to start working to update my version [1] of the
 LOD Cloud diagram [2], which means I got to start looking at the Data
 Set Statistics [3] and Link Statistics [4] pages.
 
 I have found a number of apparent discrepancies.  I'm not sure where
 these
 came from, but I think they need attention and correction.
 
 [3] gave some round, and some exact values.  It's not at all clear
 whether
 these values were originally intended to reflect triple-counts in the
 data
 set, URIs minted there (i.e., Entities named there), or something else
 entirely.  I think the page holds a mix of these, which makes them
 rather
 troublesome as a source of comparison between data sets.
 
 [4] had few exact values, which appear to have been incorrectly added
 there,
 and apparently means to use only 3 counts for the inter-set linkages
 --
  100,  1000  100.000.  Clearly, the last means more-than-one-
 hundred-thousand -- because the first clearly means more-than-one-
 hundred --
 but this was not obvious at first glance, given my US-training that the
 period is used for the decimal, not for the thousands delimiter.
 
 First thing, therefor, I suggest that all period-delimiters on [4]
 change
 to comma-delimiters, to match the first page.  (I've actually made this
 change, but incorrect values may well remain -- please read on.)
 
 I think it also makes sense to add  10,000, and  1,000,000 to the
 values here.  Just looking at the DBpedia actual counts which were on
 the page, it's clear that a log-scale comparing the interlinkage levels
 presents a better picture than the three arbitrarily chosen levels.
 (Again, I've started using these as relevant.)
 
 
 Now to the discrepancies.  From [3], I got this line --
 
 http://dbtune.org/bbc/playcount/   BBC Playcount Data  10,000
 
 At first read, I thought that meant 10,000 triples.  But [4] indicated
 these external link counts for BBC Playcount Data --
 
 http://www.bbc.co.uk/programmesBBC Programmes  100.000
 http://dbtune.org/musicbrainz  Musicbrainz 100.000
 
 I don't see a way for 10,000 triples to include 200,000 external links.
 That means that the first count must be of Entities.  But going to the
 BBC Playcount home page [5], I found --
 
 Triple count1,954,786
 Distinct BBC Programmes resources   6,863
 Distinct Musicbrainz resources  7,055
 
 An obvious missing number here is a count of minted URIs -- that is, of
 BBC Playcount resources/entities -- but I also learned that BBC
 Playcount
 URIs are not pointers-to-values, but values-in-themselves.  The count
 is
 *embedded* in the URI (and thus, if a count changes, the URI changes!)
 --
 
 A playcount URI in this service looks like:
 
http://dbtune.org/bbc/playcount/id_k
 
 Where id is the id of the episode or the brand, as in /
 programmes BBC
 catalogue, and k is a number between 0 and the number of
 playcounts
 for the episode or the brand.
 
 If we accept this URI construction as reasonable (which I don't), it
 seems
 that k must actually be a natural or counting number (i.e., an

ANN: STW Thesaurus for Economics published as Linked Data

2009-04-01 Thread Neubert Joachim
STW Thesaurus for Economics is now available under http://zbw.eu/stw.
 
STW is a richly interconnected vocabulary in English and German on
economics and business economics as well as some related subject areas.
It includes subject categories and lots of synonyms in order to find the
appropriate terms. Its publication aims at providing an interlinking hub
for economics resources on the web of Linked Data.
 
The thesaurus is maintained by the German National Library of Economics
(ZBW) and published under a Creative Commons (by-nc-sa) license.
 
It is delivered as XHTML+RDFa pages with an incremental search interface
and a navigatable tree. A SKOS RDF/XML dump version can be downloaded,
as well as a set of links to dbpedia concepts. More information about
the design of the application can be found in a paper for the Linked
Data on the Web workshop in Madrid
(http://events.linkeddata.org/ldow2009/papers/ldow2009_paper7.pdf).
 
Enjoy - Manuela and Joachim

--
Manuela Gastmeyer
Thesaurus Team

Joachim Neubert
IT Development

German National Library of Economics (ZBW)
Leibniz Information Center for Economics



Re: ANN: STW Thesaurus for Economics published as Linked Data

2009-04-01 Thread Kingsley Idehen

Neubert Joachim wrote:

STW Thesaurus for Economics is now available under http://zbw.eu/stw.
 
STW is a richly interconnected vocabulary in English and German on

economics and business economics as well as some related subject areas.
It includes subject categories and lots of synonyms in order to find the
appropriate terms. Its publication aims at providing an interlinking hub
for economics resources on the web of Linked Data.
 
The thesaurus is maintained by the German National Library of Economics

(ZBW) and published under a Creative Commons (by-nc-sa) license.
 
It is delivered as XHTML+RDFa pages with an incremental search interface

and a navigatable tree. A SKOS RDF/XML dump version can be downloaded,
as well as a set of links to dbpedia concepts. More information about
the design of the application can be found in a paper for the Linked
Data on the Web workshop in Madrid
(http://events.linkeddata.org/ldow2009/papers/ldow2009_paper7.pdf).
 
Enjoy - Manuela and Joachim


--
Manuela Gastmeyer
Thesaurus Team

Joachim Neubert
IT Development

German National Library of Economics (ZBW)
Leibniz Information Center for Economics


  

Great stuff!

Are the links on this page: 
http://zbw.eu/stw/versions/latest/download/about.en.html


Added to the LOD Data Sets page?


--


Regards,

Kingsley Idehen   Weblog: http://www.openlinksw.com/blog/~kidehen
President  CEO 
OpenLink Software Web: http://www.openlinksw.com