sanity checking the LOD Cloud statistics
Hello, all -- I've had a few minutes to start working to update my version [1] of the LOD Cloud diagram [2], which means I got to start looking at the Data Set Statistics [3] and Link Statistics [4] pages. I have found a number of apparent discrepancies. I'm not sure where these came from, but I think they need attention and correction. [3] gave some round, and some exact values. It's not at all clear whether these values were originally intended to reflect triple-counts in the data set, URIs minted there (i.e., Entities named there), or something else entirely. I think the page holds a mix of these, which makes them rather troublesome as a source of comparison between data sets. [4] had few exact values, which appear to have been incorrectly added there, and apparently means to use only 3 counts for the inter-set linkages -- 100, 1000 100.000. Clearly, the last means more-than-one- hundred-thousand -- because the first clearly means more-than-one- hundred -- but this was not obvious at first glance, given my US-training that the period is used for the decimal, not for the thousands delimiter. First thing, therefor, I suggest that all period-delimiters on [4] change to comma-delimiters, to match the first page. (I've actually made this change, but incorrect values may well remain -- please read on.) I think it also makes sense to add 10,000, and 1,000,000 to the values here. Just looking at the DBpedia actual counts which were on the page, it's clear that a log-scale comparing the interlinkage levels presents a better picture than the three arbitrarily chosen levels. (Again, I've started using these as relevant.) Now to the discrepancies. From [3], I got this line -- http://dbtune.org/bbc/playcount/ BBC Playcount Data 10,000 At first read, I thought that meant 10,000 triples. But [4] indicated these external link counts for BBC Playcount Data -- http://www.bbc.co.uk/programmesBBC Programmes 100.000 http://dbtune.org/musicbrainz Musicbrainz 100.000 I don't see a way for 10,000 triples to include 200,000 external links. That means that the first count must be of Entities. But going to the BBC Playcount home page [5], I found -- Triple count1,954,786 Distinct BBC Programmes resources 6,863 Distinct Musicbrainz resources 7,055 An obvious missing number here is a count of minted URIs -- that is, of BBC Playcount resources/entities -- but I also learned that BBC Playcount URIs are not pointers-to-values, but values-in-themselves. The count is *embedded* in the URI (and thus, if a count changes, the URI changes!) -- A playcount URI in this service looks like: http://dbtune.org/bbc/playcount/id_k Where id is the id of the episode or the brand, as in / programmes BBC catalogue, and k is a number between 0 and the number of playcounts for the episode or the brand. If we accept this URI construction as reasonable (which I don't), it seems that k must actually be a natural or counting number (i.e., an integer greater than or equal to 1). A value of 0 is nonsensical, as it would result in a Cartesian data set -- where each and every Musicbrainz resource gets a Playcount URI for each and every Programme resource -- and most of these Playcount URIs would have k = 0, for most Musicbrainz resources were not played in most Programmes. Even if Zero-Play URIs are created only for those Musicbrainz resources which were played in *some* Programme, for those Programmes where they weren't played, far more URIs are created than are needed. I'm hoping that the folks who built this data set are reading, and will consider restructuring it. I'd suggest that the URI structure should be more like -- http://dbtune.org/bbc/playcount/id_count -- where id reflects *either* Programmes *or* Musicbrainz ID (this may mean further thinking, as I'm not directly familiar with these IDs, and Programmes may conflict with Musicbrainz), and the count (the *value*) is returned when the constructed URI is dereferenced. More baffling, and more troubling, on [3] I found -- http://ieee.rkbexplorer.com/ IEEE 111 -- which purports to be linked out as follows -- http://acm.rkbexplorer.com/ACM 1000 http://eprints.rkbexplorer.com/eprints 100.000 http://citeseer.rkbexplorer.com/ CiteSeer 100.000 http://dblp.rkbexplorer.com/ DBLP RKB Explorer 1000 http://laas.rkbexplorer.com/ LAAS CNRS100.000 Looking to primary sources again -- Current statistics for this repository (ieee.rkbexplorer.com) — Last data assertion 2009-02-06 13:28:04 Number of triples111442 Number of symbols31552 Size of RDF dataset 8.2M Current statistics for the CRS for this repository (ieee.rkbexplorer.com) — Last data assertion
RE: sanity checking the LOD Cloud statistics
Hi Ted, First, I totally agree with the need to change the current (relatively arbitrary) levels. Values like 100 and even 100,000 seem a bit anachronistic; I guess these ranges were valid in the very first days of the LOD Cloud, but today, for the most part, we're talking about millions of URIs, triples etc. Two significant errors I see related to OpenCalais: Open Calais DBpedia 100 Open Calais Freebase100 The correct number should be 100,000 for both OpenCalais-to-DBpedia and OpenCalais-to-Freebase link counts. To make sure we're on the same page: that's larger than one hundred thousand. Also regarding the size of the data set: OpenCalais 4,500,000 The number shown actually refers to the URI count and not to the number of triples. The number of triples is at least 10 times bigger, or: 45,000,000 (that's 45 million triples). Regards, Michal * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * Michal Finkelstein Director, Content Strategy The Calais Initiative Thomson Reuters michal.finkelst...@thomsonreuters.com -Original Message- From: public-lod-requ...@w3.org [mailto:public-lod-requ...@w3.org] On Behalf Of Ted Thibodeau Jr Sent: Wednesday, April 01, 2009 9:06 AM To: public-lod@w3.org Subject: sanity checking the LOD Cloud statistics Hello, all -- I've had a few minutes to start working to update my version [1] of the LOD Cloud diagram [2], which means I got to start looking at the Data Set Statistics [3] and Link Statistics [4] pages. I have found a number of apparent discrepancies. I'm not sure where these came from, but I think they need attention and correction. [3] gave some round, and some exact values. It's not at all clear whether these values were originally intended to reflect triple-counts in the data set, URIs minted there (i.e., Entities named there), or something else entirely. I think the page holds a mix of these, which makes them rather troublesome as a source of comparison between data sets. [4] had few exact values, which appear to have been incorrectly added there, and apparently means to use only 3 counts for the inter-set linkages -- 100, 1000 100.000. Clearly, the last means more-than-one- hundred-thousand -- because the first clearly means more-than-one- hundred -- but this was not obvious at first glance, given my US-training that the period is used for the decimal, not for the thousands delimiter. First thing, therefor, I suggest that all period-delimiters on [4] change to comma-delimiters, to match the first page. (I've actually made this change, but incorrect values may well remain -- please read on.) I think it also makes sense to add 10,000, and 1,000,000 to the values here. Just looking at the DBpedia actual counts which were on the page, it's clear that a log-scale comparing the interlinkage levels presents a better picture than the three arbitrarily chosen levels. (Again, I've started using these as relevant.) Now to the discrepancies. From [3], I got this line -- http://dbtune.org/bbc/playcount/ BBC Playcount Data 10,000 At first read, I thought that meant 10,000 triples. But [4] indicated these external link counts for BBC Playcount Data -- http://www.bbc.co.uk/programmesBBC Programmes 100.000 http://dbtune.org/musicbrainz Musicbrainz 100.000 I don't see a way for 10,000 triples to include 200,000 external links. That means that the first count must be of Entities. But going to the BBC Playcount home page [5], I found -- Triple count1,954,786 Distinct BBC Programmes resources 6,863 Distinct Musicbrainz resources 7,055 An obvious missing number here is a count of minted URIs -- that is, of BBC Playcount resources/entities -- but I also learned that BBC Playcount URIs are not pointers-to-values, but values-in-themselves. The count is *embedded* in the URI (and thus, if a count changes, the URI changes!) -- A playcount URI in this service looks like: http://dbtune.org/bbc/playcount/id_k Where id is the id of the episode or the brand, as in / programmes BBC catalogue, and k is a number between 0 and the number of playcounts for the episode or the brand. If we accept this URI construction as reasonable (which I don't), it seems that k must actually be a natural or counting number (i.e., an integer greater than or equal to 1). A value of 0 is nonsensical, as it would result in a Cartesian data set -- where each and every Musicbrainz resource gets a Playcount URI for each and every Programme resource -- and most of these Playcount URIs would have k = 0, for most Musicbrainz resources were not played in most Programmes. Even if Zero-Play URIs are created only for those Musicbrainz resources which were played in *some* Programme, for those Programmes where they weren't played, far more URIs are created than are needed. I'm hoping that the folks who built
Re: sanity checking the LOD Cloud statistics
Nice going Ted. Sanity checking (and even QA) is always good. (I'll try and find the time to respond accurately to the RKB queries soon.) Just one general comment I'd like to make - size isn't everything! Millions of links between dbpedia and yago or freebase might give us a nice warm feeling, but it would be nice to find space for what I think of as very valuable links, that might be in small numbers - small but perfectly formed? For example, if I was to have a site about the British royal family (or maybe a small company or institution), I might only have a few hundred people in it, some of whom would have pages in dbpedia, but certainly less than 100. If I have carefully made those links, it will be a great benefit to my site (and possibly LOD in general), but there will be little or no visibility in the LOD wiki, and certainly not the LOD diagram. This seems a shame to me. Of course, I could construct stuff to get over some arbitrary threshold if I really want to, but we really don't want to encourage that. (By the way, this is actually the situation for things like our RKB links to Computer Scientists in dbpedia:- as you can imagine, there are not a huge number of Computer Scientists in wikipedia.) Best Hugh
AW: sanity checking the LOD Cloud statistics - Please add the statistics for your dataset to the Wiki
Hi Ted, good that you raise this topic. The statistics were added to the wiki by Anja and reflect her knowledge/guesses about the size of the datasets and the numbers of links between them. And of course, some of her guesses might be wrong. In an ideal world, these statistics would be provided by Semantic Web search engines that crawl the cloud and calculate the statistics afterwards based on what they actually got from the Web. Alternatively, all dataset providers could publish Void descriptions of their datasets which could also be used to generate the statistics. But as the search engines have not yet reached this point and as Void is also not used by all data providers, we thought it would be useful to put these statistics as a starting point into the Wiki so that people (especially data set publishers) can update them and we can use them when we draw the LOD cloud the next time. I have updated the statistics about outgoing links connecting DBpedia with other datasets yesterday. If everybody on this list would do the same for the data sources they maintain/use, I think we will get a much more accurate LOD diagram the next time we draw it. So, please: Take 5 minutes and quickly add the actual statistics about your datasets to http://esw.w3.org/topic/TaskForces/CommunityProjects/LinkingOpenData/DataSet s/Statistics (size of your dataset) http://esw.w3.org/topic/TaskForces/CommunityProjects/LinkingOpenData/DataSet s/LinkStatistics (number of links connecting your dataset with other datasets) Thanks a lot in advance! Cheers Chris -Ursprüngliche Nachricht- Von: public-lod-requ...@w3.org [mailto:public-lod-requ...@w3.org] Im Auftrag von Ted Thibodeau Jr Gesendet: Mittwoch, 1. April 2009 08:06 An: public-lod@w3.org Betreff: sanity checking the LOD Cloud statistics Hello, all -- I've had a few minutes to start working to update my version [1] of the LOD Cloud diagram [2], which means I got to start looking at the Data Set Statistics [3] and Link Statistics [4] pages. I have found a number of apparent discrepancies. I'm not sure where these came from, but I think they need attention and correction. [3] gave some round, and some exact values. It's not at all clear whether these values were originally intended to reflect triple-counts in the data set, URIs minted there (i.e., Entities named there), or something else entirely. I think the page holds a mix of these, which makes them rather troublesome as a source of comparison between data sets. [4] had few exact values, which appear to have been incorrectly added there, and apparently means to use only 3 counts for the inter-set linkages -- 100, 1000 100.000. Clearly, the last means more-than-one- hundred-thousand -- because the first clearly means more-than-one- hundred -- but this was not obvious at first glance, given my US-training that the period is used for the decimal, not for the thousands delimiter. First thing, therefor, I suggest that all period-delimiters on [4] change to comma-delimiters, to match the first page. (I've actually made this change, but incorrect values may well remain -- please read on.) I think it also makes sense to add 10,000, and 1,000,000 to the values here. Just looking at the DBpedia actual counts which were on the page, it's clear that a log-scale comparing the interlinkage levels presents a better picture than the three arbitrarily chosen levels. (Again, I've started using these as relevant.) Now to the discrepancies. From [3], I got this line -- http://dbtune.org/bbc/playcount/ BBC Playcount Data 10,000 At first read, I thought that meant 10,000 triples. But [4] indicated these external link counts for BBC Playcount Data -- http://www.bbc.co.uk/programmesBBC Programmes 100.000 http://dbtune.org/musicbrainz Musicbrainz 100.000 I don't see a way for 10,000 triples to include 200,000 external links. That means that the first count must be of Entities. But going to the BBC Playcount home page [5], I found -- Triple count1,954,786 Distinct BBC Programmes resources 6,863 Distinct Musicbrainz resources 7,055 An obvious missing number here is a count of minted URIs -- that is, of BBC Playcount resources/entities -- but I also learned that BBC Playcount URIs are not pointers-to-values, but values-in-themselves. The count is *embedded* in the URI (and thus, if a count changes, the URI changes!) -- A playcount URI in this service looks like: http://dbtune.org/bbc/playcount/id_k Where id is the id of the episode or the brand, as in / programmes BBC catalogue, and k is a number between 0 and the number of playcounts for the episode or the brand. If we accept this URI construction as reasonable (which I don't), it seems that k must actually be a natural or counting number (i.e., an
ANN: STW Thesaurus for Economics published as Linked Data
STW Thesaurus for Economics is now available under http://zbw.eu/stw. STW is a richly interconnected vocabulary in English and German on economics and business economics as well as some related subject areas. It includes subject categories and lots of synonyms in order to find the appropriate terms. Its publication aims at providing an interlinking hub for economics resources on the web of Linked Data. The thesaurus is maintained by the German National Library of Economics (ZBW) and published under a Creative Commons (by-nc-sa) license. It is delivered as XHTML+RDFa pages with an incremental search interface and a navigatable tree. A SKOS RDF/XML dump version can be downloaded, as well as a set of links to dbpedia concepts. More information about the design of the application can be found in a paper for the Linked Data on the Web workshop in Madrid (http://events.linkeddata.org/ldow2009/papers/ldow2009_paper7.pdf). Enjoy - Manuela and Joachim -- Manuela Gastmeyer Thesaurus Team Joachim Neubert IT Development German National Library of Economics (ZBW) Leibniz Information Center for Economics
Re: ANN: STW Thesaurus for Economics published as Linked Data
Neubert Joachim wrote: STW Thesaurus for Economics is now available under http://zbw.eu/stw. STW is a richly interconnected vocabulary in English and German on economics and business economics as well as some related subject areas. It includes subject categories and lots of synonyms in order to find the appropriate terms. Its publication aims at providing an interlinking hub for economics resources on the web of Linked Data. The thesaurus is maintained by the German National Library of Economics (ZBW) and published under a Creative Commons (by-nc-sa) license. It is delivered as XHTML+RDFa pages with an incremental search interface and a navigatable tree. A SKOS RDF/XML dump version can be downloaded, as well as a set of links to dbpedia concepts. More information about the design of the application can be found in a paper for the Linked Data on the Web workshop in Madrid (http://events.linkeddata.org/ldow2009/papers/ldow2009_paper7.pdf). Enjoy - Manuela and Joachim -- Manuela Gastmeyer Thesaurus Team Joachim Neubert IT Development German National Library of Economics (ZBW) Leibniz Information Center for Economics Great stuff! Are the links on this page: http://zbw.eu/stw/versions/latest/download/about.en.html Added to the LOD Data Sets page? -- Regards, Kingsley Idehen Weblog: http://www.openlinksw.com/blog/~kidehen President CEO OpenLink Software Web: http://www.openlinksw.com