Hugh,
An important and interesting issue, thanks for raising it, and thanks
also to everyone else who contributed to this thread.
I tend to agree: A search function that allows looking for resources
by name greatly increases the usefulness of any dataset, and providing
such a function is always a good idea.
Let me ask you something, Hugh: Now that you've raised awareness of
the issue, can you propose some concrete steps that we could take to
improve the situation? Shall we review the datasets out there and flag
those without search? Shall we write up a blog post or wiki page?
Something else?
I want to point out that creating such a site search can be very
simple for the dataset publisher. For example, at the old Berlin DBLP
dataset [1], you will find a name search on the homepage. This was a
last-minute hack, implemented in an hour using a pageful of
Javascript. It works by asking a SPARQL query to the dataset's SPARQL
endpoint via AJAX, and redirecting to the best result. Certainly not
the best search function you've ever seen, but really simple... If
your dataset wraps a triple store or a relational database or a web
API, then you almost certainly can use the search functions provided
by the store/DB/API to implement this, and I would be surprised if it
takes more than half a day.
Another example to which I've contributed, and which I like quite
much, is the search of the RDF book mashup [2], which works by
wrapping the appropriate method of the Amazon web service API. The
search results are also available as RDF (find it via autodiscovery
links).
Bradley's mention of RDFa is worth highlighting: In an RDFa-enabled
website, the local site search, which is probably already available,
automatically doubles as a search for URIs. This is one of the many
reasons why I'm becoming an RDFa fanboy -- it makes us create good
linked data sites simply by following dusty old good practices for
website design and deployment, such as providing site search!
Finally, allow me to be a bit smirky and quote below from an email I
sent to this list 14 months ago. In it, I recount similar frustrations
in finding entry points into a recently announced dataset -- RKB
Explorer. It's good to see that this site has improved a lot since,
but it's maybe a bit discouraging that we still face the same general
problems more than a year later... Anyways, enjoy! ;-)
Next, let's talk about concrete steps that we can take to improve the
situation.
Best,
Richard
[1] http://www4.wiwiss.fu-berlin.de/dblp/
[2] http://www4.wiwiss.fu-berlin.de/bizer/bookmashup/#search
On 7 Nov 2007, at 21:46, Richard Cyganiak wrote:
Hugh,
This looks like it could be an awesome resource. Unfortunately I
didn't have much luck getting any kind of data back from the services.
The "browse" function doesn't do anything useful for me. I searched
for a wide variety of terms, including "the", "a" and "2003" in the
first ten or so datasets, including the one called Citeseer and
DBLP. No results. What am I supposed to put into the search box?
I also tried to explore the datasets using SPARQL queries. I started
with queries such as
SELECT DISTINCT ?class WHERE { ?x a ?class }
to learn about the vocabulary used in the dataset. These queries
return some results on some of the datasets (they time out on
others), but clicking any of the results consistently showed a page
with zero results. Same for opening in an RDF browser.
So in fact, despite honestly trying, the only way I could get any
real data back from the services was by using the four example URIs
provided at www.rkbexplorer.com .
Obviously a lot of work went into this. It's a shame that it's so
hard to make any use of it because the last 5% are missing.
What are those last 5%?
1. A brief description of what each dataset actually is, and what
sort of data it contains. The currently available information (who
provided the data and some triple counts) are not enough.
2. A bunch of representative example URIs for each dataset.
3. A bunch of representative and interesting SPARQL queries against
each dataset.
4. If possible, a note on what vocabulary (classes and properties)
are used in each dataset. This would greatly simplify SPARQLing the
datasets.
5. You should think really hard about natural navigation entry
points into the datasets. Is there any natural root from which
everything can be accessed? Is there a category system or class
hierarchy that one can navigate along to find interesting stuff?
6. You should consider adding a few domain-specific search
functions, such as the simple Find Yourself function provided at http://dblp.l3s.de/d2r/
.
I'm a bit frustrated because this looks like an amazingly great
resource, but I can't actually get any clear feeling for its scope
or quality or contents. This feels like exploring a pitch black room
while wearing boxing gloves.
I'm very hopeful that you can greatly improve this experience with
little effort.
Thanks a lot,
Richard
On 7 Feb 2009, at 13:23, Hugh Glaser wrote:
My proposal:
*We should not permit any site to be a member of the Linked Data
cloud if it
does not provide a simple way of finding URIs from natural language
identifiers.*
Rationale:
One aspect of our Linking Data (not to mention our Linking Open
Data) world
is that we want people to link to our data - that is, I have
published some
stuff about something, with a URI, and I want people to be able to
use that
URI.
So my question to you, the publisher, is: "How easy is it for me to
find the
URI your users want?"
My experience suggests it is not always very easy.
What is required at the minimum, I suggest, is a text search, so
that if I
have a (boring string version of a) name that refers in my mind to
something, I can hope to find an (exciting Linked Data) URI of that
thing.
I call this a projection from the Web to the Semantic Web.
rdfs:label or equivalent usually provides the other one.
At the risk of being seen as critical of the amazing efforts of all my
colleagues (if not also myself), this is rarely an easy thing to do.
Some recent experiences:
OpenCalais: as in my previous message on this list, I tried hard to
find a
URI for Tim, but failed.
dbtune: Saw a Twine message about dbtune, trundled over there, and
tried to
find a URI for a Telemann, but failed.
dbpedia: wanted Tim again. After clicking on a few web pages, none
of which
seemed to provide a search facility, I resorted to my usual method:-
look it
up in wikipedia and then hack the URI and hope it works in dbpedia.
(Sorry to name specific sites, guys, but I needed a few examples.
And I am only asking for a little more, so that the fruits of your
amazing
labours can be more widely appreciated!)
wordnet: [2] below
So I have access to Linked Data sites that I know (or at least
strongly
suspect) have URIs I might want, but I can't find them.
How on earth do we expect your average punter to join this world?
What have I missed?
Searching, such as Sindice: Well yes, but should I really have to go
off to
a search engine to find a dbpedia URI? And when I look up "Telemann
dbtune"
I don't get any results. And I wanted the dbtune link, not some
other link.
Did I miss some links on web pages? Quite probably, but the basic
problem
still stands.
SPARQL: Well, yes. But we cannot seriously expect our users to
formulate a
SPARQL query simply to find out the dbpedia URI for Tim. What is the
regexp
I need to put in? (see below [1])
A foaf file: Well Tim's dbpedia URI is probably in his foaf file
(although
possibly there are none of Tim's URIs in his foaf file), if I can
actually
find the file; but for some reason I can't seem to find Telemann's
foaf
file.
If you are still doubting me, try finding a URI for Telemann in
dbpedia
without using an external link, just by following stuff from the
home page.
I managed to get a Telemann by using SPARQL without a regexp (it
times out
on any regexp), but unfortunately I get the asteroid.
Again, my proposal:
*We should not permit any site to be a member of the Linked Data
cloud if it
does not provide a simple way of finding URIs from natural language
identifiers.*
Otherwise we end up in a silo, and the world passes us by.
Very best
Hugh
[And since we have to take our own medicine, I have added a "Just
search"
box right at the top level of all the rkbexplorer.com domains, such as
http://wordnet.rkbexplorer.com/ ]
[1]
Dbtune finding of Telemann:
SELECT * WHERE {?s ?p ?name .
FILTER regex(?name, "Telemann$") }
I tried
SELECT * WHERE {?s ?p ?name .
FILTER regex(?name, "telemann$", "i") }
first, but got no results - not sure why.
[2]
<rant>
I cannot believe just how frustrating this stuff can be when you
really try
to use it.
Because I looked at Sindice for telemann, I know that it is a word in
wordnet ( http://sindice.com/search?q=Telemann reports loads of
http://wordnet.rkbexplorer.com/ links).
Great, he thinks, I can get a wordnet link from a "proper" wordnet
publisher
(ie not me).
Goes to
http://esw.w3.org/topic/SweoIG/TaskForces/CommunityProjects/LinkingOpenData
to find wordnet.
The link there is dead.
Strips off the last bit, to get to the home princeton wordnet page,
and
clicks on the browser link I find - also dead.
Go back and look on the
http://esw.w3.org/topic/TaskForces/CommunityProjects/LinkingOpenData/DataSet
s page, and find the link to http://esw.w3.org/topic/WordNet , but
that
doesn't help.
So finally, I do the obvious - google "wordnet rdf".
Of course I get lots of pages saying how available it is, and how
exciting
it is that we have it, and how it was produced; and somewhere in
there I
find a link: "Wordnet-RDF/RDDL Browser" at www.openhealth.org/RDDL/wnbrowse
Almost unable to contain myself with excitement, I click on the link
to find
a text box, and with trembling hands I type "Telemann" and click
submit.
If I show you what I got, you can come some way to imagining my
devastation:
"Using org.apache.xerces.parsers.SAXParser
Exception net.sf.saxon.trans.DynamicError:
org.xml.sax.SAXParseException:
White spaces are required between publicId and systemId.
org.xml.sax.SAXParseException: White spaces are required between
publicId
and systemId."
Does the emperor have any clothes at all?
</rant>