Hi Jeni,
Jeni Tennison wrote:
As part of the linked data work the UK government is doing, we're
looking at how to use the linked data that we have as the basis of APIs
that are readily usable by developers who really don't want to learn
about RDF or SPARQL.
Wow! Talk about timing. We are looking at exactly the same issue as part
of the TSB work and were starting to look at JSON formats just this last
couple of days. We should combine forces.
One thing that we want to do is provide JSON representations of both RDF
graphs and SPARQL results. I wanted to run some ideas past this group as
to how we might do that.
I agree we want both graphs and SPARQL results but I think there is
another third case - lists of described objects.
This seems to have been a common pattern in the apps that I've worked
on. You want to find all objects (resources in RDF speak) that match
some criteria, with some ordering, and get back a list of them and their
associated properties. This is like a SPARQL DESCRIBE operating on each
of an ordered list of resources found by a SPARQL SELECT.
The point is that this is not a graph because the top level list needs
to be ordered. It is not a SPARQL result set because you want the
descriptions to include any of the properties that are present in the
data (potentially included bNode closure) without having to know all
those and spell them out in the query. But it is a natural thing to want
to return from a REST API.
To put this in context, what I think we should aim for is a pure
publishing format that is optimised for approachability for normal
developers, *not* an interchange format. RDF/JSON [1] and the SPARQL
results JSON format [2] aren't entirely satisfactory as far as I'm
concerned because of the way the objects of statements are represented
as JSON objects rather than as simple values. I still think we should
produce them (to wean people on to, and for those using more generic
tools), but I'd like to think about producing something that is a bit
more immediately approachable too.
RDFj [3] is closer to what I think is needed here. However, I don't
think there's a need for setting 'context' given I'm not aiming for an
interchange format, there are no clear rules about how to generate it
from an arbitrary graph (basically there can't be without some
additional configuration) and it's not clear how to deal with datatypes
or languages.
WRT 'context' you might not need it but it I don't think it is harmful.
I think if we said to developers that there is some outer wrapper like:
{
"format" : "RDF-JSON",
"version" : "0.1",
"mapping" : ... magic stuff ...
"data" : ... the bit you care about ...
}
The developers would be quite happy doing that one dereference and
ignore the mapping stuff but it might allow inversion back to RDF for
those few who do care, or come to care.
I suppose my first question is whether there are any other JSON-based
formats that we should be aware of, that we could use or borrow ideas from?
The one that most intrigued me as a possible starting point was the
Simile Exhibit JSON format [1]. It is developer friendly in much the way
that you talk about but it has the advantage of zero configuration, some
measure of invertability, has an online translator [2] and is supported
by the RPI Sparql proxy [3].
I've some reservations about standardizing on it as is:
- lack of documentation of the mapping
- some inconsistencies in how references between resources are encoded
(at least judging by the output of Babel[2] on test cases)
- handling of bNodes - I'd rather single referenced bNodes were
serialized as nested structures
[There was another format we used in a project in my previous existence
but I'm not sure if that was made public anywhere, will check.]
Assuming there aren't, I wanted to discuss what generic rules we might
use, where configuration is necessary and how the configuration might be
done.
One starting assumption to call out: I'd like to aim for a zero
configuration option and that explicit configuration is only used to
help tidy things up but isn't required to get started.
# RDF Graphs #
Let's take as an example:
<http://www.w3.org/TR/rdf-syntax-grammar>
dc:title "RDF/XML Syntax Specification (Revised)" ;
ex:editor [
ex:fullName "Dave Beckett" ;
ex:homePage <http://purl.org/net/dajobe/> ;
] .
In JSON, I think we'd like to create something like:
{
"$": "http://www.w3.org/TR/rdf-syntax-grammar",
"title": "RDF/XML Syntax Specification (Revised)",
"editor": {
"name": "Dave Beckett",
"homepage": "http://purl.org/net/dajobe/"
}
}
+1 on style
In terms of details I was thinking of following the Simile convention on
short form naming that, in the absence of clashes, use the rdfs:label
falling back to the localname, as the basis for the shortened property
names. So knowing nothing else the bNode would be:
...
"editor": {
"fullName": "Dave Beckett",
"homePage": "http://purl.org/net/dajobe/"
}
In the event of clashes then fall back on a prefix based disambiguation.
Note that the "$" is taken from RDFj. I'm not convinced it's a good idea
to use this symbol, rather than simply a property called "about" or
"this" -- any opinions?
I'd prefer "id" (though "about" is OK), "$" is too heavily overused in
javascript libraries.
Also note that I've made no distinction in the above between a URI and a
literal, while RDFj uses <>s around literals. My feeling is that normal
developers really don't care about the distinction between a URI literal
and a pointer to a resource, and that they will base the treatment of
the value of a property on the (name of) the property itself.
Probably right.
Actually, in your example isn't that value a resource anyway? To make it
a literal you'd have to have:
ex:homePage "http://purl.org/net/dajobe/"^^xsd:anyURI
So, the first piece of configuration that I think we need here is to map
properties on to short names that make good JSON identifiers (ie name
tokens without hyphens). Given that properties normally have
lowercaseCamelCase local names, it should be possible to use that as a
default. If you need something more readable, though, it seems like it
should be possible to use a property of the property, such as:
ex:fullName api:jsonName "name" .
ex:homePage api:jsonName "homepage" .
Suggest Simile approach and have api:jsonName or your API as an optional
extra for resolving problems rather than a requirement.
However, in any particular graph, there may be properties that have been
given the same JSON name (or, even more probably, local name). We could
provide multiple alternative names that could be chosen between, but any
mapping to JSON is going to need to give consistent results across a
given dataset for people to rely on it as an API, and that means the
mapping can't be based on what's present in the data. We could do
something with prefixes, but I have a strong aversion to assuming global
prefixes.
So I think this means that we need to provide configuration at an API
level rather than at a global level: something that can be used
consistently across a particular API to determine the token that's used
for a given property. For example:
<> a api:JSON ;
api:mapping [
api:property ex:fullName ;
api:name "name" ;
] , [
api:property ex:homePage ;
api:name "homepage" ;
] .
Are you thinking of this as something the publisher provides or the API
caller provides?
If the former, then OK but as I say I think a zero config set of default
conventions is OK with the API to allow fine tuning.
There are four more areas where I think there's configuration we need to
think about:
* multi-valued properties
* typed and language-specific values
* nesting objects
* suppressing properties
## Multi-valued Properties ##
First one first. It seems obvious that if you have a property with
multiple values, it should turn into a JSON array structure. For example:
[] foaf:name "Anna Wilder" ;
foaf:nick "wilding", "wilda" ;
foaf:homepage <http://example.org/about> .
should become something like:
{
"name": "Anna Wilder",
"nick": [ "wilding", "wilda" ],
"homepage": "http://example.org/about"
}
+1
The trouble is that if you determine whether something is an array or
not based on the data that is actually available, you'll get situations
where the value of a particular JSON property is sometimes an array and
sometimes a string; that's bad for predictability for the people using
the API. (RDF/JSON solves this by every value being an array, but that's
counter-intuitive for normal developers.)
So I think a second API-level configuration that needs to be made is to
indicate which properties should be arrays and which not:
<> a api:API ;
api:mapping [
api:property foaf:nick ;
api:name "nick" ;
api:array true ;
] .
So if this is not specified in the mapping then you get the
unpredictable behaviour but by providing a mapping spec you can force
arrays on single values but not force singletons on multi-values. Is
that right? If so OK.
There is a related issue: how to represent RDF lists. There are times
you want ordered property values. At the RDF end the good way to do that
is to use lists (sorry "collections"). I'd argue that a natural
representation of:
<http://example.com/ourpaper>
ex:authors (
<http://example.com/people#Jeni>
<http://example.com/people#Dave
) .
is
{
"id" : "http://example.com/ourpaper",
"authors" : [
"http://example.com/people#Jeni",
"http://example.com/people#Dave"
]
}
The problem is that this looks just the same as the multi-valued case.
We could:
(1) decide not to care, the mapping can't be inverted
(2) keep this mapping but include context information in the outer
wrapper that allows the inversion (in uniform cases)
(3) have a separate list notation:
{
"id" : "http://example.com/ourpaper",
"authors" : { "type" : "list", "value" : [
"http://example.com/people#Jeni",
"http://example.com/people#Dave"
] }
}
My preference is (2) because I think lists are really useful and should
be as simple as possible in the JSON translation but think (3) is
technically cleaner.
## Typed Values and Languages ##
Typed values and values with languages are really the same problem.
Not sure I agree with this, see later.
If
we have something like:
<http://statistics.data.gov.uk/id/local-authority-district/00PB>
skos:prefLabel "The County Borough of Bridgend"@en ;
skos:prefLabel "Pen-y-bont ar Ogwr"@cy ;
skos:notation "00PB"^^geo:StandardCode ;
skos:notation "6405"^^transport:LocalAuthorityCode .
then we'd really want the JSON to look something like:
{
"$": "http://statistics.data.gov.uk/id/local-authority-district/00PB",
"name": "The County Borough of Bridgend",
"welshName": "Pen-y-bont ar Ogwr",
"onsCode": "00PB",
"dftCode": "6405"
}
I think that for this to work, the configuration needs to be able to
filter values based on language or datatype to determine the JSON
property name. Something like:
<> a api:JSON ;
api:mapping [
api:property skos:prefLabel ;
api:lang "en" ;
api:name "name" ;
] , [
api:property skos:prefLabel ;
api:lang "cy" ;
api:name "welshName" ;
] , [
api:property skos:notation ;
api:datatype geo:StandardCode ;
api:name "onsCode" ;
] , [
api:property skos:notation ;
api:datatype transport:LocalAuthorityCode ;
api:name "dftCode" ;
] .
Neat but ...
Language codes are effectively open ended. I can't necessarily predict
what lang codes are going to be in my data and provide a property
mapping for every single one.
Plus when working with language-tagged data you often have code to do a
"best match" (not simple lookup) between the user's language preferences
and the available lang tags. That looks hard if each is in a different
property and the lang tags themselves are hidden in the API configuration.
I think we may need the long winded encoding available:
{
"id" : "http://statistics.data.gov.uk/id/local-authority-district/00PB",
"prefLabel" : [
"The County Borough of Bridgend",
{ "value" : "The County Borough of Bridgend", "lang" : "en" },
{ "value" : "Pen-y-bont ar Ogwr", "lang : "cy" }
]
...
Then it would up to the publisher whether provide the simpler properties
as well or instead. But those could be regard as transformations of the
RDF for convenience (much like choosing to include RDFS closure info).
Turning to data types ...
Your onsCode examples are a particular pattern for how to use datatypes
which are indeed a similar case to lang tags. But how are you thinking
of handling the common cases like the XSD types?
I'm assuming that all the number formats would all become JSON numbers
rather than strings, right? That looses the distinction between say
xsd:decimal and xsd:float but javascript doesn't care about that and if
we are not doing an interchange format that's OK.
For things like xsd:dateTime then there seems a couple of options. The
Simile type option would be to have them as strings but define the range
of the property in some associated context/properties table.
The other would be to use a structured representation:
{
"id" : "http://example.com/ourpaper",
"date" : { "type" : date, "value" : "20091312"}
...
I'm guessing you would just have them as strings and let the consumer
figure out when they want to treat them as dates, is that right?
## Nesting Objects ##
Regarding nested objects, I'm again inclined to view this as a
configuration option rather than something that is based on the
available data. For example, if we have:
<http://example.org/about>
dc:title "Anna's Homepage"@en ;
foaf:maker <http://example.org/anna> .
<http://example.org/anna>
foaf:name "Anna Wilder" ;
foaf:homepage <http://example.org/about> .
this could be expressed in JSON as either:
{
"$": "http://example.org/about",
"title": "Anna's Homepage",
"maker": {
"$": "http://example.org/anna",
"name": "Anna Wilder",
"homepage": "http://example.org/about"
}
}
or:
{
"$": "http://example.org/anna",
"name": "Anna Wilder",
"homepage": {
"$": "http://example.org/about",
"title": "Anna's Homepage",
"maker": "http://example.org/anna"
}
}
Or:
[
{
"id": "http://example.org/about",
"title": "Anna's Homepage",
"maker": "http://example.org/anna"
},
{
"id": "http://example.org/anna",
"name": "Anna Wilder",
"homepage": "http://example.org/about"
}
]
The one that's required could be indicated through the configuration,
for example:
<> a api:API ;
api:mapping [
api:property foaf:maker ;
api:name "maker" ;
api:embed true ;
] .
My zero-configuration default would be to nest single-referenced bNodes
and have everything else as top level resources with cross-references,
as above.
The final thought that I had for representing RDF graphs as JSON was
about suppressing properties. Basically I'm thinking that this
configuration should work on any graph, most likely one generated from a
DESCRIBE query. That being the case, it's likely that there will be
properties that repeat information (because, for example, they are a
super-property of another property). It will make a cleaner JSON API if
those repeated properties aren't included. So something like:
<> a api:API ;
api:mapping [
api:property admingeo:contains ;
api:ignore true ;
] .
Seems reasonable but seems a separate issue from the JSON encoding.
# SPARQL Results #
I'm inclined to think that creating JSON representations of SPARQL
results that are acceptable to normal developers is less important than
creating JSON representations of RDF graphs, for two reasons:
1. SPARQL naturally gives short, usable, names to the properties in
JSON objects
2. You have to be using SPARQL to create them anyway, and if you're
doing that then you can probably grok the extra complexity of having
values that are objects
+1
Nevertheless, there are two things that could be done to simplify the
SPARQL results format for normal developers.
One would be to just return an array of the results, rather than an
object that contains a results property that contains an object with a
bindings property that contains an array of the results. People who want
metadata can always request the standard SPARQL results JSON format.
This seems quite minor, it's very easy to do the deref.
The second would be to always return simple values rather than objects.
For example, rather than:
{
"head": {
"vars": [ "book", "title" ]
},
"results": {
"bindings": [
{
"book": {
"type": "uri",
"value": "http://example.org/book/book6"
},
"title": {
"type": "literal",
"value", "Harry Potter and the Half-Blood Prince"
}
},
{
"book": {
"type": "uri",
"value": "http://example.org/book/book5"
},
"title": {
"type": "literal",
"value": "Harry Potter and the Order of the Phoenix"
}
},
...
]
}
}
a normal developer would want to just get:
[{
"book": "http://example.org/book/book6",
"title": "Harry Potter and the Half-Blood Prince"
},{
"book": "http://example.org/book/book5",
"title": "Harry Potter and the Order of the Phoenix"
},
...
]
I don't think we can do any configuration here. It means that
information about datatypes and languages isn't visible, but (a) I'm
pretty sure that 80% of the time that doesn't matter, (b) there's always
the full JSON version if people need it and (c) they could write SPARQL
queries that used the datatype/language to populate different
variables/properties if they wanted to.
+1
So there you are. I'd really welcome any thoughts or pointers about any
of this: things I've missed, vocabularies we could reuse, things that
you've already done along these lines, and so on. Reasons why none of
this is necessary are fine too, but I'll warn you in advance that I'm
unlikely to be convinced ;)
Thanks so much for getting this started and kicking off with such
detailed suggestions.
Cheers,
Dave
[1] The data model is described at:
http://simile.mit.edu/wiki/Exhibit/Understanding_Exhibit_Database
The JSON page is unhelpful!
http://simile.mit.edu/wiki/Exhibit/Understanding_Exhibit_JSON_Format
But there is some documentation:
http://simile.mit.edu/wiki/Exhibit/Creating,_Importing,_and_Managing_Data
[2] http://simile.mit.edu/babel/
[3] http://data-gov.tw.rpi.edu/ws/sparqlproxy.php