On 3/6/10 6:59 PM, Houghton,Andrew wrote:
From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of
Bill Dueber
Sent: Saturday, March 06, 2010 05:11 PM
To: CODE4LIB@LISTSERV.ND.EDU
Subject: Re: [CODE4LIB] Q: XML2JSON converter
Anyway, hopefully, it won't be a huge surprise that I don't disagree
with any of the quote above in general; I would assert, though, that
application/json and application/marc+json should both return JSON
(in the same way that text/xml, application/xml, and
application/marc+xml can all be expected to return XML).
Newline-delimited json is starting to crop up in a few places
(e.g. couchdb) and should probably have its own mime type
and associated extension. So I would say something like:
application/json -- return json (obviously)
application/marc+json -- return json
application/marc+ndj -- return newline-delimited json
This sounds like consensus on how to deal with newline-delimited JSON in a
standards based manner.
I'm not familiar with CouchDB, but I am using MongoDB which is similar. I'll
have to dig into how they deal with this newline-delimited JSON. Can you
provide any references to get me started?
Rather than using a newline-delimited format (the whole of which would
not together be considered a valid JSON object) why not use the JSON
array format with or without new lines? Something like:
[{"key":"value"}, {"key","value"}]
You could include new line delimiters after the "," if you needed to
make pre-parsing easier (in a streaming context), but may be able to get
away with just looking for the next "," or "]" after each valid JSON object.
That would allow the entire stream, if desired, to be saved to disk and
read in as a single JSON object, or the same API to serve smaller JSON
collections in a JSON standard way.
CouchDB uses this array notation when returning multiple document
revisions in one request. CouchDB also offers a slightly more annotated
structure (which might be useful with streaming as well):
{
"total_rows": 2,
"offset": 0,
"rows":[{"key":"value"}, {"key","value"}]
}
Rows here plays the same roll as the above array-based format, but
provides an initial row count for the consumer to use (if it wants) for
knowing what's ahead. The "offset" key is specific to CouchDB, but
similar application specific information could be stored in the "header"
of the JSON object using this method.
In all cases, we should agree on a standard record serialization,
though, and the pure-json returns should include something that
indicates what the heck it is (hopefully a URI that can act as a
distinct "namespace"-type identifier, including a version in it).
I agree that our MARC-JSON serialization needs some "namespace" identifier in
it and it occurred to me that the way it is handling indicators, e.g., ind1 and ind2
properties, might be better handled as an array to accommodate IFLA's MARC-XML-ish where
they can have from 1-9 indicator values.
BTW, our MARC-JSON content is specified in Unicode not MARC-8, per the JSON
standard, which means you need to use \uXXXX notation to specify characters in
strings, not sure I made that clear in earlier posts. A downside to the
current ECMA 262 specification is that it doesn't support \U00XXXXXX, as Python
does, for the extended characters. Hopefully that will get rectified in a
future ECMA 262 specification.
The question for me, I think, is whether within this community, anyone
who provides one of these types (application/marc+json and
application/marc+ndj) should automatically be expected to provide both.
I don't have an answer for that.
As far as mime-type declarations go in general, I'd recommend avoiding
any format specific mime types and sticking to the application/json
format and providing document level hints (if needed) for the content
type. If you do find a need for the special case mime types, I'd
recommend still responding to Accepts: application/json whenever
possible--for the sake of standards. :)
All told, I'm just glad to see this discussion being had. I'll be happy
to provide some CouchDB test cases (replication, etc) if that's of
interest to anyone.
Thanks,
Benjamin
I think this issue gets into familiar territory when dealing with RDF formats.
Let's see, there is N3, NT, XML, Turtle, etc. Do you need to provide all of
them? No, but it's nice of the server to at least provide NT or Turtle and
XML. Ultimately it's up to the server. But the only difference between use
cases #2 and #3 is whether the output is wrapped in an array, so it's probably
easy for the server to produce both.
Depending on how much time I get next week I'll talk with the developer network
folks to see what I need to do to put a specification under their
infrastructure. Looks like from my schedule it's going to be another week of
hell :(
Andy.