Re: [CODE4LIB] Q: XML2JSON converter

Benjamin Young Mon, 08 Mar 2010 06:33:38 -0800

On 3/6/10 6:59 PM, Houghton,Andrew wrote:

From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of
Bill Dueber
Sent: Saturday, March 06, 2010 05:11 PM
To: CODE4LIB@LISTSERV.ND.EDU
Subject: Re: [CODE4LIB] Q: XML2JSON converter


Anyway, hopefully, it won't be a huge surprise that I don't disagree
with any of the quote above in general; I would assert, though, that
application/json and application/marc+json should both return JSON
(in the same way that text/xml, application/xml, and
application/marc+xml can all be expected to return XML).
Newline-delimited json is starting to crop up in a few places
(e.g. couchdb) and should probably have its own mime type
and associated extension. So I would say something like:

application/json -- return json (obviously)
application/marc+json  -- return json
application/marc+ndj  -- return newline-delimited json

This sounds like consensus on how to deal with newline-delimited JSON in a 
standards based manner.

I'm not familiar with CouchDB, but I am using MongoDB which is similar.  I'll 
have to dig into how they deal with this newline-delimited JSON.  Can you 
provide any references to get me started?

Rather than using a newline-delimited format (the whole of which wouldnot together be considered a valid JSON object) why not use the JSONarray format with or without new lines? Something like:


[{"key":"value"}, {"key","value"}]

You could include new line delimiters after the "," if you needed tomake pre-parsing easier (in a streaming context), but may be able to getaway with just looking for the next "," or "]" after each valid JSON object.

That would allow the entire stream, if desired, to be saved to disk andread in as a single JSON object, or the same API to serve smaller JSONcollections in a JSON standard way.

CouchDB uses this array notation when returning multiple documentrevisions in one request. CouchDB also offers a slightly more annotatedstructure (which might be useful with streaming as well):


{
  "total_rows": 2,
  "offset": 0,
  "rows":[{"key":"value"}, {"key","value"}]
}

Rows here plays the same roll as the above array-based format, butprovides an initial row count for the consumer to use (if it wants) forknowing what's ahead. The "offset" key is specific to CouchDB, butsimilar application specific information could be stored in the "header"of the JSON object using this method.

In all cases, we should agree on a standard record serialization,
though, and the pure-json returns should include something that
indicates what the heck it is (hopefully a URI that can act as a
distinct "namespace"-type identifier, including a version in it).

I agree that our MARC-JSON serialization needs some "namespace" identifier in 
it and it occurred to me that the way it is handling indicators, e.g., ind1 and ind2 
properties, might be better handled as an array to accommodate IFLA's MARC-XML-ish where 
they can have from 1-9 indicator values.

BTW, our MARC-JSON content is specified in Unicode not MARC-8, per the JSON 
standard, which means you need to use \uXXXX notation to specify characters in 
strings, not sure I made that clear in earlier posts.  A downside to the 
current ECMA 262 specification is that it doesn't support \U00XXXXXX, as Python 
does, for the extended characters.  Hopefully that will get rectified in a 
future ECMA 262 specification.

The question for me, I think, is whether within this community,  anyone
who provides one of these types (application/marc+json and
application/marc+ndj) should automatically be expected to provide both.
I don't have an answer for that.

As far as mime-type declarations go in general, I'd recommend avoidingany format specific mime types and sticking to the application/jsonformat and providing document level hints (if needed) for the contenttype. If you do find a need for the special case mime types, I'drecommend still responding to Accepts: application/json wheneverpossible--for the sake of standards. :)

All told, I'm just glad to see this discussion being had. I'll be happyto provide some CouchDB test cases (replication, etc) if that's ofinterest to anyone.


Thanks,
Benjamin

I think this issue gets into familiar territory when dealing with RDF formats.  
Let's see, there is N3, NT, XML, Turtle, etc.  Do you need to provide all of 
them?  No, but it's nice of the server to at least provide NT or Turtle and 
XML.  Ultimately it's up to the server.  But the only difference between use 
cases #2 and #3 is whether the output is wrapped in an array, so it's probably 
easy for the server to produce both.

Depending on how much time I get next week I'll talk with the developer network 
folks to see what I need to do to put a specification under their 
infrastructure.  Looks like from my schedule it's going to be another week of 
hell :(


Andy.

Re: [CODE4LIB] Q: XML2JSON converter

Reply via email to