Re: [CODE4LIB] Q: XML2JSON converter

Houghton,Andrew Sat, 06 Mar 2010 10:58:32 -0800

> From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of
> Bill Dueber
> Sent: Friday, March 05, 2010 08:48 PM
> To: CODE4LIB@LISTSERV.ND.EDU
> Subject: Re: [CODE4LIB] Q: XML2JSON converter
> 
> On Fri, Mar 5, 2010 at 6:25 PM, Houghton,Andrew <hough...@oclc.org>
> wrote:
> 
> > OK, I will bite, you stated:
> >
> > 1. That large datasets are a problem.
> > 2. That streaming APIs are a pain to deal with.
> > 3. That tool sets have memory constraints.
> >
> > So how do you propose to process large JSON datasets that:
> >
> > 1. Comply with the JSON specification.
> > 2. Can be read by any JavaScript/JSON processor.
> > 3. Do not require the use of streaming API.
> > 4. Do not exceed the memory limitations of current JSON processors.
> >
> >
> What I'm proposing is that we don't process large JSON datasets; I'm
> proposing that we process smallish JSON documents one at a time by
> pulling
> them out of a stream based on an end-of-record character.
> 
> This is basically what we use for MARC21 binary format -- have a
> defined
> structure for a valid record, and separate multiple well-formed record
> structures with an end-of-record character. This preserves JSON
> specification adherence at the record level and uses a different scheme
> to represent collections. Obviously, MARC-XML uses a different 
> mechanism to define a collection of records -- putting well-formed 
> record structures inside a <collection> tag.
> 
> So... I'm proposing define what we mean by a single MARC record
> serialized to JSON (in whatever format; I'm not very opinionated 
> on this point) that preserves the order, indicators, tags, data, 
> etc. we need to round-trip between marc21binary, marc-xml, and 
> marc-json.
> 
> And then separate those valid records with an end-of-record character
> -- "\n".


Ok, what I see here are divergent use cases and the willingness of the library 
community to break existing Web standards.  This is how the library community 
makes it more difficult to use their data and places additional barriers for 
people and organizations to enter their market because of these library centric 
protocols and standards.

If I were to try to sell this idea to the Web community, at large, and tell 
them that when they send an HTTP request with an Accept: application/json 
header to our services, our services will respond with a 200 HTTP status and 
deliver them malformed JSON, I would be immediately impaled with multiple 
arrows and daggers :(  Not to mention that OCLC would disparaged by a certain 
crowd in their blogs as being idiots who cannot follow standards.

OCLC's goals are use and conform to Web standards to make library data easier 
to use by people or organizations outside the library community, otherwise 
libraries and their data will become irrelevant.  The JSON serialization is a 
standard and the Web community expects that when they make HTTP requests with 
an Accept: application/json header that they will be get back JSON conforming 
to the standard.  JSON's main use case is in AJAX scenarios where you are not 
suppose to be sending megabytes of data across the wire.

Your proposal is asking me to break a widely deployed Web standard that is used 
by AJAX frameworks and to access millions (ok, many) Web sites.

> Unless I've read all this wrong, you've come to the conclusion that the
> benefit of having a JSON serialization that is valid JSON at both the
> record and collection level outweighs the pain of having to deal with
> a streaming parser and writer.  This allows a single collection to be
> treated as any other JSON document, which has obvious benefits (which 
> I certainly don't mean to minimize) and all the drawbacks we've been 
> talking about *ad nauseam*.

The goal is to adhere to existing Web standards and your underlying assumption 
is that you can or will be retrieving large datasets through an AJAX scenario.  
As I pointed out this is more an API design issue and due to the way AJAX works 
you should never design an API in that manner.  Your assumption that you can or 
will be retrieving large datasets through an AJAX scenario is false given the 
caveat of a well designed API.  Therefore you will never be put into the 
scenario requiring the use of JSON streaming so your argument from this point 
of view is mute.

But for arguments sake let's say you could retrieve a line delimited list of 
JSON objects.  You can no longer use any existing AJAX framework for getting 
back that JSON since it's malformed.  You could use the AJAX framework's 
XMLHTTP to retrieve this line delimited list of JSON objects, but this still 
doesn't help because the XMLHTTP object will keep the entire response in memory.

So when our service sends the user agent 100MB of line delimited JSON objects, 
the XMLHTTP object is going to try to slurp the entire 100MB HTTP response into 
memory and that is going to exceed the memory requirement of the JSON/Javascrpt 
processor or the browser that is controlling the XMLHTTP object and the 
application will never get to process it one per line.

In addition, I wouldn't be surprised that whatever programming libraries or 
frameworks you use to read a line from the stream will have issues reading 
lines longer than several thousand characters which could easily be exceeded by 
MARC-21 records serialized into JSON.

> I go the the other way. I think the pain of dealing with a streaming
> API outweighs the benefits of having a single valid JSON structure for 
> a collection, and instead have put forward that we use a combination 
> of JSON records and a well-defined end-of-record character ("\n") to 
> represent a collection.  I recognize that this involves providing 
> special-purpose code which must call for JSON-deserialization on each 
> line, instead of being able to throw the whole stream/file/whatever 
> at your json parser is. I accept that because getting each line of a 
> text file is something I find easy compared to dealing with streaming
> parsers.

What I see here is divergent use cases:

Use case #1: retrieve a single MARC-21 format record serialized as an object 
according to the JSON specification.

Use case #2: retrieve a collection of MARC-21 format records serialized as an 
array according to the JSON specification.

Use case #3: retrieve a collection of MARC-21 format records serializing each 
record as an object according to the JSON specification with the restriction 
that all whitespace tokens are converted to spaces and each JSON object is 
terminated by a newline.

Personally, I have some minor issues with use case #3 in that it requires the 
entire serialization to be on one line.  Programming libraries and frameworks 
often have issues when line lengths exceed certain buffer requirements.  In 
addition, compressing the stream makes it difficult for humans to read when 
things eventually do go wrong and need human intervention.  Other alternatives 
to serializing the object to a single line would be to use VT (vertical tab), 
FF (form feed) or a double-newline to terminate the serialized objects.

Other issues with use case #3 are that this use case is primarily a file format 
to be read by library centric tool chains that can feed the individual objects 
to a JSON/Javascript processor.  Use case #3 works no differently from and 
provides no advantages over use case #2 in AJAX scenarios because both use 
cases are limited by memory constraints of the JSON/Javascript processor, e.g., 
if you can keep use case #2 in memory you will be able to keep use case #3 in 
memory.  A disadvantage to use case #3 is that it cannot use existing AJAX 
frameworks to deserialize JSON objects and each application must build their 
own infrastructure to deserialize these line delimited JSON objects.

Use cases #2 and #3 diverge because of standards compliance expectations.  So 
the question becomes how can use case #3 be made standards compliant?  It seems 
to me that use case #3 is defining a different media type than use case #1 and 
#2 whose media types are defined by the JSON specification.  A way to fix this 
issue is to say that use cases #1 and #2 conform to media type application/json 
and use case #3 conforms to a new media type say: application/marc+json.  This 
new application/marc+json media type now becomes a library centric standard and 
it avoids breaking a widely deployed Web standard.

Given the above discussion, use cases #1 and #2 are already defined by our 
MARC-JSON serialization format and meet existing standards compliance.  No 
changes are required by our existing specification.  Our MARC-JSON 
serialization for an object (MARC-21 record) could be used in use case #3 with 
the restriction that all whitespace tokens in serialized objects can only be 
spaces, given your current proposal.  Use case #3 can be satisfied by an 
alternate specification which defines a new media type and suggested file 
extension, e.g., application/marc+json and .mrj vs. application/marc and .mrc 
as defined by RFC 2220.


Andy.

Re: [CODE4LIB] Q: XML2JSON converter

Reply via email to