Houghton,Andrew
Sat, 06 Mar 2010 10:58:32 -0800
> From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of > Bill Dueber > Sent: Friday, March 05, 2010 08:48 PM > To: CODE4LIB@LISTSERV.ND.EDU > Subject: Re: [CODE4LIB] Q: XML2JSON converter > > On Fri, Mar 5, 2010 at 6:25 PM, Houghton,Andrew <hough...@oclc.org> > wrote: > > > OK, I will bite, you stated: > > > > 1. That large datasets are a problem. > > 2. That streaming APIs are a pain to deal with. > > 3. That tool sets have memory constraints. > > > > So how do you propose to process large JSON datasets that: > > > > 1. Comply with the JSON specification. > > 2. Can be read by any JavaScript/JSON processor. > > 3. Do not require the use of streaming API. > > 4. Do not exceed the memory limitations of current JSON processors. > > > > > What I'm proposing is that we don't process large JSON datasets; I'm > proposing that we process smallish JSON documents one at a time by > pulling > them out of a stream based on an end-of-record character. > > This is basically what we use for MARC21 binary format -- have a > defined > structure for a valid record, and separate multiple well-formed record > structures with an end-of-record character. This preserves JSON > specification adherence at the record level and uses a different scheme > to represent collections. Obviously, MARC-XML uses a different > mechanism to define a collection of records -- putting well-formed > record structures inside a <collection> tag. > > So... I'm proposing define what we mean by a single MARC record > serialized to JSON (in whatever format; I'm not very opinionated > on this point) that preserves the order, indicators, tags, data, > etc. we need to round-trip between marc21binary, marc-xml, and > marc-json. > > And then separate those valid records with an end-of-record character > -- "\n".
Ok, what I see here are divergent use cases and the willingness of the library
community to break existing Web standards. This is how the library community
makes it more difficult to use their data and places additional barriers for
people and organizations to enter their market because of these library centric
protocols and standards.
If I were to try to sell this idea to the Web community, at large, and tell
them that when they send an HTTP request with an Accept: application/json
header to our services, our services will respond with a 200 HTTP status and
deliver them malformed JSON, I would be immediately impaled with multiple
arrows and daggers :( Not to mention that OCLC would disparaged by a certain
crowd in their blogs as being idiots who cannot follow standards.
OCLC's goals are use and conform to Web standards to make library data easier
to use by people or organizations outside the library community, otherwise
libraries and their data will become irrelevant. The JSON serialization is a
standard and the Web community expects that when they make HTTP requests with
an Accept: application/json header that they will be get back JSON conforming
to the standard. JSON's main use case is in AJAX scenarios where you are not
suppose to be sending megabytes of data across the wire.
Your proposal is asking me to break a widely deployed Web standard that is used
by AJAX frameworks and to access millions (ok, many) Web sites.
> Unless I've read all this wrong, you've come to the conclusion that the
> benefit of having a JSON serialization that is valid JSON at both the
> record and collection level outweighs the pain of having to deal with
> a streaming parser and writer. This allows a single collection to be
> treated as any other JSON document, which has obvious benefits (which
> I certainly don't mean to minimize) and all the drawbacks we've been
> talking about *ad nauseam*.
The goal is to adhere to existing Web standards and your underlying assumption
is that you can or will be retrieving large datasets through an AJAX scenario.
As I pointed out this is more an API design issue and due to the way AJAX works
you should never design an API in that manner. Your assumption that you can or
will be retrieving large datasets through an AJAX scenario is false given the
caveat of a well designed API. Therefore you will never be put into the
scenario requiring the use of JSON streaming so your argument from this point
of view is mute.
But for arguments sake let's say you could retrieve a line delimited list of
JSON objects. You can no longer use any existing AJAX framework for getting
back that JSON since it's malformed. You could use the AJAX framework's
XMLHTTP to retrieve this line delimited list of JSON objects, but this still
doesn't help because the XMLHTTP object will keep the entire response in memory.
So when our service sends the user agent 100MB of line delimited JSON objects,
the XMLHTTP object is going to try to slurp the entire 100MB HTTP response into
memory and that is going to exceed the memory requirement of the JSON/Javascrpt
processor or the browser that is controlling the XMLHTTP object and the
application will never get to process it one per line.
In addition, I wouldn't be surprised that whatever programming libraries or
frameworks you use to read a line from the stream will have issues reading
lines longer than several thousand characters which could easily be exceeded by
MARC-21 records serialized into JSON.
> I go the the other way. I think the pain of dealing with a streaming
> API outweighs the benefits of having a single valid JSON structure for
> a collection, and instead have put forward that we use a combination
> of JSON records and a well-defined end-of-record character ("\n") to
> represent a collection. I recognize that this involves providing
> special-purpose code which must call for JSON-deserialization on each
> line, instead of being able to throw the whole stream/file/whatever
> at your json parser is. I accept that because getting each line of a
> text file is something I find easy compared to dealing with streaming
> parsers.
What I see here is divergent use cases:
Use case #1: retrieve a single MARC-21 format record serialized as an object
according to the JSON specification.
Use case #2: retrieve a collection of MARC-21 format records serialized as an
array according to the JSON specification.
Use case #3: retrieve a collection of MARC-21 format records serializing each
record as an object according to the JSON specification with the restriction
that all whitespace tokens are converted to spaces and each JSON object is
terminated by a newline.
Personally, I have some minor issues with use case #3 in that it requires the
entire serialization to be on one line. Programming libraries and frameworks
often have issues when line lengths exceed certain buffer requirements. In
addition, compressing the stream makes it difficult for humans to read when
things eventually do go wrong and need human intervention. Other alternatives
to serializing the object to a single line would be to use VT (vertical tab),
FF (form feed) or a double-newline to terminate the serialized objects.
Other issues with use case #3 are that this use case is primarily a file format
to be read by library centric tool chains that can feed the individual objects
to a JSON/Javascript processor. Use case #3 works no differently from and
provides no advantages over use case #2 in AJAX scenarios because both use
cases are limited by memory constraints of the JSON/Javascript processor, e.g.,
if you can keep use case #2 in memory you will be able to keep use case #3 in
memory. A disadvantage to use case #3 is that it cannot use existing AJAX
frameworks to deserialize JSON objects and each application must build their
own infrastructure to deserialize these line delimited JSON objects.
Use cases #2 and #3 diverge because of standards compliance expectations. So
the question becomes how can use case #3 be made standards compliant? It seems
to me that use case #3 is defining a different media type than use case #1 and
#2 whose media types are defined by the JSON specification. A way to fix this
issue is to say that use cases #1 and #2 conform to media type application/json
and use case #3 conforms to a new media type say: application/marc+json. This
new application/marc+json media type now becomes a library centric standard and
it avoids breaking a widely deployed Web standard.
Given the above discussion, use cases #1 and #2 are already defined by our
MARC-JSON serialization format and meet existing standards compliance. No
changes are required by our existing specification. Our MARC-JSON
serialization for an object (MARC-21 record) could be used in use case #3 with
the restriction that all whitespace tokens in serialized objects can only be
spaces, given your current proposal. Use case #3 can be satisfied by an
alternate specification which defines a new media type and suggested file
extension, e.g., application/marc+json and .mrj vs. application/marc and .mrc
as defined by RFC 2220.
Andy.