On Tue, 18 Nov 2008 16:01:51 -0500, Scott McKellar <[EMAIL PROTECTED]> wrote:

--- On Tue, 11/18/08, Bill Erickson <[EMAIL PROTECTED]> wrote:

From: Bill Erickson <[EMAIL PROTECTED]>
<snip>

I've run into one issue with these last 2 patches, in
particular, there
seems to be a subtle difference in uescape() and
buffer_append_uescape().
The majority of the JSON parsing succeeds, but parsing
certain
bibliographic data (MARC records) always results in NULL
when attempting
to encode the containing object (biblio_record entry
objects).    If I
replace the call to buffer_append_uescape() with a call to
uescape() in
osrf_json_object.c, then all is well.

I would send you some sample data to test, but I've yet
to recreate the
issue outside of a running system.  For now, I'm adding
some logging to
see if I can track it down.  I just wanted to pass this on
in case any
lightbulbs went off.  I'll let you if I find
something...

I'm not sure what you mean by "always results in NULL".  What is NULL?
The return from the original uescape()?  That would indicate invalid
UTF-8 characters in the input string.

Maybe you mean that the buffer_append_uescape() seems to work, but the
resulting JSON string later fails to parse correctly.

In my test harness I translated things both ways and tested with strcmp()
to verify that I got the same results.  However I didn't test with
non-ASCII characters, so that's likely to be the weakest link.  It may
be that I'm mangling the UTF-8.

I'm more inclined to think that you have invalid UTF-8 characters in some
of your MARC records.  You wouldn't know about it because uescape()
silently discards an entire input string if any of it is invalid.  The
new function translates as far as it can before it hits the invalid
characters.  I *think* that the result would still be valid JSON, because
we only translate the values of strings, which get enclosed by double
quotes anyway.  However the semantics of that string might be invalid
at some higher level.

For diagnostic purposes, I suggest either or both of the following:

1. Check the return code from buffer_append_uescape(), and log a
message if it's non-zero.

2. Tweak buffer_append_uescape() so that, if it detects invalid UTF-8,
it restores the growing_buffer to its original length.  In effect, this
change would make buffer_append_uescape() behave like uescape() in the
face of invalid UTF-8.  I can send you a patch for this if you like, but
I don't know if you would want it based on the original code or on the
already patched version.

In the long run we should probably think about how best to respond to
invalid UTF-8.  For now we just want to identify the problem and fix it.

Meanwhile you can get most of the speedup by reverting to the original
uescape function but otherwise keeping the rewrite of jsonObjectToJSON().

Scott McKellar
http://home.swbell.net/mck9/ct/



OK, I tracked down 2 problems, the first makes sense, the second, not so much.

The first problem was the result of attempting to escape quotes (") and backslashes (\) in the final if/else block of buffer_append_uescape(). Since " and \ are both >= ' ', they were not getting escaped.

The second was caused by NULL jsonObject's being passed from OSRF_LIST_GET_INDEX (osrf_json_object:337) into add_json_to_buffer(). osrfList allows NULL entries, so it seems the old could should have been dying here, but for some reason only the new code suffers.

Slightly modified patch committed: http://svn.open-ils.org/trac/OpenSRF/changeset/1498

Thanks again, Scott.  Eyes on on the latest patch would be appreciated.

-b

--
Bill Erickson
| VP, Software Development & Integration
| Equinox Software, Inc. / The Evergreen Experts
| phone: 877-OPEN-ILS (673-6457)
| email: [EMAIL PROTECTED]
| web: http://esilibrary.com

Reply via email to