Re: [PATCH v2] routines to generate JSON data

2018-03-22 Thread Ævar Arnfjörð Bjarmason

On Wed, Mar 21 2018, g...@jeffhostetler.com wrote:

> So, I'm not sure we have a route to get UTF-8-clean data out of Git, and if
> we do it is beyond the scope of this patch series.
>
> So I think for our uses here, defining this as "JSON-like" is probably the
> best answer.  We write the strings as we received them (from the file system,
> the index, or whatever).  These strings are properly escaped WRT double
> quotes, backslashes, and control characters, so we shouldn't have an issue
> with decoders getting out of sync -- only with them rejecting non-UTF-8
> sequences.
>
> We could blindly \u encode each of the hi-bit characters, if that would
> help the parsers, but I don't want to do that right now.
>
> WRT binary data, I had not intended using this for binary data.  And without
> knowing what kinds or quantity of binary data we might use it for, I'd like
> to ignore this for now.

I agree we should just ignore this problem for now given the immediate
use-case.


Re: [PATCH v2] routines to generate JSON data

2018-03-22 Thread Jeff King
On Wed, Mar 21, 2018 at 07:28:26PM +, g...@jeffhostetler.com wrote:

> It includes a new "struct json_writer" which is used to guide the
> accumulation of JSON data -- knowing whether an object or array is
> currently being composed.  This allows error checking during construction.
> 
> It also allows construction of nested structures using an inline model (in
> addition to the original bottom-up composition).
> 
> The test helper has been updated to include both the original unit tests and
> a new scripting API to allow individual tests to be written directly in our
> t/t*.sh shell scripts.

Thanks for all of this. The changes look quite sensible to me (I do
still suspect we could do the "first_item" thing without having to
allocate, but I really like the assertions you were able to put in).

> So I think for our uses here, defining this as "JSON-like" is probably the
> best answer.  We write the strings as we received them (from the file system,
> the index, or whatever).  These strings are properly escaped WRT double
> quotes, backslashes, and control characters, so we shouldn't have an issue
> with decoders getting out of sync -- only with them rejecting non-UTF-8
> sequences.

Yeah, I think I've come to the same conclusion. My main goal in raising
it now was to see if there was some other format we might use before we
go too far down the JSON road. But as far as I can tell there really
isn't another good option.

> WRT binary data, I had not intended using this for binary data.  And without
> knowing what kinds or quantity of binary data we might use it for, I'd like
> to ignore this for now.

Yeah, I don't have any plans here either. I was thinking more about
things like author names and file paths.

-Peff


[PATCH v2] routines to generate JSON data

2018-03-21 Thread git
From: Jeff Hostetler 

This is version 2 of my JSON data format routines.  This version addresses
the non-utf8 questions raised on V1.

It includes a new "struct json_writer" which is used to guide the
accumulation of JSON data -- knowing whether an object or array is
currently being composed.  This allows error checking during construction.

It also allows construction of nested structures using an inline model (in
addition to the original bottom-up composition).

The test helper has been updated to include both the original unit tests and
a new scripting API to allow individual tests to be written directly in our
t/t*.sh shell scripts.


TODO


I still don't know what to do about the Unicode/UTF-8 questions that
were raised WRT strings.  Pathnames on Linux can be any sequence of 8bit
characters -- this is likely to be UTF-8 on modern systems.  Pathnames on
Windows are UCS2/UTF-16 in the filesystem and we always convert to/from
UTF-8 when moving between git data structures and IO calls.

There are few other fields (like author name) that we may want to log which
may or may not be, but that is beyond our control.  Even localized error
messages may be problematic if they include other fields.

So, I'm not sure we have a route to get UTF-8-clean data out of Git, and if
we do it is beyond the scope of this patch series.

So I think for our uses here, defining this as "JSON-like" is probably the
best answer.  We write the strings as we received them (from the file system,
the index, or whatever).  These strings are properly escaped WRT double
quotes, backslashes, and control characters, so we shouldn't have an issue
with decoders getting out of sync -- only with them rejecting non-UTF-8
sequences.

We could blindly \u encode each of the hi-bit characters, if that would
help the parsers, but I don't want to do that right now.

WRT binary data, I had not intended using this for binary data.  And without
knowing what kinds or quantity of binary data we might use it for, I'd like
to ignore this for now.


Jeff Hostetler (1):
  json_writer: new routines to create data in JSON format

 Makefile|   2 +
 json-writer.c   | 321 +
 json-writer.h   |  86 +
 t/helper/test-json-writer.c | 420 
 t/t0019-json-writer.sh  | 102 +++
 5 files changed, 931 insertions(+)
 create mode 100644 json-writer.c
 create mode 100644 json-writer.h
 create mode 100644 t/helper/test-json-writer.c
 create mode 100755 t/t0019-json-writer.sh

-- 
2.9.3