Blue Swirl <blauwir...@gmail.com> writes: > On Thu, Feb 28, 2013 at 8:14 AM, Markus Armbruster <arm...@redhat.com> wrote: >> Blue Swirl <blauwir...@gmail.com> writes: >> >>> On Mon, Feb 4, 2013 at 5:19 PM, Markus Armbruster <arm...@redhat.com> wrote: >>>> Test cases are scraped from Markus Kuhn's UTF-8 decoder capability and >>>> stress test at >>>> http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt >>>> >>>> Unfortunately, both JSON parser and formatter misbehave right now. >>>> This test expects current, incorrect results. They're all clearly >>>> marked, and are to be replaced by correct ones as the bugs get fixed. >>>> See comments in new utf8_string() for details. >>>> >>>> Signed-off-by: Markus Armbruster <arm...@redhat.com> >>>> --- >>>> tests/check-qjson.c | 625 >>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>> 1 file changed, 625 insertions(+) >>>> >>>> diff --git a/tests/check-qjson.c b/tests/check-qjson.c >>>> index 32ffb43..4590b3a 100644 >>>> --- a/tests/check-qjson.c >>>> +++ b/tests/check-qjson.c >>>> @@ -1,8 +1,10 @@ >>>> /* >>>> * Copyright IBM, Corp. 2009 >>>> + * Copyright (c) 2013 Red Hat Inc. >>>> * >>>> * Authors: >>>> * Anthony Liguori <aligu...@us.ibm.com> >>>> + * Markus Armbruster <arm...@redhat.com>, >>>> * >>>> * This work is licensed under the terms of the GNU LGPL, version 2.1 or >>>> later. >>>> * See the COPYING.LIB file in the top-level directory. >>>> @@ -131,6 +133,628 @@ static void single_quote_string(void) >>>> } >>>> } >>>> >>>> +static void utf8_string(void) >>>> +{ >>>> + /* >>>> + * FIXME Current behavior for invalid UTF-8 sequences is >>>> + * incorrect. This test expects current, incorrect results. >>>> + * They're all marked "bug:" below, and are to be replaced by >>>> + * correct ones as the bugs get fixed. >>>> + * >>>> + * The JSON parser rejects some invalid sequences, but accepts >>>> + * others without correcting the problem. >>>> + * >>>> + * The JSON formatter replaces some invalid sequences by U+FFFF (a >>>> + * noncharacter), and goes wonky for others. >>>> + * >>>> + * For both directions, we should either reject all invalid >>>> + * sequences, or minimize overlong sequences and replace all other >>>> + * invalid sequences by a suitable replacement character. A >>>> + * common choice for replacement is U+FFFD. >>>> + * >>>> + * Problem: we can't easily deal with embedded U+0000. Parsing >>>> + * the JSON string "this \\u0000" is fun" yields "this \0 is fun", >>>> + * which gets misinterpreted as NUL-terminated "this ". We should >>>> + * consider using overlong encoding \xC0\x80 for U+0000 ("modified >>>> + * UTF-8"). >>>> + * >>>> + * Tests are scraped from Markus Kuhn's UTF-8 decoder capability >>>> + * and stress test at >>>> + * http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt >>>> + */ >>>> + static const struct { >>>> + const char *json_in; >>>> + const char *utf8_out; >>>> + const char *json_out; /* defaults to @json_in */ >>>> + const char *utf8_in; /* defaults to @utf8_out */ >>>> + } test_cases[] = { >>>> + /* >>>> + * Bug markers used here: >>>> + * - bug: not corrected >>>> + * JSON parser fails to correct invalid sequence(s) >>>> + * - bug: rejected >>>> + * JSON parser rejects invalid sequence(s) >>>> + * We may choose to define this as feature >>>> + * - bug: want "\"...\"" >>>> + * JSON formatter produces incorrect result, this is the >>>> + * correct one, assuming replacement character U+FFFF >>>> + * - bug: want "..." (no \") >>>> + * JSON parser produces incorrect result, this is the >>>> + * correct one. >>>> + * Not marked explicitly, but trivial to find: >>>> + * - JSON formatter replacing invalid sequence by \\uFFFF is a >>>> + * bug if we want it to fail for invalid sequences. >>>> + */ >> [...] >>>> + /* 2.1.4 4 bytes U+10000 */ >>>> + { >>>> + "\"\xF0\x90\x80\x80\"", >>>> + "\xF0\x90\x80\x80", >>>> + "\"\\u0400\\uFFFF\"", /* bug: want "\"\\uD800\\uDC00\"" */ >>>> + }, >> [...] >>>> + {} >>>> + }; >>>> + int i; >>>> + QObject *obj; >>>> + QString *str; >>>> + const char *json_in, *utf8_out, *utf8_in, *json_out; >>>> + >>>> + for (i = 0; test_cases[i].json_in; i++) { >>>> + json_in = test_cases[i].json_in; >>>> + utf8_out = test_cases[i].utf8_out; >>>> + utf8_in = test_cases[i].utf8_in ?: test_cases[i].utf8_out; >>>> + json_out = test_cases[i].json_out ?: test_cases[i].json_in; >>>> + >>>> + obj = qobject_from_json(json_in); >>>> + if (utf8_out) { >>>> + g_assert(obj); >>>> + g_assert(qobject_type(obj) == QTYPE_QSTRING); >>>> + str = qobject_to_qstring(obj); >>>> + g_assert_cmpstr(qstring_get_str(str), ==, utf8_out); >>>> + } else { >>>> + g_assert(!obj); >>>> + } >>>> + qobject_decref(obj); >>>> + >>>> + obj = QOBJECT(qstring_from_str(utf8_in)); >>>> + str = qobject_to_json(obj); >>>> + if (json_out) { >>>> + g_assert(str); >>>> + g_assert_cmpstr(qstring_get_str(str), ==, json_out); >>> >>> This assertion trips on an ARM host (Debian stable): >>> >>> GTESTER tests/check-qjson >>> ** >>> ERROR:/src/qemu/tests/check-qjson.c:775:utf8_string: assertion failed >>> (qstring_get_str(str) == json_out): ("\"\\u0400\200\"" == >>> "\"\\u0400\\uFFFF\"") >>> GTester: last random seed: R02S88b76f755e809e9024832d2ab6660afd >> >> Must be case 2.1.4, because that's where json_out is >> "\"\\u0400\\uFFFF\"". >> >> We start by passing C string "\"\xF0\x90\x80\x80\"" to the JSON parser >> qobject_from_json(). Yields "\xF0\x90\x80\x80", as expected. >> >> We then pass that to qobject_to_json(). Should yield >> "\"\\uD800\\uDC00\"" (surrogate pair). Does yield "\"\\u0400\\uFFFF\"" >> on my machine (known bug), and "\"\\u0400\200\"" on yours. >> >> Looks like the JSON formatter is not just broken (we knew that already), >> it's broken in machine-dependent ways. Good to know, thanks for >> reporting. >> >> Obvious ways to get "make check" pass for you again *now*: >> >> * Disable check-qjson. That's too big a hammer for me. >> >> * Disable test case 2.1.4 with a comment explaining why. >> >> * Suitable #ifdeffery around the expected value. >> >> Preferences? > > * Fix JSON formatter :-)
I want that too, but I'm afraid we can't have it *now* :) > Disabling 2.1.4 only reveals the next problem: > GTESTER tests/check-qjson > GTester: last random seed: R02S6754f3523201dc81bb21de42e2ba843c > ** > ERROR:/src/qemu/tests/check-qjson.c:777:utf8_string: assertion failed > (qstring_get_str(str) == json_out): ("\"\\u8200\200\200\"" == > "\"\\u8200\\uFFFF\\uFFFF\"") All right, I give up. I can't fix to_json() tonight (I have maybe 30 minutes of usable brain left), but I can make it portably wrong. Please try the appended patch. diff --git a/qobject/qjson.c b/qobject/qjson.c index 83a6b4f..195da1f 100644 --- a/qobject/qjson.c +++ b/qobject/qjson.c @@ -187,7 +187,21 @@ static void to_json(const QObject *obj, QString *str, int pretty, int indent) default: { if (ptr[0] <= 0x1F) { char escape[7]; - snprintf(escape, sizeof(escape), "\\u%04X", ptr[0]); + /* + * Portability band-aid + * + * We used to print ptr[0] here, but when + * plain char is signed, that prints \uFFFF + * for negative values. The code here is crap + * (see utf8_string() in tests/check-qjson.c), + * and needs to be fixed. Can't do that right + * now, and don't want to go from wrong to + * differently wrong, so I make the wrong we + * now get on the most common machines the + * wrong we get on all machines. + */ + snprintf(escape, sizeof(escape), "\\u%04X", + *(signed char *)ptr); qstring_append(str, escape); } else { char buf[2] = { ptr[0], 0 };