On Thu, Feb 28, 2013 at 8:14 AM, Markus Armbruster <arm...@redhat.com> wrote: > Blue Swirl <blauwir...@gmail.com> writes: > >> On Mon, Feb 4, 2013 at 5:19 PM, Markus Armbruster <arm...@redhat.com> wrote: >>> Test cases are scraped from Markus Kuhn's UTF-8 decoder capability and >>> stress test at >>> http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt >>> >>> Unfortunately, both JSON parser and formatter misbehave right now. >>> This test expects current, incorrect results. They're all clearly >>> marked, and are to be replaced by correct ones as the bugs get fixed. >>> See comments in new utf8_string() for details. >>> >>> Signed-off-by: Markus Armbruster <arm...@redhat.com> >>> --- >>> tests/check-qjson.c | 625 >>> ++++++++++++++++++++++++++++++++++++++++++++++++++++ >>> 1 file changed, 625 insertions(+) >>> >>> diff --git a/tests/check-qjson.c b/tests/check-qjson.c >>> index 32ffb43..4590b3a 100644 >>> --- a/tests/check-qjson.c >>> +++ b/tests/check-qjson.c >>> @@ -1,8 +1,10 @@ >>> /* >>> * Copyright IBM, Corp. 2009 >>> + * Copyright (c) 2013 Red Hat Inc. >>> * >>> * Authors: >>> * Anthony Liguori <aligu...@us.ibm.com> >>> + * Markus Armbruster <arm...@redhat.com>, >>> * >>> * This work is licensed under the terms of the GNU LGPL, version 2.1 or >>> later. >>> * See the COPYING.LIB file in the top-level directory. >>> @@ -131,6 +133,628 @@ static void single_quote_string(void) >>> } >>> } >>> >>> +static void utf8_string(void) >>> +{ >>> + /* >>> + * FIXME Current behavior for invalid UTF-8 sequences is >>> + * incorrect. This test expects current, incorrect results. >>> + * They're all marked "bug:" below, and are to be replaced by >>> + * correct ones as the bugs get fixed. >>> + * >>> + * The JSON parser rejects some invalid sequences, but accepts >>> + * others without correcting the problem. >>> + * >>> + * The JSON formatter replaces some invalid sequences by U+FFFF (a >>> + * noncharacter), and goes wonky for others. >>> + * >>> + * For both directions, we should either reject all invalid >>> + * sequences, or minimize overlong sequences and replace all other >>> + * invalid sequences by a suitable replacement character. A >>> + * common choice for replacement is U+FFFD. >>> + * >>> + * Problem: we can't easily deal with embedded U+0000. Parsing >>> + * the JSON string "this \\u0000" is fun" yields "this \0 is fun", >>> + * which gets misinterpreted as NUL-terminated "this ". We should >>> + * consider using overlong encoding \xC0\x80 for U+0000 ("modified >>> + * UTF-8"). >>> + * >>> + * Tests are scraped from Markus Kuhn's UTF-8 decoder capability >>> + * and stress test at >>> + * http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt >>> + */ >>> + static const struct { >>> + const char *json_in; >>> + const char *utf8_out; >>> + const char *json_out; /* defaults to @json_in */ >>> + const char *utf8_in; /* defaults to @utf8_out */ >>> + } test_cases[] = { >>> + /* >>> + * Bug markers used here: >>> + * - bug: not corrected >>> + * JSON parser fails to correct invalid sequence(s) >>> + * - bug: rejected >>> + * JSON parser rejects invalid sequence(s) >>> + * We may choose to define this as feature >>> + * - bug: want "\"...\"" >>> + * JSON formatter produces incorrect result, this is the >>> + * correct one, assuming replacement character U+FFFF >>> + * - bug: want "..." (no \") >>> + * JSON parser produces incorrect result, this is the >>> + * correct one. >>> + * Not marked explicitly, but trivial to find: >>> + * - JSON formatter replacing invalid sequence by \\uFFFF is a >>> + * bug if we want it to fail for invalid sequences. >>> + */ > [...] >>> + /* 2.1.4 4 bytes U+10000 */ >>> + { >>> + "\"\xF0\x90\x80\x80\"", >>> + "\xF0\x90\x80\x80", >>> + "\"\\u0400\\uFFFF\"", /* bug: want "\"\\uD800\\uDC00\"" */ >>> + }, > [...] >>> + {} >>> + }; >>> + int i; >>> + QObject *obj; >>> + QString *str; >>> + const char *json_in, *utf8_out, *utf8_in, *json_out; >>> + >>> + for (i = 0; test_cases[i].json_in; i++) { >>> + json_in = test_cases[i].json_in; >>> + utf8_out = test_cases[i].utf8_out; >>> + utf8_in = test_cases[i].utf8_in ?: test_cases[i].utf8_out; >>> + json_out = test_cases[i].json_out ?: test_cases[i].json_in; >>> + >>> + obj = qobject_from_json(json_in); >>> + if (utf8_out) { >>> + g_assert(obj); >>> + g_assert(qobject_type(obj) == QTYPE_QSTRING); >>> + str = qobject_to_qstring(obj); >>> + g_assert_cmpstr(qstring_get_str(str), ==, utf8_out); >>> + } else { >>> + g_assert(!obj); >>> + } >>> + qobject_decref(obj); >>> + >>> + obj = QOBJECT(qstring_from_str(utf8_in)); >>> + str = qobject_to_json(obj); >>> + if (json_out) { >>> + g_assert(str); >>> + g_assert_cmpstr(qstring_get_str(str), ==, json_out); >> >> This assertion trips on an ARM host (Debian stable): >> >> GTESTER tests/check-qjson >> ** >> ERROR:/src/qemu/tests/check-qjson.c:775:utf8_string: assertion failed >> (qstring_get_str(str) == json_out): ("\"\\u0400\200\"" == >> "\"\\u0400\\uFFFF\"") >> GTester: last random seed: R02S88b76f755e809e9024832d2ab6660afd > > Must be case 2.1.4, because that's where json_out is > "\"\\u0400\\uFFFF\"". > > We start by passing C string "\"\xF0\x90\x80\x80\"" to the JSON parser > qobject_from_json(). Yields "\xF0\x90\x80\x80", as expected. > > We then pass that to qobject_to_json(). Should yield > "\"\\uD800\\uDC00\"" (surrogate pair). Does yield "\"\\u0400\\uFFFF\"" > on my machine (known bug), and "\"\\u0400\200\"" on yours. > > Looks like the JSON formatter is not just broken (we knew that already), > it's broken in machine-dependent ways. Good to know, thanks for > reporting. > > Obvious ways to get "make check" pass for you again *now*: > > * Disable check-qjson. That's too big a hammer for me. > > * Disable test case 2.1.4 with a comment explaining why. > > * Suitable #ifdeffery around the expected value. > > Preferences?
* Fix JSON formatter :-) Disabling 2.1.4 only reveals the next problem: GTESTER tests/check-qjson GTester: last random seed: R02S6754f3523201dc81bb21de42e2ba843c ** ERROR:/src/qemu/tests/check-qjson.c:777:utf8_string: assertion failed (qstring_get_str(str) == json_out): ("\"\\u8200\200\200\"" == "\"\\u8200\\uFFFF\\uFFFF\"") diff --git a/tests/check-qjson.c b/tests/check-qjson.c index ec85a0c..fecd9c1 100644 --- a/tests/check-qjson.c +++ b/tests/check-qjson.c @@ -223,12 +223,14 @@ static void utf8_string(void) "\xE0\xA0\x80", "\"\\u0800\"", }, +#if 0 /* 2.1.4 4 bytes U+10000 */ { "\"\xF0\x90\x80\x80\"", "\xF0\x90\x80\x80", "\"\\u0400\\uFFFF\"", /* bug: want "\"\\uD800\\uDC00\"" */ }, +#endif /* 2.1.5 5 bytes U+200000 */ { "\"\xF8\x88\x80\x80\x80\"", > >>> + } else { >>> + g_assert(!str); >>> + } >>> + QDECREF(str); >>> + qobject_decref(obj); >>> + >>> + /* >>> + * disabled, because json_out currently contains the crap >>> + * qobject_to_json() produces >>> + */ >>> + if (0 && json_out != json_in) { >>> + obj = qobject_from_json(json_out); >>> + g_assert(obj); >>> + g_assert(qobject_type(obj) == QTYPE_QSTRING); >>> + str = qobject_to_qstring(obj); >>> + g_assert_cmpstr(qstring_get_str(str), ==, utf8_out); >>> + } >>> + } >>> +} >>> + >>> static void vararg_string(void) >>> { >>> int i; > [...]