Updated Branches: refs/heads/1425-fix-graceful-surrogate-handling [created] be4e41ff2
Handle invalid UTF-8 byte sequences gracefully by replacing them with '?' CouchDB's Erlang JSON parser allows storing of invalid UTF-8 byte sequences. The Query Server inside CouchDB fails upon necountering these byte sequences. The view process fails for the current batch of document updates. The result is that the view is invariably broken. Only removing the document in question solves this otherwise, but finding that is hard as the `log()` inside the Query Server dies with the invalid byte sequence because our protocol is synchronous and map results an `log()` messages generated therein are submitted together. This patch replaces invalid bytes with the the '?' (0x3f) byte. Closes COUCHDB-1425. Patch by Sam Rijs <r...@awesan.de> Eventually, this should be fixed at the HTTP level, so that no documents with invalid byte sequences can be written to CouchDB. The jiffy encoder we'll get with BigCouch will do that for us. This is a fix for the releases until then. Project: http://git-wip-us.apache.org/repos/asf/couchdb/repo Commit: http://git-wip-us.apache.org/repos/asf/couchdb/commit/be4e41ff Tree: http://git-wip-us.apache.org/repos/asf/couchdb/tree/be4e41ff Diff: http://git-wip-us.apache.org/repos/asf/couchdb/diff/be4e41ff Branch: refs/heads/1425-fix-graceful-surrogate-handling Commit: be4e41ff27a9c5ac270e24dcf2b3fca26a938149 Parents: 2b8539d Author: Jan Lehnardt <j...@apache.org> Authored: Mon Mar 4 15:09:36 2013 +0100 Committer: Jan Lehnardt <j...@apache.org> Committed: Mon Mar 4 15:19:33 2013 +0100 ---------------------------------------------------------------------- THANKS.in | 1 + src/couchdb/priv/couch_js/utf8.c | 22 ++++++++++++---------- 2 files changed, 13 insertions(+), 10 deletions(-) ---------------------------------------------------------------------- http://git-wip-us.apache.org/repos/asf/couchdb/blob/be4e41ff/THANKS.in ---------------------------------------------------------------------- diff --git a/THANKS.in b/THANKS.in index 4ebf3f0..db0ac07 100644 --- a/THANKS.in +++ b/THANKS.in @@ -94,6 +94,7 @@ suggesting improvements or submitting changes. Some of these people are: * Fedor Indutny <fe...@indutny.com> * Tim Blair * Tady Walsh <he...@tady.me> + * Sam Rijs <r...@awesam.de> # Authors from commit 6c976bd and onwards are auto-inserted. If you are merging # a commit from a non-committer, you should not add an entry to this file. When # `bootstrap` is run, the actual THANKS file will be generated. http://git-wip-us.apache.org/repos/asf/couchdb/blob/be4e41ff/src/couchdb/priv/couch_js/utf8.c ---------------------------------------------------------------------- diff --git a/src/couchdb/priv/couch_js/utf8.c b/src/couchdb/priv/couch_js/utf8.c index d606426..94dac32 100644 --- a/src/couchdb/priv/couch_js/utf8.c +++ b/src/couchdb/priv/couch_js/utf8.c @@ -66,9 +66,11 @@ enc_charbuf(const jschar* src, size_t srclen, char* dst, size_t* dstlenp) c = *src++; srclen--; - if((c >= 0xDC00) && (c <= 0xDFFF)) goto bad_surrogate; - - if(c < 0xD800 || c > 0xDBFF) + if((c >= 0xDC00) && (c <= 0xDFFF)) + { // bad surrogate hack -- emit '?' -- COUCHDB-1425 + v = 0x3f; + } + else if(c < 0xD800 || c > 0xDBFF) { v = c; } @@ -78,11 +80,15 @@ enc_charbuf(const jschar* src, size_t srclen, char* dst, size_t* dstlenp) c2 = *src++; srclen--; if ((c2 < 0xDC00) || (c2 > 0xDFFF)) + { // bad surrogate hack -- emit '?' -- COUCHDB-1425 + v = 0x3f; + src--; + srclen++; + } + else { - c = c2; - goto bad_surrogate; + v = ((c - 0xD800) << 10) + (c2 - 0xDC00) + 0x10000; } - v = ((c - 0xD800) << 10) + (c2 - 0xDC00) + 0x10000; } if(v < 0x0080) { @@ -109,10 +115,6 @@ enc_charbuf(const jschar* src, size_t srclen, char* dst, size_t* dstlenp) *dstlenp = (origDstlen - dstlen); return JS_TRUE; -bad_surrogate: - *dstlenp = (origDstlen - dstlen); - return JS_FALSE; - buffer_too_small: *dstlenp = (origDstlen - dstlen); return JS_FALSE;