[ 
https://issues.apache.org/jira/browse/COUCHDB-1176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13038487#comment-13038487
 ] 

Nuutti Kotivuori commented on COUCHDB-1176:
-------------------------------------------

The bug is in mochijson2.erl, where tokenize_string_fast (which is 
hand-written) allows for invalid UTF-8, where as tokenize_string uses 
xmerl_ucs:to_utf8 to convert escapes to utf-8. This is directly from the 
documentation of xmerl:

%%% UTF-8 support
%%% Possible errors encoding UTF-8:
%%%     - Non-character values (something other than 0 .. 2^31-1).
%%%     - Surrogate pair code in string.
%%%     - 16#FFFE or 16#FFFF character in string.

Either the same values should be rejected by tokenize_string_fast, or both 
places should accept the values.

> CouchDB accepts data which it cannot replicate (invalid UTF-8 json during 
> replication)
> --------------------------------------------------------------------------------------
>
>                 Key: COUCHDB-1176
>                 URL: https://issues.apache.org/jira/browse/COUCHDB-1176
>             Project: CouchDB
>          Issue Type: Bug
>    Affects Versions: 1.0.1, 1.0.2
>         Environment: CentOS 5.5 64bit
>            Reporter: Jaakko Sipari
>            Priority: Critical
>         Attachments: fffe_escaped.json, fffe_utf8.json
>
>
> CouchDB appears to treat some unicode characters as illegal when parsing 
> escaped unicode values (\uXXXX) during insert or update of a document.  These 
> characters can however be inserted to the database by using the UTF-8 
> encoding instead of escaping. An example value would be an unicode value 
> 0xFFFE which is escaped \uFFFE and as UTF-8 is represented by consecutive 
> bytes with values 0xEF 0xBF and 0xBE.
> Even though the documents are inserted using UTF-8 encoding without errors, 
> couchdb always serves them in the escaped form. This leads us to the actual 
> problem we currently have. If documents containing such unaccepted characters 
> are inserted to couchdb by using UTF-8 encoding, attempt to replicate the 
> database will abort to first of those documents giving an error like this:
> {"error":"json_encode","reason":"{bad_term,{nocatch,{invalid_json,<<\"[{\\\"ok\\\":{\\\"_id\\\":\\\"192058c4f81afc66c5bf883548004331\\\",\\\"_rev\\\":\\\"1-ad1c9dcee520d12abdf948d91e31cf15\\\",\\\"abc\\\":\\\"\\\\ufffe\\\",\\\"_revisions\\\":{\\\"start\\\":1,\\\"ids\\\":[\\\"ad1c9dcee520d12abdf948d91e31cf15\\\"]}}}]\\n\">>}}}"}
> Here are steps to reproduce:
> curl -X PUT http://localhost:5984/replicationtest_source
> curl -X PUT http://localhost:5984/replicationtest_target
> # Should fail
> curl -H "Content-Type:application/json" -X POST -d @fffe_escaped.json 
> http://localhost:5984/replicationtest_source
> # Should succeed
> curl -H "Content-Type:application/json" -X POST -d @fffe_utf8.json 
> http://localhost:5984/replicationtest_source
> # Should fail to json_encode error related to the previously inserted document
> curl -H "Content-Type:application/json" -X POST -d 
> "{\"source\":\"http://localhost:5984/replicationtest_source\",\"target\":\"replicationtest_target\"}";
>  http://localhost:5984/_replicate
> If anyone has a quick fix for this (how to accept "invalid" escaped unicode 
> characters at least during replication), we would be more than happy to test 
> it.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to