[
https://issues.apache.org/jira/browse/COUCHDB-345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12749160#action_12749160
]
Curt Arnold commented on COUCHDB-345:
-------------------------------------
I looked at the Erlang test suite and didn't see anything that looked similar
to what I wanted to test. When I did "make check", I got a lot of lines like:
"test/etap/070-couch-db....................dubious"
Test returned status 127 (wstat 32512, 0x7f00)
Looks like it is beyond my skills at the moment to write an Erlang test that
would effectively test the changes.
If you wanted to accept interpret badly encoded UTF-8 as ISO-8859-1, I'd
suggest logging it, but in the catch, you could do something like:
mochijson2:decode(xmerl_ucs:to_utf8(binary_to_list(S))
Sniffing the multi-byte encodings could also be done in couch_db:json_decode as
an enhancement. You don't have the Content-Encoding header available, so you
could not support users who explicitly try to send JSON in arbitrary encodings.
However, since JSON is supposed to be encoded in Unicode and you can
unambiguously determine the encoding from the content. I think it is best to
ignore the Content-Encoding even if it was available.
I do not know the performance implications of the patch. It would likely be
cheaper if the binary could be scanned only and not extracted to a list, but I
think that has to be relatively minor compared to the JSON parsing.
> "High ASCII" can be inserted into db but not retrieved
> ------------------------------------------------------
>
> Key: COUCHDB-345
> URL: https://issues.apache.org/jira/browse/COUCHDB-345
> Project: CouchDB
> Issue Type: Bug
> Affects Versions: 0.9
> Environment: OSX 10.5.6
> Reporter: Joan Touzet
> Attachments: badenc1.patch, badtext.tar.gz, enctest.zip,
> reject_invalid_utf8.patch
>
>
> It is possible to PUT/POST a document into CouchDB with a "high ASCII" value
> that cannot be retrieved. This results from not escaping a non-ASCII value
> into \u#### when PUT/POSTing the document.
> The attached sample code will recreate the problem using the hex value D8 (Ø)
> in a possibly unsavoury test string.
> Sample output against 0.9.0 is as follows:
> ================================================
> {
> "ok": true
> }
> {
> "id": "fail",
> "ok": true,
> "rev": "1-76726372"
> }
> {
> "error": "ucs",
> "reason": "{bad_utf8_character_code}"
> }
> ================================================
> Please note this defect turned up another problem, namely that the
> bad_utf8_character_code exception thrown by a design document attempting to
> map() the bad document caused Futon to fail silently in building the view,
> with no indication (except via debug log) that there was a failure. The log
> indicated two attempts to build the view, both failing, followed by an
> uncaught exception error for Futon.
> Based on this, there are likely other areas in the codebase that do not
> handle the bad_utf8_character_code exception correctly.
> My belief is that CouchDB shouldn't accept this input and should have
> rejected the PUT/POST, or should have escaped the input itself before the
> insertion.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.