[
https://issues.apache.org/jira/browse/COUCHDB-345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12749149#action_12749149
]
Curt Arnold commented on COUCHDB-345:
-------------------------------------
The patch does result in passing the tests as they were intended to be
(asserting that the PUT returns a 400). The versions that were attached were
demonstrating the failure of the GET after the PUT so they didn't assert the
400 return. The patch hunk for couch_httpd.erl is a little stale and needs to
be manually applied.
The patch modifies mochijson2, so it puts us in the position of diverging from
stock MochiWeb.
My thought was to put a call to unicode:characters_to_binary(Bin,utf8,utf8) in
PUT code path. If the source Bin is valid UTF-8, the return value will be
identical. If not, then it returns { error, "Valid characters",
<<MalformedStuff>> }. Support for the UTF-16's could be done at the same place.
http://erlang.org/doc/apps/stdlib/unicode_usage.html and
http://erlang.org/doc/man/unicode.html mention that the implementation is
complete as documented in R13A, but I don't know how much if any of the unicode
module is present in R12B5. Mochiweb references xmerl_ucs, which isn't in the
docs but is apparently the ucs string support for the XML parser.
I'd suggest implementing a check/conversion on the PUT code path using the
unicode module and then adapting it to run on our minimum platform if that is
an issue.
> "High ASCII" can be inserted into db but not retrieved
> ------------------------------------------------------
>
> Key: COUCHDB-345
> URL: https://issues.apache.org/jira/browse/COUCHDB-345
> Project: CouchDB
> Issue Type: Bug
> Affects Versions: 0.9
> Environment: OSX 10.5.6
> Reporter: Joan Touzet
> Attachments: badtext.tar.gz, enctest.zip, reject_invalid_utf8.patch
>
>
> It is possible to PUT/POST a document into CouchDB with a "high ASCII" value
> that cannot be retrieved. This results from not escaping a non-ASCII value
> into \u#### when PUT/POSTing the document.
> The attached sample code will recreate the problem using the hex value D8 (Ø)
> in a possibly unsavoury test string.
> Sample output against 0.9.0 is as follows:
> ================================================
> {
> "ok": true
> }
> {
> "id": "fail",
> "ok": true,
> "rev": "1-76726372"
> }
> {
> "error": "ucs",
> "reason": "{bad_utf8_character_code}"
> }
> ================================================
> Please note this defect turned up another problem, namely that the
> bad_utf8_character_code exception thrown by a design document attempting to
> map() the bad document caused Futon to fail silently in building the view,
> with no indication (except via debug log) that there was a failure. The log
> indicated two attempts to build the view, both failing, followed by an
> uncaught exception error for Futon.
> Based on this, there are likely other areas in the codebase that do not
> handle the bad_utf8_character_code exception correctly.
> My belief is that CouchDB shouldn't accept this input and should have
> rejected the PUT/POST, or should have escaped the input itself before the
> insertion.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.