[ 
https://issues.apache.org/jira/browse/COUCHDB-345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12749164#action_12749164
 ] 

Noah Slater commented on COUCHDB-345:
-------------------------------------

I disagree with Curt.

The JSON RFC is either wrong or carelessly worded. You cannot encode anything 
as Unicode because Unicode is not an encoding, it is a collection of code 
points that have no binary representation. You can encode these code points 
into character data, and you can decode the same character data into Unicode. 
Unicode is always some internal representation after decoding, and before 
encoding. I am guessing everyone already knows this, but I keep seeing people 
form arguments (particularly on IRC) that start with "since JSON has to be 
encoded as Unicode" which is just a meaningless sentence (and the RFC is to 
blame as it uses this wording) and hence conclusions that follow from that have 
tended to be false.

In this case, what the JSON RFC should say is that JSON should be encoded from 
Unicode, which means that the encoding could be anything from ISO-8859-1 to 
Shift JIS, which means that we cannot "unambiguously determine the encoding 
from the content." Even if we decided to only allow UTF-8, UTF-16, or UTF-32, 
we could only "unambiguously determine the encoding" if the request body 
included the BOM, which is entirely optional. So again, without the 
Content-Encoding information, we are forced to use a heuristic. Heuristics 
already exist, and where they are not already available in Erlang, I rather 
suspect that they can be ported with relative ease.

If we can access the Content-Encoding, we should absolutely use it, and 
absolutely reject as garbage any request that could not be decoded with the 
explicit encoding. Any patch that will-fully ignored this information only to 
fall back onto a heuristic would get my emphatic veto. I am however satisfied 
with requiring UTF-8 in the short term, and adding Content-Encoding awareness 
at some later point.

> "High ASCII" can be inserted into db but not retrieved
> ------------------------------------------------------
>
>                 Key: COUCHDB-345
>                 URL: https://issues.apache.org/jira/browse/COUCHDB-345
>             Project: CouchDB
>          Issue Type: Bug
>    Affects Versions: 0.9
>         Environment: OSX 10.5.6
>            Reporter: Joan Touzet
>         Attachments: badenc1.patch, badtext.tar.gz, enctest.zip, 
> reject_invalid_utf8.patch
>
>
> It is possible to PUT/POST a document into CouchDB with a "high ASCII" value 
> that cannot be retrieved. This results from not escaping a non-ASCII value 
> into \u#### when PUT/POSTing the document.
> The attached sample code will recreate the problem using the hex value D8 (Ø) 
> in a possibly unsavoury test string.
> Sample output against 0.9.0 is as follows:
> ================================================
> {
>     "ok": true
> }
> {
>     "id": "fail", 
>     "ok": true, 
>     "rev": "1-76726372"
> }
> {
>     "error": "ucs", 
>     "reason": "{bad_utf8_character_code}"
> }
> ================================================
> Please note this defect turned up another problem, namely that the 
> bad_utf8_character_code exception thrown by a design document attempting to 
> map() the bad document caused Futon to fail silently in building the view, 
> with no indication (except via debug log) that there was a failure. The log 
> indicated two attempts to build the view, both failing, followed by an 
> uncaught exception error for Futon.
> Based on this, there are likely other areas in the codebase that do not 
> handle the bad_utf8_character_code exception correctly.
> My belief is that CouchDB shouldn't accept this input and should have 
> rejected the PUT/POST, or should have escaped the input itself before the 
> insertion.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to