[
https://issues.apache.org/jira/browse/COUCHDB-345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12749164#action_12749164
]
Noah Slater commented on COUCHDB-345:
-------------------------------------
I disagree with Curt.
The JSON RFC is either wrong or carelessly worded. You cannot encode anything
as Unicode because Unicode is not an encoding, it is a collection of code
points that have no binary representation. You can encode these code points
into character data, and you can decode the same character data into Unicode.
Unicode is always some internal representation after decoding, and before
encoding. I am guessing everyone already knows this, but I keep seeing people
form arguments (particularly on IRC) that start with "since JSON has to be
encoded as Unicode" which is just a meaningless sentence (and the RFC is to
blame as it uses this wording) and hence conclusions that follow from that have
tended to be false.
In this case, what the JSON RFC should say is that JSON should be encoded from
Unicode, which means that the encoding could be anything from ISO-8859-1 to
Shift JIS, which means that we cannot "unambiguously determine the encoding
from the content." Even if we decided to only allow UTF-8, UTF-16, or UTF-32,
we could only "unambiguously determine the encoding" if the request body
included the BOM, which is entirely optional. So again, without the
Content-Encoding information, we are forced to use a heuristic. Heuristics
already exist, and where they are not already available in Erlang, I rather
suspect that they can be ported with relative ease.
If we can access the Content-Encoding, we should absolutely use it, and
absolutely reject as garbage any request that could not be decoded with the
explicit encoding. Any patch that will-fully ignored this information only to
fall back onto a heuristic would get my emphatic veto. I am however satisfied
with requiring UTF-8 in the short term, and adding Content-Encoding awareness
at some later point.
> "High ASCII" can be inserted into db but not retrieved
> ------------------------------------------------------
>
> Key: COUCHDB-345
> URL: https://issues.apache.org/jira/browse/COUCHDB-345
> Project: CouchDB
> Issue Type: Bug
> Affects Versions: 0.9
> Environment: OSX 10.5.6
> Reporter: Joan Touzet
> Attachments: badenc1.patch, badtext.tar.gz, enctest.zip,
> reject_invalid_utf8.patch
>
>
> It is possible to PUT/POST a document into CouchDB with a "high ASCII" value
> that cannot be retrieved. This results from not escaping a non-ASCII value
> into \u#### when PUT/POSTing the document.
> The attached sample code will recreate the problem using the hex value D8 (Ø)
> in a possibly unsavoury test string.
> Sample output against 0.9.0 is as follows:
> ================================================
> {
> "ok": true
> }
> {
> "id": "fail",
> "ok": true,
> "rev": "1-76726372"
> }
> {
> "error": "ucs",
> "reason": "{bad_utf8_character_code}"
> }
> ================================================
> Please note this defect turned up another problem, namely that the
> bad_utf8_character_code exception thrown by a design document attempting to
> map() the bad document caused Futon to fail silently in building the view,
> with no indication (except via debug log) that there was a failure. The log
> indicated two attempts to build the view, both failing, followed by an
> uncaught exception error for Futon.
> Based on this, there are likely other areas in the codebase that do not
> handle the bad_utf8_character_code exception correctly.
> My belief is that CouchDB shouldn't accept this input and should have
> rejected the PUT/POST, or should have escaped the input itself before the
> insertion.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.