[ 
https://issues.apache.org/jira/browse/COUCHDB-760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12990974#comment-12990974
 ] 

He Shiming commented on COUCHDB-760:
------------------------------------

Hi again. After debugging for half a day, I've got some insights regarding the 
failure of the test.

couch_util.validate_utf8_fast is correct with minor problems. According to 
wikipedia, the equation should be like this:

    <<_:O/binary, C1, C2, _/binary>> when
            C1 >= 192, C1 =< 223,
            C2 >= 128, C2 =< 191 ->
        validate_utf8_fast(B, 2 + O);
    <<_:O/binary, C1, C2, C3, _/binary>> when
            C1 >= 224, C1 =< 239,
            C2 >= 128, C2 =< 191,
            C3 >= 128, C3 =< 191 ->
        validate_utf8_fast(B, 3 + O);
    <<_:O/binary, C1, C2, C3, C4, _/binary>> when
            C1 >= 240, C1 =< 247,
            C2 >= 128, C2 =< 191,
            C3 >= 128, C3 =< 191,
            C4 >= 128, C4 =< 191 ->
        validate_utf8_fast(B, 4 + O);
    _ ->

After this change the routine is theoretically correct. I've extracted it out 
and tested some strings. It's got correct results.

Regarding the tests, the 1st one is easy to fix. Since you are saving 
"Колян.txt", you should retrieve by that name: var xhr = CouchDB.request("GET", 
"/test_suite_db/good_doc/Колян.txt"); . This test is actually passed.

I'm not able to fix the rest of the tests, and the problem seemed related. 
After debugging, I discovered that a javascript string "foo\x80txt" of 
incorrect utf-8 encoding, is altered when erlang gets to see it.

couch_util.validate_utf8_fast is supposed to see <<102, 111, 111, 128, 116, 
120, 116>>. But it saw <<102,111,111,194,128,116,120,116>> instead.

Either the browser or the erlang httpd has attempted to fix the incorrect utf-8 
encoding, making it impossible for couchdb to see it. Since the original code 
ruled out anything beyond 128, the test will pass.

So in order for utf-8 attachment names to work, this test will need to be 
rewritten. I tried other combinations of the string, but I was unable to get 
pass the "encoding fix". CouchDB always sees correct encoding.

> Put attachments with cyrillic names is fail.
> --------------------------------------------
>
>                 Key: COUCHDB-760
>                 URL: https://issues.apache.org/jira/browse/COUCHDB-760
>             Project: CouchDB
>          Issue Type: Bug
>          Components: Database Core
>    Affects Versions: 0.11
>         Environment: Distributor ID:  Ubuntu
> Description:  Ubuntu 9.10
> Release:      9.10
> Codename:     karmic
>            Reporter: Antonio
>              Labels: attachments
>         Attachments: COUCHDB-760.patch, couchdb_760.patch
>
>
> I try upload any file with cyrillic name(like Колян.txt) and its fail 
> i try with futon.
> And create test http://friendpaste.com/WrVoFIOZb3T5r70Fz8XWB (see line 22):
> this test is fail with # Exception raised: 
> {"error":"bad_request","reason":"Attachment name is not UTF-8 encoded"}

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira


Reply via email to