Hi Peter,
I think that allowing raw UTF-8 sequences in uri-generic breaks
compatibility with RFC 3986. In other words, if you construct a URI with a
UTF-8 sequence that happens to include reserved ASCII characters, those
ASCII characters will not get escaped, and you could potentially be sending
an invalid URI to a legacy system that does not understand UTF-8. For
example, the UTF-8 string "пиле" consists of the octets D0 BF D0 B8 D0 BB
D0 B5. The ASCII codes corresponding to these octets are all outside of the
allowed character set defined in RFC 3986 and will correctly get rejected
by the uri-reference constructor. However, if we allow UTF-8 string
operations in uri-generic, and extend the unreserved character set to
include Unicode, these octets will form a valid character sequence and will
get accepted by uri-reference without being escaped. If you then send the
result of uri->string to a system that does not understand UTF-8, the URI
will get rejected.
My proposed solution is to include a UTF-8 aware constructor to
uri-generic and prevent percent decoding of UTF-8 sequences. I believe that
this solution is compatible with the IRI to URI mapping scheme described in
Section 3.1 of RFC 3987, but indeed I need to extend the uri-generic test
suite with more UTF-8 examples to ensure that nothing is broken. I think
that any solution will have to give the user choice whether to use ASCII or
UTF-8, and not just default to UTF-8.
Ivan
On Thu, Jan 17, 2013 at 4:51 AM, Peter Bex <[email protected]> wrote:
>
> OK, I took some time to investigate and I pinpointed this problem.
> This appears to happen due to the use of core srfi-14 and srfi-13 in
> uri-generic; its char-set operations simply don't deal with anything
> beyond ASCII. Only by switching to the UTF versions utf8-srfi-14,
> utf8-srfi-13 and unicode-char-sets this works:
>
> Without patch:
> $ csi -R uri-generic -P '(uri-encode-string "삼계탕")'
> "�%82%BC�%B3%84�%83%95"
>
> With patch:
> $ csi -R uri-generic -P '(uri-encode-string "삼계탕")'
> "%EC%82%BC%EA%B3%84%ED%83%95"
>
> Ivan, what do you think about adding the UTF8 dependency, as per the
> attached patch (against trunk)?
>
>
_______________________________________________
Chicken-users mailing list
[email protected]
https://lists.nongnu.org/mailman/listinfo/chicken-users