Re: [Chicken-users] [Q] uri-common has problem with UTF-8 uri.

Sungjin Chun Mon, 14 Jan 2013 14:35:53 -0800

Thank you very much. :-)

My proposed hack(yes, no solution) just works for me but I found that it is
just wrong w.r.t RFC.
I'll try your modification and and let you know whether it works or not.


Thank you again.


On Mon, Jan 14, 2013 at 5:08 PM, Ivan Raikov <[email protected]>wrote:

> Hi Sungjin,
>
>    Thanks for trying to use the uri-generic library. As Peter already
> pointed out, uri-generic and uri-common are intended to implement RFC 3986
> (URIs), and so far no effort has been done to support RFC 3987 (IRIs).
> However, the IRI RFC does define a mapping from IRI to URI, where Unicode
> characters in IRIs are converted to percent  encoded UTF-8 sequences. The
> caveat here is that if you try to decode these percent-encoded sequences
> they will likely result in invalid URI characters. I have prototyped a
> procedure iri->uri which attempts to percent-encode all UTF-8 sequences in
> the input string and create a URI. You can see it here:
>
>
> http://bugs.call-cc.org/browser/release/4/uri-generic/branches/utf8/uri-generic.scm
>
> You can try iri->uri as follows:
>
> (use uri-generic)
> (print (iri->uri "http://example.com/삼계탕";))
> (URI scheme=http authority=#(URIAuth host="example.com" port=#f) path=(/
> "�%82%BC�%B3%84�%83%95") query=#f fragment=#f)
>
>   Note that the URI constructor still tries to percent-decode all
> characters in the path, and in this example this results in unprintable
> characters being displayed. So I will probably need to add a field to the
> URI structure that indicates if UTF-8 sequences are included and avoid
> percent-decoding altogether. Would this be sufficient for your needs?
>
>   Your proposed solution to extend the definition of the 'unstructured'
> character set is in line with RFC 3987, but I need to look some more at the
> code and see whether it would be possible to have an API where the user can
> choose whether to use IRIs or URIs. I prefer not to use UTF-8 sequences by
> default, since this might result in uri-generic based client sending
> invalid URIs to a server. Let me know what the exact requirements of your
> application are, and perhaps we can some up with a simple solution.
>
>   Ivan
>
>
>
> On Mon, Jan 14, 2013 at 1:36 PM, Sungjin Chun <[email protected]> wrote:
>
>> As far as I know, revised RFC permits UTF-8 characters in the URL without
>> encoding. Am I wrong here?
>> Even Solr (the search engine) permits them.
>>
>>
>>
>> On Mon, Jan 14, 2013 at 1:26 PM, Alex Shinn <[email protected]> wrote:
>>
>>> Hi,
>>>
>>> On Mon, Jan 14, 2013 at 12:52 PM, Sungjin Chun <[email protected]> wrote:
>>>
>>>> First, I might have found wrong place but...
>>>>
>>>> It seems that the main source of the my problem is related to the part
>>>> of uri-generic.scm, especially;
>>>>
>>>> (define char-set:uri-unreserved
>>>>   (char-set union char-set:letter+digit (string->char-set "-_.~")))
>>>>
>>>> If I change this part as;
>>>>
>>>> (define char-set:uri-unreserved
>>>>   (char-set union char-set:letter+digit (string->char-set "-_.~")
>>>> char-set:hangul))
>>>>
>>>> then, uri/url with korean characters work. How can I set those part
>>>> more generic one?
>>>>
>>>
>>> I believe the ASCII definition is correct even for Unicode URLs.
>>> You need to represent the URL in utf8 and then use percent
>>> escapes on the utf8 bytes, which is what would happen naturally
>>> here.
>>>
>>> --
>>> Alex
>>>
>>>
>>
>> _______________________________________________
>> Chicken-users mailing list
>> [email protected]
>> https://lists.nongnu.org/mailman/listinfo/chicken-users
>>
>>
>

_______________________________________________
Chicken-users mailing list
[email protected]
https://lists.nongnu.org/mailman/listinfo/chicken-users

Re: [Chicken-users] [Q] uri-common has problem with UTF-8 uri.

Reply via email to