Hi again,

   I have now extended the utf8 code in uri-generic, so that UTF-8
sequences are percent-encoded as lists of the form '(% h1 h2 [% h3 h4
...])). The percent-decoding routine is not going to decode sequences of
more that one byte, so that now percent encoding normalization will not
interfere with encoded UTF-8 sequences. I have also renamed the iri->uri
routine to utf8-string->uri. I think now its behavior is compliant with
both RFC 3986 and 3987:

(utf8-string->uri "http://example.com/삼계탕";) =>
#(URI scheme=http authority=#(URIAuth host="example.com" port=#f) path=(/
"%EC%82%BC%EA%B3%84%ED%83%95") query=#f fragment=#f)

(uri->string (utf8-string->uri "http://example.com/삼계탕";)) =>
"http://example.com/%EC%82%BC%EA%B3%84%ED%83%95";

The code is available here:

http://bugs.call-cc.org/browser/release/4/uri-generic/branches/utf8

Sungjin, can you take a look at this code as see if it works for you?

  Ivan





On Tue, Jan 15, 2013 at 1:22 PM, Ivan Raikov <[email protected]>wrote:

> Hi all,
>
>    I realized that I replied only to Sungjin and neglected to include the
> mailing list, so let me repeat.
>
> Section 3.1 of RFC 3987 defines a mapping between IRIs and URIs such that
> UTF-8 sequences are percent-encoded.
> So I implemented a procedure iri->uri, which percent-encodes a UTF-8
> string and passes it to the usual URI constructor in uri-generic.
> It is intended to work as follows:
>
> (iri->uri "http://example.com/삼계탕";) =>
> #(URI scheme=http authority=#(URIAuth host="example.com" port=#f) path=(/
> "%EC%82%BC%EA%B3%84%ED%83%95") query=#f fragment=#f)
>
> However, the uri-generic constructor tries to normalize all URIs by
> percent decoding them, so currently the URL above results in this:
>
> #(URI scheme=http authority=#(URIAuth host="example.com" port=#f) path=(/
> "�%82%BC�%B3%84�%83%95") query=#f fragment=#f)
>
>
>   In other words, parts of the percent-encoded UTF-8 sequences are decoded
> back to unprintable ASCII characters.
> So a better solution might indeed be to change iri->uri to pass the
> percent-encoded sequences directly to make-uri without attempts at
> percent-decoding normalization.
>
>   Sungjin's modification to the definition of 'unstructured' is in line
> with the IRI RFC (except of course we will need to add all other character
> sets besides Hangul).
> However, it was already pointed out by Peter and Alex that URIs containing
> native UTF-8 sequences might results in invalid URLs being sent to systems
> that do not understand IRIs or UTF-8.
>
> I will modify iri->uri to avoid normalization and see if this would
> produce ok results.
>
>   Ivan
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> On Tue, Jan 15, 2013 at 12:20 PM, Alex Shinn <[email protected]> wrote:
>
>> =삼계탕&start=0&rows=10<http://127.0.0.1:8983/solr/select?q=%EC%82%BC%EA%B3%84%ED%83%95&start=0&rows=10>
>
>
>
>
_______________________________________________
Chicken-users mailing list
[email protected]
https://lists.nongnu.org/mailman/listinfo/chicken-users

Reply via email to