Hi again, I have now extended the utf8 code in uri-generic, so that UTF-8 sequences are percent-encoded as lists of the form '(% h1 h2 [% h3 h4 ...])). The percent-decoding routine is not going to decode sequences of more that one byte, so that now percent encoding normalization will not interfere with encoded UTF-8 sequences. I have also renamed the iri->uri routine to utf8-string->uri. I think now its behavior is compliant with both RFC 3986 and 3987:
(utf8-string->uri "http://example.com/삼계탕") => #(URI scheme=http authority=#(URIAuth host="example.com" port=#f) path=(/ "%EC%82%BC%EA%B3%84%ED%83%95") query=#f fragment=#f) (uri->string (utf8-string->uri "http://example.com/삼계탕")) => "http://example.com/%EC%82%BC%EA%B3%84%ED%83%95" The code is available here: http://bugs.call-cc.org/browser/release/4/uri-generic/branches/utf8 Sungjin, can you take a look at this code as see if it works for you? Ivan On Tue, Jan 15, 2013 at 1:22 PM, Ivan Raikov <[email protected]>wrote: > Hi all, > > I realized that I replied only to Sungjin and neglected to include the > mailing list, so let me repeat. > > Section 3.1 of RFC 3987 defines a mapping between IRIs and URIs such that > UTF-8 sequences are percent-encoded. > So I implemented a procedure iri->uri, which percent-encodes a UTF-8 > string and passes it to the usual URI constructor in uri-generic. > It is intended to work as follows: > > (iri->uri "http://example.com/삼계탕") => > #(URI scheme=http authority=#(URIAuth host="example.com" port=#f) path=(/ > "%EC%82%BC%EA%B3%84%ED%83%95") query=#f fragment=#f) > > However, the uri-generic constructor tries to normalize all URIs by > percent decoding them, so currently the URL above results in this: > > #(URI scheme=http authority=#(URIAuth host="example.com" port=#f) path=(/ > "�%82%BC�%B3%84�%83%95") query=#f fragment=#f) > > > In other words, parts of the percent-encoded UTF-8 sequences are decoded > back to unprintable ASCII characters. > So a better solution might indeed be to change iri->uri to pass the > percent-encoded sequences directly to make-uri without attempts at > percent-decoding normalization. > > Sungjin's modification to the definition of 'unstructured' is in line > with the IRI RFC (except of course we will need to add all other character > sets besides Hangul). > However, it was already pointed out by Peter and Alex that URIs containing > native UTF-8 sequences might results in invalid URLs being sent to systems > that do not understand IRIs or UTF-8. > > I will modify iri->uri to avoid normalization and see if this would > produce ok results. > > Ivan > > > > > > > > > > > > > > > On Tue, Jan 15, 2013 at 12:20 PM, Alex Shinn <[email protected]> wrote: > >> =삼계탕&start=0&rows=10<http://127.0.0.1:8983/solr/select?q=%EC%82%BC%EA%B3%84%ED%83%95&start=0&rows=10> > > > >
_______________________________________________ Chicken-users mailing list [email protected] https://lists.nongnu.org/mailman/listinfo/chicken-users
