Re: [Chicken-users] [Q] uri-common has problem with UTF-8 uri.

Alex Shinn Wed, 23 Jan 2013 00:09:16 -0800

On Wed, Jan 23, 2013 at 3:45 PM, Ivan Raikov <[email protected]>wrote:


> Yes, I ran into this when I was adding UTF-8 support to mbox... If you
> were to add wide char support in srfi-14, is there a way to quantify the
> performance penalty?
>

To add the bounds check so it doesn't error?  Practically
nothing.

To branch to a separate path for a wide-char table if
the bounds check fails?  Same cost if the input is ASCII.

For efficient handling in the case of Unicode input...
how small/fast do you want it?

-- 
Alex

On Wed, Jan 23, 2013 at 3:42 PM, Alex Shinn <[email protected]> wrote:

> On Thu, Jan 17, 2013 at 4:51 AM, Peter Bex <[email protected]> wrote:
>>
>>> On Tue, Jan 15, 2013 at 02:44:08PM +0900, Alex Shinn wrote:
>>> > This result looks broken.  As I noted in my previous mail, the URI
>>> > representation already handles non-ASCII characters and escapes on
>>> output:
>>> >
>>> > $ csi -R uri-common
>>> > #;1> (make-uri scheme: "http" host: "127.0.0.1" path: '(/ "삼계탕"))
>>> > #<URI-common: scheme="http" port=#f host="127.0.0.1" path=(/ "삼계탕")
>>> > query=#f fragment=#f>
>>> > #;2> (uri->string (make-uri scheme: "http" host: "127.0.0.1" path: '(/
>>> > "삼계탕")))
>>> > "http://127.0.0.1/82%BCB3%8483%95";
>>> >
>>> > Unrelated, the actual escaped output looks buggy - it looks like
>>> > some characters like the leading "%EC%" are getting dropped.
>>>
>>> OK, I took some time to investigate and I pinpointed this problem.
>>> This appears to happen due to the use of core srfi-14 and srfi-13 in
>>> uri-generic; its char-set operations simply don't deal with anything
>>> beyond ASCII.
>>
>>
>> As an aside from the uri discussion, we really need to fix srfi-14.
>>
>> The reference implementation is terrible.  Not only does it not
>> handle Unicode, but it doesn't not-handle it gracefully:
>>
>> #;1> (char-set-contains? char-set:full #\x100)
>> Error: (string-ref) out of range [...]
>>
>> At a minimum we should avoid these errors, but really we
>> should be using a Unicode-aware implementation - there's no
>> barrier to doing so like there is for Unicode strings.  We could
>> just move utf8-srfi-14 into the core, or I could patch up the
>> srfi-14 implementation to handle wide chars properly (but maybe
>> slowly) without bringing in the iset dependency.
>>
>> --
>> Alex
>>
>>
>> _______________________________________________
>> Chicken-users mailing list
>> [email protected]
>> https://lists.nongnu.org/mailman/listinfo/chicken-users
>>
>>
>

_______________________________________________
Chicken-users mailing list
[email protected]
https://lists.nongnu.org/mailman/listinfo/chicken-users

Re: [Chicken-users] [Q] uri-common has problem with UTF-8 uri.

Reply via email to