[Chicken-users] UTF-8 support in Chicken core [Was: [Q] uri-common has problem with UTF-8 uri.]

Ivan Raikov Sat, 26 Jan 2013 17:43:50 -0800

Hi Alex,

    Yes, I would have thought that more people would be interested in
having UTF-8 support in core Chicken (or at least wide-char compatible
srfi-14). I have changed the title of this thread to reflect the subject
more accurately :-)


  Personally, I think that adding UTF-8  in core is much better than the
hacks I had to do in mbox, and is a no brainer considering the benchmark
results you have below.  But I am sure that opinions vary on this subject...

   Can you post your bounds-check patches to srfi-14 on the mailing list,
and/or create a ticket for it? Hopefully there will be more responses this
time.

    Ivan

On Sat, Jan 26, 2013 at 1:42 PM, Alex Shinn <[email protected]> wrote:

> On Wed, Jan 23, 2013 at 5:09 PM, Alex Shinn <[email protected]> wrote:
>
>> On Wed, Jan 23, 2013 at 3:45 PM, Ivan Raikov <[email protected]>wrote:
>>
>>> Yes, I ran into this when I was adding UTF-8 support to mbox... If you
>>> were to add wide char support in srfi-14, is there a way to quantify the
>>> performance penalty?
>>>
>>
>> To add the bounds check so it doesn't error?  Practically
>> nothing.
>>
>> To branch to a separate path for a wide-char table if
>> the bounds check fails?  Same cost if the input is ASCII.
>>
>> For efficient handling in the case of Unicode input...
>> how small/fast do you want it?
>>
>
> I've never met such stony silence in response to an offer to do work...
>
> I ran the following simple char-set-contains? benchmark with
> a few variations:
>
>   (time
>    (do ((i 0 (+ i 1)))
>        ((= i 10000))
>        (do ((j 0 (+ j 1)))
>            ((= j 256))
>          (char-set-contains? char-set:letter (integer->char j)))))
>
> This is what most people are concerned about for speed, as
> the boolean and construction operations are less common.
>
> The results:
>
> ;; reference implementation
> ;; 0.312s CPU time, 1/2059 GCs (major/minor)
>
> ;; "fixed" reference implementation (no error but no support for
> non-latin-1)
> ;; 0.257s CPU time, 1/1706 GCs (major/minor)
>
> ;; utf8-srfi-14 with full Unicode char-set:letter
> ;; 0.243s CPU time, 0/1526 GCs (major/minor)
>
> ;; utf8-srfi-14 with ASCII-only char-set:letter
> ;; 0.242s CPU time, 0/1526 GCs (major/minor)
>
> I was able to add the check and make the reference
> implementation faster because I fixed the common case -
> it was optimized for checking for 0 instead of 1.
>
> Even with the enormous and complex definition of a
> Unicode "letter", utf8-srfi-14 is faster than srfi-14.
>
> As for what we want in Chicken, the answer depends
> on what you're optimizing for.  utf8-srfi-14 will always
> win for space, and generally for speed as well.
>
> If the biggest concern is code-size, then you might want
> to borrow the char-set definition from irregex and use
> that as a "fallback" for non-latin-1 chars in the srfi-14
> reference impl.  This would have the same perf as
> srfi-14 for latin-1, yet still support full Unicode and not
> increase the size of the Chicken distribution.
>
> --
> Alex
>
>

_______________________________________________
Chicken-users mailing list
[email protected]
https://lists.nongnu.org/mailman/listinfo/chicken-users

[Chicken-users] UTF-8 support in Chicken core [Was: [Q] uri-common has problem with UTF-8 uri.]

Reply via email to