I'll throw in my two bits here. I'm not personally decided whether utf-8 in core would be an improvement. I don't have enough background or knowledge of the internals to contribute to that decision.
I can offer this, however: I have found that I have to use utf-8 support in every project I've written in Chicken. I do so, and have only had a problem when the utf-8 egg did not map a procedure from core properly. I'm getting by just fine with the current state of affairs, and I do have a certain nostalgic love of ASCII. If I *could* get away with only having ASCII, I would. This has not been true in practice. My experience with numbers is slightly different, where I do find I need to do word-level calculation where I depend on the underlying machine implementation of character- and pointer-sized integers. I use the fx versions of these functions when I do rely on this, but I mainly have found I must intentionally subvert the numeric tower to get a specific behavior. This has never been true when I've dealt with characters. FWIW, -Alan On Sun, Jan 27, 2013 at 10:43:41AM +0900, Ivan Raikov wrote: > Hi Alex, > > *** Yes, I would have thought that more people would be interested in > having UTF-8 support in core Chicken (or at least wide-char compatible > srfi-14). I have changed the title of this thread to reflect the subject > more accurately :-) > > * Personally, I think that adding UTF-8* in core is much better than the > hacks I had to do in mbox, and is a no brainer considering the benchmark > results you have below.* But I am sure that opinions vary on this > subject... > > ** Can you post your bounds-check patches to srfi-14 on the mailing list, > and/or create a ticket for it? Hopefully there will be more responses this > time. > > *** Ivan > On Sat, Jan 26, 2013 at 1:42 PM, Alex Shinn <[1][email protected]> > wrote: > > On Wed, Jan 23, 2013 at 5:09 PM, Alex Shinn <[2][email protected]> > wrote: > > On Wed, Jan 23, 2013 at 3:45 PM, Ivan Raikov > <[3][email protected]> wrote: > > Yes, I ran into this when I was adding UTF-8 support to mbox... If > you were to add wide char support in srfi-14, is there a way to > quantify the performance penalty? > > To add the bounds check so it doesn't error? *Practically > nothing. > To branch to a separate path for a wide-char table if > the bounds check fails? *Same cost if the input is ASCII. > For efficient handling in the case of Unicode input... > how small/fast do you want it? > > I've never met such stony silence in response to an offer to do work... > I ran the following simple char-set-contains? benchmark with > a few variations: > * (time > * *(do ((i 0 (+ i 1))) > * * * *((= i 10000)) > * * * *(do ((j 0 (+ j 1))) > * * * * * *((= j 256)) > * * * * *(char-set-contains? char-set:letter (integer->char j))))) > This is what most people are concerned about for speed, as > the boolean and construction operations are less common. > The results: > ;; reference implementation > ;; 0.312s CPU time, 1/2059 GCs (major/minor) > ;; "fixed" reference implementation (no error but no support for > non-latin-1) > ;; 0.257s CPU time, 1/1706 GCs (major/minor) > ;; utf8-srfi-14 with full Unicode char-set:letter > ;; 0.243s CPU time, 0/1526 GCs (major/minor) > ;; utf8-srfi-14 with ASCII-only char-set:letter > ;; 0.242s CPU time, 0/1526 GCs (major/minor) > I was able to add the check and make the reference > implementation faster because I fixed the common case - > it was optimized for checking for 0 instead of 1. > Even with the enormous and complex definition of a > Unicode "letter", utf8-srfi-14 is faster than srfi-14. > As for what we want in Chicken, the answer depends > on what you're optimizing for. *utf8-srfi-14 will always > win for space, and generally for speed as well. > If the biggest concern is code-size, then you might want > to borrow the char-set definition from irregex and use > that as a "fallback" for non-latin-1 chars in the srfi-14 > reference impl. *This would have the same perf as > srfi-14 for latin-1, yet still support full Unicode and not > increase the size of the Chicken distribution. > --* > Alex > > References > > Visible links > 1. mailto:[email protected] > 2. mailto:[email protected] > 3. mailto:[email protected] > _______________________________________________ > Chicken-users mailing list > [email protected] > https://lists.nongnu.org/mailman/listinfo/chicken-users -- my personal website: http://c0redump.org/ _______________________________________________ Chicken-users mailing list [email protected] https://lists.nongnu.org/mailman/listinfo/chicken-users
