Re: [Chicken-users] UTF-8 support in eggs

Alex Shinn Wed, 09 Jul 2014 16:22:27 -0700

On Wed, Jul 9, 2014 at 7:15 AM, Oleg Kolosov <bazur...@gmail.com> wrote:


>
> IMO just enable utf8 by default and let them break. Is it's not 80's
> anymore, latin1 only software should die.


I agree that if people want "latin1 only" there should at best be
a compiler option for this which is disabled by default.  Chicken
is a community project used by people around the world.

However, I don't think that's the real problem.  The issue as I
understand is that although Chicken has both strings and
bytevectors in the core, historically and for continued simplicity
strings are abused as bytevectors in many cases.  This allows
you to use the plentiful string libraries (e.g. srfi-13 and regex)
on binary data, whereas there are few bytevector utils.  There
are also cases where you have mixed text and binary data,
and applying all appropriate conversions can be tedious.

The clean way to handle this is to duplicate the useful string
APIs for bytevectors.  This could be done without code duplication
with the use of functors, though compiler assistance may be
needed for efficiency (e.g. for inlined procedures).  Even without
code duplication there would be an increase in the core library
size, though we could probably move most utilities to external
libraries (how often do you need regexps that operate on binary
data?).

If we could (through functors or in a pinch duplication) bring
the bytevector API up to speed with strings, then the next
step is to identify all such abusers of strings and move them
to bytevectors.

We did few tests some time ago and they showed that tackling this from
> Scheme side does not make worthy difference. Using pure C is much
> better. Perhaps utf8 egg could enjoy some yet to be written (or found in
> third party libraries) low level support from the core, so we can have
> the best of the both worlds.
>

There's already some small utf8 support in the core, and
adding the missing pieces would not take measurable space.

The bigger issue from the performance perspective is existing
idioms that use indexes, which can degrade to quadratic behavior
in the worst case no matter how much you optimize (without hacks
that slow down normal usage).  So people would have to learn to
take substrings where appropriate to avoid the start/end parameters
to all SRFI 13 functions, or we would need to deprecate SRFI 13
in favor of a cursor-oriented API (planned for R7RS).

So as you see the change is contagious.  We can update the core
efficiently and easily, but then we have to fix the string abusers,
and then we have to replace existing index-oriented APIs.

-- 
Alex

_______________________________________________
Chicken-users mailing list
Chicken-users@nongnu.org
https://lists.nongnu.org/mailman/listinfo/chicken-users

Re: [Chicken-users] UTF-8 support in eggs

Reply via email to