On Thu, 2007-03-29 at 22:06 -0700, Erick Tryzelaar wrote:
> I'd like to make a simpler frontend to tre by wrapping the apis to 
> return felix-native data structures, such as varray. Is it possible to 
> pass in a pointer to varray so that it knows that it owns it, and will 
> delete it when it's garbage collected?
> 
> And speaking of data structures, what should we use as our default 
> return type? varrays or lists?

BTW: on tre .. now we have a way to compile C code, we should probably
rewrite the build code for tre to compile it as C.

Then, a small python script can be written to wrap up the sources
from the original CVS repository of Ville Laurikari, and we can
sync automatically to the latest tre. A few bugs have been
found in tre which have been fixed, our C++ version is a bit out
of date.

The hard part is we will have to write out own configuration
script. Configuring tre is a bitch! Many #defines are repeated
with inconsistent definitions, because there are TWO configurations
for TRE: one is for the headers only, the other for the implementation.
The reason is explained below:

There are THREE ways to get regexps:

1. Use gnu libc regex 
2. Use tre from system
3. Use our private copy of tre

In models 2 and 3 there are two variants:

R. Use tre as a gnu libc replacement
N. Use tre as a native regexp library

Model N specifically calls tre, and isn't Posix
compliant C code (the regex handling is though).
Using model N has the advantage it craps out the 
system at run time if you accidentally bind to gnu regex.

Model R is a compatibility mode, and will allow compiled
code to work with EITHER tre or gnu regex, the choice
being made dynamically at run time, depending on your
LD_LIBRARY_PATH or whether or not tre is actually installed.

Since we can (hopefully) build tre ourselves, and make it
work on ALL platforms, a native binding to tre which uses
our library makes sense.

The PROBLEM with this is that the Posix specs include some
very lame features which I have currently disabled for TRE.
Felix TRE currently does NOT support wide or multi-byte characters
and it does NOT support any locale dependent interpretation of
things like \W, and it does NOT support locale sensitive error
messages: you get English, end of story.

The lack of Posix locale support in regexp handling is mandatory
for platform independent behaviour. C's way of handling locales
is entirely screwed and should never be used by anyone in any program.

If you want to use locale specific stuff you should be FORCED
to pass the locale object to any function parametrised by
the locale, and the locale could be obtained from the environment.

TRE cannot do this properly because it tries to provide a
Posix compliant mode.

Note that locale sensitive error messages are possibly different:
they're sensibly locale dependent, but Posix doesn't support
that, you need something like gnu gettext, which isn't available
on Windows.

Similarly, support for wchar_t is suspicious because the size
is platform dependent. In some ways this is a bug in TRE,
it should really provide support for int16_t characters,
that is, specifically 16 bit, and, int32_t characters,
that is, specifically 32 bit, and then alias wchar_t to
one of these. wchar_t is 16 bit on windows and 32 bit on Linux.

So, I'm inclined to keep  the current configuration which is:

NO support for locale dependent regexps
NO support for locale dependent error messages
NO support for multi-byte encodings
NO support for wide characters

because these things would lead to non-deterministic behaviour.

32 bit Unicode support would be nice but it can be quite
expensive and will never work properly  -- human script
doesn't fit 1-1 in Unicode anyhow: you still need
multi-codepoint encodings.

Luckily, UTF8 encoded 8 bit strings work with regexps just fine
provided you don't use garbage like \W, which of course won't
pick out true words from UTF-8 encoded unicode (it will work
if all the high bit set bytes are considered letters .. :)

Ditto for case mappings. To do this properly we probably
need either to enable multi-byte encoding in TRE 
(which I doubt will work because C multi-byte support only
works for 1-2 byte encodings AFACIK?) or use the
TRE specific ability to stream characters in with a callback.

But I'm not sure how TRE actually handles \W etc..



-- 
John Skaller <skaller at users dot sf dot net>
Felix, successor to C++: http://felix.sf.net

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Felix-language mailing list
Felix-language@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/felix-language

Reply via email to