On Thu, 2007-03-29 at 22:06 -0700, Erick Tryzelaar wrote: > I'd like to make a simpler frontend to tre by wrapping the apis to > return felix-native data structures, such as varray. Is it possible to > pass in a pointer to varray so that it knows that it owns it, and will > delete it when it's garbage collected? > > And speaking of data structures, what should we use as our default > return type? varrays or lists?
BTW: on tre .. now we have a way to compile C code, we should probably rewrite the build code for tre to compile it as C. Then, a small python script can be written to wrap up the sources from the original CVS repository of Ville Laurikari, and we can sync automatically to the latest tre. A few bugs have been found in tre which have been fixed, our C++ version is a bit out of date. The hard part is we will have to write out own configuration script. Configuring tre is a bitch! Many #defines are repeated with inconsistent definitions, because there are TWO configurations for TRE: one is for the headers only, the other for the implementation. The reason is explained below: There are THREE ways to get regexps: 1. Use gnu libc regex 2. Use tre from system 3. Use our private copy of tre In models 2 and 3 there are two variants: R. Use tre as a gnu libc replacement N. Use tre as a native regexp library Model N specifically calls tre, and isn't Posix compliant C code (the regex handling is though). Using model N has the advantage it craps out the system at run time if you accidentally bind to gnu regex. Model R is a compatibility mode, and will allow compiled code to work with EITHER tre or gnu regex, the choice being made dynamically at run time, depending on your LD_LIBRARY_PATH or whether or not tre is actually installed. Since we can (hopefully) build tre ourselves, and make it work on ALL platforms, a native binding to tre which uses our library makes sense. The PROBLEM with this is that the Posix specs include some very lame features which I have currently disabled for TRE. Felix TRE currently does NOT support wide or multi-byte characters and it does NOT support any locale dependent interpretation of things like \W, and it does NOT support locale sensitive error messages: you get English, end of story. The lack of Posix locale support in regexp handling is mandatory for platform independent behaviour. C's way of handling locales is entirely screwed and should never be used by anyone in any program. If you want to use locale specific stuff you should be FORCED to pass the locale object to any function parametrised by the locale, and the locale could be obtained from the environment. TRE cannot do this properly because it tries to provide a Posix compliant mode. Note that locale sensitive error messages are possibly different: they're sensibly locale dependent, but Posix doesn't support that, you need something like gnu gettext, which isn't available on Windows. Similarly, support for wchar_t is suspicious because the size is platform dependent. In some ways this is a bug in TRE, it should really provide support for int16_t characters, that is, specifically 16 bit, and, int32_t characters, that is, specifically 32 bit, and then alias wchar_t to one of these. wchar_t is 16 bit on windows and 32 bit on Linux. So, I'm inclined to keep the current configuration which is: NO support for locale dependent regexps NO support for locale dependent error messages NO support for multi-byte encodings NO support for wide characters because these things would lead to non-deterministic behaviour. 32 bit Unicode support would be nice but it can be quite expensive and will never work properly -- human script doesn't fit 1-1 in Unicode anyhow: you still need multi-codepoint encodings. Luckily, UTF8 encoded 8 bit strings work with regexps just fine provided you don't use garbage like \W, which of course won't pick out true words from UTF-8 encoded unicode (it will work if all the high bit set bytes are considered letters .. :) Ditto for case mappings. To do this properly we probably need either to enable multi-byte encoding in TRE (which I doubt will work because C multi-byte support only works for 1-2 byte encodings AFACIK?) or use the TRE specific ability to stream characters in with a callback. But I'm not sure how TRE actually handles \W etc.. -- John Skaller <skaller at users dot sf dot net> Felix, successor to C++: http://felix.sf.net ------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys-and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV _______________________________________________ Felix-language mailing list Felix-language@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/felix-language