On Tue, Mar 08, 2011 at 05:50:34PM +0000, Andrew S. Townley wrote: > Tokenizer for the interface and RegexTokenizer for platform-specific regexes > (which, in fairness, is kinda what people would expect anyway).
Yes, that's the idea. :) > Many things support Perl5 regexes to varying degrees, so you'd likely not > have too much trouble from a portability perspective. That's true, but I think it makes sense to endorse the full use of the host language's regex engine if that's possible. (It will be a little tricky to make the analysis chain work with different host string encodings.) > If you wanted to lock it in across host languages, then you could always > implement this in C using the library of your choice due to the > architecture, right? Yes, most likely using PCRE. I think that would make sense to implement as an extension, distributed seperately. Bundling PCRE with core Lucy would provide very little benefit at a large cost, though. Every host provides a regex engine that users are already familiar with, and I expect that few users will require indexes to work across multiple hosts. Marvin Humphrey
