Re: [lucy-dev] RegexTokenizer

Marvin Humphrey Tue, 08 Mar 2011 11:26:07 -0800

On Tue, Mar 08, 2011 at 05:50:34PM +0000, Andrew S. Townley wrote:
> Tokenizer for the interface and RegexTokenizer for platform-specific regexes
> (which, in fairness, is kinda what people would expect anyway).


Yes, that's the idea.  :)

> Many things support Perl5 regexes to varying degrees, so you'd likely not
> have too much trouble from a portability perspective.  

That's true, but I think it makes sense to endorse the full use of the host
language's regex engine if that's possible.  (It will be a little tricky to
make the analysis chain work with different host string encodings.)

> If you wanted to lock it in across host languages, then you could always
> implement this in C using the library of your choice due to the
> architecture, right?

Yes, most likely using PCRE.  I think that would make sense to implement as an
extension, distributed seperately.  Bundling PCRE with core Lucy would provide
very little benefit at a large cost, though.  Every host provides a regex
engine that users are already familiar with, and I expect that few users will
require indexes to work across multiple hosts.

Marvin Humphrey

Re: [lucy-dev] RegexTokenizer

Reply via email to