Greets,
Lucene's TokenStream concept does not translate very well from Java
to our target languages, primarily because calling next() once for
each analyzer-token pairing creates a lot of method-call overhead.
Also, the analysis process creates a *lot* of Token objects; in
Plucene these were hash-based objects, relatively expensive to create
and destroy.
The solution adopted by KinoSearch is the TokenBatch. Instead of
passing tokens one by one through an analysis chain, they are grouped
together and passed en masse from analyzer to analyzer. Analyzers do
not supply a next() method; they supply an analyze() method, which
both accepts and returns a TokenBatch object. Analysis chains are
set up using a PolyAnalyzer object, which is an ordered array of
Analyzers, each of which will be called upon to analyze() a
TokenBatch in turn.
Implementing TokenBatch as a doubly-linked list of Token structs
seems to work pretty well.
typedef struct lucy_Token {
char *text;
lucy_i32_t len;
lucy_i32_t start_offset;
lucy_i32_t end_offset;
lucy_i32_t pos_inc;
struct lucy_Token *next;
struct lucy_Token *prev;
} lucy_Token;
typedef struct lucy_TokenBatch {
lucy_Token *first;
lucy_Token *last;
lucy_Token *current;
lucy_i32_t size;
lucy_bool_t initialized;
} lucy_TokenBatch;
(We might define token->text as either an 8 or 16 bit type depending
on how the target platform handles strings, but that's a topic for
another day.)
The tricky part here is how to expose TokenBatch and its constituent
tokens as a native API, which is necessary for subclassing Analyzer.
Core Analyzer subclasses distributed with Lucy might peek into the
guts of Token structs. However, it would be a bad idea to make *any*
C struct's makeup part of Lucy's public API. The only supported way
to get at a struct's members should be through methods.
The obvious way to affect individual Tokens would be to spin them off
as native objects when iterating over the TokenBatch object.
Illustrated in Java:
while ((token = batch.next()) != null) {
token.setText( lowerCase(token.getText()) );
}
However, that introduces a garbage collection issue. Who's
responsible for destroying the individual Token structs? We can't
have them destroyed twice, once when the TokenBatch gets destroyed,
and once when the spun-off token gets destroyed. I see two
possibilities, neither of which is attractive. We might implement
our own reference-counting scheme, which would be messy and
complicated. Or we could start off by creating each token as a
native object and defer to native GC, which is messy, complicated,
and expensive to boot.
The least worst solution I see is to avoid exposing the individual
tokens via native API, and allow access to them one at a time via
methods against the TokenBatch object. next() would return a boolean
indicating whether the TokenBatch was currently located at a valid
Token position, instead of a native Token object.
while (batch.next()) {
batch.setText( lowerCase(batch.getText()) );
}
This is perhaps a little less intuitive than iterating over an faux-
array of Tokens, but it's similar to how Lucene's TermDocs, TermEnum
and Scorer classes all work.
It's also the fastest scheme I've come up with. Without making
TokenBatch a native array of Token objects...
for my $token (@tokens) {
$token->set_text( $token->get_text );
}
... there's no way to avoid the method-call overhead of next(). Any
user-defined subclass of Analyzer implemented using the native API is
therefore going to be a little slower than it's possible to make core
Analyzer classes which are allowed to access struct members, but
that's the way it goes. If we supply a good set of flexible Analyzer
classes, hopefully it won't be necessary for people to create
multiple custom Analyzers and their indexing apps will stay fast enough.
Marvin Humphrey
Rectangular Research
http://www.rectangular.com/