TokenBatch

Marvin Humphrey Sun, 25 Jun 2006 14:18:39 -0700

Greets,

Lucene's TokenStream concept does not translate very well from Javato our target languages, primarily because calling next() once foreach analyzer-token pairing creates a lot of method-call overhead.Also, the analysis process creates a *lot* of Token objects; inPlucene these were hash-based objects, relatively expensive to createand destroy.

The solution adopted by KinoSearch is the TokenBatch. Instead ofpassing tokens one by one through an analysis chain, they are groupedtogether and passed en masse from analyzer to analyzer. Analyzers donot supply a next() method; they supply an analyze() method, whichboth accepts and returns a TokenBatch object. Analysis chains areset up using a PolyAnalyzer object, which is an ordered array ofAnalyzers, each of which will be called upon to analyze() aTokenBatch in turn.

Implementing TokenBatch as a doubly-linked list of Token structsseems to work pretty well.


typedef struct lucy_Token {
    char              *text;
    lucy_i32_t         len;
    lucy_i32_t         start_offset;
    lucy_i32_t         end_offset;
    lucy_i32_t         pos_inc;
    struct lucy_Token *next;
    struct lucy_Token *prev;
} lucy_Token;

typedef struct lucy_TokenBatch {
    lucy_Token  *first;
    lucy_Token  *last;
    lucy_Token  *current;
    lucy_i32_t   size;
    lucy_bool_t  initialized;
} lucy_TokenBatch;

(We might define token->text as either an 8 or 16 bit type dependingon how the target platform handles strings, but that's a topic foranother day.)

The tricky part here is how to expose TokenBatch and its constituenttokens as a native API, which is necessary for subclassing Analyzer.

Core Analyzer subclasses distributed with Lucy might peek into theguts of Token structs. However, it would be a bad idea to make *any*C struct's makeup part of Lucy's public API. The only supported wayto get at a struct's members should be through methods.

The obvious way to affect individual Tokens would be to spin them offas native objects when iterating over the TokenBatch object.Illustrated in Java:


  while ((token = batch.next()) != null) {
    token.setText( lowerCase(token.getText()) );
  }

However, that introduces a garbage collection issue. Who'sresponsible for destroying the individual Token structs? We can'thave them destroyed twice, once when the TokenBatch gets destroyed,and once when the spun-off token gets destroyed. I see twopossibilities, neither of which is attractive. We might implementour own reference-counting scheme, which would be messy andcomplicated. Or we could start off by creating each token as anative object and defer to native GC, which is messy, complicated,and expensive to boot.

The least worst solution I see is to avoid exposing the individualtokens via native API, and allow access to them one at a time viamethods against the TokenBatch object. next() would return a booleanindicating whether the TokenBatch was currently located at a validToken position, instead of a native Token object.


  while (batch.next()) {
    batch.setText( lowerCase(batch.getText()) );
  }

This is perhaps a little less intuitive than iterating over an faux-array of Tokens, but it's similar to how Lucene's TermDocs, TermEnumand Scorer classes all work.

It's also the fastest scheme I've come up with. Without makingTokenBatch a native array of Token objects...


  for my $token (@tokens) {
    $token->set_text( $token->get_text );
  }

... there's no way to avoid the method-call overhead of next(). Anyuser-defined subclass of Analyzer implemented using the native API istherefore going to be a little slower than it's possible to make coreAnalyzer classes which are allowed to access struct members, butthat's the way it goes. If we supply a good set of flexible Analyzerclasses, hopefully it won't be necessary for people to createmultiple custom Analyzers and their indexing apps will stay fast enough.


Marvin Humphrey
Rectangular Research
http://www.rectangular.com/

TokenBatch

Reply via email to