[pylucene-dev] First shot at custom tokenfilter

Ofer Nave Mon, 26 Mar 2007 15:48:42 -0800

I'm finally getting around to extending Lucene.  The only piece I've played
with so far is Similarity, which I have working.  Now I want to write a
custom TokenFilter.


I figured the first step is just get the class modeling correctly, so I
started by implementing a shell analyzer that doesn't do anything new:

class FooAnalyzer(object):
    def __init__(self, analyzer=PyLucene.StandardAnalyzer()):
        self.analyzer = analyzer
    def tokenStream(self, fieldName, reader):
        return self.analyzer.tokenStream(fieldName, reader)

I then added a section in AnalyzerUtils.main() that uses FooAnalyzer (after
Simple and Standard).  The output was identical to StandardAnalayzer.  Way
to go me.

Next I implemented a shell TokenFilter class:

class FooTokenFilter(object):
    def __init__(self, tokenStream):
        self.tokenStream = tokenStream
    def close(self):
        self.tokenStream.close()
    def next(self):
        return self.tokenStream.next()

And modified FooAnalyzer to use it:

    def tokenStream(self, fieldName, reader):
        return FooTokenFilter(self.analyzer.tokenStream(fieldName, reader))

My test program errored out with:

  File "/home/ofer/bin/foo.py", line 72, in tokensFromAnalysis
    return [token for token in analyzer.tokenStream("contents",
StringReader(text))]
TypeError: iteration over non-sequence

That was strange.  It fails to use my tokenstream, even though I implemented
the interface.  It occurred to me that Pythonicly, the TokenFilter object
would need an __iter__ method to work in that context.  I ran a test script
with 'print dir(tokenstream)' on a standard TokenFilter object, and it did
have __iter__ defined.  So I added it to mine:

class FooTokenFilter(object):
    def __init__(self, tokenStream):
        self.tokenStream = tokenStream
    def __iter__(self):
        return self
    def close(self):
        self.tokenStream.close()
    def next(self):
        return self.tokenStream.next()

I ran the test script again, and this time instead of getting an error
message about iteration over non-sequence, it just ran indefinitely, chewing
up the CPU and eating up memory as fast as it could.

---

I checked the PyLucene README, and the note regarding custom tokenfilters
said this:

       "In order to instantiate such a custom token filter, the additional
       tokenFilter() factory method defined on
       org.apache.lucene.analysis.TokenStream instances needs to be invoked
       with the Python extension instance."

However, I couldn't find reference to any tokenFilter() methods in the
TokenStream class family in the Lucene 2.1 docs.

Where have I gone wrong?

-ofer

PS-Andi, thanks for porting the lia libraries.  They're amazingly useful.
Wish they were installed by 'python setup.py install'.  :)

_______________________________________________
pylucene-dev mailing list
[email protected]
http://lists.osafoundation.org/mailman/listinfo/pylucene-dev

[pylucene-dev] First shot at custom tokenfilter

Reply via email to