I'm finally getting around to extending Lucene. The only piece I've played
with so far is Similarity, which I have working. Now I want to write a
custom TokenFilter.
I figured the first step is just get the class modeling correctly, so I
started by implementing a shell analyzer that doesn't do anything new:
class FooAnalyzer(object):
def __init__(self, analyzer=PyLucene.StandardAnalyzer()):
self.analyzer = analyzer
def tokenStream(self, fieldName, reader):
return self.analyzer.tokenStream(fieldName, reader)
I then added a section in AnalyzerUtils.main() that uses FooAnalyzer (after
Simple and Standard). The output was identical to StandardAnalayzer. Way
to go me.
Next I implemented a shell TokenFilter class:
class FooTokenFilter(object):
def __init__(self, tokenStream):
self.tokenStream = tokenStream
def close(self):
self.tokenStream.close()
def next(self):
return self.tokenStream.next()
And modified FooAnalyzer to use it:
def tokenStream(self, fieldName, reader):
return FooTokenFilter(self.analyzer.tokenStream(fieldName, reader))
My test program errored out with:
File "/home/ofer/bin/foo.py", line 72, in tokensFromAnalysis
return [token for token in analyzer.tokenStream("contents",
StringReader(text))]
TypeError: iteration over non-sequence
That was strange. It fails to use my tokenstream, even though I implemented
the interface. It occurred to me that Pythonicly, the TokenFilter object
would need an __iter__ method to work in that context. I ran a test script
with 'print dir(tokenstream)' on a standard TokenFilter object, and it did
have __iter__ defined. So I added it to mine:
class FooTokenFilter(object):
def __init__(self, tokenStream):
self.tokenStream = tokenStream
def __iter__(self):
return self
def close(self):
self.tokenStream.close()
def next(self):
return self.tokenStream.next()
I ran the test script again, and this time instead of getting an error
message about iteration over non-sequence, it just ran indefinitely, chewing
up the CPU and eating up memory as fast as it could.
---
I checked the PyLucene README, and the note regarding custom tokenfilters
said this:
"In order to instantiate such a custom token filter, the additional
tokenFilter() factory method defined on
org.apache.lucene.analysis.TokenStream instances needs to be invoked
with the Python extension instance."
However, I couldn't find reference to any tokenFilter() methods in the
TokenStream class family in the Lucene 2.1 docs.
Where have I gone wrong?
-ofer
PS-Andi, thanks for porting the lia libraries. They're amazingly useful.
Wish they were installed by 'python setup.py install'. :)
_______________________________________________
pylucene-dev mailing list
[email protected]
http://lists.osafoundation.org/mailman/listinfo/pylucene-dev