Hi,
I'm trying to migrate some Analyzers from API 3.6 to 6.2 and I'm not sure
if I got the right approach. I'm using Pylucene, so lets assume this is
pseudo-code.
In 3.x (and up to 4), I've had access to the StringReader containing the
data in the overriden tokenStream(fieldName, reader):
class TokenStream3(PythonTokenStream):
def __init__(self, reader):
self.data = DATA_FROM_READER(reader)
self.i = 0
# prepare termAtt/offsetAtt/posIncrAtt and other helpers
def incrementToken(self):
if self.i == len(self.data):
return False
# stuff from self.data into termAtt/offsetAtt/posIncrAtt
self.i += 1
return True
class Analyzer3(PythonAnalyzer):
def tokenStream(self, fieldName, reader):
return TokenStream3(reader)
-----
In 5.x/6.x I've only found the following approach with some ugly
indirections: Capture the active reader in Analyzer.initReader() and
access it via callback in the Tokenizer.
class Tokenizer6(PythonTokenizer):
def __init__(self, getReader):
# callable for retrieving current reader
self.getReader = getReader
self.i = 0
self.data = None
def incrementToken(self):
if self.i == 0:
self.data = DATA_FROM_READER(self.getReader())
if self.i == len(self.data):
# we are reused - reset
self.i = 0
return False
# stuff from self.data into termAtt/offsetAtt/posIncrAtt
self.i += 1
return True
class Analyzer6(PythonAnalyzer):
def createComponents(self, fieldName):
return Analyzer.TokenStreamComponents(Tokenizer6(lambda:
self._reader))
def initReader(self, fieldName, reader):
# capture reader
self._reader = reader
return reader
-----
Is this sane?
--dirk
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org