Hi, I apologise in advance if this information is clearly available somewhere else; I've spent quite a bit of time on this so far and not yet found it. I've also messed around with the API from Pylucene (3.0.1) for many hours and not found an acceptable solution, but that could mainly be due to chronic lack of sleep.
1) For our open natural language corpora search project we need to be able to get at the _character offsets_ of the matching terms in the original documents. This information definitely seems to be stored with the index but I've not yet been able to figure out how to get at it effectively. It would be extremely helpful if someone could please point out how to access this from (Py)Lucene v3.0.1. 2) If it's not possible to access the character offsets, token positions would be a vaguely acceptable fallback but would limit the final capabilities of our system somewhat. So, although token positions are definitely not our first choice, I am a little curious why, in the following code, the positions returned by MultipleTermPositions don't match the term positions in the Analyzer's token stream (if it helps, the input text documents are currently in sgml or some derivative so there is a little bit of markup that seems to be ignored in one tokenizer and not the other, and I can't see where to tell the MultipleTermPositions or IndexSearcher which Analyzer to use...). searcher = IndexSearcher(indexdir, True) analyzer = StandardAnalyzer(Version.LUCENE_CURRENT) query = QueryParser(Version.LUCENE_CURRENT, field, analyzer).parse(querystring) terms = HashSet() query.extractTerms(terms) scoreDocs = searcher.search(query, 500).scoreDocs for scoreDoc in scoreDocs: positions = MultipleTermPositions(searcher.getIndexReader(), list(terms)) positions.skipTo(scoreDoc.doc) if positions.doc() == scoreDoc.doc: doc = searcher.doc(scoreDoc.doc) fpath = path + doc.get('path') fhandle = open(fpath, 'r') text = fhandle.read() reader = StringReader(text) tokenStream = analyzer.tokenStream(field, reader) # offsetAttribute = tokenStream.getAttribute(OffsetAttribute.class_) # termAttribute = tokenStream.getAttribute(TermAttribute.class_) p = 0 for i in range(positions.freq()): pos = positions.nextPosition() for j in range(pos - p): tokenStream.incrementToken() # do something here with the token (which should match the search term but isn't even close...) p = pos positions.close() searcher.close() Any help with either of these problems would be greatly appreciated! Cheers, Ben --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org