Matching term document character offsets (PyLucene 3.0.1)

Ben Phelan Sun, 16 May 2010 20:17:37 -0700

Hi,

I apologise in advance if this information is clearly available
somewhere else; I've spent quite a bit of time on this so far and not
yet found it.  I've also messed around with the API from Pylucene
(3.0.1) for many hours and not found an acceptable solution, but that
could mainly be due to chronic lack of sleep.


1)  For our open natural language corpora search project we need to be
able to get at the _character offsets_ of the matching terms in the
original documents.  This information definitely seems to be stored
with the index but I've not yet been able to figure out how to get at
it effectively.  It would be extremely helpful if someone could please
point out how to access this from (Py)Lucene v3.0.1.

2)  If it's not possible to access the character offsets, token
positions would be a vaguely acceptable fallback but would limit the
final capabilities of our system somewhat.  So, although token
positions are definitely not our first choice, I am a little curious
why, in the following code, the positions returned by
MultipleTermPositions don't match the term positions in the Analyzer's
token stream (if it helps, the input text documents are currently in
sgml or some derivative so there is a little bit of markup that seems
to be ignored in one tokenizer and not the other, and I can't see
where to tell the MultipleTermPositions or IndexSearcher which
Analyzer to use...).

searcher = IndexSearcher(indexdir, True)
analyzer = StandardAnalyzer(Version.LUCENE_CURRENT)
query = QueryParser(Version.LUCENE_CURRENT, field, analyzer).parse(querystring)
terms = HashSet()
query.extractTerms(terms)
scoreDocs = searcher.search(query, 500).scoreDocs
for scoreDoc in scoreDocs:
  positions = MultipleTermPositions(searcher.getIndexReader(), list(terms))
  positions.skipTo(scoreDoc.doc)
  if positions.doc() == scoreDoc.doc:
    doc = searcher.doc(scoreDoc.doc)
    fpath = path + doc.get('path')
    fhandle = open(fpath, 'r')
    text = fhandle.read()
    reader = StringReader(text)
    tokenStream = analyzer.tokenStream(field, reader)
#    offsetAttribute = tokenStream.getAttribute(OffsetAttribute.class_)
#    termAttribute = tokenStream.getAttribute(TermAttribute.class_)
    p = 0
    for i in range(positions.freq()):
      pos = positions.nextPosition()
      for j in range(pos - p):
        tokenStream.incrementToken()
      # do something here with the token (which should match the
search term but isn't even close...)
      p = pos
    positions.close()
searcher.close()

Any help with either of these problems would be greatly appreciated!

Cheers,
Ben

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Matching term document character offsets (PyLucene 3.0.1)

Reply via email to