Hi all,

I'm trying to create a new type of Query in Lucene (working from branch_4x)
to support "containment" operators.  These operators allow searches to
account for position relative to markup that is indexed with a Field.  More
specifics about these operators can be found in the paper available here:
http://plg.uwaterloo.ca/%7Eftp/mt/TechReports/CS-94-30/
<http://plg.uwaterloo.ca/~ftp/mt/TechReports/CS-94-30/>) [1]. These are
also described in chapters 2 and 5 of the book "Information Retrieval" [2].
Pretty cool stuff!

Implementing this new Query requires semi-random access into a term's
postings list.  The algebra used to build the Query (as explained in [1]
and [2]) relies on several "primitive" postings-list operations:

nextPosition(int referencePoint) - find the next occurrence of a Term after
the given position.
prevPosition(int referencePoint) - find the previous occurrence of a Term
after the given position.

The closest match to this that I've found so far is
DocsAndPositionsEnum[3], which allows forward iteration (nextPosition())
across the positions in a postings-list.  However DocsAndPositionsEnum
can't move backwards.  It also doesn't have a notion of a "reference
point".  It doesn't seem to be built for what I'm trying to do.

I can see two ways to implement this "prevPosition" functionality, but
neither look feasible to me.

1) I could wrap DocsAndPositionsEnum, storing each item in the
postings-list in a container as it's iterated over.  This makes moving back
and forth pretty trivial.  This has performance issues though, as all
positions must be iterated through and stored in memory.
2) I could write my own codec and modify the index file format to more
easily support prev() and next().  This could fix the performance
limitations of (1), but the level of effort appears much higher.  I don't
think I'm familiar enough with Lucene to pull this off successfully.

Does anyone see any other possibilities for accessing previous postings?
 Am I looking at the wrong class (DocsAndPositionsEnum) entirely?  Is there
another class available that supports going backward?  Does anyone know if
this functionality is reasonable to implement given the default index-file
format, and just not exposed in DocsAndPositionsEnum?  If the default
index-file codec isn't build to support this type of use (as it appears),
does anyone know of an alternative codec that would be more supportive of
this functionality?

Any suggestions or comments greatly appreciated.  Thanks for your thoughts
and time.

Cheers,

Jason Gerlowski
[email protected]

[1] http://plg.uwaterloo.ca/%7Eftp/mt/TechReports/CS-94-30/
<http://plg.uwaterloo.ca/~ftp/mt/TechReports/CS-94-30/>
[2] "Information Retrieval: Implementing and Evaluating Search Engines";
Stefan Buttcher, Charles L. A. Clarke, Gordon V. Cormack. 2010
[3]
http://lucene.apache.org/core/4_9_0/core/org/apache/lucene/index/DocsAndPositionsEnum.html

Reply via email to