Hi all, I'm trying to create a new type of Query in Lucene (working from branch_4x) to support "containment" operators. These operators allow searches to account for position relative to markup that is indexed with a Field. More specifics about these operators can be found in the paper available here: http://plg.uwaterloo.ca/%7Eftp/mt/TechReports/CS-94-30/ <http://plg.uwaterloo.ca/~ftp/mt/TechReports/CS-94-30/>) [1]. These are also described in chapters 2 and 5 of the book "Information Retrieval" [2]. Pretty cool stuff!
Implementing this new Query requires semi-random access into a term's postings list. The algebra used to build the Query (as explained in [1] and [2]) relies on several "primitive" postings-list operations: nextPosition(int referencePoint) - find the next occurrence of a Term after the given position. prevPosition(int referencePoint) - find the previous occurrence of a Term after the given position. The closest match to this that I've found so far is DocsAndPositionsEnum[3], which allows forward iteration (nextPosition()) across the positions in a postings-list. However DocsAndPositionsEnum can't move backwards. It also doesn't have a notion of a "reference point". It doesn't seem to be built for what I'm trying to do. I can see two ways to implement this "prevPosition" functionality, but neither look feasible to me. 1) I could wrap DocsAndPositionsEnum, storing each item in the postings-list in a container as it's iterated over. This makes moving back and forth pretty trivial. This has performance issues though, as all positions must be iterated through and stored in memory. 2) I could write my own codec and modify the index file format to more easily support prev() and next(). This could fix the performance limitations of (1), but the level of effort appears much higher. I don't think I'm familiar enough with Lucene to pull this off successfully. Does anyone see any other possibilities for accessing previous postings? Am I looking at the wrong class (DocsAndPositionsEnum) entirely? Is there another class available that supports going backward? Does anyone know if this functionality is reasonable to implement given the default index-file format, and just not exposed in DocsAndPositionsEnum? If the default index-file codec isn't build to support this type of use (as it appears), does anyone know of an alternative codec that would be more supportive of this functionality? Any suggestions or comments greatly appreciated. Thanks for your thoughts and time. Cheers, Jason Gerlowski [email protected] [1] http://plg.uwaterloo.ca/%7Eftp/mt/TechReports/CS-94-30/ <http://plg.uwaterloo.ca/~ftp/mt/TechReports/CS-94-30/> [2] "Information Retrieval: Implementing and Evaluating Search Engines"; Stefan Buttcher, Charles L. A. Clarke, Gordon V. Cormack. 2010 [3] http://lucene.apache.org/core/4_9_0/core/org/apache/lucene/index/DocsAndPositionsEnum.html
