Jason, DocsAndPositionsEnum is the right place to add one or more methods that allow random access to a given position. A simple way to move backwards once is to keep the previous position when doing nextPosition(). When once is enough that should be a good start.
Otherwise random acces can be implemented by adding a codec in Lucene that uses Elias-Fano encoding for the positions. An implementation in MG4J is described by Sebastiano Vigna, "Quasi Succinct Indices", section 6, http://arxiv.org/pdf/1206.4300 >From table 5 there one might conclude that this is faster than Lucene 3.6.0 for position access. It is better compressed, see table 2. I don't know how much the positions encoding in Lucene has changed since 3.6.0. LUCENE-5542 has an implementation of Elias Fano sequences on byte arrays. Random access is in the advanceToValue and backToValue methods. These could be used by a codec that implements random access on positions. Regards, Paul Elschot On 26-08-14 20:30, Jason Gerlowski wrote: > Hi all, > > I'm trying to create a new type of Query in Lucene (working from > branch_4x) to support "containment" operators. These operators allow > searches to account for position relative to markup that is indexed with > a Field. More specifics about these operators can be found in the paper > available here: http://plg.uwaterloo.ca/%7Eftp/mt/TechReports/CS-94-30/ > <http://plg.uwaterloo.ca/~ftp/mt/TechReports/CS-94-30/>) [1]. These are > also described in chapters 2 and 5 of the book "Information > Retrieval"[2]. Pretty cool stuff! > > Implementing this new Query requires semi-random access into a term's > postings list. The algebra used to build the Query (as explained in [1] > and [2]) relies on several "primitive" postings-list operations: > > nextPosition(int referencePoint) - find the next occurrence of a Term > after the given position. > prevPosition(int referencePoint) - find the previous occurrence of a > Term after the given position. > > The closest match to this that I've found so far is > DocsAndPositionsEnum[3], which allows forward iteration (nextPosition()) > across the positions in a postings-list. However DocsAndPositionsEnum > can't move backwards. It also doesn't have a notion of a "reference > point". It doesn't seem to be built for what I'm trying to do. > > I can see two ways to implement this "prevPosition" functionality, but > neither look feasible to me. > > 1) I could wrap DocsAndPositionsEnum, storing each item in the > postings-list in a container as it's iterated over. This makes moving > back and forth pretty trivial. This has performance issues though, as > all positions must be iterated through and stored in memory. > 2) I could write my own codec and modify the index file format to more > easily support prev() and next(). This could fix the performance > limitations of (1), but the level of effort appears much higher. I > don't think I'm familiar enough with Lucene to pull this off successfully. > > Does anyone see any other possibilities for accessing previous postings? > Am I looking at the wrong class (DocsAndPositionsEnum) entirely? Is > there another class available that supports going backward? Does anyone > know if this functionality is reasonable to implement given the default > index-file format, and just not exposed in DocsAndPositionsEnum? If the > default index-file codec isn't build to support this type of use (as it > appears), does anyone know of an alternative codec that would be more > supportive of this functionality? > > Any suggestions or comments greatly appreciated. Thanks for your > thoughts and time. > > Cheers, > > Jason Gerlowski > [email protected] <mailto:[email protected]> > > [1] http://plg.uwaterloo.ca/%7Eftp/mt/TechReports/CS-94-30/ > <http://plg.uwaterloo.ca/~ftp/mt/TechReports/CS-94-30/> > [2] "Information Retrieval: Implementing and Evaluating Search Engines"; > Stefan Buttcher, Charles L. A. Clarke, Gordon V. Cormack. 2010 > [3] > http://lucene.apache.org/core/4_9_0/core/org/apache/lucene/index/DocsAndPositionsEnum.html --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
