Jason,

DocsAndPositionsEnum is the right place to add one or more methods that
allow random access to a given position.
A simple way to move backwards once is to keep the previous position
when doing nextPosition(). When once is enough that should be a good start.

Otherwise random acces can be implemented by adding a codec in Lucene
that uses Elias-Fano encoding for the positions. An implementation in
MG4J is described by Sebastiano Vigna, "Quasi Succinct Indices", section
6, http://arxiv.org/pdf/1206.4300
>From table 5 there one might conclude that this is faster than Lucene
3.6.0 for position access. It is better compressed, see table 2.
I don't know how much the positions encoding in Lucene has changed since
3.6.0.

LUCENE-5542 has an implementation of Elias Fano sequences on byte
arrays. Random access is in the advanceToValue and backToValue methods.
These could be used by a codec that implements random access on positions.


Regards,
Paul Elschot


On 26-08-14 20:30, Jason Gerlowski wrote:
> Hi all,
> 
> I'm trying to create a new type of Query in Lucene (working from
> branch_4x) to support "containment" operators.  These operators allow
> searches to account for position relative to markup that is indexed with
> a Field.  More specifics about these operators can be found in the paper
> available here: http://plg.uwaterloo.ca/%7Eftp/mt/TechReports/CS-94-30/
> <http://plg.uwaterloo.ca/~ftp/mt/TechReports/CS-94-30/>) [1]. These are
> also described in chapters 2 and 5 of the book "Information
> Retrieval"[2].  Pretty cool stuff!
> 
> Implementing this new Query requires semi-random access into a term's
> postings list.  The algebra used to build the Query (as explained in [1]
> and [2]) relies on several "primitive" postings-list operations:
> 
> nextPosition(int referencePoint) - find the next occurrence of a Term
> after the given position.
> prevPosition(int referencePoint) - find the previous occurrence of a
> Term after the given position.
> 
> The closest match to this that I've found so far is
> DocsAndPositionsEnum[3], which allows forward iteration (nextPosition())
> across the positions in a postings-list.  However DocsAndPositionsEnum
> can't move backwards.  It also doesn't have a notion of a "reference
> point".  It doesn't seem to be built for what I'm trying to do.
> 
> I can see two ways to implement this "prevPosition" functionality, but
> neither look feasible to me.
> 
> 1) I could wrap DocsAndPositionsEnum, storing each item in the
> postings-list in a container as it's iterated over.  This makes moving
> back and forth pretty trivial.  This has performance issues though, as
> all positions must be iterated through and stored in memory.
> 2) I could write my own codec and modify the index file format to more
> easily support prev() and next().  This could fix the performance
> limitations of (1), but the level of effort appears much higher.  I
> don't think I'm familiar enough with Lucene to pull this off successfully.
> 
> Does anyone see any other possibilities for accessing previous postings?
>  Am I looking at the wrong class (DocsAndPositionsEnum) entirely?  Is
> there another class available that supports going backward?  Does anyone
> know if this functionality is reasonable to implement given the default
> index-file format, and just not exposed in DocsAndPositionsEnum?  If the
> default index-file codec isn't build to support this type of use (as it
> appears), does anyone know of an alternative codec that would be more
> supportive of this functionality?
> 
> Any suggestions or comments greatly appreciated.  Thanks for your
> thoughts and time.
> 
> Cheers,
> 
> Jason Gerlowski
> [email protected] <mailto:[email protected]>
> 
> [1] http://plg.uwaterloo.ca/%7Eftp/mt/TechReports/CS-94-30/
> <http://plg.uwaterloo.ca/~ftp/mt/TechReports/CS-94-30/> 
> [2] "Information Retrieval: Implementing and Evaluating Search Engines";
> Stefan Buttcher, Charles L. A. Clarke, Gordon V. Cormack. 2010
> [3] 
> http://lucene.apache.org/core/4_9_0/core/org/apache/lucene/index/DocsAndPositionsEnum.html


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to