Re: docid set compression and boolean docid set operations

eks dev Sat, 13 Sep 2008 01:09:40 -0700

Hi Anmol, Paul, 

great that someone is taking on p4delta for lucene! There are people like me 
that beleive this could bring some really noce performance benefits (if we get 
only 50% of speed-up the authors reported, it will still be huge)


> Moreover, we have implementations of Logic operators on the underlying
> docsets which act like filters in Lucene.

I have not looked at the code, but I think this part can be spared as soon as 
https://issues.apache.org/jira/browse/LUCENE-1345 gets done. The idea in this 
issue is to make it possible for Filter to enter BooleanQuery as a clause so we 
can focus complete iterator arithmetic at one place. Lucene is anyhow doing 
this internally.

> True, the table is a placeholder right now. I will run my performance
> tests and update the table in the next day or two.

Maybe better test would be to compare against Paul's CompactVIntFilter as they 
are also compressed and they are in memory representation of postings list 
format in lucene (p4delta should "compete" agains VInt lists)

The most difficult thing here is to replace VInt postings list compression with 
p4delta (to be precise, authors recommend VInt approach for very short postings 
and p4delta for the rest). I guess it will be easy to do it for omitTf(true) 
case, but I do not know how we could do it for interlaved postings  
(docIDDelta, term frequency) *.   








----- Original Message ----
> From: Anmol Bhasin <[EMAIL PROTECTED]>
> To: [email protected]
> Sent: Saturday, 13 September, 2008 9:21:21
> Subject: Re: docid set compression and boolean docid set operations
> 
> Hi,
> 
> Michael :
> 
> True, the table is a placeholder right now. I will run my performance
> tests and update the table in the next day or two.
> 
> Paul :
> 
> Thanks for skimming over the code. As John mentioned in his email, we
> currently use Kamikaze for in memory caching for document hits. The
> Kamikaze project is aimed to provide a docset implementation using
> either an Integer array, an OpenBitSet or P4Delta data stucture using
> the OpenBitSet(lucene)  depending on  the size and range of the
> docset. There is a utility function exposed in
> com.kamikaze.docidset.utils. DocSetFactory.java which can decide on
> the fly given the parameters as to what underlying datastructure
> should be employed, however it is not tested thoroughly.
> 
> Moreover, we have implementations of Logic operators on the underlying
> docsets which act like filters in Lucene. Again, these can be employed
> to perform complex logic ops on the underlying docsets which in turn
> could themselves be composite docsets generated using AND|OR|NOT
> operations. The implementations do not materialize the interim or
> final structures but simply expose an iterator to walk the docsets.
> 
> It would be wonderful if the indexing structure can be augumented
> using Kamikaze. I can start
> proactively modifying/improving the implementation/packaging to ease
> the process. As for storing the term frequencies and positions with
> this datastructure, let me revisit the literature to see how best, if
> at all we can assimilate them.
> 
> Thanks for the interest in Kamikaze and I would keep you posted once I
> have updated performance numbers on the wiki.
> 
> Thanks,
> Anmol
> 
> 
> 
> 
> 
> com.kamikaze.docidset.utils
> 
> 
> On Thu, Sep 11, 2008 at 5:23 PM, John Wang wrote:
> > Hi guys:
> >
> >     I will let the author, Anmol Bhasin to respond with details.
> >
> >     In our use case, we are not making changes to the index because we do
> > not want to diverge from the lucene code base. (thought it'd be great if we
> > can enhance indexing structure with this) We load the docIdSets into memory
> > for caching reasons.
> >
> > Thanks
> >
> > -John
> >
> > On Thu, Sep 11, 2008 at 3:28 AM, Paul Elschot 
> > wrote:
> >>
> >> John,
> >>
> >> I've taken a first look at the code, and I have a few questions.
> >>
> >> Did I understand correctly that it is basically a two way
> >> conversion between an integer array and an (Open)BitSet
> >> representing a p4delta data structure?
> >>
> >> In that case it would still be necessary to extend the lucene
> >> index structure to make it understand the p4delta data structure
> >> at the appropriate places.
> >> I can help getting the code integrated into lucene, but I've
> >> never done an index structure extension, so I'd like to have
> >> some support from this list for that.
> >> The code would initially need some package restructuring and
> >> layout changes, and then it could move forward to an index
> >> structure extension.
> >>
> >> Would you have some ideas on how to use de p4delta structure
> >> to store docIds, term frequencies and term positions?
> >> The references give some insights there, but it seems that there
> >> is still quite a bit of work to do get such "details" right.
> >> Fortunately, the existing Lucene TermDocs and TermPositions
> >> appear to be just right for this.
> >>
> >> Regards,
> >> Paul Elschot
> >>
> >>
> >> Op Wednesday 10 September 2008 23:09:18 schreef John Wang:
> >> > Sorry, I meant lucene 2.4
> >> >
> >> > -John
> >> >
> >> > On Wed, Sep 10, 2008 at 2:08 PM, John Wang 
> >> wrote:
> >> > > Hi guys:
> >> > >
> >> > >      We have build this on top of the lucene 1.4. api/refactoring
> >> > > for docid sets and docIdIterater.
> >> > >
> >> > >      We've implemented the p4Delta compression algorithm presented
> >> > > at www2008: http://www2008.org/papers/fp618.html
> >> > >
> >> > >      We've been using this in production here at LinkedIn and would
> >> > > love to contribute it into lucene.
> >> > >
> >> > >      We currently open sourced it at:
> >> > > http://code.google.com/p/lucene-ext/wiki/Kamikaze
> >> > >
> >> > >      Please let us know if it is thing you guys want to proceed, if
> >> > > so, what are the steps we should take.
> >> > >
> >> > > Thanks
> >> > >
> >> > > -John
> >>
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: [EMAIL PROTECTED]
> >> For additional commands, e-mail: [EMAIL PROTECTED]
> >>
> >
> >
> 
> 
> 
> -- 
> Courage doesn't always roar. Sometimes courage is the quiet voice at
> the end of the day saying, "I will try again tomorrow"
> 
> 
> Anmol Bhasin
> SSE Data Platform
> www.linkedin.com
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]





---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: docid set compression and boolean docid set operations

Reply via email to