Re: posting list traversal code

Adrien Grand Thu, 13 Jun 2013 13:01:54 -0700

On Thu, Jun 13, 2013 at 7:56 PM, Sriram Sankar <san...@gmail.com> wrote:
> Thank you very much.  I think I need to play a bit with the code before
> asking more questions.  Here is the context for my questions:
>
> I was at Facebook until recently and worked extensively on the Unicorn
> search backend.  Unicorn allows documents to be ordered by a static rank in
> the posting lists and retrieval (based on a query) will therefore return
> documents in decreasing order of importance (by static rank).  We only
> retrieved a limited number of these documents - so static rank is the first
> filter.  These documents are scored which becomes the second filter.  This
> two level filtering ended up working very well.
>
> I am trying to see how we can do the same with Lucene.  It looks like we
> can build an index offline (say in Hadoop) and use SortingAtomicReader,
> etc. to get the posting lists in static rank order.  Then everything works
> great. However, as we update the index with new data, we end up with
> additional segments.


The use-case you are describing is why we built SortingMergePolicy:
this merge policy sorts segments at merging time. The consequence is
that segments resulting from a flush are unsorted while segments
resulting from a merge are sorted. This is an interesting property
because one can then early-terminate collection on the large segments
(which necessarily result from a merge) which are the most costly to
collect.

> To get the same functionality as Unicorn, it looks
> like we will have to sort these additional segments by the same static
> rank, and then traverse the posting lists in these multiple segments
> simultaneously (rather than segment by segment).  I am trying to see how
> easy this will be to achieve.

Traversing the postings lists simultaneously to collect documents by
strict increasing rank is an option, but segment-by-segment collection
might be faster (depending on the number of segments to collect, the
likeliness of new documents to have better ranks, etc.): although you
will over-collect some segments, it is no more needed to make a
decision after each collected document to know which segment to
collect. I think both options are worth trying.

> To get more specific, the all documents in Unicorn have two attributes - an
> id (called fbid) which is unique and never changes - so for example the
> fbid for the Lucene page is 104032849634513 (so you can always get it as
> www.facebook.com/104032849634513), and a static rank.  The static rank may
> not be unique (multiple docs may share the same static rank, in which case
> we could order them arbitrarily so long as they are ordered the same way in
> all posting lists).

SortingAtomicReader and SortingMergePolicy ensure this as well by
using a stable sorting algorithm to sort documents. Documents which
have the same rank will remain in index order through the sorting
process.

> It looks like we will need an fbid to docid mapping in Lucene for each
> segment so that the simultaneous calls to advance() in each segment can be
> in sync.

Although I understand why you would need a docid -> fbid mapping
(likely through a stored field or a doc values field), I'm not sure to
understand why you need a fbid -> docid mapping.

> This is how far I've thought things out - let me play with the code a bit
> and I may have more questions in a couple of days.

You are welcome. Online sorting is rather new in Lucene so there are
likely many things that could be improved!

--
Adrien

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: posting list traversal code

Reply via email to