On Thu, Jun 13, 2013 at 7:56 PM, Sriram Sankar <san...@gmail.com> wrote: > Thank you very much. I think I need to play a bit with the code before > asking more questions. Here is the context for my questions: > > I was at Facebook until recently and worked extensively on the Unicorn > search backend. Unicorn allows documents to be ordered by a static rank in > the posting lists and retrieval (based on a query) will therefore return > documents in decreasing order of importance (by static rank). We only > retrieved a limited number of these documents - so static rank is the first > filter. These documents are scored which becomes the second filter. This > two level filtering ended up working very well. > > I am trying to see how we can do the same with Lucene. It looks like we > can build an index offline (say in Hadoop) and use SortingAtomicReader, > etc. to get the posting lists in static rank order. Then everything works > great. However, as we update the index with new data, we end up with > additional segments.
The use-case you are describing is why we built SortingMergePolicy: this merge policy sorts segments at merging time. The consequence is that segments resulting from a flush are unsorted while segments resulting from a merge are sorted. This is an interesting property because one can then early-terminate collection on the large segments (which necessarily result from a merge) which are the most costly to collect. > To get the same functionality as Unicorn, it looks > like we will have to sort these additional segments by the same static > rank, and then traverse the posting lists in these multiple segments > simultaneously (rather than segment by segment). I am trying to see how > easy this will be to achieve. Traversing the postings lists simultaneously to collect documents by strict increasing rank is an option, but segment-by-segment collection might be faster (depending on the number of segments to collect, the likeliness of new documents to have better ranks, etc.): although you will over-collect some segments, it is no more needed to make a decision after each collected document to know which segment to collect. I think both options are worth trying. > To get more specific, the all documents in Unicorn have two attributes - an > id (called fbid) which is unique and never changes - so for example the > fbid for the Lucene page is 104032849634513 (so you can always get it as > www.facebook.com/104032849634513), and a static rank. The static rank may > not be unique (multiple docs may share the same static rank, in which case > we could order them arbitrarily so long as they are ordered the same way in > all posting lists). SortingAtomicReader and SortingMergePolicy ensure this as well by using a stable sorting algorithm to sort documents. Documents which have the same rank will remain in index order through the sorting process. > It looks like we will need an fbid to docid mapping in Lucene for each > segment so that the simultaneous calls to advance() in each segment can be > in sync. Although I understand why you would need a docid -> fbid mapping (likely through a stored field or a doc values field), I'm not sure to understand why you need a fbid -> docid mapping. > This is how far I've thought things out - let me play with the code a bit > and I may have more questions in a couple of days. You are welcome. Online sorting is rather new in Lucene so there are likely many things that could be improved! -- Adrien --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org