[ https://issues.apache.org/jira/browse/LUCENE-851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12484234 ]
Karl Wettin commented on LUCENE-851: ------------------------------------ If this is what I think, then it is really cool and can be very interesting for indices with a (one) static natural order, for instance publishing date of news articles. I know people have implemented this by reverting norms to a true float (32 bits rather than 8 bits) and by incrementing the value been using that as the natural order, a somewhat crazy solution if you ask me. > Pruning > ------- > > Key: LUCENE-851 > URL: https://issues.apache.org/jira/browse/LUCENE-851 > Project: Lucene - Java > Issue Type: New Feature > Components: Index, Search > Reporter: Marvin Humphrey > Priority: Minor > > Greets, > A thread on java-dev a couple of months ago drew my attention to a technique > used by Nutch for cutting down the number of hits that have to be processed: > if you have an algorithm for ordering documents by importance, and you sort > them so that the lowest document numbers have the highest rank, then most of > your high-scoring hits are going to occur early on in the hit-collection > process. Say you're looking for the top 100 matches -- the odds are pretty > good that after you've found 1000 hits, you've gotten most of the good stuff. > It may not be necessary to score the other e.g. 5,000,000 hits. > To pull this off in Nutch, they run the index through a post process whereby > documents are re-ordered by page score using the IndexSorter class. > Unfortunately, post-processing does not live happily with incremental > indexing. > However, if we ensure that document numbers are ordered according to our > criteria within each segment, that's almost as good. > Say we're looking for 100 hits, as before; what we do is collect a maximum of > 1000 hits per segment. If we are dealing with an index made up of 25 > segments, that's 25,000 hits max we'll have to process fully -- the rest we > can skip over. That's not as quick as only processing 1000 hits then > stopping in a fully optimized index, but it's a lot better than churning > through all 5,000,000 hits. > A lot of those hits from the smallest segments will be garbage; we'll get > most of our good hits from a few large segments most of the time. But that's > fine -- the cost to process any one segment is small. > Writing a low-level scoring loop which implements pruning per segment is > straightforward. KinoSearch's version (in C) is below. > To control the amount of pruning, we need a high-level > Searcher.setPruneFactor API, which sets a multiplier; the number of > hits-per-segment which must be processed is determined by multiplying the > number of hits you need by pruneFactor. Here's code from KS for deriving > hits-per-seg: > # process prune_factor if supplied > my $seg_starts; > my $hits_per_seg = 2**31 - 1; > if ( defined $self->{prune_factor} and defined $args{num_wanted} ) { > my $prune_count = $self->{prune_factor} * $args{num_wanted}; > if ( $prune_count < $hits_per_seg ) { # don't exceed I32_MAX > $hits_per_seg = $prune_count; > $seg_starts = $reader->get_seg_starts; > } > } > What I have not yet written is the index-time mechanism for sorting > documents. > In Nutch, they use the norms from a known indexed, non-tokenized field > ("site"). However, in Lucene and KS, we can't count on any existing fields. > Document boost isn't stored directly, either. The obvious answer is to start > storing it, which would suffice for Nutch-like uses. However, it may make > sense to to avoid coupling document ordering to boost in order to influence > pruning without affecting scores. > The sort ordering information needs a permanent home in the index, since it > will be needed whenever segment merging occurs. The fixed-width per-document > storage in Lucene's .fdx file seems like a good place. If we use one float > per document, we can simply put it before or after the 64-bit file pointer > and seek into the file after multiplying the doc num by 12 rather than 8. > During indexing, we'd keep the ordering info in an array; after all documents > for a segment have been added, we create an array of sorted document numbers. > When flushing the postings, their document numbers get remapped using the > sorted array. Then we rewrite the .fdx file (and also the .tvx file), moving > the file pointers (and ordering info) to remapped locations. The fact that > the .fdt file is now "out of order" is isn't a problem -- optimizing > sequential access to that file isn't important. > This issue is closely tied to LUCENE-843, "Improve how IndexWriter uses RAM > to buffer added documents", and LUCENE-847, "Factor merge policy out of > IndexWriter". Michael McCandless, Steven Parks, Ning Li, anybody else... > comments? Suggestions? > Marvin Humphrey > Rectangular Research > http://www.rectangular.com/ > -------------------------------------------------------------------- > void > Scorer_collect(Scorer *self, HitCollector *hc, u32_t start, u32_t end, > u32_t hits_per_seg, VArray *seg_starts) > { > u32_t seg_num = 0; > u32_t doc_num_thresh = 0; > u32_t hits_this_seg = 0; > u32_t hits_thresh = hits_per_seg; > /* get to first doc */ > if ( !Scorer_Skip_To(self, start) ) > return; > /* execute scoring loop */ > do { > u32_t doc_num = Scorer_Doc(self); > Tally *tally; > if (hits_this_seg >= hits_thresh || doc_num >= doc_num_thresh) { > if (doc_num >= end) { > /* bail out of loop if we've hit the user-spec'd end */ > return; > } > else if (seg_starts == NULL || seg_starts->size == 0) { > /* must supply seg_starts to enable pruning */ > hits_thresh = U32_MAX; > doc_num_thresh = end; > } > else if (seg_num == seg_starts->size) { > /* we've finished the last segment */ > return; > } > else { > /* get start of upcoming segment */ > Int *this_start = (Int*)VA_Fetch(seg_starts, seg_num); > Int *next_start = (Int*)VA_Fetch(seg_starts, seg_num + 1); > u32_t this_seg_start = this_start->value; > seg_num++; > /* skip docs as appropriate if we're pruning */ > if (doc_num < this_seg_start) { > if ( Scorer_Skip_To(self, this_seg_start) ) > doc_num = Scorer_Doc(self); > else > return; > } > /* set the last doc_num we'll score this upcoming segment */ > doc_num_thresh = next_start == NULL > ? end /* last segment */ > : next_start->value; > } > /* start over counting hits for the new segment */ > hits_this_seg = 0; > } > /* this doc is in range, so collect it */ > tally = Scorer_Tally(self); > hc->collect(hc, doc_num, tally->score); > hits_this_seg++; > } while (Scorer_Next(self)); > } -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]