[ https://issues.apache.org/jira/browse/LUCENE-4100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13406992#comment-13406992 ]
Robert Muir commented on LUCENE-4100: ------------------------------------- Hello, thank you for working on this! I have just taken a rough glance at the code, and think we should probably look at what API changes would make this sort of thing fit better into Lucene and it easier to implement. Random thoughts: Specifically: what you are doing in the PostingsWriter is similar to computing impacts (I don't have a copy of the paper so admittedly don't know the exact algorithm you are using). But it seems to me that you are putting a maxScore in the term dictionary metadata for all of the terms postings (as a float). With the tool you provide, this works because you have access to e.g. the segment's length normalization information etc (your postingswriter takes a reader). But we would have to think about how to give postingswriters access to this on flush... it seems possible to me though. Giving the postingswriter full statistics (e.g. docfreq) for Similarity computation seems difficult: while I think we could accum this stuff in FreqProxTermsWriter before we flush to the codec, it wouldn't solve the problem at merge time, so you would have to do a 2-pass merge in the codec somehow... But the alternative of splitting the "impact" (tf/norm) from the document-independent weight (e.g. IDF) isn't that pretty either, because it limits the scoring systems (Similarity implementations) that could use the optimization. as many terms will be low frequency (e.g. docfreq=1), i think its not worth it to encode the maxscore for these low freq terms: we could save space by omitting maxscore for low freq terms and just treat it as infinitely large? the opposite problem: is it really optimal to encode maxscore for the entire term? or would it be better for high-freq terms to encode maxScore for a range of postings (e.g. block). This way, you could skip over ranges of postings that cannot compete (rather than limiting the optimization to an entire term). A codec could put this information into a block header, or at certain intervals, into the skip data, etc. do we really need a full 4-byte float? How well would the algorithm work with degraded precision: e.g. something like SmallFloat. (I think this SmallFloat currently computes a lower bound, we would have to bump to the next byte to make an upper bound). another idea: it might be nice if this optimization could sit underneath the codec, such that you dont need a special Scorer. One idea here would be for your collector to set an attribute on the DocsEnum (maxScore): of course a normal codec would totally ignore this and proceed as today. But codecs like this one could return NO_MORE_DOCS when postings for that term can no longer compete. I'm just not positive if this algorithm can be refactored in this way, and this would also require some clean way of getting these attributes from Collector -> Scorer -> DocsEnum. Currently Scorer is in the way here :) Just some random thoughts, I'll try to get a copy of this paper so I have a better idea whats going on with this particular optimization... > Maxscore - Efficient Scoring > ---------------------------- > > Key: LUCENE-4100 > URL: https://issues.apache.org/jira/browse/LUCENE-4100 > Project: Lucene - Java > Issue Type: Improvement > Components: core/codecs, core/query/scoring, core/search > Affects Versions: 4.0 > Reporter: Stefan Pohl > Labels: api-change, patch, performance > Fix For: 4.0 > > Attachments: contrib_maxscore.tgz, maxscore.patch > > > At Berlin Buzzwords 2012, I will be presenting 'maxscore', an efficient > algorithm first published in the IR domain in 1995 by H. Turtle & J. Flood, > that I find deserves more attention among Lucene users (and developers). > I implemented a proof of concept and did some performance measurements with > example queries and lucenebench, the package of Mike McCandless, resulting in > very significant speedups. > This ticket is to get started the discussion on including the implementation > into Lucene's codebase. Because the technique requires awareness about it > from the Lucene user/developer, it seems best to become a contrib/module > package so that it consciously can be chosen to be used. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org