[
https://issues.apache.org/jira/browse/LUCENE-6066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14239328#comment-14239328
]
Mark Harwood commented on LUCENE-6066:
--------------------------------------
Thanks for the review, Mike. I'm working through changes.
bq. Why couldn't you just pass your custom queue instead of null to super() in
DiversifiedTopDocsCollector ctor?
Oops. That was a cut/paste error transferring code from es that relied on a
forked PriorityQueue which is obviously incompatible with the Lucene
TopDocsCollector base class.
bq. the abstract method returns NumericDocValues, which is confusing: how does
"beatles" become a number? Why not e.g. SortedDVs
I originally had a getKey(docId) method that returned an object - anything
which implements hashCode and Equals. When I talked through with Adrien he
suggested the use of NumericDocValues as a better abstraction which could be
backed by any system based on ordinals. We need to decide on what this
abstraction should be. One of the things I've been grappling with is if the
collector should implement support for multi-keyed docs e.g. a field containing
hashes for near-duplicate detection to avoid too-similar texts. This would
require extra code in the collector to determine if any one key had exceeded
limits (and ideally some memory-safeguard for docs with too many keys).
>I saw a test about paging; how does/should paging work with such a collector?
In regular collections, TopScoreDocCollector provides all of the smarts for
in-order/out-of-order and starting from the ScoreDoc at the bottom of the
previous page. I expect I would have to reimplement all of it's logic for a new
DiversifiedTopScoreKeyedDocCollector because it makes some assumptions about
using updateTop() that don't apply when we have a two-tier system for scoring
(globally competitive and within-key competitive).
My vague assumption was that the logic for paging would have to be that any
per-key constraints would apply across multiple pages e.g. having had 5 Beatles
hits on pages 1 and 2 you wouldn't expect to find any more the deeper you go
into the results because it had exhausted the "max 5 per key" limit. This logic
would probably preclude any use of the deep-paging optimisation where you can
pass just the ScoreDoc of the last entry on the previous page to minimise the
size of the PQ created for subsequent pages.
> New "remove" method in PriorityQueue
> ------------------------------------
>
> Key: LUCENE-6066
> URL: https://issues.apache.org/jira/browse/LUCENE-6066
> Project: Lucene - Core
> Issue Type: Improvement
> Components: core/query/scoring
> Reporter: Mark Harwood
> Priority: Minor
> Fix For: 5.0
>
> Attachments: LUCENE-PQRemoveV3.patch
>
>
> It would be useful to be able to remove existing elements from a
> PriorityQueue.
> The proposal is that a linear scan is performed to find the element being
> removed and then the end element in heap[size] is swapped into this position
> to perform the delete. The method downHeap() is then called to shuffle the
> replacement element back down the array but the existing downHeap method must
> be modified to allow picking up an entry from any point in the array rather
> than always assuming the first element (which is its only current mode of
> operation).
> A working javascript model of the proposal with animation is available here:
> http://jsfiddle.net/grcmquf2/22/
> In tests the modified version of "downHeap" produces the same results as the
> existing impl but adds the ability to push down from any point.
> An example use case that requires remove is where a client doesn't want more
> than N matches for any given key (e.g. no more than 5 products from any one
> retailer in a marketplace). In these circumstances a document that was
> previously thought of as competitive has to be removed from the final PQ and
> replaced with another doc (eg a retailer who already has 5 matches in the PQ
> receives a 6th match which is better than his previous ones). This particular
> process is managed by a special "DiversifyingPriorityQueue" which wraps the
> main PriorityQueue and could be contributed as part of another issue if there
> is interest in that.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]