[jira] Commented: (LUCENE-1575) Refactoring Lucene collectors (HitCollector and extensions)

Shai Erera (JIRA) Mon, 30 Mar 2009 22:58:20 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12693997#action_12693997
 ]


Shai Erera commented on LUCENE-1575:
------------------------------------

bq. But what is the plan now for the FieldComparator API? We no longer pass 
score all around, but expose access to scorer, which only RelevanceComparator 
(in core) will use?

Yes. FieldComparator will have a default empty setScorer() method, which will 
be overridden by RelevanceComparator. In TopFieldCollector (forget the final 
name now) setScorer() method we set the Scorer on all FieldComparators. During 
collect(), only RelevanceComparator, if it exists, will compute the score.

bq. What if we add a new method, and deprecate the old one?

The current methods are:
* Searchable - search(Weight, Filter, int, Sort)
* Searcher - search(Query, Filter, int, Sort)
* IndexSearcher - search(Weight, Filter, int, Sort, boolean /* fillFields */)

Deprecate all three, and add the same but take another boolean as a parameter? 
I have two comments regarding that:
# The current methods need to call the new ones with trackScores = true since 
that's the current behavior.
# When we are left with only the new versions, I'm afraid those methods will 
not look 'simple fast' to a user - I now have to decide whether I want to track 
scores or not, something I haven't given much thought to before. I kind of like 
the current signature, but I understand your concern regarding defaults.

BTW, Searchable is an interface, so we cannot add it there. Searcher is an 
abstract class and we cannot add the method to it with default implementation 
(as I believe the other search methods will call the new one with 
default=true). So it only leaves IndexSearcher as an option. But then what if 
someone uses MultiSearcher? ParallelMultiSearcher? etc.

Is it possible to deprecate a method, documenting that its runtime behavior 
will change in 3.0 and then in 3.0 change to not track scores?

If we're touching TopFieldCollector in such a way, I'd like to propose the 
following refactoring. It stems from the current complex implementation in 
collect() which checks in every collect call if we have just one Comparator or 
Multi, and we're talking about having two versions w/ and w/o score tracking:
* Keep TopFieldCollector as abstract class with a static create() factory 
method, and an abstract updateBottom() method. It will still implement 
topDocs(start, howMany).
* Have the factory method create one of 4 instances, all extend TFC, but are 
private internal classes, which do not concern the user:
*# OneComparatorNonScoringTopFieldCollector - assumes just one FieldComparator 
exists as well as scoring should not be tracked.
*# OneComparatorScoringTopFieldCollector - assumes just one FieldComparator 
exists as well as scoring should be tracked.
*# MultiComparatorNonScoringTopFieldCollector - assumes more than one 
FieldComparator exists, as well as scoring should not be tracked.
*# MultiComparatorScoringTopFieldCollector - assumes more than one 
FieldComparator exists, as well as scoring should be tracked.

The advantages are:
* We simplify the API - the user is only aware of TopFieldCollector, and 
instead of calling a ctor it calls a static create method, which takes all the 
arguments as the ctor takes.
* We are free to create whatever instance is the right and most optimized one 
given the input parameters. The user does not care how the instance is called. 
Hence the long names - they are internal anyway.
* The code is much cleaner and easy to understand. It also does not need to 
check if we have just one comparator or more in every call to collect.
* We don't need to add a protected score to a NonScoring collector (read above 
about code readability) just because a Scoring one extends it and will make use 
of it.

Since TopFieldCollector is new, we have the freedom to do it right w/o 
deprecating anything. I think it's a much cleaner design. It is orthogonal to 
the discussion we're having regarding the search methods and parameters. They 
will use the create() factory method instead of creating a collector, passing 
whatever arguments they have. So let's not confuse the two.

The patch for this issue is ready. As soon as we agree on how to proceed with 
TFC, I'll add the changes and submit the patch.

> Refactoring Lucene collectors (HitCollector and extensions)
> -----------------------------------------------------------
>
>                 Key: LUCENE-1575
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1575
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>            Reporter: Shai Erera
>             Fix For: 2.9
>
>
> This issue is a result of a recent discussion we've had on the mailing list. 
> You can read the thread 
> [here|http://www.nabble.com/Is-TopDocCollector%27s-collect()-implementation-correct--td22557419.html].
> We have agreed to do the following refactoring:
> * Rename MultiReaderHitCollector to Collector, with the purpose that it will 
> be the base class for all Collector implementations.
> * Deprecate HitCollector in favor of the new Collector.
> * Introduce new methods in IndexSearcher that accept Collector, and deprecate 
> those that accept HitCollector.
> ** Create a final class HitCollectorWrapper, and use it in the deprecated 
> methods in IndexSearcher, wrapping the given HitCollector.
> ** HitCollectorWrapper will be marked deprecated, so we can remove it in 3.0, 
> when we remove HitCollector.
> ** It will remove any instanceof checks that currently exist in IndexSearcher 
> code.
> * Create a new (abstract) TopDocsCollector, which will:
> ** Leave collect and setNextReader unimplemented.
> ** Introduce protected members PriorityQueue and totalHits.
> ** Introduce a single protected constructor which accepts a PriorityQueue.
> ** Implement topDocs() and getTotalHits() using the PQ and totalHits members. 
> These can be used as-are by extending classes, as well as be overridden.
> ** Introduce a new topDocs(start, howMany) method which will be used a 
> convenience method when implementing a search application which allows paging 
> through search results. It will also attempt to improve the memory 
> allocation, by allocating a ScoreDoc[] of the requested size only.
> * Change TopScoreDocCollector to extend TopDocsCollector, use the topDocs() 
> and getTotalHits() implementations as they are from TopDocsCollector. The 
> class will also be made final.
> * Change TopFieldCollector to extend TopDocsCollector, and make the class 
> final. Also implement topDocs(start, howMany).
> * Change TopFieldDocCollector (deprecated) to extend TopDocsCollector, 
> instead of TopScoreDocCollector. Implement topDocs(start, howMany)
> * Review other places where HitCollector is used, such as in Scorer, 
> deprecate those places and use Collector instead.
> Additionally, the following proposal was made w.r.t. decoupling score from 
> collect():
> * Change collect to accecpt only a doc Id (unbased).
> * Introduce a setScorer(Scorer) method.
> * If during collect the implementation needs the score, it can call 
> scorer.score().
> If we do this, then we need to review all places in the code where 
> collect(doc, score) is called, and assert whether Scorer can be passed. Also 
> this raises few questions:
> * What if during collect() Scorer is null? (i.e., not set) - is it even 
> possible?
> * I noticed that many (if not all) of the collect() implementations discard 
> the document if its score is not greater than 0. Doesn't it mean that score 
> is needed in collect() always?
> Open issues:
> * The name for Collector
> * TopDocsCollector was mentioned on the thread as TopResultsCollector, but 
> that was when we thought to call Colletor ResultsColletor. Since we decided 
> (so far) on Collector, I think TopDocsCollector makes sense, because of its 
> TopDocs output.
> * Decoupling score from collect().
> I will post a patch a bit later, as this is expected to be a very large 
> patch. I will split it into 2: (1) code patch (2) test cases (moving to use 
> Collector instead of HitCollector, as well as testing the new topDocs(start, 
> howMany) method.
> There might be even a 3rd patch which handles the setScorer thing in 
> Collector (maybe even a different issue?)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] Commented: (LUCENE-1575) Refactoring Lucene collectors (HitCollector and extensions)

Reply via email to