Re: Is TopDocCollector's collect() implementation correct?

Shai Erera Wed, 25 Mar 2009 12:21:53 -0700

Just minor correction - we can't make TopDocCollector final, as it will
break back-compat. So we keep it deprecated, keep TopScoreDocCollector and
make it extend that TopDocOutputCollector.


On Wed, Mar 25, 2009 at 8:55 PM, Shai Erera <ser...@gmail.com> wrote:

> Ok so I feel we're coming to an end. Few comments though:
>
> The difference is for the new code, it's an upcast, which catches any
>> errors at compile time, not run time.  The compiler determines that
>> the class implements the required interface.
>>
>
> I still don't understand how a compiler can detect at compilation time that
> a HitCollector instance that is used in method A, and is casted to a
> TopDocsOutput instance by calling method B (from A) is actually ok ... I may
> be missing some Java information here, but I simply don't see how that
> happens at compilation time instead of at runtime ...
>
> But that leaves no way forward for current users subclassing
>> TopDocCollector (for the freedom of providing your own pqueue).
>>
>
> TopDocCollector is actually marked deprecated. So nobody should continue
> subclassing it, as it will be removed in 3.0 no? Perhaps you mean
> TopScoreDocCollector? If yes, then I think nobody should subclass it (like
> you write at the end of the email).
>
> Maybe we should in fact simply deprecate HitCollector (in favor of
>> MultiReaderHitCollector)?  After all, making your own HitCollector is
>> an advanced thing; expecting you to properly implement setNextReader
>> may be fine.
>>
>
> I personally don't understand why MRHC was invented in the first place.
> What's wrong with adding a setNextReader to HitCollector with an empty
> implementation which does nothing? That doesn't break back-compat, and
> everyone can still override that method and do whatever they need.
>
> Notice also that Lucene *lies* to its users about the existence of HC and
> MRHC. While I'm passing HC to IndexSearcher, I believe I noticed a code
> segment like:
>
> MultiReaderHitCollector mrhc;
> if (c instanceof MRHC) {
>    mrhc = (MRHC) c;
> else {
>    mrhc = new MRHC(c);
> }
>
> So in fact the internal Lucene code expects only MRHC from a certain point,
> and so even if I wrote a HC and passed it on Searcher, it's still converted
> to MRHC, with an empty setNextReader() method implementation. That's why I
> meant that HC is already deprecated, whether it's marked as deprecated or
> not.
>
> What you say about deprecating HC to me is unnecessary. Simply pull
> setNextReader up with an empty implementation, get rid of all the
> instanceof, casting and wrapping code and you're fine. Nothing is broken.
> All works well and better (instanceof, casting and wrapping have their
> overheads).
> Isn't that right?
>
> I'm all for your suggestions, I just think we can achieve the same thing w/
> HC. We can also not deprecate TopDocCollector and remove
> TopScoreDocCollector (merging implementations so that TopDocCollector is
> more efficient), as well as introduce that in between TopDocsOutputCollector
> (or a better name) with the topDocs() and getTotalHits() methods ...
>
> Regarding interfaces .. I don't think they're that bad. Perhaps a different
> viewing angle might help. Lucene processes a query and fires events. Today
> it fires an event every time a doc's score has been computed, and recently
> every time it starts to process a different reader. HitCollector is a
> listener implementation on the "doc-score event", while MRHC is a listener
> on both.
> To me, interfaces play just nicely here. Assume that you have the following
> interfaces:
> - DocScoreEvent
> - ChangeReaderEvent
> - EndProcessingEvent (thrown when the query has finished processing -
> perhaps to aid collectors to free resources)
> - <any other events you foresee?>
> The Lucene code receives a HitCollector which listens on all events. In the
> future, Lucene might throw other events, but still get a HitCollector. Those
> methods will check for instanceof, and you as a user will know that if you
> want to catch those events, you pass in a collector implementation which
> does. Those events cannot of course be main-stream events (like
> DocScoreEvent), but new ones, perhaps even experimental.
> Since HitCollector is a concrete class, we can always add new interfaces to
> it in the future with empty implementations?
>
> Anyway, this is just a thought. I wouldn't want to deviate the discussion
> we're having.
>
> So how about what I propose:
>
>    1. Pull setNextReader up to HitCollector, w/ empty implementation (easy
>    change).
>    2. Get rid of MRHC and change the classes which extend it (easy
>    change).
>    3. Get rid of all the instanceof checks (easy change - the compiler
>    will warn you right away since MRHC is deleted).
>    4. Create a new TopDocOutputCollector which allows you to set PQ and
>    implements topDocs() and getTotalHits().
>    5. Un-deprecate TopDocCollector, merge its implementation with
>    TopScoreDocCollector and make it final. Override topDocs() and getTotalHits
>    (if necessary).
>
> Did I miss anything?
>
> Shai
>
>
> On Wed, Mar 25, 2009 at 3:38 PM, Michael McCandless <
> luc...@mikemccandless.com> wrote:
>
>> This stuff is surprisingly hard to think about!
>>
>> >> Actually, it was Nadav who first proposed the "read interface", to
>> >> solve the "there's no common way for reading its output" problem.
>> >> With an interface (say TopDocsOutput), then you could have some
>> >> method somewhere:
>> >>
>> >>  renderResults(TopDocsOutput results)
>> >>
>> >> and then any collector, independent of how it *collects* results,
>> >> could implement TopDocsOutput if appropriate.
>> >
>> > You'd still need to cast the collector to TopDocsOutput, won't you?
>> > How's that different than the code snippet I showed above?
>>
>> The difference is for the new code, it's an upcast, which catches any
>> errors at compile time, not run time.  The compiler determines that
>> the class implements the required interface.
>>
>> > The current situation introduces a bug, that's true. However, unless
>> > something better pops up, shouldn't we just make it final?
>>
>> But that leaves no way forward for current users subclassing
>> TopDocCollector (for the freedom of providing your own pqueue).
>>
>> > May I suggest something else? What if MRHC was actually an
>> > interface?
>>
>> I think interface is too dangerous in this case (the future back
>> compatibility problem).  EG here we are wanting to explore a way to
>> not pre-compute the score; had we released MRHC as an interface we'd
>> be in trouble.  (We may still be in trouble, anyway!).
>>
>> >> Would TopDocsCollector subclass HitCollector or
>> >> MultiReaderHitCollector?
>> >
>> > Well ... we've been there as well already :). I don't think there's
>> > an easy answer here. I guess if MRHC is the better approach, and we
>> > think all Top***DocCollector would want to have the MRHC
>> > functionality, then I'd say let's extend MRHC. Otherwise, I don't
>> > have a good answer. When I started this thread, I only knew of
>> > HitCollector, so things were simpler at the time.
>>
>> We have challenging goals here:
>>
>>  * The "collect top N by score" collector should be final, use
>>    ScorerDocQueue, specialized to sorting by score/docID: performance
>>    is important.
>>
>>  * Likewise for the "collect top N by sorted field" collector, though
>>    it does provide extensibility by letting you make a custom
>>    comparator (FieldComparatorSource).  Ideally this'd allow with and
>>    without computing score (it does not today).
>>
>>  * A "top N by my own pqueue" collector (this is what
>>    TopDocCollector/TopScoreDocsCollector allow today, but it has the
>>    bug).
>>
>>  * Allow fully custom collection, with and without score.
>>
>> Maybe we should in fact simply deprecate HitCollector (in favor of
>> MultiReaderHitCollector)?  After all, making your own HitCollector is
>> an advanced thing; expecting you to properly implement setNextReader
>> may be fine.
>>
>> And then we can subclass MultiReaderHitCollector to TopDocsCollector
>> (which adds the totalHits/topDocs "results delivery" API).
>>
>> And then the "collect top docs by score", and "collect top docs by
>> fields" collectors subclass TopDocsCollector?
>>
>> Finally, we add a "collect top docs according to my own pqueue"
>> class.
>>
>> Then we wouldn't need an interface; this works because all core
>> collectors deliver top N results in the end.
>>
>> All that's missing is a way to NOT compute score if it's not needed.
>>
>> Mike
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-dev-h...@lucene.apache.org
>>
>>
>

Re: Is TopDocCollector's collect() implementation correct?

Reply via email to