Just minor correction - we can't make TopDocCollector final, as it will break back-compat. So we keep it deprecated, keep TopScoreDocCollector and make it extend that TopDocOutputCollector.
On Wed, Mar 25, 2009 at 8:55 PM, Shai Erera <ser...@gmail.com> wrote: > Ok so I feel we're coming to an end. Few comments though: > > The difference is for the new code, it's an upcast, which catches any >> errors at compile time, not run time. The compiler determines that >> the class implements the required interface. >> > > I still don't understand how a compiler can detect at compilation time that > a HitCollector instance that is used in method A, and is casted to a > TopDocsOutput instance by calling method B (from A) is actually ok ... I may > be missing some Java information here, but I simply don't see how that > happens at compilation time instead of at runtime ... > > But that leaves no way forward for current users subclassing >> TopDocCollector (for the freedom of providing your own pqueue). >> > > TopDocCollector is actually marked deprecated. So nobody should continue > subclassing it, as it will be removed in 3.0 no? Perhaps you mean > TopScoreDocCollector? If yes, then I think nobody should subclass it (like > you write at the end of the email). > > Maybe we should in fact simply deprecate HitCollector (in favor of >> MultiReaderHitCollector)? After all, making your own HitCollector is >> an advanced thing; expecting you to properly implement setNextReader >> may be fine. >> > > I personally don't understand why MRHC was invented in the first place. > What's wrong with adding a setNextReader to HitCollector with an empty > implementation which does nothing? That doesn't break back-compat, and > everyone can still override that method and do whatever they need. > > Notice also that Lucene *lies* to its users about the existence of HC and > MRHC. While I'm passing HC to IndexSearcher, I believe I noticed a code > segment like: > > MultiReaderHitCollector mrhc; > if (c instanceof MRHC) { > mrhc = (MRHC) c; > else { > mrhc = new MRHC(c); > } > > So in fact the internal Lucene code expects only MRHC from a certain point, > and so even if I wrote a HC and passed it on Searcher, it's still converted > to MRHC, with an empty setNextReader() method implementation. That's why I > meant that HC is already deprecated, whether it's marked as deprecated or > not. > > What you say about deprecating HC to me is unnecessary. Simply pull > setNextReader up with an empty implementation, get rid of all the > instanceof, casting and wrapping code and you're fine. Nothing is broken. > All works well and better (instanceof, casting and wrapping have their > overheads). > Isn't that right? > > I'm all for your suggestions, I just think we can achieve the same thing w/ > HC. We can also not deprecate TopDocCollector and remove > TopScoreDocCollector (merging implementations so that TopDocCollector is > more efficient), as well as introduce that in between TopDocsOutputCollector > (or a better name) with the topDocs() and getTotalHits() methods ... > > Regarding interfaces .. I don't think they're that bad. Perhaps a different > viewing angle might help. Lucene processes a query and fires events. Today > it fires an event every time a doc's score has been computed, and recently > every time it starts to process a different reader. HitCollector is a > listener implementation on the "doc-score event", while MRHC is a listener > on both. > To me, interfaces play just nicely here. Assume that you have the following > interfaces: > - DocScoreEvent > - ChangeReaderEvent > - EndProcessingEvent (thrown when the query has finished processing - > perhaps to aid collectors to free resources) > - <any other events you foresee?> > The Lucene code receives a HitCollector which listens on all events. In the > future, Lucene might throw other events, but still get a HitCollector. Those > methods will check for instanceof, and you as a user will know that if you > want to catch those events, you pass in a collector implementation which > does. Those events cannot of course be main-stream events (like > DocScoreEvent), but new ones, perhaps even experimental. > Since HitCollector is a concrete class, we can always add new interfaces to > it in the future with empty implementations? > > Anyway, this is just a thought. I wouldn't want to deviate the discussion > we're having. > > So how about what I propose: > > 1. Pull setNextReader up to HitCollector, w/ empty implementation (easy > change). > 2. Get rid of MRHC and change the classes which extend it (easy > change). > 3. Get rid of all the instanceof checks (easy change - the compiler > will warn you right away since MRHC is deleted). > 4. Create a new TopDocOutputCollector which allows you to set PQ and > implements topDocs() and getTotalHits(). > 5. Un-deprecate TopDocCollector, merge its implementation with > TopScoreDocCollector and make it final. Override topDocs() and getTotalHits > (if necessary). > > Did I miss anything? > > Shai > > > On Wed, Mar 25, 2009 at 3:38 PM, Michael McCandless < > luc...@mikemccandless.com> wrote: > >> This stuff is surprisingly hard to think about! >> >> >> Actually, it was Nadav who first proposed the "read interface", to >> >> solve the "there's no common way for reading its output" problem. >> >> With an interface (say TopDocsOutput), then you could have some >> >> method somewhere: >> >> >> >> renderResults(TopDocsOutput results) >> >> >> >> and then any collector, independent of how it *collects* results, >> >> could implement TopDocsOutput if appropriate. >> > >> > You'd still need to cast the collector to TopDocsOutput, won't you? >> > How's that different than the code snippet I showed above? >> >> The difference is for the new code, it's an upcast, which catches any >> errors at compile time, not run time. The compiler determines that >> the class implements the required interface. >> >> > The current situation introduces a bug, that's true. However, unless >> > something better pops up, shouldn't we just make it final? >> >> But that leaves no way forward for current users subclassing >> TopDocCollector (for the freedom of providing your own pqueue). >> >> > May I suggest something else? What if MRHC was actually an >> > interface? >> >> I think interface is too dangerous in this case (the future back >> compatibility problem). EG here we are wanting to explore a way to >> not pre-compute the score; had we released MRHC as an interface we'd >> be in trouble. (We may still be in trouble, anyway!). >> >> >> Would TopDocsCollector subclass HitCollector or >> >> MultiReaderHitCollector? >> > >> > Well ... we've been there as well already :). I don't think there's >> > an easy answer here. I guess if MRHC is the better approach, and we >> > think all Top***DocCollector would want to have the MRHC >> > functionality, then I'd say let's extend MRHC. Otherwise, I don't >> > have a good answer. When I started this thread, I only knew of >> > HitCollector, so things were simpler at the time. >> >> We have challenging goals here: >> >> * The "collect top N by score" collector should be final, use >> ScorerDocQueue, specialized to sorting by score/docID: performance >> is important. >> >> * Likewise for the "collect top N by sorted field" collector, though >> it does provide extensibility by letting you make a custom >> comparator (FieldComparatorSource). Ideally this'd allow with and >> without computing score (it does not today). >> >> * A "top N by my own pqueue" collector (this is what >> TopDocCollector/TopScoreDocsCollector allow today, but it has the >> bug). >> >> * Allow fully custom collection, with and without score. >> >> Maybe we should in fact simply deprecate HitCollector (in favor of >> MultiReaderHitCollector)? After all, making your own HitCollector is >> an advanced thing; expecting you to properly implement setNextReader >> may be fine. >> >> And then we can subclass MultiReaderHitCollector to TopDocsCollector >> (which adds the totalHits/topDocs "results delivery" API). >> >> And then the "collect top docs by score", and "collect top docs by >> fields" collectors subclass TopDocsCollector? >> >> Finally, we add a "collect top docs according to my own pqueue" >> class. >> >> Then we wouldn't need an interface; this works because all core >> collectors deliver top N results in the end. >> >> All that's missing is a way to NOT compute score if it's not needed. >> >> Mike >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-dev-h...@lucene.apache.org >> >> >