: It seems your interface requires that the SearchFilter know all of the query : results before hand. I am not sure this works well with the partial result : sets that Lucene supports.
No, I'm not suggesting that. I'm saying that the SearchFilter would act as an iterator over the doc ids that pass the filter -- regardless of the what query might have been doen (in my work, i accutally use Filter's without Queries quite a bit). My point was that with the interface you suggested, a searcher would have to Score every doc, and then ask the SearchFilter if it should be included in the results -- which could result in a lot of docs being scored unneccessarily. I was suggesting an interface that would allow the searcher to first ask the filter "what is the lowest doc id that you allow?" score that doc, then say "what is the next doc id you allow?" and score that one. Now that i write it out, i realize i'm bascially arguing for a Filter API that looks a lot like the Scorer API -- just without the score method. Perhaps that's the best way to go: a generic "DocIterator" interface that could be implimented by Scorer out of the box.... public interface DocIterator { public int doc(); public boolean next(); public boolean skipTo(int target); } : : -----Original Message----- : From: [EMAIL PROTECTED] : [mailto:[EMAIL PROTECTED] Behalf Of Chris Hostetter : Sent: Thursday, January 26, 2006 1:09 PM : To: java-dev@lucene.apache.org : Subject: Re: Filter : : : : The subject of revamping the Filter API to support more compact filter : representations has come up in the past ... At least one patch comes to : mind that helps with the issue... : : https://issues.apache.org/jira/browse/LUCENE-328 : : ...i'm not intimitely familiar with that code, but if i recall correctly : from the last time i read it, it doesn't propose any actual API changes : just some utilities to reduce memory usage. : : Reading your post has me thinking about this whole issue again, : particularly the subject of Filters that are straight forward enough they : could be implimented as simple iterators with very little state and what : API changes could be made to support the interface you describe and still : be backwards compatible. : : One thing that comes to mind (that i don't remember suggesting before, but : perhaps someone else has suggested it before) is that since Filter is an : bastract class which people arecurrently required to subclass, we could : follow a migration path something like this... : : 1) add a SearchFilter interface like the one you describe to the core : code base : 2) add the following method declaration to the Filter class... : public SearchFilter getSearchFilter(IndexReader) throws IOException : ...impliment this method by calling bits, and returning an instance : of a thin inner class that wraps the BitSet : 3) indicate that Filter.bits() is deprecated. : 4) change all existing calls to Filter.bits() in the core lucene code : base to call Filter.getSearchFilter and do whatever iterating is : neccessary. : 5) gradually reimpliment all of the concrete instances of Filter in : the core lucene code base so they override the getSearchFilter method : with something that returns a more "iterator" style SearchFilter, : and impliment their bits() method to use the SearchFilter to build up : the bit set if clients call it directly. : 6) wait a suitable amount of time. : 7) remove Filter.bits() and all of the concrete implimentations from the : lucene core. : : ...i think that would be a fairly straight forward and practical way to : execute such a change. The big question in my mind is what the : "SearchFilter" interface should look like. what you propose is along the : usage lines of "iterate over your ScoreDocs, and foreach one test it : against hte filter" ... but i'm not convinced that it wouldnt' make more : sense to say "ask the filter what the next viable doc is, now score it", : ala... : : public interface SearchFilter { : /** returns doc ids that pass the filter, in increasing order. : * returns 0 once there are no more docs. : */ : int doc getNextFilteredDoc(); : } : : : thoughts? : : : : Date: Thu, 26 Jan 2006 14:35:44 +0100 : : From: Morus Walter <[EMAIL PROTECTED]> : : Reply-To: java-dev@lucene.apache.org : : To: java-dev@lucene.apache.org : : Subject: Filter : : : : Hi, : : : : I would like to suggest a more general filter interface which could be : : added as an alternative to the current bitset filters. : : (Replacing the bitset filters would only be possible if api changes were : : acceptable). : : : : While bitset based filters are useful in many use cases the restriction : : of filters to using bitsets prevents other solutions. : : Especially since the introduction of field caches for sorting it's easy : : to implement filters directly based on field values. : : : : So I suggest to add a general filter interface that requires a filter : : just to provide a filter-method that takes a ScoreDoc and returns : : true or false if the document passes the filter or is rejected. : : This would be basically : : public interface SearchFilter { : : boolean filter(ScoreDoc doc); : : } : : : : Thus a filter could be implemented using a bitset or it could get a : : field cache and check the documents value based on that or in any : : other way. : : Providing a ScoreDoc to the filter (instead of the document id alone) : : allows to write filters that modify the score instead of : : accepting/rejecting documents. : : : : Use cases include : : - Filtering based on document values : : E.g. a date filter. This can be done by the current bitset based : : filters but if the date ranges vary from query to query and the index : : change rate is low, using a field cache on the dates seems better than : : creating a bitset for each range. : : - Modifying the score : : E.g. a scoring that degrades the score based on a date field to prefer : : new documents over old ones. This is not the same as sorting by date : : since an old but good hit can still end in a better score than a new but : : low scored hit. : : - Collecting addional information : : Lets say you have a category field in your documents. Using a field : : cache you could count the number of hits for each category. : : : : Of course this can be done (and I did some of this) by subclassing : : and extending IndexSearcher, but I think the support for generalised : : filters should rather be part of the lucene core itself. : : Adding such an api would mean to duplicate all the search methods taking : : filters to have an additional version taking the generalized filter. Not : : really nice, but I think it would be worth the effort. And if api : : changes are accepted (e.g. for 2.0) the bitset filters could be replaced : : by the generalized filter since a bitset filter could be easily wraped : : in a generalized filter (at the cost of an additional method call per : : lookup). : : : : If there is interest in such a change and it would be accepted, I could : : work out a patch (might take some time though). : : : : Morus : : : : --------------------------------------------------------------------- : : To unsubscribe, e-mail: [EMAIL PROTECTED] : : For additional commands, e-mail: [EMAIL PROTECTED] : : : : : : -Hoss : : : --------------------------------------------------------------------- : To unsubscribe, e-mail: [EMAIL PROTECTED] : For additional commands, e-mail: [EMAIL PROTECTED] : : : --------------------------------------------------------------------- : To unsubscribe, e-mail: [EMAIL PROTECTED] : For additional commands, e-mail: [EMAIL PROTECTED] : -Hoss --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]