On 6/27/06, Trent Steele <[EMAIL PROTECTED]> wrote: > David Balmain wrote: > > Hi Trent, > > > > The way to do this is to search for more than you need and then > > actually go through each search result and count the types in a hash, > > only adding a doc if it's type count is under the threshold. If you > > failed to retrieve enough results then search again and repeat until > > you get the required number of results. For those of you who know the > > Lucene API, this is where a Hits class comes in handy. It'll be coming > > in a future version. For now I'll show you the easiest wat by doing a > > search and setting :num_docs to max_doc, thereby getting all search > > results in one go; > > > > def get_results(search_str, max_type = 5, num_required = 10) > > type_counter = Hash.new(0) > > results = [] > > index.search_each(search_str, :num_docs => index.size) do > > |doc_id, score| > > doc = index[doc_id] > > if type_counter[doc[:type]] < max_type > > results << doc > > type_counter[doc[:type]] += 1 > > end > > break if results.size >= num_required > > end > > return results > > end > > > > Hope that helps, > > Dave > > Hi, > > I suspected I'd have to do something like this. Thanks for putting me on > the right path. Are there any concerns about scalability/speed when the > index grows larger regarding searching the whole index like this?
As long as you're using the C backed version of Ferret, the index would have to grow very large before speed becomes a concern in this case. Note that Ferret actually has to go through every single search result anyway to check its score, no matter what you have num_docs set to. The only thing that you are using more of with a high value of num_docs is memory (approximately 12-bytes per hit). Cheers, Dave _______________________________________________ Ferret-talk mailing list [email protected] http://rubyforge.org/mailman/listinfo/ferret-talk

