On 6/27/06, Trent Steele <[EMAIL PROTECTED]> wrote:
> David Balmain wrote:
> > Hi Trent,
> >
> > The way to do this is to search for more than you need and then
> > actually go through each search result and count the types in a hash,
> > only adding a doc if it's type count is under the threshold. If you
> > failed to retrieve enough results then search again and repeat until
> > you get the required number of results. For those of you who know the
> > Lucene API, this is where a Hits class comes in handy. It'll be coming
> > in a future version. For now I'll show you the easiest wat by doing a
> > search and setting :num_docs to max_doc, thereby getting all search
> > results in one go;
> >
> >     def get_results(search_str, max_type = 5, num_required = 10)
> >         type_counter = Hash.new(0)
> >         results = []
> >         index.search_each(search_str, :num_docs => index.size) do
> > |doc_id, score|
> >             doc = index[doc_id]
> >             if type_counter[doc[:type]] < max_type
> >                 results << doc
> >                 type_counter[doc[:type]] += 1
> >             end
> >             break if results.size >= num_required
> >         end
> >         return results
> >     end
> >
> > Hope that helps,
> > Dave
>
> Hi,
>
> I suspected I'd have to do something like this. Thanks for putting me on
> the right path. Are there any concerns about scalability/speed when the
> index grows larger regarding searching the whole index like this?

As long as you're using the C backed version of Ferret, the index
would have to grow very large before speed becomes a concern in this
case. Note that Ferret actually has to go through every single search
result anyway to check its score, no matter what you have num_docs set
to. The only thing that you are using more of with a high value of
num_docs is memory (approximately 12-bytes per hit).

Cheers,
Dave
_______________________________________________
Ferret-talk mailing list
[email protected]
http://rubyforge.org/mailman/listinfo/ferret-talk

Reply via email to