Re: Large scale sorting

Doug Cutting Mon, 09 Apr 2007 11:18:38 -0700

Paul Smith wrote:

Disadvantages to this approach:
* It's a lot more I/O intensive

I think this would be prohibitive. Queries matching more than a fewhundred documents will take several seconds to sort, since random diskaccesses are required per matching document. Such an approach is onlypractical if you can guarantee that queries match fewer than a hundreddocuments, which is not generally the case, especially with largecollections.

I'm working on the basis that it's a LOT harder/more expensive to simplyallocate more heap size to cover the current sorting infrastructure.One hits memory limits faster. Not everyone can afford 64-bit hardwarewith many Gb RAM to allocate to a heap. It _is_ cheaper/easier to builda disk subsystem to tune this I/O approach, and one can still use anyRAM as buffer cache for the memory-mapped file anyway.

In my experience, raw search time starts to climb towards one second perquery as collections grow to around 10M documents (in round figures andwith lots of assumptions). Thus, searching on a single CPU is lesspractical as collections grow substantially larger than 10M documents,and distributed solutions are required. So it would be convenient ifsorting is also practical for ~10M document collections on standardhardware. If 10M strings with 20 characters are required in memory forefficient search, this requires 400MB. This is a lot, but not anunusual amount on todays machines. However, if you have a large numberof fields, then this approach may be problematic and force you toconsider a distributed solution earlier than you might otherwise.


Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Large scale sorting

Reply via email to