Re: Large scale sorting

jian chen Mon, 09 Apr 2007 14:09:27 -0700

Hi, Doug,

I have been thinking about this as well lately and have some thoughts
similar to Paul's approach.


Lucene has the norm data for each document field. Conceptually it is a byte
array with one byte for each document field. At query time, I think the norm
array is loaded into memory the first time it is accessed, allowing for
efficient look up of the norm value for each document.

Now, if we could use integers to represent the sort field values, which is
typically the case for most applications, maybe we can afford to have the
sort field values stored in the disk and do disk lookup for each document
matched? The look up of the sort field value will be as simple as docNo * 4
* offset.

This way, we use the same approach as constructing the norms (proper merging
for incremental indexing), but, at search time, we don't load the sort field
values into memory, instead, just store them in disk.

Will this approach be good enough?

Thanks for your feedback.

Jian


On 4/9/07, Doug Cutting <[EMAIL PROTECTED]> wrote:

Paul Smith wrote:
> Disadvantages to this approach:
> * It's a lot more I/O intensive

I think this would be prohibitive.  Queries matching more than a few
hundred documents will take several seconds to sort, since random disk
accesses are required per matching document.  Such an approach is only
practical if you can guarantee that queries match fewer than a hundred
documents, which is not generally the case, especially with large
collections.

> I'm working on the basis that it's a LOT harder/more expensive to simply
> allocate more heap size to cover the current sorting infrastructure.
> One hits memory limits faster.  Not everyone can afford 64-bit hardware
> with many Gb RAM to allocate to a heap.  It _is_ cheaper/easier to build
> a disk subsystem to tune this I/O approach, and one can still use any
> RAM as buffer cache for the memory-mapped file anyway.

In my experience, raw search time starts to climb towards one second per
query as collections grow to around 10M documents (in round figures and
with lots of assumptions).  Thus, searching on a single CPU is less
practical as collections grow substantially larger than 10M documents,
and distributed solutions are required.  So it would be convenient if
sorting is also practical for ~10M document collections on standard
hardware.  If 10M strings with 20 characters are required in memory for
efficient search, this requires 400MB.  This is a lot, but not an
unusual amount on todays machines.  However, if you have a large number
of fields, then this approach may be problematic and force you to
consider a distributed solution earlier than you might otherwise.

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Large scale sorting

Reply via email to