Re: External sort

Marvin Humphrey Thu, 17 Dec 2009 09:33:38 -0800

On Thu, Dec 17, 2009 at 05:03:11PM +0100, Toke Eskildsen wrote:

> A third alternative would be to count the number of unique datetime-values
> and make a compressed representation, but that would make the creation of
> the order-array more complex.


This is what we're doing in Lucy/KinoSearch, though we prepare the ordinal
arrays at index time and mmap them at search time.

The provisional implementation has a problem, though.  We use a hash set to
perform uniquing, but when there are a lot of unique values, the memory
requirements of the SortWriter go very high.

Assume that we're sorting string data rather than date time values (so the
memory occupied by all those unique values is significant).  What algorithms
would you consider for performing the uniquing?

Marvin Humphrey


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: External sort

Reply via email to