Large scale sorting

Paul Smith Fri, 06 Apr 2007 19:46:15 -0700

A discussion on the user list brought my mind to the longer termscalability issues of Lucene. Lucene is inherently memory efficient,except for sorting, when the inverted index nature of the index worksagainst the required nature of having a value for each object to sortagainst.

I'm hoping this discussion will stimulate some thought into usingLucene in environments of truly big indexes that need efficient multi-field sorting without requiring a large heap size to manage it. I'mnot really being altruistic here because how we use Lucene wouldreally benefit from some improvements in this area.

There has been some provided patches outlined that use a WeakHashMapimplementation but I'm not sure that's really the best long termstrategy. I've been looking over the FieldCache+Impl and theFieldSortedHitQueue classes and I wonder if some use of disk mightwork for sorting on big indexes.


Conceptually, to sort in Lucene, you must have:

1) For a given Doc ID, it's 'relationship' to other docs for a givenfield/locale combination. If each Doc ID is ordered based on this,then one can simply compare 2 docs position within the structure.This is how FieldSortHitQueue does it.

2) the raw value of a documents field so that the merge sort cancomplete when merging results from multiple segments.

To achieve point 1 is relatively efficient in terms of memory, it'ssimply 1 array the size of numDocs with the value at the index it'snumerical position within the sorted list of terms. Point 2 is wherethe memory can go off the charts, particularly for Strings, and tomake sorting efficient over the long hall these terms for each docare cached in memory. This is why everyone recommends 'warming up' alarge index; it's purely loading this stuff into memory!

If you have large indices, and require sorting on multiple fields,you can very easily saturate a heap. In our case we get hit byanother multiplier: Locale. If you need to sort by these same #fields in multiple Locale's... (we currently support English,Chinese, French and Japanese sorting, with plans for other languagesat some point........)

I'm just going to throw this idea up as a sort of straw-man,hopefully to get some discussion going.

Rather than hold the say, String[] of terms for String field type, inmemory, what about writing them to disk in an offset Memory Mappedfile (NIO)? The file structure might look something like:

Header - For each doc #, a byte offset start position into the datablock, plus a length. So, 2x4byte ints per Doc #.

Data - At particular location, each doc's term data is written.

For a given index of 10 million documents, with an average fieldlength of 43 characters (sample taken from one of our production dbfor mail subject lines), the file size would be:


* Header = 76Mb
* Data = 410Mb

(This just shows you why holding the String[] in memory is notscalable, yes one can get reductions based on String interning, butone should always consider the worst case with non-unique field values)

Performing the lookup of a given Doc's term text to be used forsorting would require:


* Seek to (doc # x (2x4byte)) in the header
* Read offset and length
* Seek to Data location, read length bytes and convert to, say, String

Advantages:

* Almost no memory retained for use in sorting. This means one couldhold many more large indexes open.* When large index closed, less GC activity since less data held forsorting is retained.* Can now sort on any # fields for practically any sized index forany # locales.


Disadvantages to this approach:
* It's a lot more I/O intensive
* would be slower than current Sorting
* would result in more GC activity

I'm wondering then if the Sorting infrastructure could be refactoredto allow with some sort of policy/strategy where one can choose apoint where one is not willing to use memory for sorting, but willingto pay the disk penalty to get the sort done over a large indexwithout having to allocate a massive gob of heap memory to do it.This would allow the standard behavior to continue, but once an indexreaches a threshold it no longer scales hideously.

I'm working on the basis that it's a LOT harder/more expensive tosimply allocate more heap size to cover the current sortinginfrastructure. One hits memory limits faster. Not everyone canafford 64-bit hardware with many Gb RAM to allocate to a heap. It_is_ cheaper/easier to build a disk subsystem to tune this I/Oapproach, and one can still use any RAM as buffer cache for thememory-mapped file anyway.

I'm almost certain to have screwed up a calculation somewhere, sofeel free to point out weaknesses/gaps etc.

To accomplish this would require a substantial change to theFieldSortHitQueue et al, and I realize that the use of NIOimmediately pins Lucene to Java 1.4, so I'm sure this iscontroversial. But, if we wish Lucene to go beyond where it is now,I think we need to start thinking about this particular problemsooner rather than later.


Happy Easter to all,

Paul Smith

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Large scale sorting

Reply via email to