A discussion on the user list brought my mind to the longer term scalability issues of Lucene. Lucene is inherently memory efficient, except for sorting, when the inverted index nature of the index works against the required nature of having a value for each object to sort against.

I'm hoping this discussion will stimulate some thought into using Lucene in environments of truly big indexes that need efficient multi- field sorting without requiring a large heap size to manage it. I'm not really being altruistic here because how we use Lucene would really benefit from some improvements in this area.

There has been some provided patches outlined that use a WeakHashMap implementation but I'm not sure that's really the best long term strategy. I've been looking over the FieldCache+Impl and the FieldSortedHitQueue classes and I wonder if some use of disk might work for sorting on big indexes.

Conceptually, to sort in Lucene, you must have:

1) For a given Doc ID, it's 'relationship' to other docs for a given field/locale combination. If each Doc ID is ordered based on this, then one can simply compare 2 docs position within the structure. This is how FieldSortHitQueue does it.

2) the raw value of a documents field so that the merge sort can complete when merging results from multiple segments.

To achieve point 1 is relatively efficient in terms of memory, it's simply 1 array the size of numDocs with the value at the index it's numerical position within the sorted list of terms. Point 2 is where the memory can go off the charts, particularly for Strings, and to make sorting efficient over the long hall these terms for each doc are cached in memory. This is why everyone recommends 'warming up' a large index; it's purely loading this stuff into memory!

If you have large indices, and require sorting on multiple fields, you can very easily saturate a heap. In our case we get hit by another multiplier: Locale. If you need to sort by these same # fields in multiple Locale's... (we currently support English, Chinese, French and Japanese sorting, with plans for other languages at some point........)

I'm just going to throw this idea up as a sort of straw-man, hopefully to get some discussion going.

Rather than hold the say, String[] of terms for String field type, in memory, what about writing them to disk in an offset Memory Mapped file (NIO)? The file structure might look something like:

Header - For each doc #, a byte offset start position into the data block, plus a length. So, 2x4byte ints per Doc #.
Data - At particular location, each doc's term data is written.

For a given index of 10 million documents, with an average field length of 43 characters (sample taken from one of our production db for mail subject lines), the file size would be:

* Header = 76Mb
* Data = 410Mb

(This just shows you why holding the String[] in memory is not scalable, yes one can get reductions based on String interning, but one should always consider the worst case with non-unique field values)

Performing the lookup of a given Doc's term text to be used for sorting would require:

* Seek to (doc # x (2x4byte)) in the header
* Read offset and length
* Seek to Data location, read length bytes and convert to, say, String

Advantages:
* Almost no memory retained for use in sorting. This means one could hold many more large indexes open. * When large index closed, less GC activity since less data held for sorting is retained. * Can now sort on any # fields for practically any sized index for any # locales.

Disadvantages to this approach:
* It's a lot more I/O intensive
* would be slower than current Sorting
* would result in more GC activity

I'm wondering then if the Sorting infrastructure could be refactored to allow with some sort of policy/strategy where one can choose a point where one is not willing to use memory for sorting, but willing to pay the disk penalty to get the sort done over a large index without having to allocate a massive gob of heap memory to do it. This would allow the standard behavior to continue, but once an index reaches a threshold it no longer scales hideously.

I'm working on the basis that it's a LOT harder/more expensive to simply allocate more heap size to cover the current sorting infrastructure. One hits memory limits faster. Not everyone can afford 64-bit hardware with many Gb RAM to allocate to a heap. It _is_ cheaper/easier to build a disk subsystem to tune this I/O approach, and one can still use any RAM as buffer cache for the memory-mapped file anyway.

I'm almost certain to have screwed up a calculation somewhere, so feel free to point out weaknesses/gaps etc.

To accomplish this would require a substantial change to the FieldSortHitQueue et al, and I realize that the use of NIO immediately pins Lucene to Java 1.4, so I'm sure this is controversial. But, if we wish Lucene to go beyond where it is now, I think we need to start thinking about this particular problem sooner rather than later.

Happy Easter to all,

Paul Smith

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to