A discussion on the user list brought my mind to the longer term
scalability issues of Lucene. Lucene is inherently memory efficient,
except for sorting, when the inverted index nature of the index works
against the required nature of having a value for each object to sort
against.
I'm hoping this discussion will stimulate some thought into using
Lucene in environments of truly big indexes that need efficient multi-
field sorting without requiring a large heap size to manage it. I'm
not really being altruistic here because how we use Lucene would
really benefit from some improvements in this area.
There has been some provided patches outlined that use a WeakHashMap
implementation but I'm not sure that's really the best long term
strategy. I've been looking over the FieldCache+Impl and the
FieldSortedHitQueue classes and I wonder if some use of disk might
work for sorting on big indexes.
Conceptually, to sort in Lucene, you must have:
1) For a given Doc ID, it's 'relationship' to other docs for a given
field/locale combination. If each Doc ID is ordered based on this,
then one can simply compare 2 docs position within the structure.
This is how FieldSortHitQueue does it.
2) the raw value of a documents field so that the merge sort can
complete when merging results from multiple segments.
To achieve point 1 is relatively efficient in terms of memory, it's
simply 1 array the size of numDocs with the value at the index it's
numerical position within the sorted list of terms. Point 2 is where
the memory can go off the charts, particularly for Strings, and to
make sorting efficient over the long hall these terms for each doc
are cached in memory. This is why everyone recommends 'warming up' a
large index; it's purely loading this stuff into memory!
If you have large indices, and require sorting on multiple fields,
you can very easily saturate a heap. In our case we get hit by
another multiplier: Locale. If you need to sort by these same #
fields in multiple Locale's... (we currently support English,
Chinese, French and Japanese sorting, with plans for other languages
at some point........)
I'm just going to throw this idea up as a sort of straw-man,
hopefully to get some discussion going.
Rather than hold the say, String[] of terms for String field type, in
memory, what about writing them to disk in an offset Memory Mapped
file (NIO)? The file structure might look something like:
Header - For each doc #, a byte offset start position into the data
block, plus a length. So, 2x4byte ints per Doc #.
Data - At particular location, each doc's term data is written.
For a given index of 10 million documents, with an average field
length of 43 characters (sample taken from one of our production db
for mail subject lines), the file size would be:
* Header = 76Mb
* Data = 410Mb
(This just shows you why holding the String[] in memory is not
scalable, yes one can get reductions based on String interning, but
one should always consider the worst case with non-unique field values)
Performing the lookup of a given Doc's term text to be used for
sorting would require:
* Seek to (doc # x (2x4byte)) in the header
* Read offset and length
* Seek to Data location, read length bytes and convert to, say, String
Advantages:
* Almost no memory retained for use in sorting. This means one could
hold many more large indexes open.
* When large index closed, less GC activity since less data held for
sorting is retained.
* Can now sort on any # fields for practically any sized index for
any # locales.
Disadvantages to this approach:
* It's a lot more I/O intensive
* would be slower than current Sorting
* would result in more GC activity
I'm wondering then if the Sorting infrastructure could be refactored
to allow with some sort of policy/strategy where one can choose a
point where one is not willing to use memory for sorting, but willing
to pay the disk penalty to get the sort done over a large index
without having to allocate a massive gob of heap memory to do it.
This would allow the standard behavior to continue, but once an index
reaches a threshold it no longer scales hideously.
I'm working on the basis that it's a LOT harder/more expensive to
simply allocate more heap size to cover the current sorting
infrastructure. One hits memory limits faster. Not everyone can
afford 64-bit hardware with many Gb RAM to allocate to a heap. It
_is_ cheaper/easier to build a disk subsystem to tune this I/O
approach, and one can still use any RAM as buffer cache for the
memory-mapped file anyway.
I'm almost certain to have screwed up a calculation somewhere, so
feel free to point out weaknesses/gaps etc.
To accomplish this would require a substantial change to the
FieldSortHitQueue et al, and I realize that the use of NIO
immediately pins Lucene to Java 1.4, so I'm sure this is
controversial. But, if we wish Lucene to go beyond where it is now,
I think we need to start thinking about this particular problem
sooner rather than later.
Happy Easter to all,
Paul Smith
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]