Hi Sam,

I am preparing Zend_Search_Lucene Best Practice documentation section right now and it'll include recommendations for different indexing modes (see below) :)

Hope it'll help.


To get quick result:
1. Don't limit batch indexing execution time.
2. Choose MaxBufferedDocs according to your memory limit (set it to 128 and decrease it twice each time you get 'out of memory' error).
3. Skip MergeFactor tuning
4. Set MaxMergedDocs to floor(NumberOfDocuments/64)



-- Indexing performance -------------
Indexing performance is a compromise between used resources, indexing time and index quality.


Index quality is completely determined by number of index segments.

Each index segment is entirely independent portion of data. So index containing more segments needs more memory and more time for searching.

Index optimization is a process of merging several segments into new one. Fully optimized index contains only one segment.

Full index optimization may be performed with 'optimize()' method:
----
$index = Zend_Search_Lucene::open($indexPath);

$index->optimize();
----

Index optimization works with data streams and doesn't take a lot of memory, but takes processor resources and time.


Lucene index segments are not updatable by their nature (update operation needs segment file to be completely rewritten). So adding new document(s) to the index always generates new segment. It decreases index quality.

Index auto-optimization process is performed after each segment generation and consists in partial segments merging.


There are three options to control behavior of auto-optimization:
1. MaxBufferedDocs is a number of documents buffered in memory before new segment is generated and written to a hard drive. 2. MaxMergeDocs is a maximum number of documents merged by auto-optimization process into new segment.
3. MergeFactor determines how often auto-optimization is performed.
* All these options are Zend_Search_Lucene object properties, but not index properties. So they affect only current Zend_Search_Lucene object behavior and may vary for different scripts.

MaxBufferedDocs doesn't matter if you index only one document per script execution. To the contrary, it's very important for batch indexing. Greater value increases indexing performance, but also needs more memory.

There are no way to calculate best value for MaxBufferedDocs parameter because it depends on documents size, used analyzer and allowed memory.

Good way to get right value is to perform several tests with largest document you expect to be added to the index ('memory_get_usage()' and 'memory_get_peak_usage()' may be used to control memory usage). That's good idea not to use more than a half of allowed memory.


MaxMergeDocs limits segment size (in terms of documents). So it limits auto-optimization time. That guarantees addDocument() method to be not executed more than a certain time. It's important for interactive application.

Decreasing MaxMergeDocs parameter also may improve batch indexing performance. Index auto-optimization is iterative process and is performed step by step. Small segments are merged into larger, at some moment they are merged into even greater and so on. Full index optimization is much more effective.

On the over hand, smaller segments decreases index quality and may generate too many segments. It may be a cause of the 'Too many open files' error determined by OS limitations (Zend_Search_Lucene keeps each segment file opened to improve search performance).

So background index optimization should be performed for interactive indexing mode and MaxMergeDocs shouldn't be too low for batch indexing.


MergeFactor affects auto-optimization frequency. Less values increases quality of unoptimized index. Larger values increases indexing performance, but also increases number of segments. It again may be a cause of the 'Too many open files' error.

MergeFactor groups index segments by their size:
1. Not greater than MaxBufferedDocs.
2. Greater than MaxBufferedDocs, but not greater than MaxBufferedDocs*MergeFactor. 3. Greater than MaxBufferedDocs*MergeFactor, but not greater than MaxBufferedDocs*MergeFactor*MergeFactor.
...

Zend_Search_Lucene checks at each addDocument() call if merging of any segments group may move newly created segment into next group. If yes, then merging is performed.

So index with N groups may contain MaxBufferedDocs + (N-1)*MergeFactor segments and contains at least MaxBufferedDocs*MergeFactor^(N-1) documents.

It gives good approximation for number of segments in the index:
NumberOfSegments <= MaxBufferedDocs + MergeFactor*ln(NumberOfDocuments/MaxBufferedDocs)/ln(MergeFactor)

MaxBufferedDocs is determined by allowed memory. It gives the possibility to choose appropriate merge factor to get reasonable number of segments.


Tuning MergeFactor parameter is more effective for batch indexing performance than MaxMergeDocs. But it's more rough. So use above estimation for tuning MergeFactor, then play with MaxMergeDocs to get best batch indexing performance.
---------------

With best regards,
   Alexander Veremyev.


Sam Davey wrote:
Hi,

I am really impressed with the performance of Zend_Search_Lucene and am
trying to shift my SQL based search to an index based search for the obvious
advantages of relieving stress on my MySql server.

However I have a problem in that there is a massive amount of data I need to
index.  And when I try to index it all the script either runs out of memory
or exceeds the given execution time.  Of course I can use ini_set to
increase these values but I have already increased them to high values and I
can still only index about half of my data.

Does anyone know of a good strategy to minimise or control the required
memory/time of a script indexing this amount of data?

Cheers,

Sam

Reply via email to