Re: [fw-general] Zend_Search_Lucene - Large amount of data

Alexander Veremyev Wed, 27 Jun 2007 16:43:14 -0700

Hi Sam,

I am preparing Zend_Search_Lucene Best Practice documentation sectionright now and it'll include recommendations for different indexing modes(see below) :)


Hope it'll help.


To get quick result:
1. Don't limit batch indexing execution time.

2. Choose MaxBufferedDocs according to your memory limit (set it to 128and decrease it twice each time you get 'out of memory' error).

3. Skip MergeFactor tuning
4. Set MaxMergedDocs to floor(NumberOfDocuments/64)



-- Indexing performance -------------

Indexing performance is a compromise between used resources, indexingtime and index quality.



Index quality is completely determined by number of index segments.

Each index segment is entirely independent portion of data. So indexcontaining more segments needs more memory and more time for searching.

Index optimization is a process of merging several segments into newone. Fully optimized index contains only one segment.


Full index optimization may be performed with 'optimize()' method:
----
$index = Zend_Search_Lucene::open($indexPath);

$index->optimize();
----

Index optimization works with data streams and doesn't take a lot ofmemory, but takes processor resources and time.

Lucene index segments are not updatable by their nature (updateoperation needs segment file to be completely rewritten). So adding newdocument(s) to the index always generates new segment. It decreasesindex quality.

Index auto-optimization process is performed after each segmentgeneration and consists in partial segments merging.



There are three options to control behavior of auto-optimization:

1. MaxBufferedDocs is a number of documents buffered in memory beforenew segment is generated and written to a hard drive.2. MaxMergeDocs is a maximum number of documents merged byauto-optimization process into new segment.

3. MergeFactor determines how often auto-optimization is performed.

* All these options are Zend_Search_Lucene object properties, but notindex properties. So they affect only current Zend_Search_Lucene objectbehavior and may vary for different scripts.

MaxBufferedDocs doesn't matter if you index only one document per scriptexecution. To the contrary, it's very important for batch indexing.Greater value increases indexing performance, but also needs more memory.

There are no way to calculate best value for MaxBufferedDocs parameterbecause it depends on documents size, used analyzer and allowed memory.

Good way to get right value is to perform several tests with largestdocument you expect to be added to the index ('memory_get_usage()' and'memory_get_peak_usage()' may be used to control memory usage). That'sgood idea not to use more than a half of allowed memory.

MaxMergeDocs limits segment size (in terms of documents). So it limitsauto-optimization time. That guarantees addDocument() method to be notexecuted more than a certain time. It's important for interactiveapplication.

Decreasing MaxMergeDocs parameter also may improve batch indexingperformance. Index auto-optimization is iterative process and isperformed step by step. Small segments are merged into larger, at somemoment they are merged into even greater and so on. Full indexoptimization is much more effective.

On the over hand, smaller segments decreases index quality and maygenerate too many segments. It may be a cause of the 'Too many openfiles' error determined by OS limitations (Zend_Search_Lucene keeps eachsegment file opened to improve search performance).

So background index optimization should be performed for interactiveindexing mode and MaxMergeDocs shouldn't be too low for batch indexing.

MergeFactor affects auto-optimization frequency. Less values increasesquality of unoptimized index. Larger values increases indexingperformance, but also increases number of segments. It again may be acause of the 'Too many open files' error.


MergeFactor groups index segments by their size:
1. Not greater than MaxBufferedDocs.

2. Greater than MaxBufferedDocs, but not greater thanMaxBufferedDocs*MergeFactor.3. Greater than MaxBufferedDocs*MergeFactor, but not greater thanMaxBufferedDocs*MergeFactor*MergeFactor.

...

Zend_Search_Lucene checks at each addDocument() call if merging of anysegments group may move newly created segment into next group. If yes,then merging is performed.

So index with N groups may contain MaxBufferedDocs + (N-1)*MergeFactorsegments and contains at least MaxBufferedDocs*MergeFactor^(N-1) documents.


It gives good approximation for number of segments in the index:

NumberOfSegments <= MaxBufferedDocs +MergeFactor*ln(NumberOfDocuments/MaxBufferedDocs)/ln(MergeFactor)

MaxBufferedDocs is determined by allowed memory. It gives thepossibility to choose appropriate merge factor to get reasonable numberof segments.

Tuning MergeFactor parameter is more effective for batch indexingperformance than MaxMergeDocs. But it's more rough.So use above estimation for tuning MergeFactor, then play withMaxMergeDocs to get best batch indexing performance.

---------------

With best regards,
   Alexander Veremyev.


Sam Davey wrote:

Hi,

I am really impressed with the performance of Zend_Search_Lucene and am
trying to shift my SQL based search to an index based search for the obvious
advantages of relieving stress on my MySql server.

However I have a problem in that there is a massive amount of data I need to
index.  And when I try to index it all the script either runs out of memory
or exceeds the given execution time.  Of course I can use ini_set to
increase these values but I have already increased them to high values and I
can still only index about half of my data.

Does anyone know of a good strategy to minimise or control the required
memory/time of a script indexing this amount of data?

Cheers,

Sam

Re: [fw-general] Zend_Search_Lucene - Large amount of data

Reply via email to