Hi!

Sebi wrote:
-->
 >> I have some questions related to the Zend Framework Search Lucene. I
 >> want to implement a fast and scalable directory system and I need to
 >> understand some things.

 >Sounds very interesting!


>> 1. I read that |Zend_Search_Lucene::optimize()function merges all index
 >> segments into new one. Won't this single segment become to big in time?

 >Segment can't be "too big". Merged segment always takes less memory and
 >can be scanned faster, then several "source" segments.

 >Segment size is limited by 2Gb on 32-bit platforms.
 
>(http://framework.zend.com/manual/en/zend.search.index-creation.html#zend.search.index-creation.limitations)

Well, let's say the segment will reach the 2GB limit. What will the optimizer do in this case? Let's say we have a segment with 2GB size and 2 small segments with 50MB each. What will the optimizer do in this case? Will it begin a second segment by merging thouse 2 segments with 50MB size (now there will be 2 segments after the optimizer finishes).


Zend_Search_Lucene::optimize() call merges all segments into new one.

But automatic optimization behavior depends on MergeFactor.

It
1. Waits up to the moment, when it has enough segments with size less than MaxBufferedDocs, but with summary size greater than MaxBufferedDocs. If it has, then optimizer merges them.

2. Checks, if it has enough segments with size from MaxBufferedDocs up to MaxBufferedDocs*MergeFactor, and merges them.

and so on until target segment size will not reach MaxMergeDocs.


So it has next groups with default MergeFactor (10) and MaxBufferedDocs (10):
1. 1-10 documents.
2. 11-100 documents.
3. 101-1000 documents.
4. 1001-10000 documents.
....

If group merging moves new segment into next group and group size is not greater than MaxMergeDocs, then merging is performed.

Thus it needs near 10 segments with size less than 2Gb to merge them into new which is greater then 2Gb.


If segment exceeds 2Gb, then it will not be loaded correctly next time.

In principle it's possible to limit target segment size by 2Gb (merged segment size is always less then sum of sizes of segments to be merged, so we have a good target size estimation). It will not help with large indices prepared and already optimized with Java Lucene (there was an issue on this problem - http://framework.zend.com/issues/browse/ZF-527), but will remove this limit for indices prepared with Zend_Search.


 >> Or the optimize process will create more segments with a maximum number
>> of documents (MaxMergeDocs variable is used in this case for the maximum
 >> number)?

>"MaxMergeDocs is a largest number of documents ever merged by addDocument()"
 >Automatic index optimization (involved by addDocument()) is an
 >incremental process. It merges several small segments into new one,
 >which is larger.
 >When it has enough "larger" segments it merges them.
 >And so on.

 >MaxMergeDocs guarantees, that addDocument() will never execute longer
 >than we want. It's a limitation for auto-optimization.

Let me see if I understand. addDocument() function will add documents to a single segment. When the number of thouse documents will reach MaxMergeDocs, then the optimizer will merge this segment together withe the other ones. The optimizer will not merge the segments which contains a number of documents less than MaxMergeDocs. Is it right? Or I did not understand the MaxMergeDocs variable.

1. addDocument() function adds one document to the index.

2. If addDocument() is called only once during script execution, then new segment (with only one document) will be written down. Automatic index optimization may merge this segment with others.

3. If addDocument() is called several times, then all added documents are stored in memory (index segment files are not updatable by their nature) until number of added documents will not reach MaxBufferedDocs. All buffered documents will be flushed into new segments at this moment. Then automatic merging may also be performed.

4. Automatic optimization doesn't merge segments if target segment is greater than maxMergeDocs.


 >> 2. Is the optimize process automatic?

 >Yes. But you may call Zend_Search_Lucene::optimize() to perform full
 >index optimization. It doesn't use MaxMergeDocs and merges all segments
 >into one.

>> 3. When I must use commit? Only after delete operation? Or I must use it
 >> after add operations as well. Is it an automatic process?

 >It's not necessary to use commit() now.
 >But you may use it if you want to be sure, that all changes are written
 >down at the point of commit() call.
 >unset($index) (where $index is a Zend_Search_Lucene object) has the same
 >effect.

When do the automatic commit() start? Only to the end of the script? Or maybe after MaxBufferedDocs is reached?

Yes. At each maxBufferedDocs document and at the end of script.

Where are the documents stored before I call commit function? In the memory?

Yes. In memory. Documents can't be written down before without generating new segment (which is not updatable).


 >> What happens after MaxBufferedDocs is reached?

 >>1. All added documents are written down into new segment.
 >>2. Automatic optimization process may start.


 >> 4. I didn't find anywhere details about zend search performance. Any
 >> benchmarks? Can u estimate how it will behave with 2 millions of
>> documents? Each document will have a maximum 400 characters length (this
 >> text is indexed).

 >There are no any official benchmarks.
 >I can only say, that performance was one of the first goals for
 >Zend_Search_Lucene. I can also say, that it's comparable with Java Lucene.

 >The behavior is strongly depends on an index contents and query types.
 >Do you have any idea about terms selectivity?


With best regards,
   Alexander Veremyev.

Reply via email to