Thank you for your great and detailed explanation. You were very explicit

I have one little question yet. You said "Do you have any idea about terms 
selectivity?" What means terms selectivity?

> -->
>  >> I have some questions related to the Zend Framework Search Lucene. I
>  >> want to implement a fast and scalable directory system and I need to
>  >> understand some things.
> 
>  >Sounds very interesting!
> 
> 
>  >> 1. I read that  |Zend_Search_Lucene::optimize()function merges all 
> index
>  >> segments into new one. Won't this single segment become to big in time?
> 
>  >Segment can't be "too big". Merged segment always takes less memory and
>  >can be scanned faster, then several "source" segments.
> 
>  >Segment size is limited by 2Gb on 32-bit platforms.
>  
> >(http://framework.zend.com/manual/en/zend.search.index-creation.html#zend.search.index-creation.limitations)
> 
> Well, let's say  the segment will reach the 2GB limit. What will the 
> optimizer do in this case? Let's say we have a segment with 2GB size and 
> 2 small segments with 50MB each. What will the optimizer do in this 
> case? Will it begin a second segment by merging thouse 2 segments with 
> 50MB size (now there will be 2 segments after the optimizer finishes).


Zend_Search_Lucene::optimize() call merges all segments into new one.

But automatic optimization behavior depends on MergeFactor.

It
1. Waits up to the moment, when it has enough segments with size less 
than MaxBufferedDocs, but with summary size greater than 
MaxBufferedDocs. If it has, then optimizer merges them.

2. Checks, if it has enough segments with size from MaxBufferedDocs up 
to MaxBufferedDocs*MergeFactor, and merges them.

and so on until target segment size will not reach MaxMergeDocs.


So it has next groups with default MergeFactor (10) and MaxBufferedDocs 
(10):
1. 1-10 documents.
2. 11-100 documents.
3. 101-1000 documents.
4. 1001-10000 documents.
....

If group merging moves new segment into next group and group size is not 
greater than MaxMergeDocs, then merging is performed.

Thus it needs near 10 segments with size less than 2Gb to merge them 
into new which is greater then 2Gb.


If segment exceeds 2Gb, then it will not be loaded correctly next time.

In principle it's possible to limit target segment size by 2Gb (merged 
segment size is always less then sum of sizes of segments to be merged, 
so we have a good target size estimation).
It will not help with large indices prepared and already optimized with 
Java Lucene (there was an issue on this problem - 
http://framework.zend.com/issues/browse/ZF-527), but will remove this 
limit for indices prepared with Zend_Search.


>  >> Or the optimize process will create more segments with a maximum number
>  >> of documents (MaxMergeDocs variable is used in this case for the 
> maximum
>  >> number)?
> 
>  >"MaxMergeDocs is a largest number of documents ever merged by 
> addDocument()"
>  >Automatic index optimization (involved by addDocument()) is an
>  >incremental process. It merges several small segments into new one,
>  >which is larger.
>  >When it has enough "larger" segments it merges them.
>  >And so on.
> 
>  >MaxMergeDocs guarantees, that addDocument() will never execute longer
>  >than we want. It's a limitation for auto-optimization.
> 
> Let me see if I understand. addDocument() function will add documents to 
> a single segment. When the number of thouse documents will reach 
> MaxMergeDocs, then the optimizer will merge this segment together withe 
> the other ones. The optimizer will not merge the segments which contains 
> a number of documents less than MaxMergeDocs. Is it right? Or I did not 
> understand the MaxMergeDocs variable.

1. addDocument() function adds one document to the index.

2. If addDocument() is called only once during script execution, then 
new segment (with only one document) will be written down. Automatic 
index optimization may merge this segment with others.

3. If addDocument() is called several times, then all added documents 
are stored in memory (index segment files are not updatable by their 
nature) until number of added documents will not reach MaxBufferedDocs. 
All buffered documents will be flushed into new segments at this moment. 
Then automatic merging may also be performed.

4. Automatic optimization doesn't merge segments if target segment is 
greater than maxMergeDocs.


>  >> 2. Is the optimize process automatic?
> 
>  >Yes. But you may call Zend_Search_Lucene::optimize() to perform full
>  >index optimization. It doesn't use MaxMergeDocs and merges all segments
>  >into one.
> 
>  >> 3. When I must use commit? Only after delete operation? Or I must 
> use it
>  >> after add operations as well. Is it an automatic process?
> 
>  >It's not necessary to use commit() now.
>  >But you may use it if you want to be sure, that all changes are written
>  >down at the point of commit() call.
>  >unset($index) (where $index is a Zend_Search_Lucene object) has the same
>  >effect.
> 
> When do the automatic commit() start? Only to the end of the script? Or 
> maybe after MaxBufferedDocs is reached?

Yes. At each maxBufferedDocs document and at the end of script.

> Where are the documents stored before I call commit function? In the memory?

Yes. In memory. Documents can't be written down before without 
generating new segment (which is not updatable).


>  >> What happens after MaxBufferedDocs is reached?
> 
>  >>1. All added documents are written down into new segment.
>  >>2. Automatic optimization process may start.
> 
> 
>  >> 4. I didn't find anywhere details about zend search performance. Any
>  >> benchmarks? Can u estimate how it will behave with 2 millions of
>  >> documents? Each document will have a maximum 400 characters length 
> (this
>  >> text is indexed).
> 
>  >There are no any official benchmarks.
>  >I can only say, that performance was one of the first goals for
>  >Zend_Search_Lucene. I can also say, that it's comparable with Java Lucene.
> 
>  >The behavior is strongly depends on an index contents and query types.
>  >Do you have any idea about terms selectivity?


With best regards,
    Alexander Veremyev.






__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 

Reply via email to