Hi,
Sebi wrote:
-->
I have some questions related to the Zend Framework Search Lucene. I
want to implement a fast and scalable directory system and I need to
understand some things.
Sounds very interesting!
1. I read that |Zend_Search_Lucene::optimize()function merges all index
segments into new one. Won't this single segment become to big in time?
Segment can't be "too big". Merged segment always takes less memory and
can be scanned faster, then several "source" segments.
Segment size is limited by 2Gb on 32-bit platforms.
(http://framework.zend.com/manual/en/zend.search.index-creation.html#zend.search.index-creation.limitations)
Or the optimize process will create more segments with a maximum number
of documents (MaxMergeDocs variable is used in this case for the maximum
number)?
"MaxMergeDocs is a largest number of documents ever merged by addDocument()"
Automatic index optimization (involved by addDocument()) is an
incremental process. It merges several small segments into new one,
which is larger.
When it has enough "larger" segments it merges them.
And so on.
MaxMergeDocs guarantees, that addDocument() will never execute longer
than we want. It's a limitation for auto-optimization.
2. Is the optimize process automatic?
Yes. But you may call Zend_Search_Lucene::optimize() to perform full
index optimization. It doesn't use MaxMergeDocs and merges all segments
into one.
3. When I must use commit? Only after delete operation? Or I must use it
after add operations as well. Is it an automatic process?
It's not necessary to use commit() now.
But you may use it if you want to be sure, that all changes are written
down at the point of commit() call.
unset($index) (where $index is a Zend_Search_Lucene object) has the same
effect.
What happens after MaxBufferedDocs is reached?
1. All added documents are written down into new segment.
2. Automatic optimization process may start.
4. I didn't find anywhere details about zend search performance. Any
benchmarks? Can u estimate how it will behave with 2 millions of
documents? Each document will have a maximum 400 characters length (this
text is indexed).
There are no any official benchmarks.
I can only say, that performance was one of the first goals for
Zend_Search_Lucene. I can also say, that it's comparable with Java Lucene.
The behavior is strongly depends on an index contents and query types.
Do you have any idea about terms selectivity?
With best regards,
Alexander Veremyev.