On 04/15/2010 01:50 PM, Earwin Burrfoot wrote:
First, the index format. IMHO, it is a good thing for a major release to be
able to read the prior major release's index. And the ability to convert it
to the current format via optimize is also good. Whatever is decided on this
thread should take this seriously.
Optimize is a bad way to convert to current.
1. conversion is not guaranteed, optimizing already optimized index is a noop
2. it merges all your segments. if you use BalancedSegmentMergePolicy,
that destroys your segment size distribution
Dedicated upgrade tool (available both from command-line and
programmatically) is a good way to convert to current.
1. conversion happens exactly when you need it, conversion happens for
sure, no additional checks needed
2. it should leave all your segments as is, only changing their format
It is my observation, though possibly not correct, that core only has
rudimentary analysis capabilities, handling English very well. To handle
other languages well "contrib/analyzers" is required. Until recently it did
not get much love. There have been many bw compat breaking changes (though
w/ version one can probably get the prior behavior). IMHO, most of
contrib/analyzers should be core. My guess is that most non-trivial
applications will use contrib/analyzers.
I counter - most non-trivial applications will use their own analyzers.
The more modules - the merrier. You can choose precisely what you need.
By and large an analyzer is a simple wrapper for a tokenizer and some
filters. Are you suggesting that most non-trivial apps write their own
tokenizers and filters?
I'd find that hard to believe. For example, I don't know enough Chinese,
Farsi, Arabic, Polish, ... to come up with anything better than what
Lucene has to tokenize, stem or filter these.
Our user base are those with ancient,
underpowered laptops in 3-rd world countries. On those machines it might
take 10 minutes to create an index and during that time the machine is
fairly unresponsive. There is no opportunity to "do it in the background."
Major Lucene releases (feature-wise, not version-wise) happen like
once in a year, or year-and-a-half.
Is it that hard for your users to wait ten minutes once a year?
I said that was for one index. Multiply that times the number of books
available (300+) and yes, it is too much to ask. Even if a small subset
is indexed, say 30, that's around 5 hours of waiting.
Under consideration is the frequency of breakage. Some are suggesting a
greater frequency than yearly.
DM
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org