Re: Lucene's default settings & back compatibility

Grant Ingersoll Tue, 19 May 2009 05:56:46 -0700


On May 19, 2009, at 8:19 AM, Michael McCandless wrote:

On Tue, May 19, 2009 at 7:26 AM, Grant Ingersoll<[email protected]> wrote:
I don't think we have said that bug fixes are required to be back
compatible, even if it is in analysis. I think it is a really badidea forTokenStreams to have if clauses in them checking boolean values forold
versus new behaviors.
When they can be back compat, we do, but there is not arequirement. For
instance, we upgraded Snowball.
True (Snowball), but then we have discussions like this:
https://issues.apache.org/jira/browse/LUCENE-1068?focusedCommentId=12550948&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12550948
which added a confusing deprecated "boolean replaceDepAcronym =
false;" to StandardAnalyzer.  Something similar led to
StandardAnalyzer.replaceInvalidAcronym.

I think there have been other cases (in particular StandardAnalyzer,
QueryParser) over time, but I haven't tracked them down.  Analyzer
back compat after fixing issues is especially tricky since the bugs
get "cached" into the index and queries against that index using the
fixed analyzer may not longer match the docs.  (So I think back-compat
is important in Analyzers).
Or, the removal of StopFilter as "Standard" all together. Thiscoupled with
a QP that created phrases around stop words is a better solution.
Interesting... that'd be a pretty big change to StandardAnalyzer,
though.

I can see we are spinning off lots of neat ideas, decoupled from the
"Settings" proposal, here :)
For instance, if we removed the StopFilter from theStandardAnalyzer, then
what?  A Settings object would not be able to account for it.
Why not?  The settings object could have say a property
"analysis.standard.enableStopFilter"?

And what if it is something that has to be called in the next() chainand not during construction? Are you going to want to call that everysingle time over millions upon millions of tokens in a largecollection? Even if it is during construction, you still might endup calling it a lot of times.

Likewise, the subtler issue of "fixing" a TokenStream such that it
might produce different tokens.
Settings should cover this in general, I think.
I really worry about Settings objects having to be repeatedlychecked insideof tight inner loops. Even looking at the new TokenStream stuff,there arenow checks for the "new API" in an area that is called _a lot_ oftimes.
Agreed, but I'd say this is orthogonal.  We should never do slow
things inside inner loops -- checking settings, calling logging
frameworks, calling List.size(), opening files, etc.  This is the
stuff of standard coding practices...

There's a difference between std. coding practices and purposefullyputting in lots of if checks to solve back compatibility issues thatare created in order to satisfy some naming convention. Given thelength of time between releases, we could easily call every newrelease a major version and we wouldn't be all that different frommost commercial projects. I'd bet if we switched from calling thingsmajor.minor and just called them Lucene '09 and Lucene '10 peoplewould be just fine with the changes.

I've said it before and I'll say it again. Given the time betweenLucene releases (at least 6 mos. for minor releases and 1+ year formajors) we have _PLENTY_ of time to let users know what is coming andplan accordingly. By being so dogmatic about back compatibility, Ibelieve we are making it harder to innovate and harder for new peopleto contribute and we keep cruft around for way too long. (How theheck is a new contributor supposed to keep track of all the thingsthat went into Lucene for the past 1.5 years?) I'm not saying weshould throw back compat. out the window, I'm just saying we shouldtake it more on a case by case basis, with the default, obviously,being to favor back compatibility. The large majority of users (I'dventure to say well north of 95% of them) will be able to deal withminor API changes every 6 to 8 months, especially if we are moreproactive about communicating them to java-user@ and in CHANGES. Infact, if we announced changes that are going to break for not the nextversion, but the one after, it would give people lots of time to adapt.

Last, and mostly I mention it as an afterthought. How are yougoing to
handle changes to the Settings?  Say, for instance, we come out w/
Settings2.4, release it and then we realize we missed something(and this
seems likely given the number of settings available in Lucene), then
what?
We deprecate Settings2.4 and come out with TheRealSettingsFor2.4?And then
when that is incomplete?


Well, in 2.9 there would still be a Settings2.4 class, but it'd have
newly created (in 2.9) settings with their defaults bound.

So in 2.9, when sorting by field you can optionally turn off scoring.
It gives a sizable performance boost doing so.  We of course were
forced to leave scoring on for back compat, but if we had this
Settings class online what we would have done instead is add a new
"search.sort.trackScores" (and, "trackMaxScore") setting to the base
Settings class, but the Settings2.4 would bind it to true.

There should be no need to make a new class for 2.4's settings on
releasing 2.9?

I think you missed the point. The problem lies in releasing 2.4'ssettings and those settings are wrong. Using your example, saySettings24 was messed up and set trackMaxScore to true when it shouldhave been false (mistakes happen). It gets released in 2.9 as thesettings for 2.4 back compatibility. We then realize our mistake.How do you fix it? You can't just set it to false, b/c now you haveusers who are depending, potentially, on the _wrong_ version. So, nowyou have to deprecate it and come out with a "new" Settings2.4 calledsomething else.

I still think we would benefit from just communicating upcomingchangesbetter even in minor releases, thereby allowing for a bit morevariance in
back compat.  It should be the exception, not the rule.
I like DM's point, that this Settings class would be a great vehicle
for exactly that communication.  Rather than pouring over a
CHANGES.txt, you can see setting-by-setting what changed, and why.

Sorry, I'd rather read CHANGES. It is the one place we all make sureto enter our changes. People aren't as good about javadocs,especially accessors where the name is "self explanatory". Plus ithas a link to a JIRA issue.

Also, how useful is it going to be to have 30 or 40 (hundreds?)accessors on a single Settings object? So, then, the logical thing todo is to split it up and have some nested way of doing things. Andthen people will be tired of having to programmatically set all thevalues, so they will create a config/properties file that does it.But, because we don't like dependencies, we will re-invent how thatworks. After it's all said and done, you end up having re-invented IOC.

Another interesting thing to think about is how do we sunset oldsettings objects. When we are on 4.X, should we still keep around 2.4settings? Not really something we necessarily need to solve right now.



-Grant


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Lucene's default settings & back compatibility

Reply via email to