On May 19, 2009, at 8:19 AM, Michael McCandless wrote:

On Tue, May 19, 2009 at 7:26 AM, Grant Ingersoll <gsing...@apache.org> wrote:

I don't think we have said that bug fixes are required to be back
compatible, even if it is in analysis. I think it is a really bad idea for TokenStreams to have if clauses in them checking boolean values for old
versus new behaviors.

When they can be back compat, we do, but there is not a requirement. For
instance, we upgraded Snowball.

True (Snowball), but then we have discussions like this:

https://issues.apache.org/jira/browse/LUCENE-1068?focusedCommentId=12550948&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel #action_12550948

which added a confusing deprecated "boolean replaceDepAcronym =
false;" to StandardAnalyzer.  Something similar led to
StandardAnalyzer.replaceInvalidAcronym.

I think there have been other cases (in particular StandardAnalyzer,
QueryParser) over time, but I haven't tracked them down.  Analyzer
back compat after fixing issues is especially tricky since the bugs
get "cached" into the index and queries against that index using the
fixed analyzer may not longer match the docs.  (So I think back-compat
is important in Analyzers).

Or, the removal of StopFilter as "Standard" all together. This coupled with
a QP that created phrases around stop words is a better solution.

Interesting... that'd be a pretty big change to StandardAnalyzer,
though.

I can see we are spinning off lots of neat ideas, decoupled from the
"Settings" proposal, here :)

For instance, if we removed the StopFilter from the StandardAnalyzer, then
what?  A Settings object would not be able to account for it.

Why not?  The settings object could have say a property
"analysis.standard.enableStopFilter"?

And what if it is something that has to be called in the next() chain and not during construction? Are you going to want to call that every single time over millions upon millions of tokens in a large collection? Even if it is during construction, you still might end up calling it a lot of times.



Likewise, the subtler issue of "fixing" a TokenStream such that it
might produce different tokens.

Settings should cover this in general, I think.

I really worry about Settings objects having to be repeatedly checked inside of tight inner loops. Even looking at the new TokenStream stuff, there are now checks for the "new API" in an area that is called _a lot_ of times.

Agreed, but I'd say this is orthogonal.  We should never do slow
things inside inner loops -- checking settings, calling logging
frameworks, calling List.size(), opening files, etc.  This is the
stuff of standard coding practices...

There's a difference between std. coding practices and purposefully putting in lots of if checks to solve back compatibility issues that are created in order to satisfy some naming convention. Given the length of time between releases, we could easily call every new release a major version and we wouldn't be all that different from most commercial projects. I'd bet if we switched from calling things major.minor and just called them Lucene '09 and Lucene '10 people would be just fine with the changes.

I've said it before and I'll say it again. Given the time between Lucene releases (at least 6 mos. for minor releases and 1+ year for majors) we have _PLENTY_ of time to let users know what is coming and plan accordingly. By being so dogmatic about back compatibility, I believe we are making it harder to innovate and harder for new people to contribute and we keep cruft around for way too long. (How the heck is a new contributor supposed to keep track of all the things that went into Lucene for the past 1.5 years?) I'm not saying we should throw back compat. out the window, I'm just saying we should take it more on a case by case basis, with the default, obviously, being to favor back compatibility. The large majority of users (I'd venture to say well north of 95% of them) will be able to deal with minor API changes every 6 to 8 months, especially if we are more proactive about communicating them to java-user@ and in CHANGES. In fact, if we announced changes that are going to break for not the next version, but the one after, it would give people lots of time to adapt.




Last, and mostly I mention it as an afterthought. How are you going to
handle changes to the Settings?  Say, for instance, we come out w/
Settings2.4, release it and then we realize we missed something (and this
seems likely given the number of settings available in Lucene), then
what?

We deprecate Settings2.4 and come out with TheRealSettingsFor2.4? And then
when that is incomplete?

Well, in 2.9 there would still be a Settings2.4 class, but it'd have
newly created (in 2.9) settings with their defaults bound.

So in 2.9, when sorting by field you can optionally turn off scoring.
It gives a sizable performance boost doing so.  We of course were
forced to leave scoring on for back compat, but if we had this
Settings class online what we would have done instead is add a new
"search.sort.trackScores" (and, "trackMaxScore") setting to the base
Settings class, but the Settings2.4 would bind it to true.

There should be no need to make a new class for 2.4's settings on
releasing 2.9?

I think you missed the point. The problem lies in releasing 2.4's settings and those settings are wrong. Using your example, say Settings24 was messed up and set trackMaxScore to true when it should have been false (mistakes happen). It gets released in 2.9 as the settings for 2.4 back compatibility. We then realize our mistake. How do you fix it? You can't just set it to false, b/c now you have users who are depending, potentially, on the _wrong_ version. So, now you have to deprecate it and come out with a "new" Settings2.4 called something else.




I still think we would benefit from just communicating upcoming changes better even in minor releases, thereby allowing for a bit more variance in
back compat.  It should be the exception, not the rule.

I like DM's point, that this Settings class would be a great vehicle
for exactly that communication.  Rather than pouring over a
CHANGES.txt, you can see setting-by-setting what changed, and why.

Sorry, I'd rather read CHANGES. It is the one place we all make sure to enter our changes. People aren't as good about javadocs, especially accessors where the name is "self explanatory". Plus it has a link to a JIRA issue.


Also, how useful is it going to be to have 30 or 40 (hundreds?) accessors on a single Settings object? So, then, the logical thing to do is to split it up and have some nested way of doing things. And then people will be tired of having to programmatically set all the values, so they will create a config/properties file that does it. But, because we don't like dependencies, we will re-invent how that works. After it's all said and done, you end up having re-invented IOC.

Another interesting thing to think about is how do we sunset old settings objects. When we are on 4.X, should we still keep around 2.4 settings? Not really something we necessarily need to solve right now.


-Grant


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Reply via email to