Re: Lucene's default settings & back compatibility

Grant Ingersoll Tue, 19 May 2009 04:26:45 -0700

I like the idea, some thoughts below.

On May 18, 2009, at 5:06 PM, Michael McCandless wrote:

As we all know, Lucene's back-compat policy necessarily hurts the
out-of-the-box experience for new users: because we are only allowed
make substantial improvements to Lucene's default settings at a major
release, new users won't see the improvements to our settings until a
major release (typically years apart).

Lucene has a number of default settings, eg some recent examples:

 * Read-only IndexReader gives better much performance with threads,
   yet we must now default IndexReader.open to return a non-readOnly
   reader

 * We can now optionally turn off scoring when sorting by field
   (sizable speed gain), but we had to leave it on by default until
   3.0

 * Letting IndexReader.norms return null

 * LogMergePolicy now takes deletions into account, but we had to
   disable it by default, since it could conceivably break back
   compat.

 * Bug fixes in StandardAnalyzer must be delayed until 3.0 since
   there's a remote chance they'd break back compat in an app, or we
   end up adding confusing methods like "public static void
   setDefaultReplaceInvalidAcronym".

I don't think we have said that bug fixes are required to be backcompatible, even if it is in analysis. I think it is a really badidea for TokenStreams to have if clauses in them checking booleanvalues for old versus new behaviors.

When they can be back compat, we do, but there is not a requirement.For instance, we upgraded Snowball.



 * NIOFSDirectory ought to be "the default" on UNIX, but it's not

 * Constant score rewrite ought to be the default for most multi-term
   queries

 * StopFilter should enable position increments by default

Or, the removal of StopFilter as "Standard" all together. Thiscoupled with a QP that created phrases around stop words is a bettersolution.

The fact that we are "forced" delay such "out of the box" improvements
to Lucene for so long is a frustrating cost, since it can only stunt
Lucene's adoption and growth and my sense is that it's a minority of
Lucene's users that need such strict back-compat (this has been
discussed before).  It also clutters our APIs because we end up
creating setter/getters that often only exist for the sake of a back
compat preservation of a bug.

I think we can fix this.  Ie, maintain our strong back-compat policy,
yet still allow new users to experience the best of Lucene on every
release (not just on major releases), by creating an explicit class
that holds settings/defaults used by Lucene.

For example, say we create a base class named Settings.  It holds the
defaults for settings across all of Lucene's classes. When you create
IndexReader, IndexWriter and others, you must pass in a Settings
instance.

A subclass, SettingsMatching24, binds all settings to "match" 2.4's
behavior.  When we make improvements in 2.9, we'd add the back-compat
settings to SettingsMatching24.  So if your app wants to keep exactly
2.4's behavior, you'd pass in SettingsMatching24().  On upgrading to
2.9 you'd still see 2.4's behavior.

Users who'd like to see Lucene's improvements on each minor release
would instead instantiate LatestAndGreatestSettings() (or
CurrentVersionSettings(), or something), understanding that when they
upgrade there might be biggish changes to Lucene's defaults.  My guess
is most users would use this settings class.

Doug actually suggested this exact idea a while back:

 http://www.gossamer-threads.com/lists/lucene/java-dev/54421#54421.

Now that I realize we could use this to strongly decouple "users
wanting precise back-compat" from "users wanting the latest &
greatest", I think it's a very compelling solution.

If we do this I'd like to do it in 2.9, so that starting with 3.x we
are free to change default settings w/o breaking back compat.

Thoughts?

For instance, if we removed the StopFilter from the StandardAnalyzer,then what? A Settings object would not be able to account for it.Likewise, the subtler issue of "fixing" a TokenStream such that itmight produce different tokens.

I really worry about Settings objects having to be repeatedly checkedinside of tight inner loops. Even looking at the new TokenStreamstuff, there are now checks for the "new API" in an area that iscalled _a lot_ of times.

Last, and mostly I mention it as an afterthought. How are you goingto handle changes to the Settings? Say, for instance, we come out w/Settings2.4, release it and then we realize we missed something (andthis seems likely given the number of settings available in Lucene),then what? We deprecate Settings2.4 and come out withTheRealSettingsFor2.4? And then when that is incomplete?

I still think we would benefit from just communicating upcomingchanges better even in minor releases, thereby allowing for a bit morevariance in back compat. It should be the exception, not the rule.


Still, I think this is worth pursuing.

-Grant


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Lucene's default settings & back compatibility

Reply via email to