I like the idea, some thoughts below. On May 18, 2009, at 5:06 PM, Michael McCandless wrote:
As we all know, Lucene's back-compat policy necessarily hurts the out-of-the-box experience for new users: because we are only allowed make substantial improvements to Lucene's default settings at a major release, new users won't see the improvements to our settings until a major release (typically years apart). Lucene has a number of default settings, eg some recent examples: * Read-only IndexReader gives better much performance with threads, yet we must now default IndexReader.open to return a non-readOnly reader * We can now optionally turn off scoring when sorting by field (sizable speed gain), but we had to leave it on by default until 3.0 * Letting IndexReader.norms return null * LogMergePolicy now takes deletions into account, but we had to disable it by default, since it could conceivably break back compat. * Bug fixes in StandardAnalyzer must be delayed until 3.0 since there's a remote chance they'd break back compat in an app, or we end up adding confusing methods like "public static void setDefaultReplaceInvalidAcronym".
I don't think we have said that bug fixes are required to be back compatible, even if it is in analysis. I think it is a really bad idea for TokenStreams to have if clauses in them checking boolean values for old versus new behaviors.
When they can be back compat, we do, but there is not a requirement. For instance, we upgraded Snowball.
* NIOFSDirectory ought to be "the default" on UNIX, but it's not * Constant score rewrite ought to be the default for most multi-term queries * StopFilter should enable position increments by default
Or, the removal of StopFilter as "Standard" all together. This coupled with a QP that created phrases around stop words is a better solution.
The fact that we are "forced" delay such "out of the box" improvements to Lucene for so long is a frustrating cost, since it can only stunt Lucene's adoption and growth and my sense is that it's a minority of Lucene's users that need such strict back-compat (this has been discussed before). It also clutters our APIs because we end up creating setter/getters that often only exist for the sake of a back compat preservation of a bug. I think we can fix this. Ie, maintain our strong back-compat policy, yet still allow new users to experience the best of Lucene on every release (not just on major releases), by creating an explicit class that holds settings/defaults used by Lucene. For example, say we create a base class named Settings. It holds the defaults for settings across all of Lucene's classes. When you create IndexReader, IndexWriter and others, you must pass in a Settings instance. A subclass, SettingsMatching24, binds all settings to "match" 2.4's behavior. When we make improvements in 2.9, we'd add the back-compat settings to SettingsMatching24. So if your app wants to keep exactly 2.4's behavior, you'd pass in SettingsMatching24(). On upgrading to 2.9 you'd still see 2.4's behavior. Users who'd like to see Lucene's improvements on each minor release would instead instantiate LatestAndGreatestSettings() (or CurrentVersionSettings(), or something), understanding that when they upgrade there might be biggish changes to Lucene's defaults. My guess is most users would use this settings class. Doug actually suggested this exact idea a while back: http://www.gossamer-threads.com/lists/lucene/java-dev/54421#54421. Now that I realize we could use this to strongly decouple "users wanting precise back-compat" from "users wanting the latest & greatest", I think it's a very compelling solution. If we do this I'd like to do it in 2.9, so that starting with 3.x we are free to change default settings w/o breaking back compat. Thoughts?
For instance, if we removed the StopFilter from the StandardAnalyzer, then what? A Settings object would not be able to account for it. Likewise, the subtler issue of "fixing" a TokenStream such that it might produce different tokens.
I really worry about Settings objects having to be repeatedly checked inside of tight inner loops. Even looking at the new TokenStream stuff, there are now checks for the "new API" in an area that is called _a lot_ of times.
Last, and mostly I mention it as an afterthought. How are you going to handle changes to the Settings? Say, for instance, we come out w/ Settings2.4, release it and then we realize we missed something (and this seems likely given the number of settings available in Lucene), then what? We deprecate Settings2.4 and come out with TheRealSettingsFor2.4? And then when that is incomplete?
I still think we would benefit from just communicating upcoming changes better even in minor releases, thereby allowing for a bit more variance in back compat. It should be the exception, not the rule.
Still, I think this is worth pursuing. -Grant --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org