There has been a lot of heated discussion recently about version tracking in Lucene [1] [2]. I wanted to have a fresh discussion outside of jira to give a full description of the current state of things, the problems I have heard, and a proposed solution.
CURRENT We have 2 pieces of code that handle “versioning.” The first is Constants.LUCENE_MAIN_VERSION, which is written to the SegmentsInfo for each segment. This is a string version which is used to detect when the current version of lucene is newer than the version that wrote the segment (and how/if an upgrade to to a newer codec should be done). There is some complication with the “display” version and non-display version, which are distinguished by whether the version of lucene was an official release, or an alpha/beta version (which was added specifically for the 4.0.0 release ramp up). This string version also has its own parsing and comparison methods. The second piece of versioning code is in Version.java, which is an enum used by analyzers to maintain backwards compatible behavior given a specific version of lucene. The enum only contains values for dot releases of lucene, not bug fixes (which was what spurred the recent discussions over version). Analyzers’ constructors take a required Version parameter, which is only actually used by the few analyzers that have changed behavior recently. Version.java contains a separate version parsing and comparison methods. CONCERNS * Having 2 different pieces of code that do very similar things is confusing for development. Very few developers appear to really understand the current system (especially when trying to understand the alpha/beta setup). * Users are generally confused by the Version passed to analyzers: I know I was when I first started working with Lucene, and Version.CURRENT_VERSION was deprecated because users used that without understanding the implications. * Most analyzers currently have dead code constructors, since they never make use of Version. There are also a lot of classes used by analyzers which contain similar dead code. * Backwards compatibility needs to be handled in some fashion, to ensure users have a path to upgrade from one version of lucene to another, without requiring immediate re-indexing. PROPOSAL I propose the following: * Consolidate all version related enumeration, including reading and writing string versions, into Version.java. Have a static method that returns the current lucene version (replacing Constants.LUCENE_MAIN_VERSION). * Make bug fix releases first class in the enumeration, so that they can be distinguished for any compatibility issues that come up. * Remove all snapshot/alpha/beta versioning logic. Alpha/beta was really only necessary for 4.0 because of the extreme changes that were being made. The system is much more stable now, and 5.0 should not require preview releases, IMO. I don’t think snapshots should be a concern because any user building an index from an unreleased build (which they built themselves) is just asking for trouble. They do so at their own risk (of figuring out how to upgrade their indexes if they are not trash-able). Backwards compatibility can be handled by adding the alpha/beta/final versions of 4.0 to the enum (and special parsing logic for this). If lucene changes so much that we need alpha/beta type discrimination in the future, we can revisit the system if simply having extra versions in the enum won't work. * Analyzers constructors should have Version removed, and a setter should be added which allows production users to set the version used. This way any analyzers can still use version if it is set to something other than current (which would be the default), but users simply prototyping do not need to worry about it. * Classes that analyzers use, which take Version, should have Version removed, and the analyzers should choose which settings/variants of those classes to use based on the version they have set. In other words, all version variant logic should be contained within the analyzers. For example, Lucene47WordDelimiterFilter, or StandardAnalyzer can take the unicode version. Factories could still take Version (e.g. TokenizerFactory, TokenFilterFactory, etc) to produce the correct component (so nothing will change for solr in this regard). I’m sure not everyone will be happy with what I have proposed, but I’m hoping we can work out a solution together, and then implement in a team-like fashion, the way I have seen the community work in the past, and I hope to see again in the future. Thanks Ryan [1] https://issues.apache.org/jira/browse/LUCENE-5850 [2] https://issues.apache.org/jira/browse/LUCENE-5859 --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org