Thanks for this detailed answer. On Sat, Sep 10, 2016 at 3:24 AM Uwe Schindler <u...@thetaphi.de> wrote:
> Hallo Alexandre, > > > I can't see a reason why it should be different, but: > > > > This works > > <fieldType name="text_basic" class="solr.TextField"> > > <analyzer> > > <tokenizer class="solr.LowerCaseTokenizerFactory" /> > > </analyzer> > > </fieldType> > > > > This does not: > > <fieldType name="text_basic" class="solr.TextField"> > > <analyzer class="solr.SimpleAnalyzer"/> > > </fieldType> > > > > This does work again: > > <fieldType name="text_basic" class="solr.TextField"> > > <analyzer > class="org.apache.lucene.analysis.core.SimpleAnalyzer"/> > > </fieldType> > > > > Both LowerCaseTokenizerFactory and SimpleAnalyzer are in the same > > package. > > > > Is this a bug or some sort of legacy decision? > > There is a long history behind that and there is also a *fundamental* > difference between the factories used for building custom analyzers in XML > code and just referring to an Analyzer! > > Let me start with some history: From the early beginning there was the > concept of factories in Solr, so implementation classes are initialized > from a map of properties given in the XML. Those factories were specified > by Java binary class name ("org.apache.solr.foo.bar.MyFactory"). This is > used at many places in Solr. The problem is that those class names could be > quite long, so the SolrResourceLoader has a "hack" to allow short names > (IMHO, which was a horrible decision). When it sees a class starting with > name "solr.", it tris to lookup different possibilities. See code here: > https://goo.gl/P24ZU3 (subpackages is generally a list like > "o.a.solr.something",...). > > In the early days (before Lucene/Solr 4.0), those factories were *all* > part of Solr, so the lookup with the "solr." short name prefix was easy and > the subpackages list was short. So it "just worked" and many people had > those class names in their config files. > > The Analyzers (2nd example) were always referred to by their full name, > because they were part of Lucene and not Solr. Using a "solr." Short name > was never ever possible because of that. > > Now a change in 4.0 comes into the game: To make the concept of building > "custom" analyzers easier to use for non-Solr users, and to make the whole > concept easier to maintain, the factories for tokenstream components were > moved out of Solr into Lucene ( > https://issues.apache.org/jira/browse/LUCENE-2510). The analysis parts > got new package names below the Lucene namespace. The effect of this would > have been that all people have to change their config files, because the > "solr." Shortcut won't work with Lucene classes. > > Now you might ask why the "solr." Prefix still works? The reason is a > second fundamental change with Lucene 4. We no longer use class names in > Lucene to refer to stuff like Codecs, PostingFormats - we use the java > concept of SPI. All components get a name, the implementation class is not > exposed to outside. Like with Codecs, where you use > Codec.forName("Lucene70") to instantiate it, the same was done for > TokenStream components. This allows now to create StandardTokenizerFactory > using the following code: TokenizerFactory.forName("standard"). Or > LowercaseFilter with TokenFilterFactory.forName("lowercase"). There is no > such concept for Analyzers (no SPI) [this explains your original question]. > > Now we have the two pieces to put together: Refactoring of class names and > adding of SPI concept. The "correct" fix in Solr would have been to remove > the "class=" attribute in the fieldType and replace by something called > "name" or "type", so the XML would look like (https://goo.gl/Dr3gpO): > > <fieldType name="something " class="solr.TextField"> > <analyzer> > <tokenizer name="whitespace" /> > </analyzer> > </fieldType> > > Similar to those examples of the corresponding class to build Analyzers > from those SPI names in Lucene: > https://lucene.apache.org/core/6_2_0/analyzers-common/org/apache/lucene/analysis/custom/CustomAnalyzer.html > > The above syntax is wonderful, but again this caused lots of complaints > from Solr developers, that people are unable to understand this WTF :-) It > may also have to do with those short names look more like <add competitors > name here> analysis component names.... (no idea, although its completely > unrelated). The issue with more history is here: > https://issues.apache.org/jira/browse/LUCENE-4044 > > Because of that there was a second hack added so all schema.xml files > worked like before (in LUCENE-4044). This hack is the only way to configure > tokenstream components up to this day - which is a desaster, IMHO! The hack > is a fancy regular expression that tries to convert the old > "solr.FoobarTokenFilterFactory" to the nice reading "names" like above: > https://goo.gl/mtWmjm > The factory is then loaded using SPI: https://goo.gl/EwDtQr > IMHO, the hack should be deprecated and removed and the new syntax, as > described above, should be introduced. > > Analyzer class names would still (and will for sure stay like that - as > used seldom in Solr) be *full* class names. There is no way to change that! > > Now you have a bit of history and you might see that there is absolutely > no relationship between the class name / package name and the configured > "class" in schema.xml. In fact, the thing above cannot be fixed. Instead, > the issue mentioned before should finally be fixed and the "class" > attribute in token stream components be deprecated and removed and the > above "name" (or maybe "type") syntax be used. > > Uwe > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: dev-h...@lucene.apache.org > > -- Lucene/Solr Search Committer, Consultant, Developer, Author, Speaker LinkedIn: http://linkedin.com/in/davidwsmiley | Book: http://www.solrenterprisesearchserver.com