Wow Uwe, Thanks for the treatise. That's an interesting discussion, but I wonder if anything changed since?
In terms of user-confusion/migration, we now have managed schema and can probably rewrite from 'solr.x' to symbol names on first use. That, of course, requires some sort of registry of those names, which I am not sure if it exists (apart from my own solrt-start.com hacks). But then the registry may well align with some other configuration reporting by the components. And with plugins/library jars. I am also wondering if the objection is still valid that other components in Solr (such as search components) are still not able to move to SPI? I am especially curious if any of that was affected by Nobble's work on having libraries loaded into Solr's special collection. What is the mechanism used there to load things. But yes, I can see it is a big topic. I may just update the documentation and examples to mention that Analyzers have to use full-name when I get to it. Regards, Alex. ---- Newsletter and resources for Solr beginners and intermediates: http://www.solr-start.com/ On 10 September 2016 at 14:24, Uwe Schindler <u...@thetaphi.de> wrote: > Hallo Alexandre, > >> I can't see a reason why it should be different, but: >> >> This works >> <fieldType name="text_basic" class="solr.TextField"> >> <analyzer> >> <tokenizer class="solr.LowerCaseTokenizerFactory" /> >> </analyzer> >> </fieldType> >> >> This does not: >> <fieldType name="text_basic" class="solr.TextField"> >> <analyzer class="solr.SimpleAnalyzer"/> >> </fieldType> >> >> This does work again: >> <fieldType name="text_basic" class="solr.TextField"> >> <analyzer class="org.apache.lucene.analysis.core.SimpleAnalyzer"/> >> </fieldType> >> >> Both LowerCaseTokenizerFactory and SimpleAnalyzer are in the same >> package. >> >> Is this a bug or some sort of legacy decision? > > There is a long history behind that and there is also a *fundamental* > difference between the factories used for building custom analyzers in XML > code and just referring to an Analyzer! > > Let me start with some history: From the early beginning there was the > concept of factories in Solr, so implementation classes are initialized from > a map of properties given in the XML. Those factories were specified by Java > binary class name ("org.apache.solr.foo.bar.MyFactory"). This is used at many > places in Solr. The problem is that those class names could be quite long, so > the SolrResourceLoader has a "hack" to allow short names (IMHO, which was a > horrible decision). When it sees a class starting with name "solr.", it tris > to lookup different possibilities. See code here: https://goo.gl/P24ZU3 > (subpackages is generally a list like "o.a.solr.something",...). > > In the early days (before Lucene/Solr 4.0), those factories were *all* part > of Solr, so the lookup with the "solr." short name prefix was easy and the > subpackages list was short. So it "just worked" and many people had those > class names in their config files. > > The Analyzers (2nd example) were always referred to by their full name, > because they were part of Lucene and not Solr. Using a "solr." Short name was > never ever possible because of that. > > Now a change in 4.0 comes into the game: To make the concept of building > "custom" analyzers easier to use for non-Solr users, and to make the whole > concept easier to maintain, the factories for tokenstream components were > moved out of Solr into Lucene > (https://issues.apache.org/jira/browse/LUCENE-2510). The analysis parts got > new package names below the Lucene namespace. The effect of this would have > been that all people have to change their config files, because the "solr." > Shortcut won't work with Lucene classes. > > Now you might ask why the "solr." Prefix still works? The reason is a second > fundamental change with Lucene 4. We no longer use class names in Lucene to > refer to stuff like Codecs, PostingFormats - we use the java concept of SPI. > All components get a name, the implementation class is not exposed to > outside. Like with Codecs, where you use Codec.forName("Lucene70") to > instantiate it, the same was done for TokenStream components. This allows now > to create StandardTokenizerFactory using the following code: > TokenizerFactory.forName("standard"). Or LowercaseFilter with > TokenFilterFactory.forName("lowercase"). There is no such concept for > Analyzers (no SPI) [this explains your original question]. > > Now we have the two pieces to put together: Refactoring of class names and > adding of SPI concept. The "correct" fix in Solr would have been to remove > the "class=" attribute in the fieldType and replace by something called > "name" or "type", so the XML would look like (https://goo.gl/Dr3gpO): > > <fieldType name="something " class="solr.TextField"> > <analyzer> > <tokenizer name="whitespace" /> > </analyzer> > </fieldType> > > Similar to those examples of the corresponding class to build Analyzers from > those SPI names in Lucene: > https://lucene.apache.org/core/6_2_0/analyzers-common/org/apache/lucene/analysis/custom/CustomAnalyzer.html > > The above syntax is wonderful, but again this caused lots of complaints from > Solr developers, that people are unable to understand this WTF :-) It may > also have to do with those short names look more like <add competitors name > here> analysis component names.... (no idea, although its completely > unrelated). The issue with more history is here: > https://issues.apache.org/jira/browse/LUCENE-4044 > > Because of that there was a second hack added so all schema.xml files worked > like before (in LUCENE-4044). This hack is the only way to configure > tokenstream components up to this day - which is a desaster, IMHO! The hack > is a fancy regular expression that tries to convert the old > "solr.FoobarTokenFilterFactory" to the nice reading "names" like above: > https://goo.gl/mtWmjm > The factory is then loaded using SPI: https://goo.gl/EwDtQr > IMHO, the hack should be deprecated and removed and the new syntax, as > described above, should be introduced. > > Analyzer class names would still (and will for sure stay like that - as used > seldom in Solr) be *full* class names. There is no way to change that! > > Now you have a bit of history and you might see that there is absolutely no > relationship between the class name / package name and the configured "class" > in schema.xml. In fact, the thing above cannot be fixed. Instead, the issue > mentioned before should finally be fixed and the "class" attribute in token > stream components be deprecated and removed and the above "name" (or maybe > "type") syntax be used. > > Uwe > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: dev-h...@lucene.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org