To add, the manages schema really makes it easy to "rewrite". My plan would be:
- Add a new "type" or "name" attribute to schema.xml, which is contrary to "class" attribute usage - When a manages schema is loaded, the resolving of classes using the hack is done as it is now. Warnings are printed as said before. - The managed schema is then changes to switch to the new attribute (there is a getter to get the symbolic name from the factory, so rewriting is easy) In addition, this simplifies usage: Some GUI could show a dropdown list for clicking together the analyzer. We just need to add a schema-REST endpoint to get all names. Maybe open an issue targeted for 6.x / 7.0. I'd be happy to help to fix this, although I could only do the SolrResourceLoader and SolrAnalyzer stuff. Uwe ----- Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de > -----Original Message----- > From: Uwe Schindler [mailto:u...@thetaphi.de] > Sent: Saturday, September 10, 2016 4:03 PM > To: dev@lucene.apache.org; Alexandre Rafalovitch <arafa...@gmail.com> > Subject: Re: Is "solr.AnalyzerName" expansion supposed to work for > Analyzers? > > Hi, > > The registry is there. To get all symbolic names of analyzer components in > classpath, use XxxFacrory.availableXxx() static methods. > > I don't think it makes sense to replace all factories in solr with named SPIs. > But I'd suggest to add the type or name attribute to analysis components and > promote it. Class attribute can still be used like now but logs warning if it > was > misused to load an SPI. If it refers to a real class all is fine. > > Uwe > > Am 10. September 2016 15:56:51 MESZ, schrieb Alexandre Rafalovitch > <arafa...@gmail.com>: > >Wow Uwe, > > > >Thanks for the treatise. That's an interesting discussion, but I > >wonder if anything changed since? > > > >In terms of user-confusion/migration, we now have managed schema and > >can probably rewrite from 'solr.x' to symbol names on first use. That, > >of course, requires some sort of registry of those names, which I am > >not sure if it exists (apart from my own solrt-start.com hacks). But > >then the registry may well align with some other configuration > >reporting by the components. And with plugins/library jars. > > > >I am also wondering if the objection is still valid that other > >components in Solr (such as search components) are still not able to > >move to SPI? I am especially curious if any of that was affected by > >Nobble's work on having libraries loaded into Solr's special > >collection. What is the mechanism used there to load things. > > > >But yes, I can see it is a big topic. I may just update the > >documentation and examples to mention that Analyzers have to use > >full-name when I get to it. > > > >Regards, > > Alex. > >---- > >Newsletter and resources for Solr beginners and intermediates: > >http://www.solr-start.com/ > > > > > >On 10 September 2016 at 14:24, Uwe Schindler <u...@thetaphi.de> wrote: > >> Hallo Alexandre, > >> > >>> I can't see a reason why it should be different, but: > >>> > >>> This works > >>> <fieldType name="text_basic" class="solr.TextField"> > >>> <analyzer> > >>> <tokenizer class="solr.LowerCaseTokenizerFactory" /> > >>> </analyzer> > >>> </fieldType> > >>> > >>> This does not: > >>> <fieldType name="text_basic" class="solr.TextField"> > >>> <analyzer class="solr.SimpleAnalyzer"/> > >>> </fieldType> > >>> > >>> This does work again: > >>> <fieldType name="text_basic" class="solr.TextField"> > >>> <analyzer > >class="org.apache.lucene.analysis.core.SimpleAnalyzer"/> > >>> </fieldType> > >>> > >>> Both LowerCaseTokenizerFactory and SimpleAnalyzer are in the same > >>> package. > >>> > >>> Is this a bug or some sort of legacy decision? > >> > >> There is a long history behind that and there is also a *fundamental* > >difference between the factories used for building custom analyzers in > >XML code and just referring to an Analyzer! > >> > >> Let me start with some history: From the early beginning there was > >the concept of factories in Solr, so implementation classes are > >initialized from a map of properties given in the XML. Those factories > >were specified by Java binary class name > >("org.apache.solr.foo.bar.MyFactory"). This is used at many places in > >Solr. The problem is that those class names could be quite long, so the > >SolrResourceLoader has a "hack" to allow short names (IMHO, which was a > >horrible decision). When it sees a class starting with name "solr.", it > >tris to lookup different possibilities. See code here: > >https://goo.gl/P24ZU3 (subpackages is generally a list like > >"o.a.solr.something",...). > >> > >> In the early days (before Lucene/Solr 4.0), those factories were > >*all* part of Solr, so the lookup with the "solr." short name prefix > >was easy and the subpackages list was short. So it "just worked" and > >many people had those class names in their config files. > >> > >> The Analyzers (2nd example) were always referred to by their full > >name, because they were part of Lucene and not Solr. Using a "solr." > >Short name was never ever possible because of that. > >> > >> Now a change in 4.0 comes into the game: To make the concept of > >building "custom" analyzers easier to use for non-Solr users, and to > >make the whole concept easier to maintain, the factories for > >tokenstream components were moved out of Solr into Lucene > >(https://issues.apache.org/jira/browse/LUCENE-2510). The analysis parts > >got new package names below the Lucene namespace. The effect of this > >would have been that all people have to change their config files, > >because the "solr." Shortcut won't work with Lucene classes. > >> > >> Now you might ask why the "solr." Prefix still works? The reason is a > >second fundamental change with Lucene 4. We no longer use class names > >in Lucene to refer to stuff like Codecs, PostingFormats - we use the > >java concept of SPI. All components get a name, the implementation > >class is not exposed to outside. Like with Codecs, where you use > >Codec.forName("Lucene70") to instantiate it, the same was done for > >TokenStream components. This allows now to create > >StandardTokenizerFactory using the following code: > >TokenizerFactory.forName("standard"). Or LowercaseFilter with > >TokenFilterFactory.forName("lowercase"). There is no such concept for > >Analyzers (no SPI) [this explains your original question]. > >> > >> Now we have the two pieces to put together: Refactoring of class > >names and adding of SPI concept. The "correct" fix in Solr would have > >been to remove the "class=" attribute in the fieldType and replace by > >something called "name" or "type", so the XML would look like > >(https://goo.gl/Dr3gpO): > >> > >> <fieldType name="something " class="solr.TextField"> > >> <analyzer> > >> <tokenizer name="whitespace" /> > >> </analyzer> > >> </fieldType> > >> > >> Similar to those examples of the corresponding class to build > >Analyzers from those SPI names in Lucene: > >https://lucene.apache.org/core/6_2_0/analyzers- > common/org/apache/lucene/analysis/custom/CustomAnalyzer.html > >> > >> The above syntax is wonderful, but again this caused lots of > >complaints from Solr developers, that people are unable to understand > >this WTF :-) It may also have to do with those short names look more > >like <add competitors name here> analysis component names.... (no > >idea, although its completely unrelated). The issue with more history > >is here: https://issues.apache.org/jira/browse/LUCENE-4044 > >> > >> Because of that there was a second hack added so all schema.xml files > >worked like before (in LUCENE-4044). This hack is the only way to > >configure tokenstream components up to this day - which is a desaster, > >IMHO! The hack is a fancy regular expression that tries to convert the > >old "solr.FoobarTokenFilterFactory" to the nice reading "names" like > >above: https://goo.gl/mtWmjm > >> The factory is then loaded using SPI: https://goo.gl/EwDtQr > >> IMHO, the hack should be deprecated and removed and the new syntax, > >as described above, should be introduced. > >> > >> Analyzer class names would still (and will for sure stay like that - > >as used seldom in Solr) be *full* class names. There is no way to > >change that! > >> > >> Now you have a bit of history and you might see that there is > >absolutely no relationship between the class name / package name and > >the configured "class" in schema.xml. In fact, the thing above cannot > >be fixed. Instead, the issue mentioned before should finally be fixed > >and the "class" attribute in token stream components be deprecated and > >removed and the above "name" (or maybe "type") syntax be used. > >> > >> Uwe > >> > >> > >> --------------------------------------------------------------------- > >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > >> For additional commands, e-mail: dev-h...@lucene.apache.org > >> > > > >--------------------------------------------------------------------- > >To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > >For additional commands, e-mail: dev-h...@lucene.apache.org > > -- > Uwe Schindler > H.-H.-Meier-Allee 63, 28213 Bremen > http://www.thetaphi.de > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: dev-h...@lucene.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org