Thanks for this detailed answer.

On Sat, Sep 10, 2016 at 3:24 AM Uwe Schindler <u...@thetaphi.de> wrote:

> Hallo Alexandre,
>
> > I can't see a reason why it should be different, but:
> >
> > This works
> >     <fieldType name="text_basic" class="solr.TextField">
> >         <analyzer>
> >             <tokenizer class="solr.LowerCaseTokenizerFactory" />
> >         </analyzer>
> >    </fieldType>
> >
> > This does not:
> >     <fieldType name="text_basic" class="solr.TextField">
> >         <analyzer class="solr.SimpleAnalyzer"/>
> >     </fieldType>
> >
> > This does work again:
> >     <fieldType name="text_basic" class="solr.TextField">
> >         <analyzer
> class="org.apache.lucene.analysis.core.SimpleAnalyzer"/>
> >     </fieldType>
> >
> > Both LowerCaseTokenizerFactory and SimpleAnalyzer are in the same
> > package.
> >
> > Is this a bug or some sort of legacy decision?
>
> There is a long history behind that and there is also a *fundamental*
> difference between the factories used for building custom analyzers in XML
> code and just referring to an Analyzer!
>
> Let me start with some history: From the early beginning there was the
> concept of factories in Solr, so implementation classes are initialized
> from a map of properties given in the XML. Those factories were specified
> by Java binary class name ("org.apache.solr.foo.bar.MyFactory"). This is
> used at many places in Solr. The problem is that those class names could be
> quite long, so the SolrResourceLoader has a "hack" to allow short names
> (IMHO, which was a horrible decision). When it sees a class starting with
> name "solr.", it tris to lookup different possibilities. See code here:
> https://goo.gl/P24ZU3 (subpackages is generally a list like
> "o.a.solr.something",...).
>
> In the early days (before Lucene/Solr 4.0), those factories were *all*
> part of Solr, so the lookup with the "solr." short name prefix was easy and
> the subpackages list was short. So it "just worked" and many people had
> those class names in their config files.
>
> The Analyzers (2nd example) were always referred to by their full name,
> because they were part of Lucene and not Solr. Using a "solr." Short name
> was never ever possible because of that.
>
> Now a change in 4.0 comes into the game: To make the concept of building
> "custom" analyzers easier to use for non-Solr users, and to make the whole
> concept easier to maintain, the factories for tokenstream components were
> moved out of Solr into Lucene (
> https://issues.apache.org/jira/browse/LUCENE-2510). The analysis parts
> got new package names below the Lucene namespace. The effect of this would
> have been that all people have to change their config files, because the
> "solr." Shortcut won't work with Lucene classes.
>
> Now you might ask why the "solr." Prefix still works? The reason is a
> second fundamental change with Lucene 4. We no longer use class names in
> Lucene to refer to stuff like Codecs, PostingFormats - we use the java
> concept of SPI. All components get a name, the implementation class is not
> exposed to outside. Like with Codecs, where you use
> Codec.forName("Lucene70") to instantiate it, the same was done for
> TokenStream components. This allows now to create StandardTokenizerFactory
> using the following code: TokenizerFactory.forName("standard"). Or
> LowercaseFilter with TokenFilterFactory.forName("lowercase"). There is no
> such concept for Analyzers (no SPI) [this explains your original question].
>
> Now we have the two pieces to put together: Refactoring of class names and
> adding of SPI concept. The "correct" fix in Solr would have been to remove
> the "class=" attribute in the fieldType and replace by something called
> "name" or "type", so the XML would look like (https://goo.gl/Dr3gpO):
>
> <fieldType name="something " class="solr.TextField">
>    <analyzer>
>       <tokenizer name="whitespace" />
>    </analyzer>
> </fieldType>
>
> Similar to those examples of the corresponding class to build Analyzers
> from those SPI names in Lucene:
> https://lucene.apache.org/core/6_2_0/analyzers-common/org/apache/lucene/analysis/custom/CustomAnalyzer.html
>
> The above syntax is wonderful, but again this caused lots of complaints
> from Solr developers, that people are unable to understand this WTF :-) It
> may also have to do with those short names look more like <add competitors
> name here>  analysis component names.... (no idea, although its completely
> unrelated). The issue with more history is here:
> https://issues.apache.org/jira/browse/LUCENE-4044
>
> Because of that there was a second hack added so all schema.xml files
> worked like before (in LUCENE-4044). This hack is the only way to configure
> tokenstream components up to this day - which is a desaster, IMHO! The hack
> is a fancy regular expression that tries to convert the old
> "solr.FoobarTokenFilterFactory" to the nice reading "names" like above:
> https://goo.gl/mtWmjm
> The factory is then loaded using SPI: https://goo.gl/EwDtQr
> IMHO, the hack should be deprecated and removed and the new syntax, as
> described above, should be introduced.
>
> Analyzer class names would still (and will for sure stay like that - as
> used seldom in Solr) be *full* class names. There is no way to change that!
>
> Now you have a bit of history and you might see that there is absolutely
> no relationship between the class name / package name and the configured
> "class" in schema.xml. In fact, the thing above cannot be fixed. Instead,
> the issue mentioned before should finally be fixed and the "class"
> attribute in token stream components be deprecated and removed and the
> above "name" (or maybe "type") syntax be used.
>
> Uwe
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
> --
Lucene/Solr Search Committer, Consultant, Developer, Author, Speaker
LinkedIn: http://linkedin.com/in/davidwsmiley | Book:
http://www.solrenterprisesearchserver.com

Reply via email to