RE: Is "solr.AnalyzerName" expansion supposed to work for Analyzers?

Uwe Schindler Sat, 10 Sep 2016 00:25:27 -0700

Hallo Alexandre,

> I can't see a reason why it should be different, but:
> 
> This works
>     <fieldType name="text_basic" class="solr.TextField">
>         <analyzer>
>             <tokenizer class="solr.LowerCaseTokenizerFactory" />
>         </analyzer>
>    </fieldType>
> 
> This does not:
>     <fieldType name="text_basic" class="solr.TextField">
>         <analyzer class="solr.SimpleAnalyzer"/>
>     </fieldType>
> 
> This does work again:
>     <fieldType name="text_basic" class="solr.TextField">
>         <analyzer class="org.apache.lucene.analysis.core.SimpleAnalyzer"/>
>     </fieldType>
> 
> Both LowerCaseTokenizerFactory and SimpleAnalyzer are in the same
> package.
> 
> Is this a bug or some sort of legacy decision?


There is a long history behind that and there is also a *fundamental* 
difference between the factories used for building custom analyzers in XML code 
and just referring to an Analyzer!

Let me start with some history: From the early beginning there was the concept 
of factories in Solr, so implementation classes are initialized from a map of 
properties given in the XML. Those factories were specified by Java binary 
class name ("org.apache.solr.foo.bar.MyFactory"). This is used at many places 
in Solr. The problem is that those class names could be quite long, so the 
SolrResourceLoader has a "hack" to allow short names (IMHO, which was a 
horrible decision). When it sees a class starting with name "solr.", it tris to 
lookup different possibilities. See code here: https://goo.gl/P24ZU3 
(subpackages is generally a list like "o.a.solr.something",...).

In the early days (before Lucene/Solr 4.0), those factories were *all* part of 
Solr, so the lookup with the "solr." short name prefix was easy and the 
subpackages list was short. So it "just worked" and many people had those class 
names in their config files.

The Analyzers (2nd example) were always referred to by their full name, because 
they were part of Lucene and not Solr. Using a "solr." Short name was never 
ever possible because of that.

Now a change in 4.0 comes into the game: To make the concept of building 
"custom" analyzers easier to use for non-Solr users, and to make the whole 
concept easier to maintain, the factories for tokenstream components were moved 
out of Solr into Lucene (https://issues.apache.org/jira/browse/LUCENE-2510). 
The analysis parts got new package names below the Lucene namespace. The effect 
of this would have been that all people have to change their config files, 
because the "solr." Shortcut won't work with Lucene classes.

Now you might ask why the "solr." Prefix still works? The reason is a second 
fundamental change with Lucene 4. We no longer use class names in Lucene to 
refer to stuff like Codecs, PostingFormats - we use the java concept of SPI. 
All components get a name, the implementation class is not exposed to outside. 
Like with Codecs, where you use Codec.forName("Lucene70") to instantiate it, 
the same was done for TokenStream components. This allows now to create 
StandardTokenizerFactory using the following code: 
TokenizerFactory.forName("standard"). Or LowercaseFilter with 
TokenFilterFactory.forName("lowercase"). There is no such concept for Analyzers 
(no SPI) [this explains your original question].

Now we have the two pieces to put together: Refactoring of class names and 
adding of SPI concept. The "correct" fix in Solr would have been to remove the 
"class=" attribute in the fieldType and replace by something called "name" or 
"type", so the XML would look like (https://goo.gl/Dr3gpO):

<fieldType name="something " class="solr.TextField">
   <analyzer>
      <tokenizer name="whitespace" />
   </analyzer>
</fieldType>

Similar to those examples of the corresponding class to build Analyzers from 
those SPI names in Lucene: 
https://lucene.apache.org/core/6_2_0/analyzers-common/org/apache/lucene/analysis/custom/CustomAnalyzer.html

The above syntax is wonderful, but again this caused lots of complaints from 
Solr developers, that people are unable to understand this WTF :-) It may also 
have to do with those short names look more like <add competitors name here>  
analysis component names.... (no idea, although its completely unrelated). The 
issue with more history is here: 
https://issues.apache.org/jira/browse/LUCENE-4044

Because of that there was a second hack added so all schema.xml files worked 
like before (in LUCENE-4044). This hack is the only way to configure 
tokenstream components up to this day - which is a desaster, IMHO! The hack is 
a fancy regular expression that tries to convert the old 
"solr.FoobarTokenFilterFactory" to the nice reading "names" like above: 
https://goo.gl/mtWmjm
The factory is then loaded using SPI: https://goo.gl/EwDtQr
IMHO, the hack should be deprecated and removed and the new syntax, as 
described above, should be introduced.

Analyzer class names would still (and will for sure stay like that - as used 
seldom in Solr) be *full* class names. There is no way to change that!

Now you have a bit of history and you might see that there is absolutely no 
relationship between the class name / package name and the configured "class" 
in schema.xml. In fact, the thing above cannot be fixed. Instead, the issue 
mentioned before should finally be fixed and the "class" attribute in token 
stream components be deprecated and removed and the above "name" (or maybe 
"type") syntax be used.

Uwe


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

RE: Is "solr.AnalyzerName" expansion supposed to work for Analyzers?

Reply via email to