Re: Is "solr.AnalyzerName" expansion supposed to work for Analyzers?

Alexandre Rafalovitch Sat, 10 Sep 2016 06:57:56 -0700

Wow Uwe,

Thanks for the treatise. That's an interesting discussion, but I
wonder if anything changed since?


In terms of user-confusion/migration, we now have managed schema and
can probably rewrite from 'solr.x' to symbol names on first use. That,
of course, requires some sort of registry of those names, which I am
not sure if it exists (apart from my own solrt-start.com hacks). But
then the registry may well align with some other configuration
reporting by the components. And with plugins/library jars.

I am also wondering if the objection is still valid that other
components in Solr (such as search components) are still not able to
move to SPI? I am especially curious if any of that was affected by
Nobble's work on having libraries loaded into Solr's special
collection. What is the mechanism used there to load things.

But yes, I can see it is a big topic. I may just update the
documentation and examples to mention that Analyzers have to use
full-name when I get to it.

Regards,
   Alex.
----
Newsletter and resources for Solr beginners and intermediates:
http://www.solr-start.com/


On 10 September 2016 at 14:24, Uwe Schindler <[email protected]> wrote:
> Hallo Alexandre,
>
>> I can't see a reason why it should be different, but:
>>
>> This works
>>     <fieldType name="text_basic" class="solr.TextField">
>>         <analyzer>
>>             <tokenizer class="solr.LowerCaseTokenizerFactory" />
>>         </analyzer>
>>    </fieldType>
>>
>> This does not:
>>     <fieldType name="text_basic" class="solr.TextField">
>>         <analyzer class="solr.SimpleAnalyzer"/>
>>     </fieldType>
>>
>> This does work again:
>>     <fieldType name="text_basic" class="solr.TextField">
>>         <analyzer class="org.apache.lucene.analysis.core.SimpleAnalyzer"/>
>>     </fieldType>
>>
>> Both LowerCaseTokenizerFactory and SimpleAnalyzer are in the same
>> package.
>>
>> Is this a bug or some sort of legacy decision?
>
> There is a long history behind that and there is also a *fundamental* 
> difference between the factories used for building custom analyzers in XML 
> code and just referring to an Analyzer!
>
> Let me start with some history: From the early beginning there was the 
> concept of factories in Solr, so implementation classes are initialized from 
> a map of properties given in the XML. Those factories were specified by Java 
> binary class name ("org.apache.solr.foo.bar.MyFactory"). This is used at many 
> places in Solr. The problem is that those class names could be quite long, so 
> the SolrResourceLoader has a "hack" to allow short names (IMHO, which was a 
> horrible decision). When it sees a class starting with name "solr.", it tris 
> to lookup different possibilities. See code here: https://goo.gl/P24ZU3 
> (subpackages is generally a list like "o.a.solr.something",...).
>
> In the early days (before Lucene/Solr 4.0), those factories were *all* part 
> of Solr, so the lookup with the "solr." short name prefix was easy and the 
> subpackages list was short. So it "just worked" and many people had those 
> class names in their config files.
>
> The Analyzers (2nd example) were always referred to by their full name, 
> because they were part of Lucene and not Solr. Using a "solr." Short name was 
> never ever possible because of that.
>
> Now a change in 4.0 comes into the game: To make the concept of building 
> "custom" analyzers easier to use for non-Solr users, and to make the whole 
> concept easier to maintain, the factories for tokenstream components were 
> moved out of Solr into Lucene 
> (https://issues.apache.org/jira/browse/LUCENE-2510). The analysis parts got 
> new package names below the Lucene namespace. The effect of this would have 
> been that all people have to change their config files, because the "solr." 
> Shortcut won't work with Lucene classes.
>
> Now you might ask why the "solr." Prefix still works? The reason is a second 
> fundamental change with Lucene 4. We no longer use class names in Lucene to 
> refer to stuff like Codecs, PostingFormats - we use the java concept of SPI. 
> All components get a name, the implementation class is not exposed to 
> outside. Like with Codecs, where you use Codec.forName("Lucene70") to 
> instantiate it, the same was done for TokenStream components. This allows now 
> to create StandardTokenizerFactory using the following code: 
> TokenizerFactory.forName("standard"). Or LowercaseFilter with 
> TokenFilterFactory.forName("lowercase"). There is no such concept for 
> Analyzers (no SPI) [this explains your original question].
>
> Now we have the two pieces to put together: Refactoring of class names and 
> adding of SPI concept. The "correct" fix in Solr would have been to remove 
> the "class=" attribute in the fieldType and replace by something called 
> "name" or "type", so the XML would look like (https://goo.gl/Dr3gpO):
>
> <fieldType name="something " class="solr.TextField">
>    <analyzer>
>       <tokenizer name="whitespace" />
>    </analyzer>
> </fieldType>
>
> Similar to those examples of the corresponding class to build Analyzers from 
> those SPI names in Lucene: 
> https://lucene.apache.org/core/6_2_0/analyzers-common/org/apache/lucene/analysis/custom/CustomAnalyzer.html
>
> The above syntax is wonderful, but again this caused lots of complaints from 
> Solr developers, that people are unable to understand this WTF :-) It may 
> also have to do with those short names look more like <add competitors name 
> here>  analysis component names.... (no idea, although its completely 
> unrelated). The issue with more history is here: 
> https://issues.apache.org/jira/browse/LUCENE-4044
>
> Because of that there was a second hack added so all schema.xml files worked 
> like before (in LUCENE-4044). This hack is the only way to configure 
> tokenstream components up to this day - which is a desaster, IMHO! The hack 
> is a fancy regular expression that tries to convert the old 
> "solr.FoobarTokenFilterFactory" to the nice reading "names" like above: 
> https://goo.gl/mtWmjm
> The factory is then loaded using SPI: https://goo.gl/EwDtQr
> IMHO, the hack should be deprecated and removed and the new syntax, as 
> described above, should be introduced.
>
> Analyzer class names would still (and will for sure stay like that - as used 
> seldom in Solr) be *full* class names. There is no way to change that!
>
> Now you have a bit of history and you might see that there is absolutely no 
> relationship between the class name / package name and the configured "class" 
> in schema.xml. In fact, the thing above cannot be fixed. Instead, the issue 
> mentioned before should finally be fixed and the "class" attribute in token 
> stream components be deprecated and removed and the above "name" (or maybe 
> "type") syntax be used.
>
> Uwe
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Is "solr.AnalyzerName" expansion supposed to work for Analyzers?

Reply via email to