IndexingConfiguration jr 1.4 release, analyzing, searching and synonymprovider

Ard Schrijvers Wed, 08 Aug 2007 08:33:58 -0700

Hello, 

and sorry for spamming, but I just want to share my findings/impressions, and 
what I am posting I am willimg to implement and port to the JackRabbit trunk 
(so if you bother to read it, and are positive about it, I will implement it 
:-) )


(if you make it to the end of this mail, I also describe how simple it would 
become to add a just in the trunk created SynonymProvider functionality....)

First of all, the IndexingConfiguration, very promising! Exactly what we need 
for better indexing, and, consequently better search results. Because, in the 
end, what good is a repository when customers can't find the results they are 
looking for? Storing, versioning, workflow, all very important, but no good 
when nobody can find their content (duhh, obviously).

So, one part that bothers me, is multilinguality (with lang specific stopwords, 
stemming, synonyms). Many customers these days want multilingual sites, and 
search them accordingly. And, obviously, lucene has quite some code for exactly 
this : see contrib/analyzers/src/java. 

Obviously, lucene has many more analyzers, and you can easily add your own. 
AFAIU, there is a single configuration place where I can define the overall 
JackRabbit analyzer that is used within one workspace: 

in repository.xml :

<param name="analyzer" 
value="org.apache.lucene.analysis.standard.StandardAnalyzer"/>

but, what I want, is a per property defineable analyzer (I would give bode_fr a 
french analyzer, body_de a german, some properties i might want to be indexed 
with keyword analyzers, like zipcodes). The best place for this IMO, is the 
IndexingConfiguration: then, if you do not configure it, nothing changes for 
you.
 
So, for example the first index rule at 
http://wiki.apache.org/jackrabbit/IndexingConfiguration would change in:

<index-rule nodeType="nt:unstructured"
              boost="2.0">
    <property 
analyzer="org.apache.lucene.analysis.Analyzer.GermanAnalyzer">text_de</property>
</index-rule>

and during loading, we construct a Map of {jr-property,analyzer} (call it 
propertyAnalyzerMap). Then, all we need to add is one jackrabbit global 
analyzer, that look like:

class JRAnalyzer extends Analyzer {
        Analyzer defaultAnalyzer = new StandardAnalyzer();

        public TokenStream tokenStream(String fieldName, Reader reader) {
                Analyzer analyzer = 
(Analyzer)propertyAnalyzerMap.get(fieldName);
                if(analyzer!=null){
                        return analyzer.tokenStream(fieldName, reader);
                }else{
                        return this.defaultAnalyzer.tokenStream(fieldName, 
reader);
                }
        }
}

This very same JRAnalyzer is also used for the QueryParser in 
LuceneQueryBuilder, so this will work also for searching IIUC. So, WDOT? I can 
implement it and send a patch, but if the community is reluctant to it, I will 
have to do it for myself in a non jr code intrusive way.

Example of the SynonymProvider mentioned at the top:

If my suggested changes are accepted, things like a SynonymProvider becomes 
superfluous, and very easy to add on the fly:

suppose, I want on the "body" property of my nodes always full searching with 
dutch synonyms. This boils down to adding an analyzer for this property, that 
extends the DutchAnalyzer in lucene, and that adds synonym functionality (very 
simple example in "lucene in action" book). I think it is better to do synonyms 
during analyzing (as opposed to the SynonymProvider in jr trunk), and simply 
use an analyzer for it. Ofcourse, a difference of using it, would be that with 
the current SynonymProvider you specifically have to define that you do a 
synonymsearch (~term), while with an analyzer, you define which properties 
whould be indexed with an synonymanalyzer, and searched accordingly (without 
having to specify it),

So WDOT? Again, sry for mailing so much, just trying to sell my ideas :-) 

 
-- 

Hippo
Oosteinde 11
1017WT Amsterdam
The Netherlands
Tel  +31 (0)20 5224466
-------------------------------------------------------------
[EMAIL PROTECTED] / [EMAIL PROTECTED] / http://www.hippo.nl
--------------------------------------------------------------

IndexingConfiguration jr 1.4 release, analyzing, searching and synonymprovider

Reply via email to