RE: IndexingConfiguration jr 1.4 release, analyzing, searching and synonymprovider

Ard Schrijvers Wed, 22 Aug 2007 03:22:14 -0700

Excuses for not indenting properly Bertrand's text, I am using webmail. See 
comments below the second ------ :


>Bertrand Delacretaz wrote:
----------------------------------
"Yes, given that many Lucene TokenFilters are available, this is useful I think.

I see two potential issues that you might want to take into account:

1) With configurable indexing analyzers, people sometimes have a hard
time figuring out how exactly their data is indexed (and why they
don't find it later).

Solr provides an analysis test page for that (see "Solr's content
analysis test page" in [1]). In the case of Jackrabbit, maybe logging
the filtered values of fields at the DEBUG level would help.

2) As discussed previously, one problem with this is which analyzer to
use when running a query that applies to several fields. In Solr, you
can configure a different analyzer for querying, it's probably the
best solution.

People then have to make sure their config is consistent for indexing
and querying, and might need in some cases to provide their own custom
QueryAnalyzer to achieve this. For example one that provides fake
synonyms for a token, with each synonym being the result of the one of
the analysis methods used. This can get tricky depending on the
configured analysis, when searching in multiple fields.

See also http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters
for more info on how Solr manages the analyzers."
---------------------------------

I think I do not have these two problems for my aimed solution: I'll add one 
general analyzer to Jackrabbit, that looks something like:

class JRAnalyzerImpl extends Analyzer {
        Analyzer defaultAnalyzer = new StandardAnalyzer();
        
        public TokenStream tokenStream(String fieldName, Reader reader) {
           Analyzer analyzer = (Analyzer)configuredProperties.get(fieldName);
            if (analyzer != null) {
                return analyzer.tokenStream(fieldName, reader);
            } 
            else {
                return this.defaultAnalyzer.tokenStream(fieldName, reader);
            }
        }
    }

Now, all I need to do is hold a map of configuredProperties, which maps 
fieldname to the configured analyzer. When running a query for different field, 
I use the JRAnalyzerImpl as always, but by returning different tokenStream 
based on an analyzer I implicitely use different analyzers for each field that 
have it configured like this. Since this analyzer is used for indexing *and* 
querying, on a per field basis, it will always work. 

Might this be a better solution for Solr querying as well? It does seem to me 
overcomplicated that people have to take care of choosing an appropriate 
analyzer for querying while this does not seem to be necessary to me. Not 
finding a hit while you would expect one is pretty hard to solve sometimes, 
certainly if you don't know where to look, or understand lucene analyzing to 
some extend. WDYT?

Regards Ard


-Bertrand

[1] http://www.xml.com/lpt/a/1668

RE: IndexingConfiguration jr 1.4 release, analyzing, searching and synonymprovider

Reply via email to