Excuses for not indenting properly Bertrand's text, I am using webmail. See comments below the second ------ :
>Bertrand Delacretaz wrote: ---------------------------------- "Yes, given that many Lucene TokenFilters are available, this is useful I think. I see two potential issues that you might want to take into account: 1) With configurable indexing analyzers, people sometimes have a hard time figuring out how exactly their data is indexed (and why they don't find it later). Solr provides an analysis test page for that (see "Solr's content analysis test page" in [1]). In the case of Jackrabbit, maybe logging the filtered values of fields at the DEBUG level would help. 2) As discussed previously, one problem with this is which analyzer to use when running a query that applies to several fields. In Solr, you can configure a different analyzer for querying, it's probably the best solution. People then have to make sure their config is consistent for indexing and querying, and might need in some cases to provide their own custom QueryAnalyzer to achieve this. For example one that provides fake synonyms for a token, with each synonym being the result of the one of the analysis methods used. This can get tricky depending on the configured analysis, when searching in multiple fields. See also http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters for more info on how Solr manages the analyzers." --------------------------------- I think I do not have these two problems for my aimed solution: I'll add one general analyzer to Jackrabbit, that looks something like: class JRAnalyzerImpl extends Analyzer { Analyzer defaultAnalyzer = new StandardAnalyzer(); public TokenStream tokenStream(String fieldName, Reader reader) { Analyzer analyzer = (Analyzer)configuredProperties.get(fieldName); if (analyzer != null) { return analyzer.tokenStream(fieldName, reader); } else { return this.defaultAnalyzer.tokenStream(fieldName, reader); } } } Now, all I need to do is hold a map of configuredProperties, which maps fieldname to the configured analyzer. When running a query for different field, I use the JRAnalyzerImpl as always, but by returning different tokenStream based on an analyzer I implicitely use different analyzers for each field that have it configured like this. Since this analyzer is used for indexing *and* querying, on a per field basis, it will always work. Might this be a better solution for Solr querying as well? It does seem to me overcomplicated that people have to take care of choosing an appropriate analyzer for querying while this does not seem to be necessary to me. Not finding a hit while you would expect one is pretty hard to solve sometimes, certainly if you don't know where to look, or understand lucene analyzing to some extend. WDYT? Regards Ard -Bertrand [1] http://www.xml.com/lpt/a/1668
