Re: Lucene Analyzer not used when querying the index ?

Marcel Reutegger Fri, 24 Feb 2006 00:44:21 -0800

Cédric Damioli wrote:

Hi all,
I noticed that no Lucene Analyzer is used when querying the repository :when building the actual Lucene query theo.a.j.c.query.lucene.LuceneQueryBuilder does not make any use of theAnalyzer (at least in my case).

in general the analyzer is used for the contains() function to tokenizethe fulltext query parameter. however there is one exception to thisrule: terms that use wildcards are not tokenized.

the reason for this is a technical one. an analyzer that is based on agrammer will not be able to process such tokens properly.

e.g. if the grammar rule says 'a' 'b' and 'abc' are tokens then theanalyzer would be unable to determine if 'ab*' should be tokenized or not.

Let describe my exemple : I'm using chinese characters, say A and B. Iset a property named "title" with the value "AB" (the two chinesecharacters without any witespace).After indexation (with the default StandardAnalyzer) the text has beentokenized and the index contains at least three noticeable terms :
- one associated with the field _PROPERTIES and the value "titleï¿¿AB"
- one associated with the field FULL:title and the value "A"
- one associated with the field FULL:title and the value "B"
After that I try to execute an XPath Query like //*[jcr:contains(@title,'*AB*')]I of course expected this query to return the previously set property,but I obtained no results.After looking at the code, I can say that the Analyzer is not called fora WildcardQuery, so my "AB" is not tokenized and furthermore,


if you execute the following query you will get the expected result:
//*[jcr:contains(@title, 'AB')]

assuming A and B are chinese characters, they will get tokenized and thefulltext query is acutally a phrase match. similar to searching for'hello there'.

it seemsthat the _PROPERTIES field is not used when searching, otherwise, Ithink it would match.


the PROPERTIES field is only used for jcr:like and other operators.
e.g. you can search the workspace with the following query:
//*[jcr:like(@title, '%AB%')]

this will internally use the PROPERTIES field.

I know that StandardAnalyzer is not the best suited for handling chinesetext, but that's another story.

there might be implementations that are better suited for chinese text,but I think it does a pretty good job.

It seems to me that there may be a Jackrabbit problem here, so I wantedto have your feelings about this.


What you described is imo expected behaviour in jackrabbit.

Regarding analyzers, you can configure it on a per workspace basis anduse one of the many available analyzers. e.g. from the lucene website.


regards
 marcel

Re: Lucene Analyzer not used when querying the index ?

Reply via email to