Re: Lucene Analyzer not used when querying the index ?

Cédric Damioli Fri, 24 Feb 2006 10:34:18 -0800

Thanks a lot Marcel for your answers,


Marcel Reutegger a écrit :

Cédric Damioli wrote:
Hi all,
I noticed that no Lucene Analyzer is used when querying therepository : when building the actual Lucene query theo.a.j.c.query.lucene.LuceneQueryBuilder does not make any use of theAnalyzer (at least in my case).
in general the analyzer is used for the contains() function totokenize the fulltext query parameter. however there is one exceptionto this rule: terms that use wildcards are not tokenized.
the reason for this is a technical one. an analyzer that is based on agrammer will not be able to process such tokens properly.
e.g. if the grammar rule says 'a' 'b' and 'abc' are tokens then theanalyzer would be unable to determine if 'ab*' should be tokenized ornot.
Let describe my exemple : I'm using chinese characters, say A and B.I set a property named "title" with the value "AB" (the two chinesecharacters without any witespace).After indexation (with the default StandardAnalyzer) the text hasbeen tokenized and the index contains at least three noticeable terms :
- one associated with the field _PROPERTIES and the value "titleï¿¿AB"
- one associated with the field FULL:title and the value "A"
- one associated with the field FULL:title and the value "B"
After that I try to execute an XPath Query like//*[jcr:contains(@title, '*AB*')]I of course expected this query to return the previously setproperty, but I obtained no results.After looking at the code, I can say that the Analyzer is not calledfor a WildcardQuery, so my "AB" is not tokenized and furthermore,
if you execute the following query you will get the expected result:
//*[jcr:contains(@title, 'AB')]
assuming A and B are chinese characters, they will get tokenized andthe fulltext query is acutally a phrase match. similar to searchingfor 'hello there'.

I actually can't use that query, because my application handle bothchinese and latin-1 characters, and in case of latin ones, the queryneeds to be wilcarded, otherwise it would only match exact tokens, whichis not what I want.


But I now understand the processing.
The correct behaviour for me is :

- First, tokenize my query String ("AB") using the same tokenizer thanJackRabbit (StandardTokenizer by default) :- Then building the XPath query with a separated statement for eachtoken : /*[jcr:contains(@title, '*A*') and jcr:contains(@title, '*B*')]

- This query gives me the correct answer.

With this processing I can query the index with both chinese andeuropean strings.


Thanks for your help

Regards,

--
Cédric Damioli
Chef de projets systèmes d'informations
Solutions CMS
ANYWARE TECHNOLOGIES
Tel : +33 (0)5 61 00 52 90
Fax : +33 (0)5 61 00 51 46
http://www.anyware-tech.com

Re: Lucene Analyzer not used when querying the index ?

Reply via email to