Re: Performing a like query

Rahil Sun, 01 Oct 2006 14:48:19 -0700

Hi Erick

Thanks for your response. There's a lot to chew on in your reply and Imlooking at the suggestions you've made.

Yeah I have Luke installed and have queried my index but there isn't anygreat explanation Im getting out of it. A query for "6/12" is sent as"TERM:6/12" which is quite straight-forward. I did an explanation of thequery in my code though and got some more information but that toowasn't of much help either.

--
Explanation explain = searcher.explain(query,0);

OUTPUT:
query: +TERM:6/12
explain.getDescription() : weight(TERM:6/12 in 0), product of:
Detail 0 : 0.99999994 = queryWeight(TERM:6/12), product of:
 2.0986123 = idf(docFreq=1)
 0.47650534 = queryNorm

Detail 1 : 0.0 = fieldWeight(TERM:6/12 in 0), product of:
 0.0 = tf(termFreq(TERM:6/12)=0)
 2.0986123 = idf(docFreq=1)
 0.5 = fieldNorm(field=TERM, doc=0)

Number of results returned: 1
SampleLucene.displayIndexResults
SCORE    DESCRIPTIONSTATUS    CONCEPTID    TERM
1.0    0    260278007    6/12 (finding)
--

My tokeniser called BaseAnalyzer extends Analyzer. Since I wanted toretain all non whitespace characters and not just letters and digits, Iintroduced the following block of code in the overridden tokenStream( )


--
public TokenStream tokenStream(String fieldName, Reader reader) {

return new CharTokenizer(reader) {


           protected char normalize(char c) {
                    return Character.toLowerCase(c);
           }
               protected boolean isTokenChar(char c) {
                      boolean type = false;
                      boolean space =   Character.isWhitespace(c);
                      boolean letDig =  Character.isLetterOrDigit(c);

if(letDig && !space) //letter or digit but notwhitespace

                           type = true;

else if(!letDig && !space) //not letter,digitor whitespace (retain non-whitespace characters)

                           type = true;

else if( !letDig && space) //is nota letter or digit but is a whitespace

                           type = false;
               return type;
           }
       };
     }

---

The problem is that when the term "6/12 (finding)" is tokenised, twotokens are generated viz. '6/12' and '(finding)'. Therefore when Isearch for '6/12' this term is returned as in a way it is an EXACT tokenmatch.

However when the term "R-eye=6/12 (finding)" is tokenised it againresults in two tokens viz. 'R-eye=6/12' and '(finding)'. So now if Ilook for '6/12' its no more an exact match since there is no token withthis EXACT value. A simple searcher.search(query) isnt useful to pullout the partial token match.

I think it wont be useful to create separate tokens for "6", "/", "12"or "R","-","eye","=", and so on. Im having a look at the RegexTermEnumand WildcardTermEnum as they might possibily help.

Would appreciate your comments on the BaseAnalyzer tokenizer and queryexplanation Ive received so far.


Thanks
Rahil

Erick Erickson wrote:

Most often, from what I've seen on this e-mail list, unexpectedresults are

because you're not indexing on the tokens you *think* you're indexing. Or
not searching on them. By that I mean that the analyzers you're using are
behaving in ways you don't expect.

That said, I think you're getting exactly what you should. I suspectyou're

indexing tokens as follows
doc1: "6/12"  and "(finding)"
doc2: "R-eye=6/12" and "(finding)"

So it makes perfect sense that searching in 6/12 returns doc1 andsearch on

R-eye=6/12 returns doc 2

So, first question: Have you actually used something like Luke (googleluke

lucene) to examine your index and see if what you've put in there is what

you expect? What analyzer is your custom analyzer built upon and is itdoing

anything you're unaware of (for instance, lower-casing the 'R' in your
second example)?

Here's what I'd do.
1> get Luke and see what's actually in your index.
2> use searcher.explain (?) to see the query you're actually emitting.
3> if you make no headway, post the smallest code snippets you can that
illustrate the problem. Folks would need the indexing AND searching code.

As far as queryies like "contains" in java.... Well sure. Write a filter
that filters on regular expressions or wildcards (you'll need

WildcardTermEnum and RegexTermEnum). Or index things differently (e.g.index"6/12" and "finding" on doc1 and "r". "eye" "6/12" and "finding" ondoc 2.Now your searches for "6/12" will work. Or index "6" "/", "12" and"finding"

on doc1, index similarly for doc2, and use a SpanNearQuery with an
appropriate span value. Or....

This is all gobbldeygook if you haven't gotten a copy of "Lucene InAction",which you should read in order to get the most out of Lucene. It's forthe1.4 code base, but the 2.0 Lucene code base isn't that much different.Moreimportantly, it ties lots of stuff together. Also, the junit teststhat come

along with the Lucene code can be invaluable to show you how to do
something.

Hope this helps
Erick

On 10/1/06, Rahil <[EMAIL PROTECTED]> wrote:


Hi

I have a custom-built Analyzer where I tokenize all non-whitespace
characters as well available in the field "TERM" (which is the only
field being tokenised).

If I now query my index file for a term "6/12" for instance, I get back
only ONE result

SCORE    DESCRIPTIONSTATUS    CONCEPTID    TERM
1.0    0    260278007    6/12 (finding)

instead of TWO. There is another token in the index file of the form

2561280012    0    163939000    R-eye=6/12 (finding)    0    3    en

At first it wasn't quite obvious to me why this was happening. But after
playing around a bit I realised that if I pass a query "R-eye=6/12"
instead, I will get this second result (but not the first one now). Is
there a way to tweak the  Query query = parser.parse(searchString)
method so that I can get both the records if I query for "6/12".
Something like a 'contains' query in Java.

Will appreciate all help. Thanks a lot

Regards
Rahil


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Performing a like query

Reply via email to