Re: Problems with exact matces on non-tokenized fields...

Stefanos Karasavvidis Thu, 14 Nov 2002 01:53:46 -0800

> Doesn't that one do just that - treats fields differently, based on
> their name?

yes it does, but look at the question's title
"How do I write my own Analyzer?"

if someone has a problem with a non-tokenized field (which was the problem of the mail thread that started this) then he doesn't know that he has to write a custom analyzer, and so he won't be able to find the correct faq entry.

Moreover, the second solution Doug has proposed suites better in some cases and should be included, too. (Doug has written these solutions in a mail to the users list on 27/9/2002 9:24 p.m.)

I still think that there should be a faq entry as I propose in my previous email.

Moreover, there should be an addition to the faq entry
http://lucene.sourceforge.net/cgi-bin/faq/faqmanager.cgi?file=chapter.indexing&toc=faq#q15

it states there that it is important to use the same analyzer during indexing and searching. Again this may lead to problems if a field is not tokenized (during indexing it will _not_ get passed through the analyzer, but during searching it get's passed. If the analyzer does not treat that field as a special case, there will be a problem.)

I don't know, maybe I'm missing something here, but it seems obvious to me that non tokenized fields in conjuction with analyzers produce problems which should be mentioned in documenation/faq etc.

Stefanos

Otis Gospodnetic wrote:

Not sure which FAQ entry you are refering to.
This one http://www.jguru.com/faq/view.jsp?EID=1006122 ?

Doesn't that one do just that - treats fields differently, based on
their name?

Otis

--- Stefanos Karasavvidis <[EMAIL PROTECTED]> wrote:

I came accross the same problem and I think that the faq entry you (Otis) propose should get a better title so that users can find more easily an answer to this problem.

Correct me if I'm wrong (and please forgive any wrong assumptions I
may have made), put the problem is on "how to query on a non tokenized
field?"

Problem explanation:
If a field is not tokenized than it is not passed through the
analyzer, independently of the used analyzer (that's what I understand by
looking into DocumentWriter.invertDocument()).
If you construct a query with a given analyzer (for example with QueryParser.parse(query, field, analyzer)) with this field, the queryparser does not know that this field is not tokenized and passes
it through the analyzer. Ther analyzer may alter the query (for example
if the analyzer has a stemming algorithm) and the document is not
matched uppon the query.

The solution:
The solution is to make sure that fields that aren't tokenized during

indexig, are not passed through the analyzer during searching. This
can be done in 2 ways, either by making an analyzer that takes care of
this according to the field, or by constructing a TermQuery with this
field and adding it to the rest of the query

Example:
put here the 2 examples from Doug

Stefanos

Otis Gospodnetic wrote:

Thanks, it's a FAQ entry now:

How do I write my own Analyzer?
http://www.jguru.com/faq/view.jsp?EID=1006122

Otis

--- Doug Cutting <[EMAIL PROTECTED]> wrote:

karl �ie wrote:

I have a Lucene Document with a field named "element" which is

stored

and indexed but not tokenized. The value of the field is "POST" (uppercase). But the only way i can match the field is by entering

"element:POST?" or "element:POST*" in the QueryParser class.

There are two ways to do this.

If this must be entered by users in the query string, then you need
to use a non-lowercasing analyzer for this field. The way to do this

if

you're currently using StandardAnalyzer, is to do something like:

public class MyAnalyzer extends Analyzer {
private Analyzer standard = new StandardAnalyzer();
public TokenStream tokenStream(String field, final Reader
reader) {
if ("element".equals(field)) { // don't tokenize
return new CharTokenizer(reader) {
protected boolean isTokenChar(char c) { return true; }
};
} else { // use standard

analyzer

return standard.tokenStream(field, reader);
}
}
}

Analyzer analyzer = new MyAnalyzer();
Query query = queryParser.parse("... +element:POST", analyzer);

Alternately, if this query field is added by a program, then this

can

be done by bypassing the analyzer for this class, building this clause

directly instead:

Analyzer analyzer = new StandardAnalyzer();
BooleanQuery query = (BooleanQuery)queryParser.parse("...",
analyzer);

// now add the element clause
query.add(new TermQuery(new Term("element", "POST"))), true,
false);

Perhaps this should become an FAQ...

Doug

--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@;jakarta.apache.org>
For additional commands, e-mail:
<mailto:lucene-user-help@;jakarta.apache.org>

__________________________________________________
Do you Yahoo!?
New DSL Internet Access from SBC & Yahoo!
http://sbc.yahoo.com

--
To unsubscribe, e-mail:

<mailto:lucene-user-unsubscribe@;jakarta.apache.org>

For additional commands, e-mail:

<mailto:lucene-user-help@;jakarta.apache.org>

--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@;jakarta.apache.org>
For additional commands, e-mail:
<mailto:lucene-user-help@;jakarta.apache.org>

__________________________________________________
Do you Yahoo!?
U2 on LAUNCH - Exclusive greatest hits videos
http://launch.yahoo.com/u2

--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@;jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@;jakarta.apache.org>

--
======================================================================
Stefanos Karasavvidis
Electronics & Computer Engineer
e-mail : [EMAIL PROTECTED]


Multimedia Systems Center S.A.
Kissamou 178
73100 Chania - Crete - Hellas
http://www.multimedia-sa.gr

Tel : +30 821 0 88447
Fax : +30 821 0 88427



--
To unsubscribe, e-mail:   <mailto:lucene-user-unsubscribe@;jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@;jakarta.apache.org>

Re: Problems with exact matces on non-tokenized fields...

Reply via email to