RE: Why Lucene has to rewrite queries prior to actual searching?

Itamar Syn-Hershko Wed, 09 Apr 2008 14:32:37 -0700

>> When it is known in advance that "w?rd" and "wor*" will be used in
queries a lot, one can write a tokenizer that indexes them so that they can
be searched directly.


No, that was not at all what I tried to suggest. I will try to explain
better, please try to be open-minded :)

Lucene accesses the index files in some way upon searching to get the data
for the requested term(s) - documents and fields it appears in, along with
position and frequency data. This is called TermInfo -- according to
http://lucene.apache.org/java/docs/fileformats.html#Term%20Dictionary. I'm
not sure how that part of Lucene is named, but what I'm proposing is related
to it and it only.

Instead of receiving "word", "foo" and "bar" from a query, and looking for
the TermInfo for each of those terms by just comparing TermInfo.Term to the
term from the query, this part of Lucene should be smarter to know how to
handle wildcards or even regex, so if "foo*" is received from the query, it
will start with retieving the TermInfo for just "foo", and then will
continue and add up more and more TermInfo structure to its cache (or
whatever else it does with them) until the pattern will no longer match. The
idea is to consider each word that matches the pattern relevant, instead of
doing a simple character comparison on the terms, and all this directly
against the index files and terms list/info.
I'm aware that this would be a bit tougher to write for mid-word wildcards
(or RegEx) unlike "prefixes", but it is possible and still better than
rewriting a query.

Obviously, queries sent to this searcher will have to pass a lighter
QueryParser, so they keep their wildcards.

This has obvious speed, accuracy, and simplicity gain. As I mentioned
before, it should be offered as another tool for searches - the original
searcher may still be required by other search methods.

Please let me know if this makes more sense now (or not...)

Itamar. 

-----Original Message-----
From: Paul Elschot [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, April 08, 2008 5:57 PM
To: [email protected]
Subject: Re: Why Lucene has to rewrite queries prior to actual searching?

Op Tuesday 08 April 2008 15:18:34 schreef Itamar Syn-Hershko:
> Paul,
>
> I don't see how this answers the question. 

Towards the end, the page describes when a Scorer is called and roughly what
it does.

> I was asking why Lucene
> has to access the index with exact terms, and not use RegEx or simpler 
> wildcards support internally? If Lucene will be able to look for 
> "w?rd" or "wor*" and treat the wildcards as wildcards, this will 
> greatly improve speed of searches and will eliminate the need for 
> Query rewriting.

When it is known in advance that "w?rd" and "wor*" will be used in queries a
lot, one can write a tokenizer that indexes them so that they can be
searched directly.
The problem is to know that in advance, that is at indexing time.

> Since some people may want to index chars like those used in 
> wildcards, they could be escaped (or, those people will use the 
> standard search classes available today instead). I'm not entirely 
> sure what part of Lucene does the actual access to the terms and 
> position vectors, but if it could be sub-classed or cloned, and then 
> modified to honor wildcards or even RegEx, that would bring Lucene to 
> new heights.

There are regular expression queries in the regex contrib module, however
these work by rewriting to actually indexed terms.

> Unless, again, there is a specific reason why this can't be done.

There is no specific reason why it cannot be done, one only needs to provide
the corresponding tokenizer to be used at indexing time.

Kind regards,
Paul Elschot


>
> Itamar.
>
> -----Original Message-----
> From: Paul Elschot [mailto:[EMAIL PROTECTED]
> Sent: Tuesday, April 08, 2008 1:56 AM
> To: [email protected]
> Subject: Re: Why Lucene has to rewrite queries prior to actual 
> searching?
>
> Itamar,
>
> Have a look here:
> http://lucene.apache.org/java/2_3_1/scoring.html
>
> Regards,
> Paul Elschot
>
> Op Tuesday 08 April 2008 00:34:48 schreef Itamar Syn-Hershko:
> > Paul and John,
> >
> > Thanks for your quick reply.
> >
> > The problem with query rewriting is the beforementioned 
> > MaxClauseException. Instead of inflating the query and passing a 
> > deterministic list of terms to the actual search routine, Lucene 
> > could have accessed the vectors in the index using some sort of 
> > filter. So, for example, if it knows to access "Foobar" by its name 
> > in the index, why can't it take "Foo*" and just get all the vectors 
> > until "Fop" is met (for example). Why does it have to get 
> > deterministic list of terms?
> >
> > I will take a look at the Scorer - can you describe in short what 
> > exactly it does and where and when it is being called?
> >
> > I don't get John's comment though - Query::rewrite is being called 
> > prior to the actual searching (through QueryParser), how come it can 
> > use "information gathered from IndexReader at search time"?
> >
> > Itamar.
> >
> > -----Original Message-----
> > From: Paul Elschot [mailto:[EMAIL PROTECTED]
> > Sent: Tuesday, April 08, 2008 12:57 AM
> > To: [email protected]
> > Subject: Re: Why Lucene has to rewrite queries prior to actual 
> > searching?
> >
> > Itamar,
> >
> > Query rewrite replaces wildcards with terms available from the 
> > index. Usually that involves replacing a wildcard with a 
> > BooleanQuery that is an effective OR over the available terms while 
> > using a flat coordination factor, i.e. it does not matter how many 
> > of the available terms actually match a document, as long as at 
> > least one matches.
> >
> > For the required query parts (AND like), Scorer.skipTo() is used, 
> > and that could well be the filter mechanism you are referring to; 
> > have a look at the javadocs of Scorer, and, if necessary, at the 
> > actual code of ConjunctionScorer.
> >
> > Regards,
> > Paul Elschot
> >
> > Op Monday 07 April 2008 23:13:09 schreef Itamar Syn-Hershko:
> > > Hi all,
> > >
> > > Can someone from the experts here explain why Lucene has to get a 
> > > "rewritten" query for the Searcher - so Phrase or Wildcards 
> > > queries have to rewrite themselves into a "primitive" query, that 
> > > is then passed to Lucene to look for? I'm probably not familiar 
> > > too much with the internals of Lucene, but I'd imagine that if you 
> > > can inflate a query using wildcards via xxxxQuery sub classing, 
> > > you could as easily (?) have some sort of Filter mechanism during 
> > > the search, so that Lucene retrieves the Position vectors for all 
> > > the terms that pass that filter, instead of retrieving only the 
> > > position data for deterministic terms (with no wildcards etc.). If 
> > > that was possible to do somehow, it could greatly increase the 
> > > searchability of Lucene indices by using RegEx (without re-writing 
> > > and getting the dreaded MaxClauseCount error) and similar.
> > >
> > > Would love to hear some insights on this one.
> > >
> > > Itamar.
> >
> > -------------------------------------------------------------------
> >-- To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> >
> >
> >
> >
> >
> > -------------------------------------------------------------------
> >-- To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Why Lucene has to rewrite queries prior to actual searching?

Reply via email to