Hi Arun,
this is hard to explain with one single reason - To conclude:
- A wildcard query that consists only of a prefix ("ab*") is internally
implemented by iterating all terms starting with "ab" and ending with the one
before "ac"). This is not different to Lucene 3.x and there is no room for
optimization. If you look at
o.a.l.util.automaton.CompiledAutomaton#getTermsEnum(), you see that prefix
terms are implemented by PrefixTermsEnum, that uses the above algorithm and is
mostly a copy of the Lucene 3.x PrefixTermEnum.
- A more complex wildcard or regex query that does not solely consist of a
prefix but more restrictions after the prefix is faster in 4.x than in Lucene
3.x, because Lucene 3.x was only able to scan all terms starting with the
common prefix ("ab*xy" would have approx. same speed as a prefix "ab*", it just
removes all terms not matching *xy in a second step). This is also the
improvement in FuzzyQuery: In Lucene 3.x a FuzzyQuery has no prefix at all
(unless you specify one, but that limits the fuzzy use), so it has to scan all
terms in the term dictionary and filter them against the Levenshtein distance.
In Lucene 4.x, it can use an automaton, to effectively seek inside the term
enum to find all terms that match the (non-prefix) regex, wildcard or
Levensthein distance. In short: Lucene 4 is able to step over terms that cannot
match the automaton. A cool example is the following regex: "(ht|f)tp://.*". In
Lucene 3.x this would scan all terms because there is no common prefix, in 4.x
this expands to something similar to the intersection of 2 prefix queries, so
the automaton intersection between the regex and the term dictionary will seek
to term "ftp://" first, iterate until the last term of this prefix and then
seeks to "http://", doing the same.
If you only have a prefix query (wildcard "ab*" or regex "ab.*"), then the
query is executed in the same way like in Lucene 3.x, no automatons are
involved at all! The reason why you see a speed improvement is not caused by
the automaton implementation (it's not used at all), but it is caused by a
complete rewrite of Lucene's algorithms regarding the term dictionary and the
posting lists (Lucene Codecs). E.g. Lucene 4.0 uses a new term dictionary based
on FST, Lucene 4.0 no longer creates Strings all the time (all terms are just
slices inside a large byte[]),... and many more improvements. Also when
scanning the posting lists to find all documents related to the terms found,
Lucene uses different data structures (e.g., block postings in 4.1). So it’s a
number of improvements that speed up the index (also while indexing!). Just
review the numerous presentations on conferences or the blog posts by Mike
McCandless (http://blog.mikemccandless.com/).
> - Can we do something to improve speed for such queries ?
You cannot improve speed of pure prefix queries. Maybe another data structure
is helpful? E.g. if you scan for dates as string (queries on dates like
"201301*"), it might be better to use a numeric field and use
NumericRangeQuery(20130101, 20130131)? Another approach for improving prefix
queries is indexing additional terms: If you are always searching for a 2-char
prefix for "ab*", then simply index an additional term in a separate field with
2 chars (e.g., "ab") in your documents while indexing and just search for "ab"?
This is very fast! This also applies to Lucene 3.x!
Uwe
-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: [email protected]
> -----Original Message-----
> From: Arun Kumar K [mailto:[email protected]]
> Sent: Friday, March 29, 2013 11:13 AM
> To: java-user
> Subject: Re: Wild Card Query Performance
>
> Hi Uwe,
>
> Thanks for the info.
> You were mentioning about term dictionary and and other index
> components. I didn't get this.
> - What could be the other factors that improve the speed of such query ?
> Can u explain or give some pointers to this ?
> - Can we do something to improve speed for such queries ?
>
> Also other observation is indexing time has increased by around 6% in 4.0.
>
> Arun
>
> On Fri, Mar 29, 2013 at 3:25 PM, Uwe Schindler <[email protected]> wrote:
>
> > Hi,
> >
> > It depends on the type of wildcard query. If you only have a prefix
> > (ab*), they rewrite to a simple PrefixQuery and this one is
> > implemented exactly like in 3.x, so you only see the speed
> > improvements of Lucene 4.0 in the term dictionary and and other index
> > components, not related to the query itsself.
> >
> > If you have wildcards like ab?xy, then this query will be multiple
> > times faster than in 3.x, because the "?" wildcard can only expand to
> > a limited set of terms, while in Lucene 3.x, it still scans all terms
> > with prefix "ab". The same applies to other wildcard constructs, if
> > they limit more than just prefix.
> >
> > Uwe
> >
> > -----
> > Uwe Schindler
> > H.-H.-Meier-Allee 63, D-28213 Bremen
> > http://www.thetaphi.de
> > eMail: [email protected]
> >
> >
> > > -----Original Message-----
> > > From: Arun Kumar K [mailto:[email protected]]
> > > Sent: Friday, March 29, 2013 10:38 AM
> > > To: java-user
> > > Subject: Wild Card Query Performance
> > >
> > > Hi Guys,
> > >
> > > I have been testing the search time improvement in Lucene 4.0 from
> > > Lucene
> > > 3.0.2 version for Wildcard Queries (with atleast say 2 chars Eg.ar*).
> > >
> > > For a 2GB size index with 4000000 docs, the following observations
> > > were
> > > made:
> > >
> > > Around 3X improvement with and without STRING sort on a sortable
> field.
> > >
> > > I guess this improvement is because of the Automation Query by
> > > Robert which is used in WildCard Queries.
> > >
> > > As per mike's blog, FuzzyQueries are 100X times faster in 4.0 but
> > > these wildcard queries are not that faster comparatively.
> > >
> > > I have used default codecs and postings format.
> > >
> > > Did i miss something or is it the max improvement that we can expect
> > > currently for WildCard Queries?
> > >
> > >
> > > Arun
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [email protected]
> > For additional commands, e-mail: [email protected]
> >
> >
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]