If I remember correctly it was Baeza-Yates or someone in his group at U 
Santiago who came up with the rotated term indexing. 

Indexing "abc", you explicitly mark end of string and index all rotations using 
a data structure which supports prefix search (such as a trie):

abc$
bc$a
c$ab
$abc

This index directly supports all prefix, suffix, circumfix, and infix wildcard 
searches (a*, *c, a*c, *b*) by transforming the query:

a*      -> $a           (all terms starting with a)
a*c     -> c$a  (all terms starting with a and ending in c)
*c      -> c$           (all terms ending with c)
*b*     -> b            (all terms containing b)

You then simply run a prefix search for the transformed query in the trie of 
rotated strings. Quite similar to suffix arrays, but adds circumfix support 
(a*c) which, if I remember correctly, are not directly supported by standard 
suffix arrays. One can mark up start and end of the indexed term if the same 
trie should also be used for exact match ($abc$).

Cheers, Oli


-----Original Message-----
From: Uwe Schindler [mailto:u...@thetaphi.de] 
Sent: Friday, December 07, 2012 4:34 AM
To: java-user@lucene.apache.org
Subject: RE: Alternative for WildcardQuery with leading *

In general, you seem to need decomposing...: vacancyplan -> tokenized to -> 
vacancyplan, vacancy, plan. Wildcards are in general not really a replacement 
for correct text analysis on the indexing side. Unfortunately, decomposing is a 
hard task, but there are dictionary-based algorithms for e.g. German.

About the question with wildcards: If you really need a wildcard  in front, the 
common use case is to also index all terms in backwards order. E.g. Solr does 
this with ReverseWildCardTokenFilter, which indexes all terms forward and also 
in reverse order with some extra "marker-prefix" on the reverse terms (to be 
able to differentiate between forwards and backwards terms in the same field - 
of course, you could also index the reverse terms into an extra field). The 
query parser then automatically produces a simple PrefixQuery with a 
transformed query term (so it queries the magic extra reverse terms with a 
prefix).

Uwe

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de


> -----Original Message-----
> From: Clemens Wyss DEV [mailto:clemens...@mysign.ch]
> Sent: Friday, December 07, 2012 10:16 AM
> To: java-user@lucene.apache.org
> Subject: Alternative for WildcardQuery with leading *
> 
> In order to provide suggestions our query also includes a 
> "WildcardQuery with a leading *", which, of course, has a HUGE 
> performance impact :-(
> 
> E.g.
> Say we have indexed "vacancyplan", then if a user typed "plan" he 
> should also be offered "vacancyplan" ...
> 
> How can this feature be implemented without WcQwl* ?
> 
> Any help appreciated
> - Clemens
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to