On Wed, Aug 31, 2011 at 1:30 PM, eks dev <eks...@googlemail.com> wrote: > I do not think it will be expensive, it is just an attempt to keep > code smaller, simpler and marginally faster :)
I think you will find the compile is pretty fast, this only happens once per query too (its not per-segment or anything)... see below > > those are a lot (Ca 1000) of small prefix based regex-es with limited > alphabet compiled as RunAutomaton I load on startup and lookup from > some RunAutomaton[] on request... > > they look like Regex("((123)|(124)|(401)|(777)|(351))[0-9]{0,2}") > > By the way, what will AutomatonQuery prefer "(XXX)[0-9]{0,2}" or > "(XXX)[0-9]*" or "(XXX).*" ? Any performance difference? Well, you would have to benchmark, and it definitely depends on your content. (XXX)[0-9]{0,2} is the 'simplest' automaton in that its finite, if you actually have (XXX)[0-9][0-9]<junk> it will seek past that. the other two forms you listed are infinite, and when automatonquery finds a 'loop' in the automaton, it turns itself into a 'filtering rangequery' temporarily with the upperbound being the end of the transition. This prevents it from doing a lot of useless disk seeks. if you have (XXX)[0-9]* this is going to seek to (XXX) and then act as a range query to (XXX)a (exclusive, just indicating a is the first valid term after the infinitely long pattern (XXX)999999999999999999999......) then for each term in the range query its going to 'check' that it matches the automaton. (XXX).* will be similar to the above, except its going to be obviously accept more terms, e.g. (XXX)m, and its 'range query' will be something like (XXX)->(XXY) > > Semantically are they the same as I know that my content is only 5 digits > > I need them to > 1. formulate complex BooleanQuery, where AutomatonQuery gets one clause > 2. do post processing (a lot of hits) of the "query against hits" and > this has to be fast. > > I guess, I will switch to keeping only Automaton[] and build > RunAutomaton on the fly (per request) for fast query vs hits, this is > done once per request only, but them I need to keep state of the > RunAutomaton per query... makes things slightly more verbose... > AutomatonQuery computes this stuff a single time, up-front in its constructor. Can you just reuse the AutomatonQuery(s)? in your app? This should work fine. -- lucidimagination.com --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org