On Tue, Feb 22, 2011 at 11:55 AM, Toke Eskildsen <[email protected]> wrote: > On Mon, 2011-02-21 at 16:00 +0100, Simon Willnauer wrote: >> For all real codecs seek(BR, TermState) should be as fast as it gets. >> There are some codecs which simply forward to seek(BR) so if you have >> the TermState already you won't loose anything. This might also answer >> your other question, if you pass an empty BytesRef to a codec that did >> not override the seek(BR, TermState) method it will seek to the empty >> term and your code might not work anymore. > > Thanks, that makes sense. > > It seems to me that I'll have to use the strategy pattern and make a > TermsEnum-implementation-aware wrapper (or rather codec-aware?), if I > want the "best" ordinal-seeker. > > Toke: >> > I tried calling with an empty BytesRef term. This gave me an empty >> > result back for the call itself, but the correct terms for subsequent >> > calls to next. This works perfectly for my scenario. However, that was >> > just an experiment using the default variable gap codec, so I am unsure >> > if I can count on this behavior for any given codec? >> >> what do you mean by an empty result for the call itself? > > Sorry, I mixed things up. I mean I tried calling with an empty term and > getting the term with the term()-method, which returned an empty > BytesRef after the initial call. Ah ok that makes sense ;) thanks for clarifying.
> Anyway, since codec are free to fall > back to BytesRef-seek, my options are reduced to > seek(Bytesref, TermState) with real values > or > seek(Bytesref) which I expect is normally log(n) or better. Currently any of our 'real' codecs does support the fast TermState lookup which is O(1) in those cases. > >> can't you us a codec that supports ord for your facet / sort fields? > > That was also Mike McCandless suggestion in > https://issues.apache.org/jira/browse/LUCENE-2843 > > I think this might be counter-productive. If a non-ordinal-supporting > codec has significantly lower impact on memory, the extra bookkeeping > for a BytesRef/TermState-seek-cache might be small enough so that the > total overhead is still less than that of an ordinal-supporting codec. I don't know how you did implement that part but you might consider using something like ByteBlockPool instead of BytesRef instances to safe an extra amount of memory. Just as a hint you can look at BytesRefHash for an example. > > I did try a quick experiment with the variable gap vs. fixed gap codec, > where I kept every 32nd BytesRef+TermState for the variable gap. With a > 50M term field, this increased the overhead from 600MB to 800MB (or > about 130 bytes for each BytesRef/TermState-pair, ignoring the > memory-impact-difference for variable vs. fixed). This clearly does not > support my theory. I'll have to make a proper test, but a strong > recommendation of using ordinal-supporting codec might very well be the > best solution. I think we need to check if that BytesRef is really needed. I hope we can get rid of it eventually. simon > > Thanks for helping, > Toke Eskildsen > > --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
