Re: Proper use of TermsEnum.seek?

Simon Willnauer Tue, 22 Feb 2011 03:20:20 -0800

On Tue, Feb 22, 2011 at 11:55 AM, Toke Eskildsen <[email protected]> 
wrote:
> On Mon, 2011-02-21 at 16:00 +0100, Simon Willnauer wrote:
>> For all real codecs seek(BR, TermState) should be as fast as it gets.
>> There are some codecs which simply forward to seek(BR) so if you have
>> the TermState already you won't loose anything. This might also answer
>> your other question, if you pass an empty BytesRef to a codec that did
>> not override the seek(BR, TermState) method it will seek to the empty
>> term and your code might not work anymore.
>
> Thanks, that makes sense.
>
> It seems to me that I'll have to use the strategy pattern and make a
> TermsEnum-implementation-aware wrapper (or rather codec-aware?), if I
> want the "best" ordinal-seeker.
>
> Toke:
>> > I tried calling with an empty BytesRef term. This gave me an empty
>> > result back for the call itself, but the correct terms for subsequent
>> > calls to next. This works perfectly for my scenario. However, that was
>> > just an experiment using the default variable gap codec, so I am unsure
>> > if I can count on this behavior for any given codec?
>>
>> what do you mean by an empty result for the call itself?
>
> Sorry, I mixed things up. I mean I tried calling with an empty term and
> getting the term with the term()-method, which returned an empty
> BytesRef after the initial call.
Ah ok that makes sense ;) thanks for clarifying.


> Anyway, since codec are free to fall
> back to BytesRef-seek, my options are reduced to
> seek(Bytesref, TermState) with real values
> or
> seek(Bytesref) which I expect is normally log(n) or better.
Currently any of our 'real' codecs does support the fast TermState
lookup which is O(1) in those cases.

>
>> can't you us a codec that supports ord for your facet / sort fields?
>
> That was also Mike McCandless suggestion in
> https://issues.apache.org/jira/browse/LUCENE-2843
>
> I think this might be counter-productive. If a non-ordinal-supporting
> codec has significantly lower impact on memory, the extra bookkeeping
> for a BytesRef/TermState-seek-cache might be small enough so that the
> total overhead is still less than that of an ordinal-supporting codec.

I don't know how you did implement that part but you might consider
using something like ByteBlockPool instead of BytesRef instances to
safe an extra amount of memory. Just as a hint you can look at
BytesRefHash for an example.
>
> I did try a quick experiment with the variable gap vs. fixed gap codec,
> where I kept every 32nd BytesRef+TermState for the variable gap. With a
> 50M term field, this increased the overhead from 600MB to 800MB (or
> about 130 bytes for each BytesRef/TermState-pair, ignoring the
> memory-impact-difference for variable vs. fixed). This clearly does not
> support my theory. I'll have to make a proper test, but a strong
> recommendation of using ordinal-supporting codec might very well be the
> best solution.

I think we need to check if that BytesRef is really needed. I hope we
can get rid of it eventually.

simon
>
> Thanks for helping,
> Toke Eskildsen
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Proper use of TermsEnum.seek?

Reply via email to