Re: Improving DocValue sorting performance

Toke Eskildsen Wed, 13 Mar 2019 07:52:38 -0700

On Mon, 2019-02-11 at 13:42 +0100, Daniel Penning wrote:
> I think sorting and searching on docvalues can benefit from similar
> optimizations like impacts do when scoring is used. By storing the
> minimum and maximum value for blocks of docvalues it would be
> possible to skip over sets of documents that don't contain any values
> in the relevant range.


Nice idea: The docID-indexed blocks are handled by IndexedDISI, where
the blocks represents 65536 docs each. It would be simple enough to
enrich these with min & max, although it would mean 16 bytes extra for
each block, where the current minimum is 0 bytes + 8 bytes for the jump
table entry. It could be made into an option of course... As Alan
suggests, the jump tables would hold the values for maximum skippiness.

I unsure how much API-change it would require for the values to be
used. Adding a method such as nextDocIDWhereNumericMightBeHigherThan()
seems awfully specialized and might require knowledge of the block size
used inside of IndexedDISI, which is implementation specific. I do like
the general idea of "can we somehow skip ahead?".

Another sore point if how much this would statistically help. What are
the chances that a block can be skipped at all? Of course it depends on
the distribution of the values themselves, so sorting descending on a
long tail distribution might work very well, whereas a more uniform
distribution would result in no speed-up. A date field in a time-series 
corpus, indexed in chronological order, would work extremely well
sorting ascending and extremely poorly descending (possible it's the
other way around. It's getting late here).


So yeah, it is definitely technically possible and a very interesting
idea. And maybe very specialized?

Cc: to Daniel as the thread is a month old.

- Toke Eskildsen, Royal Danish Library



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: Improving DocValue sorting performance

Reply via email to