[This is a continuation of a thread from the KinoSearch mailing list entitled "Queries with large number of hits."]

On Sep 19, 2008, at 11:25 AM, Nathan Kurz wrote:

Thanks for your quick research and improvements, Marvin.  You prompted
me to run a few quick tests as well:

kinosearch_instream/perl$ search "the OR and OR of OR for" (ten times)
kinosearch_instream/perl$ opreport -alt2 */KinoSearch.so
CPU: CPU with timer interrupt, speed 0 MHz (estimated)
Profiling through timer interrupt
samples  cum. samples  %        cum. %     symbol name
190      190           23.8394  23.8394    kino_InStream_read_c32
112      302           14.0527  37.8921    kino_ScorePost_read_record
68       370            8.5320  46.4241    kino_ScorerDocQ_top_next
55       425            6.9009  53.3250    kino_TermScorer_next
51       476            6.3990  59.7240    kino_InStream_read_u8
45       521            5.6462  65.3701    kino_SegPList_next
31       552            3.8896  69.2597    __i686.get_pc_thunk.bx
30       582            3.7641  73.0238    advance_after_current
29 611 3.6386 76.6625 anonymous symbol from section .plt
25       636            3.1368  79.7992    kino_MemMan_wrapped_realloc
22       658            2.7604  82.5596    kino_ScorePostScorer_tally
17       675            2.1330  84.6926    kino_ORScorer_tally

"opannotate --source */KinoSearch.so" is also useful to glance at, as
is adding the '-c' flag to opreport to see where the calls are coming
from and where the functions are spending their time internally.


The main thing that jumped out is that the function call to
Instream_read_c32 is killing us.  I don't see any way to have this
remain a per-int function call and still get good performance.

Let's set a goal of implementing PForDelta, and figure out how to get there from here.

You need to figure out some to decode the entire posting with fewer
function calls, and this is where the bulk of them are coming from.
I'd suggest having the Store level return an undecoded raw posting,
and let the Posting class decode the parts it wants.  That way the
VByte code can be macros that work on a local buffer in a tight loop.

It's a little tricky to give anything other than the InStream object itself direct access to the raw buffer. It's inevitable that the bytes that form a compressed integer will straddle a buffer boundary from time to time. InStream_read_c32 takes this into account by calling refill() when necessary, but how do you factor that in when working externally? MathUtils.h already has DECODE_C32 and ENCODE_C32 macros and they're pretty gnarly, even though they assume that the buffer they're pointed at holds all the needed bytes.

Alternatively, if you decide to guarantee that any time you supply a buffer to an outsider that it has all the necessary bytes for the C32, the cost of scanning the stream to find the end of the record bites you.

Another theoretical possibility would be to store the size of each undecoded posting in the index, perhaps as a prepended C32 -- similar to how string data is stored. However, that would take extra space and extra time to decode, and would be prohibitively expensive for simple Posting formats like MatchPosting.

The second thing that jumps out is that decompressing VBytes is
expensive.  P4Delta might be a significant advantage, or perhaps there
are ways to optimize the decompression with the existing scheme.

In that study, compressed integers fared the worst of all the competitors. We'll still need them for lots of other things, but let's try to get PForDelta happening in the postings data.

The last thing (sort of a non-thing) is that to my surprise the
double/triple buffering of the Posting data doesn't seem to have much
of a negative effect.  I still think it's worth trying to avoid this,
but it's barely on the current radar.

The double/triple buffering may not cause a slowdown, but it's worth going with mmap for the backing buffer on InStream simply for memory footprint. Since all index files are read-only once committed, they can be shared. That means if you have multiple processes with multiple IndexReaders all reading the same file, they can all effectively use the same backing buffer, and we'll let the OS manage page swapping.

Did you also recreate an index without the position information, or
are these times based on just skipping over the position info in the
index?

I recreated the index, using a new Schema -- there was no other choice. MatchPosting as presently implemented expects to see <doc,freq> for each posting. (It really ought to be just <doc>, but I hacked it so we could get these tests going.) If you give it ScorePosting data <doc,freq,boost,positions>, it will misinterpret the boost and positions as the next docs's data, get out of sync, return wrong results, and possibly crash with an EOF error.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/

Reply via email to