Posting codecs

Marvin Humphrey Fri, 19 Sep 2008 19:51:51 -0700

[This is a continuation of a thread from the KinoSearch mailing listentitled "Queries with large number of hits."]


On Sep 19, 2008, at 11:25 AM, Nathan Kurz wrote:

Thanks for your quick research and improvements, Marvin.  You prompted
me to run a few quick tests as well:

kinosearch_instream/perl$ search "the OR and OR of OR for" (ten times)
kinosearch_instream/perl$ opreport -alt2 */KinoSearch.so
CPU: CPU with timer interrupt, speed 0 MHz (estimated)
Profiling through timer interrupt
samples  cum. samples  %        cum. %     symbol name
190      190           23.8394  23.8394    kino_InStream_read_c32
112      302           14.0527  37.8921    kino_ScorePost_read_record
68       370            8.5320  46.4241    kino_ScorerDocQ_top_next
55       425            6.9009  53.3250    kino_TermScorer_next
51       476            6.3990  59.7240    kino_InStream_read_u8
45       521            5.6462  65.3701    kino_SegPList_next
31       552            3.8896  69.2597    __i686.get_pc_thunk.bx
30       582            3.7641  73.0238    advance_after_current

29 611 3.6386 76.6625 anonymous symbol fromsection .plt

25       636            3.1368  79.7992    kino_MemMan_wrapped_realloc
22       658            2.7604  82.5596    kino_ScorePostScorer_tally
17       675            2.1330  84.6926    kino_ORScorer_tally

"opannotate --source */KinoSearch.so" is also useful to glance at, as
is adding the '-c' flag to opreport to see where the calls are coming
from and where the functions are spending their time internally.

The main thing that jumped out is that the function call to
Instream_read_c32 is killing us.  I don't see any way to have this
remain a per-int function call and still get good performance.

Let's set a goal of implementing PForDelta, and figure out how to getthere from here.

You need to figure out some to decode the entire posting with fewer
function calls, and this is where the bulk of them are coming from.
I'd suggest having the Store level return an undecoded raw posting,
and let the Posting class decode the parts it wants.  That way the
VByte code can be macros that work on a local buffer in a tight loop.

It's a little tricky to give anything other than the InStream objectitself direct access to the raw buffer. It's inevitable that thebytes that form a compressed integer will straddle a buffer boundaryfrom time to time. InStream_read_c32 takes this into account bycalling refill() when necessary, but how do you factor that in whenworking externally? MathUtils.h already has DECODE_C32 and ENCODE_C32macros and they're pretty gnarly, even though they assume that thebuffer they're pointed at holds all the needed bytes.

Alternatively, if you decide to guarantee that any time you supply abuffer to an outsider that it has all the necessary bytes for the C32,the cost of scanning the stream to find the end of the record bites you.

Another theoretical possibility would be to store the size of eachundecoded posting in the index, perhaps as a prepended C32 -- similarto how string data is stored. However, that would take extra spaceand extra time to decode, and would be prohibitively expensive forsimple Posting formats like MatchPosting.

The second thing that jumps out is that decompressing VBytes is
expensive.  P4Delta might be a significant advantage, or perhaps there
are ways to optimize the decompression with the existing scheme.

In that study, compressed integers fared the worst of all thecompetitors. We'll still need them for lots of other things, butlet's try to get PForDelta happening in the postings data.

The last thing (sort of a non-thing) is that to my surprise the
double/triple buffering of the Posting data doesn't seem to have much
of a negative effect.  I still think it's worth trying to avoid this,
but it's barely on the current radar.

The double/triple buffering may not cause a slowdown, but it's worthgoing with mmap for the backing buffer on InStream simply for memoryfootprint. Since all index files are read-only once committed, theycan be shared. That means if you have multiple processes withmultiple IndexReaders all reading the same file, they can alleffectively use the same backing buffer, and we'll let the OS managepage swapping.

Did you also recreate an index without the position information, or
are these times based on just skipping over the position info in the
index?

I recreated the index, using a new Schema -- there was no otherchoice. MatchPosting as presently implemented expects to see<doc,freq> for each posting. (It really ought to be just <doc>, but Ihacked it so we could get these tests going.) If you give itScorePosting data <doc,freq,boost,positions>, it will misinterpret theboost and positions as the next docs's data, get out of sync, returnwrong results, and possibly crash with an EOF error.


Marvin Humphrey
Rectangular Research
http://www.rectangular.com/

Posting codecs

Reply via email to