[This is a continuation of a thread from the KinoSearch mailing list
entitled "Queries with large number of hits."]
On Sep 19, 2008, at 11:25 AM, Nathan Kurz wrote:
Thanks for your quick research and improvements, Marvin. You prompted
me to run a few quick tests as well:
kinosearch_instream/perl$ search "the OR and OR of OR for" (ten times)
kinosearch_instream/perl$ opreport -alt2 */KinoSearch.so
CPU: CPU with timer interrupt, speed 0 MHz (estimated)
Profiling through timer interrupt
samples cum. samples % cum. % symbol name
190 190 23.8394 23.8394 kino_InStream_read_c32
112 302 14.0527 37.8921 kino_ScorePost_read_record
68 370 8.5320 46.4241 kino_ScorerDocQ_top_next
55 425 6.9009 53.3250 kino_TermScorer_next
51 476 6.3990 59.7240 kino_InStream_read_u8
45 521 5.6462 65.3701 kino_SegPList_next
31 552 3.8896 69.2597 __i686.get_pc_thunk.bx
30 582 3.7641 73.0238 advance_after_current
29 611 3.6386 76.6625 anonymous symbol from
section .plt
25 636 3.1368 79.7992 kino_MemMan_wrapped_realloc
22 658 2.7604 82.5596 kino_ScorePostScorer_tally
17 675 2.1330 84.6926 kino_ORScorer_tally
"opannotate --source */KinoSearch.so" is also useful to glance at, as
is adding the '-c' flag to opreport to see where the calls are coming
from and where the functions are spending their time internally.
The main thing that jumped out is that the function call to
Instream_read_c32 is killing us. I don't see any way to have this
remain a per-int function call and still get good performance.
Let's set a goal of implementing PForDelta, and figure out how to get
there from here.
You need to figure out some to decode the entire posting with fewer
function calls, and this is where the bulk of them are coming from.
I'd suggest having the Store level return an undecoded raw posting,
and let the Posting class decode the parts it wants. That way the
VByte code can be macros that work on a local buffer in a tight loop.
It's a little tricky to give anything other than the InStream object
itself direct access to the raw buffer. It's inevitable that the
bytes that form a compressed integer will straddle a buffer boundary
from time to time. InStream_read_c32 takes this into account by
calling refill() when necessary, but how do you factor that in when
working externally? MathUtils.h already has DECODE_C32 and ENCODE_C32
macros and they're pretty gnarly, even though they assume that the
buffer they're pointed at holds all the needed bytes.
Alternatively, if you decide to guarantee that any time you supply a
buffer to an outsider that it has all the necessary bytes for the C32,
the cost of scanning the stream to find the end of the record bites you.
Another theoretical possibility would be to store the size of each
undecoded posting in the index, perhaps as a prepended C32 -- similar
to how string data is stored. However, that would take extra space
and extra time to decode, and would be prohibitively expensive for
simple Posting formats like MatchPosting.
The second thing that jumps out is that decompressing VBytes is
expensive. P4Delta might be a significant advantage, or perhaps there
are ways to optimize the decompression with the existing scheme.
In that study, compressed integers fared the worst of all the
competitors. We'll still need them for lots of other things, but
let's try to get PForDelta happening in the postings data.
The last thing (sort of a non-thing) is that to my surprise the
double/triple buffering of the Posting data doesn't seem to have much
of a negative effect. I still think it's worth trying to avoid this,
but it's barely on the current radar.
The double/triple buffering may not cause a slowdown, but it's worth
going with mmap for the backing buffer on InStream simply for memory
footprint. Since all index files are read-only once committed, they
can be shared. That means if you have multiple processes with
multiple IndexReaders all reading the same file, they can all
effectively use the same backing buffer, and we'll let the OS manage
page swapping.
Did you also recreate an index without the position information, or
are these times based on just skipping over the position info in the
index?
I recreated the index, using a new Schema -- there was no other
choice. MatchPosting as presently implemented expects to see
<doc,freq> for each posting. (It really ought to be just <doc>, but I
hacked it so we could get these tests going.) If you give it
ScorePosting data <doc,freq,boost,positions>, it will misinterpret the
boost and positions as the next docs's data, get out of sync, return
wrong results, and possibly crash with an EOF error.
Marvin Humphrey
Rectangular Research
http://www.rectangular.com/