On Sep 29, 2008, at 1:47 PM, Nathan Kurz wrote:
> I would try very hard to avoid doing buffer bounds checking.
The only place where we're really going to care about the performance issue of
buffer bounds checking is in the posting lists, since that's where the rubber
hits the road at search-time.
The first goal should be to avoid per-int or per-byte bounds checking, as
happens right now in InStream_Read_C32(). (Later, we'll go further and
minimize processor decision branching using PForDelta.)
I said before that we couldn't know the know the maximum length of a
ScorePosting in its current implementation, but we can sort of know it in
stages. Before we start reading, we know that the maximum space that the doc
num, freq, and boost at the top can take up is 11 bytes:
doc num => C32, max 5 bytes
freq => C32, max 5 bytes
boost => 1 byte
Once we know freq, we know the maximum number of bytes that the positions could
occupy.
positions => freq * C32_MAX_BYTES
To take advantage of this knowledge, we can add an argument to our buffer
getter.
/** Get the InStream's buffer. Check to see whether <code>request</code>
* bytes are already in the buffer. If not, fill the buffer with either
* <code>request</code> bytes or the number of bytes remaining before EOF,
* whichever is smaller.
*
* @param request Advisory byte size request.
* @return Pointer to the InStream's internal buffer.
*/
char*
Buf(InStream *self, size_t request);
Here's an alternate implementation of ScorePosting's read method. (Note that
the DECODE_C32 macro from MathUtils advances the pointer supplied to it.)
void
ScorePost_read_record(ScorePosting *self, InStream *instream)
{
u32_t num_prox;
u32_t position = 0;
u32_t *positions;
const size_t max_start_bytes = (C32_MAX_BYTES * 2) + 1;
char *buf = InStream_Buf(self->instream, max_start_bytes);
const u32_t doc_code = DECODE_32(buf);
const u32_t doc_delta = doc_code >> 1;
/* Apply delta doc and retrieve freq. */
self->doc_num += doc_delta;
if (doc_code & 1)
self->freq = 1;
else
self->freq = DECODE_32(buf);
/* Decode boost/norm byte. */
self->weight = self->sim->norm_decoder[ *(u8_t*)buf ];
buf++;
/* Read positions. */
num_prox = self->freq;
buf = InStream_Buf(self->instream, num_prox * C32_MAX_BYTES);
if (num_prox > self->prox_cap) {
self->prox = REALLOCATE(self->prox, num_prox, u32_t);
self->prox_cap = num_prox;
}
positions = self->prox;
while (num_prox--) {
position += DECODE_32(buf);
*positions++ = position;
}
InStream_Advance_Buf(self->instream, buf);
}
The last method invoked, InStream_Advance_Buf(), is a setter which checks to
see whether the supplied pointer indicates a "read past EOF" error.
We're going to be changing the way that Postings are read, but the principle of
splicing in periodic buffer checking instead of needing it for every int read
still applies.
Marvin Humphrey
Rectangular Research
http://www.rectangular.com/