Hi Luceners, During a profiling session, I discovered that BufferedIndexInput.readBytes(), the function which reads a bunch of bytes from an index, is very inefficient in many cases. It is efficient for one or two bytes, and also efficient for a very large number of bytes (e.g., when the norms are read all at once); But for anything in between (e.g., 100 bytes), it is a performance disaster. It can easily be improved, though, and below I include a patch to do that (unfortunately, JIRA is down at the moment, so I can't put the patch there).
The basic problem in the existing code was that if you ask it to read 100 bytes, readBytes() simply calls readByte() 100 times in a loop, which means we check byte after byte if the buffer has another character, instead of just checking once how many bytes we have left, and copy them all at once. My version, attached below, copies these 100 bytes if they are available at bulk (using System.arraycopy), and if less than 100 are available, whatever is available gets copied, and then the rest. (as before, when a very large number of bytes is requested, it is read directly into the final buffer). In my profiling (which, incidentally, isn't the typical Lucene run because it adds payloads to the positions), this fix caused amazing performance improvement: previously, BufferedIndexInput.readBytes() took as much as 25% of the run time, and after the fix, this was down to 1% of the run time! Is there an official benchmark for Lucene to test the effect of this improvement for more typical secenarios? In addition to the change to readBytes(), my attached patch also adds a new unit test to BufferedIndexInput (which previously did not have a unit test). This test simulates a "file" which contains a predictable series of bytes, and then tries to read from it with readByte() and readButes() with various sizes (many thousands of combinations are tried) and see that exactly the expected bytes are read. This test is independent of my new readBytes() inplementation, and can be used to check the old implementation as well. If anybody has any comments, or knows of any reason why the existing code was so inefficient (while the code in BufferedIndexOutput makes more sense), I'd love to hear. If a committer will agree to commit this change, even better :-) When JIRA is back online, I'll put the patches there too. Thanks, Nadav Har'El. -- Nadav Har'El | Monday, Oct 23 2006, 1 Heshvan 5767 IBM Haifa Research Lab |----------------------------------------- |Shortening Year-2000 to Y2K was just the http://nadav.harel.org.il |kind of thinking that caused that problem!
Index: src/java/org/apache/lucene/store/BufferedIndexInput.java =================================================================== --- src/java/org/apache/lucene/store/BufferedIndexInput.java (revision 466973) +++ src/java/org/apache/lucene/store/BufferedIndexInput.java (working copy) @@ -34,19 +34,37 @@ return buffer[bufferPosition++]; } - public void readBytes(byte[] b, int offset, int len) - throws IOException { - if (len < BUFFER_SIZE) { - for (int i = 0; i < len; i++) // read byte-by-byte - b[i + offset] = (byte)readByte(); - } else { // read all-at-once - long start = getFilePointer(); - seekInternal(start); - readInternal(b, offset, len); - - bufferStart = start + len; // adjust stream variables - bufferPosition = 0; - bufferLength = 0; // trigger refill() on read + public void readBytes(byte[] b, int offset, int len) throws IOException { + if(len <= (bufferLength-bufferPosition)){ + // the buffer contains enough data to satistfy this request + System.arraycopy(buffer, bufferPosition, b, offset, len); + bufferPosition+=len; + } else { + // the buffer does not have enough data. First serve all we've got. + int available = bufferLength - bufferPosition; + if(available > 0){ + System.arraycopy(buffer, bufferPosition, b, offset, available); + offset += available; + len -= available; + bufferPosition += available; + } + // and now, read the remaining 'len' bytes: + if(len<BUFFER_SIZE){ + // If the amount left to read is small enough, do it in the usual + // buffered way: fill the buffer and copy from it: + refill(); + System.arraycopy(buffer, 0, b, offset, len); + bufferPosition=len; + } else { + // The amount left to read is larger than the buffer - there's no + // performance reason not to read it all at once. Note that unlike + // the previous code of this function, there is no need to do a seek + // here, because there's no need to reread what we had in the buffer. + readInternal(b, offset, len); + bufferStart += bufferPosition+len; // adjust stream variables + bufferPosition = 0; + bufferLength = 0; // trigger refill() on read + } } } Index: src/test/org/apache/lucene/store/TestBufferedIndexInput.java =================================================================== --- src/test/org/apache/lucene/store/TestBufferedIndexInput.java (revision 0) +++ src/test/org/apache/lucene/store/TestBufferedIndexInput.java (revision 0) @@ -0,0 +1,83 @@ +package org.apache.lucene.store; + +import java.io.IOException; + +import junit.framework.TestCase; + +public class TestBufferedIndexInput extends TestCase { + // Call readByte() repeatedly, past the buffer boundary, and see that it + // is working as expected. + // Our input comes from a dynamically generated/ "file" - see + // MyBufferedIndexInput below. + public void testReadByte() throws Exception { + MyBufferedIndexInput input = new MyBufferedIndexInput(); + for(int i=0; i<BufferedIndexInput.BUFFER_SIZE*10; i++){ + assertEquals(input.readByte(), byten(i)); + } + } + + // Call readBytes() repeatedly, with various chunk sizes (from 1 byte to + // larger than the buffer size), and see that it returns the bytes we expect. + // Our input comes from a dynamically generated "file" - + // see MyBufferedIndexInput below. + public void testReadBytes() throws Exception { + MyBufferedIndexInput input = new MyBufferedIndexInput(); + int pos=0; + // gradually increasing size: + for(int size=1; size<BufferedIndexInput.BUFFER_SIZE*10; size=size+size/200+1){ + checkReadBytes(input, size, pos); + pos+=size; + } + // wildly fluctuating size: + for(long i=0; i<1000; i++){ + // The following function generates a fluctuating (but repeatable) + // size, sometimes small (<100) but sometimes large (>10000) + int size1 = (int)( i%7 + 7*(i%5)+ 7*5*(i%3) + 5*5*3*(i%2)); + int size2 = (int)( i%11 + 11*(i%7)+ 11*7*(i%5) + 11*7*5*(i%3) + 11*7*5*3*(i%2) ); + int size = (i%3==0)?size2*10:size1; + checkReadBytes(input, size, pos); + pos+=size; + } + // constant small size (7 bytes): + for(int i=0; i<BufferedIndexInput.BUFFER_SIZE; i++){ + checkReadBytes(input, 7, pos); + pos+=7; + } + } + public void checkReadBytes(BufferedIndexInput input, int size, int pos) throws IOException{ + // Just to see that "offset" is treated properly in readBytes(), we + // add an arbitrary offset at the beginning of the array + int offset = size % 10; // arbitrary + byte[] b = new byte[offset+size]; + input.readBytes(b, offset, size); + for(int i=0; i<size; i++){ + assertEquals(b[offset+i], byten(pos+i)); + } + } + + // byten emulates a file - byten(n) returns the n'th byte in that file. + // MyBufferedIndexInput reads this "file". + private static byte byten(long n){ + return (byte)(n*n%256); + } + private static class MyBufferedIndexInput extends BufferedIndexInput { + long pos=0; + protected void readInternal(byte[] b, int offset, int length) throws IOException { + for(int i=offset; i<offset+length; i++) + b[i] = byten(pos++); + } + + protected void seekInternal(long pos) throws IOException { + this.pos = pos; + } + + public void close() throws IOException { + } + + public long length() { + // an infinite file. Unfortunatelty, this length() is actually used + // in refill(), so we need to give it a real value + return Long.MAX_VALUE; + } + } +}
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]