[ https://issues.apache.org/jira/browse/AVRO-315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12800068#action_12800068 ]
Scott Carey commented on AVRO-315: ---------------------------------- I could not reproduce the performance improvement from unrolling the loop. I was able to modify the loop to remove a conditional which had a small impact and it is ugly: {code} int n = 0; int b; int shift = 0; do { b = in.read(); n |= (b & 0x7f) << shift; shift += 7; } while (((b ^ 0xFFFFFF00) & 0xFFFFFF80) == 0xFFFFFF80); if (-1 == b) { throw new EOFException(); } return n;{code} I have modified the test to pick a broader set of integers to serialize, increased its duration, and increased its warmup time. It is nearly stable now, the JVM is usually done compiling before the test starts. I run the Integer and Long test twice to help guage the stability. OS X Snow Leopard, 1.6.0_15, -server -XX:+UseCompressedOops. Reviewing the code, the primary problem looks to be the ubiquitous use of this slow function: {code} InputStream.read(); {code} Unfortunately, this old stream API is slow. It is optimized for bulk transfer and very poor for reading small chunks at once. Ideally one should always use the bulk copy operators on InputStream and its relations if possible. I made a quick and dirty stab at removing that problem by adding a byte[] buffer in this class and buffering access to the input stream, allowing array access to retrieve bytes. The result is about 2x to 3x as fast! I'm attaching a patch that has a test class (modified from Thiru's) and a prototype "FastBinaryDecoder.java". There is room for improvement on top of this. This is a prototype to share -- needs some work to clean it up and merge it into BinaryDecoder. Plus, it is easier to test one versus the other when they are in different classes. The new test class has only one line to change to switch implementations. The idea is to use array accessors as often as possible, and reduce the number of times that bounds have to be checked. There are ways we could reduce this even further. The nio ByteBuffer API is smarter about this sort of thing than InputStream, but raw byte[] access tends to be the fastest, even if it means making an extra copy. An extra 2k copy is almost free (fits in L1 cache) -- two or three orders of magnitude faster than a large in-process memcopy or copy from out of process. Sample test output below: {noformat} BinaryDecoder (moodified, including Thiru's changes but without the loop unrolling): ----- ReadInt({"type":"array","items":"int"}): 945 ms, 31.731626594910882 million numbers decoded /sec ReadLong({"type":"array","items":"long"}): 1434 ms, 20.92050209205021 million numbers decoded /sec ReadInt({"type":"array","items":"int"}): 831 ms, 36.08644869649733 million numbers decoded /sec ReadLong({"type":"array","items":"long"}): 1430 ms, 20.972303975370128 million numbers decoded /sec ReadFloat({"type":"array","items":"float"}): 636 ms, 47.09687041296106 million numbers decoded /sec ReadDouble({"type":"array","items":"double"}): 772 ms, 38.81877068716988 million numbers decoded /sec FastBinaryDecoder: -------- ReadInt({"type":"array","items":"int"}): 298 ms, 100.49746243907342 million numbers decoded /sec ReadLong({"type":"array","items":"long"}): 793 ms, 37.79032416540069 million numbers decoded /sec ReadInt({"type":"array","items":"int"}): 291 ms, 102.76084126875385 million numbers decoded /sec ReadLong({"type":"array","items":"long"}): 806 ms, 37.2056106060794 million numbers decoded /sec ReadFloat({"type":"array","items":"float"}): 324 ms, 92.45904064499426 million numbers decoded /sec ReadDouble({"type":"array","items":"double"}): 345 ms, 86.7109663359125 million numbers decoded /sec{noformat} > Performance improvements to BinaryDecoder > ----------------------------------------- > > Key: AVRO-315 > URL: https://issues.apache.org/jira/browse/AVRO-315 > Project: Avro > Issue Type: Improvement > Reporter: Thiruvalluvan M. G. > Assignee: Thiruvalluvan M. G. > Attachments: AVRO-315-test.patch, AVRO-315.patch, AVRO-315.patch > > > The forthcoming patch improves the performance of BinaryDecoder.readLong(), > readFloat() and readDouble(). > The test-patch has a command-line program Perf in org.apache.avro.io in the > (test part of the source directory) which tests the performance of readInt() > (which calls readLong()) readFloat() and readDouble(). On my machine, the > patch improves the performance by 10% for readInt() and about 50% for > readFloat() and readDouble(). > The idea is to unroll the loops in readLong(), readFloat() and readDouble(). > There is a small change in doReadBytes() which checks for most common > condition before less common ones. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.