I have read the COBS paper, and I propose modifications to COBS to (likely) make it significantly faster and fit the Hadoop use case better below. When I first skimmed the paper, I identified the same issues that Doug did -- particularly that it is a byte-by-byte encoding. Java is SLOW at byte-by-byte anything. The JIT compilers in Java haven't really optimized byte-by-byte work well -- compared to a C compiler.
I just saw that Matt Massie opened a JIRA ticket for this, so I'll leave the bulk of the details for that and send what I have written so far below. Initial comments inline below, with a summary at the end. JIRA details later. On 5/4/09 10:43 AM, "Doug Cutting" <[email protected]> wrote: > Matt Massie wrote: >> (2) Why are we using a 16-byte UUID block delimiter to mark block boundaries >> instead of using say Consistent Overhead Byte Stuffing (COBS)? > > In part because I didn't yet know about COBS when I implemented this > container. (You're the second person to ask this.) Perhaps we should > instead use COBS. The only potential disadvantage I see is that COBS > seems to require byte-by-byte processing. When projecting records to a > subset schema, we've seen huge speedups if we skip through data by > chunks, passing over strings, blobs, etc. just incrementing the file > pointer. So I worry that COBS, while quite fast, might add a > significant cost to such skipping. Compression also requires > byte-by-byte processing, but provides more tangible value than COBS. So > COBS would need to add truly negligible CPU overhead, which it might. If its going byte by byte, its a performance risk for the low overhead, high throughput case. Perhaps a test using a ByteBuffer and simply counting the number of zero values encountered would be a useful (prevents dead code elimination). One could compare this to something similar with IntBuffer or LongBuffer, counting the occurances of zero values there. > >> COBs >> also allows for in-place encoding and decoding with very little overhead and >> copying (see the paper for test results). > [ ... ] > > Thanks for all your COBS analysis! It does seem attractive. Skimming > the paper, I see mostly space overhead benchmarks, not CPU overhead. In > projection benchmarks, we saw a 4x speedup from just skipping strings. > AVRO-25 proposes to make entire arrays and maps skippable, which should, > e.g., speed the projection of a single field from large, complex records > by 10x or more. But if compression is always used, perhaps the right > comparison is with something like LZO. If COBS is negligible next to > LZO, that could be good enough. > First, I do suspect that compression will be common. But there will be very important non-compressed use cases as well. Next, even if it is roughly on par with LZO, that is too slow. It would have to be a couple times faster than LZO to have minimal CPU impact. A fast checksum operation is a few times faster than LZO. A big reason for that is that you do most checksums on 32 bit or 64 bit chunks. LZO is also "slow". Google mentions a compression scheme for BigTable that is faster than LZO: From the Google BigTable whitepaper (2006): " Many clients use a two-pass custom compression scheme. The first pass uses Bentley and McIlroy’s scheme [6], which compresses long common strings across a large window. The second pass uses a fast compression algorithm that looks for repetitions in a small 16 KB window of the data. Both compression passes are very fast—they encode at 100–200 MB/s, and decode at 400–1000 MB/s on modern machines " LZO and its similar schemes are somewhat slower, especially in Java: http://quicklz.com/ Sun's lzjb compression in ZFS claims to be a bit faster (benchmarks seem to put it in the neighborhood of lzo, and they are related algorithms): http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/os/ compress.c In short, if you want fast streaming for Java, I suspect COBS won't work in the byte-by-byte form in the paper. However ... > Hopefully we'll develop a benchmark suite for Avro that includes > projection & compression, so we can more easily evaluate things like this. > > Doug > COBS has fantastic properties. My proposal is to essentially go from COBS to COWS or COLS. Constant Overhead Word Stuffing or Constant Overhead Long Stuffing. The properties of COBS all carry over with larger chunks, just faster. Java may not compile efficient byte-by-byte operations, but it does pretty well with 32 bit or 64 bit chunks. My proposal is essentially to do almost the same thing as COBS, but in 32 or 64 bit chunks. This should be efficient in many languages. Eyeballing the future, it should probably be 64 bit chunks. Sun is about to release JVM 1.6.0_14 which has pointer compression, and 64 bit JVMs will perform better than 32 bit ones on x86-64 with nearly the same memory overhead. When scanning for the beginning of a 'packet', one can do so looking for a zero WORD or LONG rather than a zero BYTE. There are drawbacks to this, I'll address in the JIRA. There are many ways to handle them however and one can reduce overhead to a maximum of 16 bits. Since we expect most hadoop related streams to be larger than the design goals of COBS, we probably want to encode longer runs to 'skip' more data at a time and can afford more overhead per packet. The truth is, if we had access to SSE instructions, you'd probably do this 128 bits at a time and approach a few GB/sec in memory on newer hardware. The COBS paper seems to target network protocols and low level network devices and hardware, which already operate one byte at a time and would not have a performance problem from that aspect of it.
