[jira] Commented: (AVRO-27) Consistent Overhead Byte Stuffing (COBS) encoded block format for Object Container Files

Scott Carey (JIRA) Mon, 25 May 2009 22:51:19 -0700

    [ 
https://issues.apache.org/jira/browse/AVRO-27?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12712868#action_12712868
 ]


Scott Carey commented on AVRO-27:
---------------------------------

{quote}
COLSCodec, one zero word every 10 words
Encoding at 354.09015 MB/sec
Decoding at 812.4928 MB/sec
Original array was modified!
{quote}

That Sir, is the remaining bug I alluded to but didn't highlight enough in my 
previous comment.  If you change the size of the array, the random number seed, 
or just about anything else it will go away (or pop up elsewhere).

The before and after arrays have the same bytes, but the one that was encoded 
and decoded has an extra word at the end.  I stepped through that case briefly, 
but was too lazy to fix it.  I don't think it is relevant to the overall 
results.  (and any real Codec would be written cleaner, with plenty of unit 
tests to cover the corner cases).


Which reminds me, these are the main conclusions I draw not specific to this 
JIRA:

ByteBuffer.getInt() , getLong(), are rather optimized, as are the matching 
putInt() and putLong() operations.   Bulk put operations are also fast on 
ByteBuffer, but not IntBuffer if created from ByteBuffer.asIntBuffer().

Any encoder or decoder in Java will see potentially large performance gains if 
it can read / write in larger chunks.

I could be evil and try the same test and misalign the array -- start at 
position 1 instead of 0 (the JVM aligns array data to 8 byte boundaries, and 
many processor instructions are faster if aligned).

Ok, I decided to be evil and try it on my laptop with misaligned bytes (added a 
put(0) to the start of the encoder and a get() to the start of the decoder, to 
misalign the whole thing by a byte).  Now, perhaps getLong() will be a lot less 
efficient.  Lets see:

Aligned (COLS):
COLSCodec, one zero word every 1 words
Encoding at 323.87604 MB/sec
Decoding at 419.4213 MB/sec
COLSCodec, one zero word every 10 words
Encoding at 376.7943 MB/sec
Decoding at 1041.8271 MB/sec
COLSCodec, one zero word every 10000 words
Encoding at 439.01627 MB/sec
Decoding at 1350.2242 MB/sec
COLSCodec, one zero word every 1000000 words
Encoding at 415.91876 MB/sec
Decoding at 1411.3434 MB/sec

Misaligned (COLS):
COLSCodec, one zero word every 1 words
Encoding at 327.0196 MB/sec
Decoding at 402.65366 MB/sec
COLSCodec, one zero word every 10 words
Encoding at 377.48105 MB/sec
Decoding at 974.4739 MB/se
COLSCodec, one zero word every 10000 words
Encoding at 445.4802 MB/sec
Decoding at 1440.7946 MB/s
COLSCodec, one zero word every 1000000 words
Encoding at 443.61166 MB/sec
Decoding at 1423.9922 MB/sec

These are within the usual margin of error, and essentially the same.  Perhaps 
the JVM's JIT isn't smart enough to recognize that in the first case, all 
access is aligned and use the processor load instructions for aligned access 
which are faster?  I could write a COLSCodec2 that operated on LongBuffer 
rather than ByteBuffer to see what that does.

But the main conclusion is that accessing in larger chunks has big gains when 
it is possible to do.

> Consistent Overhead Byte Stuffing (COBS) encoded block format for Object 
> Container Files
> ----------------------------------------------------------------------------------------
>
>                 Key: AVRO-27
>                 URL: https://issues.apache.org/jira/browse/AVRO-27
>             Project: Avro
>          Issue Type: New Feature
>          Components: spec
>            Reporter: Matt Massie
>         Attachments: COBSCodec.java, COBSCodec2.java, COBSPerfTest.java, 
> COLSCodec.java, COWSCodec.java, COWSCodec2.java, COWSCodec3.java
>
>
> Object Container Files could use a 1 byte sync marker (set to zero) using 
> zig-zag and COBS encoding within blocks to efficiently escape zeros from the 
> record data.
> h4. Zig-Zag encoding
> With zig-zag encoding only the value of 0 (zero) gets encoded into a value 
> with a single zero byte.  This property means that we can write any non-zero 
> zig-zag long inside a block within concern for creating an unintentional sync 
> byte. 
> h4. COBS encoding
> We'll use COBS encoding to ensure that all zeros are escaped inside the block 
> payload.  You can read http://www.sigcomm.org/sigcomm97/papers/p062.pdf for 
> the details about COBS encoding.
> h1. Block Format
> All blocks start and end with a sync byte (set to zero) with a 
> type-length-value format internally as follows:
> || name || format || length in bytes || value || meaning ||
> | sync | byte | 1 | always 0 (zero) | The sync byte serves as a clear marker 
> for the start of a block |
> | type | zig-zag long | variable | must be non-zero | The type field 
> expresses whether the block is for _metadata_ or _normal_ data. |
> | length | zig-zag long | variable | must be non-zero | The length field 
> expresses the number of bytes until the next record (including the cobs code 
> and sync byte).  Useful for skipping ahead to the next block. |
> | cobs_code | byte | 1 | see COBS code table below | Used in escaping zeros 
> from the block payload |
> | payload | cobs-encoded | Greater than or equal to zero | all non-zero bytes 
> | The payload of the block |
> | sync | byte | 1 | always 0 (zero) | The sync byte serves as a clear marker 
> for the end of the block |
> h2. COBS code table 
> || Code || Followed by || Meaning | 
> | 0x00 | (not applicable) | (not allowed ) |
> | 0x01 | nothing | Empty payload followed by the closing sync byte |
> | 0x02 | one data byte | The single data byte, followed by the closing sync 
> byte | 
> | 0x03 | two data bytes | The pair of data bytes, followed by the closing 
> sync byte |
> | 0x04 | three data bytes | The three data bytes, followed by the closing 
> sync byte |
> | n | (n-1) data bytes | The (n-1) data bytes, followed by the closing sync 
> byte |
> | 0xFD | 252 data bytes | The 252 data bytes, followed by the closing sync 
> byte |
> | 0xFE | 253 data bytes | The 253 data bytes, followed by the closing sync 
> byte |
> | 0xFF | 254 data bytes | The 254 data bytes *not* followed by a zero. |
> (taken from http://www.sigcomm.org/sigcomm97/papers/p062.pdf)
> h1. Encoding
> Only the block writer needs to perform byte-by-byte processing to encode the 
> block.  The overhead for COBS encoding is very small in terms of the 
> in-memory state required.
> h1. Decoding
> Block readers are not required to do as much byte-by-byte processing as a 
> writer.  The reader could (for example) find a _metadata_ block by doing the 
> following:
> # Search for a zero byte in the file which marks the start of a record
> # Read and zig-zag decode the _type_ of the block
> #* If the block is _normal_ data, read the _length_, seek ahead to the next 
> block and goto step #2 again
> #* If the block is a _metadata_ block, cobs decode the data

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (AVRO-27) Consistent Overhead Byte Stuffing (COBS) encoded block format for Object Container Files

Reply via email to