[
https://issues.apache.org/jira/browse/AVRO-27?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12706754#action_12706754
]
Scott Carey edited comment on AVRO-27 at 5/7/09 1:15 AM:
---------------------------------------------------------
An outsider here -- I've got an idea on how to avoid the performance pitfalls
of COBS' byte-by-byte nature and as I thought through it, I spotted many other
opportunities for enhancement since larger chunks afford a lot more bits in the
Code that can be used for things other than the length of the following literal
chunk.
h2. Proposal -- COLS, a modification of COBS
h3. (for greater performance and extensibility for large data streams)
Java is particularly bad at byte-by-byte operations. The COBS paper clearly
indicates its design intention was stuffing data through embedded systems such
as telephone lines and other networks where byte-by-byte processing of the
whole payload is already mandatory.
Doing so here would be a performance bottleneck in Java. Some simple tests can
be constructed to prove or disprove this claim.
I propose that rather than use COBS, one uses COLS or COWS ... that is Constant
Overhead Long Stuffing or Constant Overhead Word Stuffing instead.
This would be inefficient if we expect most payloads to be small (less than 256
bytes), but I suspect most hadoop related payloads to be large, and often very
large.
I favor stuffing Longs rather than Ints, since most systems will soon be
running 64 bit JVMs in the future. Sun's next JRE release has Object Pointer
Compression, which makes the memory overhead of a 64 bit JVM very small
compared to a 32 bit JVM, and performance is generally faster than the 32 bit
JVM due to native 64 bit operations and more registers (for x86-64 at least).
http://blog.juma.me.uk/2008/10/14/32-bit-or-64-bit-jvm-how-about-a-hybrid/
I will describe the proposal below assuming a translation of COBS to COLS, from
1 byte at a time to 8 byte at a time encoding. However, it is clear that a 4
byte variant is very similar and may be preferable.
h3. Proposed Changes -- Simple Block format with COLS
|| name || format || length in bytes || value || meaning ||
| sync | byte | *8* | *0L* | The sync *long* serves as a clear marker for the
start of a block |
| type | 1 byte | 1 | non-zero | The type field expresses whether the block is
for _metadata_ or _normal_ data. *note - if this is only ever going to be a
binary flag, it can be packed into the length or sequence number as a sign
value.* *However, it is decoding performance critical to keep the non-COLS
header 8 byte aligned*|
| block sequence number | 3 byte unsigned int | 3 bytes | 0 - 2^24 | the block
sequence number -- a client can use this to resume a stream from the last
successful block. This may not be needed if the metadata blocks take care of
this. |
| length | fixed 4 byte signed int | variable | *>= 0L* | The length field
expresses the number of bytes of COLS_payload data in bytes. Useful for
skipping ahead to the next block. |
| COLS_payload | *COLS* | *length as above* | see COLS description below | The
data in this block, encoded. |
The above would put cap the stream length to 2GB * 16M = 32PB. There is room
to increase this significantly by taking bits from the type and giving those to
the block count. 2GB blocks are rather unlikely for now however -- as is
multi-PB streams.
h4. Discussion
* The entire stream would need to be 8 byte aligned in order to process it
cleanly with something like java.nio.LongBuffer. This would include metadata
blocks.
* The sequence is assumed to be in network-order. Endianness can be handled
and is not discussed in detail here.
* The type can likely be encoded in a single bit in the block sequence number
or length field. If more than two types of blocks are expected, more bits can
be reserved for future use.
* The length can be stored as the number of longs rather than bytes (bytes / 8)
since the COLS payload is a multiple of 8 bytes.
* The COLS payload here differs from the original proposal. It will have an
entire COBS-like stream, with possibly many COLS code markers (at least one per
0L value in the block data).
* One may want to have both the encoded length above, and the decoded length
(or a checksum) as extra data validation. Perhaps even 4 types: METADATA,
METADATA_CSUM, NORMAL, NORMAL_CSUM -- where the ordinary variants store the
length (fast, but less reliable) and the _CSUM variants store a checksum
(slower, but highly reliable).
h3. Basic COBS to COLS description
COBS describes a byte-by-byte encoding where a zero byte cannot exist, and a
set of codes are used to encode runs of data that does not contain a zero byte.
All codes but but one have an implicit trailing zero. The last block is
assumed to have no implicit zero regardless of the code.
COLS is a simple extension of this scheme to 64 bit chunks. In its base form,
it does nothing more than work with larger chunks:
|| COLS Code (Long, 8 bytes) || Followed by || Meaning ||
| 0L | N/A | (not allowed)|
| 1L | nothing | A single zero Long |
| 2L | one long (8 bytes) | The single data long, followed by a trailing zero
long * |
| 3L | two longs (16 bytes) | The pair of data longs, followed by a trailing
zero long * |
| nL | (n-1) longs | The (n-1) longs, followed by a trailing zero long * |
| MAX \*\* | MAX - 1 longs | MAX -1 longs, with no trailing zero |
\* The last code in the sequence (which can be identified by the length header
or a 0L indicating the start of the next block) does NOT have an implicit
trailing zero.
\*\* MAX needs to be chosen, and can't realistically be very large since
encoding requires an arraycopy of size (MAX -1) * 8
The COLS_payload has multiple COLS Code entries (and literals), up to the
length specified in the header (where a 0L should then occur).
However -- there are drawbacks to using such a large chunk without other
modifications from COBS:
# 64 bits is far too large for a length field. For encoding, a COBS code block
must fit in RAM, and for performance, should probably fit in half an L2 cache.
However, for decoding COLS code length is irrelevant.
# If the size of the data encoded is not a multiple of 8 bytes, we need a
mechanism to encode that up to 7 trailing bytes should be truncated (3 bits).
# For most blocks, the overhead will be exactly 8 bytes (unless the block has a
trailing 0L).
# Very long data streams without a zero Long are unlikely, so very large chunk
lengths are not very useful.
There are also benefits however. The above suggests that most of the 8 byte
COLS code block space is not needed to encode length. Much can be done with
this!
Some thoughts:
* The 3 bits needed to define the truncation behavior can be stored in the COLS
code.
* The overhead can be reduced, by encoding short trailing sequences into the
upper bits rather than purely truncating -- e.g. you can append 2 bytes instead
of truncating 6.
* Rudimentary run-length encoding or other light weight compression can be done
with the extra bits (completely encoder-optional).
* We can remove the requirement that most codes have an implicit trailing zero,
and encode that in one of the extra bits.
If only the lower 2 bytes of an 8 byte COLS code represent the size, (MAX =
2^16 - 1), then the max literal size is 512KB - 8B. If we remove the implicit
trailing zero, an encoder can optionally encode smaller literal sequences
(perhaps for performance, or compression).
What can be done with the remaining 48 bits?
Some ideas:
# The highest 4 bytes can represent data to append to the literal. In this
way, half of the size overhead of the encoding is removed. This should
generally only apply to the last COLS code in the block (for performance
reasons and maintaining 8 byte alignment on all arraycopy operations, but its
encoder optional).
# the next bit represents whether the COLS block has an implicit 0L appended.
# a bit can be used to signify endianness (this might be a better fit for the
Block header or stream metadata -- detecting zero's works without known
endianness)
# The next three bits can represent how much data is truncated or appended to
the literal, (before the optional implicit 0L):
|| value || meaning ||
| 000 | do not truncate or append |
| 100 | append all 4 leading bytes in the COLS code after the literal |
| 111 | append the first 3 leading bytes in the COLS code after the literal |
| 110 | append the first 2 leading bytes in the COLS code after the literal |
| 101 | append the leading byte in the COLS code after the literal |
| 011 | truncate the last 3 bytes of the literal |
| 010 | truncate the last 2 bytes of the literal |
| 001 | truncate the last byte of the literal |
This leaves us with 12 bits. I propose that these be used for rudimentary
(optional) compression:
* Option A:
** Run length only -- the 12 bits represent the number of times to repeat the
literal. Or 4 bits are the number of COLS chunks backwards (including this
one) to repeat, and 8 bits is the number of repeats. Or ... some other form of
emitting copies of entire COLS chunks.
* Option B:
** Some form of LZ-like compression that copies in 8 byte chunks -- 4 bits
represent the number of Longs to copy (so, max match size is 15 * 8 bytes), and
8 bits represents the number of Longs backwards (from the end of this COLS
chunk) to begin that copy (up to 2KB). Because of the truncation/append
feature, this is not constrained to 8-byte aligned copies on the output, but
the encoded format is entirely 8 byte aligned and all copies are multiples of 8
bytes. I would not be surprised if this was as fast as LZO or faster, since it
is very similar but operates in a more chunky fashion. Compression levels
would not be that great, but like most similar algorithms to this the encoder
can do more work to search for matches. Decoding uncompressed data should be
essentially free (if the 4 bits are 0, do nothing -- and most COLS blocks would
be fairly large so this check does not occur that frequently).
* Option C:
** Reserve those 12 bits for future use / research
Alternatively, one to 4 extra bytes used for the "append" feature can be
reassigned to have more than 12 bits for compression metadata.
So, with the above modifications, the COLS code looks like this:
The COLS code is 8 bytes. The low 16 bits encode basic meaning.
An 8 byte COLS code cannot be 0L.
|| Code & 0xFFFF (low 2 bytes) || Followed by || Meaning ||
| 0x0000 | N/A | (not allowed)|
| 0x0001 | nothing | A single zero Long |
| 0x0002 | one long (8 bytes) | The single data long |
| 0x0003 | two longs (16 bytes) | The pair of data longs |
| n | (n-1) longs | The (n-1) longs |
| 0xFFFF | 2^16 - 2 longs | 2^16 - 2 longs |
The next portion, is to determine the state of truncation or appending.
Two options are listed -- only truncation, and truncation/appending. The
appending could be up to 5 bytes if we squeeze all the rest of the space. The
example below is for up to 4 bytes appended and 3 bytes truncated.
{code}appendCode = (Code >> 28) & 0xF;{code}
|| appendCode & 0x7 || Append (+) or truncate (-) || From || truncate only
option ||
| 0x0 | 0 | nothing| 0 |
| 0x1 | (-)1 | nothing | (-)1 |
| 0x2 | (-)2 | nothing | (-)2 |
| 0x3 | (-)3 | nothing | (-)3 |
| 0x4 | (+)1 | Code >>> 56 | (-)4 |
| 0x5 | (+)2 | Code >>> 48 | (-)5 |
| 0x6 | (+)3 | Code >>> 40 | (-)6 |
| 0x7 | (+)4 | Code >>> 32 | (-)7 |
It may be wiser to choose an option between these. If 3 bytes are chosen as
the max arbitrary append length, with 4 truncated, 20 bits are left for other
purposes, rather than 12. The average COLS chunk would be one byte larger.
|| AppendCode & 0x8 || Append 0L ||
| 0 | do not append 0L (8 zero bytes) |
| 1 | do append 0L (8 zero bytes) |
h2. Encoding
The writer would perform processing in 8 byte chunks until the end of the block
where some byte-by-byte processing would occur. Compression options would be
entirely the writer's choice.
The state overhead can be very low or large at the writer's whim. Larger COLS
chunk sizes require more state (and larger arraycopys), and any compression
option adds state overhead.
h2. Decoding
Decoding in all circumstances reads data in 8 byte chunks. Copies occur in 8
byte chunks, 8 byte aligned save for the end of a block if the block does not
have a multiple of 8 bytes in its payload. An encoder can cause copy
destinations (but not sources) to not be 8 byte aligned if certain special
options (compression) or intentionally misaligned encoding is done.
Generally, an encoder can choose to make all but the last few bytes of the last
block in the stream aligned.
was (Author: scott_carey):
An outsider here -- I've got an idea on how to avoid the performance
pitfalls of COBS' byte-by-byte nature and as I thought through it, I spotted
many other opportunities for enhancement since larger chunks afford a lot more
bits in the Code that can be used for things other than the length of the
following literal chunk.
h2. Proposal -- COLS, a modification of COBS
h3. (for greater performance and extensibility for large data streams)
Java is particularly bad at byte-by-byte operations. The COBS paper clearly
indicates its design intention was stuffing data through embedded systems such
as telephone lines and other networks where byte-by-byte processing of the
whole payload is already mandatory.
Doing so here would be a performance bottleneck in Java. Some simple tests can
be constructed to prove or disprove this claim.
I propose that rather than use COBS, one uses COLS or COWS ... that is Constant
Overhead Long Stuffing or Constant Overhead Word Stuffing instead.
This would be inefficient if we expect most payloads to be small (less than 256
bytes), but I suspect most hadoop related payloads to be large, and often very
large.
I favor stuffing Longs rather than Ints, since most systems will soon be
running 64 bit JVMs in the future. Sun's next JRE release has Object Pointer
Compression, which makes the memory overhead of a 64 bit JVM very small
compared to a 32 bit JVM, and performance is generally faster than the 32 bit
JVM due to native 64 bit operations and more registers (for x86-64 at least).
http://blog.juma.me.uk/2008/10/14/32-bit-or-64-bit-jvm-how-about-a-hybrid/
I will describe the proposal below assuming a translation of COBS to COLS, from
1 byte at a time to 8 byte at a time encoding. However, it is clear that a 4
byte variant is very similar and may be preferable.
h3. Proposed Changes -- Simple Block format with COLS
|| name || format || length in bytes || value || meaning ||
| sync | byte | *8* | *0L* | The sync *long* serves as a clear marker for the
start of a block |
| type | 1 byte | 1 | non-zero | The type field expresses whether the block is
for _metadata_ or _normal_ data. *note - if this is only ever going to be a
binary flag, it can be packed into the length or sequence number as a sign
value.* *However, it is decoding performance critical to keep the non-COLS
header 8 byte aligned*|
| block sequence number | 3 byte unsigned int | 3 bytes | 0 - 2^24 | the block
sequence number -- a client can use this to resume a stream from the last
successful block. This may not be needed if the metadata blocks take care of
this. |
| length | fixed 4 byte signed int | variable | *>= 0L* | The length field
expresses the number of bytes of COLS_payload data in bytes. Useful for
skipping ahead to the next block. |
| COLS_payload | *COLS* | *length as above* | see COLS description below | The
data in this block, encoded. |
The above would put cap the stream length to 2GB * 16M = 32PB. There is room
to increase this significantly by taking bits from the type and giving those to
the block count. 2GB blocks are rather unlikely for now however -- as is
multi-PB streams.
h4. Discussion
* The entire stream would need to be 8 byte aligned in order to process it
cleanly with something like java.nio.LongBuffer. This would include metadata
blocks.
* The sequence is assumed to be in network-order. Endianness can be handled
and is not discussed in detail here.
* The type can likely be encoded in a single bit in the block sequence number
or length field. If more than two types of blocks are expected, more bits can
be reserved for future use.
* The length can be stored as the number of longs rather than bytes (bytes / 8)
since the COLS payload is a multiple of 8 bytes.
* The COLS payload here differs from the original proposal. It will have an
entire COBS-like stream, with possibly many COLS code markers (at least one per
0L value in the block data).
* One may want to have both the encoded length above, and the decoded length
(or a checksum) as extra data validation. Perhaps even 4 types: METADATA,
METADATA_CSUM, NORMAL, NORMAL_CSUM -- where the ordinary variants store the
length (fast, but less reliable) and the _CSUM variants store a checksum
(slower, but highly reliable).
h3. Basic COBS to COLS description
COBS describes a byte-by-byte encoding where a zero byte cannot exist, and a
set of codes are used to encode runs of data that does not contain a zero byte.
All codes but but one have an implicit trailing zero. The last block is
assumed to have no implicit zero regardless of the code.
COLS is a simple extension of this scheme to 64 bit chunks. In its base form,
it does nothing more than work with larger chunks:
|| COLS Code (Long, 8 bytes) || Followed by || Meaning ||
| 0L | N/A | (not allowed)|
| 1L | nothing | A single zero Long |
| 2L | one long (8 bytes) | The single data long, followed by a trailing zero
long * |
| 3L | two longs (16 bytes) | The pair of data longs, followed by a trailing
zero long * |
| nL | (n-1) longs | The (n-1) longs, followed by a trailing zero long * |
| MAX \*\* | MAX - 1 longs | MAX -1 longs, with no trailing zero |
\* The last code in the sequence (which can be identified by the length header
or a 0L indicating the start of the next block) does NOT have an implicit
trailing zero.
\*\* MAX needs to be chosen, and can't realistically be very large since
encoding requires an arraycopy of size (MAX -1) * 8
The COLS_payload has multiple COLS Code entries (and literals), up to the
length specified in the header (where a 0L should then occur).
However -- there are drawbacks to using such a large chunk without other
modifications from COBS:
# 64 bits is far too large for a length field. For encoding, a COBS code block
must fit in RAM, and for performance, should probably fit in half an L2 cache.
However, for decoding COLS code length is irrelevant.
# If the size of the data encoded is not a multiple of 8 bytes, we need a
mechanism to encode that up to 7 trailing bytes should be truncated (3 bits).
# For most blocks, the overhead will be exactly 8 bytes (unless the block has a
trailing 0L).
# Very long data streams without a zero Long are unlikely, so very large chunk
lengths are not very useful.
There are also benefits however. The above suggests that most of the 8 byte
COLS code block space is not needed to encode length. Much can be done with
this!
Some thoughts:
* The 3 bits needed to define the truncation behavior can be stored in the COLS
code.
* The overhead can be reduced, by encoding short trailing sequences into the
upper bits rather than purely truncating -- e.g. you can append 2 bytes instead
of truncating 6.
* Rudimentary run-length encoding or other light weight compression can be done
with the extra bits (completely encoder-optional).
* We can remove the requirement that most codes have an implicit trailing zero,
and encode that in one of the extra bits.
If only the lower 2 bytes of an 8 byte COLS code represent the size, (MAX =
2^16 - 1), then the max literal size is 512KB - 8B. If we remove the implicit
trailing zero, an encoder can optionally encode smaller literal sequences
(perhaps for performance, or compression).
What can be done with the remaining 48 bits?
Some ideas:
# The highest 4 bytes can represent data to append to the literal. In this
way, half of the size overhead of the encoding is removed. This should
generally only apply to the last COLS code in the block (for performance
reasons and maintaining 8 byte alignment on all arraycopy operations, but its
encoder optional).
# the next bit represents whether the COLS block has an implicit 0L appended.
# a bit can be used to signify endianness (this might be a better fit for the
Block header or stream metadata -- detecting zero's works without known
endianness)
# The next three bits can represent how much data is truncated or appended to
the literal, (before the optional implicit 0L):
|| value || meaning ||
| 000 | do not truncate or append |
| 100 | append all 4 leading bytes in the COLS code after the literal |
| 111 | append the first 3 leading bytes in the COLS code after the literal |
| 110 | append the first 2 leading bytes in the COLS code after the literal |
| 101 | append the leading byte in the COLS code after the literal |
| 011 | truncate the last 3 bytes of the literal |
| 010 | truncate the last 2 bytes of the literal |
| 001 | truncate the last byte of the literal |
This leaves us with 12 bits. I propose that these be used for rudimentary
(optional) compression:
* Option A:
** Run length only -- the 12 bits represent the number of times to repeat the
literal. Or 4 bits are the number of COLS chunks backwards (including this
one) to repeat, and 8 bits is the number of repeats. Or ... some other form of
emitting copies of entire COLS chunks.
* Option B:
** Some form of LZ-like compression that copies in 8 byte chunks -- 4 bits
represent the number of Longs to copy (so, max match size is 15 * 8 bytes), and
8 bits represents the number of Longs backwards (from the end of this COLS
chunk) to begin that copy (up to 2KB). Because of the truncation/append
feature, this is not constrained to 8-byte aligned copies on the output, but
the encoded format is entirely 8 byte aligned and all copies are multiples of 8
bytes. I would not be surprised if this was as fast as LZO or faster, since it
is very similar but operates in a more chunky fashion. Compression levels
would not be that great, but like most similar algorithms to this the encoder
can do more work to search for matches. Decoding uncompressed data should be
essentially free (if the 4 bits are 0, do nothing -- and most COLS blocks would
be fairly large so this check does not occur that frequently).
* Option C:
** Reserve those 12 bits for future use / research
Alternatively, one to 4 extra bytes used for the "append" feature can be
reassigned to have more than 12 bits for compression metadata.
So, with the above modifications, the COLS code looks like this:
The COLS code is 8 bytes. The low 16 bits encode basic meaning.
An 8 byte COLS code cannot be 0L.
|| Code & 0xFFFF (low 2 bytes) || Followed by || Meaning ||
| 0x0000 | N/A | (not allowed)|
| 0x0001 | nothing | A single zero Long |
| 0x0002 | one long (8 bytes) | The single data long |
| 0x0003 | two longs (16 bytes) | The pair of data longs |
| n | (n-1) longs | The (n-1) longs |
| 0xFFFF | 2^16 - 2 longs | 2^16 - 2 longs |
The next portion, is to determine the state of truncation or appending.
Two options are listed -- only truncation, and truncation/appending. The
appending could be up to 5 bytes if we squeeze all the rest of the space. The
example below is for up to 4 bytes appended and 3 bytes truncated.
{code}appendCode = (Code >> 28) & 0xF;{code}
|| appendCode & 0x7 || Append (+) or truncate (-) || From || truncate only
option ||
| 0x0 | 0 | nothing| 0 |
| 0x1 | (-)1 | nothing | (-)1 |
| 0x2 | (-)2 | nothing | (-)2 |
| 0x3 | (-)3 | nothing | (-)3 |
| 0x4 | (+)1 | Code >>> 63 | (-)4 |
| 0x5 | (+)2 | Code >>> 62 | (-)5 |
| 0x6 | (+)3 | Code >>> 61 | (-)6 |
| 0x7 | (+)4 | Code >>> 60 | (-)7 |
It may be wiser to choose an option between these. If 3 bytes are chosen as
the max arbitrary append length, with 4 truncated, 20 bits are left for other
purposes, rather than 12. The average COLS chunk would be one byte larger.
|| AppendCode & 0x8 || Append 0L ||
| 0 | do not append 0L (8 zero bytes) |
| 1 | do append 0L (8 zero bytes) |
h2. Encoding
The writer would perform processing in 8 byte chunks until the end of the block
where some byte-by-byte processing would occur. Compression options would be
entirely the writer's choice.
The state overhead can be very low or large at the writer's whim. Larger COLS
chunk sizes require more state (and larger arraycopys), and any compression
option adds state overhead.
h2. Decoding
Decoding in all circumstances reads data in 8 byte chunks. Copies occur in 8
byte chunks, 8 byte aligned save for the end of a block if the block does not
have a multiple of 8 bytes in its payload. An encoder can cause copy
destinations (but not sources) to not be 8 byte aligned if certain special
options (compression) or intentionally misaligned encoding is done.
Generally, an encoder can choose to make all but the last few bytes of the last
block in the stream aligned.
> Consistent Overhead Byte Stuffing (COBS) encoded block format for Object
> Container Files
> ----------------------------------------------------------------------------------------
>
> Key: AVRO-27
> URL: https://issues.apache.org/jira/browse/AVRO-27
> Project: Avro
> Issue Type: New Feature
> Components: spec
> Reporter: Matt Massie
>
> Object Container Files could use a 1 byte sync marker (set to zero) using
> zig-zag and COBS encoding within blocks to efficiently escape zeros from the
> record data.
> h4. Zig-Zag encoding
> With zig-zag encoding only the value of 0 (zero) gets encoded into a value
> with a single zero byte. This property means that we can write any non-zero
> zig-zag long inside a block within concern for creating an unintentional sync
> byte.
> h4. COBS encoding
> We'll use COBS encoding to ensure that all zeros are escaped inside the block
> payload. You can read http://www.sigcomm.org/sigcomm97/papers/p062.pdf for
> the details about COBS encoding.
> h1. Block Format
> All blocks start and end with a sync byte (set to zero) with a
> type-length-value format internally as follows:
> || name || format || length in bytes || value || meaning ||
> | sync | byte | 1 | always 0 (zero) | The sync byte serves as a clear marker
> for the start of a block |
> | type | zig-zag long | variable | must be non-zero | The type field
> expresses whether the block is for _metadata_ or _normal_ data. |
> | length | zig-zag long | variable | must be non-zero | The length field
> expresses the number of bytes until the next record (including the cobs code
> and sync byte). Useful for skipping ahead to the next block. |
> | cobs_code | byte | 1 | see COBS code table below | Used in escaping zeros
> from the block payload |
> | payload | cobs-encoded | Greater than or equal to zero | all non-zero bytes
> | The payload of the block |
> | sync | byte | 1 | always 0 (zero) | The sync byte serves as a clear marker
> for the end of the block |
> h2. COBS code table
> || Code || Followed by || Meaning |
> | 0x00 | (not applicable) | (not allowed ) |
> | 0x01 | nothing | Empty payload followed by the closing sync byte |
> | 0x02 | one data byte | The single data byte, followed by the closing sync
> byte |
> | 0x03 | two data bytes | The pair of data bytes, followed by the closing
> sync byte |
> | 0x04 | three data bytes | The three data bytes, followed by the closing
> sync byte |
> | n | (n-1) data bytes | The (n-1) data bytes, followed by the closing sync
> byte |
> | 0xFD | 252 data bytes | The 252 data bytes, followed by the closing sync
> byte |
> | 0xFE | 253 data bytes | The 253 data bytes, followed by the closing sync
> byte |
> | 0xFF | 254 data bytes | The 254 data bytes *not* followed by a zero. |
> (taken from http://www.sigcomm.org/sigcomm97/papers/p062.pdf)
> h1. Encoding
> Only the block writer needs to perform byte-by-byte processing to encode the
> block. The overhead for COBS encoding is very small in terms of the
> in-memory state required.
> h1. Decoding
> Block readers are not required to do as much byte-by-byte processing as a
> writer. The reader could (for example) find a _metadata_ block by doing the
> following:
> # Search for a zero byte in the file which marks the start of a record
> # Read and zig-zag decode the _type_ of the block
> #* If the block is _normal_ data, read the _length_, seek ahead to the next
> block and goto step #2 again
> #* If the block is a _metadata_ block, cobs decode the data
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.