[
https://issues.apache.org/jira/browse/AVRO-712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12974410#action_12974410
]
Adam Warrington commented on AVRO-712:
--------------------------------------
Here is a first stab at a definition. I've written a prototype using these
rules in python, and will attach it to this ticket. The spec I've defined here
is not nearly as efficient in both space and time to encode/decode as the
binary encoding currently developed in Avro. It's support of features like the
"ignored" and "decreasing" attributes on record fields is also lacking.
Spec:
int and long data is encoded by flipping the sign bit, and encoding as big
endian.
============================================
float and double data is encoded by flipping the sign bit, and if it is a
negative number, taking the compliment of all bits lower than the sign bit.
Storing the 4 or 8 bytes as a big endian int or long.
=============================================
bytes: bytes should have a delimiter marking the end of the byte array. This
value should be the lowest possible value, since the ordering of two byte
arrays equal up to one's delimiter should always order the shorter one less
than the longer one. If one were to encode a byte of value 0x00 as two bytes
0x00 0x01, and one were to encode end of byte array as 0x00 0x00, this property
would hold true. Some examples:
Alphabet 0-7
String1: 012020 -> 0112012100
String2: 01202 -> 011201200
String2 is less than String1
String1: 012021 -> 0112012100
String2: 012020 -> 01120120100
String2 is less than String1
============================
strings: encoded as bytes defined above, ordering should hold
============================
boolean, null: encoded same as binary encoder
============================
maps: Not supported. Attempting to encode a map with a memcmp DatumWriter
should throw an exception in my opinion.
=============================
enum: Encoded the same as DatumWriter using the memcmp encoder, since enums
with lesser ordinals come before enums with greater ordinals.
=============================
arrays: I don't have a great answer here. One thing that would work, but isn't
space efficient, is to write out a byte after every element in the array
indicating whether it is the last element or not, where 0x00 indicates last
element, and 0x01 indicates not. I've done this in a prototype I'll attach to
this ticket.
=============================
record: Before writing out the fields in a record, the fields should be ordered
lexicographically by name. Ignored fields either shouldn't be written (makes
the encoder lossy), or the encoder should throw an exception (something I've
done in a prototype). In the prototype I created, I also don't handle the
"decreasing" field attribute, but this could be done by allowing the encoder to
take an optional parameter specifying whether to encode the value in
decreasing, which would invert all the bits before returning the encoded bytes.
=============================
union: Encoded the same way as a DatumWriter using a memcmp encoder would
encode it currently, first writing out the ordinal of the union using the
memcmp encoder, then using the encoder to encode the value.
> define memcmp'able encoding
> ---------------------------
>
> Key: AVRO-712
> URL: https://issues.apache.org/jira/browse/AVRO-712
> Project: Avro
> Issue Type: New Feature
> Components: spec
> Reporter: Doug Cutting
>
> It would be useful to have an encoding for Avro data that ordered data
> according to the Avro specification under memcmp.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.