[ 
https://issues.apache.org/jira/browse/AVRO-712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12974410#action_12974410
 ] 

Adam Warrington commented on AVRO-712:
--------------------------------------

Here is a first stab at a definition. I've written a prototype using these 
rules in python, and will attach it to this ticket. The spec I've defined here 
is not nearly as efficient in both space and time to encode/decode as the 
binary encoding currently developed in Avro. It's support of features like the 
"ignored" and "decreasing" attributes on record fields is also lacking.

Spec:

int and long data is encoded by flipping the sign bit, and encoding as big 
endian.

============================================

float and double data is encoded by flipping the sign bit, and if it is a 
negative number, taking the compliment of all bits lower than the sign bit. 
Storing the 4 or 8 bytes as a big endian int or long.

=============================================

bytes: bytes should have a delimiter marking the end of the byte array. This 
value should be the lowest possible value, since the ordering of two byte 
arrays equal up to one's delimiter should always order the shorter one less 
than the longer one. If one were to encode a byte of value 0x00 as two bytes 
0x00 0x01, and one were to encode end of byte array as 0x00 0x00, this property 
would hold true. Some examples:

Alphabet 0-7
String1: 012020 -> 0112012100
String2: 01202 -> 011201200
String2 is less than String1

String1: 012021 -> 0112012100
String2: 012020 -> 01120120100
String2 is less than String1

============================

strings: encoded as bytes defined above, ordering should hold

============================

boolean, null: encoded same as binary encoder

============================

maps: Not supported. Attempting to encode a map with a memcmp DatumWriter 
should throw an exception in my opinion.

=============================

enum: Encoded the same as DatumWriter using the memcmp encoder, since enums 
with lesser ordinals come before enums with greater ordinals.

=============================

arrays: I don't have a great answer here. One thing that would work, but isn't 
space efficient, is to write out a byte after every element in the array 
indicating whether it is the last element or not, where 0x00 indicates last 
element, and 0x01 indicates not. I've done this in a prototype I'll attach to 
this ticket.

=============================

record: Before writing out the fields in a record, the fields should be ordered 
lexicographically by name. Ignored fields either shouldn't be written (makes 
the encoder lossy), or the encoder should throw an exception (something I've 
done in a prototype). In the prototype I created, I also don't handle the 
"decreasing" field attribute, but this could be done by allowing the encoder to 
take an optional parameter specifying whether to encode the value in 
decreasing, which would invert all the bits before returning the encoded bytes.

=============================

union: Encoded the same way as a DatumWriter using a memcmp encoder would 
encode it currently, first writing out the ordinal of the union using the 
memcmp encoder, then using the encoder to encode the value.

> define memcmp'able encoding
> ---------------------------
>
>                 Key: AVRO-712
>                 URL: https://issues.apache.org/jira/browse/AVRO-712
>             Project: Avro
>          Issue Type: New Feature
>          Components: spec
>            Reporter: Doug Cutting
>
> It would be useful to have an encoding for Avro data that ordered data 
> according to the Avro specification under memcmp.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to