[jira] Issue Comment Edited: (AVRO-712) define memcmp'able encoding

Scott Carey (JIRA) Thu, 23 Dec 2010 10:49:11 -0800

    [ 
https://issues.apache.org/jira/browse/AVRO-712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12974691#action_12974691
 ]


Scott Carey edited comment on AVRO-712 at 12/23/10 1:47 PM:
------------------------------------------------------------

Ints/Longs can be variable length, with the same space cost as before: The 
first byte in the sequence has the sign bit first, inverted (0 for negative 
numbers, 1 for positive).  The second bit is the 'is last byte in varint' bit 
-- 1 if there are more, 0 if it is the last byte.  The remaining encoding is 
like before -- zig-zag packed absolute value of the number.  However, if the 
number is negative we need to flip those bits as well as the 'is last byte in 
varint' bits.  Both encoding and decoding are more expensive this way, but 
memcmp would work and the space used would be the same.

Bytes/Strings/Arrays:
All are prefixed with a length in the current format.  you can't prefix with a 
length and get "abc" < "abcd" as well as "abc" < "b".  Lexicographic ordering 
of variable length fields is hard.  We could use a similar tactic to the 
varints above for Bytes and store 7 bits per byte, with a trailing marker bit 
indicating the end of the sequence.  This takes more space, and is definitely 
more expensive to serialize/deserialize.

I think UTF-8's binary properties might be beneficial here for Strings.  At 
least for each sequence of bytes that represents a character, it may already 
sort properly.  We just need a way to identify the end of the string.  I 
suppose trailing marker bytes or bits will work here too, with differing 
space/time tradeoffs.

Arrays can't use the trailing marker bit, a trailing marker byte per element 
would work.

Records:  I think if "ignored" fields were placed at the end, it would just 
work for sorting (equal fields would be adjacent) but an equivalence test would 
fail (one processes as > when they are equal).  



Overall, I am not convinced such an encoding would have that much value.  The 
extra time/space cost of encoding and decoding in many use cases would outweigh 
the savings of sorting and comparison.   For some languages, the cost of doing 
the comparison on the current binary format is either not too great or could be 
optimized quite a bit further than they are now (C/C+\+/Java).  More dynamic 
languages may benefit more from being able to call into an optimized memcmp 
routine since their conditional logic and method call overhead is greater.  For 
those languages, perhaps we should call out to an optimized Avro C or C+\+ stub 
for comparisons?  I wonder if we put our efforts into more optimized 
comparison, serialization, and deserialization if we would end up in a better 
place than adding a binary format.



      was (Author: scott_carey):
    Ints/Longs can be variable length, with the same space cost as before: The 
first byte in the sequence has the sign bit first, inverted (0 for negative 
numbers, 1 for positive).  The second bit is the 'is last byte in varint' bit 
-- 1 if there are more, 0 if it is the last byte.  The remaining encoding is 
like before -- zig-zag packed absolute value of the number.  However, if the 
number is negative we need to flip those bits as well as the 'is last byte in 
varint' bits.  Both encoding and decoding are more expensive this way, but 
memcmp would work and the space used would be the same.

Bytes/Strings/Arrays:
All are prefixed with a length in the current format.  you can't prefix with a 
length and get "abc" < "abcd" as well as "abc" < "b".  Lexicographic ordering 
of variable length fields is hard.  We could use a similar tactic to the 
varints above for Bytes and store 7 bits per byte, with a trailing marker bit 
indicating the end of the sequence.  This takes more space, and is definitely 
more expensive to serialize/deserialize.

I think UTF-8's binary properties might be beneficial here for Strings.  At 
least for each sequence of bytes that represents a character, it may already 
sort properly.  We just need a way to identify the end of the string.  I 
suppose trailing marker bytes or bits will work here too, with differing 
space/time tradeoffs.

Arrays can't use the trailing marker bit, a trailing marker byte per element 
would work.

Records:  I think if "ignored" fields were placed at the end, it would just 
work for sorting (equal fields would be adjacent) but an equivalence test would 
fail (one processes as > when they are equal).  



Overall, I am not convinced such an encoding would have that much value.  The 
extra time/space cost of encoding and decoding in many use cases would outweigh 
the savings of sorting and comparison.   For some languages, the cost of doing 
the comparison on the current binary format is either not too great or could be 
optimized quite a bit further than they are now (C/C++/Java).  More dynamic 
languages may benefit more from being able to call into an optimized memcmp 
routine since their conditional logic and method call overhead is greater.  For 
those languages, perhaps we should call out to an optimized Avro C or C++ stub 
for comparisons?  I wonder if we put our efforts into more optimized 
comparison, serialization, and deserialization if we would end up in a better 
place than adding a binary format.


  
> define memcmp'able encoding
> ---------------------------
>
>                 Key: AVRO-712
>                 URL: https://issues.apache.org/jira/browse/AVRO-712
>             Project: Avro
>          Issue Type: New Feature
>          Components: spec
>            Reporter: Doug Cutting
>         Attachments: memcmp_encoding_prototype.py
>
>
> It would be useful to have an encoding for Avro data that ordered data 
> according to the Avro specification under memcmp.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Issue Comment Edited: (AVRO-712) define memcmp'able encoding

Reply via email to