[
https://issues.apache.org/jira/browse/AVRO-712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12974691#action_12974691
]
Scott Carey edited comment on AVRO-712 at 12/23/10 1:47 PM:
------------------------------------------------------------
Ints/Longs can be variable length, with the same space cost as before: The
first byte in the sequence has the sign bit first, inverted (0 for negative
numbers, 1 for positive). The second bit is the 'is last byte in varint' bit
-- 1 if there are more, 0 if it is the last byte. The remaining encoding is
like before -- zig-zag packed absolute value of the number. However, if the
number is negative we need to flip those bits as well as the 'is last byte in
varint' bits. Both encoding and decoding are more expensive this way, but
memcmp would work and the space used would be the same.
Bytes/Strings/Arrays:
All are prefixed with a length in the current format. you can't prefix with a
length and get "abc" < "abcd" as well as "abc" < "b". Lexicographic ordering
of variable length fields is hard. We could use a similar tactic to the
varints above for Bytes and store 7 bits per byte, with a trailing marker bit
indicating the end of the sequence. This takes more space, and is definitely
more expensive to serialize/deserialize.
I think UTF-8's binary properties might be beneficial here for Strings. At
least for each sequence of bytes that represents a character, it may already
sort properly. We just need a way to identify the end of the string. I
suppose trailing marker bytes or bits will work here too, with differing
space/time tradeoffs.
Arrays can't use the trailing marker bit, a trailing marker byte per element
would work.
Records: I think if "ignored" fields were placed at the end, it would just
work for sorting (equal fields would be adjacent) but an equivalence test would
fail (one processes as > when they are equal).
Overall, I am not convinced such an encoding would have that much value. The
extra time/space cost of encoding and decoding in many use cases would outweigh
the savings of sorting and comparison. For some languages, the cost of doing
the comparison on the current binary format is either not too great or could be
optimized quite a bit further than they are now (C/C+\+/Java). More dynamic
languages may benefit more from being able to call into an optimized memcmp
routine since their conditional logic and method call overhead is greater. For
those languages, perhaps we should call out to an optimized Avro C or C+\+ stub
for comparisons? I wonder if we put our efforts into more optimized
comparison, serialization, and deserialization if we would end up in a better
place than adding a binary format.
was (Author: scott_carey):
Ints/Longs can be variable length, with the same space cost as before: The
first byte in the sequence has the sign bit first, inverted (0 for negative
numbers, 1 for positive). The second bit is the 'is last byte in varint' bit
-- 1 if there are more, 0 if it is the last byte. The remaining encoding is
like before -- zig-zag packed absolute value of the number. However, if the
number is negative we need to flip those bits as well as the 'is last byte in
varint' bits. Both encoding and decoding are more expensive this way, but
memcmp would work and the space used would be the same.
Bytes/Strings/Arrays:
All are prefixed with a length in the current format. you can't prefix with a
length and get "abc" < "abcd" as well as "abc" < "b". Lexicographic ordering
of variable length fields is hard. We could use a similar tactic to the
varints above for Bytes and store 7 bits per byte, with a trailing marker bit
indicating the end of the sequence. This takes more space, and is definitely
more expensive to serialize/deserialize.
I think UTF-8's binary properties might be beneficial here for Strings. At
least for each sequence of bytes that represents a character, it may already
sort properly. We just need a way to identify the end of the string. I
suppose trailing marker bytes or bits will work here too, with differing
space/time tradeoffs.
Arrays can't use the trailing marker bit, a trailing marker byte per element
would work.
Records: I think if "ignored" fields were placed at the end, it would just
work for sorting (equal fields would be adjacent) but an equivalence test would
fail (one processes as > when they are equal).
Overall, I am not convinced such an encoding would have that much value. The
extra time/space cost of encoding and decoding in many use cases would outweigh
the savings of sorting and comparison. For some languages, the cost of doing
the comparison on the current binary format is either not too great or could be
optimized quite a bit further than they are now (C/C++/Java). More dynamic
languages may benefit more from being able to call into an optimized memcmp
routine since their conditional logic and method call overhead is greater. For
those languages, perhaps we should call out to an optimized Avro C or C++ stub
for comparisons? I wonder if we put our efforts into more optimized
comparison, serialization, and deserialization if we would end up in a better
place than adding a binary format.
> define memcmp'able encoding
> ---------------------------
>
> Key: AVRO-712
> URL: https://issues.apache.org/jira/browse/AVRO-712
> Project: Avro
> Issue Type: New Feature
> Components: spec
> Reporter: Doug Cutting
> Attachments: memcmp_encoding_prototype.py
>
>
> It would be useful to have an encoding for Avro data that ordered data
> according to the Avro specification under memcmp.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.