[
https://issues.apache.org/jira/browse/TEZ-3937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16479252#comment-16479252
]
Jason Lowe commented on TEZ-3937:
---------------------------------
Thanks for the patch!
I love how this drastically simplifies the code for BitSet serialization, but I
do have one concern with the change. If I'm reading the code correctly, the
new vesion will serialize and deserialize the bytes in the opposite order that
the old code does. I think it's fine as long as all callers are pairing these
two methods together, e.g.: all calls to deserialize a BitSet are passing byte
arrays that were encoded with the new encoded method.
What worries me there is if a job ever ran where some Tez jars happened to be
an older version before this change. The BitSet would be deserialized mostly
backwards (same order of bits within a byte but opposite byte order). That
could lead to silent dataloss since the consumer may think a partition is empty
when in fact it's a completely different partition is empty, due to how shuffle
interprets these BitSets.
If accidental mixing of Tez jars isn't a concern I think this change is fine.
If we're mostly worried about the wasted byte when the number of bits is a
multiple of 8 then I think a safer change is to fix the toByteArray method to
size the destination byte array properly instead.
> Empty partition BitSet to byte[] conversion creates one extra byte in
> rounding error
> ------------------------------------------------------------------------------------
>
> Key: TEZ-3937
> URL: https://issues.apache.org/jira/browse/TEZ-3937
> Project: Apache Tez
> Issue Type: Bug
> Reporter: Jonathan Eagles
> Assignee: Jonathan Eagles
> Priority: Major
> Attachments: TEZ-3937.001.patch
>
>
> Byte array length calculation is defined as (bitset.length / 8) + 1 which has
> off by one errors on byte boundaries. For example, BitSet of length 0 is
> converted to a byte array of length 1. This was introduced as part of TEZ-972
> since BitSet.toByteArray and valueOf were not supported as Tez supported Java
> 6 at the time and API was introduced in Java 7.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)