[ https://issues.apache.org/jira/browse/PIG-2975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13477394#comment-13477394 ]
Koji Noguchi commented on PIG-2975: ----------------------------------- This took me much longer to understand than I first anticipated. This is not necessary a regression bug in PIG-2862, but PIG-2862 helped manifest a subtle bug in pig that we always had for a long time. In short, pig has a bug of using PigBytesRawComparator for NullableBytesWritable key even though its content is stored as a Tuple. To explain in detail, following is the serialized output format for PigNullableWritable.java {noformat} |-------------| | mNull | 1byte |-------------| | | | mValue | WritableComparable __ bytes | | | | |-------------| | mIndex | 1byte |-------------| {noformat} When comparing through PigBytesRawComparator.compare(), we skip the above mNull and mIndex so we can ignore that part. Now, looking at what's stored inside mValue for NullableBytesWritable (extends PigNullableWritable). It holds one DataBytesArray entry inside a Tuple. {noformat} |-------------| | mNull | 1byte |-------------| ------------------ | TINYTUPLE | 1byte | |-------------| | | sz=1 | 1byte | |-------------| mValue | | | | DataByteArr | | | | | | | | |-------------| ------------------- | mIndex | 1byte |-------------| {noformat} And expanding how DataByteArray is serialized for the smallest content, 1 byte. {noformat} |-------------| | mNull | 1byte |-------------| ------------------ | TINYTUPLE | 1byte | |-------------| | | sz=1 | 1byte | |-------------| mValue(5bytes) |TINYBYTEARRAY| 1byte | |-------------| | | sz=1 | 1byte | |-------------| | |content 1byte| 1byte | |-------------| ------------------- | mIndex | 1byte |-------------| {noformat} As I mentioned in the beginning, even though above PigNullableWritable is storing mValue as *Tuple*, we are using PigBytesRawComparator(mWrappedComp = BytesWritable.Comparator) which actually *skips 4 bytes* when comparing the two writables. {noformat} BytesWritable.java 37 private static final int LENGTH_BYTES = 4; ... 205 public int compare(byte[] b1, int s1, int l1, 206 byte[] b2, int s2, int l2) { 207 return compareBytes(b1, s1+LENGTH_BYTES, l1-LENGTH_BYTES, 208 b2, s2+LENGTH_BYTES, l2-LENGTH_BYTES); 209 } {noformat} So, even though there is a mismatch between having Tuple in NullableBytesWritable and comparing them as ByteArrays, coincidentally it is not skipping the actual tuple content. (For larger bytearray whose size does not fit in 1 byte(TINYBYTEARRAY), this compare method would take part of the header as content hurting the order but not the uniqueness.) Now, with this new feature of PIG-2862 (TUPLE BinInterSedes byte identifier), this delicate balance suddenly broke, when the minimum header size became 3 from 4. (TINYTUPLE + sz = 2bytes now became TUPLE_1 = 1byte) {noformat} |-------------| | mNull | 1byte |-------------| ------------------ | TUPLE_1 | 1byte | |-------------| mValue(4bytes) |TINYBYTEARRAY| 1byte | |-------------| | | sz=1 | 1byte | |-------------| | | tuple 1 byte| 1byte | |-------------| ------------------- | mIndex | 1byte |-------------| {noformat} So BytesWritable.Comparator started not only skipping the header but the content itself. On the original example, byte '1' and '3' became 0 byte comparisons and was treated as equal bytes leading to the weird output. I confirmed this by inserting a debug statement inside BytesWritable.Comparator.compare() and seeing 0 length byte array comparisons. I believe the fix would be to store mValue as a BytesArray instead of a Tuple for NullableBytesWritable.java. > TestTypedMap.testOrderBy failing with incorrect result > ------------------------------------------------------- > > Key: PIG-2975 > URL: https://issues.apache.org/jira/browse/PIG-2975 > Project: Pig > Issue Type: Sub-task > Affects Versions: 0.11 > Reporter: Koji Noguchi > Priority: Blocker > Fix For: 0.11 > > > Looked at > {noformat} > junit.framework.AssertionFailedError > at org.apache.pig.test.TestTypedMap.testOrderBy(TestTypedMap.java:352) > {noformat} > This looks like a valid test case failing with incorrect result. > {noformat} > % cat test/orderby.txt > [key#1,key9#23] > [key#3,key3#2] > [key#22] > % cat test/orderby.pig > a = load 'test/orderby.txt' as (m:[]); > b = foreach a generate m#'key' as b0; > dump b; > c = order b by b0; > dump c; > % java ... org.apache.pig.Main -x local test/orderby.pig > [dump b] > (1) > (3) > (22) > ... > [dump c] > (1) > (1) > (22) > % > where did the '(3)' go? > {noformat} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira