[ 
https://issues.apache.org/jira/browse/PIG-2975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13477394#comment-13477394
 ] 

Koji Noguchi commented on PIG-2975:
-----------------------------------

This took me much longer to understand than I first anticipated.

This is not necessary a regression bug in PIG-2862, but PIG-2862 helped
manifest a subtle bug in pig that we always had for a long time.

In short, pig has a bug of using PigBytesRawComparator for
NullableBytesWritable key even though its content is stored as a Tuple.

To explain in detail, following is the serialized output format for
PigNullableWritable.java
{noformat}
|-------------|
| mNull       |  1byte
|-------------|
|             |
| mValue      |  WritableComparable __ bytes
|             |
|             |
|-------------|
| mIndex      |  1byte
|-------------|
{noformat}
When comparing through PigBytesRawComparator.compare(), we skip the above mNull
and mIndex so we can ignore that part.


Now, looking at what's stored inside mValue for
NullableBytesWritable (extends PigNullableWritable).
It holds one DataBytesArray entry inside a Tuple.
{noformat}
|-------------|
| mNull       |  1byte
|-------------|  ------------------
| TINYTUPLE   |  1byte            |
|-------------|                   |
|  sz=1       |  1byte            |
|-------------|                  mValue
|             |                   |
| DataByteArr |                   |
|             |                   |
|             |                   |
|-------------| -------------------
| mIndex      |  1byte
|-------------|
{noformat}

And expanding how DataByteArray is serialized for the smallest content, 1
byte.
{noformat}
|-------------|
| mNull       |  1byte
|-------------|  ------------------
| TINYTUPLE   |  1byte            |
|-------------|                   |
|  sz=1       |  1byte            |
|-------------|                  mValue(5bytes)
|TINYBYTEARRAY|  1byte            |
|-------------|                   |
|  sz=1       |  1byte            |
|-------------|                   |
|content 1byte|  1byte            |
|-------------| -------------------
| mIndex      |  1byte
|-------------|
{noformat}

As I mentioned in the beginning,
even though above PigNullableWritable is storing mValue as *Tuple*, we are using
PigBytesRawComparator(mWrappedComp = BytesWritable.Comparator) 
which actually *skips 4 bytes* when comparing the two writables.

{noformat}
BytesWritable.java
 37   private static final int LENGTH_BYTES = 4;
...
205     public int compare(byte[] b1, int s1, int l1,
206                        byte[] b2, int s2, int l2) {
207       return compareBytes(b1, s1+LENGTH_BYTES, l1-LENGTH_BYTES,
208                           b2, s2+LENGTH_BYTES, l2-LENGTH_BYTES);
209     }
{noformat}

So, even though there is a mismatch between having Tuple in 
NullableBytesWritable
and comparing them as ByteArrays, coincidentally it is not skipping the actual 
tuple
content.  (For larger bytearray whose size does not fit in 1 
byte(TINYBYTEARRAY), this compare method would take part of the header as 
content hurting the order but not the uniqueness.)

Now, with this new feature of PIG-2862 (TUPLE BinInterSedes byte identifier),
this delicate balance suddenly broke, when the minimum header size became 3 
from 4.
(TINYTUPLE + sz = 2bytes now became TUPLE_1 = 1byte)
{noformat}
|-------------|
| mNull       |  1byte
|-------------|  ------------------
| TUPLE_1     |  1byte            |
|-------------|                  mValue(4bytes)
|TINYBYTEARRAY|  1byte            |
|-------------|                   |
|  sz=1       |  1byte            |
|-------------|                   |
| tuple 1 byte|  1byte            |
|-------------| -------------------
| mIndex      |  1byte
|-------------|
{noformat}

So BytesWritable.Comparator started not only skipping the header but the
content itself.  On the original example, byte '1' and '3' became 0 byte 
comparisons and was treated as equal bytes leading to the weird output.
I confirmed this by inserting a debug statement inside
BytesWritable.Comparator.compare() and seeing 0 length byte array comparisons.

I believe the fix would be to store mValue as a BytesArray instead of a Tuple 
for NullableBytesWritable.java.
                
> TestTypedMap.testOrderBy failing with incorrect result 
> -------------------------------------------------------
>
>                 Key: PIG-2975
>                 URL: https://issues.apache.org/jira/browse/PIG-2975
>             Project: Pig
>          Issue Type: Sub-task
>    Affects Versions: 0.11
>            Reporter: Koji Noguchi
>            Priority: Blocker
>             Fix For: 0.11
>
>
> Looked at 
> {noformat}
> junit.framework.AssertionFailedError
>     at org.apache.pig.test.TestTypedMap.testOrderBy(TestTypedMap.java:352)
> {noformat}
> This looks like a valid test case failing with incorrect result.
> {noformat}
> % cat test/orderby.txt
> [key#1,key9#23]
> [key#3,key3#2]
> [key#22]
> % cat test/orderby.pig
> a = load 'test/orderby.txt' as (m:[]);
> b = foreach a generate m#'key' as b0;
> dump b;
> c = order b by b0;
> dump c;
> % java ... org.apache.pig.Main    -x local test/orderby.pig 
> [dump b]
> (1)
> (3)
> (22)
> ...
> [dump c]
> (1)
> (1)
> (22)
> %
> where did the '(3)' go?
> {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to