[
https://issues.apache.org/jira/browse/PIG-2975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Koji Noguchi updated PIG-2975:
------------------------------
Attachment: pig-2975-trunk_v03-unionapproach.txt
bq. 2. We could just make a custom WritableComparator for this case
My impression is that using the BytesWritable compare directly would be the
fastest.
bq. (right now it is going to be TUPLE_1 / {TINYBYTEARRAY, SMALLBYTEARRAY,
BYTEARRAY} / SIZE/ and so on.
If the header size is different, I would need a switch somewhere. So thought of
this lame approach.
{noformat}
/*
* This class tries to optimize for the most common input, DataByteArray
* In order to preserve the alphabetical ordering for DataByteArray,
* we skip the first 4 bytes when comparing.
* For non-DataByteArray, empty 4bytes is added so that content is not
* skipped by the above offset. Order for non-DataByteArray would look
* random since it includes all the headers for comparisons.
*
* Bytes comparison is done by pair (isByteArray, mValue) to avoid any
* potential collision among DataByteArray and non-DataByteArray.
* //Serialization structure
* struct {
* byte mNull;
* int size; (empty for non-DataByteArray)
* byte isByteArray;
* union {
* byte [size]; //for DataType.BYTEARRAY
* Tuple.serialized //for all others
* } mValue;
* byte mIndex;
* }
*
*/
{noformat}
This sacrifices the space for performance.
* For DataType.BYTEARRAY, it adds 2 more bytes for small record (<256).
size(4bytes) + 1byte(isByteArray) = 5bytes
Before, it was TUPLE_1(1byte) + TINYBYTEARRAY(1byte) + size(1byte) = 3bytes
* For non-BYTEARRAY, 5 bytes. empty 4 bytes + 1byte boolean. This is in
addition to whatever Tuple adds when serialized.
> TestTypedMap.testOrderBy failing with incorrect result
> -------------------------------------------------------
>
> Key: PIG-2975
> URL: https://issues.apache.org/jira/browse/PIG-2975
> Project: Pig
> Issue Type: Sub-task
> Affects Versions: 0.11
> Reporter: Koji Noguchi
> Assignee: Koji Noguchi
> Priority: Blocker
> Fix For: 0.11
>
> Attachments: PIG-2975-0_jco.patch, PIG-2975-0_jco-v2.patch,
> pig-2975-trunk_v01.txt, pig-2975-trunk_v02-broken.txt,
> pig-2975-trunk_v03-unionapproach.txt
>
>
> Looked at
> {noformat}
> junit.framework.AssertionFailedError
> at org.apache.pig.test.TestTypedMap.testOrderBy(TestTypedMap.java:352)
> {noformat}
> This looks like a valid test case failing with incorrect result.
> {noformat}
> % cat test/orderby.txt
> [key#1,key9#23]
> [key#3,key3#2]
> [key#22]
> % cat test/orderby.pig
> a = load 'test/orderby.txt' as (m:[]);
> b = foreach a generate m#'key' as b0;
> dump b;
> c = order b by b0;
> dump c;
> % java ... org.apache.pig.Main -x local test/orderby.pig
> [dump b]
> (1)
> (3)
> (22)
> ...
> [dump c]
> (1)
> (1)
> (22)
> %
> where did the '(3)' go?
> {noformat}
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira