[ 
https://issues.apache.org/jira/browse/PIG-4838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15217309#comment-15217309
 ] 

liyunzhang_intel commented on PIG-4838:
---------------------------------------

explain more about why there are difference in mr and spark mode about 
TestBuiltin#testUniqueID

pig script like:
{code}
A = load './testUniqueID.txt' as (name); 
B = foreach A generate name, UniqueID();
store B into './testUnique.out';
{code}
# cat -A testUniqueID.txt
1$
2$
3$
4$
5$
1$
2$
3$
4$
5$
$

There are 21 bytes in testUniqueID.txt and there will 2 splits if we set 
mapred.max.split.size as 10.
In spark mode , the splits will like: Split0 contains 10 bytes, Split1 contains 
11 bytes.
{code}
Split0
1$
2$
3$
4$
5$

Split1
1$
2$
3$
4$
5$
$
{code}

In mr mode, the splits will like:Split0 contains 11 bytes, Split1 contains 10 
bytes.  org.apache.hadoop.mapreduce.JobSubmitter#writeNewSplits will sort 
original splits into order based on size, so that the biggest go first.
{code}
Split0
1$
2$
3$
4$
5$
$

Split1
1$
2$
3$
4$
5$
{code}

After B = foreach A generate name, UniqueID(); the result will be
spark mode:
{code}
testUnique.out/part-m-00000
1    0-0
2    0-1
3    0-2
4    0-3
5    0-4
1    0-5

testUnique.out/part-m-00001
2    0-0
3    0-1
4    0-2
5    0-3
      0-4
{code}
mr mode:
{code}
testUnique.out/part-m-00000
2    0-0
3    0-1
4    0-2
5    0-3
     0-4

testUnique.out/part-m-00001
1    1-0
2    1-1
3    1-2
4    1-3
5    1-4
1    1-5
{code}

> Fix test TestBuiltin
> --------------------
>
>                 Key: PIG-4838
>                 URL: https://issues.apache.org/jira/browse/PIG-4838
>             Project: Pig
>          Issue Type: Sub-task
>          Components: spark
>            Reporter: liyunzhang_intel
>            Assignee: liyunzhang_intel
>             Fix For: spark-branch
>
>         Attachments: PIG-4838.patch
>
>
> In https://builds.apache.org/job/Pig-spark/316/, following unit tests fail:
> org.apache.pig.test.TestBuiltin.testRANDOMWithJob
> org.apache.pig.test.TestBuiltin.testUniqueID



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to