[
https://issues.apache.org/jira/browse/PIG-4838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15217309#comment-15217309
]
liyunzhang_intel commented on PIG-4838:
---------------------------------------
explain more about why there are difference in mr and spark mode about
TestBuiltin#testUniqueID
pig script like:
{code}
A = load './testUniqueID.txt' as (name);
B = foreach A generate name, UniqueID();
store B into './testUnique.out';
{code}
# cat -A testUniqueID.txt
1$
2$
3$
4$
5$
1$
2$
3$
4$
5$
$
There are 21 bytes in testUniqueID.txt and there will 2 splits if we set
mapred.max.split.size as 10.
In spark mode , the splits will like: Split0 contains 10 bytes, Split1 contains
11 bytes.
{code}
Split0
1$
2$
3$
4$
5$
Split1
1$
2$
3$
4$
5$
$
{code}
In mr mode, the splits will like:Split0 contains 11 bytes, Split1 contains 10
bytes. org.apache.hadoop.mapreduce.JobSubmitter#writeNewSplits will sort
original splits into order based on size, so that the biggest go first.
{code}
Split0
1$
2$
3$
4$
5$
$
Split1
1$
2$
3$
4$
5$
{code}
After B = foreach A generate name, UniqueID(); the result will be
spark mode:
{code}
testUnique.out/part-m-00000
1 0-0
2 0-1
3 0-2
4 0-3
5 0-4
1 0-5
testUnique.out/part-m-00001
2 0-0
3 0-1
4 0-2
5 0-3
0-4
{code}
mr mode:
{code}
testUnique.out/part-m-00000
2 0-0
3 0-1
4 0-2
5 0-3
0-4
testUnique.out/part-m-00001
1 1-0
2 1-1
3 1-2
4 1-3
5 1-4
1 1-5
{code}
> Fix test TestBuiltin
> --------------------
>
> Key: PIG-4838
> URL: https://issues.apache.org/jira/browse/PIG-4838
> Project: Pig
> Issue Type: Sub-task
> Components: spark
> Reporter: liyunzhang_intel
> Assignee: liyunzhang_intel
> Fix For: spark-branch
>
> Attachments: PIG-4838.patch
>
>
> In https://builds.apache.org/job/Pig-spark/316/, following unit tests fail:
> org.apache.pig.test.TestBuiltin.testRANDOMWithJob
> org.apache.pig.test.TestBuiltin.testUniqueID
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)