[
https://issues.apache.org/jira/browse/PIG-4819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Koji Noguchi updated PIG-4819:
------------------------------
Attachment: pig-4819-v02_fix_v02.patch
Discussing with Rohini, simplified a call by creating a long by connecting two
jobid-hash and xor-ing with a task number. Also, added a logic so that if
RANDOM is called for more than once in the script, they would return a
different value.
{code}
B = FOREACH A generate RANDOM(), RANDOM();
{code}
bq. To add more randomness across jobs, adding submit time with XOR.
This didn't work with Tez. It wasn't transferring
"pig.job.submitted.timestamp". For now, taking it out but it would be nice to
have this. (Even better with nanosecond).
{quote}
bq. But should I simply extend org.apache.pig.builtin.RANDOM from
org.apache.pig.piggybank.evaluation.math.RANDOM
Would be ideal, but if they use newer piggybank jar with older version of pig
it will break. So I think duplicating code is better for now.
{quote}
Given the not so obvious changes I've made to original RANDOM, I wasn't
comfortable with copy and pasting. I simply went with extending option.
My understanding is, worst case would be piggybank.RANDOM referencing the
original builtin.RANDOM without my changes but it won't fail.
> RANDOM() udf can lead to missing or redundant records
> -----------------------------------------------------
>
> Key: PIG-4819
> URL: https://issues.apache.org/jira/browse/PIG-4819
> Project: Pig
> Issue Type: Bug
> Reporter: Koji Noguchi
> Assignee: Koji Noguchi
> Fix For: 0.16.0
>
> Attachments: pig-4819-v01.patch, pig-4819-v02.patch,
> pig-4819-v02_fix_v01.patch, pig-4819-v02_fix_v02.patch
>
>
> When RANDOM() value is used for grouping/distinct/etc, it breaks the
> mapreduce rule and can lead to redundant or missing records.
> Some discussion can be found in
> https://issues.apache.org/jira/browse/PIG-3257?focusedCommentId=13669195#comment-13669195
> We should make RANDOM less random so that it'll produce the same sequence of
> random values from the task retries.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)