[jira] [Updated] (PIG-3900) SAMPLE and RANDOM should optionally stabilize their output from run-to-run, even across a large input set

Philip (flip) Kromer (JIRA) Wed, 16 Apr 2014 16:45:11 -0700

     [ 
https://issues.apache.org/jira/browse/PIG-3900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Philip (flip) Kromer updated PIG-3900:
--------------------------------------

    Description: 
SAMPLE and RANDOM should be able to give output that is stable from run-to-run, 
yet random across a large input set. Although PIG-2965 allows the RANDOM 
function to be constructed with a seed, each mapper will generate the same 
sequence of values, which is unacceptable.

It's typically undesirable to have the output of a large job be completely 
non-deterministic. Testing becomes more complicated, and failed map tasks don't 
provide the same output from attempt to attempt, making debugging difficult.

The most desirable implementation would provide a guarantee that a given seed 
and input data would produce an identical result in any environment. I believe 
this is difficult in a distributed environment, however.

If each mapper added the index of its task ID to the provided seed, then the 
output would be stable for most practical purposes -- as long as the assignment 
of input splits to mappers doesn't change from job to job, the number produced 
for each row won't change from job to job. Doing it this way would be backwards 
compatible with the current Pig 0.12.0 implementation (PIG-2965) in the case of 
a single mapper (which is the only justifiable use of the current seed 
feature). Alternatively, one could use a hash of the input file path, the split 
offset, and the provided seed. Both approaches are not stable if the 
splitCombination logic is not stable. 

Suggested documentation for new functionality of RANDOM:

{quote}
This example constructs a function, providing a seed to control the series of 
numbers generated. Each of the three fields will have an  independent series of 
random values, and the output will be stable from run to run. (Note that the 
result is only stable if the input splits remain stable).

{code:sql}
DEFINE rollRand  RANDOM('12345');
DEFINE yawRand   RANDOM('69');
DEFINE pitchRand RANDOM('42');
position = LOAD 'position.tsv';
orientation = FOREACH position GENERATE rollRand() AS roll:double, pitchRand() 
AS pitch:double, yawRand() AS yaw:double;
{code}
{quote}

Suggested documentation for new functionality of SAMPLE:

{quote}
In this example, we provide a seed that stabilizes which rows are selected from 
run to run. (Note that the result is only stable if the input splits remain 
stable).

{code:sql}
a = LOAD 'a.txt';
b = SAMPLE A 0.1 SEED 42;
{code}

{quote}


  was:
SAMPLE and RANDOM should be able to give output that is stable from run-to-run, 
yet random across a large input set. Although PIG-2965 allows the RANDOM 
function to be constructed with a seed, each mapper will generate the same 
sequence of values, which is unacceptable.

It's typically undesirable to have the output of a large job be completely 
non-deterministic. Testing becomes more complicated, and failed map tasks don't 
provide the same output from attempt to attempt, making debugging difficult.

The most desirable implementation would provide a guarantee that a given seed 
and input data would produce an identical result in any environment. I believe 
this is difficult in a distributed environment, however.

If each mapper added the index of its task ID to the provided seed, then the 
output would be stable for most practical purposes -- as long as the assignment 
of input splits to mappers doesn't change from job to job, the number produced 
for each row won't change from job to job. Doing it this way would be backwards 
compatible with the current Pig 0.12.0 implementation (PIG-2965) in the case of 
a single mapper (which is the only justifiable use of the current seed 
feature). Alternatively, one could use a hash of the input file path, the split 
offset, and the provided seed. Both approaches are not stable if the 
splitCombination logic is not stable. 

Suggested documentation for new functionality of RANDOM:

{quote}
This example constructs a function, providing a seed to control the series of 
numbers generated. Each of the three fields will have an  independent series of 
random values, and the output will be stable from run to run. (Note that the 
result is only stable if the input splits remain stable).

{code:pig}
DEFINE rollRand  RANDOM('12345');
DEFINE yawRand   RANDOM('69');
DEFINE pitchRand RANDOM('42');
position = LOAD 'position.tsv';
orientation = FOREACH position GENERATE rollRand() AS roll:double, pitchRand() 
AS pitch:double, yawRand() AS yaw:double;
{code}
{quote}

Suggested documentation for new functionality of SAMPLE:

{quote}
In this example, we provide a seed that stabilizes which rows are selected from 
run to run. (Note that the result is only stable if the input splits remain 
stable).

{code:pig}
a = LOAD 'a.txt';
b = SAMPLE A 0.1 SEED 42;
{code}

{quote}



> SAMPLE and RANDOM should optionally stabilize their output from run-to-run, 
> even across a large input set
> ---------------------------------------------------------------------------------------------------------
>
>                 Key: PIG-3900
>                 URL: https://issues.apache.org/jira/browse/PIG-3900
>             Project: Pig
>          Issue Type: Bug
>            Reporter: Philip (flip) Kromer
>            Priority: Minor
>              Labels: features, random, sample, seed
>
> SAMPLE and RANDOM should be able to give output that is stable from 
> run-to-run, yet random across a large input set. Although PIG-2965 allows the 
> RANDOM function to be constructed with a seed, each mapper will generate the 
> same sequence of values, which is unacceptable.
> It's typically undesirable to have the output of a large job be completely 
> non-deterministic. Testing becomes more complicated, and failed map tasks 
> don't provide the same output from attempt to attempt, making debugging 
> difficult.
> The most desirable implementation would provide a guarantee that a given seed 
> and input data would produce an identical result in any environment. I 
> believe this is difficult in a distributed environment, however.
> If each mapper added the index of its task ID to the provided seed, then the 
> output would be stable for most practical purposes -- as long as the 
> assignment of input splits to mappers doesn't change from job to job, the 
> number produced for each row won't change from job to job. Doing it this way 
> would be backwards compatible with the current Pig 0.12.0 implementation 
> (PIG-2965) in the case of a single mapper (which is the only justifiable use 
> of the current seed feature). Alternatively, one could use a hash of the 
> input file path, the split offset, and the provided seed. Both approaches are 
> not stable if the splitCombination logic is not stable. 
> Suggested documentation for new functionality of RANDOM:
> {quote}
> This example constructs a function, providing a seed to control the series of 
> numbers generated. Each of the three fields will have an  independent series 
> of random values, and the output will be stable from run to run. (Note that 
> the result is only stable if the input splits remain stable).
> {code:sql}
> DEFINE rollRand  RANDOM('12345');
> DEFINE yawRand   RANDOM('69');
> DEFINE pitchRand RANDOM('42');
> position = LOAD 'position.tsv';
> orientation = FOREACH position GENERATE rollRand() AS roll:double, 
> pitchRand() AS pitch:double, yawRand() AS yaw:double;
> {code}
> {quote}
> Suggested documentation for new functionality of SAMPLE:
> {quote}
> In this example, we provide a seed that stabilizes which rows are selected 
> from run to run. (Note that the result is only stable if the input splits 
> remain stable).
> {code:sql}
> a = LOAD 'a.txt';
> b = SAMPLE A 0.1 SEED 42;
> {code}
> {quote}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (PIG-3900) SAMPLE and RANDOM should optionally stabilize their output from run-to-run, even across a large input set

Reply via email to