[
https://issues.apache.org/jira/browse/PIG-3900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Philip (flip) Kromer updated PIG-3900:
--------------------------------------
Description:
SAMPLE and RANDOM should be able to give output that is stable from run-to-run,
yet random across a large input set. Although PIG-2965 allows the RANDOM
function to be constructed with a seed, each mapper will generate the same
sequence of values, which is unacceptable.
It's typically undesirable to have the output of a large job be completely
non-deterministic. Testing becomes difficult, and failed map tasks don't
provide the same output from attempt to attempt, which complicates debugging.
The most desirable implementation would provide a guarantee that a given seed
and input data would produce an identical result in any environment. I believe
this is difficult in a distributed environment, however.
If each mapper added the index of its task ID to the provided seed, then the
output would be stable for most practical purposes -- as long as the assignment
of input splits to mappers doesn't change from job to job, the number produced
for each row won't change from job to job. Doing it this way would be backwards
compatible with the current Pig 0.12.0 implementation (PIG-2965) in the case of
a single mapper (which is the only justifiable use of the current seed
feature). Alternatively, one could use a hash of the input file path, the split
offset, and the provided seed. Both approaches are not stable if the
splitCombination logic is not stable.
Suggested documentation for new functionality of RANDOM:
{quote}
This example constructs a function, providing a seed to control the series of
numbers generated. Each of the three fields will have an independent series of
random values, and the output will be stable from run to run. (Note that the
result is only stable if the input splits remain stable).
{code:sql}
DEFINE rollRand RANDOM('12345');
DEFINE yawRand RANDOM('69');
DEFINE pitchRand RANDOM('42');
position = LOAD 'position.tsv';
orientation = FOREACH position GENERATE rollRand() AS roll:double, pitchRand()
AS pitch:double, yawRand() AS yaw:double;
{code}
{quote}
Suggested documentation for new functionality of SAMPLE:
{quote}
In this example, we provide a seed that stabilizes which rows are selected from
run to run. (Note that the result is only stable if the input splits remain
stable).
{code:sql}
a = LOAD 'a.txt';
b = SAMPLE A 0.1 SEED 42;
{code}
{quote}
was:
SAMPLE and RANDOM should be able to give output that is stable from run-to-run,
yet random across a large input set. Although PIG-2965 allows the RANDOM
function to be constructed with a seed, each mapper will generate the same
sequence of values, which is unacceptable.
It's typically undesirable to have the output of a large job be completely
non-deterministic. Testing becomes more complicated, and failed map tasks don't
provide the same output from attempt to attempt, making debugging difficult.
The most desirable implementation would provide a guarantee that a given seed
and input data would produce an identical result in any environment. I believe
this is difficult in a distributed environment, however.
If each mapper added the index of its task ID to the provided seed, then the
output would be stable for most practical purposes -- as long as the assignment
of input splits to mappers doesn't change from job to job, the number produced
for each row won't change from job to job. Doing it this way would be backwards
compatible with the current Pig 0.12.0 implementation (PIG-2965) in the case of
a single mapper (which is the only justifiable use of the current seed
feature). Alternatively, one could use a hash of the input file path, the split
offset, and the provided seed. Both approaches are not stable if the
splitCombination logic is not stable.
Suggested documentation for new functionality of RANDOM:
{quote}
This example constructs a function, providing a seed to control the series of
numbers generated. Each of the three fields will have an independent series of
random values, and the output will be stable from run to run. (Note that the
result is only stable if the input splits remain stable).
{code:sql}
DEFINE rollRand RANDOM('12345');
DEFINE yawRand RANDOM('69');
DEFINE pitchRand RANDOM('42');
position = LOAD 'position.tsv';
orientation = FOREACH position GENERATE rollRand() AS roll:double, pitchRand()
AS pitch:double, yawRand() AS yaw:double;
{code}
{quote}
Suggested documentation for new functionality of SAMPLE:
{quote}
In this example, we provide a seed that stabilizes which rows are selected from
run to run. (Note that the result is only stable if the input splits remain
stable).
{code:sql}
a = LOAD 'a.txt';
b = SAMPLE A 0.1 SEED 42;
{code}
{quote}
> SAMPLE and RANDOM should optionally stabilize their output from run-to-run,
> even across a large input set
> ---------------------------------------------------------------------------------------------------------
>
> Key: PIG-3900
> URL: https://issues.apache.org/jira/browse/PIG-3900
> Project: Pig
> Issue Type: Bug
> Reporter: Philip (flip) Kromer
> Priority: Minor
> Labels: features, random, sample, seed
>
> SAMPLE and RANDOM should be able to give output that is stable from
> run-to-run, yet random across a large input set. Although PIG-2965 allows the
> RANDOM function to be constructed with a seed, each mapper will generate the
> same sequence of values, which is unacceptable.
> It's typically undesirable to have the output of a large job be completely
> non-deterministic. Testing becomes difficult, and failed map tasks don't
> provide the same output from attempt to attempt, which complicates debugging.
> The most desirable implementation would provide a guarantee that a given seed
> and input data would produce an identical result in any environment. I
> believe this is difficult in a distributed environment, however.
> If each mapper added the index of its task ID to the provided seed, then the
> output would be stable for most practical purposes -- as long as the
> assignment of input splits to mappers doesn't change from job to job, the
> number produced for each row won't change from job to job. Doing it this way
> would be backwards compatible with the current Pig 0.12.0 implementation
> (PIG-2965) in the case of a single mapper (which is the only justifiable use
> of the current seed feature). Alternatively, one could use a hash of the
> input file path, the split offset, and the provided seed. Both approaches are
> not stable if the splitCombination logic is not stable.
> Suggested documentation for new functionality of RANDOM:
> {quote}
> This example constructs a function, providing a seed to control the series of
> numbers generated. Each of the three fields will have an independent series
> of random values, and the output will be stable from run to run. (Note that
> the result is only stable if the input splits remain stable).
> {code:sql}
> DEFINE rollRand RANDOM('12345');
> DEFINE yawRand RANDOM('69');
> DEFINE pitchRand RANDOM('42');
> position = LOAD 'position.tsv';
> orientation = FOREACH position GENERATE rollRand() AS roll:double,
> pitchRand() AS pitch:double, yawRand() AS yaw:double;
> {code}
> {quote}
> Suggested documentation for new functionality of SAMPLE:
> {quote}
> In this example, we provide a seed that stabilizes which rows are selected
> from run to run. (Note that the result is only stable if the input splits
> remain stable).
> {code:sql}
> a = LOAD 'a.txt';
> b = SAMPLE A 0.1 SEED 42;
> {code}
> {quote}
--
This message was sent by Atlassian JIRA
(v6.2#6252)