[
https://issues.apache.org/jira/browse/PIG-3900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14550086#comment-14550086
]
li yuntian commented on PIG-3900:
---------------------------------
have you solved sample with seed problem yourself ?
> SAMPLE and RANDOM should optionally stabilize their output from run-to-run,
> even across a large input set
> ---------------------------------------------------------------------------------------------------------
>
> Key: PIG-3900
> URL: https://issues.apache.org/jira/browse/PIG-3900
> Project: Pig
> Issue Type: Bug
> Reporter: Philip (flip) Kromer
> Priority: Minor
> Labels: features, random, sample, seed
>
> SAMPLE and RANDOM should be able to give output that is stable from
> run-to-run, yet random across a large input set. Although PIG-2965 allows the
> RANDOM function to be constructed with a seed, each mapper will generate the
> same sequence of values, which is unacceptable.
> It's typically undesirable to have the output of a large job be completely
> non-deterministic. Testing becomes difficult, and failed map tasks don't
> provide the same output from attempt to attempt, which complicates debugging.
> The most desirable implementation would provide a guarantee that a given seed
> and input data would produce an identical result in any environment. I
> believe this is difficult in a distributed environment, however.
> If each mapper added the index of its task ID to the provided seed, then the
> output would be stable for most practical purposes -- as long as the
> assignment of input splits to mappers doesn't change from job to job, the
> number produced for each row won't change from job to job. Doing it this way
> would be backwards compatible with the current Pig 0.12.0 implementation
> (PIG-2965) in the case of a single mapper (which is the only justifiable use
> of the current seed feature). Alternatively, one could use a hash of the
> input file path, the split offset, and the provided seed. Both approaches are
> not stable if the splitCombination logic is not stable.
> Suggested documentation for new functionality of RANDOM:
> {quote}
> This example constructs a function, providing a seed to control the series of
> numbers generated. Each of the three fields will have an independent series
> of random values, and the output will be stable from run to run. (Note that
> the result is only stable if the input splits remain stable).
> {code:sql}
> DEFINE rollRand RANDOM('12345');
> DEFINE yawRand RANDOM('69');
> DEFINE pitchRand RANDOM('42');
> position = LOAD 'position.tsv';
> orientation = FOREACH position GENERATE rollRand() AS roll:double,
> pitchRand() AS pitch:double, yawRand() AS yaw:double;
> {code}
> {quote}
> Suggested documentation for new functionality of SAMPLE:
> {quote}
> In this example, we provide a seed that stabilizes which rows are selected
> from run to run. (Note that the result is only stable if the input splits
> remain stable).
> {code:sql}
> a = LOAD 'a.txt';
> b = SAMPLE A 0.1 SEED 42;
> {code}
> {quote}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)