[ 
https://issues.apache.org/jira/browse/PIG-5311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohini Palaniswamy updated PIG-5311:
------------------------------------
    Attachment: PIG-5311-3.patch

bq. both rowProcessed and rowProcessedLong
  Was just the recent influence from writing bytecode :)

bq. this breaks the reservoir sampling algorithm
  Switched form Random to RandomDataGenerator from commons-math3 
(http://commons.apache.org/proper/commons-math/javadocs/api-3.6.1/index.html). 
Thanks for catching this and not letting me take a shortcut saying this is a 
very rare case.

Other things that [~knoguchi] suggested in our offline discussion were
1) 
https://docs.oracle.com/javase/8/docs/api/java/util/concurrent/ThreadLocalRandom.html
Discarded this as it had nextLong but no option to provide seed.
2) 
https://stackoverflow.com/questions/2546078/java-random-long-number-in-0-x-n-range
RandomDataGenerator and extending Random with nextLong() were two of the 
solutions I tried from this. The nextLong() produced some negative numbers. Did 
not spend time debugging that. Tried RandomDataGenerator with some ranges 
comparing it to Random. More or less similar. So went with that.

{code}
long taskIdHashCode = "task_1509095573435_5386881_1_04_000462".hashCode();
        long randomSeed = ((long)taskIdHashCode << 32) | (taskIdHashCode & 
0xffffffffL);
        Random rand = new Random(randomSeed);
        BufferedWriter bo = new BufferedWriter(new FileWriter("/tmp/int"));
        for( int i = 1 ; i < Integer.MAX_VALUE; i++) {
            long val = rand.nextInt(i);
            if (val < 100) {
                bo.write(i + " = " + val + "\n");
                System.out.println(i + " = " + val);
            }
            if (i % 10000 == 0) {
                bo.flush();
            }
        }
        bo.close();

        RandomDataGenerator randGen = new RandomDataGenerator();
        randGen.reSeed(randomSeed);
        BufferedWriter bo1 = new BufferedWriter(new FileWriter("/tmp/long"));
        for( long i = 1 ; i < 100000000000L; i++) { //100 Billion
            long val = randGen.nextLong(0, i);
            if (val < 100) {
                bo1.write(i + " = " + val + "\n");
                System.out.println(i + " = " + val);
            }
            if (i % 10000 == 0) {
                bo.flush();
            }
        }
        bo1.close();
        System.out.println("Done");
{code} 

> POReservoirSample fails for more than Integer.MAX_VALUE records
> ---------------------------------------------------------------
>
>                 Key: PIG-5311
>                 URL: https://issues.apache.org/jira/browse/PIG-5311
>             Project: Pig
>          Issue Type: Bug
>            Reporter: Rohini Palaniswamy
>            Assignee: Rohini Palaniswamy
>             Fix For: 0.18.0
>
>         Attachments: PIG-5311-1.patch, PIG-5311-2.patch, PIG-5311-3.patch
>
>
> https://github.com/apache/pig/blob/branch-0.17/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POReservoirSample.java#L128
> The rowProcessed is a int. When it exceeds the int range it wraps around and 
> becomes a negative number throwing below exception. It needs to be changed to 
> long.
> {code}
> Caused by: java.lang.IllegalArgumentException: bound must be positive
>   at java.util.Random.nextInt(Random.java:388)
>   at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POReservoirSample.getNextTuple(POReservoirSample.java:128)
>   at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:305)
>   at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNextTuple(POForEach.java:284)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to