[ https://issues.apache.org/jira/browse/PIG-5311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16305556#comment-16305556 ]
Koji Noguchi commented on PIG-5311: ----------------------------------- Thanks Rohini. For readability, I'm not sure if having -- both rowProcessed and rowProcessedLong is better than -- having a single count and typecasting at multiple places. I think latter is better but no strong preference. Outside of that, found one critical issue. Learning the "Reservoir sampling" from https://en.wikipedia.org/wiki/Reservoir_sampling and looking at the following change, {code:java} - int rand = randGen.nextInt(rowProcessed + 1); + int rand = randGen.nextInt(getPositiveVal(rowProcessed)); if (rand < numSamples) { samples[rand] = res; } {code} this breaks the reservoir sampling algorithm (when count is larger than Integer.MAX_VALUE). Random number generated here has to be from 0...rowProcessedLong in order to preserve the feature of this algorithm: each item is kept with probability of {{numSamples / total}}. > POReservoirSample fails for more than Integer.MAX_VALUE records > --------------------------------------------------------------- > > Key: PIG-5311 > URL: https://issues.apache.org/jira/browse/PIG-5311 > Project: Pig > Issue Type: Bug > Reporter: Rohini Palaniswamy > Assignee: Rohini Palaniswamy > Fix For: 0.18.0 > > Attachments: PIG-5311-1.patch, PIG-5311-2.patch > > > https://github.com/apache/pig/blob/branch-0.17/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POReservoirSample.java#L128 > The rowProcessed is a int. When it exceeds the int range it wraps around and > becomes a negative number throwing below exception. It needs to be changed to > long. > {code} > Caused by: java.lang.IllegalArgumentException: bound must be positive > at java.util.Random.nextInt(Random.java:388) > at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POReservoirSample.getNextTuple(POReservoirSample.java:128) > at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:305) > at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNextTuple(POForEach.java:284) > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029)