[ https://issues.apache.org/jira/browse/DATAFU-63?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16261346#comment-16261346 ]
OlgaK commented on DATAFU-63: ----------------------------- Overall test fails on `datafu-pig:downloadOpenNlpModels` {noformat} :datafu-pig:downloadOpenNlpModels FAILED FAILURE: Build failed with an exception. * Where: Build file '/home/olga/DataFu/datafu-pig/build.gradle' line: 213 * What went wrong: Execution failed for task ':datafu-pig:downloadOpenNlpModels'. > Lorg/gradle/logging/ProgressLogger; {noformat} running a single test: `./gradlew :datafu-pig:test -Dtest.single=UniformRandomSampleTest` fails at the mentioned point as well > SimpleRandomSample by a fixed number > ------------------------------------ > > Key: DATAFU-63 > URL: https://issues.apache.org/jira/browse/DATAFU-63 > Project: DataFu > Issue Type: New Feature > Reporter: jian wang > Assignee: jian wang > > SimpleRandomSample currently supports random sampling by probability, it does > not support random sample a fixed number of items. ReserviorSample may do the > work but since it relies on an in-memory priority queue, memory issue may > happen if we are going to sample a huge number of items, eg: sample 100M from > 100G data. > Suggested approach is to create a new class "SimpleRandomSampleByCount" that > uses Manuver's rejection threshold to reject items whose weight exceeds the > threshold as we go from mapper to combiner to reducer. The majority part of > the algorithm will be very similar to SimpleRandomSample, except that we do > not use Berstein's theory to accept items and replace probability p = k / n, > k is the number of items to sample, n is the total number of items local in > mapper, combiner and reducer. > Quote this requirement from others: > "Hi folks, > Question: does anybody know if there is a quicker way to randomly sample a > specified number of rows from grouped data? I’m currently doing this, since > it appears that the SAMPLE operator doesn’t work inside FOREACH statements: > photosGrouped = GROUP photos BY farm; > agg = FOREACH photosGrouped { > rnds = FOREACH photos GENERATE *, RANDOM() as rnd; > ordered_rnds = ORDER rnds BY rnd; > limitSet = LIMIT ordered_rnds 5000; > GENERATE group AS farm, > FLATTEN(limitSet.(photo_id, server, secret)) AS (photo_id, server, > secret); > }; > This approach seems clumsy, and appears to run quite slowly (I’m assuming the > ORDER/LIMIT isn’t great for performance). Is there a less awkward way to do > this? > Thanks, > " -- This message was sent by Atlassian JIRA (v6.4.14#64029)