[ https://issues.apache.org/jira/browse/DATAFU-63?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16259835#comment-16259835 ]
OlgaK commented on DATAFU-63: ----------------------------- Thanks for the explanations, Eyal. I think gradlew script can be "just used" on systems pretty much alike the one it has been built on, as soon as the target system will differ, I expect it'll stop working. For example while Ubuntu and Fedora are both Linux, they are so to say different breeds and therefore the script gets changed. I think to move it out of git repo makes sense. For me AlgebraicEvalFunc (used in SimpleRandomSample) and AccumulatorEvalFunc (used in ReservoirSample) look substantially different. It is pointed that ReservoirSample doesn't scale but it's because of it extends AccumulatorEvalFunc and is a singleton, in other words, it's done unscalable. And this is very confusing. On a concept level there is no difference between `p` type of float - fraction of a set and `k` type of int - size of requested sample. What I suggest, make one module as extension of AlgebraicEvalFunc, which will process both `p` and `k` sample size parameters and they are easily convertible one into another, just `p` doesn't change when we cut the original set in chunks while `k` does, but it doesn't mean that the same chunking cannot be used. Regarding the implementation following the paper. For small p (or k) the output of `Initial` will be ways bigger than needed, since the run goes through the entire set (line 270) and doesn't have `k` or 'p` constraint, while for `p|k` closer to the set size, I doubt you'll get 99% of the requested result. In this case it should be exclusion from the original set rather than addition to an empty one. Overall uniform random sampling is very simple and doesn't require that much of sophistication, the task gets trickier when one go out of uniform distribution and tries to fit a curve (getQ1 and getQ2). In this case indeed you can't guarantee (rather chances are miserable) that taken `k` elements will get someone a desired distribution. I'd suggest to make simple stuff simple. One can get a guaranteed number or fraction of elements from a set, when selection is uniform, it this case one doesn't need to go through the entire set in the best case, there are no extra calculations required and the memory requirements are decent ( ceil(p*n) or k/number_of_chunks ). I'll try to make the implementation as I see it, hopefully in the code it'll be more clear than in the words. Just need to figure out how to adjust `k` if I can't store it in the parent and resolve gradle version dependency. This should give better performance in all aspects. HTH > SimpleRandomSample by a fixed number > ------------------------------------ > > Key: DATAFU-63 > URL: https://issues.apache.org/jira/browse/DATAFU-63 > Project: DataFu > Issue Type: New Feature > Reporter: jian wang > Assignee: jian wang > > SimpleRandomSample currently supports random sampling by probability, it does > not support random sample a fixed number of items. ReserviorSample may do the > work but since it relies on an in-memory priority queue, memory issue may > happen if we are going to sample a huge number of items, eg: sample 100M from > 100G data. > Suggested approach is to create a new class "SimpleRandomSampleByCount" that > uses Manuver's rejection threshold to reject items whose weight exceeds the > threshold as we go from mapper to combiner to reducer. The majority part of > the algorithm will be very similar to SimpleRandomSample, except that we do > not use Berstein's theory to accept items and replace probability p = k / n, > k is the number of items to sample, n is the total number of items local in > mapper, combiner and reducer. > Quote this requirement from others: > "Hi folks, > Question: does anybody know if there is a quicker way to randomly sample a > specified number of rows from grouped data? I’m currently doing this, since > it appears that the SAMPLE operator doesn’t work inside FOREACH statements: > photosGrouped = GROUP photos BY farm; > agg = FOREACH photosGrouped { > rnds = FOREACH photos GENERATE *, RANDOM() as rnd; > ordered_rnds = ORDER rnds BY rnd; > limitSet = LIMIT ordered_rnds 5000; > GENERATE group AS farm, > FLATTEN(limitSet.(photo_id, server, secret)) AS (photo_id, server, > secret); > }; > This approach seems clumsy, and appears to run quite slowly (I’m assuming the > ORDER/LIMIT isn’t great for performance). Is there a less awkward way to do > this? > Thanks, > " -- This message was sent by Atlassian JIRA (v6.4.14#64029)