Github user sachingoel0101 commented on the pull request:
https://github.com/apache/flink/pull/949#issuecomment-128781613
I have worked on this problem before. The idea is to divide the data into
blocks and find the probability of selection of an element from a block.
Thus, suppose there are blocks B_1, B_2, ..., B_N with probabilities P_1,
P_2, ..., P_N, then you sample k points by first sampling from the distribution
{P_1, P_2, ..., P_N} and find the number of elements you require from each
block. After that you select the required number of points from each block and
take a union.
It is pretty easy to implement in a shared memory system but with a
distributed system, it is hard. I tried the following approach some time
before, although didn't quite finish working on it:
blockedData = Data -> (block_id, data)
blockNumbers = (block_id, data) -> (block_id, count)
(1...k) -> (list of block ids we'll be sampling from)
After this, I tried broadcasting the list and selecting the required number
of elements from each block, which can be done quite easily. But what if k is
very large?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---