[GitHub] flink pull request: [FLINK-1901] [core] Create sample operator for...

sachingoel0101 Fri, 07 Aug 2015 11:07:10 -0700

Github user sachingoel0101 commented on the pull request:

    https://github.com/apache/flink/pull/949#issuecomment-128781613
  
    I have worked on this problem before. The idea is to divide the data into 
blocks and find the probability of selection of an element from a block.
    Thus, suppose there are blocks B_1, B_2, ..., B_N with probabilities P_1, 
P_2, ..., P_N, then you sample k points by first sampling from the distribution 
{P_1, P_2, ..., P_N} and find the number of elements you require from each 
block. After that you select the required number of points from each block and 
take a union.
    It is pretty easy to implement in a shared memory system but with a 
distributed system, it is hard. I tried the following approach some time 
before, although didn't quite finish working on it:
    blockedData = Data -> (block_id, data)
    
    blockNumbers = (block_id, data) -> (block_id, count)
    
    (1...k) -> (list of block ids we'll be sampling from)
    After this, I tried broadcasting the list and selecting the required number 
of elements from each block, which can be done quite easily. But what if k is 
very large?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [FLINK-1901] [core] Create sample operator for...

Reply via email to