[GitHub] flink pull request: [FLINK-1901] [core] Create sample operator for...

tillrohrmann Fri, 07 Aug 2015 05:41:55 -0700

Github user tillrohrmann commented on the pull request:

    https://github.com/apache/flink/pull/949#issuecomment-128690389
  
    The current state with the `RichMapPartitionFunctions` looks good to me 
:+1: 
    
    You're right that user usually want to fix the size for the whole sample. 
An easy solution could be to assign each item an index, see 
`DataSetUtils.zipWithIndex`. Then we can compute the maximum index (which is 
effectively counting the data set elements). This gives us the range from which 
have to sample. By generating a parallel sequence of the size of our sample 
size with `env.generateSequence(maxIndex)`, we could then sample from `[0, 
maxIndex]`. At last we would have to join this data set with the original data 
set which has the indices assigned. There are probably more efficient 
algorithms out there than this one.
    
    Just ping me when you've found a solution for the problem. Looking forward 
reviewing it :-)



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [FLINK-1901] [core] Create sample operator for...

Reply via email to