Github user tillrohrmann commented on the pull request:
https://github.com/apache/flink/pull/949#issuecomment-128690389
The current state with the `RichMapPartitionFunctions` looks good to me
:+1:
You're right that user usually want to fix the size for the whole sample.
An easy solution could be to assign each item an index, see
`DataSetUtils.zipWithIndex`. Then we can compute the maximum index (which is
effectively counting the data set elements). This gives us the range from which
have to sample. By generating a parallel sequence of the size of our sample
size with `env.generateSequence(maxIndex)`, we could then sample from `[0,
maxIndex]`. At last we would have to join this data set with the original data
set which has the indices assigned. There are probably more efficient
algorithms out there than this one.
Just ping me when you've found a solution for the problem. Looking forward
reviewing it :-)
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---