[ 
https://issues.apache.org/jira/browse/FLINK-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14635100#comment-14635100
 ] 

Sachin Goel commented on FLINK-2312:
------------------------------------

[~albermax], usually, when the ratios are not too small, the number of elements 
actually observed is very close to the exact number. To ensure an entirely 
exact number can be done when the number of elements being sampled is small, 
however, with a large number such as, say 50%, I can see no easy way to do 
this. 
One potential approach can be actually over-sample, say, with twice the 
required probability and then call first() to pick the actual number of 
samples. But this might end up taking too much time, wouldn't it?

> Random Splits
> -------------
>
>                 Key: FLINK-2312
>                 URL: https://issues.apache.org/jira/browse/FLINK-2312
>             Project: Flink
>          Issue Type: Wish
>          Components: Machine Learning Library
>            Reporter: Maximilian Alber
>            Assignee: pietro pinoli
>            Priority: Minor
>
> In machine learning applications it is common to split data sets into f.e. 
> training and testing set.
> To the best of my knowledge there is at the moment no nice way in Flink to 
> split a data set randomly into several partitions according to some ratio.
> The wished semantic would be the same as of Sparks RDD randomSplit.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to