[ 
https://issues.apache.org/jira/browse/FLINK-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14631731#comment-14631731
 ] 

ASF GitHub Bot commented on FLINK-2312:
---------------------------------------

GitHub user sachingoel0101 opened a pull request:

    https://github.com/apache/flink/pull/921

    [FLINK-2312][ml][WIP] Randomly Splitting a Data Set according to weights 
given

    Adds a method for randomly splitting a data set.
    
    However, there are a few problems. We're effectively creating several data 
sources from one, and each of these sources will act independently later on, 
and the execution will be kicked off for each one separately. This leads to the 
splitting happening several times, thus, what actually remains is several 
different random samples from data which aren't exhaustive. @pp86, do you have 
any idea how to deal with this?

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/sachingoel0101/flink random_splits

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/flink/pull/921.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #921
    
----
commit ad50fd5ebf1277302b93ad14ac2d0e4169687bfe
Author: Sachin Goel <[email protected]>
Date:   2015-07-17T18:41:20Z

    A first version of Random Splits

----


> Random Splits
> -------------
>
>                 Key: FLINK-2312
>                 URL: https://issues.apache.org/jira/browse/FLINK-2312
>             Project: Flink
>          Issue Type: Wish
>          Components: Machine Learning Library
>            Reporter: Maximilian Alber
>            Assignee: pietro pinoli
>            Priority: Minor
>
> In machine learning applications it is common to split data sets into f.e. 
> training and testing set.
> To the best of my knowledge there is at the moment no nice way in Flink to 
> split a data set randomly into several partitions according to some ratio.
> The wished semantic would be the same as of Sparks RDD randomSplit.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to