[jira] [Commented] (FLINK-2312) Random Splits

ASF GitHub Bot (JIRA) Tue, 21 Jul 2015 09:21:47 -0700

    [ 
https://issues.apache.org/jira/browse/FLINK-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14635328#comment-14635328
 ]


ASF GitHub Bot commented on FLINK-2312:
---------------------------------------

Github user sachingoel0101 commented on the pull request:

    https://github.com/apache/flink/pull/921#issuecomment-123390786
  
    This leads to non-mutually exclusive splits. I tracked down the reason for 
this: The input data is parallelized differently while performing the splits 
for every fraction. This leads to an altogether different sequence of random 
numbers, hence the problem. 
    @tillrohrmann, I use a seed value to initialize, as in the cross-validation 
PR. Is there any way I can fix the parallelization of data, so running the 
split for every fraction leads to exactly same sequence of numbers. Persist 
would write the data to disk, which is something of an overhead isn't it?


> Random Splits
> -------------
>
>                 Key: FLINK-2312
>                 URL: https://issues.apache.org/jira/browse/FLINK-2312
>             Project: Flink
>          Issue Type: Wish
>          Components: Machine Learning Library
>            Reporter: Maximilian Alber
>            Assignee: pietro pinoli
>            Priority: Minor
>
> In machine learning applications it is common to split data sets into f.e. 
> training and testing set.
> To the best of my knowledge there is at the moment no nice way in Flink to 
> split a data set randomly into several partitions according to some ratio.
> The wished semantic would be the same as of Sparks RDD randomSplit.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (FLINK-2312) Random Splits

Reply via email to