[
https://issues.apache.org/jira/browse/FLINK-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14635328#comment-14635328
]
ASF GitHub Bot commented on FLINK-2312:
---------------------------------------
Github user sachingoel0101 commented on the pull request:
https://github.com/apache/flink/pull/921#issuecomment-123390786
This leads to non-mutually exclusive splits. I tracked down the reason for
this: The input data is parallelized differently while performing the splits
for every fraction. This leads to an altogether different sequence of random
numbers, hence the problem.
@tillrohrmann, I use a seed value to initialize, as in the cross-validation
PR. Is there any way I can fix the parallelization of data, so running the
split for every fraction leads to exactly same sequence of numbers. Persist
would write the data to disk, which is something of an overhead isn't it?
> Random Splits
> -------------
>
> Key: FLINK-2312
> URL: https://issues.apache.org/jira/browse/FLINK-2312
> Project: Flink
> Issue Type: Wish
> Components: Machine Learning Library
> Reporter: Maximilian Alber
> Assignee: pietro pinoli
> Priority: Minor
>
> In machine learning applications it is common to split data sets into f.e.
> training and testing set.
> To the best of my knowledge there is at the moment no nice way in Flink to
> split a data set randomly into several partitions according to some ratio.
> The wished semantic would be the same as of Sparks RDD randomSplit.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)