[GitHub] [flink-ml] yunfengzhou-hub commented on a diff in pull request #189: [FLINK-30348] Refine Transformer for RandomSplitter

GitBox Mon, 12 Dec 2022 18:19:08 -0800


yunfengzhou-hub commented on code in PR #189:
URL: https://github.com/apache/flink-ml/pull/189#discussion_r1046592801



##########
docs/content/docs/operators/feature/randomsplitter.md:
##########
@@ -34,6 +34,7 @@ An AlgoOperator which splits a table into N tables according 
to the given weight
 | Key     | Default      | Type     | Required | Description                   
 |
 
|:--------|:-------------|:---------|:---------|:-------------------------------|
 | weights | `[1.0, 1.0]` | Double[] | no       | The weights of data 
splitting. |
+| seed    | `null`       | Long     | no       | The random seed.              
 |

Review Comment:
   It might be better to describe the conditions required to reproduce random 
split results. For example, if users set the same random seed but changed the 
parallelism of the upstream operator, can they still expect to get the same 
split result?



##########
flink-ml-lib/src/main/java/org/apache/flink/ml/feature/randomsplitter/RandomSplitter.java:
##########
@@ -83,11 +84,12 @@ public Table[] transform(Table... inputs) {
 
     private static class SplitterOperator extends AbstractStreamOperator<Row>
             implements OneInputStreamOperator<Row, Row> {
-        private final Random random = new Random(0);
+        private final Random random;
         OutputTag<Row>[] outputTag;
         final double[] fractions;
 
-        public SplitterOperator(OutputTag<Row>[] outputTag, Double[] weights) {
+        public SplitterOperator(OutputTag<Row>[] outputTag, Double[] weights, 
long seed) {
+            random = new Random(seed);

Review Comment:
   It might be better to avoid having random values behaving the same on each 
subtask, e.g., always assigning the first element to the first output table 
regardless of the id of the subtask. You may check 
[`RowGenerator.open()`](https://github.com/apache/flink-ml/blob/master/flink-ml-benchmark/src/main/java/org/apache/flink/ml/benchmark/datagenerator/common/RowGenerator.java#L52)
 for how to create randoms from both initial seed and subtask id.



##########
flink-ml-lib/src/test/java/org/apache/flink/ml/feature/RandomSplitterTest.java:
##########
@@ -95,7 +96,7 @@ public void testOutputSchema() {
     @Test
     public void testWeights() throws Exception {
         Table data = getTable(1000);
-        RandomSplitter splitter = new RandomSplitter().setWeights(2.0, 1.0, 
2.0);
+        RandomSplitter splitter = new RandomSplitter().setWeights(2.0, 1.0, 
2.0).setSeed(0);

Review Comment:
   It seems that the default seed is not good enough to generate split results 
that meet statistical expectations. Can we modify the random splitting behavior 
to improve the default splitting behavior?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [flink-ml] yunfengzhou-hub commented on a diff in pull request #189: [FLINK-30348] Refine Transformer for RandomSplitter

Reply via email to