yunfengzhou-hub commented on code in PR #189:
URL: https://github.com/apache/flink-ml/pull/189#discussion_r1046592801
##########
docs/content/docs/operators/feature/randomsplitter.md:
##########
@@ -34,6 +34,7 @@ An AlgoOperator which splits a table into N tables according
to the given weight
| Key | Default | Type | Required | Description
|
|:--------|:-------------|:---------|:---------|:-------------------------------|
| weights | `[1.0, 1.0]` | Double[] | no | The weights of data
splitting. |
+| seed | `null` | Long | no | The random seed.
|
Review Comment:
It might be better to describe the conditions required to reproduce random
split results. For example, if users set the same random seed but changed the
parallelism of the upstream operator, can they still expect to get the same
split result?
##########
flink-ml-lib/src/main/java/org/apache/flink/ml/feature/randomsplitter/RandomSplitter.java:
##########
@@ -83,11 +84,12 @@ public Table[] transform(Table... inputs) {
private static class SplitterOperator extends AbstractStreamOperator<Row>
implements OneInputStreamOperator<Row, Row> {
- private final Random random = new Random(0);
+ private final Random random;
OutputTag<Row>[] outputTag;
final double[] fractions;
- public SplitterOperator(OutputTag<Row>[] outputTag, Double[] weights) {
+ public SplitterOperator(OutputTag<Row>[] outputTag, Double[] weights,
long seed) {
+ random = new Random(seed);
Review Comment:
It might be better to avoid having random values behaving the same on each
subtask, e.g., always assigning the first element to the first output table
regardless of the id of the subtask. You may check
[`RowGenerator.open()`](https://github.com/apache/flink-ml/blob/master/flink-ml-benchmark/src/main/java/org/apache/flink/ml/benchmark/datagenerator/common/RowGenerator.java#L52)
for how to create randoms from both initial seed and subtask id.
##########
flink-ml-lib/src/test/java/org/apache/flink/ml/feature/RandomSplitterTest.java:
##########
@@ -95,7 +96,7 @@ public void testOutputSchema() {
@Test
public void testWeights() throws Exception {
Table data = getTable(1000);
- RandomSplitter splitter = new RandomSplitter().setWeights(2.0, 1.0,
2.0);
+ RandomSplitter splitter = new RandomSplitter().setWeights(2.0, 1.0,
2.0).setSeed(0);
Review Comment:
It seems that the default seed is not good enough to generate split results
that meet statistical expectations. Can we modify the random splitting behavior
to improve the default splitting behavior?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]