lindong28 commented on code in PR #87:
URL: https://github.com/apache/flink-ml/pull/87#discussion_r855177774
##########
flink-ml-benchmark/README.md:
##########
@@ -63,59 +63,75 @@ Then in Flink ML's binary distribution's folder, execute
the following command
to run the default benchmarks.
```bash
-./bin/flink-ml-benchmark.sh ./conf/benchmark-conf.json --output-file
results.json
+./bin/benchmark-run.sh ./conf/benchmark-conf.json --output-file results.json
```
You will notice that some Flink jobs are submitted to your Flink cluster, and
the following information is printed out in your terminal. This means that you
have successfully executed the default benchmarks.
```
-Found benchmarks [KMeansModel-1, KMeans-1]
+Found 2 benchmarks.
...
Benchmarks execution completed.
Benchmark results summary:
-[ {
- "name" : "KMeansModel-1",
- "totalTimeMs" : 782.0,
- "inputRecordNum" : 10000,
- "inputThroughput" : 12787.723785166241,
- "outputRecordNum" : 10000,
- "outputThroughput" : 12787.723785166241
-}, {
- "name" : "KMeans-1",
- "totalTimeMs" : 7022.0,
- "inputRecordNum" : 10000,
- "inputThroughput" : 1424.0956992309884,
- "outputRecordNum" : 1,
- "outputThroughput" : 0.14240956992309883
-} ]
+{
+ "KMeansModel-1" : {
+ "stage" : {
+ "className" : "org.apache.flink.ml.clustering.kmeans.KMeansModel",
+ "paramMap" : {
+ "k" : 2,
+ "distanceMeasure" : "euclidean",
+ "featuresCol" : "features",
+ "predictionCol" : "prediction"
+ }
+ },
+ ...
+ "modelData" : null,
+ "results" : {
+ "totalTimeMs" : 6596.0,
+ "inputRecordNum" : 10000,
+ "inputThroughput" : 1516.0703456640388,
+ "outputRecordNum" : 1,
+ "outputThroughput" : 0.15160703456640387
+ }
+ }
+}
Benchmark results saved as json in results.json.
```
The command above would save the results into `results.json` as below.
```json
-[ {
- "name" : "KMeansModel-1",
- "totalTimeMs" : 782.0,
- "inputRecordNum" : 10000,
- "inputThroughput" : 12787.723785166241,
- "outputRecordNum" : 10000,
- "outputThroughput" : 12787.723785166241
-}, {
- "name" : "KMeans-1",
- "totalTimeMs" : 7022.0,
- "inputRecordNum" : 10000,
- "inputThroughput" : 1424.0956992309884,
- "outputRecordNum" : 1,
- "outputThroughput" : 0.14240956992309883
-} ]
+{
+ "KMeansModel-1" : {
+ "stage" : {
+ "className" : "org.apache.flink.ml.clustering.kmeans.KMeansModel",
+ "paramMap" : {
+ "k" : 2,
+ "distanceMeasure" : "euclidean",
+ "featuresCol" : "features",
+ "predictionCol" : "prediction"
+ }
+ },
+ ...
+ "modelData" : null,
Review Comment:
Would it be more intuitive to remove the this line if the model data is not
specified for this stage?
##########
flink-ml-benchmark/README.md:
##########
@@ -194,3 +189,9 @@ stage. The stage is benchmarked against 10000 randomly
generated vectors.
}
}
```
+
+### Benchmark Results Visualization
+
+`benchmark-results-visualize.py` is provided as a helper script to visualize
Review Comment:
Could we provide commands that users can copy and paste to visualize the
`results.json` in this quickstart?
##########
flink-ml-benchmark/src/main/java/org/apache/flink/ml/benchmark/BenchmarkResult.java:
##########
@@ -18,13 +18,25 @@
package org.apache.flink.ml.benchmark;
+import org.apache.flink.ml.api.Stage;
+import org.apache.flink.ml.benchmark.datagenerator.DataGenerator;
+import org.apache.flink.ml.benchmark.datagenerator.InputDataGenerator;
import org.apache.flink.util.Preconditions;
/** The result of executing a benchmark. */
public class BenchmarkResult {
/** The benchmark name. */
public final String name;
+ /** The stage to execute benchmark on. */
+ public final Stage<?> stage;
Review Comment:
Instead of keeping the stage/inputDataGenerator/modelDataGenerator instances
in `BenchmarkResult`, would it be simpler to just pass the original
`Map<String, ?> params` (which is used to instantiate these instances) to
`BenchmarkUtils.getResultsMapAsJson(...)` directly?
##########
flink-ml-benchmark/src/main/java/org/apache/flink/ml/benchmark/Benchmark.java:
##########
@@ -75,33 +75,43 @@ public static void printHelp() {
public static void executeBenchmarks(CommandLine commandLine) throws
Exception {
String configFile = commandLine.getArgs()[0];
Map<String, ?> benchmarks = BenchmarkUtils.parseJsonFile(configFile);
- System.out.println("Found benchmarks " + benchmarks.keySet());
+ System.out.println("Found " + benchmarks.keySet().size() + "
benchmarks.");
+ String saveFile =
commandLine.getOptionValue(OUTPUT_FILE_OPTION.getLongOpt());
StreamExecutionEnvironment env =
StreamExecutionEnvironment.getExecutionEnvironment();
StreamTableEnvironment tEnv = StreamTableEnvironment.create(env);
List<BenchmarkResult> results = new ArrayList<>();
+ String benchmarkResultsJson = "{}";
+ int index = 0;
for (Map.Entry<String, ?> benchmark : benchmarks.entrySet()) {
- LOG.info("Running benchmark " + benchmark.getKey() + ".");
+ LOG.info(
+ String.format(
+ "Running benchmark %d/%d: %s",
+ index++, benchmarks.keySet().size(),
benchmark.getKey()));
BenchmarkResult result =
BenchmarkUtils.runBenchmark(
tEnv, benchmark.getKey(), (Map<String, ?>)
benchmark.getValue());
results.add(result);
- LOG.info(BenchmarkUtils.getResultsMapAsJson(result));
- }
+ LOG.info("\n" + BenchmarkUtils.getResultsMapAsJson(result));
+
+ benchmarkResultsJson =
+ BenchmarkUtils.getResultsMapAsJson(results.toArray(new
BenchmarkResult[0]));
- String benchmarkResultsJson =
- BenchmarkUtils.getResultsMapAsJson(results.toArray(new
BenchmarkResult[0]));
+ if (commandLine.hasOption(OUTPUT_FILE_OPTION.getLongOpt())) {
+ ReadWriteUtils.saveToFile(saveFile, benchmarkResultsJson,
true);
Review Comment:
Instead of saving all benchmark results in the loop body, would it be more
efficient to save benchmark results after the while loop is done?
We probably also need to record in the `results.json` the fact that a
benchmark fails, probably with a string indicating the the cause of the
failure. The `benchmark-results-visualize.py` would need to handle this
information probably in the graph (e.g. with a red `X`) in the figure.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]