[GitHub] [flink-ml] lindong28 commented on a diff in pull request #87: [FLINK-27096] Improve Benchmark Framework

GitBox Sat, 23 Apr 2022 20:48:21 -0700


lindong28 commented on code in PR #87:
URL: https://github.com/apache/flink-ml/pull/87#discussion_r855177774



##########
flink-ml-benchmark/README.md:
##########
@@ -63,59 +63,75 @@ Then in Flink ML's binary distribution's folder, execute 
the following command
 to run the default benchmarks.
 
 ```bash
-./bin/flink-ml-benchmark.sh ./conf/benchmark-conf.json --output-file 
results.json
+./bin/benchmark-run.sh ./conf/benchmark-conf.json --output-file results.json
 ```
 
 You will notice that some Flink jobs are submitted to your Flink cluster, and
 the following information is printed out in your terminal. This means that you
 have successfully executed the default benchmarks.
 
 ```
-Found benchmarks [KMeansModel-1, KMeans-1]
+Found 2 benchmarks.
 ...
 Benchmarks execution completed.
 Benchmark results summary:
-[ {
-  "name" : "KMeansModel-1",
-  "totalTimeMs" : 782.0,
-  "inputRecordNum" : 10000,
-  "inputThroughput" : 12787.723785166241,
-  "outputRecordNum" : 10000,
-  "outputThroughput" : 12787.723785166241
-}, {
-  "name" : "KMeans-1",
-  "totalTimeMs" : 7022.0,
-  "inputRecordNum" : 10000,
-  "inputThroughput" : 1424.0956992309884,
-  "outputRecordNum" : 1,
-  "outputThroughput" : 0.14240956992309883
-} ]
+{
+  "KMeansModel-1" : {
+    "stage" : {
+      "className" : "org.apache.flink.ml.clustering.kmeans.KMeansModel",
+      "paramMap" : {
+        "k" : 2,
+        "distanceMeasure" : "euclidean",
+        "featuresCol" : "features",
+        "predictionCol" : "prediction"
+      }
+    },
+    ...
+    "modelData" : null,
+    "results" : {
+      "totalTimeMs" : 6596.0,
+      "inputRecordNum" : 10000,
+      "inputThroughput" : 1516.0703456640388,
+      "outputRecordNum" : 1,
+      "outputThroughput" : 0.15160703456640387
+    }
+  }
+}
 Benchmark results saved as json in results.json.
 ```
 
 The command above would save the results into `results.json` as below.
 
 ```json
-[ {
-  "name" : "KMeansModel-1",
-  "totalTimeMs" : 782.0,
-  "inputRecordNum" : 10000,
-  "inputThroughput" : 12787.723785166241,
-  "outputRecordNum" : 10000,
-  "outputThroughput" : 12787.723785166241
-}, {
-  "name" : "KMeans-1",
-  "totalTimeMs" : 7022.0,
-  "inputRecordNum" : 10000,
-  "inputThroughput" : 1424.0956992309884,
-  "outputRecordNum" : 1,
-  "outputThroughput" : 0.14240956992309883
-} ]
+{
+  "KMeansModel-1" : {
+    "stage" : {
+      "className" : "org.apache.flink.ml.clustering.kmeans.KMeansModel",
+      "paramMap" : {
+        "k" : 2,
+        "distanceMeasure" : "euclidean",
+        "featuresCol" : "features",
+        "predictionCol" : "prediction"
+      }
+    },
+    ...
+    "modelData" : null,

Review Comment:
   Would it be more intuitive to remove the this line if the model data is not 
specified for this stage?



##########
flink-ml-benchmark/README.md:
##########
@@ -194,3 +189,9 @@ stage. The stage is benchmarked against 10000 randomly 
generated vectors.
   }
 }
 ```
+
+### Benchmark Results Visualization
+
+`benchmark-results-visualize.py` is provided as a helper script to visualize

Review Comment:
   Could we provide commands that users can copy and paste to visualize the 
`results.json` in this quickstart?



##########
flink-ml-benchmark/src/main/java/org/apache/flink/ml/benchmark/BenchmarkResult.java:
##########
@@ -18,13 +18,25 @@
 
 package org.apache.flink.ml.benchmark;
 
+import org.apache.flink.ml.api.Stage;
+import org.apache.flink.ml.benchmark.datagenerator.DataGenerator;
+import org.apache.flink.ml.benchmark.datagenerator.InputDataGenerator;
 import org.apache.flink.util.Preconditions;
 
 /** The result of executing a benchmark. */
 public class BenchmarkResult {
     /** The benchmark name. */
     public final String name;
 
+    /** The stage to execute benchmark on. */
+    public final Stage<?> stage;

Review Comment:
   Instead of keeping the stage/inputDataGenerator/modelDataGenerator instances 
in `BenchmarkResult`, would it be simpler to just pass the original 
`Map<String, ?> params` (which is used to instantiate these instances) to 
`BenchmarkUtils.getResultsMapAsJson(...)` directly?



##########
flink-ml-benchmark/src/main/java/org/apache/flink/ml/benchmark/Benchmark.java:
##########
@@ -75,33 +75,43 @@ public static void printHelp() {
     public static void executeBenchmarks(CommandLine commandLine) throws 
Exception {
         String configFile = commandLine.getArgs()[0];
         Map<String, ?> benchmarks = BenchmarkUtils.parseJsonFile(configFile);
-        System.out.println("Found benchmarks " + benchmarks.keySet());
+        System.out.println("Found " + benchmarks.keySet().size() + " 
benchmarks.");
+        String saveFile = 
commandLine.getOptionValue(OUTPUT_FILE_OPTION.getLongOpt());
 
         StreamExecutionEnvironment env = 
StreamExecutionEnvironment.getExecutionEnvironment();
         StreamTableEnvironment tEnv = StreamTableEnvironment.create(env);
 
         List<BenchmarkResult> results = new ArrayList<>();
+        String benchmarkResultsJson = "{}";
+        int index = 0;
         for (Map.Entry<String, ?> benchmark : benchmarks.entrySet()) {
-            LOG.info("Running benchmark " + benchmark.getKey() + ".");
+            LOG.info(
+                    String.format(
+                            "Running benchmark %d/%d: %s",
+                            index++, benchmarks.keySet().size(), 
benchmark.getKey()));
 
             BenchmarkResult result =
                     BenchmarkUtils.runBenchmark(
                             tEnv, benchmark.getKey(), (Map<String, ?>) 
benchmark.getValue());
 
             results.add(result);
-            LOG.info(BenchmarkUtils.getResultsMapAsJson(result));
-        }
+            LOG.info("\n" + BenchmarkUtils.getResultsMapAsJson(result));
+
+            benchmarkResultsJson =
+                    BenchmarkUtils.getResultsMapAsJson(results.toArray(new 
BenchmarkResult[0]));
 
-        String benchmarkResultsJson =
-                BenchmarkUtils.getResultsMapAsJson(results.toArray(new 
BenchmarkResult[0]));
+            if (commandLine.hasOption(OUTPUT_FILE_OPTION.getLongOpt())) {
+                ReadWriteUtils.saveToFile(saveFile, benchmarkResultsJson, 
true);

Review Comment:
   Instead of saving all benchmark results in the loop body, would it be more 
efficient to save benchmark results after the while loop is done?
   
   We probably also need to record in the `results.json` the fact that a 
benchmark fails, probably with a string indicating the the cause of the 
failure. The `benchmark-results-visualize.py` would need to handle this 
information probably in the graph (e.g. with a red `X`) in the figure.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [flink-ml] lindong28 commented on a diff in pull request #87: [FLINK-27096] Improve Benchmark Framework

Reply via email to