[
https://issues.apache.org/jira/browse/FLINK-31125?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dong Lin updated FLINK-31125:
-----------------------------
Summary: Flink ML benchmark framework should minimize the source operator
overhead (was: Flink ML benchmark result should not include data generation
overhead)
> Flink ML benchmark framework should minimize the source operator overhead
> -------------------------------------------------------------------------
>
> Key: FLINK-31125
> URL: https://issues.apache.org/jira/browse/FLINK-31125
> Project: Flink
> Issue Type: Improvement
> Components: Library / Machine Learning
> Reporter: Dong Lin
> Assignee: Dong Lin
> Priority: Major
> Fix For: ml-2.2.0
>
>
> Flink ML benchmark framework estimates the throughput by having a source
> operator generate a given number (e.g. 10^7) of input records with random
> values, let the given AlgoOperator process these input records, and divide
> the number of records by the total execution time.
> The overhead of generating random values for all input records has observable
> impact on the estimated throughput. We would like to minimize the overhead of
> the source operator so that the benchmark result can focus on the throughput
> of the AlgoOperator as much as possible.
> Note that [spark-sql-perf|https://github.com/databricks/spark-sql-perf]
> generates all input records in advance into memory before running the
> benchmark. This allows Spark ML benchmark to read records from memory instead
> of generating values for those records during the benchmark.
> We can generate value once and re-use it for all input records. This approach
> minimizes the overhead of source operator and allow us to compare the Flink
> ML benchmark result with Spark ML benchmark result (using spark-sql-perf)
> fairly.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)