crepererum opened a new issue, #10229: URL: https://github.com/apache/datafusion/issues/10229
### Is your feature request related to a problem or challenge? DataFusion currently has an extensive benchmark suite that looks roughly like this: - **fixed task set:** There is a set of predefined queries on some predefined data (i.e. they are NOT randomized). - **execution time metric:** The metric that is measured is the "execution time" of a query (or it's inverse: executions/s), starting from when a client/driver issues a query (mostly SQL text) until it retrieved the full result data. - **serialized execution:** This is done by running a single query with a single "client"/driver multiple times, each execution being serialized after the next one. A query is repeated in this linear fashion multiple times to improve robustness of the benchmark. In other words: we are running a single query in a hot loop. - **semi-bounded environment:** The test VM that is used for these benchmarks is rather large because the point of the benchmark is to eat up as many resources as possible to "win the game" (i.e. to make the query faster). This setup has several drawbacks: - **environment size:** Apart from some single-query OLAP use cases where people just provide seemingly unbounded resources, it is unlikely that DataFusion (or products/solutions that are built on top of it) have super large VMs. In most cases, people want to saturate their VMs to safe money and the environment. - **efficiency:** This is similar to the environment size, but from a different point of view. There is a difference between "fast" (queries/s) and "efficient" (queries/core, queries/watt, queries/VM, queries/tokio-tasks, ...). One could argue that DataFusion should not "eat" up additional resources like CPU cores if the return of invest is too low. For example, re-partitioning the data into a wide plan and chopping the record batches into smaller batches may improve execution time, but it costs more (as an integral of used resources of query time) than if you just execute the query in one single partition with a rather large record batch size. - **concurrency:** While many analytic/OLAP/CLI/... solutions basically serve a single user, many server-based solutions need to handle queries concurrently. Esp. "eating up" too many resources (both on the host -- like CPU and memory -- and within the process -- like tokio tasks) just to make one query a bit faster easily makes the system as a whole less good (both from in terms of end user experience and for the operator). It is important to point out that the existing benchmark suite has its valid purpose and that its goals may often conflict with the points mentioned in the second list. ### Describe the solution you'd like Add a new benchmark suite that has the following properties / design: - **fixed task set:** I think using the same queries and fixtures as the existing suite is fine. From the list of "tasks" (pair of query and data) select ONE to run at the same time. - **concurrency:** Define a "client" as "run given query in serial order in a hot loop". Clients keep running for a fixed time period. The number of clients running at the same time defines the "concurrency". Run the test with multiple, ideally exponentially growing concurrency starting at 1 up to a point that we deem "exhaustive" for the environment (see blow), e.g. 1, 2, 4, 8, 16, 32. - **constrained environment:** Use a constrained environment (= VM) that we expect that DataFusion should (at least with increases concurrency) eventually saturate. - **warm-up:** We may want to wait a few seconds until we start measuring the metric (see below) to give the OS and other components some time to warm-up. - **metric:** For each query execution by one of the client, measure the "query time" in the same way as we currently do, from start (sending the SQL text) to finish (receiving the entire result). Only successful completion counts, errors should terminate the benchmark. Based on that, derive the following derived metrics: - **quantiles:** Over all query executions of all clients, report some nice quantiles like 50%, 90%, 95%, 99%. - **average throughput:** Over the entire measurement window (minus the warm-up time), count the number of queries that where successfully completed. Define by the measurement window in seconds to derive and the client count to derive "the average per-client throughput in queries/s" This suite should be hooked up into the `/benchmark` GitHub command. ### Describe alternatives you've considered \- ### Additional context - #10026 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org