crepererum opened a new issue, #10229:
URL: https://github.com/apache/datafusion/issues/10229
### Is your feature request related to a problem or challenge?
DataFusion currently has an extensive benchmark suite that looks roughly
like this:
- **fixed task set:** There is a set of predefined queries on some
predefined data (i.e. they are NOT randomized).
- **execution time metric:** The metric that is measured is the "execution
time" of a query (or it's inverse: executions/s), starting from when a
client/driver issues a query (mostly SQL text) until it retrieved the full
result data.
- **serialized execution:** This is done by running a single query with a
single "client"/driver multiple times, each execution being serialized after
the next one. A query is repeated in this linear fashion multiple times to
improve robustness of the benchmark. In other words: we are running a single
query in a hot loop.
- **semi-bounded environment:** The test VM that is used for these
benchmarks is rather large because the point of the benchmark is to eat up as
many resources as possible to "win the game" (i.e. to make the query faster).
This setup has several drawbacks:
- **environment size:** Apart from some single-query OLAP use cases where
people just provide seemingly unbounded resources, it is unlikely that
DataFusion (or products/solutions that are built on top of it) have super large
VMs. In most cases, people want to saturate their VMs to safe money and the
environment.
- **efficiency:** This is similar to the environment size, but from a
different point of view. There is a difference between "fast" (queries/s) and
"efficient" (queries/core, queries/watt, queries/VM, queries/tokio-tasks, ...).
One could argue that DataFusion should not "eat" up additional resources like
CPU cores if the return of invest is too low. For example, re-partitioning the
data into a wide plan and chopping the record batches into smaller batches may
improve execution time, but it costs more (as an integral of used resources of
query time) than if you just execute the query in one single partition with a
rather large record batch size.
- **concurrency:** While many analytic/OLAP/CLI/... solutions basically
serve a single user, many server-based solutions need to handle queries
concurrently. Esp. "eating up" too many resources (both on the host -- like CPU
and memory -- and within the process -- like tokio tasks) just to make one
query a bit faster easily makes the system as a whole less good (both from in
terms of end user experience and for the operator).
It is important to point out that the existing benchmark suite has its valid
purpose and that its goals may often conflict with the points mentioned in the
second list.
### Describe the solution you'd like
Add a new benchmark suite that has the following properties / design:
- **fixed task set:** I think using the same queries and fixtures as the
existing suite is fine. From the list of "tasks" (pair of query and data)
select ONE to run at the same time.
- **concurrency:** Define a "client" as "run given query in serial order in
a hot loop". Clients keep running for a fixed time period. The number of
clients running at the same time defines the "concurrency". Run the test with
multiple, ideally exponentially growing concurrency starting at 1 up to a point
that we deem "exhaustive" for the environment (see blow), e.g. 1, 2, 4, 8, 16,
32.
- **constrained environment:** Use a constrained environment (= VM) that we
expect that DataFusion should (at least with increases concurrency) eventually
saturate.
- **warm-up:** We may want to wait a few seconds until we start measuring
the metric (see below) to give the OS and other components some time to warm-up.
- **metric:** For each query execution by one of the client, measure the
"query time" in the same way as we currently do, from start (sending the SQL
text) to finish (receiving the entire result). Only successful completion
counts, errors should terminate the benchmark. Based on that, derive the
following derived metrics:
- **quantiles:** Over all query executions of all clients, report some
nice quantiles like 50%, 90%, 95%, 99%.
- **average throughput:** Over the entire measurement window (minus the
warm-up time), count the number of queries that where successfully completed.
Define by the measurement window in seconds to derive and the client count to
derive "the average per-client throughput in queries/s"
This suite should be hooked up into the `/benchmark` GitHub command.
### Describe alternatives you've considered
\-
### Additional context
- #10026
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]