crepererum opened a new issue, #10229:
URL: https://github.com/apache/datafusion/issues/10229

   ### Is your feature request related to a problem or challenge?
   
   DataFusion currently has an extensive benchmark suite that looks roughly 
like this:
   
   - **fixed task set:** There is a set of predefined queries on some 
predefined data (i.e. they are NOT randomized).
   - **execution time metric:** The metric that is measured is the "execution 
time" of a query (or it's inverse: executions/s), starting from when a 
client/driver issues a query (mostly SQL text) until it retrieved the full 
result data.
   - **serialized execution:** This is done by running a single query with a 
single "client"/driver multiple times, each execution being serialized after 
the next one. A query is repeated in this linear fashion multiple times to 
improve robustness of the benchmark. In other words: we are running a single 
query in a hot loop.
   - **semi-bounded environment:** The test VM that is used for these 
benchmarks is rather large because the point of the benchmark is to eat up as 
many resources as possible to "win the game" (i.e. to make the query faster).
   
   This setup has several drawbacks:
   
   - **environment size:** Apart from some single-query OLAP use cases where 
people just provide seemingly unbounded resources, it is unlikely that 
DataFusion (or products/solutions that are built on top of it) have super large 
VMs. In most cases, people want to saturate their VMs to safe money and the 
environment.
   - **efficiency:** This is similar to the environment size, but from a 
different point of view. There is a difference between "fast" (queries/s) and 
"efficient" (queries/core, queries/watt, queries/VM, queries/tokio-tasks, ...). 
One could argue that DataFusion should not "eat" up additional resources like 
CPU cores if the return of invest is too low. For example, re-partitioning the 
data into a wide plan and chopping the record batches into smaller batches may 
improve execution time, but it costs more (as an integral of used resources of 
query time) than if you just execute the query in one single partition with a 
rather large record batch size.
   - **concurrency:** While many analytic/OLAP/CLI/... solutions basically 
serve a single user, many server-based solutions need to handle queries 
concurrently. Esp. "eating up" too many resources (both on the host -- like CPU 
and memory -- and within the process -- like tokio tasks) just to make one 
query a bit faster easily makes the system as a whole less good (both from in 
terms of end user experience and for the operator).
   
   It is important to point out that the existing benchmark suite has its valid 
purpose and that its goals may often conflict with the points mentioned in the 
second list.
   
   ### Describe the solution you'd like
   
   Add a new benchmark suite that has the following properties / design:
   
   - **fixed task set:** I think using the same queries and fixtures as the 
existing suite is fine. From the list of "tasks" (pair of query and data) 
select ONE to run at the same time.
   - **concurrency:** Define a "client" as "run given query in serial order in 
a hot loop". Clients keep running for a fixed time period. The number of 
clients running at the same time defines the "concurrency". Run the test with 
multiple, ideally exponentially growing concurrency starting at 1 up to a point 
that we deem "exhaustive" for the environment (see blow), e.g. 1, 2, 4, 8, 16, 
32.
   - **constrained environment:** Use a constrained environment (= VM) that we 
expect that DataFusion should (at least with increases concurrency) eventually 
saturate.
   - **warm-up:** We may want to wait a few seconds until we start measuring 
the metric (see below) to give the OS and other components some time to warm-up.
   - **metric:** For each query execution by one of the client, measure the 
"query time" in the same way as we currently do, from start (sending the SQL 
text) to finish (receiving the entire result). Only successful completion 
counts, errors should terminate the benchmark. Based on that, derive the 
following derived metrics:
     - **quantiles:** Over all query executions of all clients, report some 
nice quantiles like 50%, 90%, 95%, 99%.
     - **average throughput:** Over the entire measurement window (minus the 
warm-up time), count the number of queries that where successfully completed. 
Define by the measurement window in seconds to derive and the client count to 
derive "the average per-client throughput in queries/s"
   
   This suite should be hooked up into the `/benchmark` GitHub command.
   
   ### Describe alternatives you've considered
   
   \-
   
   ### Additional context
   
   - #10026


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to