pdet commented on issue #38837: URL: https://github.com/apache/arrow/issues/38837#issuecomment-2031914284
Just to provide a clearer example of how statistics can be useful for query optimization. In DuckDB, join ordering is currently determined using heuristics based on table cardinalities with future plans to enhance this with sample statistics. Not only join ordering is affected by statistics but even the choice of the probe side in a hash join, will be determined based on the expected cardinalities. One example of a query that is affected by join ordering is [Q 21 of tpch](https://github.com/duckdb/duckdb/blob/main/extension/tpch/dbgen/queries/q21.sql). However, the plan for it is too big to big to share it easily in a GitHub Discussion. To give a simpler example of how cardinalities affect this, I've created two tables. 1. t - Has 10^8 rows, ranging from 0 to 10^8 2. t_2 - Has 10 rows, ranging from 0 to 10 My example query is a simple inner join of these two tables and we calculate the sum of `t.i`. ```sql SELECT SUM(t.i) from t inner join t_2 on (t_2.k = t.i) ``` Because the optimizer doesn't have any information of statistics from the Arrow side, it will basically pick the probe side depending on what's presented in the query. ```sql SELECT SUM(t.i) from t_2 inner join t on (t_2.k = t.i) ``` This would result in a slightly different plan, yet with significant differences in performance. <img width="358" alt="Screenshot 2024-04-02 at 14 28 03" src="https://github.com/apache/arrow/assets/7377477/135fd50f-a092-4171-b99e-dfc976872035"> As depicted in the screenshot of executing both queries, choosing the incorrect probe side for this query already results in a performance difference of an order of magnitude. For more complex queries, the variations in execution time could be not only larger but also more difficult to trace. For reference, the code I used for this example: ```python import duckdb import time import statistics con = duckdb.connect() # Create table with 10^8 con.execute("CREATE TABLE t as SELECT * FROM RANGE(0, 100000000) tbl(i)") # Create Table with 10 con.execute("CREATE TABLE t_2 as SELECT * FROM RANGE(0, 10) tbl(k)") query_slow = ''' SELECT SUM(t.i) from t inner join t_2 on (t_2.k = t.i); ''' query_fast = ''' SELECT SUM(t.i) from t_2 inner join t on (t_2.k = t.i);''' t = con.execute("FROM t").fetch_arrow_table() t_2 = con.execute("FROM t_2").fetch_arrow_table() con_2 = duckdb.connect() print("DuckDB Arrow - Query Slow") print(con_2.execute("EXPLAIN " + query_slow).fetchall()[0][1]) execution_times = [] for _ in range(5): start_time = time.time() con_2.execute(query_slow) end_time = time.time() execution_times.append(end_time - start_time) median_time = statistics.median(execution_times) print(median_time) print("DuckDB Arrow - Query Fast") print(con_2.execute("EXPLAIN " + query_fast).fetchall()[0][1]) execution_times = [] for _ in range(5): start_time = time.time() con_2.execute(query_fast) end_time = time.time() execution_times.append(end_time - start_time) median_time = statistics.median(execution_times) print(median_time) ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
