pdet commented on issue #38837:
URL: https://github.com/apache/arrow/issues/38837#issuecomment-2031914284

   Just to provide a clearer example of how statistics can be useful for query 
optimization.
   
   In DuckDB, join ordering is currently determined using heuristics based on 
table cardinalities with future plans to enhance this with sample statistics. 
   
   Not only join ordering is affected by statistics but even the choice of the 
probe side in a hash join, will be determined based on the expected 
cardinalities.
   
   One example of a query that is affected by join ordering is [Q 21 of 
tpch](https://github.com/duckdb/duckdb/blob/main/extension/tpch/dbgen/queries/q21.sql).
 However, the plan for it is too big to big to share it easily in a GitHub 
Discussion.
   
   To give a simpler example of how cardinalities affect this, I've created two 
tables.
   1. t - Has 10^8 rows, ranging from 0 to 10^8
   2. t_2 - Has 10 rows, ranging from 0 to 10
   
   My example query is a simple  inner join of these two tables and we 
calculate the sum of `t.i`.
   
   ```sql
   SELECT SUM(t.i) from t inner join t_2 on (t_2.k = t.i)
   ```
   
   Because the optimizer doesn't have any information of statistics from the 
Arrow side, it will basically pick the probe side depending on what's presented 
in the query.
   
   ```sql
   SELECT SUM(t.i) from t_2 inner join t on (t_2.k = t.i)
   ```
   
   This would result in a slightly different plan, yet with significant 
differences in performance.
   
   <img width="358" alt="Screenshot 2024-04-02 at 14 28 03" 
src="https://github.com/apache/arrow/assets/7377477/135fd50f-a092-4171-b99e-dfc976872035";>
   
   As depicted in the screenshot of executing both queries, choosing the 
incorrect probe side for this query already results in a performance difference 
of an order of magnitude. For more complex queries, the variations in execution 
time could be not only larger but also more difficult to trace.
   
   For reference, the code I used for this example:
   ```python
   import duckdb
   import time
   import statistics
   
   con = duckdb.connect()
   
   # Create table with 10^8
   con.execute("CREATE TABLE t as SELECT * FROM RANGE(0, 100000000) tbl(i)")
   # Create Table with 10
   con.execute("CREATE TABLE t_2 as SELECT * FROM RANGE(0, 10) tbl(k)")
   
   query_slow = '''
   SELECT SUM(t.i) from t inner join t_2 on (t_2.k = t.i);
   '''
   
   query_fast = '''
   SELECT SUM(t.i) from t_2 inner join t on (t_2.k = t.i);'''
   
   t = con.execute("FROM t").fetch_arrow_table()
   t_2 = con.execute("FROM t_2").fetch_arrow_table()
   
   con_2 = duckdb.connect()
   
   print("DuckDB Arrow - Query Slow")
   
   print(con_2.execute("EXPLAIN " + query_slow).fetchall()[0][1])
   
   execution_times = []
   
   for _ in range(5):
       start_time = time.time()
       con_2.execute(query_slow)
       end_time = time.time()
       
       execution_times.append(end_time - start_time)
   
   median_time = statistics.median(execution_times)
   
   print(median_time)
   
   print("DuckDB Arrow - Query Fast")
   
   print(con_2.execute("EXPLAIN " + query_fast).fetchall()[0][1])
   
   execution_times = []
   
   for _ in range(5):
       start_time = time.time()
       con_2.execute(query_fast)
       end_time = time.time()
       
       execution_times.append(end_time - start_time)
   
   median_time = statistics.median(execution_times)
   
   print(median_time)
   
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to