RatulDawar opened a new issue, #23263:
URL: https://github.com/apache/datafusion/issues/23263

   Currently ClickBench Q23 is a major outliner due to the fact that datafusion 
doesn't do late materialization of projected columns. 
   
   ```
   SELECT * FROM hits WHERE "URL" LIKE '%google%' ORDER BY "EventTime" LIMIT 10;
   ```
   A query like this needs to scan all the columns only after all the redundant 
rows are prunned from output.  
   
   Some numbers that I was able to get locally showing this regression in play. 
   
   | Engine | Query time |
   |--------|------------|
   | DataFusion 54 (`SELECT *`) | ~4.0s avg |
   | DataFusion 54 (`SELECT "EventTime", "URL"`) | ~0.9s |
   
   
   ```text
   SortPreservingMergeExec (limit 10)
   └─ SortExec TopK (EventTime), DynamicFilter [ empty ]
      └─ ProjectionExec (105 columns)
         └─ FilterExec (URL LIKE '%google%')
            └─ DataSourceExec (projection=[105 cols], 111 files)
                 predicate: URL LIKE %google% AND DynamicFilter [ empty ]
   ```
   
   ### `EXPLAIN ANALYZE` metrics (`SELECT *`)
   
   **DataSourceExec**
   
   | Metric | Value |
   |--------|-------|
   | `projection` | 105 columns |
   | `output_rows` | 97.17 M |
   | `output_bytes` | 137.8 GB |
   | `row_groups_pruned_statistics` | 325 → 325 matched |
   | `row_groups_pruned_bloom_filter` | 325 → 325 matched |
   | `page_index_pages_pruned` | 0 |
   
   
   
   What duckdb does in this scenario is that it materialises the projection 
rows that are not needed for any computation or maybe needed in later steps as 
late as possilble. 
   
   But this late materialization comes at the cost of requiring a self join 
later on to get the other columns back, and got self join's we need row 
tagging. 
   
   In datafusion we can tag a row uniquely with `file_row_index()`, 
`input_file_name()`. 
   But then if rows are not dynamically filtered on the basis of above two 
columns it will again lead to a full table scan for probe side of the join. 
   
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to