Re: [I] Late materialization when LIMIT prunes heavily. [datafusion]

via GitHub Wed, 01 Jul 2026 11:01:29 -0700


RatulDawar commented on issue #23263:
URL: https://github.com/apache/datafusion/issues/23263#issuecomment-4858707444


   Since this largely depends on statistics here is what I propose. 
   ymbols (uniform per-cell cost):
   
   | Symbol | Meaning |
   |--------|---------|
   | R | rows before filtering |
   | r | rows after filtering |
   | C | columns projected (`SELECT *`) |
   | c | columns required in phase 1 (filter cols + sort key + row ids) |
   
   ### Scan cost
   
   **Before optimization** — filter after wide decode:
   
   ```text
   scan_before = R × C
   ```
   
   **After optimization** — narrow scan, then wide fetch for survivors:
   
   ```text
   scan_after = R × c + r × C
   ```
   
   ### When optimization wins
   
   ```text
   scan_before > scan_after
   
   R × C  >  R × c + r × C
   
   R × (C - c)  >  r × C
   ```
   
   Same condition, rearranged:
   
   ```text
   r / R  <  (C - c) / C          (selectivity < column savings fraction)
   
   R / r  >  C / (C - c)
   ```
   
   **Break-even** (costs equal):
   
   ```text
   r* = R × (C - c) / C
   
   Optimization wins when  r < r*
   ```
   
   
   
   Here it's possible to determine R, C and c.  But r seems a bit hard, maybe 
we can start with just LIMIT queries as there we have r avaliable. 
   
   @saadtajwar what do you think ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [I] Late materialization when LIMIT prunes heavily. [datafusion]

Reply via email to