RatulDawar commented on issue #23263: URL: https://github.com/apache/datafusion/issues/23263#issuecomment-4858707444
Since this largely depends on statistics here is what I propose. ymbols (uniform per-cell cost): | Symbol | Meaning | |--------|---------| | R | rows before filtering | | r | rows after filtering | | C | columns projected (`SELECT *`) | | c | columns required in phase 1 (filter cols + sort key + row ids) | ### Scan cost **Before optimization** — filter after wide decode: ```text scan_before = R × C ``` **After optimization** — narrow scan, then wide fetch for survivors: ```text scan_after = R × c + r × C ``` ### When optimization wins ```text scan_before > scan_after R × C > R × c + r × C R × (C - c) > r × C ``` Same condition, rearranged: ```text r / R < (C - c) / C (selectivity < column savings fraction) R / r > C / (C - c) ``` **Break-even** (costs equal): ```text r* = R × (C - c) / C Optimization wins when r < r* ``` Here it's possible to determine R, C and c. But r seems a bit hard, maybe we can start with just LIMIT queries as there we have r avaliable. @saadtajwar what do you think ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
