askalt opened a new issue, #19929:
URL: https://github.com/apache/datafusion/issues/19929

   This issue covers two related filter push-down improvements.
   
   ## Pass previously pushed filters to supports_filters_pushdown
   
   Currently, the optimization does not pass filters that were pushed in a 
previous run (`TableScan::filters`) to 
`TableProvider::supports_filters_pushdown(...)`.
   
   If the optimizer runs multiple times, it may try to push filters into the 
table provider multiple times. In our DataFusion-based project, 
`supports_filters_pushdown(...)` has context-dependent behavior: the provider 
supports any single filter like `column = value`, but not multiple such filters 
at the same time.
   
   Consider the following optimizer pipeline pattern:
   
   1. Try to push `a = 1, b = 1`.
      `supports_filters_pushdown` returns `[Exact, Inexact]`
      OK: the optimizer records that `a = 1` is pushed and creates a filter 
node for `b = 1`.
   
   ...
   Another optimization iteration.
   
   2. Try to push b = 1.
       `supports_filters_pushdown` returns `[Exact]`. Of course, the table 
provider can’t remember 
       all previously pushed filters, so it has no choice but to answer `Exact`.
       Now, the optimizer thinks the conjunction `a = 1 AND b = 1` is supported 
exactly, but it is not.
   
   To prevent this problem, I suggest passing filters that were already pushed 
into the scan earlier to `supports_filters_pushdown(...)`.
   
   ## Do not assume that filter support decision is stable
   
   Consider the next scenario:
   
   1. `supports_filters_pushdown` returns `Exact` on some filter, e.g. "a = 1", 
where column "a" is not 
       required by the query projection.
   
   2. "a" is removed from the table provider projection by "optimize 
projection" rule.
   
   3. `supports_filters_pushdown` changes a decision and returns `Inexact` on 
this filter the next time. 
        For example, input filters were changed and it prefers to use a new one.
   
   4. "a" is not returned to the table provider projection which leads to 
filter that references a column which is 
       not a part of the schema.
   
   Suggest to extend logic with the following actions:
   
   1. Collect columns that are not used in the current table provider 
projection, but required for filter 
       expressions. Call it `additional_projection`.
   
   2. If `additional_projection` is empty -- leave all as is.
   
   3. Otherwise extend a table provider projection and wrap a plan with an 
additional projection node 
       to preserve schema used prior to this rule.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to