I agree it would be better if the projection did not include the columns needed (only) for the pushdown filter . As Jie Wen says, it would be a nice thing to improve.
Would you be willing to file a ticket to track this issue? If not I can do so. Thanks, Andrew On Wed, Apr 12, 2023 at 2:13 AM vin jake <jakevin...@gmail.com> wrote: > Hi, > > The current "Projection pushdown" rule was implemented by me previously. > > I believe the key issue is the execution order of the internal projection > and filter of the `TableScan` operator. > If we do `projection` before `filter`, `Projection [col1, col2]` would be > essential. Otherwise, we will miss some columns. > But If we do `projection` before `filter`, I think `Projection [col1, > col2]` isn't necessary, you can create a new PR to enhance it (if you are > willing to) and request my review. > > Best regards, Jie Wen. > > On Wed, Apr 12, 2023 at 1:36 PM Markus Appel <markusap...@hotmail.de> > wrote: > > > Hello, > > > > I hope this is the right place to ask this. > > > > While working on a project based on arrow-datafusion, I came across some > > weird behavior where a projection did not get eliminated as expected, > thus > > breaking a custom optimizer rule's assumption (into which I won't go > > further, as it's not important for this question). > > > > Specifically, I found an execution plan like: > > > > Projection [col1, col2] > > TableScan projection=[col1, col2, col3], full_filters=[ ... col3 > ...] > > > > to not be simplified to: > > > > TableScan projection=[col1, col2], full_filters=[... col3 ...] > > > > This does seem to be intentional, as I have found this snippet< > > > https://github.com/apache/arrow-datafusion/blob/c97048d178594b10b813c6bcd1543f157db4ba3f/datafusion/optimizer/src/push_down_projection.rs#L174 > > > > in the optimizer rule: > > > > ... > > LogicalPlan::TableScan(scan) > > if !scan.projected_schema.fields().is_empty() => > > { > > let mut used_columns: HashSet<Column> = HashSet::new(); > > // filter expr may not exist in expr in projection. > > // like: TableScan: t1 projection=[bool_col, int_col], > > full_filters=[t1.id = Int32(1)] > > // projection=[bool_col, int_col] don't contain `ti.id`. > > exprlist_to_columns(&scan.filters, &mut used_columns)?; > > ... > > > > However, the comment does not explain why we need to keep the extra > > projection and return the extra column - after all, the filters inside of > > the scan are internal to that scan, and should not affect the execution > > plan, right? > > > > I am looking forward to any opinions. > > > > Best regards, Markus Appel > > > > > > >