askoa commented on issue #4794: URL: https://github.com/apache/arrow-datafusion/issues/4794#issuecomment-1382626825
> The second bug is that the schema has `customer.c_customer_sk` but the join condition has `c.c_customer_sk` The expression in question was originally part of `EXISTS(<subquery>)`. Few iterations before the failure, a `push_down_filter` pushes the `EXISTS(<subquery>)` down. The `push_down_filter` uses `ExprRewriter` to rewrite the expressions. The `ExprRewriter` does not rewrite subquery in any expression that has a subquery. In this case, the expression with subquery is `EXISTS`. Thus the column `c.c_customer_sk` does not get modified to `customer.c_customer_sk`. `push_down_filter` rewriting expressions: https://github.com/apache/arrow-datafusion/blob/a9ddcd3a7558437361835120659b946b903468e1/datafusion/optimizer/src/push_down_filter.rs#L785 `ExprRewriter` does not rewrite subqueries: https://github.com/apache/arrow-datafusion/blob/a9ddcd3a7558437361835120659b946b903468e1/datafusion/expr/src/expr_rewriter.rs#L116 <details><summary>Relevant snippet of logical plan before `push_down_filter`</summary> Filter: c.c_current_addr_sk = ca.ca_address_sk AND ca.ca_county IN ([Utf8("Walker County"), Utf8("Richland County"), Utf8("Gaines County"), Utf8("Douglas County"), Utf8("Dona Ana County")]) AND customer_demographics.cd_demo_sk = c.c_current_cdemo_sk AND (EXISTS (<subquery>) AS EXISTS OR EXISTS (<subquery>) AS EXISTS) Projection: EXISTS (<subquery>) AS EXISTS (<subquery>), c.c_customer_sk, c.c_customer_id, c.c_current_cdemo_sk, c.c_current_hdemo_sk, c.c_current_addr_sk, c.c_first_shipto_date_sk, c.c_first_sales_date_sk, c.c_salutation, c.c_first_name, c.c_last_name, c.c_preferred_cust_flag, c.c_birth_day, c.c_birth_month, c.c_birth_year, c.c_birth_country, c.c_login, c.c_email_address, c.c_last_review_date_sk, ca.ca_address_sk, ca.ca_address_id, ca.ca_street_number, ca.ca_street_name, ca.ca_street_type, ca.ca_suite_number, ca.ca_city, ca.ca_county, ca.ca_state, ca.ca_zip, ca.ca_country, ca.ca_gmt_offset, ca.ca_location_type, customer_demographics.cd_demo_sk, customer_demographics.cd_gender, customer_demographics.cd_marital_status, customer_demographics.cd_education_status, customer_demographics.cd_purchase_estimate, customer_demographics.cd_credit_rating, customer_demographics.cd_dep_count, customer_demographics.cd_dep_employed_count, customer_demographics.cd_dep_college_count Subquery: Projection: web_sales.ws_sold_date_sk, web_sales.ws_sold_time_sk, web_sales.ws_ship_date_sk, web_sales.ws_item_sk, web_sales.ws_bill_customer_sk, web_sales.ws_bill_cdemo_sk, web_sales.ws_bill_hdemo_sk, web_sales.ws_bill_addr_sk, web_sales.ws_ship_customer_sk, web_sales.ws_ship_cdemo_sk, web_sales.ws_ship_hdemo_sk, web_sales.ws_ship_addr_sk, web_sales.ws_web_page_sk, web_sales.ws_web_site_sk, web_sales.ws_ship_mode_sk, web_sales.ws_warehouse_sk, web_sales.ws_promo_sk, web_sales.ws_order_number, web_sales.ws_quantity, web_sales.ws_wholesale_cost, web_sales.ws_list_price, web_sales.ws_sales_price, web_sales.ws_ext_discount_amt, web_sales.ws_ext_sales_price, web_sales.ws_ext_wholesale_cost, web_sales.ws_ext_list_price, web_sales.ws_ext_tax, web_sales.ws_coupon_amt, web_sales.ws_ext_ship_cost, web_sales.ws_net_paid, web_sales.ws_net_paid_inc_tax, web_sales.ws_net_paid_inc_ship, web_sales.ws_net_paid_inc_ship_tax, web_sales.ws_net_profit, date_dim.d_date_sk, date_dim .d_date_id, date_dim.d_date, date_dim.d_month_seq, date_dim.d_week_seq, date_dim.d_quarter_seq, date_dim.d_year, date_dim.d_dow, date_dim.d_moy, date_dim.d_dom, date_dim.d_qoy, date_dim.d_fy_year, date_dim.d_fy_quarter_seq, date_dim.d_fy_week_seq, date_dim.d_day_name, date_dim.d_quarter_name, date_dim.d_holiday, date_dim.d_weekend, date_dim.d_following_holiday, date_dim.d_first_dom, date_dim.d_last_dom, date_dim.d_same_day_ly, date_dim.d_same_day_lq, date_dim.d_current_day, date_dim.d_current_week, date_dim.d_current_month, date_dim.d_current_quarter, date_dim.d_current_year Filter: c.c_customer_sk = web_sales.ws_bill_customer_sk AND web_sales.ws_sold_date_sk = date_dim.d_date_sk AND CAST(date_dim.d_year AS Int64) = Int64(2002) AND CAST(date_dim.d_moy AS Int64) BETWEEN Int64(4) AND Int64(4) + Int64(3) CrossJoin: TableScan: web_sales TableScan: date_dim LeftSemi Join: c.c_customer_sk = store_sales.ss_customer_sk CrossJoin: CrossJoin: SubqueryAlias: c TableScan: customer </details> <details><summary>Relevant snippet of logical plan after `push_down_filter`</summary> Projection: c.c_customer_sk, c.c_customer_id, c.c_current_cdemo_sk, c.c_current_hdemo_sk, c.c_current_addr_sk, c.c_first_shipto_date_sk, c.c_first_sales_date_sk, c.c_salutation, c.c_first_name, c.c_last_name, c.c_preferred_cust_flag, c.c_birth_day, c.c_birth_month, c.c_birth_year, c.c_birth_country, c.c_login, c.c_email_address, c.c_last_review_date_sk, ca.ca_address_sk, ca.ca_address_id, ca.ca_street_number, ca.ca_street_name, ca.ca_street_type, ca.ca_suite_number, ca.ca_city, ca.ca_county, ca.ca_state, ca.ca_zip, ca.ca_country, ca.ca_gmt_offset, ca.ca_location_type, customer_demographics.cd_demo_sk, customer_demographics.cd_gender, customer_demographics.cd_marital_status, customer_demographics.cd_education_status, customer_demographics.cd_purchase_estimate, customer_demographics.cd_credit_rating, customer_demographics.cd_dep_count, customer_demographics.cd_dep_employed_count, customer_demographics.cd_dep_college_count Projection: EXISTS (<subquery>) AS EXISTS (<subquery>), c.c_customer_sk, c.c_customer_id, c.c_current_cdemo_sk, c.c_current_hdemo_sk, c.c_current_addr_sk, c.c_first_shipto_date_sk, c.c_first_sales_date_sk, c.c_salutation, c.c_first_name, c.c_last_name, c.c_preferred_cust_flag, c.c_birth_day, c.c_birth_month, c.c_birth_year, c.c_birth_country, c.c_login, c.c_email_address, c.c_last_review_date_sk, ca.ca_address_sk, ca.ca_address_id, ca.ca_street_number, ca.ca_street_name, ca.ca_street_type, ca.ca_suite_number, ca.ca_city, ca.ca_county, ca.ca_state, ca.ca_zip, ca.ca_country, ca.ca_gmt_offset, ca.ca_location_type, customer_demographics.cd_demo_sk, customer_demographics.cd_gender, customer_demographics.cd_marital_status, customer_demographics.cd_education_status, customer_demographics.cd_purchase_estimate, customer_demographics.cd_credit_rating, customer_demographics.cd_dep_count, customer_demographics.cd_dep_employed_count, customer_demographics.cd_dep_college_count Subquery: Projection: web_sales.ws_sold_date_sk, web_sales.ws_sold_time_sk, web_sales.ws_ship_date_sk, web_sales.ws_item_sk, web_sales.ws_bill_customer_sk, web_sales.ws_bill_cdemo_sk, web_sales.ws_bill_hdemo_sk, web_sales.ws_bill_addr_sk, web_sales.ws_ship_customer_sk, web_sales.ws_ship_cdemo_sk, web_sales.ws_ship_hdemo_sk, web_sales.ws_ship_addr_sk, web_sales.ws_web_page_sk, web_sales.ws_web_site_sk, web_sales.ws_ship_mode_sk, web_sales.ws_warehouse_sk, web_sales.ws_promo_sk, web_sales.ws_order_number, web_sales.ws_quantity, web_sales.ws_wholesale_cost, web_sales.ws_list_price, web_sales.ws_sales_price, web_sales.ws_ext_discount_amt, web_sales.ws_ext_sales_price, web_sales.ws_ext_wholesale_cost, web_sales.ws_ext_list_price, web_sales.ws_ext_tax, web_sales.ws_coupon_amt, web_sales.ws_ext_ship_cost, web_sales.ws_net_paid, web_sales.ws_net_paid_inc_tax, web_sales.ws_net_paid_inc_ship, web_sales.ws_net_paid_inc_ship_tax, web_sales.ws_net_profit, date_dim.d_date_sk, date_dim.d _date_id, date_dim.d_date, date_dim.d_month_seq, date_dim.d_week_seq, date_dim.d_quarter_seq, date_dim.d_year, date_dim.d_dow, date_dim.d_moy, date_dim.d_dom, date_dim.d_qoy, date_dim.d_fy_year, date_dim.d_fy_quarter_seq, date_dim.d_fy_week_seq, date_dim.d_day_name, date_dim.d_quarter_name, date_dim.d_holiday, date_dim.d_weekend, date_dim.d_following_holiday, date_dim.d_first_dom, date_dim.d_last_dom, date_dim.d_same_day_ly, date_dim.d_same_day_lq, date_dim.d_current_day, date_dim.d_current_week, date_dim.d_current_month, date_dim.d_current_quarter, date_dim.d_current_year Filter: c.c_customer_sk = web_sales.ws_bill_customer_sk AND web_sales.ws_sold_date_sk = date_dim.d_date_sk AND CAST(date_dim.d_year AS Int64) = Int64(2002) AND CAST(date_dim.d_moy AS Int64) BETWEEN Int64(4) AND Int64(4) + Int64(3) CrossJoin: TableScan: web_sales TableScan: date_dim LeftSemi Join: c.c_customer_sk = store_sales.ss_customer_sk Filter: customer_demographics.cd_demo_sk = c.c_current_cdemo_sk CrossJoin: Filter: c.c_current_addr_sk = ca.ca_address_sk CrossJoin: SubqueryAlias: c Filter: EXISTS (<subquery>) OR EXISTS (<subquery>) Subquery: Projection: web_sales.ws_sold_date_sk, web_sales.ws_sold_time_sk, web_sales.ws_ship_date_sk, web_sales.ws_item_sk, web_sales.ws_bill_customer_sk, web_sales.ws_bill_cdemo_sk, web_sales.ws_bill_hdemo_sk, web_sales.ws_bill_addr_sk, web_sales.ws_ship_customer_sk, web_sales.ws_ship_cdemo_sk, web_sales.ws_ship_hdemo_sk, web_sales.ws_ship_addr_sk, web_sales.ws_web_page_sk, web_sales.ws_web_site_sk, web_sales.ws_ship_mode_sk, web_sales.ws_warehouse_sk, web_sales.ws_promo_sk, web_sales.ws_order_number, web_sales.ws_quantity, web_sales.ws_wholesale_cost, web_sales.ws_list_price, web_sales.ws_sales_price, web_sales.ws_ext_discount_amt, web_sales.ws_ext_sales_price, web_sales.ws_ext_wholesale_cost, web_sales.ws_ext_list_price, web_sales.ws_ext_tax, web_sales.ws_coupon_amt, web_sales.ws_ext_ship_cost, web_sales.ws_net_paid, web_sales.ws_net_paid_inc_tax, web_sales.ws_net_paid_inc_ship, web_sales.ws_net_paid_inc_ship_tax, web_sales.ws_net_profit, date_dim.d_date_ sk, date_dim.d_date_id, date_dim.d_date, date_dim.d_month_seq, date_dim.d_week_seq, date_dim.d_quarter_seq, date_dim.d_year, date_dim.d_dow, date_dim.d_moy, date_dim.d_dom, date_dim.d_qoy, date_dim.d_fy_year, date_dim.d_fy_quarter_seq, date_dim.d_fy_week_seq, date_dim.d_day_name, date_dim.d_quarter_name, date_dim.d_holiday, date_dim.d_weekend, date_dim.d_following_holiday, date_dim.d_first_dom, date_dim.d_last_dom, date_dim.d_same_day_ly, date_dim.d_same_day_lq, date_dim.d_current_day, date_dim.d_current_week, date_dim.d_current_month, date_dim.d_current_quarter, date_dim.d_current_year Filter: c.c_customer_sk = web_sales.ws_bill_customer_sk AND web_sales.ws_sold_date_sk = date_dim.d_date_sk AND CAST(date_dim.d_year AS Int64) = Int64(2002) AND CAST(date_dim.d_moy AS Int64) BETWEEN Int64(4) AND Int64(4) + Int64(3) CrossJoin: TableScan: web_sales TableScan: date_dim Subquery: Projection: web_sales.ws_sold_date_sk, web_sales.ws_sold_time_sk, web_sales.ws_ship_date_sk, web_sales.ws_item_sk, web_sales.ws_bill_customer_sk, web_sales.ws_bill_cdemo_sk, web_sales.ws_bill_hdemo_sk, web_sales.ws_bill_addr_sk, web_sales.ws_ship_customer_sk, web_sales.ws_ship_cdemo_sk, web_sales.ws_ship_hdemo_sk, web_sales.ws_ship_addr_sk, web_sales.ws_web_page_sk, web_sales.ws_web_site_sk, web_sales.ws_ship_mode_sk, web_sales.ws_warehouse_sk, web_sales.ws_promo_sk, web_sales.ws_order_number, web_sales.ws_quantity, web_sales.ws_wholesale_cost, web_sales.ws_list_price, web_sales.ws_sales_price, web_sales.ws_ext_discount_amt, web_sales.ws_ext_sales_price, web_sales.ws_ext_wholesale_cost, web_sales.ws_ext_list_price, web_sales.ws_ext_tax, web_sales.ws_coupon_amt, web_sales.ws_ext_ship_cost, web_sales.ws_net_paid, web_sales.ws_net_paid_inc_tax, web_sales.ws_net_paid_inc_ship, web_sales.ws_net_paid_inc_ship_tax, web_sales.ws_net_profit, date_dim.d_date_ sk, date_dim.d_date_id, date_dim.d_date, date_dim.d_month_seq, date_dim.d_week_seq, date_dim.d_quarter_seq, date_dim.d_year, date_dim.d_dow, date_dim.d_moy, date_dim.d_dom, date_dim.d_qoy, date_dim.d_fy_year, date_dim.d_fy_quarter_seq, date_dim.d_fy_week_seq, date_dim.d_day_name, date_dim.d_quarter_name, date_dim.d_holiday, date_dim.d_weekend, date_dim.d_following_holiday, date_dim.d_first_dom, date_dim.d_last_dom, date_dim.d_same_day_ly, date_dim.d_same_day_lq, date_dim.d_current_day, date_dim.d_current_week, date_dim.d_current_month, date_dim.d_current_quarter, date_dim.d_current_year Filter: c.c_customer_sk = web_sales.ws_bill_customer_sk AND web_sales.ws_sold_date_sk = date_dim.d_date_sk AND CAST(date_dim.d_year AS Int64) = Int64(2002) AND CAST(date_dim.d_moy AS Int64) BETWEEN Int64(4) AND Int64(4) + Int64(3) CrossJoin: TableScan: web_sales TableScan: date_dim TableScan: customer </details> We could see from the second snippet that the filter `EXISTS (<subquery>) OR EXISTS (<subquery>)` pushed below `Subqueryalias: c`. The column name in the subquery of `EXISTS` is not changed from `c` to `customer`. #### The fix? I would like to discuss the fix here. I don't think we can change `impl ExprRewriter for Expr` as it's used in many places. May be write an exclusive rewriter for `push_down_filter` which rewrites subqueries? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
