askoa commented on issue #4794:
URL: 
https://github.com/apache/arrow-datafusion/issues/4794#issuecomment-1382626825

   > The second bug is that the schema has `customer.c_customer_sk` but the 
join condition has `c.c_customer_sk`
   
   The expression in question was originally part of `EXISTS(<subquery>)`. Few 
iterations before the failure, a `push_down_filter` pushes the 
`EXISTS(<subquery>)` down.  The `push_down_filter` uses  `ExprRewriter` to 
rewrite the expressions. The `ExprRewriter` does not rewrite subquery in any 
expression that has a subquery. In this case, the expression with subquery is 
`EXISTS`. Thus the column `c.c_customer_sk` does not get modified to 
`customer.c_customer_sk`.
   
   `push_down_filter` rewriting expressions:
   
https://github.com/apache/arrow-datafusion/blob/a9ddcd3a7558437361835120659b946b903468e1/datafusion/optimizer/src/push_down_filter.rs#L785
   
   `ExprRewriter` does not rewrite subqueries:
   
https://github.com/apache/arrow-datafusion/blob/a9ddcd3a7558437361835120659b946b903468e1/datafusion/expr/src/expr_rewriter.rs#L116
   
   
   
   <details><summary>Relevant snippet of logical plan before 
`push_down_filter`</summary>
   
   
                 Filter: c.c_current_addr_sk = ca.ca_address_sk AND 
ca.ca_county IN ([Utf8("Walker County"), Utf8("Richland County"), Utf8("Gaines 
County"), Utf8("Douglas County"), Utf8("Dona Ana County")]) AND 
customer_demographics.cd_demo_sk = c.c_current_cdemo_sk AND (EXISTS 
(<subquery>) AS EXISTS OR EXISTS (<subquery>) AS EXISTS)
                   Projection: EXISTS (<subquery>) AS EXISTS (<subquery>), 
c.c_customer_sk, c.c_customer_id, c.c_current_cdemo_sk, c.c_current_hdemo_sk, 
c.c_current_addr_sk, c.c_first_shipto_date_sk, c.c_first_sales_date_sk, 
c.c_salutation, c.c_first_name, c.c_last_name, c.c_preferred_cust_flag, 
c.c_birth_day, c.c_birth_month, c.c_birth_year, c.c_birth_country, c.c_login, 
c.c_email_address, c.c_last_review_date_sk, ca.ca_address_sk, ca.ca_address_id, 
ca.ca_street_number, ca.ca_street_name, ca.ca_street_type, ca.ca_suite_number, 
ca.ca_city, ca.ca_county, ca.ca_state, ca.ca_zip, ca.ca_country, 
ca.ca_gmt_offset, ca.ca_location_type, customer_demographics.cd_demo_sk, 
customer_demographics.cd_gender, customer_demographics.cd_marital_status, 
customer_demographics.cd_education_status, 
customer_demographics.cd_purchase_estimate, 
customer_demographics.cd_credit_rating, customer_demographics.cd_dep_count, 
customer_demographics.cd_dep_employed_count, 
customer_demographics.cd_dep_college_count
                     Subquery:
                       Projection: web_sales.ws_sold_date_sk, 
web_sales.ws_sold_time_sk, web_sales.ws_ship_date_sk, web_sales.ws_item_sk, 
web_sales.ws_bill_customer_sk, web_sales.ws_bill_cdemo_sk, 
web_sales.ws_bill_hdemo_sk, web_sales.ws_bill_addr_sk, 
web_sales.ws_ship_customer_sk, web_sales.ws_ship_cdemo_sk, 
web_sales.ws_ship_hdemo_sk, web_sales.ws_ship_addr_sk, 
web_sales.ws_web_page_sk, web_sales.ws_web_site_sk, web_sales.ws_ship_mode_sk, 
web_sales.ws_warehouse_sk, web_sales.ws_promo_sk, web_sales.ws_order_number, 
web_sales.ws_quantity, web_sales.ws_wholesale_cost, web_sales.ws_list_price, 
web_sales.ws_sales_price, web_sales.ws_ext_discount_amt, 
web_sales.ws_ext_sales_price, web_sales.ws_ext_wholesale_cost, 
web_sales.ws_ext_list_price, web_sales.ws_ext_tax, web_sales.ws_coupon_amt, 
web_sales.ws_ext_ship_cost, web_sales.ws_net_paid, 
web_sales.ws_net_paid_inc_tax, web_sales.ws_net_paid_inc_ship, 
web_sales.ws_net_paid_inc_ship_tax, web_sales.ws_net_profit, 
date_dim.d_date_sk, date_dim
 .d_date_id, date_dim.d_date, date_dim.d_month_seq, date_dim.d_week_seq, 
date_dim.d_quarter_seq, date_dim.d_year, date_dim.d_dow, date_dim.d_moy, 
date_dim.d_dom, date_dim.d_qoy, date_dim.d_fy_year, date_dim.d_fy_quarter_seq, 
date_dim.d_fy_week_seq, date_dim.d_day_name, date_dim.d_quarter_name, 
date_dim.d_holiday, date_dim.d_weekend, date_dim.d_following_holiday, 
date_dim.d_first_dom, date_dim.d_last_dom, date_dim.d_same_day_ly, 
date_dim.d_same_day_lq, date_dim.d_current_day, date_dim.d_current_week, 
date_dim.d_current_month, date_dim.d_current_quarter, date_dim.d_current_year
                         Filter: c.c_customer_sk = 
web_sales.ws_bill_customer_sk AND web_sales.ws_sold_date_sk = 
date_dim.d_date_sk AND CAST(date_dim.d_year AS Int64) = Int64(2002) AND 
CAST(date_dim.d_moy AS Int64) BETWEEN Int64(4) AND Int64(4) + Int64(3)
                           CrossJoin:
                             TableScan: web_sales
                             TableScan: date_dim
                     LeftSemi Join: c.c_customer_sk = store_sales.ss_customer_sk
                       CrossJoin:
                         CrossJoin:
                           SubqueryAlias: c
                             TableScan: customer
   
   
   
   </details>
   
   
   <details><summary>Relevant snippet of logical plan after 
`push_down_filter`</summary>
   
   
               Projection: c.c_customer_sk, c.c_customer_id, 
c.c_current_cdemo_sk, c.c_current_hdemo_sk, c.c_current_addr_sk, 
c.c_first_shipto_date_sk, c.c_first_sales_date_sk, c.c_salutation, 
c.c_first_name, c.c_last_name, c.c_preferred_cust_flag, c.c_birth_day, 
c.c_birth_month, c.c_birth_year, c.c_birth_country, c.c_login, 
c.c_email_address, c.c_last_review_date_sk, ca.ca_address_sk, ca.ca_address_id, 
ca.ca_street_number, ca.ca_street_name, ca.ca_street_type, ca.ca_suite_number, 
ca.ca_city, ca.ca_county, ca.ca_state, ca.ca_zip, ca.ca_country, 
ca.ca_gmt_offset, ca.ca_location_type, customer_demographics.cd_demo_sk, 
customer_demographics.cd_gender, customer_demographics.cd_marital_status, 
customer_demographics.cd_education_status, 
customer_demographics.cd_purchase_estimate, 
customer_demographics.cd_credit_rating, customer_demographics.cd_dep_count, 
customer_demographics.cd_dep_employed_count, 
customer_demographics.cd_dep_college_count
                 Projection: EXISTS (<subquery>) AS EXISTS (<subquery>), 
c.c_customer_sk, c.c_customer_id, c.c_current_cdemo_sk, c.c_current_hdemo_sk, 
c.c_current_addr_sk, c.c_first_shipto_date_sk, c.c_first_sales_date_sk, 
c.c_salutation, c.c_first_name, c.c_last_name, c.c_preferred_cust_flag, 
c.c_birth_day, c.c_birth_month, c.c_birth_year, c.c_birth_country, c.c_login, 
c.c_email_address, c.c_last_review_date_sk, ca.ca_address_sk, ca.ca_address_id, 
ca.ca_street_number, ca.ca_street_name, ca.ca_street_type, ca.ca_suite_number, 
ca.ca_city, ca.ca_county, ca.ca_state, ca.ca_zip, ca.ca_country, 
ca.ca_gmt_offset, ca.ca_location_type, customer_demographics.cd_demo_sk, 
customer_demographics.cd_gender, customer_demographics.cd_marital_status, 
customer_demographics.cd_education_status, 
customer_demographics.cd_purchase_estimate, 
customer_demographics.cd_credit_rating, customer_demographics.cd_dep_count, 
customer_demographics.cd_dep_employed_count, 
customer_demographics.cd_dep_college_count
                   Subquery:
                     Projection: web_sales.ws_sold_date_sk, 
web_sales.ws_sold_time_sk, web_sales.ws_ship_date_sk, web_sales.ws_item_sk, 
web_sales.ws_bill_customer_sk, web_sales.ws_bill_cdemo_sk, 
web_sales.ws_bill_hdemo_sk, web_sales.ws_bill_addr_sk, 
web_sales.ws_ship_customer_sk, web_sales.ws_ship_cdemo_sk, 
web_sales.ws_ship_hdemo_sk, web_sales.ws_ship_addr_sk, 
web_sales.ws_web_page_sk, web_sales.ws_web_site_sk, web_sales.ws_ship_mode_sk, 
web_sales.ws_warehouse_sk, web_sales.ws_promo_sk, web_sales.ws_order_number, 
web_sales.ws_quantity, web_sales.ws_wholesale_cost, web_sales.ws_list_price, 
web_sales.ws_sales_price, web_sales.ws_ext_discount_amt, 
web_sales.ws_ext_sales_price, web_sales.ws_ext_wholesale_cost, 
web_sales.ws_ext_list_price, web_sales.ws_ext_tax, web_sales.ws_coupon_amt, 
web_sales.ws_ext_ship_cost, web_sales.ws_net_paid, 
web_sales.ws_net_paid_inc_tax, web_sales.ws_net_paid_inc_ship, 
web_sales.ws_net_paid_inc_ship_tax, web_sales.ws_net_profit, 
date_dim.d_date_sk, date_dim.d
 _date_id, date_dim.d_date, date_dim.d_month_seq, date_dim.d_week_seq, 
date_dim.d_quarter_seq, date_dim.d_year, date_dim.d_dow, date_dim.d_moy, 
date_dim.d_dom, date_dim.d_qoy, date_dim.d_fy_year, date_dim.d_fy_quarter_seq, 
date_dim.d_fy_week_seq, date_dim.d_day_name, date_dim.d_quarter_name, 
date_dim.d_holiday, date_dim.d_weekend, date_dim.d_following_holiday, 
date_dim.d_first_dom, date_dim.d_last_dom, date_dim.d_same_day_ly, 
date_dim.d_same_day_lq, date_dim.d_current_day, date_dim.d_current_week, 
date_dim.d_current_month, date_dim.d_current_quarter, date_dim.d_current_year
                       Filter: c.c_customer_sk = web_sales.ws_bill_customer_sk 
AND web_sales.ws_sold_date_sk = date_dim.d_date_sk AND CAST(date_dim.d_year AS 
Int64) = Int64(2002) AND CAST(date_dim.d_moy AS Int64) BETWEEN Int64(4) AND 
Int64(4) + Int64(3)
                         CrossJoin:
                           TableScan: web_sales
                           TableScan: date_dim
                   LeftSemi Join: c.c_customer_sk = store_sales.ss_customer_sk
                     Filter: customer_demographics.cd_demo_sk = 
c.c_current_cdemo_sk
                       CrossJoin:
                         Filter: c.c_current_addr_sk = ca.ca_address_sk
                           CrossJoin:
                             SubqueryAlias: c
                               Filter: EXISTS (<subquery>) OR EXISTS 
(<subquery>)
                                 Subquery:
                                   Projection: web_sales.ws_sold_date_sk, 
web_sales.ws_sold_time_sk, web_sales.ws_ship_date_sk, web_sales.ws_item_sk, 
web_sales.ws_bill_customer_sk, web_sales.ws_bill_cdemo_sk, 
web_sales.ws_bill_hdemo_sk, web_sales.ws_bill_addr_sk, 
web_sales.ws_ship_customer_sk, web_sales.ws_ship_cdemo_sk, 
web_sales.ws_ship_hdemo_sk, web_sales.ws_ship_addr_sk, 
web_sales.ws_web_page_sk, web_sales.ws_web_site_sk, web_sales.ws_ship_mode_sk, 
web_sales.ws_warehouse_sk, web_sales.ws_promo_sk, web_sales.ws_order_number, 
web_sales.ws_quantity, web_sales.ws_wholesale_cost, web_sales.ws_list_price, 
web_sales.ws_sales_price, web_sales.ws_ext_discount_amt, 
web_sales.ws_ext_sales_price, web_sales.ws_ext_wholesale_cost, 
web_sales.ws_ext_list_price, web_sales.ws_ext_tax, web_sales.ws_coupon_amt, 
web_sales.ws_ext_ship_cost, web_sales.ws_net_paid, 
web_sales.ws_net_paid_inc_tax, web_sales.ws_net_paid_inc_ship, 
web_sales.ws_net_paid_inc_ship_tax, web_sales.ws_net_profit, date_dim.d_date_
 sk, date_dim.d_date_id, date_dim.d_date, date_dim.d_month_seq, 
date_dim.d_week_seq, date_dim.d_quarter_seq, date_dim.d_year, date_dim.d_dow, 
date_dim.d_moy, date_dim.d_dom, date_dim.d_qoy, date_dim.d_fy_year, 
date_dim.d_fy_quarter_seq, date_dim.d_fy_week_seq, date_dim.d_day_name, 
date_dim.d_quarter_name, date_dim.d_holiday, date_dim.d_weekend, 
date_dim.d_following_holiday, date_dim.d_first_dom, date_dim.d_last_dom, 
date_dim.d_same_day_ly, date_dim.d_same_day_lq, date_dim.d_current_day, 
date_dim.d_current_week, date_dim.d_current_month, date_dim.d_current_quarter, 
date_dim.d_current_year
                                     Filter: c.c_customer_sk = 
web_sales.ws_bill_customer_sk AND web_sales.ws_sold_date_sk = 
date_dim.d_date_sk AND CAST(date_dim.d_year AS Int64) = Int64(2002) AND 
CAST(date_dim.d_moy AS Int64) BETWEEN Int64(4) AND Int64(4) + Int64(3)
                                       CrossJoin:
                                         TableScan: web_sales
                                         TableScan: date_dim
                                 Subquery:
                                   Projection: web_sales.ws_sold_date_sk, 
web_sales.ws_sold_time_sk, web_sales.ws_ship_date_sk, web_sales.ws_item_sk, 
web_sales.ws_bill_customer_sk, web_sales.ws_bill_cdemo_sk, 
web_sales.ws_bill_hdemo_sk, web_sales.ws_bill_addr_sk, 
web_sales.ws_ship_customer_sk, web_sales.ws_ship_cdemo_sk, 
web_sales.ws_ship_hdemo_sk, web_sales.ws_ship_addr_sk, 
web_sales.ws_web_page_sk, web_sales.ws_web_site_sk, web_sales.ws_ship_mode_sk, 
web_sales.ws_warehouse_sk, web_sales.ws_promo_sk, web_sales.ws_order_number, 
web_sales.ws_quantity, web_sales.ws_wholesale_cost, web_sales.ws_list_price, 
web_sales.ws_sales_price, web_sales.ws_ext_discount_amt, 
web_sales.ws_ext_sales_price, web_sales.ws_ext_wholesale_cost, 
web_sales.ws_ext_list_price, web_sales.ws_ext_tax, web_sales.ws_coupon_amt, 
web_sales.ws_ext_ship_cost, web_sales.ws_net_paid, 
web_sales.ws_net_paid_inc_tax, web_sales.ws_net_paid_inc_ship, 
web_sales.ws_net_paid_inc_ship_tax, web_sales.ws_net_profit, date_dim.d_date_
 sk, date_dim.d_date_id, date_dim.d_date, date_dim.d_month_seq, 
date_dim.d_week_seq, date_dim.d_quarter_seq, date_dim.d_year, date_dim.d_dow, 
date_dim.d_moy, date_dim.d_dom, date_dim.d_qoy, date_dim.d_fy_year, 
date_dim.d_fy_quarter_seq, date_dim.d_fy_week_seq, date_dim.d_day_name, 
date_dim.d_quarter_name, date_dim.d_holiday, date_dim.d_weekend, 
date_dim.d_following_holiday, date_dim.d_first_dom, date_dim.d_last_dom, 
date_dim.d_same_day_ly, date_dim.d_same_day_lq, date_dim.d_current_day, 
date_dim.d_current_week, date_dim.d_current_month, date_dim.d_current_quarter, 
date_dim.d_current_year
                                     Filter: c.c_customer_sk = 
web_sales.ws_bill_customer_sk AND web_sales.ws_sold_date_sk = 
date_dim.d_date_sk AND CAST(date_dim.d_year AS Int64) = Int64(2002) AND 
CAST(date_dim.d_moy AS Int64) BETWEEN Int64(4) AND Int64(4) + Int64(3)
                                       CrossJoin:
                                         TableScan: web_sales
                                         TableScan: date_dim
                                 TableScan: customer
   
   
   </details>
   
   We could see from the second snippet that the filter `EXISTS (<subquery>) OR 
EXISTS (<subquery>)` pushed below `Subqueryalias: c`. The column name in the 
subquery of `EXISTS` is not changed from `c` to `customer`.
   
   #### The fix?
    I would like to discuss the fix here. I don't think we can change `impl 
ExprRewriter for Expr` as it's used in many places. May be write an exclusive 
rewriter for `push_down_filter` which rewrites subqueries?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to