sagarlakshmipathy opened a new issue, #180:
URL: https://github.com/apache/arrow-datafusion-comet/issues/180

   ### Describe the bug
   
   While running Comet with OSS Spark, I noticed warning messages on some 
queries indicating that `Comet native execution is disabled`. Wondering why 
that is.
   
   Here's the execution log:
   ```
   
====================================================================================================
   RUNNING: Query # 15 (round 1) (1 statements)
   
----------------------------------------------------------------------------------------------------
   24/03/09 23:16:27 WARN QueryPlanSerde: Comet native execution is disabled 
due to: unsupported Spark expression: 'might_contain(Subquery subquery#8915, 
[id=#74608], xxhash64(cs_sold_date_sk#277, 42))' of class 
'org.apache.spark.sql.catalyst.expressions.BloomFilterMightContain
   24/03/09 23:16:27 WARN QueryPlanSerde: Comet native execution is disabled 
due to: unsupported Spark expression: 'might_contain(Subquery subquery#8915, 
[id=#74608], xxhash64(cs_sold_date_sk#277, 42))' of class 
'org.apache.spark.sql.catalyst.expressions.BloomFilterMightContain
   24/03/09 23:16:27 WARN DAGScheduler: Broadcasting large task binary with 
size 1047.8 KiB
   24/03/09 23:16:33 WARN DAGScheduler: Broadcasting large task binary with 
size 1096.7 KiB
   24/03/09 23:16:33 WARN DAGScheduler: Broadcasting large task binary with 
size 1143.9 KiB
   24/03/09 23:16:35 WARN DAGScheduler: Broadcasting large task binary with 
size 1131.6 KiB
   Time taken: 8596 ms                                                          
   
   
----------------------------------------------------------------------------------------------------
   FINISHED: Query # 15 (round 1)
   
====================================================================================================
   ```
   
   Here's the query itself
   ```
   --TPC-DS Q15
   select  ca_zip
          ,sum(cs_sales_price)
    from catalog_sales
        ,customer
        ,customer_address
        ,date_dim
    where cs_bill_customer_sk = c_customer_sk
        and c_current_addr_sk = ca_address_sk 
        and ( substr(ca_zip,1,5) in ('85669', '86197','88274','83405','86475',
                                      '85392', '85460', '80348', '81792')
              or ca_state in ('CA','WA','GA')
              or cs_sales_price > 500)
        and cs_sold_date_sk = d_date_sk
        and d_qoy = 2 and d_year = 2002
    group by ca_zip
    order by ca_zip
    limit 100;
   ```
   
   Regardless, I could see that the queries ran faster.
   
   ### Steps to reproduce
   
   1. Run a TPCDS query test, maybe just for query 15
   
   Apologies for mentioning minimal steps here. Thats all thats needed 
fortunately.
   
   ### Expected behavior
   
   No WARN messages
   
   ### Additional context
   
   This only happened for some queries. For example, Q46 ran without any issues.
   
   ```
   
====================================================================================================
   RUNNING: Query # 46 (round 1) (1 statements)
   
----------------------------------------------------------------------------------------------------
   Time taken: 18658 ms                                                         
   ]
   
----------------------------------------------------------------------------------------------------
   FINISHED: Query # 46 (round 1)
   
====================================================================================================
   ```
   
   ```
   --TPC-DS Q46
   select  c_last_name
          ,c_first_name
          ,ca_city
          ,bought_city
          ,ss_ticket_number
          ,amt,profit 
    from
      (select ss_ticket_number
             ,ss_customer_sk
             ,ca_city bought_city
             ,sum(ss_coupon_amt) amt
             ,sum(ss_net_profit) profit
       from store_sales,date_dim,store,household_demographics,customer_address 
       where store_sales.ss_sold_date_sk = date_dim.d_date_sk
       and store_sales.ss_store_sk = store.s_store_sk  
       and store_sales.ss_hdemo_sk = household_demographics.hd_demo_sk
       and store_sales.ss_addr_sk = customer_address.ca_address_sk
       and (household_demographics.hd_dep_count = 3 or
            household_demographics.hd_vehicle_count= 1)
       and date_dim.d_dow in (6,0)
       and date_dim.d_year in (1999,1999+1,1999+2) 
       and store.s_city in ('Midway','Fairview','Fairview','Midway','Fairview') 
       group by ss_ticket_number,ss_customer_sk,ss_addr_sk,ca_city) 
dn,customer,customer_address current_addr
       where ss_customer_sk = c_customer_sk
         and customer.c_current_addr_sk = current_addr.ca_address_sk
         and current_addr.ca_city <> bought_city
     order by c_last_name
             ,c_first_name
             ,ca_city
             ,bought_city
             ,ss_ticket_number
     limit 100;
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to