caican00 opened a new pull request, #37479:
URL: https://github.com/apache/spark/pull/37479

   ### Why are the changes needed?
   select id, data FROM testcat.ns1.ns2.table
   where id =2
   and md5(data) = '8cde774d6f7333752ed72cacddb05126'
   and trim(data) = 'a' 
   Based on the SQL, we currently get the filters in the following order:
   
   // `(md5(cast(data#23 as binary)) = 8cde774d6f7333752ed72cacddb05126)) AND 
(trim(data#23, None) = a))` comes before `(id#22L = 2)`
   == Physical Plan == *(1) Project [id#22L, data#23]
    +- *(1) Filter ((((isnotnull(data#23) AND isnotnull(id#22L)) AND 
(md5(cast(data#23 as binary)) = 8cde774d6f7333752ed72cacddb05126)) AND 
(trim(data#23, None) = a)) AND (id#22L = 2))
       +- BatchScan[id#22L, data#23] class 
org.apache.spark.sql.connector.InMemoryTable$InMemoryBatchScan
   In this predicate order, all data needs to participate in the evaluation, 
even if some data does not meet the later filtering criteria and it may causes 
spark tasks to execute slowly.
   
    
   
   So i think that filtering predicates that need to be evaluated should 
automatically be placed to the far right to avoid data that does not meet the 
criteria being evaluated.
   
    
   
   As shown below:
   
   //  `(id#22L = 2)` comes before `(md5(cast(data#23 as binary)) = 
8cde774d6f7333752ed72cacddb05126)) AND (trim(data#23, None) = a))`
   == Physical Plan == *(1) Project [id#22L, data#23]
    +- *(1) Filter ((((isnotnull(data#23) AND isnotnull(id#22L)) AND 
(md5(cast(data#23 as binary)) = 8cde774d6f7333752ed72cacddb05126)) AND 
(trim(data#23, None) = a)) AND (id#22L = 2))
       +- BatchScan[id#22L, data#23] class 
org.apache.spark.sql.connector.InMemoryTable$InMemoryBatchScan
   
   ### How was this patch tested?
   1. Add new test
   2. manually test:the stage execution time for reading data dropped from 
5min+ to 24s
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to