caican00 opened a new pull request, #37479:
URL: https://github.com/apache/spark/pull/37479
### Why are the changes needed?
select id, data FROM testcat.ns1.ns2.table
where id =2
and md5(data) = '8cde774d6f7333752ed72cacddb05126'
and trim(data) = 'a'
Based on the SQL, we currently get the filters in the following order:
// `(md5(cast(data#23 as binary)) = 8cde774d6f7333752ed72cacddb05126)) AND
(trim(data#23, None) = a))` comes before `(id#22L = 2)`
== Physical Plan == *(1) Project [id#22L, data#23]
+- *(1) Filter ((((isnotnull(data#23) AND isnotnull(id#22L)) AND
(md5(cast(data#23 as binary)) = 8cde774d6f7333752ed72cacddb05126)) AND
(trim(data#23, None) = a)) AND (id#22L = 2))
+- BatchScan[id#22L, data#23] class
org.apache.spark.sql.connector.InMemoryTable$InMemoryBatchScan
In this predicate order, all data needs to participate in the evaluation,
even if some data does not meet the later filtering criteria and it may causes
spark tasks to execute slowly.
So i think that filtering predicates that need to be evaluated should
automatically be placed to the far right to avoid data that does not meet the
criteria being evaluated.
As shown below:
// `(id#22L = 2)` comes before `(md5(cast(data#23 as binary)) =
8cde774d6f7333752ed72cacddb05126)) AND (trim(data#23, None) = a))`
== Physical Plan == *(1) Project [id#22L, data#23]
+- *(1) Filter ((((isnotnull(data#23) AND isnotnull(id#22L)) AND
(md5(cast(data#23 as binary)) = 8cde774d6f7333752ed72cacddb05126)) AND
(trim(data#23, None) = a)) AND (id#22L = 2))
+- BatchScan[id#22L, data#23] class
org.apache.spark.sql.connector.InMemoryTable$InMemoryBatchScan
### How was this patch tested?
1. Add new test
2. manually test:the stage execution time for reading data dropped from
5min+ to 24s
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]