wankunde opened a new pull request #32514: URL: https://github.com/apache/spark/pull/32514
### What changes were proposed in this pull request?
This PR try to improve `InferFiltersFromConstraints` performance via avoid
generating too many constraints.
For example:
```java
test("Expression explosion when analyze test") {
RuleExecutor.resetMetrics()
Seq((1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14))
.toDF("a", "b", "c", "d", "e", "f", "g", "h", "i", "j",
"k", "l", "m", "n")
.write.saveAsTable("test")
val df = spark.table("test")
val df2 = df.filter("a+b+c+d+e+f+g+h+i+j+k+l+m+n > 100")
val df3 = df2.select('a as 'a1, 'b as 'b1,
'c as 'c1, 'd as 'd1, 'e as 'e1, 'f as 'f1,
'g as 'g1, 'h as 'h1, 'i as 'i1, 'j as 'j1,
'k as 'k1, 'l as 'l1, 'm as 'm1, 'n as 'n1)
val df4 = df3.join(df2, df3("a1") === df2("a"))
df4.explain(true)
logWarning(RuleExecutor.dumpTimeSpent())
}
```
### Why are the changes needed?
Improve `InferFiltersFromConstraints` performance
Before this PR:
```
=== Metrics of Analyzer/Optimizer Rules ===
Total number of runs: 1187
Total time: 5.022786805 seconds
Rule
Effective Time / Total Time Effective Runs / Total Runs
org.apache.spark.sql.catalyst.optimizer.InferFiltersFromConstraints
4528820409 / 4529498144 1 / 2
org.apache.spark.sql.catalyst.analysis.ResolveSessionCatalog
0 / 38907142 0 / 13
Combined[org.apache.spark.sql.catalyst.analysis.TypeCoercionBase$InConversion,
org.apache.spark.sql.catalyst.analysis.TypeCoercion$PromoteStrings,
org.apache.spark.sql.catalyst.analysis.DecimalPrecision,
org.apache.spark.sql.catalyst.analysis.TypeCoercion$BooleanEquality,
org.apache.spark.sql.catalyst.analysis.TypeCoercionBase$FunctionArgumentConversion,
org.apache.spark.sql.catalyst.analysis.TypeCoercionBase$ConcatCoercion,
org.apache.spark.sql.catalyst.analysis.TypeCoercionBase$MapZipWithCoercion,
org.apache.spark.sql.catalyst.analysis.TypeCoercionBase$EltCoercion,
org.apache.spark.sql.catalyst.analysis.TypeCoercionBase$CaseWhenCoercion,
org.apache.spark.sql.catalyst.analysis.TypeCoercionBase$IfCoercion,
org.apache.spark.sql.catalyst.analysis.TypeCoercionBase$StackCoercion,
org.apache.spark.sql.catalyst.analysis.TypeCoercionBase$Division,
org.apache.spark.sql.catalyst.analysis.TypeCoercionBase$IntegralDivision,
org.apache.spark.sql.catalyst.analysis.TypeCoercionBase$ImplicitType
Casts,
org.apache.spark.sql.catalyst.analysis.TypeCoercionBase$DateTimeOperations,
org.apache.spark.sql.catalyst.analysis.TypeCoercionBase$WindowFrameCoercion,
org.apache.spark.sql.catalyst.analysis.TypeCoercionBase$StringLiteralCoercion]
0 / 30035714 0 / 13
org.apache.spark.sql.execution.datasources.SchemaPruning
0 / 20202429 0 / 2
org.apache.spark.sql.execution.datasources.PreprocessTableCreation
0 / 15898208 0 / 8
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences
7497131 / 15098789 2 / 13
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations
11633805 / 13755605 1 / 13
```
After this PR:
```
=== Metrics of Analyzer/Optimizer Rules ===
Total number of runs: 1187
Total time: 0.559125361 seconds
Rule
Effective Time / Total Time Effective Runs / Total Runs
org.apache.spark.sql.catalyst.optimizer.InferFiltersFromConstraints
44387973 / 45044872 1 / 2
org.apache.spark.sql.catalyst.analysis.ResolveSessionCatalog
0 / 40652311 0 / 13
Combined[org.apache.spark.sql.catalyst.analysis.TypeCoercionBase$InConversion,
org.apache.spark.sql.catalyst.analysis.TypeCoercion$PromoteStrings,
org.apache.spark.sql.catalyst.analysis.DecimalPrecision,
org.apache.spark.sql.catalyst.analysis.TypeCoercion$BooleanEquality,
org.apache.spark.sql.catalyst.analysis.TypeCoercionBase$FunctionArgumentConversion,
org.apache.spark.sql.catalyst.analysis.TypeCoercionBase$ConcatCoercion,
org.apache.spark.sql.catalyst.analysis.TypeCoercionBase$MapZipWithCoercion,
org.apache.spark.sql.catalyst.analysis.TypeCoercionBase$EltCoercion,
org.apache.spark.sql.catalyst.analysis.TypeCoercionBase$CaseWhenCoercion,
org.apache.spark.sql.catalyst.analysis.TypeCoercionBase$IfCoercion,
org.apache.spark.sql.catalyst.analysis.TypeCoercionBase$StackCoercion,
org.apache.spark.sql.catalyst.analysis.TypeCoercionBase$Division,
org.apache.spark.sql.catalyst.analysis.TypeCoercionBase$IntegralDivision,
org.apache.spark.sql.catalyst.analysis.TypeCoercionBase$ImplicitType
Casts,
org.apache.spark.sql.catalyst.analysis.TypeCoercionBase$DateTimeOperations,
org.apache.spark.sql.catalyst.analysis.TypeCoercionBase$WindowFrameCoercion,
org.apache.spark.sql.catalyst.analysis.TypeCoercionBase$StringLiteralCoercion]
0 / 30068620 0 / 13
org.apache.spark.sql.execution.datasources.SchemaPruning
0 / 20810353 0 / 2
org.apache.spark.sql.execution.datasources.PreprocessTableCreation
0 / 19485336 0 / 8
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences
8476540 / 16209891 2 / 13
org.apache.spark.sql.execution.datasources.FindDataSourceTable
10826285 / 14306609 1 / 13
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations
11935867 / 14163328 1 / 13
```
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Exists Unit tests.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
