Github user nongli commented on a diff in the pull request:
https://github.com/apache/spark/pull/10444#discussion_r48503899
--- Diff:
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala
---
@@ -47,6 +48,34 @@ trait Predicate extends Expression {
override def dataType: DataType = BooleanType
}
+object Predicate extends PredicateHelper {
+ def toCNF(predicate: Expression, maybeThreshold: Option[Double] = None):
Expression = {
+ val cnf = new CNFExecutor(predicate).execute(predicate)
+ val threshold = maybeThreshold.map(predicate.size *
_).getOrElse(Double.MaxValue)
+ if (cnf.size > threshold) predicate else cnf
--- End diff --
I disagree with 1. I don't see why it matters if it is all CNF or none. I
think the heuristic we want is something like "maximize the number of simple
predicates that are in CNF form". Simple here means contains just 1 attribute
or binary predicate between two. These are candidates for benefiting from
further optimization.
We could try cost basing this or just stopping the expansion after some
amount.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]