[
https://issues.apache.org/jira/browse/SPARK-37392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17452656#comment-17452656
]
Josh Rosen commented on SPARK-37392:
------------------------------------
When I ran this in {{spark-shell}} it triggered an OOM in
{{{}LogicalPlan.constraints(){}}}. It In the heap dump I spotted an
{{ExpressionSet}} with over 100,000 expressions.
Based on the constraints that I saw I think that they were introduced by the
{{InferFiltersFromGenerate}} rule and that some sort of unexpected rule
interaction is resulting in a huge blowup of derived constraints in
{{{}PruneFilters{}}}. The {{InferFiltersFromGenerate}} rule was introduced in
SPARK-32295 / Spark 3.1.0, which could explain why this issue isn't
reproducible in Spark 2.4.1.
Looking at the constraints in the huge {{{}ExpressionSet{}}}, it looks like the
vast majority (> 99%) of the constraints are {{GreaterThan}} or {{LessThan}}
constraints of the form:
* {{GreaterThan(Size(CreateArray(...)), Literal(0))}}
* {{LessThan(Literal(0), Size(CreateArray(...)))}}
I think the {{GreaterThan}} comes from {{InferFiltersFromGenerate}} and suspect
that the {{LessThan}} equivalents are introduced via expression
canonicalization.
We'll need to dig a bit deeper to figure out what's leading to this buildup of
duplicate constraints. Perhaps it's some sort of interaction between this
particular shape of constraint, the constraint propagation system, and
canonicalization? I'm not sure yet.
> Catalyst optimizer very time-consuming and memory-intensive with some
> "explode(array)"
> ---------------------------------------------------------------------------------------
>
> Key: SPARK-37392
> URL: https://issues.apache.org/jira/browse/SPARK-37392
> Project: Spark
> Issue Type: Bug
> Components: Optimizer
> Affects Versions: 3.1.2, 3.2.0
> Reporter: Francois MARTIN
> Priority: Major
>
> The problem occurs with the simple code below:
> {code:java}
> import session.implicits._
> Seq(
> (1, "x", "x", "x", "x", "x", "x", "x", "x", "x", "x", "x", "x", "x", "x",
> "x", "x", "x", "x", "x", "x")
> ).toDF()
> .checkpoint() // or save and reload to truncate lineage
> .createOrReplaceTempView("sub")
> session.sql("""
> SELECT
> *
> FROM
> (
> SELECT
> EXPLODE( ARRAY( * ) ) result
> FROM
> (
> SELECT
> _1 a, _2 b, _3 c, _4 d, _5 e, _6 f, _7 g, _8 h, _9 i, _10 j, _11 k,
> _12 l, _13 m, _14 n, _15 o, _16 p, _17 q, _18 r, _19 s, _20 t, _21 u
> FROM
> sub
> )
> )
> WHERE
> result != ''
> """).show() {code}
> It takes several minutes and a very high Java heap usage, when it should be
> immediate.
> It does not occur when replacing the unique integer value (1) with a string
> value ({_}"x"{_}).
> All the time is spent in the _PruneFilters_ optimization rule.
> Not reproduced in Spark 2.4.1.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]