j-esse commented on a change in pull request #23556: [SPARK-26626][SQL] Maximum
size for repeatedly substituted aliases in SQL expressions
URL: https://github.com/apache/spark/pull/23556#discussion_r268004208
##########
File path:
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
##########
@@ -658,7 +658,8 @@ object CollapseProject extends Rule[LogicalPlan] {
def apply(plan: LogicalPlan): LogicalPlan = plan transformUp {
Review comment:
@HyukjinKwon it's not that processing a huge tree causes an OOM, it's that
the user can write a small tree, that seems very reasonable to execute, but
under the hood the optimiser turns it into a huge tree that OOMs. The user
doesn't know beforehand that the optimiser issue is going to happen, in order
to disable the rule. It takes a lot of debugging, looking through stack
traces, etc, to identify that the OOM is caused by CollapseProject and that you
can disable it. Also, we typically run many different queries within a spark
session, and wouldn't want to disable CollapseProject for all of them.
This change means that we can still run CollapseProject, we just don't
substitute overly large aliases. In the types of query we had problems with,
this means that it will collapse the query until the aliases get too large, and
then stop. So we still do apply CollapseProject to every query, we just stop
substituting any alias the gets too large.
`spark.sql.maxRepeatedAliasSize` just determines the size of alias tree that
is determined to be too large to efficiently substitute multiple times. The
default value of `100` was determined by some basic testing to find the best
perf balance (see charts at top), but happy to tweak this if you don't htink
it's appropriate?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]