fqaiser94 commented on pull request #27066:
URL: https://github.com/apache/spark/pull/27066#issuecomment-640747267
@cloud-fan
Thanks for your PR. I've merged your changes into this branch with some
modifications in order to pass the tests.
> Remove the SQL API. It's unclear to me if we should add a special syntax
for it, or simply treat it as a function. We can add the SQL API later.
I tried to implement special syntax for it before but it involved a lot of
changes and I didn't think the added complexity was worth it over simply having
a function.
Anyway, I think adding the SQL API later is the correct decision if we're
unsure about it right now and it helps keep this PR small.
> return null if the struct is null. I think the user-facing behavior
matters more than the internal implementation. I don't think it's a big problem
to have deeply nested if-else expressions. It's similar to the encoder
expressions.
While the code will now give this behaviour (`return null if struct is
null`), I'm still not entirely convinced the way this is currently implemented
is a good idea.
Looking at the generated physical plan, it looks like Spark would be doing a
lot of unnecessary work for **nullable** struct columns.
In the following example, I add 2 columns to a nullable struct.
```
nullStructLevel1.withColumn("a", 'a.withField("b", lit(2)).withField("d",
lit(4))).explain()
// == Physical Plan ==
// *(1) Project [if (isnull(if (isnull(a#6)) null else named_struct(a,
a#6.a, b, 2, c, a#6.c))) null else named_struct(a, if (isnull(a#6)) null else
named_struct(a, a#6.a, b, 2, c, a#6.c).a, b, if (isnull(a#6)) null else
named_struct(a, a#6.a, b, 2, c, a#6.c).b, c, if (isnull(a#6)) null else
named_struct(a, a#6.a, b, 2, c, a#6.c).c, d, 4) AS a#8]
// +- *(1) Scan ExistingRDD[a#6]
```
As you can see, the `If(IsNull(_), Literal(null, _), CreateNamedStruct(_))`
check has to happen several times in this simple example unnecessarily.
This quickly gets out of hand the more fields you add.
For example, if I add 3 columns to the same nullable struct:
```
nullStructLevel1.withColumn("a", 'a.withField("b", lit(2)).withField("d",
lit(4)).withField("e", lit(5))).explain()
// == Physical Plan ==
// *(1) Project [if (isnull(if (isnull(if (isnull(a#6)) null else
named_struct(a, a#6.a, b, 2, c, a#6.c))) null else named_struct(a, if
(isnull(a#6)) null else named_struct(a, a#6.a, b, 2, c, a#6.c).a, b, if
(isnull(a#6)) null else named_struct(a, a#6.a, b, 2, c, a#6.c).b, c, if
(isnull(a#6)) null else named_struct(a, a#6.a, b, 2, c, a#6.c).c, d, 4))) null
else named_struct(a, if (isnull(if (isnull(a#6)) null else named_struct(a,
a#6.a, b, 2, c, a#6.c))) null else named_struct(a, if (isnull(a#6)) null else
named_struct(a, a#6.a, b, 2, c, a#6.c).a, b, if (isnull(a#6)) null else
named_struct(a, a#6.a, b, 2, c, a#6.c).b, c, if (isnull(a#6)) null else
named_struct(a, a#6.a, b, 2, c, a#6.c).c, d, 4).a, b, if (isnull(if
(isnull(a#6)) null else named_struct(a, a#6.a, b, 2, c, a#6.c))) null else
named_struct(a, if (isnull(a#6)) null else named_struct(a, a#6.a, b, 2, c,
a#6.c).a, b, if (isnull(a#6)) null else named_struct(a, a#6.a, b, 2, c,
a#6.c).b, c, if (isnull(a#6)) null else named_struct(a, a#6.a, b, 2, c,
a#6.c).c, d, 4).b, c, if (isnull(if (isnull(a#6)) null else named_struct(a,
a#6.a, b, 2, c, a#6.c))) null else named_struct(a, if (isnull(a#6)) null else
named_struct(a, a#6.a, b, 2, c, a#6.c).a, b, if (isnull(a#6)) null else
named_struct(a, a#6.a, b, 2, c, a#6.c).b, c, if (isnull(a#6)) null else
named_struct(a, a#6.a, b, 2, c, a#6.c).c, d, 4).c, d, if (isnull(if
(isnull(a#6)) null else named_struct(a, a#6.a, b, 2, c, a#6.c))) null else
named_struct(a, if (isnull(a#6)) null else named_struct(a, a#6.a, b, 2, c,
a#6.c).a, b, if (isnull(a#6)) null else named_struct(a, a#6.a, b, 2, c,
a#6.c).b, c, if (isnull(a#6)) null else named_struct(a, a#6.a, b, 2, c,
a#6.c).c, d, 4).d, e, 5) AS a#10]
// +- *(1) Scan ExistingRDD[a#6]
```
Do you still think this is not an issue?
If we're adamant on having this `return null if the struct is null`
behaviour, I'll try to think of ways to fix this (whilst still reusing
`CreateNamedStruct` Expression).
There might be a possibility of writing optimizer rules that can simplify
nested if-else expressions of this sort. One simple rule I can think of is as
follows:
```
object RewriteIsNullIfIsNull extends Rule[LogicalPlan] {
def apply(plan: LogicalPlan): LogicalPlan = plan transformAllExpressions {
case IsNull(If(IsNull(a), Literal(null, _), b)) if !b.nullable =>
IsNull(a)
}
}
```
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]