Github user marmbrus commented on the pull request:
https://github.com/apache/spark/pull/482#issuecomment-41753498
I think the problem with the previous implementation is the change is much
larger and therefore harder to understand. (It took me a while to look at it
and find bugs, but there were quite a few and I'm sure I didn't find them all.)
A lot of concepts _could_ be added to the expression hierarchy, but the
goal of Catalyst is to instead allow self contained rules to perform
optimizations. By keeping these optimization rules self contained it is much
easier to look in one place and reason about their correctness. Making the
rule simple while pushing all of the complex logic to many different places in
the code is not the goal here.
The only reason nullability is modeled in the expression hierarchy at all
is because it is a fairly fundamental concept that is part of the schema of
the data.
Regarding HiveGenericUdf, I'm not sure how the three state solution allows
us to do any extra folding. The UDF is still a black box and we have no idea
how its going to respond to null inputs.
Regarding your last point about code generation, we don't need three
states. We already know if we need to track a nullable bit when doing code
generation with the current two state implementation. If it is possible
statically determine a value to be null, that will have already been done by
optimization rules and we will just have a null literal. All we need is a
special rule that generates the AST for `Literal(null, _)`. So, actually I
think the code generation logic is simpler with this implementation.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---