Github user marmbrus commented on the pull request:
https://github.com/apache/spark/pull/446#issuecomment-40854164
A few high-level comments:
- I'm not sure if stateful UDFs are actually something we want to support.
The semantics for them are not well defined in partitioned systems, especially
where the optimizer decides the partitioning. If you want things like row id
there are already ways to do this with map partitions with index.
- The deferred evaluation class seems like a complicated way to get short
circuit evaluation. In a lot of cases can't we just change the ordering of
calling the existing eval method? Adding a new interface complicates things,
and in some simple benchmarks that I ran this code is actually slower than what
was there before (probably because of the extra object allocations).
- There are a lot of unrelated changes here also. While fixing a minor
spelling error or something is okay, making a whole bunch of unrelated changes
makes reviewing the PR more difficult for us. For example, maybe you can do
the data type additions for Hive UDFs in their own PR.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---