[ https://issues.apache.org/jira/browse/HIVE-1994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12994557#comment-12994557 ]
John Sichi commented on HIVE-1994: ---------------------------------- @Jonathan: good point that we need to prevent short-circuiting from causing problems in the SELECT list too (e.g. inside of CASE/AND/OR). Optimally, we should figure out how to make sure they get eagerly evaluated exactly once before evaluating the entire expression; that way short-circuiting can still be used. We should still prevent them entirely outside of the SELECT list to avoid semantic ambiguity from other optimizations (e.g. decomposing predicates during predicate pushdown). But I just checked SQL/OLAP, and it does allow them in ORDER BY, which makes sense for reporting. > Support new annotation @UDFType(stateful = true) > ------------------------------------------------ > > Key: HIVE-1994 > URL: https://issues.apache.org/jira/browse/HIVE-1994 > Project: Hive > Issue Type: Improvement > Components: Query Processor, UDF > Reporter: John Sichi > Assignee: John Sichi > > Because Hive does not yet support window functions from SQL/OLAP, people have > started hacking around it by writing stateful UDF's for things like > cumulative sum. An example is row_sequence in contrib. > To clearly mark these, I think we should add a new annotation (with separate > semantics from the existing deterministic annotation). I'm proposing the > name stateful for lack of a better idea, but I'm open to suggestions. > The semantics are as follows: > * A stateful UDF can only be used in the SELECT list, not in other clauses > such as WHERE/ON/ORDER/GROUP > * When a stateful UDF is present in a query, there's an implication that its > SELECT needs to be treated as similar to TRANSFORM, i.e. when there's > DISTRIBUTE/CLUSTER/SORT clause, then run inside the corresponding reducer to > make sure that the results are as expected. > For the first one, an example of why we need this is AND/OR short-circuiting; > we don't want these optimizations to cause the invocation to be skipped in a > confusing way, so we should just ban it outright (which is what SQL/OLAP does > for window functions). > For the second one, I'm not entirely certain about the details since some of > it is lost in the mists in Hive prehistory, but at least if we have the > annotation, we'll be able to preserve backwards compatibility as we start > adding new cost-based optimizations which might otherwise break it. A > specific example would be inserting a materialization step (e.g. for global > query optimization) in between the DISTRIBUTE/CLUSTER/SORT and the outer > SELECT containing the stateful UDF invocation; this could be a problem if the > mappers in the second job subdivides the buckets generated by the first job. > So we wouldn't do anything immediately, but the presence of the annotation > will help us going forward. -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira