[ https://issues.apache.org/jira/browse/HIVE-1994?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
John Sichi updated HIVE-1994: ----------------------------- Attachment: HIVE-1994.2.patch > Support new annotation @UDFType(stateful = true) > ------------------------------------------------ > > Key: HIVE-1994 > URL: https://issues.apache.org/jira/browse/HIVE-1994 > Project: Hive > Issue Type: Improvement > Components: Query Processor, UDF > Reporter: John Sichi > Assignee: John Sichi > Attachments: HIVE-1994.0.patch, HIVE-1994.1.patch, HIVE-1994.2.patch > > > Because Hive does not yet support window functions from SQL/OLAP, people have > started hacking around it by writing stateful UDF's for things like > cumulative sum. An example is row_sequence in contrib. > To clearly mark these, I think we should add a new annotation (with separate > semantics from the existing deterministic annotation). I'm proposing the > name stateful for lack of a better idea, but I'm open to suggestions. > The semantics are as follows: > * A stateful UDF can only be used in the SELECT list, not in other clauses > such as WHERE/ON/ORDER/GROUP > * When a stateful UDF is present in a query, there's an implication that its > SELECT needs to be treated as similar to TRANSFORM, i.e. when there's > DISTRIBUTE/CLUSTER/SORT clause, then run inside the corresponding reducer to > make sure that the results are as expected. > For the first one, an example of why we need this is AND/OR short-circuiting; > we don't want these optimizations to cause the invocation to be skipped in a > confusing way, so we should just ban it outright (which is what SQL/OLAP does > for window functions). > For the second one, I'm not entirely certain about the details since some of > it is lost in the mists in Hive prehistory, but at least if we have the > annotation, we'll be able to preserve backwards compatibility as we start > adding new cost-based optimizations which might otherwise break it. A > specific example would be inserting a materialization step (e.g. for global > query optimization) in between the DISTRIBUTE/CLUSTER/SORT and the outer > SELECT containing the stateful UDF invocation; this could be a problem if the > mappers in the second job subdivides the buckets generated by the first job. > So we wouldn't do anything immediately, but the presence of the annotation > will help us going forward. -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira