[
https://issues.apache.org/jira/browse/HIVE-1994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12994557#comment-12994557
]
John Sichi commented on HIVE-1994:
----------------------------------
@Jonathan: good point that we need to prevent short-circuiting from causing
problems in the SELECT list too (e.g. inside of CASE/AND/OR). Optimally, we
should figure out how to make sure they get eagerly evaluated exactly once
before evaluating the entire expression; that way short-circuiting can still be
used.
We should still prevent them entirely outside of the SELECT list to avoid
semantic ambiguity from other optimizations (e.g. decomposing predicates during
predicate pushdown). But I just checked SQL/OLAP, and it does allow them in
ORDER BY, which makes sense for reporting.
> Support new annotation @UDFType(stateful = true)
> ------------------------------------------------
>
> Key: HIVE-1994
> URL: https://issues.apache.org/jira/browse/HIVE-1994
> Project: Hive
> Issue Type: Improvement
> Components: Query Processor, UDF
> Reporter: John Sichi
> Assignee: John Sichi
>
> Because Hive does not yet support window functions from SQL/OLAP, people have
> started hacking around it by writing stateful UDF's for things like
> cumulative sum. An example is row_sequence in contrib.
> To clearly mark these, I think we should add a new annotation (with separate
> semantics from the existing deterministic annotation). I'm proposing the
> name stateful for lack of a better idea, but I'm open to suggestions.
> The semantics are as follows:
> * A stateful UDF can only be used in the SELECT list, not in other clauses
> such as WHERE/ON/ORDER/GROUP
> * When a stateful UDF is present in a query, there's an implication that its
> SELECT needs to be treated as similar to TRANSFORM, i.e. when there's
> DISTRIBUTE/CLUSTER/SORT clause, then run inside the corresponding reducer to
> make sure that the results are as expected.
> For the first one, an example of why we need this is AND/OR short-circuiting;
> we don't want these optimizations to cause the invocation to be skipped in a
> confusing way, so we should just ban it outright (which is what SQL/OLAP does
> for window functions).
> For the second one, I'm not entirely certain about the details since some of
> it is lost in the mists in Hive prehistory, but at least if we have the
> annotation, we'll be able to preserve backwards compatibility as we start
> adding new cost-based optimizations which might otherwise break it. A
> specific example would be inserting a materialization step (e.g. for global
> query optimization) in between the DISTRIBUTE/CLUSTER/SORT and the outer
> SELECT containing the stateful UDF invocation; this could be a problem if the
> mappers in the second job subdivides the buckets generated by the first job.
> So we wouldn't do anything immediately, but the presence of the annotation
> will help us going forward.
--
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira