[jira] Commented: (HIVE-1994) Support new annotation @UDFType(stateful = true)

John Sichi (JIRA) Mon, 14 Feb 2011 15:26:22 -0800

    [ 
https://issues.apache.org/jira/browse/HIVE-1994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12994557#comment-12994557
 ]


John Sichi commented on HIVE-1994:
----------------------------------

@Jonathan:  good point that we need to prevent short-circuiting from causing 
problems in the SELECT list too (e.g. inside of CASE/AND/OR).  Optimally, we 
should figure out how to make sure they get eagerly evaluated exactly once 
before evaluating the entire expression; that way short-circuiting can still be 
used.

We should still prevent them entirely outside of the SELECT list to avoid 
semantic ambiguity from other optimizations (e.g. decomposing predicates during 
predicate pushdown).  But I just checked SQL/OLAP, and it does allow them in 
ORDER BY, which makes sense for reporting.


> Support new annotation @UDFType(stateful = true)
> ------------------------------------------------
>
>                 Key: HIVE-1994
>                 URL: https://issues.apache.org/jira/browse/HIVE-1994
>             Project: Hive
>          Issue Type: Improvement
>          Components: Query Processor, UDF
>            Reporter: John Sichi
>            Assignee: John Sichi
>
> Because Hive does not yet support window functions from SQL/OLAP, people have 
> started hacking around it by writing stateful UDF's for things like 
> cumulative sum.  An example is row_sequence in contrib.
> To clearly mark these, I think we should add a new annotation (with separate 
> semantics from the existing deterministic annotation).  I'm proposing the 
> name stateful for lack of a better idea, but I'm open to suggestions.
> The semantics are as follows:
> * A stateful UDF can only be used in the SELECT list, not in other clauses 
> such as WHERE/ON/ORDER/GROUP
> * When a stateful UDF is present in a query, there's an implication that its 
> SELECT needs to be treated as similar to TRANSFORM, i.e. when there's 
> DISTRIBUTE/CLUSTER/SORT clause, then run inside the corresponding reducer to 
> make sure that the results are as expected.
> For the first one, an example of why we need this is AND/OR short-circuiting; 
> we don't want these optimizations to cause the invocation to be skipped in a 
> confusing way, so we should just ban it outright (which is what SQL/OLAP does 
> for window functions).
> For the second one, I'm not entirely certain about the details since some of 
> it is lost in the mists in Hive prehistory, but at least if we have the 
> annotation, we'll be able to preserve backwards compatibility as we start 
> adding new cost-based optimizations which might otherwise break it.  A 
> specific example would be inserting a materialization step (e.g. for global 
> query optimization) in between the DISTRIBUTE/CLUSTER/SORT and the outer 
> SELECT containing the stateful UDF invocation; this could be a problem if the 
> mappers in the second job subdivides the buckets generated by the first job.  
> So we wouldn't do anything immediately, but the presence of the annotation 
> will help us going forward.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (HIVE-1994) Support new annotation @UDFType(stateful = true)

Reply via email to