[ 
https://issues.apache.org/jira/browse/HIVE-1994?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Sichi updated HIVE-1994:
-----------------------------

    Attachment: HIVE-1994.0.patch

Preliminary patch with everything except the fix to prevent short-circuiting.


> Support new annotation @UDFType(stateful = true)
> ------------------------------------------------
>
>                 Key: HIVE-1994
>                 URL: https://issues.apache.org/jira/browse/HIVE-1994
>             Project: Hive
>          Issue Type: Improvement
>          Components: Query Processor, UDF
>            Reporter: John Sichi
>            Assignee: John Sichi
>         Attachments: HIVE-1994.0.patch
>
>
> Because Hive does not yet support window functions from SQL/OLAP, people have 
> started hacking around it by writing stateful UDF's for things like 
> cumulative sum.  An example is row_sequence in contrib.
> To clearly mark these, I think we should add a new annotation (with separate 
> semantics from the existing deterministic annotation).  I'm proposing the 
> name stateful for lack of a better idea, but I'm open to suggestions.
> The semantics are as follows:
> * A stateful UDF can only be used in the SELECT list, not in other clauses 
> such as WHERE/ON/ORDER/GROUP
> * When a stateful UDF is present in a query, there's an implication that its 
> SELECT needs to be treated as similar to TRANSFORM, i.e. when there's 
> DISTRIBUTE/CLUSTER/SORT clause, then run inside the corresponding reducer to 
> make sure that the results are as expected.
> For the first one, an example of why we need this is AND/OR short-circuiting; 
> we don't want these optimizations to cause the invocation to be skipped in a 
> confusing way, so we should just ban it outright (which is what SQL/OLAP does 
> for window functions).
> For the second one, I'm not entirely certain about the details since some of 
> it is lost in the mists in Hive prehistory, but at least if we have the 
> annotation, we'll be able to preserve backwards compatibility as we start 
> adding new cost-based optimizations which might otherwise break it.  A 
> specific example would be inserting a materialization step (e.g. for global 
> query optimization) in between the DISTRIBUTE/CLUSTER/SORT and the outer 
> SELECT containing the stateful UDF invocation; this could be a problem if the 
> mappers in the second job subdivides the buckets generated by the first job.  
> So we wouldn't do anything immediately, but the presence of the annotation 
> will help us going forward.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to