[jira] Commented: (HIVE-1994) Support new annotation @UDFType(stateful = true)

2011-02-24 Thread Carl Steinbach (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-1994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12998991#comment-12998991
 ] 

Carl Steinbach commented on HIVE-1994:
--

+1. Will commit if tests pass.

 Support new annotation @UDFType(stateful = true)
 

 Key: HIVE-1994
 URL: https://issues.apache.org/jira/browse/HIVE-1994
 Project: Hive
  Issue Type: Improvement
  Components: Query Processor, UDF
Reporter: John Sichi
Assignee: John Sichi
 Fix For: 0.8.0

 Attachments: HIVE-1994.0.patch, HIVE-1994.1.patch, HIVE-1994.2.patch, 
 HIVE-1994.3.patch


 Because Hive does not yet support window functions from SQL/OLAP, people have 
 started hacking around it by writing stateful UDF's for things like 
 cumulative sum.  An example is row_sequence in contrib.
 To clearly mark these, I think we should add a new annotation (with separate 
 semantics from the existing deterministic annotation).  I'm proposing the 
 name stateful for lack of a better idea, but I'm open to suggestions.
 The semantics are as follows:
 * A stateful UDF can only be used in the SELECT list, not in other clauses 
 such as WHERE/ON/ORDER/GROUP
 * When a stateful UDF is present in a query, there's an implication that its 
 SELECT needs to be treated as similar to TRANSFORM, i.e. when there's 
 DISTRIBUTE/CLUSTER/SORT clause, then run inside the corresponding reducer to 
 make sure that the results are as expected.
 For the first one, an example of why we need this is AND/OR short-circuiting; 
 we don't want these optimizations to cause the invocation to be skipped in a 
 confusing way, so we should just ban it outright (which is what SQL/OLAP does 
 for window functions).
 For the second one, I'm not entirely certain about the details since some of 
 it is lost in the mists in Hive prehistory, but at least if we have the 
 annotation, we'll be able to preserve backwards compatibility as we start 
 adding new cost-based optimizations which might otherwise break it.  A 
 specific example would be inserting a materialization step (e.g. for global 
 query optimization) in between the DISTRIBUTE/CLUSTER/SORT and the outer 
 SELECT containing the stateful UDF invocation; this could be a problem if the 
 mappers in the second job subdivides the buckets generated by the first job.  
 So we wouldn't do anything immediately, but the presence of the annotation 
 will help us going forward.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Commented: (HIVE-1994) Support new annotation @UDFType(stateful = true)

2011-02-22 Thread John Sichi (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-1994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12998171#comment-12998171
 ] 

John Sichi commented on HIVE-1994:
--

Note that for CASE expressions, we *always* want short circuiting, otherwise 
it's impossible to do something like
case when x  0 then sqrt(-x) else sqrt(x) end (to avoid trying to take the 
square root of a negative number).  So if we detect a stateful UDF inside of a 
CASE expression, we'll throw an exception.


 Support new annotation @UDFType(stateful = true)
 

 Key: HIVE-1994
 URL: https://issues.apache.org/jira/browse/HIVE-1994
 Project: Hive
  Issue Type: Improvement
  Components: Query Processor, UDF
Reporter: John Sichi
Assignee: John Sichi
 Attachments: HIVE-1994.0.patch


 Because Hive does not yet support window functions from SQL/OLAP, people have 
 started hacking around it by writing stateful UDF's for things like 
 cumulative sum.  An example is row_sequence in contrib.
 To clearly mark these, I think we should add a new annotation (with separate 
 semantics from the existing deterministic annotation).  I'm proposing the 
 name stateful for lack of a better idea, but I'm open to suggestions.
 The semantics are as follows:
 * A stateful UDF can only be used in the SELECT list, not in other clauses 
 such as WHERE/ON/ORDER/GROUP
 * When a stateful UDF is present in a query, there's an implication that its 
 SELECT needs to be treated as similar to TRANSFORM, i.e. when there's 
 DISTRIBUTE/CLUSTER/SORT clause, then run inside the corresponding reducer to 
 make sure that the results are as expected.
 For the first one, an example of why we need this is AND/OR short-circuiting; 
 we don't want these optimizations to cause the invocation to be skipped in a 
 confusing way, so we should just ban it outright (which is what SQL/OLAP does 
 for window functions).
 For the second one, I'm not entirely certain about the details since some of 
 it is lost in the mists in Hive prehistory, but at least if we have the 
 annotation, we'll be able to preserve backwards compatibility as we start 
 adding new cost-based optimizations which might otherwise break it.  A 
 specific example would be inserting a materialization step (e.g. for global 
 query optimization) in between the DISTRIBUTE/CLUSTER/SORT and the outer 
 SELECT containing the stateful UDF invocation; this could be a problem if the 
 mappers in the second job subdivides the buckets generated by the first job.  
 So we wouldn't do anything immediately, but the presence of the annotation 
 will help us going forward.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Commented: (HIVE-1994) Support new annotation @UDFType(stateful = true)

2011-02-22 Thread John Sichi (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-1994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12998197#comment-12998197
 ] 

John Sichi commented on HIVE-1994:
--

HIVE-1994.1.patch addresses short-circuiting.  I'm running it through tests now.


 Support new annotation @UDFType(stateful = true)
 

 Key: HIVE-1994
 URL: https://issues.apache.org/jira/browse/HIVE-1994
 Project: Hive
  Issue Type: Improvement
  Components: Query Processor, UDF
Reporter: John Sichi
Assignee: John Sichi
 Attachments: HIVE-1994.0.patch, HIVE-1994.1.patch


 Because Hive does not yet support window functions from SQL/OLAP, people have 
 started hacking around it by writing stateful UDF's for things like 
 cumulative sum.  An example is row_sequence in contrib.
 To clearly mark these, I think we should add a new annotation (with separate 
 semantics from the existing deterministic annotation).  I'm proposing the 
 name stateful for lack of a better idea, but I'm open to suggestions.
 The semantics are as follows:
 * A stateful UDF can only be used in the SELECT list, not in other clauses 
 such as WHERE/ON/ORDER/GROUP
 * When a stateful UDF is present in a query, there's an implication that its 
 SELECT needs to be treated as similar to TRANSFORM, i.e. when there's 
 DISTRIBUTE/CLUSTER/SORT clause, then run inside the corresponding reducer to 
 make sure that the results are as expected.
 For the first one, an example of why we need this is AND/OR short-circuiting; 
 we don't want these optimizations to cause the invocation to be skipped in a 
 confusing way, so we should just ban it outright (which is what SQL/OLAP does 
 for window functions).
 For the second one, I'm not entirely certain about the details since some of 
 it is lost in the mists in Hive prehistory, but at least if we have the 
 annotation, we'll be able to preserve backwards compatibility as we start 
 adding new cost-based optimizations which might otherwise break it.  A 
 specific example would be inserting a materialization step (e.g. for global 
 query optimization) in between the DISTRIBUTE/CLUSTER/SORT and the outer 
 SELECT containing the stateful UDF invocation; this could be a problem if the 
 mappers in the second job subdivides the buckets generated by the first job.  
 So we wouldn't do anything immediately, but the presence of the annotation 
 will help us going forward.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Commented: (HIVE-1994) Support new annotation @UDFType(stateful = true)

2011-02-14 Thread Jonathan Chang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-1994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12994546#comment-12994546
 ] 

Jonathan Chang commented on HIVE-1994:
--

The AND/OR short circuiting is an issue for both SELECT and WHERE.  I think 
stateful UDFs need to poison containing expressions and force them to not short 
circuit.  



 Support new annotation @UDFType(stateful = true)
 

 Key: HIVE-1994
 URL: https://issues.apache.org/jira/browse/HIVE-1994
 Project: Hive
  Issue Type: Improvement
  Components: Query Processor, UDF
Reporter: John Sichi
Assignee: John Sichi

 Because Hive does not yet support window functions from SQL/OLAP, people have 
 started hacking around it by writing stateful UDF's for things like 
 cumulative sum.  An example is row_sequence in contrib.
 To clearly mark these, I think we should add a new annotation (with separate 
 semantics from the existing deterministic annotation).  I'm proposing the 
 name stateful for lack of a better idea, but I'm open to suggestions.
 The semantics are as follows:
 * A stateful UDF can only be used in the SELECT list, not in other clauses 
 such as WHERE/ON/ORDER/GROUP
 * When a stateful UDF is present in a query, there's an implication that its 
 SELECT needs to be treated as similar to TRANSFORM, i.e. when there's 
 DISTRIBUTE/CLUSTER/SORT clause, then run inside the corresponding reducer to 
 make sure that the results are as expected.
 For the first one, an example of why we need this is AND/OR short-circuiting; 
 we don't want these optimizations to cause the invocation to be skipped in a 
 confusing way, so we should just ban it outright (which is what SQL/OLAP does 
 for window functions).
 For the second one, I'm not entirely certain about the details since some of 
 it is lost in the mists in Hive prehistory, but at least if we have the 
 annotation, we'll be able to preserve backwards compatibility as we start 
 adding new cost-based optimizations which might otherwise break it.  A 
 specific example would be inserting a materialization step (e.g. for global 
 query optimization) in between the DISTRIBUTE/CLUSTER/SORT and the outer 
 SELECT containing the stateful UDF invocation; this could be a problem if the 
 mappers in the second job subdivides the buckets generated by the first job.  
 So we wouldn't do anything immediately, but the presence of the annotation 
 will help us going forward.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Commented: (HIVE-1994) Support new annotation @UDFType(stateful = true)

2011-02-14 Thread Adam Kramer (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-1994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12994581#comment-12994581
 ] 

Adam Kramer commented on HIVE-1994:
---

Agree; also consider deprecating DISTRIBUTE/SORT/CLUSTER BY in favor of 
DISTRIBUTED/SORTED/CLUSTERED BY, a syntax that would explicitly prevent 
short-circuiting and subdivision for only the query it's a part of.

I can't imagine that sort by in the subquery leads to assumptions in the 
parent query scales well or will last long in any case, but this functionality 
is not only necessary for backwards-compatibility, but is also kind of the 
entire reason Hive was developed and/or conceived: To utilize mapreduce 
functionality in order to transform and process data. Preventing the querier 
from making mapreduce assumptions just seems sad.

 Support new annotation @UDFType(stateful = true)
 

 Key: HIVE-1994
 URL: https://issues.apache.org/jira/browse/HIVE-1994
 Project: Hive
  Issue Type: Improvement
  Components: Query Processor, UDF
Reporter: John Sichi
Assignee: John Sichi

 Because Hive does not yet support window functions from SQL/OLAP, people have 
 started hacking around it by writing stateful UDF's for things like 
 cumulative sum.  An example is row_sequence in contrib.
 To clearly mark these, I think we should add a new annotation (with separate 
 semantics from the existing deterministic annotation).  I'm proposing the 
 name stateful for lack of a better idea, but I'm open to suggestions.
 The semantics are as follows:
 * A stateful UDF can only be used in the SELECT list, not in other clauses 
 such as WHERE/ON/ORDER/GROUP
 * When a stateful UDF is present in a query, there's an implication that its 
 SELECT needs to be treated as similar to TRANSFORM, i.e. when there's 
 DISTRIBUTE/CLUSTER/SORT clause, then run inside the corresponding reducer to 
 make sure that the results are as expected.
 For the first one, an example of why we need this is AND/OR short-circuiting; 
 we don't want these optimizations to cause the invocation to be skipped in a 
 confusing way, so we should just ban it outright (which is what SQL/OLAP does 
 for window functions).
 For the second one, I'm not entirely certain about the details since some of 
 it is lost in the mists in Hive prehistory, but at least if we have the 
 annotation, we'll be able to preserve backwards compatibility as we start 
 adding new cost-based optimizations which might otherwise break it.  A 
 specific example would be inserting a materialization step (e.g. for global 
 query optimization) in between the DISTRIBUTE/CLUSTER/SORT and the outer 
 SELECT containing the stateful UDF invocation; this could be a problem if the 
 mappers in the second job subdivides the buckets generated by the first job.  
 So we wouldn't do anything immediately, but the presence of the annotation 
 will help us going forward.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira