rbalamohan opened a new pull request #1940:
URL: https://github.com/apache/hive/pull/1940


   https://issues.apache.org/jira/browse/HIVE-24710
   
   {noformat}
   select x, y, count(*) over (partition by x order by y range between 86400 
PRECEDING and CURRENT ROW) r0 from foo
   {noformat}
   
   When there are duplicates "y",  window frame becomes really large and 
internal implementation of PTFOperator ends up running for O(n^2) times. E.g in 
some queries, we had 2.5 M entries in the window and that caused it to run 
forever in single task.  Along with this, there is high amount of IO due to 
reading and discarding rows from RowContainers (note that we just need the 
count and nothing from materizlied row).
   
   1. In such cases, there is no need to iterate over the rowcontainers often 
(internally it does O(n^2) operations taking forever when window frame is 
really large). This can be optimised to reduce CPU burn and IO.
   2. BasePartitionEvaluator::calcFunctionValue need not materialize ROW when 
parameters are empty. This codepath can also be optimised.
   
   ### What changes were proposed in this pull request?
   - For count(*), PR follows a fast path and just takes up the count of 
PTFPartitionIterator.
   - When parameters are empty/null, it tries to run via optimised iterator 
which does not materialize anything in ROW. This helps in reducing IO cost. 
   
   ### How was this patch tested?
   small internal cluster


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to