[jira] [Commented] (HIVE-7989) Optimize Windowing function performance for row frames

2014-09-09 Thread Ankit Kamboj (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-7989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14127282#comment-14127282
 ] 

Ankit Kamboj commented on HIVE-7989:


Looks like the tests that failed are not due to the patch itself (ptf-windowing 
tests are part of ql module). Could somebody take a quick look and advise?

 Optimize Windowing function performance for row frames
 --

 Key: HIVE-7989
 URL: https://issues.apache.org/jira/browse/HIVE-7989
 Project: Hive
  Issue Type: Improvement
  Components: PTF-Windowing
Affects Versions: 0.13.0
Reporter: Ankit Kamboj
 Attachments: HIVE-7989.patch


 To find aggregate value for each row, current windowing function 
 implementation creates a new aggregation buffer for each row, iterates over 
 all the rows in respective window frame, puts them in buffer and then finds 
 the aggregated value. This causes bottleneck for partitions with huge number 
 of rows because this process runs in n-square complexity (n being rows in a 
 partition) for each partition. So, if there are multiple partitions in a 
 dataset, each with millions of rows, aggregation for all rows will take days 
 to finish.
 There is scope of optimization for row frames, for following cases:
 a) For UNBOUNDED PRECEDING start and bounded end: Instead of iterating on 
 window frame again for each row, we can slide the end one row at a time and 
 aggregate, since we know the start is fixed for each row. This will have 
 running time linear to the size of partition.
 b) For bounded start and UNBOUNDED FOLLOWING end: Instead of iterating on 
 window frame again for each row, we can slide the start one row at a time and 
 aggregate in reverse, since we know the end is fixed for each row. This will 
 have running time linear to the size of partition.
 Also, In general for both row and value frames, we don't need to iterate over 
 the range and re-create aggregation buffer if the start as well as end remain 
 same. Instead, can re-use the previously created aggregation buffer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-7989) Optimize Windowing function performance for row frames

2014-09-06 Thread Hive QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-7989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14124393#comment-14124393
 ] 

Hive QA commented on HIVE-7989:
---



{color:red}Overall{color}: -1 at least one tests failed

Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12666864/HIVE-7989.patch

{color:red}ERROR:{color} -1 due to 3 failed/errored test(s), 6171 tests executed
*Failed tests:*
{noformat}
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_auto_sortmerge_join_8
org.apache.hive.hcatalog.pig.TestHCatLoader.testReadDataPrimitiveTypes
org.apache.hive.jdbc.miniHS2.TestHiveServer2.testConnection
{noformat}

Test results: 
http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/664/testReport
Console output: 
http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/664/console
Test logs: 
http://ec2-174-129-184-35.compute-1.amazonaws.com/logs/PreCommit-HIVE-TRUNK-Build-664/

Messages:
{noformat}
Executing org.apache.hive.ptest.execution.PrepPhase
Executing org.apache.hive.ptest.execution.ExecutionPhase
Executing org.apache.hive.ptest.execution.ReportingPhase
Tests exited with: TestsFailedException: 3 tests failed
{noformat}

This message is automatically generated.

ATTACHMENT ID: 12666864

 Optimize Windowing function performance for row frames
 --

 Key: HIVE-7989
 URL: https://issues.apache.org/jira/browse/HIVE-7989
 Project: Hive
  Issue Type: Improvement
  Components: PTF-Windowing
Affects Versions: 0.13.0
Reporter: Ankit Kamboj
 Attachments: HIVE-7989.patch


 To find aggregate value for each row, current windowing function 
 implementation creates a new aggregation buffer for each row, iterates over 
 all the rows in respective window frame, puts them in buffer and then finds 
 the aggregated value. This causes bottleneck for partitions with huge number 
 of rows because this process runs in n-square complexity (n being rows in a 
 partition) for each partition. So, if there are multiple partitions in a 
 dataset, each with millions of rows, aggregation for all rows will take days 
 to finish.
 There is scope of optimization for row frames, for following cases:
 a) For UNBOUNDED PRECEDING start and bounded end: Instead of iterating on 
 window frame again for each row, we can slide the end one row at a time and 
 aggregate, since we know the start is fixed for each row. This will have 
 running time linear to the size of partition.
 b) For bounded start and UNBOUNDED FOLLOWING end: Instead of iterating on 
 window frame again for each row, we can slide the start one row at a time and 
 aggregate in reverse, since we know the end is fixed for each row. This will 
 have running time linear to the size of partition.
 Also, In general for both row and value frames, we don't need to iterate over 
 the range and re-create aggregation buffer if the start as well as end remain 
 same. Instead, can re-use the previously created aggregation buffer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)