[
https://issues.apache.org/jira/browse/HIVE-7989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14124393#comment-14124393
]
Hive QA commented on HIVE-7989:
---
{color:red}Overall{color}: -1 at least one tests failed
Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12666864/HIVE-7989.patch
{color:red}ERROR:{color} -1 due to 3 failed/errored test(s), 6171 tests executed
*Failed tests:*
{noformat}
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_auto_sortmerge_join_8
org.apache.hive.hcatalog.pig.TestHCatLoader.testReadDataPrimitiveTypes
org.apache.hive.jdbc.miniHS2.TestHiveServer2.testConnection
{noformat}
Test results:
http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/664/testReport
Console output:
http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/664/console
Test logs:
http://ec2-174-129-184-35.compute-1.amazonaws.com/logs/PreCommit-HIVE-TRUNK-Build-664/
Messages:
{noformat}
Executing org.apache.hive.ptest.execution.PrepPhase
Executing org.apache.hive.ptest.execution.ExecutionPhase
Executing org.apache.hive.ptest.execution.ReportingPhase
Tests exited with: TestsFailedException: 3 tests failed
{noformat}
This message is automatically generated.
ATTACHMENT ID: 12666864
Optimize Windowing function performance for row frames
--
Key: HIVE-7989
URL: https://issues.apache.org/jira/browse/HIVE-7989
Project: Hive
Issue Type: Improvement
Components: PTF-Windowing
Affects Versions: 0.13.0
Reporter: Ankit Kamboj
Attachments: HIVE-7989.patch
To find aggregate value for each row, current windowing function
implementation creates a new aggregation buffer for each row, iterates over
all the rows in respective window frame, puts them in buffer and then finds
the aggregated value. This causes bottleneck for partitions with huge number
of rows because this process runs in n-square complexity (n being rows in a
partition) for each partition. So, if there are multiple partitions in a
dataset, each with millions of rows, aggregation for all rows will take days
to finish.
There is scope of optimization for row frames, for following cases:
a) For UNBOUNDED PRECEDING start and bounded end: Instead of iterating on
window frame again for each row, we can slide the end one row at a time and
aggregate, since we know the start is fixed for each row. This will have
running time linear to the size of partition.
b) For bounded start and UNBOUNDED FOLLOWING end: Instead of iterating on
window frame again for each row, we can slide the start one row at a time and
aggregate in reverse, since we know the end is fixed for each row. This will
have running time linear to the size of partition.
Also, In general for both row and value frames, we don't need to iterate over
the range and re-create aggregation buffer if the start as well as end remain
same. Instead, can re-use the previously created aggregation buffer.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)