[jira] [Commented] (HIVE-18049) Enable Hive on Tez to provide globally sorted clustered table

Hive QA (JIRA) Mon, 13 Nov 2017 00:04:34 -0800

    [ 
https://issues.apache.org/jira/browse/HIVE-18049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16249218#comment-16249218
 ]


Hive QA commented on HIVE-18049:
--------------------------------



Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12897288/HIVE-18049.3.patch

{color:red}ERROR:{color} -1 due to no test(s) being added or modified.

{color:red}ERROR:{color} -1 due to 10 failed/errored test(s), 11374 tests 
executed
*Failed tests:*
{noformat}
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[dbtxnmgr_showlocks] 
(batchId=77)
org.apache.hadoop.hive.cli.TestMiniLlapCliDriver.testCliDriver[unionDistinct_1] 
(batchId=146)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[insert_values_orig_table_use_metadata]
 (batchId=162)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[sysdb] 
(batchId=156)
org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver[explainanalyze_2] 
(batchId=102)
org.apache.hadoop.hive.cli.TestNegativeMinimrCliDriver.testCliDriver[ct_noperm_loc]
 (batchId=94)
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[subquery_multi] 
(batchId=111)
org.apache.hadoop.hive.cli.control.TestDanglingQOuts.checkDanglingQOut 
(batchId=206)
org.apache.hadoop.hive.ql.exec.tez.TestWorkloadManager.testApplyPlanQpChanges 
(batchId=281)
org.apache.hadoop.hive.ql.parse.TestReplicationScenarios.testConstraints 
(batchId=223)
{noformat}

Test results: https://builds.apache.org/job/PreCommit-HIVE-Build/7785/testReport
Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/7785/console
Test logs: http://104.198.109.242/logs/PreCommit-HIVE-Build-7785/

Messages:
{noformat}
Executing org.apache.hive.ptest.execution.TestCheckPhase
Executing org.apache.hive.ptest.execution.PrepPhase
Executing org.apache.hive.ptest.execution.ExecutionPhase
Executing org.apache.hive.ptest.execution.ReportingPhase
Tests exited with: TestsFailedException: 10 tests failed
{noformat}

This message is automatically generated.

ATTACHMENT ID: 12897288 - PreCommit-HIVE-Build

> Enable Hive on Tez to provide globally sorted clustered table
> -------------------------------------------------------------
>
>                 Key: HIVE-18049
>                 URL: https://issues.apache.org/jira/browse/HIVE-18049
>             Project: Hive
>          Issue Type: Improvement
>          Components: Hive, Tez
>            Reporter: LingXiao Lan
>             Fix For: 2.1.1
>
>         Attachments: CombinedPartitioner.txt, HIVE-18049.1.patch, 
> tez-0.8.5.txt
>
>
> {code:sql}
> CREATE TABLE `test`(
>    `time` int,
>    `userid` bigint)
>  CLUSTERED BY (
>    userid)
>  SORTED BY (
>    userid ASC)
>  INTO 4 BUCKETS
>  ;
> {code}
> When insert data into this table, the data will be sorted into 4 buckets 
> automatically. But because hive uses hash partitioner by default, the data is 
> only sorted in each bucket and isn't sorted among different buckets. 
> Sometimes we need the data to be globally sorted, to optimizing indexing, for 
> example.
> If we can sample the table first and use TotalOrderPartitioner, this work 
> could be done. The difficulty is how do we automatically decide when to use 
> TotalOrderPartitioner and when not, because a insertion query can be complex, 
> which results in a complex DAG in Tez.
> I have implemented a temporary version. It uses a customer partitioner which 
> combines hash partitioner and totalorder partitioner. A physical optimizer is 
> added to hive to decide to choose which partitioner. But in order to reduce 
> the work load, this version should affect tez source code, which is not 
> necessary in fact.
> I'm wondering if we can implement a more common version which addresses this 
> issue.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (HIVE-18049) Enable Hive on Tez to provide globally sorted clustered table

Reply via email to