[
https://issues.apache.org/jira/browse/HIVE-18049?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
LingXiao Lan updated HIVE-18049:
--------------------------------
Status: Patch Available (was: Open)
> Enable Hive on Tez to provide globally sorted clustered table
> -------------------------------------------------------------
>
> Key: HIVE-18049
> URL: https://issues.apache.org/jira/browse/HIVE-18049
> Project: Hive
> Issue Type: Improvement
> Components: Hive, Tez
> Reporter: LingXiao Lan
> Fix For: 2.1.1
>
>
> CREATE TABLE `test`(
> `time` int,
> `userid` bigint)
> CLUSTERED BY (
> userid)
> SORTED BY (
> userid ASC)
> INTO 4 BUCKETS
> ;
> When insert data into this table, the data will be sorted into 4 buckets
> automatically. But because hive uses hash partitioner by default, the data is
> only sorted in each bucket and isn't sorted among different buckets.
> Sometimes we need the data to be globally sorted, to optimizing indexing, for
> example.
> If we can sample the table first and use TotalOrderPartitioner, this work
> could be done. The difficulty is how do we automatically decide when to use
> TotalOrderPartitioner and when not, because a insertion query can be complex,
> which results in a complex DAG in Tez.
> I have implemented a temporary version. It uses a customer partitioner which
> combines hash partitioner and totalorder partitioner. A physical optimizer is
> added to hive to decide to choose which partitioner. But in order to reduce
> the work load, this version should affect tez source code, which is not
> necessary in fact.
> I'm wondering if we can implement a more common version which addresses this
> issue.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)