[
https://issues.apache.org/jira/browse/TEZ-3908?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Gopal V resolved TEZ-3908.
--------------------------
Resolution: Information Provided
Discussion on the mailing list covers all the cases discussed here.
Specifically about [~findepi]'s comments, here's the problem with alphabetical
sorting + index assignment.
{code}
create external table bucketed (x int) clustered by (x) into 4 buckets stored
as orc;
insert into bucketed values(1),(2),(3),(4);
insert into bucketed values(1),(2),(3),(4);
0: jdbc:hive2://localhost:2181/> dfs -ls /apps/hive/warehouse/bucketed;
| -rw-r--r-- 3 hive hdfs 181 2018-04-04 23:13
/apps/hive/warehouse/bucketed/000000_0 |
| -rw-r--r-- 3 hive hdfs 181 2018-04-04 23:14
/apps/hive/warehouse/bucketed/000000_0_copy_1 |
| -rw-r--r-- 3 hive hdfs 181 2018-04-04 23:13
/apps/hive/warehouse/bucketed/000001_0 |
| -rw-r--r-- 3 hive hdfs 181 2018-04-04 23:14
/apps/hive/warehouse/bucketed/000001_0_copy_1 |
| -rw-r--r-- 3 hive hdfs 181 2018-04-04 23:13
/apps/hive/warehouse/bucketed/000002_0 |
| -rw-r--r-- 3 hive hdfs 181 2018-04-04 23:14
/apps/hive/warehouse/bucketed/000002_0_copy_1 |
| -rw-r--r-- 3 hive hdfs 181 2018-04-04 23:13
/apps/hive/warehouse/bucketed/000003_0 |
| -rw-r--r-- 3 hive hdfs 181 2018-04-04 23:14
/apps/hive/warehouse/bucketed/000003_0_copy_1 |
{code}
> Tez fails to create files for all Hive buckets specified in DDL
> ---------------------------------------------------------------
>
> Key: TEZ-3908
> URL: https://issues.apache.org/jira/browse/TEZ-3908
> Project: Apache Tez
> Issue Type: New Feature
> Affects Versions: 0.8.4
> Reporter: Richard Bross
> Priority: Major
>
> When the Hive DDL specifies a clustering statement, i.e.
> "CLUSTERING BY(\x) INTO x BUCKETS",
> Tez may not create all the bucket files if the data is sparse, causing query
> failures with Presto.
> When an INSERT OVERWRITE is done on a partition, the MapReduce engine would
> always create the proper (as defined in the metastore DDL) number of bucket
> files.
> Tez only creates bucket files if there will be data in them. When the data
> is too sparse to force all files to be created a mismatch will occur with the
> Hive metastore.
> Dependent applications, such as Apache Presto will then fail to execute
> queries (please see [https://github.com/prestodb/presto/issues/10301).]
> There should be a conf var that forces Tez to create all the bucket files,
> even those that will be 0 length, so that there is a metastore match as well
> as backwards compatibility, as MapReduce did.
> Since Hive has deprecated MapReduce in favor of Tez, any existing users that
> have the following conditions will have query failures:
> * Have a table with CLUSTERING BY
> * Have a partition with sparse data that don't have enough samples to force
> Tez to create all bucket files
> * Query with Presto
> If a query includes *any* partition without the full complement of buckets,
> the Presto query will fail.
> As a real world example, our inserts are done with Hive/Tez and our query UIs
> are all set up to use Presto. We have run into these failures and currently
> the only non-hack fix is to refactor our DDL to not use CLUSTERED BY.
> As a note - the Presto developers indicated in the issue referenced above
> that they build a query plan for a bucketed table, so when they don't find
> the correct number of buckets they are not sure that there is a performant
> work-around.
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)