[ 
https://issues.apache.org/jira/browse/TEZ-3908?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gopal V resolved TEZ-3908.
--------------------------
    Resolution: Information Provided

Discussion on the mailing list covers all the cases discussed here.

Specifically about [~findepi]'s comments, here's the problem with alphabetical 
sorting + index assignment.

{code}
create external table bucketed (x int) clustered by (x) into 4 buckets stored 
as orc;
insert into bucketed values(1),(2),(3),(4);
insert into bucketed values(1),(2),(3),(4);

0: jdbc:hive2://localhost:2181/> dfs -ls /apps/hive/warehouse/bucketed;

| -rw-r--r--   3 hive hdfs        181 2018-04-04 23:13 
/apps/hive/warehouse/bucketed/000000_0 |
| -rw-r--r--   3 hive hdfs        181 2018-04-04 23:14 
/apps/hive/warehouse/bucketed/000000_0_copy_1 |
| -rw-r--r--   3 hive hdfs        181 2018-04-04 23:13 
/apps/hive/warehouse/bucketed/000001_0 |
| -rw-r--r--   3 hive hdfs        181 2018-04-04 23:14 
/apps/hive/warehouse/bucketed/000001_0_copy_1 |
| -rw-r--r--   3 hive hdfs        181 2018-04-04 23:13 
/apps/hive/warehouse/bucketed/000002_0 |
| -rw-r--r--   3 hive hdfs        181 2018-04-04 23:14 
/apps/hive/warehouse/bucketed/000002_0_copy_1 |
| -rw-r--r--   3 hive hdfs        181 2018-04-04 23:13 
/apps/hive/warehouse/bucketed/000003_0 |
| -rw-r--r--   3 hive hdfs        181 2018-04-04 23:14 
/apps/hive/warehouse/bucketed/000003_0_copy_1 |
{code}

> Tez fails to create files for all Hive buckets specified in DDL
> ---------------------------------------------------------------
>
>                 Key: TEZ-3908
>                 URL: https://issues.apache.org/jira/browse/TEZ-3908
>             Project: Apache Tez
>          Issue Type: New Feature
>    Affects Versions: 0.8.4
>            Reporter: Richard Bross
>            Priority: Major
>
> When the Hive DDL specifies a clustering statement, i.e.  
> "CLUSTERING BY(\x) INTO x BUCKETS",
> Tez may not create all the bucket files if the data is sparse, causing query 
> failures with Presto.
> When an INSERT OVERWRITE is done on a partition, the MapReduce engine would 
> always create the proper (as defined in the metastore DDL) number of bucket 
> files.
> Tez only creates bucket files if there will be data in them.  When the data 
> is too sparse to force all files to be created a mismatch will occur with the 
> Hive metastore.
> Dependent applications, such as Apache Presto will then fail to execute 
> queries (please see [https://github.com/prestodb/presto/issues/10301).]
> There should be a conf var that forces Tez to create all the bucket files, 
> even those that will be 0 length, so that there is a metastore match as well 
> as backwards compatibility, as MapReduce did.
> Since Hive has deprecated MapReduce in favor of Tez, any existing users that 
> have the following conditions will have query failures:
>  * Have a table with CLUSTERING BY 
>  * Have a partition with sparse data that don't have enough samples to force 
> Tez to create all bucket files
>  * Query with Presto
> If a query includes *any* partition without the full complement of buckets, 
> the Presto query will fail.
> As a real world example, our inserts are done with Hive/Tez and our query UIs 
> are all set up to use Presto.  We have run into these failures and currently 
> the only non-hack fix is to refactor our DDL to not use CLUSTERED BY.
> As a note - the Presto developers indicated in the issue referenced above 
> that they build a query plan for a bucketed table, so when they don't find 
> the correct number of buckets they are not sure that there is a performant 
> work-around.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to