[jira] [Created] (TEZ-3908) Tez fails to create files for all Hive buckets specified in DDL

Richard Bross (JIRA) Mon, 02 Apr 2018 13:48:15 -0700

Richard Bross created TEZ-3908:
----------------------------------

             Summary: Tez fails to create files for all Hive buckets specified 
in DDL
                 Key: TEZ-3908
                 URL: https://issues.apache.org/jira/browse/TEZ-3908
             Project: Apache Tez
          Issue Type: Bug
    Affects Versions: 0.8.4
            Reporter: Richard Bross



When the Hive DDL specifies a clustering statement, i.e.  

"CLUSTERING BY(x) INTO x BUCKETS",

Tez may not create all the bucket files if the data is spare, causing query 
failures with Presto.

When an INSERT OVERWRITE is done on a partition, the MapReduce engine would 
always create the proper (as defined in the metastore DDL) number of bucket 
files.

Tez only creates bucket files if there will be data in them.  When the data is 
too sparse to force all files to be created a mismatch will occur with the Hive 
metastore.

Dependent applications, such as Apache Presto will then fail to execute queries 
(please see [https://github.com/prestodb/presto/issues/10301).]

There should be a conf var that forces Tez to create all the bucket files, even 
those that will be 0 length, so that there is a metastore match as well as 
backwards compatibility, as MapReduce did.

Since Hive has deprecated MapReduce in favor of Tez, any existing users that 
have the following conditions will have query failures:
 * Have a table with CLUSTERING BY 
 * Have a partition with sparse data that don't have enough samples to force 
Tez to create all bucket files
 * Query with Presto

If a query includes *any* partition without the full complement of buckets, the 
Presto query will fail.

As a real world example, our inserts are done with Hive/Tez and our query UIs 
are all set up to use Presto/Tez.  We have run into these failures and 
currently the only non-hack fix is to refactor our DDL to not use CLUSTERED BY.

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (TEZ-3908) Tez fails to create files for all Hive buckets specified in DDL

Reply via email to