Richard Bross created TEZ-3908:
----------------------------------
Summary: Tez fails to create files for all Hive buckets specified
in DDL
Key: TEZ-3908
URL: https://issues.apache.org/jira/browse/TEZ-3908
Project: Apache Tez
Issue Type: Bug
Affects Versions: 0.8.4
Reporter: Richard Bross
When the Hive DDL specifies a clustering statement, i.e.
"CLUSTERING BY(x) INTO x BUCKETS",
Tez may not create all the bucket files if the data is spare, causing query
failures with Presto.
When an INSERT OVERWRITE is done on a partition, the MapReduce engine would
always create the proper (as defined in the metastore DDL) number of bucket
files.
Tez only creates bucket files if there will be data in them. When the data is
too sparse to force all files to be created a mismatch will occur with the Hive
metastore.
Dependent applications, such as Apache Presto will then fail to execute queries
(please see [https://github.com/prestodb/presto/issues/10301).]
There should be a conf var that forces Tez to create all the bucket files, even
those that will be 0 length, so that there is a metastore match as well as
backwards compatibility, as MapReduce did.
Since Hive has deprecated MapReduce in favor of Tez, any existing users that
have the following conditions will have query failures:
* Have a table with CLUSTERING BY
* Have a partition with sparse data that don't have enough samples to force
Tez to create all bucket files
* Query with Presto
If a query includes *any* partition without the full complement of buckets, the
Presto query will fail.
As a real world example, our inserts are done with Hive/Tez and our query UIs
are all set up to use Presto/Tez. We have run into these failures and
currently the only non-hack fix is to refactor our DDL to not use CLUSTERED BY.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)