Stephen Measmer created HIVE-16870:
--------------------------------------
Summary: Give Hive the ability to suppress output of empty files
Key: HIVE-16870
URL: https://issues.apache.org/jira/browse/HIVE-16870
Project: Hive
Issue Type: Improvement
Components: StorageHandler
Reporter: Stephen Measmer
Today some hive queries using joins can output zero byte files, particularly on
large joins. This can have a negative affect on HDFS as it can lead to too
many small files [1].
A solution suggested in this Cloudera Community thread [2] suggests using
OutputFormat of LazyOutputFormat because MapReduce can be set to suppress the
generation of empty (zero byte) files.
But it's not possible to create a table with an OutputFormat of just
LazyOutputFormat in Hive. Below is what we found when testing.
create table mytable (fip int, state string, zip string, level int) STORED AS
INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' OUTPUTFORMAT
'org.apache.hadoop.mapreduce.lib.output.LazyOutputFormat';
------------
Error: Error while compiling statement: FAILED: SemanticException [Error
10055]: Output Format must implement HiveOutputFormat, otherwise it should be
either IgnoreKeyTextOutputFormat or SequenceFileOutputFormat
(state=42000,code=10055)
[1] http://blog.cloudera.com/blog/2009/02/the-small-files-problem/
[2]
https://community.cloudera.com/t5/Batch-Processing-and-Workflow/how-to-suppress-mapper-output-files-if-the-output-file-does-not/td-p/29540
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)