Ferenc Erdelyi created HIVE-13613:
-------------------------------------
Summary: Add computeSplitSize() to CombineHiveInputFormat and
HiveInputFormat
Key: HIVE-13613
URL: https://issues.apache.org/jira/browse/HIVE-13613
Project: Hive
Issue Type: Improvement
Components: Hive
Affects Versions: 1.1.0
Reporter: Ferenc Erdelyi
The input formats that Hive uses (CombineHiveInputFormat and HiveInputFormat)
do not use the computeSplitSize().
CombineHiveInputFormat and HiveInputFormat do not extend FileInputFormat so
that functionality is not there.
For tuning parquet file processing the computeSplitSize() could be used.
Please add computeSplitSize() functionality to CombineHiveInputFormat and
HiveInputFormat.
Use case:
It would be desirable for our Hive query to autoselect the right splitsize (and
consequently number of mappers) based on the data's blocksize as this is
providing us with significant performance gains (e.g. for processing parquet
files). Looking in
https://github.com/cloudera/hadoop-common/blob/cdh5-2.6.0_5.5.2/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.java
this is the behaviour I would expect from computeSplitSize().
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)