Chaozhong Yang created HIVE-16972:
-------------------------------------
Summary: FetchOperator: filter out inputSplits which length is zero
Key: HIVE-16972
URL: https://issues.apache.org/jira/browse/HIVE-16972
Project: Hive
Issue Type: Improvement
Components: HiveServer2, Physical Optimizer, Query Planning
Affects Versions: 2.1.1, 2.1.0
Reporter: Chaozhong Yang
Assignee: Chaozhong Yang
Fix For: 2.1.2
* Background
We can describe the basic work flow of common HQL query as follows:
1. compile and execute
2. fetch results
In many cases, we don't need to worry about the issues fetching results from
HDFS(iff there are mapreduce jobs generated in planning step). However, the
number of results files on HDFS and data distribution will affect the final
status of HQL query, especially for HiveServer2. We have some map-only queries,
e.g:
{code:sql}
select * from myTable where date > '20170201' and date <= '20170301' and id =
88;
{code}
This query will generate more than 10,000 files on HDFS and most of those
files are empty. Of course, they are very sparse. If we send
TFetchResultsRequest from HiveServer2 client with some parameters(timeout:
90s, maxRows: 1024) , FetchOperator can not fetch 1024 rows in 90 seconds and
our HiveServer2 client will mark this TFetchResultsRequest as timed out
failure. Why? In fact, It's expensive to fetch results from empty file. In our
HDFS cluster( 5000+ DataNodes) , reading data from an empty file will cost
almost 100 ms (100ms * 1000 ==> 100s > 90s timeout). Obviously, we can filter
out those empty files or splits to speed up the process of FetchResults.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)