[jira] [Updated] (HIVE-16972) FetchOperator: filter out inputSplits which length is zero

Chaozhong Yang (JIRA) Tue, 27 Jun 2017 06:47:17 -0700

     [ 
https://issues.apache.org/jira/browse/HIVE-16972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Chaozhong Yang updated HIVE-16972:
----------------------------------
    Description: 
* Background
   We can describe the basic work flow of  common HQL query as follows:
  1. compile and execute
  2. fetch results
  In many cases, we don't need to  worry about the issues fetching results from 
HDFS(iff there are mapreduce jobs generated in planning step). However, the 
number of results files on HDFS and data distribution will affect the final 
status of HQL query, especially for HiveServer2. We have some map-only queries, 
e.g: 
{code:sql}
select * from myTable where date > '20170101' and date <= '20170301' and id = 
88;
{code}
    This query will generate more than 20,000 files(look at screenshot image 
uploaded) on HDFS and most of those files are empty. Of course, they are very 
sparse. If we send TFetchResultsRequest from HiveServer2 client with  some 
parameters(timeout:90s, maxRows:1024) , FetchOperator can not fetch 1024 rows 
in 90 seconds and our HiveServer2 client will mark this TFetchResultsRequest as 
timed out failure. Why? In fact, It's expensive to fetch results from empty 
file. In our HDFS cluster( 5000+ DataNodes) , reading data from an empty file 
will cost almost 100 ms (100ms * 1000 ==> 100s > 90s timeout). Obviously, we 
can filter out those empty files or splits to speed up the process of 
FetchResults. 

  

  was:
* Background
   We can describe the basic work flow of  common HQL query as follows:
  1. compile and execute
  2. fetch results
  In many cases, we don't need to  worry about the issues fetching results from 
HDFS(iff there are mapreduce jobs generated in planning step). However, the 
number of results files on HDFS and data distribution will affect the final 
status of HQL query, especially for HiveServer2. We have some map-only queries, 
e.g: 
{code:sql}
select * from myTable where date > '20170101' and date <= '20170301' and id = 
88;
{code}
    This query will generate more than 20,000 files(look at screenshot image 
uploaded) on HDFS and most of those files are empty. Of course, they are very 
sparse. If we send TFetchResultsRequest from HiveServer2 client with  some 
parameters(timeout: 90s, maxRows: 1024) , FetchOperator can not fetch 1024 rows 
in 90 seconds and our HiveServer2 client will mark this TFetchResultsRequest as 
timed out failure. Why? In fact, It's expensive to fetch results from empty 
file. In our HDFS cluster( 5000+ DataNodes) , reading data from an empty file 
will cost almost 100 ms (100ms * 1000 ==> 100s > 90s timeout). Obviously, we 
can filter out those empty files or splits to speed up the process of 
FetchResults. 

  


> FetchOperator: filter out inputSplits which length is zero
> ----------------------------------------------------------
>
>                 Key: HIVE-16972
>                 URL: https://issues.apache.org/jira/browse/HIVE-16972
>             Project: Hive
>          Issue Type: Improvement
>          Components: Physical Optimizer
>    Affects Versions: 2.1.0, 2.1.1
>            Reporter: Chaozhong Yang
>            Assignee: Chaozhong Yang
>             Fix For: 2.1.2
>
>         Attachments: HIVE-16972.patch, screenshot-1.png
>
>
> * Background
>    We can describe the basic work flow of  common HQL query as follows:
>   1. compile and execute
>   2. fetch results
>   In many cases, we don't need to  worry about the issues fetching results 
> from HDFS(iff there are mapreduce jobs generated in planning step). However, 
> the number of results files on HDFS and data distribution will affect the 
> final status of HQL query, especially for HiveServer2. We have some map-only 
> queries, e.g: 
> {code:sql}
> select * from myTable where date > '20170101' and date <= '20170301' and id = 
> 88;
> {code}
>     This query will generate more than 20,000 files(look at screenshot image 
> uploaded) on HDFS and most of those files are empty. Of course, they are very 
> sparse. If we send TFetchResultsRequest from HiveServer2 client with  some 
> parameters(timeout:90s, maxRows:1024) , FetchOperator can not fetch 1024 rows 
> in 90 seconds and our HiveServer2 client will mark this TFetchResultsRequest 
> as timed out failure. Why? In fact, It's expensive to fetch results from 
> empty file. In our HDFS cluster( 5000+ DataNodes) , reading data from an 
> empty file will cost almost 100 ms (100ms * 1000 ==> 100s > 90s timeout). 
> Obviously, we can filter out those empty files or splits to speed up the 
> process of FetchResults. 
>   



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Updated] (HIVE-16972) FetchOperator: filter out inputSplits which length is zero

Reply via email to