Improving Performance of SELECT * FROM hive.table LIMIT 0

Sudheesh Katkam Thu, 24 Sep 2015 21:33:07 -0700

Hey y'all,

### Short Question:


How do we improve performance of SELECT * FROM plugin.table LIMIT 0?

### Extended Question:

While investigating DRILL-3623 
<https://issues.apache.org/jira/browse/DRILL-3623>, I did an analysis to see 
where we spend time for SELECT * FROM hive.table LIMIT 0 query.

## Setup:
Copy the drill/sample-data/region.parquet (x 20000) into a DFS (MapR-FS in my 
case) directory named region. Create a Hive external table pointing to region. 
Run Drill with default configuration.

## Now there are two ways to query this table:

> SELECT * FROM hive.region LIMIT 0;
+--------------+---------+------------+
| r_regionkey  | r_name  | r_comment  |
+--------------+---------+------------+
+--------------+---------+------------+
No rows selected (1203.179 seconds)
...

> SELECT * FROM dfs.test.region LIMIT 0;
+--------------+---------+------------+
| r_regionkey  | r_name  | r_comment  |
+--------------+---------+------------+
+--------------+---------+------------+
No rows selected (94.396 seconds)

Currently, we use HiveRecordReader for the first case and ParquetRecordReader 
in the second case. With DRILL-3209 
<https://issues.apache.org/jira/browse/DRILL-3209>, both queries will use 
ParquetRecordReader. However, for formats that are non-native to Drill or other 
storage plugins, we still face this problem. Summarizing the query profile,
+-------+-----------+---------------+----------------+
| Query | Fragments | Planning time | Execution time |
+-------+-----------+---------------+----------------+
| hive  | 1         | ~2 min        | ~18 min        |
| dfs   | 1         | ~1 min        | ~33 sec        |
+-------+-----------+---------------+----------------+

## The time hogs:

# Planning time in both cases needs to improve. How?

# With respect to execution, in the first case ImplCreator.getExec(…) call in 
the FragmentExecutor took 1,070,926 ms to create RecordBatch tree. There are 
20,000 readers being initialized in HiveScanBatchCreator. How do we avoid this? 
What are the implications of chained impersonation (opening readers in ctor() 
rather than in setup())?

### Extending further:

This can be generalized to any "LIMIT n" query with n is a small number. For n 
> 0, we parallelize scanning. So LIMIT 1 query runs faster than LIMIT 0. 
However there is a sweet "n" after which parallelization hurts.

###

Thank you,
Sudheesh

Improving Performance of SELECT * FROM hive.table LIMIT 0

Reply via email to