Re: Improving Performance of SELECT * FROM hive.table LIMIT 0

Sudheesh Katkam Thu, 24 Sep 2015 22:43:21 -0700

For the table below, 33 seconds for execution (includes parquet reader 
initialization) and 60 seconds for planning.


> On Sep 24, 2015, at 10:01 PM, Jinfeng Ni <jinfengn...@gmail.com> wrote:
> 
> "FragmentExecutor took 1,070,926 ms to create RecordBatch tree."
> 
> 1,070,926 ms ~ 17.x  minutes. In other words, the majority of 18
> minutes of execution in hive case is spent on the initialization of
> Hive readers. If we want to improve "limit n", we probably should make
> "lazy" initialization of Hive reader; only when Drill has to read rows
> from reader, we do the initialization. Otherwise, to initialize all
> the readers before reading any single row means long setup time for
> limit "n" query, when n is relative small.
> 
> For the second case, the 94 seconds query time seems to be too long as
> well. I guess most of the time is spent on parquet reader
> initialization (?)
> 
> 
> 
> On Thu, Sep 24, 2015 at 9:32 PM, Sudheesh Katkam <skat...@maprtech.com> wrote:
>> Hey y'all,
>> 
>> ### Short Question:
>> 
>> How do we improve performance of SELECT * FROM plugin.table LIMIT 0?
>> 
>> ### Extended Question:
>> 
>> While investigating DRILL-3623 
>> <https://issues.apache.org/jira/browse/DRILL-3623>, I did an analysis to see 
>> where we spend time for SELECT * FROM hive.table LIMIT 0 query.
>> 
>> ## Setup:
>> Copy the drill/sample-data/region.parquet (x 20000) into a DFS (MapR-FS in 
>> my case) directory named region. Create a Hive external table pointing to 
>> region. Run Drill with default configuration.
>> 
>> ## Now there are two ways to query this table:
>> 
>>> SELECT * FROM hive.region LIMIT 0;
>> +--------------+---------+------------+
>> | r_regionkey  | r_name  | r_comment  |
>> +--------------+---------+------------+
>> +--------------+---------+------------+
>> No rows selected (1203.179 seconds)
>> ...
>> 
>>> SELECT * FROM dfs.test.region LIMIT 0;
>> +--------------+---------+------------+
>> | r_regionkey  | r_name  | r_comment  |
>> +--------------+---------+------------+
>> +--------------+---------+------------+
>> No rows selected (94.396 seconds)
>> 
>> Currently, we use HiveRecordReader for the first case and 
>> ParquetRecordReader in the second case. With DRILL-3209 
>> <https://issues.apache.org/jira/browse/DRILL-3209>, both queries will use 
>> ParquetRecordReader. However, for formats that are non-native to Drill or 
>> other storage plugins, we still face this problem. Summarizing the query 
>> profile,
>> +-------+-----------+---------------+----------------+
>> | Query | Fragments | Planning time | Execution time |
>> +-------+-----------+---------------+----------------+
>> | hive  | 1         | ~2 min        | ~18 min        |
>> | dfs   | 1         | ~1 min        | ~33 sec        |
>> +-------+-----------+---------------+----------------+
>> 
>> ## The time hogs:
>> 
>> # Planning time in both cases needs to improve. How?
>> 
>> # With respect to execution, in the first case ImplCreator.getExec(…) call 
>> in the FragmentExecutor took 1,070,926 ms to create RecordBatch tree. There 
>> are 20,000 readers being initialized in HiveScanBatchCreator. How do we 
>> avoid this? What are the implications of chained impersonation (opening 
>> readers in ctor() rather than in setup())?
>> 
>> ### Extending further:
>> 
>> This can be generalized to any "LIMIT n" query with n is a small number. For 
>> n > 0, we parallelize scanning. So LIMIT 1 query runs faster than LIMIT 0. 
>> However there is a sweet "n" after which parallelization hurts.
>> 
>> ###
>> 
>> Thank you,
>> Sudheesh
>>

Re: Improving Performance of SELECT * FROM hive.table LIMIT 0

Reply via email to