[jira] [Created] (IMPALA-6932) Simple LIMIT 1 query can be really slow on many-filed Avro datasets

Philip Zeyliger (JIRA) Wed, 25 Apr 2018 11:48:57 -0700

Philip Zeyliger created IMPALA-6932:
---------------------------------------


             Summary: Simple LIMIT 1 query can be really slow on many-filed 
Avro datasets
                 Key: IMPALA-6932
                 URL: https://issues.apache.org/jira/browse/IMPALA-6932
             Project: IMPALA
          Issue Type: Task
          Components: Backend
            Reporter: Philip Zeyliger


I recently ran across really slow behavior with the trivial {{SELECT * FROM 
table LIMIT 1}} query. The table used Avro as a file format and had about 
45,000 files across about 250 partitions. An optimization kicked in to set 
NUM_NODES to 1.

The query ran for about an hour, and the profile indicated that it was opening 
files:
          - TotalRawHdfsOpenFileTime(*): 1.0h (3622833666032)
I took a single minidump while this query was running, and I suspect the query 
was here:
{code:java}
1 impalad!impala::ScannerContext::Stream::GetNextBuffer(long) 
[scanner-context.cc : 115 + 0x13]
2 impalad!impala::ScannerContext::Stream::GetBytesInternal(long, unsigned 
char**, bool, long*) [scanner-context.cc : 241 + 0x5]
3 impalad!impala::HdfsAvroScanner::ReadFileHeader() [scanner-context.inline.h : 
54 + 0x1f]
4 impalad!impala::BaseSequenceScanner::GetNextInternal(impala::RowBatch*) 
[base-sequence-scanner.cc : 157 + 0x13]
5 impalad!impala::HdfsScanner::ProcessSplit() [hdfs-scanner.cc : 129 + 0xc]
6 impalad!impala::HdfsScanNode::ProcessSplit(std::vector<impala::FilterContext, 
std::allocator<impala::FilterContext> > const&, impala::MemPool*, 
impala::io::ScanRange*) [hdfs-scan-node.cc : 527 + 0x17]
7 impalad!impala::HdfsScanNode::ScannerThread() [hdfs-scan-node.cc : 437 + 0x1c]
8 impalad!impala::Thread::SuperviseThread(std::string const&, std::string 
const&, boost::function<void ()>, impala::ThreadDebugInfo const*, 
impala::Promise<long>*) [function_template.hpp : 767 + 0x7]{code}
 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (IMPALA-6932) Simple LIMIT 1 query can be really slow on many-filed Avro datasets

Reply via email to