Joe McDonnell created IMPALA-7827: ------------------------------------- Summary: Investigate increasing disk utilization by overlapping file open with reads Key: IMPALA-7827 URL: https://issues.apache.org/jira/browse/IMPALA-7827 Project: IMPALA Issue Type: Improvement Components: Backend Affects Versions: Impala 3.2.0 Reporter: Joe McDonnell
Disk IO threads are responsible for doing both the HDFS file open and the reads for ScanRanges. Most HDFS file opens are served from the file handle cache. However, in case of a cache miss, the Disk IO thread is tied up waiting on a roundtrip to the NameNode. Depending on the number of Disk IO threads and the speed of the NameNode, all of the Disk IO threads could be blocked waiting on HDFS file open calls, even if there are ScanRanges that have file handles available in the cache. In particular, for spinning disks, there is a single Disk IO thread per disk. If this thread gets tied up in an open call, the disk will go idle. It might make sense for the open call to be serviced by a separate thread pool. The ScanRange would go through a separate state transition that opens the file handle. The Disk IO thread can process ScanRanges that already have an open file handle (cached or otherwise) while the open call is in progress. This is complicated by the fact that file handles can't be simultaneously used by multiple threads. In order to do the state transition properly, it needs to be clear whether a new file handle is necessary. Keeping a file handle cache at the RequestContext level and using preads (See IMPALA-6403) might make this clear. -- This message was sent by Atlassian JIRA (v7.6.3#76005)