Joe McDonnell created IMPALA-7827:
-------------------------------------
Summary: Investigate increasing disk utilization by overlapping
file open with reads
Key: IMPALA-7827
URL: https://issues.apache.org/jira/browse/IMPALA-7827
Project: IMPALA
Issue Type: Improvement
Components: Backend
Affects Versions: Impala 3.2.0
Reporter: Joe McDonnell
Disk IO threads are responsible for doing both the HDFS file open and the reads
for ScanRanges. Most HDFS file opens are served from the file handle cache.
However, in case of a cache miss, the Disk IO thread is tied up waiting on a
roundtrip to the NameNode. Depending on the number of Disk IO threads and the
speed of the NameNode, all of the Disk IO threads could be blocked waiting on
HDFS file open calls, even if there are ScanRanges that have file handles
available in the cache. In particular, for spinning disks, there is a single
Disk IO thread per disk. If this thread gets tied up in an open call, the disk
will go idle.
It might make sense for the open call to be serviced by a separate thread pool.
The ScanRange would go through a separate state transition that opens the file
handle. The Disk IO thread can process ScanRanges that already have an open
file handle (cached or otherwise) while the open call is in progress.
This is complicated by the fact that file handles can't be simultaneously used
by multiple threads. In order to do the state transition properly, it needs to
be clear whether a new file handle is necessary. Keeping a file handle cache at
the RequestContext level and using preads (See IMPALA-6403) might make this
clear.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]