Joe McDonnell has posted comments on this change.

Change subject: IMPALA-4623: Thread level file handle caching
......................................................................


Patch Set 1:

(19 comments)

http://gerrit.cloudera.org:8080/#/c/6478/1/be/src/runtime/disk-io-mgr.cc
File be/src/runtime/disk-io-mgr.cc:

Line 70: DEFINE_uint64(max_cached_file_handles, 10000, "Maximum number of HDFS 
file handles "
> on further consideration, esp. in light of what tim brought up (separate re
Comment out of date.


Line 388:   std::vector<int> num_threads_per_disk;
> don't use std:: in .cc files
Done


Line 1050: void DiskIoMgr::WorkLoop(DiskQueue* disk_queue, int 
thread_file_handle_cache_size) {
> you use 'fh' elsewhere for 'file handle', feel free to use that elsewhere i
removed


Line 1052:   if (thread_file_handle_cache_size > 0) {
> dcheck that it's not < 0
removed


Line 1081:       Write(worker_context, static_cast<WriteRange*>(range));
> what about writing?
We currently don't do writes to HDFS via the IO manager. The writes here are to 
local files.

When we are writing tables, we open an Hdfs file handle in hdfs-table-sink.cc. 
I don't think these file handles can be reused, because we open it in write 
only mode. The file handle cache opens in read only mode. I don't see any way 
to convert them.


Line 1337:   : capacity_(capacity) {}
> move to .h
removed


Line 1339: DiskIoMgr::ThreadFileHandleCache::~ThreadFileHandleCache() {}
> remove
removed


Line 1343:   std::string file_key = std::string(fname) + std::to_string(mtime);
> is this guaranteed to be unique w/o some sort of separator?
removed


Line 1351:       EvictFileHandle();
> you only call this once, and it's a small function, you might as well inlin
removed


http://gerrit.cloudera.org:8080/#/c/6478/1/be/src/runtime/disk-io-mgr.h
File be/src/runtime/disk-io-mgr.h:

Line 233:   /// This is a single-threaded LRU cache for Hdfs file handles. The 
cache creates and
> I guess I was thinking that each thread would temporarily check out a file 
The approach changed to using a single cache at the DiskIoMgr level like the 
existing cache.


Line 235:   class ThreadFileHandleCache {
> 'thread' is generic. LruFileHandleCache? or simply FileHandleCache?
Removed.


Line 240:     /// Gets an Hdfs file handle for the specified file. It will 
lookup and use a
> "look up"
Removed


Line 245:     HdfsCachedFileHandle* GetFileHandle(const hdfsFS& fs, const char* 
fname, int64_t mtime);
> long line
Removed


Line 248: 
> remove blank line
Removed


Line 256:     class LruListElement {
> make it a struct, class member vars are supposed to be private.
Removed


Line 260:       ~LruListElement() {}
> remove
Removed


Line 533:     /// in the reader context.
> "reader context" = reader_ member. referencing the member directly is more 
N/A in new code


Line 544:     /// the DiskIoRequestContext. This clears the statistics on this 
file handle.
> why not make that a function of diskiorequestcontext
The two statistics that need input from the ScanRange are 
unexpected_remote_bytes_ and num_remote_ranges_.

The unexpected_remote_bytes_ requires knowing the value of expected_local_ for 
the ScanRange. There is some tracing that prints when a ScanRange has 
unexpected remote bytes (see ScanRange::Close for where it is now). The tracing 
is the only thing that would prevent this from being put directly into 
DiskIoMgrContext if we passed in expected_local_ or the ScanRange to the 
function.

num_remote_ranges_ is impacted by the fact that we can call GetStatistics 
multiple times per scan range. It needs to know if this range has already been 
counted at the DiskIoMgrContext level. This can be worked around.


Line 574:     /// at read time, so hdfs_file_ is null.
> is there a need to have it both ways for hdfs files, instead of always open
We need to keep the hdfs file open as long as the cached buffer is around so 
that the hdfs file doesn't get closed. We also use this when file handle 
caching is off (max_cached_file_handles=0).


-- 
To view, visit http://gerrit.cloudera.org:8080/6478
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ibe5ff60971dd653c3b6a0e13928cfa9fc59d078d
Gerrit-PatchSet: 1
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Joe McDonnell <[email protected]>
Gerrit-Reviewer: Dan Hecht <[email protected]>
Gerrit-Reviewer: Joe McDonnell <[email protected]>
Gerrit-Reviewer: Marcel Kornacker <[email protected]>
Gerrit-Reviewer: Tim Armstrong <[email protected]>
Gerrit-HasComments: Yes

Reply via email to