Thanks Dan. On Thu, Dec 8, 2016 at 11:57 AM, Daniel Hecht <[email protected]> wrote:
> We don't have the tests for it. I think what we should have is test that > does stuff to the files from outside of Impala (and therefore invalidates > the file handles), and make sure nothing "bad" happens (like crashes). And > maybe some invalidate metadata / refreshes (since IIRC we use the mtime at > those points as part of the handle key) > Also, I don't remember what the eviction policy is. We should probably > verify that memory usage is bounded (both on Impala and hdfs side). > Per my understanding, we only evict after an mtime mismatch. Else we'd cache it forever. > > On Thu, Dec 8, 2016 at 11:14 AM, Bharath Vissapragada < > [email protected] > > wrote: > > > I see this <https://gerrit.cloudera.org/#/c/691/> change has reset > > --max_cached_file_handles to 0 effectively disabling hdfs file handle > > caching. Any idea why? > > > > I don't think it consumes too much memory (~20MB for 10k cached handles). > > The reason I'm asking this is, without caching, we'd have to create a new > > handle for every scan range and hence a new RPC every time. > > > > // The number of cached file handles defines how much memory can be used > > per backend for > > // caching frequently used file handles. Currently, we assume that > > approximately 2kB data > > // are associated with a single file handle. 10k file handles will thus > > reserve ~20MB > > // data. The actual amount of memory that is associated with a file > handle > > can be larger > > // or smaller, depending on the replication factor for this file or the > > path name. > > DEFINE_uint64(max_cached_file_handles, 0, "Maximum number of HDFS file > > handles " > > "that will be cached. Disabled if set to 0."); > > >
