Andrew Purtell created HBASE-20430:
--------------------------------------
Summary: Improve store file management for non-HDFS filesystems
Key: HBASE-20430
URL: https://issues.apache.org/jira/browse/HBASE-20430
Project: HBase
Issue Type: Sub-task
Reporter: Andrew Purtell
HBase keeps a file open for every active store file so no additional round
trips to the NameNode are needed after the initial open. HDFS internally
multiplexes open files, but the Hadoop S3 filesystem implementations do not,
or, at least, not as well. As the bulk of data under management increases we
observe the required number of concurrently open connections will rise, and
expect it will eventually exhaust a limit somewhere (the client, the OS file
descriptor table or open file limits, or the S3 service).
Initially we can simply introduce an option to close every store file after the
reader has finished, and determine the performance impact. Use cases backed by
non-HDFS filesystems will already have to cope with a different read
performance profile. Based on experiments with the S3 backed Hadoop
filesystems, notably S3A, even with aggressively tuned options simple reads can
be very slow when there are blockcache misses, 15-20 seconds observed for Get
of a single small row, for example. We expect extensive use of the BucketCache
to mitigate in this application already. Could be backed by offheap storage,
but more likely a large number of cache files managed by the file engine on
local SSD storage. If misses are already going to be super expensive, then the
motivation to do more than simply open store files on demand is largely absent.
Still, we could employ a predictive cache. Where frequent access to a given
store file (or, at least, its store) is predicted, keep a reference to the
store file open. Can keep statistics about read frequency, write it out to
HFiles during compaction, and note these stats when opening the region, perhaps
by reading all meta blocks of region HFiles when opening. Otherwise, close the
file after reading and open again on demand. Need to be careful not to use ARC
or equivalent as cache replacement strategy as it is encumbered. The size of
the cache can be determined at startup after detecting the underlying
filesystem. Eg. setCacheSize(VERY_LARGE_CONSTANT) if (fs instanceof
DistributedFileSystem), so we don't lose much when on HDFS still.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)