Andrew Purtell created HBASE-22659:
--------------------------------------
Summary: Resilient block caching for cache sensitive data serving
Key: HBASE-22659
URL: https://issues.apache.org/jira/browse/HBASE-22659
Project: HBase
Issue Type: Brainstorming
Components: BlockCache
Reporter: Andrew Purtell
Caching in data serving remains crucial for performance. Networks are fast but
not yet fast enough. RDMA may change this once it becomes more popular and
available. Caching layers should be resilient to crashes to avoid the cost of
rewarming. In the context of HBase with root filesystem placed on S3, the
object store is quite slow relative to other options like HDFS, so caching is
particularly essential given the rewarming costs will be high, either client
visible performance degradation (due to cache miss and reload) or elevated IO
due to prefetching.
We expect for cloud serving when backed by S3 the HBase blockcache will be
configured for hosting the entirety of the warm set, which may be very large,
so we also expect the selection of the file backed option and the placement of
the filesystem for cache file storage on local fast solid state devices. These
devices offer data persistence beyond the lifetime of an individual process. We
can take advantage of this to make block caching partially resilient to short
duration process failures and restarts.
When the blockcache is backed by a file system, when starting up it can
reinitialize and prewarm using a scan over preexisting disk contents. These
will be cache files left behind by another process executing earlier on the
same instance. This strategy is applicable to process restart and rolling
upgrade scenarios specifically. (The local storage may not survive an instance
reboot.)
Once the server has reloaded the blockcache metadata from local storage it can
advertise to the HMaster the list of HFiles for which it has some precached
blocks resident. This implies the blockcache's file backed option should
maintain a mapping of source HFile paths for the blocks in cache. We don't need
to provide more granular information on which blocks (or not) of the HFile are
in cache. It is unlikely entries for the HFile will be cached elsewhere. We can
assume placement of a region containing the HFile on a server with any block
cached there will be better than alternatives.
The HMaster already waits for regionserver registration activity to stabilize
before assigning regions and we can contemplate adding configurable delay in
region reassignment for sever crash handling in the hopes a restarted or
recovered instance will come online and report in-cache reloaded contents in
time for an assignment decision to consider this new factor in data locality.
When finally processing (re)assignment the HMaster can consider this additional
factor when building the assignment plan. We already calculate a HDFS level
locality metric. We can also calculate a new cache level locality metric
aggregated from regionserver reports of re-warmed cache contents. For a given
region we can build a candidate assignment set of servers reporting cached
blocks for its associated HFiles, and the master can assign the region to the
server with the highest weight. Otherwise we (re)assign using the HDFS locality
metric as before.
In this way during rolling restart or quick process restart via supervisory
process scenarios we are very likely to assign a region back to the server that
was most recently hosting it, and we can pick up for immediate reuse any file
backed blockcache data accumulated for the region by the previous process.
These are going to be the most common scenarios encountered during normal
cluster operation. This will allow HBase's internal data caching to be
resilient to short duration crashes and administrative process restarts.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)