Andrew Purtell created HBASE-22659:
--------------------------------------

             Summary: Resilient block caching for cache sensitive data serving
                 Key: HBASE-22659
                 URL: https://issues.apache.org/jira/browse/HBASE-22659
             Project: HBase
          Issue Type: Brainstorming
          Components: BlockCache
            Reporter: Andrew Purtell


Caching in data serving remains crucial for performance. Networks are fast but 
not yet fast enough. RDMA may change this once it becomes more popular and 
available. Caching layers should be resilient to crashes to avoid the cost of 
rewarming. In the context of HBase with root filesystem placed on S3, the 
object store is quite slow relative to other options like HDFS, so caching is 
particularly essential given the rewarming costs will be high, either client 
visible performance degradation (due to cache miss and reload) or elevated IO 
due to prefetching.

We expect for cloud serving when backed by S3 the HBase blockcache will be 
configured for hosting the entirety of the warm set, which may be very large, 
so we also expect the selection of the file backed option and the placement of 
the filesystem for cache file storage on local fast solid state devices. These 
devices offer data persistence beyond the lifetime of an individual process. We 
can take advantage of this to make block caching partially resilient to short 
duration process failures and restarts. 

When the blockcache is backed by a file system, when starting up it can 
reinitialize and prewarm using a scan over preexisting disk contents. These 
will be cache files left behind by another process executing earlier on the 
same instance. This strategy is applicable to process restart and rolling 
upgrade scenarios specifically. (The local storage may not survive an instance 
reboot.) 

Once the server has reloaded the blockcache metadata from local storage it can 
advertise to the HMaster the list of HFiles for which it has some precached 
blocks resident. This implies the blockcache's file backed option should 
maintain a mapping of source HFile paths for the blocks in cache. We don't need 
to provide more granular information on which blocks (or not) of the HFile are 
in cache. It is unlikely entries for the HFile will be cached elsewhere. We can 
assume placement of a region containing the HFile on a server with any block 
cached there will be better than alternatives. 

The HMaster already waits for regionserver registration activity to stabilize 
before assigning regions and we can contemplate adding configurable delay in 
region reassignment for sever crash handling  in the hopes a restarted or 
recovered instance will come online and report in-cache reloaded contents in 
time for an assignment decision to consider this new factor in data locality. 
When finally processing (re)assignment the HMaster can consider this additional 
factor when building the assignment plan. We already calculate a HDFS level 
locality metric. We can also calculate a new cache level locality metric 
aggregated from regionserver reports of re-warmed cache contents. For a given 
region we can build a candidate assignment set of servers reporting cached 
blocks for its associated HFiles, and the master can assign the region to the 
server with the highest weight. Otherwise we (re)assign using the HDFS locality 
metric as before.

In this way during rolling restart or quick process restart via supervisory 
process scenarios we are very likely to assign a region back to the server that 
was most recently hosting it, and we can pick up for immediate reuse any file 
backed blockcache data accumulated for the region by the previous process. 
These are going to be the most common scenarios encountered during normal 
cluster operation. This will allow HBase's internal data caching to be 
resilient to short duration crashes and administrative process restarts.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to