Wellington Chevreuil created HBASE-28170:
--------------------------------------------
Summary: Put the cached time at the beginning of the block run
cache validation in the background when retrieving the persistent cache
Key: HBASE-28170
URL: https://issues.apache.org/jira/browse/HBASE-28170
Project: HBase
Issue Type: Sub-task
Reporter: Wellington Chevreuil
Assignee: Wellington Chevreuil
In HBASE-28004, we added a "cached time" long at the end of each block on the
bucket cache. We also record the cached time in the backing map we persist to
disk periodically, in order to retrieve the cache upon crashes/restarts. The
persisted backing map includes the last modification time of the cache itself.
On restarts, once we read the backing map from the persisted file, we compare
the last modification time of the cache recorded there against the last
modification time of the cache. If those differ, it means the cache has been
updated after the backing map has been persisted, so the backing map might not
be accurate. We then iterate though the backing map entires and compare the
entries cached time against the related block in the cache, and if those
differ, we remove the entry from the map.
Currently this validation is made at RS initialisation time, but with caches as
large as 1.6TB/30M+ blocks, it can last to an hour, meaning the RS is useless
over that time. This PR changes this validation to be performed in the
background, whilst direct accesses to a block in the cache would also perform
the "cached time" comparison.
This PR also moves the "cached time" to the beginning of the block in the
cache, instead of the end. We noticed that with the "cached time" at the end we
can fail to ensure consistency at some conditions. Consider the following:
1) A block B1 of size S gets allocated at offset 0 with cached time T1;
2) The backing map is persisted, containing B1 at offset 0 and cached time T1;
3) B1 is evicted. It's offset in the cache is now free, however its contents
are still there, including the cached time T1 at its end;
4) A new block B2 of size S/2 gets allocated at offset 0 with cached time T2;
5) RS crashes before the backing map gets saved, so the persisted backing map
still has only the reference to B1, but not B2;
6) At restart, we run the validation. Because B2 was half the size of B1, we
haven't overridden B1 cached time from the cache, so we will successfully
validate B1, although its content is now half overridden by B2.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)