[
https://issues.apache.org/jira/browse/HBASE-26304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17432456#comment-17432456
]
Bryan Beaudreault commented on HBASE-26304:
-------------------------------------------
As mentioned above, I have implementations for the above 2 HDFS issues and it
works great for ensuring HBase is able to take advantage of new locality
improvements without any DFSClient warnings. Before pushing PRs for those, I'm
now taking a look at the localityIndex reporting issue in case that affects the
strategy. The core problem is that when a StoreFile is opened, a StoreFileInfo
object is created. Initializing that StoreFileInfo calls
computeHDFSBlocksDistribution and caches the result for the lifetime of the
StoreFileInfo. The resulting value is available via the
getHDFSBlockDistribution method.
The getHDFSBlockDistribution has three usages:
* RatioBasedCompactionPolicy and DateTieredCompactionPolicy uses it to force a
major compaction on files whose BlockLocalityIndex is less than a threshold
* The value is aggregated for all StoreFiles in an HRegion, and used to create
RegionLoad objects. RegionLoads are created in a few ways:
** On demand, when loading RegionServer UI "Regions" section
** On demand, through HBaseAdmin.getRegionLoad(ServerName, TableName)
** Periodically, in reporting heartbeat to HMaster, by default 3s. The HMaster
uses these in a few ways:
*** Available to query via HBaseAdmin
*** Used in HMaster UI, where you can see localityIndex when viewing table page
*** Used in various load balancer functions (though not localityIndex, since
the balancer computes that separately)
* The value is aggregated for all StoreFiles in an HRegion, and used to report
localityIndex metrics.
** This happens in a thread which executes on an interval, by default 5s. The
resulting metrics are available in JMX, hbtop, and the "Server Metrics" section
at the top of RegionServer UIs.
All of these usages are non-time sensitive, i.e. not in a core read path or
anything. As such I think we could consider the StoreFileInfo
hdfsBlockDistribution a cache which must be cleared. Previously it was a cache
of a value that rarely changed, and now we need more control over clearing. I
can think of 3 options for this:
* We could create a periodic chore which reloads the cached value for all
store files. This could be filtered to only clear values which are not fully
local.
* We could add a TTL on the cached value, which gets enforced at read time. In
other words, when getHDFSBlockDistribution is called, re-compute if TTL is
expired. We could similarly limit this to only files which are not fully local.
* We could use some trigger from the DFSInputStream to intelligently refresh
the HDFSBlockDistribution only if the underlying stream has been updated. I
think this would have to happen at the HStoreFile level, which has a similar
getHDFSBlockDistribution which is the only caller to the StorefileInfo method.
The HStoreFile has access to the initialReader object which can access the
underlying FSDataInputStreamWrapper. We'd need to expose something in
DFSInputStream that can be used to trigger the logic.
Of the options, I think the last one is most appealing because we could avoid
yet another config (the refresh ttl/period). That one also is the most involved
and requires some investigation. My second preference would be the 2nd option
above, because I'd like to avoid another chore. I don't think the minor latency
hit of fetching block locations should be an issue for any of the use cases
mentioned above.
I'm going to do a little more investigation into what the 3rd option could look
like.
> Reflect out-of-band locality improvements in served requests
> ------------------------------------------------------------
>
> Key: HBASE-26304
> URL: https://issues.apache.org/jira/browse/HBASE-26304
> Project: HBase
> Issue Type: Sub-task
> Reporter: Bryan Beaudreault
> Assignee: Bryan Beaudreault
> Priority: Major
>
> Once the LocalityHealer has improved locality of a StoreFile (by moving
> blocks onto the correct host), the Reader's DFSInputStream and Region's
> localityIndex metric must be refreshed. Without refreshing the
> DFSInputStream, the improved locality will not improve latencies. In fact,
> the DFSInputStream may try to fetch blocks that have moved, resulting in a
> ReplicaNotFoundException. This is automatically retried, but the retry will
> increase long tail latencies relative to configured backoff strategy.
> See https://issues.apache.org/jira/browse/HDFS-16155 for an improvement in
> backoff strategy which can greatly mitigate latency impact of the missing
> block retry.
> Even with that mitigation, a StoreFile is often made up of many blocks.
> Without some sort of intervention, we will continue to hit
> ReplicaNotFoundException over time as clients naturally request data from
> moved blocks.
> In the original LocalityHealer design, I created a new
> RefreshHDFSBlockDistribution RPC on the RegionServer. This RPC accepts a list
> of region names and, for each region store, re-opens the underlying StoreFile
> if the locality has changed.
> I will submit a PR with that implementation, but I am also investigating
> other avenues. For example, I noticed
> https://issues.apache.org/jira/browse/HDFS-15119 which doesn't seem ideal but
> maybe can be improved as an automatic lower-level handling of block moves.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)