[
https://issues.apache.org/jira/browse/HBASE-10742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15093497#comment-15093497
]
Orange commented on HBASE-10742:
--------------------------------
Whether hbase can determine data temperature now?
> Data temperature aware compaction policy
> ----------------------------------------
>
> Key: HBASE-10742
> URL: https://issues.apache.org/jira/browse/HBASE-10742
> Project: HBase
> Issue Type: Brainstorming
> Reporter: Andrew Purtell
>
> Reading "Identifying Hot and Cold Data in Main-Memory Databases" (Levandoski,
> Larson, and Stoica), it occurred to me that some of the motivation applies to
> HBase and some of the results can inform a data temperature aware compaction
> policy implementation.
> We also wish to optimize retention of cells in the working set in memory, in
> blockcache.
> We can also consider further and related performance optimizations in HBase
> that awareness of hot and cold data can enable, even for cases where the
> working set does not fit in memory. If we could partition HFiles into hot and
> cold (cold+lukewarm) and move cells between them at compaction time, then we
> could:
> - Migrate hot HFiles onto alternate storage tiers with improved read latency
> and throughput characteristics. This has been discussed before on HBASE-6572.
> Or, migrate cold HFiles to an archival tier.
> - Preload hot HFiles into blockcache to increase cache hit rates, especially
> when regions are first brought online. And/or add another LRU priority to
> increase the likelihood of retention of blocks in hot HFiles. This could be
> sufficiently different from ARC to avoid issues there.
> - Reduce the compaction priorities of cold HFiles, with proportional
> reduction in priority IO and write amplification, since cold files would less
> frequently participate in reads.
> Levandoski et. al. describe determining data temperature with low overhead
> using an out of band estimation process running in the background over an
> access log. We could consider logging reads along with mutations and
> similarly process the result in the background. The WAL could be overloaded
> to carry access log records, or we could follow the approach described in the
> paper and maintain an in memory access log only.
> {quote}
> We chose the offline approach for several reasons. First, as mentioned
> earlier, the overhead of even the simplest caching scheme is very high.
> Second, the offline approach is generic and requires minimum changes to the
> database engine. Third, logging imposes very little overhead during normal
> operation. Finally, it allows flexibility in when, where, and how to analyze
> the log and estimate access frequencies. For instance, the analysis can be
> done on a separate machine, thus reducing overhead on the system running the
> transactional workloads.
> {quote}
> Importantly, they only log a sample of all accesses.
> {quote}
> To implement sampling, we have each worker thread flip a biased coin before
> starting a new query (where bias correlates with sample rate). The thread
> records its accesses in log buffers (or not) based on the outcome of the coin
> flip. In Section V, we report experimental results showing that sampling 10%
> of the accesses reduces the accuracy by only 2.5%,
> {quote}
> Likewise we would only record a subset of all accesses to limit overheads.
> The offline process estimates access frequencies over discrete time slices
> using exponential smoothing. (Markers representing time slice boundaries are
> interleaved with access records in the log.) Forward and backward
> classification algorithms are presented. The forward algorithm requires a
> full scan over the log and storage proportional to the number of unique cell
> addresses, while the backward algorithm requires reading a least the tail of
> the log in reverse order.
> If we overload the WAL to carry the access log, offline data temperature
> estimation can piggyback as a WAL listener. The forward algorithm would then
> be a natural choice. The HBase master is fairly idle most of the time and
> less memory hungry as a regionserver, at least in today's architecture. We
> could probably get away with considering only row+family as a unique
> coordinate to minimize space overhead. Or if instead we maintain the access
> logs in memory at the RegionServer, then there is a parallel formulation and
> we could benefit from the reverse algorithm's ability to terminate early once
> confidence bounds are reached and backwards scanning IO wouldn't be a
> concern. This handwaves over a lot of details.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)