[jira] [Updated] (HBASE-10742) Data temperature aware compaction policy

Andrew Purtell (JIRA) Thu, 13 Mar 2014 16:48:06 -0700

     [ 
https://issues.apache.org/jira/browse/HBASE-10742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Andrew Purtell updated HBASE-10742:
-----------------------------------

    Description: 
Reading "Identifying Hot and Cold Data in Main-Memory Databases" (Levandoski, 
Larson, and Stoica), it occurred to me that some of the motivation applies to 
HBase and some of the results can inform a data temperature aware compaction 
policy implementation.

We also wish to optimize retention of cells in the working set in memory, in 
blockcache. 

We can also consider further and related performance optimizations in HBase 
that awareness of hot and cold data can enable, even for cases where the 
working set does not fit in memory. If we could partition HFiles into hot and 
cold (cold+lukewarm) and move cells between them at compaction time, then we 
could:

- Migrate hot HFiles onto alternate storage tiers with improved read latency 
and throughput characteristics. This has been discussed before on HBASE-6572. 
Or, migrate cold HFiles to an archival tier.

- Preload hot HFiles into blockcache to increase cache hit rates, especially 
when regions are first brought online. And/or add another LRU priority to 
increase the likelihood of retention of blocks in hot HFiles. This could be 
sufficiently different from ARC to avoid issues there. 

- Reduce the compaction priorities of cold HFiles, with proportional reduction 
in priority IO and write amplification, since cold files would less frequently 
participate in reads.

Levandoski et. al. describe determining data temperature with low overhead 
using an out of band estimation process running in the background over an 
access log. We could consider logging reads along with mutations and similarly 
process the result in the background. The WAL could be overloaded to carry 
access log records, or we could follow the approach described in the paper and 
maintain an in memory access log only. 

{quote}
We chose the offline approach for several reasons. First, as mentioned earlier, 
the overhead of even the simplest caching scheme is very high. Second, the 
offline approach is generic and requires minimum changes to the database 
engine. Third, logging imposes very little overhead during normal operation. 
Finally, it allows flexibility in when, where, and how to analyze the log and 
estimate access frequencies. For instance, the analysis can be done on a 
separate machine, thus reducing overhead on the system running the 
transactional workloads.
{quote}

Importantly, they only log a sample of all accesses.

{quote}
To implement sampling, we have each worker thread flip a biased coin before 
starting a new query (where bias correlates with sample rate). The thread 
records its accesses in log buffers (or not) based on the outcome of the coin 
flip. In Section V, we report experimental results showing that sampling 10% of 
the accesses reduces the accuracy by only 2.5%,
{quote}

Likewise we would only record a subset of all accesses to limit overheads.

The offline process estimates access frequencies over discrete time slices 
using exponential smoothing. (Markers representing time slice boundaries are 
interleaved with access records in the log.) Forward and backward 
classification algorithms are presented. The forward algorithm requires a full 
scan over the log and storage proportional to the number of unique cell 
addresses, while the backward algorithm requires reading a least the tail of 
the log in reverse order.

If we overload the WAL to carry the access log, offline data temperature 
estimation can piggyback as a WAL listener. The forward algorithm would then be 
a natural choice. The HBase master is fairly idle most of the time and less 
memory hungry as a regionserver, at least in today's architecture. We could 
probably get away with considering only row+family as a unique coordinate to 
minimize space overhead.  Or if instead we maintain the access logs in memory 
at the RegionServer, then there is a parallel formulation and we could benefit 
from the reverse algorithm's ability to terminate early once confidence bounds 
are reached and backwards scanning IO wouldn't be a concern. This handwaves 
over a lot of details.

  was:
Reading "Identifying Hot and Cold Data in Main-Memory Databases" (Levandoski, 
Larson, and Stoica), it occurred to me that some of the motivation applies to 
HBase and some of the results can inform a data temperature aware compaction 
policy implementation.

We also wish to optimize retention of cells in the working set in memory, in 
blockcache. 

We can also consider further and related performance optimizations in HBase 
that awareness of hot and cold data can enable, even for cases where the 
working set does not fit in memory. If we could partition HFiles into hot and 
cold (cold+lukewarm) and move cells between them at compaction time, then we 
could:

- Migrate hot HFiles onto alternate storage tiers with improved read latency 
and throughput characteristics. This has been discussed before on HBASE-6572. 
Or, migrate cold HFiles to an archival tier.

- Preload hot HFiles into blockcache to increase cache hit rates, especially 
when regions are first brought online. And/or add another LRU priority to 
increase the likelihood of retention of blocks in hot HFiles. This could be 
sufficiently different from ARC to avoid issues there. 

- Reduce the compaction priorities of cold HFiles, with proportional reduction 
in priority IO and write amplification, since cold files would less frequently 
participate in reads.

Levandoski et. al. describe determining data temperature with low overhead 
using an out of band estimation process running in the background over an 
access log. We could consider logging reads along with mutations and similarly 
process the result in the background. The WAL could be overloaded to carry 
access log records, or we could follow the approach described in the paper and 
maintain an in memory access log only. 

{quote}
We chose the offline approach for several reasons. First, as mentioned earlier, 
the overhead of even the simplest caching scheme is very high. Second, the 
offline approach is generic and requires minimum changes to the database 
engine. Third, logging imposes very little overhead during normal operation. 
Finally, it allows flexibility in when, where, and how to analyze the log and 
estimate access frequencies. For instance, the analysis can be done on a 
separate machine, thus reducing overhead on the system running the 
transactional workloads.
{quote}

Importantly, they only log a sample of all accesses.

{quote}
To implement sampling, we have each worker thread flip a biased coin before 
starting a new query (where bias correlates with sample rate). The thread 
records its accesses in log buffers (or not) based on the outcome of the coin 
flip. In Section V, we report experimental results showing that sampling 10% of 
the accesses reduces the accuracy by only 2.5%,
{quote}

Likewise we would only record a subset of all accesses to limit overheads.

The offline process estimates access frequencies over discrete time slices 
using exponential smoothing. (Markers representing time slice boundaries are 
interleaved with access records in the log.) Forward and backward 
classification algorithms are presented. The forward algorithm requires a full 
scan over the log and storage proportional to the number of unique cell 
addresses, while the backward algorithm requires reading a least the tail of 
the log in reverse order.

If we overload the WAL to carry the access log, offline data temperature 
estimation can piggyback as a WAL listener. If replication is also enabled then 
we might even consolidate both forward scans through the WAL into a single 
sequential read. The forward algorithm would then be a natural choice. The 
HBase master is fairly idle most of the time and less memory hungry as a 
regionserver, at least in today's architecture. We could probably get away with 
considering only row+family as a unique coordinate to minimize space overhead.  
Or if instead we maintain the access logs in memory at the RegionServer, then 
there is a parallel formulation and we could benefit from the reverse 
algorithm's ability to terminate early once confidence bounds are reached and 
backwards scanning IO wouldn't be a concern. This handwaves over a lot of 
details.


> Data temperature aware compaction policy
> ----------------------------------------
>
>                 Key: HBASE-10742
>                 URL: https://issues.apache.org/jira/browse/HBASE-10742
>             Project: HBase
>          Issue Type: Brainstorming
>            Reporter: Andrew Purtell
>
> Reading "Identifying Hot and Cold Data in Main-Memory Databases" (Levandoski, 
> Larson, and Stoica), it occurred to me that some of the motivation applies to 
> HBase and some of the results can inform a data temperature aware compaction 
> policy implementation.
> We also wish to optimize retention of cells in the working set in memory, in 
> blockcache. 
> We can also consider further and related performance optimizations in HBase 
> that awareness of hot and cold data can enable, even for cases where the 
> working set does not fit in memory. If we could partition HFiles into hot and 
> cold (cold+lukewarm) and move cells between them at compaction time, then we 
> could:
> - Migrate hot HFiles onto alternate storage tiers with improved read latency 
> and throughput characteristics. This has been discussed before on HBASE-6572. 
> Or, migrate cold HFiles to an archival tier.
> - Preload hot HFiles into blockcache to increase cache hit rates, especially 
> when regions are first brought online. And/or add another LRU priority to 
> increase the likelihood of retention of blocks in hot HFiles. This could be 
> sufficiently different from ARC to avoid issues there. 
> - Reduce the compaction priorities of cold HFiles, with proportional 
> reduction in priority IO and write amplification, since cold files would less 
> frequently participate in reads.
> Levandoski et. al. describe determining data temperature with low overhead 
> using an out of band estimation process running in the background over an 
> access log. We could consider logging reads along with mutations and 
> similarly process the result in the background. The WAL could be overloaded 
> to carry access log records, or we could follow the approach described in the 
> paper and maintain an in memory access log only. 
> {quote}
> We chose the offline approach for several reasons. First, as mentioned 
> earlier, the overhead of even the simplest caching scheme is very high. 
> Second, the offline approach is generic and requires minimum changes to the 
> database engine. Third, logging imposes very little overhead during normal 
> operation. Finally, it allows flexibility in when, where, and how to analyze 
> the log and estimate access frequencies. For instance, the analysis can be 
> done on a separate machine, thus reducing overhead on the system running the 
> transactional workloads.
> {quote}
> Importantly, they only log a sample of all accesses.
> {quote}
> To implement sampling, we have each worker thread flip a biased coin before 
> starting a new query (where bias correlates with sample rate). The thread 
> records its accesses in log buffers (or not) based on the outcome of the coin 
> flip. In Section V, we report experimental results showing that sampling 10% 
> of the accesses reduces the accuracy by only 2.5%,
> {quote}
> Likewise we would only record a subset of all accesses to limit overheads.
> The offline process estimates access frequencies over discrete time slices 
> using exponential smoothing. (Markers representing time slice boundaries are 
> interleaved with access records in the log.) Forward and backward 
> classification algorithms are presented. The forward algorithm requires a 
> full scan over the log and storage proportional to the number of unique cell 
> addresses, while the backward algorithm requires reading a least the tail of 
> the log in reverse order.
> If we overload the WAL to carry the access log, offline data temperature 
> estimation can piggyback as a WAL listener. The forward algorithm would then 
> be a natural choice. The HBase master is fairly idle most of the time and 
> less memory hungry as a regionserver, at least in today's architecture. We 
> could probably get away with considering only row+family as a unique 
> coordinate to minimize space overhead.  Or if instead we maintain the access 
> logs in memory at the RegionServer, then there is a parallel formulation and 
> we could benefit from the reverse algorithm's ability to terminate early once 
> confidence bounds are reached and backwards scanning IO wouldn't be a 
> concern. This handwaves over a lot of details.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (HBASE-10742) Data temperature aware compaction policy

Reply via email to