[ 
https://issues.apache.org/jira/browse/HBASE-20045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16394587#comment-16394587
 ] 

Saad Mufti edited comment on HBASE-20045 at 3/11/18 6:15 PM:
-------------------------------------------------------------

I hope I'm not interrupting the whole discussion, but I would like to describe 
a current use case for which this would be incredibly useful in my current line 
of work. We are using HBase on AWS EMR where the actual storage is all on S3 
using AWS's proprietary EMRFS filesystem. We have configured our bucket cache 
to be more than large enough to store all our data and more, but the S3 buys us 
failure recoverability and AWS EMR does not support more than a single master 
yet, so loss of the master means loss of the entire cluster and having our data 
in S3 lets us survive cluster failure cleanly. 

We have a heavy read and write load, and performance is more than good enough 
when everything is coming from the bucket cache (we have set the prefetch on 
open flag to true in the relevant column families' schema, except for one where 
we do heavy write but never read from it in the HBase cluster).

Now we come to compaction, both minor and major compaction. We have tuned minor 
compaction min and max filesize very low to avoid as much minor compaction as 
practical, and run a homegrown tool that does major compaction across the whole 
cluster in batches, with each batch being one region per region server. In past 
HBase clusters where storage was in HDFS, this served our needs well and we'd 
like to keep this tool for its operational flexibility and not having to take 
any downtime to compact. The problem of course is that compaction evicts from 
the bucket cache blocks from the files being compacted away, and also refuses 
to put blocks from the newly compacted file in the bucket cache. In the face of 
ongoing traffic that does a lot of checkAndPut, this causes the reads to go to 
S3 and be slow, causing the write lock for checkAndPut to be held for a long 
time, causing timeouts in other operations trying to get the same row lock. 
Also our overall performance suffers while the batch is going on due to 
requests building up in IPC queues. Response time means and other percentiles 
look like a sawtooth pattern. In our case each batch of compactions lasts 
roughly 10 minutes or so and the response time sawtooth pattern has the same 
periodicity.

For now we have worked around this by a) using the new setting in HBase 1.4.0 
in our client to avoid one slow region server from blocking all client ops to 
other region servers, b) accepting some timeouts as the cost of business and 
requeuing them in a special Kafka retry topic for our upstream system to 
reprocess. This also limits traffic on the slow region server, letting it clean 
out its backed up IPC queues instead of being hammered with traffic that 
doesn't let it recover. Also we run the tool once a day and it finishes in 8-10 
hours, so performance is great the rest of the day.

But if we got a setting that would let us tell HBase to always cache even newly 
compacted files, our performance hit would totally go away. I see the argument 
above about not bothering to cache a block if all its cells are weeks old. In 
our case, the data is advertising identifiers and can come in unpredictably, 
and like I said we have a big enough bucket cache anyway, so why not just cache 
everything? The old blocks from the compacted away files are going to be 
evicted anyway, so we should never run out of bucket cache if we have sized it 
much larger than our entire data size.

Also of course the default behavior would continue, any new configuration 
around this can come with heavy warnings of the potential consequences and only 
being used by advanced users etc.

Again I apologize for this long description if it distracts from where the 
current state of the discussion was. Even a setting to only cache newly 
compacted blocks if they had "new" cells would still be hugely benficial to us.

Cheers.

 


was (Author: saadmufti):
I hope I'm not interrupting the whole discussion, but I would like to describe 
a current use case for which this would be incredibly useful in my current line 
of work. We are using HBase on AWS EMR where the actual storage is all on S3 
using AWS's proprietary EMRFS filesystem. We have configured our bucket cache 
to be more than large enough to store all our data and more, but the S3 buys us 
failure recoverability and AWS EMR does not support more than a single master 
yet, so loss of the master means loss of the entire cluster and having our data 
in S3 lets us survive cluster failure cleanly. 

We have a heavy read and write load, and performance is more than good enough 
when everything is coming from the bucket cache (we have set the prefetch on 
open flag to true in the relevant column families' schema, except for one where 
we do heavy write but never read from it in the HBase cluster).

Now we come to compaction, both minor and major compaction. We have tuned minor 
compaction min and max filesize very low to avoid as much minor compaction as 
practical, and run a homegrown tool that does major compaction across the whole 
cluster in batches, with each batch being one region per region server. In past 
HBase clusters where storage was in HDFS, this served our needs well and we'd 
like to keep this tool for its operational flexibility and not having to take 
any downtime to compact. The problem of course is that compaction evicts from 
the bucket cache blocks from the files being compacted away. In the face of 
ongoing traffic that does a lot of checkAndPut, this causes the reads to go to 
S3 and be slow, causing the write lock for checkAndPut to be held for a long 
time, causing timeouts in other operations trying to get the same row lock. 
Also our overall performance suffers while the batch is going on due to 
requests building up in IPC queues. Response time means and other percentiles 
look like a sawtooth pattern. In our case each batch of compactions lasts 
roughly 10 minutes or so and the response time sawtooth pattern has the same 
periodicity.

For now we have worked around this by a) using the new setting in HBase 1.4.0 
in our client to avoid one slow region server from blocking all client ops to 
other region servers, b) accepting some timeouts as the cost of business and 
requeuing them in a special Kafka retry topic for our upstream system to 
reprocess. This also limits traffic on the slow region server, letting it clean 
out its backed up IPC queues instead of being hammered with traffic that 
doesn't let it recover. Also we run the tool once a day and it finishes in 8-10 
hours, so performance is great the rest of the day.

But if we got a setting that would let us tell HBase to always cache even newly 
compacted files, our performance hit would totally go away. I see the argument 
above about not bothering to cache a block if all its cells are weeks old. In 
our case, the data is advertising identifiers and can come in unpredictably, 
and like I said we have a big enough bucket cache anyway, so why not just cache 
everything? The old blocks from the compacted away files are going to be 
evicted anyway, so we should never run out of bucket cache if we have sized it 
much larger than our entire data size.

Also of course the default behavior would continue, any new configuration 
around this can come with heavy warnings of the potential consequences and only 
being used by advanced users etc.

Again I apologize for this long description if it distracts from where the 
current state of the discussion was. Even a setting to only cache newly 
compacted blocks if they had "new" cells would still be hugely benficial to us.

Cheers.

 

> When running compaction, cache recent blocks.
> ---------------------------------------------
>
>                 Key: HBASE-20045
>                 URL: https://issues.apache.org/jira/browse/HBASE-20045
>             Project: HBase
>          Issue Type: New Feature
>          Components: BlockCache, Compaction
>    Affects Versions: 2.0.0-beta-1
>            Reporter: Jean-Marc Spaggiari
>            Priority: Major
>
> HBase already allows to cache blocks on flush. This is very useful for 
> usecases where most queries are against recent data. However, as soon as 
> their is a compaction, those blocks are evicted. It will be interesting to 
> have a table level parameter to say "When compacting, cache blocks less than 
> 24 hours old". That way, when running compaction, all blocks where some data 
> are less than 24h hold, will be automatically cached. 
>  
> Very useful for table design where there is TS in the key but a long history 
> (Like a year of sensor data).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to