Jean-Marc Spaggiari commented on HBASE-20045:

Interesting. My tought was that compactions are not loading the blocks back 
into memory. Only the flush. Which will make sense, because if you compact a 
20GB file, you will just blow-up the cache, no? I will give that a try.


For what I had a mind was, a table parameter where for a given CF we say 
"KEEP_IN_MEMORY => 604800" (Where 604800 represents 7 days in seconds). Which 
mean, every block we write on disk when doing compactions, that contains a cell 
which is less than 7 days old, has to be stored into memory too.

Yes it will store some "older" values. But look at the compaction life.


We flush F1. Then F2, then F3. There is already a huge F0 10GB file.  We run a 
minor compaction. F1+F2+F3 become F4, but nothing is put back into cache. F0 is 
left over, because too big. I would have liked all of F4 to go into memory... 
Then I continue. F4, F5, F6. I compact that into F7, keeping F0 and F4. Here 
again, I would have liked all of F7 to go into memory. When I will compact F0, 
F4 and F7, then yes, I might have some blocks with some older data, but 
depending on the key design, it might not be the case. If your key is 
sensorid+ts, then F0 holds old data, and when merging with F4 and F7, most 
probably old data and most recent data will be in different blocks, so only 
what is recent goes back in memory. This is to garantie a better 99.999% 
latency. Since that way recent data will ALWAYS be in memory...



> When running compaction, cache recent blocks.
> ---------------------------------------------
>                 Key: HBASE-20045
>                 URL: https://issues.apache.org/jira/browse/HBASE-20045
>             Project: HBase
>          Issue Type: New Feature
>          Components: BlockCache, Compaction
>    Affects Versions: 2.0.0-beta-1
>            Reporter: Jean-Marc Spaggiari
>            Priority: Major
> HBase already allows to cache blocks on flush. This is very useful for 
> usecases where most queries are against recent data. However, as soon as 
> their is a compaction, those blocks are evicted. It will be interesting to 
> have a table level parameter to say "When compacting, cache blocks less than 
> 24 hours old". That way, when running compaction, all blocks where some data 
> are less than 24h hold, will be automatically cached. 
> Very useful for table design where there is TS in the key but a long history 
> (Like a year of sensor data).

This message was sent by Atlassian JIRA

Reply via email to