[jira] Commented: (HADOOP-2615) Add max number of mapfiles to compact at one time giveing us a minor & major compaction

Billy Pearson (JIRA) Fri, 25 Jan 2008 21:39:57 -0800

    [ 
https://issues.apache.org/jira/browse/HADOOP-2615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12562821#action_12562821
 ]


Billy Pearson commented on HADOOP-2615:
---------------------------------------

Example for high number of updates on hot regions

Say I have many regions say 100 on a server and a few are getting a lot of 
updates and the others are getting some updates. 

The few that are getting the bulk of the updates will have many map files 
causing the compaction to limit to the 6 map file limit but will be quick to 
finish sense we will only be working with say 16-20MB not 64MB+. 
If we go from new to old leaveing out the oldest map files that are the largest 
and take the longest to include in compaction.

The other regions will still tie up the compaction thread for say 10 mins each 
on regions that only has 3-4 map files becuase it will include the larger map 
files for compaction.
In that time the two region that are getting lots of  updates will be flushing 
more often meaning they will have many map files. 
We will be spending most of our time compaction region that have only a few map 
files including the larger map files that take the longest our time to compact 
instead of on the region that have the most map files to compact.

In my example above if all/most the regions flushed a map file and entered in 
to the que for compaction it would be 16 hours before we got back to the 
few regions that had been getting the bulk of the updates. Then when we got 
back to them we would only be processing 6 of the map files again then 
leavening many map files for the next compaction
looping and doing all the others again assuming they will get a few flushes 
over the 16 hours it took to complete the cmopaction on all the regions.

We should try to come up with a simple test outside of Hudson to see real 
numbers on the time it takes to  do a scan on a region say run the test with 
10,20,100,500 map files.

With my new idea above we could keep the number of map files under control by 
only running compaction map files under X size keeping the compaction fast. the 
test many 
This may show that we can handle say 50 medium size map files during a scan 
with out much impact on speed. if this is the case then we may not have to do 
major compaction's where we merge all the map files together but once every few 
days. Except on a split then we would want to do a major compaction soon to 
remove the out of range data from each new region.

The bottle neck I have seen on compaction is with block compress we am bound by 
the cpu speed to gzip the map file after compaction. So I would rather run one 
large compaction once a day or two and have to gzip the biggest part of  each 
region only every few days in place of doing every day or more then once a day. 
In my mind thats wasting resources gunzip 64MB data add 4MB gzip it. and do 
this many times a day I thank thats wasting cpu time on gziping the same data 
over and over again

My idea here is to spend more time on compaction on regions getting more 
updates then the other regions so we can handle more regions per server.
Currently my example above is based on 100 regions totaling 9GB of compressed 
data. With that kind of number per server someone wanting to store a TB of 
compressed data in hbase they would need a vary large number of servers or have 
low update traffic.

I know we have some other issues on how many regions a server can handle with 
the open files limits per server and stuff like that but I would like to see 
this compaction problem fix once and have the most efficient compaction we can 
for all users and removing it from becoming a issue later down the road. In the 
end if we go with this new idea it would mean that the compaction's would be 
faster and use less resources during bulk updates and allow more resources to 
other task running on the server like map task.

So my proposed idea would be to have two types of compaction's

1. Compact new flushes in to one map files until it reach a size in MB's then 
leave it for compaction below
2. Compact all map files for a region to together once every x days or if we 
are child region from a split.


> Add max number of mapfiles to compact at one time giveing us a minor & major 
> compaction
> ---------------------------------------------------------------------------------------
>
>                 Key: HADOOP-2615
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2615
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: contrib/hbase
>            Reporter: Billy Pearson
>            Priority: Minor
>             Fix For: 0.17.0
>
>         Attachments: flag-v2.patch, flag.patch, twice.patch
>
>
> Currently we do compaction on a region when the 
> hbase.hstore.compactionThreshold is reached - default 3
> I thank we should configure a max number of mapfiles to compact at one time 
> simulator to doing a minor compaction in bigtable. This keep compaction's 
> form getting tied up in one region to long letting other regions get way to 
> many memcache flushes making compaction take longer and longer for each region
> If we did that when a regions updates start to slack off the max number will 
> eventuly include all mapfiles causeing a major compaction on that region. 
> Unlike big table this would leave the master out of the process and letting 
> the region server handle the major compaction when it has time.
> When doing a minor compaction on a few files I thank we should compact the 
> newest mapfiles first leave the larger/older ones for when we have low 
> updates to a region.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-2615) Add max number of mapfiles to compact at one time giveing us a minor & major compaction

Reply via email to