Re: commit semantics

Jean-Daniel Cryans Tue, 12 Jan 2010 11:54:25 -0800

It's all very depending on the size of your data VS the size of your
cluster VS your usage pattern.


Example: you have 50 regions on a RS and they are all filled at the
same rate. The RS dies so the master has to split the logs of 50
regions before reassigning.

Example2: you have 500 regions on a RS and only 1 is filled. When it
dies, the master will only have 1 region to process.

Since a planned optimization is to reassign regions that have no edits
in any HLog (you have to have that knowledge prior to processing the
files, maybe store that in zookeeper) right before log splitting, then
you lose availability on 49 regions in this case. Nevertheless,
splitting a small number of regions should be more efficient.

Also more regions in general means more memory usage, possibly more
opened files, and if your data should be served very fast, then a
higher number of regions means more data to keep in memory.

J-D

On Tue, Jan 12, 2010 at 11:40 AM, Kannan Muthukkaruppan
<[email protected]> wrote:
> Btw, is there much gains in having a large number of regions-- i.e. to the 
> tune of 500 -- per region server?
>
> I understand that having multiple regions per region server allows finer 
> grained rebalancing when new nodes are added or a node goes down. But would 
> say having a smaller number of regions per region server (say ~50) be really 
> bad. If a region server goes down, 50 other nodes would pick up ~1/50 of its 
> work. Not as good as 500 other nodes picking up 1/500 of its work each-- but 
> seems acceptable still. Are there other advantages of having a large number 
> of regions per region server?
>
> regards,
> Kannan
> -----Original Message-----
> From: [email protected] [mailto:[email protected]] On Behalf Of Jean-Daniel 
> Cryans
> Sent: Tuesday, January 12, 2010 9:42 AM
> To: [email protected]
> Subject: Re: commit semantics
>
> wrt 1 HLog per region server, this is from the Bigtable paper. Their
> main concern is the number of opened files since if you have 1000
> region servers * 500 regions then you may have 100 000 HLogs to
> manage. Also you can have more than one file per HLog, so let's say
> you have on average 5 log files per HLog that's 500 000 files on HDFS.
>
> J-D
>
> On Tue, Jan 12, 2010 at 12:24 AM, Dhruba Borthakur <[email protected]> wrote:
>> Hi Ryan,
>>
>> thanks for ur response.
>>
>>>Right now each regionserver has 1 log, so if 2 puts on different
>>>tables hit the same RS, they hit the same HLog.
>>
>> I understand. My point was that the application could insert the same record
>> into two different tables on two different Hbase instances on two different
>> piece of hardware.
>>
>> On a related note, can somebody explain what the tradeoff is if each region
>> has its own hlog? are you worried about the number of files in HDFS? or
>> maybe the number of sync-threads in the region server? Can multiple hlog
>> files provide faster region splits?
>>
>>
>>> I've thought about this issue quite a bit, and I think the sync every
>>> 1 rows combined with optional no-sync and low time sync() is the way
>>> to go. If you want to discuss this more in person, maybe we can meet
>>> up for brews or something.
>>>
>>
>> The group-commit thing I can understand. HDFS does a very similar thing. But
>> can you explain your alternative "sync every 1 rows combined with optional
>> no-sync and low time sync"? For those applications that have the natural
>> characteristics of updating only one row per logical operation, how can they
>> be sure that their data has reached some-sort-of-stable-storage unless they
>> sync after every row update?
>>
>> thanks,
>> dhruba
>>
>

Re: commit semantics

Reply via email to