Re: Hbase stuck after some hours

Al Lias Mon, 12 Apr 2010 04:32:22 -0700

I still search for a problem that happens after HBase Region Splits
(0.20.3). It seems a region is not assigned after a split.



In ProcessRegionOpen.process() I see a line     

        master.regionManager.removeRegion(regionInfo)

which is executed after a MSG_REPORT_OPEN in ServerManager().

It removes the (newly added) regions from the regionsInTransition-Array,
where they were put into in response to MSG_REPORT_SPLIT (well in the
trunk this is done by MSG_REPORT_OPEN/setOpen(), but still the entry is
removed)

Later in the message processing thread, the ServerManager triggers
master.regionManager.assignRegions(), which operates on regions in
regionsInTransition.

Despite of the fact that this code is synchronized on regionManager and
the messages should be kept in order: Can it be, that the remove comes
(sometimes) too early and the regions get never assigned again? After
all, ProcessRegionOpen is beeing processed in a separate TodoQ. If that
Q is processed faster than the message processing thread....


Al


Am 10.04.2010 21:08, schrieb Al Lias:
> Thanks looking into it, Todd,
> 
> Am 09.04.2010 17:16, schrieb Todd Lipcon:
>> Hi,
>>
>> This is likely a multiple assignment bug.
>>
> 
> I tried again, this time I grep'ed for the the region that a client
> could not find. Locks like something with "mutliple assigment".
> 
> http://pastebin.com/CHD0KSPH
> 
>> Can you grep the NN log for the block ID 991235084167234271 ? This should
>> tell you which file it was originally allocated to, as well as what IP wrote
>> it. You should also see a deletion later. Also, the filename should give you
>> a clue as to which region the block is from. You can then consult those
>> particular RS and master logs to see which servers deleted the file and why.
>>
> 
> PLS help; http://pastebin.com/zUxqyyfU (not sorted by time)
> I can only see that the Master adviced to delete....
> 
> (This error is a different instance of the same problem than the one above)
> 
> Thx,
> 
>       Al
> 
>> -Todd
>>
>> On Fri, Apr 9, 2010 at 12:56 AM, Al Lias <[email protected]> wrote:
>>
>>> I repeatedly have the following problem with
>>> 0.20.3/dfs.datanode.socket.write.timeout=0: Some RS is requested for
>>> some data, the DFS can not find it, client hangs until timeout.
>>>
>>> Grepping the cluster logs, I can see this:
>>>
>>> 1. at some time the DFS is asked to delete a block, blocks are deleted
>>> from the datanodes
>>>
>>> 2. some minutes later, a RS seems to ask for exactly this block...DFS
>>> says "Block blk_.. is not valid." and then "No live nodes contain
>>> current block".
>>>
>>> (I have xceivers and file desc limit high,
>>> dfs.datanode.handler.count=10, No particulary high load, 17 Servers with
>>> 24G/4Core)
>>>
>>> More log here: http://pastebin.com/cdqsy8Ae
>>>
>>> ?
>>>
>>> Thx, Al
>>>
>>>
>>>
>>>
>>
>>

Re: Hbase stuck after some hours

Reply via email to