Re: Region gets stuck in transition state

Stack Wed, 27 Jan 2010 14:52:46 -0800

Oh, the cloudera lads are working on updating their distro to 0.20.3.
Will flag list when done.
St.Ack


On Wed, Jan 27, 2010 at 2:51 PM, Stack <st...@duboce.net> wrote:
> On Wed, Jan 27, 2010 at 2:41 PM, James Baldassari <ja...@dataxu.com> wrote:
>>
>> First we shut down the master and all region servers and then manually
>> removed the /hbase root through hadoop/HDFS.  One of my colleagues
>> increased some timeout values (I think they were ZooKeeper timeouts).
>
> ticktime?
>
>> Another change was that I recreated the table without LZO compression
>> and without setting the IN_MEMORY flag.  I learned that we did not have
>> the LZO libraries installed, and the table had been created originally
>> with compression set to LZO, so I imagine that would cause problems.  I
>> didn't see any errors about it in the logs, however.  Maybe this
>> explains why we lost data during our initial testing after shutting down
>> HBase.  Perhaps it was unable to write the data to HDFS because the LZO
>> libraries were not available?
>>
>
> If lzo enabled and libs are not in place, no data is written IIRC.
> Its a problem.
>
>> Anyway, everything seems to be ok for now.  We can restart HBase without
>> data loss or errors, and we can truncate the table without any problems.
>> If any other issues crop up we plan on upgrading to 0.20.3, but our
>> preference is to stay with the Cloudera distro if we can.  We're doing
>> additional testing tonight with a larger dataset, so I'll keep an eye on
>> it and post back if we learn anything new.
>
> Avoid truncating tables if you are not on 0.20.3.  Its flakey and may
> put you back in the spot you complained of orignally.
>
> St.Ack
>
>>
>> Thanks again for your help.
>>
>> -James
>>
>>
>> On Wed, 2010-01-27 at 13:54 -0600, Stack wrote:
>>> On Tue, Jan 26, 2010 at 9:03 PM, James Baldassari <ja...@dataxu.com> wrote:
>>> >
>>> > After running a map/reduce job which inserted around 180,000 rows into
>>> > HBase, HBase appeared to be fine.  We could do a count on our table, and
>>> > no errors were reported.  We then tried to truncate the table in
>>> > preparation for another test but were unable to do so because the region
>>> > became stuck in a transition state.
>>>
>>> Yes.  In older hbase, truncate of > small tables was flakey.  Its
>>> better in 0.20.3 (I wrote our brothers over at Cloudera about updating
>>> version they bundle especially since 0.20.3 just went out).
>>>
>>>  I restarted each region server
>>> > individually, but it did not fix the problem.  I tried the
>>> > disable_region and close_region commands from the hbase shell, but that
>>> > didn't work either.  After doing all of that, a status 'detailed' showed
>>> > this:
>>> >
>>> > 1 regionsInTransition
>>> >    name=retargeting,,1264546222144, unassigned=false, pendingOpen=false, 
>>> > open=false, closing=true, pendingClose=false, closed=false, offlined=false
>>> >
>>> > Then I restarted the master and all region servers, and it looked like 
>>> > this:
>>> >
>>> > 1 regionsInTransition
>>> >    name=retargeting,,1264546222144, unassigned=false, pendingOpen=true, 
>>> > open=false, closing=false, pendingClose=false, closed=false, 
>>> > offlined=false
>>>
>>>
>>> Even after a master restart?  Above is dump of a master internal
>>> datastructure that is kept in-memory.  Strange that it would pick up
>>> same exact state on restart (As Ryan says, a restart of the master
>>> alone is usually a radical but sufficient fix).
>>>
>>> I was going to say that you try onlining the individual region in the
>>> shell but I don't think that'll work either, not unless you update to
>>> 0.20.3 era hbase.
>>>
>>> >
>>> > I noticed messages in some of the region server logs indicating that
>>> > their zookeeper sessions had expired.  I'm not sure if this has anything
>>> > to do with the problem.
>>>
>>> It could.  The regionservers will restart if their session w/ zk
>>> expires.  Whats your hbase schema like?  How are you doing your
>>> upload?
>>>
>>> I should mention that this scenario is quite
>>> > repeatable, and the last few times it has happened we had to shut down
>>> > HBase and manually remove the /hbase root from HDFS, then start HBase
>>> > and recreate the table.
>>> >
>>> For sure you've upped file descriptors and xceiver params as per the
>>> Getting Started?
>>>
>>> >
>>> > I was also wondering whether it was normal for there to be only one
>>> > region with 180,000+ rows.  Shouldn't this region be split into several
>>> > regions and distributed among the region servers?  I'm new to HBase, so
>>> > maybe my understanding of how it's supposed to work is wrong.
>>>
>>> Get the regions size on the filesystem: ./bin/hadoop fs -dus
>>> /hbase/table/regionname.  Region splits when its above a size
>>> threshold, 256M usually.
>>>
>>> St.Ack
>>>
>>> >
>>> > Thanks,
>>> > James
>>> >
>>> >
>>> >
>>
>>
>

Re: Region gets stuck in transition state

Reply via email to