Oh, the cloudera lads are working on updating their distro to 0.20.3. Will flag list when done. St.Ack
On Wed, Jan 27, 2010 at 2:51 PM, Stack <st...@duboce.net> wrote: > On Wed, Jan 27, 2010 at 2:41 PM, James Baldassari <ja...@dataxu.com> wrote: >> >> First we shut down the master and all region servers and then manually >> removed the /hbase root through hadoop/HDFS. One of my colleagues >> increased some timeout values (I think they were ZooKeeper timeouts). > > ticktime? > >> Another change was that I recreated the table without LZO compression >> and without setting the IN_MEMORY flag. I learned that we did not have >> the LZO libraries installed, and the table had been created originally >> with compression set to LZO, so I imagine that would cause problems. I >> didn't see any errors about it in the logs, however. Maybe this >> explains why we lost data during our initial testing after shutting down >> HBase. Perhaps it was unable to write the data to HDFS because the LZO >> libraries were not available? >> > > If lzo enabled and libs are not in place, no data is written IIRC. > Its a problem. > >> Anyway, everything seems to be ok for now. We can restart HBase without >> data loss or errors, and we can truncate the table without any problems. >> If any other issues crop up we plan on upgrading to 0.20.3, but our >> preference is to stay with the Cloudera distro if we can. We're doing >> additional testing tonight with a larger dataset, so I'll keep an eye on >> it and post back if we learn anything new. > > Avoid truncating tables if you are not on 0.20.3. Its flakey and may > put you back in the spot you complained of orignally. > > St.Ack > >> >> Thanks again for your help. >> >> -James >> >> >> On Wed, 2010-01-27 at 13:54 -0600, Stack wrote: >>> On Tue, Jan 26, 2010 at 9:03 PM, James Baldassari <ja...@dataxu.com> wrote: >>> > >>> > After running a map/reduce job which inserted around 180,000 rows into >>> > HBase, HBase appeared to be fine. We could do a count on our table, and >>> > no errors were reported. We then tried to truncate the table in >>> > preparation for another test but were unable to do so because the region >>> > became stuck in a transition state. >>> >>> Yes. In older hbase, truncate of > small tables was flakey. Its >>> better in 0.20.3 (I wrote our brothers over at Cloudera about updating >>> version they bundle especially since 0.20.3 just went out). >>> >>> I restarted each region server >>> > individually, but it did not fix the problem. I tried the >>> > disable_region and close_region commands from the hbase shell, but that >>> > didn't work either. After doing all of that, a status 'detailed' showed >>> > this: >>> > >>> > 1 regionsInTransition >>> > name=retargeting,,1264546222144, unassigned=false, pendingOpen=false, >>> > open=false, closing=true, pendingClose=false, closed=false, offlined=false >>> > >>> > Then I restarted the master and all region servers, and it looked like >>> > this: >>> > >>> > 1 regionsInTransition >>> > name=retargeting,,1264546222144, unassigned=false, pendingOpen=true, >>> > open=false, closing=false, pendingClose=false, closed=false, >>> > offlined=false >>> >>> >>> Even after a master restart? Above is dump of a master internal >>> datastructure that is kept in-memory. Strange that it would pick up >>> same exact state on restart (As Ryan says, a restart of the master >>> alone is usually a radical but sufficient fix). >>> >>> I was going to say that you try onlining the individual region in the >>> shell but I don't think that'll work either, not unless you update to >>> 0.20.3 era hbase. >>> >>> > >>> > I noticed messages in some of the region server logs indicating that >>> > their zookeeper sessions had expired. I'm not sure if this has anything >>> > to do with the problem. >>> >>> It could. The regionservers will restart if their session w/ zk >>> expires. Whats your hbase schema like? How are you doing your >>> upload? >>> >>> I should mention that this scenario is quite >>> > repeatable, and the last few times it has happened we had to shut down >>> > HBase and manually remove the /hbase root from HDFS, then start HBase >>> > and recreate the table. >>> > >>> For sure you've upped file descriptors and xceiver params as per the >>> Getting Started? >>> >>> > >>> > I was also wondering whether it was normal for there to be only one >>> > region with 180,000+ rows. Shouldn't this region be split into several >>> > regions and distributed among the region servers? I'm new to HBase, so >>> > maybe my understanding of how it's supposed to work is wrong. >>> >>> Get the regions size on the filesystem: ./bin/hadoop fs -dus >>> /hbase/table/regionname. Region splits when its above a size >>> threshold, 256M usually. >>> >>> St.Ack >>> >>> > >>> > Thanks, >>> > James >>> > >>> > >>> > >> >> >