Re: Starting Abnormally After Shutting Down For Some Time

2012-03-28 Thread Bing Li
Dear Manish and Jean-Daniel, After starting DFS (/opt/hadoop/bin/start-dfs.sh), I got the following daemons after tying jps. 5212 Jps 5150 SecondaryNameNode 4932 DataNode 4737 NameNode Then, I started the HBase (/opt/hbase/bin/start-hbase.sh). The following daemons were available. 5797 Jps

Re: Starting Abnormally After Shutting Down For Some Time

2012-03-28 Thread Bing Li
Jean-Daniel, I changed dfs.data.dir and dfs.name.dir to new paths in the hdfs-site.xml. I really cannot figure out why the HBase/Hadoop got a problem after a couple of days of shutting down. If I use it frequently, no such a master problem happens. Each time, I have to reinstall not only

Re: Starting Abnormally After Shutting Down For Some Time

2012-03-28 Thread Manish Bhoge
Bing, As per my experience on the configuration I can list down some points one of which may be your solution. - first and foremost don't store your service metadata into system tmp directory because it may get cleaned up in every start and you loose all your job tracker, datanode

Re: efficient export w/o HDFS/copying

2012-03-28 Thread Michel Segel
Wouldn't that mean having the NAS attached to all of the nodes in the cluster? Sent from a remote device. Please excuse any typos... Mike Segel On Mar 26, 2012, at 11:07 PM, Stack st...@duboce.net wrote: On Mon, Mar 26, 2012 at 4:31 PM, Ted Tuttle ted.tut...@mentacapital.com wrote: Is

Re: distributed cluster proble

2012-03-28 Thread Roberto Alonso
Hello, I think the problem is not that. If I don't put this env variable $HBASE_CONF_DIR, everything starts correctly, I don't think I need to put it. My problem is that the map reduce is not executing in parallel. if I ask: Configuration config = HBaseConfiguration.create();

Re: Starting Abnormally After Shutting Down For Some Time

2012-03-28 Thread Bing Li
Dear Manish, I appreciate so much for your replies! The system tmp directory is changed to anther location in my hdfs-site.xml. If I ran $HADOOP_HOME/bin/start-all.sh, all of the services were listed, including job tracker and task tracker. 10211 SecondaryNameNode 10634 Jps 9992

Re: 0.92 and Read/writes not scaling

2012-03-28 Thread Juhani Connolly
I think there is a lot of stuff in this and the situation has changed a bit so I'd like to summarize the current situation and verify a few points: Our current environment: - CDH 4b1: hdfs 0.23 and hbase 0.92 - separate master and namenode, 64gb, 24 cores each, colocating with zookeepers(third

Re: ArrayIndexOutOfBoundsException in 0.90.7-SNAPSHOT

2012-03-28 Thread Daniel Lamberger
Thank you very much! On Tue, Mar 27, 2012 at 6:54 PM, Ted Yu yuzhih...@gmail.com wrote: Index 20 corresponds to RS_ZK_REGION_FAILED_OPEN which was added by: HBASE-5490 Move the enum RS_ZK_REGION_FAILED_OPEN to the last of the enum list in 0.90 EventHandler (Ram) As of now,

Region server shutting down due to HDFS error

2012-03-28 Thread Eran Kutner
Hi, We have region server sporadically stopping under load due supposedly to errors writing to HDFS. Things like: 2012-03-28 00:37:11,210 WARN org.apache.hadoop.hdfs.DFSClient: Error while syncing java.io.IOException: All datanodes 10.1.104.10:50010 are bad. Aborting.. It's happening with a

Re: Starting Abnormally After Shutting Down For Some Time

2012-03-28 Thread Agarwal, Saurabh
R - Original Message - From: Bing Li [mailto:lbl...@gmail.com] Sent: Wednesday, March 28, 2012 01:32 AM To: user@hbase.apache.org user@hbase.apache.org; hbase-u...@hadoop.apache.org hbase-u...@hadoop.apache.org Subject: Re: Starting Abnormally After Shutting Down For Some Time

RE: 0.92 and Read/writes not scaling

2012-03-28 Thread Buckley,Ron
Juhani, We've been working on some similar performance testing on our 50 node cluster running 0.92.1 and CDH3U3. We were looking mostly at reads, but observed similar behavior. HBase wasn't particularly busy, but we couldn't make it go faster. Some debugging later, we found that many (sometimes

Re: Region server shutting down due to HDFS error

2012-03-28 Thread Jimmy Xiang
Which version of HDFS and HBase are you using? When the problem happens, can you access the HDFS, for example, from hadoop dfs? Thanks, Jimmy On Wed, Mar 28, 2012 at 4:28 AM, Eran Kutner e...@gigya.com wrote: Hi, We have region server sporadically stopping under load due supposedly to

Get failing rarely by retrieving the next row

2012-03-28 Thread Whitney Sorenson
I'm noticing a failure on about .0001% of Gets, wherein instead of the actual row I request, I get the next logical row. For example, I create a Get with this key: \x00\x00\xB8\xB210291 and instead get back the row key: \x00\x00\xB8\xB2103 . This happens reliably when we run large jobs against

Re: Region server shutting down due to HDFS error

2012-03-28 Thread Eran Kutner
Hi Jimmy, HBase is built from latest sources of 0.90 branch (0.90.7-SNAPSHOT), I had the same problem with 0.90.4 Hadoop 0.20.2 from Cloudera CDH3u1 This failure happens during large M/R jobs, I have 10 servers and usually no more than 1 would fail like this, sometimes none. One thing worth

Re: Region server shutting down due to HDFS error

2012-03-28 Thread Stack
On Wed, Mar 28, 2012 at 8:09 AM, Eran Kutner e...@gigya.com wrote: Hi Jimmy, HBase is built from latest sources of 0.90 branch (0.90.7-SNAPSHOT), I had the same problem with 0.90.4 Hadoop 0.20.2 from Cloudera CDH3u1 Can you upgrade to CDH3u3 Eran? I don't remember if CDH3u1 had support for

Re: Region server shutting down due to HDFS error

2012-03-28 Thread Harsh J
Eran, For 0.90.7 SNAPSHOT, set hbase.regionserver.logroll.errors.tolerated to 0 (default). This will help RS survive transient HLog sync failures (with local DN) by retrying a few times before the RS decides to shut itself down. Also worth investigating if you had too much IO load/etc. on the

Re: Region server shutting down due to HDFS error

2012-03-28 Thread Eran Kutner
Thanks Stack and Harsh, I'll try both suggestions and update the list with the results. -eran On Wed, Mar 28, 2012 at 17:21, Harsh J ha...@cloudera.com wrote: Eran, For 0.90.7 SNAPSHOT, set hbase.regionserver.logroll.errors.tolerated to 0 (default). This will help RS survive transient

Re: Region server shutting down due to HDFS error

2012-03-28 Thread Jean-Daniel Cryans
Any chance we can see what happened before that too? Usually you should see a lot more HDFS spam before getting that all the datanodes are bad. J-D On Wed, Mar 28, 2012 at 4:28 AM, Eran Kutner e...@gigya.com wrote: Hi, We have region server sporadically stopping under load due supposedly to

Re: Get failing rarely by retrieving the next row

2012-03-28 Thread Stack
On Wed, Mar 28, 2012 at 7:49 AM, Whitney Sorenson wsoren...@hubspot.com wrote: This happens reliably when we run large jobs against our cluster which perform many reads and writes, but it does not always happen on the same keys. Interesting. Anything you can figure about a particular key if

Re: Region server shutting down due to HDFS error

2012-03-28 Thread Eran Kutner
I don't see any prior HDFS issues in the 15 minutes before this exception. The logs on the datanode reported as problematic are clean as well. However, I now see the log is full of errors like this: 2012-03-28 00:15:05,358 DEBUG org.apache.hadoop.hbase.regionserver.handler.CloseRegionHandler:

Re: Region server shutting down due to HDFS error

2012-03-28 Thread Jean-Daniel Cryans
Can you look even further? Like a day? J-D On Wed, Mar 28, 2012 at 9:45 AM, Eran Kutner e...@gigya.com wrote: I don't see any prior HDFS issues in the 15 minutes before this exception. The logs on the datanode reported as problematic are clean as well. However, I now see the log is full of

Re: Region server shutting down due to HDFS error

2012-03-28 Thread Ted Yu
Eran: The error indicated some zookeeper related issue. Do you see KeeperException after the Error log ? I searched 90 codebase but couldn't find the exact log phrase: zhihyu$ find src/main -name '*.java' -exec grep getting node's version in CLOSI {} \; -print zhihyu$ find src/main -name

Re: efficient export w/o HDFS/copying

2012-03-28 Thread Stack
On Wed, Mar 28, 2012 at 12:59 AM, Michel Segel michael_se...@hotmail.com wrote: Wouldn't that mean having the NAS attached to all of the nodes in the cluster? Yes. That was the presumption. St.Ack

Re: apache.hadoop.ipc.HBaseServer: (responseTooSlow)

2012-03-28 Thread Stack
On Tue, Mar 27, 2012 at 5:56 PM, Sindy sindyban...@gmail.com wrote: Hadoop 1.0.1 HBase 0.90.2 You mean 0.92.0? 2012-03-27 22:04:06,607 WARN org.apache.hadoop.ipc.HBaseServer: IPC Server handler 60 on 60120 caught: java.nio.channels.ClosedChannelException Client timed out. Check its

Re: Dealing with large data sets in client

2012-03-28 Thread Stack
On Tue, Mar 27, 2012 at 2:36 PM, Bryan Beaudreault bbeaudrea...@hubspot.com wrote: I imagine it isn't a great idea to create a ton of scans (1 for each row), which is the only way I can think to do the above with what we have. You want to step through some set of rows in lock-step? That is,

Re: 0.92 and Read/writes not scaling

2012-03-28 Thread Stack
On Wed, Mar 28, 2012 at 5:41 AM, Buckley,Ron buckl...@oclc.org wrote: For us, setting these two, got rid of  all of the 20 and 40 ms response times and dropped the average response time we measured from HBase by more than half.  Plus, we can push HBase a lot harder. That had an effect on

Re: Dealing with large data sets in client

2012-03-28 Thread Bryan Beaudreault
Thanks Stack, that's correct. It is kind of hard to describe, though I guess it's easiest to think of it as a 2d array where the 2nd dimension is sorted. I think your idea would be doable, too. I'm going to try testing them both and see how well they perform. Luckily I'm not TOO concerned

RE: 0.92 and Read/writes not scaling

2012-03-28 Thread Buckley,Ron
Stack, We're about 80% random read and 20% random write. So, that would have been the mix that we were running. We'll try a test with Nagel On and then Nagel off, random write only, later this afternoon and see if the same pattern emerges. Ron -Original Message- From:

Re: Hbase RegionServer stalls on initialization

2012-03-28 Thread N Keywal
Then you should have an error in the master logs. If not, it worths checking that the master the region servers speak to the same ZK... As it's hbase related, I redirect the question to hbase user mailing list (hadoop common is in bcc). On Wed, Mar 28, 2012 at 8:03 PM, Nabib El-Rahman

Re: Starting Abnormally After Shutting Down For Some Time

2012-03-28 Thread Peter Vandenabeele
On Wed, Mar 28, 2012 at 7:27 PM, Bing Li lbl...@gmail.com wrote: Dear all, I found some configuration information was saved in /tmp in my system. So when some of the information is lost, the HBase cannot be started normally. But in my system, I have tried to change the HDFS directory to

Re: Region server shutting down due to HDFS error

2012-03-28 Thread Eran Kutner
hmmm... I couldn't find it either, so I've looked at the history of that file and sure enough a few check-ins back it had that message. I have no idea how something like this could happen. I know I had some merge issues when I first got the latest version and built that project but I've then

Re: Starting Abnormally After Shutting Down For Some Time

2012-03-28 Thread Bing Li
Dear Peter, When I just started the Ubuntu machine, there was nothing in /tmp. After starting $HADOOP/bin/start-dfs.sh and $HBase/bin/start-hbase.sh, the following files were under /tmp. Do you think anything wrong? Thanks! libing@greatfreeweb:/tmp$ ls -alrt total 112 drwxr-xr-x 22 root root

HBase RefGuide updated

2012-03-28 Thread Doug Meil
Hi folks- The HBase RefGuide has been updated on the website. Doug Meil Chief Software Architect, Explorys doug.m...@explorys.com

Re: HBase RefGuide updated

2012-03-28 Thread Doug Meil
The one thing I wanted to point out in this latest update was that I broke the Case Studies into a separate chapter (from the single entry that I put in Troubleshooting a few weeks ago). http://hbase.apache.org/book.html#casestudies Several people have posted links to some great research, so

Re: Starting Abnormally After Shutting Down For Some Time

2012-03-28 Thread Peter Vandenabeele
On Wed, Mar 28, 2012 at 9:53 PM, Bing Li lbl...@gmail.com wrote: Dear Peter, When I just started the Ubuntu machine, there was nothing in /tmp. After starting $HADOOP/bin/start-dfs.sh and $HBase/bin/start-hbase.sh, the following files were under /tmp. Do you think anything wrong? Thanks!

Re: Starting Abnormally After Shutting Down For Some Time

2012-03-28 Thread Suraj Varma
Bing: Your pid file location can be setup via hbase-env.sh; default is /tmp ... # The directory where pid files are stored. /tmp by default. # export HBASE_PID_DIR=/var/hadoop/pids On Wed, Mar 28, 2012 at 3:04 PM, Peter Vandenabeele pe...@vandenabeele.com wrote: On Wed, Mar 28, 2012 at 9:53

Re: 0.92 and Read/writes not scaling

2012-03-28 Thread Juhani Connolly
Ron, thanks for sharing those settings. Unfortunately they didn't help with our read throughput, but every little bit helps. Another suspicious thing that has come up is with the network... While overall throughput has been verified to be able to go much higher than the tax hbase is putting

Re: 0.92 and Read/writes not scaling

2012-03-28 Thread Dave Wang
As you said, the amount of errors and drops you are seeing are very small compared to your overall traffic, so I doubt that is a significant contributor to the throughput problems you are seeing. - Dave On Wed, Mar 28, 2012 at 7:36 PM, Juhani Connolly juhani_conno...@cyberagent.co.jp wrote:

Re: 0.92 and Read/writes not scaling

2012-03-28 Thread Stack
On Wed, Mar 28, 2012 at 7:36 PM, Juhani Connolly juhani_conno...@cyberagent.co.jp wrote: Since we haven't heard anything on expected throughput we're downgrading our hdfs back to 0.20.2, I'd be curious to hear how other people do with 0.23 and the throughput they're getting. We don't have