Thanks, that seemed to help. Our jobs are running without failures, for the last 48 hours.
On Thu, Apr 1, 2010 at 11:43 PM, Andrew Purtell <apurt...@apache.org> wrote: > First, > > "ulimit: 1024" > > That's fatal. You need to up file descriptors to something like 32K. > > See http://wiki.apache.org/hadoop/Hbase/Troubleshooting, item #6 > > From there, let's see. > > - Andy > > > From: Oded Rosen <o...@legolas-media.com> > > Subject: DFSClient errors during massive HBase load > > To: hbase-user@hadoop.apache.org > > Date: Thursday, April 1, 2010, 1:19 PM > > **Hi all, > > > > I have a problem with a massive HBase loading job. > > It is from raw files to hbase, through some mapreduce > > processing + > > manipulating (so loading direcly to files will not be > > easy). > > > > After some dozen million successful writes - a few hours of > > load - some of > > the regionservers start to die - one by one - until the > > whole cluster is > > kaput. > > The hbase master sees a "znode expired" error each time a > > regionserver > > falls. The regionserver errors are attached. > > > > Current configurations: > > Four nodes - one namenode+master, three > > datanodes+regionservers. > > dfs.datanode.max.xcievers: 2047 > > ulimit: 1024 > > servers: fedora > > hadoop-0.20, hbase-0.20, hdfs (private servers, not on ec2 > > or anything). > > > > > > *The specific errors from the regionserver log (from > > <IP6>, see comment):* > > > > 2010-04-01 11:36:00,224 WARN > > org.apache.hadoop.hdfs.DFSClient: > > DFSOutputStream ResponseProcessor exception for > > block > > blk_7621973847448611459_244908java.io.IOException: Bad > > response 1 for block > > blk_7621973847448611459_244908 from datanode > > <IP2>:50010 > > at > > > org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2423) > > > > *after that, some of this appears:* > > > > 2010-04-01 11:36:20,602 INFO > > org.apache.hadoop.hdfs.DFSClient: Exception in > > createBlockOutputStream java.io.IOException: Bad connect > > ack with > > firstBadLink <IP2>:50010 > > 2010-04-01 11:36:20,602 INFO > > org.apache.hadoop.hdfs.DFSClient: Abandoning > > block blk_4280490438976631008_245009 > > > > *and the FATAL:* > > > > 2010-04-01 11:36:32,634 FATAL > > org.apache.hadoop.hbase.regionserver.HLog: > > Could not append. Requesting close of hlog > > java.io.IOException: Bad connect ack with firstBadLink > > <IP2>:50010 > > at > > > org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.createBlockOutputStream(DFSClient.java:2872) > > at > > > org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2795) > > at > > > org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:2078) > > at > > > org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2264) > > > > *this FATAL error appears many times until this one kicks > > in:* > > > > 2010-04-01 11:38:57,281 FATAL > > org.apache.hadoop.hbase.regionserver.MemStoreFlusher: > > Replay of hlog > > required. Forcing server shutdown > > org.apache.hadoop.hbase.DroppedSnapshotException: region: > > .META.,,1 > > at > > > org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:977) > > at > > org.apache.hadoop.hbase.regionserver.HRegion.flushcache(HRegion.java:846) > > at > > > org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:241) > > at > > > org.apache.hadoop.hbase.regionserver.MemStoreFlusher.run(MemStoreFlusher.java:149) > > Caused by: java.io.IOException: Bad connect ack with > > firstBadLink > > <IP2>:50010 > > at > > > org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.createBlockOutputStream(DFSClient.java:2872) > > at > > > org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2795) > > at > > > org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:2078) > > at > > > org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2264) > > > > *(then the regionserver starts closing itself)* > > > > The regionserver on <IP6> was shut down, but problems > > are corellated with > > <IP2> (notice the ip in the error msgs). <IP2> > > was also considered a dead > > node after these errors, according to the hadoop namenode > > web ui. > > I think this is an hdfs failure, rather then > > hbase/zookeeper (although it is > > probably because of hbase high load...). > > > > On the datanodes, once in a while I had: > > > > 2010-04-01 11:24:59,265 ERROR > > org.apache.hadoop.hdfs.server.datanode.DataNode: > > DatanodeRegistration(<IP2>:50010, > > storageID=DS-1822315410-<IP2>-50010-1266860406782, > > infoPort=50075, > > ipcPort=50020):DataXceiver > > > > but these errors occured at different times, and not even > > around crashes. No > > fatal errors found on the datanode log (but it still > > crashed). > > > > I haven't seen this exact error on the web (only similar > > ones); > > This guy ( > http://osdir.com/ml/hbase-user-hadoop-apache/2009-02/msg00186.html) > > had a similar problem, but not exactly the same. > > > > Any ideas? > > thanks, > > > > -- > > Oded > > > > > > > -- Oded