Thanks for the writing back the list Oded. I changed the subject so its easier to find your suggestions amongst the mailing list weeds going forward. On swappyness, setting it to 0 is extreme but since you've supplied links, users can do as you suggest or do something not so radical.
Good stuff, St.Ack On Mon, Apr 12, 2010 at 11:31 AM, Oded Rosen <o...@legolas-media.com> wrote: > The tips you guys gave me made a huge difference. > I also used other tips from the "Troubleshooting" section in hbase wiki, and > from all around the web. > I would like to share my current cluster configuration, as only few places > around the web offer a guided tour of these important configuration changes. > This might be helpful for other people with small clusters, that have > problems with loading large amounts of data to hbase on a regular basis. > I am not a very experienced user (yet...) so if I got something wrong, or if > I am missing anything, please say so. Thanks in advance > > *1. Prevent your regionserver machines from memory swap* - this is a must > have, it seems, for small hbase clusters that handle large loads: > > *Edit this file (on each regionserver) and then activate the following > commands.* > > *File:* /etc/sysctl.conf > *Add Values:* > vm.swappiness = 0 (this one - on datanodes only!) > > *Then run (In order to apply the changes immediately):* > sysctl -p /etc/sysctl.conf > service network restart > > note: this is a kernel property change. swappiness with a zero value means > the machine will not use virtual memory at all (or at least that what I > understood). So handle with care. a low value (around 5 or 10, from the > maximum value of 100) might also work. My configuration is zero. > > (Further explanations: > http://cloudepr.blogspot.com/2009/09/cluster-facilities-hardware-and.html ) > > *2. Increase file descriptor limit* - this is also a must have for almost > any use of hbase. > > *Edit these two files (on each datanode/namenode) and then activate the > following commands.* > > *File:* /etc/security/limits.conf > *Add Values:* > > hadoop soft nofile 32768 > hadoop hard nofile 32768 > > *File:* /etc/sysctl.conf > *Add Values:* > > fs.file-max = 32768 > > *Then run:* > sysctl -p /etc/sysctl.conf > service network restart > note: you can perform steps 1+2 together, they both edit sysctl.conf. notice > step 1 is only for regionservers (datanodes), > while this one can also be made to the master (namenode) - although I'm not > so sure it's necessary. > > (see > http://wiki.apache.org/hadoop/Hbase/Troubleshooting<http://wiki.apache.org/hadoop/Hbase/Troubleshooting#A6>) > > *3. Raise HDFS + HBASE connection limit upper bound:* > > Edit hadoop/hbase configuration files to include these entries: > (you might want to change the specific values, according to your cluster > properties and usage). > > *File:* hdfs-site.xml > *Add Properties:* > > name: dfs.datanode.max.xcievers > value: 2047 > > name: dfs.datanode.handler.count > value: 10 (at least as the number of nodes in the cluster, or more if > needed). > > (see > http://wiki.apache.org/hadoop/Hbase/Troubleshooting<http://wiki.apache.org/hadoop/Hbase/Troubleshooting#A6> > ) > > *File:* hbase-site.xml > *Add Properties:* > > <property> > <name>hbase.regionserver.handler.count</name> > <value>100</value> > </property> > <property> > <name>hbase.zookeeper.property.maxClientCnxns</name> > <value>100</value> > </property> > > (see > http://hadoop.apache.org/hbase/docs/current/api/overview-summary.html<http://hadoop.apache.org/hbase/docs/current/api/overview-summary.html#overview_description>) > > If you can remember other changes you've made to increase hbase stability, > you are welcome to reply. > > Cheers. > > On Thu, Apr 1, 2010 at 11:43 PM, Andrew Purtell <apurt...@apache.org> wrote: > >> First, >> >> "ulimit: 1024" >> >> That's fatal. You need to up file descriptors to something like 32K. >> >> See http://wiki.apache.org/hadoop/Hbase/Troubleshooting, item #6 >> >> From there, let's see. >> >> - Andy >> >> > From: Oded Rosen <o...@legolas-media.com> >> > Subject: DFSClient errors during massive HBase load >> > To: hbase-user@hadoop.apache.org >> > Date: Thursday, April 1, 2010, 1:19 PM >> > **Hi all, >> > >> > I have a problem with a massive HBase loading job. >> > It is from raw files to hbase, through some mapreduce >> > processing + >> > manipulating (so loading direcly to files will not be >> > easy). >> > >> > After some dozen million successful writes - a few hours of >> > load - some of >> > the regionservers start to die - one by one - until the >> > whole cluster is >> > kaput. >> > The hbase master sees a "znode expired" error each time a >> > regionserver >> > falls. The regionserver errors are attached. >> > >> > Current configurations: >> > Four nodes - one namenode+master, three >> > datanodes+regionservers. >> > dfs.datanode.max.xcievers: 2047 >> > ulimit: 1024 >> > servers: fedora >> > hadoop-0.20, hbase-0.20, hdfs (private servers, not on ec2 >> > or anything). >> > >> > >> > *The specific errors from the regionserver log (from >> > <IP6>, see comment):* >> > >> > 2010-04-01 11:36:00,224 WARN >> > org.apache.hadoop.hdfs.DFSClient: >> > DFSOutputStream ResponseProcessor exception for >> > block >> > blk_7621973847448611459_244908java.io.IOException: Bad >> > response 1 for block >> > blk_7621973847448611459_244908 from datanode >> > <IP2>:50010 >> > at >> > >> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2423) >> > >> > *after that, some of this appears:* >> > >> > 2010-04-01 11:36:20,602 INFO >> > org.apache.hadoop.hdfs.DFSClient: Exception in >> > createBlockOutputStream java.io.IOException: Bad connect >> > ack with >> > firstBadLink <IP2>:50010 >> > 2010-04-01 11:36:20,602 INFO >> > org.apache.hadoop.hdfs.DFSClient: Abandoning >> > block blk_4280490438976631008_245009 >> > >> > *and the FATAL:* >> > >> > 2010-04-01 11:36:32,634 FATAL >> > org.apache.hadoop.hbase.regionserver.HLog: >> > Could not append. Requesting close of hlog >> > java.io.IOException: Bad connect ack with firstBadLink >> > <IP2>:50010 >> > at >> > >> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.createBlockOutputStream(DFSClient.java:2872) >> > at >> > >> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2795) >> > at >> > >> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:2078) >> > at >> > >> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2264) >> > >> > *this FATAL error appears many times until this one kicks >> > in:* >> > >> > 2010-04-01 11:38:57,281 FATAL >> > org.apache.hadoop.hbase.regionserver.MemStoreFlusher: >> > Replay of hlog >> > required. Forcing server shutdown >> > org.apache.hadoop.hbase.DroppedSnapshotException: region: >> > .META.,,1 >> > at >> > >> org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:977) >> > at >> > org.apache.hadoop.hbase.regionserver.HRegion.flushcache(HRegion.java:846) >> > at >> > >> org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:241) >> > at >> > >> org.apache.hadoop.hbase.regionserver.MemStoreFlusher.run(MemStoreFlusher.java:149) >> > Caused by: java.io.IOException: Bad connect ack with >> > firstBadLink >> > <IP2>:50010 >> > at >> > >> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.createBlockOutputStream(DFSClient.java:2872) >> > at >> > >> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2795) >> > at >> > >> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:2078) >> > at >> > >> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2264) >> > >> > *(then the regionserver starts closing itself)* >> > >> > The regionserver on <IP6> was shut down, but problems >> > are corellated with >> > <IP2> (notice the ip in the error msgs). <IP2> >> > was also considered a dead >> > node after these errors, according to the hadoop namenode >> > web ui. >> > I think this is an hdfs failure, rather then >> > hbase/zookeeper (although it is >> > probably because of hbase high load...). >> > >> > On the datanodes, once in a while I had: >> > >> > 2010-04-01 11:24:59,265 ERROR >> > org.apache.hadoop.hdfs.server.datanode.DataNode: >> > DatanodeRegistration(<IP2>:50010, >> > storageID=DS-1822315410-<IP2>-50010-1266860406782, >> > infoPort=50075, >> > ipcPort=50020):DataXceiver >> > >> > but these errors occured at different times, and not even >> > around crashes. No >> > fatal errors found on the datanode log (but it still >> > crashed). >> > >> > I haven't seen this exact error on the web (only similar >> > ones); >> > This guy ( >> http://osdir.com/ml/hbase-user-hadoop-apache/2009-02/msg00186.html) >> > had a similar problem, but not exactly the same. >> > >> > Any ideas? >> > thanks, >> > >> > -- >> > Oded >> > >> >> >> >> >> > > > -- > Oded >