St.Ack, On Wed, Mar 4, 2009 at 3:36 PM, stack <[email protected]> wrote:
> See http://wiki.apache.org/hadoop/Hbase/Troubleshooting#5. Yes, we checked this item before, however now we hit another exception while calling MemcacheFlusher.flushRegion: Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded at java.util.Arrays.copyOf(Arrays.java:2786) at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:71) at java.io.DataOutputStream.writeInt(DataOutputStream.java:182) at org.apache.hadoop.hbase.io.ImmutableBytesWritable.write(ImmutableBytesWritable.java:115) We changed according to http://java.sun.com/javase/technologies/hotspot/gc/gc_tuning_6.html#par_gc.oom, but the problem is still there. We have also changed to start 4 regionservers instead of single one, and will keep an eye on how it will going. > > On your hitting the global mem limit, you are uploading at the time, right? > If so, then its probably fine. Whats your schema look like by the way? one of schemas is: {NAME => '1001_profiles', IS_ROOT => 'false', IS_META => 'false', FAMILIES => [{NAME => 'inferred', BLOOMFILTE R => 'false', COMPRESSION => 'NONE', VERSIONS => '3', LENGTH => '2147483647', TTL => '-1', IN_MEMORY => 'false ', BLOCKCACHE => 'false'}, {NAME => 'edge', BLOOMFILTER => 'false', COMPRESSION => 'NONE', VERSIONS => '3', LE NGTH => '2147483647', TTL => '-1', IN_MEMORY => 'false', BLOCKCACHE => 'false'}, {NAME => 'scored', BLOOMFILTE R => 'false', COMPRESSION => 'NONE', VERSIONS => '3', LENGTH => '2147483647', TTL => '-1', IN_MEMORY => 'false ', BLOCKCACHE => 'false'}, {NAME => 'fetl', BLOOMFILTER => 'false', COMPRESSION => 'NONE', VERSIONS => '3', LE NGTH => '2147483647', TTL => '-1', IN_MEMORY => 'false', BLOCKCACHE => 'false'}, {NAME => 'reverse_edge', BLOO MFILTER => 'false', COMPRESSION => 'NONE', VERSIONS => '3', LENGTH => '2147483647', TTL => '-1', IN_MEMORY => 'false', BLOCKCACHE => 'false'}, {NAME => 'pre_fetl', BLOOMFILTER => 'false', COMPRESSION => 'NONE', VERSIONS => '3', LENGTH => '2147483647', TTL => '-1', IN_MEMORY => 'false', BLOCKCACHE => 'false'}], INDEXES => []} > > The droppedsnapshot exception see in your first message should also be > addressed by #5 in troubleshooting above. > > You have upped your ulimit file descriptors? For ulimit file descriptors, it's fine now, we added * 'root - nofile 65536' into /etc/security/limits.conf * 'fs.file-max=200000' into /etc/sysctl.conf, and applied it with sysctl -p * * Thanks for your help, Basil. > > > St.Ack > > > > > On Tue, Mar 3, 2009 at 11:29 PM, Basil He <[email protected]> wrote: > > > Stack, > > > > After we switched to a larger EC2 instance, the problem is still there. > and > > at same time we found following message from datanode's log. > > > > java.io.IOException: xceiverCount 1024 exceeds the limit of concurrent > > xcievers 1023 > > at > > > org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:87) > > at java.lang.Thread.run(Thread.java:619) > > log from datanode. > > > > Thanks very much for your regard. > > Basil. > > > > On Sat, Feb 28, 2009 at 10:41 AM, Xiaogang He <[email protected]> > wrote: > > > > > stack, > > > > > > Thanks for your reply, I really appreciate that. > > > > > > On Fri, Feb 27, 2009 at 11:49 PM, stack <[email protected]> wrote: > > > > > >> Tell us more about your hbase install? Number of servers, number of > > >> regions, schema, general size of your cells and hbase version. > > > > > > > > > We just have a small hadoop cluster with 1 master and 3 slaves, and 1 > > > single hmaser and 1 regionserver, and version numbers are both 0.19. > > > > > > > > >> The configuration that effects most directly the amount of heap used > is > > >> the > > >> below: > > >> > > >> <property> > > >> <name>hbase.io.index.interval</name> > > >> <value>128</value> > > >> <description>The interval at which we record offsets in hbase > > >> store files/mapfiles. Default for stock mapfiles is 128. Index > > >> files are read into memory. If there are many of them, could prove > > >> a burden. If so play with the hadoop io.map.index.skip property > and > > >> skip every nth index member when reading back the index into > memory. > > >> Downside to high index interval is lowered access times. > > >> </description> > > >> > > >> You could try setting io.map.index.skip to 4 or 8 across your cluster > > and > > >> restart. > > >> > > > > > > We have namenode/secondnamenode/hmaster/regionserver running on a small > > EC2 > > > instance(1.7G memory). > > > We think it should be part of the problem, so we switched to a larger > > > instance now. > > > We will try above suggestion if we hit the problem again. > > > > > > > > >> > > >> The flushing of the cache seems to be frustrated by an hdfs error in > the > > >> below. You have read the 'getting started' section and have upped > your > > >> ulimit file descriptors? > > > > > > > > > Yes, we have changed ulimit file descriptors according to the FAQ on > the > > > hbase official site. > > > > > > Thank you very much. > > > > > > Regards, > > > Basil. > > > > > > > > >> > > >> > > >> St.Ack > > >> > > >> On Thu, Feb 26, 2009 at 8:16 PM, Xiaogang He <[email protected]> > > wrote: > > >> > > >> > hi, > > >> > > > >> > I'm keeping hit following exception after hbase restarted and > running > > a > > >> > while: > > >> > > > >> > 2009-02-26 15:14:04,827 INFO > > >> > org.apache.hadoop.hbase.regionserver.HLog: Closed > > >> > > > >> > > > >> > > > hdfs://hmaster:50001/hbase/log_10.249.190.85_1235626687854_60020/hlog.dat.1235679079054, > > >> > entries=100053. New log writer: > > >> > /hbase/log_10.249.190.85_1235626687854_60020/hlog.dat.1235679244824 > > >> > 2009-02-26 15:14:16,405 INFO > > >> > org.apache.hadoop.hbase.regionserver.MemcacheFlusher: Forced > flushing > > of > > >> > 1002_profiles,155123497688845858,1235539496917 because global > memcache > > >> > limit > > >> > of 396.9m exceeded; currently 396.9m and flushing till 248.1m > > >> > 2009-02-26 15:14:18,666 INFO > > >> > org.apache.hadoop.hbase.regionserver.MemcacheFlusher: Forced > flushing > > of > > >> > 1002_profiles,145928983691898633,1235539496917 because global > memcache > > >> > limit > > >> > of 396.9m exceeded; currently 386.3m and flushing till 248.1m > > >> > 2009-02-26 15:14:19,497 INFO > > >> > org.apache.hadoop.hbase.regionserver.MemcacheFlusher: Forced > flushing > > of > > >> > 1001_profiles,,1235562106563 because global memcache limit of 396.9m > > >> > exceeded; currently 376.2m and flushing till 248.1m > > >> > 2009-02-26 15:14:21,971 INFO > > >> > org.apache.hadoop.hbase.regionserver.MemcacheFlusher: Forced > flushing > > of > > >> > 1002_profiles,1859616112140717,1235538938447 because global memcache > > >> limit > > >> > of 396.9m exceeded; currently 367.1m and flushing till 248.1m > > >> > 2009-02-26 15:14:23,506 INFO > > >> > org.apache.hadoop.hbase.regionserver.MemcacheFlusher: Forced > flushing > > of > > >> > 1002_profiles,256848350134132138,1235539352160 because global > memcache > > >> > limit > > >> > of 396.9m exceeded; currently 358.2m and flushing till 248.1m > > >> > 2009-02-26 15:14:26,273 INFO > > >> > org.apache.hadoop.hbase.regionserver.MemcacheFlusher: Forced > flushing > > of > > >> > 1001_profiles,38395253911274047,1235562695944 because global > memcache > > >> limit > > >> > of 396.9m exceeded; currently 349.4m and flushing till 248.1m > > >> > 2009-02-26 15:14:27,946 INFO > > >> > org.apache.hadoop.hbase.regionserver.MemcacheFlusher: Forced > flushing > > of > > >> > 1001_relationships,18320094988761441,1235659399900 because global > > >> memcache > > >> > limit of 396.9m exceeded; currently 340.8m and flushing till 248.1m > > >> > 2009-02-26 15:14:28,898 INFO > > >> > org.apache.hadoop.hbase.regionserver.MemcacheFlusher: Forced > flushing > > of > > >> > 1001_profiles,183105869903093166,1235658588032 because global > memcache > > >> > limit > > >> > of 396.9m exceeded; currently 332.3m and flushing till 248.1m > > >> > 2009-02-26 15:14:29,857 INFO > > >> > org.apache.hadoop.hbase.regionserver.MemcacheFlusher: Forced > flushing > > of > > >> > 1001_relationships,1279872936511407,1235563047231 because global > > >> memcache > > >> > limit of 396.9m exceeded; currently 323.9m and flushing till 248.1m > > >> > 2009-02-26 15:14:30,338 INFO > > >> > org.apache.hadoop.hbase.regionserver.MemcacheFlusher: Forced > flushing > > of > > >> > 1002_profiles,9374985809090827,1235658787938 because global memcache > > >> limit > > >> > of 396.9m exceeded; currently 315.5m and flushing till 248.1m > > >> > 2009-02-26 15:14:31,284 INFO > org.apache.hadoop.hdfs.DFSClient: > > >> > Exception in createBlockOutputStream java.io.IOException: Bad > connect > > >> ack > > >> > with firstBadLink 10.249.187.102:50010 > > >> > 2009-02-26 15:14:31,284 INFO > org.apache.hadoop.hdfs.DFSClient: > > >> > Abandoning block blk_-8226110948737137663_51382 > > >> > 2009-02-26 15:14:39,640 INFO > org.apache.hadoop.hdfs.DFSClient: > > >> > Exception in createBlockOutputStream java.io.IOException: Could not > > read > > >> > from stream > > >> > 2009-02-26 15:14:39,640 INFO > org.apache.hadoop.hdfs.DFSClient: > > >> > Abandoning block blk_4802751471280593846_51382 > > >> > 2009-02-26 15:14:45,807 INFO > org.apache.hadoop.hdfs.DFSClient: > > >> > Exception in createBlockOutputStream java.io.IOException: Bad > connect > > >> ack > > >> > with firstBadLink 10.249.187.102:50010 > > >> > 2009-02-26 15:14:45,807 INFO > org.apache.hadoop.hdfs.DFSClient: > > >> > Abandoning block blk_-3919223098697505175_51382 > > >> > 2009-02-26 15:14:51,813 INFO > org.apache.hadoop.hdfs.DFSClient: > > >> > Exception in createBlockOutputStream java.io.IOException: Could not > > read > > >> > from stream > > >> > 2009-02-26 15:14:51,813 INFO > org.apache.hadoop.hdfs.DFSClient: > > >> > Abandoning block blk_-6922144209752436228_51382 > > >> > 2009-02-26 15:14:57,827 WARN > org.apache.hadoop.hdfs.DFSClient: > > >> > DataStreamer Exception: java.io.IOException: Unable to create new > > block. > > >> > at > > >> > > > >> > > > >> > > > org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2723) > > >> > at > > >> > > > >> > > > >> > > > org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:1997) > > >> > at > > >> > > > >> > > > >> > > > org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2183) > > >> > > > >> > 2009-02-26 15:14:57,845 WARN > org.apache.hadoop.hdfs.DFSClient: > > >> Error > > >> > Recovery for block blk_-6922144209752436228_51382 bad datanode[0] > > nodes > > >> == > > >> > null > > >> > 2009-02-26 15:14:57,846 WARN > org.apache.hadoop.hdfs.DFSClient: > > >> Could > > >> > not get block locations. Aborting... > > >> > 2009-02-26 15:14:57,924 FATAL > > >> > org.apache.hadoop.hbase.regionserver.MemcacheFlusher: Replay of hlog > > >> > required. Forcing server shutdown > > >> > org.apache.hadoop.hbase.DroppedSnapshotException: region: > > >> > 1002_profiles,9374985809090827,1235658787938 > > >> > at > > >> > > > >> > > > >> > > > org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:896) > > >> > at > > >> > > > >> > > org.apache.hadoop.hbase.regionserver.HRegion.flushcache(HRegion.java:789) > > >> > at > > >> > > > >> > > > >> > > > org.apache.hadoop.hbase.regionserver.MemcacheFlusher.flushRegion(MemcacheFlusher.java:227) > > >> > at > > >> > > > >> > > > >> > > > org.apache.hadoop.hbase.regionserver.MemcacheFlusher.flushSomeRegions(MemcacheFlusher.java:291) > > >> > at > > >> > > > >> > > > >> > > > org.apache.hadoop.hbase.regionserver.MemcacheFlusher.reclaimMemcacheMemory(MemcacheFlusher.java:261) > > >> > at > > >> > > > >> > > > >> > > > org.apache.hadoop.hbase.regionserver.HRegionServer.batchUpdates(HRegionServer.java:1614) > > >> > at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown > > >> Source) > > >> > at > > >> > > > >> > > > >> > > > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > > >> > at java.lang.reflect.Method.invoke(Method.java:597) > > >> > at > > >> > org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:632) > > >> > at > > >> > > > >> > > org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:895) > > >> > > > >> > > > >> > I noticed there are some parameters regarding this, such as > > >> > *hbase.regionserver.globalMemcache.upperLimit > > >> > and **hbase.regionserver.globalMemcache.lowerLimit*. > > >> > > > >> > I'm just using the default settings, > > >> > <property> > > >> > <name>hbase.regionserver.globalMemcache.upperLimit</name> > > >> > <value>0.4</value> > > >> > <description>Maximum size of all memcaches in a region server > > before > > >> new > > >> > updates are blocked and flushes are forced. Defaults to 40% of > > >> heap. > > >> > </description> > > >> > </property> > > >> > <property> > > >> > <name>hbase.regionserver.globalMemcache.lowerLimit</name> > > >> > <value>0.25</value> > > >> > <description>When memcaches are being forced to flush to make > room > > in > > >> > memory, keep flushing until we hit this mark. Defaults to 30% > of > > >> heap. > > >> > This value equal to > hbase.regionserver.globalmemcache.upperLimit > > >> > causes > > >> > the minimum possible flushing to occur when updates are blocked > > due > > >> to > > >> > memcache limiting. > > >> > </description> > > >> > </property> > > >> > > > >> > Could anyone please give me some guide to help me out of this issue? > > >> > > > >> > Thanks, > > >> > Basil. > > >> > > > >> > > > > > > > > >
