and in the gc.log of the region server we get CMS failures that cause full gc (that fails to free memory):
11867.254: [Full GC 11867.254: [CMS: 3712638K->3712638K(3712640K), 4.7779250 secs] 4032614K->4032392K(4057664K), [CMS Perm : 20062K->19883K(33548K)] icms_dc=100 , 4.7780440 secs] [Times: user=4.76 sys=0.02, real=4.78 secs] 11872.033: [GC [1 CMS-initial-mark: 3712638K(3712640K)] 4032392K(4057664K), 0.0734520 secs] [Times: user=0.07 sys=0.00, real=0.07 secs] 11872.107: [CMS-concurrent-mark-start] 11872.107: [Full GC 11872.107: [CMS11872.693: [CMS-concurrent-mark: 0.584/0.586 secs] [Times: user=2.92 sys=0.00, real=0.59 secs] (concurrent mode failure): 3712638K->3712638K(3712640K), 5.3078630 secs] 4032392K->4032392K(4057664K), [CMS Perm : 19883K->19883K(33548K)] icms_dc=100 , 5.3079940 secs] [Times: user=7.63 sys=0.00, real=5.31 secs] 11877.415: [Full GC 11877.415: [CMS: 3712638K->3712638K(3712640K), 4.6467720 secs] 4032392K->4032392K(4057664K), [CMS Perm : 19883K->19883K(33548K)] icms_dc=100 , 4.6468910 secs] [Times: user=4.65 sys=0.00, real=4.65 secs] 11882.063: [GC [1 CMS-initial-mark: 3712638K(3712640K)] 4032402K(4057664K), 0.0730580 secs] [Times: user=0.07 sys=0.00, real=0.07 secs] 11882.136: [CMS-concurrent-mark-start] 11882.300: [Full GC 11882.300: [CMS11882.784: [CMS-concurrent-mark: 0.628/0.648 secs] [Times: user=3.79 sys=0.12, real=0.65 secs] (concurrent mode failure): 3712638K->3712639K(3712640K), 7.2815000 secs] 4057662K->4044438K(4057664K), [CMS Perm : 20001K->20000K(33548K)] icms_dc=100 , 7.2816440 secs] [Times: user=9.19 sys=0.01, real=7.28 secs] On Sun, Aug 14, 2011 at 7:32 PM, Lior Schachter <li...@infolinks.com> wrote: > Hi, > > cluster details: > hbase 0.90.2. 10 machines. 1GB switch. > > use-case > M/R job that inserts about 10 million rows to hbase in the reducer, > followed by M/R that works with hdfs files. > When the first job maps finish the second job maps starts and region server > crushes. > please note, that when running the 2 jobs separately they both finish > successfully. > > From our monitoring we see that when the 2 jobs work together the network > load reaches to our max bandwidth - 1GB. > > In the region server log we see these exceptions: > a. > 2011-08-14 18:37:36,263 WARN org.apache.hadoop.ipc.HBaseServer: IPC Server > Responder, call multi(org.apache.hadoop.hbase.client.MultiAction@491fb2f4) > from 10.11.87.73:33737: output error > 2011-08-14 18:37:36,264 WARN org.apache.hadoop.ipc.HBaseServer: IPC Server > handler 24 on 8041 caught: java.nio.channels.ClosedChannelException > at > sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:133) > at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:324) > at > org.apache.hadoop.hbase.ipc.HBaseServer.channelIO(HBaseServer.java:1387) > at > org.apache.hadoop.hbase.ipc.HBaseServer.channelWrite(HBaseServer.java:1339) > at > org.apache.hadoop.hbase.ipc.HBaseServer$Responder.processResponse(HBaseServer.java:727) > at > org.apache.hadoop.hbase.ipc.HBaseServer$Responder.doRespond(HBaseServer.java:792) > at > org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1083) > > b. > 2011-08-14 18:41:56,225 WARN org.apache.hadoop.hdfs.DFSClient: > DFSOutputStream ResponseProcessor exception for block > blk_-8181634225601608891_579246java.io.EOFException > at java.io.DataInputStream.readFully(DataInputStream.java:180) > at java.io.DataInputStream.readLong(DataInputStream.java:399) > at > org.apache.hadoop.hdfs.protocol.DataTransferProtocol$PipelineAck.readFields(DataTransferProtocol.java:122) > at > org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2548) > > c. > 2011-08-14 18:42:02,960 WARN org.apache.hadoop.hdfs.DFSClient: Failed > recovery attempt #0 from primary datanode 10.11.87.72:50010 > org.apache.hadoop.ipc.RemoteException: > org.apache.hadoop.ipc.RemoteException: java.io.IOException: > blk_-8181634225601608891_579246 is already commited, storedBlock == null. > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.nextGenerationStampForBlock(FSNamesystem.java:4877) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.nextGenerationStamp(NameNode.java:501) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > at java.lang.reflect.Method.invoke(Method.java:597) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:508) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:961) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:957) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:396) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:955) > > at org.apache.hadoop.ipc.Client.call(Client.java:740) > at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220) > at $Proxy4.nextGenerationStamp(Unknown Source) > at > org.apache.hadoop.hdfs.server.datanode.DataNode.syncBlock(DataNode.java:1577) > at > org.apache.hadoop.hdfs.server.datanode.DataNode.recoverBlock(DataNode.java:1551) > at > org.apache.hadoop.hdfs.server.datanode.DataNode.recoverBlock(DataNode.java:1617) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > at java.lang.reflect.Method.invoke(Method.java:597) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:508) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:961) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:957) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:396) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:955) > > at org.apache.hadoop.ipc.Client.call(Client.java:740) > at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220) > at $Proxy9.recoverBlock(Unknown Source) > at > org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:2706) > at > org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$1500(DFSClient.java:2173) > at > org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2372) > > Few questions: > 1. Can we configure hadoop/hbase not to consume all network resources > (e.g., to specify upper limit for map/reduce network load)? > 2. Should we increase the timeout for open connections ? > 3. Can we assign different IPs for data transfer and region quorum check > protocol (zookeeper) ? > > Thanks, > Lior > > > >