Re: M/R vs hbase problem in production

Lior Schachter Sun, 14 Aug 2011 09:37:13 -0700

and in the gc.log of the region server we get CMS failures that cause full
gc (that fails to free memory):


11867.254: [Full GC 11867.254: [CMS: 3712638K->3712638K(3712640K), 4.7779250
secs] 4032614K->4032392K(4057664K), [CMS Perm : 20062K->19883K(33548K)]
icms_dc=100 , 4.7780440 secs] [Times: user=4.76 sys=0.02, real=4.78 secs]
11872.033: [GC [1 CMS-initial-mark: 3712638K(3712640K)] 4032392K(4057664K),
0.0734520 secs] [Times: user=0.07 sys=0.00, real=0.07 secs]
11872.107: [CMS-concurrent-mark-start]
11872.107: [Full GC 11872.107: [CMS11872.693: [CMS-concurrent-mark:
0.584/0.586 secs] [Times: user=2.92 sys=0.00, real=0.59 secs]
 (concurrent mode failure): 3712638K->3712638K(3712640K), 5.3078630 secs]
4032392K->4032392K(4057664K), [CMS Perm : 19883K->19883K(33548K)]
icms_dc=100 , 5.3079940 secs] [Times: user=7.63 sys=0.00, real=5.31 secs]
11877.415: [Full GC 11877.415: [CMS: 3712638K->3712638K(3712640K), 4.6467720
secs] 4032392K->4032392K(4057664K), [CMS Perm : 19883K->19883K(33548K)]
icms_dc=100 , 4.6468910 secs] [Times: user=4.65 sys=0.00, real=4.65 secs]
11882.063: [GC [1 CMS-initial-mark: 3712638K(3712640K)] 4032402K(4057664K),
0.0730580 secs] [Times: user=0.07 sys=0.00, real=0.07 secs]
11882.136: [CMS-concurrent-mark-start]
11882.300: [Full GC 11882.300: [CMS11882.784: [CMS-concurrent-mark:
0.628/0.648 secs] [Times: user=3.79 sys=0.12, real=0.65 secs]
 (concurrent mode failure): 3712638K->3712639K(3712640K), 7.2815000 secs]
4057662K->4044438K(4057664K), [CMS Perm : 20001K->20000K(33548K)]
icms_dc=100 , 7.2816440 secs] [Times: user=9.19 sys=0.01, real=7.28 secs]




On Sun, Aug 14, 2011 at 7:32 PM, Lior Schachter <li...@infolinks.com> wrote:

> Hi,
>
> cluster details:
> hbase 0.90.2. 10 machines. 1GB switch.
>
> use-case
> M/R job that inserts about 10 million rows to hbase in the reducer,
> followed by M/R that works with hdfs files.
> When the first job maps finish the second job maps starts and region server
> crushes.
> please note, that when running the 2 jobs separately they both finish
> successfully.
>
> From our monitoring we see that when the 2 jobs work together the network
> load reaches to our max bandwidth - 1GB.
>
> In the region server log we see these exceptions:
> a.
> 2011-08-14 18:37:36,263 WARN org.apache.hadoop.ipc.HBaseServer: IPC Server
> Responder, call multi(org.apache.hadoop.hbase.client.MultiAction@491fb2f4)
> from 10.11.87.73:33737: output error
> 2011-08-14 18:37:36,264 WARN org.apache.hadoop.ipc.HBaseServer: IPC Server
> handler 24 on 8041 caught: java.nio.channels.ClosedChannelException
>         at
> sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:133)
>         at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:324)
>         at
> org.apache.hadoop.hbase.ipc.HBaseServer.channelIO(HBaseServer.java:1387)
>         at
> org.apache.hadoop.hbase.ipc.HBaseServer.channelWrite(HBaseServer.java:1339)
>         at
> org.apache.hadoop.hbase.ipc.HBaseServer$Responder.processResponse(HBaseServer.java:727)
>         at
> org.apache.hadoop.hbase.ipc.HBaseServer$Responder.doRespond(HBaseServer.java:792)
>         at
> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1083)
>
> b.
> 2011-08-14 18:41:56,225 WARN org.apache.hadoop.hdfs.DFSClient:
> DFSOutputStream ResponseProcessor exception  for block
> blk_-8181634225601608891_579246java.io.EOFException
>         at java.io.DataInputStream.readFully(DataInputStream.java:180)
>         at java.io.DataInputStream.readLong(DataInputStream.java:399)
>         at
> org.apache.hadoop.hdfs.protocol.DataTransferProtocol$PipelineAck.readFields(DataTransferProtocol.java:122)
>         at
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2548)
>
> c.
> 2011-08-14 18:42:02,960 WARN org.apache.hadoop.hdfs.DFSClient: Failed
> recovery attempt #0 from primary datanode 10.11.87.72:50010
> org.apache.hadoop.ipc.RemoteException:
> org.apache.hadoop.ipc.RemoteException: java.io.IOException:
> blk_-8181634225601608891_579246 is already commited, storedBlock == null.
>         at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.nextGenerationStampForBlock(FSNamesystem.java:4877)
>         at
> org.apache.hadoop.hdfs.server.namenode.NameNode.nextGenerationStamp(NameNode.java:501)
>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>         at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>         at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>         at java.lang.reflect.Method.invoke(Method.java:597)
>         at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:508)
>         at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:961)
>         at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:957)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:396)
>         at org.apache.hadoop.ipc.Server$Handler.run(Server.java:955)
>
>         at org.apache.hadoop.ipc.Client.call(Client.java:740)
>         at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220)
>         at $Proxy4.nextGenerationStamp(Unknown Source)
>         at
> org.apache.hadoop.hdfs.server.datanode.DataNode.syncBlock(DataNode.java:1577)
>         at
> org.apache.hadoop.hdfs.server.datanode.DataNode.recoverBlock(DataNode.java:1551)
>         at
> org.apache.hadoop.hdfs.server.datanode.DataNode.recoverBlock(DataNode.java:1617)
>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>         at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>         at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>         at java.lang.reflect.Method.invoke(Method.java:597)
>         at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:508)
>         at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:961)
>         at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:957)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:396)
>         at org.apache.hadoop.ipc.Server$Handler.run(Server.java:955)
>
>         at org.apache.hadoop.ipc.Client.call(Client.java:740)
>         at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220)
>         at $Proxy9.recoverBlock(Unknown Source)
>         at
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:2706)
>         at
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$1500(DFSClient.java:2173)
>         at
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2372)
>
> Few questions:
> 1. Can we configure hadoop/hbase not to consume all network resources
> (e.g., to specify upper limit for map/reduce network load)?
> 2. Should we increase the timeout for open connections ?
> 3. Can we assign different IPs for data transfer and region quorum check
> protocol (zookeeper) ?
>
> Thanks,
> Lior
>
>
>
>

Re: M/R vs hbase problem in production

Reply via email to