Re: Put MR job.. Regionservers crashes..

Vidhyashankar Venkataraman Wed, 04 Aug 2010 11:56:10 -0700

>> The first part of the log is usually a client that died while it was
>> doing a request.
Wow. I get tons of them in the logs.. And there arent that many clients that 
got killed as reported by the MR job.. Is that the only case when these errors 
are reported?


>> The second part of the log is a session expiration. The log fragment
>> is too small to tell if it was the region server that paused or the ZK
>> ensemble that was unreachable during that time... are the zk servers
>> collocated with region servers?
Ok good, so one of the two happened then.. I will try figuring out what 
happened.. ZK servers are not collocated in my setup.. They are a set of 5 
dedicated nodes (nothing else running)..

>> About your side question, TOF can be used as the output of a map and
>> it will be just like using the HTable, but I'm not sure how's that
>> going to play with your streaming job.
No, I was referring to using TOF on a regular java-API MR job.. But I guessed 
using TOF will be similar to what I am currently doing..

I will use ganglia to monitor the stats..

Thank you for the prompt response
I will update soon with more results..
Vidhya

On 8/4/10 11:45 AM, "Jean-Daniel Cryans" <[email protected]> wrote:

The first part of the log is usually a client that died while it was
doing a request.

The second part of the log is a session expiration. The log fragment
is too small to tell if it was the region server that paused or the ZK
ensemble that was unreachable during that time... are the zk servers
collocated with region servers?

About your side question, TOF can be used as the output of a map and
it will be just like using the HTable, but I'm not sure how's that
going to play with your streaming job.

About "Were the RS's too busy to ping the ZK quorum?", apart from
looking at the log and seeing a pause or not, only you can tell if you
are overcommitting your ressources. Setup ganglia, look at the
metrics, see if the CPUs are too busy, make sure there's 0kb in the
swap (or just disable it altogether).

J-D

On Wed, Aug 4, 2010 at 11:27 AM, Vidhyashankar Venkataraman
<[email protected]> wrote:
> I am trying to run an update (inserts and modify operations) experiment on a 
> 100 TB  existing database..
>
> The job is a map-only job that doesn't use TableOutputFormat (since mappers 
> cannot possibly know the split anyways, so I cant optimize on locality here 
> of course unless I customize it on my own.. I will be shortly doing that..)..
>  I have a program that generates Puts on existing rows and new inserts within 
> a specified range.. I use hadoop streaming to run the program as individual 
> tasks on specified ranges..
>  (A side question is if it is better to use TableOutputFormat with reduce 
> jobs: the cost of sort-and-shuffle operations seem to be the same in both the 
> cases...
>
>  In my case, shuffle will happen anyways.. Sorts will happen in each region 
> server.. The maps are at coarser shard ranges: 6 maps per node as against 170 
> regions per node.. )
>
> Existing setup:
> 180 nodes, 6 map tasks per node, 100 TB existing db, 10 TB of updates.. Logs 
> disabled.. 4 gig regions.. 12 MB write buffer size in each client.. Single 
> column family and single value of 28 KB for each row: no compression in the 
> expts (will soon be doing expts with compressed data)..
>
> Results: Around 130 region servers crashed.. All of them had logs of the 
> following form: The first set of data isnt the root for crashes.. They seem 
> to be put errors because of a possible socket timeout.. Is that right?
> The second set of logs below show the point where it crashed.. It looks like 
> a zk timeout.. Were the RS's too busy to ping the ZK quorum?
>
>
> Thank you in advance :)
> Vidhya
>
>
> The logs:
>
>
> 2010-08-04 08:05:42,886 DEBUG 
> org.apache.hadoop.hbase.regionserver.CompactSplitThread: Compaction requested 
> for region DocData,0000000149473989,1280881832403/571301273 because: 
> regionserver/63.250.207.82:60020.cacheFlusher2010-08-04 08:05:42,886 DEBUG 
> org.apache.hadoop.hbase.regionserver.CompactSplitThread: Compaction requested 
> for region DocData,0000000148937993,1280881832423/196751683 because: 
> regionserver/63.250.207.82:60020.cacheFlusher2010-08-04 08:05:42,886 DEBUG 
> org.apache.hadoop.hbase.regionserver.CompactSplitThread: Compaction requested 
> for region DocData,0000000147751972,1280881832465/1982268997 because: 
> regionserver/63.250.207.82:60020.cacheFlusher2010-08-04 08:05:42,900 WARN 
> org.apache.hadoop.ipc.HBaseServer: IPC Server Responder, call 
> put([...@661cc0d6, [Lorg.apache.hadoop.hbase.client.Put;@69b855f9) from 
> 63.250.207.23:37600: output error
> 2010-08-04 08:05:42,901 WARN org.apache.hadoop.ipc.HBaseServer: IPC Server 
> Responder, call put([...@460ca540, 
> [Lorg.apache.hadoop.hbase.client.Put;@683530ad) from 98.136.191.232:59388: 
> output error
> 2010-08-04 08:05:42,901 WARN org.apache.hadoop.ipc.HBaseServer: IPC Server 
> Responder, call put([...@631e978e, 
> [Lorg.apache.hadoop.hbase.client.Put;@6d701e78) from 63.250.207.113:44020: 
> output error
> 2010-08-04 08:05:42,913 WARN org.apache.hadoop.ipc.HBaseServer: IPC Server 
> Responder, call put([...@685060b9, 
> [Lorg.apache.hadoop.hbase.client.Put;@61f39aaa) from 98.136.191.122:40316: 
> output error
> 2010-08-04 08:05:42,916 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server 
> handler 87 on 60020 caught: java.nio.channels.ClosedChannelException        
> at sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:126)
>        at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:324)      
>    at 
> org.apache.hadoop.hbase.ipc.HBaseServer.channelWrite(HBaseServer.java:1125)
>        at 
> org.apache.hadoop.hbase.ipc.HBaseServer$Responder.processResponse(HBaseServer.java:615)
>         at 
> org.apache.hadoop.hbase.ipc.HBaseServer$Responder.doRespond(HBaseServer.java:679)
>        at 
> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:943)
> 2010-08-04 08:05:42,916 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server 
> handler 86 on 60020 caught: java.nio.channels.ClosedChannelException        
> at sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:126)
>        at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:324)      
>    at 
> org.apache.hadoop.hbase.ipc.HBaseServer.channelWrite(HBaseServer.java:1125)
>        at 
> org.apache.hadoop.hbase.ipc.HBaseServer$Responder.processResponse(HBaseServer.java:615)
>         at 
> org.apache.hadoop.hbase.ipc.HBaseServer$Responder.doRespond(HBaseServer.java:679)
>        at 
> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:943)
>
> 2010-08-04 08:05:42,916 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server 
> handler 85 on 60020 caught: java.nio.channels.ClosedChannelException
>        at 
> sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:126)
>        at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:324)
>        at 
> org.apache.hadoop.hbase.ipc.HBaseServer.channelWrite(HBaseServer.java:1125)
>        at 
> org.apache.hadoop.hbase.ipc.HBaseServer$Responder.processResponse(HBaseServer.java:615)
>        at 
> org.apache.hadoop.hbase.ipc.HBaseServer$Responder.doRespond(HBaseServer.java:679)
>        at 
> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:943)
>
>
>
> THE SECOND PART OF LOGS WHERE CRASHES OCCURRED...
>
>
>
> 2010-08-04 09:30:29,512 WARN 
> org.apache.hadoop.hbase.regionserver.HRegionServer: Attempt=1
> org.apache.hadoop.hbase.Leases$LeaseStillHeldException        at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>        at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
>        at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
>        at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
>        at 
> org.apache.hadoop.hbase.RemoteExceptionHandler.decodeRemoteException(RemoteExceptionHandler.java:94)
>        at 
> org.apache.hadoop.hbase.RemoteExceptionHandler.checkThrowable(RemoteExceptionHandler.java:48)
>        at 
> org.apache.hadoop.hbase.RemoteExceptionHandler.checkIOException(RemoteExceptionHandler.java:66)
>        at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:549)
>        at java.lang.Thread.run(Thread.java:619)
> 2010-08-04 09:30:29,512 WARN org.apache.zookeeper.ClientCnxn: Exception 
> closing session 0x32a206703ca0616 to sun.nio.ch.selectionkeyi...@3d2292cd
> java.io.IOException: TIMED OUT
>        at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:906)
> 2010-08-04 09:30:29,512 WARN org.apache.zookeeper.ClientCnxn: Exception 
> closing session 0x32a206703ca0014 to sun.nio.ch.selectionkeyi...@36f0b7f8
> java.io.IOException: TIMED OUT
>        at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:906)
> 2010-08-04 09:30:29,685 INFO 
> org.apache.hadoop.hbase.regionserver.HRegionServer: Got ZooKeeper event, 
> state: Disconnected, type: None, path: null2010-08-04 09:30:29,695 DEBUG 
> org.apache.hadoop.hbase.client.HConnectionManager$ClientZKWatcher: Got 
> ZooKeeper event, state: Disconnected, type: None, path:
> null
> 2010-08-04 09:30:29,897 INFO org.apache.zookeeper.ClientCnxn: Attempting 
> connection to server b3122695.yst.yahoo.net/98.136.191.163:2181
> 2010-08-04 09:30:29,898 INFO org.apache.zookeeper.ClientCnxn: Priming 
> connection to java.nio.channels.SocketChannel[connected 
> local=/63.250.207.82:40417 remo
> te=b3122695.yst.yahoo.net/98.136.191.163:2181]
> 2010-08-04 09:30:29,898 INFO org.apache.zookeeper.ClientCnxn: Server 
> connection successful
> 2010-08-04 09:30:29,901 DEBUG 
> org.apache.hadoop.hbase.client.HConnectionManager$ClientZKWatcher: Got 
> ZooKeeper event, state: Expired, type: None, path: null
> 2010-08-04 09:30:29,901 WARN org.apache.zookeeper.ClientCnxn: Exception 
> closing session 0x32a206703ca0616 to sun.nio.ch.selectionkeyi...@7935c2a8
> java.io.IOException: Session Expired
>        at 
> org.apache.zookeeper.ClientCnxn$SendThread.readConnectResult(ClientCnxn.java:589)
>         at 
> org.apache.zookeeper.ClientCnxn$SendThread.doIO(ClientCnxn.java:709)        
> at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:945)
> 2010-08-04 09:30:29,901 INFO org.apache.zookeeper.ZooKeeper: Closing session: 
> 0x32a206703ca0616
> 2010-08-04 09:30:29,901 INFO org.apache.zookeeper.ClientCnxn: Closing 
> ClientCnxn for session: 0x32a206703ca0616
> 2010-08-04 09:30:29,948 INFO org.apache.zookeeper.ClientCnxn: Disconnecting 
> ClientCnxn for session: 0x32a206703ca0616
> 2010-08-04 09:30:29,949 INFO org.apache.zookeeper.ZooKeeper: Session: 
> 0x32a206703ca0616 closed
> 2010-08-04 09:30:29,949 DEBUG 
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: Closed connection with 
> ZooKeeper
> 2010-08-04 09:30:29,949 INFO org.apache.zookeeper.ClientCnxn: EventThread 
> shut down
> 2010-08-04 09:30:30,288 INFO org.apache.zookeeper.ClientCnxn: Attempting 
> connection to server b3122695.yst.yahoo.net/98.136.191.163:2181
> 2010-08-04 09:30:30,289 INFO org.apache.zookeeper.ClientCnxn: Priming 
> connection to java.nio.channels.SocketChannel[connected 
> local=/63.250.207.82:40418 remote=b3122695.yst.yahoo.net/98.136.191.163:2181]
> 2010-08-04 09:30:30,289 INFO org.apache.zookeeper.ClientCnxn: Server 
> connection successful
> 2010-08-04 09:30:30,291 WARN org.apache.zookeeper.ClientCnxn: Exception 
> closing session 0x32a206703ca0014 to sun.nio.ch.selectionkeyi...@3dc37e55
> java.io.IOException: Session Expired
>        at 
> org.apache.zookeeper.ClientCnxn$SendThread.readConnectResult(ClientCnxn.java:589)
>        at org.apache.zookeeper.ClientCnxn$SendThread.doIO(ClientCnxn.java:709)
>        at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:945)
> 2010-08-04 09:30:30,291 INFO 
> org.apache.hadoop.hbase.regionserver.HRegionServer: Got ZooKeeper event, 
> state: Expired, type: None, path: null
> 2010-08-04 09:30:30,291 ERROR 
> org.apache.hadoop.hbase.regionserver.HRegionServer: ZooKeeper session expired
> 2010-08-04 09:30:30,292 INFO 
> org.apache.hadoop.hbase.regionserver.HRegionServer: Dump of metrics: 
> request=0.0, regions=162, stores=162, storefiles=280, storefileIndexSize=58, 
> memstoreSize=1724, compactionQueueSize=63, usedHeap=1877, maxHeap=7999, 
> blockCacheSize=20643688, blockCacheFree=2495761816, blockCacheCount=0, 
> blockCacheHitRatio=0, fsReadLatency=0, fsWriteLatency=0, fsSyncLatency=0
>
>
>

Re: Put MR job.. Regionservers crashes..

Reply via email to