[
https://issues.apache.org/jira/browse/TAJO-257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Hyunsik Choi updated TAJO-257:
------------------------------
Attachment: TAJO-257.patch
This issue is caused by several problems as follows:
- Before workers are stopped completely, TajoTestingClient tried to shutdown
HDFS.
- Jetty server in each worker was hanging occasionally.
- Some query execution query didn't close query results. So, it caused
resource leak.
This patch solves the above problems. I've tested unit tests ten times
repeatedly. All unit tests are passed without any fail and hanging.
> Unit tests occassionally fail.
> ------------------------------
>
> Key: TAJO-257
> URL: https://issues.apache.org/jira/browse/TAJO-257
> Project: Tajo
> Issue Type: Bug
> Reporter: Hyunsik Choi
> Priority: Blocker
> Fix For: 0.2-incubating
>
> Attachments: TAJO-257.patch
>
>
> {noformat}
> 2013-10-16 14:55:43,110 INFO tajo.TajoTestingCluster
> (TajoTestingCluster.java:startMiniTajoCluster(292)) - Mini Tajo cluster is up
> 2013-10-16 14:55:43,110 INFO rm.TajoWorkerResourceManager
> (TajoWorkerResourceManager.java:workerHeartbeat(395)) -
> TajoWorker:host:local05.gruter.com, port=44238,44237,47304, slots=2048:4:8,
> used=0:0:0 added in live TajoWorker list
> 2013-10-16 14:55:43,149 INFO BlockStateChange
> (BlockManager.java:logAddStoredBlock(2174)) - BLOCK* addStoredBlock: blockMap
> updated: 127.0.0.1:60398 is added to
> blk_-2382303275850279660_1004{blockUCState=COMMITTED, primaryNodeIndex=-1,
> replicas=[ReplicaUnderConstruction[127.0.0.1:60398|RBW]]} size 117894
> 2013-10-16 14:55:43,600 INFO tajo.TajoTestingCluster
> (TajoTestingCluster.java:shutdownMiniCluster(447)) -
> ========================================
> 2013-10-16 14:55:43,600 INFO tajo.TajoTestingCluster
> (TajoTestingCluster.java:shutdownMiniCluster(448)) - Shutdown minicluster
> 2013-10-16 14:55:43,600 INFO tajo.TajoTestingCluster
> (TajoTestingCluster.java:shutdownMiniCluster(449)) -
> ========================================
> 2013-10-16 14:55:43,612 INFO rpc.NettyServerBase
> (NettyServerBase.java:shutdown(120)) - Rpc (TajoMasterProtocol) listened on
> 127.0.0.1:44236) shutdown
> 2013-10-16 14:55:43,613 INFO rpc.NettyServerBase
> (NettyServerBase.java:shutdown(120)) - Rpc (TajoMasterClientProtocol)
> listened on 127.0.0.1:44235) shutdown
> 2013-10-16 14:55:43,614 INFO rpc.NettyServerBase
> (NettyServerBase.java:shutdown(120)) - Rpc (CatalogProtocol) listened on
> 127.0.0.1:44234) shutdown
> 2013-10-16 14:55:43,614 INFO catalog.CatalogServer
> (CatalogServer.java:stop(161)) - Catalog Server (127.0.0.1:44234) shutdown
> 2013-10-16 14:55:43,614 INFO master.TajoMaster (TajoMaster.java:stop(436)) -
> Tajo Master main thread exiting
> 2013-10-16 14:55:43,615 INFO worker.TajoWorker (TajoWorker.java:run(451)) -
> Worker Resource Heartbeat Thread stopped.
> 2013-10-16 14:55:43,616 INFO rpc.NettyClientBase
> (NettyClientBase.java:close(87)) - Proxy is disconnected from 127.0.0.1:44234
> 2013-10-16 14:55:43,618 INFO rpc.NettyClientBase
> (NettyClientBase.java:close(87)) - Proxy is disconnected from 127.0.0.1:44236
> 2013-10-16 14:55:43,619 INFO rpc.NettyServerBase
> (NettyServerBase.java:shutdown(120)) - Rpc (TajoWorkerProtocol) listened on
> 0:0:0:0:0:0:0:0:44238) shutdown
> 2013-10-16 14:55:43,619 INFO worker.TajoWorkerManagerService
> (TajoWorkerManagerService.java:stop(109)) - TajoWorkerManagerService stopped
> 2013-10-16 14:55:43,619 INFO querymaster.QueryMaster
> (QueryMaster.java:run(346)) - QueryMaster heartbeat thread stopped
> 2013-10-16 14:55:43,620 INFO master.TajoAsyncDispatcher
> (TajoAsyncDispatcher.java:stop(122)) - AsyncDispatcher
> stopped:querymaster_1381902943098
> 2013-10-16 14:55:43,620 INFO querymaster.QueryMaster
> (QueryMaster.java:stop(151)) - QueryMaster stop
> 2013-10-16 14:55:43,620 INFO worker.TajoWorkerClientService
> (TajoWorkerClientService.java:stop(103)) - TajoWorkerClientService stopping
> 2013-10-16 14:55:43,621 INFO rpc.NettyServerBase
> (NettyServerBase.java:shutdown(120)) - Rpc (QueryMasterClientProtocol)
> listened on 0:0:0:0:0:0:0:0:44237) shutdown
> 2013-10-16 14:55:43,621 INFO worker.TajoWorkerClientService
> (TajoWorkerClientService.java:stop(107)) - TajoWorkerClientService stopped
> 2013-10-16 14:55:43,628 INFO worker.TajoWorker (TajoWorker.java:stop(211)) -
> TajoWorker main thread exiting
> 2013-10-16 14:55:43,628 WARN datanode.DirectoryScanner
> (DirectoryScanner.java:shutdown(289)) - DirectoryScanner: shutdown has been
> called
> 2013-10-16 14:55:43,745 WARN datanode.DataNode
> (BPServiceActor.java:offerService(575)) - BPOfferService for Block pool
> BP-1571979412-192.168.0.205-1381902941942 (storage id
> DS-977793291-192.168.0.205-60398-1381902942381) service to
> localhost/127.0.0.1:56994 interrupted
> 2013-10-16 14:55:43,745 WARN datanode.DataNode
> (BPServiceActor.java:run(685)) - Ending block pool service for: Block pool
> BP-1571979412-192.168.0.205-1381902941942 (storage id
> DS-977793291-192.168.0.205-60398-1381902942381) service to
> localhost/127.0.0.1:56994
> 2013-10-16 14:55:43,747 WARN datanode.DirectoryScanner
> (DirectoryScanner.java:shutdown(289)) - DirectoryScanner: shutdown has been
> called
> 2013-10-16 14:55:45,043 INFO BlockStateChange
> (BlockManager.java:computeReplicationWorkForBlocks(1308)) - BLOCK* ask
> 127.0.0.1:41909 to replicate blk_8623857380696071962_1002 to datanode(s)
> 127.0.0.1:60398 127.0.0.1:51428
> 2013-10-16 14:55:45,279 WARN datanode.DataNode (DataNode.java:run(1485)) -
> DatanodeRegistration(127.0.0.1,
> storageID=DS-1668021296-192.168.0.205-41909-1381902942201, infoPort=60807,
> ipcPort=52563,
> storageInfo=lv=-40;cid=testClusterID;nsid=1161821716;c=0):Failed to transfer
> BP-1571979412-192.168.0.205-1381902941942:blk_8623857380696071962_1002 to
> 127.0.0.1:60398 got
> java.net.ConnectException: Connection refused
> at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
> at
> sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:708)
> at
> org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
> at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:526)
> at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:490)
> at
> org.apache.hadoop.hdfs.server.datanode.DataNode$DataTransfer.run(DataNode.java:1423)
> at java.lang.Thread.run(Thread.java:724)
> 2013-10-16 14:55:45,361 WARN datanode.DataNode
> (BPServiceActor.java:run(685)) - Ending block pool service for: Block pool
> BP-1571979412-192.168.0.205-1381902941942 (storage id
> DS-617348020-192.168.0.205-51428-1381902942289) service to
> localhost/127.0.0.1:56994
> 2013-10-16 14:55:51,045 INFO BlockStateChange
> (BlockManager.java:computeReplicationWorkForBlocks(1308)) - BLOCK* ask
> 127.0.0.1:41909 to replicate blk_8623857380696071962_1002 to datanode(s)
> 127.0.0.1:51428 127.0.0.1:60398
> 2013-10-16 14:55:51,282 WARN datanode.DataNode
> (BlockReceiver.java:<init>(233)) - IOException in BlockReceiver constructor.
> Cause is
> 2013-10-16 14:55:51,282 ERROR datanode.DataNode (DataXceiver.java:run(223)) -
> 127.0.0.1:51428:DataXceiver error processing WRITE_BLOCK operation src:
> /127.0.0.1:46346 dest: /127.0.0.1:51428
> java.io.IOException: block pool BP-1571979412-192.168.0.205-1381902941942 is
> not found
> at
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.getBlockPoolSlice(FsVolumeImpl.java:122)
> at
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.createTmpFile(FsVolumeImpl.java:150)
> at
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.createTemporary(FsDatasetImpl.java:818)
> at
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.createTemporary(FsDatasetImpl.java:91)
> at
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.<init>(BlockReceiver.java:155)
> at
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:393)
> at
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:98)
> at
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:65)
> at
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:219)
> at java.lang.Thread.run(Thread.java:724)
> ^C2013-10-16 14:55:56,252 INFO worker.TajoWorker (TajoWorker.java:run(459))
> - ============================================
> 2013-10-16 14:55:56,252 INFO worker.TajoWorker (TajoWorker.java:run(460)) -
> TajoWorker received SIGINT Signal
> 2013-10-16 14:55:56,252 INFO worker.TajoWorker (TajoWorker.java:run(461)) -
> ============================================
> {noformat}
--
This message was sent by Atlassian JIRA
(v6.1#6144)