[
https://issues.apache.org/jira/browse/HDFS-10340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15432830#comment-15432830
]
Kihwal Lee commented on HDFS-10340:
-----------------------------------
I checked the oom killer code and it is SIGKILL as you pointed out. It might
have used SIGTERM in the ancient versions. This wouldn't have been caught by
the sys call snooping, as it does not involve any. It sure looks like something
else sending SIGTERM to the datanode process. I looked over the openjdk8 source
but couldn't find anything raising SIGTERM for itself to shutdown. Whoever the
sender is, you should be able to catch it with the systemtap instrumentation.
We have had similar issues due to stale pid files, but that can't be it if no
service was (re)started at that time.
bq. if user of DataNode is same with NodeManager, maybe it is related with
YARN-4459
Are you saying that your cluster is configured this way? If so, I agree
YARN-4459 is a good candidate. If not, we are back to square one. In any case,
the systemtap instrumentation should help identifying the source of the signal.
> data node sudden killed
> ------------------------
>
> Key: HDFS-10340
> URL: https://issues.apache.org/jira/browse/HDFS-10340
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: datanode
> Affects Versions: 2.6.0
> Environment: Ubuntu 16.04 LTS , RAM 16g , cpu core : 8 , hdd 100gb,
> hadoop 2.6.0
> Reporter: tu nguyen khac
> Priority: Critical
>
> I tried to setup a new data node using ubuntu 16
> and get it join to an existed Hadoop Hdfs cluster ( there are 10 nodes in
> this cluster and they all run on centos Os 6 )
> But when i try to boostrap this node , after about 10 or 20 minutes i get
> this strange errors :
> 2016-04-26 20:12:09,394 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src:
> /10.3.24.65:55323, dest: /10.3.24.197:50010, bytes: 79902, op: HDFS_WRITE,
> cliID: DFSClient_NONMAPREDUCE_1379996362_1, offset: 0, srvID:
> 225f5b43-1dd3-4ac6-88d2-1e8d27dba55b, blockid:
> BP-352432948-10.3.24.65-1433821675295:blk_1074038505_789832, duration:
> 15331628
> 2016-04-26 20:12:09,394 INFO org.apache.hadoop.hdfs.server.datanode.DataNode:
> PacketResponder: BP-352432948-10.3.24.65-1433821675295:blk_1074038505_789832,
> type=LAST_IN_PIPELINE, downstreams=0:[] terminating
> 2016-04-26 20:12:25,410 INFO
> org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceScanner: Verification
> succeeded for BP-352432948-10.3.24.65-1433821675295:blk_1074038502_789829
> 2016-04-26 20:12:25,411 INFO
> org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceScanner: Verification
> succeeded for BP-352432948-10.3.24.65-1433821675295:blk_1074038505_789832
> 2016-04-26 20:13:18,546 INFO
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService:
> Scheduling blk_1074038502_789829 file
> /data/hadoop_data/backup/data/current/BP-352432948-10.3.24.65-1433821675295/current/finalized/subdir4/subdir134/blk_1074038502
> for deletion
> 2016-04-26 20:13:18,562 INFO
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService:
> Deleted BP-352432948-10.3.24.65-1433821675295 blk_1074038502_789829 file
> /data/hadoop_data/backup/data/current/BP-352432948-10.3.24.65-1433821675295/current/finalized/subdir4/subdir134/blk_1074038502
> 2016-04-26 20:15:46,481 ERROR
> org.apache.hadoop.hdfs.server.datanode.DataNode: RECEIVED SIGNAL 15: SIGTERM
> 2016-04-26 20:15:46,504 INFO org.apache.hadoop.hdfs.server.datanode.DataNode:
> SHUTDOWN_MSG:
> /************************************************************
> SHUTDOWN_MSG: Shutting down DataNode at bigdata-dw-24-197/10.3.24.197
> ************************************************************/
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]