[jira] [Updated] (HDFS-5042) Completed files lost after power failure
[ https://issues.apache.org/jira/browse/HDFS-5042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinayakumar B updated HDFS-5042: Attachment: HDFS-5042-branch-2.8-addendum.patch Here is the addendum patch to move fsync() out of lock. [~kihwal], If this is fine to you, can create another Jira to push this if required. > Completed files lost after power failure > > > Key: HDFS-5042 > URL: https://issues.apache.org/jira/browse/HDFS-5042 > Project: Hadoop HDFS > Issue Type: Bug > Environment: ext3 on CentOS 5.7 (kernel 2.6.18-274.el5) >Reporter: Dave Latham >Assignee: Vinayakumar B >Priority: Critical > Fix For: 2.9.0, 2.7.4, 3.0.0-alpha4, 2.8.2 > > Attachments: HDFS-5042-01.patch, HDFS-5042-02.patch, > HDFS-5042-03.patch, HDFS-5042-04.patch, HDFS-5042-05-branch-2.patch, > HDFS-5042-05.patch, HDFS-5042-branch-2-01.patch, HDFS-5042-branch-2-05.patch, > HDFS-5042-branch-2.7-05.patch, HDFS-5042-branch-2.7-06.patch, > HDFS-5042-branch-2.8-05.patch, HDFS-5042-branch-2.8-06.patch, > HDFS-5042-branch-2.8-addendum.patch > > > We suffered a cluster wide power failure after which HDFS lost data that it > had acknowledged as closed and complete. > The client was HBase which compacted a set of HFiles into a new HFile, then > after closing the file successfully, deleted the previous versions of the > file. The cluster then lost power, and when brought back up the newly > created file was marked CORRUPT. > Based on reading the logs it looks like the replicas were created by the > DataNodes in the 'blocksBeingWritten' directory. Then when the file was > closed they were moved to the 'current' directory. After the power cycle > those replicas were again in the blocksBeingWritten directory of the > underlying file system (ext3). When those DataNodes reported in to the > NameNode it deleted those replicas and lost the file. > Some possible fixes could be having the DataNode fsync the directory(s) after > moving the block from blocksBeingWritten to current to ensure the rename is > durable or having the NameNode accept replicas from blocksBeingWritten under > certain circumstances. > Log snippets from RS (RegionServer), NN (NameNode), DN (DataNode): > {noformat} > RS 2013-06-29 11:16:06,812 DEBUG org.apache.hadoop.hbase.util.FSUtils: > Creating > file=hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c > with permission=rwxrwxrwx > NN 2013-06-29 11:16:06,830 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > NameSystem.allocateBlock: > /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c. > blk_1395839728632046111_357084589 > DN 2013-06-29 11:16:06,832 INFO > org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block > blk_1395839728632046111_357084589 src: /10.0.5.237:14327 dest: > /10.0.5.237:50010 > NN 2013-06-29 11:16:11,370 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > NameSystem.addStoredBlock: blockMap updated: 10.0.6.1:50010 is added to > blk_1395839728632046111_357084589 size 25418340 > NN 2013-06-29 11:16:11,370 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > NameSystem.addStoredBlock: blockMap updated: 10.0.6.24:50010 is added to > blk_1395839728632046111_357084589 size 25418340 > NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > NameSystem.addStoredBlock: blockMap updated: 10.0.5.237:50010 is added to > blk_1395839728632046111_357084589 size 25418340 > DN 2013-06-29 11:16:11,385 INFO > org.apache.hadoop.hdfs.server.datanode.DataNode: Received block > blk_1395839728632046111_357084589 of size 25418340 from /10.0.5.237:14327 > DN 2013-06-29 11:16:11,385 INFO > org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 2 for block > blk_1395839728632046111_357084589 terminating > NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange: Removing > lease on file > /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c > from client DFSClient_hb_rs_hs745,60020,1372470111932 > NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange: DIR* > NameSystem.completeFile: file > /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c > is closed by DFSClient_hb_rs_hs745,60020,1372470111932 > RS 2013-06-29 11:16:11,393 INFO org.apache.hadoop.hbase.regionserver.Store: > Renaming compacted file at > hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c > to > hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/n/6e0cc30af6e64e56ba5a539fdf159c4c > RS 2013-06-29 11:16:11,505 INFO org.apache.hadoop.hbase.regionserver.Store: > Completed major compaction of 7 file(s) in n of >
[jira] [Updated] (HDFS-5042) Completed files lost after power failure
[ https://issues.apache.org/jira/browse/HDFS-5042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kihwal Lee updated HDFS-5042: - Fix Version/s: 2.8.2 2.7.4 > Completed files lost after power failure > > > Key: HDFS-5042 > URL: https://issues.apache.org/jira/browse/HDFS-5042 > Project: Hadoop HDFS > Issue Type: Bug > Environment: ext3 on CentOS 5.7 (kernel 2.6.18-274.el5) >Reporter: Dave Latham >Assignee: Vinayakumar B >Priority: Critical > Fix For: 2.9.0, 2.7.4, 3.0.0-alpha4, 2.8.2 > > Attachments: HDFS-5042-01.patch, HDFS-5042-02.patch, > HDFS-5042-03.patch, HDFS-5042-04.patch, HDFS-5042-05-branch-2.patch, > HDFS-5042-05.patch, HDFS-5042-branch-2-01.patch, HDFS-5042-branch-2-05.patch, > HDFS-5042-branch-2.7-05.patch, HDFS-5042-branch-2.7-06.patch, > HDFS-5042-branch-2.8-05.patch, HDFS-5042-branch-2.8-06.patch > > > We suffered a cluster wide power failure after which HDFS lost data that it > had acknowledged as closed and complete. > The client was HBase which compacted a set of HFiles into a new HFile, then > after closing the file successfully, deleted the previous versions of the > file. The cluster then lost power, and when brought back up the newly > created file was marked CORRUPT. > Based on reading the logs it looks like the replicas were created by the > DataNodes in the 'blocksBeingWritten' directory. Then when the file was > closed they were moved to the 'current' directory. After the power cycle > those replicas were again in the blocksBeingWritten directory of the > underlying file system (ext3). When those DataNodes reported in to the > NameNode it deleted those replicas and lost the file. > Some possible fixes could be having the DataNode fsync the directory(s) after > moving the block from blocksBeingWritten to current to ensure the rename is > durable or having the NameNode accept replicas from blocksBeingWritten under > certain circumstances. > Log snippets from RS (RegionServer), NN (NameNode), DN (DataNode): > {noformat} > RS 2013-06-29 11:16:06,812 DEBUG org.apache.hadoop.hbase.util.FSUtils: > Creating > file=hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c > with permission=rwxrwxrwx > NN 2013-06-29 11:16:06,830 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > NameSystem.allocateBlock: > /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c. > blk_1395839728632046111_357084589 > DN 2013-06-29 11:16:06,832 INFO > org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block > blk_1395839728632046111_357084589 src: /10.0.5.237:14327 dest: > /10.0.5.237:50010 > NN 2013-06-29 11:16:11,370 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > NameSystem.addStoredBlock: blockMap updated: 10.0.6.1:50010 is added to > blk_1395839728632046111_357084589 size 25418340 > NN 2013-06-29 11:16:11,370 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > NameSystem.addStoredBlock: blockMap updated: 10.0.6.24:50010 is added to > blk_1395839728632046111_357084589 size 25418340 > NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > NameSystem.addStoredBlock: blockMap updated: 10.0.5.237:50010 is added to > blk_1395839728632046111_357084589 size 25418340 > DN 2013-06-29 11:16:11,385 INFO > org.apache.hadoop.hdfs.server.datanode.DataNode: Received block > blk_1395839728632046111_357084589 of size 25418340 from /10.0.5.237:14327 > DN 2013-06-29 11:16:11,385 INFO > org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 2 for block > blk_1395839728632046111_357084589 terminating > NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange: Removing > lease on file > /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c > from client DFSClient_hb_rs_hs745,60020,1372470111932 > NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange: DIR* > NameSystem.completeFile: file > /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c > is closed by DFSClient_hb_rs_hs745,60020,1372470111932 > RS 2013-06-29 11:16:11,393 INFO org.apache.hadoop.hbase.regionserver.Store: > Renaming compacted file at > hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c > to > hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/n/6e0cc30af6e64e56ba5a539fdf159c4c > RS 2013-06-29 11:16:11,505 INFO org.apache.hadoop.hbase.regionserver.Store: > Completed major compaction of 7 file(s) in n of > users-6,\x12\xBDp\xA3,1359426311784.b5b0820cde759ae68e333b2f4015bb7e. into > 6e0cc30af6e64e56ba5a539fdf159c4c, size=24.2m; total size for store is 24.2m > --- CRASH, RESTART - > NN 2013-06-29
[jira] [Updated] (HDFS-5042) Completed files lost after power failure
[ https://issues.apache.org/jira/browse/HDFS-5042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinayakumar B updated HDFS-5042: Attachment: HDFS-5042-branch-2.7-06.patch Updated branch-2.7 patch as well. > Completed files lost after power failure > > > Key: HDFS-5042 > URL: https://issues.apache.org/jira/browse/HDFS-5042 > Project: Hadoop HDFS > Issue Type: Bug > Environment: ext3 on CentOS 5.7 (kernel 2.6.18-274.el5) >Reporter: Dave Latham >Assignee: Vinayakumar B >Priority: Critical > Fix For: 2.9.0, 3.0.0-alpha4 > > Attachments: HDFS-5042-01.patch, HDFS-5042-02.patch, > HDFS-5042-03.patch, HDFS-5042-04.patch, HDFS-5042-05-branch-2.patch, > HDFS-5042-05.patch, HDFS-5042-branch-2-01.patch, HDFS-5042-branch-2-05.patch, > HDFS-5042-branch-2.7-05.patch, HDFS-5042-branch-2.7-06.patch, > HDFS-5042-branch-2.8-05.patch, HDFS-5042-branch-2.8-06.patch > > > We suffered a cluster wide power failure after which HDFS lost data that it > had acknowledged as closed and complete. > The client was HBase which compacted a set of HFiles into a new HFile, then > after closing the file successfully, deleted the previous versions of the > file. The cluster then lost power, and when brought back up the newly > created file was marked CORRUPT. > Based on reading the logs it looks like the replicas were created by the > DataNodes in the 'blocksBeingWritten' directory. Then when the file was > closed they were moved to the 'current' directory. After the power cycle > those replicas were again in the blocksBeingWritten directory of the > underlying file system (ext3). When those DataNodes reported in to the > NameNode it deleted those replicas and lost the file. > Some possible fixes could be having the DataNode fsync the directory(s) after > moving the block from blocksBeingWritten to current to ensure the rename is > durable or having the NameNode accept replicas from blocksBeingWritten under > certain circumstances. > Log snippets from RS (RegionServer), NN (NameNode), DN (DataNode): > {noformat} > RS 2013-06-29 11:16:06,812 DEBUG org.apache.hadoop.hbase.util.FSUtils: > Creating > file=hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c > with permission=rwxrwxrwx > NN 2013-06-29 11:16:06,830 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > NameSystem.allocateBlock: > /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c. > blk_1395839728632046111_357084589 > DN 2013-06-29 11:16:06,832 INFO > org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block > blk_1395839728632046111_357084589 src: /10.0.5.237:14327 dest: > /10.0.5.237:50010 > NN 2013-06-29 11:16:11,370 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > NameSystem.addStoredBlock: blockMap updated: 10.0.6.1:50010 is added to > blk_1395839728632046111_357084589 size 25418340 > NN 2013-06-29 11:16:11,370 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > NameSystem.addStoredBlock: blockMap updated: 10.0.6.24:50010 is added to > blk_1395839728632046111_357084589 size 25418340 > NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > NameSystem.addStoredBlock: blockMap updated: 10.0.5.237:50010 is added to > blk_1395839728632046111_357084589 size 25418340 > DN 2013-06-29 11:16:11,385 INFO > org.apache.hadoop.hdfs.server.datanode.DataNode: Received block > blk_1395839728632046111_357084589 of size 25418340 from /10.0.5.237:14327 > DN 2013-06-29 11:16:11,385 INFO > org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 2 for block > blk_1395839728632046111_357084589 terminating > NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange: Removing > lease on file > /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c > from client DFSClient_hb_rs_hs745,60020,1372470111932 > NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange: DIR* > NameSystem.completeFile: file > /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c > is closed by DFSClient_hb_rs_hs745,60020,1372470111932 > RS 2013-06-29 11:16:11,393 INFO org.apache.hadoop.hbase.regionserver.Store: > Renaming compacted file at > hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c > to > hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/n/6e0cc30af6e64e56ba5a539fdf159c4c > RS 2013-06-29 11:16:11,505 INFO org.apache.hadoop.hbase.regionserver.Store: > Completed major compaction of 7 file(s) in n of > users-6,\x12\xBDp\xA3,1359426311784.b5b0820cde759ae68e333b2f4015bb7e. into > 6e0cc30af6e64e56ba5a539fdf159c4c, size=24.2m; total size for store is 24.2m > --- CRASH, RESTART - >
[jira] [Updated] (HDFS-5042) Completed files lost after power failure
[ https://issues.apache.org/jira/browse/HDFS-5042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinayakumar B updated HDFS-5042: Attachment: HDFS-5042-branch-2.8-06.patch Oops. thats the wrong patch I attached. Here is the updated one for branch-2.8 > Completed files lost after power failure > > > Key: HDFS-5042 > URL: https://issues.apache.org/jira/browse/HDFS-5042 > Project: Hadoop HDFS > Issue Type: Bug > Environment: ext3 on CentOS 5.7 (kernel 2.6.18-274.el5) >Reporter: Dave Latham >Assignee: Vinayakumar B >Priority: Critical > Fix For: 2.9.0, 3.0.0-alpha4 > > Attachments: HDFS-5042-01.patch, HDFS-5042-02.patch, > HDFS-5042-03.patch, HDFS-5042-04.patch, HDFS-5042-05-branch-2.patch, > HDFS-5042-05.patch, HDFS-5042-branch-2-01.patch, HDFS-5042-branch-2-05.patch, > HDFS-5042-branch-2.7-05.patch, HDFS-5042-branch-2.8-05.patch, > HDFS-5042-branch-2.8-06.patch > > > We suffered a cluster wide power failure after which HDFS lost data that it > had acknowledged as closed and complete. > The client was HBase which compacted a set of HFiles into a new HFile, then > after closing the file successfully, deleted the previous versions of the > file. The cluster then lost power, and when brought back up the newly > created file was marked CORRUPT. > Based on reading the logs it looks like the replicas were created by the > DataNodes in the 'blocksBeingWritten' directory. Then when the file was > closed they were moved to the 'current' directory. After the power cycle > those replicas were again in the blocksBeingWritten directory of the > underlying file system (ext3). When those DataNodes reported in to the > NameNode it deleted those replicas and lost the file. > Some possible fixes could be having the DataNode fsync the directory(s) after > moving the block from blocksBeingWritten to current to ensure the rename is > durable or having the NameNode accept replicas from blocksBeingWritten under > certain circumstances. > Log snippets from RS (RegionServer), NN (NameNode), DN (DataNode): > {noformat} > RS 2013-06-29 11:16:06,812 DEBUG org.apache.hadoop.hbase.util.FSUtils: > Creating > file=hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c > with permission=rwxrwxrwx > NN 2013-06-29 11:16:06,830 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > NameSystem.allocateBlock: > /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c. > blk_1395839728632046111_357084589 > DN 2013-06-29 11:16:06,832 INFO > org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block > blk_1395839728632046111_357084589 src: /10.0.5.237:14327 dest: > /10.0.5.237:50010 > NN 2013-06-29 11:16:11,370 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > NameSystem.addStoredBlock: blockMap updated: 10.0.6.1:50010 is added to > blk_1395839728632046111_357084589 size 25418340 > NN 2013-06-29 11:16:11,370 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > NameSystem.addStoredBlock: blockMap updated: 10.0.6.24:50010 is added to > blk_1395839728632046111_357084589 size 25418340 > NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > NameSystem.addStoredBlock: blockMap updated: 10.0.5.237:50010 is added to > blk_1395839728632046111_357084589 size 25418340 > DN 2013-06-29 11:16:11,385 INFO > org.apache.hadoop.hdfs.server.datanode.DataNode: Received block > blk_1395839728632046111_357084589 of size 25418340 from /10.0.5.237:14327 > DN 2013-06-29 11:16:11,385 INFO > org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 2 for block > blk_1395839728632046111_357084589 terminating > NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange: Removing > lease on file > /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c > from client DFSClient_hb_rs_hs745,60020,1372470111932 > NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange: DIR* > NameSystem.completeFile: file > /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c > is closed by DFSClient_hb_rs_hs745,60020,1372470111932 > RS 2013-06-29 11:16:11,393 INFO org.apache.hadoop.hbase.regionserver.Store: > Renaming compacted file at > hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c > to > hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/n/6e0cc30af6e64e56ba5a539fdf159c4c > RS 2013-06-29 11:16:11,505 INFO org.apache.hadoop.hbase.regionserver.Store: > Completed major compaction of 7 file(s) in n of > users-6,\x12\xBDp\xA3,1359426311784.b5b0820cde759ae68e333b2f4015bb7e. into > 6e0cc30af6e64e56ba5a539fdf159c4c, size=24.2m; total size for store is 24.2m > --- CRASH,
[jira] [Updated] (HDFS-5042) Completed files lost after power failure
[ https://issues.apache.org/jira/browse/HDFS-5042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinayakumar B updated HDFS-5042: Attachment: HDFS-5042-branch-2.7-05.patch Attaching branch-2.7 patch as well. > Completed files lost after power failure > > > Key: HDFS-5042 > URL: https://issues.apache.org/jira/browse/HDFS-5042 > Project: Hadoop HDFS > Issue Type: Bug > Environment: ext3 on CentOS 5.7 (kernel 2.6.18-274.el5) >Reporter: Dave Latham >Assignee: Vinayakumar B >Priority: Critical > Fix For: 2.9.0, 3.0.0-alpha4 > > Attachments: HDFS-5042-01.patch, HDFS-5042-02.patch, > HDFS-5042-03.patch, HDFS-5042-04.patch, HDFS-5042-05-branch-2.patch, > HDFS-5042-05.patch, HDFS-5042-branch-2-01.patch, HDFS-5042-branch-2-05.patch, > HDFS-5042-branch-2.7-05.patch, HDFS-5042-branch-2.8-05.patch > > > We suffered a cluster wide power failure after which HDFS lost data that it > had acknowledged as closed and complete. > The client was HBase which compacted a set of HFiles into a new HFile, then > after closing the file successfully, deleted the previous versions of the > file. The cluster then lost power, and when brought back up the newly > created file was marked CORRUPT. > Based on reading the logs it looks like the replicas were created by the > DataNodes in the 'blocksBeingWritten' directory. Then when the file was > closed they were moved to the 'current' directory. After the power cycle > those replicas were again in the blocksBeingWritten directory of the > underlying file system (ext3). When those DataNodes reported in to the > NameNode it deleted those replicas and lost the file. > Some possible fixes could be having the DataNode fsync the directory(s) after > moving the block from blocksBeingWritten to current to ensure the rename is > durable or having the NameNode accept replicas from blocksBeingWritten under > certain circumstances. > Log snippets from RS (RegionServer), NN (NameNode), DN (DataNode): > {noformat} > RS 2013-06-29 11:16:06,812 DEBUG org.apache.hadoop.hbase.util.FSUtils: > Creating > file=hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c > with permission=rwxrwxrwx > NN 2013-06-29 11:16:06,830 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > NameSystem.allocateBlock: > /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c. > blk_1395839728632046111_357084589 > DN 2013-06-29 11:16:06,832 INFO > org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block > blk_1395839728632046111_357084589 src: /10.0.5.237:14327 dest: > /10.0.5.237:50010 > NN 2013-06-29 11:16:11,370 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > NameSystem.addStoredBlock: blockMap updated: 10.0.6.1:50010 is added to > blk_1395839728632046111_357084589 size 25418340 > NN 2013-06-29 11:16:11,370 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > NameSystem.addStoredBlock: blockMap updated: 10.0.6.24:50010 is added to > blk_1395839728632046111_357084589 size 25418340 > NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > NameSystem.addStoredBlock: blockMap updated: 10.0.5.237:50010 is added to > blk_1395839728632046111_357084589 size 25418340 > DN 2013-06-29 11:16:11,385 INFO > org.apache.hadoop.hdfs.server.datanode.DataNode: Received block > blk_1395839728632046111_357084589 of size 25418340 from /10.0.5.237:14327 > DN 2013-06-29 11:16:11,385 INFO > org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 2 for block > blk_1395839728632046111_357084589 terminating > NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange: Removing > lease on file > /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c > from client DFSClient_hb_rs_hs745,60020,1372470111932 > NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange: DIR* > NameSystem.completeFile: file > /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c > is closed by DFSClient_hb_rs_hs745,60020,1372470111932 > RS 2013-06-29 11:16:11,393 INFO org.apache.hadoop.hbase.regionserver.Store: > Renaming compacted file at > hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c > to > hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/n/6e0cc30af6e64e56ba5a539fdf159c4c > RS 2013-06-29 11:16:11,505 INFO org.apache.hadoop.hbase.regionserver.Store: > Completed major compaction of 7 file(s) in n of > users-6,\x12\xBDp\xA3,1359426311784.b5b0820cde759ae68e333b2f4015bb7e. into > 6e0cc30af6e64e56ba5a539fdf159c4c, size=24.2m; total size for store is 24.2m > --- CRASH, RESTART - > NN 2013-06-29 12:01:19,743 INFO
[jira] [Updated] (HDFS-5042) Completed files lost after power failure
[ https://issues.apache.org/jira/browse/HDFS-5042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinayakumar B updated HDFS-5042: Attachment: HDFS-5042-branch-2.8-05.patch Attaching the branch-2.8 patch > Completed files lost after power failure > > > Key: HDFS-5042 > URL: https://issues.apache.org/jira/browse/HDFS-5042 > Project: Hadoop HDFS > Issue Type: Bug > Environment: ext3 on CentOS 5.7 (kernel 2.6.18-274.el5) >Reporter: Dave Latham >Assignee: Vinayakumar B >Priority: Critical > Fix For: 2.9.0, 3.0.0-alpha4 > > Attachments: HDFS-5042-01.patch, HDFS-5042-02.patch, > HDFS-5042-03.patch, HDFS-5042-04.patch, HDFS-5042-05-branch-2.patch, > HDFS-5042-05.patch, HDFS-5042-branch-2-01.patch, HDFS-5042-branch-2-05.patch, > HDFS-5042-branch-2.8-05.patch > > > We suffered a cluster wide power failure after which HDFS lost data that it > had acknowledged as closed and complete. > The client was HBase which compacted a set of HFiles into a new HFile, then > after closing the file successfully, deleted the previous versions of the > file. The cluster then lost power, and when brought back up the newly > created file was marked CORRUPT. > Based on reading the logs it looks like the replicas were created by the > DataNodes in the 'blocksBeingWritten' directory. Then when the file was > closed they were moved to the 'current' directory. After the power cycle > those replicas were again in the blocksBeingWritten directory of the > underlying file system (ext3). When those DataNodes reported in to the > NameNode it deleted those replicas and lost the file. > Some possible fixes could be having the DataNode fsync the directory(s) after > moving the block from blocksBeingWritten to current to ensure the rename is > durable or having the NameNode accept replicas from blocksBeingWritten under > certain circumstances. > Log snippets from RS (RegionServer), NN (NameNode), DN (DataNode): > {noformat} > RS 2013-06-29 11:16:06,812 DEBUG org.apache.hadoop.hbase.util.FSUtils: > Creating > file=hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c > with permission=rwxrwxrwx > NN 2013-06-29 11:16:06,830 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > NameSystem.allocateBlock: > /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c. > blk_1395839728632046111_357084589 > DN 2013-06-29 11:16:06,832 INFO > org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block > blk_1395839728632046111_357084589 src: /10.0.5.237:14327 dest: > /10.0.5.237:50010 > NN 2013-06-29 11:16:11,370 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > NameSystem.addStoredBlock: blockMap updated: 10.0.6.1:50010 is added to > blk_1395839728632046111_357084589 size 25418340 > NN 2013-06-29 11:16:11,370 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > NameSystem.addStoredBlock: blockMap updated: 10.0.6.24:50010 is added to > blk_1395839728632046111_357084589 size 25418340 > NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > NameSystem.addStoredBlock: blockMap updated: 10.0.5.237:50010 is added to > blk_1395839728632046111_357084589 size 25418340 > DN 2013-06-29 11:16:11,385 INFO > org.apache.hadoop.hdfs.server.datanode.DataNode: Received block > blk_1395839728632046111_357084589 of size 25418340 from /10.0.5.237:14327 > DN 2013-06-29 11:16:11,385 INFO > org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 2 for block > blk_1395839728632046111_357084589 terminating > NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange: Removing > lease on file > /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c > from client DFSClient_hb_rs_hs745,60020,1372470111932 > NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange: DIR* > NameSystem.completeFile: file > /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c > is closed by DFSClient_hb_rs_hs745,60020,1372470111932 > RS 2013-06-29 11:16:11,393 INFO org.apache.hadoop.hbase.regionserver.Store: > Renaming compacted file at > hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c > to > hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/n/6e0cc30af6e64e56ba5a539fdf159c4c > RS 2013-06-29 11:16:11,505 INFO org.apache.hadoop.hbase.regionserver.Store: > Completed major compaction of 7 file(s) in n of > users-6,\x12\xBDp\xA3,1359426311784.b5b0820cde759ae68e333b2f4015bb7e. into > 6e0cc30af6e64e56ba5a539fdf159c4c, size=24.2m; total size for store is 24.2m > --- CRASH, RESTART - > NN 2013-06-29 12:01:19,743 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* >
[jira] [Updated] (HDFS-5042) Completed files lost after power failure
[ https://issues.apache.org/jira/browse/HDFS-5042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kihwal Lee updated HDFS-5042: - Resolution: Fixed Hadoop Flags: Reviewed Fix Version/s: 3.0.0-alpha4 2.9.0 Status: Resolved (was: Patch Available) I've committed this to trunk and branch-2. branch-2.8 does not have a separate FileIoProvider, so it will need a different patch. I am resolving this for now. > Completed files lost after power failure > > > Key: HDFS-5042 > URL: https://issues.apache.org/jira/browse/HDFS-5042 > Project: Hadoop HDFS > Issue Type: Bug > Environment: ext3 on CentOS 5.7 (kernel 2.6.18-274.el5) >Reporter: Dave Latham >Assignee: Vinayakumar B >Priority: Critical > Fix For: 2.9.0, 3.0.0-alpha4 > > Attachments: HDFS-5042-01.patch, HDFS-5042-02.patch, > HDFS-5042-03.patch, HDFS-5042-04.patch, HDFS-5042-05-branch-2.patch, > HDFS-5042-05.patch, HDFS-5042-branch-2-01.patch, HDFS-5042-branch-2-05.patch > > > We suffered a cluster wide power failure after which HDFS lost data that it > had acknowledged as closed and complete. > The client was HBase which compacted a set of HFiles into a new HFile, then > after closing the file successfully, deleted the previous versions of the > file. The cluster then lost power, and when brought back up the newly > created file was marked CORRUPT. > Based on reading the logs it looks like the replicas were created by the > DataNodes in the 'blocksBeingWritten' directory. Then when the file was > closed they were moved to the 'current' directory. After the power cycle > those replicas were again in the blocksBeingWritten directory of the > underlying file system (ext3). When those DataNodes reported in to the > NameNode it deleted those replicas and lost the file. > Some possible fixes could be having the DataNode fsync the directory(s) after > moving the block from blocksBeingWritten to current to ensure the rename is > durable or having the NameNode accept replicas from blocksBeingWritten under > certain circumstances. > Log snippets from RS (RegionServer), NN (NameNode), DN (DataNode): > {noformat} > RS 2013-06-29 11:16:06,812 DEBUG org.apache.hadoop.hbase.util.FSUtils: > Creating > file=hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c > with permission=rwxrwxrwx > NN 2013-06-29 11:16:06,830 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > NameSystem.allocateBlock: > /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c. > blk_1395839728632046111_357084589 > DN 2013-06-29 11:16:06,832 INFO > org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block > blk_1395839728632046111_357084589 src: /10.0.5.237:14327 dest: > /10.0.5.237:50010 > NN 2013-06-29 11:16:11,370 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > NameSystem.addStoredBlock: blockMap updated: 10.0.6.1:50010 is added to > blk_1395839728632046111_357084589 size 25418340 > NN 2013-06-29 11:16:11,370 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > NameSystem.addStoredBlock: blockMap updated: 10.0.6.24:50010 is added to > blk_1395839728632046111_357084589 size 25418340 > NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > NameSystem.addStoredBlock: blockMap updated: 10.0.5.237:50010 is added to > blk_1395839728632046111_357084589 size 25418340 > DN 2013-06-29 11:16:11,385 INFO > org.apache.hadoop.hdfs.server.datanode.DataNode: Received block > blk_1395839728632046111_357084589 of size 25418340 from /10.0.5.237:14327 > DN 2013-06-29 11:16:11,385 INFO > org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 2 for block > blk_1395839728632046111_357084589 terminating > NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange: Removing > lease on file > /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c > from client DFSClient_hb_rs_hs745,60020,1372470111932 > NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange: DIR* > NameSystem.completeFile: file > /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c > is closed by DFSClient_hb_rs_hs745,60020,1372470111932 > RS 2013-06-29 11:16:11,393 INFO org.apache.hadoop.hbase.regionserver.Store: > Renaming compacted file at > hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c > to > hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/n/6e0cc30af6e64e56ba5a539fdf159c4c > RS 2013-06-29 11:16:11,505 INFO org.apache.hadoop.hbase.regionserver.Store: > Completed major compaction of 7 file(s) in n of > users-6,\x12\xBDp\xA3,1359426311784.b5b0820cde759ae68e333b2f4015bb7e. into >
[jira] [Updated] (HDFS-5042) Completed files lost after power failure
[ https://issues.apache.org/jira/browse/HDFS-5042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Chen updated HDFS-5042: Summary: Completed files lost after power failure (was: utuvjndvrcdtgddbdvj) > Completed files lost after power failure > > > Key: HDFS-5042 > URL: https://issues.apache.org/jira/browse/HDFS-5042 > Project: Hadoop HDFS > Issue Type: Bug > Environment: ext3 on CentOS 5.7 (kernel 2.6.18-274.el5) >Reporter: Dave Latham >Assignee: Vinayakumar B >Priority: Critical > Attachments: HDFS-5042-01.patch, HDFS-5042-02.patch, > HDFS-5042-03.patch, HDFS-5042-04.patch, HDFS-5042-05-branch-2.patch, > HDFS-5042-05.patch, HDFS-5042-branch-2-01.patch, HDFS-5042-branch-2-05.patch > > > We suffered a cluster wide power failure after which HDFS lost data that it > had acknowledged as closed and complete. > The client was HBase which compacted a set of HFiles into a new HFile, then > after closing the file successfully, deleted the previous versions of the > file. The cluster then lost power, and when brought back up the newly > created file was marked CORRUPT. > Based on reading the logs it looks like the replicas were created by the > DataNodes in the 'blocksBeingWritten' directory. Then when the file was > closed they were moved to the 'current' directory. After the power cycle > those replicas were again in the blocksBeingWritten directory of the > underlying file system (ext3). When those DataNodes reported in to the > NameNode it deleted those replicas and lost the file. > Some possible fixes could be having the DataNode fsync the directory(s) after > moving the block from blocksBeingWritten to current to ensure the rename is > durable or having the NameNode accept replicas from blocksBeingWritten under > certain circumstances. > Log snippets from RS (RegionServer), NN (NameNode), DN (DataNode): > {noformat} > RS 2013-06-29 11:16:06,812 DEBUG org.apache.hadoop.hbase.util.FSUtils: > Creating > file=hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c > with permission=rwxrwxrwx > NN 2013-06-29 11:16:06,830 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > NameSystem.allocateBlock: > /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c. > blk_1395839728632046111_357084589 > DN 2013-06-29 11:16:06,832 INFO > org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block > blk_1395839728632046111_357084589 src: /10.0.5.237:14327 dest: > /10.0.5.237:50010 > NN 2013-06-29 11:16:11,370 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > NameSystem.addStoredBlock: blockMap updated: 10.0.6.1:50010 is added to > blk_1395839728632046111_357084589 size 25418340 > NN 2013-06-29 11:16:11,370 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > NameSystem.addStoredBlock: blockMap updated: 10.0.6.24:50010 is added to > blk_1395839728632046111_357084589 size 25418340 > NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > NameSystem.addStoredBlock: blockMap updated: 10.0.5.237:50010 is added to > blk_1395839728632046111_357084589 size 25418340 > DN 2013-06-29 11:16:11,385 INFO > org.apache.hadoop.hdfs.server.datanode.DataNode: Received block > blk_1395839728632046111_357084589 of size 25418340 from /10.0.5.237:14327 > DN 2013-06-29 11:16:11,385 INFO > org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 2 for block > blk_1395839728632046111_357084589 terminating > NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange: Removing > lease on file > /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c > from client DFSClient_hb_rs_hs745,60020,1372470111932 > NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange: DIR* > NameSystem.completeFile: file > /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c > is closed by DFSClient_hb_rs_hs745,60020,1372470111932 > RS 2013-06-29 11:16:11,393 INFO org.apache.hadoop.hbase.regionserver.Store: > Renaming compacted file at > hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c > to > hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/n/6e0cc30af6e64e56ba5a539fdf159c4c > RS 2013-06-29 11:16:11,505 INFO org.apache.hadoop.hbase.regionserver.Store: > Completed major compaction of 7 file(s) in n of > users-6,\x12\xBDp\xA3,1359426311784.b5b0820cde759ae68e333b2f4015bb7e. into > 6e0cc30af6e64e56ba5a539fdf159c4c, size=24.2m; total size for store is 24.2m > --- CRASH, RESTART - > NN 2013-06-29 12:01:19,743 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > NameSystem.addStoredBlock: addStoredBlock request received for > blk_1395839728632046111_357084589 on
[jira] [Updated] (HDFS-5042) Completed files lost after power failure
[ https://issues.apache.org/jira/browse/HDFS-5042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinayakumar B updated HDFS-5042: Attachment: HDFS-5042-branch-2-05.patch Re-attaching the branch-2 patch with name changed. > Completed files lost after power failure > > > Key: HDFS-5042 > URL: https://issues.apache.org/jira/browse/HDFS-5042 > Project: Hadoop HDFS > Issue Type: Bug > Environment: ext3 on CentOS 5.7 (kernel 2.6.18-274.el5) >Reporter: Dave Latham >Assignee: Vinayakumar B >Priority: Critical > Attachments: HDFS-5042-01.patch, HDFS-5042-02.patch, > HDFS-5042-03.patch, HDFS-5042-04.patch, HDFS-5042-05-branch-2.patch, > HDFS-5042-05.patch, HDFS-5042-branch-2-01.patch, HDFS-5042-branch-2-05.patch > > > We suffered a cluster wide power failure after which HDFS lost data that it > had acknowledged as closed and complete. > The client was HBase which compacted a set of HFiles into a new HFile, then > after closing the file successfully, deleted the previous versions of the > file. The cluster then lost power, and when brought back up the newly > created file was marked CORRUPT. > Based on reading the logs it looks like the replicas were created by the > DataNodes in the 'blocksBeingWritten' directory. Then when the file was > closed they were moved to the 'current' directory. After the power cycle > those replicas were again in the blocksBeingWritten directory of the > underlying file system (ext3). When those DataNodes reported in to the > NameNode it deleted those replicas and lost the file. > Some possible fixes could be having the DataNode fsync the directory(s) after > moving the block from blocksBeingWritten to current to ensure the rename is > durable or having the NameNode accept replicas from blocksBeingWritten under > certain circumstances. > Log snippets from RS (RegionServer), NN (NameNode), DN (DataNode): > {noformat} > RS 2013-06-29 11:16:06,812 DEBUG org.apache.hadoop.hbase.util.FSUtils: > Creating > file=hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c > with permission=rwxrwxrwx > NN 2013-06-29 11:16:06,830 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > NameSystem.allocateBlock: > /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c. > blk_1395839728632046111_357084589 > DN 2013-06-29 11:16:06,832 INFO > org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block > blk_1395839728632046111_357084589 src: /10.0.5.237:14327 dest: > /10.0.5.237:50010 > NN 2013-06-29 11:16:11,370 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > NameSystem.addStoredBlock: blockMap updated: 10.0.6.1:50010 is added to > blk_1395839728632046111_357084589 size 25418340 > NN 2013-06-29 11:16:11,370 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > NameSystem.addStoredBlock: blockMap updated: 10.0.6.24:50010 is added to > blk_1395839728632046111_357084589 size 25418340 > NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > NameSystem.addStoredBlock: blockMap updated: 10.0.5.237:50010 is added to > blk_1395839728632046111_357084589 size 25418340 > DN 2013-06-29 11:16:11,385 INFO > org.apache.hadoop.hdfs.server.datanode.DataNode: Received block > blk_1395839728632046111_357084589 of size 25418340 from /10.0.5.237:14327 > DN 2013-06-29 11:16:11,385 INFO > org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 2 for block > blk_1395839728632046111_357084589 terminating > NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange: Removing > lease on file > /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c > from client DFSClient_hb_rs_hs745,60020,1372470111932 > NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange: DIR* > NameSystem.completeFile: file > /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c > is closed by DFSClient_hb_rs_hs745,60020,1372470111932 > RS 2013-06-29 11:16:11,393 INFO org.apache.hadoop.hbase.regionserver.Store: > Renaming compacted file at > hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c > to > hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/n/6e0cc30af6e64e56ba5a539fdf159c4c > RS 2013-06-29 11:16:11,505 INFO org.apache.hadoop.hbase.regionserver.Store: > Completed major compaction of 7 file(s) in n of > users-6,\x12\xBDp\xA3,1359426311784.b5b0820cde759ae68e333b2f4015bb7e. into > 6e0cc30af6e64e56ba5a539fdf159c4c, size=24.2m; total size for store is 24.2m > --- CRASH, RESTART - > NN 2013-06-29 12:01:19,743 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > NameSystem.addStoredBlock: addStoredBlock request received for >
[jira] [Updated] (HDFS-5042) Completed files lost after power failure
[ https://issues.apache.org/jira/browse/HDFS-5042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinayakumar B updated HDFS-5042: Attachment: HDFS-5042-05-branch-2.patch Attaching branch-2 patch > Completed files lost after power failure > > > Key: HDFS-5042 > URL: https://issues.apache.org/jira/browse/HDFS-5042 > Project: Hadoop HDFS > Issue Type: Bug > Environment: ext3 on CentOS 5.7 (kernel 2.6.18-274.el5) >Reporter: Dave Latham >Assignee: Vinayakumar B >Priority: Critical > Attachments: HDFS-5042-01.patch, HDFS-5042-02.patch, > HDFS-5042-03.patch, HDFS-5042-04.patch, HDFS-5042-05-branch-2.patch, > HDFS-5042-05.patch, HDFS-5042-branch-2-01.patch > > > We suffered a cluster wide power failure after which HDFS lost data that it > had acknowledged as closed and complete. > The client was HBase which compacted a set of HFiles into a new HFile, then > after closing the file successfully, deleted the previous versions of the > file. The cluster then lost power, and when brought back up the newly > created file was marked CORRUPT. > Based on reading the logs it looks like the replicas were created by the > DataNodes in the 'blocksBeingWritten' directory. Then when the file was > closed they were moved to the 'current' directory. After the power cycle > those replicas were again in the blocksBeingWritten directory of the > underlying file system (ext3). When those DataNodes reported in to the > NameNode it deleted those replicas and lost the file. > Some possible fixes could be having the DataNode fsync the directory(s) after > moving the block from blocksBeingWritten to current to ensure the rename is > durable or having the NameNode accept replicas from blocksBeingWritten under > certain circumstances. > Log snippets from RS (RegionServer), NN (NameNode), DN (DataNode): > {noformat} > RS 2013-06-29 11:16:06,812 DEBUG org.apache.hadoop.hbase.util.FSUtils: > Creating > file=hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c > with permission=rwxrwxrwx > NN 2013-06-29 11:16:06,830 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > NameSystem.allocateBlock: > /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c. > blk_1395839728632046111_357084589 > DN 2013-06-29 11:16:06,832 INFO > org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block > blk_1395839728632046111_357084589 src: /10.0.5.237:14327 dest: > /10.0.5.237:50010 > NN 2013-06-29 11:16:11,370 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > NameSystem.addStoredBlock: blockMap updated: 10.0.6.1:50010 is added to > blk_1395839728632046111_357084589 size 25418340 > NN 2013-06-29 11:16:11,370 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > NameSystem.addStoredBlock: blockMap updated: 10.0.6.24:50010 is added to > blk_1395839728632046111_357084589 size 25418340 > NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > NameSystem.addStoredBlock: blockMap updated: 10.0.5.237:50010 is added to > blk_1395839728632046111_357084589 size 25418340 > DN 2013-06-29 11:16:11,385 INFO > org.apache.hadoop.hdfs.server.datanode.DataNode: Received block > blk_1395839728632046111_357084589 of size 25418340 from /10.0.5.237:14327 > DN 2013-06-29 11:16:11,385 INFO > org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 2 for block > blk_1395839728632046111_357084589 terminating > NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange: Removing > lease on file > /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c > from client DFSClient_hb_rs_hs745,60020,1372470111932 > NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange: DIR* > NameSystem.completeFile: file > /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c > is closed by DFSClient_hb_rs_hs745,60020,1372470111932 > RS 2013-06-29 11:16:11,393 INFO org.apache.hadoop.hbase.regionserver.Store: > Renaming compacted file at > hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c > to > hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/n/6e0cc30af6e64e56ba5a539fdf159c4c > RS 2013-06-29 11:16:11,505 INFO org.apache.hadoop.hbase.regionserver.Store: > Completed major compaction of 7 file(s) in n of > users-6,\x12\xBDp\xA3,1359426311784.b5b0820cde759ae68e333b2f4015bb7e. into > 6e0cc30af6e64e56ba5a539fdf159c4c, size=24.2m; total size for store is 24.2m > --- CRASH, RESTART - > NN 2013-06-29 12:01:19,743 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > NameSystem.addStoredBlock: addStoredBlock request received for > blk_1395839728632046111_357084589 on 10.0.6.1:50010 size 21978112 but
[jira] [Updated] (HDFS-5042) Completed files lost after power failure
[ https://issues.apache.org/jira/browse/HDFS-5042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinayakumar B updated HDFS-5042: Attachment: HDFS-5042-05.patch Fixed the test failure. > Completed files lost after power failure > > > Key: HDFS-5042 > URL: https://issues.apache.org/jira/browse/HDFS-5042 > Project: Hadoop HDFS > Issue Type: Bug > Environment: ext3 on CentOS 5.7 (kernel 2.6.18-274.el5) >Reporter: Dave Latham >Assignee: Vinayakumar B >Priority: Critical > Attachments: HDFS-5042-01.patch, HDFS-5042-02.patch, > HDFS-5042-03.patch, HDFS-5042-04.patch, HDFS-5042-05.patch, > HDFS-5042-branch-2-01.patch > > > We suffered a cluster wide power failure after which HDFS lost data that it > had acknowledged as closed and complete. > The client was HBase which compacted a set of HFiles into a new HFile, then > after closing the file successfully, deleted the previous versions of the > file. The cluster then lost power, and when brought back up the newly > created file was marked CORRUPT. > Based on reading the logs it looks like the replicas were created by the > DataNodes in the 'blocksBeingWritten' directory. Then when the file was > closed they were moved to the 'current' directory. After the power cycle > those replicas were again in the blocksBeingWritten directory of the > underlying file system (ext3). When those DataNodes reported in to the > NameNode it deleted those replicas and lost the file. > Some possible fixes could be having the DataNode fsync the directory(s) after > moving the block from blocksBeingWritten to current to ensure the rename is > durable or having the NameNode accept replicas from blocksBeingWritten under > certain circumstances. > Log snippets from RS (RegionServer), NN (NameNode), DN (DataNode): > {noformat} > RS 2013-06-29 11:16:06,812 DEBUG org.apache.hadoop.hbase.util.FSUtils: > Creating > file=hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c > with permission=rwxrwxrwx > NN 2013-06-29 11:16:06,830 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > NameSystem.allocateBlock: > /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c. > blk_1395839728632046111_357084589 > DN 2013-06-29 11:16:06,832 INFO > org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block > blk_1395839728632046111_357084589 src: /10.0.5.237:14327 dest: > /10.0.5.237:50010 > NN 2013-06-29 11:16:11,370 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > NameSystem.addStoredBlock: blockMap updated: 10.0.6.1:50010 is added to > blk_1395839728632046111_357084589 size 25418340 > NN 2013-06-29 11:16:11,370 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > NameSystem.addStoredBlock: blockMap updated: 10.0.6.24:50010 is added to > blk_1395839728632046111_357084589 size 25418340 > NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > NameSystem.addStoredBlock: blockMap updated: 10.0.5.237:50010 is added to > blk_1395839728632046111_357084589 size 25418340 > DN 2013-06-29 11:16:11,385 INFO > org.apache.hadoop.hdfs.server.datanode.DataNode: Received block > blk_1395839728632046111_357084589 of size 25418340 from /10.0.5.237:14327 > DN 2013-06-29 11:16:11,385 INFO > org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 2 for block > blk_1395839728632046111_357084589 terminating > NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange: Removing > lease on file > /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c > from client DFSClient_hb_rs_hs745,60020,1372470111932 > NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange: DIR* > NameSystem.completeFile: file > /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c > is closed by DFSClient_hb_rs_hs745,60020,1372470111932 > RS 2013-06-29 11:16:11,393 INFO org.apache.hadoop.hbase.regionserver.Store: > Renaming compacted file at > hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c > to > hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/n/6e0cc30af6e64e56ba5a539fdf159c4c > RS 2013-06-29 11:16:11,505 INFO org.apache.hadoop.hbase.regionserver.Store: > Completed major compaction of 7 file(s) in n of > users-6,\x12\xBDp\xA3,1359426311784.b5b0820cde759ae68e333b2f4015bb7e. into > 6e0cc30af6e64e56ba5a539fdf159c4c, size=24.2m; total size for store is 24.2m > --- CRASH, RESTART - > NN 2013-06-29 12:01:19,743 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > NameSystem.addStoredBlock: addStoredBlock request received for > blk_1395839728632046111_357084589 on 10.0.6.1:50010 size 21978112 but was > rejected: Reported as block being
[jira] [Updated] (HDFS-5042) Completed files lost after power failure
[ https://issues.apache.org/jira/browse/HDFS-5042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinayakumar B updated HDFS-5042: Attachment: HDFS-5042-04.patch Updated the patch > Completed files lost after power failure > > > Key: HDFS-5042 > URL: https://issues.apache.org/jira/browse/HDFS-5042 > Project: Hadoop HDFS > Issue Type: Bug > Environment: ext3 on CentOS 5.7 (kernel 2.6.18-274.el5) >Reporter: Dave Latham >Assignee: Vinayakumar B >Priority: Critical > Attachments: HDFS-5042-01.patch, HDFS-5042-02.patch, > HDFS-5042-03.patch, HDFS-5042-04.patch, HDFS-5042-branch-2-01.patch > > > We suffered a cluster wide power failure after which HDFS lost data that it > had acknowledged as closed and complete. > The client was HBase which compacted a set of HFiles into a new HFile, then > after closing the file successfully, deleted the previous versions of the > file. The cluster then lost power, and when brought back up the newly > created file was marked CORRUPT. > Based on reading the logs it looks like the replicas were created by the > DataNodes in the 'blocksBeingWritten' directory. Then when the file was > closed they were moved to the 'current' directory. After the power cycle > those replicas were again in the blocksBeingWritten directory of the > underlying file system (ext3). When those DataNodes reported in to the > NameNode it deleted those replicas and lost the file. > Some possible fixes could be having the DataNode fsync the directory(s) after > moving the block from blocksBeingWritten to current to ensure the rename is > durable or having the NameNode accept replicas from blocksBeingWritten under > certain circumstances. > Log snippets from RS (RegionServer), NN (NameNode), DN (DataNode): > {noformat} > RS 2013-06-29 11:16:06,812 DEBUG org.apache.hadoop.hbase.util.FSUtils: > Creating > file=hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c > with permission=rwxrwxrwx > NN 2013-06-29 11:16:06,830 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > NameSystem.allocateBlock: > /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c. > blk_1395839728632046111_357084589 > DN 2013-06-29 11:16:06,832 INFO > org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block > blk_1395839728632046111_357084589 src: /10.0.5.237:14327 dest: > /10.0.5.237:50010 > NN 2013-06-29 11:16:11,370 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > NameSystem.addStoredBlock: blockMap updated: 10.0.6.1:50010 is added to > blk_1395839728632046111_357084589 size 25418340 > NN 2013-06-29 11:16:11,370 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > NameSystem.addStoredBlock: blockMap updated: 10.0.6.24:50010 is added to > blk_1395839728632046111_357084589 size 25418340 > NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > NameSystem.addStoredBlock: blockMap updated: 10.0.5.237:50010 is added to > blk_1395839728632046111_357084589 size 25418340 > DN 2013-06-29 11:16:11,385 INFO > org.apache.hadoop.hdfs.server.datanode.DataNode: Received block > blk_1395839728632046111_357084589 of size 25418340 from /10.0.5.237:14327 > DN 2013-06-29 11:16:11,385 INFO > org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 2 for block > blk_1395839728632046111_357084589 terminating > NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange: Removing > lease on file > /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c > from client DFSClient_hb_rs_hs745,60020,1372470111932 > NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange: DIR* > NameSystem.completeFile: file > /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c > is closed by DFSClient_hb_rs_hs745,60020,1372470111932 > RS 2013-06-29 11:16:11,393 INFO org.apache.hadoop.hbase.regionserver.Store: > Renaming compacted file at > hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c > to > hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/n/6e0cc30af6e64e56ba5a539fdf159c4c > RS 2013-06-29 11:16:11,505 INFO org.apache.hadoop.hbase.regionserver.Store: > Completed major compaction of 7 file(s) in n of > users-6,\x12\xBDp\xA3,1359426311784.b5b0820cde759ae68e333b2f4015bb7e. into > 6e0cc30af6e64e56ba5a539fdf159c4c, size=24.2m; total size for store is 24.2m > --- CRASH, RESTART - > NN 2013-06-29 12:01:19,743 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > NameSystem.addStoredBlock: addStoredBlock request received for > blk_1395839728632046111_357084589 on 10.0.6.1:50010 size 21978112 but was > rejected: Reported as block being written but is a block of
[jira] [Updated] (HDFS-5042) Completed files lost after power failure
[ https://issues.apache.org/jira/browse/HDFS-5042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinayakumar B updated HDFS-5042: Attachment: HDFS-5042-03.patch updated patch > Completed files lost after power failure > > > Key: HDFS-5042 > URL: https://issues.apache.org/jira/browse/HDFS-5042 > Project: Hadoop HDFS > Issue Type: Bug > Environment: ext3 on CentOS 5.7 (kernel 2.6.18-274.el5) >Reporter: Dave Latham >Priority: Critical > Attachments: HDFS-5042-01.patch, HDFS-5042-02.patch, > HDFS-5042-03.patch, HDFS-5042-branch-2-01.patch > > > We suffered a cluster wide power failure after which HDFS lost data that it > had acknowledged as closed and complete. > The client was HBase which compacted a set of HFiles into a new HFile, then > after closing the file successfully, deleted the previous versions of the > file. The cluster then lost power, and when brought back up the newly > created file was marked CORRUPT. > Based on reading the logs it looks like the replicas were created by the > DataNodes in the 'blocksBeingWritten' directory. Then when the file was > closed they were moved to the 'current' directory. After the power cycle > those replicas were again in the blocksBeingWritten directory of the > underlying file system (ext3). When those DataNodes reported in to the > NameNode it deleted those replicas and lost the file. > Some possible fixes could be having the DataNode fsync the directory(s) after > moving the block from blocksBeingWritten to current to ensure the rename is > durable or having the NameNode accept replicas from blocksBeingWritten under > certain circumstances. > Log snippets from RS (RegionServer), NN (NameNode), DN (DataNode): > {noformat} > RS 2013-06-29 11:16:06,812 DEBUG org.apache.hadoop.hbase.util.FSUtils: > Creating > file=hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c > with permission=rwxrwxrwx > NN 2013-06-29 11:16:06,830 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > NameSystem.allocateBlock: > /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c. > blk_1395839728632046111_357084589 > DN 2013-06-29 11:16:06,832 INFO > org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block > blk_1395839728632046111_357084589 src: /10.0.5.237:14327 dest: > /10.0.5.237:50010 > NN 2013-06-29 11:16:11,370 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > NameSystem.addStoredBlock: blockMap updated: 10.0.6.1:50010 is added to > blk_1395839728632046111_357084589 size 25418340 > NN 2013-06-29 11:16:11,370 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > NameSystem.addStoredBlock: blockMap updated: 10.0.6.24:50010 is added to > blk_1395839728632046111_357084589 size 25418340 > NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > NameSystem.addStoredBlock: blockMap updated: 10.0.5.237:50010 is added to > blk_1395839728632046111_357084589 size 25418340 > DN 2013-06-29 11:16:11,385 INFO > org.apache.hadoop.hdfs.server.datanode.DataNode: Received block > blk_1395839728632046111_357084589 of size 25418340 from /10.0.5.237:14327 > DN 2013-06-29 11:16:11,385 INFO > org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 2 for block > blk_1395839728632046111_357084589 terminating > NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange: Removing > lease on file > /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c > from client DFSClient_hb_rs_hs745,60020,1372470111932 > NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange: DIR* > NameSystem.completeFile: file > /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c > is closed by DFSClient_hb_rs_hs745,60020,1372470111932 > RS 2013-06-29 11:16:11,393 INFO org.apache.hadoop.hbase.regionserver.Store: > Renaming compacted file at > hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c > to > hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/n/6e0cc30af6e64e56ba5a539fdf159c4c > RS 2013-06-29 11:16:11,505 INFO org.apache.hadoop.hbase.regionserver.Store: > Completed major compaction of 7 file(s) in n of > users-6,\x12\xBDp\xA3,1359426311784.b5b0820cde759ae68e333b2f4015bb7e. into > 6e0cc30af6e64e56ba5a539fdf159c4c, size=24.2m; total size for store is 24.2m > --- CRASH, RESTART - > NN 2013-06-29 12:01:19,743 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > NameSystem.addStoredBlock: addStoredBlock request received for > blk_1395839728632046111_357084589 on 10.0.6.1:50010 size 21978112 but was > rejected: Reported as block being written but is a block of closed file. > NN 2013-06-29 12:01:19,743 INFO
[jira] [Updated] (HDFS-5042) Completed files lost after power failure
[ https://issues.apache.org/jira/browse/HDFS-5042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinayakumar B updated HDFS-5042: Attachment: HDFS-5042-02.patch Updated the patch to fix javadoc and checkstyle > Completed files lost after power failure > > > Key: HDFS-5042 > URL: https://issues.apache.org/jira/browse/HDFS-5042 > Project: Hadoop HDFS > Issue Type: Bug > Environment: ext3 on CentOS 5.7 (kernel 2.6.18-274.el5) >Reporter: Dave Latham >Priority: Critical > Attachments: HDFS-5042-01.patch, HDFS-5042-02.patch, > HDFS-5042-branch-2-01.patch > > > We suffered a cluster wide power failure after which HDFS lost data that it > had acknowledged as closed and complete. > The client was HBase which compacted a set of HFiles into a new HFile, then > after closing the file successfully, deleted the previous versions of the > file. The cluster then lost power, and when brought back up the newly > created file was marked CORRUPT. > Based on reading the logs it looks like the replicas were created by the > DataNodes in the 'blocksBeingWritten' directory. Then when the file was > closed they were moved to the 'current' directory. After the power cycle > those replicas were again in the blocksBeingWritten directory of the > underlying file system (ext3). When those DataNodes reported in to the > NameNode it deleted those replicas and lost the file. > Some possible fixes could be having the DataNode fsync the directory(s) after > moving the block from blocksBeingWritten to current to ensure the rename is > durable or having the NameNode accept replicas from blocksBeingWritten under > certain circumstances. > Log snippets from RS (RegionServer), NN (NameNode), DN (DataNode): > {noformat} > RS 2013-06-29 11:16:06,812 DEBUG org.apache.hadoop.hbase.util.FSUtils: > Creating > file=hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c > with permission=rwxrwxrwx > NN 2013-06-29 11:16:06,830 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > NameSystem.allocateBlock: > /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c. > blk_1395839728632046111_357084589 > DN 2013-06-29 11:16:06,832 INFO > org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block > blk_1395839728632046111_357084589 src: /10.0.5.237:14327 dest: > /10.0.5.237:50010 > NN 2013-06-29 11:16:11,370 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > NameSystem.addStoredBlock: blockMap updated: 10.0.6.1:50010 is added to > blk_1395839728632046111_357084589 size 25418340 > NN 2013-06-29 11:16:11,370 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > NameSystem.addStoredBlock: blockMap updated: 10.0.6.24:50010 is added to > blk_1395839728632046111_357084589 size 25418340 > NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > NameSystem.addStoredBlock: blockMap updated: 10.0.5.237:50010 is added to > blk_1395839728632046111_357084589 size 25418340 > DN 2013-06-29 11:16:11,385 INFO > org.apache.hadoop.hdfs.server.datanode.DataNode: Received block > blk_1395839728632046111_357084589 of size 25418340 from /10.0.5.237:14327 > DN 2013-06-29 11:16:11,385 INFO > org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 2 for block > blk_1395839728632046111_357084589 terminating > NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange: Removing > lease on file > /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c > from client DFSClient_hb_rs_hs745,60020,1372470111932 > NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange: DIR* > NameSystem.completeFile: file > /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c > is closed by DFSClient_hb_rs_hs745,60020,1372470111932 > RS 2013-06-29 11:16:11,393 INFO org.apache.hadoop.hbase.regionserver.Store: > Renaming compacted file at > hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c > to > hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/n/6e0cc30af6e64e56ba5a539fdf159c4c > RS 2013-06-29 11:16:11,505 INFO org.apache.hadoop.hbase.regionserver.Store: > Completed major compaction of 7 file(s) in n of > users-6,\x12\xBDp\xA3,1359426311784.b5b0820cde759ae68e333b2f4015bb7e. into > 6e0cc30af6e64e56ba5a539fdf159c4c, size=24.2m; total size for store is 24.2m > --- CRASH, RESTART - > NN 2013-06-29 12:01:19,743 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > NameSystem.addStoredBlock: addStoredBlock request received for > blk_1395839728632046111_357084589 on 10.0.6.1:50010 size 21978112 but was > rejected: Reported as block being written but is a block of closed file. > NN 2013-06-29 12:01:19,743 INFO
[jira] [Updated] (HDFS-5042) Completed files lost after power failure
[ https://issues.apache.org/jira/browse/HDFS-5042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinayakumar B updated HDFS-5042: Status: Patch Available (was: Open) > Completed files lost after power failure > > > Key: HDFS-5042 > URL: https://issues.apache.org/jira/browse/HDFS-5042 > Project: Hadoop HDFS > Issue Type: Bug > Environment: ext3 on CentOS 5.7 (kernel 2.6.18-274.el5) >Reporter: Dave Latham >Priority: Critical > Attachments: HDFS-5042-01.patch, HDFS-5042-branch-2-01.patch > > > We suffered a cluster wide power failure after which HDFS lost data that it > had acknowledged as closed and complete. > The client was HBase which compacted a set of HFiles into a new HFile, then > after closing the file successfully, deleted the previous versions of the > file. The cluster then lost power, and when brought back up the newly > created file was marked CORRUPT. > Based on reading the logs it looks like the replicas were created by the > DataNodes in the 'blocksBeingWritten' directory. Then when the file was > closed they were moved to the 'current' directory. After the power cycle > those replicas were again in the blocksBeingWritten directory of the > underlying file system (ext3). When those DataNodes reported in to the > NameNode it deleted those replicas and lost the file. > Some possible fixes could be having the DataNode fsync the directory(s) after > moving the block from blocksBeingWritten to current to ensure the rename is > durable or having the NameNode accept replicas from blocksBeingWritten under > certain circumstances. > Log snippets from RS (RegionServer), NN (NameNode), DN (DataNode): > {noformat} > RS 2013-06-29 11:16:06,812 DEBUG org.apache.hadoop.hbase.util.FSUtils: > Creating > file=hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c > with permission=rwxrwxrwx > NN 2013-06-29 11:16:06,830 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > NameSystem.allocateBlock: > /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c. > blk_1395839728632046111_357084589 > DN 2013-06-29 11:16:06,832 INFO > org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block > blk_1395839728632046111_357084589 src: /10.0.5.237:14327 dest: > /10.0.5.237:50010 > NN 2013-06-29 11:16:11,370 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > NameSystem.addStoredBlock: blockMap updated: 10.0.6.1:50010 is added to > blk_1395839728632046111_357084589 size 25418340 > NN 2013-06-29 11:16:11,370 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > NameSystem.addStoredBlock: blockMap updated: 10.0.6.24:50010 is added to > blk_1395839728632046111_357084589 size 25418340 > NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > NameSystem.addStoredBlock: blockMap updated: 10.0.5.237:50010 is added to > blk_1395839728632046111_357084589 size 25418340 > DN 2013-06-29 11:16:11,385 INFO > org.apache.hadoop.hdfs.server.datanode.DataNode: Received block > blk_1395839728632046111_357084589 of size 25418340 from /10.0.5.237:14327 > DN 2013-06-29 11:16:11,385 INFO > org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 2 for block > blk_1395839728632046111_357084589 terminating > NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange: Removing > lease on file > /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c > from client DFSClient_hb_rs_hs745,60020,1372470111932 > NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange: DIR* > NameSystem.completeFile: file > /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c > is closed by DFSClient_hb_rs_hs745,60020,1372470111932 > RS 2013-06-29 11:16:11,393 INFO org.apache.hadoop.hbase.regionserver.Store: > Renaming compacted file at > hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c > to > hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/n/6e0cc30af6e64e56ba5a539fdf159c4c > RS 2013-06-29 11:16:11,505 INFO org.apache.hadoop.hbase.regionserver.Store: > Completed major compaction of 7 file(s) in n of > users-6,\x12\xBDp\xA3,1359426311784.b5b0820cde759ae68e333b2f4015bb7e. into > 6e0cc30af6e64e56ba5a539fdf159c4c, size=24.2m; total size for store is 24.2m > --- CRASH, RESTART - > NN 2013-06-29 12:01:19,743 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > NameSystem.addStoredBlock: addStoredBlock request received for > blk_1395839728632046111_357084589 on 10.0.6.1:50010 size 21978112 but was > rejected: Reported as block being written but is a block of closed file. > NN 2013-06-29 12:01:19,743 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* >
[jira] [Updated] (HDFS-5042) Completed files lost after power failure
[ https://issues.apache.org/jira/browse/HDFS-5042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinayakumar B updated HDFS-5042: Attachment: HDFS-5042-branch-2-01.patch attached the branch-2 patch > Completed files lost after power failure > > > Key: HDFS-5042 > URL: https://issues.apache.org/jira/browse/HDFS-5042 > Project: Hadoop HDFS > Issue Type: Bug > Environment: ext3 on CentOS 5.7 (kernel 2.6.18-274.el5) >Reporter: Dave Latham >Priority: Critical > Attachments: HDFS-5042-01.patch, HDFS-5042-branch-2-01.patch > > > We suffered a cluster wide power failure after which HDFS lost data that it > had acknowledged as closed and complete. > The client was HBase which compacted a set of HFiles into a new HFile, then > after closing the file successfully, deleted the previous versions of the > file. The cluster then lost power, and when brought back up the newly > created file was marked CORRUPT. > Based on reading the logs it looks like the replicas were created by the > DataNodes in the 'blocksBeingWritten' directory. Then when the file was > closed they were moved to the 'current' directory. After the power cycle > those replicas were again in the blocksBeingWritten directory of the > underlying file system (ext3). When those DataNodes reported in to the > NameNode it deleted those replicas and lost the file. > Some possible fixes could be having the DataNode fsync the directory(s) after > moving the block from blocksBeingWritten to current to ensure the rename is > durable or having the NameNode accept replicas from blocksBeingWritten under > certain circumstances. > Log snippets from RS (RegionServer), NN (NameNode), DN (DataNode): > {noformat} > RS 2013-06-29 11:16:06,812 DEBUG org.apache.hadoop.hbase.util.FSUtils: > Creating > file=hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c > with permission=rwxrwxrwx > NN 2013-06-29 11:16:06,830 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > NameSystem.allocateBlock: > /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c. > blk_1395839728632046111_357084589 > DN 2013-06-29 11:16:06,832 INFO > org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block > blk_1395839728632046111_357084589 src: /10.0.5.237:14327 dest: > /10.0.5.237:50010 > NN 2013-06-29 11:16:11,370 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > NameSystem.addStoredBlock: blockMap updated: 10.0.6.1:50010 is added to > blk_1395839728632046111_357084589 size 25418340 > NN 2013-06-29 11:16:11,370 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > NameSystem.addStoredBlock: blockMap updated: 10.0.6.24:50010 is added to > blk_1395839728632046111_357084589 size 25418340 > NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > NameSystem.addStoredBlock: blockMap updated: 10.0.5.237:50010 is added to > blk_1395839728632046111_357084589 size 25418340 > DN 2013-06-29 11:16:11,385 INFO > org.apache.hadoop.hdfs.server.datanode.DataNode: Received block > blk_1395839728632046111_357084589 of size 25418340 from /10.0.5.237:14327 > DN 2013-06-29 11:16:11,385 INFO > org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 2 for block > blk_1395839728632046111_357084589 terminating > NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange: Removing > lease on file > /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c > from client DFSClient_hb_rs_hs745,60020,1372470111932 > NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange: DIR* > NameSystem.completeFile: file > /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c > is closed by DFSClient_hb_rs_hs745,60020,1372470111932 > RS 2013-06-29 11:16:11,393 INFO org.apache.hadoop.hbase.regionserver.Store: > Renaming compacted file at > hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c > to > hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/n/6e0cc30af6e64e56ba5a539fdf159c4c > RS 2013-06-29 11:16:11,505 INFO org.apache.hadoop.hbase.regionserver.Store: > Completed major compaction of 7 file(s) in n of > users-6,\x12\xBDp\xA3,1359426311784.b5b0820cde759ae68e333b2f4015bb7e. into > 6e0cc30af6e64e56ba5a539fdf159c4c, size=24.2m; total size for store is 24.2m > --- CRASH, RESTART - > NN 2013-06-29 12:01:19,743 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > NameSystem.addStoredBlock: addStoredBlock request received for > blk_1395839728632046111_357084589 on 10.0.6.1:50010 size 21978112 but was > rejected: Reported as block being written but is a block of closed file. > NN 2013-06-29 12:01:19,743 INFO org.apache.hadoop.hdfs.StateChange:
[jira] [Updated] (HDFS-5042) Completed files lost after power failure
[ https://issues.apache.org/jira/browse/HDFS-5042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinayakumar B updated HDFS-5042: Attachment: HDFS-5042-01.patch Attached the proposed changes for trunk, > Completed files lost after power failure > > > Key: HDFS-5042 > URL: https://issues.apache.org/jira/browse/HDFS-5042 > Project: Hadoop HDFS > Issue Type: Bug > Environment: ext3 on CentOS 5.7 (kernel 2.6.18-274.el5) >Reporter: Dave Latham >Priority: Critical > Attachments: HDFS-5042-01.patch > > > We suffered a cluster wide power failure after which HDFS lost data that it > had acknowledged as closed and complete. > The client was HBase which compacted a set of HFiles into a new HFile, then > after closing the file successfully, deleted the previous versions of the > file. The cluster then lost power, and when brought back up the newly > created file was marked CORRUPT. > Based on reading the logs it looks like the replicas were created by the > DataNodes in the 'blocksBeingWritten' directory. Then when the file was > closed they were moved to the 'current' directory. After the power cycle > those replicas were again in the blocksBeingWritten directory of the > underlying file system (ext3). When those DataNodes reported in to the > NameNode it deleted those replicas and lost the file. > Some possible fixes could be having the DataNode fsync the directory(s) after > moving the block from blocksBeingWritten to current to ensure the rename is > durable or having the NameNode accept replicas from blocksBeingWritten under > certain circumstances. > Log snippets from RS (RegionServer), NN (NameNode), DN (DataNode): > {noformat} > RS 2013-06-29 11:16:06,812 DEBUG org.apache.hadoop.hbase.util.FSUtils: > Creating > file=hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c > with permission=rwxrwxrwx > NN 2013-06-29 11:16:06,830 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > NameSystem.allocateBlock: > /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c. > blk_1395839728632046111_357084589 > DN 2013-06-29 11:16:06,832 INFO > org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block > blk_1395839728632046111_357084589 src: /10.0.5.237:14327 dest: > /10.0.5.237:50010 > NN 2013-06-29 11:16:11,370 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > NameSystem.addStoredBlock: blockMap updated: 10.0.6.1:50010 is added to > blk_1395839728632046111_357084589 size 25418340 > NN 2013-06-29 11:16:11,370 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > NameSystem.addStoredBlock: blockMap updated: 10.0.6.24:50010 is added to > blk_1395839728632046111_357084589 size 25418340 > NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > NameSystem.addStoredBlock: blockMap updated: 10.0.5.237:50010 is added to > blk_1395839728632046111_357084589 size 25418340 > DN 2013-06-29 11:16:11,385 INFO > org.apache.hadoop.hdfs.server.datanode.DataNode: Received block > blk_1395839728632046111_357084589 of size 25418340 from /10.0.5.237:14327 > DN 2013-06-29 11:16:11,385 INFO > org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 2 for block > blk_1395839728632046111_357084589 terminating > NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange: Removing > lease on file > /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c > from client DFSClient_hb_rs_hs745,60020,1372470111932 > NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange: DIR* > NameSystem.completeFile: file > /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c > is closed by DFSClient_hb_rs_hs745,60020,1372470111932 > RS 2013-06-29 11:16:11,393 INFO org.apache.hadoop.hbase.regionserver.Store: > Renaming compacted file at > hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c > to > hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/n/6e0cc30af6e64e56ba5a539fdf159c4c > RS 2013-06-29 11:16:11,505 INFO org.apache.hadoop.hbase.regionserver.Store: > Completed major compaction of 7 file(s) in n of > users-6,\x12\xBDp\xA3,1359426311784.b5b0820cde759ae68e333b2f4015bb7e. into > 6e0cc30af6e64e56ba5a539fdf159c4c, size=24.2m; total size for store is 24.2m > --- CRASH, RESTART - > NN 2013-06-29 12:01:19,743 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > NameSystem.addStoredBlock: addStoredBlock request received for > blk_1395839728632046111_357084589 on 10.0.6.1:50010 size 21978112 but was > rejected: Reported as block being written but is a block of closed file. > NN 2013-06-29 12:01:19,743 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* >
[jira] [Updated] (HDFS-5042) Completed files lost after power failure
[ https://issues.apache.org/jira/browse/HDFS-5042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dave Latham updated HDFS-5042: -- Labels: data (was: ) Completed files lost after power failure Key: HDFS-5042 URL: https://issues.apache.org/jira/browse/HDFS-5042 Project: Hadoop HDFS Issue Type: Bug Environment: ext3 on CentOS 5.7 (kernel 2.6.18-274.el5) Reporter: Dave Latham Priority: Critical We suffered a cluster wide power failure after which HDFS lost data that it had acknowledged as closed and complete. The client was HBase which compacted a set of HFiles into a new HFile, then after closing the file successfully, deleted the previous versions of the file. The cluster then lost power, and when brought back up the newly created file was marked CORRUPT. Based on reading the logs it looks like the replicas were created by the DataNodes in the 'blocksBeingWritten' directory. Then when the file was closed they were moved to the 'current' directory. After the power cycle those replicas were again in the blocksBeingWritten directory of the underlying file system (ext3). When those DataNodes reported in to the NameNode it deleted those replicas and lost the file. Some possible fixes could be having the DataNode fsync the directory(s) after moving the block from blocksBeingWritten to current to ensure the rename is durable or having the NameNode accept replicas from blocksBeingWritten under certain circumstances. Log snippets from RS (RegionServer), NN (NameNode), DN (DataNode): {noformat} RS 2013-06-29 11:16:06,812 DEBUG org.apache.hadoop.hbase.util.FSUtils: Creating file=hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c with permission=rwxrwxrwx NN 2013-06-29 11:16:06,830 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.allocateBlock: /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c. blk_1395839728632046111_357084589 DN 2013-06-29 11:16:06,832 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block blk_1395839728632046111_357084589 src: /10.0.5.237:14327 dest: /10.0.5.237:50010 NN 2013-06-29 11:16:11,370 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.addStoredBlock: blockMap updated: 10.0.6.1:50010 is added to blk_1395839728632046111_357084589 size 25418340 NN 2013-06-29 11:16:11,370 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.addStoredBlock: blockMap updated: 10.0.6.24:50010 is added to blk_1395839728632046111_357084589 size 25418340 NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.addStoredBlock: blockMap updated: 10.0.5.237:50010 is added to blk_1395839728632046111_357084589 size 25418340 DN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Received block blk_1395839728632046111_357084589 of size 25418340 from /10.0.5.237:14327 DN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 2 for block blk_1395839728632046111_357084589 terminating NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange: Removing lease on file /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c from client DFSClient_hb_rs_hs745,60020,1372470111932 NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange: DIR* NameSystem.completeFile: file /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c is closed by DFSClient_hb_rs_hs745,60020,1372470111932 RS 2013-06-29 11:16:11,393 INFO org.apache.hadoop.hbase.regionserver.Store: Renaming compacted file at hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c to hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/n/6e0cc30af6e64e56ba5a539fdf159c4c RS 2013-06-29 11:16:11,505 INFO org.apache.hadoop.hbase.regionserver.Store: Completed major compaction of 7 file(s) in n of users-6,\x12\xBDp\xA3,1359426311784.b5b0820cde759ae68e333b2f4015bb7e. into 6e0cc30af6e64e56ba5a539fdf159c4c, size=24.2m; total size for store is 24.2m --- CRASH, RESTART - NN 2013-06-29 12:01:19,743 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.addStoredBlock: addStoredBlock request received for blk_1395839728632046111_357084589 on 10.0.6.1:50010 size 21978112 but was rejected: Reported as block being written but is a block of closed file. NN 2013-06-29 12:01:19,743 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.addToInvalidates: blk_1395839728632046111 is added to invalidSet of 10.0.6.1:50010 NN 2013-06-29 12:01:20,155 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
[jira] [Updated] (HDFS-5042) Completed files lost after power failure
[ https://issues.apache.org/jira/browse/HDFS-5042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dave Latham updated HDFS-5042: -- Labels: (was: data) Completed files lost after power failure Key: HDFS-5042 URL: https://issues.apache.org/jira/browse/HDFS-5042 Project: Hadoop HDFS Issue Type: Bug Environment: ext3 on CentOS 5.7 (kernel 2.6.18-274.el5) Reporter: Dave Latham Priority: Critical We suffered a cluster wide power failure after which HDFS lost data that it had acknowledged as closed and complete. The client was HBase which compacted a set of HFiles into a new HFile, then after closing the file successfully, deleted the previous versions of the file. The cluster then lost power, and when brought back up the newly created file was marked CORRUPT. Based on reading the logs it looks like the replicas were created by the DataNodes in the 'blocksBeingWritten' directory. Then when the file was closed they were moved to the 'current' directory. After the power cycle those replicas were again in the blocksBeingWritten directory of the underlying file system (ext3). When those DataNodes reported in to the NameNode it deleted those replicas and lost the file. Some possible fixes could be having the DataNode fsync the directory(s) after moving the block from blocksBeingWritten to current to ensure the rename is durable or having the NameNode accept replicas from blocksBeingWritten under certain circumstances. Log snippets from RS (RegionServer), NN (NameNode), DN (DataNode): {noformat} RS 2013-06-29 11:16:06,812 DEBUG org.apache.hadoop.hbase.util.FSUtils: Creating file=hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c with permission=rwxrwxrwx NN 2013-06-29 11:16:06,830 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.allocateBlock: /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c. blk_1395839728632046111_357084589 DN 2013-06-29 11:16:06,832 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block blk_1395839728632046111_357084589 src: /10.0.5.237:14327 dest: /10.0.5.237:50010 NN 2013-06-29 11:16:11,370 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.addStoredBlock: blockMap updated: 10.0.6.1:50010 is added to blk_1395839728632046111_357084589 size 25418340 NN 2013-06-29 11:16:11,370 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.addStoredBlock: blockMap updated: 10.0.6.24:50010 is added to blk_1395839728632046111_357084589 size 25418340 NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.addStoredBlock: blockMap updated: 10.0.5.237:50010 is added to blk_1395839728632046111_357084589 size 25418340 DN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Received block blk_1395839728632046111_357084589 of size 25418340 from /10.0.5.237:14327 DN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 2 for block blk_1395839728632046111_357084589 terminating NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange: Removing lease on file /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c from client DFSClient_hb_rs_hs745,60020,1372470111932 NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange: DIR* NameSystem.completeFile: file /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c is closed by DFSClient_hb_rs_hs745,60020,1372470111932 RS 2013-06-29 11:16:11,393 INFO org.apache.hadoop.hbase.regionserver.Store: Renaming compacted file at hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c to hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/n/6e0cc30af6e64e56ba5a539fdf159c4c RS 2013-06-29 11:16:11,505 INFO org.apache.hadoop.hbase.regionserver.Store: Completed major compaction of 7 file(s) in n of users-6,\x12\xBDp\xA3,1359426311784.b5b0820cde759ae68e333b2f4015bb7e. into 6e0cc30af6e64e56ba5a539fdf159c4c, size=24.2m; total size for store is 24.2m --- CRASH, RESTART - NN 2013-06-29 12:01:19,743 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.addStoredBlock: addStoredBlock request received for blk_1395839728632046111_357084589 on 10.0.6.1:50010 size 21978112 but was rejected: Reported as block being written but is a block of closed file. NN 2013-06-29 12:01:19,743 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.addToInvalidates: blk_1395839728632046111 is added to invalidSet of 10.0.6.1:50010 NN 2013-06-29 12:01:20,155 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*