[jira] [Created] (HDFS-8715) Checkpoint node keeps throwing exception
Jiahongchao created HDFS-8715: - Summary: Checkpoint node keeps throwing exception Key: HDFS-8715 URL: https://issues.apache.org/jira/browse/HDFS-8715 Project: Hadoop HDFS Issue Type: Bug Components: namenode Affects Versions: 2.5.2 Environment: centos 6.4, sun jdk 1.7 Reporter: Jiahongchao I tired to start a checkup node using bin/hdfs namenode -checkpoint, but it keeps printing 15/07/03 23:16:22 ERROR namenode.FSNamesystem: Swallowing exception in NameNodeEditLogRoller: java.lang.IllegalStateException: Bad state: BETWEEN_LOG_SEGMENTS at com.google.common.base.Preconditions.checkState(Preconditions.java:172) at org.apache.hadoop.hdfs.server.namenode.FSEditLog.getCurSegmentTxId(FSEditLog.java:495) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem$NameNodeEditLogRoller.run(FSNamesystem.java:4718) at java.lang.Thread.run(Thread.java:745) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-8714) Folder ModificationTime in Millis Changed When NameNode is restarted
[ https://issues.apache.org/jira/browse/HDFS-8714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14613568#comment-14613568 ] Chandan Biswas commented on HDFS-8714: -- [~walter.k.su] I used CDH4.7.1 pkg that uses hadoop-2.0.0+1612. Folder ModificationTime in Millis Changed When NameNode is restarted Key: HDFS-8714 URL: https://issues.apache.org/jira/browse/HDFS-8714 Project: Hadoop HDFS Issue Type: Bug Reporter: Chandan Biswas *Steps to Produce* # Steps need to do in program ** Create a folder into HDFS ** Print folder modificationTime in millis ** Upload a file or copy a file to this newly created folder ** Print file and folder modificationTime in millis ** Restart the name node ** Print file and folder modificationTime in millis # Expected Result ** folder modification time should be the file modification time before name node restart ** folder modification time should not change after name node restart # Actual result ** folder modification time is not same with file modification time ** folder modification time is changed after name node restart and it's changed to file modification time *Impact of this behavior:* Before task is launched, distributed cache files/folders are checked for any modification. The checks are done by comparing file/folder modicationTime in millis. So any job that uses distributed cache has a potential chance of failure if # name node restarts and running tasks are resubmitted or # for e.g among 100 tasks 50 are in queue for run. Now name node restarts Here is the sample code I used for testing- {code} // file creating in hdfs final Path pathToFiles = new Path(/user/vagrant/chandan/test/); fileSystem.mkdirs(pathToFiles); System.out.println(HDFS Folder Modification Time in long Before file copy: + fileSystem.getFileStatus(pathToFiles).getModificationTime()); FileUtil.copy(fileSystem, new Path(/user/cloudera/test), fileSystem, pathToFiles, false, configuration); System.out.println(HDFS File Modification Time in long: + fileSystem.getFileStatus(new Path(/user/vagrant/chandan/test/test)).getModificationTime()); System.out.println(HDFS Folder Modification Time in long After file copy: + fileSystem.getFileStatus(pathToFiles).getModificationTime()); for (int i = 0; i 100; i++) { System.out.println(Normal HDFS Folder Modification Time in long: + fileSystem.getFileStatus(pathToFiles).getModificationTime()); System.out.println(Normal HDFS File Modification Time in long: + fileSystem.getFileStatus(new Path(/user/vagrant/chandan/test/test)).getModificationTime()); Thread.sleep(6 * 2); } {code} Here is the output - {code} HDFS Folder Modification Time in long Before file copy:1435868217309 HDFS File Modification Time in long:1435868217368 HDFS Folder Modification Time in long After file copy:1435868217353 Normal HDFS Folder Modification Time in long:1435868217353 Normal HDFS File Modification Time in long:1435868217368 Normal HDFS Folder Modification Time in long:1435868217353 Normal HDFS File Modification Time in long:1435868217368 Normal HDFS Folder Modification Time in long:1435868217368 Normal HDFS File Modification Time in long:1435868217368 {code} The last two lines are printed after name node restart. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-8577) Avoid retrying to recover lease on a file which does not exist
[ https://issues.apache.org/jira/browse/HDFS-8577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14612974#comment-14612974 ] Vinayakumar B commented on HDFS-8577: - Latest patch LGTM. +1. Will commit shortly. Avoid retrying to recover lease on a file which does not exist -- Key: HDFS-8577 URL: https://issues.apache.org/jira/browse/HDFS-8577 Project: Hadoop HDFS Issue Type: Bug Reporter: J.Andreina Assignee: J.Andreina Attachments: HDFS-8577.1.patch, HDFS-8577.2.patch 1. Avoid retrying to recover lease on a file which does not exist {noformat} recoverLease got exception: java.io.FileNotFoundException: File does not exist: /hello_hi at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:71) Retrying in 5000 ms... Retry #1 recoverLease got exception: java.io.FileNotFoundException: File does not exist: /hello_hi at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:71) {noformat} 2. Avoid populating huge stack trace for each retry for recovering lease on a file , being displayed on CLI . -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-8710) Always read DU value from the cached dfsUsed file on datanode startup
[ https://issues.apache.org/jira/browse/HDFS-8710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14613014#comment-14613014 ] Xinwei Qin commented on HDFS-8710: --- [~aw], thanks for your comment. The du value will be recalculated after 600 seconds, we don't need to calculated a precise value on startup. The current project cannot also recalculate du value when the disk structure changed if the {{dfsUsed}} value is less than 600 seconds old. In a large cluster, the DU can even cost several or tens of minutes, which slows down startup speed of the whole cluster, so quick startup is necessary. Maybe always skip DU is radical, add a quick-restart configuration(default is true to skip DU) for datanode is more reasonable, when disk structure is changed, user can turn off quick-restart configuration to trigger DU to recalculate dfsused value. Any thoughts? Always read DU value from the cached dfsUsed file on datanode startup --- Key: HDFS-8710 URL: https://issues.apache.org/jira/browse/HDFS-8710 Project: Hadoop HDFS Issue Type: Improvement Reporter: Xinwei Qin Assignee: Xinwei Qin Attachments: HDFS-8710.001.patch Currently, DataNode will cache DU value in dfsUsed file termly. When DataNode starts or restarts, it will read in the cached DU value from dfsUsed file if the value is less than 600 seconds old, otherwise, it will run DU command, which is a very time-consuming operation(may up to dozens of minutes) when DataNode has huge number of blocks. Since slight imprecision of dfsUsed is not critical, and the DU value will be updated every 600 seconds (the default DU interval) after DataNode started, we can always read DU value from the cached file (Regardless of whether this value is less than 600 seconds old or not) and skip DU operation on DataNode startup to significantly shorten the startup time. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-8577) Avoid retrying to recover lease on a file which does not exist
[ https://issues.apache.org/jira/browse/HDFS-8577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinayakumar B updated HDFS-8577: Resolution: Fixed Hadoop Flags: Reviewed Fix Version/s: 2.8.0 Status: Resolved (was: Patch Available) Committed to trunk and branch-2. Thanks for the contribution [~andreina]. Avoid retrying to recover lease on a file which does not exist -- Key: HDFS-8577 URL: https://issues.apache.org/jira/browse/HDFS-8577 Project: Hadoop HDFS Issue Type: Bug Reporter: J.Andreina Assignee: J.Andreina Fix For: 2.8.0 Attachments: HDFS-8577.1.patch, HDFS-8577.2.patch 1. Avoid retrying to recover lease on a file which does not exist {noformat} recoverLease got exception: java.io.FileNotFoundException: File does not exist: /hello_hi at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:71) Retrying in 5000 ms... Retry #1 recoverLease got exception: java.io.FileNotFoundException: File does not exist: /hello_hi at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:71) {noformat} 2. Avoid populating huge stack trace for each retry for recovering lease on a file , being displayed on CLI . -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-8260) Erasure Coding: test of writing EC file
[ https://issues.apache.org/jira/browse/HDFS-8260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14612934#comment-14612934 ] Xinwei Qin commented on HDFS-8260: --- Hi [~demongaorui], sorry for busy with other work in last several weeks, I will work on these jiras in next several days, and upload the patch ASAP. Erasure Coding: test of writing EC file Key: HDFS-8260 URL: https://issues.apache.org/jira/browse/HDFS-8260 Project: Hadoop HDFS Issue Type: Test Affects Versions: HDFS-7285 Reporter: GAO Rui Assignee: Xinwei Qin 1. Normally writing EC file(writing without datanote failure) 2. Writing EC file with tolerable number of datanodes failing. 3. Writing EC file with intolerable number of datanodes failing. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-8711) setSpaceQuota command should print the available storage type when input storage type is wrong
[ https://issues.apache.org/jira/browse/HDFS-8711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Surendra Singh Lilhore updated HDFS-8711: - Attachment: HDFS-8711-01.patch setSpaceQuota command should print the available storage type when input storage type is wrong -- Key: HDFS-8711 URL: https://issues.apache.org/jira/browse/HDFS-8711 Project: Hadoop HDFS Issue Type: Bug Components: hdfs-client Affects Versions: 2.7.0 Reporter: Surendra Singh Lilhore Assignee: Surendra Singh Lilhore Attachments: HDFS-8711-01.patch, HDFS-8711.patch If input storage type is wrong then currently *setSpaceQuota* give exception like this. {code} ./hdfs dfsadmin -setSpaceQuota 1000 -storageType COLD /testDir setSpaceQuota: No enum constant org.apache.hadoop.fs.StorageType.COLD {code} It should be {code} setSpaceQuota: Storage type COLD not available. Available storage type are [SSD, DISK, ARCHIVE] {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-8710) Always read DU value from the cached dfsUsed file on datanode startup
[ https://issues.apache.org/jira/browse/HDFS-8710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14612876#comment-14612876 ] Allen Wittenauer commented on HDFS-8710: On startup is exactly when you want the du to be recalculated because there is a good chance that the disk structure changed. Always read DU value from the cached dfsUsed file on datanode startup --- Key: HDFS-8710 URL: https://issues.apache.org/jira/browse/HDFS-8710 Project: Hadoop HDFS Issue Type: Improvement Reporter: Xinwei Qin Assignee: Xinwei Qin Attachments: HDFS-8710.001.patch Currently, DataNode will cache DU value in dfsUsed file termly. When DataNode starts or restarts, it will read in the cached DU value from dfsUsed file if the value is less than 600 seconds old, otherwise, it will run DU command, which is a very time-consuming operation(may up to dozens of minutes) when DataNode has huge number of blocks. Since slight imprecision of dfsUsed is not critical, and the DU value will be updated every 600 seconds (the default DU interval) after DataNode started, we can always read DU value from the cached file (Regardless of whether this value is less than 600 seconds old or not) and skip DU operation on DataNode startup to significantly shorten the startup time. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-8711) setSpaceQuota command should print the available storage type when input storage type is wrong
[ https://issues.apache.org/jira/browse/HDFS-8711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14612894#comment-14612894 ] Surendra Singh Lilhore commented on HDFS-8711: -- Thanks [~xyao] for reviewing. Attached new patch with unit test , Please review ... setSpaceQuota command should print the available storage type when input storage type is wrong -- Key: HDFS-8711 URL: https://issues.apache.org/jira/browse/HDFS-8711 Project: Hadoop HDFS Issue Type: Bug Components: hdfs-client Affects Versions: 2.7.0 Reporter: Surendra Singh Lilhore Assignee: Surendra Singh Lilhore Attachments: HDFS-8711-01.patch, HDFS-8711.patch If input storage type is wrong then currently *setSpaceQuota* give exception like this. {code} ./hdfs dfsadmin -setSpaceQuota 1000 -storageType COLD /testDir setSpaceQuota: No enum constant org.apache.hadoop.fs.StorageType.COLD {code} It should be {code} setSpaceQuota: Storage type COLD not available. Available storage type are [SSD, DISK, ARCHIVE] {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-8260) Erasure Coding: system test of writing EC file
[ https://issues.apache.org/jira/browse/HDFS-8260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] GAO Rui updated HDFS-8260: -- Summary: Erasure Coding: system test of writing EC file (was: Erasure Coding: test of writing EC file) Erasure Coding: system test of writing EC file --- Key: HDFS-8260 URL: https://issues.apache.org/jira/browse/HDFS-8260 Project: Hadoop HDFS Issue Type: Test Affects Versions: HDFS-7285 Reporter: GAO Rui Assignee: Xinwei Qin 1. Normally writing EC file(writing without datanote failure) 2. Writing EC file with tolerable number of datanodes failing. 3. Writing EC file with intolerable number of datanodes failing. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-8266) Erasure Coding: System Test of snapshot with EC files
[ https://issues.apache.org/jira/browse/HDFS-8266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] GAO Rui updated HDFS-8266: -- Summary: Erasure Coding: System Test of snapshot with EC files (was: Erasure Coding: Test of snapshot with EC files) Erasure Coding: System Test of snapshot with EC files - Key: HDFS-8266 URL: https://issues.apache.org/jira/browse/HDFS-8266 Project: Hadoop HDFS Issue Type: Test Affects Versions: HDFS-7285 Reporter: GAO Rui Assignee: Rakesh R Attachments: HDFS-8266-HDFS-7285-00.patch, HDFS-8266-HDFS-7285-01.patch, HDFS-8266-HDFS-7285-01.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-8267) Erasure Coding: System Test of Namenode with EC files
[ https://issues.apache.org/jira/browse/HDFS-8267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] GAO Rui updated HDFS-8267: -- Summary: Erasure Coding: System Test of Namenode with EC files (was: Erasure Coding: Test of Namenode with EC files) Erasure Coding: System Test of Namenode with EC files - Key: HDFS-8267 URL: https://issues.apache.org/jira/browse/HDFS-8267 Project: Hadoop HDFS Issue Type: Test Affects Versions: HDFS-7285 Reporter: GAO Rui Assignee: Rakesh R Labels: EC, test 1. Namenode startup with EC: 1.1. Safemode 1.2. BlockReport 2. Namenode HA with EC: 2.1. Fsimage and editlog test 2.2. Hot restart and recovery of Active NameNode after failure 2.3. Hot restart and recovery of Standby NameNode after failure 2.4. Restart and recovery of both Active and Standby NameNode fail in the same time -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-8264) Erasure Coding: System Test of version update
[ https://issues.apache.org/jira/browse/HDFS-8264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] GAO Rui updated HDFS-8264: -- Summary: Erasure Coding: System Test of version update (was: Erasure Coding: Test of version update) Erasure Coding: System Test of version update - Key: HDFS-8264 URL: https://issues.apache.org/jira/browse/HDFS-8264 Project: Hadoop HDFS Issue Type: Test Affects Versions: HDFS-7285 Reporter: GAO Rui Labels: EC, test When implementing version update, fsimage, fseditlg,conflict Block ID should be took care of. This jira tests these issues during version update process. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-8265) Erasure Coding: System Test of Quota calculation for EC files
[ https://issues.apache.org/jira/browse/HDFS-8265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] GAO Rui updated HDFS-8265: -- Summary: Erasure Coding: System Test of Quota calculation for EC files (was: Erasure Coding: Test of Quota calculation for EC files) Erasure Coding: System Test of Quota calculation for EC files - Key: HDFS-8265 URL: https://issues.apache.org/jira/browse/HDFS-8265 Project: Hadoop HDFS Issue Type: Test Affects Versions: HDFS-7285 Reporter: GAO Rui Assignee: Rakesh R -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-8262) Erasure Coding: System Test of datanode decommission which EC blocks are stored
[ https://issues.apache.org/jira/browse/HDFS-8262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] GAO Rui updated HDFS-8262: -- Summary: Erasure Coding: System Test of datanode decommission which EC blocks are stored (was: Erasure Coding: Test of datanode decommission which EC blocks are stored ) Erasure Coding: System Test of datanode decommission which EC blocks are stored - Key: HDFS-8262 URL: https://issues.apache.org/jira/browse/HDFS-8262 Project: Hadoop HDFS Issue Type: Test Reporter: GAO Rui Assignee: Xinwei Qin -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-8263) Erasure Coding: System Test of fsck for EC files
[ https://issues.apache.org/jira/browse/HDFS-8263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] GAO Rui updated HDFS-8263: -- Summary: Erasure Coding: System Test of fsck for EC files (was: Erasure Coding: Test of fsck for EC files) Erasure Coding: System Test of fsck for EC files - Key: HDFS-8263 URL: https://issues.apache.org/jira/browse/HDFS-8263 Project: Hadoop HDFS Issue Type: Test Affects Versions: HDFS-7285 Reporter: GAO Rui -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-8260) Erasure Coding: system test of writing EC file
[ https://issues.apache.org/jira/browse/HDFS-8260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14613047#comment-14613047 ] GAO Rui commented on HDFS-8260: --- Hi [~xinwei], thank you very much. Please be aware of that these jiras are focus on system test not on unit test. Erasure Coding: system test of writing EC file --- Key: HDFS-8260 URL: https://issues.apache.org/jira/browse/HDFS-8260 Project: Hadoop HDFS Issue Type: Test Affects Versions: HDFS-7285 Reporter: GAO Rui Assignee: Xinwei Qin 1. Normally writing EC file(writing without datanote failure) 2. Writing EC file with tolerable number of datanodes failing. 3. Writing EC file with intolerable number of datanodes failing. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-8261) Erasure Coding: System Test of EC file reconstruction
[ https://issues.apache.org/jira/browse/HDFS-8261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] GAO Rui updated HDFS-8261: -- Summary: Erasure Coding: System Test of EC file reconstruction (was: Erasure Coding: Test of EC file reconstruction) Erasure Coding: System Test of EC file reconstruction -- Key: HDFS-8261 URL: https://issues.apache.org/jira/browse/HDFS-8261 Project: Hadoop HDFS Issue Type: Test Reporter: GAO Rui 1. One datanode failure(one block of blockGroup corrupted) 2. Two datanodes failure(two blocks of blockGroup corrupted) 3. Three datanodes failure(three blocks of blockGroup corrupted) 4. Four datanodes failure(four blocks of blockGroup corrupted) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-8259) Erasure Coding: System Test of reading EC file
[ https://issues.apache.org/jira/browse/HDFS-8259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] GAO Rui updated HDFS-8259: -- Summary: Erasure Coding: System Test of reading EC file (was: Erasure Coding: Test of reading EC file) Erasure Coding: System Test of reading EC file -- Key: HDFS-8259 URL: https://issues.apache.org/jira/browse/HDFS-8259 Project: Hadoop HDFS Issue Type: Test Affects Versions: HDFS-7285 Reporter: GAO Rui Assignee: Xinwei Qin 1. Normally reading EC file(reading without datanote failure and no need of recovery) 2. Reading EC file with datanode failure. 3. Reading EC file with data block recovery by decoding from parity blocks. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-8660) Slow write to packet mirror should log which mirror and which block
[ https://issues.apache.org/jira/browse/HDFS-8660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14613063#comment-14613063 ] Harsh J commented on HDFS-8660: --- This would be an excellent improvement for certain performance troubleshooting. In looking for more such Slow messages, the following matches may also need similar changes: The Slow ReadProcessor message in DataStreamer.java can benefit from a block ID. The Slow waitForAckedSeqno in DataStreamer.java message too could benefit from a block ID as well as a nodes list. Just Block ID can also be added into the below messages under BlockReceiver.java: Slow flushOrSync Slow BlockReceiver write data to disk Slow manageWriterOsCache DN Mirror Host+Block ID can both be added into the below message under BlockReceiver.java: Slow PacketResponder send ack to upstream took Could you check if these are possible to do as part of the same JIRA as simple changes too? Slow write to packet mirror should log which mirror and which block --- Key: HDFS-8660 URL: https://issues.apache.org/jira/browse/HDFS-8660 Project: Hadoop HDFS Issue Type: Improvement Affects Versions: 2.7.0 Reporter: Hazem Mahmoud Assignee: Hazem Mahmoud Currently, log format states something similar to: Slow BlockReceiver write packet to mirror took 468ms (threshold=300ms) For troubleshooting purposes, it would be good to have it mention which block ID it's writing as well as the mirror (DN) that it's writing it to. -- This message was sent by Atlassian JIRA (v6.3.4#6332)