[jira] [Comment Edited] (HDFS-15217) Add more information to longest write/read lock held log
[ https://issues.apache.org/jira/browse/HDFS-15217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17246225#comment-17246225 ] Toshihiro Suzuki edited comment on HDFS-15217 at 12/9/20, 1:07 AM: --- Sounds good. Thank you [~ahussein]. was (Author: brfrn169): Sound good. Thank you [~ahussein]. > Add more information to longest write/read lock held log > > > Key: HDFS-15217 > URL: https://issues.apache.org/jira/browse/HDFS-15217 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Toshihiro Suzuki >Assignee: Toshihiro Suzuki >Priority: Major > Fix For: 3.4.0 > > > Currently, we can see the stack trace in the longest write/read lock held > log, but sometimes we need more information, for example, a target path of > deletion: > {code:java} > 2020-03-10 21:51:21,116 [main] INFO namenode.FSNamesystem > (FSNamesystemLock.java:writeUnlock(276)) - Number of suppressed > write-lock reports: 0 > Longest write-lock held at 2020-03-10 21:51:21,107+0900 for 6ms via > java.lang.Thread.getStackTrace(Thread.java:1559) > org.apache.hadoop.util.StringUtils.getStackTrace(StringUtils.java:1058) > org.apache.hadoop.hdfs.server.namenode.FSNamesystemLock.writeUnlock(FSNamesystemLock.java:257) > org.apache.hadoop.hdfs.server.namenode.FSNamesystemLock.writeUnlock(FSNamesystemLock.java:233) > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.writeUnlock(FSNamesystem.java:1706) > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.delete(FSNamesystem.java:3188) > ... > {code} > Adding more information (opName, path, etc.) to the log is useful to > troubleshoot. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15217) Add more information to longest write/read lock held log
[ https://issues.apache.org/jira/browse/HDFS-15217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17246225#comment-17246225 ] Toshihiro Suzuki commented on HDFS-15217: - Sound good. Thank you [~ahussein]. > Add more information to longest write/read lock held log > > > Key: HDFS-15217 > URL: https://issues.apache.org/jira/browse/HDFS-15217 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Toshihiro Suzuki >Assignee: Toshihiro Suzuki >Priority: Major > Fix For: 3.4.0 > > > Currently, we can see the stack trace in the longest write/read lock held > log, but sometimes we need more information, for example, a target path of > deletion: > {code:java} > 2020-03-10 21:51:21,116 [main] INFO namenode.FSNamesystem > (FSNamesystemLock.java:writeUnlock(276)) - Number of suppressed > write-lock reports: 0 > Longest write-lock held at 2020-03-10 21:51:21,107+0900 for 6ms via > java.lang.Thread.getStackTrace(Thread.java:1559) > org.apache.hadoop.util.StringUtils.getStackTrace(StringUtils.java:1058) > org.apache.hadoop.hdfs.server.namenode.FSNamesystemLock.writeUnlock(FSNamesystemLock.java:257) > org.apache.hadoop.hdfs.server.namenode.FSNamesystemLock.writeUnlock(FSNamesystemLock.java:233) > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.writeUnlock(FSNamesystem.java:1706) > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.delete(FSNamesystem.java:3188) > ... > {code} > Adding more information (opName, path, etc.) to the log is useful to > troubleshoot. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15217) Add more information to longest write/read lock held log
[ https://issues.apache.org/jira/browse/HDFS-15217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17245749#comment-17245749 ] Toshihiro Suzuki commented on HDFS-15217: - {quote} The ops information can be retrieved from the AuditLog. Isn't the AuditLog enough to see the ops? Was there a concern regarding a possible deadLock? Then, why not using debug instead of adding that overhead to hot production code? {quote} I don't think the AuditLog is enough to see the ops when there are huge number of opts. As mentioned in the Description, I faced the long time write-lock held issue, but I was not able to identify which operation caused the long time write-lock held from the NN log and the AuditLog. That's why I thought that we needed this change and added more information to the longest write/read lock held log. > Add more information to longest write/read lock held log > > > Key: HDFS-15217 > URL: https://issues.apache.org/jira/browse/HDFS-15217 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Toshihiro Suzuki >Assignee: Toshihiro Suzuki >Priority: Major > Fix For: 3.4.0 > > > Currently, we can see the stack trace in the longest write/read lock held > log, but sometimes we need more information, for example, a target path of > deletion: > {code:java} > 2020-03-10 21:51:21,116 [main] INFO namenode.FSNamesystem > (FSNamesystemLock.java:writeUnlock(276)) - Number of suppressed > write-lock reports: 0 > Longest write-lock held at 2020-03-10 21:51:21,107+0900 for 6ms via > java.lang.Thread.getStackTrace(Thread.java:1559) > org.apache.hadoop.util.StringUtils.getStackTrace(StringUtils.java:1058) > org.apache.hadoop.hdfs.server.namenode.FSNamesystemLock.writeUnlock(FSNamesystemLock.java:257) > org.apache.hadoop.hdfs.server.namenode.FSNamesystemLock.writeUnlock(FSNamesystemLock.java:233) > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.writeUnlock(FSNamesystem.java:1706) > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.delete(FSNamesystem.java:3188) > ... > {code} > Adding more information (opName, path, etc.) to the log is useful to > troubleshoot. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15217) Add more information to longest write/read lock held log
[ https://issues.apache.org/jira/browse/HDFS-15217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17245740#comment-17245740 ] Toshihiro Suzuki commented on HDFS-15217: - Thank you [~ahussein]. Are we sure the slowdown is caused by the changes in this Jira? > Add more information to longest write/read lock held log > > > Key: HDFS-15217 > URL: https://issues.apache.org/jira/browse/HDFS-15217 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Toshihiro Suzuki >Assignee: Toshihiro Suzuki >Priority: Major > Fix For: 3.4.0 > > > Currently, we can see the stack trace in the longest write/read lock held > log, but sometimes we need more information, for example, a target path of > deletion: > {code:java} > 2020-03-10 21:51:21,116 [main] INFO namenode.FSNamesystem > (FSNamesystemLock.java:writeUnlock(276)) - Number of suppressed > write-lock reports: 0 > Longest write-lock held at 2020-03-10 21:51:21,107+0900 for 6ms via > java.lang.Thread.getStackTrace(Thread.java:1559) > org.apache.hadoop.util.StringUtils.getStackTrace(StringUtils.java:1058) > org.apache.hadoop.hdfs.server.namenode.FSNamesystemLock.writeUnlock(FSNamesystemLock.java:257) > org.apache.hadoop.hdfs.server.namenode.FSNamesystemLock.writeUnlock(FSNamesystemLock.java:233) > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.writeUnlock(FSNamesystem.java:1706) > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.delete(FSNamesystem.java:3188) > ... > {code} > Adding more information (opName, path, etc.) to the log is useful to > troubleshoot. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15386) ReplicaNotFoundException keeps happening in DN after removing multiple DN's data directories
[ https://issues.apache.org/jira/browse/HDFS-15386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17128753#comment-17128753 ] Toshihiro Suzuki commented on HDFS-15386: - [~sodonnell] I created a PR for branch-2.10 but it looks like QA is broken. https://github.com/apache/hadoop/pull/2054 Can you please help me? > ReplicaNotFoundException keeps happening in DN after removing multiple DN's > data directories > > > Key: HDFS-15386 > URL: https://issues.apache.org/jira/browse/HDFS-15386 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Toshihiro Suzuki >Assignee: Toshihiro Suzuki >Priority: Major > Fix For: 3.0.4, 3.2.2, 3.3.1, 3.4.0, 3.1.5 > > > When removing volumes, we need to invalidate all the blocks in the volumes. > In the following code (FsDatasetImpl), we keep the blocks that will be > invalidate in *blkToInvalidate* map. However as the key of the map is *bpid* > (Block Pool ID), it will be overwritten by other removed volumes. As a > result, the map will have only the blocks of the last volume we are removing, > and invalidate only them: > {code:java} > for (String bpid : volumeMap.getBlockPoolList()) { > List blocks = new ArrayList<>(); > for (Iterator it = > volumeMap.replicas(bpid).iterator(); it.hasNext();) { > ReplicaInfo block = it.next(); > final StorageLocation blockStorageLocation = > block.getVolume().getStorageLocation(); > LOG.trace("checking for block " + block.getBlockId() + > " with storageLocation " + blockStorageLocation); > if (blockStorageLocation.equals(sdLocation)) { > blocks.add(block); > it.remove(); > } > } > blkToInvalidate.put(bpid, blocks); > } > {code} > [https://github.com/apache/hadoop/blob/704409d53bf7ebf717a3c2e988ede80f623bbad3/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/impl/FsDatasetImpl.java#L580-L595] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15386) ReplicaNotFoundException keeps happening in DN after removing multiple DN's data directories
[ https://issues.apache.org/jira/browse/HDFS-15386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17126647#comment-17126647 ] Toshihiro Suzuki commented on HDFS-15386: - [~sodonnell] Thank you for merging the PR to trunk! For branch 2, which branch should I create a PR for? Thanks. > ReplicaNotFoundException keeps happening in DN after removing multiple DN's > data directories > > > Key: HDFS-15386 > URL: https://issues.apache.org/jira/browse/HDFS-15386 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Toshihiro Suzuki >Assignee: Toshihiro Suzuki >Priority: Major > > When removing volumes, we need to invalidate all the blocks in the volumes. > In the following code (FsDatasetImpl), we keep the blocks that will be > invalidate in *blkToInvalidate* map. However as the key of the map is *bpid* > (Block Pool ID), it will be overwritten by other removed volumes. As a > result, the map will have only the blocks of the last volume we are removing, > and invalidate only them: > {code:java} > for (String bpid : volumeMap.getBlockPoolList()) { > List blocks = new ArrayList<>(); > for (Iterator it = > volumeMap.replicas(bpid).iterator(); it.hasNext();) { > ReplicaInfo block = it.next(); > final StorageLocation blockStorageLocation = > block.getVolume().getStorageLocation(); > LOG.trace("checking for block " + block.getBlockId() + > " with storageLocation " + blockStorageLocation); > if (blockStorageLocation.equals(sdLocation)) { > blocks.add(block); > it.remove(); > } > } > blkToInvalidate.put(bpid, blocks); > } > {code} > [https://github.com/apache/hadoop/blob/704409d53bf7ebf717a3c2e988ede80f623bbad3/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/impl/FsDatasetImpl.java#L580-L595] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15386) ReplicaNotFoundException keeps happening in DN after removing multiple DN's data directories
[ https://issues.apache.org/jira/browse/HDFS-15386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Toshihiro Suzuki updated HDFS-15386: Description: When removing volumes, we need to invalidate all the blocks in the volumes. In the following code (FsDatasetImpl), we keep the blocks that will be invalidate in *blkToInvalidate* map. However as the key of the map is *bpid* (Block Pool ID), it will be overwritten by other removed volumes. As a result, the map will have only the blocks of the last volume we are removing, and invalidate only them: {code:java} for (String bpid : volumeMap.getBlockPoolList()) { List blocks = new ArrayList<>(); for (Iterator it = volumeMap.replicas(bpid).iterator(); it.hasNext();) { ReplicaInfo block = it.next(); final StorageLocation blockStorageLocation = block.getVolume().getStorageLocation(); LOG.trace("checking for block " + block.getBlockId() + " with storageLocation " + blockStorageLocation); if (blockStorageLocation.equals(sdLocation)) { blocks.add(block); it.remove(); } } blkToInvalidate.put(bpid, blocks); } {code} [https://github.com/apache/hadoop/blob/704409d53bf7ebf717a3c2e988ede80f623bbad3/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/impl/FsDatasetImpl.java#L580-L595] was: When removing volumes, we need to invalidate all the blocks in the volumes. In the following code (FsDatasetImpl), we keep the blocks that will be invalidate in *blkToInvalidate* map. However as the key of the map is *bpid* (Block Pool ID), it will be overwritten by other removed volumes. As a result, the map will have only the blocks of the last volume, and invalidate only them: {code:java} for (String bpid : volumeMap.getBlockPoolList()) { List blocks = new ArrayList<>(); for (Iterator it = volumeMap.replicas(bpid).iterator(); it.hasNext();) { ReplicaInfo block = it.next(); final StorageLocation blockStorageLocation = block.getVolume().getStorageLocation(); LOG.trace("checking for block " + block.getBlockId() + " with storageLocation " + blockStorageLocation); if (blockStorageLocation.equals(sdLocation)) { blocks.add(block); it.remove(); } } blkToInvalidate.put(bpid, blocks); } {code} [https://github.com/apache/hadoop/blob/704409d53bf7ebf717a3c2e988ede80f623bbad3/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/impl/FsDatasetImpl.java#L580-L595] > ReplicaNotFoundException keeps happening in DN after removing multiple DN's > data directories > > > Key: HDFS-15386 > URL: https://issues.apache.org/jira/browse/HDFS-15386 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Toshihiro Suzuki >Assignee: Toshihiro Suzuki >Priority: Major > > When removing volumes, we need to invalidate all the blocks in the volumes. > In the following code (FsDatasetImpl), we keep the blocks that will be > invalidate in *blkToInvalidate* map. However as the key of the map is *bpid* > (Block Pool ID), it will be overwritten by other removed volumes. As a > result, the map will have only the blocks of the last volume we are removing, > and invalidate only them: > {code:java} > for (String bpid : volumeMap.getBlockPoolList()) { > List blocks = new ArrayList<>(); > for (Iterator it = > volumeMap.replicas(bpid).iterator(); it.hasNext();) { > ReplicaInfo block = it.next(); > final StorageLocation blockStorageLocation = > block.getVolume().getStorageLocation(); > LOG.trace("checking for block " + block.getBlockId() + > " with storageLocation " + blockStorageLocation); > if (blockStorageLocation.equals(sdLocation)) { > blocks.add(block); > it.remove(); > } > } > blkToInvalidate.put(bpid, blocks); > } > {code} > [https://github.com/apache/hadoop/blob/704409d53bf7ebf717a3c2e988ede80f623bbad3/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/impl/FsDatasetImpl.java#L580-L595] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Work started] (HDFS-15386) ReplicaNotFoundException keeps happening in DN after removing multiple DN's data directories
[ https://issues.apache.org/jira/browse/HDFS-15386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Work on HDFS-15386 started by Toshihiro Suzuki. --- > ReplicaNotFoundException keeps happening in DN after removing multiple DN's > data directories > > > Key: HDFS-15386 > URL: https://issues.apache.org/jira/browse/HDFS-15386 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Toshihiro Suzuki >Assignee: Toshihiro Suzuki >Priority: Major > > When removing volumes, we need to invalidate all the blocks in the volumes. > In the following code (FsDatasetImpl), we keep the blocks that will be > invalidate in *blkToInvalidate* map. > However as the key of the map is *bpid* (Block Pool ID), it will be > overwritten by other removed volumes. As a result, the map will have only the > blocks of the last volume, and invalidate only them: > {code:java} > for (String bpid : volumeMap.getBlockPoolList()) { > List blocks = new ArrayList<>(); > for (Iterator it = > volumeMap.replicas(bpid).iterator(); it.hasNext();) { > ReplicaInfo block = it.next(); > final StorageLocation blockStorageLocation = > block.getVolume().getStorageLocation(); > LOG.trace("checking for block " + block.getBlockId() + > " with storageLocation " + blockStorageLocation); > if (blockStorageLocation.equals(sdLocation)) { > blocks.add(block); > it.remove(); > } > } > blkToInvalidate.put(bpid, blocks); > } > {code} > [https://github.com/apache/hadoop/blob/704409d53bf7ebf717a3c2e988ede80f623bbad3/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/impl/FsDatasetImpl.java#L580-L595] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15386) ReplicaNotFoundException keeps happening in DN after removing multiple DN's data directories
[ https://issues.apache.org/jira/browse/HDFS-15386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Toshihiro Suzuki updated HDFS-15386: Description: When removing volumes, we need to invalidate all the blocks in the volumes. In the following code (FsDatasetImpl), we keep the blocks that will be invalidate in *blkToInvalidate* map. However as the key of the map is *bpid* (Block Pool ID), it will be overwritten by other removed volumes. As a result, the map will have only the blocks of the last volume, and invalidate only them: {code:java} for (String bpid : volumeMap.getBlockPoolList()) { List blocks = new ArrayList<>(); for (Iterator it = volumeMap.replicas(bpid).iterator(); it.hasNext();) { ReplicaInfo block = it.next(); final StorageLocation blockStorageLocation = block.getVolume().getStorageLocation(); LOG.trace("checking for block " + block.getBlockId() + " with storageLocation " + blockStorageLocation); if (blockStorageLocation.equals(sdLocation)) { blocks.add(block); it.remove(); } } blkToInvalidate.put(bpid, blocks); } {code} [https://github.com/apache/hadoop/blob/704409d53bf7ebf717a3c2e988ede80f623bbad3/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/impl/FsDatasetImpl.java#L580-L595] was: When removing volumes, we need to invalidate all the blocks in the volumes. In the following code (FsDatasetImpl), we keep the blocks that will be invalidate in *blkToInvalidate* map. However as the key of the map is *bpid* (Block Pool ID), it will be overwritten by other removed volumes. As a result, the map will have only the blocks of the last volume, and invalidate only them: {code:java} for (String bpid : volumeMap.getBlockPoolList()) { List blocks = new ArrayList<>(); for (Iterator it = volumeMap.replicas(bpid).iterator(); it.hasNext();) { ReplicaInfo block = it.next(); final StorageLocation blockStorageLocation = block.getVolume().getStorageLocation(); LOG.trace("checking for block " + block.getBlockId() + " with storageLocation " + blockStorageLocation); if (blockStorageLocation.equals(sdLocation)) { blocks.add(block); it.remove(); } } blkToInvalidate.put(bpid, blocks); } {code} [https://github.com/apache/hadoop/blob/704409d53bf7ebf717a3c2e988ede80f623bbad3/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/impl/FsDatasetImpl.java#L580-L595] > ReplicaNotFoundException keeps happening in DN after removing multiple DN's > data directories > > > Key: HDFS-15386 > URL: https://issues.apache.org/jira/browse/HDFS-15386 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Toshihiro Suzuki >Assignee: Toshihiro Suzuki >Priority: Major > > When removing volumes, we need to invalidate all the blocks in the volumes. > In the following code (FsDatasetImpl), we keep the blocks that will be > invalidate in *blkToInvalidate* map. However as the key of the map is *bpid* > (Block Pool ID), it will be overwritten by other removed volumes. As a > result, the map will have only the blocks of the last volume, and invalidate > only them: > {code:java} > for (String bpid : volumeMap.getBlockPoolList()) { > List blocks = new ArrayList<>(); > for (Iterator it = > volumeMap.replicas(bpid).iterator(); it.hasNext();) { > ReplicaInfo block = it.next(); > final StorageLocation blockStorageLocation = > block.getVolume().getStorageLocation(); > LOG.trace("checking for block " + block.getBlockId() + > " with storageLocation " + blockStorageLocation); > if (blockStorageLocation.equals(sdLocation)) { > blocks.add(block); > it.remove(); > } > } > blkToInvalidate.put(bpid, blocks); > } > {code} > [https://github.com/apache/hadoop/blob/704409d53bf7ebf717a3c2e988ede80f623bbad3/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/impl/FsDatasetImpl.java#L580-L595] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15386) ReplicaNotFoundException keeps happening in DN after removing multiple DN's data directories
[ https://issues.apache.org/jira/browse/HDFS-15386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Toshihiro Suzuki updated HDFS-15386: Summary: ReplicaNotFoundException keeps happening in DN after removing multiple DN's data directories (was: ReplicaNotFoundException keeps happening in DN after removing multiple DN's data direcotries) > ReplicaNotFoundException keeps happening in DN after removing multiple DN's > data directories > > > Key: HDFS-15386 > URL: https://issues.apache.org/jira/browse/HDFS-15386 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Toshihiro Suzuki >Assignee: Toshihiro Suzuki >Priority: Major > > When removing volumes, we need to invalidate all the blocks in the volumes. > In the following code (FsDatasetImpl), we keep the blocks that will be > invalidate in *blkToInvalidate* map. > However as the key of the map is *bpid* (Block Pool ID), it will be > overwritten by other removed volumes. As a result, the map will have only the > blocks of the last volume, and invalidate only them: > {code:java} > for (String bpid : volumeMap.getBlockPoolList()) { > List blocks = new ArrayList<>(); > for (Iterator it = > volumeMap.replicas(bpid).iterator(); it.hasNext();) { > ReplicaInfo block = it.next(); > final StorageLocation blockStorageLocation = > block.getVolume().getStorageLocation(); > LOG.trace("checking for block " + block.getBlockId() + > " with storageLocation " + blockStorageLocation); > if (blockStorageLocation.equals(sdLocation)) { > blocks.add(block); > it.remove(); > } > } > blkToInvalidate.put(bpid, blocks); > } > {code} > [https://github.com/apache/hadoop/blob/704409d53bf7ebf717a3c2e988ede80f623bbad3/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/impl/FsDatasetImpl.java#L580-L595] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-15386) ReplicaNotFoundException keeps happening in DN after removing multiple DN's data direcotries
Toshihiro Suzuki created HDFS-15386: --- Summary: ReplicaNotFoundException keeps happening in DN after removing multiple DN's data direcotries Key: HDFS-15386 URL: https://issues.apache.org/jira/browse/HDFS-15386 Project: Hadoop HDFS Issue Type: Bug Reporter: Toshihiro Suzuki Assignee: Toshihiro Suzuki When removing volumes, we need to invalidate all the blocks in the volumes. In the following code (FsDatasetImpl), we keep the blocks that will be invalidate in *blkToInvalidate* map. However as the key of the map is *bpid* (Block Pool ID), it will be overwritten by other removed volumes. As a result, the map will have only the blocks of the last volume, and invalidate only them: {code:java} for (String bpid : volumeMap.getBlockPoolList()) { List blocks = new ArrayList<>(); for (Iterator it = volumeMap.replicas(bpid).iterator(); it.hasNext();) { ReplicaInfo block = it.next(); final StorageLocation blockStorageLocation = block.getVolume().getStorageLocation(); LOG.trace("checking for block " + block.getBlockId() + " with storageLocation " + blockStorageLocation); if (blockStorageLocation.equals(sdLocation)) { blocks.add(block); it.remove(); } } blkToInvalidate.put(bpid, blocks); } {code} [https://github.com/apache/hadoop/blob/704409d53bf7ebf717a3c2e988ede80f623bbad3/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/impl/FsDatasetImpl.java#L580-L595] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15298) Fix the findbugs warnings introduced in HDFS-15217
[ https://issues.apache.org/jira/browse/HDFS-15298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17093355#comment-17093355 ] Toshihiro Suzuki commented on HDFS-15298: - Thank you for reviewing and committing this! [~iwasakims] [~inigoiri] > Fix the findbugs warnings introduced in HDFS-15217 > -- > > Key: HDFS-15298 > URL: https://issues.apache.org/jira/browse/HDFS-15298 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Toshihiro Suzuki >Assignee: Toshihiro Suzuki >Priority: Major > Fix For: 3.4.0 > > > We need to fix the findbugs warnings introduced in HDFS-15217: > https://builds.apache.org/job/hadoop-multibranch/job/PR-1954/5/artifact/out/new-findbugs-hadoop-hdfs-project_hadoop-hdfs.html -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15298) Fix the findbugs warnings introduced in HDFS-15217
[ https://issues.apache.org/jira/browse/HDFS-15298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Toshihiro Suzuki updated HDFS-15298: Description: We need to fix the findbugs warnings introduced in HDFS-15217: https://builds.apache.org/job/hadoop-multibranch/job/PR-1954/5/artifact/out/new-findbugs-hadoop-hdfs-project_hadoop-hdfs.html was: We need to fIx the findbugs warnings introduced in HDFS-15217: https://builds.apache.org/job/hadoop-multibranch/job/PR-1954/5/artifact/out/new-findbugs-hadoop-hdfs-project_hadoop-hdfs.html > Fix the findbugs warnings introduced in HDFS-15217 > -- > > Key: HDFS-15298 > URL: https://issues.apache.org/jira/browse/HDFS-15298 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Toshihiro Suzuki >Assignee: Toshihiro Suzuki >Priority: Major > > We need to fix the findbugs warnings introduced in HDFS-15217: > https://builds.apache.org/job/hadoop-multibranch/job/PR-1954/5/artifact/out/new-findbugs-hadoop-hdfs-project_hadoop-hdfs.html -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15298) Fix the findbugs warnings introduced in HDFS-15217
[ https://issues.apache.org/jira/browse/HDFS-15298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Toshihiro Suzuki updated HDFS-15298: Summary: Fix the findbugs warnings introduced in HDFS-15217 (was: FIx the findbugs warnings introduced in HDFS-15217) > Fix the findbugs warnings introduced in HDFS-15217 > -- > > Key: HDFS-15298 > URL: https://issues.apache.org/jira/browse/HDFS-15298 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Toshihiro Suzuki >Assignee: Toshihiro Suzuki >Priority: Major > > We need to fIx the findbugs warnings introduced in HDFS-15217: > https://builds.apache.org/job/hadoop-multibranch/job/PR-1954/5/artifact/out/new-findbugs-hadoop-hdfs-project_hadoop-hdfs.html -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15217) Add more information to longest write/read lock held log
[ https://issues.apache.org/jira/browse/HDFS-15217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17091055#comment-17091055 ] Toshihiro Suzuki commented on HDFS-15217: - Filed: https://issues.apache.org/jira/browse/HDFS-15298 > Add more information to longest write/read lock held log > > > Key: HDFS-15217 > URL: https://issues.apache.org/jira/browse/HDFS-15217 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Toshihiro Suzuki >Assignee: Toshihiro Suzuki >Priority: Major > Fix For: 3.4.0 > > > Currently, we can see the stack trace in the longest write/read lock held > log, but sometimes we need more information, for example, a target path of > deletion: > {code:java} > 2020-03-10 21:51:21,116 [main] INFO namenode.FSNamesystem > (FSNamesystemLock.java:writeUnlock(276)) - Number of suppressed > write-lock reports: 0 > Longest write-lock held at 2020-03-10 21:51:21,107+0900 for 6ms via > java.lang.Thread.getStackTrace(Thread.java:1559) > org.apache.hadoop.util.StringUtils.getStackTrace(StringUtils.java:1058) > org.apache.hadoop.hdfs.server.namenode.FSNamesystemLock.writeUnlock(FSNamesystemLock.java:257) > org.apache.hadoop.hdfs.server.namenode.FSNamesystemLock.writeUnlock(FSNamesystemLock.java:233) > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.writeUnlock(FSNamesystem.java:1706) > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.delete(FSNamesystem.java:3188) > ... > {code} > Adding more information (opName, path, etc.) to the log is useful to > troubleshoot. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-15298) FIx the findbugs warnings introduced in HDFS-15217
Toshihiro Suzuki created HDFS-15298: --- Summary: FIx the findbugs warnings introduced in HDFS-15217 Key: HDFS-15298 URL: https://issues.apache.org/jira/browse/HDFS-15298 Project: Hadoop HDFS Issue Type: Bug Reporter: Toshihiro Suzuki Assignee: Toshihiro Suzuki We need to fIx the findbugs warnings introduced in HDFS-15217: https://builds.apache.org/job/hadoop-multibranch/job/PR-1954/5/artifact/out/new-findbugs-hadoop-hdfs-project_hadoop-hdfs.html -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15217) Add more information to longest write/read lock held log
[ https://issues.apache.org/jira/browse/HDFS-15217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17091054#comment-17091054 ] Toshihiro Suzuki commented on HDFS-15217: - [~ayushtkn] I checked the Report again, and yes, the changes in this Jira introduced the findbugs warnings. Sorry I misunderstood the warnings. Will create a new JIRA to address the warnings. Thank you for pointing it out. > Add more information to longest write/read lock held log > > > Key: HDFS-15217 > URL: https://issues.apache.org/jira/browse/HDFS-15217 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Toshihiro Suzuki >Assignee: Toshihiro Suzuki >Priority: Major > Fix For: 3.4.0 > > > Currently, we can see the stack trace in the longest write/read lock held > log, but sometimes we need more information, for example, a target path of > deletion: > {code:java} > 2020-03-10 21:51:21,116 [main] INFO namenode.FSNamesystem > (FSNamesystemLock.java:writeUnlock(276)) - Number of suppressed > write-lock reports: 0 > Longest write-lock held at 2020-03-10 21:51:21,107+0900 for 6ms via > java.lang.Thread.getStackTrace(Thread.java:1559) > org.apache.hadoop.util.StringUtils.getStackTrace(StringUtils.java:1058) > org.apache.hadoop.hdfs.server.namenode.FSNamesystemLock.writeUnlock(FSNamesystemLock.java:257) > org.apache.hadoop.hdfs.server.namenode.FSNamesystemLock.writeUnlock(FSNamesystemLock.java:233) > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.writeUnlock(FSNamesystem.java:1706) > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.delete(FSNamesystem.java:3188) > ... > {code} > Adding more information (opName, path, etc.) to the log is useful to > troubleshoot. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15217) Add more information to longest write/read lock held log
[ https://issues.apache.org/jira/browse/HDFS-15217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17090933#comment-17090933 ] Toshihiro Suzuki commented on HDFS-15217: - [~ayushtkn] [~elgoiri] I was aware of the findbugs warnings actually but I don't think they were introduced in the changes in this Jira. To fix the findbugs warnings, we need to change the original code. > Add more information to longest write/read lock held log > > > Key: HDFS-15217 > URL: https://issues.apache.org/jira/browse/HDFS-15217 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Toshihiro Suzuki >Assignee: Toshihiro Suzuki >Priority: Major > Fix For: 3.4.0 > > > Currently, we can see the stack trace in the longest write/read lock held > log, but sometimes we need more information, for example, a target path of > deletion: > {code:java} > 2020-03-10 21:51:21,116 [main] INFO namenode.FSNamesystem > (FSNamesystemLock.java:writeUnlock(276)) - Number of suppressed > write-lock reports: 0 > Longest write-lock held at 2020-03-10 21:51:21,107+0900 for 6ms via > java.lang.Thread.getStackTrace(Thread.java:1559) > org.apache.hadoop.util.StringUtils.getStackTrace(StringUtils.java:1058) > org.apache.hadoop.hdfs.server.namenode.FSNamesystemLock.writeUnlock(FSNamesystemLock.java:257) > org.apache.hadoop.hdfs.server.namenode.FSNamesystemLock.writeUnlock(FSNamesystemLock.java:233) > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.writeUnlock(FSNamesystem.java:1706) > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.delete(FSNamesystem.java:3188) > ... > {code} > Adding more information (opName, path, etc.) to the log is useful to > troubleshoot. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-15217) Add more information to longest write/read lock held log
[ https://issues.apache.org/jira/browse/HDFS-15217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17081309#comment-17081309 ] Toshihiro Suzuki edited comment on HDFS-15217 at 4/14/20, 1:10 AM: --- I created a PR for this. After applying this patch, we can see additional information in the lock report message as follows: {code:java} 2020-04-11 23:04:36,020 [IPC Server handler 5 on default port 62641] INFO namenode.FSNamesystem (FSNamesystemLock.java:writeUnlock(321)) - Number of suppressed write-lock reports: 0 Longest write-lock held at 2020-04-11 23:04:36,020+0900 for 3ms by delete (ugi=bob (auth:SIMPLE),ip=/127.0.0.1,src=/file,dst=null,perm=null) via java.lang.Thread.getStackTrace(Thread.java:1559) org.apache.hadoop.util.StringUtils.getStackTrace(StringUtils.java:1058) org.apache.hadoop.hdfs.server.namenode.FSNamesystemLock.writeUnlock(FSNamesystemLock.java:302) org.apache.hadoop.hdfs.server.namenode.FSNamesystemLock.writeUnlock(FSNamesystemLock.java:261) org.apache.hadoop.hdfs.server.namenode.FSNamesystem.writeUnlock(FSNamesystem.java:1746) org.apache.hadoop.hdfs.server.namenode.FSNamesystem.delete(FSNamesystem.java:3274) org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.delete(NameNodeRpcServer.java:1130) org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.delete(ClientNamenodeProtocolServerSideTranslatorPB.java:724) org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:529) org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070) org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1016) org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:944) java.security.AccessController.doPrivileged(Native Method) javax.security.auth.Subject.doAs(Subject.java:422) org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1845) org.apache.hadoop.ipc.Server$Handler.run(Server.java:2948) Total suppressed write-lock held time: 0.0 {code} This patch adds the additional information *"by delete (ugi=bob (auth:SIMPLE),ip=/127.0.0.1,src=/file,dst=null,perm=null)"* which is similar to the audit log format. was (Author: brfrn169): I created a PR for this. After this patch, we can see additional information in the lock report message as follows: {code:java} 2020-04-11 23:04:36,020 [IPC Server handler 5 on default port 62641] INFO namenode.FSNamesystem (FSNamesystemLock.java:writeUnlock(321)) - Number of suppressed write-lock reports: 0 Longest write-lock held at 2020-04-11 23:04:36,020+0900 for 3ms by delete (ugi=bob (auth:SIMPLE),ip=/127.0.0.1,src=/file,dst=null,perm=null) via java.lang.Thread.getStackTrace(Thread.java:1559) org.apache.hadoop.util.StringUtils.getStackTrace(StringUtils.java:1058) org.apache.hadoop.hdfs.server.namenode.FSNamesystemLock.writeUnlock(FSNamesystemLock.java:302) org.apache.hadoop.hdfs.server.namenode.FSNamesystemLock.writeUnlock(FSNamesystemLock.java:261) org.apache.hadoop.hdfs.server.namenode.FSNamesystem.writeUnlock(FSNamesystem.java:1746) org.apache.hadoop.hdfs.server.namenode.FSNamesystem.delete(FSNamesystem.java:3274) org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.delete(NameNodeRpcServer.java:1130) org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.delete(ClientNamenodeProtocolServerSideTranslatorPB.java:724) org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:529) org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070) org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1016) org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:944) java.security.AccessController.doPrivileged(Native Method) javax.security.auth.Subject.doAs(Subject.java:422) org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1845) org.apache.hadoop.ipc.Server$Handler.run(Server.java:2948) Total suppressed write-lock held time: 0.0 {code} This patch adds the additional information *"by delete (ugi=bob (auth:SIMPLE),ip=/127.0.0.1,src=/file,dst=null,perm=null)"* which is similar to the audit log format. > Add more information to longest write/read lock held log > > > Key: HDFS-15217 > URL: https://issues.apache.org/jira/browse/HDFS-15217 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Toshihiro Suzuki >Assignee: Toshihiro Suzuki >Priority: Major > > Currently, we can see the stack trace in the longest write/read lock held > log, but
[jira] [Commented] (HDFS-15217) Add more information to longest write/read lock held log
[ https://issues.apache.org/jira/browse/HDFS-15217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17081309#comment-17081309 ] Toshihiro Suzuki commented on HDFS-15217: - I created a PR for this. After this patch, we can see additional information in the lock report message as follows: {code:java} 2020-04-11 23:04:36,020 [IPC Server handler 5 on default port 62641] INFO namenode.FSNamesystem (FSNamesystemLock.java:writeUnlock(321)) - Number of suppressed write-lock reports: 0 Longest write-lock held at 2020-04-11 23:04:36,020+0900 for 3ms by delete (ugi=bob (auth:SIMPLE),ip=/127.0.0.1,src=/file,dst=null,perm=null) via java.lang.Thread.getStackTrace(Thread.java:1559) org.apache.hadoop.util.StringUtils.getStackTrace(StringUtils.java:1058) org.apache.hadoop.hdfs.server.namenode.FSNamesystemLock.writeUnlock(FSNamesystemLock.java:302) org.apache.hadoop.hdfs.server.namenode.FSNamesystemLock.writeUnlock(FSNamesystemLock.java:261) org.apache.hadoop.hdfs.server.namenode.FSNamesystem.writeUnlock(FSNamesystem.java:1746) org.apache.hadoop.hdfs.server.namenode.FSNamesystem.delete(FSNamesystem.java:3274) org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.delete(NameNodeRpcServer.java:1130) org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.delete(ClientNamenodeProtocolServerSideTranslatorPB.java:724) org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:529) org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070) org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1016) org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:944) java.security.AccessController.doPrivileged(Native Method) javax.security.auth.Subject.doAs(Subject.java:422) org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1845) org.apache.hadoop.ipc.Server$Handler.run(Server.java:2948) Total suppressed write-lock held time: 0.0 {code} This patch adds the additional information *"by delete (ugi=bob (auth:SIMPLE),ip=/127.0.0.1,src=/file,dst=null,perm=null)"* which is similar to the audit log format. > Add more information to longest write/read lock held log > > > Key: HDFS-15217 > URL: https://issues.apache.org/jira/browse/HDFS-15217 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Toshihiro Suzuki >Assignee: Toshihiro Suzuki >Priority: Major > > Currently, we can see the stack trace in the longest write/read lock held > log, but sometimes we need more information, for example, a target path of > deletion: > {code:java} > 2020-03-10 21:51:21,116 [main] INFO namenode.FSNamesystem > (FSNamesystemLock.java:writeUnlock(276)) - Number of suppressed > write-lock reports: 0 > Longest write-lock held at 2020-03-10 21:51:21,107+0900 for 6ms via > java.lang.Thread.getStackTrace(Thread.java:1559) > org.apache.hadoop.util.StringUtils.getStackTrace(StringUtils.java:1058) > org.apache.hadoop.hdfs.server.namenode.FSNamesystemLock.writeUnlock(FSNamesystemLock.java:257) > org.apache.hadoop.hdfs.server.namenode.FSNamesystemLock.writeUnlock(FSNamesystemLock.java:233) > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.writeUnlock(FSNamesystem.java:1706) > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.delete(FSNamesystem.java:3188) > ... > {code} > Adding more information (opName, path, etc.) to the log is useful to > troubleshoot. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15249) ThrottledAsyncChecker is not thread-safe.
[ https://issues.apache.org/jira/browse/HDFS-15249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17076935#comment-17076935 ] Toshihiro Suzuki commented on HDFS-15249: - Thank you for reviewing and committing this! [~aajisaka] [~weichiu] > ThrottledAsyncChecker is not thread-safe. > - > > Key: HDFS-15249 > URL: https://issues.apache.org/jira/browse/HDFS-15249 > Project: Hadoop HDFS > Issue Type: Bug > Components: federation >Reporter: Toshihiro Suzuki >Assignee: Toshihiro Suzuki >Priority: Major > Fix For: 3.4.0 > > > ThrottledAsyncChecker should be thread-safe because it can be used by > multiple threads when we have multiple namespaces. > *checksInProgress* and *completedChecks* are respectively HashMap and > WeakHashMap which are not thread-safe. So we need to put them in synchronized > block whenever we access them. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15249) ThrottledAsyncChecker is not thread-safe.
[ https://issues.apache.org/jira/browse/HDFS-15249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17076924#comment-17076924 ] Toshihiro Suzuki commented on HDFS-15249: - [~weichiu] How do we proceed with this? > ThrottledAsyncChecker is not thread-safe. > - > > Key: HDFS-15249 > URL: https://issues.apache.org/jira/browse/HDFS-15249 > Project: Hadoop HDFS > Issue Type: Bug > Components: federation >Reporter: Toshihiro Suzuki >Assignee: Toshihiro Suzuki >Priority: Major > > ThrottledAsyncChecker should be thread-safe because it can be used by > multiple threads when we have multiple namespaces. > *checksInProgress* and *completedChecks* are respectively HashMap and > WeakHashMap which are not thread-safe. So we need to put them in synchronized > block whenever we access them. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15249) ThrottledAsyncChecker is not thread-safe.
[ https://issues.apache.org/jira/browse/HDFS-15249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17071456#comment-17071456 ] Toshihiro Suzuki commented on HDFS-15249: - Currently, we already have a synchronized block for *checksInProgress* and *completedChecks* in addResultCachingCallback(), but don't have in schedule(): https://github.com/apache/hadoop/blob/80b877a72f52ef0f4acafe15db55b8ed61fbe6d2/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/checker/ThrottledAsyncChecker.java#L170-L174 https://github.com/apache/hadoop/blob/80b877a72f52ef0f4acafe15db55b8ed61fbe6d2/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/checker/ThrottledAsyncChecker.java#L179-L183 I thought we also needed a synchronized block in schedule() and filed this Jira. > ThrottledAsyncChecker is not thread-safe. > - > > Key: HDFS-15249 > URL: https://issues.apache.org/jira/browse/HDFS-15249 > Project: Hadoop HDFS > Issue Type: Bug > Components: federation >Reporter: Toshihiro Suzuki >Assignee: Toshihiro Suzuki >Priority: Major > > ThrottledAsyncChecker should be thread-safe because it can be used by > multiple threads when we have multiple namespaces. > *checksInProgress* and *completedChecks* are respectively HashMap and > WeakHashMap which are not thread-safe. So we need to put them in synchronized > block whenever we access them. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-15249) ThrottledAsyncChecker is not thread-safe.
[ https://issues.apache.org/jira/browse/HDFS-15249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17071455#comment-17071455 ] Toshihiro Suzuki edited comment on HDFS-15249 at 3/31/20, 4:37 AM: --- [~weichiu] Thank you for taking a look at this. Actually, I didn't see any Exceptions from this issue. When I checked the source code for another issue, I found this issue. Thanks. was (Author: brfrn169): [~weichiu] Thank you for taking look at this. Actually, I didn't see any Exceptions from this issue. When I checked the source code for another issue, I found this issue. Thanks. > ThrottledAsyncChecker is not thread-safe. > - > > Key: HDFS-15249 > URL: https://issues.apache.org/jira/browse/HDFS-15249 > Project: Hadoop HDFS > Issue Type: Bug > Components: federation >Reporter: Toshihiro Suzuki >Assignee: Toshihiro Suzuki >Priority: Major > > ThrottledAsyncChecker should be thread-safe because it can be used by > multiple threads when we have multiple namespaces. > *checksInProgress* and *completedChecks* are respectively HashMap and > WeakHashMap which are not thread-safe. So we need to put them in synchronized > block whenever we access them. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15249) ThrottledAsyncChecker is not thread-safe.
[ https://issues.apache.org/jira/browse/HDFS-15249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17071455#comment-17071455 ] Toshihiro Suzuki commented on HDFS-15249: - [~weichiu] Thank you for taking look at this. Actually, I didn't see any Exceptions from this issue. When I checked the source code for another issue, I found this issue. Thanks. > ThrottledAsyncChecker is not thread-safe. > - > > Key: HDFS-15249 > URL: https://issues.apache.org/jira/browse/HDFS-15249 > Project: Hadoop HDFS > Issue Type: Bug > Components: federation >Reporter: Toshihiro Suzuki >Assignee: Toshihiro Suzuki >Priority: Major > > ThrottledAsyncChecker should be thread-safe because it can be used by > multiple threads when we have multiple namespaces. > *checksInProgress* and *completedChecks* are respectively HashMap and > WeakHashMap which are not thread-safe. So we need to put them in synchronized > block whenever we access them. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15249) ThrottledAsyncChecker is not thread-safe.
[ https://issues.apache.org/jira/browse/HDFS-15249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17070784#comment-17070784 ] Toshihiro Suzuki commented on HDFS-15249: - CC: [~arp] [~weichiu] > ThrottledAsyncChecker is not thread-safe. > - > > Key: HDFS-15249 > URL: https://issues.apache.org/jira/browse/HDFS-15249 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Toshihiro Suzuki >Assignee: Toshihiro Suzuki >Priority: Major > > ThrottledAsyncChecker should be thread-safe because it can be used by > multiple threads when we have multiple namespaces. > *checksInProgress* and *completedChecks* are respectively HashMap and > WeakHashMap which are not thread-safe. So we need to put them in synchronized > block whenever we access them. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15249) ThrottledAsyncChecker is not thread-safe.
[ https://issues.apache.org/jira/browse/HDFS-15249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Toshihiro Suzuki updated HDFS-15249: Description: ThrottledAsyncChecker should be thread-safe because it can be used by multiple threads when we have multiple namespaces. *checksInProgress* and *completedChecks* are respectively HashMap and WeakHashMap which are not thread-safe. So we need to put them in synchronized block whenever we access them. was: ThrottledAsyncChecker should be thread-safe because it can be used by multiple threads when we have multiple namespaces. *checksInProgress* and *completedChecks* are respectively HashMap and WeakHashMap which are not tread-safe. So we need to put them in synchronized block whenever we access them. > ThrottledAsyncChecker is not thread-safe. > - > > Key: HDFS-15249 > URL: https://issues.apache.org/jira/browse/HDFS-15249 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Toshihiro Suzuki >Assignee: Toshihiro Suzuki >Priority: Major > > ThrottledAsyncChecker should be thread-safe because it can be used by > multiple threads when we have multiple namespaces. > *checksInProgress* and *completedChecks* are respectively HashMap and > WeakHashMap which are not thread-safe. So we need to put them in synchronized > block whenever we access them. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-15249) ThrottledAsyncChecker is not thread-safe.
Toshihiro Suzuki created HDFS-15249: --- Summary: ThrottledAsyncChecker is not thread-safe. Key: HDFS-15249 URL: https://issues.apache.org/jira/browse/HDFS-15249 Project: Hadoop HDFS Issue Type: Bug Reporter: Toshihiro Suzuki Assignee: Toshihiro Suzuki ThrottledAsyncChecker should be thread-safe because it can be used by multiple threads when we have multiple namespaces. *checksInProgress* and *completedChecks* are respectively HashMap and WeakHashMap which are not tread-safe. So we need to put them in synchronized block whenever we access them. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Resolved] (HDFS-14503) ThrottledAsyncChecker throws NPE during block pool initialization
[ https://issues.apache.org/jira/browse/HDFS-14503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Toshihiro Suzuki resolved HDFS-14503. - Resolution: Duplicate > ThrottledAsyncChecker throws NPE during block pool initialization > -- > > Key: HDFS-14503 > URL: https://issues.apache.org/jira/browse/HDFS-14503 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 3.3.0 >Reporter: Yiqun Lin >Priority: Major > > ThrottledAsyncChecker throws NPE during block pool initialization. The error > leads the block pool registration failure. > The exception > {noformat} > 2019-05-20 01:02:36,003 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: > Unexpected exception in block pool Block pool (Datanode Uuid > x) service to xx.xx.xx.xx/xx.xx.xx.xx > java.lang.NullPointerException > at > org.apache.hadoop.hdfs.server.datanode.checker.ThrottledAsyncChecker$LastCheckResult.access$000(ThrottledAsyncChecker.java:211) > at > org.apache.hadoop.hdfs.server.datanode.checker.ThrottledAsyncChecker.schedule(ThrottledAsyncChecker.java:129) > at > org.apache.hadoop.hdfs.server.datanode.checker.DatasetVolumeChecker.checkAllVolumes(DatasetVolumeChecker.java:209) > at > org.apache.hadoop.hdfs.server.datanode.DataNode.checkDiskError(DataNode.java:3387) > at > org.apache.hadoop.hdfs.server.datanode.DataNode.initBlockPool(DataNode.java:1508) > at > org.apache.hadoop.hdfs.server.datanode.BPOfferService.verifyAndSetNamespaceInfo(BPOfferService.java:319) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor.connectToNNAndHandshake(BPServiceActor.java:272) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:768) > at java.lang.Thread.run(Thread.java:745) > {noformat} > Looks like this error due to {{WeakHashMap}} type map {{completedChecks}} has > removed the target entry while we still get that entry. Although we have done > a check before we get it, there is still a chance the entry is got as null. > We met a corner case for this: A federation mode, two block pools in DN, > {{ThrottledAsyncChecker}} schedules two same health checks for same volume. > {noformat} > 2019-05-20 01:02:36,000 INFO > org.apache.hadoop.hdfs.server.datanode.checker.ThrottledAsyncChecker: > Scheduling a check for /hadoop/2/hdfs/data/current > 2019-05-20 01:02:36,000 INFO > org.apache.hadoop.hdfs.server.datanode.checker.ThrottledAsyncChecker: > Scheduling a check for /hadoop/2/hdfs/data/current > {noformat} > {{completedChecks}} cleans up the entry for one successful check after called > {{completedChecks#get}}. However, after this, another check we get the null. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14503) ThrottledAsyncChecker throws NPE during block pool initialization
[ https://issues.apache.org/jira/browse/HDFS-14503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17070738#comment-17070738 ] Toshihiro Suzuki commented on HDFS-14503: - It looks like duplicated by https://issues.apache.org/jira/browse/HDFS-14074 > ThrottledAsyncChecker throws NPE during block pool initialization > -- > > Key: HDFS-14503 > URL: https://issues.apache.org/jira/browse/HDFS-14503 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 3.3.0 >Reporter: Yiqun Lin >Priority: Major > > ThrottledAsyncChecker throws NPE during block pool initialization. The error > leads the block pool registration failure. > The exception > {noformat} > 2019-05-20 01:02:36,003 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: > Unexpected exception in block pool Block pool (Datanode Uuid > x) service to xx.xx.xx.xx/xx.xx.xx.xx > java.lang.NullPointerException > at > org.apache.hadoop.hdfs.server.datanode.checker.ThrottledAsyncChecker$LastCheckResult.access$000(ThrottledAsyncChecker.java:211) > at > org.apache.hadoop.hdfs.server.datanode.checker.ThrottledAsyncChecker.schedule(ThrottledAsyncChecker.java:129) > at > org.apache.hadoop.hdfs.server.datanode.checker.DatasetVolumeChecker.checkAllVolumes(DatasetVolumeChecker.java:209) > at > org.apache.hadoop.hdfs.server.datanode.DataNode.checkDiskError(DataNode.java:3387) > at > org.apache.hadoop.hdfs.server.datanode.DataNode.initBlockPool(DataNode.java:1508) > at > org.apache.hadoop.hdfs.server.datanode.BPOfferService.verifyAndSetNamespaceInfo(BPOfferService.java:319) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor.connectToNNAndHandshake(BPServiceActor.java:272) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:768) > at java.lang.Thread.run(Thread.java:745) > {noformat} > Looks like this error due to {{WeakHashMap}} type map {{completedChecks}} has > removed the target entry while we still get that entry. Although we have done > a check before we get it, there is still a chance the entry is got as null. > We met a corner case for this: A federation mode, two block pools in DN, > {{ThrottledAsyncChecker}} schedules two same health checks for same volume. > {noformat} > 2019-05-20 01:02:36,000 INFO > org.apache.hadoop.hdfs.server.datanode.checker.ThrottledAsyncChecker: > Scheduling a check for /hadoop/2/hdfs/data/current > 2019-05-20 01:02:36,000 INFO > org.apache.hadoop.hdfs.server.datanode.checker.ThrottledAsyncChecker: > Scheduling a check for /hadoop/2/hdfs/data/current > {noformat} > {{completedChecks}} cleans up the entry for one successful check after called > {{completedChecks#get}}. However, after this, another check we get the null. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15215) The Timestamp for longest write/read lock held log is wrong
[ https://issues.apache.org/jira/browse/HDFS-15215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17056609#comment-17056609 ] Toshihiro Suzuki commented on HDFS-15215: - [~elgoiri] Thank you for reviewing this. I just force-pushed a new commit which modified the existing test to verify the fix. > The Timestamp for longest write/read lock held log is wrong > --- > > Key: HDFS-15215 > URL: https://issues.apache.org/jira/browse/HDFS-15215 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Toshihiro Suzuki >Assignee: Toshihiro Suzuki >Priority: Major > > I found the Timestamp for longest write/read lock held log is wrong in trunk: > {code} > 2020-03-10 16:01:26,585 [main] INFO namenode.FSNamesystem > (FSNamesystemLock.java:writeUnlock(281)) - Number of suppressed > write-lock reports: 0 > Longest write-lock held at 1970-01-03 07:07:40,841+0900 for 3ms via > java.lang.Thread.getStackTrace(Thread.java:1559) > ... > {code} > Looking at the code, it looks like the timestamp comes from System.nanoTime() > that returns the current value of the running Java Virtual Machine's > high-resolution time source and this method can only be used to measure > elapsed time: > https://docs.oracle.com/javase/8/docs/api/java/lang/System.html#nanoTime-- > We need to make the timestamp from System.currentTimeMillis(). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-15217) Add more information to longest write/read lock held log
Toshihiro Suzuki created HDFS-15217: --- Summary: Add more information to longest write/read lock held log Key: HDFS-15217 URL: https://issues.apache.org/jira/browse/HDFS-15217 Project: Hadoop HDFS Issue Type: Improvement Reporter: Toshihiro Suzuki Assignee: Toshihiro Suzuki Currently, we can see the stack trace in the longest write/read lock held log, but sometimes we need more information, for example, a target path of deletion: {code:java} 2020-03-10 21:51:21,116 [main] INFO namenode.FSNamesystem (FSNamesystemLock.java:writeUnlock(276)) - Number of suppressed write-lock reports: 0 Longest write-lock held at 2020-03-10 21:51:21,107+0900 for 6ms via java.lang.Thread.getStackTrace(Thread.java:1559) org.apache.hadoop.util.StringUtils.getStackTrace(StringUtils.java:1058) org.apache.hadoop.hdfs.server.namenode.FSNamesystemLock.writeUnlock(FSNamesystemLock.java:257) org.apache.hadoop.hdfs.server.namenode.FSNamesystemLock.writeUnlock(FSNamesystemLock.java:233) org.apache.hadoop.hdfs.server.namenode.FSNamesystem.writeUnlock(FSNamesystem.java:1706) org.apache.hadoop.hdfs.server.namenode.FSNamesystem.delete(FSNamesystem.java:3188) ... {code} Adding more information (opName, path, etc.) to the log is useful to troubleshoot. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-15215) The Timestamp for longest write/read lock held log is wrong
Toshihiro Suzuki created HDFS-15215: --- Summary: The Timestamp for longest write/read lock held log is wrong Key: HDFS-15215 URL: https://issues.apache.org/jira/browse/HDFS-15215 Project: Hadoop HDFS Issue Type: Bug Reporter: Toshihiro Suzuki Assignee: Toshihiro Suzuki I found the Timestamp for longest write/read lock held log is wrong in trunk: {code} 2020-03-10 16:01:26,585 [main] INFO namenode.FSNamesystem (FSNamesystemLock.java:writeUnlock(281)) - Number of suppressed write-lock reports: 0 Longest write-lock held at 1970-01-03 07:07:40,841+0900 for 3ms via java.lang.Thread.getStackTrace(Thread.java:1559) ... {code} Looking at the code, it looks like the timestamp comes from System.nanoTime() that returns the current value of the running Java Virtual Machine's high-resolution time source and this method can only be used to measure elapsed time: https://docs.oracle.com/javase/8/docs/api/java/lang/System.html#nanoTime-- We need to make the timestamp from System.currentTimeMillis(). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15018) DataNode doesn't shutdown although the number of failed disks reaches dfs.datanode.failed.volumes.tolerated
[ https://issues.apache.org/jira/browse/HDFS-15018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Toshihiro Suzuki updated HDFS-15018: Description: In our case, we set dfs.datanode.failed.volumes.tolerated=0 but a DataNode didn't shutdown when a disk in the DataNode host got failed for some reason. The the following log messages were shown in the DataNode log which indicates the DataNode detected the disk failure, but the DataNode didn't shutdown: {code} 2019-09-17T13:15:43.262-0400 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: checkDiskErrorAsync callback got 1 failed volumes: [/data2/hdfs/current] 2019-09-17T13:15:43.262-0400 INFO org.apache.hadoop.hdfs.server.datanode.BlockScanner: Removing scanner for volume /data2/hdfs (StorageID DS-329dec9d-a476-4334-9570-651a7e4d1f44) 2019-09-17T13:15:43.263-0400 INFO org.apache.hadoop.hdfs.server.datanode.VolumeScanner: VolumeScanner(/data2/hdfs, DS-329dec9d-a476-4334-9570-651a7e4d1f44) exiting. {code} Looking at the HDFS code, it looks like when the DataNode detects a disk failure, DataNode waits until the volume reference of the disk is released. https://github.com/hortonworks/hadoop/blob/HDP-2.6.5.0-292-tag/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/impl/FsVolumeList.java#L246 I'm suspecting that the volume reference is not released after the failure detection, but not sure the reason. And we took thread dumps when the issue was happening. It looks like the following thread is waiting for the volume reference of the disk to be released: {code} "pool-4-thread-1" #174 daemon prio=5 os_prio=0 tid=0x7f9e7c7bf800 nid=0x8325 in Object.wait() [0x7f9e629cb000] java.lang.Thread.State: TIMED_WAITING (on object monitor) at java.lang.Object.wait(Native Method) at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeList.waitVolumeRemoved(FsVolumeList.java:262) at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeList.handleVolumeFailures(FsVolumeList.java:246) - locked <0x000670559278> (a java.lang.Object) at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.handleVolumeFailures(FsDatasetImpl.java:2178) at org.apache.hadoop.hdfs.server.datanode.DataNode.handleVolumeFailures(DataNode.java:3410) at org.apache.hadoop.hdfs.server.datanode.DataNode.access$100(DataNode.java:248) at org.apache.hadoop.hdfs.server.datanode.DataNode$4.call(DataNode.java:2013) at org.apache.hadoop.hdfs.server.datanode.checker.DatasetVolumeChecker$ResultHandler.invokeCallback(DatasetVolumeChecker.java:394) at org.apache.hadoop.hdfs.server.datanode.checker.DatasetVolumeChecker$ResultHandler.cleanup(DatasetVolumeChecker.java:387) at org.apache.hadoop.hdfs.server.datanode.checker.DatasetVolumeChecker$ResultHandler.onFailure(DatasetVolumeChecker.java:370) at com.google.common.util.concurrent.Futures$6.run(Futures.java:977) at com.google.common.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:253) at org.apache.hadoop.hdfs.server.datanode.checker.AbstractFuture.executeListener(AbstractFuture.java:991) at org.apache.hadoop.hdfs.server.datanode.checker.AbstractFuture.complete(AbstractFuture.java:885) at org.apache.hadoop.hdfs.server.datanode.checker.AbstractFuture.setException(AbstractFuture.java:739) at org.apache.hadoop.hdfs.server.datanode.checker.TimeoutFuture$Fire.run(TimeoutFuture.java:137) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) {code} We found a similar issue HDFS-13339, but we didn't see any dead lock from the thread dump. Attaching the full thread dumps of the problematic DataNode. was: In our case, we set dfs.datanode.failed.volumes.tolerated=0 but a DataNode didn't shutdown when a disk in the DataNode host got failed for some reason. The the following log messages were shown in the DataNode log which indicates the DataNode detected the disk failure, but the DataNode didn't shutdown: {code} 2019-09-17T13:15:43.262-0400 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: checkDiskErrorAsync callback got 1 failed volumes: [/data2/hdfs/current] 2019-09-17T13:15:43.262-0400 INFO
[jira] [Created] (HDFS-15018) DataNode doesn't shutdown although the number of failed disks reaches dfs.datanode.failed.volumes.tolerated
Toshihiro Suzuki created HDFS-15018: --- Summary: DataNode doesn't shutdown although the number of failed disks reaches dfs.datanode.failed.volumes.tolerated Key: HDFS-15018 URL: https://issues.apache.org/jira/browse/HDFS-15018 Project: Hadoop HDFS Issue Type: Bug Components: datanode Affects Versions: 2.7.3 Environment: HDP-2.6.5 Reporter: Toshihiro Suzuki Attachments: thread_dumps.txt In our case, we set dfs.datanode.failed.volumes.tolerated=0 but a DataNode didn't shutdown when a disk in the DataNode host got failed for some reason. The the following log messages were shown in the DataNode log which indicates the DataNode detected the disk failure, but the DataNode didn't shutdown: {code} 2019-09-17T13:15:43.262-0400 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: checkDiskErrorAsync callback got 1 failed volumes: [/data2/hdfs/current] 2019-09-17T13:15:43.262-0400 INFO org.apache.hadoop.hdfs.server.datanode.BlockScanner: Removing scanner for volume /data2/hdfs (StorageID DS-329dec9d-a476-4334-9570-651a7e4d1f44) 2019-09-17T13:15:43.263-0400 INFO org.apache.hadoop.hdfs.server.datanode.VolumeScanner: VolumeScanner(/data2/hdfs, DS-329dec9d-a476-4334-9570-651a7e4d1f44) exiting. {code} Looking at the HDFS code, it looks like when the DataNode detects a disk failure, DataNode waits until the volume reference of the disk is released. https://github.com/hortonworks/hadoop/blob/HDP-2.6.5.0-292-tag/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/impl/FsVolumeList.java#L246 I'm suspecting that the volume reference is not released after the failure detection, but not sure the reason. And we took when the issue was happening. Attaching it to this Jira. It looks like the following thread is waiting for the volume reference of the disk to be released: {code} "pool-4-thread-1" #174 daemon prio=5 os_prio=0 tid=0x7f9e7c7bf800 nid=0x8325 in Object.wait() [0x7f9e629cb000] java.lang.Thread.State: TIMED_WAITING (on object monitor) at java.lang.Object.wait(Native Method) at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeList.waitVolumeRemoved(FsVolumeList.java:262) at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeList.handleVolumeFailures(FsVolumeList.java:246) - locked <0x000670559278> (a java.lang.Object) at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.handleVolumeFailures(FsDatasetImpl.java:2178) at org.apache.hadoop.hdfs.server.datanode.DataNode.handleVolumeFailures(DataNode.java:3410) at org.apache.hadoop.hdfs.server.datanode.DataNode.access$100(DataNode.java:248) at org.apache.hadoop.hdfs.server.datanode.DataNode$4.call(DataNode.java:2013) at org.apache.hadoop.hdfs.server.datanode.checker.DatasetVolumeChecker$ResultHandler.invokeCallback(DatasetVolumeChecker.java:394) at org.apache.hadoop.hdfs.server.datanode.checker.DatasetVolumeChecker$ResultHandler.cleanup(DatasetVolumeChecker.java:387) at org.apache.hadoop.hdfs.server.datanode.checker.DatasetVolumeChecker$ResultHandler.onFailure(DatasetVolumeChecker.java:370) at com.google.common.util.concurrent.Futures$6.run(Futures.java:977) at com.google.common.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:253) at org.apache.hadoop.hdfs.server.datanode.checker.AbstractFuture.executeListener(AbstractFuture.java:991) at org.apache.hadoop.hdfs.server.datanode.checker.AbstractFuture.complete(AbstractFuture.java:885) at org.apache.hadoop.hdfs.server.datanode.checker.AbstractFuture.setException(AbstractFuture.java:739) at org.apache.hadoop.hdfs.server.datanode.checker.TimeoutFuture$Fire.run(TimeoutFuture.java:137) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) {code} We found a similar issue HDFS-13339, but we didn't see any dead lock from the thread dump. Attaching the full thread dumps of the problematic DataNode. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional
[jira] [Updated] (HDFS-14123) NameNode failover doesn't happen when running fsfreeze for the NameNode dir (dfs.namenode.name.dir)
[ https://issues.apache.org/jira/browse/HDFS-14123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Toshihiro Suzuki updated HDFS-14123: Attachment: HDFS-14123.01.patch > NameNode failover doesn't happen when running fsfreeze for the NameNode dir > (dfs.namenode.name.dir) > --- > > Key: HDFS-14123 > URL: https://issues.apache.org/jira/browse/HDFS-14123 > Project: Hadoop HDFS > Issue Type: Bug > Components: ha >Reporter: Toshihiro Suzuki >Assignee: Toshihiro Suzuki >Priority: Major > Attachments: HDFS-14123.01.patch, HDFS-14123.01.patch, > HDFS-14123.01.patch > > > I ran fsfreeze for the NameNode dir (dfs.namenode.name.dir) in my cluster for > test purpose, but NameNode failover didn't happen. > {code} > fsfreeze -f /mnt > {code} > /mnt is a separate filesystem partition from /. And the NameNode dir > "dfs.namenode.name.dir" is /mnt/hadoop/hdfs/namenode. > I checked the source code, and I found monitorHealth RPC from ZKFC doesn't > fail even if the NameNode dir is frozen. I think that's why the failover > doesn't happen. > Also if the NameNode dir is frozen, it looks like FSImage.rollEditLog() gets > stuck like the following, and it keeps holding the write lock of > FSNamesystem, which causes HDFS service down: > {code} > "IPC Server handler 5 on default port 8020" #53 daemon prio=5 os_prio=0 > tid=0x7f56b96e2000 nid=0x5042 in Object.wait() [0x7f56937bb000] >java.lang.Thread.State: TIMED_WAITING (on object monitor) > at java.lang.Object.wait(Native Method) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogAsync$SyncEdit.logSyncWait(FSEditLogAsync.java:317) > - locked <0xc58ca268> (a > org.apache.hadoop.hdfs.server.namenode.FSEditLogAsync) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogAsync.logSyncAll(FSEditLogAsync.java:147) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLog.endCurrentLogSegment(FSEditLog.java:1422) > - locked <0xc58ca268> (a > org.apache.hadoop.hdfs.server.namenode.FSEditLogAsync) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLog.rollEditLog(FSEditLog.java:1316) > - locked <0xc58ca268> (a > org.apache.hadoop.hdfs.server.namenode.FSEditLogAsync) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.rollEditLog(FSImage.java:1322) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.rollEditLog(FSNamesystem.java:4740) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.rollEditLog(NameNodeRpcServer.java:1307) > at > org.apache.hadoop.hdfs.protocolPB.NamenodeProtocolServerSideTranslatorPB.rollEditLog(NamenodeProtocolServerSideTranslatorPB.java:148) > at > org.apache.hadoop.hdfs.protocol.proto.NamenodeProtocolProtos$NamenodeProtocolService$2.callBlockingMethod(NamenodeProtocolProtos.java:14726) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:898) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:844) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1876) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2727) >Locked ownable synchronizers: > - <0xc5f4ca10> (a > java.util.concurrent.locks.ReentrantReadWriteLock$FairSync) > {code} > I believe NameNode failover should happen in this case. One idea is to check > if the NameNode dir is working when NameNode receives monitorHealth RPC from > ZKFC. > I will attach a patch for this idea. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-14123) NameNode failover doesn't happen when running fsfreeze for the NameNode dir (dfs.namenode.name.dir)
[ https://issues.apache.org/jira/browse/HDFS-14123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Toshihiro Suzuki updated HDFS-14123: Attachment: HDFS-14123.01.patch > NameNode failover doesn't happen when running fsfreeze for the NameNode dir > (dfs.namenode.name.dir) > --- > > Key: HDFS-14123 > URL: https://issues.apache.org/jira/browse/HDFS-14123 > Project: Hadoop HDFS > Issue Type: Bug > Components: ha >Reporter: Toshihiro Suzuki >Assignee: Toshihiro Suzuki >Priority: Major > Attachments: HDFS-14123.01.patch, HDFS-14123.01.patch > > > I ran fsfreeze for the NameNode dir (dfs.namenode.name.dir) in my cluster for > test purpose, but NameNode failover didn't happen. > {code} > fsfreeze -f /mnt > {code} > /mnt is a separate filesystem partition from /. And the NameNode dir > "dfs.namenode.name.dir" is /mnt/hadoop/hdfs/namenode. > I checked the source code, and I found monitorHealth RPC from ZKFC doesn't > fail even if the NameNode dir is frozen. I think that's why the failover > doesn't happen. > Also if the NameNode dir is frozen, it looks like FSImage.rollEditLog() gets > stuck like the following, and it keeps holding the write lock of > FSNamesystem, which causes HDFS service down: > {code} > "IPC Server handler 5 on default port 8020" #53 daemon prio=5 os_prio=0 > tid=0x7f56b96e2000 nid=0x5042 in Object.wait() [0x7f56937bb000] >java.lang.Thread.State: TIMED_WAITING (on object monitor) > at java.lang.Object.wait(Native Method) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogAsync$SyncEdit.logSyncWait(FSEditLogAsync.java:317) > - locked <0xc58ca268> (a > org.apache.hadoop.hdfs.server.namenode.FSEditLogAsync) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogAsync.logSyncAll(FSEditLogAsync.java:147) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLog.endCurrentLogSegment(FSEditLog.java:1422) > - locked <0xc58ca268> (a > org.apache.hadoop.hdfs.server.namenode.FSEditLogAsync) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLog.rollEditLog(FSEditLog.java:1316) > - locked <0xc58ca268> (a > org.apache.hadoop.hdfs.server.namenode.FSEditLogAsync) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.rollEditLog(FSImage.java:1322) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.rollEditLog(FSNamesystem.java:4740) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.rollEditLog(NameNodeRpcServer.java:1307) > at > org.apache.hadoop.hdfs.protocolPB.NamenodeProtocolServerSideTranslatorPB.rollEditLog(NamenodeProtocolServerSideTranslatorPB.java:148) > at > org.apache.hadoop.hdfs.protocol.proto.NamenodeProtocolProtos$NamenodeProtocolService$2.callBlockingMethod(NamenodeProtocolProtos.java:14726) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:898) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:844) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1876) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2727) >Locked ownable synchronizers: > - <0xc5f4ca10> (a > java.util.concurrent.locks.ReentrantReadWriteLock$FairSync) > {code} > I believe NameNode failover should happen in this case. One idea is to check > if the NameNode dir is working when NameNode receives monitorHealth RPC from > ZKFC. > I will attach a patch for this idea. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14123) NameNode failover doesn't happen when running fsfreeze for the NameNode dir (dfs.namenode.name.dir)
[ https://issues.apache.org/jira/browse/HDFS-14123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16708902#comment-16708902 ] Toshihiro Suzuki commented on HDFS-14123: - I don't think the failures in the last QA are related to the patch. > NameNode failover doesn't happen when running fsfreeze for the NameNode dir > (dfs.namenode.name.dir) > --- > > Key: HDFS-14123 > URL: https://issues.apache.org/jira/browse/HDFS-14123 > Project: Hadoop HDFS > Issue Type: Bug > Components: ha >Reporter: Toshihiro Suzuki >Assignee: Toshihiro Suzuki >Priority: Major > Attachments: HDFS-14123.01.patch > > > I ran fsfreeze for the NameNode dir (dfs.namenode.name.dir) in my cluster for > test purpose, but NameNode failover didn't happen. > {code} > fsfreeze -f /mnt > {code} > /mnt is a separate filesystem partition from /. And the NameNode dir > "dfs.namenode.name.dir" is /mnt/hadoop/hdfs/namenode. > I checked the source code, and I found monitorHealth RPC from ZKFC doesn't > fail even if the NameNode dir is frozen. I think that's why the failover > doesn't happen. > Also if the NameNode dir is frozen, it looks like FSImage.rollEditLog() gets > stuck like the following, and it keeps holding the write lock of > FSNamesystem, which causes HDFS service down: > {code} > "IPC Server handler 5 on default port 8020" #53 daemon prio=5 os_prio=0 > tid=0x7f56b96e2000 nid=0x5042 in Object.wait() [0x7f56937bb000] >java.lang.Thread.State: TIMED_WAITING (on object monitor) > at java.lang.Object.wait(Native Method) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogAsync$SyncEdit.logSyncWait(FSEditLogAsync.java:317) > - locked <0xc58ca268> (a > org.apache.hadoop.hdfs.server.namenode.FSEditLogAsync) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogAsync.logSyncAll(FSEditLogAsync.java:147) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLog.endCurrentLogSegment(FSEditLog.java:1422) > - locked <0xc58ca268> (a > org.apache.hadoop.hdfs.server.namenode.FSEditLogAsync) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLog.rollEditLog(FSEditLog.java:1316) > - locked <0xc58ca268> (a > org.apache.hadoop.hdfs.server.namenode.FSEditLogAsync) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.rollEditLog(FSImage.java:1322) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.rollEditLog(FSNamesystem.java:4740) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.rollEditLog(NameNodeRpcServer.java:1307) > at > org.apache.hadoop.hdfs.protocolPB.NamenodeProtocolServerSideTranslatorPB.rollEditLog(NamenodeProtocolServerSideTranslatorPB.java:148) > at > org.apache.hadoop.hdfs.protocol.proto.NamenodeProtocolProtos$NamenodeProtocolService$2.callBlockingMethod(NamenodeProtocolProtos.java:14726) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:898) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:844) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1876) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2727) >Locked ownable synchronizers: > - <0xc5f4ca10> (a > java.util.concurrent.locks.ReentrantReadWriteLock$FairSync) > {code} > I believe NameNode failover should happen in this case. One idea is to check > if the NameNode dir is working when NameNode receives monitorHealth RPC from > ZKFC. > I will attach a patch for this idea. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-14123) NameNode failover doesn't happen when running fsfreeze for the NameNode dir (dfs.namenode.name.dir)
[ https://issues.apache.org/jira/browse/HDFS-14123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Toshihiro Suzuki updated HDFS-14123: Attachment: HDFS-14123.01.patch > NameNode failover doesn't happen when running fsfreeze for the NameNode dir > (dfs.namenode.name.dir) > --- > > Key: HDFS-14123 > URL: https://issues.apache.org/jira/browse/HDFS-14123 > Project: Hadoop HDFS > Issue Type: Bug > Components: ha >Reporter: Toshihiro Suzuki >Assignee: Toshihiro Suzuki >Priority: Major > Attachments: HDFS-14123.01.patch > > > I ran fsfreeze for the NameNode dir (dfs.namenode.name.dir) in my cluster for > test purpose, but NameNode failover didn't happen. > {code} > fsfreeze -f /mnt > {code} > /mnt is a separate filesystem partition from /. And the NameNode dir > "dfs.namenode.name.dir" is /mnt/hadoop/hdfs/namenode. > I checked the source code, and I found monitorHealth RPC from ZKFC doesn't > fail even if the NameNode dir is frozen. I think that's why the failover > doesn't happen. > Also if the NameNode dir is frozen, it looks like FSImage.rollEditLog() gets > stuck like the following, and it keeps holding the write lock of > FSNamesystem, which causes HDFS service down: > {code} > "IPC Server handler 5 on default port 8020" #53 daemon prio=5 os_prio=0 > tid=0x7f56b96e2000 nid=0x5042 in Object.wait() [0x7f56937bb000] >java.lang.Thread.State: TIMED_WAITING (on object monitor) > at java.lang.Object.wait(Native Method) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogAsync$SyncEdit.logSyncWait(FSEditLogAsync.java:317) > - locked <0xc58ca268> (a > org.apache.hadoop.hdfs.server.namenode.FSEditLogAsync) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogAsync.logSyncAll(FSEditLogAsync.java:147) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLog.endCurrentLogSegment(FSEditLog.java:1422) > - locked <0xc58ca268> (a > org.apache.hadoop.hdfs.server.namenode.FSEditLogAsync) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLog.rollEditLog(FSEditLog.java:1316) > - locked <0xc58ca268> (a > org.apache.hadoop.hdfs.server.namenode.FSEditLogAsync) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.rollEditLog(FSImage.java:1322) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.rollEditLog(FSNamesystem.java:4740) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.rollEditLog(NameNodeRpcServer.java:1307) > at > org.apache.hadoop.hdfs.protocolPB.NamenodeProtocolServerSideTranslatorPB.rollEditLog(NamenodeProtocolServerSideTranslatorPB.java:148) > at > org.apache.hadoop.hdfs.protocol.proto.NamenodeProtocolProtos$NamenodeProtocolService$2.callBlockingMethod(NamenodeProtocolProtos.java:14726) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:898) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:844) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1876) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2727) >Locked ownable synchronizers: > - <0xc5f4ca10> (a > java.util.concurrent.locks.ReentrantReadWriteLock$FairSync) > {code} > I believe NameNode failover should happen in this case. One idea is to check > if the NameNode dir is working when NameNode receives monitorHealth RPC from > ZKFC. > I will attach a patch for this idea. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14123) NameNode failover doesn't happen when running fsfreeze for the NameNode dir (dfs.namenode.name.dir)
[ https://issues.apache.org/jira/browse/HDFS-14123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16708390#comment-16708390 ] Toshihiro Suzuki commented on HDFS-14123: - I attached a patch to check if the NameNode dir is working when NameNode receives monitorHealth RPC from ZKFC. And I tested it in my env, and NameNode failover happened when running fsfreeze for the NameNode dir. Could someone please review the patch? > NameNode failover doesn't happen when running fsfreeze for the NameNode dir > (dfs.namenode.name.dir) > --- > > Key: HDFS-14123 > URL: https://issues.apache.org/jira/browse/HDFS-14123 > Project: Hadoop HDFS > Issue Type: Bug > Components: ha >Reporter: Toshihiro Suzuki >Assignee: Toshihiro Suzuki >Priority: Major > Attachments: HDFS-14123.01.patch > > > I ran fsfreeze for the NameNode dir (dfs.namenode.name.dir) in my cluster for > test purpose, but NameNode failover didn't happen. > {code} > fsfreeze -f /mnt > {code} > /mnt is a separate filesystem partition from /. And the NameNode dir > "dfs.namenode.name.dir" is /mnt/hadoop/hdfs/namenode. > I checked the source code, and I found monitorHealth RPC from ZKFC doesn't > fail even if the NameNode dir is frozen. I think that's why the failover > doesn't happen. > Also if the NameNode dir is frozen, it looks like FSImage.rollEditLog() gets > stuck like the following, and it keeps holding the write lock of > FSNamesystem, which causes HDFS service down: > {code} > "IPC Server handler 5 on default port 8020" #53 daemon prio=5 os_prio=0 > tid=0x7f56b96e2000 nid=0x5042 in Object.wait() [0x7f56937bb000] >java.lang.Thread.State: TIMED_WAITING (on object monitor) > at java.lang.Object.wait(Native Method) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogAsync$SyncEdit.logSyncWait(FSEditLogAsync.java:317) > - locked <0xc58ca268> (a > org.apache.hadoop.hdfs.server.namenode.FSEditLogAsync) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogAsync.logSyncAll(FSEditLogAsync.java:147) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLog.endCurrentLogSegment(FSEditLog.java:1422) > - locked <0xc58ca268> (a > org.apache.hadoop.hdfs.server.namenode.FSEditLogAsync) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLog.rollEditLog(FSEditLog.java:1316) > - locked <0xc58ca268> (a > org.apache.hadoop.hdfs.server.namenode.FSEditLogAsync) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.rollEditLog(FSImage.java:1322) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.rollEditLog(FSNamesystem.java:4740) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.rollEditLog(NameNodeRpcServer.java:1307) > at > org.apache.hadoop.hdfs.protocolPB.NamenodeProtocolServerSideTranslatorPB.rollEditLog(NamenodeProtocolServerSideTranslatorPB.java:148) > at > org.apache.hadoop.hdfs.protocol.proto.NamenodeProtocolProtos$NamenodeProtocolService$2.callBlockingMethod(NamenodeProtocolProtos.java:14726) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:898) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:844) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1876) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2727) >Locked ownable synchronizers: > - <0xc5f4ca10> (a > java.util.concurrent.locks.ReentrantReadWriteLock$FairSync) > {code} > I believe NameNode failover should happen in this case. One idea is to check > if the NameNode dir is working when NameNode receives monitorHealth RPC from > ZKFC. > I will attach a patch for this idea. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-14123) NameNode failover doesn't happen when running fsfreeze for the NameNode dir (dfs.namenode.name.dir)
[ https://issues.apache.org/jira/browse/HDFS-14123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Toshihiro Suzuki updated HDFS-14123: Status: Patch Available (was: Open) > NameNode failover doesn't happen when running fsfreeze for the NameNode dir > (dfs.namenode.name.dir) > --- > > Key: HDFS-14123 > URL: https://issues.apache.org/jira/browse/HDFS-14123 > Project: Hadoop HDFS > Issue Type: Bug > Components: ha >Reporter: Toshihiro Suzuki >Assignee: Toshihiro Suzuki >Priority: Major > Attachments: HDFS-14123.01.patch > > > I ran fsfreeze for the NameNode dir (dfs.namenode.name.dir) in my cluster for > test purpose, but NameNode failover didn't happen. > {code} > fsfreeze -f /mnt > {code} > /mnt is a separate filesystem partition from /. And the NameNode dir > "dfs.namenode.name.dir" is /mnt/hadoop/hdfs/namenode. > I checked the source code, and I found monitorHealth RPC from ZKFC doesn't > fail even if the NameNode dir is frozen. I think that's why the failover > doesn't happen. > Also if the NameNode dir is frozen, it looks like FSImage.rollEditLog() gets > stuck like the following, and it keeps holding the write lock of > FSNamesystem, which causes HDFS service down: > {code} > "IPC Server handler 5 on default port 8020" #53 daemon prio=5 os_prio=0 > tid=0x7f56b96e2000 nid=0x5042 in Object.wait() [0x7f56937bb000] >java.lang.Thread.State: TIMED_WAITING (on object monitor) > at java.lang.Object.wait(Native Method) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogAsync$SyncEdit.logSyncWait(FSEditLogAsync.java:317) > - locked <0xc58ca268> (a > org.apache.hadoop.hdfs.server.namenode.FSEditLogAsync) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogAsync.logSyncAll(FSEditLogAsync.java:147) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLog.endCurrentLogSegment(FSEditLog.java:1422) > - locked <0xc58ca268> (a > org.apache.hadoop.hdfs.server.namenode.FSEditLogAsync) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLog.rollEditLog(FSEditLog.java:1316) > - locked <0xc58ca268> (a > org.apache.hadoop.hdfs.server.namenode.FSEditLogAsync) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.rollEditLog(FSImage.java:1322) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.rollEditLog(FSNamesystem.java:4740) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.rollEditLog(NameNodeRpcServer.java:1307) > at > org.apache.hadoop.hdfs.protocolPB.NamenodeProtocolServerSideTranslatorPB.rollEditLog(NamenodeProtocolServerSideTranslatorPB.java:148) > at > org.apache.hadoop.hdfs.protocol.proto.NamenodeProtocolProtos$NamenodeProtocolService$2.callBlockingMethod(NamenodeProtocolProtos.java:14726) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:898) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:844) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1876) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2727) >Locked ownable synchronizers: > - <0xc5f4ca10> (a > java.util.concurrent.locks.ReentrantReadWriteLock$FairSync) > {code} > I believe NameNode failover should happen in this case. One idea is to check > if the NameNode dir is working when NameNode receives monitorHealth RPC from > ZKFC. > I will attach a patch for this idea. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-14123) NameNode failover doesn't happen when running fsfreeze for the NameNode dir (dfs.namenode.name.dir)
Toshihiro Suzuki created HDFS-14123: --- Summary: NameNode failover doesn't happen when running fsfreeze for the NameNode dir (dfs.namenode.name.dir) Key: HDFS-14123 URL: https://issues.apache.org/jira/browse/HDFS-14123 Project: Hadoop HDFS Issue Type: Bug Components: ha Reporter: Toshihiro Suzuki Assignee: Toshihiro Suzuki I ran fsfreeze for the NameNode dir (dfs.namenode.name.dir) in my cluster for test purpose, but NameNode failover didn't happen. {code} fsfreeze -f /mnt {code} /mnt is a separate filesystem partition from /. And the NameNode dir "dfs.namenode.name.dir" is /mnt/hadoop/hdfs/namenode. I checked the source code, and I found monitorHealth RPC from ZKFC doesn't fail even if the NameNode dir is frozen. I think that's why the failover doesn't happen. Also if the NameNode dir is frozen, it looks like FSImage.rollEditLog() gets stuck like the following, and it keeps holding the write lock of FSNamesystem, which causes HDFS service down: {code} "IPC Server handler 5 on default port 8020" #53 daemon prio=5 os_prio=0 tid=0x7f56b96e2000 nid=0x5042 in Object.wait() [0x7f56937bb000] java.lang.Thread.State: TIMED_WAITING (on object monitor) at java.lang.Object.wait(Native Method) at org.apache.hadoop.hdfs.server.namenode.FSEditLogAsync$SyncEdit.logSyncWait(FSEditLogAsync.java:317) - locked <0xc58ca268> (a org.apache.hadoop.hdfs.server.namenode.FSEditLogAsync) at org.apache.hadoop.hdfs.server.namenode.FSEditLogAsync.logSyncAll(FSEditLogAsync.java:147) at org.apache.hadoop.hdfs.server.namenode.FSEditLog.endCurrentLogSegment(FSEditLog.java:1422) - locked <0xc58ca268> (a org.apache.hadoop.hdfs.server.namenode.FSEditLogAsync) at org.apache.hadoop.hdfs.server.namenode.FSEditLog.rollEditLog(FSEditLog.java:1316) - locked <0xc58ca268> (a org.apache.hadoop.hdfs.server.namenode.FSEditLogAsync) at org.apache.hadoop.hdfs.server.namenode.FSImage.rollEditLog(FSImage.java:1322) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.rollEditLog(FSNamesystem.java:4740) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.rollEditLog(NameNodeRpcServer.java:1307) at org.apache.hadoop.hdfs.protocolPB.NamenodeProtocolServerSideTranslatorPB.rollEditLog(NamenodeProtocolServerSideTranslatorPB.java:148) at org.apache.hadoop.hdfs.protocol.proto.NamenodeProtocolProtos$NamenodeProtocolService$2.callBlockingMethod(NamenodeProtocolProtos.java:14726) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:898) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:844) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1876) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2727) Locked ownable synchronizers: - <0xc5f4ca10> (a java.util.concurrent.locks.ReentrantReadWriteLock$FairSync) {code} I believe NameNode failover should happen in this case. One idea is to check if the NameNode dir is working when NameNode receives monitorHealth RPC from ZKFC. I will attach a patch for this idea. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-13949) Correct the description of dfs.datanode.disk.check.timeout in hdfs-default.xml
[ https://issues.apache.org/jira/browse/HDFS-13949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16644571#comment-16644571 ] Toshihiro Suzuki commented on HDFS-13949: - Thank you for reviewing and committing the patch. [~nandakumar131] > Correct the description of dfs.datanode.disk.check.timeout in hdfs-default.xml > -- > > Key: HDFS-13949 > URL: https://issues.apache.org/jira/browse/HDFS-13949 > Project: Hadoop HDFS > Issue Type: Bug > Components: documentation >Reporter: Toshihiro Suzuki >Assignee: Toshihiro Suzuki >Priority: Minor > Fix For: 3.3.0 > > Attachments: HDFS-13949.1.patch > > > The description of dfs.datanode.disk.check.timeout in hdfs-default.xml is as > follows: > {code} > > dfs.datanode.disk.check.timeout > 10m > > Maximum allowed time for a disk check to complete during DataNode > startup. If the check does not complete within this time interval > then the disk is declared as failed. This setting supports > multiple time unit suffixes as described in dfs.heartbeat.interval. > If no suffix is specified then milliseconds is assumed. > > > {code} > I don't think the value of this config is used only during DataNode startup. > I think it's used whenever checking volumes. > The description is misleading so we need to correct it. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-13949) Correct the description of dfs.datanode.disk.check.timeout in hdfs-default.xml
[ https://issues.apache.org/jira/browse/HDFS-13949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16644253#comment-16644253 ] Toshihiro Suzuki commented on HDFS-13949: - Thank you very much for reviewing [~nandakumar131]. The property has been also used in ThrottledAsyncChecker that's initialized in the constructor of DatasetVolumeChecker: {code} diskCheckTimeout = conf.getTimeDuration( DFSConfigKeys.DFS_DATANODE_DISK_CHECK_TIMEOUT_KEY, DFSConfigKeys.DFS_DATANODE_DISK_CHECK_TIMEOUT_DEFAULT, TimeUnit.MILLISECONDS); delegateChecker = new ThrottledAsyncChecker<>( timer, minDiskCheckGapMs, diskCheckTimeout, Executors.newCachedThreadPool( new ThreadFactoryBuilder() .setNameFormat("DataNode DiskChecker thread %d") .setDaemon(true) .build())); {code} This timeout is used in ThrottledAsyncChecker#schedule. And this method is called by DatasetVolumeChecker#checkVolume. DatasetVolumeChecker#checkVolume is called by DataNode#checkDiskErrorAsync that's called when there might possibly be a disk failure. So it looks like to me the property is not only used during DataNode startup. > Correct the description of dfs.datanode.disk.check.timeout in hdfs-default.xml > -- > > Key: HDFS-13949 > URL: https://issues.apache.org/jira/browse/HDFS-13949 > Project: Hadoop HDFS > Issue Type: Bug > Components: documentation >Reporter: Toshihiro Suzuki >Assignee: Toshihiro Suzuki >Priority: Minor > Attachments: HDFS-13949.1.patch > > > The description of dfs.datanode.disk.check.timeout in hdfs-default.xml is as > follows: > {code} > > dfs.datanode.disk.check.timeout > 10m > > Maximum allowed time for a disk check to complete during DataNode > startup. If the check does not complete within this time interval > then the disk is declared as failed. This setting supports > multiple time unit suffixes as described in dfs.heartbeat.interval. > If no suffix is specified then milliseconds is assumed. > > > {code} > I don't think the value of this config is used only during DataNode startup. > I think it's used whenever checking volumes. > The description is misleading so we need to correct it. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-13949) Correct the description of dfs.datanode.disk.check.timeout in hdfs-default.xml
[ https://issues.apache.org/jira/browse/HDFS-13949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16634843#comment-16634843 ] Toshihiro Suzuki commented on HDFS-13949: - I just attached a patch to correct the description. Could someone please review it? > Correct the description of dfs.datanode.disk.check.timeout in hdfs-default.xml > -- > > Key: HDFS-13949 > URL: https://issues.apache.org/jira/browse/HDFS-13949 > Project: Hadoop HDFS > Issue Type: Bug > Components: documentation >Reporter: Toshihiro Suzuki >Priority: Minor > Attachments: HDFS-13949.1.patch > > > The description of dfs.datanode.disk.check.timeout in hdfs-default.xml is as > follows: > {code} > > dfs.datanode.disk.check.timeout > 10m > > Maximum allowed time for a disk check to complete during DataNode > startup. If the check does not complete within this time interval > then the disk is declared as failed. This setting supports > multiple time unit suffixes as described in dfs.heartbeat.interval. > If no suffix is specified then milliseconds is assumed. > > > {code} > I don't think the value of this config is used only during DataNode startup. > I think it's used whenever checking volumes. > The description is misleading so we need to correct it. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-13949) Correct the description of dfs.datanode.disk.check.timeout in hdfs-default.xml
[ https://issues.apache.org/jira/browse/HDFS-13949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Toshihiro Suzuki updated HDFS-13949: Attachment: HDFS-13949.1.patch > Correct the description of dfs.datanode.disk.check.timeout in hdfs-default.xml > -- > > Key: HDFS-13949 > URL: https://issues.apache.org/jira/browse/HDFS-13949 > Project: Hadoop HDFS > Issue Type: Bug > Components: documentation >Reporter: Toshihiro Suzuki >Priority: Minor > Attachments: HDFS-13949.1.patch > > > The description of dfs.datanode.disk.check.timeout in hdfs-default.xml is as > follows: > {code} > > dfs.datanode.disk.check.timeout > 10m > > Maximum allowed time for a disk check to complete during DataNode > startup. If the check does not complete within this time interval > then the disk is declared as failed. This setting supports > multiple time unit suffixes as described in dfs.heartbeat.interval. > If no suffix is specified then milliseconds is assumed. > > > {code} > I don't think the value of this config is used only during DataNode startup. > I think it's used whenever checking volumes. > The description is misleading so we need to correct it. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-13949) Correct the description of dfs.datanode.disk.check.timeout in hdfs-default.xml
Toshihiro Suzuki created HDFS-13949: --- Summary: Correct the description of dfs.datanode.disk.check.timeout in hdfs-default.xml Key: HDFS-13949 URL: https://issues.apache.org/jira/browse/HDFS-13949 Project: Hadoop HDFS Issue Type: Bug Components: documentation Reporter: Toshihiro Suzuki The description of dfs.datanode.disk.check.timeout in hdfs-default.xml is as follows: {code} dfs.datanode.disk.check.timeout 10m Maximum allowed time for a disk check to complete during DataNode startup. If the check does not complete within this time interval then the disk is declared as failed. This setting supports multiple time unit suffixes as described in dfs.heartbeat.interval. If no suffix is specified then milliseconds is assumed. {code} I don't think the value of this config is used only during DataNode startup. I think it's used whenever checking volumes. The description is misleading so we need to correct it. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org