[jira] [Updated] (HDFS-13693) Remove unnecessary search in INodeDirectory.addChild during image loading
[ https://issues.apache.org/jira/browse/HDFS-13693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lisheng Sun updated HDFS-13693: --- Attachment: HDFS-13693-005.patch > Remove unnecessary search in INodeDirectory.addChild during image loading > - > > Key: HDFS-13693 > URL: https://issues.apache.org/jira/browse/HDFS-13693 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Reporter: zhouyingchao >Assignee: Lisheng Sun >Priority: Major > Attachments: HDFS-13693-001.patch, HDFS-13693-002.patch, > HDFS-13693-003.patch, HDFS-13693-004.patch, HDFS-13693-005.patch > > > In FSImageFormatPBINode.loadINodeDirectorySection, all child INodes are added > to their parent INode's map one by one. The adding procedure will search a > position in the parent's map and then insert the child to the position. > However, during image loading, the search is unnecessary since the insert > position should always be at the end of the map given the sequence they are > serialized on disk. > Test this patch against a fsimage of a 70PB cluster (200million files and > 300million blocks), the image loading time be reduced from 1210 seconds to > 1138 seconds.So it can reduce up to about 10% of time. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-13693) Remove unnecessary search in INodeDirectory.addChild during image loading
[ https://issues.apache.org/jira/browse/HDFS-13693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lisheng Sun updated HDFS-13693: --- Attachment: HDFS-13693-004.patch > Remove unnecessary search in INodeDirectory.addChild during image loading > - > > Key: HDFS-13693 > URL: https://issues.apache.org/jira/browse/HDFS-13693 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Reporter: zhouyingchao >Assignee: Lisheng Sun >Priority: Major > Attachments: HDFS-13693-001.patch, HDFS-13693-002.patch, > HDFS-13693-003.patch, HDFS-13693-004.patch, HDFS-13693-005.patch > > > In FSImageFormatPBINode.loadINodeDirectorySection, all child INodes are added > to their parent INode's map one by one. The adding procedure will search a > position in the parent's map and then insert the child to the position. > However, during image loading, the search is unnecessary since the insert > position should always be at the end of the map given the sequence they are > serialized on disk. > Test this patch against a fsimage of a 70PB cluster (200million files and > 300million blocks), the image loading time be reduced from 1210 seconds to > 1138 seconds.So it can reduce up to about 10% of time. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-13693) Remove unnecessary search in INodeDirectory.addChild during image loading
[ https://issues.apache.org/jira/browse/HDFS-13693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lisheng Sun updated HDFS-13693: --- Attachment: (was: HDFS-13693-004.patch) > Remove unnecessary search in INodeDirectory.addChild during image loading > - > > Key: HDFS-13693 > URL: https://issues.apache.org/jira/browse/HDFS-13693 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Reporter: zhouyingchao >Assignee: Lisheng Sun >Priority: Major > Attachments: HDFS-13693-001.patch, HDFS-13693-002.patch, > HDFS-13693-003.patch, HDFS-13693-004.patch, HDFS-13693-005.patch > > > In FSImageFormatPBINode.loadINodeDirectorySection, all child INodes are added > to their parent INode's map one by one. The adding procedure will search a > position in the parent's map and then insert the child to the position. > However, during image loading, the search is unnecessary since the insert > position should always be at the end of the map given the sequence they are > serialized on disk. > Test this patch against a fsimage of a 70PB cluster (200million files and > 300million blocks), the image loading time be reduced from 1210 seconds to > 1138 seconds.So it can reduce up to about 10% of time. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-13693) Remove unnecessary search in INodeDirectory.addChild during image loading
[ https://issues.apache.org/jira/browse/HDFS-13693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16888483#comment-16888483 ] Lisheng Sun commented on HDFS-13693: [~jojochuang] [~hexiaoqiao] I have updated this patch with adding some annotations and could you help review it? Thank you. > Remove unnecessary search in INodeDirectory.addChild during image loading > - > > Key: HDFS-13693 > URL: https://issues.apache.org/jira/browse/HDFS-13693 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Reporter: zhouyingchao >Assignee: Lisheng Sun >Priority: Major > Attachments: HDFS-13693-001.patch, HDFS-13693-002.patch, > HDFS-13693-003.patch, HDFS-13693-004.patch, HDFS-13693-005.patch > > > In FSImageFormatPBINode.loadINodeDirectorySection, all child INodes are added > to their parent INode's map one by one. The adding procedure will search a > position in the parent's map and then insert the child to the position. > However, during image loading, the search is unnecessary since the insert > position should always be at the end of the map given the sequence they are > serialized on disk. > Test this patch against a fsimage of a 70PB cluster (200million files and > 300million blocks), the image loading time be reduced from 1210 seconds to > 1138 seconds.So it can reduce up to about 10% of time. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14313) Get hdfs used space from FsDatasetImpl#volumeMap#ReplicaInfo in memory instead of df/du
[ https://issues.apache.org/jira/browse/HDFS-14313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16889667#comment-16889667 ] Lisheng Sun commented on HDFS-14313: Thank [~linyiqun] for your review carefully. I have a few questions to discuss with you. 1 I don't use the component specific impl class in the common module. to update in GetSpaceUsed of common model is just to make subclasses inheritable. And that to update in CommonConfigurationKeys of common model is to print threshold time,which should be moved to DFSConfigKeys and is more appropriate. 2. Now there are switches that are used in control. follow as GetSpaceUsed#Builder#CLASSNAME_KEY {code:java} // static final String CLASSNAME_KEY = "fs.getspaceused.classname"; {code} if add enableFSCachingGetSpace as you say, there are two switches that give the user more confusion. 3. {quote}Even though deepCopyReplica is only used by another thread, I still prefer to let it be an atomic operation incase this will be used in other places in the future. Can you add datasetock here? {quote} FsDatasetImpl#addBlockPool with datasetLock ->FsVolumeList#addBlockPool {code:java} @Override public void addBlockPool(String bpid, Configuration conf) throws IOException { LOG.info("Adding block pool " + bpid); try (AutoCloseableLock lock = datasetLock.acquire()) { volumes.addBlockPool(bpid, conf); volumeMap.initBlockPool(bpid); } volumes.getAllVolumesMap(bpid, volumeMap, ramDiskReplicaTracker); } {code} FsVolumeList#addBlockPool ->FsVolumeImpl#addBlockPool -> new BlockPoolSlice ->FsDatasetImpl#deepCopyReplica. If deepCopyReplica use datasetock, it appears deadlock. So use Collections.unmodifiableSet to make replica info is not allowed to be modified outside {code:java} void addBlockPool(final String bpid, final Configuration conf) throws IOException { long totalStartTime = Time.monotonicNow(); final Map unhealthyDataDirs = new ConcurrentHashMap(); List blockPoolAddingThreads = new ArrayList(); for (final FsVolumeImpl v : volumes) { Thread t = new Thread() { public void run() { try (FsVolumeReference ref = v.obtainReference()) { FsDatasetImpl.LOG.info("Scanning block pool " + bpid + " on volume " + v + "..."); long startTime = Time.monotonicNow(); v.addBlockPool(bpid, conf); long timeTaken = Time.monotonicNow() - startTime; FsDatasetImpl.LOG.info("Time taken to scan block pool " + bpid + " on " + v + ": " + timeTaken + "ms"); } catch (ClosedChannelException e) { // ignore. } catch (IOException ioe) { FsDatasetImpl.LOG.info("Caught exception while scanning " + v + ". Will throw later.", ioe); unhealthyDataDirs.put(v, ioe); } } }; blockPoolAddingThreads.add(t); t.start(); } {code} 4. According to your suggestion, I will modify UT for using real minicluster and adding a comparison test by respectively using FSCachingGetUsed and default Du way. Please correct me if I was wrong. Thank [~linyiqun] again. > Get hdfs used space from FsDatasetImpl#volumeMap#ReplicaInfo in memory > instead of df/du > > > Key: HDFS-14313 > URL: https://issues.apache.org/jira/browse/HDFS-14313 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode, performance >Affects Versions: 2.6.0, 2.7.0, 2.8.0, 2.9.0, 3.0.0, 3.1.0 >Reporter: Lisheng Sun >Assignee: Lisheng Sun >Priority: Major > Attachments: HDFS-14313.000.patch, HDFS-14313.001.patch, > HDFS-14313.002.patch, HDFS-14313.003.patch, HDFS-14313.004.patch, > HDFS-14313.005.patch, HDFS-14313.006.patch, HDFS-14313.007.patch > > > There are two ways of DU/DF getting used space that are insufficient. > # Running DU across lots of disks is very expensive and running all of the > processes at the same time creates a noticeable IO spike. > # Running DF is inaccurate when the disk sharing by multiple datanode or > other servers. > Getting hdfs used space from FsDatasetImpl#volumeMap#ReplicaInfos in memory > is very small and accurate. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-13693) Remove unnecessary search in INodeDirectory.addChild during image loading
[ https://issues.apache.org/jira/browse/HDFS-13693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16889669#comment-16889669 ] Lisheng Sun commented on HDFS-13693: [~hexiaoqiao] has +1 for patch in Jira and [~jojochuang] has +1 for PR. [~jojochuang],[~xkrogen],[~elgoiri] [~ayushtkn] Would you mind take another reviews? Thank you. > Remove unnecessary search in INodeDirectory.addChild during image loading > - > > Key: HDFS-13693 > URL: https://issues.apache.org/jira/browse/HDFS-13693 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Reporter: zhouyingchao >Assignee: Lisheng Sun >Priority: Major > Attachments: HDFS-13693-001.patch, HDFS-13693-002.patch, > HDFS-13693-003.patch, HDFS-13693-004.patch, HDFS-13693-005.patch > > > In FSImageFormatPBINode.loadINodeDirectorySection, all child INodes are added > to their parent INode's map one by one. The adding procedure will search a > position in the parent's map and then insert the child to the position. > However, during image loading, the search is unnecessary since the insert > position should always be at the end of the map given the sequence they are > serialized on disk. > Test this patch against a fsimage of a 70PB cluster (200million files and > 300million blocks), the image loading time be reduced from 1210 seconds to > 1138 seconds.So it can reduce up to about 10% of time. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-13693) Remove unnecessary search in INodeDirectory.addChild during image loading
[ https://issues.apache.org/jira/browse/HDFS-13693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16889696#comment-16889696 ] Lisheng Sun commented on HDFS-13693: Thank [~ayushtkn] for your review. I have delete related PR, And from now on I have only keep this patch for v5 for this issue. Thank [~ayushtkn] again. > Remove unnecessary search in INodeDirectory.addChild during image loading > - > > Key: HDFS-13693 > URL: https://issues.apache.org/jira/browse/HDFS-13693 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Reporter: zhouyingchao >Assignee: Lisheng Sun >Priority: Major > Attachments: HDFS-13693-001.patch, HDFS-13693-002.patch, > HDFS-13693-003.patch, HDFS-13693-004.patch, HDFS-13693-005.patch > > > In FSImageFormatPBINode.loadINodeDirectorySection, all child INodes are added > to their parent INode's map one by one. The adding procedure will search a > position in the parent's map and then insert the child to the position. > However, during image loading, the search is unnecessary since the insert > position should always be at the end of the map given the sequence they are > serialized on disk. > Test this patch against a fsimage of a 70PB cluster (200million files and > 300million blocks), the image loading time be reduced from 1210 seconds to > 1138 seconds.So it can reduce up to about 10% of time. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-13693) Remove unnecessary search in INodeDirectory.addChild during image loading
[ https://issues.apache.org/jira/browse/HDFS-13693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16889701#comment-16889701 ] Lisheng Sun commented on HDFS-13693: Thank [~ayushtkn] for your suggestion. I must pay attention to it from next time. > Remove unnecessary search in INodeDirectory.addChild during image loading > - > > Key: HDFS-13693 > URL: https://issues.apache.org/jira/browse/HDFS-13693 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Reporter: zhouyingchao >Assignee: Lisheng Sun >Priority: Major > Attachments: HDFS-13693-001.patch, HDFS-13693-002.patch, > HDFS-13693-003.patch, HDFS-13693-004.patch, HDFS-13693-005.patch > > > In FSImageFormatPBINode.loadINodeDirectorySection, all child INodes are added > to their parent INode's map one by one. The adding procedure will search a > position in the parent's map and then insert the child to the position. > However, during image loading, the search is unnecessary since the insert > position should always be at the end of the map given the sequence they are > serialized on disk. > Test this patch against a fsimage of a 70PB cluster (200million files and > 300million blocks), the image loading time be reduced from 1210 seconds to > 1138 seconds.So it can reduce up to about 10% of time. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-13693) Remove unnecessary search in INodeDirectory.addChild during image loading
[ https://issues.apache.org/jira/browse/HDFS-13693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16890228#comment-16890228 ] Lisheng Sun commented on HDFS-13693: hi [~ayushtkn] If you think this patch no problem, could you help merge it to trunk ? Thank you. > Remove unnecessary search in INodeDirectory.addChild during image loading > - > > Key: HDFS-13693 > URL: https://issues.apache.org/jira/browse/HDFS-13693 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Reporter: zhouyingchao >Assignee: Lisheng Sun >Priority: Major > Attachments: HDFS-13693-001.patch, HDFS-13693-002.patch, > HDFS-13693-003.patch, HDFS-13693-004.patch, HDFS-13693-005.patch > > > In FSImageFormatPBINode.loadINodeDirectorySection, all child INodes are added > to their parent INode's map one by one. The adding procedure will search a > position in the parent's map and then insert the child to the position. > However, during image loading, the search is unnecessary since the insert > position should always be at the end of the map given the sequence they are > serialized on disk. > Test this patch against a fsimage of a 70PB cluster (200million files and > 300million blocks), the image loading time be reduced from 1210 seconds to > 1138 seconds.So it can reduce up to about 10% of time. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14313) Get hdfs used space from FsDatasetImpl#volumeMap#ReplicaInfo in memory instead of df/du
[ https://issues.apache.org/jira/browse/HDFS-14313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16890237#comment-16890237 ] Lisheng Sun commented on HDFS-14313: hi [~linyiqun], could you mind continuing to reiview this patch? Thank you. > Get hdfs used space from FsDatasetImpl#volumeMap#ReplicaInfo in memory > instead of df/du > > > Key: HDFS-14313 > URL: https://issues.apache.org/jira/browse/HDFS-14313 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode, performance >Affects Versions: 2.6.0, 2.7.0, 2.8.0, 2.9.0, 3.0.0, 3.1.0 >Reporter: Lisheng Sun >Assignee: Lisheng Sun >Priority: Major > Attachments: HDFS-14313.000.patch, HDFS-14313.001.patch, > HDFS-14313.002.patch, HDFS-14313.003.patch, HDFS-14313.004.patch, > HDFS-14313.005.patch, HDFS-14313.006.patch, HDFS-14313.007.patch > > > There are two ways of DU/DF getting used space that are insufficient. > # Running DU across lots of disks is very expensive and running all of the > processes at the same time creates a noticeable IO spike. > # Running DF is inaccurate when the disk sharing by multiple datanode or > other servers. > Getting hdfs used space from FsDatasetImpl#volumeMap#ReplicaInfos in memory > is very small and accurate. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-13693) Remove unnecessary search in INodeDirectory.addChild during image loading
[ https://issues.apache.org/jira/browse/HDFS-13693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16890943#comment-16890943 ] Lisheng Sun commented on HDFS-13693: {code:java} [ERROR] Failed to execute goal org.apache.hadoop:hadoop-maven-plugins:3.3.0-SNAPSHOT:protoc (compile-protoc) on project hadoop-common: org.apache.maven.plugin.MojoExecutionException: protoc version is 'libprotoc 3.0.0', expected version is '2.5.0' -> [Help 1] {code} I think this failure is unrelated to this patch. From the output log,it should be proto version problem. > Remove unnecessary search in INodeDirectory.addChild during image loading > - > > Key: HDFS-13693 > URL: https://issues.apache.org/jira/browse/HDFS-13693 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Reporter: zhouyingchao >Assignee: Lisheng Sun >Priority: Major > Fix For: 3.3.0 > > Attachments: HDFS-13693-001.patch, HDFS-13693-002.patch, > HDFS-13693-003.patch, HDFS-13693-004.patch, HDFS-13693-005.patch > > > In FSImageFormatPBINode.loadINodeDirectorySection, all child INodes are added > to their parent INode's map one by one. The adding procedure will search a > position in the parent's map and then insert the child to the position. > However, during image loading, the search is unnecessary since the insert > position should always be at the end of the map given the sequence they are > serialized on disk. > Test this patch against a fsimage of a 70PB cluster (200million files and > 300million blocks), the image loading time be reduced from 1210 seconds to > 1138 seconds.So it can reduce up to about 10% of time. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-13693) Remove unnecessary search in INodeDirectory.addChild during image loading
[ https://issues.apache.org/jira/browse/HDFS-13693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16890943#comment-16890943 ] Lisheng Sun edited comment on HDFS-13693 at 7/23/19 12:19 PM: -- {code:java} [ERROR] Failed to execute goal org.apache.hadoop:hadoop-maven-plugins:3.3.0-SNAPSHOT:protoc (compile-protoc) on project hadoop-common: org.apache.maven.plugin.MojoExecutionException: protoc version is 'libprotoc 3.0.0', expected version is '2.5.0' -> [Help 1] {code} I think this failure is unrelated to this patch. From the output log,it should be protoc version problem. was (Author: leosun08): {code:java} [ERROR] Failed to execute goal org.apache.hadoop:hadoop-maven-plugins:3.3.0-SNAPSHOT:protoc (compile-protoc) on project hadoop-common: org.apache.maven.plugin.MojoExecutionException: protoc version is 'libprotoc 3.0.0', expected version is '2.5.0' -> [Help 1] {code} I think this failure is unrelated to this patch. From the output log,it should be proto version problem. > Remove unnecessary search in INodeDirectory.addChild during image loading > - > > Key: HDFS-13693 > URL: https://issues.apache.org/jira/browse/HDFS-13693 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Reporter: zhouyingchao >Assignee: Lisheng Sun >Priority: Major > Fix For: 3.3.0 > > Attachments: HDFS-13693-001.patch, HDFS-13693-002.patch, > HDFS-13693-003.patch, HDFS-13693-004.patch, HDFS-13693-005.patch > > > In FSImageFormatPBINode.loadINodeDirectorySection, all child INodes are added > to their parent INode's map one by one. The adding procedure will search a > position in the parent's map and then insert the child to the position. > However, during image loading, the search is unnecessary since the insert > position should always be at the end of the map given the sequence they are > serialized on disk. > Test this patch against a fsimage of a 70PB cluster (200million files and > 300million blocks), the image loading time be reduced from 1210 seconds to > 1138 seconds.So it can reduce up to about 10% of time. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Assigned] (HDFS-13594) the lock of ShortCircuitCache is hold while close the ShortCircuitReplica
[ https://issues.apache.org/jira/browse/HDFS-13594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lisheng Sun reassigned HDFS-13594: -- Assignee: Lisheng Sun > the lock of ShortCircuitCache is hold while close the ShortCircuitReplica > --- > > Key: HDFS-13594 > URL: https://issues.apache.org/jira/browse/HDFS-13594 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs-client >Affects Versions: 3.0.2 >Reporter: Gang Xie >Assignee: Lisheng Sun >Priority: Minor > Attachments: no_hdfs.svg > > > When we are profiling SC read, we find that ShortCircuitCache's lock is a hot > spot. After look into the code, we find that when close BlockReaderLocal, it > tries to trimEvictionMaps, and several ShortCircuitReplicas are closed while > the lock being hold. This slows down the close of the BlockReaderLocal, and > the worse is that it blocks the other allocating of the new > ShortCircuitReplicas. > An idea to avoid this is to close the replica in an async way. I will do a > prototype and get the performance. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Assigned] (HDFS-13564) PreAllocator for DfsClientShm
[ https://issues.apache.org/jira/browse/HDFS-13564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lisheng Sun reassigned HDFS-13564: -- Assignee: Lisheng Sun > PreAllocator for DfsClientShm > - > > Key: HDFS-13564 > URL: https://issues.apache.org/jira/browse/HDFS-13564 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs-client >Affects Versions: 3.0.2 >Reporter: Gang Xie >Assignee: Lisheng Sun >Priority: Minor > Fix For: 3.0.2 > > > When we do a stress test against Short-Circuit Local Reads, and found a > bottleneck that allocating new DfsClientShm blocks a lot of slot allocatings > on it. > Currently, there are 128 slots per shm which means at most, 128 reads could > be blocked by the shm allocation. Especially when stressed, the domain socket > communication to datanode gets slow, and datanode could also have GC, which > could cause some hundreds ms to allocate 1 shm, in turn, the reads. This is > bad for some latency sensitive service, like Hbase. > I'm working on the prototype and will upload the code and test result later. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-14313) Get hdfs used space from FsDatasetImpl#volumeMap#ReplicaInfo in memory instead of df/du
[ https://issues.apache.org/jira/browse/HDFS-14313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lisheng Sun updated HDFS-14313: --- Attachment: HDFS-14313.008.patch > Get hdfs used space from FsDatasetImpl#volumeMap#ReplicaInfo in memory > instead of df/du > > > Key: HDFS-14313 > URL: https://issues.apache.org/jira/browse/HDFS-14313 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode, performance >Affects Versions: 2.6.0, 2.7.0, 2.8.0, 2.9.0, 3.0.0, 3.1.0 >Reporter: Lisheng Sun >Assignee: Lisheng Sun >Priority: Major > Attachments: HDFS-14313.000.patch, HDFS-14313.001.patch, > HDFS-14313.002.patch, HDFS-14313.003.patch, HDFS-14313.004.patch, > HDFS-14313.005.patch, HDFS-14313.006.patch, HDFS-14313.007.patch, > HDFS-14313.008.patch > > > There are two ways of DU/DF getting used space that are insufficient. > # Running DU across lots of disks is very expensive and running all of the > processes at the same time creates a noticeable IO spike. > # Running DF is inaccurate when the disk sharing by multiple datanode or > other servers. > Getting hdfs used space from FsDatasetImpl#volumeMap#ReplicaInfos in memory > is very small and accurate. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-14313) Get hdfs used space from FsDatasetImpl#volumeMap#ReplicaInfo in memory instead of df/du
[ https://issues.apache.org/jira/browse/HDFS-14313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16892921#comment-16892921 ] Lisheng Sun edited comment on HDFS-14313 at 7/25/19 4:24 PM: - Thank [~linyiqun] for your review. I have updated the patch as your comments. Could you help review it? Thank you. was (Author: leosun08): Thank [~linyiqun] for your review. I have update the patch as your comments. Could you help review it? Thank you. > Get hdfs used space from FsDatasetImpl#volumeMap#ReplicaInfo in memory > instead of df/du > > > Key: HDFS-14313 > URL: https://issues.apache.org/jira/browse/HDFS-14313 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode, performance >Affects Versions: 2.6.0, 2.7.0, 2.8.0, 2.9.0, 3.0.0, 3.1.0 >Reporter: Lisheng Sun >Assignee: Lisheng Sun >Priority: Major > Attachments: HDFS-14313.000.patch, HDFS-14313.001.patch, > HDFS-14313.002.patch, HDFS-14313.003.patch, HDFS-14313.004.patch, > HDFS-14313.005.patch, HDFS-14313.006.patch, HDFS-14313.007.patch, > HDFS-14313.008.patch > > > There are two ways of DU/DF getting used space that are insufficient. > # Running DU across lots of disks is very expensive and running all of the > processes at the same time creates a noticeable IO spike. > # Running DF is inaccurate when the disk sharing by multiple datanode or > other servers. > Getting hdfs used space from FsDatasetImpl#volumeMap#ReplicaInfos in memory > is very small and accurate. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14313) Get hdfs used space from FsDatasetImpl#volumeMap#ReplicaInfo in memory instead of df/du
[ https://issues.apache.org/jira/browse/HDFS-14313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16892921#comment-16892921 ] Lisheng Sun commented on HDFS-14313: Thank [~linyiqun] for your review. I have update the patch as your comments. Could you help review it? Thank you. > Get hdfs used space from FsDatasetImpl#volumeMap#ReplicaInfo in memory > instead of df/du > > > Key: HDFS-14313 > URL: https://issues.apache.org/jira/browse/HDFS-14313 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode, performance >Affects Versions: 2.6.0, 2.7.0, 2.8.0, 2.9.0, 3.0.0, 3.1.0 >Reporter: Lisheng Sun >Assignee: Lisheng Sun >Priority: Major > Attachments: HDFS-14313.000.patch, HDFS-14313.001.patch, > HDFS-14313.002.patch, HDFS-14313.003.patch, HDFS-14313.004.patch, > HDFS-14313.005.patch, HDFS-14313.006.patch, HDFS-14313.007.patch, > HDFS-14313.008.patch > > > There are two ways of DU/DF getting used space that are insufficient. > # Running DU across lots of disks is very expensive and running all of the > processes at the same time creates a noticeable IO spike. > # Running DF is inaccurate when the disk sharing by multiple datanode or > other servers. > Getting hdfs used space from FsDatasetImpl#volumeMap#ReplicaInfos in memory > is very small and accurate. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14313) Get hdfs used space from FsDatasetImpl#volumeMap#ReplicaInfo in memory instead of df/du
[ https://issues.apache.org/jira/browse/HDFS-14313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16894465#comment-16894465 ] Lisheng Sun commented on HDFS-14313: hi [~linyiqun] [~jojochuang] Could you have time to review this patch? Thank you a lot. > Get hdfs used space from FsDatasetImpl#volumeMap#ReplicaInfo in memory > instead of df/du > > > Key: HDFS-14313 > URL: https://issues.apache.org/jira/browse/HDFS-14313 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode, performance >Affects Versions: 2.6.0, 2.7.0, 2.8.0, 2.9.0, 3.0.0, 3.1.0 >Reporter: Lisheng Sun >Assignee: Lisheng Sun >Priority: Major > Attachments: HDFS-14313.000.patch, HDFS-14313.001.patch, > HDFS-14313.002.patch, HDFS-14313.003.patch, HDFS-14313.004.patch, > HDFS-14313.005.patch, HDFS-14313.006.patch, HDFS-14313.007.patch, > HDFS-14313.008.patch > > > There are two ways of DU/DF getting used space that are insufficient. > # Running DU across lots of disks is very expensive and running all of the > processes at the same time creates a noticeable IO spike. > # Running DF is inaccurate when the disk sharing by multiple datanode or > other servers. > Getting hdfs used space from FsDatasetImpl#volumeMap#ReplicaInfos in memory > is very small and accurate. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Assigned] (HDFS-7868) Use proper blocksize to choose target for blocks
[ https://issues.apache.org/jira/browse/HDFS-7868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lisheng Sun reassigned HDFS-7868: - Assignee: Lisheng Sun (was: zhouyingchao) > Use proper blocksize to choose target for blocks > > > Key: HDFS-7868 > URL: https://issues.apache.org/jira/browse/HDFS-7868 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 2.6.0 >Reporter: zhouyingchao >Assignee: Lisheng Sun >Priority: Major > Labels: BB2015-05-TBR > Attachments: HDFS-7868-001.patch > > > In BlockPlacementPolicyDefault.java:isGoodTarget, the passed-in blockSize is > used to determine if there is enough room for a new block on a data node. > However, in two conditions the blockSize might not be proper for the purpose: > (a) the passed in block size is just the size of the last block of a file, > which might be very small (for e.g., called from > BlockManager.ReplicationWork.chooseTargets). (b) A file which might be > created with a smaller blocksize. > In these conditions, the calculated scheduledSize might be smaller than the > actual value, which finally might lead to following failure of writing or > replication. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-14313) Get hdfs used space from FsDatasetImpl#volumeMap#ReplicaInfo in memory instead of df/du
[ https://issues.apache.org/jira/browse/HDFS-14313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16894893#comment-16894893 ] Lisheng Sun edited comment on HDFS-14313 at 7/29/19 3:11 AM: - Thanx [~linyiqun] for your review. *FSCachingGetSpaceUsed* {quote}Line 53: Add the final keyword for the FsVolumeImpl variable. Line 54: Add the final keyword too. {quote} The value of volume and bpid is assignment by set*, so don't add the final keyword for these two variable. {quote}Line 75: We don't pass the config to use the threshold time now, still we need to override this method? If don't need, the change made in GetSpaceUsed can also be reverted. {quote} Add the new variable volume and bpid of FSCachingGetSpaceUsed for HDFS module,so ReplicaCachingGetSpaceUsed‘s Constructor parameter must be FSCachingGetSpaceUsed#Builder and don't remove FSCachingGetSpaceUsed#build(). * TestReplicaCachingGetSpaceUsed* {quote}Line 69: As I have mentioned before, can we have an additional comparison for the DU impl class? The most of lines can be reused for these two getused impl class. Just passing different key value with restart the mini cluster and comparing the used space. {quote} get space used by DU impl include ├── current │ ├── BP-1876464514-10.239.56.179-1564369203299 │ │ ├── current │ │ │ ├── VERSION │ │ │ ├── finalized │ │ │ │ └── subdir0 │ │ │ │ └── subdir0 │ │ │ │ ├── blk_1073741825 │ │ │ │ └── blk_1073741825_1001.meta │ │ │ └── rbw │ │ ├── scanner.cursor │ │ └── tmp │ └── VERSION └── in_use.lock get space used by ReplicaCachingGetSpaceUsed impl include ├── blk_1073741825 └── blk_1073741825_1001.meta Get space used by DU impl include all directories size, other files such as VERSION, in_use.lock and so on. Get space used by DU impl must be greater than by ReplicaCachingGetSpaceUsed impl. get space used by ReplicaCachingGetSpaceUsed impl is more accurate. so is it necessary to add comparison for the DU impl class? Please correct me if I was wrong. Thank [~linyiqun] again. was (Author: leosun08): Thanx [~linyiqun] for your review. *FSCachingGetSpaceUsed* {quote}Line 53: Add the final keyword for the FsVolumeImpl variable. Line 54: Add the final keyword too. {quote} The value of volume and bpid is assignment by set*, so don't add the final keyword for these two variable. {quote}Line 75: We don't pass the config to use the threshold time now, still we need to override this method? If don't need, the change made in GetSpaceUsed can also be reverted. {quote} Add the new variable volume and bpid of FSCachingGetSpaceUsed for HDFS module,so ReplicaCachingGetSpaceUsed‘s Constructor parameter must be FSCachingGetSpaceUsed#Builder and don't remove FSCachingGetSpaceUsed#build(). * TestReplicaCachingGetSpaceUsed* {quote}Line 69: As I have mentioned before, can we have an additional comparison for the DU impl class? The most of lines can be reused for these two getused impl class. Just passing different key value with restart the mini cluster and comparing the used space. {quote} get space used by DU impl include ├── current │ ├── BP-1876464514-10.239.56.179-1564369203299 │ │ ├── current │ │ │ ├── VERSION │ │ │ ├── finalized │ │ │ │ └── subdir0 │ │ │ │ └── subdir0 │ │ │ │ ├── blk_1073741825 │ │ │ │ └── blk_1073741825_1001.meta │ │ │ └── rbw │ │ ├── scanner.cursor │ │ └── tmp │ └── VERSION └── in_use.lock get space used by ReplicaCachingGetSpaceUsed impl include ├── blk_1073741825 └── blk_1073741825_1001.meta get space used by DU impl include all directories size, other files such as VERSION, in_use.lock and so on Get space used by DU impl must be greater than by ReplicaCachingGetSpaceUsed impl. get space used by ReplicaCachingGetSpaceUsed impl is more accurate. so is it necessary to add comparison for the DU impl class? Please correct me if I was wrong. Thank [~linyiqun] again. > Get hdfs used space from FsDatasetImpl#volumeMap#ReplicaInfo in memory > instead of df/du > > > Key: HDFS-14313 > URL: https://issues.apache.org/jira/browse/HDFS-14313 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode, performance >Affects Versions: 2.6.0, 2.7.0, 2.8.0, 2.9.0, 3.0.0, 3.1.0 >Reporter: Lisheng Sun >Assignee: Lisheng Sun >Priority: Major > Attachments: HDFS-14313.000.patch, HDFS-14313.001.patch, > HDFS-14313.002.patch, HDFS-14313.003.patch, HDFS-14313.004.patch, > HDFS-14313.005.patch, HDFS-14313.006.patch, HDFS-14313.007.patch, > HDFS-14313.008.patch > > > There are two ways of DU/DF getting used space that are insuf
[jira] [Commented] (HDFS-14313) Get hdfs used space from FsDatasetImpl#volumeMap#ReplicaInfo in memory instead of df/du
[ https://issues.apache.org/jira/browse/HDFS-14313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16894893#comment-16894893 ] Lisheng Sun commented on HDFS-14313: Thanx [~linyiqun] for your review. *FSCachingGetSpaceUsed* {quote}Line 53: Add the final keyword for the FsVolumeImpl variable. Line 54: Add the final keyword too. {quote} The value of volume and bpid is assignment by set*, so don't add the final keyword for these two variable. {quote}Line 75: We don't pass the config to use the threshold time now, still we need to override this method? If don't need, the change made in GetSpaceUsed can also be reverted. {quote} Add the new variable volume and bpid of FSCachingGetSpaceUsed for HDFS module,so ReplicaCachingGetSpaceUsed‘s Constructor parameter must be FSCachingGetSpaceUsed#Builder and don't remove FSCachingGetSpaceUsed#build(). * TestReplicaCachingGetSpaceUsed* {quote}Line 69: As I have mentioned before, can we have an additional comparison for the DU impl class? The most of lines can be reused for these two getused impl class. Just passing different key value with restart the mini cluster and comparing the used space. {quote} get space used by DU impl include ├── current │ ├── BP-1876464514-10.239.56.179-1564369203299 │ │ ├── current │ │ │ ├── VERSION │ │ │ ├── finalized │ │ │ │ └── subdir0 │ │ │ │ └── subdir0 │ │ │ │ ├── blk_1073741825 │ │ │ │ └── blk_1073741825_1001.meta │ │ │ └── rbw │ │ ├── scanner.cursor │ │ └── tmp │ └── VERSION └── in_use.lock get space used by ReplicaCachingGetSpaceUsed impl include ├── blk_1073741825 └── blk_1073741825_1001.meta get space used by DU impl include all directories size, other files such as VERSION, in_use.lock and so on Get space used by DU impl must be greater than by ReplicaCachingGetSpaceUsed impl. get space used by ReplicaCachingGetSpaceUsed impl is more accurate. so is it necessary to add comparison for the DU impl class? Please correct me if I was wrong. Thank [~linyiqun] again. > Get hdfs used space from FsDatasetImpl#volumeMap#ReplicaInfo in memory > instead of df/du > > > Key: HDFS-14313 > URL: https://issues.apache.org/jira/browse/HDFS-14313 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode, performance >Affects Versions: 2.6.0, 2.7.0, 2.8.0, 2.9.0, 3.0.0, 3.1.0 >Reporter: Lisheng Sun >Assignee: Lisheng Sun >Priority: Major > Attachments: HDFS-14313.000.patch, HDFS-14313.001.patch, > HDFS-14313.002.patch, HDFS-14313.003.patch, HDFS-14313.004.patch, > HDFS-14313.005.patch, HDFS-14313.006.patch, HDFS-14313.007.patch, > HDFS-14313.008.patch > > > There are two ways of DU/DF getting used space that are insufficient. > # Running DU across lots of disks is very expensive and running all of the > processes at the same time creates a noticeable IO spike. > # Running DF is inaccurate when the disk sharing by multiple datanode or > other servers. > Getting hdfs used space from FsDatasetImpl#volumeMap#ReplicaInfos in memory > is very small and accurate. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-14313) Get hdfs used space from FsDatasetImpl#volumeMap#ReplicaInfo in memory instead of df/du
[ https://issues.apache.org/jira/browse/HDFS-14313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lisheng Sun updated HDFS-14313: --- Attachment: HDFS-14313.009.patch > Get hdfs used space from FsDatasetImpl#volumeMap#ReplicaInfo in memory > instead of df/du > > > Key: HDFS-14313 > URL: https://issues.apache.org/jira/browse/HDFS-14313 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode, performance >Affects Versions: 2.6.0, 2.7.0, 2.8.0, 2.9.0, 3.0.0, 3.1.0 >Reporter: Lisheng Sun >Assignee: Lisheng Sun >Priority: Major > Attachments: HDFS-14313.000.patch, HDFS-14313.001.patch, > HDFS-14313.002.patch, HDFS-14313.003.patch, HDFS-14313.004.patch, > HDFS-14313.005.patch, HDFS-14313.006.patch, HDFS-14313.007.patch, > HDFS-14313.008.patch, HDFS-14313.009.patch > > > There are two ways of DU/DF getting used space that are insufficient. > # Running DU across lots of disks is very expensive and running all of the > processes at the same time creates a noticeable IO spike. > # Running DF is inaccurate when the disk sharing by multiple datanode or > other servers. > Getting hdfs used space from FsDatasetImpl#volumeMap#ReplicaInfos in memory > is very small and accurate. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14313) Get hdfs used space from FsDatasetImpl#volumeMap#ReplicaInfo in memory instead of df/du
[ https://issues.apache.org/jira/browse/HDFS-14313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16894898#comment-16894898 ] Lisheng Sun commented on HDFS-14313: Updated patch according the review comments and uploaded the v9 patch. > Get hdfs used space from FsDatasetImpl#volumeMap#ReplicaInfo in memory > instead of df/du > > > Key: HDFS-14313 > URL: https://issues.apache.org/jira/browse/HDFS-14313 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode, performance >Affects Versions: 2.6.0, 2.7.0, 2.8.0, 2.9.0, 3.0.0, 3.1.0 >Reporter: Lisheng Sun >Assignee: Lisheng Sun >Priority: Major > Attachments: HDFS-14313.000.patch, HDFS-14313.001.patch, > HDFS-14313.002.patch, HDFS-14313.003.patch, HDFS-14313.004.patch, > HDFS-14313.005.patch, HDFS-14313.006.patch, HDFS-14313.007.patch, > HDFS-14313.008.patch, HDFS-14313.009.patch > > > There are two ways of DU/DF getting used space that are insufficient. > # Running DU across lots of disks is very expensive and running all of the > processes at the same time creates a noticeable IO spike. > # Running DF is inaccurate when the disk sharing by multiple datanode or > other servers. > Getting hdfs used space from FsDatasetImpl#volumeMap#ReplicaInfos in memory > is very small and accurate. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Assigned] (HDFS-14644) That replication of block failed leads to decommission is blocked when the number of replicas of block is greater than the number of datanode
[ https://issues.apache.org/jira/browse/HDFS-14644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lisheng Sun reassigned HDFS-14644: -- Assignee: Lisheng Sun > That replication of block failed leads to decommission is blocked when the > number of replicas of block is greater than the number of datanode > - > > Key: HDFS-14644 > URL: https://issues.apache.org/jira/browse/HDFS-14644 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Affects Versions: 3.1.1, 2.9.2, 3.0.3, 2.8.5, 2.7.7 >Reporter: Lisheng Sun >Assignee: Lisheng Sun >Priority: Major > > 2019-07-10,15:37:18,028 WARN > org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy: Failed to > place enough replicas, still in need of 5 to reach 10 > (unavailableStorages=[DISK, ARCHIVE], > storagePolicy=BlockStoragePolicy\{HOT:7, storageTypes=[DISK], > creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock=false) All > required storage types are unavailable: unavailableStorages=[DISK, ARCHIVE], > storagePolicy=BlockStoragePolicy\{HOT:7, storageTypes=[DISK], > creationFallbacks=[], replicationFallbacks=[ARCHIVE]} > 2019-07-10,15:37:18,028 WARN > org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy: Failed to > place enough replicas, still in need of 5 to reach 10 > (unavailableStorages=[], storagePolicy=BlockStoragePolicy\{HOT:7, > storageTypes=[DISK], creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, > newBlock=false) For more information, please enable DEBUG log level on > org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14644) That replication of block failed leads to decommission is blocked when the number of replicas of block is greater than the number of datanode
[ https://issues.apache.org/jira/browse/HDFS-14644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16896099#comment-16896099 ] Lisheng Sun commented on HDFS-14644: Thanx [~sodonnell] [~jojochuang] for your suggestions. I intend to implement it by the commnad as HDFS-12946. If you think no problem, I will do this work. Thank you. > That replication of block failed leads to decommission is blocked when the > number of replicas of block is greater than the number of datanode > - > > Key: HDFS-14644 > URL: https://issues.apache.org/jira/browse/HDFS-14644 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Affects Versions: 3.1.1, 2.9.2, 3.0.3, 2.8.5, 2.7.7 >Reporter: Lisheng Sun >Priority: Major > > 2019-07-10,15:37:18,028 WARN > org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy: Failed to > place enough replicas, still in need of 5 to reach 10 > (unavailableStorages=[DISK, ARCHIVE], > storagePolicy=BlockStoragePolicy\{HOT:7, storageTypes=[DISK], > creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock=false) All > required storage types are unavailable: unavailableStorages=[DISK, ARCHIVE], > storagePolicy=BlockStoragePolicy\{HOT:7, storageTypes=[DISK], > creationFallbacks=[], replicationFallbacks=[ARCHIVE]} > 2019-07-10,15:37:18,028 WARN > org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy: Failed to > place enough replicas, still in need of 5 to reach 10 > (unavailableStorages=[], storagePolicy=BlockStoragePolicy\{HOT:7, > storageTypes=[DISK], creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, > newBlock=false) For more information, please enable DEBUG log level on > org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14313) Get hdfs used space from FsDatasetImpl#volumeMap#ReplicaInfo in memory instead of df/du
[ https://issues.apache.org/jira/browse/HDFS-14313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16896696#comment-16896696 ] Lisheng Sun commented on HDFS-14313: hi [~linyiqun] Could you have time to continue review ? Thank you. > Get hdfs used space from FsDatasetImpl#volumeMap#ReplicaInfo in memory > instead of df/du > > > Key: HDFS-14313 > URL: https://issues.apache.org/jira/browse/HDFS-14313 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode, performance >Affects Versions: 2.6.0, 2.7.0, 2.8.0, 2.9.0, 3.0.0, 3.1.0 >Reporter: Lisheng Sun >Assignee: Lisheng Sun >Priority: Major > Attachments: HDFS-14313.000.patch, HDFS-14313.001.patch, > HDFS-14313.002.patch, HDFS-14313.003.patch, HDFS-14313.004.patch, > HDFS-14313.005.patch, HDFS-14313.006.patch, HDFS-14313.007.patch, > HDFS-14313.008.patch, HDFS-14313.009.patch > > > There are two ways of DU/DF getting used space that are insufficient. > # Running DU across lots of disks is very expensive and running all of the > processes at the same time creates a noticeable IO spike. > # Running DF is inaccurate when the disk sharing by multiple datanode or > other servers. > Getting hdfs used space from FsDatasetImpl#volumeMap#ReplicaInfos in memory > is very small and accurate. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-14290) Unexpected message type: PooledUnsafeDirectByteBuf when get datanode info by DatanodeWebHdfsMethods
[ https://issues.apache.org/jira/browse/HDFS-14290?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lisheng Sun updated HDFS-14290: --- Resolution: Not A Problem Status: Resolved (was: Patch Available) > Unexpected message type: PooledUnsafeDirectByteBuf when get datanode info by > DatanodeWebHdfsMethods > --- > > Key: HDFS-14290 > URL: https://issues.apache.org/jira/browse/HDFS-14290 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode, webhdfs >Affects Versions: 2.7.0, 2.7.1 >Reporter: Lisheng Sun >Assignee: Lisheng Sun >Priority: Major > Attachments: HDFS-14290.000.patch, webhdfs show.png > > > The issue is there is no HttpRequestDecoder in InboundHandler of netty, > appear unexpected message type when read message. > > !webhdfs show.png! > DEBUG org.apache.hadoop.hdfs.server.datanode.web.DatanodeHttpServer: Proxy > failed. Cause: > com.xiaomi.infra.thirdparty.io.netty.handler.codec.EncoderException: > java.lang.IllegalStateException: unexpected message type: > PooledUnsafeDirectByteBuf > at > com.xiaomi.infra.thirdparty.io.netty.handler.codec.MessageToMessageEncoder.write(MessageToMessageEncoder.java:106) > at > com.xiaomi.infra.thirdparty.io.netty.channel.CombinedChannelDuplexHandler.write(CombinedChannelDuplexHandler.java:348) > at > com.xiaomi.infra.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:738) > at > com.xiaomi.infra.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeWrite(AbstractChannelHandlerContext.java:730) > at > com.xiaomi.infra.thirdparty.io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:816) > at > com.xiaomi.infra.thirdparty.io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:723) > at > com.xiaomi.infra.thirdparty.io.netty.handler.stream.ChunkedWriteHandler.doFlush(ChunkedWriteHandler.java:304) > at > com.xiaomi.infra.thirdparty.io.netty.handler.stream.ChunkedWriteHandler.flush(ChunkedWriteHandler.java:137) > at > com.xiaomi.infra.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeFlush0(AbstractChannelHandlerContext.java:776) > at > com.xiaomi.infra.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeWriteAndFlush(AbstractChannelHandlerContext.java:802) > at > com.xiaomi.infra.thirdparty.io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:814) > at > com.xiaomi.infra.thirdparty.io.netty.channel.AbstractChannelHandlerContext.writeAndFlush(AbstractChannelHandlerContext.java:794) > at > com.xiaomi.infra.thirdparty.io.netty.channel.AbstractChannelHandlerContext.writeAndFlush(AbstractChannelHandlerContext.java:831) > at > com.xiaomi.infra.thirdparty.io.netty.channel.DefaultChannelPipeline.writeAndFlush(DefaultChannelPipeline.java:1051) > at > com.xiaomi.infra.thirdparty.io.netty.channel.AbstractChannel.writeAndFlush(AbstractChannel.java:300) > at > org.apache.hadoop.hdfs.server.datanode.web.SimpleHttpProxyHandler$Forwarder.channelRead(SimpleHttpProxyHandler.java:80) > at > com.xiaomi.infra.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) > at > com.xiaomi.infra.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) > at > com.xiaomi.infra.thirdparty.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) > at > com.xiaomi.infra.thirdparty.io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1414) > at > com.xiaomi.infra.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) > at > com.xiaomi.infra.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) > at > com.xiaomi.infra.thirdparty.io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:945) > at > com.xiaomi.infra.thirdparty.io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:146) > at > com.xiaomi.infra.thirdparty.io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:645) > at > com.xiaomi.infra.thirdparty.io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:580) > at > com.xiaomi.infra.thirdparty.io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:497) > at > com.xiaomi.infra.thirdparty.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:459) > at > com.xiaomi.infra.thirdparty.io.netty.util.concurrent.SingleTh
[jira] [Commented] (HDFS-14290) Unexpected message type: PooledUnsafeDirectByteBuf when get datanode info by DatanodeWebHdfsMethods
[ https://issues.apache.org/jira/browse/HDFS-14290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16896754#comment-16896754 ] Lisheng Sun commented on HDFS-14290: Yeah. Sorry. It's not a problem. I have closed it. > Unexpected message type: PooledUnsafeDirectByteBuf when get datanode info by > DatanodeWebHdfsMethods > --- > > Key: HDFS-14290 > URL: https://issues.apache.org/jira/browse/HDFS-14290 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode, webhdfs >Affects Versions: 2.7.0, 2.7.1 >Reporter: Lisheng Sun >Assignee: Lisheng Sun >Priority: Major > Attachments: HDFS-14290.000.patch, webhdfs show.png > > > The issue is there is no HttpRequestDecoder in InboundHandler of netty, > appear unexpected message type when read message. > > !webhdfs show.png! > DEBUG org.apache.hadoop.hdfs.server.datanode.web.DatanodeHttpServer: Proxy > failed. Cause: > com.xiaomi.infra.thirdparty.io.netty.handler.codec.EncoderException: > java.lang.IllegalStateException: unexpected message type: > PooledUnsafeDirectByteBuf > at > com.xiaomi.infra.thirdparty.io.netty.handler.codec.MessageToMessageEncoder.write(MessageToMessageEncoder.java:106) > at > com.xiaomi.infra.thirdparty.io.netty.channel.CombinedChannelDuplexHandler.write(CombinedChannelDuplexHandler.java:348) > at > com.xiaomi.infra.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:738) > at > com.xiaomi.infra.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeWrite(AbstractChannelHandlerContext.java:730) > at > com.xiaomi.infra.thirdparty.io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:816) > at > com.xiaomi.infra.thirdparty.io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:723) > at > com.xiaomi.infra.thirdparty.io.netty.handler.stream.ChunkedWriteHandler.doFlush(ChunkedWriteHandler.java:304) > at > com.xiaomi.infra.thirdparty.io.netty.handler.stream.ChunkedWriteHandler.flush(ChunkedWriteHandler.java:137) > at > com.xiaomi.infra.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeFlush0(AbstractChannelHandlerContext.java:776) > at > com.xiaomi.infra.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeWriteAndFlush(AbstractChannelHandlerContext.java:802) > at > com.xiaomi.infra.thirdparty.io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:814) > at > com.xiaomi.infra.thirdparty.io.netty.channel.AbstractChannelHandlerContext.writeAndFlush(AbstractChannelHandlerContext.java:794) > at > com.xiaomi.infra.thirdparty.io.netty.channel.AbstractChannelHandlerContext.writeAndFlush(AbstractChannelHandlerContext.java:831) > at > com.xiaomi.infra.thirdparty.io.netty.channel.DefaultChannelPipeline.writeAndFlush(DefaultChannelPipeline.java:1051) > at > com.xiaomi.infra.thirdparty.io.netty.channel.AbstractChannel.writeAndFlush(AbstractChannel.java:300) > at > org.apache.hadoop.hdfs.server.datanode.web.SimpleHttpProxyHandler$Forwarder.channelRead(SimpleHttpProxyHandler.java:80) > at > com.xiaomi.infra.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) > at > com.xiaomi.infra.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) > at > com.xiaomi.infra.thirdparty.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) > at > com.xiaomi.infra.thirdparty.io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1414) > at > com.xiaomi.infra.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) > at > com.xiaomi.infra.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) > at > com.xiaomi.infra.thirdparty.io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:945) > at > com.xiaomi.infra.thirdparty.io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:146) > at > com.xiaomi.infra.thirdparty.io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:645) > at > com.xiaomi.infra.thirdparty.io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:580) > at > com.xiaomi.infra.thirdparty.io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:497) > at > com.xiaomi.infra.thirdparty.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:459) > at > com.xiaomi.infra.thirdparty.io
[jira] [Updated] (HDFS-14313) Get hdfs used space from FsDatasetImpl#volumeMap#ReplicaInfo in memory instead of df/du
[ https://issues.apache.org/jira/browse/HDFS-14313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lisheng Sun updated HDFS-14313: --- Attachment: HDFS-14313.010.patch > Get hdfs used space from FsDatasetImpl#volumeMap#ReplicaInfo in memory > instead of df/du > > > Key: HDFS-14313 > URL: https://issues.apache.org/jira/browse/HDFS-14313 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode, performance >Affects Versions: 2.6.0, 2.7.0, 2.8.0, 2.9.0, 3.0.0, 3.1.0 >Reporter: Lisheng Sun >Assignee: Lisheng Sun >Priority: Major > Attachments: HDFS-14313.000.patch, HDFS-14313.001.patch, > HDFS-14313.002.patch, HDFS-14313.003.patch, HDFS-14313.004.patch, > HDFS-14313.005.patch, HDFS-14313.006.patch, HDFS-14313.007.patch, > HDFS-14313.008.patch, HDFS-14313.009.patch, HDFS-14313.010.patch > > > There are two ways of DU/DF getting used space that are insufficient. > # Running DU across lots of disks is very expensive and running all of the > processes at the same time creates a noticeable IO spike. > # Running DF is inaccurate when the disk sharing by multiple datanode or > other servers. > Getting hdfs used space from FsDatasetImpl#volumeMap#ReplicaInfos in memory > is very small and accurate. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14313) Get hdfs used space from FsDatasetImpl#volumeMap#ReplicaInfo in memory instead of df/du
[ https://issues.apache.org/jira/browse/HDFS-14313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16897988#comment-16897988 ] Lisheng Sun commented on HDFS-14313: Thanx for [~linyiqun] for deep review. I have updated this patch as your comments and uploaded the v10 patch. > Get hdfs used space from FsDatasetImpl#volumeMap#ReplicaInfo in memory > instead of df/du > > > Key: HDFS-14313 > URL: https://issues.apache.org/jira/browse/HDFS-14313 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode, performance >Affects Versions: 2.6.0, 2.7.0, 2.8.0, 2.9.0, 3.0.0, 3.1.0 >Reporter: Lisheng Sun >Assignee: Lisheng Sun >Priority: Major > Attachments: HDFS-14313.000.patch, HDFS-14313.001.patch, > HDFS-14313.002.patch, HDFS-14313.003.patch, HDFS-14313.004.patch, > HDFS-14313.005.patch, HDFS-14313.006.patch, HDFS-14313.007.patch, > HDFS-14313.008.patch, HDFS-14313.009.patch, HDFS-14313.010.patch > > > There are two ways of DU/DF getting used space that are insufficient. > # Running DU across lots of disks is very expensive and running all of the > processes at the same time creates a noticeable IO spike. > # Running DF is inaccurate when the disk sharing by multiple datanode or > other servers. > Getting hdfs used space from FsDatasetImpl#volumeMap#ReplicaInfos in memory > is very small and accurate. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14313) Get hdfs used space from FsDatasetImpl#volumeMap#ReplicaInfo in memory instead of df/du
[ https://issues.apache.org/jira/browse/HDFS-14313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16899565#comment-16899565 ] Lisheng Sun commented on HDFS-14313: ping [~linyiqun] Could you mind taking a review for this patch? Thank you. > Get hdfs used space from FsDatasetImpl#volumeMap#ReplicaInfo in memory > instead of df/du > > > Key: HDFS-14313 > URL: https://issues.apache.org/jira/browse/HDFS-14313 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode, performance >Affects Versions: 2.6.0, 2.7.0, 2.8.0, 2.9.0, 3.0.0, 3.1.0 >Reporter: Lisheng Sun >Assignee: Lisheng Sun >Priority: Major > Attachments: HDFS-14313.000.patch, HDFS-14313.001.patch, > HDFS-14313.002.patch, HDFS-14313.003.patch, HDFS-14313.004.patch, > HDFS-14313.005.patch, HDFS-14313.006.patch, HDFS-14313.007.patch, > HDFS-14313.008.patch, HDFS-14313.009.patch, HDFS-14313.010.patch > > > There are two ways of DU/DF getting used space that are insufficient. > # Running DU across lots of disks is very expensive and running all of the > processes at the same time creates a noticeable IO spike. > # Running DF is inaccurate when the disk sharing by multiple datanode or > other servers. > Getting hdfs used space from FsDatasetImpl#volumeMap#ReplicaInfos in memory > is very small and accurate. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-13700) The process of loading image can be done in a pipeline model
[ https://issues.apache.org/jira/browse/HDFS-13700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lisheng Sun updated HDFS-13700: --- Description: The process of loading a file system image involves reading inodes section, deserializing inodes, initializing inodes, adding inodes to the global map, reading directories section, adding inodes to their parents' map, cache name etc. These steps can be done in a pipeline model to reduce the total duration. Test the patch against a fsimage of a 70PB 2.4 cluster (200million files and 300million blocks, the fsimage is around 22GB), the image loading time be reduced from 1210 seconds to 739 seconds. was:The process of loading a file system image involves reading inodes section, deserializing inodes, initializing inodes, adding inodes to the global map, reading directories section, adding inodes to their parents' map, cache name etc. These steps can be done in a pipeline model to reduce the total duration. > The process of loading image can be done in a pipeline model > > > Key: HDFS-13700 > URL: https://issues.apache.org/jira/browse/HDFS-13700 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: zhouyingchao >Assignee: Lisheng Sun >Priority: Major > Attachments: HDFS-13700-001.patch > > > The process of loading a file system image involves reading inodes section, > deserializing inodes, initializing inodes, adding inodes to the global map, > reading directories section, adding inodes to their parents' map, cache name > etc. These steps can be done in a pipeline model to reduce the total > duration. > Test the patch against a fsimage of a 70PB 2.4 cluster (200million files and > 300million blocks, the fsimage is around 22GB), the image loading time be > reduced from 1210 seconds to 739 seconds. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-13700) The process of loading image can be done in a pipeline model
[ https://issues.apache.org/jira/browse/HDFS-13700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lisheng Sun updated HDFS-13700: --- Attachment: HDFS-13700.002.patch > The process of loading image can be done in a pipeline model > > > Key: HDFS-13700 > URL: https://issues.apache.org/jira/browse/HDFS-13700 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: zhouyingchao >Assignee: Lisheng Sun >Priority: Major > Attachments: HDFS-13700-001.patch, HDFS-13700.002.patch > > > The process of loading a file system image involves reading inodes section, > deserializing inodes, initializing inodes, adding inodes to the global map, > reading directories section, adding inodes to their parents' map, cache name > etc. These steps can be done in a pipeline model to reduce the total > duration. > Test the patch against a fsimage of a 70PB 2.4 cluster (200million files and > 300million blocks, the fsimage is around 22GB), the image loading time be > reduced from 1210 seconds to 739 seconds. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-13700) The process of loading image can be done in a pipeline model
[ https://issues.apache.org/jira/browse/HDFS-13700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lisheng Sun updated HDFS-13700: --- Description: The process of loading a file system image involves reading inodes section, deserializing inodes, initializing inodes, adding inodes to the global map, reading directories section, adding inodes to their parents' map, cache name etc. These steps can be done in a pipeline model to reduce the total duration. Test the patch against a fsimage of a 70PB cluster (200million files and 300million blocks, the fsimage is around 22GB), the image loading time be reduced from 1210 seconds to 739 seconds. was: The process of loading a file system image involves reading inodes section, deserializing inodes, initializing inodes, adding inodes to the global map, reading directories section, adding inodes to their parents' map, cache name etc. These steps can be done in a pipeline model to reduce the total duration. Test the patch against a fsimage of a 70PB 2.4 cluster (200million files and 300million blocks, the fsimage is around 22GB), the image loading time be reduced from 1210 seconds to 739 seconds. > The process of loading image can be done in a pipeline model > > > Key: HDFS-13700 > URL: https://issues.apache.org/jira/browse/HDFS-13700 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: zhouyingchao >Assignee: Lisheng Sun >Priority: Major > Attachments: HDFS-13700-001.patch, HDFS-13700.002.patch > > > The process of loading a file system image involves reading inodes section, > deserializing inodes, initializing inodes, adding inodes to the global map, > reading directories section, adding inodes to their parents' map, cache name > etc. These steps can be done in a pipeline model to reduce the total > duration. > Test the patch against a fsimage of a 70PB cluster (200million files and > 300million blocks, the fsimage is around 22GB), the image loading time be > reduced from 1210 seconds to 739 seconds. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-14701) Change Log Level to warn in SlotReleaser
Lisheng Sun created HDFS-14701: -- Summary: Change Log Level to warn in SlotReleaser Key: HDFS-14701 URL: https://issues.apache.org/jira/browse/HDFS-14701 Project: Hadoop HDFS Issue Type: Improvement Reporter: Lisheng Sun {code:java} // @Override public void run() { LOG.trace("{}: about to release {}", ShortCircuitCache.this, slot); final DfsClientShm shm = (DfsClientShm)slot.getShm(); final DomainSocket shmSock = shm.getPeer().getDomainSocket(); final String path = shmSock.getPath(); boolean success = false; try (DomainSocket sock = DomainSocket.connect(path); DataOutputStream out = new DataOutputStream( new BufferedOutputStream(sock.getOutputStream( { new Sender(out).releaseShortCircuitFds(slot.getSlotId()); DataInputStream in = new DataInputStream(sock.getInputStream()); ReleaseShortCircuitAccessResponseProto resp = ReleaseShortCircuitAccessResponseProto.parseFrom( PBHelperClient.vintPrefixed(in)); if (resp.getStatus() != Status.SUCCESS) { String error = resp.hasError() ? resp.getError() : "(unknown)"; throw new IOException(resp.getStatus().toString() + ": " + error); } LOG.trace("{}: released {}", this, slot); success = true; } catch (IOException e) { LOG.error(ShortCircuitCache.this + ": failed to release " + "short-circuit shared memory slot " + slot + " by sending " + "ReleaseShortCircuitAccessRequestProto to " + path + ". Closing shared memory segment.", e); } finally { if (success) { shmManager.freeSlot(slot); } else { shm.getEndpointShmManager().shutdown(shm); } } } {code} 2019-08-05,15:28:03,838 ERROR [ShortCircuitCache_SlotReleaser] org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache: ShortCircuitCache(0x65849546): failed to release short-circuit shared memory slot Slot(slotIdx=62, shm=DfsClientShm(70593ef8b3d84cba3c2f0a1e81377eb1)) by sending ReleaseShortCircuitAccessRequestProto to /home/work/app/hdfs/c3micloudsrv-hdd/datanode/dn_socket. Closing shared memory segment. java.io.IOException: ERROR_INVALID: there is no shared memory segment registered with shmId 70593ef8b3d84cba3c2f0a1e81377eb1 -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-14701) Change Log Level to warn in SlotReleaser
[ https://issues.apache.org/jira/browse/HDFS-14701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lisheng Sun updated HDFS-14701: --- Description: {code:java} // @Override public void run() { LOG.trace("{}: about to release {}", ShortCircuitCache.this, slot); final DfsClientShm shm = (DfsClientShm)slot.getShm(); final DomainSocket shmSock = shm.getPeer().getDomainSocket(); final String path = shmSock.getPath(); boolean success = false; try (DomainSocket sock = DomainSocket.connect(path); DataOutputStream out = new DataOutputStream( new BufferedOutputStream(sock.getOutputStream( { new Sender(out).releaseShortCircuitFds(slot.getSlotId()); DataInputStream in = new DataInputStream(sock.getInputStream()); ReleaseShortCircuitAccessResponseProto resp = ReleaseShortCircuitAccessResponseProto.parseFrom( PBHelperClient.vintPrefixed(in)); if (resp.getStatus() != Status.SUCCESS) { String error = resp.hasError() ? resp.getError() : "(unknown)"; throw new IOException(resp.getStatus().toString() + ": " + error); } LOG.trace("{}: released {}", this, slot); success = true; } catch (IOException e) { LOG.error(ShortCircuitCache.this + ": failed to release " + "short-circuit shared memory slot " + slot + " by sending " + "ReleaseShortCircuitAccessRequestProto to " + path + ". Closing shared memory segment.", e); } finally { if (success) { shmManager.freeSlot(slot); } else { shm.getEndpointShmManager().shutdown(shm); } } } {code} *exception stack:* {code:java} // code placeholder {code} 2019-08-05,15:28:03,838 ERROR [ShortCircuitCache_SlotReleaser] org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache: ShortCircuitCache(0x65849546): failed to release short-circuit shared memory slot Slot(slotIdx=62, shm=DfsClientShm(70593ef8b3d84cba3c2f0a1e81377eb1)) by sending ReleaseShortCircuitAccessRequestProto to /home/work/app/hdfs/c3micloudsrv-hdd/datanode/dn_socket. Closing shared memory segment. java.io.IOException: ERROR_INVALID: there is no shared memory segment registered with shmId 70593ef8b3d84cba3c2f0a1e81377eb1 was: {code:java} // @Override public void run() { LOG.trace("{}: about to release {}", ShortCircuitCache.this, slot); final DfsClientShm shm = (DfsClientShm)slot.getShm(); final DomainSocket shmSock = shm.getPeer().getDomainSocket(); final String path = shmSock.getPath(); boolean success = false; try (DomainSocket sock = DomainSocket.connect(path); DataOutputStream out = new DataOutputStream( new BufferedOutputStream(sock.getOutputStream( { new Sender(out).releaseShortCircuitFds(slot.getSlotId()); DataInputStream in = new DataInputStream(sock.getInputStream()); ReleaseShortCircuitAccessResponseProto resp = ReleaseShortCircuitAccessResponseProto.parseFrom( PBHelperClient.vintPrefixed(in)); if (resp.getStatus() != Status.SUCCESS) { String error = resp.hasError() ? resp.getError() : "(unknown)"; throw new IOException(resp.getStatus().toString() + ": " + error); } LOG.trace("{}: released {}", this, slot); success = true; } catch (IOException e) { LOG.error(ShortCircuitCache.this + ": failed to release " + "short-circuit shared memory slot " + slot + " by sending " + "ReleaseShortCircuitAccessRequestProto to " + path + ". Closing shared memory segment.", e); } finally { if (success) { shmManager.freeSlot(slot); } else { shm.getEndpointShmManager().shutdown(shm); } } } {code} 2019-08-05,15:28:03,838 ERROR [ShortCircuitCache_SlotReleaser] org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache: ShortCircuitCache(0x65849546): failed to release short-circuit shared memory slot Slot(slotIdx=62, shm=DfsClientShm(70593ef8b3d84cba3c2f0a1e81377eb1)) by sending ReleaseShortCircuitAccessRequestProto to /home/work/app/hdfs/c3micloudsrv-hdd/datanode/dn_socket. Closing shared memory segment. java.io.IOException: ERROR_INVALID: there is no shared memory segment registered with shmId 70593ef8b3d84cba3c2f0a1e81377eb1 > Change Log Level to warn in SlotReleaser > > > Key: HDFS-14701 > URL: https://issues.apache.org/jira/browse/HDFS-14701 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Lisheng Sun >Priority: Minor > > {code:java} > // @Override > public void run() { > LOG.trace("{}: about to release {}", ShortCircuitCache.this, slot); > final DfsClientShm shm = (DfsClientShm)slot.getShm(); > final DomainSocket shmSock = shm.getPeer().getDomainSocket(); > final String path = shmSock.getPath(); > boolean success = false; > try (DomainSocket sock = DomainSocket.connect(path); >DataOutputStream out = new DataOutputStr
[jira] [Updated] (HDFS-14701) Change Log Level to warn in SlotReleaser
[ https://issues.apache.org/jira/browse/HDFS-14701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lisheng Sun updated HDFS-14701: --- Description: {code:java} // @Override public void run() { LOG.trace("{}: about to release {}", ShortCircuitCache.this, slot); final DfsClientShm shm = (DfsClientShm)slot.getShm(); final DomainSocket shmSock = shm.getPeer().getDomainSocket(); final String path = shmSock.getPath(); boolean success = false; try (DomainSocket sock = DomainSocket.connect(path); DataOutputStream out = new DataOutputStream( new BufferedOutputStream(sock.getOutputStream( { new Sender(out).releaseShortCircuitFds(slot.getSlotId()); DataInputStream in = new DataInputStream(sock.getInputStream()); ReleaseShortCircuitAccessResponseProto resp = ReleaseShortCircuitAccessResponseProto.parseFrom( PBHelperClient.vintPrefixed(in)); if (resp.getStatus() != Status.SUCCESS) { String error = resp.hasError() ? resp.getError() : "(unknown)"; throw new IOException(resp.getStatus().toString() + ": " + error); } LOG.trace("{}: released {}", this, slot); success = true; } catch (IOException e) { LOG.error(ShortCircuitCache.this + ": failed to release " + "short-circuit shared memory slot " + slot + " by sending " + "ReleaseShortCircuitAccessRequestProto to " + path + ". Closing shared memory segment.", e); } finally { if (success) { shmManager.freeSlot(slot); } else { shm.getEndpointShmManager().shutdown(shm); } } } {code} *exception stack:* {code:java} // 2019-08-05,15:28:03,838 ERROR [ShortCircuitCache_SlotReleaser] org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache: ShortCircuitCache(0x65849546): failed to release short-circuit shared memory slot Slot(slotIdx=62, shm=DfsClientShm(70593ef8b3d84cba3c2f0a1e81377eb1)) by sending ReleaseShortCircuitAccessRequestProto to /home/work/app/hdfs/c3micloudsrv-hdd/datanode/dn_socket. Closing shared memory segment. java.io.IOException: ERROR_INVALID: there is no shared memory segment registered with shmId 70593ef8b3d84cba3c2f0a1e81377eb1 {code} was: {code:java} // @Override public void run() { LOG.trace("{}: about to release {}", ShortCircuitCache.this, slot); final DfsClientShm shm = (DfsClientShm)slot.getShm(); final DomainSocket shmSock = shm.getPeer().getDomainSocket(); final String path = shmSock.getPath(); boolean success = false; try (DomainSocket sock = DomainSocket.connect(path); DataOutputStream out = new DataOutputStream( new BufferedOutputStream(sock.getOutputStream( { new Sender(out).releaseShortCircuitFds(slot.getSlotId()); DataInputStream in = new DataInputStream(sock.getInputStream()); ReleaseShortCircuitAccessResponseProto resp = ReleaseShortCircuitAccessResponseProto.parseFrom( PBHelperClient.vintPrefixed(in)); if (resp.getStatus() != Status.SUCCESS) { String error = resp.hasError() ? resp.getError() : "(unknown)"; throw new IOException(resp.getStatus().toString() + ": " + error); } LOG.trace("{}: released {}", this, slot); success = true; } catch (IOException e) { LOG.error(ShortCircuitCache.this + ": failed to release " + "short-circuit shared memory slot " + slot + " by sending " + "ReleaseShortCircuitAccessRequestProto to " + path + ". Closing shared memory segment.", e); } finally { if (success) { shmManager.freeSlot(slot); } else { shm.getEndpointShmManager().shutdown(shm); } } } {code} *exception stack:* {code:java} // code placeholder {code} 2019-08-05,15:28:03,838 ERROR [ShortCircuitCache_SlotReleaser] org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache: ShortCircuitCache(0x65849546): failed to release short-circuit shared memory slot Slot(slotIdx=62, shm=DfsClientShm(70593ef8b3d84cba3c2f0a1e81377eb1)) by sending ReleaseShortCircuitAccessRequestProto to /home/work/app/hdfs/c3micloudsrv-hdd/datanode/dn_socket. Closing shared memory segment. java.io.IOException: ERROR_INVALID: there is no shared memory segment registered with shmId 70593ef8b3d84cba3c2f0a1e81377eb1 > Change Log Level to warn in SlotReleaser > > > Key: HDFS-14701 > URL: https://issues.apache.org/jira/browse/HDFS-14701 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Lisheng Sun >Priority: Minor > > {code:java} > // @Override > public void run() { > LOG.trace("{}: about to release {}", ShortCircuitCache.this, slot); > final DfsClientShm shm = (DfsClientShm)slot.getShm(); > final DomainSocket shmSock = shm.getPeer().getDomainSocket(); > final String path = shmSock.getPath(); > boolean success = false; > try (DomainSocket sock = DomainSocket.connect(path); >
[jira] [Updated] (HDFS-14701) Change Log Level to warn in SlotReleaser
[ https://issues.apache.org/jira/browse/HDFS-14701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lisheng Sun updated HDFS-14701: --- Description: if the corresponding DataNode has been stopped or restarted and DFSClient close shared memory segment,releaseShortCircuitFds API throw expection and log a ERROR Message. I think it should not be a ERROR log,and that log a warn log is more reasonable. {code:java} // @Override public void run() { LOG.trace("{}: about to release {}", ShortCircuitCache.this, slot); final DfsClientShm shm = (DfsClientShm)slot.getShm(); final DomainSocket shmSock = shm.getPeer().getDomainSocket(); final String path = shmSock.getPath(); boolean success = false; try (DomainSocket sock = DomainSocket.connect(path); DataOutputStream out = new DataOutputStream( new BufferedOutputStream(sock.getOutputStream( { new Sender(out).releaseShortCircuitFds(slot.getSlotId()); DataInputStream in = new DataInputStream(sock.getInputStream()); ReleaseShortCircuitAccessResponseProto resp = ReleaseShortCircuitAccessResponseProto.parseFrom( PBHelperClient.vintPrefixed(in)); if (resp.getStatus() != Status.SUCCESS) { String error = resp.hasError() ? resp.getError() : "(unknown)"; throw new IOException(resp.getStatus().toString() + ": " + error); } LOG.trace("{}: released {}", this, slot); success = true; } catch (IOException e) { LOG.error(ShortCircuitCache.this + ": failed to release " + "short-circuit shared memory slot " + slot + " by sending " + "ReleaseShortCircuitAccessRequestProto to " + path + ". Closing shared memory segment.", e); } finally { if (success) { shmManager.freeSlot(slot); } else { shm.getEndpointShmManager().shutdown(shm); } } } {code} *exception stack:* {code:java} 2019-08-05,15:28:03,838 ERROR [ShortCircuitCache_SlotReleaser] org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache: ShortCircuitCache(0x65849546): failed to release short-circuit shared memory slot Slot(slotIdx=62, shm=DfsClientShm(70593ef8b3d84cba3c2f0a1e81377eb1)) by sending ReleaseShortCircuitAccessRequestProto to /home/work/app/hdfs/c3micloudsrv-hdd/datanode/dn_socket. Closing shared memory segment. java.io.IOException: ERROR_INVALID: there is no shared memory segment registered with shmId 70593ef8b3d84cba3c2f0a1e81377eb1 {code} was: {code:java} // @Override public void run() { LOG.trace("{}: about to release {}", ShortCircuitCache.this, slot); final DfsClientShm shm = (DfsClientShm)slot.getShm(); final DomainSocket shmSock = shm.getPeer().getDomainSocket(); final String path = shmSock.getPath(); boolean success = false; try (DomainSocket sock = DomainSocket.connect(path); DataOutputStream out = new DataOutputStream( new BufferedOutputStream(sock.getOutputStream( { new Sender(out).releaseShortCircuitFds(slot.getSlotId()); DataInputStream in = new DataInputStream(sock.getInputStream()); ReleaseShortCircuitAccessResponseProto resp = ReleaseShortCircuitAccessResponseProto.parseFrom( PBHelperClient.vintPrefixed(in)); if (resp.getStatus() != Status.SUCCESS) { String error = resp.hasError() ? resp.getError() : "(unknown)"; throw new IOException(resp.getStatus().toString() + ": " + error); } LOG.trace("{}: released {}", this, slot); success = true; } catch (IOException e) { LOG.error(ShortCircuitCache.this + ": failed to release " + "short-circuit shared memory slot " + slot + " by sending " + "ReleaseShortCircuitAccessRequestProto to " + path + ". Closing shared memory segment.", e); } finally { if (success) { shmManager.freeSlot(slot); } else { shm.getEndpointShmManager().shutdown(shm); } } } {code} *exception stack:* {code:java} // 2019-08-05,15:28:03,838 ERROR [ShortCircuitCache_SlotReleaser] org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache: ShortCircuitCache(0x65849546): failed to release short-circuit shared memory slot Slot(slotIdx=62, shm=DfsClientShm(70593ef8b3d84cba3c2f0a1e81377eb1)) by sending ReleaseShortCircuitAccessRequestProto to /home/work/app/hdfs/c3micloudsrv-hdd/datanode/dn_socket. Closing shared memory segment. java.io.IOException: ERROR_INVALID: there is no shared memory segment registered with shmId 70593ef8b3d84cba3c2f0a1e81377eb1 {code} > Change Log Level to warn in SlotReleaser > > > Key: HDFS-14701 > URL: https://issues.apache.org/jira/browse/HDFS-14701 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Lisheng Sun >Priority: Minor > > if the corresponding DataNode has been stopped or restarted and DFSClient > close shared memory segment,releaseShortCircuitFds API throw ex
[jira] [Updated] (HDFS-14701) Change Log Level to warn in SlotReleaser
[ https://issues.apache.org/jira/browse/HDFS-14701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lisheng Sun updated HDFS-14701: --- Assignee: Lisheng Sun Attachment: HDFS-14701.001.patch Status: Patch Available (was: Open) > Change Log Level to warn in SlotReleaser > > > Key: HDFS-14701 > URL: https://issues.apache.org/jira/browse/HDFS-14701 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Lisheng Sun >Assignee: Lisheng Sun >Priority: Minor > Attachments: HDFS-14701.001.patch > > > if the corresponding DataNode has been stopped or restarted and DFSClient > close shared memory segment,releaseShortCircuitFds API throw expection and > log a ERROR Message. I think it should not be a ERROR log,and that log a warn > log is more reasonable. > {code:java} > // @Override > public void run() { > LOG.trace("{}: about to release {}", ShortCircuitCache.this, slot); > final DfsClientShm shm = (DfsClientShm)slot.getShm(); > final DomainSocket shmSock = shm.getPeer().getDomainSocket(); > final String path = shmSock.getPath(); > boolean success = false; > try (DomainSocket sock = DomainSocket.connect(path); >DataOutputStream out = new DataOutputStream( >new BufferedOutputStream(sock.getOutputStream( { > new Sender(out).releaseShortCircuitFds(slot.getSlotId()); > DataInputStream in = new DataInputStream(sock.getInputStream()); > ReleaseShortCircuitAccessResponseProto resp = > ReleaseShortCircuitAccessResponseProto.parseFrom( > PBHelperClient.vintPrefixed(in)); > if (resp.getStatus() != Status.SUCCESS) { > String error = resp.hasError() ? resp.getError() : "(unknown)"; > throw new IOException(resp.getStatus().toString() + ": " + error); > } > LOG.trace("{}: released {}", this, slot); > success = true; > } catch (IOException e) { > LOG.error(ShortCircuitCache.this + ": failed to release " + > "short-circuit shared memory slot " + slot + " by sending " + > "ReleaseShortCircuitAccessRequestProto to " + path + > ". Closing shared memory segment.", e); > } finally { > if (success) { > shmManager.freeSlot(slot); > } else { > shm.getEndpointShmManager().shutdown(shm); > } > } > } > {code} > *exception stack:* > {code:java} > 2019-08-05,15:28:03,838 ERROR [ShortCircuitCache_SlotReleaser] > org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache: > ShortCircuitCache(0x65849546): failed to release short-circuit shared memory > slot Slot(slotIdx=62, shm=DfsClientShm(70593ef8b3d84cba3c2f0a1e81377eb1)) by > sending ReleaseShortCircuitAccessRequestProto to > /home/work/app/hdfs/c3micloudsrv-hdd/datanode/dn_socket. Closing shared > memory segment. > java.io.IOException: ERROR_INVALID: there is no shared memory segment > registered with shmId 70593ef8b3d84cba3c2f0a1e81377eb1 > {code} > -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-14313) Get hdfs used space from FsDatasetImpl#volumeMap#ReplicaInfo in memory instead of df/du
[ https://issues.apache.org/jira/browse/HDFS-14313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lisheng Sun updated HDFS-14313: --- Attachment: HDFS-14313.011.patch > Get hdfs used space from FsDatasetImpl#volumeMap#ReplicaInfo in memory > instead of df/du > > > Key: HDFS-14313 > URL: https://issues.apache.org/jira/browse/HDFS-14313 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode, performance >Affects Versions: 2.6.0, 2.7.0, 2.8.0, 2.9.0, 3.0.0, 3.1.0 >Reporter: Lisheng Sun >Assignee: Lisheng Sun >Priority: Major > Attachments: HDFS-14313.000.patch, HDFS-14313.001.patch, > HDFS-14313.002.patch, HDFS-14313.003.patch, HDFS-14313.004.patch, > HDFS-14313.005.patch, HDFS-14313.006.patch, HDFS-14313.007.patch, > HDFS-14313.008.patch, HDFS-14313.009.patch, HDFS-14313.010.patch, > HDFS-14313.011.patch > > > There are two ways of DU/DF getting used space that are insufficient. > # Running DU across lots of disks is very expensive and running all of the > processes at the same time creates a noticeable IO spike. > # Running DF is inaccurate when the disk sharing by multiple datanode or > other servers. > Getting hdfs used space from FsDatasetImpl#volumeMap#ReplicaInfos in memory > is very small and accurate. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14701) Change Log Level to warn in SlotReleaser
[ https://issues.apache.org/jira/browse/HDFS-14701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16900119#comment-16900119 ] Lisheng Sun commented on HDFS-14701: hi [~ayushtkn] Could you mind help review this patch? Thank you. > Change Log Level to warn in SlotReleaser > > > Key: HDFS-14701 > URL: https://issues.apache.org/jira/browse/HDFS-14701 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Lisheng Sun >Assignee: Lisheng Sun >Priority: Minor > Attachments: HDFS-14701.001.patch > > > if the corresponding DataNode has been stopped or restarted and DFSClient > close shared memory segment,releaseShortCircuitFds API throw expection and > log a ERROR Message. I think it should not be a ERROR log,and that log a warn > log is more reasonable. > {code:java} > // @Override > public void run() { > LOG.trace("{}: about to release {}", ShortCircuitCache.this, slot); > final DfsClientShm shm = (DfsClientShm)slot.getShm(); > final DomainSocket shmSock = shm.getPeer().getDomainSocket(); > final String path = shmSock.getPath(); > boolean success = false; > try (DomainSocket sock = DomainSocket.connect(path); >DataOutputStream out = new DataOutputStream( >new BufferedOutputStream(sock.getOutputStream( { > new Sender(out).releaseShortCircuitFds(slot.getSlotId()); > DataInputStream in = new DataInputStream(sock.getInputStream()); > ReleaseShortCircuitAccessResponseProto resp = > ReleaseShortCircuitAccessResponseProto.parseFrom( > PBHelperClient.vintPrefixed(in)); > if (resp.getStatus() != Status.SUCCESS) { > String error = resp.hasError() ? resp.getError() : "(unknown)"; > throw new IOException(resp.getStatus().toString() + ": " + error); > } > LOG.trace("{}: released {}", this, slot); > success = true; > } catch (IOException e) { > LOG.error(ShortCircuitCache.this + ": failed to release " + > "short-circuit shared memory slot " + slot + " by sending " + > "ReleaseShortCircuitAccessRequestProto to " + path + > ". Closing shared memory segment.", e); > } finally { > if (success) { > shmManager.freeSlot(slot); > } else { > shm.getEndpointShmManager().shutdown(shm); > } > } > } > {code} > *exception stack:* > {code:java} > 2019-08-05,15:28:03,838 ERROR [ShortCircuitCache_SlotReleaser] > org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache: > ShortCircuitCache(0x65849546): failed to release short-circuit shared memory > slot Slot(slotIdx=62, shm=DfsClientShm(70593ef8b3d84cba3c2f0a1e81377eb1)) by > sending ReleaseShortCircuitAccessRequestProto to > /home/work/app/hdfs/c3micloudsrv-hdd/datanode/dn_socket. Closing shared > memory segment. > java.io.IOException: ERROR_INVALID: there is no shared memory segment > registered with shmId 70593ef8b3d84cba3c2f0a1e81377eb1 > {code} > -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-14701) Change Log Level to warn in SlotReleaser
[ https://issues.apache.org/jira/browse/HDFS-14701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16900119#comment-16900119 ] Lisheng Sun edited comment on HDFS-14701 at 8/5/19 2:32 PM: hi [~ayushtkn] [~xkrogen] [~brahmareddy].Could you mind help review this patch? Thank you. was (Author: leosun08): hi [~ayushtkn] Could you mind help review this patch? Thank you. > Change Log Level to warn in SlotReleaser > > > Key: HDFS-14701 > URL: https://issues.apache.org/jira/browse/HDFS-14701 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Lisheng Sun >Assignee: Lisheng Sun >Priority: Minor > Attachments: HDFS-14701.001.patch > > > if the corresponding DataNode has been stopped or restarted and DFSClient > close shared memory segment,releaseShortCircuitFds API throw expection and > log a ERROR Message. I think it should not be a ERROR log,and that log a warn > log is more reasonable. > {code:java} > // @Override > public void run() { > LOG.trace("{}: about to release {}", ShortCircuitCache.this, slot); > final DfsClientShm shm = (DfsClientShm)slot.getShm(); > final DomainSocket shmSock = shm.getPeer().getDomainSocket(); > final String path = shmSock.getPath(); > boolean success = false; > try (DomainSocket sock = DomainSocket.connect(path); >DataOutputStream out = new DataOutputStream( >new BufferedOutputStream(sock.getOutputStream( { > new Sender(out).releaseShortCircuitFds(slot.getSlotId()); > DataInputStream in = new DataInputStream(sock.getInputStream()); > ReleaseShortCircuitAccessResponseProto resp = > ReleaseShortCircuitAccessResponseProto.parseFrom( > PBHelperClient.vintPrefixed(in)); > if (resp.getStatus() != Status.SUCCESS) { > String error = resp.hasError() ? resp.getError() : "(unknown)"; > throw new IOException(resp.getStatus().toString() + ": " + error); > } > LOG.trace("{}: released {}", this, slot); > success = true; > } catch (IOException e) { > LOG.error(ShortCircuitCache.this + ": failed to release " + > "short-circuit shared memory slot " + slot + " by sending " + > "ReleaseShortCircuitAccessRequestProto to " + path + > ". Closing shared memory segment.", e); > } finally { > if (success) { > shmManager.freeSlot(slot); > } else { > shm.getEndpointShmManager().shutdown(shm); > } > } > } > {code} > *exception stack:* > {code:java} > 2019-08-05,15:28:03,838 ERROR [ShortCircuitCache_SlotReleaser] > org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache: > ShortCircuitCache(0x65849546): failed to release short-circuit shared memory > slot Slot(slotIdx=62, shm=DfsClientShm(70593ef8b3d84cba3c2f0a1e81377eb1)) by > sending ReleaseShortCircuitAccessRequestProto to > /home/work/app/hdfs/c3micloudsrv-hdd/datanode/dn_socket. Closing shared > memory segment. > java.io.IOException: ERROR_INVALID: there is no shared memory segment > registered with shmId 70593ef8b3d84cba3c2f0a1e81377eb1 > {code} > -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14313) Get hdfs used space from FsDatasetImpl#volumeMap#ReplicaInfo in memory instead of df/du
[ https://issues.apache.org/jira/browse/HDFS-14313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16900515#comment-16900515 ] Lisheng Sun commented on HDFS-14313: Thanx [~linyiqun] for your deep review. I updated this patch as your comments. checkstyle issue: {code:java} ./hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/FSCachingGetSpaceUsed.java:64: public Builder setBpid(String bpid) {:35: 'bpid' hides a field. [HiddenField] {code} I think it is not a problem. I haved uploaded the v11 patch. Could you help review it? Thank you. > Get hdfs used space from FsDatasetImpl#volumeMap#ReplicaInfo in memory > instead of df/du > > > Key: HDFS-14313 > URL: https://issues.apache.org/jira/browse/HDFS-14313 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode, performance >Affects Versions: 2.6.0, 2.7.0, 2.8.0, 2.9.0, 3.0.0, 3.1.0 >Reporter: Lisheng Sun >Assignee: Lisheng Sun >Priority: Major > Attachments: HDFS-14313.000.patch, HDFS-14313.001.patch, > HDFS-14313.002.patch, HDFS-14313.003.patch, HDFS-14313.004.patch, > HDFS-14313.005.patch, HDFS-14313.006.patch, HDFS-14313.007.patch, > HDFS-14313.008.patch, HDFS-14313.009.patch, HDFS-14313.010.patch, > HDFS-14313.011.patch > > > There are two ways of DU/DF getting used space that are insufficient. > # Running DU across lots of disks is very expensive and running all of the > processes at the same time creates a noticeable IO spike. > # Running DF is inaccurate when the disk sharing by multiple datanode or > other servers. > Getting hdfs used space from FsDatasetImpl#volumeMap#ReplicaInfos in memory > is very small and accurate. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-14313) Get hdfs used space from FsDatasetImpl#volumeMap#ReplicaInfo in memory instead of df/du
[ https://issues.apache.org/jira/browse/HDFS-14313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16900515#comment-16900515 ] Lisheng Sun edited comment on HDFS-14313 at 8/6/19 1:37 AM: Thanx [~linyiqun] for your deep review. I updated this patch as your comments. checkstyle issue: {code:java} ./hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/FSCachingGetSpaceUsed.java:64: public Builder setBpid(String bpid) {:35: 'bpid' hides a field. [HiddenField] {code} I think it is not a problem. I haved uploaded the v11 patch. Could you help review it? Thank you. was (Author: leosun08): Thanx [~linyiqun] for your deep review. I updated this patch as your comments. checkstyle issue: {code:java} ./hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/FSCachingGetSpaceUsed.java:64: public Builder setBpid(String bpid) {:35: 'bpid' hides a field. [HiddenField] {code} I think it is not a problem. I haved uploaded the v11 patch. Could you help review it? Thank you. > Get hdfs used space from FsDatasetImpl#volumeMap#ReplicaInfo in memory > instead of df/du > > > Key: HDFS-14313 > URL: https://issues.apache.org/jira/browse/HDFS-14313 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode, performance >Affects Versions: 2.6.0, 2.7.0, 2.8.0, 2.9.0, 3.0.0, 3.1.0 >Reporter: Lisheng Sun >Assignee: Lisheng Sun >Priority: Major > Attachments: HDFS-14313.000.patch, HDFS-14313.001.patch, > HDFS-14313.002.patch, HDFS-14313.003.patch, HDFS-14313.004.patch, > HDFS-14313.005.patch, HDFS-14313.006.patch, HDFS-14313.007.patch, > HDFS-14313.008.patch, HDFS-14313.009.patch, HDFS-14313.010.patch, > HDFS-14313.011.patch > > > There are two ways of DU/DF getting used space that are insufficient. > # Running DU across lots of disks is very expensive and running all of the > processes at the same time creates a noticeable IO spike. > # Running DF is inaccurate when the disk sharing by multiple datanode or > other servers. > Getting hdfs used space from FsDatasetImpl#volumeMap#ReplicaInfos in memory > is very small and accurate. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14313) Get hdfs used space from FsDatasetImpl#volumeMap#ReplicaInfo in memory instead of df/du
[ https://issues.apache.org/jira/browse/HDFS-14313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16900563#comment-16900563 ] Lisheng Sun commented on HDFS-14313: Thank [~jojochuang] for your suggestion. {quote}One thing that's missing out in the v11 is the update to hdfs-default.xml which was missing since v8. {quote} According to [~linyiqun] suggetion, I define a hard-coded threadold time value like 1000ms in ReplicaCachingGetSpaceUsed. So remove config in hdfs-default.xml. {code:java} private static final long DEEP_COPY_REPLICA_THRESHOLD_MS = 50; private static final long REPLICA_CACHING_GET_SPACE_USED_THRESHOLD_MS = 1000; {code} {quote} Additionally there should be an additional config key "fs.getspaceused.classname" in core-default.xml, and state that possible options are org.apache.hadoop.fs.DU (default) org.apache.hadoop.fs.WindowsGetSpaceUsed org.apache.hadoop.fs.DFCachingGetSpaceUsed org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.ReplicaCachingGetSpaceUsed {quote} That ReplicaCachingGetSpaceUsed of hdfs module is added in core-default.xml of common module should not be very good. Please correct me if I was wrong. Thank you again. > Get hdfs used space from FsDatasetImpl#volumeMap#ReplicaInfo in memory > instead of df/du > > > Key: HDFS-14313 > URL: https://issues.apache.org/jira/browse/HDFS-14313 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode, performance >Affects Versions: 2.6.0, 2.7.0, 2.8.0, 2.9.0, 3.0.0, 3.1.0 >Reporter: Lisheng Sun >Assignee: Lisheng Sun >Priority: Major > Attachments: HDFS-14313.000.patch, HDFS-14313.001.patch, > HDFS-14313.002.patch, HDFS-14313.003.patch, HDFS-14313.004.patch, > HDFS-14313.005.patch, HDFS-14313.006.patch, HDFS-14313.007.patch, > HDFS-14313.008.patch, HDFS-14313.009.patch, HDFS-14313.010.patch, > HDFS-14313.011.patch > > > There are two ways of DU/DF getting used space that are insufficient. > # Running DU across lots of disks is very expensive and running all of the > processes at the same time creates a noticeable IO spike. > # Running DF is inaccurate when the disk sharing by multiple datanode or > other servers. > Getting hdfs used space from FsDatasetImpl#volumeMap#ReplicaInfos in memory > is very small and accurate. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-14313) Get hdfs used space from FsDatasetImpl#volumeMap#ReplicaInfo in memory instead of df/du
[ https://issues.apache.org/jira/browse/HDFS-14313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lisheng Sun updated HDFS-14313: --- Attachment: HDFS-14313.012.patch > Get hdfs used space from FsDatasetImpl#volumeMap#ReplicaInfo in memory > instead of df/du > > > Key: HDFS-14313 > URL: https://issues.apache.org/jira/browse/HDFS-14313 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode, performance >Affects Versions: 2.6.0, 2.7.0, 2.8.0, 2.9.0, 3.0.0, 3.1.0 >Reporter: Lisheng Sun >Assignee: Lisheng Sun >Priority: Major > Attachments: HDFS-14313.000.patch, HDFS-14313.001.patch, > HDFS-14313.002.patch, HDFS-14313.003.patch, HDFS-14313.004.patch, > HDFS-14313.005.patch, HDFS-14313.006.patch, HDFS-14313.007.patch, > HDFS-14313.008.patch, HDFS-14313.009.patch, HDFS-14313.010.patch, > HDFS-14313.011.patch, HDFS-14313.012.patch > > > There are two ways of DU/DF getting used space that are insufficient. > # Running DU across lots of disks is very expensive and running all of the > processes at the same time creates a noticeable IO spike. > # Running DF is inaccurate when the disk sharing by multiple datanode or > other servers. > Getting hdfs used space from FsDatasetImpl#volumeMap#ReplicaInfos in memory > is very small and accurate. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14313) Get hdfs used space from FsDatasetImpl#volumeMap#ReplicaInfo in memory instead of df/du
[ https://issues.apache.org/jira/browse/HDFS-14313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16900724#comment-16900724 ] Lisheng Sun commented on HDFS-14313: Thanx [~linyiqun] [~jojochuang] for your good suggestions. I added the key fs.getspaceused.classname and the usage of these four impl classes in core-default.xml. I have uloaded the v12 patch. Thank you again. > Get hdfs used space from FsDatasetImpl#volumeMap#ReplicaInfo in memory > instead of df/du > > > Key: HDFS-14313 > URL: https://issues.apache.org/jira/browse/HDFS-14313 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode, performance >Affects Versions: 2.6.0, 2.7.0, 2.8.0, 2.9.0, 3.0.0, 3.1.0 >Reporter: Lisheng Sun >Assignee: Lisheng Sun >Priority: Major > Attachments: HDFS-14313.000.patch, HDFS-14313.001.patch, > HDFS-14313.002.patch, HDFS-14313.003.patch, HDFS-14313.004.patch, > HDFS-14313.005.patch, HDFS-14313.006.patch, HDFS-14313.007.patch, > HDFS-14313.008.patch, HDFS-14313.009.patch, HDFS-14313.010.patch, > HDFS-14313.011.patch, HDFS-14313.012.patch > > > There are two ways of DU/DF getting used space that are insufficient. > # Running DU across lots of disks is very expensive and running all of the > processes at the same time creates a noticeable IO spike. > # Running DF is inaccurate when the disk sharing by multiple datanode or > other servers. > Getting hdfs used space from FsDatasetImpl#volumeMap#ReplicaInfos in memory > is very small and accurate. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-14313) Get hdfs used space from FsDatasetImpl#volumeMap#ReplicaInfo in memory instead of df/du
[ https://issues.apache.org/jira/browse/HDFS-14313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lisheng Sun updated HDFS-14313: --- Status: Patch Available (was: Open) > Get hdfs used space from FsDatasetImpl#volumeMap#ReplicaInfo in memory > instead of df/du > > > Key: HDFS-14313 > URL: https://issues.apache.org/jira/browse/HDFS-14313 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode, performance >Affects Versions: 3.1.0, 3.0.0, 2.9.0, 2.8.0, 2.7.0, 2.6.0 >Reporter: Lisheng Sun >Assignee: Lisheng Sun >Priority: Major > Attachments: HDFS-14313.000.patch, HDFS-14313.001.patch, > HDFS-14313.002.patch, HDFS-14313.003.patch, HDFS-14313.004.patch, > HDFS-14313.005.patch, HDFS-14313.006.patch, HDFS-14313.007.patch, > HDFS-14313.008.patch, HDFS-14313.009.patch, HDFS-14313.010.patch, > HDFS-14313.011.patch, HDFS-14313.012.patch > > > There are two ways of DU/DF getting used space that are insufficient. > # Running DU across lots of disks is very expensive and running all of the > processes at the same time creates a noticeable IO spike. > # Running DF is inaccurate when the disk sharing by multiple datanode or > other servers. > Getting hdfs used space from FsDatasetImpl#volumeMap#ReplicaInfos in memory > is very small and accurate. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-14313) Get hdfs used space from FsDatasetImpl#volumeMap#ReplicaInfo in memory instead of df/du
[ https://issues.apache.org/jira/browse/HDFS-14313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lisheng Sun updated HDFS-14313: --- Status: Open (was: Patch Available) > Get hdfs used space from FsDatasetImpl#volumeMap#ReplicaInfo in memory > instead of df/du > > > Key: HDFS-14313 > URL: https://issues.apache.org/jira/browse/HDFS-14313 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode, performance >Affects Versions: 3.1.0, 3.0.0, 2.9.0, 2.8.0, 2.7.0, 2.6.0 >Reporter: Lisheng Sun >Assignee: Lisheng Sun >Priority: Major > Attachments: HDFS-14313.000.patch, HDFS-14313.001.patch, > HDFS-14313.002.patch, HDFS-14313.003.patch, HDFS-14313.004.patch, > HDFS-14313.005.patch, HDFS-14313.006.patch, HDFS-14313.007.patch, > HDFS-14313.008.patch, HDFS-14313.009.patch, HDFS-14313.010.patch, > HDFS-14313.011.patch, HDFS-14313.012.patch > > > There are two ways of DU/DF getting used space that are insufficient. > # Running DU across lots of disks is very expensive and running all of the > processes at the same time creates a noticeable IO spike. > # Running DF is inaccurate when the disk sharing by multiple datanode or > other servers. > Getting hdfs used space from FsDatasetImpl#volumeMap#ReplicaInfos in memory > is very small and accurate. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14313) Get hdfs used space from FsDatasetImpl#volumeMap#ReplicaInfo in memory instead of df/du
[ https://issues.apache.org/jira/browse/HDFS-14313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16900753#comment-16900753 ] Lisheng Sun commented on HDFS-14313: {quote} HDFS-14313 does not apply to trunk. Rebase required? Wrong Branch? See https://wiki.apache.org/hadoop/HowToContribute for help. {quote} That I only update core-default.xml leads to this problem. > Get hdfs used space from FsDatasetImpl#volumeMap#ReplicaInfo in memory > instead of df/du > > > Key: HDFS-14313 > URL: https://issues.apache.org/jira/browse/HDFS-14313 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode, performance >Affects Versions: 2.6.0, 2.7.0, 2.8.0, 2.9.0, 3.0.0, 3.1.0 >Reporter: Lisheng Sun >Assignee: Lisheng Sun >Priority: Major > Attachments: HDFS-14313.000.patch, HDFS-14313.001.patch, > HDFS-14313.002.patch, HDFS-14313.003.patch, HDFS-14313.004.patch, > HDFS-14313.005.patch, HDFS-14313.006.patch, HDFS-14313.007.patch, > HDFS-14313.008.patch, HDFS-14313.009.patch, HDFS-14313.010.patch, > HDFS-14313.011.patch, HDFS-14313.012.patch > > > There are two ways of DU/DF getting used space that are insufficient. > # Running DU across lots of disks is very expensive and running all of the > processes at the same time creates a noticeable IO spike. > # Running DF is inaccurate when the disk sharing by multiple datanode or > other servers. > Getting hdfs used space from FsDatasetImpl#volumeMap#ReplicaInfos in memory > is very small and accurate. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-14313) Get hdfs used space from FsDatasetImpl#volumeMap#ReplicaInfo in memory instead of df/du
[ https://issues.apache.org/jira/browse/HDFS-14313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lisheng Sun updated HDFS-14313: --- Attachment: HDFS-14313.013.patch > Get hdfs used space from FsDatasetImpl#volumeMap#ReplicaInfo in memory > instead of df/du > > > Key: HDFS-14313 > URL: https://issues.apache.org/jira/browse/HDFS-14313 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode, performance >Affects Versions: 2.6.0, 2.7.0, 2.8.0, 2.9.0, 3.0.0, 3.1.0 >Reporter: Lisheng Sun >Assignee: Lisheng Sun >Priority: Major > Attachments: HDFS-14313.000.patch, HDFS-14313.001.patch, > HDFS-14313.002.patch, HDFS-14313.003.patch, HDFS-14313.004.patch, > HDFS-14313.005.patch, HDFS-14313.006.patch, HDFS-14313.007.patch, > HDFS-14313.008.patch, HDFS-14313.009.patch, HDFS-14313.010.patch, > HDFS-14313.011.patch, HDFS-14313.012.patch, HDFS-14313.013.patch > > > There are two ways of DU/DF getting used space that are insufficient. > # Running DU across lots of disks is very expensive and running all of the > processes at the same time creates a noticeable IO spike. > # Running DF is inaccurate when the disk sharing by multiple datanode or > other servers. > Getting hdfs used space from FsDatasetImpl#volumeMap#ReplicaInfos in memory > is very small and accurate. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14313) Get hdfs used space from FsDatasetImpl#volumeMap#ReplicaInfo in memory instead of df/du
[ https://issues.apache.org/jira/browse/HDFS-14313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16900863#comment-16900863 ] Lisheng Sun commented on HDFS-14313: Thank [~linyiqun] for your reminder. I rebased the code, updated description about config key "fs.getspaceused.classname" in core-default.xml and uploaded the v013 patch. > Get hdfs used space from FsDatasetImpl#volumeMap#ReplicaInfo in memory > instead of df/du > > > Key: HDFS-14313 > URL: https://issues.apache.org/jira/browse/HDFS-14313 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode, performance >Affects Versions: 2.6.0, 2.7.0, 2.8.0, 2.9.0, 3.0.0, 3.1.0 >Reporter: Lisheng Sun >Assignee: Lisheng Sun >Priority: Major > Attachments: HDFS-14313.000.patch, HDFS-14313.001.patch, > HDFS-14313.002.patch, HDFS-14313.003.patch, HDFS-14313.004.patch, > HDFS-14313.005.patch, HDFS-14313.006.patch, HDFS-14313.007.patch, > HDFS-14313.008.patch, HDFS-14313.009.patch, HDFS-14313.010.patch, > HDFS-14313.011.patch, HDFS-14313.012.patch, HDFS-14313.013.patch > > > There are two ways of DU/DF getting used space that are insufficient. > # Running DU across lots of disks is very expensive and running all of the > processes at the same time creates a noticeable IO spike. > # Running DF is inaccurate when the disk sharing by multiple datanode or > other servers. > Getting hdfs used space from FsDatasetImpl#volumeMap#ReplicaInfos in memory > is very small and accurate. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-14313) Get hdfs used space from FsDatasetImpl#volumeMap#ReplicaInfo in memory instead of df/du
[ https://issues.apache.org/jira/browse/HDFS-14313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lisheng Sun updated HDFS-14313: --- Attachment: HDFS-14313.014.patch > Get hdfs used space from FsDatasetImpl#volumeMap#ReplicaInfo in memory > instead of df/du > > > Key: HDFS-14313 > URL: https://issues.apache.org/jira/browse/HDFS-14313 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode, performance >Affects Versions: 2.6.0, 2.7.0, 2.8.0, 2.9.0, 3.0.0, 3.1.0 >Reporter: Lisheng Sun >Assignee: Lisheng Sun >Priority: Major > Attachments: HDFS-14313.000.patch, HDFS-14313.001.patch, > HDFS-14313.002.patch, HDFS-14313.003.patch, HDFS-14313.004.patch, > HDFS-14313.005.patch, HDFS-14313.006.patch, HDFS-14313.007.patch, > HDFS-14313.008.patch, HDFS-14313.009.patch, HDFS-14313.010.patch, > HDFS-14313.011.patch, HDFS-14313.012.patch, HDFS-14313.013.patch, > HDFS-14313.014.patch > > > There are two ways of DU/DF getting used space that are insufficient. > # Running DU across lots of disks is very expensive and running all of the > processes at the same time creates a noticeable IO spike. > # Running DF is inaccurate when the disk sharing by multiple datanode or > other servers. > Getting hdfs used space from FsDatasetImpl#volumeMap#ReplicaInfos in memory > is very small and accurate. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14313) Get hdfs used space from FsDatasetImpl#volumeMap#ReplicaInfo in memory instead of df/du
[ https://issues.apache.org/jira/browse/HDFS-14313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16901146#comment-16901146 ] Lisheng Sun commented on HDFS-14313: Fix UT in hadoop.conf.TestCommonConfigurationFields and other UTs fails is unrelated to this patch. Upload the v014 patch. Please help review it. Thank [~linyiqun]. > Get hdfs used space from FsDatasetImpl#volumeMap#ReplicaInfo in memory > instead of df/du > > > Key: HDFS-14313 > URL: https://issues.apache.org/jira/browse/HDFS-14313 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode, performance >Affects Versions: 2.6.0, 2.7.0, 2.8.0, 2.9.0, 3.0.0, 3.1.0 >Reporter: Lisheng Sun >Assignee: Lisheng Sun >Priority: Major > Attachments: HDFS-14313.000.patch, HDFS-14313.001.patch, > HDFS-14313.002.patch, HDFS-14313.003.patch, HDFS-14313.004.patch, > HDFS-14313.005.patch, HDFS-14313.006.patch, HDFS-14313.007.patch, > HDFS-14313.008.patch, HDFS-14313.009.patch, HDFS-14313.010.patch, > HDFS-14313.011.patch, HDFS-14313.012.patch, HDFS-14313.013.patch, > HDFS-14313.014.patch > > > There are two ways of DU/DF getting used space that are insufficient. > # Running DU across lots of disks is very expensive and running all of the > processes at the same time creates a noticeable IO spike. > # Running DF is inaccurate when the disk sharing by multiple datanode or > other servers. > Getting hdfs used space from FsDatasetImpl#volumeMap#ReplicaInfos in memory > is very small and accurate. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-14708) TestLargeBlockReport#testBlockReportSucceedsWithLargerLengthLimit fails in trunk
Lisheng Sun created HDFS-14708: -- Summary: TestLargeBlockReport#testBlockReportSucceedsWithLargerLengthLimit fails in trunk Key: HDFS-14708 URL: https://issues.apache.org/jira/browse/HDFS-14708 Project: Hadoop HDFS Issue Type: Bug Reporter: Lisheng Sun {code:java} 2019-08-07 09:56:26,082 [IPC Server handler 7 on default port 49613] INFO ipc.Server (Server.java:logException(2982)) - IPC Server handler 7 on default port 49613, call Call#7 Retry#0 org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.blockReport from 127.0.0.1:49618 java.io.IOException: java.lang.IllegalStateException: com.google.protobuf.InvalidProtocolBufferException: Protocol message was too large. May be malicious. Use CodedInputStream.setSizeLimit() to increase the size limit. at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.runBlockOp(BlockManager.java:5011) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.blockReport(NameNodeRpcServer.java:1581) at org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolServerSideTranslatorPB.blockReport(DatanodeProtocolServerSideTranslatorPB.java:181) at org.apache.hadoop.hdfs.protocol.proto.DatanodeProtocolProtos$DatanodeProtocolService$2.callBlockingMethod(DatanodeProtocolProtos.java:31664) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:529) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1001) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:929) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1891) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2921) Caused by: java.lang.IllegalStateException: com.google.protobuf.InvalidProtocolBufferException: Protocol message was too large. May be malicious. Use CodedInputStream.setSizeLimit() to increase the size limit. at org.apache.hadoop.hdfs.protocol.BlockListAsLongs$BufferDecoder$1.next(BlockListAsLongs.java:424) at org.apache.hadoop.hdfs.protocol.BlockListAsLongs$BufferDecoder$1.next(BlockListAsLongs.java:396) at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.reportDiffSorted(BlockManager.java:2952) at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.processReport(BlockManager.java:2787) at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.processReport(BlockManager.java:2655) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.lambda$blockReport$0(NameNodeRpcServer.java:1582) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$BlockReportProcessingThread.processQueue(BlockManager.java:5089) at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$BlockReportProcessingThread.run(BlockManager.java:5068) Caused by: com.google.protobuf.InvalidProtocolBufferException: Protocol message was too large. May be malicious. Use CodedInputStream.setSizeLimit() to increase the size limit. at com.google.protobuf.InvalidProtocolBufferException.sizeLimitExceeded(InvalidProtocolBufferException.java:110) at com.google.protobuf.CodedInputStream.refillBuffer(CodedInputStream.java:755) at com.google.protobuf.CodedInputStream.readRawByte(CodedInputStream.java:769) at com.google.protobuf.CodedInputStream.readRawVarint64(CodedInputStream.java:462) at org.apache.hadoop.hdfs.protocol.BlockListAsLongs$BufferDecoder$1.next(BlockListAsLongs.java:420) ... 8 more {code} Ref :: [https://builds.apache.org/job/PreCommit-HDFS-Build/27416/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt] -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-14708) TestLargeBlockReport#testBlockReportSucceedsWithLargerLengthLimit fails in trunk
[ https://issues.apache.org/jira/browse/HDFS-14708?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lisheng Sun updated HDFS-14708: --- Description: {code:java} [ERROR] testBlockReportSucceedsWithLargerLengthLimit(org.apache.hadoop.hdfs.server.datanode.TestLargeBlockReport) Time elapsed: 47.956 s <<< ERROR! org.apache.hadoop.ipc.RemoteException(java.io.IOException): java.lang.IllegalStateException: com.google.protobuf.InvalidProtocolBufferException: Protocol message was too large. May be malicious. Use CodedInputStream.setSizeLimit() to increase the size limit. at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.runBlockOp(BlockManager.java:5011) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.blockReport(NameNodeRpcServer.java:1581) at org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolServerSideTranslatorPB.blockReport(DatanodeProtocolServerSideTranslatorPB.java:181) at org.apache.hadoop.hdfs.protocol.proto.DatanodeProtocolProtos$DatanodeProtocolService$2.callBlockingMethod(DatanodeProtocolProtos.java:31664) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:529) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1001) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:929) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1891) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2921) Caused by: java.lang.IllegalStateException: com.google.protobuf.InvalidProtocolBufferException: Protocol message was too large. May be malicious. Use CodedInputStream.setSizeLimit() to increase the size limit. at org.apache.hadoop.hdfs.protocol.BlockListAsLongs$BufferDecoder$1.next(BlockListAsLongs.java:424) at org.apache.hadoop.hdfs.protocol.BlockListAsLongs$BufferDecoder$1.next(BlockListAsLongs.java:396) at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.reportDiffSorted(BlockManager.java:2952) at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.processReport(BlockManager.java:2787) at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.processReport(BlockManager.java:2655) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.lambda$blockReport$0(NameNodeRpcServer.java:1582) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$BlockReportProcessingThread.processQueue(BlockManager.java:5089) at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$BlockReportProcessingThread.run(BlockManager.java:5068) Caused by: com.google.protobuf.InvalidProtocolBufferException: Protocol message was too large. May be malicious. Use CodedInputStream.setSizeLimit() to increase the size limit. at com.google.protobuf.InvalidProtocolBufferException.sizeLimitExceeded(InvalidProtocolBufferException.java:110) at com.google.protobuf.CodedInputStream.refillBuffer(CodedInputStream.java:755) at com.google.protobuf.CodedInputStream.readRawByte(CodedInputStream.java:769) at com.google.protobuf.CodedInputStream.readRawVarint64(CodedInputStream.java:462) at org.apache.hadoop.hdfs.protocol.BlockListAsLongs$BufferDecoder$1.next(BlockListAsLongs.java:420) ... 8 more at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1553) at org.apache.hadoop.ipc.Client.call(Client.java:1499) at org.apache.hadoop.ipc.Client.call(Client.java:1396) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:118) at com.sun.proxy.$Proxy25.blockReport(Unknown Source) at org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.blockReport(DatanodeProtocolClientSideTranslatorPB.java:218) at org.apache.hadoop.hdfs.server.datanode.TestLargeBlockReport.testBlockReportSucceedsWithLargerLengthLimit(TestLargeBlockReport.java:97) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50) at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) at org.junit.runners.model.FrameworkMethod.invokeExp
[jira] [Assigned] (HDFS-14708) TestLargeBlockReport#testBlockReportSucceedsWithLargerLengthLimit fails in trunk
[ https://issues.apache.org/jira/browse/HDFS-14708?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lisheng Sun reassigned HDFS-14708: -- Assignee: Lisheng Sun > TestLargeBlockReport#testBlockReportSucceedsWithLargerLengthLimit fails in > trunk > > > Key: HDFS-14708 > URL: https://issues.apache.org/jira/browse/HDFS-14708 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Lisheng Sun >Assignee: Lisheng Sun >Priority: Minor > > {code:java} > [ERROR] > testBlockReportSucceedsWithLargerLengthLimit(org.apache.hadoop.hdfs.server.datanode.TestLargeBlockReport) > Time elapsed: 47.956 s <<< ERROR! > org.apache.hadoop.ipc.RemoteException(java.io.IOException): > java.lang.IllegalStateException: > com.google.protobuf.InvalidProtocolBufferException: Protocol message was too > large. May be malicious. Use CodedInputStream.setSizeLimit() to increase > the size limit. > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.runBlockOp(BlockManager.java:5011) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.blockReport(NameNodeRpcServer.java:1581) > at > org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolServerSideTranslatorPB.blockReport(DatanodeProtocolServerSideTranslatorPB.java:181) > at > org.apache.hadoop.hdfs.protocol.proto.DatanodeProtocolProtos$DatanodeProtocolService$2.callBlockingMethod(DatanodeProtocolProtos.java:31664) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:529) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1001) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:929) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1891) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2921) > Caused by: java.lang.IllegalStateException: > com.google.protobuf.InvalidProtocolBufferException: Protocol message was too > large. May be malicious. Use CodedInputStream.setSizeLimit() to increase > the size limit. > at > org.apache.hadoop.hdfs.protocol.BlockListAsLongs$BufferDecoder$1.next(BlockListAsLongs.java:424) > at > org.apache.hadoop.hdfs.protocol.BlockListAsLongs$BufferDecoder$1.next(BlockListAsLongs.java:396) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.reportDiffSorted(BlockManager.java:2952) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.processReport(BlockManager.java:2787) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.processReport(BlockManager.java:2655) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.lambda$blockReport$0(NameNodeRpcServer.java:1582) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$BlockReportProcessingThread.processQueue(BlockManager.java:5089) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$BlockReportProcessingThread.run(BlockManager.java:5068) > Caused by: com.google.protobuf.InvalidProtocolBufferException: Protocol > message was too large. May be malicious. Use > CodedInputStream.setSizeLimit() to increase the size limit. > at > com.google.protobuf.InvalidProtocolBufferException.sizeLimitExceeded(InvalidProtocolBufferException.java:110) > at > com.google.protobuf.CodedInputStream.refillBuffer(CodedInputStream.java:755) > at > com.google.protobuf.CodedInputStream.readRawByte(CodedInputStream.java:769) > at > com.google.protobuf.CodedInputStream.readRawVarint64(CodedInputStream.java:462) > at > org.apache.hadoop.hdfs.protocol.BlockListAsLongs$BufferDecoder$1.next(BlockListAsLongs.java:420) > ... 8 more > at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1553) > at org.apache.hadoop.ipc.Client.call(Client.java:1499) > at org.apache.hadoop.ipc.Client.call(Client.java:1396) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:118) > at com.sun.proxy.$Proxy25.blockReport(Unknown Source) > at > org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.blockReport(DatanodeProtocolClientSideTranslatorPB.java:218) > at > org.apache.hadoop.hdfs.server.datanode.TestLargeBlockReport.testBlockReportSucceedsWithLargerLengthLimit(TestLargeBlockReport.java:97) > at sun.reflect.NativeMethodAcces
[jira] [Commented] (HDFS-14313) Get hdfs used space from FsDatasetImpl#volumeMap#ReplicaInfo in memory instead of df/du
[ https://issues.apache.org/jira/browse/HDFS-14313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16901736#comment-16901736 ] Lisheng Sun commented on HDFS-14313: Thanx [~linyiqun] for all your work for this patch. I will attach the patch for branch-3.x and branch-2.x branches later. > Get hdfs used space from FsDatasetImpl#volumeMap#ReplicaInfo in memory > instead of df/du > > > Key: HDFS-14313 > URL: https://issues.apache.org/jira/browse/HDFS-14313 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode, performance >Affects Versions: 2.6.0, 2.7.0, 2.8.0, 2.9.0, 3.0.0, 3.1.0 >Reporter: Lisheng Sun >Assignee: Lisheng Sun >Priority: Major > Fix For: 3.3.0 > > Attachments: HDFS-14313.000.patch, HDFS-14313.001.patch, > HDFS-14313.002.patch, HDFS-14313.003.patch, HDFS-14313.004.patch, > HDFS-14313.005.patch, HDFS-14313.006.patch, HDFS-14313.007.patch, > HDFS-14313.008.patch, HDFS-14313.009.patch, HDFS-14313.010.patch, > HDFS-14313.011.patch, HDFS-14313.012.patch, HDFS-14313.013.patch, > HDFS-14313.014.patch > > > There are two ways of DU/DF getting used space that are insufficient. > # Running DU across lots of disks is very expensive and running all of the > processes at the same time creates a noticeable IO spike. > # Running DF is inaccurate when the disk sharing by multiple datanode or > other servers. > Getting hdfs used space from FsDatasetImpl#volumeMap#ReplicaInfos in memory > is very small and accurate. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-14313) Get hdfs used space from FsDatasetImpl#volumeMap#ReplicaInfo in memory instead of df/du
[ https://issues.apache.org/jira/browse/HDFS-14313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lisheng Sun updated HDFS-14313: --- Attachment: HDFS-14313.branch-3.v1.patch > Get hdfs used space from FsDatasetImpl#volumeMap#ReplicaInfo in memory > instead of df/du > > > Key: HDFS-14313 > URL: https://issues.apache.org/jira/browse/HDFS-14313 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode, performance >Affects Versions: 2.6.0, 2.7.0, 2.8.0, 2.9.0, 3.0.0, 3.1.0 >Reporter: Lisheng Sun >Assignee: Lisheng Sun >Priority: Major > Fix For: 3.3.0 > > Attachments: HDFS-14313.000.patch, HDFS-14313.001.patch, > HDFS-14313.002.patch, HDFS-14313.003.patch, HDFS-14313.004.patch, > HDFS-14313.005.patch, HDFS-14313.006.patch, HDFS-14313.007.patch, > HDFS-14313.008.patch, HDFS-14313.009.patch, HDFS-14313.010.patch, > HDFS-14313.011.patch, HDFS-14313.012.patch, HDFS-14313.013.patch, > HDFS-14313.014.patch, HDFS-14313.branch-3.v1.patch > > > There are two ways of DU/DF getting used space that are insufficient. > # Running DU across lots of disks is very expensive and running all of the > processes at the same time creates a noticeable IO spike. > # Running DF is inaccurate when the disk sharing by multiple datanode or > other servers. > Getting hdfs used space from FsDatasetImpl#volumeMap#ReplicaInfos in memory > is very small and accurate. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14313) Get hdfs used space from FsDatasetImpl#volumeMap#ReplicaInfo in memory instead of df/du
[ https://issues.apache.org/jira/browse/HDFS-14313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16901778#comment-16901778 ] Lisheng Sun commented on HDFS-14313: I attached patch for branch-3. you hope I attach patch for branch-3.0, branch-3.1, branch3.2 and branch2.9. Instead of branch-3 and branch-2? I think I was misunderstanding. > Get hdfs used space from FsDatasetImpl#volumeMap#ReplicaInfo in memory > instead of df/du > > > Key: HDFS-14313 > URL: https://issues.apache.org/jira/browse/HDFS-14313 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode, performance >Affects Versions: 2.6.0, 2.7.0, 2.8.0, 2.9.0, 3.0.0, 3.1.0 >Reporter: Lisheng Sun >Assignee: Lisheng Sun >Priority: Major > Fix For: 3.3.0 > > Attachments: HDFS-14313.000.patch, HDFS-14313.001.patch, > HDFS-14313.002.patch, HDFS-14313.003.patch, HDFS-14313.004.patch, > HDFS-14313.005.patch, HDFS-14313.006.patch, HDFS-14313.007.patch, > HDFS-14313.008.patch, HDFS-14313.009.patch, HDFS-14313.010.patch, > HDFS-14313.011.patch, HDFS-14313.012.patch, HDFS-14313.013.patch, > HDFS-14313.014.patch, HDFS-14313.branch-3.v1.patch > > > There are two ways of DU/DF getting used space that are insufficient. > # Running DU across lots of disks is very expensive and running all of the > processes at the same time creates a noticeable IO spike. > # Running DF is inaccurate when the disk sharing by multiple datanode or > other servers. > Getting hdfs used space from FsDatasetImpl#volumeMap#ReplicaInfos in memory > is very small and accurate. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-14313) Get hdfs used space from FsDatasetImpl#volumeMap#ReplicaInfo in memory instead of df/du
[ https://issues.apache.org/jira/browse/HDFS-14313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lisheng Sun updated HDFS-14313: --- Attachment: HDFS-14313.branch-3.0.v1.patch > Get hdfs used space from FsDatasetImpl#volumeMap#ReplicaInfo in memory > instead of df/du > > > Key: HDFS-14313 > URL: https://issues.apache.org/jira/browse/HDFS-14313 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode, performance >Affects Versions: 2.6.0, 2.7.0, 2.8.0, 2.9.0, 3.0.0, 3.1.0 >Reporter: Lisheng Sun >Assignee: Lisheng Sun >Priority: Major > Fix For: 3.3.0 > > Attachments: HDFS-14313.000.patch, HDFS-14313.001.patch, > HDFS-14313.002.patch, HDFS-14313.003.patch, HDFS-14313.004.patch, > HDFS-14313.005.patch, HDFS-14313.006.patch, HDFS-14313.007.patch, > HDFS-14313.008.patch, HDFS-14313.009.patch, HDFS-14313.010.patch, > HDFS-14313.011.patch, HDFS-14313.012.patch, HDFS-14313.013.patch, > HDFS-14313.014.patch, HDFS-14313.branch-3.0.v1.patch, > HDFS-14313.branch-3.v1.patch > > > There are two ways of DU/DF getting used space that are insufficient. > # Running DU across lots of disks is very expensive and running all of the > processes at the same time creates a noticeable IO spike. > # Running DF is inaccurate when the disk sharing by multiple datanode or > other servers. > Getting hdfs used space from FsDatasetImpl#volumeMap#ReplicaInfos in memory > is very small and accurate. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14313) Get hdfs used space from FsDatasetImpl#volumeMap#ReplicaInfo in memory instead of df/du
[ https://issues.apache.org/jira/browse/HDFS-14313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16901783#comment-16901783 ] Lisheng Sun commented on HDFS-14313: Thank [~linyiqun] for your suggestion. I rename patch for branch-3.x and upload the branch-3.0.v1 patch. > Get hdfs used space from FsDatasetImpl#volumeMap#ReplicaInfo in memory > instead of df/du > > > Key: HDFS-14313 > URL: https://issues.apache.org/jira/browse/HDFS-14313 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode, performance >Affects Versions: 2.6.0, 2.7.0, 2.8.0, 2.9.0, 3.0.0, 3.1.0 >Reporter: Lisheng Sun >Assignee: Lisheng Sun >Priority: Major > Fix For: 3.3.0 > > Attachments: HDFS-14313.000.patch, HDFS-14313.001.patch, > HDFS-14313.002.patch, HDFS-14313.003.patch, HDFS-14313.004.patch, > HDFS-14313.005.patch, HDFS-14313.006.patch, HDFS-14313.007.patch, > HDFS-14313.008.patch, HDFS-14313.009.patch, HDFS-14313.010.patch, > HDFS-14313.011.patch, HDFS-14313.012.patch, HDFS-14313.013.patch, > HDFS-14313.014.patch, HDFS-14313.branch-3.0.v1.patch, > HDFS-14313.branch-3.v1.patch > > > There are two ways of DU/DF getting used space that are insufficient. > # Running DU across lots of disks is very expensive and running all of the > processes at the same time creates a noticeable IO spike. > # Running DF is inaccurate when the disk sharing by multiple datanode or > other servers. > Getting hdfs used space from FsDatasetImpl#volumeMap#ReplicaInfos in memory > is very small and accurate. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-14313) Get hdfs used space from FsDatasetImpl#volumeMap#ReplicaInfo in memory instead of df/du
[ https://issues.apache.org/jira/browse/HDFS-14313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lisheng Sun updated HDFS-14313: --- Attachment: HDFS-14313-branch-2.v1.patch > Get hdfs used space from FsDatasetImpl#volumeMap#ReplicaInfo in memory > instead of df/du > > > Key: HDFS-14313 > URL: https://issues.apache.org/jira/browse/HDFS-14313 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode, performance >Affects Versions: 2.6.0, 2.7.0, 2.8.0, 2.9.0, 3.0.0, 3.1.0 >Reporter: Lisheng Sun >Assignee: Lisheng Sun >Priority: Major > Fix For: 3.3.0 > > Attachments: HDFS-14313-branch-2.v1.patch, HDFS-14313.000.patch, > HDFS-14313.001.patch, HDFS-14313.002.patch, HDFS-14313.003.patch, > HDFS-14313.004.patch, HDFS-14313.005.patch, HDFS-14313.006.patch, > HDFS-14313.007.patch, HDFS-14313.008.patch, HDFS-14313.009.patch, > HDFS-14313.010.patch, HDFS-14313.011.patch, HDFS-14313.012.patch, > HDFS-14313.013.patch, HDFS-14313.014.patch, HDFS-14313.branch-3.0.v1.patch, > HDFS-14313.branch-3.0.v2.patch, HDFS-14313.branch-3.v1.patch > > > There are two ways of DU/DF getting used space that are insufficient. > # Running DU across lots of disks is very expensive and running all of the > processes at the same time creates a noticeable IO spike. > # Running DF is inaccurate when the disk sharing by multiple datanode or > other servers. > Getting hdfs used space from FsDatasetImpl#volumeMap#ReplicaInfos in memory > is very small and accurate. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-14313) Get hdfs used space from FsDatasetImpl#volumeMap#ReplicaInfo in memory instead of df/du
[ https://issues.apache.org/jira/browse/HDFS-14313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lisheng Sun updated HDFS-14313: --- Attachment: HDFS-14313.branch-3.0.v2.patch > Get hdfs used space from FsDatasetImpl#volumeMap#ReplicaInfo in memory > instead of df/du > > > Key: HDFS-14313 > URL: https://issues.apache.org/jira/browse/HDFS-14313 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode, performance >Affects Versions: 2.6.0, 2.7.0, 2.8.0, 2.9.0, 3.0.0, 3.1.0 >Reporter: Lisheng Sun >Assignee: Lisheng Sun >Priority: Major > Fix For: 3.3.0 > > Attachments: HDFS-14313.000.patch, HDFS-14313.001.patch, > HDFS-14313.002.patch, HDFS-14313.003.patch, HDFS-14313.004.patch, > HDFS-14313.005.patch, HDFS-14313.006.patch, HDFS-14313.007.patch, > HDFS-14313.008.patch, HDFS-14313.009.patch, HDFS-14313.010.patch, > HDFS-14313.011.patch, HDFS-14313.012.patch, HDFS-14313.013.patch, > HDFS-14313.014.patch, HDFS-14313.branch-3.0.v1.patch, > HDFS-14313.branch-3.0.v2.patch, HDFS-14313.branch-3.v1.patch > > > There are two ways of DU/DF getting used space that are insufficient. > # Running DU across lots of disks is very expensive and running all of the > processes at the same time creates a noticeable IO spike. > # Running DF is inaccurate when the disk sharing by multiple datanode or > other servers. > Getting hdfs used space from FsDatasetImpl#volumeMap#ReplicaInfos in memory > is very small and accurate. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-14701) Change Log Level to warn in SlotReleaser
[ https://issues.apache.org/jira/browse/HDFS-14701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lisheng Sun updated HDFS-14701: --- Attachment: HDFS-14701.002.patch > Change Log Level to warn in SlotReleaser > > > Key: HDFS-14701 > URL: https://issues.apache.org/jira/browse/HDFS-14701 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Lisheng Sun >Assignee: Lisheng Sun >Priority: Minor > Attachments: HDFS-14701.001.patch, HDFS-14701.002.patch > > > if the corresponding DataNode has been stopped or restarted and DFSClient > close shared memory segment,releaseShortCircuitFds API throw expection and > log a ERROR Message. I think it should not be a ERROR log,and that log a warn > log is more reasonable. > {code:java} > // @Override > public void run() { > LOG.trace("{}: about to release {}", ShortCircuitCache.this, slot); > final DfsClientShm shm = (DfsClientShm)slot.getShm(); > final DomainSocket shmSock = shm.getPeer().getDomainSocket(); > final String path = shmSock.getPath(); > boolean success = false; > try (DomainSocket sock = DomainSocket.connect(path); >DataOutputStream out = new DataOutputStream( >new BufferedOutputStream(sock.getOutputStream( { > new Sender(out).releaseShortCircuitFds(slot.getSlotId()); > DataInputStream in = new DataInputStream(sock.getInputStream()); > ReleaseShortCircuitAccessResponseProto resp = > ReleaseShortCircuitAccessResponseProto.parseFrom( > PBHelperClient.vintPrefixed(in)); > if (resp.getStatus() != Status.SUCCESS) { > String error = resp.hasError() ? resp.getError() : "(unknown)"; > throw new IOException(resp.getStatus().toString() + ": " + error); > } > LOG.trace("{}: released {}", this, slot); > success = true; > } catch (IOException e) { > LOG.error(ShortCircuitCache.this + ": failed to release " + > "short-circuit shared memory slot " + slot + " by sending " + > "ReleaseShortCircuitAccessRequestProto to " + path + > ". Closing shared memory segment.", e); > } finally { > if (success) { > shmManager.freeSlot(slot); > } else { > shm.getEndpointShmManager().shutdown(shm); > } > } > } > {code} > *exception stack:* > {code:java} > 2019-08-05,15:28:03,838 ERROR [ShortCircuitCache_SlotReleaser] > org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache: > ShortCircuitCache(0x65849546): failed to release short-circuit shared memory > slot Slot(slotIdx=62, shm=DfsClientShm(70593ef8b3d84cba3c2f0a1e81377eb1)) by > sending ReleaseShortCircuitAccessRequestProto to > /home/work/app/hdfs/c3micloudsrv-hdd/datanode/dn_socket. Closing shared > memory segment. > java.io.IOException: ERROR_INVALID: there is no shared memory segment > registered with shmId 70593ef8b3d84cba3c2f0a1e81377eb1 > {code} > -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14701) Change Log Level to warn in SlotReleaser
[ https://issues.apache.org/jira/browse/HDFS-14701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16901929#comment-16901929 ] Lisheng Sun commented on HDFS-14701: Thanx [~jojochuang] for you suggestion. I updated the patch as your comment and uploaded the v002 patch. Could you help review it? Thank you. > Change Log Level to warn in SlotReleaser > > > Key: HDFS-14701 > URL: https://issues.apache.org/jira/browse/HDFS-14701 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Lisheng Sun >Assignee: Lisheng Sun >Priority: Minor > Attachments: HDFS-14701.001.patch, HDFS-14701.002.patch > > > if the corresponding DataNode has been stopped or restarted and DFSClient > close shared memory segment,releaseShortCircuitFds API throw expection and > log a ERROR Message. I think it should not be a ERROR log,and that log a warn > log is more reasonable. > {code:java} > // @Override > public void run() { > LOG.trace("{}: about to release {}", ShortCircuitCache.this, slot); > final DfsClientShm shm = (DfsClientShm)slot.getShm(); > final DomainSocket shmSock = shm.getPeer().getDomainSocket(); > final String path = shmSock.getPath(); > boolean success = false; > try (DomainSocket sock = DomainSocket.connect(path); >DataOutputStream out = new DataOutputStream( >new BufferedOutputStream(sock.getOutputStream( { > new Sender(out).releaseShortCircuitFds(slot.getSlotId()); > DataInputStream in = new DataInputStream(sock.getInputStream()); > ReleaseShortCircuitAccessResponseProto resp = > ReleaseShortCircuitAccessResponseProto.parseFrom( > PBHelperClient.vintPrefixed(in)); > if (resp.getStatus() != Status.SUCCESS) { > String error = resp.hasError() ? resp.getError() : "(unknown)"; > throw new IOException(resp.getStatus().toString() + ": " + error); > } > LOG.trace("{}: released {}", this, slot); > success = true; > } catch (IOException e) { > LOG.error(ShortCircuitCache.this + ": failed to release " + > "short-circuit shared memory slot " + slot + " by sending " + > "ReleaseShortCircuitAccessRequestProto to " + path + > ". Closing shared memory segment.", e); > } finally { > if (success) { > shmManager.freeSlot(slot); > } else { > shm.getEndpointShmManager().shutdown(shm); > } > } > } > {code} > *exception stack:* > {code:java} > 2019-08-05,15:28:03,838 ERROR [ShortCircuitCache_SlotReleaser] > org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache: > ShortCircuitCache(0x65849546): failed to release short-circuit shared memory > slot Slot(slotIdx=62, shm=DfsClientShm(70593ef8b3d84cba3c2f0a1e81377eb1)) by > sending ReleaseShortCircuitAccessRequestProto to > /home/work/app/hdfs/c3micloudsrv-hdd/datanode/dn_socket. Closing shared > memory segment. > java.io.IOException: ERROR_INVALID: there is no shared memory segment > registered with shmId 70593ef8b3d84cba3c2f0a1e81377eb1 > {code} > -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-14313) Get hdfs used space from FsDatasetImpl#volumeMap#ReplicaInfo in memory instead of df/du
[ https://issues.apache.org/jira/browse/HDFS-14313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lisheng Sun updated HDFS-14313: --- Attachment: HDFS-14313-branch-2.v2.patch > Get hdfs used space from FsDatasetImpl#volumeMap#ReplicaInfo in memory > instead of df/du > > > Key: HDFS-14313 > URL: https://issues.apache.org/jira/browse/HDFS-14313 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode, performance >Affects Versions: 2.6.0, 2.7.0, 2.8.0, 2.9.0, 3.0.0, 3.1.0 >Reporter: Lisheng Sun >Assignee: Lisheng Sun >Priority: Major > Fix For: 3.3.0 > > Attachments: HDFS-14313-branch-2.v1.patch, > HDFS-14313-branch-2.v2.patch, HDFS-14313.000.patch, HDFS-14313.001.patch, > HDFS-14313.002.patch, HDFS-14313.003.patch, HDFS-14313.004.patch, > HDFS-14313.005.patch, HDFS-14313.006.patch, HDFS-14313.007.patch, > HDFS-14313.008.patch, HDFS-14313.009.patch, HDFS-14313.010.patch, > HDFS-14313.011.patch, HDFS-14313.012.patch, HDFS-14313.013.patch, > HDFS-14313.014.patch, HDFS-14313.branch-3.0.v1.patch, > HDFS-14313.branch-3.0.v2.patch, HDFS-14313.branch-3.v1.patch > > > There are two ways of DU/DF getting used space that are insufficient. > # Running DU across lots of disks is very expensive and running all of the > processes at the same time creates a noticeable IO spike. > # Running DF is inaccurate when the disk sharing by multiple datanode or > other servers. > Getting hdfs used space from FsDatasetImpl#volumeMap#ReplicaInfos in memory > is very small and accurate. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14313) Get hdfs used space from FsDatasetImpl#volumeMap#ReplicaInfo in memory instead of df/du
[ https://issues.apache.org/jira/browse/HDFS-14313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16902154#comment-16902154 ] Lisheng Sun commented on HDFS-14313: the branch-2.v2 for patch fixes unchecked issue in javac. It can be ignored. > Get hdfs used space from FsDatasetImpl#volumeMap#ReplicaInfo in memory > instead of df/du > > > Key: HDFS-14313 > URL: https://issues.apache.org/jira/browse/HDFS-14313 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode, performance >Affects Versions: 2.6.0, 2.7.0, 2.8.0, 2.9.0, 3.0.0, 3.1.0 >Reporter: Lisheng Sun >Assignee: Lisheng Sun >Priority: Major > Fix For: 3.3.0 > > Attachments: HDFS-14313-branch-2.v1.patch, > HDFS-14313-branch-2.v2.patch, HDFS-14313.000.patch, HDFS-14313.001.patch, > HDFS-14313.002.patch, HDFS-14313.003.patch, HDFS-14313.004.patch, > HDFS-14313.005.patch, HDFS-14313.006.patch, HDFS-14313.007.patch, > HDFS-14313.008.patch, HDFS-14313.009.patch, HDFS-14313.010.patch, > HDFS-14313.011.patch, HDFS-14313.012.patch, HDFS-14313.013.patch, > HDFS-14313.014.patch, HDFS-14313.branch-3.0.v1.patch, > HDFS-14313.branch-3.0.v2.patch, HDFS-14313.branch-3.v1.patch > > > There are two ways of DU/DF getting used space that are insufficient. > # Running DU across lots of disks is very expensive and running all of the > processes at the same time creates a noticeable IO spike. > # Running DF is inaccurate when the disk sharing by multiple datanode or > other servers. > Getting hdfs used space from FsDatasetImpl#volumeMap#ReplicaInfos in memory > is very small and accurate. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-13571) Dead DataNode Detector
[ https://issues.apache.org/jira/browse/HDFS-13571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16902631#comment-16902631 ] Lisheng Sun commented on HDFS-13571: Sorry [~linyiqun], I have been working on this JIRA. Recently, the company has a lot of things, so there is some delay. I will update this jira as soon as possible. > Dead DataNode Detector > -- > > Key: HDFS-13571 > URL: https://issues.apache.org/jira/browse/HDFS-13571 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs-client >Affects Versions: 2.4.0, 2.6.0, 3.0.2 >Reporter: Gang Xie >Assignee: Lisheng Sun >Priority: Major > Attachments: HDFS-13571-2.6.diff, node status machine.png > > > Currently, the information of the dead datanode in DFSInputStream in stored > locally. So, it could not be shared among the inputstreams of the same > DFSClient. In our production env, every days, some datanodes dies with > different causes. At this time, after the first inputstream blocked and > detect this, it could share this information to others in the same DFSClient, > thus, the ohter inputstreams are still blocked by the dead node for some > time, which could cause bad service latency. > To eliminate this impact from dead datanode, we designed a dead datanode > detector, which detect the dead ones in advance, and share this information > among all the inputstreams in the same client. This improvement has being > online for some months and works fine. So, we decide to port to the 3.0 (the > version used in our production env is 2.4 and 2.6). > I will do the porting work and upload the code later. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-14330) Consider StorageID to choose volume
Lisheng Sun created HDFS-14330: -- Summary: Consider StorageID to choose volume Key: HDFS-14330 URL: https://issues.apache.org/jira/browse/HDFS-14330 Project: Hadoop HDFS Issue Type: Improvement Components: datanode Affects Versions: 3.0.0-alpha4 Reporter: Lisheng Sun RoundRobinVolumeChoosingPolicy#chooseVolume does not consider parameter storageId. The {{BlockPlacementPolicy}} considers specific storages and return the infomation including storageId to client. {code:java} @Override public V chooseVolume(final List volumes, long blockSize, String storageId) throws IOException { if (volumes.size() < 1) { throw new DiskOutOfSpaceException("No more available volumes"); } // As all the items in volumes are with the same storage type, // so only need to get the storage type index of the first item in volumes StorageType storageType = volumes.get(0).getStorageType(); int index = storageType != null ? storageType.ordinal() : StorageType.DEFAULT.ordinal(); synchronized (syncLocks[index]) { return chooseVolume(index, volumes, blockSize); } } {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Assigned] (HDFS-15046) Backport HDFS-7060 to branch-2.10
[ https://issues.apache.org/jira/browse/HDFS-15046?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lisheng Sun reassigned HDFS-15046: -- Assignee: Lisheng Sun > Backport HDFS-7060 to branch-2.10 > - > > Key: HDFS-15046 > URL: https://issues.apache.org/jira/browse/HDFS-15046 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Wei-Chiu Chuang >Assignee: Lisheng Sun >Priority: Major > > Not sure why it didn't get backported in 2.x before, but looks like a good > improvement overall. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15046) Backport HDFS-7060 to branch-2.10
[ https://issues.apache.org/jira/browse/HDFS-15046?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lisheng Sun updated HDFS-15046: --- Attachment: HDFS-15046.branch-2.9.001.patch Status: Patch Available (was: Open) > Backport HDFS-7060 to branch-2.10 > - > > Key: HDFS-15046 > URL: https://issues.apache.org/jira/browse/HDFS-15046 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Wei-Chiu Chuang >Assignee: Lisheng Sun >Priority: Major > Attachments: HDFS-15046.branch-2.9.001.patch > > > Not sure why it didn't get backported in 2.x before, but looks like a good > improvement overall. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15046) Backport HDFS-7060 to branch-2.10
[ https://issues.apache.org/jira/browse/HDFS-15046?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lisheng Sun updated HDFS-15046: --- Attachment: HDFS-15046.branch-2.9.002.patch > Backport HDFS-7060 to branch-2.10 > - > > Key: HDFS-15046 > URL: https://issues.apache.org/jira/browse/HDFS-15046 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Wei-Chiu Chuang >Assignee: Lisheng Sun >Priority: Major > Attachments: HDFS-15046.branch-2.9.001.patch, > HDFS-15046.branch-2.9.002.patch > > > Not sure why it didn't get backported in 2.x before, but looks like a good > improvement overall. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15046) Backport HDFS-7060 to branch-2.10
[ https://issues.apache.org/jira/browse/HDFS-15046?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lisheng Sun updated HDFS-15046: --- Attachment: HDFS-15046.branch-2.001.patch > Backport HDFS-7060 to branch-2.10 > - > > Key: HDFS-15046 > URL: https://issues.apache.org/jira/browse/HDFS-15046 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Wei-Chiu Chuang >Assignee: Lisheng Sun >Priority: Major > Attachments: HDFS-15046.branch-2.001.patch, > HDFS-15046.branch-2.9.001.patch, HDFS-15046.branch-2.9.002.patch > > > Not sure why it didn't get backported in 2.x before, but looks like a good > improvement overall. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15046) Backport HDFS-7060 to branch-2.10
[ https://issues.apache.org/jira/browse/HDFS-15046?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lisheng Sun updated HDFS-15046: --- Status: Open (was: Patch Available) > Backport HDFS-7060 to branch-2.10 > - > > Key: HDFS-15046 > URL: https://issues.apache.org/jira/browse/HDFS-15046 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Wei-Chiu Chuang >Assignee: Lisheng Sun >Priority: Major > Attachments: HDFS-15046.branch-2.001.patch, > HDFS-15046.branch-2.9.001.patch, HDFS-15046.branch-2.9.002.patch > > > Not sure why it didn't get backported in 2.x before, but looks like a good > improvement overall. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15046) Backport HDFS-7060 to branch-2.10
[ https://issues.apache.org/jira/browse/HDFS-15046?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lisheng Sun updated HDFS-15046: --- Status: Patch Available (was: Open) > Backport HDFS-7060 to branch-2.10 > - > > Key: HDFS-15046 > URL: https://issues.apache.org/jira/browse/HDFS-15046 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Wei-Chiu Chuang >Assignee: Lisheng Sun >Priority: Major > Attachments: HDFS-15046.branch-2.001.patch, > HDFS-15046.branch-2.9.001.patch, HDFS-15046.branch-2.9.002.patch > > > Not sure why it didn't get backported in 2.x before, but looks like a good > improvement overall. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15046) Backport HDFS-7060 to branch-2.10
[ https://issues.apache.org/jira/browse/HDFS-15046?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lisheng Sun updated HDFS-15046: --- Attachment: HDFS-15046.branch-2.9.002(2).patch > Backport HDFS-7060 to branch-2.10 > - > > Key: HDFS-15046 > URL: https://issues.apache.org/jira/browse/HDFS-15046 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Wei-Chiu Chuang >Assignee: Lisheng Sun >Priority: Major > Attachments: HDFS-15046.branch-2.001.patch, > HDFS-15046.branch-2.9.001.patch, HDFS-15046.branch-2.9.002(2).patch, > HDFS-15046.branch-2.9.002.patch > > > Not sure why it didn't get backported in 2.x before, but looks like a good > improvement overall. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15046) Backport HDFS-7060 to branch-2.10
[ https://issues.apache.org/jira/browse/HDFS-15046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17016568#comment-17016568 ] Lisheng Sun commented on HDFS-15046: [~weichiu] Could you help review this patch? Thank you. > Backport HDFS-7060 to branch-2.10 > - > > Key: HDFS-15046 > URL: https://issues.apache.org/jira/browse/HDFS-15046 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Wei-Chiu Chuang >Assignee: Lisheng Sun >Priority: Major > Attachments: HDFS-15046.branch-2.001.patch, > HDFS-15046.branch-2.9.001.patch, HDFS-15046.branch-2.9.002(2).patch, > HDFS-15046.branch-2.9.002.patch > > > Not sure why it didn't get backported in 2.x before, but looks like a good > improvement overall. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14651) DeadNodeDetector checks dead node periodically
[ https://issues.apache.org/jira/browse/HDFS-14651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17028269#comment-17028269 ] Lisheng Sun commented on HDFS-14651: Thanks [~ahussein] for your questions. {quote}1.what is the usage of deadNodeDetectInterval ? As far as I understand, every call to checkDeadNodes() will change the state to IDLE forcing the DeadNodeDetector to sleep for IDLE_SLEEP_MS. So, why do we need deadNodeDetectInterval if the actual time gap between every check is IDLE_SLEEP_MS? {quote} checkDeadNodes in checkDeadNodes() is not really necessary,since call idle() after checkDeadNodes(). {quote} stopDeadNodeDetectorThread.stopDeadNodeDetectorThread() is supposed to stop the deadNodeDetector thread; but it looks like the implementation of the runnable never terminates. DeadNodeDetector surpresses all interrupts and never checks for a termination flag. Therefore, the caller will just hang for 3 seconds waiting to join. {quote} {code:java} /** * Close dead node detector thread. */ public void stopDeadNodeDetectorThread() { if (deadNodeDetectorThr != null) { deadNodeDetectorThr.interrupt(); try { deadNodeDetectorThr.join(3000); } catch (InterruptedException e) { LOG.warn("Encountered exception while waiting to join on dead " + "node detector thread.", e); } } } {code} i remove 3s timeout in deadNodeDetectorThr.join() for waiting for stoping the deadNodeDetector thread. i will create issue to fix these two problems. Thanks you. > DeadNodeDetector checks dead node periodically > -- > > Key: HDFS-14651 > URL: https://issues.apache.org/jira/browse/HDFS-14651 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Lisheng Sun >Assignee: Lisheng Sun >Priority: Major > Fix For: 3.3.0 > > Attachments: HDFS-14651.001.patch, HDFS-14651.002.patch, > HDFS-14651.003.patch, HDFS-14651.004.patch, HDFS-14651.005.patch, > HDFS-14651.006.patch, HDFS-14651.007.patch, HDFS-14651.008.patch > > > DeadNodeDetector checks dead node periodically. > DeadNodeDetector periodically detect the Node in DeadNodeDetector#deadnode, > If the access is successful, the Node will be moved from > DeadNodeDetector#deadnode. Continuous detection of the dead node is > necessary. The DataNode need rejoin the cluster due to a service > restart/machine repair. The DataNode may be permanently excluded if there is > no added probe mechanism. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-15161) When evictableMmapped or evictable size is zero, do not throw NoSuchElementException in ShortCircuitCache#close()
Lisheng Sun created HDFS-15161: -- Summary: When evictableMmapped or evictable size is zero, do not throw NoSuchElementException in ShortCircuitCache#close() Key: HDFS-15161 URL: https://issues.apache.org/jira/browse/HDFS-15161 Project: Hadoop HDFS Issue Type: Bug Reporter: Lisheng Sun Assignee: Lisheng Sun detail see # HDFS-14541 # HDFS-14541 # HDFS-14541 {code:java} /** * Close the cache and free all associated resources. */ @Override public void close() { try { lock.lock(); if (closed) return; closed = true; LOG.info(this + ": closing"); maxNonMmappedEvictableLifespanMs = 0; maxEvictableMmapedSize = 0; // Close and join cacheCleaner thread. IOUtilsClient.cleanupWithLogger(LOG, cacheCleaner); // Purge all replicas. while (true) { Object eldestKey; try { eldestKey = evictable.firstKey(); } catch (NoSuchElementException e) { break; } purge((ShortCircuitReplica)evictable.get(eldestKey)); } while (true) { Object eldestKey; try { eldestKey = evictableMmapped.firstKey(); } catch (NoSuchElementException e) { break; } purge((ShortCircuitReplica)evictableMmapped.get(eldestKey)); } } finally { lock.unlock(); } {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15161) When evictableMmapped or evictable size is zero, do not throw NoSuchElementException in ShortCircuitCache#close()
[ https://issues.apache.org/jira/browse/HDFS-15161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lisheng Sun updated HDFS-15161: --- Description: detail see HDFS-14541 {code:java} /** * Close the cache and free all associated resources. */ @Override public void close() { try { lock.lock(); if (closed) return; closed = true; LOG.info(this + ": closing"); maxNonMmappedEvictableLifespanMs = 0; maxEvictableMmapedSize = 0; // Close and join cacheCleaner thread. IOUtilsClient.cleanupWithLogger(LOG, cacheCleaner); // Purge all replicas. while (true) { Object eldestKey; try { eldestKey = evictable.firstKey(); } catch (NoSuchElementException e) { break; } purge((ShortCircuitReplica)evictable.get(eldestKey)); } while (true) { Object eldestKey; try { eldestKey = evictableMmapped.firstKey(); } catch (NoSuchElementException e) { break; } purge((ShortCircuitReplica)evictableMmapped.get(eldestKey)); } } finally { lock.unlock(); } {code} was: detail see # HDFS-14541 # HDFS-14541 # HDFS-14541 {code:java} /** * Close the cache and free all associated resources. */ @Override public void close() { try { lock.lock(); if (closed) return; closed = true; LOG.info(this + ": closing"); maxNonMmappedEvictableLifespanMs = 0; maxEvictableMmapedSize = 0; // Close and join cacheCleaner thread. IOUtilsClient.cleanupWithLogger(LOG, cacheCleaner); // Purge all replicas. while (true) { Object eldestKey; try { eldestKey = evictable.firstKey(); } catch (NoSuchElementException e) { break; } purge((ShortCircuitReplica)evictable.get(eldestKey)); } while (true) { Object eldestKey; try { eldestKey = evictableMmapped.firstKey(); } catch (NoSuchElementException e) { break; } purge((ShortCircuitReplica)evictableMmapped.get(eldestKey)); } } finally { lock.unlock(); } {code} > When evictableMmapped or evictable size is zero, do not throw > NoSuchElementException in ShortCircuitCache#close() > -- > > Key: HDFS-15161 > URL: https://issues.apache.org/jira/browse/HDFS-15161 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Lisheng Sun >Assignee: Lisheng Sun >Priority: Major > > detail see HDFS-14541 > {code:java} > /** > * Close the cache and free all associated resources. > */ > @Override > public void close() { > try { > lock.lock(); > if (closed) return; > closed = true; > LOG.info(this + ": closing"); > maxNonMmappedEvictableLifespanMs = 0; > maxEvictableMmapedSize = 0; > // Close and join cacheCleaner thread. > IOUtilsClient.cleanupWithLogger(LOG, cacheCleaner); > // Purge all replicas. > while (true) { > Object eldestKey; > try { > eldestKey = evictable.firstKey(); > } catch (NoSuchElementException e) { > break; > } > purge((ShortCircuitReplica)evictable.get(eldestKey)); > } > while (true) { > Object eldestKey; > try { > eldestKey = evictableMmapped.firstKey(); > } catch (NoSuchElementException e) { > break; > } > purge((ShortCircuitReplica)evictableMmapped.get(eldestKey)); > } > } finally { > lock.unlock(); > } > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15161) When evictableMmapped or evictable size is zero, do not throw NoSuchElementException in ShortCircuitCache#close()
[ https://issues.apache.org/jira/browse/HDFS-15161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lisheng Sun updated HDFS-15161: --- Attachment: HDFS-15161.001.patch Status: Patch Available (was: Open) > When evictableMmapped or evictable size is zero, do not throw > NoSuchElementException in ShortCircuitCache#close() > -- > > Key: HDFS-15161 > URL: https://issues.apache.org/jira/browse/HDFS-15161 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Lisheng Sun >Assignee: Lisheng Sun >Priority: Major > Attachments: HDFS-15161.001.patch > > > detail see HDFS-14541 > {code:java} > /** > * Close the cache and free all associated resources. > */ > @Override > public void close() { > try { > lock.lock(); > if (closed) return; > closed = true; > LOG.info(this + ": closing"); > maxNonMmappedEvictableLifespanMs = 0; > maxEvictableMmapedSize = 0; > // Close and join cacheCleaner thread. > IOUtilsClient.cleanupWithLogger(LOG, cacheCleaner); > // Purge all replicas. > while (true) { > Object eldestKey; > try { > eldestKey = evictable.firstKey(); > } catch (NoSuchElementException e) { > break; > } > purge((ShortCircuitReplica)evictable.get(eldestKey)); > } > while (true) { > Object eldestKey; > try { > eldestKey = evictableMmapped.firstKey(); > } catch (NoSuchElementException e) { > break; > } > purge((ShortCircuitReplica)evictableMmapped.get(eldestKey)); > } > } finally { > lock.unlock(); > } > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15161) When evictableMmapped or evictable size is zero, do not throw NoSuchElementException in ShortCircuitCache#close()
[ https://issues.apache.org/jira/browse/HDFS-15161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17034469#comment-17034469 ] Lisheng Sun commented on HDFS-15161: [~ayushtkn] HDFS-14541 left the the problem that do not throw NoSuchElementException in ShortCircuitCache#close(), when evictableMmapped or evictable size is zero. other method in ShortCircuitCache has fixed the problem. > When evictableMmapped or evictable size is zero, do not throw > NoSuchElementException in ShortCircuitCache#close() > -- > > Key: HDFS-15161 > URL: https://issues.apache.org/jira/browse/HDFS-15161 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Lisheng Sun >Assignee: Lisheng Sun >Priority: Major > Attachments: HDFS-15161.001.patch > > > detail see HDFS-14541 > {code:java} > /** > * Close the cache and free all associated resources. > */ > @Override > public void close() { > try { > lock.lock(); > if (closed) return; > closed = true; > LOG.info(this + ": closing"); > maxNonMmappedEvictableLifespanMs = 0; > maxEvictableMmapedSize = 0; > // Close and join cacheCleaner thread. > IOUtilsClient.cleanupWithLogger(LOG, cacheCleaner); > // Purge all replicas. > while (true) { > Object eldestKey; > try { > eldestKey = evictable.firstKey(); > } catch (NoSuchElementException e) { > break; > } > purge((ShortCircuitReplica)evictable.get(eldestKey)); > } > while (true) { > Object eldestKey; > try { > eldestKey = evictableMmapped.firstKey(); > } catch (NoSuchElementException e) { > break; > } > purge((ShortCircuitReplica)evictableMmapped.get(eldestKey)); > } > } finally { > lock.unlock(); > } > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15161) When evictableMmapped or evictable size is zero, do not throw NoSuchElementException in ShortCircuitCache#close()
[ https://issues.apache.org/jira/browse/HDFS-15161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lisheng Sun updated HDFS-15161: --- Attachment: HDFS-15161.002.patch > When evictableMmapped or evictable size is zero, do not throw > NoSuchElementException in ShortCircuitCache#close() > -- > > Key: HDFS-15161 > URL: https://issues.apache.org/jira/browse/HDFS-15161 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Lisheng Sun >Assignee: Lisheng Sun >Priority: Major > Attachments: HDFS-15161.001.patch, HDFS-15161.002.patch > > > detail see HDFS-14541 > {code:java} > /** > * Close the cache and free all associated resources. > */ > @Override > public void close() { > try { > lock.lock(); > if (closed) return; > closed = true; > LOG.info(this + ": closing"); > maxNonMmappedEvictableLifespanMs = 0; > maxEvictableMmapedSize = 0; > // Close and join cacheCleaner thread. > IOUtilsClient.cleanupWithLogger(LOG, cacheCleaner); > // Purge all replicas. > while (true) { > Object eldestKey; > try { > eldestKey = evictable.firstKey(); > } catch (NoSuchElementException e) { > break; > } > purge((ShortCircuitReplica)evictable.get(eldestKey)); > } > while (true) { > Object eldestKey; > try { > eldestKey = evictableMmapped.firstKey(); > } catch (NoSuchElementException e) { > break; > } > purge((ShortCircuitReplica)evictableMmapped.get(eldestKey)); > } > } finally { > lock.unlock(); > } > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-15172) Remove unnecessary deadNodeDetectInterval in DeadNodeDetector#checkDeadNodes()
Lisheng Sun created HDFS-15172: -- Summary: Remove unnecessary deadNodeDetectInterval in DeadNodeDetector#checkDeadNodes() Key: HDFS-15172 URL: https://issues.apache.org/jira/browse/HDFS-15172 Project: Hadoop HDFS Issue Type: Sub-task Reporter: Lisheng Sun Assignee: Lisheng Sun Every call to checkDeadNodes() will change the state to IDLE forcing the DeadNodeDetector to sleep for IDLE_SLEEP_MS. So, we don't need deadNodeDetectInterval between every checkDeadNodes(). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15172) Remove unnecessary deadNodeDetectInterval in DeadNodeDetector#checkDeadNodes()
[ https://issues.apache.org/jira/browse/HDFS-15172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lisheng Sun updated HDFS-15172: --- Attachment: HDFS-15172-001.patch Status: Patch Available (was: Open) > Remove unnecessary deadNodeDetectInterval in > DeadNodeDetector#checkDeadNodes() > --- > > Key: HDFS-15172 > URL: https://issues.apache.org/jira/browse/HDFS-15172 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Lisheng Sun >Assignee: Lisheng Sun >Priority: Major > Attachments: HDFS-15172-001.patch > > > Every call to checkDeadNodes() will change the state to IDLE forcing the > DeadNodeDetector to sleep for IDLE_SLEEP_MS. So, we don't need > deadNodeDetectInterval between every checkDeadNodes(). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15172) Remove unnecessary deadNodeDetectInterval in DeadNodeDetector#checkDeadNodes()
[ https://issues.apache.org/jira/browse/HDFS-15172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lisheng Sun updated HDFS-15172: --- Attachment: HDFS-15172-002.patch > Remove unnecessary deadNodeDetectInterval in > DeadNodeDetector#checkDeadNodes() > --- > > Key: HDFS-15172 > URL: https://issues.apache.org/jira/browse/HDFS-15172 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Lisheng Sun >Assignee: Lisheng Sun >Priority: Major > Attachments: HDFS-15172-001.patch, HDFS-15172-002.patch > > > Every call to checkDeadNodes() will change the state to IDLE forcing the > DeadNodeDetector to sleep for IDLE_SLEEP_MS. So, we don't need > deadNodeDetectInterval between every checkDeadNodes(). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15172) Remove unnecessary deadNodeDetectInterval in DeadNodeDetector#checkDeadNodes()
[ https://issues.apache.org/jira/browse/HDFS-15172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lisheng Sun updated HDFS-15172: --- Description: Every call to checkDeadNodes() will change the state to IDLE forcing the DeadNodeDetector to sleep for IDLE_SLEEP_MS. So we don't need deadNodeDetectInterval between every checkDeadNodes(). (was: Every call to checkDeadNodes() will change the state to IDLE forcing the DeadNodeDetector to sleep for IDLE_SLEEP_MS. So, we don't need deadNodeDetectInterval between every checkDeadNodes().) > Remove unnecessary deadNodeDetectInterval in > DeadNodeDetector#checkDeadNodes() > --- > > Key: HDFS-15172 > URL: https://issues.apache.org/jira/browse/HDFS-15172 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Lisheng Sun >Assignee: Lisheng Sun >Priority: Major > Attachments: HDFS-15172-001.patch, HDFS-15172-002.patch > > > Every call to checkDeadNodes() will change the state to IDLE forcing the > DeadNodeDetector to sleep for IDLE_SLEEP_MS. So we don't need > deadNodeDetectInterval between every checkDeadNodes(). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Assigned] (HDFS-15149) TestDeadNodeDetection test cases time-out
[ https://issues.apache.org/jira/browse/HDFS-15149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lisheng Sun reassigned HDFS-15149: -- Assignee: Lisheng Sun (was: Ahmed Hussein) > TestDeadNodeDetection test cases time-out > - > > Key: HDFS-15149 > URL: https://issues.apache.org/jira/browse/HDFS-15149 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode >Reporter: Ahmed Hussein >Assignee: Lisheng Sun >Priority: Major > > TestDeadNodeDetection JUnit time out times out with the following stack > traces: > * 1- testDeadNodeDetectionInBackground* > {code:bash} > [ERROR] Tests run: 5, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: > 264.757 s <<< FAILURE! - in org.apache.hadoop.hdfs.TestDeadNodeDetection > [ERROR] > testDeadNodeDetectionInBackground(org.apache.hadoop.hdfs.TestDeadNodeDetection) > Time elapsed: 125.806 s <<< ERROR! > java.util.concurrent.TimeoutException: > Timed out waiting for condition. Thread diagnostics: > Timestamp: 2020-01-24 08:31:07,023 > "client DomainSocketWatcher" daemon prio=5 tid=117 runnable > java.lang.Thread.State: RUNNABLE > at org.apache.hadoop.net.unix.DomainSocketWatcher.doPoll0(Native > Method) > at > org.apache.hadoop.net.unix.DomainSocketWatcher.access$900(DomainSocketWatcher.java:52) > at > org.apache.hadoop.net.unix.DomainSocketWatcher$2.run(DomainSocketWatcher.java:503) > at java.lang.Thread.run(Thread.java:748) > "Session-HouseKeeper-48c3205a" prio=5 tid=350 timed_waiting > java.lang.Thread.State: TIMED_WAITING > at sun.misc.Unsafe.park(Native Method) > at > java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078) > at > java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:1093) > at > java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:809) > at > java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > "java.util.concurrent.ThreadPoolExecutor$Worker@3ae54156[State = -1, empty > queue]" daemon prio=5 tid=752 in Object.wait() > java.lang.Thread.State: WAITING (on object monitor) > at sun.misc.Unsafe.park(Native Method) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039) > at > java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442) > at > java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > "CacheReplicationMonitor(1960356187)" prio=5 tid=386 timed_waiting > java.lang.Thread.State: TIMED_WAITING > at sun.misc.Unsafe.park(Native Method) > at > java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2163) > at > org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor.run(CacheReplicationMonitor.java:181) > "Timer for 'NameNode' metrics system" daemon prio=5 tid=339 timed_waiting > java.lang.Thread.State: TIMED_WAITING > at java.lang.Object.wait(Native Method) > at java.util.TimerThread.mainLoop(Timer.java:552) > at java.util.TimerThread.run(Timer.java:505) > "org.apache.hadoop.hdfs.server.namenode.FSNamesystem$LazyPersistFileScrubber@6b760460" > daemon prio=5 tid=385 timed_waiting > java.lang.Thread.State: TIMED_WAITING > at java.lang.Thread.sleep(Native Method) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem$LazyPersistFileScrubber.run(FSNamesystem.java:4420) > at java.lang.Thread.run(Thread.java:748) > "qtp164757726-349" daemon prio=5 tid=349 runnable > java.lang.Thread.State: RUNNABLE > at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method) > at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269) > at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:93) > at sun.nio.ch.SelectorImpl.lockAndDoSelect(Se
[jira] [Commented] (HDFS-15172) Remove unnecessary deadNodeDetectInterval in DeadNodeDetector#checkDeadNodes()
[ https://issues.apache.org/jira/browse/HDFS-15172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17038073#comment-17038073 ] Lisheng Sun commented on HDFS-15172: [~elgoiri] I was intended to add this check In DeadNodeDetector#checkDeadNodes to prevent excessive frequency check. But waiting time has been added to idle() and every call to checkDeadNodes() will change the state to IDLE. So there will not is a situation that excessive frequency check. TestDeadNodeDetection#testDeadNodeDetectionDeadNodeRecovery() has covered about this change. Thank you. > Remove unnecessary deadNodeDetectInterval in > DeadNodeDetector#checkDeadNodes() > --- > > Key: HDFS-15172 > URL: https://issues.apache.org/jira/browse/HDFS-15172 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Lisheng Sun >Assignee: Lisheng Sun >Priority: Major > Attachments: HDFS-15172-001.patch, HDFS-15172-002.patch > > > Every call to checkDeadNodes() will change the state to IDLE forcing the > DeadNodeDetector to sleep for IDLE_SLEEP_MS. So we don't need > deadNodeDetectInterval between every checkDeadNodes(). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-15174) Optimize ReplicaCachingGetSpaceUsed by reducing unnecessary io operations
Lisheng Sun created HDFS-15174: -- Summary: Optimize ReplicaCachingGetSpaceUsed by reducing unnecessary io operations Key: HDFS-15174 URL: https://issues.apache.org/jira/browse/HDFS-15174 Project: Hadoop HDFS Issue Type: Improvement Reporter: Lisheng Sun Assignee: Lisheng Sun Calculating the size of each block and the size of the meta file requires io operation In ReplicaCachingGetSpaceUsed#refresh(). Pressure on disk performance when there are many block. HDFS-14313 is intended to reduce io operation. So get block size by ReplicaInfo and meta size by DataChecksum#getChecksumSize(). {code:java} @Override protected void refresh() { if (CollectionUtils.isNotEmpty(replicaInfos)) { for (ReplicaInfo replicaInfo : replicaInfos) { if (Objects.equals(replicaInfo.getVolume().getStorageID(), volume.getStorageID())) { dfsUsed += replicaInfo.getBlockDataLength(); dfsUsed += replicaInfo.getMetadataLength(); count++; } } } } {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15149) TestDeadNodeDetection test cases time-out
[ https://issues.apache.org/jira/browse/HDFS-15149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lisheng Sun updated HDFS-15149: --- Attachment: HDFS-15149-001.patch Status: Patch Available (was: Open) > TestDeadNodeDetection test cases time-out > - > > Key: HDFS-15149 > URL: https://issues.apache.org/jira/browse/HDFS-15149 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode >Reporter: Ahmed Hussein >Assignee: Lisheng Sun >Priority: Major > Attachments: HDFS-15149-001.patch > > > TestDeadNodeDetection JUnit time out times out with the following stack > traces: > * 1- testDeadNodeDetectionInBackground* > {code:bash} > [ERROR] Tests run: 5, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: > 264.757 s <<< FAILURE! - in org.apache.hadoop.hdfs.TestDeadNodeDetection > [ERROR] > testDeadNodeDetectionInBackground(org.apache.hadoop.hdfs.TestDeadNodeDetection) > Time elapsed: 125.806 s <<< ERROR! > java.util.concurrent.TimeoutException: > Timed out waiting for condition. Thread diagnostics: > Timestamp: 2020-01-24 08:31:07,023 > "client DomainSocketWatcher" daemon prio=5 tid=117 runnable > java.lang.Thread.State: RUNNABLE > at org.apache.hadoop.net.unix.DomainSocketWatcher.doPoll0(Native > Method) > at > org.apache.hadoop.net.unix.DomainSocketWatcher.access$900(DomainSocketWatcher.java:52) > at > org.apache.hadoop.net.unix.DomainSocketWatcher$2.run(DomainSocketWatcher.java:503) > at java.lang.Thread.run(Thread.java:748) > "Session-HouseKeeper-48c3205a" prio=5 tid=350 timed_waiting > java.lang.Thread.State: TIMED_WAITING > at sun.misc.Unsafe.park(Native Method) > at > java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078) > at > java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:1093) > at > java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:809) > at > java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > "java.util.concurrent.ThreadPoolExecutor$Worker@3ae54156[State = -1, empty > queue]" daemon prio=5 tid=752 in Object.wait() > java.lang.Thread.State: WAITING (on object monitor) > at sun.misc.Unsafe.park(Native Method) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039) > at > java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442) > at > java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > "CacheReplicationMonitor(1960356187)" prio=5 tid=386 timed_waiting > java.lang.Thread.State: TIMED_WAITING > at sun.misc.Unsafe.park(Native Method) > at > java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2163) > at > org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor.run(CacheReplicationMonitor.java:181) > "Timer for 'NameNode' metrics system" daemon prio=5 tid=339 timed_waiting > java.lang.Thread.State: TIMED_WAITING > at java.lang.Object.wait(Native Method) > at java.util.TimerThread.mainLoop(Timer.java:552) > at java.util.TimerThread.run(Timer.java:505) > "org.apache.hadoop.hdfs.server.namenode.FSNamesystem$LazyPersistFileScrubber@6b760460" > daemon prio=5 tid=385 timed_waiting > java.lang.Thread.State: TIMED_WAITING > at java.lang.Thread.sleep(Native Method) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem$LazyPersistFileScrubber.run(FSNamesystem.java:4420) > at java.lang.Thread.run(Thread.java:748) > "qtp164757726-349" daemon prio=5 tid=349 runnable > java.lang.Thread.State: RUNNABLE > at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method) > at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269) > at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSele
[jira] [Commented] (HDFS-15172) Remove unnecessary deadNodeDetectInterval in DeadNodeDetector#checkDeadNodes()
[ https://issues.apache.org/jira/browse/HDFS-15172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17038721#comment-17038721 ] Lisheng Sun commented on HDFS-15172: Thank [~elgoiri] for review. This jira solves the problem that that excessive frequency check. HDFS-15149 should not bring this part of the modification and is surpose to solve that DeadNodeDetector surpresses all interrupts and never checks for a termination flag. I think these two problem is better divided into two jiras. > Remove unnecessary deadNodeDetectInterval in > DeadNodeDetector#checkDeadNodes() > --- > > Key: HDFS-15172 > URL: https://issues.apache.org/jira/browse/HDFS-15172 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Lisheng Sun >Assignee: Lisheng Sun >Priority: Major > Attachments: HDFS-15172-001.patch, HDFS-15172-002.patch > > > Every call to checkDeadNodes() will change the state to IDLE forcing the > DeadNodeDetector to sleep for IDLE_SLEEP_MS. So we don't need > deadNodeDetectInterval between every checkDeadNodes(). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-15182) TestBlockManager#testOneOfTwoRacksDecommissioned() fail in trunk
Lisheng Sun created HDFS-15182: -- Summary: TestBlockManager#testOneOfTwoRacksDecommissioned() fail in trunk Key: HDFS-15182 URL: https://issues.apache.org/jira/browse/HDFS-15182 Project: Hadoop HDFS Issue Type: Bug Reporter: Lisheng Sun when run only a UT of TestBlockManager#testOneOfTwoRacksDecommissioned(), it will fail and throw NullPointerException. Since NameNode#metrics is static variable,run all uts in TestBlockManager and other ut has init metrics. But that it runs only testOneOfTwoRacksDecommissioned without initialing metrics throws NullPointerException. {code:java} java.lang.NullPointerException at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.addBlock(BlockManager.java:4088) at org.apache.hadoop.hdfs.server.blockmanagement.TestBlockManager.fulfillPipeline(TestBlockManager.java:518) at org.apache.hadoop.hdfs.server.blockmanagement.TestBlockManager.doTestOneOfTwoRacksDecommissioned(TestBlockManager.java:388) at org.apache.hadoop.hdfs.server.blockmanagement.TestBlockManager.testOneOfTwoRacksDecommissioned(TestBlockManager.java:353) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50) at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47) at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26) at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57) at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290) at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71) at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288) at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58) at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268) at org.junit.runners.ParentRunner.run(ParentRunner.java:363) at org.junit.runner.JUnitCore.run(JUnitCore.java:137) at com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs(JUnit4IdeaTestRunner.java:68) at com.intellij.rt.execution.junit.IdeaTestRunner$Repeater.startRunnerWithArgs(IdeaTestRunner.java:47) at com.intellij.rt.execution.junit.JUnitStarter.prepareStreamsAndStart(JUnitStarter.java:242) at com.intellij.rt.execution.junit.JUnitStarter.main(JUnitStarter.java:70) {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15182) TestBlockManager#testOneOfTwoRacksDecommissioned() fail in trunk
[ https://issues.apache.org/jira/browse/HDFS-15182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lisheng Sun updated HDFS-15182: --- Attachment: HDFS-15182-001.patch Status: Patch Available (was: Open) > TestBlockManager#testOneOfTwoRacksDecommissioned() fail in trunk > > > Key: HDFS-15182 > URL: https://issues.apache.org/jira/browse/HDFS-15182 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Lisheng Sun >Priority: Minor > Attachments: HDFS-15182-001.patch > > > when run only a UT of TestBlockManager#testOneOfTwoRacksDecommissioned(), it > will fail and throw NullPointerException. > Since NameNode#metrics is static variable,run all uts in TestBlockManager and > other ut has init metrics. > But that it runs only testOneOfTwoRacksDecommissioned without initialing > metrics throws NullPointerException. > {code:java} > java.lang.NullPointerException > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.addBlock(BlockManager.java:4088) > at > org.apache.hadoop.hdfs.server.blockmanagement.TestBlockManager.fulfillPipeline(TestBlockManager.java:518) > at > org.apache.hadoop.hdfs.server.blockmanagement.TestBlockManager.doTestOneOfTwoRacksDecommissioned(TestBlockManager.java:388) > at > org.apache.hadoop.hdfs.server.blockmanagement.TestBlockManager.testOneOfTwoRacksDecommissioned(TestBlockManager.java:353) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50) > at > org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) > at > org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47) > at > org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) > at > org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26) > at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325) > at > org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78) > at > org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57) > at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290) > at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71) > at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288) > at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58) > at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268) > at org.junit.runners.ParentRunner.run(ParentRunner.java:363) > at org.junit.runner.JUnitCore.run(JUnitCore.java:137) > at > com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs(JUnit4IdeaTestRunner.java:68) > at > com.intellij.rt.execution.junit.IdeaTestRunner$Repeater.startRunnerWithArgs(IdeaTestRunner.java:47) > at > com.intellij.rt.execution.junit.JUnitStarter.prepareStreamsAndStart(JUnitStarter.java:242) > at > com.intellij.rt.execution.junit.JUnitStarter.main(JUnitStarter.java:70) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15182) TestBlockManager#testOneOfTwoRacksDecommissioned() fail in trunk
[ https://issues.apache.org/jira/browse/HDFS-15182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lisheng Sun updated HDFS-15182: --- Attachment: HDFS-15182-002.patch > TestBlockManager#testOneOfTwoRacksDecommissioned() fail in trunk > > > Key: HDFS-15182 > URL: https://issues.apache.org/jira/browse/HDFS-15182 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Lisheng Sun >Assignee: Lisheng Sun >Priority: Minor > Attachments: HDFS-15182-001.patch, HDFS-15182-002.patch > > > when run only a UT of TestBlockManager#testOneOfTwoRacksDecommissioned(), it > will fail and throw NullPointerException. > Since NameNode#metrics is static variable,run all uts in TestBlockManager and > other ut has init metrics. > But that it runs only testOneOfTwoRacksDecommissioned without initialing > metrics throws NullPointerException. > {code:java} > java.lang.NullPointerException > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.addBlock(BlockManager.java:4088) > at > org.apache.hadoop.hdfs.server.blockmanagement.TestBlockManager.fulfillPipeline(TestBlockManager.java:518) > at > org.apache.hadoop.hdfs.server.blockmanagement.TestBlockManager.doTestOneOfTwoRacksDecommissioned(TestBlockManager.java:388) > at > org.apache.hadoop.hdfs.server.blockmanagement.TestBlockManager.testOneOfTwoRacksDecommissioned(TestBlockManager.java:353) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50) > at > org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) > at > org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47) > at > org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) > at > org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26) > at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325) > at > org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78) > at > org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57) > at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290) > at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71) > at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288) > at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58) > at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268) > at org.junit.runners.ParentRunner.run(ParentRunner.java:363) > at org.junit.runner.JUnitCore.run(JUnitCore.java:137) > at > com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs(JUnit4IdeaTestRunner.java:68) > at > com.intellij.rt.execution.junit.IdeaTestRunner$Repeater.startRunnerWithArgs(IdeaTestRunner.java:47) > at > com.intellij.rt.execution.junit.JUnitStarter.prepareStreamsAndStart(JUnitStarter.java:242) > at > com.intellij.rt.execution.junit.JUnitStarter.main(JUnitStarter.java:70) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15182) TestBlockManager#testOneOfTwoRacksDecommissioned() fail in trunk
[ https://issues.apache.org/jira/browse/HDFS-15182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17039993#comment-17039993 ] Lisheng Sun commented on HDFS-15182: the v002 fixed it in @Before. > TestBlockManager#testOneOfTwoRacksDecommissioned() fail in trunk > > > Key: HDFS-15182 > URL: https://issues.apache.org/jira/browse/HDFS-15182 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Lisheng Sun >Assignee: Lisheng Sun >Priority: Minor > Attachments: HDFS-15182-001.patch, HDFS-15182-002.patch > > > when run only a UT of TestBlockManager#testOneOfTwoRacksDecommissioned(), it > will fail and throw NullPointerException. > Since NameNode#metrics is static variable,run all uts in TestBlockManager and > other ut has init metrics. > But that it runs only testOneOfTwoRacksDecommissioned without initialing > metrics throws NullPointerException. > {code:java} > java.lang.NullPointerException > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.addBlock(BlockManager.java:4088) > at > org.apache.hadoop.hdfs.server.blockmanagement.TestBlockManager.fulfillPipeline(TestBlockManager.java:518) > at > org.apache.hadoop.hdfs.server.blockmanagement.TestBlockManager.doTestOneOfTwoRacksDecommissioned(TestBlockManager.java:388) > at > org.apache.hadoop.hdfs.server.blockmanagement.TestBlockManager.testOneOfTwoRacksDecommissioned(TestBlockManager.java:353) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50) > at > org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) > at > org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47) > at > org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) > at > org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26) > at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325) > at > org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78) > at > org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57) > at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290) > at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71) > at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288) > at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58) > at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268) > at org.junit.runners.ParentRunner.run(ParentRunner.java:363) > at org.junit.runner.JUnitCore.run(JUnitCore.java:137) > at > com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs(JUnit4IdeaTestRunner.java:68) > at > com.intellij.rt.execution.junit.IdeaTestRunner$Repeater.startRunnerWithArgs(IdeaTestRunner.java:47) > at > com.intellij.rt.execution.junit.JUnitStarter.prepareStreamsAndStart(JUnitStarter.java:242) > at > com.intellij.rt.execution.junit.JUnitStarter.main(JUnitStarter.java:70) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15149) TestDeadNodeDetection test cases time-out
[ https://issues.apache.org/jira/browse/HDFS-15149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17040025#comment-17040025 ] Lisheng Sun commented on HDFS-15149: Thank [~ahussein] for your suggestion. {quote} The poll period and waiting time (5000 and 10) in waitFoDeadNode is very large. I assume you had to use large numbers to match the delays of the detector threads. {quote} The poll period and waiting time are indeed too long and i can reduce them. {quote} I have a question about clearAndGetDetectedDeadNodes(): As far as I understand Calling the method in a loop means that a "deadnode" can be removed from the deadNodes map. In other words, the count may never reach 3, because the map does not for the removed nodes from the list. Please feel free to correct my understanding of the code if I am wrong. {quote} the method of clearAndGetDetectedDeadNodes() 's purpose is return the new deadNodes that don't include dead node which is not used by any DFSInputStream. {quote} IMHO, DeadNodeDetector.java needs to introduce more aggressive mechanisms to coordinate between the threads. Instead of just racing between each other, tasks can use conditional variables to communicate like synchronized queues, or object monitors. Another benefit from using conditional variables is that the runtime of the tests will be improved because there won't be need to wait for a full cycle. The DefaultSpeculator.java has a synchronized queue just for the purpose of testing: "DefaultSpeculator.scanControl". {quote} Based on your good suggestion i will optimize it. Thank you. > TestDeadNodeDetection test cases time-out > - > > Key: HDFS-15149 > URL: https://issues.apache.org/jira/browse/HDFS-15149 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode >Reporter: Ahmed Hussein >Assignee: Lisheng Sun >Priority: Major > Attachments: HDFS-15149-001.patch > > > TestDeadNodeDetection JUnit time out times out with the following stack > traces: > * 1- testDeadNodeDetectionInBackground* > {code:bash} > [ERROR] Tests run: 5, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: > 264.757 s <<< FAILURE! - in org.apache.hadoop.hdfs.TestDeadNodeDetection > [ERROR] > testDeadNodeDetectionInBackground(org.apache.hadoop.hdfs.TestDeadNodeDetection) > Time elapsed: 125.806 s <<< ERROR! > java.util.concurrent.TimeoutException: > Timed out waiting for condition. Thread diagnostics: > Timestamp: 2020-01-24 08:31:07,023 > "client DomainSocketWatcher" daemon prio=5 tid=117 runnable > java.lang.Thread.State: RUNNABLE > at org.apache.hadoop.net.unix.DomainSocketWatcher.doPoll0(Native > Method) > at > org.apache.hadoop.net.unix.DomainSocketWatcher.access$900(DomainSocketWatcher.java:52) > at > org.apache.hadoop.net.unix.DomainSocketWatcher$2.run(DomainSocketWatcher.java:503) > at java.lang.Thread.run(Thread.java:748) > "Session-HouseKeeper-48c3205a" prio=5 tid=350 timed_waiting > java.lang.Thread.State: TIMED_WAITING > at sun.misc.Unsafe.park(Native Method) > at > java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078) > at > java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:1093) > at > java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:809) > at > java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > "java.util.concurrent.ThreadPoolExecutor$Worker@3ae54156[State = -1, empty > queue]" daemon prio=5 tid=752 in Object.wait() > java.lang.Thread.State: WAITING (on object monitor) > at sun.misc.Unsafe.park(Native Method) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039) > at > java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442) > at > java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > "CacheReplicationMonitor(1960356187)" prio=5 tid=3
[jira] [Commented] (HDFS-15149) TestDeadNodeDetection test cases time-out
[ https://issues.apache.org/jira/browse/HDFS-15149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17040027#comment-17040027 ] Lisheng Sun commented on HDFS-15149: hi [~elgoiri] {quote} I like the rest of the solution though. {quote} what refers to the rest of the solution though? > TestDeadNodeDetection test cases time-out > - > > Key: HDFS-15149 > URL: https://issues.apache.org/jira/browse/HDFS-15149 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode >Reporter: Ahmed Hussein >Assignee: Lisheng Sun >Priority: Major > Attachments: HDFS-15149-001.patch > > > TestDeadNodeDetection JUnit time out times out with the following stack > traces: > * 1- testDeadNodeDetectionInBackground* > {code:bash} > [ERROR] Tests run: 5, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: > 264.757 s <<< FAILURE! - in org.apache.hadoop.hdfs.TestDeadNodeDetection > [ERROR] > testDeadNodeDetectionInBackground(org.apache.hadoop.hdfs.TestDeadNodeDetection) > Time elapsed: 125.806 s <<< ERROR! > java.util.concurrent.TimeoutException: > Timed out waiting for condition. Thread diagnostics: > Timestamp: 2020-01-24 08:31:07,023 > "client DomainSocketWatcher" daemon prio=5 tid=117 runnable > java.lang.Thread.State: RUNNABLE > at org.apache.hadoop.net.unix.DomainSocketWatcher.doPoll0(Native > Method) > at > org.apache.hadoop.net.unix.DomainSocketWatcher.access$900(DomainSocketWatcher.java:52) > at > org.apache.hadoop.net.unix.DomainSocketWatcher$2.run(DomainSocketWatcher.java:503) > at java.lang.Thread.run(Thread.java:748) > "Session-HouseKeeper-48c3205a" prio=5 tid=350 timed_waiting > java.lang.Thread.State: TIMED_WAITING > at sun.misc.Unsafe.park(Native Method) > at > java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078) > at > java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:1093) > at > java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:809) > at > java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > "java.util.concurrent.ThreadPoolExecutor$Worker@3ae54156[State = -1, empty > queue]" daemon prio=5 tid=752 in Object.wait() > java.lang.Thread.State: WAITING (on object monitor) > at sun.misc.Unsafe.park(Native Method) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039) > at > java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442) > at > java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > "CacheReplicationMonitor(1960356187)" prio=5 tid=386 timed_waiting > java.lang.Thread.State: TIMED_WAITING > at sun.misc.Unsafe.park(Native Method) > at > java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2163) > at > org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor.run(CacheReplicationMonitor.java:181) > "Timer for 'NameNode' metrics system" daemon prio=5 tid=339 timed_waiting > java.lang.Thread.State: TIMED_WAITING > at java.lang.Object.wait(Native Method) > at java.util.TimerThread.mainLoop(Timer.java:552) > at java.util.TimerThread.run(Timer.java:505) > "org.apache.hadoop.hdfs.server.namenode.FSNamesystem$LazyPersistFileScrubber@6b760460" > daemon prio=5 tid=385 timed_waiting > java.lang.Thread.State: TIMED_WAITING > at java.lang.Thread.sleep(Native Method) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem$LazyPersistFileScrubber.run(FSNamesystem.java:4420) > at java.lang.Thread.run(Thread.java:748) > "qtp164757726-349" daemon prio=5 tid=349 runnable > java.lang.Thread.State: RUNNABLE > at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method) > at sun.nio.ch.EPollArrayWrapper
[jira] [Updated] (HDFS-15149) TestDeadNodeDetection test cases time-out
[ https://issues.apache.org/jira/browse/HDFS-15149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lisheng Sun updated HDFS-15149: --- Attachment: HDFS-15149-002.patch > TestDeadNodeDetection test cases time-out > - > > Key: HDFS-15149 > URL: https://issues.apache.org/jira/browse/HDFS-15149 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode >Reporter: Ahmed Hussein >Assignee: Lisheng Sun >Priority: Major > Attachments: HDFS-15149-001.patch, HDFS-15149-002.patch > > > TestDeadNodeDetection JUnit time out times out with the following stack > traces: > * 1- testDeadNodeDetectionInBackground* > {code:bash} > [ERROR] Tests run: 5, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: > 264.757 s <<< FAILURE! - in org.apache.hadoop.hdfs.TestDeadNodeDetection > [ERROR] > testDeadNodeDetectionInBackground(org.apache.hadoop.hdfs.TestDeadNodeDetection) > Time elapsed: 125.806 s <<< ERROR! > java.util.concurrent.TimeoutException: > Timed out waiting for condition. Thread diagnostics: > Timestamp: 2020-01-24 08:31:07,023 > "client DomainSocketWatcher" daemon prio=5 tid=117 runnable > java.lang.Thread.State: RUNNABLE > at org.apache.hadoop.net.unix.DomainSocketWatcher.doPoll0(Native > Method) > at > org.apache.hadoop.net.unix.DomainSocketWatcher.access$900(DomainSocketWatcher.java:52) > at > org.apache.hadoop.net.unix.DomainSocketWatcher$2.run(DomainSocketWatcher.java:503) > at java.lang.Thread.run(Thread.java:748) > "Session-HouseKeeper-48c3205a" prio=5 tid=350 timed_waiting > java.lang.Thread.State: TIMED_WAITING > at sun.misc.Unsafe.park(Native Method) > at > java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078) > at > java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:1093) > at > java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:809) > at > java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > "java.util.concurrent.ThreadPoolExecutor$Worker@3ae54156[State = -1, empty > queue]" daemon prio=5 tid=752 in Object.wait() > java.lang.Thread.State: WAITING (on object monitor) > at sun.misc.Unsafe.park(Native Method) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039) > at > java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442) > at > java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > "CacheReplicationMonitor(1960356187)" prio=5 tid=386 timed_waiting > java.lang.Thread.State: TIMED_WAITING > at sun.misc.Unsafe.park(Native Method) > at > java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2163) > at > org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor.run(CacheReplicationMonitor.java:181) > "Timer for 'NameNode' metrics system" daemon prio=5 tid=339 timed_waiting > java.lang.Thread.State: TIMED_WAITING > at java.lang.Object.wait(Native Method) > at java.util.TimerThread.mainLoop(Timer.java:552) > at java.util.TimerThread.run(Timer.java:505) > "org.apache.hadoop.hdfs.server.namenode.FSNamesystem$LazyPersistFileScrubber@6b760460" > daemon prio=5 tid=385 timed_waiting > java.lang.Thread.State: TIMED_WAITING > at java.lang.Thread.sleep(Native Method) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem$LazyPersistFileScrubber.run(FSNamesystem.java:4420) > at java.lang.Thread.run(Thread.java:748) > "qtp164757726-349" daemon prio=5 tid=349 runnable > java.lang.Thread.State: RUNNABLE > at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method) > at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269) > at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:93) >