[jira] [Commented] (HDFS-15273) CacheReplicationMonitor hold lock for long time and lead to NN out of service
[ https://issues.apache.org/jira/browse/HDFS-15273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17099476#comment-17099476 ] Wei-Chiu Chuang commented on HDFS-15273: Thanks for reporting the issue! Do you have any estimate how many cache directives equals how much time? Any data points? Thanks > CacheReplicationMonitor hold lock for long time and lead to NN out of service > - > > Key: HDFS-15273 > URL: https://issues.apache.org/jira/browse/HDFS-15273 > Project: Hadoop HDFS > Issue Type: Improvement > Components: caching, namenode >Reporter: Xiaoqiao He >Assignee: Xiaoqiao He >Priority: Major > > CacheReplicationMonitor scan Cache Directives and Cached BlockMap > periodically. If we add more and more cache directives, > CacheReplicationMonitor will cost very long time to rescan all of cache > directives and cache blocks. Meanwhile, scan operation hold global write > lock, during scan period, NameNode could not process other request. > So I think we should warn this risk to end user who turn on CacheManager > feature before improve this implement. > {code:java} > private void rescan() throws InterruptedException { > scannedDirectives = 0; > scannedBlocks = 0; > try { > namesystem.writeLock(); > try { > lock.lock(); > if (shutdown) { > throw new InterruptedException("CacheReplicationMonitor was " + > "shut down."); > } > curScanCount = completedScanCount + 1; > } finally { > lock.unlock(); > } > resetStatistics(); > rescanCacheDirectives(); > rescanCachedBlockMap(); > blockManager.getDatanodeManager().resetLastCachingDirectiveSentTime(); > } finally { > namesystem.writeUnlock(); > } > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15272) Backport HDFS-12862 to branch-3.1
[ https://issues.apache.org/jira/browse/HDFS-15272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei-Chiu Chuang updated HDFS-15272: --- Fix Version/s: 3.1.5 Resolution: Fixed Status: Resolved (was: Patch Available) > Backport HDFS-12862 to branch-3.1 > - > > Key: HDFS-15272 > URL: https://issues.apache.org/jira/browse/HDFS-15272 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 3.1.4 >Reporter: Xiaoqiao He >Assignee: Xiaoqiao He >Priority: Major > Fix For: 3.1.5 > > Attachments: HDFS-15272.branch-3.1.001.patch > > > Backport HDFS-12862 CacheDirective becomes invalid when NN restart or > failover to branch-3.1.4 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14599) HDFS-12487 breaks test TestDiskBalancer.testDiskBalancerWithFedClusterWithOneNameServiceEmpty
[ https://issues.apache.org/jira/browse/HDFS-14599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17099463#comment-17099463 ] Wei-Chiu Chuang commented on HDFS-14599: Sorry I missed this one. Just cherrypicked the change to branch-3.2 and branch-3.1. [~gabor.bota] fyi (3.1.4 RM) > HDFS-12487 breaks test > TestDiskBalancer.testDiskBalancerWithFedClusterWithOneNameServiceEmpty > - > > Key: HDFS-14599 > URL: https://issues.apache.org/jira/browse/HDFS-14599 > Project: Hadoop HDFS > Issue Type: Bug > Components: diskbalancer >Affects Versions: 3.3.0, 3.2.1, 3.1.3 >Reporter: Wei-Chiu Chuang >Assignee: Xiaoqiao He >Priority: Major > Labels: newbie > Fix For: 3.3.0, 3.2.2, 3.1.5 > > Attachments: HDFS-14599.001.patch, HDFS-14599.002.patch > > > It looks like HDFS-12487 changes the error message expected by > {{TestDiskBalancer#testDiskBalancerWithFedClusterWithOneNameServiceEmpty}}. > The test expects error "There are no blocks in the blockPool" but after > HDFS-12487, it returns error string "NextBlock call returned null.No valid > block to copy." > Probably the simplest approach to fix it is to update the expected error > string. > Thoughts? [~bharatviswa] you crafted the test in HDFS-13715. Should we update > the expected error string, or revert HDFS-12487? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15305) Extend ViewFS and provide ViewFSOverloadScheme implementation with scheme configurable.
[ https://issues.apache.org/jira/browse/HDFS-15305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17099464#comment-17099464 ] Hudson commented on HDFS-15305: --- SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #18216 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/18216/]) HDFS-15305. Extend ViewFS and provide ViewFileSystemOverloadScheme (github: rev 9c8236d04dfc3d4cefe7a00b63625f60ee232cfe) * (add) hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/fs/viewfs/TestViewFileSystemOverloadSchemeLocalFileSystem.java * (edit) hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/fs/viewfs/ViewFileSystemBaseTest.java * (edit) hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/fs/FileSystemContractBaseTest.java * (add) hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/fs/viewfs/TestViewFileSystemOverloadSchemeHdfsFileSystemContract.java * (add) hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/fs/viewfs/TestViewFileSystemOverloadSchemeWithHdfsScheme.java * (add) hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/viewfs/ViewFileSystemOverloadScheme.java * (edit) hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/viewfs/ViewFileSystem.java * (edit) hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/FsConstants.java > Extend ViewFS and provide ViewFSOverloadScheme implementation with scheme > configurable. > --- > > Key: HDFS-15305 > URL: https://issues.apache.org/jira/browse/HDFS-15305 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: fs, hadoop-client, hdfs-client, viewfs >Affects Versions: 3.2.1 >Reporter: Uma Maheswara Rao G >Assignee: Uma Maheswara Rao G >Priority: Major > > Provide ViewFsOverloadScheme implementation by extending ViewFileSystem class. > # When target scheme and uri scheme matches, it should handle to create > target filesystems different way than using FileSystem.get API. > # Provide the flexibility to configure overload scheme. > ex: by setting hdfs scheme and impl to ViewFsOverloadScheme, users should be > able to continue working with hdfs scheme uris and should be able to mount > any hadoop compatible file systems as target. It will follow the same mount > link configuration pattern as ViewFileSystem. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-14599) HDFS-12487 breaks test TestDiskBalancer.testDiskBalancerWithFedClusterWithOneNameServiceEmpty
[ https://issues.apache.org/jira/browse/HDFS-14599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei-Chiu Chuang updated HDFS-14599: --- Fix Version/s: 3.1.5 3.2.2 > HDFS-12487 breaks test > TestDiskBalancer.testDiskBalancerWithFedClusterWithOneNameServiceEmpty > - > > Key: HDFS-14599 > URL: https://issues.apache.org/jira/browse/HDFS-14599 > Project: Hadoop HDFS > Issue Type: Bug > Components: diskbalancer >Affects Versions: 3.3.0, 3.2.1, 3.1.3 >Reporter: Wei-Chiu Chuang >Assignee: Xiaoqiao He >Priority: Major > Labels: newbie > Fix For: 3.3.0, 3.2.2, 3.1.5 > > Attachments: HDFS-14599.001.patch, HDFS-14599.002.patch > > > It looks like HDFS-12487 changes the error message expected by > {{TestDiskBalancer#testDiskBalancerWithFedClusterWithOneNameServiceEmpty}}. > The test expects error "There are no blocks in the blockPool" but after > HDFS-12487, it returns error string "NextBlock call returned null.No valid > block to copy." > Probably the simplest approach to fix it is to update the expected error > string. > Thoughts? [~bharatviswa] you crafted the test in HDFS-13715. Should we update > the expected error string, or revert HDFS-12487? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15305) Extend ViewFS and provide ViewFSOverloadScheme implementation with scheme configurable.
[ https://issues.apache.org/jira/browse/HDFS-15305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uma Maheswara Rao G updated HDFS-15305: --- Hadoop Flags: Reviewed Resolution: Fixed Status: Resolved (was: Patch Available) This PR merged into trunk now. Thanks > Extend ViewFS and provide ViewFSOverloadScheme implementation with scheme > configurable. > --- > > Key: HDFS-15305 > URL: https://issues.apache.org/jira/browse/HDFS-15305 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: fs, hadoop-client, hdfs-client, viewfs >Affects Versions: 3.2.1 >Reporter: Uma Maheswara Rao G >Assignee: Uma Maheswara Rao G >Priority: Major > > Provide ViewFsOverloadScheme implementation by extending ViewFileSystem class. > # When target scheme and uri scheme matches, it should handle to create > target filesystems different way than using FileSystem.get API. > # Provide the flexibility to configure overload scheme. > ex: by setting hdfs scheme and impl to ViewFsOverloadScheme, users should be > able to continue working with hdfs scheme uris and should be able to mount > any hadoop compatible file systems as target. It will follow the same mount > link configuration pattern as ViewFileSystem. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15272) Backport HDFS-12862 to branch-3.1
[ https://issues.apache.org/jira/browse/HDFS-15272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17099448#comment-17099448 ] Wei-Chiu Chuang commented on HDFS-15272: +1 I'm sorry I missed this one. > Backport HDFS-12862 to branch-3.1 > - > > Key: HDFS-15272 > URL: https://issues.apache.org/jira/browse/HDFS-15272 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 3.1.4 >Reporter: Xiaoqiao He >Assignee: Xiaoqiao He >Priority: Major > Attachments: HDFS-15272.branch-3.1.001.patch > > > Backport HDFS-12862 CacheDirective becomes invalid when NN restart or > failover to branch-3.1.4 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15272) Backport HDFS-12862 to branch-3.1
[ https://issues.apache.org/jira/browse/HDFS-15272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei-Chiu Chuang updated HDFS-15272: --- Fix Version/s: (was: 3.1.5) > Backport HDFS-12862 to branch-3.1 > - > > Key: HDFS-15272 > URL: https://issues.apache.org/jira/browse/HDFS-15272 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 3.1.4 >Reporter: Xiaoqiao He >Assignee: Xiaoqiao He >Priority: Major > Attachments: HDFS-15272.branch-3.1.001.patch > > > Backport HDFS-12862 CacheDirective becomes invalid when NN restart or > failover to branch-3.1.4 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-13183) Standby NameNode process getBlocks request to reduce Active load
[ https://issues.apache.org/jira/browse/HDFS-13183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17099434#comment-17099434 ] Wei-Chiu Chuang commented on HDFS-13183: I am really sorry I meant to review but got distracted. I would like to push this feature to the finish line, because CRFS is a big feature and will take time to stabilize. Plus, it requires an additional Observer NameNode. The logistics of adding an extra master namenode adds additional complexity. A few comments on the patch: * does it work in federated cluster? IIRC you have a large federated cluster so I am assuming the answer is yes, but does work out of box or does it require extra configuration ? (Sorry, don't have much experience with HDFS federation) * Looks like the balancer determine which NN is the sbnn at start, and then use it til the end. There are two issues: ** failover. if a failover happens, the balancer can't adapt and will then send the requests to ANN. That is fine as it shouldn't fail the balancer, but it increases the new ANN overhead. ** multiple standby namenode support. The balancer always choose the first available standby namenode. This is fine, since in any case there can be only one balancer running at a time. Also, just want to say that you don't actually need to UNCHECKED FSNamesystem#getBlocks(). If dfs.ha.allow.stale.reads is true, Standby NN accepts the request as well. That is an extra configuration so probably not ideal. > Standby NameNode process getBlocks request to reduce Active load > > > Key: HDFS-13183 > URL: https://issues.apache.org/jira/browse/HDFS-13183 > Project: Hadoop HDFS > Issue Type: Improvement > Components: balancer mover, namenode >Reporter: Xiaoqiao He >Assignee: Xiaoqiao He >Priority: Major > Attachments: HDFS-13183-trunk.001.patch, HDFS-13183-trunk.002.patch, > HDFS-13183-trunk.003.patch, HDFS-13183.004.patch, HDFS-13183.005.patch > > > The performance of Active NameNode could be impact when {{Balancer}} requests > #getBlocks, since query blocks of overly full DNs performance is extremely > inefficient currently. The main reason is {{NameNodeRpcServer#getBlocks}} > hold read lock for long time. In extreme case, all handlers of Active > NameNode RPC server are occupied by one reader > {{NameNodeRpcServer#getBlocks}} and other write operation calls, thus Active > NameNode enter a state of false death for number of seconds even for minutes. > The similar performance concerns of Balancer have reported by HDFS-9412, > HDFS-7967, etc. > If Standby NameNode can shoulder #getBlocks heavy burden, it could speed up > the progress of balancing and reduce performance impact to Active NameNode. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15323) StandbyNode fails transition to active due to insufficient transaction tailing
[ https://issues.apache.org/jira/browse/HDFS-15323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17099387#comment-17099387 ] Gabor Bota commented on HDFS-15323: --- [~shv], sorry but it won't be included, I'm going to send the mail with the RC0 in a few minutes. > StandbyNode fails transition to active due to insufficient transaction tailing > -- > > Key: HDFS-15323 > URL: https://issues.apache.org/jira/browse/HDFS-15323 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode, qjm >Affects Versions: 2.7.7 >Reporter: Konstantin Shvachko >Assignee: Konstantin Shvachko >Priority: Major > Fix For: 3.3.0, 3.2.2, 2.10.1, 3.4.0 > > Attachments: HDFS-15323-branch-2.10.002.patch, > HDFS-15323.000.unitTest.patch, HDFS-15323.001.patch, HDFS-15323.002.patch > > > StandbyNode is asked to {{transitionToActive()}}. If it fell too far behind > in tailing journal transaction (from QJM) it can crash with > {{IllegalStateException}}. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15311) [SBN Read] High frequency reQueue cause Reader's performance to degrade
[ https://issues.apache.org/jira/browse/HDFS-15311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17099384#comment-17099384 ] Konstantin Shvachko commented on HDFS-15311: Was proposing to avoid re-queue altogether with a cyclical queue, see HDFS-15291. Although the throughput is decreasing with {{autoMsyncPeriodMs = 0}} probably because it doubles the number of rpc calls. > [SBN Read] High frequency reQueue cause Reader's performance to degrade > --- > > Key: HDFS-15311 > URL: https://issues.apache.org/jira/browse/HDFS-15311 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: xuzq >Priority: Major > > If _autoMsyncPeriodMs_ is 0, will do _msync_ for each read rpc. > On the observer server side, it will cause high frequency reQueue in Handler. > As the Queue is BlockingQueue, so it will cause Readers(small number) and > Handlers(large number) competing for BlockingQueue locks. > It will cause the throughput decrease. > > Maybe we can let the handler sleep a little time to wait the StateId to > decrease ReQueue. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15332) Quota Space consumed was wrong in truncate with Snapshots
[ https://issues.apache.org/jira/browse/HDFS-15332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17099345#comment-17099345 ] Hadoop QA commented on HDFS-15332: -- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 59s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} dupname {color} | {color:green} 0m 0s{color} | {color:green} No case conflicting files found. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 1 new or modified test files. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 19m 45s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 13s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 46s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 21s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 4m 43s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 46s{color} | {color:green} trunk passed {color} | | {color:blue}0{color} | {color:blue} spotbugs {color} | {color:blue} 3m 1s{color} | {color:blue} Used deprecated FindBugs config; considering switching to SpotBugs. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 59s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 10s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 7s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 1m 7s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 40s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 17s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 13m 35s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 44s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 58s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:red}-1{color} | {color:red} unit {color} | {color:red} 97m 15s{color} | {color:red} hadoop-hdfs in the patch passed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 44s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black}152m 16s{color} | {color:black} {color} | \\ \\ || Reason || Tests || | Failed junit tests | hadoop.hdfs.server.datanode.TestDataNodeMetrics | \\ \\ || Subsystem || Report/Notes || | Docker | ClientAPI=1.40 ServerAPI=1.40 base: https://builds.apache.org/job/PreCommit-HDFS-Build/29235/artifact/out/Dockerfile | | JIRA Issue | HDFS-15332 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/13002022/HDFS-15332.001.patch | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle | | uname | Linux f8e64ce5028a 4.15.0-58-generic #64-Ubuntu SMP Tue Aug 6 11:12:41 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | personality/hadoop.sh | | git revision | trunk / ebb878bab99 | | Default Java | Private Build-1.8.0_252-8u252-b09-1~18.04-b09 | | unit | https://builds.apache.org/job/PreCommit-HDFS-Build/29235/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt | | Test Results |
[jira] [Commented] (HDFS-15160) ReplicaMap, Disk Balancer, Directory Scanner and various FsDatasetImpl methods should use datanode readlock
[ https://issues.apache.org/jira/browse/HDFS-15160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17099342#comment-17099342 ] Wei-Chiu Chuang commented on HDFS-15160: [~zhuqi] did you try the latest patch and how did it go? Thanks > ReplicaMap, Disk Balancer, Directory Scanner and various FsDatasetImpl > methods should use datanode readlock > --- > > Key: HDFS-15160 > URL: https://issues.apache.org/jira/browse/HDFS-15160 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode >Affects Versions: 3.3.0 >Reporter: Stephen O'Donnell >Assignee: Stephen O'Donnell >Priority: Major > Attachments: HDFS-15160.001.patch, HDFS-15160.002.patch, > HDFS-15160.003.patch, HDFS-15160.004.patch, HDFS-15160.005.patch, > image-2020-04-10-17-18-08-128.png, image-2020-04-10-17-18-55-938.png > > > Now we have HDFS-15150, we can start to move some DN operations to use the > read lock rather than the write lock to improve concurrence. The first step > is to make the changes to ReplicaMap, as many other methods make calls to it. > This Jira switches read operations against the volume map to use the readLock > rather than the write lock. > Additionally, some methods make a call to replicaMap.replicas() (eg > getBlockReports, getFinalizedBlocks, deepCopyReplica) and only use the result > in a read only fashion, so they can also be switched to using a readLock. > Next is the directory scanner and disk balancer, which only require a read > lock. > Finally (for this Jira) are various "low hanging fruit" items in BlockSender > and fsdatasetImpl where is it fairly obvious they only need a read lock. > For now, I have avoided changing anything which looks too risky, as I think > its better to do any larger refactoring or risky changes each in their own > Jira. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15270) Account for *env == NULL in hdfsThreadDestructor
[ https://issues.apache.org/jira/browse/HDFS-15270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17099329#comment-17099329 ] Hudson commented on HDFS-15270: --- SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #18215 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/18215/]) HDFS-15270. Account for *env == NULL in hdfsThreadDestructor (#1951) (github: rev 1996351b0b7be6866eda73223ab6ef1ec78d30cd) * (edit) hadoop-hdfs-project/hadoop-hdfs-native-client/src/main/native/libhdfs/os/windows/thread_local_storage.c * (edit) hadoop-hdfs-project/hadoop-hdfs-native-client/src/main/native/libhdfs/os/posix/thread_local_storage.c > Account for *env == NULL in hdfsThreadDestructor > > > Key: HDFS-15270 > URL: https://issues.apache.org/jira/browse/HDFS-15270 > Project: Hadoop HDFS > Issue Type: Bug > Environment: Please refer to the "steps to reproduce" the failure in > https://github.com/eclipse/openj9/issues/7752#issue-521732953. >Reporter: Babneet Singh >Assignee: Babneet Singh >Priority: Major > Fix For: 3.4.0 > > > OpenJ9 JVM properly terminates the thread before hdfsThreadDestructor is > invoked. JNIEnv is a mirror of J9VMThread in OpenJ9. After proper thread > termination, accessing JNIEnv in hdfsThreadDestructor (*env)->GetJavaVM, > yields a SIGSEGV since *env is NULL after thread cleanup is performed. > The main purpose of hdfsThreadDestructor is to invoke > DetachCurrentThread, which performs thread cleanup in OpenJ9. Since > OpenJ9 performs thread cleanup before hdfsThreadDestructor is invoked, > hdfsThreadDestructor should account for *env == NULL and skip > DetachCurrentThread. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15311) [SBN Read] High frequency reQueue cause Reader's performance to degrade
[ https://issues.apache.org/jira/browse/HDFS-15311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17099321#comment-17099321 ] Wei-Chiu Chuang commented on HDFS-15311: [~cliang] [~xkrogen] [~shv] thoughts? > [SBN Read] High frequency reQueue cause Reader's performance to degrade > --- > > Key: HDFS-15311 > URL: https://issues.apache.org/jira/browse/HDFS-15311 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: xuzq >Priority: Major > > If _autoMsyncPeriodMs_ is 0, will do _msync_ for each read rpc. > On the observer server side, it will cause high frequency reQueue in Handler. > As the Queue is BlockingQueue, so it will cause Readers(small number) and > Handlers(large number) competing for BlockingQueue locks. > It will cause the throughput decrease. > > Maybe we can let the handler sleep a little time to wait the StateId to > decrease ReQueue. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Resolved] (HDFS-15270) Account for *env == NULL in hdfsThreadDestructor
[ https://issues.apache.org/jira/browse/HDFS-15270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei-Chiu Chuang resolved HDFS-15270. Fix Version/s: 3.4.0 Resolution: Fixed Thanks [~babsingh] this is in the trunk. Do you have a branch in mind that you want this cherrypicked to? > Account for *env == NULL in hdfsThreadDestructor > > > Key: HDFS-15270 > URL: https://issues.apache.org/jira/browse/HDFS-15270 > Project: Hadoop HDFS > Issue Type: Bug > Environment: Please refer to the "steps to reproduce" the failure in > https://github.com/eclipse/openj9/issues/7752#issue-521732953. >Reporter: Babneet Singh >Assignee: Babneet Singh >Priority: Major > Fix For: 3.4.0 > > > OpenJ9 JVM properly terminates the thread before hdfsThreadDestructor is > invoked. JNIEnv is a mirror of J9VMThread in OpenJ9. After proper thread > termination, accessing JNIEnv in hdfsThreadDestructor (*env)->GetJavaVM, > yields a SIGSEGV since *env is NULL after thread cleanup is performed. > The main purpose of hdfsThreadDestructor is to invoke > DetachCurrentThread, which performs thread cleanup in OpenJ9. Since > OpenJ9 performs thread cleanup before hdfsThreadDestructor is invoked, > hdfsThreadDestructor should account for *env == NULL and skip > DetachCurrentThread. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15323) StandbyNode fails transition to active due to insufficient transaction tailing
[ https://issues.apache.org/jira/browse/HDFS-15323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ayush Saxena updated HDFS-15323: Fix Version/s: 3.3.0 > StandbyNode fails transition to active due to insufficient transaction tailing > -- > > Key: HDFS-15323 > URL: https://issues.apache.org/jira/browse/HDFS-15323 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode, qjm >Affects Versions: 2.7.7 >Reporter: Konstantin Shvachko >Assignee: Konstantin Shvachko >Priority: Major > Fix For: 3.3.0, 3.2.2, 2.10.1, 3.4.0 > > Attachments: HDFS-15323-branch-2.10.002.patch, > HDFS-15323.000.unitTest.patch, HDFS-15323.001.patch, HDFS-15323.002.patch > > > StandbyNode is asked to {{transitionToActive()}}. If it fell too far behind > in tailing journal transaction (from QJM) it can crash with > {{IllegalStateException}}. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14283) DFSInputStream to prefer cached replica
[ https://issues.apache.org/jira/browse/HDFS-14283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17099302#comment-17099302 ] Ayush Saxena commented on HDFS-14283: - Thanx [~leosun08] for the patch. {code:java} + if (!deadNodes.containsKey(cachedLocs[i]) {code} For this can we use {{dfsClient.getDeadNodes(this).containsKey(nodes[i])}}? it is added as part of DeadDatanodeDetection feature. If yes, May be we can refactor the if checks into a single method and use at both places. {code:java} return new DNAddrPair(chosenNode, targetAddr, storageType, block); {code} {{storagaeType}} will be {{null}} if using {{cachedReplica}}, is it ok? > DFSInputStream to prefer cached replica > --- > > Key: HDFS-14283 > URL: https://issues.apache.org/jira/browse/HDFS-14283 > Project: Hadoop HDFS > Issue Type: Improvement >Affects Versions: 2.6.0 > Environment: HDFS Caching >Reporter: Wei-Chiu Chuang >Assignee: Lisheng Sun >Priority: Major > Attachments: HDFS-14283.001.patch, HDFS-14283.002.patch, > HDFS-14283.003.patch, HDFS-14283.004.patch, HDFS-14283.005.patch, > HDFS-14283.006.patch, HDFS-14283.007.patch > > > HDFS Caching offers performance benefits. However, currently NameNode does > not treat cached replica with higher priority, so HDFS caching is only useful > when cache replication = 3, that is to say, all replicas are cached in > memory, so that a client doesn't randomly pick an uncached replica. > HDFS-6846 proposed to let NameNode give higher priority to cached replica. > Changing a logic in NameNode is always tricky so that didn't get much > traction. Here I propose a different approach: let client (DFSInputStream) > prefer cached replica. > A {{LocatedBlock}} object already contains cached replica location so a > client has the needed information. I think we can change > {{DFSInputStream#getBestNodeDNAddrPair()}} for this purpose. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15323) StandbyNode fails transition to active due to insufficient transaction tailing
[ https://issues.apache.org/jira/browse/HDFS-15323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17099281#comment-17099281 ] Konstantin Shvachko commented on HDFS-15323: Thanks [~ayushtkn], please do. [~gabor.bota] would be good if this jira could make it into 3.1.4 release. > StandbyNode fails transition to active due to insufficient transaction tailing > -- > > Key: HDFS-15323 > URL: https://issues.apache.org/jira/browse/HDFS-15323 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode, qjm >Affects Versions: 2.7.7 >Reporter: Konstantin Shvachko >Assignee: Konstantin Shvachko >Priority: Major > Fix For: 3.2.2, 2.10.1, 3.4.0 > > Attachments: HDFS-15323-branch-2.10.002.patch, > HDFS-15323.000.unitTest.patch, HDFS-15323.001.patch, HDFS-15323.002.patch > > > StandbyNode is asked to {{transitionToActive()}}. If it fell too far behind > in tailing journal transaction (from QJM) it can crash with > {{IllegalStateException}}. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15323) StandbyNode fails transition to active due to insufficient transaction tailing
[ https://issues.apache.org/jira/browse/HDFS-15323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Konstantin Shvachko updated HDFS-15323: --- Fix Version/s: 3.4.0 2.10.1 3.2.2 > StandbyNode fails transition to active due to insufficient transaction tailing > -- > > Key: HDFS-15323 > URL: https://issues.apache.org/jira/browse/HDFS-15323 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode, qjm >Affects Versions: 2.7.7 >Reporter: Konstantin Shvachko >Assignee: Konstantin Shvachko >Priority: Major > Fix For: 3.2.2, 2.10.1, 3.4.0 > > Attachments: HDFS-15323-branch-2.10.002.patch, > HDFS-15323.000.unitTest.patch, HDFS-15323.001.patch, HDFS-15323.002.patch > > > StandbyNode is asked to {{transitionToActive()}}. If it fell too far behind > in tailing journal transaction (from QJM) it can crash with > {{IllegalStateException}}. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15332) Quota Space consumed was wrong in truncate with Snapshots
[ https://issues.apache.org/jira/browse/HDFS-15332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hemanthboyina updated HDFS-15332: - Description: On calculating space quota usage {code:java} if (file.getBlocks() != null) { allBlocks.addAll(Arrays.asList(file.getBlocks())); } if (removed.getBlocks() != null) { allBlocks.addAll(Arrays.asList(removed.getBlocks())); } for (BlockInfo b: allBlocks) { {code} we missed out the blocks of file snapshot feature's Diffs > Quota Space consumed was wrong in truncate with Snapshots > - > > Key: HDFS-15332 > URL: https://issues.apache.org/jira/browse/HDFS-15332 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: hemanthboyina >Assignee: hemanthboyina >Priority: Major > Attachments: HDFS-15332.001.patch > > > On calculating space quota usage > {code:java} >if (file.getBlocks() != null) { > allBlocks.addAll(Arrays.asList(file.getBlocks())); >} >if (removed.getBlocks() != null) { > allBlocks.addAll(Arrays.asList(removed.getBlocks())); >} >for (BlockInfo b: allBlocks) { {code} > we missed out the blocks of file snapshot feature's Diffs -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15323) StandbyNode fails transition to active due to insufficient transaction tailing
[ https://issues.apache.org/jira/browse/HDFS-15323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17099219#comment-17099219 ] Ayush Saxena commented on HDFS-15323: - Thanx [~shv] For 3.1.4 RC0 seems have been created, Not sure guess need to check with the Release Manager., Can cherry-pick to 3.3.0 branch. > StandbyNode fails transition to active due to insufficient transaction tailing > -- > > Key: HDFS-15323 > URL: https://issues.apache.org/jira/browse/HDFS-15323 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode, qjm >Affects Versions: 2.7.7 >Reporter: Konstantin Shvachko >Assignee: Konstantin Shvachko >Priority: Major > Attachments: HDFS-15323-branch-2.10.002.patch, > HDFS-15323.000.unitTest.patch, HDFS-15323.001.patch, HDFS-15323.002.patch > > > StandbyNode is asked to {{transitionToActive()}}. If it fell too far behind > in tailing journal transaction (from QJM) it can crash with > {{IllegalStateException}}. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-13904) ContentSummary does not always respect processing limit, resulting in long lock acquisitions
[ https://issues.apache.org/jira/browse/HDFS-13904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17099220#comment-17099220 ] Erik Krogen commented on HDFS-13904: Hi [~umamaheswararao], I'm not actively working on this. I don't believe we applied any fix to the NN; instead we focused on migrating users to the {{getQuotaUsage()}} API, since it was quota checks which caused the really large issues. Yes, the NN had consistent load throughout (besides some minor blips around restarts of course). It indeed was interesting to see the difference across restarts. I don't have any good ideas there. GC pauses were low and consistent with normal behavior. > ContentSummary does not always respect processing limit, resulting in long > lock acquisitions > > > Key: HDFS-13904 > URL: https://issues.apache.org/jira/browse/HDFS-13904 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs, namenode >Reporter: Erik Krogen >Assignee: Erik Krogen >Priority: Major > > HDFS-4995 added a config {{dfs.content-summary.limit}} which allows for an > administrator to set a limit on the number of entries processed during a > single acquisition of the {{FSNamesystemLock}} during the creation of a > content summary. This is useful to prevent very long (multiple seconds) > pauses on the NameNode when {{getContentSummary}} is called on large > directories. > However, even on versions with HDFS-4995, we have seen warnings like: > {code} > INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: FSNamesystem read > lock held for 9398 ms via > java.lang.Thread.getStackTrace(Thread.java:1552) > org.apache.hadoop.util.StringUtils.getStackTrace(StringUtils.java:950) > org.apache.hadoop.hdfs.server.namenode.FSNamesystemLock.readUnlock(FSNamesystemLock.java:188) > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.readUnlock(FSNamesystem.java:1486) > org.apache.hadoop.hdfs.server.namenode.ContentSummaryComputationContext.yield(ContentSummaryComputationContext.java:109) > org.apache.hadoop.hdfs.server.namenode.INodeDirectory.computeDirectoryContentSummary(INodeDirectory.java:679) > org.apache.hadoop.hdfs.server.namenode.INodeDirectory.computeContentSummary(INodeDirectory.java:642) > org.apache.hadoop.hdfs.server.namenode.INodeDirectory.computeDirectoryContentSummary(INodeDirectory.java:656) > {code} > happen quite consistently when {{getContentSummary}} was called on a large > directory on a heavily-loaded NameNode. Such long pauses completely destroy > the performance of the NameNode. We have the limit set to its default of > 5000; if it was respected, clearly there would not be a 10-second pause. > The current {{yield()}} code within {{ContentSummaryComputationContext}} > looks like: > {code} > public boolean yield() { > // Are we set up to do this? > if (limitPerRun <= 0 || dir == null || fsn == null) { > return false; > } > // Have we reached the limit? > long currentCount = counts.getFileCount() + > counts.getSymlinkCount() + > counts.getDirectoryCount() + > counts.getSnapshotableDirectoryCount(); > if (currentCount <= nextCountLimit) { > return false; > } > // Update the next limit > nextCountLimit = currentCount + limitPerRun; > boolean hadDirReadLock = dir.hasReadLock(); > boolean hadDirWriteLock = dir.hasWriteLock(); > boolean hadFsnReadLock = fsn.hasReadLock(); > boolean hadFsnWriteLock = fsn.hasWriteLock(); > // sanity check. > if (!hadDirReadLock || !hadFsnReadLock || hadDirWriteLock || > hadFsnWriteLock || dir.getReadHoldCount() != 1 || > fsn.getReadHoldCount() != 1) { > // cannot relinquish > return false; > } > // unlock > dir.readUnlock(); > fsn.readUnlock("contentSummary"); > try { > Thread.sleep(sleepMilliSec, sleepNanoSec); > } catch (InterruptedException ie) { > } finally { > // reacquire > fsn.readLock(); > dir.readLock(); > } > yieldCount++; > return true; > } > {code} > We believe that this check in particular is the culprit: > {code} > if (!hadDirReadLock || !hadFsnReadLock || hadDirWriteLock || > hadFsnWriteLock || dir.getReadHoldCount() != 1 || > fsn.getReadHoldCount() != 1) { > // cannot relinquish > return false; > } > {code} > The content summary computation will only relinquish the lock if it is > currently the _only_ holder of the lock. Given the high volume of read > requests on a heavily loaded NameNode, especially when unfair locking is > enabled, it is likely there may be another holder of the read lock performing > some short-lived operation. By refusing to give up the lock in this case, the > content
[jira] [Updated] (HDFS-15332) Quota Space consumed was wrong in truncate with Snapshots
[ https://issues.apache.org/jira/browse/HDFS-15332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hemanthboyina updated HDFS-15332: - Attachment: HDFS-15332.001.patch Status: Patch Available (was: Open) > Quota Space consumed was wrong in truncate with Snapshots > - > > Key: HDFS-15332 > URL: https://issues.apache.org/jira/browse/HDFS-15332 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: hemanthboyina >Assignee: hemanthboyina >Priority: Major > Attachments: HDFS-15332.001.patch > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-15289) Allow viewfs mounts with hdfs scheme and centralized mount table
[ https://issues.apache.org/jira/browse/HDFS-15289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17091107#comment-17091107 ] Uma Maheswara Rao G edited comment on HDFS-15289 at 5/4/20, 6:18 PM: - Thanks a lot, [~virajith] for comments. Glad to hear that you guys are looking for similar things. Pretty much our targeted use cases are similar to what you mentioned. First and foremost, our goal is to make ViewFSOverloadScheme configurable with different schemes and “hdfs” is a priority use case as Hive-like systems persist “hdfs://nn1” uris in meta stores. Coming to tools support, we discussed some level of the details and we thought we should first make ViewFS support different schemes ( ex: hdfs) and keep configuration centrally to manage easy mount-configurations. {quote}saveNamespace and other methods in FileSystem all needed to be implemented in ViewFSOveraloadScheme. Do you have any specific plans around testing this? {quote} I have a question here. In ViewFSOverloadScheme case, we will have multiple target file systems. So, when user call ViewFSOverloadScheme#saveNameSpace, we need to delegate this to all hdfs specific target fs? Or but in reality users may want to run this on specific targets right? DistributedFileSystem interface tagged with: {quote}@InterfaceAudience.LimitedPrivate( Unknown macro: \{ "MapReduce", "HBase" } ) @InterfaceStability.Unstable {quote} Unfortunately some/many users directly used DFS classes. But we have a public exposed class for administration functions {quote}/** * The public API for performing administrative functions on HDFS. Those writing * applications against HDFS should prefer this interface to directly accessing * functionality in DistributedFileSystem or DFSClient. * * Note that this is distinct from the similarly-named DFSAdmin, which * is a class that provides the functionality for the CLI `hdfs dfsadmin ...' * commands. */ @InterfaceAudience.Public @InterfaceStability.Evolving public class HdfsAdmin {{quote} Can we extend this class to support ViewFS functionally for administration functions? I mean we can do something like: Currently HdfsAdmin holds DFS class and delegates calls to DFS. Probably we can modify this class or extend it to support ViewFSOverloadScheme specific functionality? If that does not work, sure we can discuss which API needed to be added in ViewFSOverloadScheme and we may need additional APIs like when users want to run on specific target child filesystems. Actually ViewFS already exposed APIs like getChildFileSystems etc. We can add more functions here to achieve. example: ViewFSOverloadScheme#getTargetFS(“/mountPath”); This would return DFS if /mounPath was pointed to the dfs cluster. It would be great if you have some thoughts on how we wanted to use “saveNameSpace” like API when we have multiple target hdfs links mounted. {quote}Admins will not have a way to directly access HDFS unless admin tooling explicitly sets the right properties. Is this something you considered? How do you plan to make admin tools work? {quote} Yes, I agree. However supporting single target dfs ( overloaded scheme target fs ) would be easy. DFSAdmin gets FS from ViewFSOverLoadScheme and gets the overloadedScheme fs from there and delegate calls. Challenge here is, we will have multiple DFS clusters configured as targets. We should make current DFSAdmin to get all matching hdfs scheme target file systems from OverLoadedScheme and delegate the calls. More appropriate way may be to extend DFSAdmin. I think today if a user configures defaultFS as “viewfs://” and wants to connect to some of the child hdfs clusters using DFSAdmin, we have the same problem. So, this problem will be there in ViewFS itself and we should improve to provide flexibility to access child filesystems. One thought is, admin commands use -fs option and specify the required nn address. DFSAdmin can use ViewFSOverloadScheme#getOverloadSchemeFS and pass the calls to that fs. Another way probably we have to build ViewDFSAdmin which will provide access to child file systems via ViewFSOverLoadedScheme APIs. {quote}How to handle cases where DistributedFileSystem is used instead of FileSystem? {quote} If users access DFS directly, they may need to get the childFileSystems from ViewFSOverloadScheme and check the instanceOf. {quote}Do you plan to make ViewFSOveraloadScheme extend DistributedFileSystem? {quote} The plan is to extend the ViewFileSystem class. So, we will retain the pretty much viewFS client side mount-building logic as is. And we will address FS looping issues and remote configuration loading in extended class. Also we can add more usability functions like getting a child file system by scheme etc. was (Author: umamaheswararao): Thanks a lot, [~virajith] for comments. Glad to hear that you guys are looking for
[jira] [Commented] (HDFS-15323) StandbyNode fails transition to active due to insufficient transaction tailing
[ https://issues.apache.org/jira/browse/HDFS-15323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17099209#comment-17099209 ] Hudson commented on HDFS-15323: --- SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #18214 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/18214/]) HDFS-15323. StandbyNode fails transition to active due to insufficient (shv: rev ebb878bab991c242b5089a18881aa10abf318ea0) * (edit) hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/ha/TestStandbyInProgressTail.java * (edit) hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/ha/EditLogTailer.java * (edit) hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/client/QuorumJournalManager.java > StandbyNode fails transition to active due to insufficient transaction tailing > -- > > Key: HDFS-15323 > URL: https://issues.apache.org/jira/browse/HDFS-15323 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode, qjm >Affects Versions: 2.7.7 >Reporter: Konstantin Shvachko >Assignee: Konstantin Shvachko >Priority: Major > Attachments: HDFS-15323-branch-2.10.002.patch, > HDFS-15323.000.unitTest.patch, HDFS-15323.001.patch, HDFS-15323.002.patch > > > StandbyNode is asked to {{transitionToActive()}}. If it fell too far behind > in tailing journal transaction (from QJM) it can crash with > {{IllegalStateException}}. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-13904) ContentSummary does not always respect processing limit, resulting in long lock acquisitions
[ https://issues.apache.org/jira/browse/HDFS-13904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17099204#comment-17099204 ] Uma Maheswara Rao G commented on HDFS-13904: Hi [~xkrogen], Any updates on this? Just a question, GC Pause monitor not reporting any pauses right? Do we have consistent load on NN on above mentioned 2 restart scenarios? Interesting to see after one of restart NN started reporting long lock and other restart not. Did you apply above proposed fix in your clusters and tried? > ContentSummary does not always respect processing limit, resulting in long > lock acquisitions > > > Key: HDFS-13904 > URL: https://issues.apache.org/jira/browse/HDFS-13904 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs, namenode >Reporter: Erik Krogen >Assignee: Erik Krogen >Priority: Major > > HDFS-4995 added a config {{dfs.content-summary.limit}} which allows for an > administrator to set a limit on the number of entries processed during a > single acquisition of the {{FSNamesystemLock}} during the creation of a > content summary. This is useful to prevent very long (multiple seconds) > pauses on the NameNode when {{getContentSummary}} is called on large > directories. > However, even on versions with HDFS-4995, we have seen warnings like: > {code} > INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: FSNamesystem read > lock held for 9398 ms via > java.lang.Thread.getStackTrace(Thread.java:1552) > org.apache.hadoop.util.StringUtils.getStackTrace(StringUtils.java:950) > org.apache.hadoop.hdfs.server.namenode.FSNamesystemLock.readUnlock(FSNamesystemLock.java:188) > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.readUnlock(FSNamesystem.java:1486) > org.apache.hadoop.hdfs.server.namenode.ContentSummaryComputationContext.yield(ContentSummaryComputationContext.java:109) > org.apache.hadoop.hdfs.server.namenode.INodeDirectory.computeDirectoryContentSummary(INodeDirectory.java:679) > org.apache.hadoop.hdfs.server.namenode.INodeDirectory.computeContentSummary(INodeDirectory.java:642) > org.apache.hadoop.hdfs.server.namenode.INodeDirectory.computeDirectoryContentSummary(INodeDirectory.java:656) > {code} > happen quite consistently when {{getContentSummary}} was called on a large > directory on a heavily-loaded NameNode. Such long pauses completely destroy > the performance of the NameNode. We have the limit set to its default of > 5000; if it was respected, clearly there would not be a 10-second pause. > The current {{yield()}} code within {{ContentSummaryComputationContext}} > looks like: > {code} > public boolean yield() { > // Are we set up to do this? > if (limitPerRun <= 0 || dir == null || fsn == null) { > return false; > } > // Have we reached the limit? > long currentCount = counts.getFileCount() + > counts.getSymlinkCount() + > counts.getDirectoryCount() + > counts.getSnapshotableDirectoryCount(); > if (currentCount <= nextCountLimit) { > return false; > } > // Update the next limit > nextCountLimit = currentCount + limitPerRun; > boolean hadDirReadLock = dir.hasReadLock(); > boolean hadDirWriteLock = dir.hasWriteLock(); > boolean hadFsnReadLock = fsn.hasReadLock(); > boolean hadFsnWriteLock = fsn.hasWriteLock(); > // sanity check. > if (!hadDirReadLock || !hadFsnReadLock || hadDirWriteLock || > hadFsnWriteLock || dir.getReadHoldCount() != 1 || > fsn.getReadHoldCount() != 1) { > // cannot relinquish > return false; > } > // unlock > dir.readUnlock(); > fsn.readUnlock("contentSummary"); > try { > Thread.sleep(sleepMilliSec, sleepNanoSec); > } catch (InterruptedException ie) { > } finally { > // reacquire > fsn.readLock(); > dir.readLock(); > } > yieldCount++; > return true; > } > {code} > We believe that this check in particular is the culprit: > {code} > if (!hadDirReadLock || !hadFsnReadLock || hadDirWriteLock || > hadFsnWriteLock || dir.getReadHoldCount() != 1 || > fsn.getReadHoldCount() != 1) { > // cannot relinquish > return false; > } > {code} > The content summary computation will only relinquish the lock if it is > currently the _only_ holder of the lock. Given the high volume of read > requests on a heavily loaded NameNode, especially when unfair locking is > enabled, it is likely there may be another holder of the read lock performing > some short-lived operation. By refusing to give up the lock in this case, the > content summary computation ends up never relinquishing the lock. > We propose to simply remove the readHoldCount checks from this {{yield()}}. >
[jira] [Created] (HDFS-15332) Quota Space consumed was wrong in truncate with Snapshots
hemanthboyina created HDFS-15332: Summary: Quota Space consumed was wrong in truncate with Snapshots Key: HDFS-15332 URL: https://issues.apache.org/jira/browse/HDFS-15332 Project: Hadoop HDFS Issue Type: Bug Reporter: hemanthboyina Assignee: hemanthboyina -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15323) StandbyNode fails transition to active due to insufficient transaction tailing
[ https://issues.apache.org/jira/browse/HDFS-15323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17099198#comment-17099198 ] Konstantin Shvachko commented on HDFS-15323: Thanks [~ayushtkn] and [~xkrogen] for prompt reviews. I just committed this to tunk, and branches 3.3, 3.2, 3.1, 2.10. I lost track of ongoing releases, please cherry pick this to respective branches. > StandbyNode fails transition to active due to insufficient transaction tailing > -- > > Key: HDFS-15323 > URL: https://issues.apache.org/jira/browse/HDFS-15323 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode, qjm >Affects Versions: 2.7.7 >Reporter: Konstantin Shvachko >Assignee: Konstantin Shvachko >Priority: Major > Attachments: HDFS-15323-branch-2.10.002.patch, > HDFS-15323.000.unitTest.patch, HDFS-15323.001.patch, HDFS-15323.002.patch > > > StandbyNode is asked to {{transitionToActive()}}. If it fell too far behind > in tailing journal transaction (from QJM) it can crash with > {{IllegalStateException}}. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15255) Consider StorageType when DatanodeManager#sortLocatedBlock()
[ https://issues.apache.org/jira/browse/HDFS-15255?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17099125#comment-17099125 ] Hadoop QA commented on HDFS-15255: -- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 1m 52s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} dupname {color} | {color:green} 0m 1s{color} | {color:green} No case conflicting files found. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 2 new or modified test files. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 1m 3s{color} | {color:blue} Maven dependency ordering for branch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 26m 12s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 22m 0s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 3m 27s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 5m 15s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 26m 49s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 3m 46s{color} | {color:green} trunk passed {color} | | {color:blue}0{color} | {color:blue} spotbugs {color} | {color:blue} 1m 38s{color} | {color:blue} Used deprecated FindBugs config; considering switching to SpotBugs. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 11m 10s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 26s{color} | {color:blue} Maven dependency ordering for patch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 4m 5s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 21m 5s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 21m 5s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 3m 26s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 5m 12s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} xml {color} | {color:green} 0m 1s{color} | {color:green} The patch has no ill-formed XML file. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 17m 18s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 3m 34s{color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} findbugs {color} | {color:red} 3m 14s{color} | {color:red} hadoop-hdfs-project/hadoop-hdfs-client generated 1 new + 0 unchanged - 0 fixed = 1 total (was 0) {color} | | {color:red}-1{color} | {color:red} findbugs {color} | {color:red} 4m 13s{color} | {color:red} hadoop-hdfs-project/hadoop-hdfs generated 1 new + 0 unchanged - 0 fixed = 1 total (was 0) {color} | || || || || {color:brown} Other Tests {color} || | {color:red}-1{color} | {color:red} unit {color} | {color:red} 9m 42s{color} | {color:red} hadoop-common in the patch passed. {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 2m 10s{color} | {color:red} hadoop-hdfs-client in the patch passed. {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red}112m 39s{color} | {color:red} hadoop-hdfs in the patch passed. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 8m 7s{color} | {color:green} hadoop-hdfs-rbf in the patch passed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 59s{color} | {color:green} The patch does not generate ASF License
[jira] [Commented] (HDFS-12288) Fix DataNode's xceiver count calculation
[ https://issues.apache.org/jira/browse/HDFS-12288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17099081#comment-17099081 ] Hadoop QA commented on HDFS-12288: -- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 1m 20s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} dupname {color} | {color:green} 0m 0s{color} | {color:green} No case conflicting files found. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 3 new or modified test files. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 21m 59s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 8s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 51s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 15s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 17m 47s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 42s{color} | {color:green} trunk passed {color} | | {color:blue}0{color} | {color:blue} spotbugs {color} | {color:blue} 3m 3s{color} | {color:blue} Used deprecated FindBugs config; considering switching to SpotBugs. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 3m 1s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 9s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 3s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 1m 3s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 44s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 11s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 15m 19s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 39s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 3m 6s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:red}-1{color} | {color:red} unit {color} | {color:red}117m 6s{color} | {color:red} hadoop-hdfs in the patch passed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 41s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black}189m 2s{color} | {color:black} {color} | \\ \\ || Reason || Tests || | Failed junit tests | hadoop.hdfs.TestGetFileChecksum | | | hadoop.hdfs.server.sps.TestExternalStoragePolicySatisfier | | | hadoop.hdfs.server.balancer.TestBalancerWithMultipleNameNodes | | | hadoop.hdfs.TestReconstructStripedFileWithRandomECPolicy | | | hadoop.hdfs.TestReconstructStripedFile | \\ \\ || Subsystem || Report/Notes || | Docker | ClientAPI=1.40 ServerAPI=1.40 base: https://builds.apache.org/job/PreCommit-HDFS-Build/29234/artifact/out/Dockerfile | | JIRA Issue | HDFS-12288 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/13001992/HDFS-12288.008.patch | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle | | uname | Linux 5d4cc3db6717 4.15.0-74-generic #84-Ubuntu SMP Thu Dec 19 08:06:28 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | personality/hadoop.sh | | git revision | trunk / 8dace8ff3a9 | | Default Java | Private
[jira] [Commented] (HDFS-15331) Remove invalid exclusions that minicluster dependency on HDFS
[ https://issues.apache.org/jira/browse/HDFS-15331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17099033#comment-17099033 ] Wanqiang Ji commented on HDFS-15331: [https://github.com/apache/hadoop/pull/1996] > Remove invalid exclusions that minicluster dependency on HDFS > - > > Key: HDFS-15331 > URL: https://issues.apache.org/jira/browse/HDFS-15331 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Wanqiang Ji >Assignee: Wanqiang Ji >Priority: Major > > Ozone has split into independent repo, but the invalid exclusions (kubernetes > client) that minicluster dependency on HDFS is kept. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15331) Remove invalid exclusions that minicluster dependency on HDFS
[ https://issues.apache.org/jira/browse/HDFS-15331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wanqiang Ji updated HDFS-15331: --- Status: Patch Available (was: Open) > Remove invalid exclusions that minicluster dependency on HDFS > - > > Key: HDFS-15331 > URL: https://issues.apache.org/jira/browse/HDFS-15331 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Wanqiang Ji >Assignee: Wanqiang Ji >Priority: Major > > Ozone has split into independent repo, but the invalid exclusions (kubernetes > client) that minicluster dependency on HDFS is kept. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-15331) Remove invalid exclusions that minicluster dependency on HDFS
Wanqiang Ji created HDFS-15331: -- Summary: Remove invalid exclusions that minicluster dependency on HDFS Key: HDFS-15331 URL: https://issues.apache.org/jira/browse/HDFS-15331 Project: Hadoop HDFS Issue Type: Bug Reporter: Wanqiang Ji Assignee: Wanqiang Ji Ozone has split into independent repo, but the invalid exclusions (kubernetes client) that minicluster dependency on HDFS is kept. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15323) StandbyNode fails transition to active due to insufficient transaction tailing
[ https://issues.apache.org/jira/browse/HDFS-15323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17099021#comment-17099021 ] Erik Krogen commented on HDFS-15323: +1 pretty simple fix, LGTM. > StandbyNode fails transition to active due to insufficient transaction tailing > -- > > Key: HDFS-15323 > URL: https://issues.apache.org/jira/browse/HDFS-15323 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode, qjm >Affects Versions: 2.7.7 >Reporter: Konstantin Shvachko >Assignee: Konstantin Shvachko >Priority: Major > Attachments: HDFS-15323-branch-2.10.002.patch, > HDFS-15323.000.unitTest.patch, HDFS-15323.001.patch, HDFS-15323.002.patch > > > StandbyNode is asked to {{transitionToActive()}}. If it fell too far behind > in tailing journal transaction (from QJM) it can crash with > {{IllegalStateException}}. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Assigned] (HDFS-12288) Fix DataNode's xceiver count calculation
[ https://issues.apache.org/jira/browse/HDFS-12288?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lisheng Sun reassigned HDFS-12288: -- Assignee: Lisheng Sun (was: Chen Zhang) > Fix DataNode's xceiver count calculation > > > Key: HDFS-12288 > URL: https://issues.apache.org/jira/browse/HDFS-12288 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode, hdfs >Reporter: Lukas Majercak >Assignee: Lisheng Sun >Priority: Major > Attachments: HDFS-12288.001.patch, HDFS-12288.002.patch, > HDFS-12288.003.patch, HDFS-12288.004.patch, HDFS-12288.005.patch, > HDFS-12288.006.patch, HDFS-12288.007.patch, HDFS-12288.008.patch > > > The problem with the ThreadGroup.activeCount() method is that the method is > only a very rough estimate, and in reality returns the total number of > threads in the thread group as opposed to the threads actually running. > In some DNs, we saw this to return 50~ for a long time, even though the > actual number of DataXceiver threads was next to none. > This is a big issue as we use the xceiverCount to make decisions on the NN > for choosing replication source DN or returning DNs to clients for R/W. > The plan is to reuse the DataNodeMetrics.dataNodeActiveXceiversCount value > which only accounts for actual number of DataXcevier threads currently > running and thus represents the load on the DN much better. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-12288) Fix DataNode's xceiver count calculation
[ https://issues.apache.org/jira/browse/HDFS-12288?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lisheng Sun updated HDFS-12288: --- Attachment: HDFS-12288.008.patch > Fix DataNode's xceiver count calculation > > > Key: HDFS-12288 > URL: https://issues.apache.org/jira/browse/HDFS-12288 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode, hdfs >Reporter: Lukas Majercak >Assignee: Chen Zhang >Priority: Major > Attachments: HDFS-12288.001.patch, HDFS-12288.002.patch, > HDFS-12288.003.patch, HDFS-12288.004.patch, HDFS-12288.005.patch, > HDFS-12288.006.patch, HDFS-12288.007.patch, HDFS-12288.008.patch > > > The problem with the ThreadGroup.activeCount() method is that the method is > only a very rough estimate, and in reality returns the total number of > threads in the thread group as opposed to the threads actually running. > In some DNs, we saw this to return 50~ for a long time, even though the > actual number of DataXceiver threads was next to none. > This is a big issue as we use the xceiverCount to make decisions on the NN > for choosing replication source DN or returning DNs to clients for R/W. > The plan is to reuse the DataNodeMetrics.dataNodeActiveXceiversCount value > which only accounts for actual number of DataXcevier threads currently > running and thus represents the load on the DN much better. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-12288) Fix DataNode's xceiver count calculation
[ https://issues.apache.org/jira/browse/HDFS-12288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17098889#comment-17098889 ] Lisheng Sun commented on HDFS-12288: hi [~zhangchen] Are you still working on this jira? If not, i will take over it. Hope you don't mind. > Fix DataNode's xceiver count calculation > > > Key: HDFS-12288 > URL: https://issues.apache.org/jira/browse/HDFS-12288 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode, hdfs >Reporter: Lukas Majercak >Assignee: Chen Zhang >Priority: Major > Attachments: HDFS-12288.001.patch, HDFS-12288.002.patch, > HDFS-12288.003.patch, HDFS-12288.004.patch, HDFS-12288.005.patch, > HDFS-12288.006.patch, HDFS-12288.007.patch > > > The problem with the ThreadGroup.activeCount() method is that the method is > only a very rough estimate, and in reality returns the total number of > threads in the thread group as opposed to the threads actually running. > In some DNs, we saw this to return 50~ for a long time, even though the > actual number of DataXceiver threads was next to none. > This is a big issue as we use the xceiverCount to make decisions on the NN > for choosing replication source DN or returning DNs to clients for R/W. > The plan is to reuse the DataNodeMetrics.dataNodeActiveXceiversCount value > which only accounts for actual number of DataXcevier threads currently > running and thus represents the load on the DN much better. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15255) Consider StorageType when DatanodeManager#sortLocatedBlock()
[ https://issues.apache.org/jira/browse/HDFS-15255?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17098868#comment-17098868 ] Lisheng Sun commented on HDFS-15255: Add the v007 patch. This patch removes equals and hashCode from DatanodeInfoWithStorage. > Consider StorageType when DatanodeManager#sortLocatedBlock() > > > Key: HDFS-15255 > URL: https://issues.apache.org/jira/browse/HDFS-15255 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Lisheng Sun >Assignee: Lisheng Sun >Priority: Major > Fix For: 3.4.0 > > Attachments: HDFS-15255-findbugs-test.001.patch, > HDFS-15255.001.patch, HDFS-15255.002.patch, HDFS-15255.003.patch, > HDFS-15255.004.patch, HDFS-15255.005.patch, HDFS-15255.006.patch, > HDFS-15255.007.patch, experiment-find-bugs.001.patch > > > When only one replica of a block is SDD, the others are HDD. > When the client reads the data, the current logic is that it considers the > distance between the client and the dn. I think it should also consider the > StorageType of the replica. Priority to return fast StorageType node when the > distance is same. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15255) Consider StorageType when DatanodeManager#sortLocatedBlock()
[ https://issues.apache.org/jira/browse/HDFS-15255?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lisheng Sun updated HDFS-15255: --- Attachment: HDFS-15255.007.patch > Consider StorageType when DatanodeManager#sortLocatedBlock() > > > Key: HDFS-15255 > URL: https://issues.apache.org/jira/browse/HDFS-15255 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Lisheng Sun >Assignee: Lisheng Sun >Priority: Major > Fix For: 3.4.0 > > Attachments: HDFS-15255-findbugs-test.001.patch, > HDFS-15255.001.patch, HDFS-15255.002.patch, HDFS-15255.003.patch, > HDFS-15255.004.patch, HDFS-15255.005.patch, HDFS-15255.006.patch, > HDFS-15255.007.patch, experiment-find-bugs.001.patch > > > When only one replica of a block is SDD, the others are HDD. > When the client reads the data, the current logic is that it considers the > distance between the client and the dn. I think it should also consider the > StorageType of the replica. Priority to return fast StorageType node when the > distance is same. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15255) Consider StorageType when DatanodeManager#sortLocatedBlock()
[ https://issues.apache.org/jira/browse/HDFS-15255?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17098838#comment-17098838 ] Stephen O'Donnell commented on HDFS-15255: -- [~leosun08] Patch 06 no longer applies to trunk, probably due to the other change related to this we committed last week. I asked some of my colleagues to check this find bug warnings, and they both believed it can be ignored. Before setting an ignore annotation on the code, one person suggested just removing the equals (and also probably hashCode) from DatanodeInfoWithStorage. All it does in both these methods is call super, and the normal inheritance chain will do that anyway. Could you try rebasing the 06 patch against trunk and then remove equals and hashCode from DatanodeInfoWithStorage and lets see if that gets rid of the find bugs warning? > Consider StorageType when DatanodeManager#sortLocatedBlock() > > > Key: HDFS-15255 > URL: https://issues.apache.org/jira/browse/HDFS-15255 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Lisheng Sun >Assignee: Lisheng Sun >Priority: Major > Fix For: 3.4.0 > > Attachments: HDFS-15255-findbugs-test.001.patch, > HDFS-15255.001.patch, HDFS-15255.002.patch, HDFS-15255.003.patch, > HDFS-15255.004.patch, HDFS-15255.005.patch, HDFS-15255.006.patch, > experiment-find-bugs.001.patch > > > When only one replica of a block is SDD, the others are HDD. > When the client reads the data, the current logic is that it considers the > distance between the client and the dn. I think it should also consider the > StorageType of the replica. Priority to return fast StorageType node when the > distance is same. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15310) RBF: Not proxy client's clientId and callId caused RetryCache invalid in NameNode.
[ https://issues.apache.org/jira/browse/HDFS-15310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17098807#comment-17098807 ] Ayush Saxena commented on HDFS-15310: - [~hexiaoqiao] [~elgoiri] [~xuzq_zander] [~ferhui] I have notified at the dev. regarding this problem. Please add anything if I have missed, or which isn't clear. :) > RBF: Not proxy client's clientId and callId caused RetryCache invalid in > NameNode. > -- > > Key: HDFS-15310 > URL: https://issues.apache.org/jira/browse/HDFS-15310 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: xuzq >Assignee: xuzq >Priority: Critical > > The RBF not proxy client's clientId and CallId to NameNode, it caused > RetryCache invalid in NameNode and some rpc may be failed. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15310) RBF: Not proxy client's clientId and callId caused RetryCache invalid in NameNode.
[ https://issues.apache.org/jira/browse/HDFS-15310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17098782#comment-17098782 ] Ayush Saxena commented on HDFS-15310: - Thanx everyone. Yahh, this seems to be have dragged a bit too much, we have discussed it in details even on the mailing list for Data Locality Problem but the solution decided there didn't conclude due to security reasons. Data Locality problem I think was still not causing as such problem, but Retry Cache one we need to do something this can have impact on the overall consistency of the system. Anyway, I will shoot a discussion at the @dev list. Hope we get a solution this time. > RBF: Not proxy client's clientId and callId caused RetryCache invalid in > NameNode. > -- > > Key: HDFS-15310 > URL: https://issues.apache.org/jira/browse/HDFS-15310 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: xuzq >Assignee: xuzq >Priority: Critical > > The RBF not proxy client's clientId and CallId to NameNode, it caused > RetryCache invalid in NameNode and some rpc may be failed. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org