[jira] [Commented] (HDFS-14461) RBF: Fix intermittently failing kerberos related unit test

2019-09-21 Thread He Xiaoqiao (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16935051#comment-16935051
 ] 

He Xiaoqiao commented on HDFS-14461:


{quote}In most cases I like static imports but for 
SecurityConfUtil#initSecurity() and SecurityConfUtil#destroy() I would keep the 
full call.{quote}
It makes sense, [^HDFS-14461.005.patch] use full call for 
SecurityConfUtil#{initSecurity,destroy}. Please help to take another reviews 
[~elgoiri]. Thanks.

> RBF: Fix intermittently failing kerberos related unit test
> --
>
> Key: HDFS-14461
> URL: https://issues.apache.org/jira/browse/HDFS-14461
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: CR Hota
>Assignee: He Xiaoqiao
>Priority: Major
> Attachments: HDFS-14461.001.patch, HDFS-14461.002.patch, 
> HDFS-14461.003.patch, HDFS-14461.004.patch, HDFS-14461.005.patch
>
>
> TestRouterHttpDelegationToken#testGetDelegationToken fails intermittently. It 
> may be due to some race condition before using the keytab that's created for 
> testing.
>  
> {code:java}
>  Failed
> org.apache.hadoop.hdfs.server.federation.security.TestRouterHttpDelegationToken.testGetDelegationToken
>  Failing for the past 1 build (Since 
> [!https://builds.apache.org/static/1e9ab9cc/images/16x16/red.png! 
> #26721|https://builds.apache.org/job/PreCommit-HDFS-Build/26721/] )
>  [Took 89 
> ms.|https://builds.apache.org/job/PreCommit-HDFS-Build/26721/testReport/org.apache.hadoop.hdfs.server.federation.security/TestRouterHttpDelegationToken/testGetDelegationToken/history]
>   
>  Error Message
> org.apache.hadoop.security.KerberosAuthException: failure to login: for 
> principal: router/localh...@example.com from keytab 
> /testptch/hadoop/hadoop-hdfs-project/hadoop-hdfs-rbf/target/test/data/SecurityConfUtil/test.keytab
>  javax.security.auth.login.LoginException: Integrity check on decrypted field 
> failed (31) - PREAUTH_FAILED
> h3. Stacktrace
> org.apache.hadoop.service.ServiceStateException: 
> org.apache.hadoop.security.KerberosAuthException: failure to login: for 
> principal: router/localh...@example.com from keytab 
> /testptch/hadoop/hadoop-hdfs-project/hadoop-hdfs-rbf/target/test/data/SecurityConfUtil/test.keytab
>  javax.security.auth.login.LoginException: Integrity check on decrypted field 
> failed (31) - PREAUTH_FAILED at 
> org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:105)
>  at org.apache.hadoop.service.AbstractService.init(AbstractService.java:173) 
> at 
> org.apache.hadoop.hdfs.server.federation.security.TestRouterHttpDelegationToken.setup(TestRouterHttpDelegationToken.java:99)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498) at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
>  at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>  at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
>  at 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:24) 
> at 
> org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27) 
> at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325) at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
>  at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
>  at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290) at 
> org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71) at 
> org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288) at 
> org.junit.runners.ParentRunner.access$000(ParentRunner.java:58) at 
> org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268) at 
> org.junit.runners.ParentRunner.run(ParentRunner.java:363) at 
> org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:365)
>  at 
> org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:273)
>  at 
> org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:238)
>  at 
> org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:159)
>  at 
> org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:384)
>  at 
> org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:345)
>  at 
> org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:126) 
> at org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:418) 
> 

[jira] [Updated] (HDFS-14461) RBF: Fix intermittently failing kerberos related unit test

2019-09-21 Thread He Xiaoqiao (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

He Xiaoqiao updated HDFS-14461:
---
Attachment: HDFS-14461.005.patch

> RBF: Fix intermittently failing kerberos related unit test
> --
>
> Key: HDFS-14461
> URL: https://issues.apache.org/jira/browse/HDFS-14461
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: CR Hota
>Assignee: He Xiaoqiao
>Priority: Major
> Attachments: HDFS-14461.001.patch, HDFS-14461.002.patch, 
> HDFS-14461.003.patch, HDFS-14461.004.patch, HDFS-14461.005.patch
>
>
> TestRouterHttpDelegationToken#testGetDelegationToken fails intermittently. It 
> may be due to some race condition before using the keytab that's created for 
> testing.
>  
> {code:java}
>  Failed
> org.apache.hadoop.hdfs.server.federation.security.TestRouterHttpDelegationToken.testGetDelegationToken
>  Failing for the past 1 build (Since 
> [!https://builds.apache.org/static/1e9ab9cc/images/16x16/red.png! 
> #26721|https://builds.apache.org/job/PreCommit-HDFS-Build/26721/] )
>  [Took 89 
> ms.|https://builds.apache.org/job/PreCommit-HDFS-Build/26721/testReport/org.apache.hadoop.hdfs.server.federation.security/TestRouterHttpDelegationToken/testGetDelegationToken/history]
>   
>  Error Message
> org.apache.hadoop.security.KerberosAuthException: failure to login: for 
> principal: router/localh...@example.com from keytab 
> /testptch/hadoop/hadoop-hdfs-project/hadoop-hdfs-rbf/target/test/data/SecurityConfUtil/test.keytab
>  javax.security.auth.login.LoginException: Integrity check on decrypted field 
> failed (31) - PREAUTH_FAILED
> h3. Stacktrace
> org.apache.hadoop.service.ServiceStateException: 
> org.apache.hadoop.security.KerberosAuthException: failure to login: for 
> principal: router/localh...@example.com from keytab 
> /testptch/hadoop/hadoop-hdfs-project/hadoop-hdfs-rbf/target/test/data/SecurityConfUtil/test.keytab
>  javax.security.auth.login.LoginException: Integrity check on decrypted field 
> failed (31) - PREAUTH_FAILED at 
> org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:105)
>  at org.apache.hadoop.service.AbstractService.init(AbstractService.java:173) 
> at 
> org.apache.hadoop.hdfs.server.federation.security.TestRouterHttpDelegationToken.setup(TestRouterHttpDelegationToken.java:99)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498) at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
>  at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>  at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
>  at 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:24) 
> at 
> org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27) 
> at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325) at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
>  at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
>  at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290) at 
> org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71) at 
> org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288) at 
> org.junit.runners.ParentRunner.access$000(ParentRunner.java:58) at 
> org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268) at 
> org.junit.runners.ParentRunner.run(ParentRunner.java:363) at 
> org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:365)
>  at 
> org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:273)
>  at 
> org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:238)
>  at 
> org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:159)
>  at 
> org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:384)
>  at 
> org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:345)
>  at 
> org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:126) 
> at org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:418) 
> Caused by: org.apache.hadoop.security.KerberosAuthException: failure to 
> login: for principal: router/localh...@example.com from keytab 
> /testptch/hadoop/hadoop-hdfs-project/hadoop-hdfs-rbf/target/test/data/SecurityConfUtil/test.keytab
>  javax.security.auth.login.LoginException: Integrity check on decrypted 

[jira] [Commented] (HDFS-9668) Optimize the locking in FsDatasetImpl

2019-09-20 Thread He Xiaoqiao (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-9668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16934472#comment-16934472
 ] 

He Xiaoqiao commented on HDFS-9668:
---

Hi [~zhangchen],[~jojochuang], any plan to push this feature forward? I am also 
concerned if any test or benchmark or other ways to evaluate this improvement.

> Optimize the locking in FsDatasetImpl
> -
>
> Key: HDFS-9668
> URL: https://issues.apache.org/jira/browse/HDFS-9668
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Reporter: Jingcheng Du
>Assignee: Jingcheng Du
>Priority: Major
> Attachments: HDFS-9668-1.patch, HDFS-9668-10.patch, 
> HDFS-9668-11.patch, HDFS-9668-12.patch, HDFS-9668-13.patch, 
> HDFS-9668-14.patch, HDFS-9668-14.patch, HDFS-9668-15.patch, 
> HDFS-9668-16.patch, HDFS-9668-17.patch, HDFS-9668-18.patch, 
> HDFS-9668-19.patch, HDFS-9668-19.patch, HDFS-9668-2.patch, 
> HDFS-9668-20.patch, HDFS-9668-21.patch, HDFS-9668-22.patch, 
> HDFS-9668-23.patch, HDFS-9668-23.patch, HDFS-9668-24.patch, 
> HDFS-9668-25.patch, HDFS-9668-26.patch, HDFS-9668-3.patch, HDFS-9668-4.patch, 
> HDFS-9668-5.patch, HDFS-9668-6.patch, HDFS-9668-7.patch, HDFS-9668-8.patch, 
> HDFS-9668-9.patch, execution_time.png
>
>
> During the HBase test on a tiered storage of HDFS (WAL is stored in 
> SSD/RAMDISK, and all other files are stored in HDD), we observe many 
> long-time BLOCKED threads on FsDatasetImpl in DataNode. The following is part 
> of the jstack result:
> {noformat}
> "DataXceiver for client DFSClient_NONMAPREDUCE_-1626037897_1 at 
> /192.168.50.16:48521 [Receiving block 
> BP-1042877462-192.168.50.13-1446173170517:blk_1073779272_40852]" - Thread 
> t@93336
>java.lang.Thread.State: BLOCKED
>   at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.createRbw(FsDatasetImpl.java:)
>   - waiting to lock <18324c9> (a 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl) owned by 
> "DataXceiver for client DFSClient_NONMAPREDUCE_-1626037897_1 at 
> /192.168.50.16:48520 [Receiving block 
> BP-1042877462-192.168.50.13-1446173170517:blk_1073779271_40851]" t@93335
>   at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.createRbw(FsDatasetImpl.java:113)
>   at 
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.(BlockReceiver.java:183)
>   at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:615)
>   at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:137)
>   at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:74)
>   at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:235)
>   at java.lang.Thread.run(Thread.java:745)
>Locked ownable synchronizers:
>   - None
>   
> "DataXceiver for client DFSClient_NONMAPREDUCE_-1626037897_1 at 
> /192.168.50.16:48520 [Receiving block 
> BP-1042877462-192.168.50.13-1446173170517:blk_1073779271_40851]" - Thread 
> t@93335
>java.lang.Thread.State: RUNNABLE
>   at java.io.UnixFileSystem.createFileExclusively(Native Method)
>   at java.io.File.createNewFile(File.java:1012)
>   at 
> org.apache.hadoop.hdfs.server.datanode.DatanodeUtil.createTmpFile(DatanodeUtil.java:66)
>   at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.BlockPoolSlice.createRbwFile(BlockPoolSlice.java:271)
>   at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.createRbwFile(FsVolumeImpl.java:286)
>   at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.createRbw(FsDatasetImpl.java:1140)
>   - locked <18324c9> (a 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl)
>   at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.createRbw(FsDatasetImpl.java:113)
>   at 
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.(BlockReceiver.java:183)
>   at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:615)
>   at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:137)
>   at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:74)
>   at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:235)
>   at java.lang.Thread.run(Thread.java:745)
>Locked ownable synchronizers:
>   - None
> {noformat}
> We measured the execution of some operations in FsDatasetImpl during the 
> test. Here following is the result.
> !execution_time.png!
> The operations of finalizeBlock, addBlock and createRbw on HDD in a heavy 
> load take a really long time.
> It means one slow operation of finalizeBlock, addBlock and createRbw in a 
> slow 

[jira] [Commented] (HDFS-14461) RBF: Fix intermittently failing kerberos related unit test

2019-09-20 Thread He Xiaoqiao (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16934447#comment-16934447
 ] 

He Xiaoqiao commented on HDFS-14461:


[^HDFS-14461.004.patch] try to fix checkstyle.

> RBF: Fix intermittently failing kerberos related unit test
> --
>
> Key: HDFS-14461
> URL: https://issues.apache.org/jira/browse/HDFS-14461
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: CR Hota
>Assignee: He Xiaoqiao
>Priority: Major
> Attachments: HDFS-14461.001.patch, HDFS-14461.002.patch, 
> HDFS-14461.003.patch, HDFS-14461.004.patch
>
>
> TestRouterHttpDelegationToken#testGetDelegationToken fails intermittently. It 
> may be due to some race condition before using the keytab that's created for 
> testing.
>  
> {code:java}
>  Failed
> org.apache.hadoop.hdfs.server.federation.security.TestRouterHttpDelegationToken.testGetDelegationToken
>  Failing for the past 1 build (Since 
> [!https://builds.apache.org/static/1e9ab9cc/images/16x16/red.png! 
> #26721|https://builds.apache.org/job/PreCommit-HDFS-Build/26721/] )
>  [Took 89 
> ms.|https://builds.apache.org/job/PreCommit-HDFS-Build/26721/testReport/org.apache.hadoop.hdfs.server.federation.security/TestRouterHttpDelegationToken/testGetDelegationToken/history]
>   
>  Error Message
> org.apache.hadoop.security.KerberosAuthException: failure to login: for 
> principal: router/localh...@example.com from keytab 
> /testptch/hadoop/hadoop-hdfs-project/hadoop-hdfs-rbf/target/test/data/SecurityConfUtil/test.keytab
>  javax.security.auth.login.LoginException: Integrity check on decrypted field 
> failed (31) - PREAUTH_FAILED
> h3. Stacktrace
> org.apache.hadoop.service.ServiceStateException: 
> org.apache.hadoop.security.KerberosAuthException: failure to login: for 
> principal: router/localh...@example.com from keytab 
> /testptch/hadoop/hadoop-hdfs-project/hadoop-hdfs-rbf/target/test/data/SecurityConfUtil/test.keytab
>  javax.security.auth.login.LoginException: Integrity check on decrypted field 
> failed (31) - PREAUTH_FAILED at 
> org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:105)
>  at org.apache.hadoop.service.AbstractService.init(AbstractService.java:173) 
> at 
> org.apache.hadoop.hdfs.server.federation.security.TestRouterHttpDelegationToken.setup(TestRouterHttpDelegationToken.java:99)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498) at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
>  at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>  at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
>  at 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:24) 
> at 
> org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27) 
> at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325) at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
>  at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
>  at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290) at 
> org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71) at 
> org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288) at 
> org.junit.runners.ParentRunner.access$000(ParentRunner.java:58) at 
> org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268) at 
> org.junit.runners.ParentRunner.run(ParentRunner.java:363) at 
> org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:365)
>  at 
> org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:273)
>  at 
> org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:238)
>  at 
> org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:159)
>  at 
> org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:384)
>  at 
> org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:345)
>  at 
> org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:126) 
> at org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:418) 
> Caused by: org.apache.hadoop.security.KerberosAuthException: failure to 
> login: for principal: router/localh...@example.com from keytab 
> /testptch/hadoop/hadoop-hdfs-project/hadoop-hdfs-rbf/target/test/data/SecurityConfUtil/test.keytab
>  javax.security.auth.login.LoginException: 

[jira] [Updated] (HDFS-14461) RBF: Fix intermittently failing kerberos related unit test

2019-09-20 Thread He Xiaoqiao (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

He Xiaoqiao updated HDFS-14461:
---
Attachment: HDFS-14461.004.patch

> RBF: Fix intermittently failing kerberos related unit test
> --
>
> Key: HDFS-14461
> URL: https://issues.apache.org/jira/browse/HDFS-14461
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: CR Hota
>Assignee: He Xiaoqiao
>Priority: Major
> Attachments: HDFS-14461.001.patch, HDFS-14461.002.patch, 
> HDFS-14461.003.patch, HDFS-14461.004.patch
>
>
> TestRouterHttpDelegationToken#testGetDelegationToken fails intermittently. It 
> may be due to some race condition before using the keytab that's created for 
> testing.
>  
> {code:java}
>  Failed
> org.apache.hadoop.hdfs.server.federation.security.TestRouterHttpDelegationToken.testGetDelegationToken
>  Failing for the past 1 build (Since 
> [!https://builds.apache.org/static/1e9ab9cc/images/16x16/red.png! 
> #26721|https://builds.apache.org/job/PreCommit-HDFS-Build/26721/] )
>  [Took 89 
> ms.|https://builds.apache.org/job/PreCommit-HDFS-Build/26721/testReport/org.apache.hadoop.hdfs.server.federation.security/TestRouterHttpDelegationToken/testGetDelegationToken/history]
>   
>  Error Message
> org.apache.hadoop.security.KerberosAuthException: failure to login: for 
> principal: router/localh...@example.com from keytab 
> /testptch/hadoop/hadoop-hdfs-project/hadoop-hdfs-rbf/target/test/data/SecurityConfUtil/test.keytab
>  javax.security.auth.login.LoginException: Integrity check on decrypted field 
> failed (31) - PREAUTH_FAILED
> h3. Stacktrace
> org.apache.hadoop.service.ServiceStateException: 
> org.apache.hadoop.security.KerberosAuthException: failure to login: for 
> principal: router/localh...@example.com from keytab 
> /testptch/hadoop/hadoop-hdfs-project/hadoop-hdfs-rbf/target/test/data/SecurityConfUtil/test.keytab
>  javax.security.auth.login.LoginException: Integrity check on decrypted field 
> failed (31) - PREAUTH_FAILED at 
> org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:105)
>  at org.apache.hadoop.service.AbstractService.init(AbstractService.java:173) 
> at 
> org.apache.hadoop.hdfs.server.federation.security.TestRouterHttpDelegationToken.setup(TestRouterHttpDelegationToken.java:99)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498) at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
>  at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>  at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
>  at 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:24) 
> at 
> org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27) 
> at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325) at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
>  at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
>  at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290) at 
> org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71) at 
> org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288) at 
> org.junit.runners.ParentRunner.access$000(ParentRunner.java:58) at 
> org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268) at 
> org.junit.runners.ParentRunner.run(ParentRunner.java:363) at 
> org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:365)
>  at 
> org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:273)
>  at 
> org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:238)
>  at 
> org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:159)
>  at 
> org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:384)
>  at 
> org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:345)
>  at 
> org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:126) 
> at org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:418) 
> Caused by: org.apache.hadoop.security.KerberosAuthException: failure to 
> login: for principal: router/localh...@example.com from keytab 
> /testptch/hadoop/hadoop-hdfs-project/hadoop-hdfs-rbf/target/test/data/SecurityConfUtil/test.keytab
>  javax.security.auth.login.LoginException: Integrity check on decrypted field 
> failed (31) 

[jira] [Commented] (HDFS-14771) Backport HDFS-14617 to branch-2 (Improve fsimage load time by writing sub-sections to the fsimage index)

2019-09-20 Thread He Xiaoqiao (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16934432#comment-16934432
 ] 

He Xiaoqiao commented on HDFS-14771:


Thanks [~sodonnell] for your reminder. Just correct the release note. Please 
help to double check. Thanks.

> Backport HDFS-14617 to branch-2 (Improve fsimage load time by writing 
> sub-sections to the fsimage index)
> 
>
> Key: HDFS-14771
> URL: https://issues.apache.org/jira/browse/HDFS-14771
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: namenode
>Affects Versions: 2.10.0
>Reporter: He Xiaoqiao
>Assignee: He Xiaoqiao
>Priority: Major
>  Labels: release-blocker
> Fix For: 2.10.0
>
> Attachments: HDFS-14771.branch-2.001.patch, 
> HDFS-14771.branch-2.002.patch, HDFS-14771.branch-2.003.patch
>
>
> This JIRA aims to backport HDFS-14617 to branch-2: fsimage load time by 
> writing sub-sections to the fsimage index.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14771) Backport HDFS-14617 to branch-2 (Improve fsimage load time by writing sub-sections to the fsimage index)

2019-09-20 Thread He Xiaoqiao (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

He Xiaoqiao updated HDFS-14771:
---
Release Note: 
This change allows the inode and inode directory sections of the fsimage to be 
loaded in parallel. Tests on large images have shown this change to reduce the 
image load time to about 50% of the pre-change run time.

It works by writing sub-section entries to the image index, effectively 
splitting each image section into many sub-sections which can be processed in 
parallel. By default 12 sub-sections per image section are created when the 
image is saved, and 4 threads are used to load the image at startup.

This is disabled by default for any image with more than 1M inodes 
(dfs.image.parallel.inode.threshold) and can be enabled by setting 
dfs.image.parallel.load to true. When the feature is enabled, the next HDFS 
checkpoint will write the image sub-sections and subsequent namenode restarts 
can load the image in parallel.

A image with the parallel sections can be read even if the feature is disabled, 
but HDFS versions without this Jira cannot load an image with parallel 
sections. OIV can process a parallel enabled image without issues.

Key configuration parameters are:

dfs.image.parallel.load=false - enable or disable the feature

dfs.image.parallel.target.sections = 12 - The target number of subsections. Aim 
for 2 to 3 times the number of dfs.image.parallel.threads.

dfs.image.parallel.inode.threshold = 100 - Only save and load in parallel 
if the image has more than this number of inodes.

dfs.image.parallel.threads = 4 - The number of threads used to load the image. 
Testing has shown 4 to be optimal, but this may depends on the environment.

UPGRADE WARN: 
1. It can upgrade smoothly from 2.10 to 3.* if not enable this feature ever.
2. Only path to do upgrade from 2.10 to 3.3 currently when enable fsimage 
parallel loading feature.
3. If someone want to upgrade 2.10 to 3.*(3.1.*/3.2.*) prior release, please 
make sure that save at least one fsimage file after disable this feature. It 
relies on change configuration parameter(dfs.image.parallel.load=false) first 
and restart namenode before upgrade operation.

  was:
This change allows the inode and inode directory sections of the fsimage to be 
loaded in parallel. Tests on large images have shown this change to reduce the 
image load time to about 50% of the pre-change run time.

It works by writing sub-section entries to the image index, effectively 
splitting each image section into many sub-sections which can be processed in 
parallel. By default 12 sub-sections per image section are created when the 
image is saved, and 4 threads are used to load the image at startup.

This is enabled by default for any image with more than 1M inodes 
(dfs.image.parallel.inode.threshold) and but can be disabled by setting 
dfs.image.parallel.load to false. When the feature is enabled, the next HDFS 
checkpoint will write the image sub-sections and subsequent namenode restarts 
can load the image in parallel.

A image with the parallel sections can be read even if the feature is disabled, 
but HDFS versions without this Jira cannot load an image with parallel 
sections. OIV can process a parallel enabled image without issues.

Key configuration parameters are:

dfs.image.parallel.load=true - enable or disable the feature

dfs.image.parallel.target.sections = 12 - The target number of subsections. Aim 
for 2 to 3 times the number of dfs.image.parallel.threads.

dfs.image.parallel.inode.threshold = 100 - Only save and load in parallel 
if the image has more than this number of inodes.

dfs.image.parallel.threads = 4 - The number of threads used to load the image. 
Testing has shown 4 to be optimal, but this may depends on the environment.

UPGRADE WARN: 
1. It can upgrade smoothly from 2.10 to 3.* if not enable this feature ever.
2. Only path to do upgrade from 2.10 to 3.3 currently when enable fsimage 
parallel loading feature.
3. If someone want to upgrade 2.10 to 3.* release prior, please make sure that 
save at least one fsimage file after disable this feature. It rely on change 
configuration parameter first and restart namenode before upgrade  operation.


> Backport HDFS-14617 to branch-2 (Improve fsimage load time by writing 
> sub-sections to the fsimage index)
> 
>
> Key: HDFS-14771
> URL: https://issues.apache.org/jira/browse/HDFS-14771
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: namenode
>Affects Versions: 2.10.0
>Reporter: He Xiaoqiao
>Assignee: He Xiaoqiao
>Priority: Major
>  Labels: release-blocker
> Fix For: 2.10.0
>
> Attachments: HDFS-14771.branch-2.001.patch, 
> 

[jira] [Updated] (HDFS-14771) Backport HDFS-14617 to branch-2 (Improve fsimage load time by writing sub-sections to the fsimage index)

2019-09-20 Thread He Xiaoqiao (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

He Xiaoqiao updated HDFS-14771:
---
Release Note: 
This change allows the inode and inode directory sections of the fsimage to be 
loaded in parallel. Tests on large images have shown this change to reduce the 
image load time to about 50% of the pre-change run time.

It works by writing sub-section entries to the image index, effectively 
splitting each image section into many sub-sections which can be processed in 
parallel. By default 12 sub-sections per image section are created when the 
image is saved, and 4 threads are used to load the image at startup.

This is enabled by default for any image with more than 1M inodes 
(dfs.image.parallel.inode.threshold) and but can be disabled by setting 
dfs.image.parallel.load to false. When the feature is enabled, the next HDFS 
checkpoint will write the image sub-sections and subsequent namenode restarts 
can load the image in parallel.

A image with the parallel sections can be read even if the feature is disabled, 
but HDFS versions without this Jira cannot load an image with parallel 
sections. OIV can process a parallel enabled image without issues.

Key configuration parameters are:

dfs.image.parallel.load=true - enable or disable the feature

dfs.image.parallel.target.sections = 12 - The target number of subsections. Aim 
for 2 to 3 times the number of dfs.image.parallel.threads.

dfs.image.parallel.inode.threshold = 100 - Only save and load in parallel 
if the image has more than this number of inodes.

dfs.image.parallel.threads = 4 - The number of threads used to load the image. 
Testing has shown 4 to be optimal, but this may depends on the environment.

UPGRADE WARN: 
1. It can upgrade smoothly from 2.10 to 3.* if not enable this feature ever.
2. Only path to do upgrade from 2.10 to 3.3 currently when enable fsimage 
parallel loading feature.
3. If someone want to upgrade 2.10 to 3.* release prior, please make sure that 
save at least one fsimage file after disable this feature. It rely on change 
configuration parameter first and restart namenode before upgrade  operation.

Provided release note similar as HDFS-14617 notes with upgrade warning notes. 
[~xkrogen], [~sodonnell],[~jojochuang] please help to take a look. Thanks.

> Backport HDFS-14617 to branch-2 (Improve fsimage load time by writing 
> sub-sections to the fsimage index)
> 
>
> Key: HDFS-14771
> URL: https://issues.apache.org/jira/browse/HDFS-14771
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: namenode
>Affects Versions: 2.10.0
>Reporter: He Xiaoqiao
>Assignee: He Xiaoqiao
>Priority: Major
>  Labels: release-blocker
> Fix For: 2.10.0
>
> Attachments: HDFS-14771.branch-2.001.patch, 
> HDFS-14771.branch-2.002.patch, HDFS-14771.branch-2.003.patch
>
>
> This JIRA aims to backport HDFS-14617 to branch-2: fsimage load time by 
> writing sub-sections to the fsimage index.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14461) RBF: Fix intermittently failing kerberos related unit test

2019-09-20 Thread He Xiaoqiao (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16934239#comment-16934239
 ] 

He Xiaoqiao commented on HDFS-14461:


Thanks [~elgoiri],[~crh] for mentioning HDFS-14609, it is more graceful 
improvement (thanks [~zhangchen]] for contribution). I just run 
#TestRouterHttpDelegationToken and other related unit test, all passed and the 
result is as expected with HDFS-14609. Please help to double check.
[^HDFS-14461.003.patch] try to unify init security context and destroy it in 
order to avoid unit tests interact with each other.

> RBF: Fix intermittently failing kerberos related unit test
> --
>
> Key: HDFS-14461
> URL: https://issues.apache.org/jira/browse/HDFS-14461
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: CR Hota
>Assignee: He Xiaoqiao
>Priority: Major
> Attachments: HDFS-14461.001.patch, HDFS-14461.002.patch, 
> HDFS-14461.003.patch
>
>
> TestRouterHttpDelegationToken#testGetDelegationToken fails intermittently. It 
> may be due to some race condition before using the keytab that's created for 
> testing.
>  
> {code:java}
>  Failed
> org.apache.hadoop.hdfs.server.federation.security.TestRouterHttpDelegationToken.testGetDelegationToken
>  Failing for the past 1 build (Since 
> [!https://builds.apache.org/static/1e9ab9cc/images/16x16/red.png! 
> #26721|https://builds.apache.org/job/PreCommit-HDFS-Build/26721/] )
>  [Took 89 
> ms.|https://builds.apache.org/job/PreCommit-HDFS-Build/26721/testReport/org.apache.hadoop.hdfs.server.federation.security/TestRouterHttpDelegationToken/testGetDelegationToken/history]
>   
>  Error Message
> org.apache.hadoop.security.KerberosAuthException: failure to login: for 
> principal: router/localh...@example.com from keytab 
> /testptch/hadoop/hadoop-hdfs-project/hadoop-hdfs-rbf/target/test/data/SecurityConfUtil/test.keytab
>  javax.security.auth.login.LoginException: Integrity check on decrypted field 
> failed (31) - PREAUTH_FAILED
> h3. Stacktrace
> org.apache.hadoop.service.ServiceStateException: 
> org.apache.hadoop.security.KerberosAuthException: failure to login: for 
> principal: router/localh...@example.com from keytab 
> /testptch/hadoop/hadoop-hdfs-project/hadoop-hdfs-rbf/target/test/data/SecurityConfUtil/test.keytab
>  javax.security.auth.login.LoginException: Integrity check on decrypted field 
> failed (31) - PREAUTH_FAILED at 
> org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:105)
>  at org.apache.hadoop.service.AbstractService.init(AbstractService.java:173) 
> at 
> org.apache.hadoop.hdfs.server.federation.security.TestRouterHttpDelegationToken.setup(TestRouterHttpDelegationToken.java:99)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498) at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
>  at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>  at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
>  at 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:24) 
> at 
> org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27) 
> at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325) at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
>  at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
>  at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290) at 
> org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71) at 
> org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288) at 
> org.junit.runners.ParentRunner.access$000(ParentRunner.java:58) at 
> org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268) at 
> org.junit.runners.ParentRunner.run(ParentRunner.java:363) at 
> org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:365)
>  at 
> org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:273)
>  at 
> org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:238)
>  at 
> org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:159)
>  at 
> org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:384)
>  at 
> org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:345)
>  at 
> org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:126) 
> at 

[jira] [Updated] (HDFS-14461) RBF: Fix intermittently failing kerberos related unit test

2019-09-20 Thread He Xiaoqiao (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

He Xiaoqiao updated HDFS-14461:
---
Attachment: HDFS-14461.003.patch

> RBF: Fix intermittently failing kerberos related unit test
> --
>
> Key: HDFS-14461
> URL: https://issues.apache.org/jira/browse/HDFS-14461
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: CR Hota
>Assignee: He Xiaoqiao
>Priority: Major
> Attachments: HDFS-14461.001.patch, HDFS-14461.002.patch, 
> HDFS-14461.003.patch
>
>
> TestRouterHttpDelegationToken#testGetDelegationToken fails intermittently. It 
> may be due to some race condition before using the keytab that's created for 
> testing.
>  
> {code:java}
>  Failed
> org.apache.hadoop.hdfs.server.federation.security.TestRouterHttpDelegationToken.testGetDelegationToken
>  Failing for the past 1 build (Since 
> [!https://builds.apache.org/static/1e9ab9cc/images/16x16/red.png! 
> #26721|https://builds.apache.org/job/PreCommit-HDFS-Build/26721/] )
>  [Took 89 
> ms.|https://builds.apache.org/job/PreCommit-HDFS-Build/26721/testReport/org.apache.hadoop.hdfs.server.federation.security/TestRouterHttpDelegationToken/testGetDelegationToken/history]
>   
>  Error Message
> org.apache.hadoop.security.KerberosAuthException: failure to login: for 
> principal: router/localh...@example.com from keytab 
> /testptch/hadoop/hadoop-hdfs-project/hadoop-hdfs-rbf/target/test/data/SecurityConfUtil/test.keytab
>  javax.security.auth.login.LoginException: Integrity check on decrypted field 
> failed (31) - PREAUTH_FAILED
> h3. Stacktrace
> org.apache.hadoop.service.ServiceStateException: 
> org.apache.hadoop.security.KerberosAuthException: failure to login: for 
> principal: router/localh...@example.com from keytab 
> /testptch/hadoop/hadoop-hdfs-project/hadoop-hdfs-rbf/target/test/data/SecurityConfUtil/test.keytab
>  javax.security.auth.login.LoginException: Integrity check on decrypted field 
> failed (31) - PREAUTH_FAILED at 
> org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:105)
>  at org.apache.hadoop.service.AbstractService.init(AbstractService.java:173) 
> at 
> org.apache.hadoop.hdfs.server.federation.security.TestRouterHttpDelegationToken.setup(TestRouterHttpDelegationToken.java:99)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498) at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
>  at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>  at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
>  at 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:24) 
> at 
> org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27) 
> at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325) at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
>  at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
>  at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290) at 
> org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71) at 
> org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288) at 
> org.junit.runners.ParentRunner.access$000(ParentRunner.java:58) at 
> org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268) at 
> org.junit.runners.ParentRunner.run(ParentRunner.java:363) at 
> org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:365)
>  at 
> org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:273)
>  at 
> org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:238)
>  at 
> org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:159)
>  at 
> org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:384)
>  at 
> org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:345)
>  at 
> org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:126) 
> at org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:418) 
> Caused by: org.apache.hadoop.security.KerberosAuthException: failure to 
> login: for principal: router/localh...@example.com from keytab 
> /testptch/hadoop/hadoop-hdfs-project/hadoop-hdfs-rbf/target/test/data/SecurityConfUtil/test.keytab
>  javax.security.auth.login.LoginException: Integrity check on decrypted field 
> failed (31) - PREAUTH_FAILED at 
> 

[jira] [Commented] (HDFS-14771) Backport HDFS-14617 to branch-2 (Improve fsimage load time by writing sub-sections to the fsimage index)

2019-09-18 Thread He Xiaoqiao (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16932350#comment-16932350
 ] 

He Xiaoqiao commented on HDFS-14771:


Thanks [~jojochuang] and sorry for late response. 
{quote}Once this features goes into branch-2 (which will become 2.10 release), 
and that the feature is enabled, I assume the only upgrade path is 2.10 --> 
3.3, correct?{quote}
Yes, it is truth. This is only path to upgrade before any tools or 
backward-compatible solution offer. Do we need some more release notes or guide 
doc?

> Backport HDFS-14617 to branch-2 (Improve fsimage load time by writing 
> sub-sections to the fsimage index)
> 
>
> Key: HDFS-14771
> URL: https://issues.apache.org/jira/browse/HDFS-14771
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: namenode
>Affects Versions: 2.10.0
>Reporter: He Xiaoqiao
>Assignee: He Xiaoqiao
>Priority: Major
>  Labels: release-blocker
> Fix For: 2.10.0
>
> Attachments: HDFS-14771.branch-2.001.patch, 
> HDFS-14771.branch-2.002.patch, HDFS-14771.branch-2.003.patch
>
>
> This JIRA aims to backport HDFS-14617 to branch-2: fsimage load time by 
> writing sub-sections to the fsimage index.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14437) Exception happened when rollEditLog expects empty EditsDoubleBuffer.bufCurrent but not

2019-09-18 Thread He Xiaoqiao (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16932246#comment-16932246
 ] 

He Xiaoqiao commented on HDFS-14437:


[~daryn],[~jojochuang],[~linyiqun],[~xkrogen],[~elgoiri], are you interested in 
this case and give some more suggestions and helps?
I would like to offer some more additional information.
1. For FSEditlog class, there are three interfaces (#logEdit, #rollEditLog, 
#logSync) which are used high frequently.
2. Among them, #logEdit and #rollEditLog are exactly hold FSN lock to control 
concurrent, however, #logSync is not hold FSN lock, so we have to rely on 
FSEditlog `synchronize`.
3. Based on test case [^HDFS-14437.reproduction.patch] which can reproduce 
stably this issue, wait-notifyAll could not be safety completely for this case 
as [~angerszhuuu] detailed digs and analysis above.
Any discussions or corrections are very welcome. Thanks guys.

> Exception happened when   rollEditLog expects empty 
> EditsDoubleBuffer.bufCurrent  but not
> -
>
> Key: HDFS-14437
> URL: https://issues.apache.org/jira/browse/HDFS-14437
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: ha, namenode, qjm
>Reporter: angerszhu
>Priority: Major
> Attachments: HDFS-14437.reproduction.patch, 
> HDFS-14437.reproductionwithlog.patch, screenshot-1.png
>
>
> For the problem mentioned in https://issues.apache.org/jira/browse/HDFS-10943 
> , I have sort the process of write and flush EditLog and some important 
> function, I found the in the class  FSEditLog class, the close() function 
> will call such process like below:
>  
> {code:java}
> waitForSyncToFinish();
> endCurrentLogSegment(true);{code}
> since we have gain the object lock in the function close(), so when  
> waitForSyncToFish() method return, it mean all logSync job has done and all 
> data in bufReady has been flushed out, and since current thread has the lock 
> of this object, when call endCurrentLogSegment(), no other thread will gain 
> the lock so they can't write new editlog into currentBuf.
> But when we don't call waitForSyncToFish() before endCurrentLogSegment(), 
> there may be some autoScheduled logSync()'s flush process is doing, since 
> this process don't need
> synchronization since it has mention in the comment of logSync() method :
>  
> {code:java}
> /**
>  * Sync all modifications done by this thread.
>  *
>  * The internal concurrency design of this class is as follows:
>  *   - Log items are written synchronized into an in-memory buffer,
>  * and each assigned a transaction ID.
>  *   - When a thread (client) would like to sync all of its edits, logSync()
>  * uses a ThreadLocal transaction ID to determine what edit number must
>  * be synced to.
>  *   - The isSyncRunning volatile boolean tracks whether a sync is currently
>  * under progress.
>  *
>  * The data is double-buffered within each edit log implementation so that
>  * in-memory writing can occur in parallel with the on-disk writing.
>  *
>  * Each sync occurs in three steps:
>  *   1. synchronized, it swaps the double buffer and sets the isSyncRunning
>  *  flag.
>  *   2. unsynchronized, it flushes the data to storage
>  *   3. synchronized, it resets the flag and notifies anyone waiting on the
>  *  sync.
>  *
>  * The lack of synchronization on step 2 allows other threads to continue
>  * to write into the memory buffer while the sync is in progress.
>  * Because this step is unsynchronized, actions that need to avoid
>  * concurrency with sync() should be synchronized and also call
>  * waitForSyncToFinish() before assuming they are running alone.
>  */
> public void logSync() {
>   long syncStart = 0;
>   // Fetch the transactionId of this thread. 
>   long mytxid = myTransactionId.get().txid;
>   
>   boolean sync = false;
>   try {
> EditLogOutputStream logStream = null;
> synchronized (this) {
>   try {
> printStatistics(false);
> // if somebody is already syncing, then wait
> while (mytxid > synctxid && isSyncRunning) {
>   try {
> wait(1000);
>   } catch (InterruptedException ie) {
>   }
> }
> //
> // If this transaction was already flushed, then nothing to do
> //
> if (mytxid <= synctxid) {
>   numTransactionsBatchedInSync++;
>   if (metrics != null) {
> // Metrics is non-null only when used inside name node
> metrics.incrTransactionsBatchedInSync();
>   }
>   return;
> }
>
> // now, this thread will do the sync
> syncStart = txid;
> isSyncRunning = true;
> sync = true;
> // swap buffers
> try {
>   if 

[jira] [Commented] (HDFS-14461) RBF: Fix intermittently failing kerberos related unit test

2019-09-18 Thread He Xiaoqiao (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16932220#comment-16932220
 ] 

He Xiaoqiao commented on HDFS-14461:


very sorry that I missed this ticket for a while. 
[~eyang],[~ayushtkn],[~elgoiri] Thanks a lot for your comments. I would like to 
continue to follow up and test again with HDFS-14609. Thanks again.

> RBF: Fix intermittently failing kerberos related unit test
> --
>
> Key: HDFS-14461
> URL: https://issues.apache.org/jira/browse/HDFS-14461
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: CR Hota
>Assignee: He Xiaoqiao
>Priority: Major
> Attachments: HDFS-14461.001.patch, HDFS-14461.002.patch
>
>
> TestRouterHttpDelegationToken#testGetDelegationToken fails intermittently. It 
> may be due to some race condition before using the keytab that's created for 
> testing.
>  
> {code:java}
>  Failed
> org.apache.hadoop.hdfs.server.federation.security.TestRouterHttpDelegationToken.testGetDelegationToken
>  Failing for the past 1 build (Since 
> [!https://builds.apache.org/static/1e9ab9cc/images/16x16/red.png! 
> #26721|https://builds.apache.org/job/PreCommit-HDFS-Build/26721/] )
>  [Took 89 
> ms.|https://builds.apache.org/job/PreCommit-HDFS-Build/26721/testReport/org.apache.hadoop.hdfs.server.federation.security/TestRouterHttpDelegationToken/testGetDelegationToken/history]
>   
>  Error Message
> org.apache.hadoop.security.KerberosAuthException: failure to login: for 
> principal: router/localh...@example.com from keytab 
> /testptch/hadoop/hadoop-hdfs-project/hadoop-hdfs-rbf/target/test/data/SecurityConfUtil/test.keytab
>  javax.security.auth.login.LoginException: Integrity check on decrypted field 
> failed (31) - PREAUTH_FAILED
> h3. Stacktrace
> org.apache.hadoop.service.ServiceStateException: 
> org.apache.hadoop.security.KerberosAuthException: failure to login: for 
> principal: router/localh...@example.com from keytab 
> /testptch/hadoop/hadoop-hdfs-project/hadoop-hdfs-rbf/target/test/data/SecurityConfUtil/test.keytab
>  javax.security.auth.login.LoginException: Integrity check on decrypted field 
> failed (31) - PREAUTH_FAILED at 
> org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:105)
>  at org.apache.hadoop.service.AbstractService.init(AbstractService.java:173) 
> at 
> org.apache.hadoop.hdfs.server.federation.security.TestRouterHttpDelegationToken.setup(TestRouterHttpDelegationToken.java:99)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498) at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
>  at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>  at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
>  at 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:24) 
> at 
> org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27) 
> at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325) at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
>  at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
>  at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290) at 
> org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71) at 
> org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288) at 
> org.junit.runners.ParentRunner.access$000(ParentRunner.java:58) at 
> org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268) at 
> org.junit.runners.ParentRunner.run(ParentRunner.java:363) at 
> org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:365)
>  at 
> org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:273)
>  at 
> org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:238)
>  at 
> org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:159)
>  at 
> org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:384)
>  at 
> org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:345)
>  at 
> org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:126) 
> at org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:418) 
> Caused by: org.apache.hadoop.security.KerberosAuthException: failure to 
> login: for principal: router/localh...@example.com from keytab 
> 

[jira] [Commented] (HDFS-14836) FileIoProvider should not increase FileIoErrors metric in datanode volume metric

2019-09-18 Thread He Xiaoqiao (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16932217#comment-16932217
 ] 

He Xiaoqiao commented on HDFS-14836:


+1(non-binding) for [^HDFS-14836-trunk-001.patch] from my side. Thanks 
[~Aiphag0] for your report and contribution. Pending for [~jojochuang]'s 
confirm.

> FileIoProvider should not increase FileIoErrors metric in datanode volume 
> metric
> 
>
> Key: HDFS-14836
> URL: https://issues.apache.org/jira/browse/HDFS-14836
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Aiphago
>Assignee: Aiphago
>Priority: Minor
> Attachments: HDFS-14836-trunk-001.patch, HDFS-14836.patch
>
>
> I found that  FileIoErrors metric will increase in 
> BlockSender.sendPacket(),when use fileIoProvider.transferToSocketFully().But 
> in https://issues.apache.org/jira/browse/HDFS-2054 the Exception has been 
> ignore like "Broken pipe" and "Connection reset" .
> So should do a filter when fileIoProvider increase FileIoErrors count ?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-10943) rollEditLog expects empty EditsDoubleBuffer.bufCurrent which is not guaranteed

2019-09-10 Thread He Xiaoqiao (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-10943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16926629#comment-16926629
 ] 

He Xiaoqiao commented on HDFS-10943:


Thanks [~swingcong] for your pings, It is important observation. Would you like 
to offer which version do you setup?

> rollEditLog expects empty EditsDoubleBuffer.bufCurrent which is not guaranteed
> --
>
> Key: HDFS-10943
> URL: https://issues.apache.org/jira/browse/HDFS-10943
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Yongjun Zhang
>Priority: Major
>
> Per the following trace stack:
> {code}
> FATAL org.apache.hadoop.hdfs.server.namenode.FSEditLog: Error: finalize log 
> segment 10562075963, 10562174157 failed for required journal 
> (JournalAndStream(mgr=QJM to [0.0.0.1:8485, 0.0.0.2:8485, 0.0.0.3:8485, 
> 0.0.0.4:8485, 0.0.0.5:8485], stream=QuorumOutputStream starting at txid 
> 10562075963))
> java.io.IOException: FSEditStream has 49708 bytes still to be flushed and 
> cannot be closed.
> at 
> org.apache.hadoop.hdfs.server.namenode.EditsDoubleBuffer.close(EditsDoubleBuffer.java:66)
> at 
> org.apache.hadoop.hdfs.qjournal.client.QuorumOutputStream.close(QuorumOutputStream.java:65)
> at 
> org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalAndStream.closeStream(JournalSet.java:115)
> at 
> org.apache.hadoop.hdfs.server.namenode.JournalSet$4.apply(JournalSet.java:235)
> at 
> org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:393)
> at 
> org.apache.hadoop.hdfs.server.namenode.JournalSet.finalizeLogSegment(JournalSet.java:231)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLog.endCurrentLogSegment(FSEditLog.java:1243)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLog.rollEditLog(FSEditLog.java:1172)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.rollEditLog(FSImage.java:1243)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.rollEditLog(FSNamesystem.java:6437)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.rollEditLog(NameNodeRpcServer.java:1002)
> at 
> org.apache.hadoop.hdfs.protocolPB.NamenodeProtocolServerSideTranslatorPB.rollEditLog(NamenodeProtocolServerSideTranslatorPB.java:142)
> at 
> org.apache.hadoop.hdfs.protocol.proto.NamenodeProtocolProtos$NamenodeProtocolService$2.callBlockingMethod(NamenodeProtocolProtos.java:12025)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1060)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2086)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2082)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2080)
> 2016-09-23 21:40:59,618 WARN 
> org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Aborting 
> QuorumOutputStream starting at txid 10562075963
> {code}
> The exception is from  EditsDoubleBuffer
> {code}
>  public void close() throws IOException {
> Preconditions.checkNotNull(bufCurrent);
> Preconditions.checkNotNull(bufReady);
> int bufSize = bufCurrent.size();
> if (bufSize != 0) {
>   throw new IOException("FSEditStream has " + bufSize
>   + " bytes still to be flushed and cannot be closed.");
> }
> IOUtils.cleanup(null, bufCurrent, bufReady);
> bufCurrent = bufReady = null;
>   }
> {code}
> We can see that FSNamesystem.rollEditLog expects  
> EditsDoubleBuffer.bufCurrent to be empty.
> Edits are recorded via FSEditLog$logSync, which does:
> {code}
>* The data is double-buffered within each edit log implementation so that
>* in-memory writing can occur in parallel with the on-disk writing.
>*
>* Each sync occurs in three steps:
>*   1. synchronized, it swaps the double buffer and sets the isSyncRunning
>*  flag.
>*   2. unsynchronized, it flushes the data to storage
>*   3. synchronized, it resets the flag and notifies anyone waiting on the
>*  sync.
>*
>* The lack of synchronization on step 2 allows other threads to continue
>* to write into the memory buffer while the sync is in progress.
>* Because this step is unsynchronized, actions that need to avoid
>* concurrency with sync() should be synchronized and also call
>* waitForSyncToFinish() before assuming they are running alone.
>*/
> {code}
> We can see that step 2 is on-purposely 

[jira] [Commented] (HDFS-14836) FileIoProvider should not increase FileIoErrors metric in datanode volume metric

2019-09-09 Thread He Xiaoqiao (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16925661#comment-16925661
 ] 

He Xiaoqiao commented on HDFS-14836:


Thanks [~Aiphag0] for your report, and I agree that we should filter out count 
FileIoError when meet some explicit exception as HDFS-2054 said, otherwise this 
counter will be polluted and no valuable to reference.
[~jojochuang] [~ayushtkn] any thought? could you help to add [~Aiphag0] as 
contributor and assign this JIRA to him?

> FileIoProvider should not increase FileIoErrors metric in datanode volume 
> metric
> 
>
> Key: HDFS-14836
> URL: https://issues.apache.org/jira/browse/HDFS-14836
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.9.1
>Reporter: Aiphago
>Priority: Minor
>
> I found that  FileIoErrors metric will increase in 
> BlockSender.sendPacket(),when use fileIoProvider.transferToSocketFully().But 
> in https://issues.apache.org/jira/browse/HDFS-2054 the Exception has been 
> ignore like "Broken pipe" and "Connection reset" .
> So should do a filter when fileIoProvider increase FileIoErrors count ?



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14303) check block directory logic not correct when there is only meta file, print no meaning warn log

2019-09-09 Thread He Xiaoqiao (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16925579#comment-16925579
 ] 

He Xiaoqiao commented on HDFS-14303:


[^HDFS-14303-branch-3.2.addendum.03.patch] upload addendum patch for branch-3.2 
which can apply to branch-3.2 & branch-3.1 directly. Let's wait for what 
Jenkins says.

> check block directory logic not correct when there is only meta file, print 
> no meaning warn log
> ---
>
> Key: HDFS-14303
> URL: https://issues.apache.org/jira/browse/HDFS-14303
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, hdfs
>Affects Versions: 2.7.3, 3.2.0, 2.9.2, 2.8.5
> Environment: env free
>Reporter: qiang Liu
>Assignee: qiang Liu
>Priority: Minor
>  Labels: easy-fix
> Fix For: 2.10.0, 3.0.4, 3.3.0, 2.8.6, 3.2.1, 2.9.3, 3.1.3
>
> Attachments: HDFS-14303-addendnum-branch-2.01.patch, 
> HDFS-14303-addendum-01.patch, HDFS-14303-addendum-02.patch, 
> HDFS-14303-branch-2.005.patch, HDFS-14303-branch-2.009.patch, 
> HDFS-14303-branch-2.010.patch, HDFS-14303-branch-2.015.patch, 
> HDFS-14303-branch-2.017.patch, HDFS-14303-branch-2.7.001.patch, 
> HDFS-14303-branch-2.7.004.patch, HDFS-14303-branch-2.7.006.patch, 
> HDFS-14303-branch-2.9.011.patch, HDFS-14303-branch-2.9.012.patch, 
> HDFS-14303-branch-2.9.013.patch, HDFS-14303-branch-2.addendum.02.patch, 
> HDFS-14303-branch-3.2.addendum.03.patch, HDFS-14303-trunk.014.patch, 
> HDFS-14303-trunk.015.patch, HDFS-14303-trunk.016.patch, 
> HDFS-14303-trunk.016.path, HDFS-14303.branch-3.2.017.patch
>
>   Original Estimate: 1m
>  Remaining Estimate: 1m
>
> chek block directory logic not correct when there is only meta file,print no 
> meaning warn log, eg:
>  WARN DirectoryScanner:? - Block: 1101939874 has to be upgraded to block 
> ID-based layout. Actual block file path: 
> /data14/hadoop/data/current/BP-1461038173-10.8.48.152-1481686842620/current/finalized/subdir174/subdir68,
>  expected block file path: 
> /data14/hadoop/data/current/BP-1461038173-10.8.48.152-1481686842620/current/finalized/subdir174/subdir68/subdir68



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14303) check block directory logic not correct when there is only meta file, print no meaning warn log

2019-09-09 Thread He Xiaoqiao (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

He Xiaoqiao updated HDFS-14303:
---
Attachment: HDFS-14303-branch-3.2.addendum.03.patch

> check block directory logic not correct when there is only meta file, print 
> no meaning warn log
> ---
>
> Key: HDFS-14303
> URL: https://issues.apache.org/jira/browse/HDFS-14303
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, hdfs
>Affects Versions: 2.7.3, 3.2.0, 2.9.2, 2.8.5
> Environment: env free
>Reporter: qiang Liu
>Assignee: qiang Liu
>Priority: Minor
>  Labels: easy-fix
> Fix For: 2.10.0, 3.0.4, 3.3.0, 2.8.6, 3.2.1, 2.9.3, 3.1.3
>
> Attachments: HDFS-14303-addendnum-branch-2.01.patch, 
> HDFS-14303-addendum-01.patch, HDFS-14303-addendum-02.patch, 
> HDFS-14303-branch-2.005.patch, HDFS-14303-branch-2.009.patch, 
> HDFS-14303-branch-2.010.patch, HDFS-14303-branch-2.015.patch, 
> HDFS-14303-branch-2.017.patch, HDFS-14303-branch-2.7.001.patch, 
> HDFS-14303-branch-2.7.004.patch, HDFS-14303-branch-2.7.006.patch, 
> HDFS-14303-branch-2.9.011.patch, HDFS-14303-branch-2.9.012.patch, 
> HDFS-14303-branch-2.9.013.patch, HDFS-14303-branch-2.addendum.02.patch, 
> HDFS-14303-branch-3.2.addendum.03.patch, HDFS-14303-trunk.014.patch, 
> HDFS-14303-trunk.015.patch, HDFS-14303-trunk.016.patch, 
> HDFS-14303-trunk.016.path, HDFS-14303.branch-3.2.017.patch
>
>   Original Estimate: 1m
>  Remaining Estimate: 1m
>
> chek block directory logic not correct when there is only meta file,print no 
> meaning warn log, eg:
>  WARN DirectoryScanner:? - Block: 1101939874 has to be upgraded to block 
> ID-based layout. Actual block file path: 
> /data14/hadoop/data/current/BP-1461038173-10.8.48.152-1481686842620/current/finalized/subdir174/subdir68,
>  expected block file path: 
> /data14/hadoop/data/current/BP-1461038173-10.8.48.152-1481686842620/current/finalized/subdir174/subdir68/subdir68



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14303) check block directory logic not correct when there is only meta file, print no meaning warn log

2019-09-09 Thread He Xiaoqiao (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16925531#comment-16925531
 ] 

He Xiaoqiao commented on HDFS-14303:


Hi [~ayushtkn], [^HDFS-14303-addendum-02.patch] only apply to trunk?not 
backport any other branches?

> check block directory logic not correct when there is only meta file, print 
> no meaning warn log
> ---
>
> Key: HDFS-14303
> URL: https://issues.apache.org/jira/browse/HDFS-14303
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, hdfs
>Affects Versions: 2.7.3, 3.2.0, 2.9.2, 2.8.5
> Environment: env free
>Reporter: qiang Liu
>Assignee: qiang Liu
>Priority: Minor
>  Labels: easy-fix
> Fix For: 2.10.0, 3.0.4, 3.3.0, 2.8.6, 3.2.1, 2.9.3, 3.1.3
>
> Attachments: HDFS-14303-addendnum-branch-2.01.patch, 
> HDFS-14303-addendum-01.patch, HDFS-14303-addendum-02.patch, 
> HDFS-14303-branch-2.005.patch, HDFS-14303-branch-2.009.patch, 
> HDFS-14303-branch-2.010.patch, HDFS-14303-branch-2.015.patch, 
> HDFS-14303-branch-2.017.patch, HDFS-14303-branch-2.7.001.patch, 
> HDFS-14303-branch-2.7.004.patch, HDFS-14303-branch-2.7.006.patch, 
> HDFS-14303-branch-2.9.011.patch, HDFS-14303-branch-2.9.012.patch, 
> HDFS-14303-branch-2.9.013.patch, HDFS-14303-branch-2.addendum.02.patch, 
> HDFS-14303-trunk.014.patch, HDFS-14303-trunk.015.patch, 
> HDFS-14303-trunk.016.patch, HDFS-14303-trunk.016.path, 
> HDFS-14303.branch-3.2.017.patch
>
>   Original Estimate: 1m
>  Remaining Estimate: 1m
>
> chek block directory logic not correct when there is only meta file,print no 
> meaning warn log, eg:
>  WARN DirectoryScanner:? - Block: 1101939874 has to be upgraded to block 
> ID-based layout. Actual block file path: 
> /data14/hadoop/data/current/BP-1461038173-10.8.48.152-1481686842620/current/finalized/subdir174/subdir68,
>  expected block file path: 
> /data14/hadoop/data/current/BP-1461038173-10.8.48.152-1481686842620/current/finalized/subdir174/subdir68/subdir68



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14303) check block directory logic not correct when there is only meta file, print no meaning warn log

2019-09-09 Thread He Xiaoqiao (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16925517#comment-16925517
 ] 

He Xiaoqiao commented on HDFS-14303:


upload new patch file with no changes compare to 
[^HDFS-14303-addendnum-branch-2.01.patch] and try to trigger Jenkins again.

> check block directory logic not correct when there is only meta file, print 
> no meaning warn log
> ---
>
> Key: HDFS-14303
> URL: https://issues.apache.org/jira/browse/HDFS-14303
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, hdfs
>Affects Versions: 2.7.3, 3.2.0, 2.9.2, 2.8.5
> Environment: env free
>Reporter: qiang Liu
>Assignee: qiang Liu
>Priority: Minor
>  Labels: easy-fix
> Fix For: 2.10.0, 3.0.4, 3.3.0, 2.8.6, 3.2.1, 2.9.3, 3.1.3
>
> Attachments: HDFS-14303-addendnum-branch-2.01.patch, 
> HDFS-14303-addendum-01.patch, HDFS-14303-addendum-02.patch, 
> HDFS-14303-branch-2.005.patch, HDFS-14303-branch-2.009.patch, 
> HDFS-14303-branch-2.010.patch, HDFS-14303-branch-2.015.patch, 
> HDFS-14303-branch-2.017.patch, HDFS-14303-branch-2.7.001.patch, 
> HDFS-14303-branch-2.7.004.patch, HDFS-14303-branch-2.7.006.patch, 
> HDFS-14303-branch-2.9.011.patch, HDFS-14303-branch-2.9.012.patch, 
> HDFS-14303-branch-2.9.013.patch, HDFS-14303-branch-2.addendum.02.patch, 
> HDFS-14303-trunk.014.patch, HDFS-14303-trunk.015.patch, 
> HDFS-14303-trunk.016.patch, HDFS-14303-trunk.016.path, 
> HDFS-14303.branch-3.2.017.patch
>
>   Original Estimate: 1m
>  Remaining Estimate: 1m
>
> chek block directory logic not correct when there is only meta file,print no 
> meaning warn log, eg:
>  WARN DirectoryScanner:? - Block: 1101939874 has to be upgraded to block 
> ID-based layout. Actual block file path: 
> /data14/hadoop/data/current/BP-1461038173-10.8.48.152-1481686842620/current/finalized/subdir174/subdir68,
>  expected block file path: 
> /data14/hadoop/data/current/BP-1461038173-10.8.48.152-1481686842620/current/finalized/subdir174/subdir68/subdir68



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14303) check block directory logic not correct when there is only meta file, print no meaning warn log

2019-09-09 Thread He Xiaoqiao (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

He Xiaoqiao updated HDFS-14303:
---
Attachment: HDFS-14303-branch-2.addendum.02.patch

> check block directory logic not correct when there is only meta file, print 
> no meaning warn log
> ---
>
> Key: HDFS-14303
> URL: https://issues.apache.org/jira/browse/HDFS-14303
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, hdfs
>Affects Versions: 2.7.3, 3.2.0, 2.9.2, 2.8.5
> Environment: env free
>Reporter: qiang Liu
>Assignee: qiang Liu
>Priority: Minor
>  Labels: easy-fix
> Fix For: 2.10.0, 3.0.4, 3.3.0, 2.8.6, 3.2.1, 2.9.3, 3.1.3
>
> Attachments: HDFS-14303-addendnum-branch-2.01.patch, 
> HDFS-14303-addendum-01.patch, HDFS-14303-addendum-02.patch, 
> HDFS-14303-branch-2.005.patch, HDFS-14303-branch-2.009.patch, 
> HDFS-14303-branch-2.010.patch, HDFS-14303-branch-2.015.patch, 
> HDFS-14303-branch-2.017.patch, HDFS-14303-branch-2.7.001.patch, 
> HDFS-14303-branch-2.7.004.patch, HDFS-14303-branch-2.7.006.patch, 
> HDFS-14303-branch-2.9.011.patch, HDFS-14303-branch-2.9.012.patch, 
> HDFS-14303-branch-2.9.013.patch, HDFS-14303-branch-2.addendum.02.patch, 
> HDFS-14303-trunk.014.patch, HDFS-14303-trunk.015.patch, 
> HDFS-14303-trunk.016.patch, HDFS-14303-trunk.016.path, 
> HDFS-14303.branch-3.2.017.patch
>
>   Original Estimate: 1m
>  Remaining Estimate: 1m
>
> chek block directory logic not correct when there is only meta file,print no 
> meaning warn log, eg:
>  WARN DirectoryScanner:? - Block: 1101939874 has to be upgraded to block 
> ID-based layout. Actual block file path: 
> /data14/hadoop/data/current/BP-1461038173-10.8.48.152-1481686842620/current/finalized/subdir174/subdir68,
>  expected block file path: 
> /data14/hadoop/data/current/BP-1461038173-10.8.48.152-1481686842620/current/finalized/subdir174/subdir68/subdir68



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14303) check block directory logic not correct when there is only meta file, print no meaning warn log

2019-09-08 Thread He Xiaoqiao (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16925377#comment-16925377
 ] 

He Xiaoqiao commented on HDFS-14303:


[~iamgd67], just try to trigger Jenkins, please wait for a while.

> check block directory logic not correct when there is only meta file, print 
> no meaning warn log
> ---
>
> Key: HDFS-14303
> URL: https://issues.apache.org/jira/browse/HDFS-14303
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, hdfs
>Affects Versions: 2.7.3, 3.2.0, 2.9.2, 2.8.5
> Environment: env free
>Reporter: qiang Liu
>Assignee: qiang Liu
>Priority: Minor
>  Labels: easy-fix
> Fix For: 2.10.0, 3.0.4, 3.3.0, 2.8.6, 3.2.1, 2.9.3, 3.1.3
>
> Attachments: HDFS-14303-addendnum-branch-2.01.patch, 
> HDFS-14303-addendum-01.patch, HDFS-14303-addendum-02.patch, 
> HDFS-14303-branch-2.005.patch, HDFS-14303-branch-2.009.patch, 
> HDFS-14303-branch-2.010.patch, HDFS-14303-branch-2.015.patch, 
> HDFS-14303-branch-2.017.patch, HDFS-14303-branch-2.7.001.patch, 
> HDFS-14303-branch-2.7.004.patch, HDFS-14303-branch-2.7.006.patch, 
> HDFS-14303-branch-2.9.011.patch, HDFS-14303-branch-2.9.012.patch, 
> HDFS-14303-branch-2.9.013.patch, HDFS-14303-trunk.014.patch, 
> HDFS-14303-trunk.015.patch, HDFS-14303-trunk.016.patch, 
> HDFS-14303-trunk.016.path, HDFS-14303.branch-3.2.017.patch
>
>   Original Estimate: 1m
>  Remaining Estimate: 1m
>
> chek block directory logic not correct when there is only meta file,print no 
> meaning warn log, eg:
>  WARN DirectoryScanner:? - Block: 1101939874 has to be upgraded to block 
> ID-based layout. Actual block file path: 
> /data14/hadoop/data/current/BP-1461038173-10.8.48.152-1481686842620/current/finalized/subdir174/subdir68,
>  expected block file path: 
> /data14/hadoop/data/current/BP-1461038173-10.8.48.152-1481686842620/current/finalized/subdir174/subdir68/subdir68



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14303) check block directory logic not correct when there is only meta file, print no meaning warn log

2019-09-08 Thread He Xiaoqiao (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

He Xiaoqiao updated HDFS-14303:
---
Target Version/s: 2.9.2, 3.2.0  (was: 3.2.0, 2.9.2)
  Status: Patch Available  (was: Reopened)

> check block directory logic not correct when there is only meta file, print 
> no meaning warn log
> ---
>
> Key: HDFS-14303
> URL: https://issues.apache.org/jira/browse/HDFS-14303
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, hdfs
>Affects Versions: 2.8.5, 2.9.2, 3.2.0, 2.7.3
> Environment: env free
>Reporter: qiang Liu
>Assignee: qiang Liu
>Priority: Minor
>  Labels: easy-fix
> Fix For: 2.10.0, 3.0.4, 3.3.0, 2.8.6, 3.2.1, 2.9.3, 3.1.3
>
> Attachments: HDFS-14303-addendnum-branch-2.01.patch, 
> HDFS-14303-addendum-01.patch, HDFS-14303-addendum-02.patch, 
> HDFS-14303-branch-2.005.patch, HDFS-14303-branch-2.009.patch, 
> HDFS-14303-branch-2.010.patch, HDFS-14303-branch-2.015.patch, 
> HDFS-14303-branch-2.017.patch, HDFS-14303-branch-2.7.001.patch, 
> HDFS-14303-branch-2.7.004.patch, HDFS-14303-branch-2.7.006.patch, 
> HDFS-14303-branch-2.9.011.patch, HDFS-14303-branch-2.9.012.patch, 
> HDFS-14303-branch-2.9.013.patch, HDFS-14303-trunk.014.patch, 
> HDFS-14303-trunk.015.patch, HDFS-14303-trunk.016.patch, 
> HDFS-14303-trunk.016.path, HDFS-14303.branch-3.2.017.patch
>
>   Original Estimate: 1m
>  Remaining Estimate: 1m
>
> chek block directory logic not correct when there is only meta file,print no 
> meaning warn log, eg:
>  WARN DirectoryScanner:? - Block: 1101939874 has to be upgraded to block 
> ID-based layout. Actual block file path: 
> /data14/hadoop/data/current/BP-1461038173-10.8.48.152-1481686842620/current/finalized/subdir174/subdir68,
>  expected block file path: 
> /data14/hadoop/data/current/BP-1461038173-10.8.48.152-1481686842620/current/finalized/subdir174/subdir68/subdir68



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14771) Backport HDFS-14617 to branch-2 (Improve fsimage load time by writing sub-sections to the fsimage index)

2019-09-08 Thread He Xiaoqiao (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16925322#comment-16925322
 ] 

He Xiaoqiao commented on HDFS-14771:


Hi [~jojochuang],[~sodonnell] is this ready to backport to branch-2?

> Backport HDFS-14617 to branch-2 (Improve fsimage load time by writing 
> sub-sections to the fsimage index)
> 
>
> Key: HDFS-14771
> URL: https://issues.apache.org/jira/browse/HDFS-14771
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: namenode
>Affects Versions: 2.10.0
>Reporter: He Xiaoqiao
>Assignee: He Xiaoqiao
>Priority: Major
>  Labels: release-blocker
> Attachments: HDFS-14771.branch-2.001.patch, 
> HDFS-14771.branch-2.002.patch, HDFS-14771.branch-2.003.patch
>
>
> This JIRA aims to backport HDFS-14617 to branch-2: fsimage load time by 
> writing sub-sections to the fsimage index.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14810) review FSNameSystem editlog sync

2019-09-08 Thread He Xiaoqiao (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16925320#comment-16925320
 ] 

He Xiaoqiao commented on HDFS-14810:


Hi [~jojochuang],[~ayushtkn] any update for this improvement?

> review FSNameSystem editlog sync
> 
>
> Key: HDFS-14810
> URL: https://issues.apache.org/jira/browse/HDFS-14810
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Reporter: He Xiaoqiao
>Assignee: He Xiaoqiao
>Priority: Major
> Attachments: HDFS-14810.001.patch, HDFS-14810.002.patch, 
> HDFS-14810.003.patch, HDFS-14810.004.patch
>
>
> refactor and unified type of edit log sync in FSNamesystem as HDFS-11246 
> mentioned.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-10453) ReplicationMonitor thread could stuck for long time due to the race between replication and delete of same file in a large cluster.

2019-09-07 Thread He Xiaoqiao (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-10453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16924753#comment-16924753
 ] 

He Xiaoqiao commented on HDFS-10453:


[~hustnn] I think this is explained case, when some datanodes are shutdown at 
the same time, NameNode will trigger to replicate under-replication blocks, 
then some live datanode will be high load especially for small cluster. I just 
wonder if namenode process meet some exception or hung long times when log 
above information?

> ReplicationMonitor thread could stuck for long time due to the race between 
> replication and delete of same file in a large cluster.
> ---
>
> Key: HDFS-10453
> URL: https://issues.apache.org/jira/browse/HDFS-10453
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.4.1, 2.5.2, 2.7.1, 2.6.4
>Reporter: He Xiaoqiao
>Assignee: He Xiaoqiao
>Priority: Major
> Fix For: 3.1.0, 2.10.0, 2.9.1, 2.8.4, 2.7.6, 3.0.3
>
> Attachments: HDFS-10453-branch-2.001.patch, 
> HDFS-10453-branch-2.003.patch, HDFS-10453-branch-2.7.004.patch, 
> HDFS-10453-branch-2.7.005.patch, HDFS-10453-branch-2.7.006.patch, 
> HDFS-10453-branch-2.7.007.patch, HDFS-10453-branch-2.7.008.patch, 
> HDFS-10453-branch-2.7.009.patch, HDFS-10453-branch-2.8.001.patch, 
> HDFS-10453-branch-2.8.002.patch, HDFS-10453-branch-2.9.001.patch, 
> HDFS-10453-branch-2.9.002.patch, HDFS-10453-branch-3.0.001.patch, 
> HDFS-10453-branch-3.0.002.patch, HDFS-10453-trunk.001.patch, 
> HDFS-10453-trunk.002.patch, HDFS-10453.001.patch
>
>
> ReplicationMonitor thread could stuck for long time and loss data with little 
> probability. Consider the typical scenario:
> (1) create and close a file with the default replicas(3);
> (2) increase replication (to 10) of the file.
> (3) delete the file while ReplicationMonitor is scheduling blocks belong to 
> that file for replications.
> if ReplicationMonitor stuck reappeared, NameNode will print log as:
> {code:xml}
> 2016-04-19 10:20:48,083 WARN 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy: Failed to 
> place enough replicas, still in need of 7 to reach 10 
> (unavailableStorages=[], storagePolicy=BlockStoragePolicy{HOT:7, 
> storageTypes=[DISK], creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, 
> newBlock=false) For more information, please enable DEBUG log level on 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy
> ..
> 2016-04-19 10:21:17,184 WARN 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy: Failed to 
> place enough replicas, still in need of 7 to reach 10 
> (unavailableStorages=[DISK], storagePolicy=BlockStoragePolicy{HOT:7, 
> storageTypes=[DISK], creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, 
> newBlock=false) For more information, please enable DEBUG log level on 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy
> 2016-04-19 10:21:17,184 WARN 
> org.apache.hadoop.hdfs.protocol.BlockStoragePolicy: Failed to place enough 
> replicas: expected size is 7 but only 0 storage types can be selected 
> (replication=10, selected=[], unavailable=[DISK, ARCHIVE], removed=[DISK, 
> DISK, DISK, DISK, DISK, DISK, DISK], policy=BlockStoragePolicy{HOT:7, 
> storageTypes=[DISK], creationFallbacks=[], replicationFallbacks=[ARCHIVE]})
> 2016-04-19 10:21:17,184 WARN 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy: Failed to 
> place enough replicas, still in need of 7 to reach 10 
> (unavailableStorages=[DISK, ARCHIVE], storagePolicy=BlockStoragePolicy{HOT:7, 
> storageTypes=[DISK], creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, 
> newBlock=false) All required storage types are unavailable:  
> unavailableStorages=[DISK, ARCHIVE], storagePolicy=BlockStoragePolicy{HOT:7, 
> storageTypes=[DISK], creationFallbacks=[], replicationFallbacks=[ARCHIVE]}
> {code}
> This is because 2 threads (#NameNodeRpcServer and #ReplicationMonitor) 
> process same block at the same moment.
> (1) ReplicationMonitor#computeReplicationWorkForBlocks get blocks to 
> replicate and leave the global lock.
> (2) FSNamesystem#delete invoked to delete blocks then clear the reference in 
> blocksmap, needReplications, etc. the block's NumBytes will set 
> NO_ACK(Long.MAX_VALUE) which is used to indicate that the block deletion does 
> not need explicit ACK from the node. 
> (3) ReplicationMonitor#computeReplicationWorkForBlocks continue to 
> chooseTargets for the same blocks and no node will be selected after traverse 
> whole cluster because  no node choice satisfy the goodness criteria 
> (remaining spaces achieve required size Long.MAX_VALUE). 
> During of stage#3 ReplicationMonitor stuck for 

[jira] [Commented] (HDFS-10453) ReplicationMonitor thread could stuck for long time due to the race between replication and delete of same file in a large cluster.

2019-09-05 Thread He Xiaoqiao (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-10453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16923913#comment-16923913
 ] 

He Xiaoqiao commented on HDFS-10453:


Thanks [~hustnn] for your pings. I am confused about the log `Failed to place 
enough replicas, still in need of 0 to reach 3`. it seems that there are enough 
replicas. Would like to offer more related information or how to reprod it?

> ReplicationMonitor thread could stuck for long time due to the race between 
> replication and delete of same file in a large cluster.
> ---
>
> Key: HDFS-10453
> URL: https://issues.apache.org/jira/browse/HDFS-10453
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.4.1, 2.5.2, 2.7.1, 2.6.4
>Reporter: He Xiaoqiao
>Assignee: He Xiaoqiao
>Priority: Major
> Fix For: 3.1.0, 2.10.0, 2.9.1, 2.8.4, 2.7.6, 3.0.3
>
> Attachments: HDFS-10453-branch-2.001.patch, 
> HDFS-10453-branch-2.003.patch, HDFS-10453-branch-2.7.004.patch, 
> HDFS-10453-branch-2.7.005.patch, HDFS-10453-branch-2.7.006.patch, 
> HDFS-10453-branch-2.7.007.patch, HDFS-10453-branch-2.7.008.patch, 
> HDFS-10453-branch-2.7.009.patch, HDFS-10453-branch-2.8.001.patch, 
> HDFS-10453-branch-2.8.002.patch, HDFS-10453-branch-2.9.001.patch, 
> HDFS-10453-branch-2.9.002.patch, HDFS-10453-branch-3.0.001.patch, 
> HDFS-10453-branch-3.0.002.patch, HDFS-10453-trunk.001.patch, 
> HDFS-10453-trunk.002.patch, HDFS-10453.001.patch
>
>
> ReplicationMonitor thread could stuck for long time and loss data with little 
> probability. Consider the typical scenario:
> (1) create and close a file with the default replicas(3);
> (2) increase replication (to 10) of the file.
> (3) delete the file while ReplicationMonitor is scheduling blocks belong to 
> that file for replications.
> if ReplicationMonitor stuck reappeared, NameNode will print log as:
> {code:xml}
> 2016-04-19 10:20:48,083 WARN 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy: Failed to 
> place enough replicas, still in need of 7 to reach 10 
> (unavailableStorages=[], storagePolicy=BlockStoragePolicy{HOT:7, 
> storageTypes=[DISK], creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, 
> newBlock=false) For more information, please enable DEBUG log level on 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy
> ..
> 2016-04-19 10:21:17,184 WARN 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy: Failed to 
> place enough replicas, still in need of 7 to reach 10 
> (unavailableStorages=[DISK], storagePolicy=BlockStoragePolicy{HOT:7, 
> storageTypes=[DISK], creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, 
> newBlock=false) For more information, please enable DEBUG log level on 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy
> 2016-04-19 10:21:17,184 WARN 
> org.apache.hadoop.hdfs.protocol.BlockStoragePolicy: Failed to place enough 
> replicas: expected size is 7 but only 0 storage types can be selected 
> (replication=10, selected=[], unavailable=[DISK, ARCHIVE], removed=[DISK, 
> DISK, DISK, DISK, DISK, DISK, DISK], policy=BlockStoragePolicy{HOT:7, 
> storageTypes=[DISK], creationFallbacks=[], replicationFallbacks=[ARCHIVE]})
> 2016-04-19 10:21:17,184 WARN 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy: Failed to 
> place enough replicas, still in need of 7 to reach 10 
> (unavailableStorages=[DISK, ARCHIVE], storagePolicy=BlockStoragePolicy{HOT:7, 
> storageTypes=[DISK], creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, 
> newBlock=false) All required storage types are unavailable:  
> unavailableStorages=[DISK, ARCHIVE], storagePolicy=BlockStoragePolicy{HOT:7, 
> storageTypes=[DISK], creationFallbacks=[], replicationFallbacks=[ARCHIVE]}
> {code}
> This is because 2 threads (#NameNodeRpcServer and #ReplicationMonitor) 
> process same block at the same moment.
> (1) ReplicationMonitor#computeReplicationWorkForBlocks get blocks to 
> replicate and leave the global lock.
> (2) FSNamesystem#delete invoked to delete blocks then clear the reference in 
> blocksmap, needReplications, etc. the block's NumBytes will set 
> NO_ACK(Long.MAX_VALUE) which is used to indicate that the block deletion does 
> not need explicit ACK from the node. 
> (3) ReplicationMonitor#computeReplicationWorkForBlocks continue to 
> chooseTargets for the same blocks and no node will be selected after traverse 
> whole cluster because  no node choice satisfy the goodness criteria 
> (remaining spaces achieve required size Long.MAX_VALUE). 
> During of stage#3 ReplicationMonitor stuck for long time, especial in a large 
> cluster. invalidateBlocks & neededReplications continues to 

[jira] [Commented] (HDFS-14771) Backport HDFS-14617 to branch-2 (Improve fsimage load time by writing sub-sections to the fsimage index)

2019-09-05 Thread He Xiaoqiao (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16923883#comment-16923883
 ] 

He Xiaoqiao commented on HDFS-14771:


I try to follow up both failed unit tests in branch-2 Jenkins reported above, 
TestQuota#testClrQuotaOnRoot has fixed which is traced by HDFS-14633 (thanks 
for [~xyao]'s help). TestDirectoryScanner#testScanDirectoryStructureWarn is 
ready to fix and pending commit which is traced by HDFS-14303.

> Backport HDFS-14617 to branch-2 (Improve fsimage load time by writing 
> sub-sections to the fsimage index)
> 
>
> Key: HDFS-14771
> URL: https://issues.apache.org/jira/browse/HDFS-14771
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: namenode
>Affects Versions: 2.10.0
>Reporter: He Xiaoqiao
>Assignee: He Xiaoqiao
>Priority: Major
>  Labels: release-blocker
> Attachments: HDFS-14771.branch-2.001.patch, 
> HDFS-14771.branch-2.002.patch, HDFS-14771.branch-2.003.patch
>
>
> This JIRA aims to backport HDFS-14617 to branch-2: fsimage load time by 
> writing sub-sections to the fsimage index.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14303) check block directory logic not correct when there is only meta file, print no meaning warn log

2019-09-05 Thread He Xiaoqiao (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16923881#comment-16923881
 ] 

He Xiaoqiao commented on HDFS-14303:


upload [^HDFS-14303-addendnum-branch-2.01.patch] addendum for branch-2 and try 
to fix the failed unit test which origins from [~iamgd67]. verify at local, it 
can fix TestDirectoryScanner failed.
[~ayushtkn] Please help to review. Thanks.

> check block directory logic not correct when there is only meta file, print 
> no meaning warn log
> ---
>
> Key: HDFS-14303
> URL: https://issues.apache.org/jira/browse/HDFS-14303
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, hdfs
>Affects Versions: 2.7.3, 3.2.0, 2.9.2, 2.8.5
> Environment: env free
>Reporter: qiang Liu
>Assignee: qiang Liu
>Priority: Minor
>  Labels: easy-fix
> Fix For: 2.10.0, 3.0.4, 3.3.0, 2.8.6, 3.2.1, 2.9.3, 3.1.3
>
> Attachments: HDFS-14303-addendnum-branch-2.01.patch, 
> HDFS-14303-addendum-01.patch, HDFS-14303-addendum-02.patch, 
> HDFS-14303-branch-2.005.patch, HDFS-14303-branch-2.009.patch, 
> HDFS-14303-branch-2.010.patch, HDFS-14303-branch-2.015.patch, 
> HDFS-14303-branch-2.017.patch, HDFS-14303-branch-2.7.001.patch, 
> HDFS-14303-branch-2.7.004.patch, HDFS-14303-branch-2.7.006.patch, 
> HDFS-14303-branch-2.9.011.patch, HDFS-14303-branch-2.9.012.patch, 
> HDFS-14303-branch-2.9.013.patch, HDFS-14303-trunk.014.patch, 
> HDFS-14303-trunk.015.patch, HDFS-14303-trunk.016.patch, 
> HDFS-14303-trunk.016.path, HDFS-14303.branch-3.2.017.patch
>
>   Original Estimate: 1m
>  Remaining Estimate: 1m
>
> chek block directory logic not correct when there is only meta file,print no 
> meaning warn log, eg:
>  WARN DirectoryScanner:? - Block: 1101939874 has to be upgraded to block 
> ID-based layout. Actual block file path: 
> /data14/hadoop/data/current/BP-1461038173-10.8.48.152-1481686842620/current/finalized/subdir174/subdir68,
>  expected block file path: 
> /data14/hadoop/data/current/BP-1461038173-10.8.48.152-1481686842620/current/finalized/subdir174/subdir68/subdir68



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14303) check block directory logic not correct when there is only meta file, print no meaning warn log

2019-09-05 Thread He Xiaoqiao (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

He Xiaoqiao updated HDFS-14303:
---
Attachment: HDFS-14303-addendnum-branch-2.01.patch

> check block directory logic not correct when there is only meta file, print 
> no meaning warn log
> ---
>
> Key: HDFS-14303
> URL: https://issues.apache.org/jira/browse/HDFS-14303
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, hdfs
>Affects Versions: 2.7.3, 3.2.0, 2.9.2, 2.8.5
> Environment: env free
>Reporter: qiang Liu
>Assignee: qiang Liu
>Priority: Minor
>  Labels: easy-fix
> Fix For: 2.10.0, 3.0.4, 3.3.0, 2.8.6, 3.2.1, 2.9.3, 3.1.3
>
> Attachments: HDFS-14303-addendnum-branch-2.01.patch, 
> HDFS-14303-addendum-01.patch, HDFS-14303-addendum-02.patch, 
> HDFS-14303-branch-2.005.patch, HDFS-14303-branch-2.009.patch, 
> HDFS-14303-branch-2.010.patch, HDFS-14303-branch-2.015.patch, 
> HDFS-14303-branch-2.017.patch, HDFS-14303-branch-2.7.001.patch, 
> HDFS-14303-branch-2.7.004.patch, HDFS-14303-branch-2.7.006.patch, 
> HDFS-14303-branch-2.9.011.patch, HDFS-14303-branch-2.9.012.patch, 
> HDFS-14303-branch-2.9.013.patch, HDFS-14303-trunk.014.patch, 
> HDFS-14303-trunk.015.patch, HDFS-14303-trunk.016.patch, 
> HDFS-14303-trunk.016.path, HDFS-14303.branch-3.2.017.patch
>
>   Original Estimate: 1m
>  Remaining Estimate: 1m
>
> chek block directory logic not correct when there is only meta file,print no 
> meaning warn log, eg:
>  WARN DirectoryScanner:? - Block: 1101939874 has to be upgraded to block 
> ID-based layout. Actual block file path: 
> /data14/hadoop/data/current/BP-1461038173-10.8.48.152-1481686842620/current/finalized/subdir174/subdir68,
>  expected block file path: 
> /data14/hadoop/data/current/BP-1461038173-10.8.48.152-1481686842620/current/finalized/subdir174/subdir68/subdir68



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14633) The StorageType quota and consume in QuotaFeature is not handled for rename

2019-09-05 Thread He Xiaoqiao (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16923872#comment-16923872
 ] 

He Xiaoqiao commented on HDFS-14633:


Thanks [~xyao] for your quick response and helps. verify failed unit test at 
local and result is OK for me. Thanks again.

> The StorageType quota and consume in QuotaFeature is not handled for rename
> ---
>
> Key: HDFS-14633
> URL: https://issues.apache.org/jira/browse/HDFS-14633
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
> Fix For: 2.10.0, 3.0.4, 3.3.0, 2.8.6, 3.2.1, 2.9.3, 3.1.3
>
> Attachments: HDFS-14633-testcases-explanation, HDFS-14633.002.patch, 
> HDFS-14633.003.patch, HDFS-14633.004.patch, HDFS-14633.005.patch, 
> HDFS-14633.006.patch, HDFS-14633.007.patch
>
>
> The NameNode manages the global state of the cluster. We should always take 
> NameNode's records as the sole criterion because no matter what inconsistent 
> is the NameNode should finally make everything right based on it's records. 
> Let's call it rule NSC(NameNode is the Sole Criterion). That means when we do 
> all quota related rpcs, we do the quota check according to NameNode's records 
> regardless of any inconsistent situation, such as the replicas doesn't match 
> the storage policy of the file, or the replicas count doesn't match the 
> file's set replica.
>  The work SPS deals with the wrongly palced replicas. There is a thought 
> about putting off the consume update of the DirectoryQuota until all replicas 
> are re-placed by SPS. I can't agree that because if we do so we will abandon 
> letting the NameNode's records to be the sole criterion. The block 
> replication is a good example of the rule NSC. When we count the consume of a 
> file(CONTIGUOUS), we multiply the replication factor with the file's length, 
> no matter the blocks are under replicated or excessed. We should do the same 
> thing for the storage type quota.
>  Another concern is the change will let setStoragePolicy throw 
> QuotaByStorageTypeExceededException which it doesn't before. I don't think 
> it's a big problem since the setStoragePolicy already throws IOException. Or 
> we can wrap the QuotaByStorageTypeExceededException in an IOException, but I 
> won't recommend that because it's ugly.
>  To make storage type consume follow the rule NSC, we need change 
> rename(moving a file with storage policy inherited from it's parent) and 
> setStoragePolicy. 



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14303) check block directory logic not correct when there is only meta file, print no meaning warn log

2019-09-05 Thread He Xiaoqiao (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16923461#comment-16923461
 ] 

He Xiaoqiao commented on HDFS-14303:


+1, TestDirectoryScanner in branch-2 failed for a while, and the addendum patch 
was not committed. I am not sure if it is related. would like to help to double 
check? Thanks. [~ayushtkn],[~iamgd67]

> check block directory logic not correct when there is only meta file, print 
> no meaning warn log
> ---
>
> Key: HDFS-14303
> URL: https://issues.apache.org/jira/browse/HDFS-14303
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, hdfs
>Affects Versions: 2.7.3, 3.2.0, 2.9.2, 2.8.5
> Environment: env free
>Reporter: qiang Liu
>Assignee: qiang Liu
>Priority: Minor
>  Labels: easy-fix
> Fix For: 2.10.0, 3.0.4, 3.3.0, 2.8.6, 3.2.1, 2.9.3, 3.1.3
>
> Attachments: HDFS-14303-addendum-01.patch, 
> HDFS-14303-addendum-02.patch, HDFS-14303-branch-2.005.patch, 
> HDFS-14303-branch-2.009.patch, HDFS-14303-branch-2.010.patch, 
> HDFS-14303-branch-2.015.patch, HDFS-14303-branch-2.017.patch, 
> HDFS-14303-branch-2.7.001.patch, HDFS-14303-branch-2.7.004.patch, 
> HDFS-14303-branch-2.7.006.patch, HDFS-14303-branch-2.9.011.patch, 
> HDFS-14303-branch-2.9.012.patch, HDFS-14303-branch-2.9.013.patch, 
> HDFS-14303-trunk.014.patch, HDFS-14303-trunk.015.patch, 
> HDFS-14303-trunk.016.patch, HDFS-14303-trunk.016.path, 
> HDFS-14303.branch-3.2.017.patch
>
>   Original Estimate: 1m
>  Remaining Estimate: 1m
>
> chek block directory logic not correct when there is only meta file,print no 
> meaning warn log, eg:
>  WARN DirectoryScanner:? - Block: 1101939874 has to be upgraded to block 
> ID-based layout. Actual block file path: 
> /data14/hadoop/data/current/BP-1461038173-10.8.48.152-1481686842620/current/finalized/subdir174/subdir68,
>  expected block file path: 
> /data14/hadoop/data/current/BP-1461038173-10.8.48.152-1481686842620/current/finalized/subdir174/subdir68/subdir68



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14633) The StorageType quota and consume in QuotaFeature is not handled for rename

2019-09-05 Thread He Xiaoqiao (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16923449#comment-16923449
 ] 

He Xiaoqiao commented on HDFS-14633:


hi [~xyao],[~LiJinglun]. Sorry for late response,  just check 
TestQuota#testClrQuotaOnRoot failed in branch-2 cause it doesn't support 
traditional binary prefix, And I try to dig commit source, and meet the 
following commit info.
{code:java}
commit a524608d1e0a276fd822e9574f0a4a6d51298b13
Author: Xiaoyu Yao 
Date:   Fri Aug 30 16:46:04 2019 -0700

HDFS-14633. The StorageType quota and consume in QuotaFeature is not 
handled for rename. Contributed by Jinglun.

(cherry picked from commit 62d71fbac3789c7d484bc76ced9ec7fa6ff94de1)
{code}
I am confused that patch v007 source code is very different with commit for 
branch-2. is there anything I missed? or commit message is not match with code. 
Another way I do not trace back to `62d71fbac3789c7d484bc76ced9ec7fa6ff94de1` 
which is mark as cherrypick source. where is it?:)

After check carefully, I think we could fix failed unittest 
`TestQuota#testClrQuotaOnRoot` with cherrypick HDFS-12009. Any suggestions? 
Thanks.

> The StorageType quota and consume in QuotaFeature is not handled for rename
> ---
>
> Key: HDFS-14633
> URL: https://issues.apache.org/jira/browse/HDFS-14633
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
> Fix For: 2.10.0, 3.0.4, 3.3.0, 2.8.6, 3.2.1, 2.9.3, 3.1.3
>
> Attachments: HDFS-14633-testcases-explanation, HDFS-14633.002.patch, 
> HDFS-14633.003.patch, HDFS-14633.004.patch, HDFS-14633.005.patch, 
> HDFS-14633.006.patch, HDFS-14633.007.patch
>
>
> The NameNode manages the global state of the cluster. We should always take 
> NameNode's records as the sole criterion because no matter what inconsistent 
> is the NameNode should finally make everything right based on it's records. 
> Let's call it rule NSC(NameNode is the Sole Criterion). That means when we do 
> all quota related rpcs, we do the quota check according to NameNode's records 
> regardless of any inconsistent situation, such as the replicas doesn't match 
> the storage policy of the file, or the replicas count doesn't match the 
> file's set replica.
>  The work SPS deals with the wrongly palced replicas. There is a thought 
> about putting off the consume update of the DirectoryQuota until all replicas 
> are re-placed by SPS. I can't agree that because if we do so we will abandon 
> letting the NameNode's records to be the sole criterion. The block 
> replication is a good example of the rule NSC. When we count the consume of a 
> file(CONTIGUOUS), we multiply the replication factor with the file's length, 
> no matter the blocks are under replicated or excessed. We should do the same 
> thing for the storage type quota.
>  Another concern is the change will let setStoragePolicy throw 
> QuotaByStorageTypeExceededException which it doesn't before. I don't think 
> it's a big problem since the setStoragePolicy already throws IOException. Or 
> we can wrap the QuotaByStorageTypeExceededException in an IOException, but I 
> won't recommend that because it's ugly.
>  To make storage type consume follow the rule NSC, we need change 
> rename(moving a file with storage policy inherited from it's parent) and 
> setStoragePolicy. 



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-14771) Backport HDFS-14617 to branch-2 (Improve fsimage load time by writing sub-sections to the fsimage index)

2019-09-05 Thread He Xiaoqiao (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16923196#comment-16923196
 ] 

He Xiaoqiao edited comment on HDFS-14771 at 9/5/19 8:33 AM:


Check failed unit test, I don't think they are related with this changes.
1. TestQuota failed because parameter check failed when input '3K' for 
`setQuota` in branch-2 which does not support traditional binary prefix. The 
root cause is different check mode between trunk and branch-2 and this issue 
brings from HDFS-14633. I would like to resolve it later.  
{code:java}
@Test
publicvoidtestClrQuotaOnRoot()throwsException{
  longorignalQuota=dfs.getQuotaUsage(newPath("/")).getQuota();
  DFSAdminadmin=newDFSAdmin(conf);
  String[]args;
  args=newString[]{"-setQuota","3K","/"};
  runCommand(admin,args,false);
  assertEquals(3*1024,dfs.getQuotaUsage(newPath("/")).getQuota());
  args=newString[]{"-clrQuota","/"};
  runCommand(admin,args,false);
  assertEquals(orignalQuota,dfs.getQuotaUsage(newPath("/")).getQuota());
}
{code}
2.TestDirectoryScanner seems timeout always in branch-2. It may be related with 
HDFS-14303, addendum patch does not commit to branch-2?
3.TestJournalNodeRespectsBindHostKeys run pass at local.


was (Author: hexiaoqiao):
Check failed unit test, I don't think they are related with this changes.
1. TestQuota failed because parameter check failed when input '3K' for 
`setQuota` in branch-2 which does not support traditional binary prefix. The 
root cause is different check mode between trunk and branch-2 and this issue 
brings from HDFS-14633. I would like to resolve it later.  
{code:java}
@Test
publicvoidtestClrQuotaOnRoot()throwsException{
  longorignalQuota=dfs.getQuotaUsage(newPath("/")).getQuota();
  DFSAdminadmin=newDFSAdmin(conf);
  String[]args;
  args=newString[]{"-setQuota","3K","/"};
  runCommand(admin,args,false);
  assertEquals(3*1024,dfs.getQuotaUsage(newPath("/")).getQuota());
  args=newString[]{"-clrQuota","/"};
  runCommand(admin,args,false);
  assertEquals(orignalQuota,dfs.getQuotaUsage(newPath("/")).getQuota());
}
{code}
2.TestDirectoryScanner seems timeout always in branch-2. It may be related with 
HDFS-14303.
3.TestJournalNodeRespectsBindHostKeys run pass at local.

> Backport HDFS-14617 to branch-2 (Improve fsimage load time by writing 
> sub-sections to the fsimage index)
> 
>
> Key: HDFS-14771
> URL: https://issues.apache.org/jira/browse/HDFS-14771
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: namenode
>Affects Versions: 2.10.0
>Reporter: He Xiaoqiao
>Assignee: He Xiaoqiao
>Priority: Major
>  Labels: release-blocker
> Attachments: HDFS-14771.branch-2.001.patch, 
> HDFS-14771.branch-2.002.patch, HDFS-14771.branch-2.003.patch
>
>
> This JIRA aims to backport HDFS-14617 to branch-2: fsimage load time by 
> writing sub-sections to the fsimage index.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14771) Backport HDFS-14617 to branch-2 (Improve fsimage load time by writing sub-sections to the fsimage index)

2019-09-05 Thread He Xiaoqiao (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16923196#comment-16923196
 ] 

He Xiaoqiao commented on HDFS-14771:


Check failed unit test, I don't think they are related with this changes.
1. TestQuota failed because parameter check failed when input '3K' for 
`setQuota` in branch-2 which does not support traditional binary prefix. The 
root cause is different check mode between trunk and branch-2 and this issue 
brings from HDFS-14633. I would like to resolve it later.  
{code:java}
@Test
publicvoidtestClrQuotaOnRoot()throwsException{
  longorignalQuota=dfs.getQuotaUsage(newPath("/")).getQuota();
  DFSAdminadmin=newDFSAdmin(conf);
  String[]args;
  args=newString[]{"-setQuota","3K","/"};
  runCommand(admin,args,false);
  assertEquals(3*1024,dfs.getQuotaUsage(newPath("/")).getQuota());
  args=newString[]{"-clrQuota","/"};
  runCommand(admin,args,false);
  assertEquals(orignalQuota,dfs.getQuotaUsage(newPath("/")).getQuota());
}
{code}
2.TestDirectoryScanner seems timeout always in branch-2. It may be related with 
HDFS-14303.
3.TestJournalNodeRespectsBindHostKeys run pass at local.

> Backport HDFS-14617 to branch-2 (Improve fsimage load time by writing 
> sub-sections to the fsimage index)
> 
>
> Key: HDFS-14771
> URL: https://issues.apache.org/jira/browse/HDFS-14771
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: namenode
>Affects Versions: 2.10.0
>Reporter: He Xiaoqiao
>Assignee: He Xiaoqiao
>Priority: Major
>  Labels: release-blocker
> Attachments: HDFS-14771.branch-2.001.patch, 
> HDFS-14771.branch-2.002.patch, HDFS-14771.branch-2.003.patch
>
>
> This JIRA aims to backport HDFS-14617 to branch-2: fsimage load time by 
> writing sub-sections to the fsimage index.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14810) review FSNameSystem editlog sync

2019-09-05 Thread He Xiaoqiao (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16923173#comment-16923173
 ] 

He Xiaoqiao commented on HDFS-14810:


Check failed unit tests [TestNamenodeCapacityReport, 
TestDFSZKFailoverController, TestFileAppend, TestDFSUpgradeWithHA] and run them 
at local, all passed. I think it is ready to commit, Please help to review and 
double check. cc [~jojochuang],[~ayushtkn].

> review FSNameSystem editlog sync
> 
>
> Key: HDFS-14810
> URL: https://issues.apache.org/jira/browse/HDFS-14810
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Reporter: He Xiaoqiao
>Assignee: He Xiaoqiao
>Priority: Major
> Attachments: HDFS-14810.001.patch, HDFS-14810.002.patch, 
> HDFS-14810.003.patch, HDFS-14810.004.patch
>
>
> refactor and unified type of edit log sync in FSNamesystem as HDFS-11246 
> mentioned.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14771) Backport HDFS-14617 to branch-2 (Improve fsimage load time by writing sub-sections to the fsimage index)

2019-09-04 Thread He Xiaoqiao (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16923036#comment-16923036
 ] 

He Xiaoqiao commented on HDFS-14771:


[^HDFS-14771.branch-2.003.patch] disable this feature by default, and add some 
explicit declaration at default configuration following HDFS-14821. Pending 
Jenkins what says.

> Backport HDFS-14617 to branch-2 (Improve fsimage load time by writing 
> sub-sections to the fsimage index)
> 
>
> Key: HDFS-14771
> URL: https://issues.apache.org/jira/browse/HDFS-14771
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: namenode
>Affects Versions: 2.10.0
>Reporter: He Xiaoqiao
>Assignee: He Xiaoqiao
>Priority: Major
>  Labels: release-blocker
> Attachments: HDFS-14771.branch-2.001.patch, 
> HDFS-14771.branch-2.002.patch, HDFS-14771.branch-2.003.patch
>
>
> This JIRA aims to backport HDFS-14617 to branch-2: fsimage load time by 
> writing sub-sections to the fsimage index.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14771) Backport HDFS-14617 to branch-2 (Improve fsimage load time by writing sub-sections to the fsimage index)

2019-09-04 Thread He Xiaoqiao (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

He Xiaoqiao updated HDFS-14771:
---
Attachment: HDFS-14771.branch-2.003.patch

> Backport HDFS-14617 to branch-2 (Improve fsimage load time by writing 
> sub-sections to the fsimage index)
> 
>
> Key: HDFS-14771
> URL: https://issues.apache.org/jira/browse/HDFS-14771
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: namenode
>Affects Versions: 2.10.0
>Reporter: He Xiaoqiao
>Assignee: He Xiaoqiao
>Priority: Major
>  Labels: release-blocker
> Attachments: HDFS-14771.branch-2.001.patch, 
> HDFS-14771.branch-2.002.patch, HDFS-14771.branch-2.003.patch
>
>
> This JIRA aims to backport HDFS-14617 to branch-2: fsimage load time by 
> writing sub-sections to the fsimage index.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14810) review FSNameSystem editlog sync

2019-09-04 Thread He Xiaoqiao (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16923035#comment-16923035
 ] 

He Xiaoqiao commented on HDFS-14810:


Thanks [~ayushtkn], [^HDFS-14810.004.patch] remove success variable in 
#addErasureCodingPolicies based on [^HDFS-14810.003.patch]. Pending Jenkins.

> review FSNameSystem editlog sync
> 
>
> Key: HDFS-14810
> URL: https://issues.apache.org/jira/browse/HDFS-14810
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Reporter: He Xiaoqiao
>Assignee: He Xiaoqiao
>Priority: Major
> Attachments: HDFS-14810.001.patch, HDFS-14810.002.patch, 
> HDFS-14810.003.patch, HDFS-14810.004.patch
>
>
> refactor and unified type of edit log sync in FSNamesystem as HDFS-11246 
> mentioned.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14810) review FSNameSystem editlog sync

2019-09-04 Thread He Xiaoqiao (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

He Xiaoqiao updated HDFS-14810:
---
Attachment: HDFS-14810.004.patch

> review FSNameSystem editlog sync
> 
>
> Key: HDFS-14810
> URL: https://issues.apache.org/jira/browse/HDFS-14810
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Reporter: He Xiaoqiao
>Assignee: He Xiaoqiao
>Priority: Major
> Attachments: HDFS-14810.001.patch, HDFS-14810.002.patch, 
> HDFS-14810.003.patch, HDFS-14810.004.patch
>
>
> refactor and unified type of edit log sync in FSNamesystem as HDFS-11246 
> mentioned.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14810) review FSNameSystem editlog sync

2019-09-04 Thread He Xiaoqiao (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16922726#comment-16922726
 ] 

He Xiaoqiao commented on HDFS-14810:


[~ayushtkn],[~jojochuang],[~kihwal] do you mind to take another review and 
confirm? Thanks.

> review FSNameSystem editlog sync
> 
>
> Key: HDFS-14810
> URL: https://issues.apache.org/jira/browse/HDFS-14810
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Reporter: He Xiaoqiao
>Assignee: He Xiaoqiao
>Priority: Major
> Attachments: HDFS-14810.001.patch, HDFS-14810.002.patch, 
> HDFS-14810.003.patch
>
>
> refactor and unified type of edit log sync in FSNamesystem as HDFS-11246 
> mentioned.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14821) Make HDFS-14617 (fsimage sub-sections) off by default

2019-09-04 Thread He Xiaoqiao (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16922722#comment-16922722
 ] 

He Xiaoqiao commented on HDFS-14821:


+1 (non-binding) for [^HDFS-14821.001.patch]. Thanks [~sodonnell].

> Make HDFS-14617 (fsimage sub-sections) off by default
> -
>
> Key: HDFS-14821
> URL: https://issues.apache.org/jira/browse/HDFS-14821
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 3.3.0
>Reporter: Stephen O'Donnell
>Assignee: Stephen O'Donnell
>Priority: Blocker
> Fix For: 3.3.0
>
> Attachments: HDFS-14821.001.patch
>
>
> In HDFS-14617 I incorrectly stated that a fsimage with sub-sections listed in 
> the image summary section could be loaded without the HDFS-14617 patch 
> applied. However that was not correct. If the cluster was upgraded to a 
> version containing this feature, then upon downgrade the image will fail to 
> load.
> I believe the simplest solution to this problem is to have this feature 
> disabled by default. That way, after upgrading to a version with the feature, 
> the operator can make a decision to enable the feature. If they later want to 
> downgrade, then they must:
>  # Disable the feature
>  # Save the namespace, which will create an image without sub-sections
>  # Perform the downgrade.
> Even though the steps to downgrade are simple, we cannot expect people to be 
> aware of this, and hence it is safest to disable the feature by default.
> The only alternative is to create a new image layout version, but that seems 
> excessive for this feature, as the core contents of the image have not 
> changed and the sub-sections can be removed by disabling the feature and 
> saving the namespace.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14771) Backport HDFS-14617 to branch-2 (Improve fsimage load time by writing sub-sections to the fsimage index)

2019-09-04 Thread He Xiaoqiao (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16922721#comment-16922721
 ] 

He Xiaoqiao commented on HDFS-14771:


{quote}Perhaps on branch 2 that is the way forward too?{quote}
+1, I would like to update it and turn off this feature by default on branch-2. 
Let’s keep disable it before there are somehow graceful solution.

> Backport HDFS-14617 to branch-2 (Improve fsimage load time by writing 
> sub-sections to the fsimage index)
> 
>
> Key: HDFS-14771
> URL: https://issues.apache.org/jira/browse/HDFS-14771
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: namenode
>Affects Versions: 2.10.0
>Reporter: He Xiaoqiao
>Assignee: He Xiaoqiao
>Priority: Major
>  Labels: release-blocker
> Attachments: HDFS-14771.branch-2.001.patch, 
> HDFS-14771.branch-2.002.patch
>
>
> This JIRA aims to backport HDFS-14617 to branch-2: fsimage load time by 
> writing sub-sections to the fsimage index.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14771) Backport HDFS-14617 to branch-2 (Improve fsimage load time by writing sub-sections to the fsimage index)

2019-09-04 Thread He Xiaoqiao (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16922715#comment-16922715
 ] 

He Xiaoqiao commented on HDFS-14771:


+1 for disable this feature by default, it may be safest and directest approach.
I concern that do we need reconsider to change the layout version since the 
compatibility issues is clear. another way, is it necessary to provide tool to 
support different format transform? I think it is helpful for 
upgrade/downgrade. In my practice, I have to prepare fsimage/editlog tools 
before upgrade operation in order to avoid rollback when meet some unexpected 
exception, especially for the case which layout versions of fsimage/editlog are 
not same or not compatible completely. I believe we could do it use different 
OIV tools (parse fsimage binary to xml using new OIV version first, then 
transfer the xml to fsimage binary using old OIV version), however it is not 
valid always and not convenient actually. Of course, this should be another 
topic to discuss. FYI.

> Backport HDFS-14617 to branch-2 (Improve fsimage load time by writing 
> sub-sections to the fsimage index)
> 
>
> Key: HDFS-14771
> URL: https://issues.apache.org/jira/browse/HDFS-14771
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: namenode
>Affects Versions: 2.10.0
>Reporter: He Xiaoqiao
>Assignee: He Xiaoqiao
>Priority: Major
>  Labels: release-blocker
> Attachments: HDFS-14771.branch-2.001.patch, 
> HDFS-14771.branch-2.002.patch
>
>
> This JIRA aims to backport HDFS-14617 to branch-2: fsimage load time by 
> writing sub-sections to the fsimage index.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14810) review FSNameSystem editlog sync

2019-09-04 Thread He Xiaoqiao (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16922692#comment-16922692
 ] 

He Xiaoqiao commented on HDFS-14810:


Thanks [~ayushtkn] for your nice reviews.
1. I try to remove `snapshotPath` cause that it always null when meet 
`AccessControlException`, so I think it is safe to remove when log audit.
2. about `AddErasureCodingPolicy`, I have added `success` based on `audit false 
should be only for ACE` mentioned above. I just check it carefully just now, 
and believe we could eliminate success variable since it looks set true always 
no matter if meet exception or not, So we could remove it. Do you mind to help 
double check? Thanks again.

> review FSNameSystem editlog sync
> 
>
> Key: HDFS-14810
> URL: https://issues.apache.org/jira/browse/HDFS-14810
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Reporter: He Xiaoqiao
>Assignee: He Xiaoqiao
>Priority: Major
> Attachments: HDFS-14810.001.patch, HDFS-14810.002.patch, 
> HDFS-14810.003.patch
>
>
> refactor and unified type of edit log sync in FSNamesystem as HDFS-11246 
> mentioned.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14771) Backport HDFS-14617 to branch-2 (Improve fsimage load time by writing sub-sections to the fsimage index)

2019-09-03 Thread He Xiaoqiao (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16921585#comment-16921585
 ] 

He Xiaoqiao commented on HDFS-14771:


[^HDFS-14771.branch-2.002.patch] try to fix checkstyle Jenkins reported.
I have tested this patch based on branch-2.
*branch-2 with changes* works well for loading former format fsimage, saving 
checkpoint, and loading new format fsimage. Performance improves is similar as 
mentioned above, would like to offer again, test with above 6GB fsimage which 
with 58M Inodes, it cost 201s to loading fsimage before changes, versus 108s 
after update.
*branch-2 with changes* OIV works well for parsing the former format fsimage 
and new one (only test parse from binary fsimage to XML).

*branch-2 without changes* could not loading the new format fsimage which could 
be expected. The root cause is that `INODE_SUB` and `INODE_DIR_SUB` section 
index in FileSummary is not filtered, after sorted by 
FSImageFormatProtobuf:L270~283, sections will show as the following example 
result.
{code:java}
0 = {FsImageProto$FileSummary$Section@3399} "name: "INODE"\nlength: 
6413637141\noffset: 39\n"
1 = {FsImageProto$FileSummary$Section@3387} "name: "INODE_SUB"\nlength: 
530412553\noffset: 39\n"
2 = {FsImageProto$FileSummary$Section@3388} "name: "INODE_SUB"\nlength: 
543759958\noffset: 530412592\n"
3 = {FsImageProto$FileSummary$Section@3389} "name: "INODE_SUB"\nlength: 
532201078\noffset: 1074172550\n"
4 = {FsImageProto$FileSummary$Section@3390} "name: "INODE_SUB"\nlength: 
532840911\noffset: 1606373628\n"
5 = {FsImageProto$FileSummary$Section@3391} "name: "INODE_SUB"\nlength: 
545683588\noffset: 2139214539\n"
6 = {FsImageProto$FileSummary$Section@3392} "name: "INODE_SUB"\nlength: 
547586964\noffset: 2684898127\n"
7 = {FsImageProto$FileSummary$Section@3393} "name: "INODE_SUB"\nlength: 
531512575\noffset: 3232485091\n"
8 = {FsImageProto$FileSummary$Section@3394} "name: "INODE_SUB"\nlength: 
528600331\noffset: 3763997666\n"
9 = {FsImageProto$FileSummary$Section@3395} "name: "INODE_SUB"\nlength: 
531448330\noffset: 4292597997\n"
10 = {FsImageProto$FileSummary$Section@3396} "name: "INODE_SUB"\nlength: 
532305282\noffset: 4824046327\n"
11 = {FsImageProto$FileSummary$Section@3397} "name: "INODE_SUB"\nlength: 
528844225\noffset: 5356351609\n"
12 = {FsImageProto$FileSummary$Section@3398} "name: "INODE_SUB"\nlength: 
528441346\noffset: 5885195834\n"
13 = {FsImageProto$FileSummary$Section@3400} "name: "INODE_DIR_SUB"\nlength: 
25274497\noffset: 6413637180\n"
14 = {FsImageProto$FileSummary$Section@3401} "name: "INODE_DIR_SUB"\nlength: 
23845900\noffset: 6438911677\n"
15 = {FsImageProto$FileSummary$Section@3402} "name: "INODE_DIR_SUB"\nlength: 
25808697\noffset: 6462757577\n"
16 = {FsImageProto$FileSummary$Section@3386} "name: "NS_INFO"\nlength: 
31\noffset: 8\n"
17 = {FsImageProto$FileSummary$Section@3418} "name: "STRING_TABLE"\nlength: 
89535\noffset: 6719821058\n"
18 = {FsImageProto$FileSummary$Section@3415} "name: "INODE_REFERENCE"\nlength: 
0\noffset: 6719063168\n"
19 = {FsImageProto$FileSummary$Section@3414} "name: "SNAPSHOT"\nlength: 
5\noffset: 6719063163\n"
20 = {FsImageProto$FileSummary$Section@3412} "name: "INODE_DIR"\nlength: 
305379882\noffset: 6413637180\n"
21 = {FsImageProto$FileSummary$Section@3413} "name: 
"FILES_UNDERCONSTRUCTION"\nlength: 46101\noffset: 6719017062\n"
22 = {FsImageProto$FileSummary$Section@3416} "name: "SECRET_MANAGER"\nlength: 
757851\noffset: 6719063168\n"
23 = {FsImageProto$FileSummary$Section@3417} "name: "CACHE_MANAGER"\nlength: 
39\noffset: 6719821019\n"
24 = {FsImageProto$FileSummary$Section@3403} "name: "INODE_DIR_SUB"\nlength: 
25627209\noffset: 6488566274\n"
25 = {FsImageProto$FileSummary$Section@3404} "name: "INODE_DIR_SUB"\nlength: 
25505770\noffset: 6514193483\n"
26 = {FsImageProto$FileSummary$Section@3405} "name: "INODE_DIR_SUB"\nlength: 
25610853\noffset: 6539699253\n"
27 = {FsImageProto$FileSummary$Section@3406} "name: "INODE_DIR_SUB"\nlength: 
25657973\noffset: 6565310106\n"
28 = {FsImageProto$FileSummary$Section@3407} "name: "INODE_DIR_SUB"\nlength: 
25469495\noffset: 6590968079\n"
29 = {FsImageProto$FileSummary$Section@3408} "name: "INODE_DIR_SUB"\nlength: 
25809460\noffset: 6616437574\n"
30 = {FsImageProto$FileSummary$Section@3409} "name: "INODE_DIR_SUB"\nlength: 
25560451\noffset: 6642247034\n"
31 = {FsImageProto$FileSummary$Section@3410} "name: "INODE_DIR_SUB"\nlength: 
25376445\noffset: 6667807485\n"
32 = {FsImageProto$FileSummary$Section@3411} "name: "INODE_DIR_SUB"\nlength: 
25833132\noffset: 6693183930\n"
{code}
After that it will loading section one by one. Then it will meet exception 
(example shows following) when loading the first section `INODE` since 
`NS_INFO` and `STRING_TABLE` are both not loading yet.
{code:java}
2019-09-03 19:32:08,577 ERROR org.apache.hadoop.hdfs.server.namenode.FSImage: 
Failed to load image from 

[jira] [Updated] (HDFS-14771) Backport HDFS-14617 to branch-2 (Improve fsimage load time by writing sub-sections to the fsimage index)

2019-09-03 Thread He Xiaoqiao (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

He Xiaoqiao updated HDFS-14771:
---
Attachment: HDFS-14771.branch-2.002.patch

> Backport HDFS-14617 to branch-2 (Improve fsimage load time by writing 
> sub-sections to the fsimage index)
> 
>
> Key: HDFS-14771
> URL: https://issues.apache.org/jira/browse/HDFS-14771
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: namenode
>Affects Versions: 2.10.0
>Reporter: He Xiaoqiao
>Assignee: He Xiaoqiao
>Priority: Major
>  Labels: release-blocker
> Attachments: HDFS-14771.branch-2.001.patch, 
> HDFS-14771.branch-2.002.patch
>
>
> This JIRA aims to backport HDFS-14617 to branch-2: fsimage load time by 
> writing sub-sections to the fsimage index.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14810) review FSNameSystem editlog sync

2019-09-02 Thread He Xiaoqiao (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16920981#comment-16920981
 ] 

He Xiaoqiao commented on HDFS-14810:


check four failed unit tests are all related with OOM, and re-run them at 
local, all passed. It seems to be not related with this changes. FYI.

> review FSNameSystem editlog sync
> 
>
> Key: HDFS-14810
> URL: https://issues.apache.org/jira/browse/HDFS-14810
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Reporter: He Xiaoqiao
>Assignee: He Xiaoqiao
>Priority: Major
> Attachments: HDFS-14810.001.patch, HDFS-14810.002.patch, 
> HDFS-14810.003.patch
>
>
> refactor and unified type of edit log sync in FSNamesystem as HDFS-11246 
> mentioned.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14810) review FSNameSystem editlog sync

2019-09-02 Thread He Xiaoqiao (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16920719#comment-16920719
 ] 

He Xiaoqiao commented on HDFS-14810:


[^HDFS-14810.003.patch] changes point based on v002 and pending Jenkins,
1. throw AccessControlException out for #enableErasureCodingPolicy and 
#disableErasureCodingPolicy.
2. add success audit log for #isFileClosed and #checkAccess.
[~ayushtkn], Please take another reviews. Thanks.

> review FSNameSystem editlog sync
> 
>
> Key: HDFS-14810
> URL: https://issues.apache.org/jira/browse/HDFS-14810
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Reporter: He Xiaoqiao
>Assignee: He Xiaoqiao
>Priority: Major
> Attachments: HDFS-14810.001.patch, HDFS-14810.002.patch, 
> HDFS-14810.003.patch
>
>
> refactor and unified type of edit log sync in FSNamesystem as HDFS-11246 
> mentioned.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14810) review FSNameSystem editlog sync

2019-09-02 Thread He Xiaoqiao (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

He Xiaoqiao updated HDFS-14810:
---
Attachment: HDFS-14810.003.patch

> review FSNameSystem editlog sync
> 
>
> Key: HDFS-14810
> URL: https://issues.apache.org/jira/browse/HDFS-14810
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Reporter: He Xiaoqiao
>Assignee: He Xiaoqiao
>Priority: Major
> Attachments: HDFS-14810.001.patch, HDFS-14810.002.patch, 
> HDFS-14810.003.patch
>
>
> refactor and unified type of edit log sync in FSNamesystem as HDFS-11246 
> mentioned.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14810) review FSNameSystem editlog sync

2019-09-01 Thread He Xiaoqiao (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16920612#comment-16920612
 ] 

He Xiaoqiao commented on HDFS-14810:


[^HDFS-14810.002.patch] try to fix some exception and follows comments 
[~ayushtkn] said above.
I am confused about #enableErasureCodingPolicy and #disableErasureCodingPolicy 
which not throw out AccessControlException when meet. Any special consideration?

> review FSNameSystem editlog sync
> 
>
> Key: HDFS-14810
> URL: https://issues.apache.org/jira/browse/HDFS-14810
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Reporter: He Xiaoqiao
>Assignee: He Xiaoqiao
>Priority: Major
> Attachments: HDFS-14810.001.patch, HDFS-14810.002.patch
>
>
> refactor and unified type of edit log sync in FSNamesystem as HDFS-11246 
> mentioned.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14810) review FSNameSystem editlog sync

2019-09-01 Thread He Xiaoqiao (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

He Xiaoqiao updated HDFS-14810:
---
Attachment: HDFS-14810.002.patch

> review FSNameSystem editlog sync
> 
>
> Key: HDFS-14810
> URL: https://issues.apache.org/jira/browse/HDFS-14810
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Reporter: He Xiaoqiao
>Assignee: He Xiaoqiao
>Priority: Major
> Attachments: HDFS-14810.001.patch, HDFS-14810.002.patch
>
>
> refactor and unified type of edit log sync in FSNamesystem as HDFS-11246 
> mentioned.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-13157) Do Not Remove Blocks Sequentially During Decommission

2019-09-01 Thread He Xiaoqiao (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-13157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16920441#comment-16920441
 ] 

He Xiaoqiao commented on HDFS-13157:


Thanks [~belugabehr] for the great work and detailed analysis. I believe this 
issue is more obvious in Federation arch. setup. +1 for the deep dig via 
[~sodonnell], and we could tune parameters [blocksReplWorkMultiplier, 
maxReplicationStreams, maxReplicationStreamsHardLimit] just for per namespace, 
but the common operation to decommission node is triggered from shell even at 
the same time, and different namenode send replication command also at the same 
time if this node is reporting to different namespace. Then load of this 
decommission in progress node is out of control. I have met both network and 
single disk i/o bottleneck.
I believe the current parameter is enough to use for solving network bottleneck.
fo single disk io bottleneck, +1 for update DatanodeDescriptor#BlockIterator 
and support to iterator block from alternate disks rather than iterator blocks 
from disk one by one.
another thought, we should not dispatch write operation to decommission in 
progress node and decrease read priority to the lowest just as decommissioned 
node, then if high load of decommissioning nodes or not is completely not 
affect to client or cluster.
this discussion is not including RAID and scenarios [~zhangchen] mentioned 
above.

> Do Not Remove Blocks Sequentially During Decommission 
> --
>
> Key: HDFS-13157
> URL: https://issues.apache.org/jira/browse/HDFS-13157
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode, namenode
>Affects Versions: 3.0.0
>Reporter: David Mollitor
>Assignee: David Mollitor
>Priority: Major
>
> From what I understand of [DataNode 
> decommissioning|https://github.com/apache/hadoop/blob/42a1c98597e6dba2e371510a6b2b6b1fb94e4090/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeAdminManager.java]
>  it appears that all the blocks are scheduled for removal _in order._. I'm 
> not 100% sure what the ordering is exactly, but I think it loops through each 
> data volume and schedules each block to be replicated elsewhere. The net 
> affect is that during a decommission, all of the DataNode transfer threads 
> slam on a single volume until it is cleaned out. At which point, they all 
> slam on the next volume, etc.
> Please randomize the block list so that there is a more even distribution 
> across all volumes when decommissioning a node.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14810) review FSNameSystem editlog sync

2019-09-01 Thread He Xiaoqiao (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16920413#comment-16920413
 ] 

He Xiaoqiao commented on HDFS-14810:


Thanks [~ayushtkn] for your feedback, I will update it next patch. Thanks again.

> review FSNameSystem editlog sync
> 
>
> Key: HDFS-14810
> URL: https://issues.apache.org/jira/browse/HDFS-14810
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Reporter: He Xiaoqiao
>Assignee: He Xiaoqiao
>Priority: Major
> Attachments: HDFS-14810.001.patch
>
>
> refactor and unified type of edit log sync in FSNamesystem as HDFS-11246 
> mentioned.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14305) Serial number in BlockTokenSecretManager could overlap between different namenodes

2019-09-01 Thread He Xiaoqiao (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16920397#comment-16920397
 ] 

He Xiaoqiao commented on HDFS-14305:


Thanks [~csun] for your comments. To be honest, I have no practice about 
multi-nns setup, so I have no idea that if it is stable based on or rely on the 
configurations. is there any case that Observe NameNode without SBN config? we 
can make sure that ANN and SBN has the same configuration with all namenode 
items in HA-mode. Please confirm the result for multi-nns install case if you 
have any experience. 
Another side as you said above, it could not resolve case of adding namenodes 
to cluster if we just rely on configurations. FYI.
Thanks [~csun] again.

> Serial number in BlockTokenSecretManager could overlap between different 
> namenodes
> --
>
> Key: HDFS-14305
> URL: https://issues.apache.org/jira/browse/HDFS-14305
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode, security
>Reporter: Chao Sun
>Assignee: He Xiaoqiao
>Priority: Major
> Fix For: 3.0.4, 3.3.0, 3.2.1, 3.1.3
>
> Attachments: HDFS-14305.001.patch, HDFS-14305.002.patch, 
> HDFS-14305.003.patch, HDFS-14305.004.patch, HDFS-14305.005.patch, 
> HDFS-14305.006.patch
>
>
> Currently, a {{BlockTokenSecretManager}} starts with a random integer as the 
> initial serial number, and then use this formula to rotate it:
> {code:java}
> this.intRange = Integer.MAX_VALUE / numNNs;
> this.nnRangeStart = intRange * nnIndex;
> this.serialNo = (this.serialNo % intRange) + (nnRangeStart);
>  {code}
> while {{numNNs}} is the total number of NameNodes in the cluster, and 
> {{nnIndex}} is the index of the current NameNode specified in the 
> configuration {{dfs.ha.namenodes.}}.
> However, with this approach, different NameNode could have overlapping ranges 
> for serial number. For simplicity, let's assume {{Integer.MAX_VALUE}} is 100, 
> and we have 2 NameNodes {{nn1}} and {{nn2}} in configuration. Then the ranges 
> for these two are:
> {code}
> nn1 -> [-49, 49]
> nn2 -> [1, 99]
> {code}
> This is because the initial serial number could be any negative integer.
> Moreover, when the keys are updated, the serial number will again be updated 
> with the formula:
> {code}
> this.serialNo = (this.serialNo % intRange) + (nnRangeStart);
> {code}
> which means the new serial number could be updated to a range that belongs to 
> a different NameNode, thus increasing the chance of collision again.
> When the collision happens, DataNodes could overwrite an existing key which 
> will cause clients to fail because of {{InvalidToken}} error.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14802) The feature of protect directories should be used in RenameOp

2019-09-01 Thread He Xiaoqiao (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16920394#comment-16920394
 ] 

He Xiaoqiao commented on HDFS-14802:


Thanks [~ferhui] for your contribution, I think maybe we need regex to match 
the `protect directories`. such as protect db directories to delete in 
warehouse, and I have to config many many paths based on my own practice. if we 
support regex match, it should be simpler. FYI.

> The feature of protect directories should be used in RenameOp
> -
>
> Key: HDFS-14802
> URL: https://issues.apache.org/jira/browse/HDFS-14802
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Affects Versions: 3.0.4, 3.3.0, 3.2.1, 3.1.3
>Reporter: Fei Hui
>Assignee: Fei Hui
>Priority: Major
> Attachments: HDFS-14802.001.patch, HDFS-14802.002.patch
>
>
> Now we could set fs.protected.directories to prevent users from deleting 
> important directories. But users can delete directories around the limitation.
> 1. Rename the directories and delete them.
> 2. move the directories to trash and namenode will delete them.
> So I think we should use the feature of protected directories in RenameOp



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14807) SetTimes updates all negative values apart from -1

2019-09-01 Thread He Xiaoqiao (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16920393#comment-16920393
 ] 

He Xiaoqiao commented on HDFS-14807:


[^HDFS-14807-02.patch] LGTM.
+1(no binding).

> SetTimes updates all negative values apart from -1
> --
>
> Key: HDFS-14807
> URL: https://issues.apache.org/jira/browse/HDFS-14807
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Harshakiran Reddy
>Assignee: Ayush Saxena
>Priority: Major
> Attachments: HDFS-14807-01.patch, HDFS-14807-02.patch
>
>
> Set Times API, updates negative time on all negative values apart from -1.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Assigned] (HDFS-12733) Option to disable to namenode local edits

2019-09-01 Thread He Xiaoqiao (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-12733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

He Xiaoqiao reassigned HDFS-12733:
--

Assignee: He Xiaoqiao  (was: Brahma Reddy Battula)

> Option to disable to namenode local edits
> -
>
> Key: HDFS-12733
> URL: https://issues.apache.org/jira/browse/HDFS-12733
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode, performance
>Reporter: Brahma Reddy Battula
>Assignee: He Xiaoqiao
>Priority: Major
> Attachments: HDFS-12733-001.patch, HDFS-12733-002.patch, 
> HDFS-12733-003.patch, HDFS-12733.004.patch, HDFS-12733.005.patch, 
> HDFS-12733.006.patch
>
>
> As of now, Edits will be written in local and shared locations which will be 
> redundant and local edits never used in HA setup.
> Disabling local edits gives little performance improvement.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-12733) Option to disable to namenode local edits

2019-09-01 Thread He Xiaoqiao (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-12733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16920390#comment-16920390
 ] 

He Xiaoqiao commented on HDFS-12733:


To [~brahmareddy], I just assign this JIRA to myself, please feel free to 
assign back if back and would like to continue to following up this ticket. 
Thanks.

> Option to disable to namenode local edits
> -
>
> Key: HDFS-12733
> URL: https://issues.apache.org/jira/browse/HDFS-12733
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode, performance
>Reporter: Brahma Reddy Battula
>Assignee: Brahma Reddy Battula
>Priority: Major
> Attachments: HDFS-12733-001.patch, HDFS-12733-002.patch, 
> HDFS-12733-003.patch, HDFS-12733.004.patch, HDFS-12733.005.patch, 
> HDFS-12733.006.patch
>
>
> As of now, Edits will be written in local and shared locations which will be 
> redundant and local edits never used in HA setup.
> Disabling local edits gives little performance improvement.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-12733) Option to disable to namenode local edits

2019-09-01 Thread He Xiaoqiao (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-12733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16920388#comment-16920388
 ] 

He Xiaoqiao commented on HDFS-12733:


Thanks [~ayushtkn] pick up this ticket, and I would like to continue to update 
it and the patches is ready for different solutions. However it seems that We 
did not reach the same agreement. any thought? do we need to anymore comments 
and discussions or vote through mail list? Thanks again.

> Option to disable to namenode local edits
> -
>
> Key: HDFS-12733
> URL: https://issues.apache.org/jira/browse/HDFS-12733
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode, performance
>Reporter: Brahma Reddy Battula
>Assignee: Brahma Reddy Battula
>Priority: Major
> Attachments: HDFS-12733-001.patch, HDFS-12733-002.patch, 
> HDFS-12733-003.patch, HDFS-12733.004.patch, HDFS-12733.005.patch, 
> HDFS-12733.006.patch
>
>
> As of now, Edits will be written in local and shared locations which will be 
> redundant and local edits never used in HA setup.
> Disabling local edits gives little performance improvement.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14810) review FSNameSystem editlog sync

2019-09-01 Thread He Xiaoqiao (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16920386#comment-16920386
 ] 

He Xiaoqiao commented on HDFS-14810:


some unnecessary edit logs such as HDFS-11291 reported do not update in the 
ticket.

> review FSNameSystem editlog sync
> 
>
> Key: HDFS-14810
> URL: https://issues.apache.org/jira/browse/HDFS-14810
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Reporter: He Xiaoqiao
>Assignee: He Xiaoqiao
>Priority: Major
> Attachments: HDFS-14810.001.patch
>
>
> refactor and unified type of edit log sync in FSNamesystem as HDFS-11246 
> mentioned.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14810) review FSNameSystem editlog sync

2019-09-01 Thread He Xiaoqiao (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

He Xiaoqiao updated HDFS-14810:
---
Attachment: HDFS-14810.001.patch
Status: Patch Available  (was: Open)

submit init patch and pending Jenkins.

> review FSNameSystem editlog sync
> 
>
> Key: HDFS-14810
> URL: https://issues.apache.org/jira/browse/HDFS-14810
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Reporter: He Xiaoqiao
>Assignee: He Xiaoqiao
>Priority: Major
> Attachments: HDFS-14810.001.patch
>
>
> refactor and unified type of edit log sync in FSNamesystem as HDFS-11246 
> mentioned.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-14810) review FSNameSystem editlog sync

2019-09-01 Thread He Xiaoqiao (Jira)
He Xiaoqiao created HDFS-14810:
--

 Summary: review FSNameSystem editlog sync
 Key: HDFS-14810
 URL: https://issues.apache.org/jira/browse/HDFS-14810
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: namenode
Reporter: He Xiaoqiao
Assignee: He Xiaoqiao


refactor and unified type of edit log sync in FSNamesystem as HDFS-11246 
mentioned.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14305) Serial number in BlockTokenSecretManager could overlap between different namenodes

2019-08-28 Thread He Xiaoqiao (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16918256#comment-16918256
 ] 

He Xiaoqiao commented on HDFS-14305:


[~shv], Thanks very much for picking up this JIRA and revisiting it.
IMO, in order to avoid overlap between different NameNode, we have to split and 
distribute serial number to different NNs, however, we could not make sure 
total number NNs per namespace only relay on configuration especially for 
multi-nns setups(HDFS-6440), Please correct me if I am wrong. So bring the 
restrict that less possible chance to setup more than 64 NNs in one NS. I would 
like to follow up and update this logic if any other thought? Thanks [~shv].

> Serial number in BlockTokenSecretManager could overlap between different 
> namenodes
> --
>
> Key: HDFS-14305
> URL: https://issues.apache.org/jira/browse/HDFS-14305
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode, security
>Reporter: Chao Sun
>Assignee: He Xiaoqiao
>Priority: Major
> Fix For: 3.0.4, 3.3.0, 3.2.1, 3.1.3
>
> Attachments: HDFS-14305.001.patch, HDFS-14305.002.patch, 
> HDFS-14305.003.patch, HDFS-14305.004.patch, HDFS-14305.005.patch, 
> HDFS-14305.006.patch
>
>
> Currently, a {{BlockTokenSecretManager}} starts with a random integer as the 
> initial serial number, and then use this formula to rotate it:
> {code:java}
> this.intRange = Integer.MAX_VALUE / numNNs;
> this.nnRangeStart = intRange * nnIndex;
> this.serialNo = (this.serialNo % intRange) + (nnRangeStart);
>  {code}
> while {{numNNs}} is the total number of NameNodes in the cluster, and 
> {{nnIndex}} is the index of the current NameNode specified in the 
> configuration {{dfs.ha.namenodes.}}.
> However, with this approach, different NameNode could have overlapping ranges 
> for serial number. For simplicity, let's assume {{Integer.MAX_VALUE}} is 100, 
> and we have 2 NameNodes {{nn1}} and {{nn2}} in configuration. Then the ranges 
> for these two are:
> {code}
> nn1 -> [-49, 49]
> nn2 -> [1, 99]
> {code}
> This is because the initial serial number could be any negative integer.
> Moreover, when the keys are updated, the serial number will again be updated 
> with the formula:
> {code}
> this.serialNo = (this.serialNo % intRange) + (nnRangeStart);
> {code}
> which means the new serial number could be updated to a range that belongs to 
> a different NameNode, thus increasing the chance of collision again.
> When the collision happens, DataNodes could overwrite an existing key which 
> will cause clients to fail because of {{InvalidToken}} error.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14771) Backport HDFS-14617 to branch-2 (Improve fsimage load time by writing sub-sections to the fsimage index)

2019-08-27 Thread He Xiaoqiao (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16917418#comment-16917418
 ] 

He Xiaoqiao commented on HDFS-14771:


thanks [~kihwal],[~sodonnell] for your discussions. it seems I misunderstand 
what fsimage layout version. based on above comments, layout version need to 
change only when meet incompatibility rather than any format changes, right?
for branch-2 backport, I am testing but not finished. until now some test case 
i ran (not include OIV test case), it seems work well, and no any reproduce 
exception. I will attach result once when test finish. Thanks again.

> Backport HDFS-14617 to branch-2 (Improve fsimage load time by writing 
> sub-sections to the fsimage index)
> 
>
> Key: HDFS-14771
> URL: https://issues.apache.org/jira/browse/HDFS-14771
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: namenode
>Affects Versions: 2.10.0
>Reporter: He Xiaoqiao
>Assignee: He Xiaoqiao
>Priority: Major
>  Labels: release-blocker
> Attachments: HDFS-14771.branch-2.001.patch
>
>
> This JIRA aims to backport HDFS-14617 to branch-2: fsimage load time by 
> writing sub-sections to the fsimage index.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14497) Write lock held by metasave impact following RPC processing

2019-08-26 Thread He Xiaoqiao (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16916380#comment-16916380
 ] 

He Xiaoqiao commented on HDFS-14497:


verify failed unit test and both passed at local, please take a review. 
[~jojochuang] Thanks.

> Write lock held by metasave impact following RPC processing
> ---
>
> Key: HDFS-14497
> URL: https://issues.apache.org/jira/browse/HDFS-14497
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Reporter: He Xiaoqiao
>Assignee: He Xiaoqiao
>Priority: Major
> Fix For: 3.3.0
>
> Attachments: HDFS-14497-addendum.001.patch, HDFS-14497.001.patch
>
>
> NameNode meta save hold global write lock currently, so following RPC r/w 
> request or inner-thread of NameNode could be paused if they try to acquire 
> global read/write lock and have to wait before metasave release it.
> I propose to change write lock to read lock and let some read request could 
> be process normally. I think it could not change informations which meta save 
> try to get if we try to open read request.
> Actually, we need ensure that there are only one thread to execute metaSave, 
> otherwise, output streams could meet exception especially both streams hold 
> the same file handle or some other same output stream.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14771) Backport HDFS-14617 to branch-2 (Improve fsimage load time by writing sub-sections to the fsimage index)

2019-08-26 Thread He Xiaoqiao (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16916314#comment-16916314
 ] 

He Xiaoqiao commented on HDFS-14771:


Thanks [~kihwal] for your response, HDFS-14617 has discussed the compatibility 
for this changes to branch trunk. I am verifying the compatibility for 
branch-2, will attach the result when finished.
{quote}Is layout version being changed?{quote}
not changed layout version in current demo patch, I agree to change it from 
version 1 to version 2. And I believe we should update for branch trunk also. 
cc [~sodonnell],[~jojochuang].

> Backport HDFS-14617 to branch-2 (Improve fsimage load time by writing 
> sub-sections to the fsimage index)
> 
>
> Key: HDFS-14771
> URL: https://issues.apache.org/jira/browse/HDFS-14771
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: namenode
>Affects Versions: 2.10.0
>Reporter: He Xiaoqiao
>Assignee: He Xiaoqiao
>Priority: Major
> Attachments: HDFS-14771.branch-2.001.patch
>
>
> This JIRA aims to backport HDFS-14617 to branch-2: fsimage load time by 
> writing sub-sections to the fsimage index.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-11246) FSNameSystem#logAuditEvent should be called outside the read or write locks

2019-08-25 Thread He Xiaoqiao (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-11246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16915452#comment-16915452
 ] 

He Xiaoqiao commented on HDFS-11246:


check failed unit tests and it seems that most of them is related with OOM, I 
try to test them at local and all passed. Please help to check if have time. 
Thanks.

> FSNameSystem#logAuditEvent should be called outside the read or write locks
> ---
>
> Key: HDFS-11246
> URL: https://issues.apache.org/jira/browse/HDFS-11246
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.7.3
>Reporter: Kuhu Shukla
>Assignee: He Xiaoqiao
>Priority: Major
> Attachments: HDFS-11246.001.patch, HDFS-11246.002.patch, 
> HDFS-11246.003.patch, HDFS-11246.004.patch, HDFS-11246.005.patch, 
> HDFS-11246.006.patch, HDFS-11246.007.patch, HDFS-11246.008.patch, 
> HDFS-11246.009.patch, HDFS-11246.010.patch, HDFS-11246.011.patch
>
>
> {code}
> readLock();
> boolean success = true;
> ContentSummary cs;
> try {
>   checkOperation(OperationCategory.READ);
>   cs = FSDirStatAndListingOp.getContentSummary(dir, src);
> } catch (AccessControlException ace) {
>   success = false;
>   logAuditEvent(success, operationName, src);
>   throw ace;
> } finally {
>   readUnlock(operationName);
> }
> {code}
> It would be nice to have audit logging outside the lock esp. in scenarios 
> where applications hammer a given operation several times. 



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14771) Backport HDFS-14617 to branch-2 (Improve fsimage load time by writing sub-sections to the fsimage index)

2019-08-25 Thread He Xiaoqiao (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16915273#comment-16915273
 ] 

He Xiaoqiao commented on HDFS-14771:


Thanks [~linyiqun] for your feedback. [^HDFS-14771.branch-2.001.patch] this 
demo patch is totally same compared with patch merged to trunk and no other 
changes not.
{quote}I prefer to convert this JIRA to the independent JIRA and add a link to 
HDFS-14617 since HDFS-14617 has been done and closed.{quote}
+1, consider this feature is not ready completely, such as OIV support 
mentioned by HDFS-14617 and so on. I think we could need another Über-jira? 
Thanks again.

> Backport HDFS-14617 to branch-2 (Improve fsimage load time by writing 
> sub-sections to the fsimage index)
> 
>
> Key: HDFS-14771
> URL: https://issues.apache.org/jira/browse/HDFS-14771
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: namenode
>Affects Versions: 2.10.0
>Reporter: He Xiaoqiao
>Assignee: He Xiaoqiao
>Priority: Major
> Attachments: HDFS-14771.branch-2.001.patch
>
>
> This JIRA aims to backport HDFS-14617 to branch-2: fsimage load time by 
> writing sub-sections to the fsimage index.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14648) DeadNodeDetector basic model

2019-08-25 Thread He Xiaoqiao (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16915262#comment-16915262
 ] 

He Xiaoqiao commented on HDFS-14648:


Thanks [~leosun08] for your contributions. I try to learn 
[^HDFS-14648.004.patch] with some minor comments,
1. Some codestyles reported by Jenkins above, Please take a look.
2. It is better to locate same module configuration together in 
hdfs-default.xml to find conveniently relevant items,  just suggest to define 
`dfs.client.deadnode.detect.enabled` together with other `dfs.client.*` items.
3. Some method name is open to different interpretations. for instance, 
`DeadNodeDetector#removeNodeFromDetect`, my first impression is that remove 
node from `deadNodes` shared by all DFSInputStreams in the same DFSClient, 
however it just mean remove this node from `localNodes` which just visible to 
some one DFSInputStream actually. right?
4. about DeadNodeDetector, I do not get why set different STATEs for 
`DeadNodeDetector`, will be used by the following implements?
5. about DeadNodeDetector#run, does it need to catch InterruptedException out 
of while loop and return?
6. Some constant such as '5000'/'1' but not explain why, I think we should 
define this constant at the begin of Class and add some annotation.
7. IIUC, some nodes are detected as dead, it should not be in next pipeline, 
right? but I do not see anywhere to add this deadNodes to `excludeNodes`.
Thanks [~leosun08] again.

> DeadNodeDetector basic model
> 
>
> Key: HDFS-14648
> URL: https://issues.apache.org/jira/browse/HDFS-14648
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Lisheng Sun
>Assignee: Lisheng Sun
>Priority: Major
> Attachments: HDFS-14648.001.patch, HDFS-14648.002.patch, 
> HDFS-14648.003.patch, HDFS-14648.004.patch
>
>
> This Jira constructs DeadNodeDetector state machine model. The function it 
> implements as follow:
>  # When a DFSInputstream is opened, a BlockReader is opened. If some DataNode 
> of the block is found to inaccessible, put the DataNode into 
> DeadNodeDetector#deadnode.(HDFS-14649) will optimize this part. Because when 
> DataNode is not accessible, it is likely that the replica has been removed 
> from the DataNode.Therefore, it needs to be confirmed by re-probing and 
> requires a higher priority processing.
>  # DeadNodeDetector will periodically detect the Node in 
> DeadNodeDetector#deadnode, If the access is successful, the Node will be 
> moved from DeadNodeDetector#deadnode. Continuous detection of the dead node 
> is necessary. The DataNode need rejoin the cluster due to a service 
> restart/machine repair. The DataNode may be permanently excluded if there is 
> no added probe mechanism.
>  # DeadNodeDetector#dfsInputStreamNodes Record the DFSInputstream using 
> DataNode. When the DFSInputstream is closed, it will be moved from 
> DeadNodeDetector#dfsInputStreamNodes.
>  # Every time get the global deanode, update the DeadNodeDetector#deadnode. 
> The new DeadNodeDetector#deadnode Equals to the intersection of the old 
> DeadNodeDetector#deadnode and the Datanodes are by 
> DeadNodeDetector#dfsInputStreamNodes.
>  # DeadNodeDetector has a switch that is turned off by default. When it is 
> closed, each DFSInputstream still uses its own local deadnode.
>  # This feature has been used in the XIAOMI production environment for a long 
> time. Reduced hbase read stuck, due to node hangs.
>  # Just open the DeadNodeDetector switch and you can use it directly. No 
> other restrictions. Don't want to use DeadNodeDetector, just close it.
> {code:java}
> if (sharedDeadNodesEnabled && deadNodeDetector == null) {
>   deadNodeDetector = new DeadNodeDetector(name);
>   deadNodeDetectorThr = new Daemon(deadNodeDetector);
>   deadNodeDetectorThr.start();
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Reopened] (HDFS-14497) Write lock held by metasave impact following RPC processing

2019-08-25 Thread He Xiaoqiao (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

He Xiaoqiao reopened HDFS-14497:


> Write lock held by metasave impact following RPC processing
> ---
>
> Key: HDFS-14497
> URL: https://issues.apache.org/jira/browse/HDFS-14497
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Reporter: He Xiaoqiao
>Assignee: He Xiaoqiao
>Priority: Major
> Fix For: 3.3.0
>
> Attachments: HDFS-14497-addendum.001.patch, HDFS-14497.001.patch
>
>
> NameNode meta save hold global write lock currently, so following RPC r/w 
> request or inner-thread of NameNode could be paused if they try to acquire 
> global read/write lock and have to wait before metasave release it.
> I propose to change write lock to read lock and let some read request could 
> be process normally. I think it could not change informations which meta save 
> try to get if we try to open read request.
> Actually, we need ensure that there are only one thread to execute metaSave, 
> otherwise, output streams could meet exception especially both streams hold 
> the same file handle or some other same output stream.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-14497) Write lock held by metasave impact following RPC processing

2019-08-25 Thread He Xiaoqiao (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16915230#comment-16915230
 ] 

He Xiaoqiao edited comment on HDFS-14497 at 8/25/19 1:45 PM:
-

Thanks [~jojochuang],
{quote}suggests metaSaveLock should be a final object{quote}
it makes sense to me. Reopen this JIRA and submit  
[^HDFS-14497-addendum.001.patch] try to change metaSaveLock to be a final 
object. Please help to take reviews.


was (Author: hexiaoqiao):
Thanks [~jojochuang],
{quote}suggests metaSaveLock should be a final object{quote}
it makes sense to me.  [^HDFS-14497-addendum.001.patch] try to change 
metaSaveLock to be a final object. Please help to take reviews.

> Write lock held by metasave impact following RPC processing
> ---
>
> Key: HDFS-14497
> URL: https://issues.apache.org/jira/browse/HDFS-14497
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Reporter: He Xiaoqiao
>Assignee: He Xiaoqiao
>Priority: Major
> Fix For: 3.3.0
>
> Attachments: HDFS-14497-addendum.001.patch, HDFS-14497.001.patch
>
>
> NameNode meta save hold global write lock currently, so following RPC r/w 
> request or inner-thread of NameNode could be paused if they try to acquire 
> global read/write lock and have to wait before metasave release it.
> I propose to change write lock to read lock and let some read request could 
> be process normally. I think it could not change informations which meta save 
> try to get if we try to open read request.
> Actually, we need ensure that there are only one thread to execute metaSave, 
> otherwise, output streams could meet exception especially both streams hold 
> the same file handle or some other same output stream.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14771) Backport HDFS-14617 to branch-2 (Improve fsimage load time by writing sub-sections to the fsimage index)

2019-08-25 Thread He Xiaoqiao (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

He Xiaoqiao updated HDFS-14771:
---
Attachment: HDFS-14771.branch-2.001.patch
Status: Patch Available  (was: Open)

submit demo patch following HDFS-14617 and pending what Jenkins says.

> Backport HDFS-14617 to branch-2 (Improve fsimage load time by writing 
> sub-sections to the fsimage index)
> 
>
> Key: HDFS-14771
> URL: https://issues.apache.org/jira/browse/HDFS-14771
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: namenode
>Affects Versions: 2.10.0
>Reporter: He Xiaoqiao
>Assignee: He Xiaoqiao
>Priority: Major
> Attachments: HDFS-14771.branch-2.001.patch
>
>
> This JIRA aims to backport HDFS-14617 to branch-2: fsimage load time by 
> writing sub-sections to the fsimage index.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14497) Write lock held by metasave impact following RPC processing

2019-08-25 Thread He Xiaoqiao (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16915230#comment-16915230
 ] 

He Xiaoqiao commented on HDFS-14497:


Thanks [~jojochuang],
{quote}suggests metaSaveLock should be a final object{quote}
it makes sense to me.  [^HDFS-14497-addendum.001.patch] try to change 
metaSaveLock to be a final object. Please help to take reviews.

> Write lock held by metasave impact following RPC processing
> ---
>
> Key: HDFS-14497
> URL: https://issues.apache.org/jira/browse/HDFS-14497
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Reporter: He Xiaoqiao
>Assignee: He Xiaoqiao
>Priority: Major
> Fix For: 3.3.0
>
> Attachments: HDFS-14497-addendum.001.patch, HDFS-14497.001.patch
>
>
> NameNode meta save hold global write lock currently, so following RPC r/w 
> request or inner-thread of NameNode could be paused if they try to acquire 
> global read/write lock and have to wait before metasave release it.
> I propose to change write lock to read lock and let some read request could 
> be process normally. I think it could not change informations which meta save 
> try to get if we try to open read request.
> Actually, we need ensure that there are only one thread to execute metaSave, 
> otherwise, output streams could meet exception especially both streams hold 
> the same file handle or some other same output stream.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14497) Write lock held by metasave impact following RPC processing

2019-08-25 Thread He Xiaoqiao (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

He Xiaoqiao updated HDFS-14497:
---
Attachment: HDFS-14497-addendum.001.patch

> Write lock held by metasave impact following RPC processing
> ---
>
> Key: HDFS-14497
> URL: https://issues.apache.org/jira/browse/HDFS-14497
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Reporter: He Xiaoqiao
>Assignee: He Xiaoqiao
>Priority: Major
> Fix For: 3.3.0
>
> Attachments: HDFS-14497-addendum.001.patch, HDFS-14497.001.patch
>
>
> NameNode meta save hold global write lock currently, so following RPC r/w 
> request or inner-thread of NameNode could be paused if they try to acquire 
> global read/write lock and have to wait before metasave release it.
> I propose to change write lock to read lock and let some read request could 
> be process normally. I think it could not change informations which meta save 
> try to get if we try to open read request.
> Actually, we need ensure that there are only one thread to execute metaSave, 
> otherwise, output streams could meet exception especially both streams hold 
> the same file handle or some other same output stream.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14617) Improve fsimage load time by writing sub-sections to the fsimage index

2019-08-25 Thread He Xiaoqiao (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16915221#comment-16915221
 ] 

He Xiaoqiao commented on HDFS-14617:


[~csun],[~sodonnell],[~jojochuang] HDFS-14771 is tracking for backport this 
patch to branch-2. I will try to test and coverage most cases about FsImage 
when  branch-2 patch is ready.

> Improve fsimage load time by writing sub-sections to the fsimage index
> --
>
> Key: HDFS-14617
> URL: https://issues.apache.org/jira/browse/HDFS-14617
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Reporter: Stephen O'Donnell
>Assignee: Stephen O'Donnell
>Priority: Major
> Fix For: 3.3.0
>
> Attachments: HDFS-14617.001.patch, ParallelLoading.svg, 
> SerialLoading.svg, dirs-single.svg, flamegraph.parallel.svg, 
> flamegraph.serial.svg, inodes.svg
>
>
> Loading an fsimage is basically a single threaded process. The current 
> fsimage is written out in sections, eg iNode, iNode_Directory, Snapshots, 
> Snapshot_Diff etc. Then at the end of the file, an index is written that 
> contains the offset and length of each section. The image loader code uses 
> this index to initialize an input stream to read and process each section. It 
> is important that one section is fully loaded before another is started, as 
> the next section depends on the results of the previous one.
> What I would like to propose is the following:
> 1. When writing the image, we can optionally output sub_sections to the 
> index. That way, a given section would effectively be split into several 
> sections, eg:
> {code:java}
>inode_section offset 10 length 1000
>  inode_sub_section offset 10 length 500
>  inode_sub_section offset 510 length 500
>  
>inode_dir_section offset 1010 length 1000
>  inode_dir_sub_section offset 1010 length 500
>  inode_dir_sub_section offset 1010 length 500
> {code}
> Here you can see we still have the original section index, but then we also 
> have sub-section entries that cover the entire section. Then a processor can 
> either read the full section in serial, or read each sub-section in parallel.
> 2. In the Image Writer code, we should set a target number of sub-sections, 
> and then based on the total inodes in memory, it will create that many 
> sub-sections per major image section. I think the only sections worth doing 
> this for are inode, inode_reference, inode_dir and snapshot_diff. All others 
> tend to be fairly small in practice.
> 3. If there are under some threshold of inodes (eg 10M) then don't bother 
> with the sub-sections as a serial load only takes a few seconds at that scale.
> 4. The image loading code can then have a switch to enable 'parallel loading' 
> and a 'number of threads' where it uses the sub-sections, or if not enabled 
> falls back to the existing logic to read the entire section in serial.
> Working with a large image of 316M inodes and 35GB on disk, I have a proof of 
> concept of this change working, allowing just inode and inode_dir to be 
> loaded in parallel, but I believe inode_reference and snapshot_diff can be 
> make parallel with the same technique.
> Some benchmarks I have are as follows:
> {code:java}
> Threads   1 2 3 4 
> 
> inodes448   290   226   189 
> inode_dir 326   211   170   161 
> Total 927   651   535   488 (MD5 calculation about 100 seconds)
> {code}
> The above table shows the time in seconds to load the inode section and the 
> inode_directory section, and then the total load time of the image.
> With 4 threads using the above technique, we are able to better than half the 
> load time of the two sections. With the patch in HDFS-13694 it would take a 
> further 100 seconds off the run time, going from 927 seconds to 388, which is 
> a significant improvement. Adding more threads beyond 4 has diminishing 
> returns as there are some synchronized points in the loading code to protect 
> the in memory structures.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-11246) FSNameSystem#logAuditEvent should be called outside the read or write locks

2019-08-25 Thread He Xiaoqiao (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-11246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16915218#comment-16915218
 ] 

He Xiaoqiao commented on HDFS-11246:


Thanks [~jojochuang], [^HDFS-11246.011.patch] try to correct lock about 
#addCachePool and fix checkstyle. Pending what Jenkins says.

> FSNameSystem#logAuditEvent should be called outside the read or write locks
> ---
>
> Key: HDFS-11246
> URL: https://issues.apache.org/jira/browse/HDFS-11246
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.7.3
>Reporter: Kuhu Shukla
>Assignee: He Xiaoqiao
>Priority: Major
> Attachments: HDFS-11246.001.patch, HDFS-11246.002.patch, 
> HDFS-11246.003.patch, HDFS-11246.004.patch, HDFS-11246.005.patch, 
> HDFS-11246.006.patch, HDFS-11246.007.patch, HDFS-11246.008.patch, 
> HDFS-11246.009.patch, HDFS-11246.010.patch, HDFS-11246.011.patch
>
>
> {code}
> readLock();
> boolean success = true;
> ContentSummary cs;
> try {
>   checkOperation(OperationCategory.READ);
>   cs = FSDirStatAndListingOp.getContentSummary(dir, src);
> } catch (AccessControlException ace) {
>   success = false;
>   logAuditEvent(success, operationName, src);
>   throw ace;
> } finally {
>   readUnlock(operationName);
> }
> {code}
> It would be nice to have audit logging outside the lock esp. in scenarios 
> where applications hammer a given operation several times. 



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-11246) FSNameSystem#logAuditEvent should be called outside the read or write locks

2019-08-25 Thread He Xiaoqiao (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-11246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

He Xiaoqiao updated HDFS-11246:
---
Attachment: HDFS-11246.011.patch

> FSNameSystem#logAuditEvent should be called outside the read or write locks
> ---
>
> Key: HDFS-11246
> URL: https://issues.apache.org/jira/browse/HDFS-11246
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.7.3
>Reporter: Kuhu Shukla
>Assignee: He Xiaoqiao
>Priority: Major
> Attachments: HDFS-11246.001.patch, HDFS-11246.002.patch, 
> HDFS-11246.003.patch, HDFS-11246.004.patch, HDFS-11246.005.patch, 
> HDFS-11246.006.patch, HDFS-11246.007.patch, HDFS-11246.008.patch, 
> HDFS-11246.009.patch, HDFS-11246.010.patch, HDFS-11246.011.patch
>
>
> {code}
> readLock();
> boolean success = true;
> ContentSummary cs;
> try {
>   checkOperation(OperationCategory.READ);
>   cs = FSDirStatAndListingOp.getContentSummary(dir, src);
> } catch (AccessControlException ace) {
>   success = false;
>   logAuditEvent(success, operationName, src);
>   throw ace;
> } finally {
>   readUnlock(operationName);
> }
> {code}
> It would be nice to have audit logging outside the lock esp. in scenarios 
> where applications hammer a given operation several times. 



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Assigned] (HDFS-14771) Backport HDFS-14617 to branch-2 (Improve fsimage load time by writing sub-sections to the fsimage index)

2019-08-23 Thread He Xiaoqiao (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

He Xiaoqiao reassigned HDFS-14771:
--

Assignee: He Xiaoqiao

> Backport HDFS-14617 to branch-2 (Improve fsimage load time by writing 
> sub-sections to the fsimage index)
> 
>
> Key: HDFS-14771
> URL: https://issues.apache.org/jira/browse/HDFS-14771
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: namenode
>Affects Versions: 2.10.0
>Reporter: He Xiaoqiao
>Assignee: He Xiaoqiao
>Priority: Major
>
> This JIRA aims to backport HDFS-14617 to branch-2: fsimage load time by 
> writing sub-sections to the fsimage index.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14771) Backport HDFS-14617 to branch-2 (Improve fsimage load time by writing sub-sections to the fsimage index)

2019-08-23 Thread He Xiaoqiao (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16914211#comment-16914211
 ] 

He Xiaoqiao commented on HDFS-14771:


[~hemanthboyina] thanks for your quick response. As mentioned at parent task,  
the current patch is only core logic but without tools (OIV) support. This JIRA 
will track the progress and aim to backport branch-2. before all features 
ready, I would like to attach one demo patch for branch-2. welcome any 
furthermore discussions. Thanks again.

> Backport HDFS-14617 to branch-2 (Improve fsimage load time by writing 
> sub-sections to the fsimage index)
> 
>
> Key: HDFS-14771
> URL: https://issues.apache.org/jira/browse/HDFS-14771
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: namenode
>Affects Versions: 2.10.0
>Reporter: He Xiaoqiao
>Priority: Major
>
> This JIRA aims to backport HDFS-14617 to branch-2: fsimage load time by 
> writing sub-sections to the fsimage index.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14771) Backport HDFS-14617 to branch-2 (Improve fsimage load time by writing sub-sections to the fsimage index)

2019-08-23 Thread He Xiaoqiao (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

He Xiaoqiao updated HDFS-14771:
---
Description: This JIRA aims to backport HDFS-14617 to branch-2: fsimage 
load time by writing sub-sections to the fsimage index.  (was: This JIRA aims 
to backport HDFS-12914 to branch-2: fsimage load time by writing sub-sections 
to the fsimage index.)

> Backport HDFS-14617 to branch-2 (Improve fsimage load time by writing 
> sub-sections to the fsimage index)
> 
>
> Key: HDFS-14771
> URL: https://issues.apache.org/jira/browse/HDFS-14771
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: namenode
>Affects Versions: 2.10.0
>Reporter: He Xiaoqiao
>Priority: Major
>
> This JIRA aims to backport HDFS-14617 to branch-2: fsimage load time by 
> writing sub-sections to the fsimage index.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14771) Backport HDFS-14617 to branch-2 (Improve fsimage load time by writing sub-sections to the fsimage index)

2019-08-23 Thread He Xiaoqiao (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

He Xiaoqiao updated HDFS-14771:
---
Summary: Backport HDFS-14617 to branch-2 (Improve fsimage load time by 
writing sub-sections to the fsimage index)  (was: Backport HDFS-12914 to 
branch-2 (Improve fsimage load time by writing sub-sections to the fsimage 
index))

> Backport HDFS-14617 to branch-2 (Improve fsimage load time by writing 
> sub-sections to the fsimage index)
> 
>
> Key: HDFS-14771
> URL: https://issues.apache.org/jira/browse/HDFS-14771
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: namenode
>Affects Versions: 2.10.0
>Reporter: He Xiaoqiao
>Priority: Major
>
> This JIRA aims to backport HDFS-12914 to branch-2: fsimage load time by 
> writing sub-sections to the fsimage index.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-14771) Backport HDFS-12914 to branch-2 (Improve fsimage load time by writing sub-sections to the fsimage index)

2019-08-23 Thread He Xiaoqiao (Jira)
He Xiaoqiao created HDFS-14771:
--

 Summary: Backport HDFS-12914 to branch-2 (Improve fsimage load 
time by writing sub-sections to the fsimage index)
 Key: HDFS-14771
 URL: https://issues.apache.org/jira/browse/HDFS-14771
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: namenode
Affects Versions: 2.10.0
Reporter: He Xiaoqiao


This JIRA aims to backport HDFS-12914 to branch-2: fsimage load time by writing 
sub-sections to the fsimage index.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14617) Improve fsimage load time by writing sub-sections to the fsimage index

2019-08-23 Thread He Xiaoqiao (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16914179#comment-16914179
 ] 

He Xiaoqiao commented on HDFS-14617:


[~csun] Thanks for your follows, I apply this patch to branch-2.7 for a while 
and it runs well, I would like to file another JIRA for backport (or offer demo 
patch) to branch-2. It should be note that, this patch offer core logic update 
but without tools (OIV) support now as [~sodonnell] said above. I prefer to 
backport other branches when most features are ready. cc 
[~sodonnell],[~jojochuang]

> Improve fsimage load time by writing sub-sections to the fsimage index
> --
>
> Key: HDFS-14617
> URL: https://issues.apache.org/jira/browse/HDFS-14617
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Reporter: Stephen O'Donnell
>Assignee: Stephen O'Donnell
>Priority: Major
> Fix For: 3.3.0
>
> Attachments: HDFS-14617.001.patch, ParallelLoading.svg, 
> SerialLoading.svg, dirs-single.svg, flamegraph.parallel.svg, 
> flamegraph.serial.svg, inodes.svg
>
>
> Loading an fsimage is basically a single threaded process. The current 
> fsimage is written out in sections, eg iNode, iNode_Directory, Snapshots, 
> Snapshot_Diff etc. Then at the end of the file, an index is written that 
> contains the offset and length of each section. The image loader code uses 
> this index to initialize an input stream to read and process each section. It 
> is important that one section is fully loaded before another is started, as 
> the next section depends on the results of the previous one.
> What I would like to propose is the following:
> 1. When writing the image, we can optionally output sub_sections to the 
> index. That way, a given section would effectively be split into several 
> sections, eg:
> {code:java}
>inode_section offset 10 length 1000
>  inode_sub_section offset 10 length 500
>  inode_sub_section offset 510 length 500
>  
>inode_dir_section offset 1010 length 1000
>  inode_dir_sub_section offset 1010 length 500
>  inode_dir_sub_section offset 1010 length 500
> {code}
> Here you can see we still have the original section index, but then we also 
> have sub-section entries that cover the entire section. Then a processor can 
> either read the full section in serial, or read each sub-section in parallel.
> 2. In the Image Writer code, we should set a target number of sub-sections, 
> and then based on the total inodes in memory, it will create that many 
> sub-sections per major image section. I think the only sections worth doing 
> this for are inode, inode_reference, inode_dir and snapshot_diff. All others 
> tend to be fairly small in practice.
> 3. If there are under some threshold of inodes (eg 10M) then don't bother 
> with the sub-sections as a serial load only takes a few seconds at that scale.
> 4. The image loading code can then have a switch to enable 'parallel loading' 
> and a 'number of threads' where it uses the sub-sections, or if not enabled 
> falls back to the existing logic to read the entire section in serial.
> Working with a large image of 316M inodes and 35GB on disk, I have a proof of 
> concept of this change working, allowing just inode and inode_dir to be 
> loaded in parallel, but I believe inode_reference and snapshot_diff can be 
> make parallel with the same technique.
> Some benchmarks I have are as follows:
> {code:java}
> Threads   1 2 3 4 
> 
> inodes448   290   226   189 
> inode_dir 326   211   170   161 
> Total 927   651   535   488 (MD5 calculation about 100 seconds)
> {code}
> The above table shows the time in seconds to load the inode section and the 
> inode_directory section, and then the total load time of the image.
> With 4 threads using the above technique, we are able to better than half the 
> load time of the two sections. With the patch in HDFS-13694 it would take a 
> further 100 seconds off the run time, going from 927 seconds to 388, which is 
> a significant improvement. Adding more threads beyond 4 has diminishing 
> returns as there are some synchronized points in the loading code to protect 
> the in memory structures.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-11246) FSNameSystem#logAuditEvent should be called outside the read or write locks

2019-08-23 Thread He Xiaoqiao (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-11246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16914173#comment-16914173
 ] 

He Xiaoqiao commented on HDFS-11246:


Thanks [~jojochuang] for the detailed reviews. submit [^HDFS-11246.010.patch] 
and fix conflict and missing.
Other comments about formats and audit log, I suggest we focus on write log out 
of Lock,  and I would like to file another JIAR to track other issues. What do 
you think? Thanks again.

> FSNameSystem#logAuditEvent should be called outside the read or write locks
> ---
>
> Key: HDFS-11246
> URL: https://issues.apache.org/jira/browse/HDFS-11246
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.7.3
>Reporter: Kuhu Shukla
>Assignee: He Xiaoqiao
>Priority: Major
> Attachments: HDFS-11246.001.patch, HDFS-11246.002.patch, 
> HDFS-11246.003.patch, HDFS-11246.004.patch, HDFS-11246.005.patch, 
> HDFS-11246.006.patch, HDFS-11246.007.patch, HDFS-11246.008.patch, 
> HDFS-11246.009.patch, HDFS-11246.010.patch
>
>
> {code}
> readLock();
> boolean success = true;
> ContentSummary cs;
> try {
>   checkOperation(OperationCategory.READ);
>   cs = FSDirStatAndListingOp.getContentSummary(dir, src);
> } catch (AccessControlException ace) {
>   success = false;
>   logAuditEvent(success, operationName, src);
>   throw ace;
> } finally {
>   readUnlock(operationName);
> }
> {code}
> It would be nice to have audit logging outside the lock esp. in scenarios 
> where applications hammer a given operation several times. 



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-11246) FSNameSystem#logAuditEvent should be called outside the read or write locks

2019-08-23 Thread He Xiaoqiao (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-11246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

He Xiaoqiao updated HDFS-11246:
---
Attachment: HDFS-11246.010.patch

> FSNameSystem#logAuditEvent should be called outside the read or write locks
> ---
>
> Key: HDFS-11246
> URL: https://issues.apache.org/jira/browse/HDFS-11246
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.7.3
>Reporter: Kuhu Shukla
>Assignee: He Xiaoqiao
>Priority: Major
> Attachments: HDFS-11246.001.patch, HDFS-11246.002.patch, 
> HDFS-11246.003.patch, HDFS-11246.004.patch, HDFS-11246.005.patch, 
> HDFS-11246.006.patch, HDFS-11246.007.patch, HDFS-11246.008.patch, 
> HDFS-11246.009.patch, HDFS-11246.010.patch
>
>
> {code}
> readLock();
> boolean success = true;
> ContentSummary cs;
> try {
>   checkOperation(OperationCategory.READ);
>   cs = FSDirStatAndListingOp.getContentSummary(dir, src);
> } catch (AccessControlException ace) {
>   success = false;
>   logAuditEvent(success, operationName, src);
>   throw ace;
> } finally {
>   readUnlock(operationName);
> }
> {code}
> It would be nice to have audit logging outside the lock esp. in scenarios 
> where applications hammer a given operation several times. 



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-12749) DN may not send block report to NN after NN restart

2019-08-22 Thread He Xiaoqiao (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-12749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16913002#comment-16913002
 ] 

He Xiaoqiao commented on HDFS-12749:


attach [^HDFS-12749-trunk.006.patch] based on trunk without unit test.

> DN may not send block report to NN after NN restart
> ---
>
> Key: HDFS-12749
> URL: https://issues.apache.org/jira/browse/HDFS-12749
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Affects Versions: 2.7.1, 2.8.3, 2.7.5, 3.0.0, 2.9.1
>Reporter: TanYuxin
>Assignee: He Xiaoqiao
>Priority: Major
> Attachments: HDFS-12749-branch-2.7.002.patch, 
> HDFS-12749-trunk.003.patch, HDFS-12749-trunk.004.patch, 
> HDFS-12749-trunk.005.patch, HDFS-12749-trunk.006.patch, HDFS-12749.001.patch
>
>
> Now our cluster have thousands of DN, millions of files and blocks. When NN 
> restart, NN's load is very high.
> After NN restart,DN will call BPServiceActor#reRegister method to register. 
> But register RPC will get a IOException since NN is busy dealing with Block 
> Report.  The exception is caught at BPServiceActor#processCommand.
> Next is the caught IOException:
> {code:java}
> WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Error processing 
> datanode Command
> java.io.IOException: Failed on local exception: java.io.IOException: 
> java.net.SocketTimeoutException: 6 millis timeout while waiting for 
> channel to be ready for read. ch : java.nio.channels.SocketChannel[connected 
> local=/DataNode_IP:Port remote=NameNode_Host/IP:Port]; Host Details : local 
> host is: "DataNode_Host/Datanode_IP"; destination host is: 
> "NameNode_Host":Port;
> at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:773)
> at org.apache.hadoop.ipc.Client.call(Client.java:1474)
> at org.apache.hadoop.ipc.Client.call(Client.java:1407)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
> at com.sun.proxy.$Proxy13.registerDatanode(Unknown Source)
> at 
> org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.registerDatanode(DatanodeProtocolClientSideTranslatorPB.java:126)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.register(BPServiceActor.java:793)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.reRegister(BPServiceActor.java:926)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActor(BPOfferService.java:604)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.processCommand(BPServiceActor.java:898)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:711)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:864)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> The un-catched IOException breaks BPServiceActor#register, and the Block 
> Report can not be sent immediately. 
> {code}
>   /**
>* Register one bp with the corresponding NameNode
>* 
>* The bpDatanode needs to register with the namenode on startup in order
>* 1) to report which storage it is serving now and 
>* 2) to receive a registrationID
>*  
>* issued by the namenode to recognize registered datanodes.
>* 
>* @param nsInfo current NamespaceInfo
>* @see FSNamesystem#registerDatanode(DatanodeRegistration)
>* @throws IOException
>*/
>   void register(NamespaceInfo nsInfo) throws IOException {
> // The handshake() phase loaded the block pool storage
> // off disk - so update the bpRegistration object from that info
> DatanodeRegistration newBpRegistration = bpos.createRegistration();
> LOG.info(this + " beginning handshake with NN");
> while (shouldRun()) {
>   try {
> // Use returned registration from namenode with updated fields
> newBpRegistration = bpNamenode.registerDatanode(newBpRegistration);
> newBpRegistration.setNamespaceInfo(nsInfo);
> bpRegistration = newBpRegistration;
> break;
>   } catch(EOFException e) {  // namenode might have just restarted
> LOG.info("Problem connecting to server: " + nnAddr + " :"
> + e.getLocalizedMessage());
> sleepAndLogInterrupts(1000, "connecting to server");
>   } catch(SocketTimeoutException e) {  // namenode is busy
> LOG.info("Problem connecting to server: " + nnAddr);
> sleepAndLogInterrupts(1000, "connecting to server");
>   }
> }
> 
> LOG.info("Block pool " + this + " successfully registered with NN");
> bpos.registrationSucceeded(this, bpRegistration);
> // random short delay - helps scatter the BR from all DNs
> 

[jira] [Updated] (HDFS-12749) DN may not send block report to NN after NN restart

2019-08-22 Thread He Xiaoqiao (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-12749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

He Xiaoqiao updated HDFS-12749:
---
Attachment: HDFS-12749-trunk.006.patch

> DN may not send block report to NN after NN restart
> ---
>
> Key: HDFS-12749
> URL: https://issues.apache.org/jira/browse/HDFS-12749
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Affects Versions: 2.7.1, 2.8.3, 2.7.5, 3.0.0, 2.9.1
>Reporter: TanYuxin
>Assignee: He Xiaoqiao
>Priority: Major
> Attachments: HDFS-12749-branch-2.7.002.patch, 
> HDFS-12749-trunk.003.patch, HDFS-12749-trunk.004.patch, 
> HDFS-12749-trunk.005.patch, HDFS-12749-trunk.006.patch, HDFS-12749.001.patch
>
>
> Now our cluster have thousands of DN, millions of files and blocks. When NN 
> restart, NN's load is very high.
> After NN restart,DN will call BPServiceActor#reRegister method to register. 
> But register RPC will get a IOException since NN is busy dealing with Block 
> Report.  The exception is caught at BPServiceActor#processCommand.
> Next is the caught IOException:
> {code:java}
> WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Error processing 
> datanode Command
> java.io.IOException: Failed on local exception: java.io.IOException: 
> java.net.SocketTimeoutException: 6 millis timeout while waiting for 
> channel to be ready for read. ch : java.nio.channels.SocketChannel[connected 
> local=/DataNode_IP:Port remote=NameNode_Host/IP:Port]; Host Details : local 
> host is: "DataNode_Host/Datanode_IP"; destination host is: 
> "NameNode_Host":Port;
> at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:773)
> at org.apache.hadoop.ipc.Client.call(Client.java:1474)
> at org.apache.hadoop.ipc.Client.call(Client.java:1407)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
> at com.sun.proxy.$Proxy13.registerDatanode(Unknown Source)
> at 
> org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.registerDatanode(DatanodeProtocolClientSideTranslatorPB.java:126)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.register(BPServiceActor.java:793)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.reRegister(BPServiceActor.java:926)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActor(BPOfferService.java:604)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.processCommand(BPServiceActor.java:898)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:711)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:864)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> The un-catched IOException breaks BPServiceActor#register, and the Block 
> Report can not be sent immediately. 
> {code}
>   /**
>* Register one bp with the corresponding NameNode
>* 
>* The bpDatanode needs to register with the namenode on startup in order
>* 1) to report which storage it is serving now and 
>* 2) to receive a registrationID
>*  
>* issued by the namenode to recognize registered datanodes.
>* 
>* @param nsInfo current NamespaceInfo
>* @see FSNamesystem#registerDatanode(DatanodeRegistration)
>* @throws IOException
>*/
>   void register(NamespaceInfo nsInfo) throws IOException {
> // The handshake() phase loaded the block pool storage
> // off disk - so update the bpRegistration object from that info
> DatanodeRegistration newBpRegistration = bpos.createRegistration();
> LOG.info(this + " beginning handshake with NN");
> while (shouldRun()) {
>   try {
> // Use returned registration from namenode with updated fields
> newBpRegistration = bpNamenode.registerDatanode(newBpRegistration);
> newBpRegistration.setNamespaceInfo(nsInfo);
> bpRegistration = newBpRegistration;
> break;
>   } catch(EOFException e) {  // namenode might have just restarted
> LOG.info("Problem connecting to server: " + nnAddr + " :"
> + e.getLocalizedMessage());
> sleepAndLogInterrupts(1000, "connecting to server");
>   } catch(SocketTimeoutException e) {  // namenode is busy
> LOG.info("Problem connecting to server: " + nnAddr);
> sleepAndLogInterrupts(1000, "connecting to server");
>   }
> }
> 
> LOG.info("Block pool " + this + " successfully registered with NN");
> bpos.registrationSucceeded(this, bpRegistration);
> // random short delay - helps scatter the BR from all DNs
> scheduler.scheduleBlockReport(dnConf.initialBlockReportDelay);
>   }
> {code}
> But NameNode has 

[jira] [Commented] (HDFS-12749) DN may not send block report to NN after NN restart

2019-08-21 Thread He Xiaoqiao (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-12749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16912960#comment-16912960
 ] 

He Xiaoqiao commented on HDFS-12749:


[~jojochuang] Thanks for your feedback, and sorry for late response. 005 patch 
may have some conflict to trunk, I would like to supply new one same as 005.
{quote}i am not sure i understand the test case. Additionally, if the fix is 
removed, the test doesn't fail, so i am not sure if the test case is 
useful.{quote}
somethings changes? will try fix it. Thanks again. 

> DN may not send block report to NN after NN restart
> ---
>
> Key: HDFS-12749
> URL: https://issues.apache.org/jira/browse/HDFS-12749
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Affects Versions: 2.7.1, 2.8.3, 2.7.5, 3.0.0, 2.9.1
>Reporter: TanYuxin
>Assignee: He Xiaoqiao
>Priority: Major
> Attachments: HDFS-12749-branch-2.7.002.patch, 
> HDFS-12749-trunk.003.patch, HDFS-12749-trunk.004.patch, 
> HDFS-12749-trunk.005.patch, HDFS-12749.001.patch
>
>
> Now our cluster have thousands of DN, millions of files and blocks. When NN 
> restart, NN's load is very high.
> After NN restart,DN will call BPServiceActor#reRegister method to register. 
> But register RPC will get a IOException since NN is busy dealing with Block 
> Report.  The exception is caught at BPServiceActor#processCommand.
> Next is the caught IOException:
> {code:java}
> WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Error processing 
> datanode Command
> java.io.IOException: Failed on local exception: java.io.IOException: 
> java.net.SocketTimeoutException: 6 millis timeout while waiting for 
> channel to be ready for read. ch : java.nio.channels.SocketChannel[connected 
> local=/DataNode_IP:Port remote=NameNode_Host/IP:Port]; Host Details : local 
> host is: "DataNode_Host/Datanode_IP"; destination host is: 
> "NameNode_Host":Port;
> at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:773)
> at org.apache.hadoop.ipc.Client.call(Client.java:1474)
> at org.apache.hadoop.ipc.Client.call(Client.java:1407)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
> at com.sun.proxy.$Proxy13.registerDatanode(Unknown Source)
> at 
> org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.registerDatanode(DatanodeProtocolClientSideTranslatorPB.java:126)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.register(BPServiceActor.java:793)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.reRegister(BPServiceActor.java:926)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActor(BPOfferService.java:604)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.processCommand(BPServiceActor.java:898)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:711)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:864)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> The un-catched IOException breaks BPServiceActor#register, and the Block 
> Report can not be sent immediately. 
> {code}
>   /**
>* Register one bp with the corresponding NameNode
>* 
>* The bpDatanode needs to register with the namenode on startup in order
>* 1) to report which storage it is serving now and 
>* 2) to receive a registrationID
>*  
>* issued by the namenode to recognize registered datanodes.
>* 
>* @param nsInfo current NamespaceInfo
>* @see FSNamesystem#registerDatanode(DatanodeRegistration)
>* @throws IOException
>*/
>   void register(NamespaceInfo nsInfo) throws IOException {
> // The handshake() phase loaded the block pool storage
> // off disk - so update the bpRegistration object from that info
> DatanodeRegistration newBpRegistration = bpos.createRegistration();
> LOG.info(this + " beginning handshake with NN");
> while (shouldRun()) {
>   try {
> // Use returned registration from namenode with updated fields
> newBpRegistration = bpNamenode.registerDatanode(newBpRegistration);
> newBpRegistration.setNamespaceInfo(nsInfo);
> bpRegistration = newBpRegistration;
> break;
>   } catch(EOFException e) {  // namenode might have just restarted
> LOG.info("Problem connecting to server: " + nnAddr + " :"
> + e.getLocalizedMessage());
> sleepAndLogInterrupts(1000, "connecting to server");
>   } catch(SocketTimeoutException e) {  // namenode is busy
> LOG.info("Problem connecting to server: " + nnAddr);
> sleepAndLogInterrupts(1000, "connecting 

[jira] [Commented] (HDFS-11246) FSNameSystem#logAuditEvent should be called outside the read or write locks

2019-08-21 Thread He Xiaoqiao (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-11246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16912955#comment-16912955
 ] 

He Xiaoqiao commented on HDFS-11246:


Ping [~ayushtkn],[~linyiqun],[~daryn], do you have any bandwidth to help 
reviews?

> FSNameSystem#logAuditEvent should be called outside the read or write locks
> ---
>
> Key: HDFS-11246
> URL: https://issues.apache.org/jira/browse/HDFS-11246
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.7.3
>Reporter: Kuhu Shukla
>Assignee: He Xiaoqiao
>Priority: Major
> Attachments: HDFS-11246.001.patch, HDFS-11246.002.patch, 
> HDFS-11246.003.patch, HDFS-11246.004.patch, HDFS-11246.005.patch, 
> HDFS-11246.006.patch, HDFS-11246.007.patch, HDFS-11246.008.patch, 
> HDFS-11246.009.patch
>
>
> {code}
> readLock();
> boolean success = true;
> ContentSummary cs;
> try {
>   checkOperation(OperationCategory.READ);
>   cs = FSDirStatAndListingOp.getContentSummary(dir, src);
> } catch (AccessControlException ace) {
>   success = false;
>   logAuditEvent(success, operationName, src);
>   throw ace;
> } finally {
>   readUnlock(operationName);
> }
> {code}
> It would be nice to have audit logging outside the lock esp. in scenarios 
> where applications hammer a given operation several times. 



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-10782) Decrease memory frequent exchange of Centralized Cache Management when run balancer

2019-08-21 Thread He Xiaoqiao (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-10782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16912952#comment-16912952
 ] 

He Xiaoqiao commented on HDFS-10782:


[~jojochuang] Thanks for picking this issue up,  
[^HDFS-10782-branch-2.001.patch] is one demo patch based on branch-2.7, as you 
said above, this patch implements with some issues indeed, memory footprint, 
performance and so on, I would like to attach another new with unit test(yes, 
it isn't hard to build one) based on branch trunk later. And welcome any 
furthermore discussions. Thanks.

> Decrease memory frequent exchange of Centralized Cache Management when run 
> balancer
> ---
>
> Key: HDFS-10782
> URL: https://issues.apache.org/jira/browse/HDFS-10782
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: balancer  mover, caching
>Affects Versions: 2.7.1
>Reporter: He Xiaoqiao
>Assignee: He Xiaoqiao
>Priority: Major
>  Labels: patch
> Attachments: HDFS-10782-branch-2.001.patch
>
>
> CachedBlocks currently are transparent for Balancer when active feature of 
> centralized cache management. This makes DataNode exchange memory frequently, 
> because Balancer does not Distinguish CachedBlock from blocks, so it may 
> trigger mount of cached/uncached ops.
> I think namenode should avoid return CacheBlocks as much as possible when 
> Balanacer#getblocks.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-10606) TrashPolicyDefault supports time of auto clean up can configured

2019-08-18 Thread He Xiaoqiao (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-10606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16910157#comment-16910157
 ] 

He Xiaoqiao commented on HDFS-10606:


TrashPolicyDefault clean up trash at static time 00:00 UTC, and we have no 
other ways to tune it to other times. This JIRA aim to offer one configuration 
to tune auto-clean start time.

> TrashPolicyDefault supports time of auto clean up can configured
> 
>
> Key: HDFS-10606
> URL: https://issues.apache.org/jira/browse/HDFS-10606
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Affects Versions: 2.7.0
>Reporter: He Xiaoqiao
>Assignee: He Xiaoqiao
>Priority: Major
> Attachments: HDFS-10606-branch-2.7.001.patch, HDFS-10606.001.patch, 
> HDFS-10606.002.patch
>
>
> TrashPolicyDefault clean up Trash based on 
> [UTC|http://www.worldtimeserver.com/current_time_in_UTC.aspx] currently and 
> the time of cleaning up is 00:00 UTC. when there are large amount of trash 
> data should be auto-clean, it will block NN for a long time since Global 
> Lock, In the most serious situations it may lead some cron job submit 
> failure. if add configuration about time of cleaning up, it will avoid impact 
> on this cron jobs at that default time.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14109) Improve hdfs auditlog format and support federation friendly

2019-08-18 Thread He Xiaoqiao (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-14109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

He Xiaoqiao updated HDFS-14109:
---
Resolution: Not A Problem
Status: Resolved  (was: Patch Available)

> Improve hdfs auditlog format and support federation friendly
> 
>
> Key: HDFS-14109
> URL: https://issues.apache.org/jira/browse/HDFS-14109
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: He Xiaoqiao
>Assignee: He Xiaoqiao
>Priority: Major
> Attachments: HDFS-14109.patch
>
>
> The following auditlog format does not well meet requirement for federation 
> arch currently. Since some case we need to aggregate all namespace audit log, 
> if there are some common path request(e.g. /tmp, /user/ etc. some path may 
> not appear in mountTable, but the path is very real), we will have no idea to 
> split them that which namespace it request to. So I propose add column 
> {{nsid}} to support federation more friendly.  
> {quote}2018-11-27 13:20:30,028 INFO FSNamesystem.audit: allowed=true   
> ugi=hdfs/hostn...@realm.com (auth:KERBEROS)  ip=/10.1.1.2 cmd=getfileinfo 
> src=/path   dst=null        perm=null       proto=rpc       clientName=null
> {quote}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14109) Improve hdfs auditlog format and support federation friendly

2019-08-18 Thread He Xiaoqiao (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-14109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16910155#comment-16910155
 ] 

He Xiaoqiao commented on HDFS-14109:


my first thought is that add `nsid` for namenode audit log to support 
federation more friendly and make a distinction between multiply namespaces if 
we collect all namenode audit log together, however it seems that this is not 
the common requirement, so I will cancel this JIRA and set to `not a problem`.

> Improve hdfs auditlog format and support federation friendly
> 
>
> Key: HDFS-14109
> URL: https://issues.apache.org/jira/browse/HDFS-14109
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: He Xiaoqiao
>Assignee: He Xiaoqiao
>Priority: Major
> Attachments: HDFS-14109.patch
>
>
> The following auditlog format does not well meet requirement for federation 
> arch currently. Since some case we need to aggregate all namespace audit log, 
> if there are some common path request(e.g. /tmp, /user/ etc. some path may 
> not appear in mountTable, but the path is very real), we will have no idea to 
> split them that which namespace it request to. So I propose add column 
> {{nsid}} to support federation more friendly.  
> {quote}2018-11-27 13:20:30,028 INFO FSNamesystem.audit: allowed=true   
> ugi=hdfs/hostn...@realm.com (auth:KERBEROS)  ip=/10.1.1.2 cmd=getfileinfo 
> src=/path   dst=null        perm=null       proto=rpc       clientName=null
> {quote}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-10606) TrashPolicyDefault supports time of auto clean up can configured

2019-08-17 Thread He Xiaoqiao (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-10606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16909767#comment-16909767
 ] 

He Xiaoqiao commented on HDFS-10606:


[^HDFS-10606.002.patch] fix checkstyle and javadoc.

> TrashPolicyDefault supports time of auto clean up can configured
> 
>
> Key: HDFS-10606
> URL: https://issues.apache.org/jira/browse/HDFS-10606
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Affects Versions: 2.7.0
>Reporter: He Xiaoqiao
>Assignee: He Xiaoqiao
>Priority: Major
> Attachments: HDFS-10606-branch-2.7.001.patch, HDFS-10606.001.patch, 
> HDFS-10606.002.patch
>
>
> TrashPolicyDefault clean up Trash based on 
> [UTC|http://www.worldtimeserver.com/current_time_in_UTC.aspx] currently and 
> the time of cleaning up is 00:00 UTC. when there are large amount of trash 
> data should be auto-clean, it will block NN for a long time since Global 
> Lock, In the most serious situations it may lead some cron job submit 
> failure. if add configuration about time of cleaning up, it will avoid impact 
> on this cron jobs at that default time.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-10606) TrashPolicyDefault supports time of auto clean up can configured

2019-08-17 Thread He Xiaoqiao (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-10606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

He Xiaoqiao updated HDFS-10606:
---
Attachment: HDFS-10606.002.patch

> TrashPolicyDefault supports time of auto clean up can configured
> 
>
> Key: HDFS-10606
> URL: https://issues.apache.org/jira/browse/HDFS-10606
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Affects Versions: 2.7.0
>Reporter: He Xiaoqiao
>Assignee: He Xiaoqiao
>Priority: Major
> Attachments: HDFS-10606-branch-2.7.001.patch, HDFS-10606.001.patch, 
> HDFS-10606.002.patch
>
>
> TrashPolicyDefault clean up Trash based on 
> [UTC|http://www.worldtimeserver.com/current_time_in_UTC.aspx] currently and 
> the time of cleaning up is 00:00 UTC. when there are large amount of trash 
> data should be auto-clean, it will block NN for a long time since Global 
> Lock, In the most serious situations it may lead some cron job submit 
> failure. if add configuration about time of cleaning up, it will avoid impact 
> on this cron jobs at that default time.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14583) FileStatus#toString() will throw IllegalArgumentException

2019-08-17 Thread He Xiaoqiao (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-14583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16909716#comment-16909716
 ] 

He Xiaoqiao commented on HDFS-14583:


{quote}
HdfsFileStatus don't has empty symlink check, so I think should fix this issue 
in HdfsFileStatus.
As you said, RouterClientProtocol, it is not necessary to set symlink, may be 
we can remove it.
{quote}
+1, it makes sense to me. Please go ahead and ping at any time if need any 
help. Thanks [~xuzq_zander].

> FileStatus#toString() will throw IllegalArgumentException
> -
>
> Key: HDFS-14583
> URL: https://issues.apache.org/jira/browse/HDFS-14583
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: xuzq
>Assignee: xuzq
>Priority: Major
>  Labels: HDFS
> Attachments: HDFS-14583-trunk-0001.patch
>
>
> FileStatus#toString() will throw IllegalArgumentException, stack and error 
> message like this:
> {code:java}
> java.lang.IllegalArgumentException: Can not create a Path from an empty string
>   at org.apache.hadoop.fs.Path.checkPathArg(Path.java:172)
>   at org.apache.hadoop.fs.Path.(Path.java:184)
>   at 
> org.apache.hadoop.hdfs.protocol.HdfsLocatedFileStatus.getSymlink(HdfsLocatedFileStatus.java:117)
>   at org.apache.hadoop.fs.FileStatus.toString(FileStatus.java:462)
>   at 
> org.apache.hadoop.hdfs.web.TestJsonUtil.testHdfsFileStatus(TestJsonUtil.java:123)
> {code}
> Test Code like this:
> {code:java}
> @Test
> public void testHdfsFileStatus() throws IOException {
>   HdfsFileStatus hdfsFileStatus = new HdfsFileStatus.Builder()
>   .replication(1)
>   .blocksize(1024)
>   .perm(new FsPermission((short) 777))
>   .owner("owner")
>   .group("group")
>   .symlink(new byte[0])
>   .path(new byte[0])
>   .fileId(1010)
>   .isdir(true)
>   .build();
>   System.out.println("HdfsFileStatus = " + hdfsFileStatus.toString());
> }{code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14583) FileStatus#toString() will throw IllegalArgumentException

2019-08-17 Thread He Xiaoqiao (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-14583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16909704#comment-16909704
 ] 

He Xiaoqiao commented on HDFS-14583:


Thanks [~xuzq_zander] for your report and contribution. I prefer to fix at 
{{RouterClientProtocol}} rather than {{HdfsFileStatus}}. In my opinion, 
HdfsFileStatus do not accept empty symlink as expect. IIUC, in 
RouterClientProtocol, it is not necessary to set {{symlink}} as following. FYI. 
cc [~elgoiri].
{code:java}
return new HdfsFileStatus.Builder()
.isdir(true)
.mtime(modTime)
.atime(accessTime)
.perm(permission)
.owner(owner)
.group(group)
.path(DFSUtil.string2Bytes(name))
.fileId(inodeId)
.children(childrenNum)
.build();
{code}

> FileStatus#toString() will throw IllegalArgumentException
> -
>
> Key: HDFS-14583
> URL: https://issues.apache.org/jira/browse/HDFS-14583
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: xuzq
>Assignee: xuzq
>Priority: Major
>  Labels: HDFS
> Attachments: HDFS-14583-trunk-0001.patch
>
>
> FileStatus#toString() will throw IllegalArgumentException, stack and error 
> message like this:
> {code:java}
> java.lang.IllegalArgumentException: Can not create a Path from an empty string
>   at org.apache.hadoop.fs.Path.checkPathArg(Path.java:172)
>   at org.apache.hadoop.fs.Path.(Path.java:184)
>   at 
> org.apache.hadoop.hdfs.protocol.HdfsLocatedFileStatus.getSymlink(HdfsLocatedFileStatus.java:117)
>   at org.apache.hadoop.fs.FileStatus.toString(FileStatus.java:462)
>   at 
> org.apache.hadoop.hdfs.web.TestJsonUtil.testHdfsFileStatus(TestJsonUtil.java:123)
> {code}
> Test Code like this:
> {code:java}
> @Test
> public void testHdfsFileStatus() throws IOException {
>   HdfsFileStatus hdfsFileStatus = new HdfsFileStatus.Builder()
>   .replication(1)
>   .blocksize(1024)
>   .perm(new FsPermission((short) 777))
>   .owner("owner")
>   .group("group")
>   .symlink(new byte[0])
>   .path(new byte[0])
>   .fileId(1010)
>   .isdir(true)
>   .build();
>   System.out.println("HdfsFileStatus = " + hdfsFileStatus.toString());
> }{code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14646) Standby NameNode should not upload fsimage to an inappropriate NameNode.

2019-08-17 Thread He Xiaoqiao (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-14646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16909702#comment-16909702
 ] 

He Xiaoqiao commented on HDFS-14646:


Thanks [~xudongcao] for your pings, to be honest, I do not have any experiences 
about multiple NNs in our installation. cc [~xkrogen],[~elgoiri],[~csun] would 
your mind to take a review?

> Standby NameNode should not upload fsimage to an inappropriate NameNode.
> 
>
> Key: HDFS-14646
> URL: https://issues.apache.org/jira/browse/HDFS-14646
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Affects Versions: 3.1.2
>Reporter: Xudong Cao
>Assignee: Xudong Cao
>Priority: Major
> Attachments: HDFS-14646.000.patch, HDFS-14646.001.patch
>
>
> *Problem Description:*
>  In the multi-NameNode scenario, when a SNN uploads a FsImage, it will put 
> the image to all other NNs (whether the peer NN is an ANN or not), and even 
> if the peer NN immediately replies an error (such as 
> TransferResult.NOT_ACTIVE_NAMENODE_FAILURE, TransferResult 
> .OLD_TRANSACTION_ID_FAILURE, etc.), the local SNN will not terminate the put 
> process immediately, but will put the FsImage completely to the peer NN, and 
> will not read the peer NN's reply until the put is completed.
> Depending on the version of Jetty, this behavior can lead to different 
> consequences, I tested it under 2.7.2 and trunk version. 
> *1.In Hadoop 2.7.2 (with Jetty 6.1.26)*
>  After peer NN called HttpServletResponse.sendError(), the underlying TCP 
> connection will still be established, and the data SNN sent will be read by 
> Jetty framework itself in the peer NN side, so the SNN will insignificantly 
> send the FsImage to the peer NN continuously, causing a waste of time and 
> bandwidth. In a relatively large HDFS cluster, the size of FsImage can often 
> reach about 30GB, This is indeed a big waste.
> *2.In trunk version (with Jetty 9.3.27)*
>  After peer NN called HttpServletResponse.sendError(), the underlying TCP 
> connection will be auto closed, and then SNN will directly get an "Error 
> writing request body to server" exception, as below, note this test needs a 
> relatively big FSImage (e.g. 10MB level):
> {code:java}
> 2019-08-17 03:59:25,413 INFO namenode.TransferFsImage: Sending fileName: 
> /tmp/hadoop-root/dfs/name/current/fsimage_3364240, fileSize: 
> 9864721. Sent total: 524288 bytes. Size of last segment intended to send: 
> 4096 bytes.
>  java.io.IOException: Error writing request body to server
>  at 
> sun.net.www.protocol.http.HttpURLConnection$StreamingOutputStream.checkError(HttpURLConnection.java:3587)
>  at 
> sun.net.www.protocol.http.HttpURLConnection$StreamingOutputStream.write(HttpURLConnection.java:3570)
>  at 
> org.apache.hadoop.hdfs.server.namenode.TransferFsImage.copyFileToStream(TransferFsImage.java:396)
>  at 
> org.apache.hadoop.hdfs.server.namenode.TransferFsImage.writeFileToPutRequest(TransferFsImage.java:340)
>  at 
> org.apache.hadoop.hdfs.server.namenode.TransferFsImage.uploadImage(TransferFsImage.java:314)
>  at 
> org.apache.hadoop.hdfs.server.namenode.TransferFsImage.uploadImageFromStorage(TransferFsImage.java:249)
>  at 
> org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$1.call(StandbyCheckpointer.java:277)
>  at 
> org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$1.call(StandbyCheckpointer.java:272)
>  at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  at java.lang.Thread.run(Thread.java:748)
>  2019-08-17 03:59:25,422 INFO namenode.TransferFsImage: Sending fileName: 
> /tmp/hadoop-root/dfs/name/current/fsimage_3364240, fileSize: 
> 9864721. Sent total: 851968 bytes. Size of last segment intended to send: 
> 4096 bytes.
>  java.io.IOException: Error writing request body to server
>  at 
> sun.net.www.protocol.http.HttpURLConnection$StreamingOutputStream.checkError(HttpURLConnection.java:3587)
>  at 
> sun.net.www.protocol.http.HttpURLConnection$StreamingOutputStream.write(HttpURLConnection.java:3570)
>  at 
> org.apache.hadoop.hdfs.server.namenode.TransferFsImage.copyFileToStream(TransferFsImage.java:396)
>  at 
> org.apache.hadoop.hdfs.server.namenode.TransferFsImage.writeFileToPutRequest(TransferFsImage.java:340)
>   {code}
>                   
> *Solution:*
>  A standby NameNode should not upload fsimage to an inappropriate NameNode, 
> when he plans to put a FsImage to the peer NN, he need to check whether he 
> really need to put it at this time.
> In detail, local SNN should establish an HTTP connection with the peer NN, 
> send the put request, 

[jira] [Commented] (HDFS-10606) TrashPolicyDefault supports time of auto clean up can configured

2019-08-17 Thread He Xiaoqiao (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-10606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16909694#comment-16909694
 ] 

He Xiaoqiao commented on HDFS-10606:


Thanks [~jojochuang] pick it up again. 
[^HDFS-10606.001.patch] based on trunk and just offer a way to tune time when 
trash auto-clean execution.

> TrashPolicyDefault supports time of auto clean up can configured
> 
>
> Key: HDFS-10606
> URL: https://issues.apache.org/jira/browse/HDFS-10606
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Affects Versions: 2.7.0
>Reporter: He Xiaoqiao
>Assignee: He Xiaoqiao
>Priority: Major
> Attachments: HDFS-10606-branch-2.7.001.patch, HDFS-10606.001.patch
>
>
> TrashPolicyDefault clean up Trash based on 
> [UTC|http://www.worldtimeserver.com/current_time_in_UTC.aspx] currently and 
> the time of cleaning up is 00:00 UTC. when there are large amount of trash 
> data should be auto-clean, it will block NN for a long time since Global 
> Lock, In the most serious situations it may lead some cron job submit 
> failure. if add configuration about time of cleaning up, it will avoid impact 
> on this cron jobs at that default time.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



  1   2   3   4   5   6   7   8   9   >