[jira] [Commented] (HDFS-13522) Support observer node from Router-Based Federation

2020-09-04 Thread CR Hota (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-13522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17190767#comment-17190767
 ] 

CR Hota commented on HDFS-13522:


[~elgoiri] Thanks for following-up.

[~hemanthboyina] Thanks for uploading the patch and feel free to take this 
jira. I can also help with the code review. Meanwhile can you help me 
understand if the consistency guarantees are same with and without router or 
router relaxes the consistency guarantees ? This was a discussion point when we 
were last working on this. Please refer to the notes in the thread. The last 
design doc which was uploaded was intended to allow routers in the middle to 
still honor the same consistency guarantees that client to 
Nameode/ObserverNamenode honor without routers.

> Support observer node from Router-Based Federation
> --
>
> Key: HDFS-13522
> URL: https://issues.apache.org/jira/browse/HDFS-13522
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: federation, namenode
>Reporter: Erik Krogen
>Assignee: Chao Sun
>Priority: Major
> Attachments: HDFS-13522.001.patch, HDFS-13522_WIP.patch, RBF_ 
> Observer support.pdf, Router+Observer RPC clogging.png, 
> ShortTerm-Routers+Observer.png
>
>
> Changes will need to occur to the router to support the new observer node.
> One such change will be to make the router understand the observer state, 
> e.g. {{FederationNamenodeServiceState}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14090) RBF: Improved isolation for downstream name nodes. {Static}

2019-09-29 Thread CR Hota (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16940620#comment-16940620
 ] 

CR Hota commented on HDFS-14090:


[~elgoiri] [~xkrogen] Thanks for your patience. Lets try and close this in the 
coming week.

> RBF: Improved isolation for downstream name nodes. {Static}
> ---
>
> Key: HDFS-14090
> URL: https://issues.apache.org/jira/browse/HDFS-14090
> Project: Hadoop HDFS
>  Issue Type: New Feature
>Reporter: CR Hota
>Assignee: CR Hota
>Priority: Major
> Attachments: HDFS-14090-HDFS-13891.001.patch, 
> HDFS-14090-HDFS-13891.002.patch, HDFS-14090-HDFS-13891.003.patch, 
> HDFS-14090-HDFS-13891.004.patch, HDFS-14090-HDFS-13891.005.patch, 
> HDFS-14090.006.patch, HDFS-14090.007.patch, HDFS-14090.008.patch, 
> HDFS-14090.009.patch, HDFS-14090.010.patch, HDFS-14090.011.patch, 
> HDFS-14090.012.patch, HDFS-14090.013.patch, HDFS-14090.014.patch, RBF_ 
> Isolation design.pdf
>
>
> Router is a gateway to underlying name nodes. Gateway architectures, should 
> help minimize impact of clients connecting to healthy clusters vs unhealthy 
> clusters.
> For example - If there are 2 name nodes downstream, and one of them is 
> heavily loaded with calls spiking rpc queue times, due to back pressure the 
> same with start reflecting on the router. As a result of this, clients 
> connecting to healthy/faster name nodes will also slow down as same rpc queue 
> is maintained for all calls at the router layer. Essentially the same IPC 
> thread pool is used by router to connect to all name nodes.
> Currently router uses one single rpc queue for all calls. Lets discuss how we 
> can change the architecture and add some throttling logic for 
> unhealthy/slow/overloaded name nodes.
> One way could be to read from current call queue, immediately identify 
> downstream name node and maintain a separate queue for each underlying name 
> node. Another simpler way is to maintain some sort of rate limiter configured 
> for each name node and let routers drop/reject/send error requests after 
> certain threshold. 
> This won’t be a simple change as router’s ‘Server’ layer would need redesign 
> and implementation. Currently this layer is the same as name node.
> Opening this ticket to discuss, design and implement this feature.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14284) RBF: Log Router identifier when reporting exceptions

2019-09-26 Thread CR Hota (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16938979#comment-16938979
 ] 

CR Hota commented on HDFS-14284:


[~hemanthboyina] [~inigoiri] [~ayushtkn] Thanks for the discussion so far. 
Overall approach looks fine.

Can we separate out RIOEx from hadoop-common and StandByExe not extend from 
RIOEx? Its best not to change hadoop-common directly for this feature.

RIOEx can be added in hdfs-rbf project and standby can be used directly to 
construct the error msg containing the router id before creating standby 
exception. Anyways standby already has logic in client side to failover, log of 
standby will automatically output the router id used when exception was created 
in server.

> RBF: Log Router identifier when reporting exceptions
> 
>
> Key: HDFS-14284
> URL: https://issues.apache.org/jira/browse/HDFS-14284
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Íñigo Goiri
>Assignee: hemanthboyina
>Priority: Major
> Attachments: HDFS-14284.001.patch, HDFS-14284.002.patch
>
>
> The typical setup is to use multiple Routers through 
> ConfiguredFailoverProxyProvider.
> In a regular HA Namenode setup, it is easy to know which NN was used.
> However, in RBF, any Router can be the one reporting the exception and it is 
> hard to know which was the one.
> We should have a way to identify which Router/Namenode was the one triggering 
> the exception.
> This would also apply with Observer Namenodes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14461) RBF: Fix intermittently failing kerberos related unit test

2019-09-26 Thread CR Hota (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16938970#comment-16938970
 ] 

CR Hota commented on HDFS-14461:


[~hexiaoqiao] 

This looks so much better. Thanks for getting this through the finish line. +1 
for v5.

> RBF: Fix intermittently failing kerberos related unit test
> --
>
> Key: HDFS-14461
> URL: https://issues.apache.org/jira/browse/HDFS-14461
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: CR Hota
>Assignee: Xiaoqiao He
>Priority: Major
> Attachments: HDFS-14461.001.patch, HDFS-14461.002.patch, 
> HDFS-14461.003.patch, HDFS-14461.004.patch, HDFS-14461.005.patch
>
>
> TestRouterHttpDelegationToken#testGetDelegationToken fails intermittently. It 
> may be due to some race condition before using the keytab that's created for 
> testing.
>  
> {code:java}
>  Failed
> org.apache.hadoop.hdfs.server.federation.security.TestRouterHttpDelegationToken.testGetDelegationToken
>  Failing for the past 1 build (Since 
> [!https://builds.apache.org/static/1e9ab9cc/images/16x16/red.png! 
> #26721|https://builds.apache.org/job/PreCommit-HDFS-Build/26721/] )
>  [Took 89 
> ms.|https://builds.apache.org/job/PreCommit-HDFS-Build/26721/testReport/org.apache.hadoop.hdfs.server.federation.security/TestRouterHttpDelegationToken/testGetDelegationToken/history]
>   
>  Error Message
> org.apache.hadoop.security.KerberosAuthException: failure to login: for 
> principal: router/localh...@example.com from keytab 
> /testptch/hadoop/hadoop-hdfs-project/hadoop-hdfs-rbf/target/test/data/SecurityConfUtil/test.keytab
>  javax.security.auth.login.LoginException: Integrity check on decrypted field 
> failed (31) - PREAUTH_FAILED
> h3. Stacktrace
> org.apache.hadoop.service.ServiceStateException: 
> org.apache.hadoop.security.KerberosAuthException: failure to login: for 
> principal: router/localh...@example.com from keytab 
> /testptch/hadoop/hadoop-hdfs-project/hadoop-hdfs-rbf/target/test/data/SecurityConfUtil/test.keytab
>  javax.security.auth.login.LoginException: Integrity check on decrypted field 
> failed (31) - PREAUTH_FAILED at 
> org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:105)
>  at org.apache.hadoop.service.AbstractService.init(AbstractService.java:173) 
> at 
> org.apache.hadoop.hdfs.server.federation.security.TestRouterHttpDelegationToken.setup(TestRouterHttpDelegationToken.java:99)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498) at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
>  at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>  at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
>  at 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:24) 
> at 
> org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27) 
> at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325) at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
>  at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
>  at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290) at 
> org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71) at 
> org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288) at 
> org.junit.runners.ParentRunner.access$000(ParentRunner.java:58) at 
> org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268) at 
> org.junit.runners.ParentRunner.run(ParentRunner.java:363) at 
> org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:365)
>  at 
> org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:273)
>  at 
> org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:238)
>  at 
> org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:159)
>  at 
> org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:384)
>  at 
> org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:345)
>  at 
> org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:126) 
> at org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:418) 
> Caused by: org.apache.hadoop.security.KerberosAuthException: failure to 
> login: for principal: router/localh...@example.com from keytab 
> 

[jira] [Commented] (HDFS-14851) WebHdfs Returns 200 Status Code for Open of Files with Corrupt Blocks

2019-09-19 Thread CR Hota (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16933784#comment-16933784
 ] 

CR Hota commented on HDFS-14851:


[Íñigo 
Goiri|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=elgoiri] 
Thanks for tagging me.

[Danny Becker|http://jira/secure/ViewProfile.jspa?name=dannytbecker]  Thanks 
for working on this.

Yes, we do use webhdfs but haven't come across a scenario like this yet. The 
change looks quite expensive performance wise for all calls to fix response 
code. Iterating through all blocks to find what is corrupted or not looks 
expensive especially when 1048576 is the limit of blocks per file. We may want 
to rather expose an API through InputStream that exposes List of all corrupted 
blocks (just like it exposes getAllBlocks), if the size of this list is 
positive, this web call can throw BlockMissingException.

Cc [~xkrogen] [~jojochuang]

> WebHdfs Returns 200 Status Code for Open of Files with Corrupt Blocks
> -
>
> Key: HDFS-14851
> URL: https://issues.apache.org/jira/browse/HDFS-14851
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: webhdfs
>Reporter: Danny Becker
>Assignee: Danny Becker
>Priority: Minor
> Attachments: HDFS-14851.001.patch
>
>
> WebHdfs returns 200 status code for Open operations on files with missing or 
> corrupt blocks.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14461) RBF: Fix intermittently failing kerberos related unit test

2019-09-19 Thread CR Hota (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16933626#comment-16933626
 ] 

CR Hota commented on HDFS-14461:


[~elgoiri] Thanks for the comment. This is a common issue with jiras that are 
"dependent". We should commit HDFS-14609 as its meant to solve a specific 
problem and continue working on this. Meanwhile [~hexiaoqiao] can you please 
help by backporting the change in HDFS-14609 to your workspace and help fix 
this issue. Will be happy to help if you need any.

> RBF: Fix intermittently failing kerberos related unit test
> --
>
> Key: HDFS-14461
> URL: https://issues.apache.org/jira/browse/HDFS-14461
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: CR Hota
>Assignee: He Xiaoqiao
>Priority: Major
> Attachments: HDFS-14461.001.patch, HDFS-14461.002.patch
>
>
> TestRouterHttpDelegationToken#testGetDelegationToken fails intermittently. It 
> may be due to some race condition before using the keytab that's created for 
> testing.
>  
> {code:java}
>  Failed
> org.apache.hadoop.hdfs.server.federation.security.TestRouterHttpDelegationToken.testGetDelegationToken
>  Failing for the past 1 build (Since 
> [!https://builds.apache.org/static/1e9ab9cc/images/16x16/red.png! 
> #26721|https://builds.apache.org/job/PreCommit-HDFS-Build/26721/] )
>  [Took 89 
> ms.|https://builds.apache.org/job/PreCommit-HDFS-Build/26721/testReport/org.apache.hadoop.hdfs.server.federation.security/TestRouterHttpDelegationToken/testGetDelegationToken/history]
>   
>  Error Message
> org.apache.hadoop.security.KerberosAuthException: failure to login: for 
> principal: router/localh...@example.com from keytab 
> /testptch/hadoop/hadoop-hdfs-project/hadoop-hdfs-rbf/target/test/data/SecurityConfUtil/test.keytab
>  javax.security.auth.login.LoginException: Integrity check on decrypted field 
> failed (31) - PREAUTH_FAILED
> h3. Stacktrace
> org.apache.hadoop.service.ServiceStateException: 
> org.apache.hadoop.security.KerberosAuthException: failure to login: for 
> principal: router/localh...@example.com from keytab 
> /testptch/hadoop/hadoop-hdfs-project/hadoop-hdfs-rbf/target/test/data/SecurityConfUtil/test.keytab
>  javax.security.auth.login.LoginException: Integrity check on decrypted field 
> failed (31) - PREAUTH_FAILED at 
> org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:105)
>  at org.apache.hadoop.service.AbstractService.init(AbstractService.java:173) 
> at 
> org.apache.hadoop.hdfs.server.federation.security.TestRouterHttpDelegationToken.setup(TestRouterHttpDelegationToken.java:99)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498) at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
>  at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>  at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
>  at 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:24) 
> at 
> org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27) 
> at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325) at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
>  at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
>  at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290) at 
> org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71) at 
> org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288) at 
> org.junit.runners.ParentRunner.access$000(ParentRunner.java:58) at 
> org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268) at 
> org.junit.runners.ParentRunner.run(ParentRunner.java:363) at 
> org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:365)
>  at 
> org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:273)
>  at 
> org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:238)
>  at 
> org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:159)
>  at 
> org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:384)
>  at 
> org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:345)
>  at 
> org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:126) 
> at org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:418) 
> Caused by: 

[jira] [Commented] (HDFS-14090) RBF: Improved isolation for downstream name nodes. {Static}

2019-09-18 Thread CR Hota (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16932738#comment-16932738
 ] 

CR Hota commented on HDFS-14090:


Hey [~elgoiri] Should we commit this? Most of the folks had already reviewed 
the patch earlier. Thoughts?

> RBF: Improved isolation for downstream name nodes. {Static}
> ---
>
> Key: HDFS-14090
> URL: https://issues.apache.org/jira/browse/HDFS-14090
> Project: Hadoop HDFS
>  Issue Type: New Feature
>Reporter: CR Hota
>Assignee: CR Hota
>Priority: Major
> Attachments: HDFS-14090-HDFS-13891.001.patch, 
> HDFS-14090-HDFS-13891.002.patch, HDFS-14090-HDFS-13891.003.patch, 
> HDFS-14090-HDFS-13891.004.patch, HDFS-14090-HDFS-13891.005.patch, 
> HDFS-14090.006.patch, HDFS-14090.007.patch, HDFS-14090.008.patch, 
> HDFS-14090.009.patch, HDFS-14090.010.patch, HDFS-14090.011.patch, 
> HDFS-14090.012.patch, HDFS-14090.013.patch, HDFS-14090.014.patch, RBF_ 
> Isolation design.pdf
>
>
> Router is a gateway to underlying name nodes. Gateway architectures, should 
> help minimize impact of clients connecting to healthy clusters vs unhealthy 
> clusters.
> For example - If there are 2 name nodes downstream, and one of them is 
> heavily loaded with calls spiking rpc queue times, due to back pressure the 
> same with start reflecting on the router. As a result of this, clients 
> connecting to healthy/faster name nodes will also slow down as same rpc queue 
> is maintained for all calls at the router layer. Essentially the same IPC 
> thread pool is used by router to connect to all name nodes.
> Currently router uses one single rpc queue for all calls. Lets discuss how we 
> can change the architecture and add some throttling logic for 
> unhealthy/slow/overloaded name nodes.
> One way could be to read from current call queue, immediately identify 
> downstream name node and maintain a separate queue for each underlying name 
> node. Another simpler way is to maintain some sort of rate limiter configured 
> for each name node and let routers drop/reject/send error requests after 
> certain threshold. 
> This won’t be a simple change as router’s ‘Server’ layer would need redesign 
> and implementation. Currently this layer is the same as name node.
> Opening this ticket to discuss, design and implement this feature.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14431) RBF: Rename with multiple subclusters should fail if no eligible locations

2019-09-16 Thread CR Hota (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16930830#comment-16930830
 ] 

CR Hota commented on HDFS-14431:


[~elgoiri] Many thanks for all the work done so far.

Took a look at the patch and approach seems error prone as the operations in 
totality are NOT atomic. Filesystems are not transactional in nature.

Since rename is very hard to get right, may I suggest we approach it as we did 
with some other features. Let's come up with a design doc and write down the 
issues, possible approaches and what all use cases we can solve and can't. We 
can all collaborate. Please count me in.

For someone new, its very hard to get the context of what is being solved and 
what use cases are not.

On a side note, with the lack of atomic renames here is how we are approaching 
renames in the short term. Most query engines (ex Hive) are equipped to handle 
rename failure by initiating a copy. In the scenario where rename is across 
clusters, hive is instructed to invoke a copy operation.

FYI [~ayushtkn] [~xuzq_zander]

> RBF: Rename with multiple subclusters should fail if no eligible locations
> --
>
> Key: HDFS-14431
> URL: https://issues.apache.org/jira/browse/HDFS-14431
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Íñigo Goiri
>Assignee: Íñigo Goiri
>Priority: Major
> Attachments: HDFS-14431-HDFS-13891.001.patch, 
> HDFS-14431-HDFS-13891.002.patch, HDFS-14431-HDFS-13891.003.patch, 
> HDFS-14431-HDFS-13891.004.patch, HDFS-14431-HDFS-13891.005.patch, 
> HDFS-14431-HDFS-13891.006.patch, HDFS-14431-HDFS-13891.007.patch
>
>
> Currently, the rename will fail with FileNotFoundException which is not clear 
> to the user.
> The operation should fail stating the reason is that there are no eligible 
> destinations.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14090) RBF: Improved isolation for downstream name nodes. {Static}

2019-09-12 Thread CR Hota (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16928913#comment-16928913
 ] 

CR Hota commented on HDFS-14090:


[~elgoiri] Thanks for the final review.

[~brahmareddy] [~aajisaka] [~xkrogen] [~hexiaoqiao] [~linyiqun] [~tanyuxin] 
Gentle ping.

Let me know if you folks have any final thoughts on v014.patch. I am trying to 
see if we can target this with 3.3 release.

> RBF: Improved isolation for downstream name nodes. {Static}
> ---
>
> Key: HDFS-14090
> URL: https://issues.apache.org/jira/browse/HDFS-14090
> Project: Hadoop HDFS
>  Issue Type: New Feature
>Reporter: CR Hota
>Assignee: CR Hota
>Priority: Major
> Attachments: HDFS-14090-HDFS-13891.001.patch, 
> HDFS-14090-HDFS-13891.002.patch, HDFS-14090-HDFS-13891.003.patch, 
> HDFS-14090-HDFS-13891.004.patch, HDFS-14090-HDFS-13891.005.patch, 
> HDFS-14090.006.patch, HDFS-14090.007.patch, HDFS-14090.008.patch, 
> HDFS-14090.009.patch, HDFS-14090.010.patch, HDFS-14090.011.patch, 
> HDFS-14090.012.patch, HDFS-14090.013.patch, HDFS-14090.014.patch, RBF_ 
> Isolation design.pdf
>
>
> Router is a gateway to underlying name nodes. Gateway architectures, should 
> help minimize impact of clients connecting to healthy clusters vs unhealthy 
> clusters.
> For example - If there are 2 name nodes downstream, and one of them is 
> heavily loaded with calls spiking rpc queue times, due to back pressure the 
> same with start reflecting on the router. As a result of this, clients 
> connecting to healthy/faster name nodes will also slow down as same rpc queue 
> is maintained for all calls at the router layer. Essentially the same IPC 
> thread pool is used by router to connect to all name nodes.
> Currently router uses one single rpc queue for all calls. Lets discuss how we 
> can change the architecture and add some throttling logic for 
> unhealthy/slow/overloaded name nodes.
> One way could be to read from current call queue, immediately identify 
> downstream name node and maintain a separate queue for each underlying name 
> node. Another simpler way is to maintain some sort of rate limiter configured 
> for each name node and let routers drop/reject/send error requests after 
> certain threshold. 
> This won’t be a simple change as router’s ‘Server’ layer would need redesign 
> and implementation. Currently this layer is the same as name node.
> Opening this ticket to discuss, design and implement this feature.
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14090) RBF: Improved isolation for downstream name nodes. {Static}

2019-09-12 Thread CR Hota (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16928774#comment-16928774
 ] 

CR Hota commented on HDFS-14090:


[~elgoiri] Thanks for the review. Uploaded v014.

> RBF: Improved isolation for downstream name nodes. {Static}
> ---
>
> Key: HDFS-14090
> URL: https://issues.apache.org/jira/browse/HDFS-14090
> Project: Hadoop HDFS
>  Issue Type: New Feature
>Reporter: CR Hota
>Assignee: CR Hota
>Priority: Major
> Attachments: HDFS-14090-HDFS-13891.001.patch, 
> HDFS-14090-HDFS-13891.002.patch, HDFS-14090-HDFS-13891.003.patch, 
> HDFS-14090-HDFS-13891.004.patch, HDFS-14090-HDFS-13891.005.patch, 
> HDFS-14090.006.patch, HDFS-14090.007.patch, HDFS-14090.008.patch, 
> HDFS-14090.009.patch, HDFS-14090.010.patch, HDFS-14090.011.patch, 
> HDFS-14090.012.patch, HDFS-14090.013.patch, HDFS-14090.014.patch, RBF_ 
> Isolation design.pdf
>
>
> Router is a gateway to underlying name nodes. Gateway architectures, should 
> help minimize impact of clients connecting to healthy clusters vs unhealthy 
> clusters.
> For example - If there are 2 name nodes downstream, and one of them is 
> heavily loaded with calls spiking rpc queue times, due to back pressure the 
> same with start reflecting on the router. As a result of this, clients 
> connecting to healthy/faster name nodes will also slow down as same rpc queue 
> is maintained for all calls at the router layer. Essentially the same IPC 
> thread pool is used by router to connect to all name nodes.
> Currently router uses one single rpc queue for all calls. Lets discuss how we 
> can change the architecture and add some throttling logic for 
> unhealthy/slow/overloaded name nodes.
> One way could be to read from current call queue, immediately identify 
> downstream name node and maintain a separate queue for each underlying name 
> node. Another simpler way is to maintain some sort of rate limiter configured 
> for each name node and let routers drop/reject/send error requests after 
> certain threshold. 
> This won’t be a simple change as router’s ‘Server’ layer would need redesign 
> and implementation. Currently this layer is the same as name node.
> Opening this ticket to discuss, design and implement this feature.
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14090) RBF: Improved isolation for downstream name nodes. {Static}

2019-09-12 Thread CR Hota (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14090?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

CR Hota updated HDFS-14090:
---
Attachment: HDFS-14090.014.patch

> RBF: Improved isolation for downstream name nodes. {Static}
> ---
>
> Key: HDFS-14090
> URL: https://issues.apache.org/jira/browse/HDFS-14090
> Project: Hadoop HDFS
>  Issue Type: New Feature
>Reporter: CR Hota
>Assignee: CR Hota
>Priority: Major
> Attachments: HDFS-14090-HDFS-13891.001.patch, 
> HDFS-14090-HDFS-13891.002.patch, HDFS-14090-HDFS-13891.003.patch, 
> HDFS-14090-HDFS-13891.004.patch, HDFS-14090-HDFS-13891.005.patch, 
> HDFS-14090.006.patch, HDFS-14090.007.patch, HDFS-14090.008.patch, 
> HDFS-14090.009.patch, HDFS-14090.010.patch, HDFS-14090.011.patch, 
> HDFS-14090.012.patch, HDFS-14090.013.patch, HDFS-14090.014.patch, RBF_ 
> Isolation design.pdf
>
>
> Router is a gateway to underlying name nodes. Gateway architectures, should 
> help minimize impact of clients connecting to healthy clusters vs unhealthy 
> clusters.
> For example - If there are 2 name nodes downstream, and one of them is 
> heavily loaded with calls spiking rpc queue times, due to back pressure the 
> same with start reflecting on the router. As a result of this, clients 
> connecting to healthy/faster name nodes will also slow down as same rpc queue 
> is maintained for all calls at the router layer. Essentially the same IPC 
> thread pool is used by router to connect to all name nodes.
> Currently router uses one single rpc queue for all calls. Lets discuss how we 
> can change the architecture and add some throttling logic for 
> unhealthy/slow/overloaded name nodes.
> One way could be to read from current call queue, immediately identify 
> downstream name node and maintain a separate queue for each underlying name 
> node. Another simpler way is to maintain some sort of rate limiter configured 
> for each name node and let routers drop/reject/send error requests after 
> certain threshold. 
> This won’t be a simple change as router’s ‘Server’ layer would need redesign 
> and implementation. Currently this layer is the same as name node.
> Opening this ticket to discuss, design and implement this feature.
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14090) RBF: Improved isolation for downstream name nodes. {Static}

2019-09-12 Thread CR Hota (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16928750#comment-16928750
 ] 

CR Hota commented on HDFS-14090:


[~elgoiri] Thanks a lot for the clarification.

Have taken care of all review comments in the latest v013 patch.

> RBF: Improved isolation for downstream name nodes. {Static}
> ---
>
> Key: HDFS-14090
> URL: https://issues.apache.org/jira/browse/HDFS-14090
> Project: Hadoop HDFS
>  Issue Type: New Feature
>Reporter: CR Hota
>Assignee: CR Hota
>Priority: Major
> Attachments: HDFS-14090-HDFS-13891.001.patch, 
> HDFS-14090-HDFS-13891.002.patch, HDFS-14090-HDFS-13891.003.patch, 
> HDFS-14090-HDFS-13891.004.patch, HDFS-14090-HDFS-13891.005.patch, 
> HDFS-14090.006.patch, HDFS-14090.007.patch, HDFS-14090.008.patch, 
> HDFS-14090.009.patch, HDFS-14090.010.patch, HDFS-14090.011.patch, 
> HDFS-14090.012.patch, HDFS-14090.013.patch, RBF_ Isolation design.pdf
>
>
> Router is a gateway to underlying name nodes. Gateway architectures, should 
> help minimize impact of clients connecting to healthy clusters vs unhealthy 
> clusters.
> For example - If there are 2 name nodes downstream, and one of them is 
> heavily loaded with calls spiking rpc queue times, due to back pressure the 
> same with start reflecting on the router. As a result of this, clients 
> connecting to healthy/faster name nodes will also slow down as same rpc queue 
> is maintained for all calls at the router layer. Essentially the same IPC 
> thread pool is used by router to connect to all name nodes.
> Currently router uses one single rpc queue for all calls. Lets discuss how we 
> can change the architecture and add some throttling logic for 
> unhealthy/slow/overloaded name nodes.
> One way could be to read from current call queue, immediately identify 
> downstream name node and maintain a separate queue for each underlying name 
> node. Another simpler way is to maintain some sort of rate limiter configured 
> for each name node and let routers drop/reject/send error requests after 
> certain threshold. 
> This won’t be a simple change as router’s ‘Server’ layer would need redesign 
> and implementation. Currently this layer is the same as name node.
> Opening this ticket to discuss, design and implement this feature.
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14609) RBF: Security should use common AuthenticationFilter

2019-09-12 Thread CR Hota (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16928747#comment-16928747
 ] 

CR Hota commented on HDFS-14609:


[~tasanuma] You are correct. 

HDFS-14461 was created to address the issue.

> RBF: Security should use common AuthenticationFilter
> 
>
> Key: HDFS-14609
> URL: https://issues.apache.org/jira/browse/HDFS-14609
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: CR Hota
>Assignee: Chen Zhang
>Priority: Major
> Attachments: HDFS-14609.001.patch, HDFS-14609.002.patch, 
> HDFS-14609.003.patch, HDFS-14609.004.patch, HDFS-14609.005.patch, 
> HDFS-14609.006.patch
>
>
> We worked on router based federation security as part of HDFS-13532. We kept 
> it compatible with the way namenode works. However with HADOOP-16314 and 
> HDFS-16354 in trunk, auth filters seems to have been changed causing tests to 
> fail.
> Changes are needed appropriately in RBF, mainly fixing broken tests.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14090) RBF: Improved isolation for downstream name nodes. {Static}

2019-09-12 Thread CR Hota (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14090?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

CR Hota updated HDFS-14090:
---
Attachment: HDFS-14090.013.patch

> RBF: Improved isolation for downstream name nodes. {Static}
> ---
>
> Key: HDFS-14090
> URL: https://issues.apache.org/jira/browse/HDFS-14090
> Project: Hadoop HDFS
>  Issue Type: New Feature
>Reporter: CR Hota
>Assignee: CR Hota
>Priority: Major
> Attachments: HDFS-14090-HDFS-13891.001.patch, 
> HDFS-14090-HDFS-13891.002.patch, HDFS-14090-HDFS-13891.003.patch, 
> HDFS-14090-HDFS-13891.004.patch, HDFS-14090-HDFS-13891.005.patch, 
> HDFS-14090.006.patch, HDFS-14090.007.patch, HDFS-14090.008.patch, 
> HDFS-14090.009.patch, HDFS-14090.010.patch, HDFS-14090.011.patch, 
> HDFS-14090.012.patch, HDFS-14090.013.patch, RBF_ Isolation design.pdf
>
>
> Router is a gateway to underlying name nodes. Gateway architectures, should 
> help minimize impact of clients connecting to healthy clusters vs unhealthy 
> clusters.
> For example - If there are 2 name nodes downstream, and one of them is 
> heavily loaded with calls spiking rpc queue times, due to back pressure the 
> same with start reflecting on the router. As a result of this, clients 
> connecting to healthy/faster name nodes will also slow down as same rpc queue 
> is maintained for all calls at the router layer. Essentially the same IPC 
> thread pool is used by router to connect to all name nodes.
> Currently router uses one single rpc queue for all calls. Lets discuss how we 
> can change the architecture and add some throttling logic for 
> unhealthy/slow/overloaded name nodes.
> One way could be to read from current call queue, immediately identify 
> downstream name node and maintain a separate queue for each underlying name 
> node. Another simpler way is to maintain some sort of rate limiter configured 
> for each name node and let routers drop/reject/send error requests after 
> certain threshold. 
> This won’t be a simple change as router’s ‘Server’ layer would need redesign 
> and implementation. Currently this layer is the same as name node.
> Opening this ticket to discuss, design and implement this feature.
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14609) RBF: Security should use common AuthenticationFilter

2019-09-11 Thread CR Hota (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16928195#comment-16928195
 ] 

CR Hota commented on HDFS-14609:


[~zhangchen] Thanks for the clarification. I think we are good. Anyways 
InvalidToken is captured in other tests as well.

+1 for 006.patch.

> RBF: Security should use common AuthenticationFilter
> 
>
> Key: HDFS-14609
> URL: https://issues.apache.org/jira/browse/HDFS-14609
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: CR Hota
>Assignee: Chen Zhang
>Priority: Major
> Attachments: HDFS-14609.001.patch, HDFS-14609.002.patch, 
> HDFS-14609.003.patch, HDFS-14609.004.patch, HDFS-14609.005.patch, 
> HDFS-14609.006.patch
>
>
> We worked on router based federation security as part of HDFS-13532. We kept 
> it compatible with the way namenode works. However with HADOOP-16314 and 
> HDFS-16354 in trunk, auth filters seems to have been changed causing tests to 
> fail.
> Changes are needed appropriately in RBF, mainly fixing broken tests.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14609) RBF: Security should use common AuthenticationFilter

2019-09-10 Thread CR Hota (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16927242#comment-16927242
 ] 

CR Hota commented on HDFS-14609:


[~zhangchen] Thanks for the ping and patch.

After cancellation of token, we should try to renew and get InvalidToken 
exception. How do we validate InvalidToken exception test?

Am i missing something?

> RBF: Security should use common AuthenticationFilter
> 
>
> Key: HDFS-14609
> URL: https://issues.apache.org/jira/browse/HDFS-14609
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: CR Hota
>Assignee: Chen Zhang
>Priority: Major
> Attachments: HDFS-14609.001.patch, HDFS-14609.002.patch, 
> HDFS-14609.003.patch, HDFS-14609.004.patch, HDFS-14609.005.patch
>
>
> We worked on router based federation security as part of HDFS-13532. We kept 
> it compatible with the way namenode works. However with HADOOP-16314 and 
> HDFS-16354 in trunk, auth filters seems to have been changed causing tests to 
> fail.
> Changes are needed appropriately in RBF, mainly fixing broken tests.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14090) RBF: Improved isolation for downstream name nodes. {Static}

2019-09-10 Thread CR Hota (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16926816#comment-16926816
 ] 

CR Hota commented on HDFS-14090:


[~elgoiri] Thanks for the review.

Sorry, couldn't understand the second point. The idea of the test is to make 
sure all threads finish execution, threads gracefully shutdown and then metrics 
analyzed based on if fairness is enabled/disabled.

> RBF: Improved isolation for downstream name nodes. {Static}
> ---
>
> Key: HDFS-14090
> URL: https://issues.apache.org/jira/browse/HDFS-14090
> Project: Hadoop HDFS
>  Issue Type: New Feature
>Reporter: CR Hota
>Assignee: CR Hota
>Priority: Major
> Attachments: HDFS-14090-HDFS-13891.001.patch, 
> HDFS-14090-HDFS-13891.002.patch, HDFS-14090-HDFS-13891.003.patch, 
> HDFS-14090-HDFS-13891.004.patch, HDFS-14090-HDFS-13891.005.patch, 
> HDFS-14090.006.patch, HDFS-14090.007.patch, HDFS-14090.008.patch, 
> HDFS-14090.009.patch, HDFS-14090.010.patch, HDFS-14090.011.patch, 
> HDFS-14090.012.patch, RBF_ Isolation design.pdf
>
>
> Router is a gateway to underlying name nodes. Gateway architectures, should 
> help minimize impact of clients connecting to healthy clusters vs unhealthy 
> clusters.
> For example - If there are 2 name nodes downstream, and one of them is 
> heavily loaded with calls spiking rpc queue times, due to back pressure the 
> same with start reflecting on the router. As a result of this, clients 
> connecting to healthy/faster name nodes will also slow down as same rpc queue 
> is maintained for all calls at the router layer. Essentially the same IPC 
> thread pool is used by router to connect to all name nodes.
> Currently router uses one single rpc queue for all calls. Lets discuss how we 
> can change the architecture and add some throttling logic for 
> unhealthy/slow/overloaded name nodes.
> One way could be to read from current call queue, immediately identify 
> downstream name node and maintain a separate queue for each underlying name 
> node. Another simpler way is to maintain some sort of rate limiter configured 
> for each name node and let routers drop/reject/send error requests after 
> certain threshold. 
> This won’t be a simple change as router’s ‘Server’ layer would need redesign 
> and implementation. Currently this layer is the same as name node.
> Opening this ticket to discuss, design and implement this feature.
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14774) RBF: Improve RouterWebhdfsMethods#chooseDatanode() error handling

2019-09-09 Thread CR Hota (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16926050#comment-16926050
 ] 

CR Hota commented on HDFS-14774:


Hey [~jojochuang], 

Do you have any follow up questions or shall we close this?

> RBF: Improve RouterWebhdfsMethods#chooseDatanode() error handling
> -
>
> Key: HDFS-14774
> URL: https://issues.apache.org/jira/browse/HDFS-14774
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Wei-Chiu Chuang
>Assignee: CR Hota
>Priority: Minor
>
>  HDFS-13972 added the following code:
> {code}
> try {
>   dns = rpcServer.getDatanodeReport(DatanodeReportType.LIVE);
> } catch (IOException e) {
>   LOG.error("Cannot get the datanodes from the RPC server", e);
> } finally {
>   // Reset ugi to remote user for remaining operations.
>   RouterRpcServer.resetCurrentUser();
> }
> HashSet excludes = new HashSet();
> if (excludeDatanodes != null) {
>   Collection collection =
>   getTrimmedStringCollection(excludeDatanodes);
>   for (DatanodeInfo dn : dns) {
> if (collection.contains(dn.getName())) {
>   excludes.add(dn);
> }
>   }
> }
> {code}
> If {{rpcServer.getDatanodeReport()}} throws an exception, {{dns}} will become 
> null. This does't look like the best way to handle the exception. Should 
> router retry upon exception? Does it perform retry automatically under the 
> hood?
> [~crh] [~brahmareddy]



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14090) RBF: Improved isolation for downstream name nodes. {Static}

2019-09-06 Thread CR Hota (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14090?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

CR Hota updated HDFS-14090:
---
Attachment: HDFS-14090.012.patch

> RBF: Improved isolation for downstream name nodes. {Static}
> ---
>
> Key: HDFS-14090
> URL: https://issues.apache.org/jira/browse/HDFS-14090
> Project: Hadoop HDFS
>  Issue Type: New Feature
>Reporter: CR Hota
>Assignee: CR Hota
>Priority: Major
> Attachments: HDFS-14090-HDFS-13891.001.patch, 
> HDFS-14090-HDFS-13891.002.patch, HDFS-14090-HDFS-13891.003.patch, 
> HDFS-14090-HDFS-13891.004.patch, HDFS-14090-HDFS-13891.005.patch, 
> HDFS-14090.006.patch, HDFS-14090.007.patch, HDFS-14090.008.patch, 
> HDFS-14090.009.patch, HDFS-14090.010.patch, HDFS-14090.011.patch, 
> HDFS-14090.012.patch, RBF_ Isolation design.pdf
>
>
> Router is a gateway to underlying name nodes. Gateway architectures, should 
> help minimize impact of clients connecting to healthy clusters vs unhealthy 
> clusters.
> For example - If there are 2 name nodes downstream, and one of them is 
> heavily loaded with calls spiking rpc queue times, due to back pressure the 
> same with start reflecting on the router. As a result of this, clients 
> connecting to healthy/faster name nodes will also slow down as same rpc queue 
> is maintained for all calls at the router layer. Essentially the same IPC 
> thread pool is used by router to connect to all name nodes.
> Currently router uses one single rpc queue for all calls. Lets discuss how we 
> can change the architecture and add some throttling logic for 
> unhealthy/slow/overloaded name nodes.
> One way could be to read from current call queue, immediately identify 
> downstream name node and maintain a separate queue for each underlying name 
> node. Another simpler way is to maintain some sort of rate limiter configured 
> for each name node and let routers drop/reject/send error requests after 
> certain threshold. 
> This won’t be a simple change as router’s ‘Server’ layer would need redesign 
> and implementation. Currently this layer is the same as name node.
> Opening this ticket to discuss, design and implement this feature.
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14090) RBF: Improved isolation for downstream name nodes. {Static}

2019-09-06 Thread CR Hota (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16924427#comment-16924427
 ] 

CR Hota commented on HDFS-14090:


[~elgoiri] Thanks for the reviews. Some thoughts below.
{quote}My main issue is that PermitAllocationException is too generic.
 As you mention, it currently covers both (1) not enough handlers and (2) 
missconfigured nameservices.
 I think they should be two separate exceptions.
 The #1 case makes sense but the other one seems more like an 
IllegalArgumentException
{quote}
Both are theoretically misconfigurations and hence wanted to keep them under 
the same umbrella of PermitAllocationException which all implementations should 
throw if allocation fails, and this failure will happen due to mis 
configurations.
{quote} 
 BTW, should we also add the fairness per user to the Router RPC server?
 It would go to a separate JIRA though.
{quote}
Fairness at user level can still be enabled via FairCallQueue. We don't need to 
add anything separate from Router's perspective. With HADOOP-16268 already 
checked in, fairness along with balancing across routers is taken care of to a 
large extent.
  

> RBF: Improved isolation for downstream name nodes. {Static}
> ---
>
> Key: HDFS-14090
> URL: https://issues.apache.org/jira/browse/HDFS-14090
> Project: Hadoop HDFS
>  Issue Type: New Feature
>Reporter: CR Hota
>Assignee: CR Hota
>Priority: Major
> Attachments: HDFS-14090-HDFS-13891.001.patch, 
> HDFS-14090-HDFS-13891.002.patch, HDFS-14090-HDFS-13891.003.patch, 
> HDFS-14090-HDFS-13891.004.patch, HDFS-14090-HDFS-13891.005.patch, 
> HDFS-14090.006.patch, HDFS-14090.007.patch, HDFS-14090.008.patch, 
> HDFS-14090.009.patch, HDFS-14090.010.patch, HDFS-14090.011.patch, RBF_ 
> Isolation design.pdf
>
>
> Router is a gateway to underlying name nodes. Gateway architectures, should 
> help minimize impact of clients connecting to healthy clusters vs unhealthy 
> clusters.
> For example - If there are 2 name nodes downstream, and one of them is 
> heavily loaded with calls spiking rpc queue times, due to back pressure the 
> same with start reflecting on the router. As a result of this, clients 
> connecting to healthy/faster name nodes will also slow down as same rpc queue 
> is maintained for all calls at the router layer. Essentially the same IPC 
> thread pool is used by router to connect to all name nodes.
> Currently router uses one single rpc queue for all calls. Lets discuss how we 
> can change the architecture and add some throttling logic for 
> unhealthy/slow/overloaded name nodes.
> One way could be to read from current call queue, immediately identify 
> downstream name node and maintain a separate queue for each underlying name 
> node. Another simpler way is to maintain some sort of rate limiter configured 
> for each name node and let routers drop/reject/send error requests after 
> certain threshold. 
> This won’t be a simple change as router’s ‘Server’ layer would need redesign 
> and implementation. Currently this layer is the same as name node.
> Opening this ticket to discuss, design and implement this feature.
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14784) Add more methods to WebHdfsTestUtil to support tests outside of package

2019-09-05 Thread CR Hota (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16923845#comment-16923845
 ] 

CR Hota commented on HDFS-14784:


[~elgoiri] Thanks for pointing out.

+1 for 002.patch

> Add more methods to WebHdfsTestUtil to support tests outside of package
> ---
>
> Key: HDFS-14784
> URL: https://issues.apache.org/jira/browse/HDFS-14784
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Chen Zhang
>Assignee: Chen Zhang
>Priority: Major
> Attachments: HDFS-14784.001.patch, HDFS-14784.002.patch, 
> HDFS-14784.002.patch
>
>
> Before HDFS-14434, we can access a secure cluster by WebHDFS using user.name 
> parameter and {{PseudoAuthenticationHandler}} without kerberos 
> authentication, it's quite useful for some test situation.
> HDFS-14434 ignores user.name query parameter in secure WebHDFS when we using 
> WebHdfsFileSystem, so the only way to use user.name parameter is to access by 
> URL.
> This Jira try to add more methods to WebHdfsTestUtil to support UT out of 
> package to test WebHDFS in customize way.
> More background and discuss, see HDFS-14609.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14784) Add more methods to WebHdfsTestUtil to support tests outside of package

2019-09-05 Thread CR Hota (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16923827#comment-16923827
 ] 

CR Hota commented on HDFS-14784:


[~zhangchen] Thanks for the latest patch. Looks good to me too. Yes, maybe it's 
an overkill to add a test for the wrapper. When we use the function, we will 
obviously add the test which will anyways automatically use it.

> Add more methods to WebHdfsTestUtil to support tests outside of package
> ---
>
> Key: HDFS-14784
> URL: https://issues.apache.org/jira/browse/HDFS-14784
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Chen Zhang
>Assignee: Chen Zhang
>Priority: Major
> Attachments: HDFS-14784.001.patch, HDFS-14784.002.patch, 
> HDFS-14784.002.patch
>
>
> Before HDFS-14434, we can access a secure cluster by WebHDFS using user.name 
> parameter and {{PseudoAuthenticationHandler}} without kerberos 
> authentication, it's quite useful for some test situation.
> HDFS-14434 ignores user.name query parameter in secure WebHDFS when we using 
> WebHdfsFileSystem, so the only way to use user.name parameter is to access by 
> URL.
> This Jira try to add more methods to WebHdfsTestUtil to support UT out of 
> package to test WebHDFS in customize way.
> More background and discuss, see HDFS-14609.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14090) RBF: Improved isolation for downstream name nodes. {Static}

2019-08-29 Thread CR Hota (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16919031#comment-16919031
 ] 

CR Hota commented on HDFS-14090:


[~elgoiri] Thanks for the comments. Uploaded 011.patch

Have taken care of the nits except the below.
 * I think we can make PermitAllocationException more specific (right now it 
just takes whatever string). I think it would be nice to have the messages in 
PermitAllocationException itself and we would just pass the number of handlers, 
the min and the nsId as a parameter. This is already nice in 
PermitLimitExceededException.
 * StaticFairnessPolicyController#184 can fit in one line. Actually, this might 
be better to have as a different exception (which can be a subclass of 
PermitAllocationException).

PermitAllocationException can happen not just for misconfigured handlers but 
also for misconfigured nameservices, hence deliberately kept it generic with a 
String msg as param. Also dint create a subclass for permitallocation again 
just to keep it simple in the beginning. I suggest we can always re-look at 
refactoring these based on how dynamic allocation work shapes up.

> RBF: Improved isolation for downstream name nodes. {Static}
> ---
>
> Key: HDFS-14090
> URL: https://issues.apache.org/jira/browse/HDFS-14090
> Project: Hadoop HDFS
>  Issue Type: New Feature
>Reporter: CR Hota
>Assignee: CR Hota
>Priority: Major
> Attachments: HDFS-14090-HDFS-13891.001.patch, 
> HDFS-14090-HDFS-13891.002.patch, HDFS-14090-HDFS-13891.003.patch, 
> HDFS-14090-HDFS-13891.004.patch, HDFS-14090-HDFS-13891.005.patch, 
> HDFS-14090.006.patch, HDFS-14090.007.patch, HDFS-14090.008.patch, 
> HDFS-14090.009.patch, HDFS-14090.010.patch, HDFS-14090.011.patch, RBF_ 
> Isolation design.pdf
>
>
> Router is a gateway to underlying name nodes. Gateway architectures, should 
> help minimize impact of clients connecting to healthy clusters vs unhealthy 
> clusters.
> For example - If there are 2 name nodes downstream, and one of them is 
> heavily loaded with calls spiking rpc queue times, due to back pressure the 
> same with start reflecting on the router. As a result of this, clients 
> connecting to healthy/faster name nodes will also slow down as same rpc queue 
> is maintained for all calls at the router layer. Essentially the same IPC 
> thread pool is used by router to connect to all name nodes.
> Currently router uses one single rpc queue for all calls. Lets discuss how we 
> can change the architecture and add some throttling logic for 
> unhealthy/slow/overloaded name nodes.
> One way could be to read from current call queue, immediately identify 
> downstream name node and maintain a separate queue for each underlying name 
> node. Another simpler way is to maintain some sort of rate limiter configured 
> for each name node and let routers drop/reject/send error requests after 
> certain threshold. 
> This won’t be a simple change as router’s ‘Server’ layer would need redesign 
> and implementation. Currently this layer is the same as name node.
> Opening this ticket to discuss, design and implement this feature.
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-14774) RBF: Improve RouterWebhdfsMethods#chooseDatanode() error handling

2019-08-29 Thread CR Hota (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16919011#comment-16919011
 ] 

CR Hota edited comment on HDFS-14774 at 8/29/19 10:32 PM:
--

[~jojochuang] Thanks for reporting this.

This is ok at this point. Reason being, router has 2 layers, One the server to 
external clients and client to downstream namenodes. Client to downstream 
namenodes (aka RouterRpcClient) is configured to retry multiple times based on 
failures from downstream namenode. It also has logic to failover and try 
standby namenode if standby becomes active etc. So ya retries are present 
before dns comes back as null.

And if it does come back as null then parent method sends back an appropriate 
IOexception. 
{code:java}
 
if (dn == null) {
  throw new IOException("Failed to find datanode, suggest to check cluster"
  + " health. excludeDatanodes=" + excludeDatanodes);
}

{code}
Let me know if this helps ?

 

 

 


was (Author: crh):
[~jojochuang] Thanks for reporting this.

This is ok at this point. Reason being, router has 2 layers, One the server to 
external clients and client to downstream namenodes. Client to downstream 
namenodes (aka RouterRpcClient) is configured to retry multiple times based on 
failures from downstream namenode. It also has logic to failover and try 
standby namenode if standby becomes active etc. So ya retries are present 
before dns comes back as null.

And if it does come back as null then parent method does send back an 
appropriate IOexception. 
{code:java}
 
if (dn == null) {
  throw new IOException("Failed to find datanode, suggest to check cluster"
  + " health. excludeDatanodes=" + excludeDatanodes);
}

{code}
Let me know if this helps ?

 

 

 

> RBF: Improve RouterWebhdfsMethods#chooseDatanode() error handling
> -
>
> Key: HDFS-14774
> URL: https://issues.apache.org/jira/browse/HDFS-14774
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Wei-Chiu Chuang
>Assignee: CR Hota
>Priority: Minor
>
>  HDFS-13972 added the following code:
> {code}
> try {
>   dns = rpcServer.getDatanodeReport(DatanodeReportType.LIVE);
> } catch (IOException e) {
>   LOG.error("Cannot get the datanodes from the RPC server", e);
> } finally {
>   // Reset ugi to remote user for remaining operations.
>   RouterRpcServer.resetCurrentUser();
> }
> HashSet excludes = new HashSet();
> if (excludeDatanodes != null) {
>   Collection collection =
>   getTrimmedStringCollection(excludeDatanodes);
>   for (DatanodeInfo dn : dns) {
> if (collection.contains(dn.getName())) {
>   excludes.add(dn);
> }
>   }
> }
> {code}
> If {{rpcServer.getDatanodeReport()}} throws an exception, {{dns}} will become 
> null. This does't look like the best way to handle the exception. Should 
> router retry upon exception? Does it perform retry automatically under the 
> hood?
> [~crh] [~brahmareddy]



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14774) RBF: Improve RouterWebhdfsMethods#chooseDatanode() error handling

2019-08-29 Thread CR Hota (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16919011#comment-16919011
 ] 

CR Hota commented on HDFS-14774:


[~jojochuang] Thanks for reporting this.

This is ok at this point. Reason being, router has 2 layers, One the server to 
external clients and client to downstream namenodes. Client to downstream 
namenodes (aka RouterRpcClient) is configured to retry multiple times based on 
failures from downstream namenode. It also has logic to failover and try 
standby namenode if standby becomes active etc. So ya retries are present 
before dns comes back as null.

And if it does come back as null then parent method does send back an 
appropriate IOexception. 
{code:java}
 
if (dn == null) {
  throw new IOException("Failed to find datanode, suggest to check cluster"
  + " health. excludeDatanodes=" + excludeDatanodes);
}

{code}
Let me know if this helps ?

 

 

 

> RBF: Improve RouterWebhdfsMethods#chooseDatanode() error handling
> -
>
> Key: HDFS-14774
> URL: https://issues.apache.org/jira/browse/HDFS-14774
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Wei-Chiu Chuang
>Assignee: CR Hota
>Priority: Minor
>
>  HDFS-13972 added the following code:
> {code}
> try {
>   dns = rpcServer.getDatanodeReport(DatanodeReportType.LIVE);
> } catch (IOException e) {
>   LOG.error("Cannot get the datanodes from the RPC server", e);
> } finally {
>   // Reset ugi to remote user for remaining operations.
>   RouterRpcServer.resetCurrentUser();
> }
> HashSet excludes = new HashSet();
> if (excludeDatanodes != null) {
>   Collection collection =
>   getTrimmedStringCollection(excludeDatanodes);
>   for (DatanodeInfo dn : dns) {
> if (collection.contains(dn.getName())) {
>   excludes.add(dn);
> }
>   }
> }
> {code}
> If {{rpcServer.getDatanodeReport()}} throws an exception, {{dns}} will become 
> null. This does't look like the best way to handle the exception. Should 
> router retry upon exception? Does it perform retry automatically under the 
> hood?
> [~crh] [~brahmareddy]



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Assigned] (HDFS-14774) RBF: Improve RouterWebhdfsMethods#chooseDatanode() error handling

2019-08-29 Thread CR Hota (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

CR Hota reassigned HDFS-14774:
--

Assignee: CR Hota

> RBF: Improve RouterWebhdfsMethods#chooseDatanode() error handling
> -
>
> Key: HDFS-14774
> URL: https://issues.apache.org/jira/browse/HDFS-14774
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Wei-Chiu Chuang
>Assignee: CR Hota
>Priority: Minor
>
>  HDFS-13972 added the following code:
> {code}
> try {
>   dns = rpcServer.getDatanodeReport(DatanodeReportType.LIVE);
> } catch (IOException e) {
>   LOG.error("Cannot get the datanodes from the RPC server", e);
> } finally {
>   // Reset ugi to remote user for remaining operations.
>   RouterRpcServer.resetCurrentUser();
> }
> HashSet excludes = new HashSet();
> if (excludeDatanodes != null) {
>   Collection collection =
>   getTrimmedStringCollection(excludeDatanodes);
>   for (DatanodeInfo dn : dns) {
> if (collection.contains(dn.getName())) {
>   excludes.add(dn);
> }
>   }
> }
> {code}
> If {{rpcServer.getDatanodeReport()}} throws an exception, {{dns}} will become 
> null. This does't look like the best way to handle the exception. Should 
> router retry upon exception? Does it perform retry automatically under the 
> hood?
> [~crh] [~brahmareddy]



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14784) Add more methods to WebHdfsTestUtil to support tests outside of package

2019-08-29 Thread CR Hota (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16919003#comment-16919003
 ] 

CR Hota commented on HDFS-14784:


[~zhangchen] Thanks for working on this. Overall patch looks fine.

Is it possible to add a test for WebHdfsFileSystem#convertJsonToDelegationToken 
? This is the new method we added here.

> Add more methods to WebHdfsTestUtil to support tests outside of package
> ---
>
> Key: HDFS-14784
> URL: https://issues.apache.org/jira/browse/HDFS-14784
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Chen Zhang
>Assignee: Chen Zhang
>Priority: Major
> Attachments: HDFS-14784.001.patch
>
>
> Before HDFS-14434, we can access a secure cluster by WebHDFS using user.name 
> parameter and {{PseudoAuthenticationHandler}} without kerberos 
> authentication, it's quite useful for some test situation.
> HDFS-14434 ignores user.name query parameter in secure WebHDFS when we using 
> WebHdfsFileSystem, so the only way to use user.name parameter is to access by 
> URL.
> This Jira try to add more methods to WebHdfsTestUtil to support UT out of 
> package to test WebHDFS in customize way.
> More background and discuss, see HDFS-14609.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14090) RBF: Improved isolation for downstream name nodes. {Static}

2019-08-29 Thread CR Hota (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14090?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

CR Hota updated HDFS-14090:
---
Attachment: HDFS-14090.011.patch

> RBF: Improved isolation for downstream name nodes. {Static}
> ---
>
> Key: HDFS-14090
> URL: https://issues.apache.org/jira/browse/HDFS-14090
> Project: Hadoop HDFS
>  Issue Type: New Feature
>Reporter: CR Hota
>Assignee: CR Hota
>Priority: Major
> Attachments: HDFS-14090-HDFS-13891.001.patch, 
> HDFS-14090-HDFS-13891.002.patch, HDFS-14090-HDFS-13891.003.patch, 
> HDFS-14090-HDFS-13891.004.patch, HDFS-14090-HDFS-13891.005.patch, 
> HDFS-14090.006.patch, HDFS-14090.007.patch, HDFS-14090.008.patch, 
> HDFS-14090.009.patch, HDFS-14090.010.patch, HDFS-14090.011.patch, RBF_ 
> Isolation design.pdf
>
>
> Router is a gateway to underlying name nodes. Gateway architectures, should 
> help minimize impact of clients connecting to healthy clusters vs unhealthy 
> clusters.
> For example - If there are 2 name nodes downstream, and one of them is 
> heavily loaded with calls spiking rpc queue times, due to back pressure the 
> same with start reflecting on the router. As a result of this, clients 
> connecting to healthy/faster name nodes will also slow down as same rpc queue 
> is maintained for all calls at the router layer. Essentially the same IPC 
> thread pool is used by router to connect to all name nodes.
> Currently router uses one single rpc queue for all calls. Lets discuss how we 
> can change the architecture and add some throttling logic for 
> unhealthy/slow/overloaded name nodes.
> One way could be to read from current call queue, immediately identify 
> downstream name node and maintain a separate queue for each underlying name 
> node. Another simpler way is to maintain some sort of rate limiter configured 
> for each name node and let routers drop/reject/send error requests after 
> certain threshold. 
> This won’t be a simple change as router’s ‘Server’ layer would need redesign 
> and implementation. Currently this layer is the same as name node.
> Opening this ticket to discuss, design and implement this feature.
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14793) BlockTokenSecretManager should LOG block token range it operates on.

2019-08-28 Thread CR Hota (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

CR Hota updated HDFS-14793:
---
Summary: BlockTokenSecretManager should LOG block token range it operates 
on.  (was: BlockTokenSecretManager should LOG block tokaen range it operates 
on.)

> BlockTokenSecretManager should LOG block token range it operates on.
> 
>
> Key: HDFS-14793
> URL: https://issues.apache.org/jira/browse/HDFS-14793
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.10.0
>Reporter: Konstantin Shvachko
>Priority: Major
>
> At startup log enough information to identified the range of block token keys 
> for the NameNode. This should make it easier to debug issues with block 
> tokens.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14090) RBF: Improved isolation for downstream name nodes. {Static}

2019-08-27 Thread CR Hota (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16916840#comment-16916840
 ] 

CR Hota commented on HDFS-14090:


Hey [~elgoiri] [~brahmareddy] [~aajisaka] [~xkrogen] ,

Could you help take a final look and commit 010.patch? I have already broken 
down this Jira to static and dynamic. Once this is committed we can focus on 
designing dynamic allocation model.

> RBF: Improved isolation for downstream name nodes. {Static}
> ---
>
> Key: HDFS-14090
> URL: https://issues.apache.org/jira/browse/HDFS-14090
> Project: Hadoop HDFS
>  Issue Type: New Feature
>Reporter: CR Hota
>Assignee: CR Hota
>Priority: Major
> Attachments: HDFS-14090-HDFS-13891.001.patch, 
> HDFS-14090-HDFS-13891.002.patch, HDFS-14090-HDFS-13891.003.patch, 
> HDFS-14090-HDFS-13891.004.patch, HDFS-14090-HDFS-13891.005.patch, 
> HDFS-14090.006.patch, HDFS-14090.007.patch, HDFS-14090.008.patch, 
> HDFS-14090.009.patch, HDFS-14090.010.patch, RBF_ Isolation design.pdf
>
>
> Router is a gateway to underlying name nodes. Gateway architectures, should 
> help minimize impact of clients connecting to healthy clusters vs unhealthy 
> clusters.
> For example - If there are 2 name nodes downstream, and one of them is 
> heavily loaded with calls spiking rpc queue times, due to back pressure the 
> same with start reflecting on the router. As a result of this, clients 
> connecting to healthy/faster name nodes will also slow down as same rpc queue 
> is maintained for all calls at the router layer. Essentially the same IPC 
> thread pool is used by router to connect to all name nodes.
> Currently router uses one single rpc queue for all calls. Lets discuss how we 
> can change the architecture and add some throttling logic for 
> unhealthy/slow/overloaded name nodes.
> One way could be to read from current call queue, immediately identify 
> downstream name node and maintain a separate queue for each underlying name 
> node. Another simpler way is to maintain some sort of rate limiter configured 
> for each name node and let routers drop/reject/send error requests after 
> certain threshold. 
> This won’t be a simple change as router’s ‘Server’ layer would need redesign 
> and implementation. Currently this layer is the same as name node.
> Opening this ticket to discuss, design and implement this feature.
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14760) Log INFO mode if snapshot usage and actual usage differ

2019-08-27 Thread CR Hota (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16916836#comment-16916836
 ] 

CR Hota commented on HDFS-14760:


[~jojochuang] Thanks for the review. Could you help commit 002.patch?

> Log INFO mode if snapshot usage and actual usage differ
> ---
>
> Key: HDFS-14760
> URL: https://issues.apache.org/jira/browse/HDFS-14760
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: CR Hota
>Assignee: CR Hota
>Priority: Major
> Attachments: HDFS-14760.001.patch, HDFS-14760.002.patch
>
>
> In DirectoryWithQuotaFeature#checkStoragespace code logs in error mode 
> without throwing any exceptions or action and pollutes logs. This should be 
> in INFO mode.
> {code}
>   private void checkStoragespace(final INodeDirectory dir, final long 
> computed) {
> if (-1 != quota.getStorageSpace() && usage.getStorageSpace() != computed) 
> {
>   NameNode.LOG.error("BUG: Inconsistent storagespace for directory "
>   + dir.getFullPathName() + ". Cached = " + usage.getStorageSpace()
>   + " != Computed = " + computed);
> }
>   }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14609) RBF: Security should use common AuthenticationFilter

2019-08-26 Thread CR Hota (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16916101#comment-16916101
 ] 

CR Hota commented on HDFS-14609:


[~zhangchen] Thanks for the ping and clarifications. It makes sense why hdfs 
changes are needed, lets still do the hdfs changes in a separate Jira and then 
fix these tests after.

> RBF: Security should use common AuthenticationFilter
> 
>
> Key: HDFS-14609
> URL: https://issues.apache.org/jira/browse/HDFS-14609
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: CR Hota
>Assignee: Chen Zhang
>Priority: Major
> Attachments: HDFS-14609.001.patch, HDFS-14609.002.patch
>
>
> We worked on router based federation security as part of HDFS-13532. We kept 
> it compatible with the way namenode works. However with HADOOP-16314 and 
> HDFS-16354 in trunk, auth filters seems to have been changed causing tests to 
> fail.
> Changes are needed appropriately in RBF, mainly fixing broken tests.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14760) Log INFO mode if snapshot usage and actual usage differ

2019-08-22 Thread CR Hota (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16913826#comment-16913826
 ] 

CR Hota commented on HDFS-14760:


[~jojochuang] Thanks!

Seems 002.patch is safe to commit. 

> Log INFO mode if snapshot usage and actual usage differ
> ---
>
> Key: HDFS-14760
> URL: https://issues.apache.org/jira/browse/HDFS-14760
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: CR Hota
>Assignee: CR Hota
>Priority: Major
> Attachments: HDFS-14760.001.patch, HDFS-14760.002.patch
>
>
> In DirectoryWithQuotaFeature#checkStoragespace code logs in error mode 
> without throwing any exceptions or action and pollutes logs. This should be 
> in INFO mode.
> {code}
>   private void checkStoragespace(final INodeDirectory dir, final long 
> computed) {
> if (-1 != quota.getStorageSpace() && usage.getStorageSpace() != computed) 
> {
>   NameNode.LOG.error("BUG: Inconsistent storagespace for directory "
>   + dir.getFullPathName() + ". Cached = " + usage.getStorageSpace()
>   + " != Computed = " + computed);
> }
>   }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14760) Log INFO mode if snapshot usage and actual usage differ

2019-08-22 Thread CR Hota (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

CR Hota updated HDFS-14760:
---
Attachment: HDFS-14760.002.patch

> Log INFO mode if snapshot usage and actual usage differ
> ---
>
> Key: HDFS-14760
> URL: https://issues.apache.org/jira/browse/HDFS-14760
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: CR Hota
>Assignee: CR Hota
>Priority: Major
> Attachments: HDFS-14760.001.patch, HDFS-14760.002.patch
>
>
> In DirectoryWithQuotaFeature#checkStoragespace code logs in error mode 
> without throwing any exceptions or action and pollutes logs. This should be 
> in INFO mode.
> {code}
>   private void checkStoragespace(final INodeDirectory dir, final long 
> computed) {
> if (-1 != quota.getStorageSpace() && usage.getStorageSpace() != computed) 
> {
>   NameNode.LOG.error("BUG: Inconsistent storagespace for directory "
>   + dir.getFullPathName() + ". Cached = " + usage.getStorageSpace()
>   + " != Computed = " + computed);
> }
>   }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14760) Log INFO mode if snapshot usage and actual usage differ

2019-08-22 Thread CR Hota (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16913522#comment-16913522
 ] 

CR Hota commented on HDFS-14760:


[~xkrogen] Thanks for the review. 'WARN' makes sense too. Honestly I haven't 
been able to wrap around my head on the whole feature yet and how to handle 
these cases. But at this point, our hdfs installation wants to make sure no 
'ERROR' is logged if it's not really an error that should/can be actionized.

> Log INFO mode if snapshot usage and actual usage differ
> ---
>
> Key: HDFS-14760
> URL: https://issues.apache.org/jira/browse/HDFS-14760
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: CR Hota
>Assignee: CR Hota
>Priority: Major
> Attachments: HDFS-14760.001.patch
>
>
> In DirectoryWithQuotaFeature#checkStoragespace code logs in error mode 
> without throwing any exceptions or action and pollutes logs. This should be 
> in INFO mode.
> {code}
>   private void checkStoragespace(final INodeDirectory dir, final long 
> computed) {
> if (-1 != quota.getStorageSpace() && usage.getStorageSpace() != computed) 
> {
>   NameNode.LOG.error("BUG: Inconsistent storagespace for directory "
>   + dir.getFullPathName() + ". Cached = " + usage.getStorageSpace()
>   + " != Computed = " + computed);
> }
>   }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14760) Log INFO mode if snapshot usage and actual usage differ

2019-08-20 Thread CR Hota (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16911837#comment-16911837
 ] 

CR Hota commented on HDFS-14760:


Thanks [~jojochuang]
 Adding some more folks for context. [~ayushtkn] [~xkrogen] [~RANith] 
[~brahmareddy] .

> Log INFO mode if snapshot usage and actual usage differ
> ---
>
> Key: HDFS-14760
> URL: https://issues.apache.org/jira/browse/HDFS-14760
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: CR Hota
>Assignee: CR Hota
>Priority: Major
> Attachments: HDFS-14760.001.patch
>
>
> In DirectoryWithQuotaFeature#checkStoragespace code logs in error mode 
> without throwing any exceptions or action and pollutes logs. This should be 
> in INFO mode.
> {code}
>   private void checkStoragespace(final INodeDirectory dir, final long 
> computed) {
> if (-1 != quota.getStorageSpace() && usage.getStorageSpace() != computed) 
> {
>   NameNode.LOG.error("BUG: Inconsistent storagespace for directory "
>   + dir.getFullPathName() + ". Cached = " + usage.getStorageSpace()
>   + " != Computed = " + computed);
> }
>   }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14403) Cost-Based RPC FairCallQueue

2019-08-20 Thread CR Hota (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

CR Hota updated HDFS-14403:
---
Description: *strong text*HADOOP-15016 initially described extensions to 
the Hadoop FairCallQueue encompassing both cost-based analysis of incoming 
RPCs, as well as support for reservations of RPC capacity for system/platform 
users. This JIRA intends to track the former, as HADOOP-15016 was repurposed to 
more specifically focus on the reservation portion of the work.  (was: 
HADOOP-15016 initially described extensions to the Hadoop FairCallQueue 
encompassing both cost-based analysis of incoming RPCs, as well as support for 
reservations of RPC capacity for system/platform users. This JIRA intends to 
track the former, as HADOOP-15016 was repurposed to more specifically focus on 
the reservation portion of the work.)

> Cost-Based RPC FairCallQueue
> 
>
> Key: HDFS-14403
> URL: https://issues.apache.org/jira/browse/HDFS-14403
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: ipc, namenode
>Reporter: Erik Krogen
>Assignee: Christopher Gregorian
>Priority: Major
>  Labels: qos, rpc
> Fix For: 3.0.4, 3.3.0, 3.2.1, 3.1.3
>
> Attachments: CostBasedFairCallQueueDesign_v0.pdf, 
> HDFS-14403.001.patch, HDFS-14403.002.patch, HDFS-14403.003.patch, 
> HDFS-14403.004.patch, HDFS-14403.005.patch, HDFS-14403.006.combined.patch, 
> HDFS-14403.006.patch, HDFS-14403.007.patch, HDFS-14403.008.patch, 
> HDFS-14403.009.patch, HDFS-14403.010.patch, HDFS-14403.011.patch, 
> HDFS-14403.012.patch, HDFS-14403.013.patch, HDFS-14403.branch-2.8.patch
>
>
> *strong text*HADOOP-15016 initially described extensions to the Hadoop 
> FairCallQueue encompassing both cost-based analysis of incoming RPCs, as well 
> as support for reservations of RPC capacity for system/platform users. This 
> JIRA intends to track the former, as HADOOP-15016 was repurposed to more 
> specifically focus on the reservation portion of the work.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14403) Cost-Based RPC FairCallQueue

2019-08-20 Thread CR Hota (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

CR Hota updated HDFS-14403:
---
Description: HADOOP-15016 initially described extensions to the Hadoop 
FairCallQueue encompassing both cost-based analysis of incoming RPCs, as well 
as support for reservations of RPC capacity for system/platform users. This 
JIRA intends to track the former, as HADOOP-15016 was repurposed to more 
specifically focus on the reservation portion of the work.  (was: *strong 
text*HADOOP-15016 initially described extensions to the Hadoop FairCallQueue 
encompassing both cost-based analysis of incoming RPCs, as well as support for 
reservations of RPC capacity for system/platform users. This JIRA intends to 
track the former, as HADOOP-15016 was repurposed to more specifically focus on 
the reservation portion of the work.)

> Cost-Based RPC FairCallQueue
> 
>
> Key: HDFS-14403
> URL: https://issues.apache.org/jira/browse/HDFS-14403
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: ipc, namenode
>Reporter: Erik Krogen
>Assignee: Christopher Gregorian
>Priority: Major
>  Labels: qos, rpc
> Fix For: 3.0.4, 3.3.0, 3.2.1, 3.1.3
>
> Attachments: CostBasedFairCallQueueDesign_v0.pdf, 
> HDFS-14403.001.patch, HDFS-14403.002.patch, HDFS-14403.003.patch, 
> HDFS-14403.004.patch, HDFS-14403.005.patch, HDFS-14403.006.combined.patch, 
> HDFS-14403.006.patch, HDFS-14403.007.patch, HDFS-14403.008.patch, 
> HDFS-14403.009.patch, HDFS-14403.010.patch, HDFS-14403.011.patch, 
> HDFS-14403.012.patch, HDFS-14403.013.patch, HDFS-14403.branch-2.8.patch
>
>
> HADOOP-15016 initially described extensions to the Hadoop FairCallQueue 
> encompassing both cost-based analysis of incoming RPCs, as well as support 
> for reservations of RPC capacity for system/platform users. This JIRA intends 
> to track the former, as HADOOP-15016 was repurposed to more specifically 
> focus on the reservation portion of the work.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14760) Log INFO mode if snapshot usage and actual usage differ

2019-08-20 Thread CR Hota (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16911816#comment-16911816
 ] 

CR Hota commented on HDFS-14760:


[~jojochuang] 
Could you help take a look at this? Am not very familiar why historically this 
check was added and also that it doesn't take any action but logged in error 
mode.

> Log INFO mode if snapshot usage and actual usage differ
> ---
>
> Key: HDFS-14760
> URL: https://issues.apache.org/jira/browse/HDFS-14760
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: CR Hota
>Assignee: CR Hota
>Priority: Major
> Attachments: HDFS-14760.001.patch
>
>
> In DirectoryWithQuotaFeature#checkStoragespace code logs in error mode 
> without throwing any exceptions or action and pollutes logs. This should be 
> in INFO mode.
> {code}
>   private void checkStoragespace(final INodeDirectory dir, final long 
> computed) {
> if (-1 != quota.getStorageSpace() && usage.getStorageSpace() != computed) 
> {
>   NameNode.LOG.error("BUG: Inconsistent storagespace for directory "
>   + dir.getFullPathName() + ". Cached = " + usage.getStorageSpace()
>   + " != Computed = " + computed);
> }
>   }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14760) Log INFO mode if snapshot usage and actual usage differ

2019-08-20 Thread CR Hota (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

CR Hota updated HDFS-14760:
---
Attachment: HDFS-14760.001.patch
Status: Patch Available  (was: Open)

> Log INFO mode if snapshot usage and actual usage differ
> ---
>
> Key: HDFS-14760
> URL: https://issues.apache.org/jira/browse/HDFS-14760
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: CR Hota
>Assignee: CR Hota
>Priority: Major
> Attachments: HDFS-14760.001.patch
>
>
> In DirectoryWithQuotaFeature#checkStoragespace code logs in error mode 
> without throwing any exceptions or action and pollutes logs. This should be 
> in INFO mode.
> {code}
>   private void checkStoragespace(final INodeDirectory dir, final long 
> computed) {
> if (-1 != quota.getStorageSpace() && usage.getStorageSpace() != computed) 
> {
>   NameNode.LOG.error("BUG: Inconsistent storagespace for directory "
>   + dir.getFullPathName() + ". Cached = " + usage.getStorageSpace()
>   + " != Computed = " + computed);
> }
>   }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14760) Log INFO mode if snapshot usage and actual usage differ

2019-08-20 Thread CR Hota (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

CR Hota updated HDFS-14760:
---
Description: 
In DirectoryWithQuotaFeature#checkStoragespace code logs in error mode without 
throwing any exceptions or action and pollutes logs. This should be in INFO 
mode.

{code}
  private void checkStoragespace(final INodeDirectory dir, final long computed) 
{
if (-1 != quota.getStorageSpace() && usage.getStorageSpace() != computed) {
  NameNode.LOG.error("BUG: Inconsistent storagespace for directory "
  + dir.getFullPathName() + ". Cached = " + usage.getStorageSpace()
  + " != Computed = " + computed);
}
  }
{code}


  was:
{code}
  private void checkStoragespace(final INodeDirectory dir, final long computed) 
{
if (-1 != quota.getStorageSpace() && usage.getStorageSpace() != computed) {
  NameNode.LOG.error("BUG: Inconsistent storagespace for directory "
  + dir.getFullPathName() + ". Cached = " + usage.getStorageSpace()
  + " != Computed = " + computed);
}
  }
{code}
The above code logs in error mode without throwing any exceptions or action and 
pollutes logs. This should be in INFO mode.


> Log INFO mode if snapshot usage and actual usage differ
> ---
>
> Key: HDFS-14760
> URL: https://issues.apache.org/jira/browse/HDFS-14760
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: CR Hota
>Assignee: CR Hota
>Priority: Major
>
> In DirectoryWithQuotaFeature#checkStoragespace code logs in error mode 
> without throwing any exceptions or action and pollutes logs. This should be 
> in INFO mode.
> {code}
>   private void checkStoragespace(final INodeDirectory dir, final long 
> computed) {
> if (-1 != quota.getStorageSpace() && usage.getStorageSpace() != computed) 
> {
>   NameNode.LOG.error("BUG: Inconsistent storagespace for directory "
>   + dir.getFullPathName() + ". Cached = " + usage.getStorageSpace()
>   + " != Computed = " + computed);
> }
>   }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-14760) Log INFO mode if snapshot usage and actual usage differ

2019-08-20 Thread CR Hota (Jira)
CR Hota created HDFS-14760:
--

 Summary: Log INFO mode if snapshot usage and actual usage differ
 Key: HDFS-14760
 URL: https://issues.apache.org/jira/browse/HDFS-14760
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: CR Hota
Assignee: CR Hota


{code}
  private void checkStoragespace(final INodeDirectory dir, final long computed) 
{
if (-1 != quota.getStorageSpace() && usage.getStorageSpace() != computed) {
  NameNode.LOG.error("BUG: Inconsistent storagespace for directory "
  + dir.getFullPathName() + ". Cached = " + usage.getStorageSpace()
  + " != Computed = " + computed);
}
  }
{code}
The above code logs in error mode without throwing any exceptions or action and 
pollutes logs. This should be in INFO mode.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-12510) RBF: Add security to UI

2019-08-19 Thread CR Hota (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-12510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16910726#comment-16910726
 ] 

CR Hota commented on HDFS-12510:


Thanks [~elgoiri]

> RBF: Add security to UI
> ---
>
> Key: HDFS-12510
> URL: https://issues.apache.org/jira/browse/HDFS-12510
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Íñigo Goiri
>Assignee: CR Hota
>Priority: Major
>  Labels: RBF
>
> HDFS-12273 implemented the UI for Router Based Federation without security.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-12510) RBF: Add security to UI

2019-08-19 Thread CR Hota (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-12510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

CR Hota resolved HDFS-12510.

Resolution: Resolved

> RBF: Add security to UI
> ---
>
> Key: HDFS-12510
> URL: https://issues.apache.org/jira/browse/HDFS-12510
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Íñigo Goiri
>Assignee: CR Hota
>Priority: Major
>  Labels: RBF
>
> HDFS-12273 implemented the UI for Router Based Federation without security.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14090) RBF: Improved isolation for downstream name nodes. {Static}

2019-08-19 Thread CR Hota (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14090?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

CR Hota updated HDFS-14090:
---
Parent: (was: HDFS-14603)
Issue Type: New Feature  (was: Sub-task)

> RBF: Improved isolation for downstream name nodes. {Static}
> ---
>
> Key: HDFS-14090
> URL: https://issues.apache.org/jira/browse/HDFS-14090
> Project: Hadoop HDFS
>  Issue Type: New Feature
>Reporter: CR Hota
>Assignee: CR Hota
>Priority: Major
> Attachments: HDFS-14090-HDFS-13891.001.patch, 
> HDFS-14090-HDFS-13891.002.patch, HDFS-14090-HDFS-13891.003.patch, 
> HDFS-14090-HDFS-13891.004.patch, HDFS-14090-HDFS-13891.005.patch, 
> HDFS-14090.006.patch, HDFS-14090.007.patch, HDFS-14090.008.patch, 
> HDFS-14090.009.patch, HDFS-14090.010.patch, RBF_ Isolation design.pdf
>
>
> Router is a gateway to underlying name nodes. Gateway architectures, should 
> help minimize impact of clients connecting to healthy clusters vs unhealthy 
> clusters.
> For example - If there are 2 name nodes downstream, and one of them is 
> heavily loaded with calls spiking rpc queue times, due to back pressure the 
> same with start reflecting on the router. As a result of this, clients 
> connecting to healthy/faster name nodes will also slow down as same rpc queue 
> is maintained for all calls at the router layer. Essentially the same IPC 
> thread pool is used by router to connect to all name nodes.
> Currently router uses one single rpc queue for all calls. Lets discuss how we 
> can change the architecture and add some throttling logic for 
> unhealthy/slow/overloaded name nodes.
> One way could be to read from current call queue, immediately identify 
> downstream name node and maintain a separate queue for each underlying name 
> node. Another simpler way is to maintain some sort of rate limiter configured 
> for each name node and let routers drop/reject/send error requests after 
> certain threshold. 
> This won’t be a simple change as router’s ‘Server’ layer would need redesign 
> and implementation. Currently this layer is the same as name node.
> Opening this ticket to discuss, design and implement this feature.
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-14750) RBF: Improved isolation for downstream name nodes. {Dynamic}

2019-08-19 Thread CR Hota (Jira)
CR Hota created HDFS-14750:
--

 Summary: RBF: Improved isolation for downstream name nodes. 
{Dynamic}
 Key: HDFS-14750
 URL: https://issues.apache.org/jira/browse/HDFS-14750
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: CR Hota
Assignee: CR Hota


This Jira tracks the work around dynamic allocation of resources in routers for 
downstream hdfs clusters. 



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14090) RBF: Improved isolation for downstream name nodes. {Static}

2019-08-19 Thread CR Hota (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14090?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

CR Hota updated HDFS-14090:
---
Summary: RBF: Improved isolation for downstream name nodes. {Static}  (was: 
RBF: Improved isolation for downstream name nodes.)

> RBF: Improved isolation for downstream name nodes. {Static}
> ---
>
> Key: HDFS-14090
> URL: https://issues.apache.org/jira/browse/HDFS-14090
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: CR Hota
>Assignee: CR Hota
>Priority: Major
> Attachments: HDFS-14090-HDFS-13891.001.patch, 
> HDFS-14090-HDFS-13891.002.patch, HDFS-14090-HDFS-13891.003.patch, 
> HDFS-14090-HDFS-13891.004.patch, HDFS-14090-HDFS-13891.005.patch, 
> HDFS-14090.006.patch, HDFS-14090.007.patch, HDFS-14090.008.patch, 
> HDFS-14090.009.patch, HDFS-14090.010.patch, RBF_ Isolation design.pdf
>
>
> Router is a gateway to underlying name nodes. Gateway architectures, should 
> help minimize impact of clients connecting to healthy clusters vs unhealthy 
> clusters.
> For example - If there are 2 name nodes downstream, and one of them is 
> heavily loaded with calls spiking rpc queue times, due to back pressure the 
> same with start reflecting on the router. As a result of this, clients 
> connecting to healthy/faster name nodes will also slow down as same rpc queue 
> is maintained for all calls at the router layer. Essentially the same IPC 
> thread pool is used by router to connect to all name nodes.
> Currently router uses one single rpc queue for all calls. Lets discuss how we 
> can change the architecture and add some throttling logic for 
> unhealthy/slow/overloaded name nodes.
> One way could be to read from current call queue, immediately identify 
> downstream name node and maintain a separate queue for each underlying name 
> node. Another simpler way is to maintain some sort of rate limiter configured 
> for each name node and let routers drop/reject/send error requests after 
> certain threshold. 
> This won’t be a simple change as router’s ‘Server’ layer would need redesign 
> and implementation. Currently this layer is the same as name node.
> Opening this ticket to discuss, design and implement this feature.
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14749) RBF: Isolation across multiple downstream hdfs clusters

2019-08-19 Thread CR Hota (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

CR Hota updated HDFS-14749:
---
Summary: RBF: Isolation across multiple downstream hdfs clusters  (was: 
Isolation across multiple downstream hdfs clusters)

> RBF: Isolation across multiple downstream hdfs clusters
> ---
>
> Key: HDFS-14749
> URL: https://issues.apache.org/jira/browse/HDFS-14749
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: CR Hota
>Assignee: CR Hota
>Priority: Major
>
> This parent Jira tracks all work done within the context of router isolation 
> across multiple downstream hdfs clusters.
>  # Phase 1 will be static allocation of resources.
>  # Phase 2 will introduce more dynamic approach/preemption based.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-14749) Isolation across multiple downstream hdfs clusters

2019-08-19 Thread CR Hota (Jira)
CR Hota created HDFS-14749:
--

 Summary: Isolation across multiple downstream hdfs clusters
 Key: HDFS-14749
 URL: https://issues.apache.org/jira/browse/HDFS-14749
 Project: Hadoop HDFS
  Issue Type: Sub-task
Reporter: CR Hota
Assignee: CR Hota


This parent Jira tracks all work done within the context of router isolation 
across multiple downstream hdfs clusters.
 # Phase 1 will be static allocation of resources.
 # Phase 2 will introduce more dynamic approach/preemption based.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14090) RBF: Improved isolation for downstream name nodes.

2019-08-19 Thread CR Hota (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16910708#comment-16910708
 ] 

CR Hota commented on HDFS-14090:


[~hexiaoqiao] Thanks for the review. The points you raised are very valid.

In the design doc also I have mentioned that at some point we need to introduce 
and look into preemption/dynamic allocation. Yes, but for Phase 1 the current 
patch will help installations move forward with the concept of isolation. 
Dynamic/Preemption will obviously be a separate implementation of 
{{FairnessPolicyController}}. I will open a ticket to track this next phase. 
This would also need a through design analysis and review.

Lets wait for  [~elgoiri] [~brahmareddy] [~aajisaka] [~xkrogen] to review the 
010 patch.

> RBF: Improved isolation for downstream name nodes.
> --
>
> Key: HDFS-14090
> URL: https://issues.apache.org/jira/browse/HDFS-14090
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: CR Hota
>Assignee: CR Hota
>Priority: Major
> Attachments: HDFS-14090-HDFS-13891.001.patch, 
> HDFS-14090-HDFS-13891.002.patch, HDFS-14090-HDFS-13891.003.patch, 
> HDFS-14090-HDFS-13891.004.patch, HDFS-14090-HDFS-13891.005.patch, 
> HDFS-14090.006.patch, HDFS-14090.007.patch, HDFS-14090.008.patch, 
> HDFS-14090.009.patch, HDFS-14090.010.patch, RBF_ Isolation design.pdf
>
>
> Router is a gateway to underlying name nodes. Gateway architectures, should 
> help minimize impact of clients connecting to healthy clusters vs unhealthy 
> clusters.
> For example - If there are 2 name nodes downstream, and one of them is 
> heavily loaded with calls spiking rpc queue times, due to back pressure the 
> same with start reflecting on the router. As a result of this, clients 
> connecting to healthy/faster name nodes will also slow down as same rpc queue 
> is maintained for all calls at the router layer. Essentially the same IPC 
> thread pool is used by router to connect to all name nodes.
> Currently router uses one single rpc queue for all calls. Lets discuss how we 
> can change the architecture and add some throttling logic for 
> unhealthy/slow/overloaded name nodes.
> One way could be to read from current call queue, immediately identify 
> downstream name node and maintain a separate queue for each underlying name 
> node. Another simpler way is to maintain some sort of rate limiter configured 
> for each name node and let routers drop/reject/send error requests after 
> certain threshold. 
> This won’t be a simple change as router’s ‘Server’ layer would need redesign 
> and implementation. Currently this layer is the same as name node.
> Opening this ticket to discuss, design and implement this feature.
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14609) RBF: Security should use common AuthenticationFilter

2019-08-19 Thread CR Hota (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16910618#comment-16910618
 ] 

CR Hota commented on HDFS-14609:


[~zhangchen] Thanks for the patch. Few points
 # Lets not add hadoop-hdfs and rbf changes together. For any changes needed to 
hdfs that rbf depends on, can be first done in hdfs through a separate jira. 
Feel free to create one.
 # It's not clear why hadoop-hdfs changes are needed in this context.
 # Let's use the configs from DFSConfigs instead of defining them in the test 
class again for ex : private static final String 
HTTP_KERBEROS_PRINCIPAL_CONF_KEY = 
"hadoop.http.authentication.kerberos.principal";

> RBF: Security should use common AuthenticationFilter
> 
>
> Key: HDFS-14609
> URL: https://issues.apache.org/jira/browse/HDFS-14609
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: CR Hota
>Assignee: Chen Zhang
>Priority: Major
> Attachments: HDFS-14609.001.patch, HDFS-14609.002.patch
>
>
> We worked on router based federation security as part of HDFS-13532. We kept 
> it compatible with the way namenode works. However with HADOOP-16314 and 
> HDFS-16354 in trunk, auth filters seems to have been changed causing tests to 
> fail.
> Changes are needed appropriately in RBF, mainly fixing broken tests.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14744) RBF: Non secured routers should not log in error mode when UGI is default.

2019-08-19 Thread CR Hota (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-14744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16910186#comment-16910186
 ] 

CR Hota commented on HDFS-14744:


[~ayushtkn] Thanks for the review.

> RBF: Non secured routers should not log in error mode when UGI is default.
> --
>
> Key: HDFS-14744
> URL: https://issues.apache.org/jira/browse/HDFS-14744
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: CR Hota
>Assignee: CR Hota
>Priority: Major
> Attachments: HDFS-14744.001.patch
>
>
> RouterClientProtocol#getMountPointStatus logs error when groups are not found 
> for default web user dr.who. The line should be logged in "error" mode for 
> secured cluster, for unsecured clusters, we may want to just specify "debug" 
> or else logs are filled up with this non-critical line
> {{ERROR org.apache.hadoop.hdfs.server.federation.router.RouterRpcServer: 
> Cannot get the remote user: There is no primary group for UGI dr.who 
> (auth:SIMPLE)}}
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-12510) RBF: Add security to UI

2019-08-16 Thread CR Hota (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-12510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16909449#comment-16909449
 ] 

CR Hota commented on HDFS-12510:


[~elgoiri] [~brahmareddy] Should we mark this done? we can revisit if any 
issues are reported in the future.

> RBF: Add security to UI
> ---
>
> Key: HDFS-12510
> URL: https://issues.apache.org/jira/browse/HDFS-12510
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Íñigo Goiri
>Assignee: CR Hota
>Priority: Major
>  Labels: RBF
>
> HDFS-12273 implemented the UI for Router Based Federation without security.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14744) RBF: Non secured routers should not log in error mode when UGI is default.

2019-08-16 Thread CR Hota (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-14744?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

CR Hota updated HDFS-14744:
---
Attachment: HDFS-14744.001.patch
Status: Patch Available  (was: Open)

> RBF: Non secured routers should not log in error mode when UGI is default.
> --
>
> Key: HDFS-14744
> URL: https://issues.apache.org/jira/browse/HDFS-14744
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: CR Hota
>Assignee: CR Hota
>Priority: Major
> Attachments: HDFS-14744.001.patch
>
>
> RouterClientProtocol#getMountPointStatus logs error when groups are not found 
> for default web user dr.who. The line should be logged in "error" mode for 
> secured cluster, for unsecured clusters, we may want to just specify "debug" 
> or else logs are filled up with this non-critical line
> {{ERROR org.apache.hadoop.hdfs.server.federation.router.RouterRpcServer: 
> Cannot get the remote user: There is no primary group for UGI dr.who 
> (auth:SIMPLE)}}
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-14744) RBF: Non secured routers should not log in error mode when UGI is default.

2019-08-16 Thread CR Hota (JIRA)
CR Hota created HDFS-14744:
--

 Summary: RBF: Non secured routers should not log in error mode 
when UGI is default.
 Key: HDFS-14744
 URL: https://issues.apache.org/jira/browse/HDFS-14744
 Project: Hadoop HDFS
  Issue Type: Sub-task
Reporter: CR Hota
Assignee: CR Hota


RouterClientProtocol#getMountPointStatus logs error when groups are not found 
for default web user dr.who. The line should be logged in "error" mode for 
secured cluster, for unsecured clusters, we may want to just specify "debug" or 
else logs are filled up with this non-critical line

{{ERROR org.apache.hadoop.hdfs.server.federation.router.RouterRpcServer: Cannot 
get the remote user: There is no primary group for UGI dr.who (auth:SIMPLE)}}

 

 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14090) RBF: Improved isolation for downstream name nodes.

2019-08-15 Thread CR Hota (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-14090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16908758#comment-16908758
 ] 

CR Hota commented on HDFS-14090:


[~xkrogen] [~elgoiri] Many thanks for the detailed reviews. Very helpful :) 
Have incorporated almost all the points you folks mentioned.

On a high level, changes are
 # "permit" is still the word being used.
 # One configuration controls the feature, {{NoFairnessPolicyController}} is 
dummy whereas {{StaticFairnessPolicyController}} is the fairness implementation.
 # The whole start-up will fail if fairness class loading has issues. Test 
cases are appropriately changed to reflect that.
 # {{NoPermitAvailableException}} is renamed to 
{{PermitLimitExceededException.}}

 

To [~xkrogen] observations,
{quote}I was considering the scenario where there are two routers R1 and R2, 
and two NameNodes N1 and N2. Assume most clients need to access both N1 and N2. 
What happens in the situation when all of R1's N1-handlers are full (but 
N2-handlers mostly empty), and all of R2's N2-handlers are full (but 
N1-handlers mostly empty)? I'm not sure if this is a situation that is likely 
to arise, or if the system will easily self-heal based on the backoff behavior. 
Maybe worth thinking about a little--not a blocking concern for me, more of a 
thought experiment.
{quote}
 It should ideally not happen that all handlers of a specific router are busy 
and other handlers are completely free, since clients are expected to use 
random order while connecting. However, from the beginning the design  focuses 
on getting the system to self-heal as much as possible to eventually get 
similar traffic across all routers in a cluster.
{quote}The configuration for this seems like it will be really tricky to get 
right, particularly knowing how many fan-out handlers to allocate. I imagine as 
an administrator, my thought process would be like:
 I want 35% allocated to NN1 and 65% allocated to NN2, since NN2 is about 2x as 
loaded as NN1. This part is fairly intuitive.
 Then I encounter the fan-out configuration... What am I supposed to do with it?
 Are there perhaps any heuristics we can provide for reasonable values?
{quote}
Yes, configurations values are something, which users have to pay attention to 
specially concurrent calls. In the documentation sub-Jira HDFS-14558, I plan to 
write more about the concurrent calls and some points for users to focus on. 
Also configurations may need to be changed by users based on new use cases and 
load on downstream clusters etc.

[~aajisaka] [~brahmareddy] [~linyiqun] [~hexiaoqiao] FYI.

> RBF: Improved isolation for downstream name nodes.
> --
>
> Key: HDFS-14090
> URL: https://issues.apache.org/jira/browse/HDFS-14090
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: CR Hota
>Assignee: CR Hota
>Priority: Major
> Attachments: HDFS-14090-HDFS-13891.001.patch, 
> HDFS-14090-HDFS-13891.002.patch, HDFS-14090-HDFS-13891.003.patch, 
> HDFS-14090-HDFS-13891.004.patch, HDFS-14090-HDFS-13891.005.patch, 
> HDFS-14090.006.patch, HDFS-14090.007.patch, HDFS-14090.008.patch, 
> HDFS-14090.009.patch, HDFS-14090.010.patch, RBF_ Isolation design.pdf
>
>
> Router is a gateway to underlying name nodes. Gateway architectures, should 
> help minimize impact of clients connecting to healthy clusters vs unhealthy 
> clusters.
> For example - If there are 2 name nodes downstream, and one of them is 
> heavily loaded with calls spiking rpc queue times, due to back pressure the 
> same with start reflecting on the router. As a result of this, clients 
> connecting to healthy/faster name nodes will also slow down as same rpc queue 
> is maintained for all calls at the router layer. Essentially the same IPC 
> thread pool is used by router to connect to all name nodes.
> Currently router uses one single rpc queue for all calls. Lets discuss how we 
> can change the architecture and add some throttling logic for 
> unhealthy/slow/overloaded name nodes.
> One way could be to read from current call queue, immediately identify 
> downstream name node and maintain a separate queue for each underlying name 
> node. Another simpler way is to maintain some sort of rate limiter configured 
> for each name node and let routers drop/reject/send error requests after 
> certain threshold. 
> This won’t be a simple change as router’s ‘Server’ layer would need redesign 
> and implementation. Currently this layer is the same as name node.
> Opening this ticket to discuss, design and implement this feature.
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: 

[jira] [Comment Edited] (HDFS-14090) RBF: Improved isolation for downstream name nodes.

2019-08-15 Thread CR Hota (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-14090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16908758#comment-16908758
 ] 

CR Hota edited comment on HDFS-14090 at 8/16/19 5:58 AM:
-

[~xkrogen] [~elgoiri] Many thanks for the detailed reviews. Very helpful :) 
Have incorporated almost all the points you folks mentioned in 010.patch.

On a high level, changes are
 # "permit" is still the word being used.
 # One configuration controls the feature, {{NoFairnessPolicyController}} is 
dummy whereas {{StaticFairnessPolicyController}} is the fairness implementation.
 # The whole start-up will fail if fairness class loading has issues. Test 
cases are appropriately changed to reflect that.
 # {{NoPermitAvailableException}} is renamed to 
{{PermitLimitExceededException.}}

 

To [~xkrogen] observations,
{quote}I was considering the scenario where there are two routers R1 and R2, 
and two NameNodes N1 and N2. Assume most clients need to access both N1 and N2. 
What happens in the situation when all of R1's N1-handlers are full (but 
N2-handlers mostly empty), and all of R2's N2-handlers are full (but 
N1-handlers mostly empty)? I'm not sure if this is a situation that is likely 
to arise, or if the system will easily self-heal based on the backoff behavior. 
Maybe worth thinking about a little--not a blocking concern for me, more of a 
thought experiment.
{quote}
 It should ideally not happen that all handlers of a specific router are busy 
and other handlers are completely free, since clients are expected to use 
random order while connecting. However, from the beginning the design  focuses 
on getting the system to self-heal as much as possible to eventually get 
similar traffic across all routers in a cluster.
{quote}The configuration for this seems like it will be really tricky to get 
right, particularly knowing how many fan-out handlers to allocate. I imagine as 
an administrator, my thought process would be like:
 I want 35% allocated to NN1 and 65% allocated to NN2, since NN2 is about 2x as 
loaded as NN1. This part is fairly intuitive.
 Then I encounter the fan-out configuration... What am I supposed to do with it?
 Are there perhaps any heuristics we can provide for reasonable values?
{quote}
Yes, configurations values are something, which users have to pay attention to 
specially concurrent calls. In the documentation sub-Jira HDFS-14558, I plan to 
write more about the concurrent calls and some points for users to focus on. 
Also configurations may need to be changed by users based on new use cases and 
load on downstream clusters etc.

[~aajisaka] [~brahmareddy] [~linyiqun] [~hexiaoqiao] FYI.


was (Author: crh):
[~xkrogen] [~elgoiri] Many thanks for the detailed reviews. Very helpful :) 
Have incorporated almost all the points you folks mentioned.

On a high level, changes are
 # "permit" is still the word being used.
 # One configuration controls the feature, {{NoFairnessPolicyController}} is 
dummy whereas {{StaticFairnessPolicyController}} is the fairness implementation.
 # The whole start-up will fail if fairness class loading has issues. Test 
cases are appropriately changed to reflect that.
 # {{NoPermitAvailableException}} is renamed to 
{{PermitLimitExceededException.}}

 

To [~xkrogen] observations,
{quote}I was considering the scenario where there are two routers R1 and R2, 
and two NameNodes N1 and N2. Assume most clients need to access both N1 and N2. 
What happens in the situation when all of R1's N1-handlers are full (but 
N2-handlers mostly empty), and all of R2's N2-handlers are full (but 
N1-handlers mostly empty)? I'm not sure if this is a situation that is likely 
to arise, or if the system will easily self-heal based on the backoff behavior. 
Maybe worth thinking about a little--not a blocking concern for me, more of a 
thought experiment.
{quote}
 It should ideally not happen that all handlers of a specific router are busy 
and other handlers are completely free, since clients are expected to use 
random order while connecting. However, from the beginning the design  focuses 
on getting the system to self-heal as much as possible to eventually get 
similar traffic across all routers in a cluster.
{quote}The configuration for this seems like it will be really tricky to get 
right, particularly knowing how many fan-out handlers to allocate. I imagine as 
an administrator, my thought process would be like:
 I want 35% allocated to NN1 and 65% allocated to NN2, since NN2 is about 2x as 
loaded as NN1. This part is fairly intuitive.
 Then I encounter the fan-out configuration... What am I supposed to do with it?
 Are there perhaps any heuristics we can provide for reasonable values?
{quote}
Yes, configurations values are something, which users have to pay attention to 
specially concurrent calls. In the documentation sub-Jira HDFS-14558, I plan to 
write more about the concurrent calls and some 

[jira] [Updated] (HDFS-14090) RBF: Improved isolation for downstream name nodes.

2019-08-15 Thread CR Hota (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-14090?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

CR Hota updated HDFS-14090:
---
Attachment: HDFS-14090.010.patch

> RBF: Improved isolation for downstream name nodes.
> --
>
> Key: HDFS-14090
> URL: https://issues.apache.org/jira/browse/HDFS-14090
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: CR Hota
>Assignee: CR Hota
>Priority: Major
> Attachments: HDFS-14090-HDFS-13891.001.patch, 
> HDFS-14090-HDFS-13891.002.patch, HDFS-14090-HDFS-13891.003.patch, 
> HDFS-14090-HDFS-13891.004.patch, HDFS-14090-HDFS-13891.005.patch, 
> HDFS-14090.006.patch, HDFS-14090.007.patch, HDFS-14090.008.patch, 
> HDFS-14090.009.patch, HDFS-14090.010.patch, RBF_ Isolation design.pdf
>
>
> Router is a gateway to underlying name nodes. Gateway architectures, should 
> help minimize impact of clients connecting to healthy clusters vs unhealthy 
> clusters.
> For example - If there are 2 name nodes downstream, and one of them is 
> heavily loaded with calls spiking rpc queue times, due to back pressure the 
> same with start reflecting on the router. As a result of this, clients 
> connecting to healthy/faster name nodes will also slow down as same rpc queue 
> is maintained for all calls at the router layer. Essentially the same IPC 
> thread pool is used by router to connect to all name nodes.
> Currently router uses one single rpc queue for all calls. Lets discuss how we 
> can change the architecture and add some throttling logic for 
> unhealthy/slow/overloaded name nodes.
> One way could be to read from current call queue, immediately identify 
> downstream name node and maintain a separate queue for each underlying name 
> node. Another simpler way is to maintain some sort of rate limiter configured 
> for each name node and let routers drop/reject/send error requests after 
> certain threshold. 
> This won’t be a simple change as router’s ‘Server’ layer would need redesign 
> and implementation. Currently this layer is the same as name node.
> Opening this ticket to discuss, design and implement this feature.
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-8631) WebHDFS : Support setQuota

2019-08-15 Thread CR Hota (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-8631?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16908496#comment-16908496
 ] 

CR Hota commented on HDFS-8631:
---

[~csun] Please ignore the TestRouter* test cases, they are tracked in HDFS-14609

> WebHDFS : Support setQuota
> --
>
> Key: HDFS-8631
> URL: https://issues.apache.org/jira/browse/HDFS-8631
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Affects Versions: 2.7.2
>Reporter: nijel
>Assignee: Chao Sun
>Priority: Major
> Attachments: HDFS-8631-001.patch, HDFS-8631-002.patch, 
> HDFS-8631-003.patch, HDFS-8631-004.patch, HDFS-8631-005.patch, 
> HDFS-8631-006.patch, HDFS-8631-007.patch, HDFS-8631-008.patch
>
>
> User is able do quota management from filesystem object. Same operation can 
> be allowed trough REST API.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14090) RBF: Improved isolation for downstream name nodes.

2019-08-13 Thread CR Hota (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-14090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16906462#comment-16906462
 ] 

CR Hota commented on HDFS-14090:


[~aajisaka] [~elgoiri] "Quota" may confuse admins/readers/developers with 
actual quota system present in router/hdfs ?

> RBF: Improved isolation for downstream name nodes.
> --
>
> Key: HDFS-14090
> URL: https://issues.apache.org/jira/browse/HDFS-14090
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: CR Hota
>Assignee: CR Hota
>Priority: Major
> Attachments: HDFS-14090-HDFS-13891.001.patch, 
> HDFS-14090-HDFS-13891.002.patch, HDFS-14090-HDFS-13891.003.patch, 
> HDFS-14090-HDFS-13891.004.patch, HDFS-14090-HDFS-13891.005.patch, 
> HDFS-14090.006.patch, HDFS-14090.007.patch, HDFS-14090.008.patch, 
> HDFS-14090.009.patch, RBF_ Isolation design.pdf
>
>
> Router is a gateway to underlying name nodes. Gateway architectures, should 
> help minimize impact of clients connecting to healthy clusters vs unhealthy 
> clusters.
> For example - If there are 2 name nodes downstream, and one of them is 
> heavily loaded with calls spiking rpc queue times, due to back pressure the 
> same with start reflecting on the router. As a result of this, clients 
> connecting to healthy/faster name nodes will also slow down as same rpc queue 
> is maintained for all calls at the router layer. Essentially the same IPC 
> thread pool is used by router to connect to all name nodes.
> Currently router uses one single rpc queue for all calls. Lets discuss how we 
> can change the architecture and add some throttling logic for 
> unhealthy/slow/overloaded name nodes.
> One way could be to read from current call queue, immediately identify 
> downstream name node and maintain a separate queue for each underlying name 
> node. Another simpler way is to maintain some sort of rate limiter configured 
> for each name node and let routers drop/reject/send error requests after 
> certain threshold. 
> This won’t be a simple change as router’s ‘Server’ layer would need redesign 
> and implementation. Currently this layer is the same as name node.
> Opening this ticket to discuss, design and implement this feature.
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14090) RBF: Improved isolation for downstream name nodes.

2019-08-12 Thread CR Hota (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-14090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16905421#comment-16905421
 ] 

CR Hota commented on HDFS-14090:


[~xkrogen]  :)

> RBF: Improved isolation for downstream name nodes.
> --
>
> Key: HDFS-14090
> URL: https://issues.apache.org/jira/browse/HDFS-14090
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: CR Hota
>Assignee: CR Hota
>Priority: Major
> Attachments: HDFS-14090-HDFS-13891.001.patch, 
> HDFS-14090-HDFS-13891.002.patch, HDFS-14090-HDFS-13891.003.patch, 
> HDFS-14090-HDFS-13891.004.patch, HDFS-14090-HDFS-13891.005.patch, 
> HDFS-14090.006.patch, HDFS-14090.007.patch, HDFS-14090.008.patch, 
> HDFS-14090.009.patch, RBF_ Isolation design.pdf
>
>
> Router is a gateway to underlying name nodes. Gateway architectures, should 
> help minimize impact of clients connecting to healthy clusters vs unhealthy 
> clusters.
> For example - If there are 2 name nodes downstream, and one of them is 
> heavily loaded with calls spiking rpc queue times, due to back pressure the 
> same with start reflecting on the router. As a result of this, clients 
> connecting to healthy/faster name nodes will also slow down as same rpc queue 
> is maintained for all calls at the router layer. Essentially the same IPC 
> thread pool is used by router to connect to all name nodes.
> Currently router uses one single rpc queue for all calls. Lets discuss how we 
> can change the architecture and add some throttling logic for 
> unhealthy/slow/overloaded name nodes.
> One way could be to read from current call queue, immediately identify 
> downstream name node and maintain a separate queue for each underlying name 
> node. Another simpler way is to maintain some sort of rate limiter configured 
> for each name node and let routers drop/reject/send error requests after 
> certain threshold. 
> This won’t be a simple change as router’s ‘Server’ layer would need redesign 
> and implementation. Currently this layer is the same as name node.
> Opening this ticket to discuss, design and implement this feature.
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-13123) RBF: Add a balancer tool to move data across subcluster

2019-08-12 Thread CR Hota (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-13123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16905411#comment-16905411
 ] 

CR Hota commented on HDFS-13123:


[~hemanthboyina] Thanks for the initial patch. We may need a final design doc 
for this task, explaining some of the below points.
 # How is atomicity in distcp taken into account here? If distcp fails, 
destination cluster may have unused files lying around unaudited. May be user 
can specify atomicity flag through admin.
 # Will all the actual work be done by common yarn queue belonging to "router" 
irrespective of user ?
 # How are multiple rebalancings going to work if executed? Should admin 
maintain a state of what all rebalancing is in progress and what all completed. 
Some basic auditing at least.
 # How does this rebalancing work play with overall user quota management ?
 # Rebalancing across secured clusters? etc.

 

> RBF: Add a balancer tool to move data across subcluster 
> 
>
> Key: HDFS-13123
> URL: https://issues.apache.org/jira/browse/HDFS-13123
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Wei Yan
>Assignee: hemanthboyina
>Priority: Major
> Attachments: HDFS Router-Based Federation Rebalancer.pdf, 
> HDFS-13123.patch
>
>
> Follow the discussion in HDFS-12615. This Jira is to track effort for 
> building a rebalancer tool, used by router-based federation to move data 
> among subclusters.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14609) RBF: Security should use common AuthenticationFilter

2019-08-12 Thread CR Hota (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-14609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16905391#comment-16905391
 ] 

CR Hota commented on HDFS-14609:


[~zhangchen] Thanks for the update. Appreciate you for digging into this.

It may be a good idea to just work on trunk and not look into HDFS-13891 
anymore and see how these tests can be fixed. As part of the filter work which 
was done in HADOOP-16314 and HADOOP-16354, there are some test case examples in 
them. You may want to take a look at those for reference.

 

> RBF: Security should use common AuthenticationFilter
> 
>
> Key: HDFS-14609
> URL: https://issues.apache.org/jira/browse/HDFS-14609
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: CR Hota
>Assignee: Chen Zhang
>Priority: Major
>
> We worked on router based federation security as part of HDFS-13532. We kept 
> it compatible with the way namenode works. However with HADOOP-16314 and 
> HDFS-16354 in trunk, auth filters seems to have been changed causing tests to 
> fail.
> Changes are needed appropriately in RBF, mainly fixing broken tests.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-14715) RBF: Fix RBF failed tests

2019-08-09 Thread CR Hota (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-14715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

CR Hota resolved HDFS-14715.

Resolution: Duplicate

> RBF: Fix RBF failed tests
> -
>
> Key: HDFS-14715
> URL: https://issues.apache.org/jira/browse/HDFS-14715
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Chen Zhang
>Assignee: Chen Zhang
>Priority: Major
>
> including:
> hadoop.hdfs.server.federation.router.TestRouterWithSecureStartup
> hadoop.hdfs.server.federation.security.TestRouterHttpDelegationToken



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14715) RBF: Fix RBF failed tests

2019-08-09 Thread CR Hota (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-14715?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16904194#comment-16904194
 ] 

CR Hota commented on HDFS-14715:


[~elgoiri] Yeah, its duplicate.

[~zhangchen] I have assigned HDFS-14609 to you, will be happy to help you get 
it going.

> RBF: Fix RBF failed tests
> -
>
> Key: HDFS-14715
> URL: https://issues.apache.org/jira/browse/HDFS-14715
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Chen Zhang
>Assignee: Chen Zhang
>Priority: Major
>
> including:
> hadoop.hdfs.server.federation.router.TestRouterWithSecureStartup
> hadoop.hdfs.server.federation.security.TestRouterHttpDelegationToken



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Assigned] (HDFS-14609) RBF: Security should use common AuthenticationFilter

2019-08-09 Thread CR Hota (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-14609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

CR Hota reassigned HDFS-14609:
--

Assignee: Chen Zhang  (was: CR Hota)

> RBF: Security should use common AuthenticationFilter
> 
>
> Key: HDFS-14609
> URL: https://issues.apache.org/jira/browse/HDFS-14609
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: CR Hota
>Assignee: Chen Zhang
>Priority: Major
>
> We worked on router based federation security as part of HDFS-13532. We kept 
> it compatible with the way namenode works. However with HADOOP-16314 and 
> HDFS-16354 in trunk, auth filters seems to have been changed causing tests to 
> fail.
> Changes are needed appropriately in RBF, mainly fixing broken tests.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14705) Remove unused configuration dfs.min.replication

2019-08-06 Thread CR Hota (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-14705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16901737#comment-16901737
 ] 

CR Hota commented on HDFS-14705:


[~jojochuang] Thanks for the review. Should we commit this?

> Remove unused configuration dfs.min.replication
> ---
>
> Key: HDFS-14705
> URL: https://issues.apache.org/jira/browse/HDFS-14705
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Wei-Chiu Chuang
>Assignee: CR Hota
>Priority: Trivial
> Attachments: HDFS-14705.001.patch
>
>
> A few HDFS tests sets a configuration property dfs.min.replication. This is 
> not being used anywhere in the code. It doesn't seem like a leftover from 
> legacy code either. Better to clean them out. 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14705) Remove unused configuration dfs.min.replication

2019-08-06 Thread CR Hota (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-14705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

CR Hota updated HDFS-14705:
---
Attachment: HDFS-14705.001.patch

> Remove unused configuration dfs.min.replication
> ---
>
> Key: HDFS-14705
> URL: https://issues.apache.org/jira/browse/HDFS-14705
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Wei-Chiu Chuang
>Assignee: CR Hota
>Priority: Trivial
> Attachments: HDFS-14705.001.patch
>
>
> A few HDFS tests sets a configuration property dfs.min.replication. This is 
> not being used anywhere in the code. It doesn't seem like a leftover from 
> legacy code either. Better to clean them out. 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14705) Remove unused configuration dfs.min.replication

2019-08-06 Thread CR Hota (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-14705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

CR Hota updated HDFS-14705:
---
Status: Patch Available  (was: Open)

> Remove unused configuration dfs.min.replication
> ---
>
> Key: HDFS-14705
> URL: https://issues.apache.org/jira/browse/HDFS-14705
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Wei-Chiu Chuang
>Assignee: CR Hota
>Priority: Trivial
> Attachments: HDFS-14705.001.patch
>
>
> A few HDFS tests sets a configuration property dfs.min.replication. This is 
> not being used anywhere in the code. It doesn't seem like a leftover from 
> legacy code either. Better to clean them out. 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14705) Remove unused configuration dfs.min.replication

2019-08-06 Thread CR Hota (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-14705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16901438#comment-16901438
 ] 

CR Hota commented on HDFS-14705:


[~jojochuang] Found this in just 2 tests and yes as you said, its not getting 
used. Not sure why it was added in the first place. Can you help take a look at 
the patch?

> Remove unused configuration dfs.min.replication
> ---
>
> Key: HDFS-14705
> URL: https://issues.apache.org/jira/browse/HDFS-14705
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Wei-Chiu Chuang
>Assignee: CR Hota
>Priority: Trivial
> Attachments: HDFS-14705.001.patch
>
>
> A few HDFS tests sets a configuration property dfs.min.replication. This is 
> not being used anywhere in the code. It doesn't seem like a leftover from 
> legacy code either. Better to clean them out. 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Assigned] (HDFS-14705) Remove unused configuration dfs.min.replication

2019-08-06 Thread CR Hota (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-14705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

CR Hota reassigned HDFS-14705:
--

Assignee: CR Hota

> Remove unused configuration dfs.min.replication
> ---
>
> Key: HDFS-14705
> URL: https://issues.apache.org/jira/browse/HDFS-14705
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Wei-Chiu Chuang
>Assignee: CR Hota
>Priority: Trivial
>
> A few HDFS tests sets a configuration property dfs.min.replication. This is 
> not being used anywhere in the code. It doesn't seem like a leftover from 
> legacy code either. Better to clean them out. 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14705) Remove unused configuration dfs.min.replication

2019-08-06 Thread CR Hota (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-14705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16901332#comment-16901332
 ] 

CR Hota commented on HDFS-14705:


[~jojochuang] Thanks for creating this. May I assign this to myself? Will be 
good to know how this property has been moving around historically.

> Remove unused configuration dfs.min.replication
> ---
>
> Key: HDFS-14705
> URL: https://issues.apache.org/jira/browse/HDFS-14705
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Wei-Chiu Chuang
>Priority: Trivial
>
> A few HDFS tests sets a configuration property dfs.min.replication. This is 
> not being used anywhere in the code. It doesn't seem like a leftover from 
> legacy code either. Better to clean them out. 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14704) RBF: NnId should not be null in NamenodeHeartbeatService

2019-08-06 Thread CR Hota (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-14704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16901300#comment-16901300
 ] 

CR Hota commented on HDFS-14704:


[~xuzq_zander]  Thanks for the comment.

The reason i think it's better to put the null check outside of the method is 
to avoid redundant null checks and also let method callers decide what params 
to pass. Also helps fail fast.

createLocalNamenodeHeartbeatService already has a check for nnId == null, the 
code can utilize this existing check and not call 
createNamenodeHeartbeatService if nnId is null.

> RBF: NnId should not be null in NamenodeHeartbeatService
> 
>
> Key: HDFS-14704
> URL: https://issues.apache.org/jira/browse/HDFS-14704
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: rbf
>Reporter: xuzq
>Assignee: xuzq
>Priority: Major
> Attachments: HDFS-14704-trunk-001.patch
>
>
> NnId should not be null in NamenodeHeartbeatService.
> If NnId is null, it will also print the error message like:
> {code:java}
> 2019-08-06 10:38:07,455 ERROR router.NamenodeHeartbeatService 
> (NamenodeHeartbeatService.java:updateState(229)) - Unhandled exception 
> updating NN registration for ns1:null
> java.lang.NullPointerException
> at 
> org.apache.hadoop.hdfs.federation.protocol.proto.HdfsServerFederationProtos$NamenodeMembershipRecordProto$Builder.setServiceAddress(HdfsServerFederationProtos.java:3831)
> at 
> org.apache.hadoop.hdfs.server.federation.store.records.impl.pb.MembershipStatePBImpl.setServiceAddress(MembershipStatePBImpl.java:119)
> at 
> org.apache.hadoop.hdfs.server.federation.store.records.MembershipState.newInstance(MembershipState.java:108)
> at 
> org.apache.hadoop.hdfs.server.federation.resolver.MembershipNamenodeResolver.registerNamenode(MembershipNamenodeResolver.java:267)
> at 
> org.apache.hadoop.hdfs.server.federation.router.NamenodeHeartbeatService.updateState(NamenodeHeartbeatService.java:223)
> at 
> org.apache.hadoop.hdfs.server.federation.router.NamenodeHeartbeatService.periodicInvoke(NamenodeHeartbeatService.java:159)
> at 
> org.apache.hadoop.hdfs.server.federation.router.PeriodicService$1.run(PeriodicService.java:178)
> at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
> at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
> at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748){code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14704) RBF: NnId should not be null in NamenodeHeartbeatService

2019-08-06 Thread CR Hota (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-14704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

CR Hota updated HDFS-14704:
---
Issue Type: Sub-task  (was: Improvement)
Parent: HDFS-14603

> RBF: NnId should not be null in NamenodeHeartbeatService
> 
>
> Key: HDFS-14704
> URL: https://issues.apache.org/jira/browse/HDFS-14704
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: rbf
>Reporter: xuzq
>Assignee: xuzq
>Priority: Major
> Attachments: HDFS-14704-trunk-001.patch
>
>
> NnId should not be null in NamenodeHeartbeatService.
> If NnId is null, it will also print the error message like:
> {code:java}
> 2019-08-06 10:38:07,455 ERROR router.NamenodeHeartbeatService 
> (NamenodeHeartbeatService.java:updateState(229)) - Unhandled exception 
> updating NN registration for ns1:null
> java.lang.NullPointerException
> at 
> org.apache.hadoop.hdfs.federation.protocol.proto.HdfsServerFederationProtos$NamenodeMembershipRecordProto$Builder.setServiceAddress(HdfsServerFederationProtos.java:3831)
> at 
> org.apache.hadoop.hdfs.server.federation.store.records.impl.pb.MembershipStatePBImpl.setServiceAddress(MembershipStatePBImpl.java:119)
> at 
> org.apache.hadoop.hdfs.server.federation.store.records.MembershipState.newInstance(MembershipState.java:108)
> at 
> org.apache.hadoop.hdfs.server.federation.resolver.MembershipNamenodeResolver.registerNamenode(MembershipNamenodeResolver.java:267)
> at 
> org.apache.hadoop.hdfs.server.federation.router.NamenodeHeartbeatService.updateState(NamenodeHeartbeatService.java:223)
> at 
> org.apache.hadoop.hdfs.server.federation.router.NamenodeHeartbeatService.periodicInvoke(NamenodeHeartbeatService.java:159)
> at 
> org.apache.hadoop.hdfs.server.federation.router.PeriodicService$1.run(PeriodicService.java:178)
> at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
> at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
> at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748){code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14704) RBF: NnId should not be null in NamenodeHeartbeatService

2019-08-06 Thread CR Hota (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-14704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

CR Hota updated HDFS-14704:
---
Summary: RBF: NnId should not be null in NamenodeHeartbeatService  (was: 
RBF:NnId should not be null in NamenodeHeartbeatService)

> RBF: NnId should not be null in NamenodeHeartbeatService
> 
>
> Key: HDFS-14704
> URL: https://issues.apache.org/jira/browse/HDFS-14704
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: rbf
>Reporter: xuzq
>Assignee: xuzq
>Priority: Major
> Attachments: HDFS-14704-trunk-001.patch
>
>
> NnId should not be null in NamenodeHeartbeatService.
> If NnId is null, it will also print the error message like:
> {code:java}
> 2019-08-06 10:38:07,455 ERROR router.NamenodeHeartbeatService 
> (NamenodeHeartbeatService.java:updateState(229)) - Unhandled exception 
> updating NN registration for ns1:null
> java.lang.NullPointerException
> at 
> org.apache.hadoop.hdfs.federation.protocol.proto.HdfsServerFederationProtos$NamenodeMembershipRecordProto$Builder.setServiceAddress(HdfsServerFederationProtos.java:3831)
> at 
> org.apache.hadoop.hdfs.server.federation.store.records.impl.pb.MembershipStatePBImpl.setServiceAddress(MembershipStatePBImpl.java:119)
> at 
> org.apache.hadoop.hdfs.server.federation.store.records.MembershipState.newInstance(MembershipState.java:108)
> at 
> org.apache.hadoop.hdfs.server.federation.resolver.MembershipNamenodeResolver.registerNamenode(MembershipNamenodeResolver.java:267)
> at 
> org.apache.hadoop.hdfs.server.federation.router.NamenodeHeartbeatService.updateState(NamenodeHeartbeatService.java:223)
> at 
> org.apache.hadoop.hdfs.server.federation.router.NamenodeHeartbeatService.periodicInvoke(NamenodeHeartbeatService.java:159)
> at 
> org.apache.hadoop.hdfs.server.federation.router.PeriodicService$1.run(PeriodicService.java:178)
> at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
> at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
> at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748){code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Assigned] (HDFS-14704) RBF:NnId should not be null in NamenodeHeartbeatService

2019-08-06 Thread CR Hota (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-14704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

CR Hota reassigned HDFS-14704:
--

Assignee: xuzq

> RBF:NnId should not be null in NamenodeHeartbeatService
> ---
>
> Key: HDFS-14704
> URL: https://issues.apache.org/jira/browse/HDFS-14704
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: rbf
>Reporter: xuzq
>Assignee: xuzq
>Priority: Major
> Attachments: HDFS-14704-trunk-001.patch
>
>
> NnId should not be null in NamenodeHeartbeatService.
> If NnId is null, it will also print the error message like:
> {code:java}
> 2019-08-06 10:38:07,455 ERROR router.NamenodeHeartbeatService 
> (NamenodeHeartbeatService.java:updateState(229)) - Unhandled exception 
> updating NN registration for ns1:null
> java.lang.NullPointerException
> at 
> org.apache.hadoop.hdfs.federation.protocol.proto.HdfsServerFederationProtos$NamenodeMembershipRecordProto$Builder.setServiceAddress(HdfsServerFederationProtos.java:3831)
> at 
> org.apache.hadoop.hdfs.server.federation.store.records.impl.pb.MembershipStatePBImpl.setServiceAddress(MembershipStatePBImpl.java:119)
> at 
> org.apache.hadoop.hdfs.server.federation.store.records.MembershipState.newInstance(MembershipState.java:108)
> at 
> org.apache.hadoop.hdfs.server.federation.resolver.MembershipNamenodeResolver.registerNamenode(MembershipNamenodeResolver.java:267)
> at 
> org.apache.hadoop.hdfs.server.federation.router.NamenodeHeartbeatService.updateState(NamenodeHeartbeatService.java:223)
> at 
> org.apache.hadoop.hdfs.server.federation.router.NamenodeHeartbeatService.periodicInvoke(NamenodeHeartbeatService.java:159)
> at 
> org.apache.hadoop.hdfs.server.federation.router.PeriodicService$1.run(PeriodicService.java:178)
> at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
> at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
> at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748){code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14704) RBF:NnId should not be null in NamenodeHeartbeatService

2019-08-06 Thread CR Hota (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-14704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16900671#comment-16900671
 ] 

CR Hota commented on HDFS-14704:


[~xuzq_zander] Thanks for reporting this and the patch. It may be good to add 
the check before the method is called. nsid check is already present, nnid can 
be clubbed with that. It will look something like below.
{code:java}
 
  if (nsId != null && nnId != null) {
NamenodeHeartbeatService heartbeatService =
createNamenodeHeartbeatService(nsId, nnId);
if (heartbeatService != null) {
  ret.put(heartbeatService.getNamenodeDesc(), heartbeatService);
}
  }

{code}
Can you also add a test for this ?

> RBF:NnId should not be null in NamenodeHeartbeatService
> ---
>
> Key: HDFS-14704
> URL: https://issues.apache.org/jira/browse/HDFS-14704
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: rbf
>Reporter: xuzq
>Priority: Major
> Attachments: HDFS-14704-trunk-001.patch
>
>
> NnId should not be null in NamenodeHeartbeatService.
> If NnId is null, it will also print the error message like:
> {code:java}
> 2019-08-06 10:38:07,455 ERROR router.NamenodeHeartbeatService 
> (NamenodeHeartbeatService.java:updateState(229)) - Unhandled exception 
> updating NN registration for ns1:null
> java.lang.NullPointerException
> at 
> org.apache.hadoop.hdfs.federation.protocol.proto.HdfsServerFederationProtos$NamenodeMembershipRecordProto$Builder.setServiceAddress(HdfsServerFederationProtos.java:3831)
> at 
> org.apache.hadoop.hdfs.server.federation.store.records.impl.pb.MembershipStatePBImpl.setServiceAddress(MembershipStatePBImpl.java:119)
> at 
> org.apache.hadoop.hdfs.server.federation.store.records.MembershipState.newInstance(MembershipState.java:108)
> at 
> org.apache.hadoop.hdfs.server.federation.resolver.MembershipNamenodeResolver.registerNamenode(MembershipNamenodeResolver.java:267)
> at 
> org.apache.hadoop.hdfs.server.federation.router.NamenodeHeartbeatService.updateState(NamenodeHeartbeatService.java:223)
> at 
> org.apache.hadoop.hdfs.server.federation.router.NamenodeHeartbeatService.periodicInvoke(NamenodeHeartbeatService.java:159)
> at 
> org.apache.hadoop.hdfs.server.federation.router.PeriodicService$1.run(PeriodicService.java:178)
> at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
> at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
> at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748){code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14702) Datanode.ReplicaMap memory leak

2019-08-05 Thread CR Hota (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-14702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16900467#comment-16900467
 ] 

CR Hota commented on HDFS-14702:


[~hexiaoqiao] Thanks for reporting this.

Can you try to backport HDFS-8859 to 2.7.1 installation you have and let us 
know how the heap dump looks like for the data node.

> Datanode.ReplicaMap memory leak
> ---
>
> Key: HDFS-14702
> URL: https://issues.apache.org/jira/browse/HDFS-14702
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Affects Versions: 2.7.1
>Reporter: He Xiaoqiao
>Priority: Major
>
> DataNode memory is occupied by ReplicaMaps and cause GC high frequency then 
> write performance degrade.
> It is about 600K block replicas located at DataNode, but when dump heap, 
> there are over 8M items of ReplicaMaps and footprint over 500MB. It seems 
> that memory leak. One more situation, the block w/r ops is very high.
> Do not test HDFS-8859 and no idea if it can solve this issue.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14090) RBF: Improved isolation for downstream name nodes.

2019-07-31 Thread CR Hota (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-14090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16897621#comment-16897621
 ] 

CR Hota commented on HDFS-14090:


[~aajisaka] [~elgoiri] [~fengnanli] [~linyiqun] [~hexiaoqiao] Thanks for the 
previous reviews. Took care of most the comments in 009.patch. Below are 3 
things that is still not changed.
 # I could not remove the log line from assignHandlersToNameservices before 
exception is thrown, since there is no other way to test how the assignment 
fails when the instance is created. Mainly to test the root cause and error 
message.
 # Did not add assertJ apis, the dependencies need to be added in pom and 
various places changed. I suggest lets introduce AssertJ api in a separate Jira 
for easier review. Will create one.
 # concurrent permits won't have any default value. Users are expected to 
specify the values based on load of the clusters. Default value assigned is 
actually a division of all threads with sum of all nameservices + 1 during run 
time.

 

> RBF: Improved isolation for downstream name nodes.
> --
>
> Key: HDFS-14090
> URL: https://issues.apache.org/jira/browse/HDFS-14090
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: CR Hota
>Assignee: CR Hota
>Priority: Major
> Attachments: HDFS-14090-HDFS-13891.001.patch, 
> HDFS-14090-HDFS-13891.002.patch, HDFS-14090-HDFS-13891.003.patch, 
> HDFS-14090-HDFS-13891.004.patch, HDFS-14090-HDFS-13891.005.patch, 
> HDFS-14090.006.patch, HDFS-14090.007.patch, HDFS-14090.008.patch, 
> HDFS-14090.009.patch, RBF_ Isolation design.pdf
>
>
> Router is a gateway to underlying name nodes. Gateway architectures, should 
> help minimize impact of clients connecting to healthy clusters vs unhealthy 
> clusters.
> For example - If there are 2 name nodes downstream, and one of them is 
> heavily loaded with calls spiking rpc queue times, due to back pressure the 
> same with start reflecting on the router. As a result of this, clients 
> connecting to healthy/faster name nodes will also slow down as same rpc queue 
> is maintained for all calls at the router layer. Essentially the same IPC 
> thread pool is used by router to connect to all name nodes.
> Currently router uses one single rpc queue for all calls. Lets discuss how we 
> can change the architecture and add some throttling logic for 
> unhealthy/slow/overloaded name nodes.
> One way could be to read from current call queue, immediately identify 
> downstream name node and maintain a separate queue for each underlying name 
> node. Another simpler way is to maintain some sort of rate limiter configured 
> for each name node and let routers drop/reject/send error requests after 
> certain threshold. 
> This won’t be a simple change as router’s ‘Server’ layer would need redesign 
> and implementation. Currently this layer is the same as name node.
> Opening this ticket to discuss, design and implement this feature.
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14090) RBF: Improved isolation for downstream name nodes.

2019-07-31 Thread CR Hota (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-14090?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

CR Hota updated HDFS-14090:
---
Attachment: HDFS-14090.009.patch

> RBF: Improved isolation for downstream name nodes.
> --
>
> Key: HDFS-14090
> URL: https://issues.apache.org/jira/browse/HDFS-14090
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: CR Hota
>Assignee: CR Hota
>Priority: Major
> Attachments: HDFS-14090-HDFS-13891.001.patch, 
> HDFS-14090-HDFS-13891.002.patch, HDFS-14090-HDFS-13891.003.patch, 
> HDFS-14090-HDFS-13891.004.patch, HDFS-14090-HDFS-13891.005.patch, 
> HDFS-14090.006.patch, HDFS-14090.007.patch, HDFS-14090.008.patch, 
> HDFS-14090.009.patch, RBF_ Isolation design.pdf
>
>
> Router is a gateway to underlying name nodes. Gateway architectures, should 
> help minimize impact of clients connecting to healthy clusters vs unhealthy 
> clusters.
> For example - If there are 2 name nodes downstream, and one of them is 
> heavily loaded with calls spiking rpc queue times, due to back pressure the 
> same with start reflecting on the router. As a result of this, clients 
> connecting to healthy/faster name nodes will also slow down as same rpc queue 
> is maintained for all calls at the router layer. Essentially the same IPC 
> thread pool is used by router to connect to all name nodes.
> Currently router uses one single rpc queue for all calls. Lets discuss how we 
> can change the architecture and add some throttling logic for 
> unhealthy/slow/overloaded name nodes.
> One way could be to read from current call queue, immediately identify 
> downstream name node and maintain a separate queue for each underlying name 
> node. Another simpler way is to maintain some sort of rate limiter configured 
> for each name node and let routers drop/reject/send error requests after 
> certain threshold. 
> This won’t be a simple change as router’s ‘Server’ layer would need redesign 
> and implementation. Currently this layer is the same as name node.
> Opening this ticket to discuss, design and implement this feature.
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14090) RBF: Improved isolation for downstream name nodes.

2019-07-31 Thread CR Hota (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-14090?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

CR Hota updated HDFS-14090:
---
Attachment: HDFS-14090.008.patch

> RBF: Improved isolation for downstream name nodes.
> --
>
> Key: HDFS-14090
> URL: https://issues.apache.org/jira/browse/HDFS-14090
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: CR Hota
>Assignee: CR Hota
>Priority: Major
> Attachments: HDFS-14090-HDFS-13891.001.patch, 
> HDFS-14090-HDFS-13891.002.patch, HDFS-14090-HDFS-13891.003.patch, 
> HDFS-14090-HDFS-13891.004.patch, HDFS-14090-HDFS-13891.005.patch, 
> HDFS-14090.006.patch, HDFS-14090.007.patch, HDFS-14090.008.patch, RBF_ 
> Isolation design.pdf
>
>
> Router is a gateway to underlying name nodes. Gateway architectures, should 
> help minimize impact of clients connecting to healthy clusters vs unhealthy 
> clusters.
> For example - If there are 2 name nodes downstream, and one of them is 
> heavily loaded with calls spiking rpc queue times, due to back pressure the 
> same with start reflecting on the router. As a result of this, clients 
> connecting to healthy/faster name nodes will also slow down as same rpc queue 
> is maintained for all calls at the router layer. Essentially the same IPC 
> thread pool is used by router to connect to all name nodes.
> Currently router uses one single rpc queue for all calls. Lets discuss how we 
> can change the architecture and add some throttling logic for 
> unhealthy/slow/overloaded name nodes.
> One way could be to read from current call queue, immediately identify 
> downstream name node and maintain a separate queue for each underlying name 
> node. Another simpler way is to maintain some sort of rate limiter configured 
> for each name node and let routers drop/reject/send error requests after 
> certain threshold. 
> This won’t be a simple change as router’s ‘Server’ layer would need redesign 
> and implementation. Currently this layer is the same as name node.
> Opening this ticket to discuss, design and implement this feature.
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-14678) Allow triggerBlockReport to a specific namenode

2019-07-29 Thread CR Hota (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-14678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16895754#comment-16895754
 ] 

CR Hota edited comment on HDFS-14678 at 7/30/19 4:16 AM:
-

[~LeonG] This is a very important issue. Thanks for creating the ticket. 
Looking forward to the fix.


was (Author: crh):
[~LeonG] This is a very important. Thanks for creating the ticket. Looking 
forward to the fix.

> Allow triggerBlockReport to a specific namenode
> ---
>
> Key: HDFS-14678
> URL: https://issues.apache.org/jira/browse/HDFS-14678
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Affects Versions: 2.8.2
>Reporter: Leon Gao
>Assignee: Leon Gao
>Priority: Minor
>
> In our largest prod cluster (running 2.8.2) we have >3k hosts. Every time 
> when rolling restarting NNs we will need to wait for block report which takes 
> >2.5 hours for each NN.
> One way to make it faster is to manually trigger a full block report from all 
> datanodes. [HDFS-7278|https://issues.apache.org/jira/browse/HDFS-7278]. 
> However, the current triggerBlockReport command will trigger a block report 
> on all NNs which will flood the active NN as well.
> A quick solution will be adding an option to specify a NN that the manually 
> triggered block report will go to, something like:
> *_hdfs dfsadmin [-triggerBlockReport [-incremental] ] 
> [-namenode] _*
> So when doing a restart of standby NN or observer NN we can trigger an 
> aggressive block report to a specific NN to exit safemode faster without 
> risking active NN performance.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14678) Allow triggerBlockReport to a specific namenode

2019-07-29 Thread CR Hota (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-14678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16895754#comment-16895754
 ] 

CR Hota commented on HDFS-14678:


[~LeonG] This is a very important. Thanks for creating the ticket. Looking 
forward to the fix.

> Allow triggerBlockReport to a specific namenode
> ---
>
> Key: HDFS-14678
> URL: https://issues.apache.org/jira/browse/HDFS-14678
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Affects Versions: 2.8.2
>Reporter: Leon Gao
>Assignee: Leon Gao
>Priority: Minor
>
> In our largest prod cluster (running 2.8.2) we have >3k hosts. Every time 
> when rolling restarting NNs we will need to wait for block report which takes 
> >2.5 hours for each NN.
> One way to make it faster is to manually trigger a full block report from all 
> datanodes. [HDFS-7278|https://issues.apache.org/jira/browse/HDFS-7278]. 
> However, the current triggerBlockReport command will trigger a block report 
> on all NNs which will flood the active NN as well.
> A quick solution will be adding an option to specify a NN that the manually 
> triggered block report will go to, something like:
> *_hdfs dfsadmin [-triggerBlockReport [-incremental] ] 
> [-namenode] _*
> So when doing a restart of standby NN or observer NN we can trigger an 
> aggressive block report to a specific NN to exit safemode faster without 
> risking active NN performance.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14558) RBF: Isolation/Fairness documentation

2019-07-29 Thread CR Hota (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-14558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

CR Hota updated HDFS-14558:
---
Status: Patch Available  (was: Open)

> RBF: Isolation/Fairness documentation
> -
>
> Key: HDFS-14558
> URL: https://issues.apache.org/jira/browse/HDFS-14558
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: CR Hota
>Assignee: CR Hota
>Priority: Major
> Attachments: HDFS-14558.001.patch
>
>
> Documentation is needed to make users aware of this feature HDFS-14090.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14558) RBF: Isolation/Fairness documentation

2019-07-29 Thread CR Hota (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-14558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

CR Hota updated HDFS-14558:
---
Attachment: HDFS-14558.001.patch

> RBF: Isolation/Fairness documentation
> -
>
> Key: HDFS-14558
> URL: https://issues.apache.org/jira/browse/HDFS-14558
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: CR Hota
>Assignee: CR Hota
>Priority: Major
> Attachments: HDFS-14558.001.patch
>
>
> Documentation is needed to make users aware of this feature HDFS-14090.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14090) RBF: Improved isolation for downstream name nodes.

2019-07-29 Thread CR Hota (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-14090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16895503#comment-16895503
 ] 

CR Hota commented on HDFS-14090:


[~brahmareddy] [~aajisaka] [~linyiqun] [~hexiaoqiao] 

Gentle ping. Please help review 007.patch. Am thinking to add this feature to 
3.3 release.

> RBF: Improved isolation for downstream name nodes.
> --
>
> Key: HDFS-14090
> URL: https://issues.apache.org/jira/browse/HDFS-14090
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: CR Hota
>Assignee: CR Hota
>Priority: Major
> Attachments: HDFS-14090-HDFS-13891.001.patch, 
> HDFS-14090-HDFS-13891.002.patch, HDFS-14090-HDFS-13891.003.patch, 
> HDFS-14090-HDFS-13891.004.patch, HDFS-14090-HDFS-13891.005.patch, 
> HDFS-14090.006.patch, HDFS-14090.007.patch, RBF_ Isolation design.pdf
>
>
> Router is a gateway to underlying name nodes. Gateway architectures, should 
> help minimize impact of clients connecting to healthy clusters vs unhealthy 
> clusters.
> For example - If there are 2 name nodes downstream, and one of them is 
> heavily loaded with calls spiking rpc queue times, due to back pressure the 
> same with start reflecting on the router. As a result of this, clients 
> connecting to healthy/faster name nodes will also slow down as same rpc queue 
> is maintained for all calls at the router layer. Essentially the same IPC 
> thread pool is used by router to connect to all name nodes.
> Currently router uses one single rpc queue for all calls. Lets discuss how we 
> can change the architecture and add some throttling logic for 
> unhealthy/slow/overloaded name nodes.
> One way could be to read from current call queue, immediately identify 
> downstream name node and maintain a separate queue for each underlying name 
> node. Another simpler way is to maintain some sort of rate limiter configured 
> for each name node and let routers drop/reject/send error requests after 
> certain threshold. 
> This won’t be a simple change as router’s ‘Server’ layer would need redesign 
> and implementation. Currently this layer is the same as name node.
> Opening this ticket to discuss, design and implement this feature.
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-12748) NameNode memory leak when accessing webhdfs GETHOMEDIRECTORY

2019-07-25 Thread CR Hota (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-12748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16893127#comment-16893127
 ] 

CR Hota commented on HDFS-12748:


[~cheersyang] We deployed this change in our clusters and this helped resolve 
NN mem leak issue. I can work on the GETFILEBLOCKLOCATIONS issue on branch-2. 
Was a new ticket created? I can create one if not.

> NameNode memory leak when accessing webhdfs GETHOMEDIRECTORY
> 
>
> Key: HDFS-12748
> URL: https://issues.apache.org/jira/browse/HDFS-12748
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Affects Versions: 2.8.2
>Reporter: Jiandan Yang 
>Assignee: Weiwei Yang
>Priority: Major
> Fix For: 2.10.0, 3.0.4, 3.3.0, 2.8.6, 3.2.1, 2.9.3, 3.1.3
>
> Attachments: HDFS-12748-branch-3.1.01.patch, HDFS-12748.001.patch, 
> HDFS-12748.002.patch, HDFS-12748.003.patch, HDFS-12748.004.patch, 
> HDFS-12748.005.patch
>
>
> In our production environment, the standby NN often do fullgc, through mat we 
> found the largest object is FileSystem$Cache, which contains 7,844,890 
> DistributedFileSystem.
> By view hierarchy of method FileSystem.get() , I found only 
> NamenodeWebHdfsMethods#get call FileSystem.get(). I don't know why creating 
> different DistributedFileSystem every time instead of get a FileSystem from 
> cache.
> {code:java}
> case GETHOMEDIRECTORY: {
>   final String js = JsonUtil.toJsonString("Path",
>   FileSystem.get(conf != null ? conf : new Configuration())
>   .getHomeDirectory().toUri().getPath());
>   return Response.ok(js).type(MediaType.APPLICATION_JSON).build();
> }
> {code}
> When we close FileSystem when GETHOMEDIRECTORY, NN don't do fullgc.
> {code:java}
> case GETHOMEDIRECTORY: {
>   FileSystem fs = null;
>   try {
> fs = FileSystem.get(conf != null ? conf : new Configuration());
> final String js = JsonUtil.toJsonString("Path",
> fs.getHomeDirectory().toUri().getPath());
> return Response.ok(js).type(MediaType.APPLICATION_JSON).build();
>   } finally {
> if (fs != null) {
>   fs.close();
> }
>   }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14670) RBF: Create secret manager instance using FederationUtil#newInstance.

2019-07-25 Thread CR Hota (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-14670?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

CR Hota updated HDFS-14670:
---
Description: Since HDFS-14577 is done, as discussed in tha ticket, security 
and isolation work will use this. This ticket is tracking the work around 
security class instantiation.  (was: Since HDFS-14577 is done, as discussed in 
tha ticket, security and isolation work will use this. This ticket is tracking 
the work for around security class instantiation.)

> RBF: Create secret manager instance using FederationUtil#newInstance.
> -
>
> Key: HDFS-14670
> URL: https://issues.apache.org/jira/browse/HDFS-14670
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: CR Hota
>Assignee: CR Hota
>Priority: Major
>
> Since HDFS-14577 is done, as discussed in tha ticket, security and isolation 
> work will use this. This ticket is tracking the work around security class 
> instantiation.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14670) RBF: Create secret manager instance using FederationUtil#newInstance.

2019-07-25 Thread CR Hota (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-14670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16893113#comment-16893113
 ] 

CR Hota commented on HDFS-14670:


[~ayushtkn] [~elgoiri] Created PR for this.

[https://github.com/apache/hadoop/pull/1162]

> RBF: Create secret manager instance using FederationUtil#newInstance.
> -
>
> Key: HDFS-14670
> URL: https://issues.apache.org/jira/browse/HDFS-14670
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: CR Hota
>Assignee: CR Hota
>Priority: Major
>
> Since HDFS-14577 is done, as discussed in tha ticket, security and isolation 
> work will use this. This ticket is tracking the work around security class 
> instantiation.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-14670) RBF: Create secret manager instance using FederationUtil#newInstance.

2019-07-25 Thread CR Hota (JIRA)
CR Hota created HDFS-14670:
--

 Summary: RBF: Create secret manager instance using 
FederationUtil#newInstance.
 Key: HDFS-14670
 URL: https://issues.apache.org/jira/browse/HDFS-14670
 Project: Hadoop HDFS
  Issue Type: Sub-task
Reporter: CR Hota
Assignee: CR Hota


Since HDFS-14577 is done, as discussed in tha ticket, security and isolation 
work will use this. This ticket is tracking the work for around security class 
instantiation.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14603) Über-JIRA: HDFS RBF stabilization phase II

2019-07-25 Thread CR Hota (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-14603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16893019#comment-16893019
 ] 

CR Hota commented on HDFS-14603:


All Developers,

For the next JIRAs, please use PRs from the beginning.
Anything that already started as patches, let's keep it as is.

> Über-JIRA: HDFS RBF stabilization phase II
> --
>
> Key: HDFS-14603
> URL: https://issues.apache.org/jira/browse/HDFS-14603
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: rbf
>Reporter: Brahma Reddy Battula
>Priority: Major
>
> To track the pending issues/any new issues after HDFS-13891.(Even for 
> grouping all the RBF issues which will easier for tracking/maintenance)



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14461) RBF: Fix intermittently failing kerberos related unit test

2019-07-25 Thread CR Hota (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-14461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16893016#comment-16893016
 ] 

CR Hota commented on HDFS-14461:


[~hexiaoqiao] Thanks for working on this.

There are couple of things.Sleep may not be the best way to solve this. What we 
should look at is, there are many places in current hadoop which uses minikdc. 
But the main difference that I see is, for each hadoop test the minikdc is 
instantiated and shutdown immediately. In the case of router tests, its static 
(due to SecurityUtil) and not being shut down after test finishes. If multiple 
tests are run in the same jvm this will cause issue and thats what was 
happening based on what I had seen earlier when I opened the ticket.

We may want to follow similar logic of creating and destroying minikdc as all 
other tests.

[~elgoiri] Am going to put your comments about PR/patch in the parent ticket 
for reference.

> RBF: Fix intermittently failing kerberos related unit test
> --
>
> Key: HDFS-14461
> URL: https://issues.apache.org/jira/browse/HDFS-14461
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: CR Hota
>Assignee: He Xiaoqiao
>Priority: Major
> Attachments: HDFS-14461.001.patch
>
>
> TestRouterHttpDelegationToken#testGetDelegationToken fails intermittently. It 
> may be due to some race condition before using the keytab that's created for 
> testing.
>  
> {code:java}
>  Failed
> org.apache.hadoop.hdfs.server.federation.security.TestRouterHttpDelegationToken.testGetDelegationToken
>  Failing for the past 1 build (Since 
> [!https://builds.apache.org/static/1e9ab9cc/images/16x16/red.png! 
> #26721|https://builds.apache.org/job/PreCommit-HDFS-Build/26721/] )
>  [Took 89 
> ms.|https://builds.apache.org/job/PreCommit-HDFS-Build/26721/testReport/org.apache.hadoop.hdfs.server.federation.security/TestRouterHttpDelegationToken/testGetDelegationToken/history]
>   
>  Error Message
> org.apache.hadoop.security.KerberosAuthException: failure to login: for 
> principal: router/localh...@example.com from keytab 
> /testptch/hadoop/hadoop-hdfs-project/hadoop-hdfs-rbf/target/test/data/SecurityConfUtil/test.keytab
>  javax.security.auth.login.LoginException: Integrity check on decrypted field 
> failed (31) - PREAUTH_FAILED
> h3. Stacktrace
> org.apache.hadoop.service.ServiceStateException: 
> org.apache.hadoop.security.KerberosAuthException: failure to login: for 
> principal: router/localh...@example.com from keytab 
> /testptch/hadoop/hadoop-hdfs-project/hadoop-hdfs-rbf/target/test/data/SecurityConfUtil/test.keytab
>  javax.security.auth.login.LoginException: Integrity check on decrypted field 
> failed (31) - PREAUTH_FAILED at 
> org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:105)
>  at org.apache.hadoop.service.AbstractService.init(AbstractService.java:173) 
> at 
> org.apache.hadoop.hdfs.server.federation.security.TestRouterHttpDelegationToken.setup(TestRouterHttpDelegationToken.java:99)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498) at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
>  at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>  at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
>  at 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:24) 
> at 
> org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27) 
> at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325) at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
>  at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
>  at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290) at 
> org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71) at 
> org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288) at 
> org.junit.runners.ParentRunner.access$000(ParentRunner.java:58) at 
> org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268) at 
> org.junit.runners.ParentRunner.run(ParentRunner.java:363) at 
> org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:365)
>  at 
> org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:273)
>  at 
> org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:238)
>  at 
> org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:159)
>  at 
> 

[jira] [Commented] (HDFS-14090) RBF: Improved isolation for downstream name nodes.

2019-07-24 Thread CR Hota (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-14090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16892252#comment-16892252
 ] 

CR Hota commented on HDFS-14090:


Thanks [~elgoiri] for the review. Will wait for others to also help review and 
then take a stab back at the patch.

> RBF: Improved isolation for downstream name nodes.
> --
>
> Key: HDFS-14090
> URL: https://issues.apache.org/jira/browse/HDFS-14090
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: CR Hota
>Assignee: CR Hota
>Priority: Major
> Attachments: HDFS-14090-HDFS-13891.001.patch, 
> HDFS-14090-HDFS-13891.002.patch, HDFS-14090-HDFS-13891.003.patch, 
> HDFS-14090-HDFS-13891.004.patch, HDFS-14090-HDFS-13891.005.patch, 
> HDFS-14090.006.patch, HDFS-14090.007.patch, RBF_ Isolation design.pdf
>
>
> Router is a gateway to underlying name nodes. Gateway architectures, should 
> help minimize impact of clients connecting to healthy clusters vs unhealthy 
> clusters.
> For example - If there are 2 name nodes downstream, and one of them is 
> heavily loaded with calls spiking rpc queue times, due to back pressure the 
> same with start reflecting on the router. As a result of this, clients 
> connecting to healthy/faster name nodes will also slow down as same rpc queue 
> is maintained for all calls at the router layer. Essentially the same IPC 
> thread pool is used by router to connect to all name nodes.
> Currently router uses one single rpc queue for all calls. Lets discuss how we 
> can change the architecture and add some throttling logic for 
> unhealthy/slow/overloaded name nodes.
> One way could be to read from current call queue, immediately identify 
> downstream name node and maintain a separate queue for each underlying name 
> node. Another simpler way is to maintain some sort of rate limiter configured 
> for each name node and let routers drop/reject/send error requests after 
> certain threshold. 
> This won’t be a simple change as router’s ‘Server’ layer would need redesign 
> and implementation. Currently this layer is the same as name node.
> Opening this ticket to discuss, design and implement this feature.
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14090) RBF: Improved isolation for downstream name nodes.

2019-07-24 Thread CR Hota (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-14090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16891999#comment-16891999
 ] 

CR Hota commented on HDFS-14090:


[~linyiqun] [~hexiaoqiao] Thanks for the reviews. Was out of office and hence 
the delay in putting in the new patches. Have taken care of all review comments 
and kept the code as simple as possible for easy understanding.

Let me know if you have any further comments.

[~elgoiri] [~brahmareddy] [~aajisaka] Could you also help review ?

> RBF: Improved isolation for downstream name nodes.
> --
>
> Key: HDFS-14090
> URL: https://issues.apache.org/jira/browse/HDFS-14090
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: CR Hota
>Assignee: CR Hota
>Priority: Major
> Attachments: HDFS-14090-HDFS-13891.001.patch, 
> HDFS-14090-HDFS-13891.002.patch, HDFS-14090-HDFS-13891.003.patch, 
> HDFS-14090-HDFS-13891.004.patch, HDFS-14090-HDFS-13891.005.patch, 
> HDFS-14090.006.patch, HDFS-14090.007.patch, RBF_ Isolation design.pdf
>
>
> Router is a gateway to underlying name nodes. Gateway architectures, should 
> help minimize impact of clients connecting to healthy clusters vs unhealthy 
> clusters.
> For example - If there are 2 name nodes downstream, and one of them is 
> heavily loaded with calls spiking rpc queue times, due to back pressure the 
> same with start reflecting on the router. As a result of this, clients 
> connecting to healthy/faster name nodes will also slow down as same rpc queue 
> is maintained for all calls at the router layer. Essentially the same IPC 
> thread pool is used by router to connect to all name nodes.
> Currently router uses one single rpc queue for all calls. Lets discuss how we 
> can change the architecture and add some throttling logic for 
> unhealthy/slow/overloaded name nodes.
> One way could be to read from current call queue, immediately identify 
> downstream name node and maintain a separate queue for each underlying name 
> node. Another simpler way is to maintain some sort of rate limiter configured 
> for each name node and let routers drop/reject/send error requests after 
> certain threshold. 
> This won’t be a simple change as router’s ‘Server’ layer would need redesign 
> and implementation. Currently this layer is the same as name node.
> Opening this ticket to discuss, design and implement this feature.
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14090) RBF: Improved isolation for downstream name nodes.

2019-07-24 Thread CR Hota (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-14090?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

CR Hota updated HDFS-14090:
---
Attachment: HDFS-14090.007.patch

> RBF: Improved isolation for downstream name nodes.
> --
>
> Key: HDFS-14090
> URL: https://issues.apache.org/jira/browse/HDFS-14090
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: CR Hota
>Assignee: CR Hota
>Priority: Major
> Attachments: HDFS-14090-HDFS-13891.001.patch, 
> HDFS-14090-HDFS-13891.002.patch, HDFS-14090-HDFS-13891.003.patch, 
> HDFS-14090-HDFS-13891.004.patch, HDFS-14090-HDFS-13891.005.patch, 
> HDFS-14090.006.patch, HDFS-14090.007.patch, RBF_ Isolation design.pdf
>
>
> Router is a gateway to underlying name nodes. Gateway architectures, should 
> help minimize impact of clients connecting to healthy clusters vs unhealthy 
> clusters.
> For example - If there are 2 name nodes downstream, and one of them is 
> heavily loaded with calls spiking rpc queue times, due to back pressure the 
> same with start reflecting on the router. As a result of this, clients 
> connecting to healthy/faster name nodes will also slow down as same rpc queue 
> is maintained for all calls at the router layer. Essentially the same IPC 
> thread pool is used by router to connect to all name nodes.
> Currently router uses one single rpc queue for all calls. Lets discuss how we 
> can change the architecture and add some throttling logic for 
> unhealthy/slow/overloaded name nodes.
> One way could be to read from current call queue, immediately identify 
> downstream name node and maintain a separate queue for each underlying name 
> node. Another simpler way is to maintain some sort of rate limiter configured 
> for each name node and let routers drop/reject/send error requests after 
> certain threshold. 
> This won’t be a simple change as router’s ‘Server’ layer would need redesign 
> and implementation. Currently this layer is the same as name node.
> Opening this ticket to discuss, design and implement this feature.
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14090) RBF: Improved isolation for downstream name nodes.

2019-07-23 Thread CR Hota (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-14090?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

CR Hota updated HDFS-14090:
---
Attachment: HDFS-14090.006.patch

> RBF: Improved isolation for downstream name nodes.
> --
>
> Key: HDFS-14090
> URL: https://issues.apache.org/jira/browse/HDFS-14090
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: CR Hota
>Assignee: CR Hota
>Priority: Major
> Attachments: HDFS-14090-HDFS-13891.001.patch, 
> HDFS-14090-HDFS-13891.002.patch, HDFS-14090-HDFS-13891.003.patch, 
> HDFS-14090-HDFS-13891.004.patch, HDFS-14090-HDFS-13891.005.patch, 
> HDFS-14090.006.patch, RBF_ Isolation design.pdf
>
>
> Router is a gateway to underlying name nodes. Gateway architectures, should 
> help minimize impact of clients connecting to healthy clusters vs unhealthy 
> clusters.
> For example - If there are 2 name nodes downstream, and one of them is 
> heavily loaded with calls spiking rpc queue times, due to back pressure the 
> same with start reflecting on the router. As a result of this, clients 
> connecting to healthy/faster name nodes will also slow down as same rpc queue 
> is maintained for all calls at the router layer. Essentially the same IPC 
> thread pool is used by router to connect to all name nodes.
> Currently router uses one single rpc queue for all calls. Lets discuss how we 
> can change the architecture and add some throttling logic for 
> unhealthy/slow/overloaded name nodes.
> One way could be to read from current call queue, immediately identify 
> downstream name node and maintain a separate queue for each underlying name 
> node. Another simpler way is to maintain some sort of rate limiter configured 
> for each name node and let routers drop/reject/send error requests after 
> certain threshold. 
> This won’t be a simple change as router’s ‘Server’ layer would need redesign 
> and implementation. Currently this layer is the same as name node.
> Opening this ticket to discuss, design and implement this feature.
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14609) RBF: Security should use common AuthenticationFilter

2019-07-02 Thread CR Hota (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-14609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16876690#comment-16876690
 ] 

CR Hota commented on HDFS-14609:


[~tasanuma] Highly appreciate your effort. Sure, I will help you get ramped up 
with all security related changes in router.

This is what is strange and needed deeper digging as I mentioned. The tests 
were fine for a very long period of time. This was added as part of HDFS-14052, 
as you can see in all subsequent changes made to HDFS-13891 branch was fine 
till we merged these changes to trunk few days back. BTW, what you tried above 
is also strange because you can see that HADOOP-16354 is also pulled even 
though you checked out a relatively old commit id which is 
506d0734825f01daa7bc4ef93664d450b03f0890.

.

> RBF: Security should use common AuthenticationFilter
> 
>
> Key: HDFS-14609
> URL: https://issues.apache.org/jira/browse/HDFS-14609
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: CR Hota
>Assignee: CR Hota
>Priority: Major
>
> We worked on router based federation security as part of HDFS-13532. We kept 
> it compatible with the way namenode works. However with HADOOP-16314 and 
> HDFS-16354 in trunk, auth filters seems to have been changed causing tests to 
> fail.
> Changes are needed appropriately in RBF, mainly fixing broken tests.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14609) RBF: Security should use common AuthenticationFilter

2019-07-01 Thread CR Hota (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-14609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16876513#comment-16876513
 ] 

CR Hota commented on HDFS-14609:


[Eric Yang|http://jira/secure/ViewProfile.jspa?name=eyang] Thanks for the 
detailed explanation. Apologies for a delayed response.

For TestRouterWithSecureStartup#testStartupWithoutSpnegoPrincipal

Since the test was fine earlier, we will just remove the test as it wont make 
such sense now considering the generic 
hadoop.http.authentication.kerberos.principal is to be used to grab the spnego 
principal. In any case, it's still unclear to me why this was working just fine 
earlier with same version of AbstractService. This would need some more digging.

For TestRouterHttpDelegationToken

We wanted to make sure for webhdfs, some tests were done to see if tokens could 
be generated by router's security manager. This was NOT intended to do a E2E 
security test. Again router works just fine as it inherits namenode 
implementation, but we may need to modify the test to inject an appropriate no 
auth filter and bypass auth to maintain the rationale behind the test.

[~tasanuma] Do you have any cycles to help with this? Will be out of office 
soon, but I will be happy to help review and guide you. Feel free to assign 
this to yourself if you work.

 

> RBF: Security should use common AuthenticationFilter
> 
>
> Key: HDFS-14609
> URL: https://issues.apache.org/jira/browse/HDFS-14609
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: CR Hota
>Assignee: CR Hota
>Priority: Major
>
> We worked on router based federation security as part of HDFS-13532. We kept 
> it compatible with the way namenode works. However with HADOOP-16314 and 
> HDFS-16354 in trunk, auth filters seems to have been changed causing tests to 
> fail.
> Changes are needed appropriately in RBF, mainly fixing broken tests.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14609) RBF: Security should use common AuthenticationFilter

2019-06-25 Thread CR Hota (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-14609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16872872#comment-16872872
 ] 

CR Hota commented on HDFS-14609:


[~eyang] Thanks for chiming in.

TestRouterHttpDelegationToken (all 3 tests) and 
TestRouterWithSecureStartup#testStartupWithoutSpnegoPrincipal are the ones 
failing currently.

 

 

> RBF: Security should use common AuthenticationFilter
> 
>
> Key: HDFS-14609
> URL: https://issues.apache.org/jira/browse/HDFS-14609
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: CR Hota
>Assignee: CR Hota
>Priority: Major
>
> We worked on router based federation security as part of HDFS-13532. We kept 
> it compatible with the way namenode works. However with HADOOP-16314 and 
> HDFS-16354 in trunk, auth filters seems to have been changed causing tests to 
> fail.
> Changes are needed appropriately in RBF, mainly fixing broken tests.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14609) RBF: Security should use common AuthenticationFilter

2019-06-25 Thread CR Hota (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-14609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

CR Hota updated HDFS-14609:
---
Description: 
We worked on router based federation security as part of HDFS-13532. We kept it 
compatible with the way namenode works. However with HDFS-16314 and HDFS-16354 
in trunk, auth filters seems to have been changed causing tests to fail.

Changes are needed appropriately in RBF, mainly fixing broken tests.

  was:
We worked on router based federation as part of HDFS-13532. We kept it 
compatible with the way namenode works. However with HDFS-16314 and HDFS-16354 
in trunk, auth filters seems to have been changed causing tests to fail.

Changes are needed appropriately in RBF, mainly fixing broken tests.


> RBF: Security should use common AuthenticationFilter
> 
>
> Key: HDFS-14609
> URL: https://issues.apache.org/jira/browse/HDFS-14609
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: CR Hota
>Assignee: CR Hota
>Priority: Major
>
> We worked on router based federation security as part of HDFS-13532. We kept 
> it compatible with the way namenode works. However with HDFS-16314 and 
> HDFS-16354 in trunk, auth filters seems to have been changed causing tests to 
> fail.
> Changes are needed appropriately in RBF, mainly fixing broken tests.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



  1   2   3   4   5   >