[jira] [Assigned] (HDFS-16428) Source path setted storagePolicy will cause wrong typeConsumed in rename operation

2022-01-20 Thread Jinglun (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jinglun reassigned HDFS-16428:
--

Assignee: lei w

> Source path setted storagePolicy will cause wrong typeConsumed  in rename 
> operation
> ---
>
> Key: HDFS-16428
> URL: https://issues.apache.org/jira/browse/HDFS-16428
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs, namenode
>Reporter: lei w
>Assignee: lei w
>Priority: Major
>  Labels: pull-request-available
> Attachments: example.txt
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> When compute quota in rename operation , we use storage policy of the target 
> directory to compute src  quota usage. This will cause wrong value of 
> typeConsumed when source path was setted storage policy. I provided a unit 
> test to present this situation.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16083) Forbid Observer NameNode trigger active namenode log roll

2021-07-07 Thread Jinglun (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jinglun updated HDFS-16083:
---
Attachment: HDFS-16083.005.1.patch

> Forbid Observer NameNode trigger  active namenode log roll
> --
>
> Key: HDFS-16083
> URL: https://issues.apache.org/jira/browse/HDFS-16083
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namanode
>Reporter: lei w
>Assignee: lei w
>Priority: Minor
> Attachments: HDFS-16083.001.patch, HDFS-16083.002.patch, 
> HDFS-16083.003.patch, HDFS-16083.004.patch, HDFS-16083.005.1.patch, 
> HDFS-16083.005.patch, activeRollEdits.png
>
>
> When the Observer NameNode is turned on in the cluster, the Active NameNode 
> will receive rollEditLog RPC requests from the Standby NameNode and Observer 
> NameNode in a short time. Observer NameNode's rollEditLog request is a 
> repetitive operation, so should we forbid Observer NameNode trigger  active 
> namenode log roll ? We  'dfs.ha.log-roll.period' configured is 300( 5 
> minutes) and active NameNode receives rollEditLog RPC as shown in 
> activeRollEdits.png



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16083) Forbid Observer NameNode trigger active namenode log roll

2021-07-07 Thread Jinglun (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jinglun updated HDFS-16083:
---
Attachment: (was: HDFS-16083.005.1.patch)

> Forbid Observer NameNode trigger  active namenode log roll
> --
>
> Key: HDFS-16083
> URL: https://issues.apache.org/jira/browse/HDFS-16083
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namanode
>Reporter: lei w
>Assignee: lei w
>Priority: Minor
> Attachments: HDFS-16083.001.patch, HDFS-16083.002.patch, 
> HDFS-16083.003.patch, HDFS-16083.004.patch, HDFS-16083.005.1.patch, 
> HDFS-16083.005.patch, activeRollEdits.png
>
>
> When the Observer NameNode is turned on in the cluster, the Active NameNode 
> will receive rollEditLog RPC requests from the Standby NameNode and Observer 
> NameNode in a short time. Observer NameNode's rollEditLog request is a 
> repetitive operation, so should we forbid Observer NameNode trigger  active 
> namenode log roll ? We  'dfs.ha.log-roll.period' configured is 300( 5 
> minutes) and active NameNode receives rollEditLog RPC as shown in 
> activeRollEdits.png



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16083) Forbid Observer NameNode trigger active namenode log roll

2021-07-02 Thread Jinglun (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jinglun updated HDFS-16083:
---
Status: Open  (was: Patch Available)

> Forbid Observer NameNode trigger  active namenode log roll
> --
>
> Key: HDFS-16083
> URL: https://issues.apache.org/jira/browse/HDFS-16083
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namanode
>Reporter: lei w
>Assignee: lei w
>Priority: Minor
> Attachments: HDFS-16083.001.patch, HDFS-16083.002.patch, 
> HDFS-16083.003.patch, HDFS-16083.004.patch, HDFS-16083.005.1.patch, 
> HDFS-16083.005.patch, activeRollEdits.png
>
>
> When the Observer NameNode is turned on in the cluster, the Active NameNode 
> will receive rollEditLog RPC requests from the Standby NameNode and Observer 
> NameNode in a short time. Observer NameNode's rollEditLog request is a 
> repetitive operation, so should we forbid Observer NameNode trigger  active 
> namenode log roll ? We  'dfs.ha.log-roll.period' configured is 300( 5 
> minutes) and active NameNode receives rollEditLog RPC as shown in 
> activeRollEdits.png



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16083) Forbid Observer NameNode trigger active namenode log roll

2021-07-02 Thread Jinglun (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jinglun updated HDFS-16083:
---
Attachment: HDFS-16083.005.1.patch
Status: Patch Available  (was: Open)

> Forbid Observer NameNode trigger  active namenode log roll
> --
>
> Key: HDFS-16083
> URL: https://issues.apache.org/jira/browse/HDFS-16083
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namanode
>Reporter: lei w
>Assignee: lei w
>Priority: Minor
> Attachments: HDFS-16083.001.patch, HDFS-16083.002.patch, 
> HDFS-16083.003.patch, HDFS-16083.004.patch, HDFS-16083.005.1.patch, 
> HDFS-16083.005.patch, activeRollEdits.png
>
>
> When the Observer NameNode is turned on in the cluster, the Active NameNode 
> will receive rollEditLog RPC requests from the Standby NameNode and Observer 
> NameNode in a short time. Observer NameNode's rollEditLog request is a 
> repetitive operation, so should we forbid Observer NameNode trigger  active 
> namenode log roll ? We  'dfs.ha.log-roll.period' configured is 300( 5 
> minutes) and active NameNode receives rollEditLog RPC as shown in 
> activeRollEdits.png



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16083) Forbid Observer NameNode trigger active namenode log roll

2021-06-30 Thread Jinglun (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jinglun updated HDFS-16083:
---
Attachment: HDFS-16083.004.patch
Status: Patch Available  (was: Open)

Re-submit v04.

> Forbid Observer NameNode trigger  active namenode log roll
> --
>
> Key: HDFS-16083
> URL: https://issues.apache.org/jira/browse/HDFS-16083
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namanode
>Reporter: lei w
>Assignee: lei w
>Priority: Minor
> Attachments: HDFS-16083.001.patch, HDFS-16083.002.patch, 
> HDFS-16083.003.patch, HDFS-16083.004.patch, activeRollEdits.png
>
>
> When the Observer NameNode is turned on in the cluster, the Active NameNode 
> will receive rollEditLog RPC requests from the Standby NameNode and Observer 
> NameNode in a short time. Observer NameNode's rollEditLog request is a 
> repetitive operation, so should we forbid Observer NameNode trigger  active 
> namenode log roll ? We  'dfs.ha.log-roll.period' configured is 300( 5 
> minutes) and active NameNode receives rollEditLog RPC as shown in 
> activeRollEdits.png



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-16083) Forbid Observer NameNode trigger active namenode log roll

2021-06-30 Thread Jinglun (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17371915#comment-17371915
 ] 

Jinglun commented on HDFS-16083:


Hi [~lei w], thanks your patch, some comments.

In EditLogTailer.java:
 # I prefer using `shouldRollLog` instead of avoidTriggerActiveLogRoll. 
{code:java}
if (shouldRollLog && tooLongSinceLastLoad() &&
lastRollTriggerTxId < lastLoadedTxnId) {{code}

In TestStandbyRollEditsLogOnly.java:
 # The test case and setup method should not be static.
 # We need a License for the new file.

In TestStandbyRollEditsLogOnly#testOnlyStandbyRollEditlog:
 # When you compare observerRollTimeMs1, could you use assertEquals instead of 
assertTrue.
 # The message of the assert should be more specific. Something like: "Standby 
should roll the log." and "The observer is not expected to roll the log."
 # I'd prefer using standbyInitialRollTime and standbyLastRollTime instead of 
using numbers standbyRollTimeMs1 and standbyRollTimeMs2.
 # The sleep time is too long, can we make it faster ?  

In TestStandbyRollEditsLogOnly#testTransObToStandbyThenRollLog:
 # It fails, could you give it a check.
 # The verify logic is very like testOnlyStandbyRollEditlog, can we extract the 
same part as a new method.
 # The idea of this test is good. We can transition the state and verify roll 
edit more times. May be do it 3 times ?

There is also some checkstyle issue. Please follow jenkins suggestions. I'll 
re-submit v03 as v04 to trigger jenkins.

> Forbid Observer NameNode trigger  active namenode log roll
> --
>
> Key: HDFS-16083
> URL: https://issues.apache.org/jira/browse/HDFS-16083
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namanode
>Reporter: lei w
>Assignee: lei w
>Priority: Minor
> Attachments: HDFS-16083.001.patch, HDFS-16083.002.patch, 
> HDFS-16083.003.patch, activeRollEdits.png
>
>
> When the Observer NameNode is turned on in the cluster, the Active NameNode 
> will receive rollEditLog RPC requests from the Standby NameNode and Observer 
> NameNode in a short time. Observer NameNode's rollEditLog request is a 
> repetitive operation, so should we forbid Observer NameNode trigger  active 
> namenode log roll ? We  'dfs.ha.log-roll.period' configured is 300( 5 
> minutes) and active NameNode receives rollEditLog RPC as shown in 
> activeRollEdits.png



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-16083) Forbid Observer NameNode trigger active namenode log roll

2021-06-23 Thread Jinglun (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17368012#comment-17368012
 ] 

Jinglun commented on HDFS-16083:


Hi [~lei w], thanks your report ! The description makes sense to me. I have a 
quick look of rollEdit, seems the redundant roll edit did exist. Could you add 
some logs of the active NameNode showing it actually rollEdit more frequently 
than configured in 'dfs.ha.log-roll.period'. Also we need a unit test in the 
patch to make is solid.

> Forbid Observer NameNode trigger  active namenode log roll
> --
>
> Key: HDFS-16083
> URL: https://issues.apache.org/jira/browse/HDFS-16083
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namanode
>Reporter: lei w
>Assignee: lei w
>Priority: Minor
> Attachments: HDFS-16083.001.patch, HDFS-16083.002.patch
>
>
> When the Observer NameNode is turned on in the cluster, the Active NameNode 
> will receive rollEditLog RPC requests from the Standby NameNode and Observer 
> NameNode in a short time. Observer NameNode's rollEditLog request is a 
> repetitive operation, so should we prohibit Observer NameNode from triggering 
> rollEditLog?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Assigned] (HDFS-16083) Forbid Observer NameNode trigger active namenode log roll

2021-06-23 Thread Jinglun (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jinglun reassigned HDFS-16083:
--

Assignee: lei w

> Forbid Observer NameNode trigger  active namenode log roll
> --
>
> Key: HDFS-16083
> URL: https://issues.apache.org/jira/browse/HDFS-16083
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namanode
>Reporter: lei w
>Assignee: lei w
>Priority: Minor
> Attachments: HDFS-16083.001.patch, HDFS-16083.002.patch
>
>
> When the Observer NameNode is turned on in the cluster, the Active NameNode 
> will receive rollEditLog RPC requests from the Standby NameNode and Observer 
> NameNode in a short time. Observer NameNode's rollEditLog request is a 
> repetitive operation, so should we prohibit Observer NameNode from triggering 
> rollEditLog?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-16038) DataNode Unrecognized Observer Node when cluster add an observer node

2021-05-30 Thread Jinglun (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17354202#comment-17354202
 ] 

Jinglun commented on HDFS-16038:


I mean you can update the package and the configuration at the same time. The 
DataNode with old package doesn't know the existence of the Observer. So there 
won't be the HAServiceState.observer issue.

Would you like to share more details about your upgrade progress and why update 
the package and the configuration at the same time doesn't work for you ?

> DataNode Unrecognized Observer Node when cluster add an observer node
> -
>
> Key: HDFS-16038
> URL: https://issues.apache.org/jira/browse/HDFS-16038
> Project: Hadoop HDFS
>  Issue Type: New Feature
>Reporter: lei w
>Priority: Critical
>
> When an Observer node is added to the cluster, the DataNode will not be able 
> to recognize the HAServiceState.observer, This is because we did not upgrade 
> the DataNode. Generally, it will take a long time for a big cluster to 
> upgrade the DataNode . So should we add a switch to replace the Observer 
> state with the Standby state when DataNode can not recognize the 
> HAServiceState.observer state?
> The following are some error messages of DataNode:
> {code:java}
> 11:14:31,812 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: 
> IOException in offerService
> com.google.protobuf.InvalidProtocolBufferException: Message missing required 
> fields: haStatus.state
> at 
> com.google.protobuf.UninitializedMessageException.asInvalidProtocolBufferException(UninitializedMessageException.java:81)
> at 
> com.google.protobuf.AbstractParser.checkMessageInitialized(AbstractParser.java:71)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-15973) RBF: Add permission check before doing router federation rename.

2021-05-30 Thread Jinglun (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17353991#comment-17353991
 ] 

Jinglun edited comment on HDFS-15973 at 5/30/21, 11:44 AM:
---

I made a mistake when commit to trunk. I'm woking in a new environment and 
forget to update my git user.name and user.email.  It left Chinese characters 
in the commit message which might make people confused. So I revert it and 
re-commit with the correct message.

 

I sincerely apologize to anyone who is disturbed by the commit message. Very 
sorry.


was (Author: lijinglun):
I made a mistake when commit to trunk. I'm woking in a new environment and 
forget to update my git user.name and user.email.  It left Chinese characters 
in the commit message which might make people confused. So I revert it and 
re-commit with the correct message.

 

I sincerely apologize to everyone who is disturbed by the commit message. Very 
sorry.

> RBF: Add permission check before doing router federation rename.
> 
>
> Key: HDFS-15973
> URL: https://issues.apache.org/jira/browse/HDFS-15973
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
> Attachments: HDFS-15973.001.patch, HDFS-15973.002.patch, 
> HDFS-15973.003.patch, HDFS-15973.004.patch, HDFS-15973.005.patch, 
> HDFS-15973.006.patch, HDFS-15973.007.patch, HDFS-15973.008.patch, 
> HDFS-15973.009.patch, HDFS-15973.010.patch
>
>
> The router federation rename is lack of permission check. It is a security 
> issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15973) RBF: Add permission check before doing router federation rename.

2021-05-30 Thread Jinglun (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17353991#comment-17353991
 ] 

Jinglun commented on HDFS-15973:


I made a mistake when commit to trunk. I'm woking in a new environment and 
forget to update my git user.name and user.email.  It left Chinese characters 
in the commit message which might make people confused. So I revert it and 
re-commit with the correct message.

 

I sincerely apologize to everyone who is disturbed by the commit message. Very 
sorry.

> RBF: Add permission check before doing router federation rename.
> 
>
> Key: HDFS-15973
> URL: https://issues.apache.org/jira/browse/HDFS-15973
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
> Attachments: HDFS-15973.001.patch, HDFS-15973.002.patch, 
> HDFS-15973.003.patch, HDFS-15973.004.patch, HDFS-15973.005.patch, 
> HDFS-15973.006.patch, HDFS-15973.007.patch, HDFS-15973.008.patch, 
> HDFS-15973.009.patch, HDFS-15973.010.patch
>
>
> The router federation rename is lack of permission check. It is a security 
> issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15973) RBF: Add permission check before doing router federation rename.

2021-05-30 Thread Jinglun (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17353971#comment-17353971
 ] 

Jinglun commented on HDFS-15973:


Commit to trunk. Thanks [~elgoiri] for review !

> RBF: Add permission check before doing router federation rename.
> 
>
> Key: HDFS-15973
> URL: https://issues.apache.org/jira/browse/HDFS-15973
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
> Attachments: HDFS-15973.001.patch, HDFS-15973.002.patch, 
> HDFS-15973.003.patch, HDFS-15973.004.patch, HDFS-15973.005.patch, 
> HDFS-15973.006.patch, HDFS-15973.007.patch, HDFS-15973.008.patch, 
> HDFS-15973.009.patch, HDFS-15973.010.patch
>
>
> The router federation rename is lack of permission check. It is a security 
> issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15973) RBF: Add permission check before doing router federation rename.

2021-05-30 Thread Jinglun (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jinglun updated HDFS-15973:
---
Summary: RBF: Add permission check before doing router federation rename.  
(was: RBF: Add permission check before doting router federation rename.)

> RBF: Add permission check before doing router federation rename.
> 
>
> Key: HDFS-15973
> URL: https://issues.apache.org/jira/browse/HDFS-15973
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
> Attachments: HDFS-15973.001.patch, HDFS-15973.002.patch, 
> HDFS-15973.003.patch, HDFS-15973.004.patch, HDFS-15973.005.patch, 
> HDFS-15973.006.patch, HDFS-15973.007.patch, HDFS-15973.008.patch, 
> HDFS-15973.009.patch, HDFS-15973.010.patch
>
>
> The router federation rename is lack of permission check. It is a security 
> issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15973) RBF: Add permission check before doing router federation rename.

2021-05-30 Thread Jinglun (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jinglun updated HDFS-15973:
---
Resolution: Fixed
Status: Resolved  (was: Patch Available)

> RBF: Add permission check before doing router federation rename.
> 
>
> Key: HDFS-15973
> URL: https://issues.apache.org/jira/browse/HDFS-15973
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
> Attachments: HDFS-15973.001.patch, HDFS-15973.002.patch, 
> HDFS-15973.003.patch, HDFS-15973.004.patch, HDFS-15973.005.patch, 
> HDFS-15973.006.patch, HDFS-15973.007.patch, HDFS-15973.008.patch, 
> HDFS-15973.009.patch, HDFS-15973.010.patch
>
>
> The router federation rename is lack of permission check. It is a security 
> issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-16038) DataNode Unrecognized Observer Node when cluster add an observer node

2021-05-30 Thread Jinglun (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17353945#comment-17353945
 ] 

Jinglun commented on HDFS-16038:


Hi [~lei w], thans for your report. I have one question.

IMO, when the observer is added to the cluster, the Datanode won't 
automatically recognize it. The administrator needs to update both the 
configuration and the package of the DataNode so the it can recognize the 
address of the observer and the `HAServiceState.observer`. If we update both 
the configuration and the package we won't run into the situation that the 
DataNode doesn't recognize the HAServiceState.observer.

> DataNode Unrecognized Observer Node when cluster add an observer node
> -
>
> Key: HDFS-16038
> URL: https://issues.apache.org/jira/browse/HDFS-16038
> Project: Hadoop HDFS
>  Issue Type: New Feature
>Reporter: lei w
>Priority: Critical
>
> When an Observer node is added to the cluster, the DataNode will not be able 
> to recognize the HAServiceState.observer, This is because we did not upgrade 
> the DataNode. Generally, it will take a long time for a big cluster to 
> upgrade the DataNode . So should we add a switch to replace the Observer 
> state with the Standby state when DataNode can not recognize the 
> HAServiceState.observer state?
> The following are some error messages of DataNode:
> {code:java}
> 11:14:31,812 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: 
> IOException in offerService
> com.google.protobuf.InvalidProtocolBufferException: Message missing required 
> fields: haStatus.state
> at 
> com.google.protobuf.UninitializedMessageException.asInvalidProtocolBufferException(UninitializedMessageException.java:81)
> at 
> com.google.protobuf.AbstractParser.checkMessageInitialized(AbstractParser.java:71)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15973) RBF: Add permission check before doting router federation rename.

2021-05-26 Thread Jinglun (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17352232#comment-17352232
 ] 

Jinglun commented on HDFS-15973:


Wait one day for further comments. After that I'll commit this.

> RBF: Add permission check before doting router federation rename.
> -
>
> Key: HDFS-15973
> URL: https://issues.apache.org/jira/browse/HDFS-15973
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
> Attachments: HDFS-15973.001.patch, HDFS-15973.002.patch, 
> HDFS-15973.003.patch, HDFS-15973.004.patch, HDFS-15973.005.patch, 
> HDFS-15973.006.patch, HDFS-15973.007.patch, HDFS-15973.008.patch, 
> HDFS-15973.009.patch, HDFS-15973.010.patch
>
>
> The router federation rename is lack of permission check. It is a security 
> issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15973) RBF: Add permission check before doting router federation rename.

2021-05-24 Thread Jinglun (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jinglun updated HDFS-15973:
---
Attachment: HDFS-15973.010.patch

> RBF: Add permission check before doting router federation rename.
> -
>
> Key: HDFS-15973
> URL: https://issues.apache.org/jira/browse/HDFS-15973
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
> Attachments: HDFS-15973.001.patch, HDFS-15973.002.patch, 
> HDFS-15973.003.patch, HDFS-15973.004.patch, HDFS-15973.005.patch, 
> HDFS-15973.006.patch, HDFS-15973.007.patch, HDFS-15973.008.patch, 
> HDFS-15973.009.patch, HDFS-15973.010.patch
>
>
> The router federation rename is lack of permission check. It is a security 
> issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15973) RBF: Add permission check before doting router federation rename.

2021-05-24 Thread Jinglun (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17350781#comment-17350781
 ] 

Jinglun commented on HDFS-15973:


Hi [~elgoiri], thanks your comments !
{quote}is just removing the sleep good enough?
{quote}
Yes I think so. The sleep wants to make sure the test directories are all 
created. The `cluster.createTestDirectoriesNamenode()` actually verifies 
whether the path exists after creating. So no need to wait. 

 

Fix white space and change rpc to capitals. Submit v10.

> RBF: Add permission check before doting router federation rename.
> -
>
> Key: HDFS-15973
> URL: https://issues.apache.org/jira/browse/HDFS-15973
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
> Attachments: HDFS-15973.001.patch, HDFS-15973.002.patch, 
> HDFS-15973.003.patch, HDFS-15973.004.patch, HDFS-15973.005.patch, 
> HDFS-15973.006.patch, HDFS-15973.007.patch, HDFS-15973.008.patch, 
> HDFS-15973.009.patch
>
>
> The router federation rename is lack of permission check. It is a security 
> issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15973) RBF: Add permission check before doting router federation rename.

2021-05-24 Thread Jinglun (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jinglun updated HDFS-15973:
---
Attachment: HDFS-15973.009.patch

> RBF: Add permission check before doting router federation rename.
> -
>
> Key: HDFS-15973
> URL: https://issues.apache.org/jira/browse/HDFS-15973
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
> Attachments: HDFS-15973.001.patch, HDFS-15973.002.patch, 
> HDFS-15973.003.patch, HDFS-15973.004.patch, HDFS-15973.005.patch, 
> HDFS-15973.006.patch, HDFS-15973.007.patch, HDFS-15973.008.patch, 
> HDFS-15973.009.patch
>
>
> The router federation rename is lack of permission check. It is a security 
> issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15973) RBF: Add permission check before doting router federation rename.

2021-05-24 Thread Jinglun (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17350410#comment-17350410
 ] 

Jinglun commented on HDFS-15973:


Hi [~elgoiri], thanks your comments ! Submit v09 follow your suggestions.

> RBF: Add permission check before doting router federation rename.
> -
>
> Key: HDFS-15973
> URL: https://issues.apache.org/jira/browse/HDFS-15973
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
> Attachments: HDFS-15973.001.patch, HDFS-15973.002.patch, 
> HDFS-15973.003.patch, HDFS-15973.004.patch, HDFS-15973.005.patch, 
> HDFS-15973.006.patch, HDFS-15973.007.patch, HDFS-15973.008.patch
>
>
> The router federation rename is lack of permission check. It is a security 
> issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15973) RBF: Add permission check before doting router federation rename.

2021-05-20 Thread Jinglun (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17348939#comment-17348939
 ] 

Jinglun commented on HDFS-15973:


Hi [~elgoiri], could you help reviewing v08, thanks !

> RBF: Add permission check before doting router federation rename.
> -
>
> Key: HDFS-15973
> URL: https://issues.apache.org/jira/browse/HDFS-15973
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
> Attachments: HDFS-15973.001.patch, HDFS-15973.002.patch, 
> HDFS-15973.003.patch, HDFS-15973.004.patch, HDFS-15973.005.patch, 
> HDFS-15973.006.patch, HDFS-15973.007.patch, HDFS-15973.008.patch
>
>
> The router federation rename is lack of permission check. It is a security 
> issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15294) Federation balance tool

2021-05-19 Thread Jinglun (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17347640#comment-17347640
 ] 

Jinglun commented on HDFS-15294:


{quote}if source directory be writting all the time, is it means Federation 
balance will never exit?
{quote}
Hi [~zhengchenyu], nice comments. HDFS-15640  has introduced a new option: 
'diffThreshold'. If the diff entries size is no greater than this threshold and 
the open files check is satisfied(no open files or force close all open files), 
the fedBalance will go to the final round of distcp.

By specifying diff threshold we can make the federation balance job exit. Does 
it work for your situation ?

 

I'll take a review of HDFS-15750.

> Federation balance tool
> ---
>
> Key: HDFS-15294
> URL: https://issues.apache.org/jira/browse/HDFS-15294
> Project: Hadoop HDFS
>  Issue Type: New Feature
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
> Fix For: 3.4.0
>
> Attachments: BalanceProcedureScheduler.png, HDFS-15294.001.patch, 
> HDFS-15294.002.patch, HDFS-15294.003.patch, HDFS-15294.003.reupload.patch, 
> HDFS-15294.004.patch, HDFS-15294.005.patch, HDFS-15294.006.patch, 
> HDFS-15294.007.patch, distcp-balance.pdf, distcp-balance.v2.pdf
>
>
> This jira introduces a new HDFS federation balance tool to balance data 
> across different federation namespaces. It uses Distcp to copy data from the 
> source path to the target path.
> The process is:
>  1. Use distcp and snapshot diff to sync data between src and dst until they 
> are the same.
>  2. Update mount table in Router if we specified RBF mode.
>  3. Deal with src data, move to trash, delete or skip them.
> The design of fedbalance tool comes from the discussion in HDFS-15087.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-15294) Federation balance tool

2021-05-19 Thread Jinglun (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17347640#comment-17347640
 ] 

Jinglun edited comment on HDFS-15294 at 5/19/21, 12:40 PM:
---

{quote}if source directory be writting all the time, is it means Federation 
balance will never exit?
{quote}
Hi [~zhengchenyu], nice comments. HDFS-15640  has introduced an option: 
'diffThreshold'. If the diff entries size is no greater than this threshold and 
the open files check is satisfied(no open files or force close all open files), 
the fedBalance will go to the final round of distcp.

By specifying diff threshold we can make the federation balance job exit. Does 
it work for your situation ?

 

I'll take a review of HDFS-15750.


was (Author: lijinglun):
{quote}if source directory be writting all the time, is it means Federation 
balance will never exit?
{quote}
Hi [~zhengchenyu], nice comments. HDFS-15640  has introduced a new option: 
'diffThreshold'. If the diff entries size is no greater than this threshold and 
the open files check is satisfied(no open files or force close all open files), 
the fedBalance will go to the final round of distcp.

By specifying diff threshold we can make the federation balance job exit. Does 
it work for your situation ?

 

I'll take a review of HDFS-15750.

> Federation balance tool
> ---
>
> Key: HDFS-15294
> URL: https://issues.apache.org/jira/browse/HDFS-15294
> Project: Hadoop HDFS
>  Issue Type: New Feature
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
> Fix For: 3.4.0
>
> Attachments: BalanceProcedureScheduler.png, HDFS-15294.001.patch, 
> HDFS-15294.002.patch, HDFS-15294.003.patch, HDFS-15294.003.reupload.patch, 
> HDFS-15294.004.patch, HDFS-15294.005.patch, HDFS-15294.006.patch, 
> HDFS-15294.007.patch, distcp-balance.pdf, distcp-balance.v2.pdf
>
>
> This jira introduces a new HDFS federation balance tool to balance data 
> across different federation namespaces. It uses Distcp to copy data from the 
> source path to the target path.
> The process is:
>  1. Use distcp and snapshot diff to sync data between src and dst until they 
> are the same.
>  2. Update mount table in Router if we specified RBF mode.
>  3. Deal with src data, move to trash, delete or skip them.
> The design of fedbalance tool comes from the discussion in HDFS-15087.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Assigned] (HDFS-13671) Namenode deletes large dir slowly caused by FoldedTreeSet#removeAndGet

2021-05-19 Thread Jinglun (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-13671?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jinglun reassigned HDFS-13671:
--

Assignee: Haibin Huang

> Namenode deletes large dir slowly caused by FoldedTreeSet#removeAndGet
> --
>
> Key: HDFS-13671
> URL: https://issues.apache.org/jira/browse/HDFS-13671
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 3.1.0, 3.0.3
>Reporter: Yiqun Lin
>Assignee: Haibin Huang
>Priority: Major
>
> NameNode hung when deleting large files/blocks. The stack info:
> {code}
> "IPC Server handler 4 on 8020" #87 daemon prio=5 os_prio=0 
> tid=0x7fb505b27800 nid=0x94c3 runnable [0x7fa861361000]
>java.lang.Thread.State: RUNNABLE
>   at 
> org.apache.hadoop.hdfs.util.FoldedTreeSet.compare(FoldedTreeSet.java:474)
>   at 
> org.apache.hadoop.hdfs.util.FoldedTreeSet.removeAndGet(FoldedTreeSet.java:849)
>   at 
> org.apache.hadoop.hdfs.util.FoldedTreeSet.remove(FoldedTreeSet.java:911)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeStorageInfo.removeBlock(DatanodeStorageInfo.java:252)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlocksMap.removeBlock(BlocksMap.java:194)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlocksMap.removeBlock(BlocksMap.java:108)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.removeBlockFromMap(BlockManager.java:3813)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.removeBlock(BlockManager.java:3617)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.removeBlocks(FSNamesystem.java:4270)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.deleteInternal(FSNamesystem.java:4244)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.deleteInt(FSNamesystem.java:4180)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.delete(FSNamesystem.java:4164)
>   at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.delete(NameNodeRpcServer.java:871)
>   at 
> org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.delete(AuthorizationProviderProxyClientProtocol.java:311)
>   at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.delete(ClientNamenodeProtocolServerSideTranslatorPB.java:625)
>   at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617)
> {code}
> In the current deletion logic in NameNode, there are mainly two steps:
> * Collect INodes and all blocks to be deleted, then delete INodes.
> * Remove blocks  chunk by chunk in a loop.
> Actually the first step should be a more expensive operation and will takes 
> more time. However, now we always see NN hangs during the remove block 
> operation. 
> Looking into this, we introduced a new structure {{FoldedTreeSet}} to have a 
> better performance in dealing FBR/IBRs. But compared with early 
> implementation in remove-block logic, {{FoldedTreeSet}} seems more slower 
> since It will take additional time to balance tree node. When there are large 
> block to be removed/deleted, it looks bad.
> For the get type operations in {{DatanodeStorageInfo}}, we only provide the 
> {{getBlockIterator}} to return blocks iterator and no other get operation 
> with specified block. Still we need to use {{FoldedTreeSet}} in 
> {{DatanodeStorageInfo}}? As we know {{FoldedTreeSet}} is benefit for Get not 
> Update. Maybe we can revert this to the early implementation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-13671) Namenode deletes large dir slowly caused by FoldedTreeSet#removeAndGet

2021-05-18 Thread Jinglun (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-13671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17347285#comment-17347285
 ] 

Jinglun commented on HDFS-13671:


In Xiaomi we have seen the same slow deletion problem. [~huanghaibin] solved 
this by revert the FoldedTreeSet. Would you like to contribute your work here ? 
[~huanghaibin]

> Namenode deletes large dir slowly caused by FoldedTreeSet#removeAndGet
> --
>
> Key: HDFS-13671
> URL: https://issues.apache.org/jira/browse/HDFS-13671
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 3.1.0, 3.0.3
>Reporter: Yiqun Lin
>Priority: Major
>
> NameNode hung when deleting large files/blocks. The stack info:
> {code}
> "IPC Server handler 4 on 8020" #87 daemon prio=5 os_prio=0 
> tid=0x7fb505b27800 nid=0x94c3 runnable [0x7fa861361000]
>java.lang.Thread.State: RUNNABLE
>   at 
> org.apache.hadoop.hdfs.util.FoldedTreeSet.compare(FoldedTreeSet.java:474)
>   at 
> org.apache.hadoop.hdfs.util.FoldedTreeSet.removeAndGet(FoldedTreeSet.java:849)
>   at 
> org.apache.hadoop.hdfs.util.FoldedTreeSet.remove(FoldedTreeSet.java:911)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeStorageInfo.removeBlock(DatanodeStorageInfo.java:252)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlocksMap.removeBlock(BlocksMap.java:194)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlocksMap.removeBlock(BlocksMap.java:108)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.removeBlockFromMap(BlockManager.java:3813)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.removeBlock(BlockManager.java:3617)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.removeBlocks(FSNamesystem.java:4270)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.deleteInternal(FSNamesystem.java:4244)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.deleteInt(FSNamesystem.java:4180)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.delete(FSNamesystem.java:4164)
>   at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.delete(NameNodeRpcServer.java:871)
>   at 
> org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.delete(AuthorizationProviderProxyClientProtocol.java:311)
>   at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.delete(ClientNamenodeProtocolServerSideTranslatorPB.java:625)
>   at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617)
> {code}
> In the current deletion logic in NameNode, there are mainly two steps:
> * Collect INodes and all blocks to be deleted, then delete INodes.
> * Remove blocks  chunk by chunk in a loop.
> Actually the first step should be a more expensive operation and will takes 
> more time. However, now we always see NN hangs during the remove block 
> operation. 
> Looking into this, we introduced a new structure {{FoldedTreeSet}} to have a 
> better performance in dealing FBR/IBRs. But compared with early 
> implementation in remove-block logic, {{FoldedTreeSet}} seems more slower 
> since It will take additional time to balance tree node. When there are large 
> block to be removed/deleted, it looks bad.
> For the get type operations in {{DatanodeStorageInfo}}, we only provide the 
> {{getBlockIterator}} to return blocks iterator and no other get operation 
> with specified block. Still we need to use {{FoldedTreeSet}} in 
> {{DatanodeStorageInfo}}? As we know {{FoldedTreeSet}} is benefit for Get not 
> Update. Maybe we can revert this to the early implementation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15973) RBF: Add permission check before doting router federation rename.

2021-05-17 Thread Jinglun (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17346104#comment-17346104
 ] 

Jinglun commented on HDFS-15973:


Submit v08 fix checkstyle. The failed unit test runs well on my local 
environment so is not related.

> RBF: Add permission check before doting router federation rename.
> -
>
> Key: HDFS-15973
> URL: https://issues.apache.org/jira/browse/HDFS-15973
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
> Attachments: HDFS-15973.001.patch, HDFS-15973.002.patch, 
> HDFS-15973.003.patch, HDFS-15973.004.patch, HDFS-15973.005.patch, 
> HDFS-15973.006.patch, HDFS-15973.007.patch, HDFS-15973.008.patch
>
>
> The router federation rename is lack of permission check. It is a security 
> issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15973) RBF: Add permission check before doting router federation rename.

2021-05-17 Thread Jinglun (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jinglun updated HDFS-15973:
---
Attachment: HDFS-15973.008.patch

> RBF: Add permission check before doting router federation rename.
> -
>
> Key: HDFS-15973
> URL: https://issues.apache.org/jira/browse/HDFS-15973
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
> Attachments: HDFS-15973.001.patch, HDFS-15973.002.patch, 
> HDFS-15973.003.patch, HDFS-15973.004.patch, HDFS-15973.005.patch, 
> HDFS-15973.006.patch, HDFS-15973.007.patch, HDFS-15973.008.patch
>
>
> The router federation rename is lack of permission check. It is a security 
> issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15973) RBF: Add permission check before doting router federation rename.

2021-05-17 Thread Jinglun (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jinglun updated HDFS-15973:
---
Attachment: HDFS-15973.007.patch

> RBF: Add permission check before doting router federation rename.
> -
>
> Key: HDFS-15973
> URL: https://issues.apache.org/jira/browse/HDFS-15973
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
> Attachments: HDFS-15973.001.patch, HDFS-15973.002.patch, 
> HDFS-15973.003.patch, HDFS-15973.004.patch, HDFS-15973.005.patch, 
> HDFS-15973.006.patch, HDFS-15973.007.patch
>
>
> The router federation rename is lack of permission check. It is a security 
> issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15973) RBF: Add permission check before doting router federation rename.

2021-05-17 Thread Jinglun (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17346003#comment-17346003
 ] 

Jinglun commented on HDFS-15973:


Hi [~elgoiri], thanks your comments ! The failed test is not related. I test it 
and it works fine.

Complete javadocs of RouterFederationRename and update the description in 
HDFSRouterFederation.md. Uplaod v07.

 

> RBF: Add permission check before doting router federation rename.
> -
>
> Key: HDFS-15973
> URL: https://issues.apache.org/jira/browse/HDFS-15973
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
> Attachments: HDFS-15973.001.patch, HDFS-15973.002.patch, 
> HDFS-15973.003.patch, HDFS-15973.004.patch, HDFS-15973.005.patch, 
> HDFS-15973.006.patch, HDFS-15973.007.patch
>
>
> The router federation rename is lack of permission check. It is a security 
> issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15973) RBF: Add permission check before doting router federation rename.

2021-05-14 Thread Jinglun (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jinglun updated HDFS-15973:
---
Attachment: HDFS-15973.006.patch

> RBF: Add permission check before doting router federation rename.
> -
>
> Key: HDFS-15973
> URL: https://issues.apache.org/jira/browse/HDFS-15973
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
> Attachments: HDFS-15973.001.patch, HDFS-15973.002.patch, 
> HDFS-15973.003.patch, HDFS-15973.004.patch, HDFS-15973.005.patch, 
> HDFS-15973.006.patch
>
>
> The router federation rename is lack of permission check. It is a security 
> issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15973) RBF: Add permission check before doting router federation rename.

2021-05-14 Thread Jinglun (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17344359#comment-17344359
 ] 

Jinglun commented on HDFS-15973:


Hi [~elgoiri], thanks your nice comments ! Update the check of snapshot path 
and permission. Move the testPermissionCheck() to a new test class. Submit v06.

> RBF: Add permission check before doting router federation rename.
> -
>
> Key: HDFS-15973
> URL: https://issues.apache.org/jira/browse/HDFS-15973
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
> Attachments: HDFS-15973.001.patch, HDFS-15973.002.patch, 
> HDFS-15973.003.patch, HDFS-15973.004.patch, HDFS-15973.005.patch, 
> HDFS-15973.006.patch
>
>
> The router federation rename is lack of permission check. It is a security 
> issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15973) RBF: Add permission check before doting router federation rename.

2021-05-10 Thread Jinglun (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17342263#comment-17342263
 ] 

Jinglun commented on HDFS-15973:


Hi [~zhengzhuobinzzb] [~elgoiri], do you have time to help reviewing v05 ? 
Thanks very much !

> RBF: Add permission check before doting router federation rename.
> -
>
> Key: HDFS-15973
> URL: https://issues.apache.org/jira/browse/HDFS-15973
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
> Attachments: HDFS-15973.001.patch, HDFS-15973.002.patch, 
> HDFS-15973.003.patch, HDFS-15973.004.patch, HDFS-15973.005.patch
>
>
> The router federation rename is lack of permission check. It is a security 
> issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15973) RBF: Add permission check before doting router federation rename.

2021-05-08 Thread Jinglun (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17341339#comment-17341339
 ] 

Jinglun commented on HDFS-15973:


Since HDFS-15923 is resolved, submit v05 based on the authentication fix.

> RBF: Add permission check before doting router federation rename.
> -
>
> Key: HDFS-15973
> URL: https://issues.apache.org/jira/browse/HDFS-15973
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
> Attachments: HDFS-15973.001.patch, HDFS-15973.002.patch, 
> HDFS-15973.003.patch, HDFS-15973.004.patch, HDFS-15973.005.patch
>
>
> The router federation rename is lack of permission check. It is a security 
> issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15973) RBF: Add permission check before doting router federation rename.

2021-05-08 Thread Jinglun (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jinglun updated HDFS-15973:
---
Attachment: HDFS-15973.005.patch

> RBF: Add permission check before doting router federation rename.
> -
>
> Key: HDFS-15973
> URL: https://issues.apache.org/jira/browse/HDFS-15973
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
> Attachments: HDFS-15973.001.patch, HDFS-15973.002.patch, 
> HDFS-15973.003.patch, HDFS-15973.004.patch, HDFS-15973.005.patch
>
>
> The router federation rename is lack of permission check. It is a security 
> issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-16006) TestRouterFederationRename is flaky

2021-05-08 Thread Jinglun (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17341300#comment-17341300
 ] 

Jinglun commented on HDFS-16006:


Hi [~elgoiri] [~hexiaoqiao], HDFS-15923 fixed this issue. The timeout is 
changed from 10s to 20s and the case TestRouterFederationRename#testCounter is 
ok now.

> TestRouterFederationRename is flaky
> ---
>
> Key: HDFS-16006
> URL: https://issues.apache.org/jira/browse/HDFS-16006
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: rbf
>Reporter: Akira Ajisaka
>Priority: Major
> Attachments: patch-unit-hadoop-hdfs-project_hadoop-hdfs-rbf.txt
>
>
> {quote}
> [ERROR] Errors: 
> [ERROR]   
> TestRouterFederationRename.testCounter:440->Object.wait:502->Object.wait:-2 ? 
> TestTimedOut
> [ERROR]   TestRouterFederationRename.testSetup:145 ? Remote The directory 
> /src cannot be...
> [ERROR]   TestRouterFederationRename.testSetup:145 ? Remote The directory 
> /src cannot be...
> {quote}
> https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-2970/2/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs-rbf.txt



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15923) RBF: Authentication failed when rename accross sub clusters

2021-05-08 Thread Jinglun (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jinglun updated HDFS-15923:
---
Fix Version/s: 3.4.0
 Hadoop Flags: Reviewed
   Resolution: Fixed
   Status: Resolved  (was: Patch Available)

Commit to trunk. Thanks [~zhengzhuobinzzb]'s contribution and [~elgoiri]'s 
review !

> RBF:  Authentication failed when rename accross sub clusters
> 
>
> Key: HDFS-15923
> URL: https://issues.apache.org/jira/browse/HDFS-15923
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: rbf
>Reporter: zhuobin zheng
>Assignee: zhuobin zheng
>Priority: Major
>  Labels: RBF, pull-request-available, rename
> Fix For: 3.4.0
>
> Attachments: HDFS-15923.001.patch, HDFS-15923.002.patch, 
> HDFS-15923.003.patch, HDFS-15923.stack-trace, 
> hdfs-15923-fix-security-issue.patch
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> Rename accross subcluster with RBF and Kerberos environment. Will encounter 
> the following two errors:
>  # Save Object to journal.
>  # Precheck try to get src file status
> So, we need use Router Login UGI doAs create DistcpProcedure and 
> TrashProcedure and submit Job.
>  
> Beside, we should check user permission for src and dst path in router side 
> before do rename internal. (HDFS-15973)
> First: Save Object to journal.
> {code:java}
> // code placeholder
> 2021-03-23 14:01:16,233 WARN org.apache.hadoop.ipc.Client: Exception 
> encountered while connecting to the server 
> javax.security.sasl.SaslException: GSS initiate failed [Caused by 
> GSSException: No valid credentials provided (Mechanism level: Failed to find 
> any Kerberos tgt)]
> at 
> com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:211)
> at 
> org.apache.hadoop.security.SaslRpcClient.saslConnect(SaslRpcClient.java:408)
> at 
> org.apache.hadoop.ipc.Client$Connection.setupSaslConnection(Client.java:622)
> at 
> org.apache.hadoop.ipc.Client$Connection.access$2300(Client.java:413)
> at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:822)
> at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:818)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1762)
> at 
> org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:818)
> at 
> org.apache.hadoop.ipc.Client$Connection.access$3800(Client.java:413)
> at org.apache.hadoop.ipc.Client.getConnection(Client.java:1636)
> at org.apache.hadoop.ipc.Client.call(Client.java:1452)
> at org.apache.hadoop.ipc.Client.call(Client.java:1405)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:118)
> at com.sun.proxy.$Proxy11.create(Unknown Source)
> at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.create(ClientNamenodeProtocolTranslatorPB.java:376)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359)
> at com.sun.proxy.$Proxy12.create(Unknown Source)
> at 
> org.apache.hadoop.hdfs.DFSOutputStream.newStreamForCreate(DFSOutputStream.java:277)
> at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1240)
> at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1219)
> at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1201)
> at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1139)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$8.doCall(DistributedFileSystem.java:533)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$8.doCall(DistributedFileSystem.java:530)
> at 
> 

[jira] [Commented] (HDFS-15923) RBF: Authentication failed when rename accross sub clusters

2021-05-07 Thread Jinglun (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17340649#comment-17340649
 ] 

Jinglun commented on HDFS-15923:


+1 on v03. Waiting one day for further comments. After that I'll commit this.

> RBF:  Authentication failed when rename accross sub clusters
> 
>
> Key: HDFS-15923
> URL: https://issues.apache.org/jira/browse/HDFS-15923
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: rbf
>Reporter: zhuobin zheng
>Assignee: zhuobin zheng
>Priority: Major
>  Labels: RBF, pull-request-available, rename
> Attachments: HDFS-15923.001.patch, HDFS-15923.002.patch, 
> HDFS-15923.003.patch, HDFS-15923.stack-trace, 
> hdfs-15923-fix-security-issue.patch
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> Rename accross subcluster with RBF and Kerberos environment. Will encounter 
> the following two errors:
>  # Save Object to journal.
>  # Precheck try to get src file status
> So, we need use Router Login UGI doAs create DistcpProcedure and 
> TrashProcedure and submit Job.
>  
> Beside, we should check user permission for src and dst path in router side 
> before do rename internal. (HDFS-15973)
> First: Save Object to journal.
> {code:java}
> // code placeholder
> 2021-03-23 14:01:16,233 WARN org.apache.hadoop.ipc.Client: Exception 
> encountered while connecting to the server 
> javax.security.sasl.SaslException: GSS initiate failed [Caused by 
> GSSException: No valid credentials provided (Mechanism level: Failed to find 
> any Kerberos tgt)]
> at 
> com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:211)
> at 
> org.apache.hadoop.security.SaslRpcClient.saslConnect(SaslRpcClient.java:408)
> at 
> org.apache.hadoop.ipc.Client$Connection.setupSaslConnection(Client.java:622)
> at 
> org.apache.hadoop.ipc.Client$Connection.access$2300(Client.java:413)
> at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:822)
> at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:818)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1762)
> at 
> org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:818)
> at 
> org.apache.hadoop.ipc.Client$Connection.access$3800(Client.java:413)
> at org.apache.hadoop.ipc.Client.getConnection(Client.java:1636)
> at org.apache.hadoop.ipc.Client.call(Client.java:1452)
> at org.apache.hadoop.ipc.Client.call(Client.java:1405)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:118)
> at com.sun.proxy.$Proxy11.create(Unknown Source)
> at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.create(ClientNamenodeProtocolTranslatorPB.java:376)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359)
> at com.sun.proxy.$Proxy12.create(Unknown Source)
> at 
> org.apache.hadoop.hdfs.DFSOutputStream.newStreamForCreate(DFSOutputStream.java:277)
> at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1240)
> at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1219)
> at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1201)
> at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1139)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$8.doCall(DistributedFileSystem.java:533)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$8.doCall(DistributedFileSystem.java:530)
> at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
> at 
> 

[jira] [Commented] (HDFS-15923) RBF: Authentication failed when rename accross sub clusters

2021-04-26 Thread Jinglun (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17332893#comment-17332893
 ] 

Jinglun commented on HDFS-15923:


LGTM. +1 on v002.

> RBF:  Authentication failed when rename accross sub clusters
> 
>
> Key: HDFS-15923
> URL: https://issues.apache.org/jira/browse/HDFS-15923
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: rbf
>Reporter: zhuobin zheng
>Assignee: zhuobin zheng
>Priority: Major
>  Labels: RBF, pull-request-available, rename
> Attachments: HDFS-15923.001.patch, HDFS-15923.002.patch, 
> HDFS-15923.stack-trace, hdfs-15923-fix-security-issue.patch
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> Rename accross subcluster with RBF and Kerberos environment. Will encounter 
> the following two errors:
>  # Save Object to journal.
>  # Precheck try to get src file status
> So, we need use Proxy UGI doAs create DistcpProcedure and TrashProcedure and 
> submit Job.
> In patch i use proxy ugi doAs above method. It worked.
> But there are another strange thing and this patch not solve:
> Router use ugi itself to submit the Distcp job. But not user ugi or proxy 
> ugi. This may cause excessive distcp permissions.
> First: Save Object to journal.
> {code:java}
> // code placeholder
> 2021-03-23 14:01:16,233 WARN org.apache.hadoop.ipc.Client: Exception 
> encountered while connecting to the server 
> javax.security.sasl.SaslException: GSS initiate failed [Caused by 
> GSSException: No valid credentials provided (Mechanism level: Failed to find 
> any Kerberos tgt)]
> at 
> com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:211)
> at 
> org.apache.hadoop.security.SaslRpcClient.saslConnect(SaslRpcClient.java:408)
> at 
> org.apache.hadoop.ipc.Client$Connection.setupSaslConnection(Client.java:622)
> at 
> org.apache.hadoop.ipc.Client$Connection.access$2300(Client.java:413)
> at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:822)
> at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:818)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1762)
> at 
> org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:818)
> at 
> org.apache.hadoop.ipc.Client$Connection.access$3800(Client.java:413)
> at org.apache.hadoop.ipc.Client.getConnection(Client.java:1636)
> at org.apache.hadoop.ipc.Client.call(Client.java:1452)
> at org.apache.hadoop.ipc.Client.call(Client.java:1405)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:118)
> at com.sun.proxy.$Proxy11.create(Unknown Source)
> at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.create(ClientNamenodeProtocolTranslatorPB.java:376)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359)
> at com.sun.proxy.$Proxy12.create(Unknown Source)
> at 
> org.apache.hadoop.hdfs.DFSOutputStream.newStreamForCreate(DFSOutputStream.java:277)
> at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1240)
> at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1219)
> at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1201)
> at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1139)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$8.doCall(DistributedFileSystem.java:533)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$8.doCall(DistributedFileSystem.java:530)
> at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
> at 

[jira] [Commented] (HDFS-15923) RBF: Authentication failed when rename accross sub clusters

2021-04-26 Thread Jinglun (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17332465#comment-17332465
 ] 

Jinglun commented on HDFS-15923:


Hi [~zhengzhuobinzzb], thanks your explanation ! Only some minor comments.

1. I think we can just remove the code comment at 
RouterFederationRename.java#L114.

2. TestRouterFederationRenameInKerberosEnv.java#L129 the code comment could be 
removed too.

Other than that the patch is good to me. Nice work !

> RBF:  Authentication failed when rename accross sub clusters
> 
>
> Key: HDFS-15923
> URL: https://issues.apache.org/jira/browse/HDFS-15923
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: rbf
>Reporter: zhuobin zheng
>Assignee: zhuobin zheng
>Priority: Major
>  Labels: RBF, pull-request-available, rename
> Attachments: HDFS-15923.001.patch, HDFS-15923.stack-trace, 
> hdfs-15923-fix-security-issue.patch
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> Rename accross subcluster with RBF and Kerberos environment. Will encounter 
> the following two errors:
>  # Save Object to journal.
>  # Precheck try to get src file status
> So, we need use Proxy UGI doAs create DistcpProcedure and TrashProcedure and 
> submit Job.
> In patch i use proxy ugi doAs above method. It worked.
> But there are another strange thing and this patch not solve:
> Router use ugi itself to submit the Distcp job. But not user ugi or proxy 
> ugi. This may cause excessive distcp permissions.
> First: Save Object to journal.
> {code:java}
> // code placeholder
> 2021-03-23 14:01:16,233 WARN org.apache.hadoop.ipc.Client: Exception 
> encountered while connecting to the server 
> javax.security.sasl.SaslException: GSS initiate failed [Caused by 
> GSSException: No valid credentials provided (Mechanism level: Failed to find 
> any Kerberos tgt)]
> at 
> com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:211)
> at 
> org.apache.hadoop.security.SaslRpcClient.saslConnect(SaslRpcClient.java:408)
> at 
> org.apache.hadoop.ipc.Client$Connection.setupSaslConnection(Client.java:622)
> at 
> org.apache.hadoop.ipc.Client$Connection.access$2300(Client.java:413)
> at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:822)
> at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:818)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1762)
> at 
> org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:818)
> at 
> org.apache.hadoop.ipc.Client$Connection.access$3800(Client.java:413)
> at org.apache.hadoop.ipc.Client.getConnection(Client.java:1636)
> at org.apache.hadoop.ipc.Client.call(Client.java:1452)
> at org.apache.hadoop.ipc.Client.call(Client.java:1405)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:118)
> at com.sun.proxy.$Proxy11.create(Unknown Source)
> at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.create(ClientNamenodeProtocolTranslatorPB.java:376)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359)
> at com.sun.proxy.$Proxy12.create(Unknown Source)
> at 
> org.apache.hadoop.hdfs.DFSOutputStream.newStreamForCreate(DFSOutputStream.java:277)
> at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1240)
> at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1219)
> at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1201)
> at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1139)
> at 
> 

[jira] [Updated] (HDFS-15923) RBF: Authentication failed when rename accross sub clusters

2021-04-25 Thread Jinglun (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jinglun updated HDFS-15923:
---
Attachment: hdfs-15923-fix-security-issue.patch

> RBF:  Authentication failed when rename accross sub clusters
> 
>
> Key: HDFS-15923
> URL: https://issues.apache.org/jira/browse/HDFS-15923
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: rbf
>Reporter: zhuobin zheng
>Assignee: zhuobin zheng
>Priority: Major
>  Labels: RBF, pull-request-available, rename
> Attachments: HDFS-15923.stack-trace, 
> hdfs-15923-fix-security-issue.patch
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> Rename accross subcluster with RBF and Kerberos environment. Will encounter 
> the following two errors:
>  # Save Object to journal.
>  # Precheck try to get src file status
> So, we need use Proxy UGI doAs create DistcpProcedure and TrashProcedure and 
> submit Job.
> In patch i use proxy ugi doAs above method. It worked.
> But there are another strange thing and this patch not solve:
> Router use ugi itself to submit the Distcp job. But not user ugi or proxy 
> ugi. This may cause excessive distcp permissions.
> First: Save Object to journal.
> {code:java}
> // code placeholder
> 2021-03-23 14:01:16,233 WARN org.apache.hadoop.ipc.Client: Exception 
> encountered while connecting to the server 
> javax.security.sasl.SaslException: GSS initiate failed [Caused by 
> GSSException: No valid credentials provided (Mechanism level: Failed to find 
> any Kerberos tgt)]
> at 
> com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:211)
> at 
> org.apache.hadoop.security.SaslRpcClient.saslConnect(SaslRpcClient.java:408)
> at 
> org.apache.hadoop.ipc.Client$Connection.setupSaslConnection(Client.java:622)
> at 
> org.apache.hadoop.ipc.Client$Connection.access$2300(Client.java:413)
> at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:822)
> at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:818)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1762)
> at 
> org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:818)
> at 
> org.apache.hadoop.ipc.Client$Connection.access$3800(Client.java:413)
> at org.apache.hadoop.ipc.Client.getConnection(Client.java:1636)
> at org.apache.hadoop.ipc.Client.call(Client.java:1452)
> at org.apache.hadoop.ipc.Client.call(Client.java:1405)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:118)
> at com.sun.proxy.$Proxy11.create(Unknown Source)
> at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.create(ClientNamenodeProtocolTranslatorPB.java:376)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359)
> at com.sun.proxy.$Proxy12.create(Unknown Source)
> at 
> org.apache.hadoop.hdfs.DFSOutputStream.newStreamForCreate(DFSOutputStream.java:277)
> at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1240)
> at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1219)
> at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1201)
> at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1139)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$8.doCall(DistributedFileSystem.java:533)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$8.doCall(DistributedFileSystem.java:530)
> at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
> at 
> 

[jira] [Updated] (HDFS-15923) RBF: Authentication failed when rename accross sub clusters

2021-04-25 Thread Jinglun (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jinglun updated HDFS-15923:
---
Attachment: HDFS-15923.stack-trace

> RBF:  Authentication failed when rename accross sub clusters
> 
>
> Key: HDFS-15923
> URL: https://issues.apache.org/jira/browse/HDFS-15923
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: rbf
>Reporter: zhuobin zheng
>Assignee: zhuobin zheng
>Priority: Major
>  Labels: RBF, pull-request-available, rename
> Attachments: HDFS-15923.stack-trace
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> Rename accross subcluster with RBF and Kerberos environment. Will encounter 
> the following two errors:
>  # Save Object to journal.
>  # Precheck try to get src file status
> So, we need use Proxy UGI doAs create DistcpProcedure and TrashProcedure and 
> submit Job.
> In patch i use proxy ugi doAs above method. It worked.
> But there are another strange thing and this patch not solve:
> Router use ugi itself to submit the Distcp job. But not user ugi or proxy 
> ugi. This may cause excessive distcp permissions.
> First: Save Object to journal.
> {code:java}
> // code placeholder
> 2021-03-23 14:01:16,233 WARN org.apache.hadoop.ipc.Client: Exception 
> encountered while connecting to the server 
> javax.security.sasl.SaslException: GSS initiate failed [Caused by 
> GSSException: No valid credentials provided (Mechanism level: Failed to find 
> any Kerberos tgt)]
> at 
> com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:211)
> at 
> org.apache.hadoop.security.SaslRpcClient.saslConnect(SaslRpcClient.java:408)
> at 
> org.apache.hadoop.ipc.Client$Connection.setupSaslConnection(Client.java:622)
> at 
> org.apache.hadoop.ipc.Client$Connection.access$2300(Client.java:413)
> at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:822)
> at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:818)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1762)
> at 
> org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:818)
> at 
> org.apache.hadoop.ipc.Client$Connection.access$3800(Client.java:413)
> at org.apache.hadoop.ipc.Client.getConnection(Client.java:1636)
> at org.apache.hadoop.ipc.Client.call(Client.java:1452)
> at org.apache.hadoop.ipc.Client.call(Client.java:1405)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:118)
> at com.sun.proxy.$Proxy11.create(Unknown Source)
> at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.create(ClientNamenodeProtocolTranslatorPB.java:376)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359)
> at com.sun.proxy.$Proxy12.create(Unknown Source)
> at 
> org.apache.hadoop.hdfs.DFSOutputStream.newStreamForCreate(DFSOutputStream.java:277)
> at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1240)
> at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1219)
> at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1201)
> at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1139)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$8.doCall(DistributedFileSystem.java:533)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$8.doCall(DistributedFileSystem.java:530)
> at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:544)
> at 
> 

[jira] [Comment Edited] (HDFS-15923) RBF: Authentication failed when rename accross sub clusters

2021-04-25 Thread Jinglun (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17331517#comment-17331517
 ] 

Jinglun edited comment on HDFS-15923 at 4/25/21, 1:04 PM:
--

Hi [~zhengzhuobinzzb], nice work ! Here are some comments.

1. The unit test couldn't pass on my local environment. I got a NPE when the 
MiniDFSCluster trys to verify the registered datanodes. The stack trace is 
uploaded. I think it is caused by the Datanode doesn't load the security 
configuration(The DataNode will call UserGroupInformation.setConfiguration and 
change the authenticationMethod to SIMPLE). I did a little change, see 
_hdfs-15923-fix-security-issue.patch._ I don't know why it worked well on your 
environment and yetus, do you know why ?

2. Does the TestRouterFederationRenameInKerberosEnv need to extend the 
ClientBaseWithFixes and why ?

3. I'd prefer reviewing your patch from jira. Could you change to Jira when 
submitting your next patch ?

 


was (Author: lijinglun):
Hi [~zhengzhuobinzzb], nice work ! Here are some comments.

1. The unit test couldn't pass on my local environment. I got a NPE when the 
MiniDFSCluster trys to verify the registered datanodes. The stack trace is 
uploaded. I think it is caused by the Datanode doesn't load the security 
configuration(The DataNode will call UserGroupInformation.setConfiguration and 
change the authenticationMethod to SIMPLE). I did a little change, see 
_fix-datanode-security-issue.patch._ I don't know why it worked well on your 
environment and yetus, do you know why ?

2. Does the TestRouterFederationRenameInKerberosEnv need to extend the 
ClientBaseWithFixes and why ?

3. I'd prefer reviewing your patch from jira. Could you change to Jira when 
submitting your next patch ?

 

> RBF:  Authentication failed when rename accross sub clusters
> 
>
> Key: HDFS-15923
> URL: https://issues.apache.org/jira/browse/HDFS-15923
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: rbf
>Reporter: zhuobin zheng
>Assignee: zhuobin zheng
>Priority: Major
>  Labels: RBF, pull-request-available, rename
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> Rename accross subcluster with RBF and Kerberos environment. Will encounter 
> the following two errors:
>  # Save Object to journal.
>  # Precheck try to get src file status
> So, we need use Proxy UGI doAs create DistcpProcedure and TrashProcedure and 
> submit Job.
> In patch i use proxy ugi doAs above method. It worked.
> But there are another strange thing and this patch not solve:
> Router use ugi itself to submit the Distcp job. But not user ugi or proxy 
> ugi. This may cause excessive distcp permissions.
> First: Save Object to journal.
> {code:java}
> // code placeholder
> 2021-03-23 14:01:16,233 WARN org.apache.hadoop.ipc.Client: Exception 
> encountered while connecting to the server 
> javax.security.sasl.SaslException: GSS initiate failed [Caused by 
> GSSException: No valid credentials provided (Mechanism level: Failed to find 
> any Kerberos tgt)]
> at 
> com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:211)
> at 
> org.apache.hadoop.security.SaslRpcClient.saslConnect(SaslRpcClient.java:408)
> at 
> org.apache.hadoop.ipc.Client$Connection.setupSaslConnection(Client.java:622)
> at 
> org.apache.hadoop.ipc.Client$Connection.access$2300(Client.java:413)
> at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:822)
> at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:818)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1762)
> at 
> org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:818)
> at 
> org.apache.hadoop.ipc.Client$Connection.access$3800(Client.java:413)
> at org.apache.hadoop.ipc.Client.getConnection(Client.java:1636)
> at org.apache.hadoop.ipc.Client.call(Client.java:1452)
> at org.apache.hadoop.ipc.Client.call(Client.java:1405)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:118)
> at com.sun.proxy.$Proxy11.create(Unknown Source)
> at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.create(ClientNamenodeProtocolTranslatorPB.java:376)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> 

[jira] [Comment Edited] (HDFS-15923) RBF: Authentication failed when rename accross sub clusters

2021-04-25 Thread Jinglun (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17331517#comment-17331517
 ] 

Jinglun edited comment on HDFS-15923 at 4/25/21, 1:03 PM:
--

Hi [~zhengzhuobinzzb], nice work ! Here are some comments.

1. The unit test couldn't pass on my local environment. I got a NPE when the 
MiniDFSCluster trys to verify the registered datanodes. The stack trace is 
uploaded. I think it is caused by the Datanode doesn't load the security 
configuration(The DataNode will call UserGroupInformation.setConfiguration and 
change the authenticationMethod to SIMPLE). I did a little change, see 
_fix-datanode-security-issue.patch._ I don't know why it worked well on your 
environment and yetus, do you know why ?

2. Does the TestRouterFederationRenameInKerberosEnv need to extend the 
ClientBaseWithFixes and why ?

3. I'd prefer reviewing your patch from jira. Could you change to Jira when 
submitting your next patch ?

 


was (Author: lijinglun):
Hi [~zhengzhuobinzzb], nice work ! Here are some comments.

1. The unit test couldn't pass on my local environment. I got a NPE when the 
MiniDFSCluster trys to verify the registered datanodes. The stack trace is 
uploaded. I think it is caused by the Datanode doesn't load the security 
configuration. I did a little change, see _fix-datanode-security-issue.patch._ 
I don't know why it worked well on your environment and yetus, do you know why ?

2. Does the TestRouterFederationRenameInKerberosEnv need to extend the 
ClientBaseWithFixes and why ?

3. I'd prefer reviewing your patch from jira. Could you change to Jira when 
submitting your next patch ?

 

> RBF:  Authentication failed when rename accross sub clusters
> 
>
> Key: HDFS-15923
> URL: https://issues.apache.org/jira/browse/HDFS-15923
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: rbf
>Reporter: zhuobin zheng
>Assignee: zhuobin zheng
>Priority: Major
>  Labels: RBF, pull-request-available, rename
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> Rename accross subcluster with RBF and Kerberos environment. Will encounter 
> the following two errors:
>  # Save Object to journal.
>  # Precheck try to get src file status
> So, we need use Proxy UGI doAs create DistcpProcedure and TrashProcedure and 
> submit Job.
> In patch i use proxy ugi doAs above method. It worked.
> But there are another strange thing and this patch not solve:
> Router use ugi itself to submit the Distcp job. But not user ugi or proxy 
> ugi. This may cause excessive distcp permissions.
> First: Save Object to journal.
> {code:java}
> // code placeholder
> 2021-03-23 14:01:16,233 WARN org.apache.hadoop.ipc.Client: Exception 
> encountered while connecting to the server 
> javax.security.sasl.SaslException: GSS initiate failed [Caused by 
> GSSException: No valid credentials provided (Mechanism level: Failed to find 
> any Kerberos tgt)]
> at 
> com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:211)
> at 
> org.apache.hadoop.security.SaslRpcClient.saslConnect(SaslRpcClient.java:408)
> at 
> org.apache.hadoop.ipc.Client$Connection.setupSaslConnection(Client.java:622)
> at 
> org.apache.hadoop.ipc.Client$Connection.access$2300(Client.java:413)
> at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:822)
> at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:818)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1762)
> at 
> org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:818)
> at 
> org.apache.hadoop.ipc.Client$Connection.access$3800(Client.java:413)
> at org.apache.hadoop.ipc.Client.getConnection(Client.java:1636)
> at org.apache.hadoop.ipc.Client.call(Client.java:1452)
> at org.apache.hadoop.ipc.Client.call(Client.java:1405)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:118)
> at com.sun.proxy.$Proxy11.create(Unknown Source)
> at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.create(ClientNamenodeProtocolTranslatorPB.java:376)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at 

[jira] [Commented] (HDFS-15923) RBF: Authentication failed when rename accross sub clusters

2021-04-25 Thread Jinglun (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17331517#comment-17331517
 ] 

Jinglun commented on HDFS-15923:


Hi [~zhengzhuobinzzb], nice work ! Here are some comments.

1. The unit test couldn't pass on my local environment. I got a NPE when the 
MiniDFSCluster trys to verify the registered datanodes. The stack trace is 
uploaded. I think it is caused by the Datanode doesn't load the security 
configuration. I did a little change, see _fix-datanode-security-issue.patch._ 
I don't know why it worked well on your environment and yetus, do you know why ?

2. Does the TestRouterFederationRenameInKerberosEnv need to extend the 
ClientBaseWithFixes and why ?

3. I'd prefer reviewing your patch from jira. Could you change to Jira when 
submitting your next patch ?

 

> RBF:  Authentication failed when rename accross sub clusters
> 
>
> Key: HDFS-15923
> URL: https://issues.apache.org/jira/browse/HDFS-15923
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: rbf
>Reporter: zhuobin zheng
>Assignee: zhuobin zheng
>Priority: Major
>  Labels: RBF, pull-request-available, rename
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> Rename accross subcluster with RBF and Kerberos environment. Will encounter 
> the following two errors:
>  # Save Object to journal.
>  # Precheck try to get src file status
> So, we need use Proxy UGI doAs create DistcpProcedure and TrashProcedure and 
> submit Job.
> In patch i use proxy ugi doAs above method. It worked.
> But there are another strange thing and this patch not solve:
> Router use ugi itself to submit the Distcp job. But not user ugi or proxy 
> ugi. This may cause excessive distcp permissions.
> First: Save Object to journal.
> {code:java}
> // code placeholder
> 2021-03-23 14:01:16,233 WARN org.apache.hadoop.ipc.Client: Exception 
> encountered while connecting to the server 
> javax.security.sasl.SaslException: GSS initiate failed [Caused by 
> GSSException: No valid credentials provided (Mechanism level: Failed to find 
> any Kerberos tgt)]
> at 
> com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:211)
> at 
> org.apache.hadoop.security.SaslRpcClient.saslConnect(SaslRpcClient.java:408)
> at 
> org.apache.hadoop.ipc.Client$Connection.setupSaslConnection(Client.java:622)
> at 
> org.apache.hadoop.ipc.Client$Connection.access$2300(Client.java:413)
> at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:822)
> at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:818)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1762)
> at 
> org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:818)
> at 
> org.apache.hadoop.ipc.Client$Connection.access$3800(Client.java:413)
> at org.apache.hadoop.ipc.Client.getConnection(Client.java:1636)
> at org.apache.hadoop.ipc.Client.call(Client.java:1452)
> at org.apache.hadoop.ipc.Client.call(Client.java:1405)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:118)
> at com.sun.proxy.$Proxy11.create(Unknown Source)
> at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.create(ClientNamenodeProtocolTranslatorPB.java:376)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359)
> at com.sun.proxy.$Proxy12.create(Unknown Source)
> at 
> org.apache.hadoop.hdfs.DFSOutputStream.newStreamForCreate(DFSOutputStream.java:277)
> at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1240)
> at 

[jira] [Commented] (HDFS-15923) RBF: Authentication failed when rename accross sub clusters

2021-04-25 Thread Jinglun (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17331438#comment-17331438
 ] 

Jinglun commented on HDFS-15923:


OK, I am going to review this.

Hi [~elgoiri] , [~ayushtkn] could you help adding zhoubin zheng as a 
contributor ?

> RBF:  Authentication failed when rename accross sub clusters
> 
>
> Key: HDFS-15923
> URL: https://issues.apache.org/jira/browse/HDFS-15923
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: rbf
>Reporter: zhuobin zheng
>Priority: Major
>  Labels: RBF, pull-request-available, rename
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> Rename accross subcluster with RBF and Kerberos environment. Will encounter 
> the following two errors:
>  # Save Object to journal.
>  # Precheck try to get src file status
> So, we need use Proxy UGI doAs create DistcpProcedure and TrashProcedure and 
> submit Job.
> In patch i use proxy ugi doAs above method. It worked.
> But there are another strange thing and this patch not solve:
> Router use ugi itself to submit the Distcp job. But not user ugi or proxy 
> ugi. This may cause excessive distcp permissions.
> First: Save Object to journal.
> {code:java}
> // code placeholder
> 2021-03-23 14:01:16,233 WARN org.apache.hadoop.ipc.Client: Exception 
> encountered while connecting to the server 
> javax.security.sasl.SaslException: GSS initiate failed [Caused by 
> GSSException: No valid credentials provided (Mechanism level: Failed to find 
> any Kerberos tgt)]
> at 
> com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:211)
> at 
> org.apache.hadoop.security.SaslRpcClient.saslConnect(SaslRpcClient.java:408)
> at 
> org.apache.hadoop.ipc.Client$Connection.setupSaslConnection(Client.java:622)
> at 
> org.apache.hadoop.ipc.Client$Connection.access$2300(Client.java:413)
> at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:822)
> at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:818)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1762)
> at 
> org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:818)
> at 
> org.apache.hadoop.ipc.Client$Connection.access$3800(Client.java:413)
> at org.apache.hadoop.ipc.Client.getConnection(Client.java:1636)
> at org.apache.hadoop.ipc.Client.call(Client.java:1452)
> at org.apache.hadoop.ipc.Client.call(Client.java:1405)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:118)
> at com.sun.proxy.$Proxy11.create(Unknown Source)
> at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.create(ClientNamenodeProtocolTranslatorPB.java:376)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359)
> at com.sun.proxy.$Proxy12.create(Unknown Source)
> at 
> org.apache.hadoop.hdfs.DFSOutputStream.newStreamForCreate(DFSOutputStream.java:277)
> at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1240)
> at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1219)
> at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1201)
> at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1139)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$8.doCall(DistributedFileSystem.java:533)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$8.doCall(DistributedFileSystem.java:530)
> at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
> at 
> 

[jira] [Assigned] (HDFS-15923) RBF: Authentication failed when rename accross sub clusters

2021-04-25 Thread Jinglun (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jinglun reassigned HDFS-15923:
--

Assignee: (was: Jinglun)

> RBF:  Authentication failed when rename accross sub clusters
> 
>
> Key: HDFS-15923
> URL: https://issues.apache.org/jira/browse/HDFS-15923
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: rbf
>Reporter: zhuobin zheng
>Priority: Major
>  Labels: RBF, pull-request-available, rename
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> Rename accross subcluster with RBF and Kerberos environment. Will encounter 
> the following two errors:
>  # Save Object to journal.
>  # Precheck try to get src file status
> So, we need use Proxy UGI doAs create DistcpProcedure and TrashProcedure and 
> submit Job.
> In patch i use proxy ugi doAs above method. It worked.
> But there are another strange thing and this patch not solve:
> Router use ugi itself to submit the Distcp job. But not user ugi or proxy 
> ugi. This may cause excessive distcp permissions.
> First: Save Object to journal.
> {code:java}
> // code placeholder
> 2021-03-23 14:01:16,233 WARN org.apache.hadoop.ipc.Client: Exception 
> encountered while connecting to the server 
> javax.security.sasl.SaslException: GSS initiate failed [Caused by 
> GSSException: No valid credentials provided (Mechanism level: Failed to find 
> any Kerberos tgt)]
> at 
> com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:211)
> at 
> org.apache.hadoop.security.SaslRpcClient.saslConnect(SaslRpcClient.java:408)
> at 
> org.apache.hadoop.ipc.Client$Connection.setupSaslConnection(Client.java:622)
> at 
> org.apache.hadoop.ipc.Client$Connection.access$2300(Client.java:413)
> at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:822)
> at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:818)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1762)
> at 
> org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:818)
> at 
> org.apache.hadoop.ipc.Client$Connection.access$3800(Client.java:413)
> at org.apache.hadoop.ipc.Client.getConnection(Client.java:1636)
> at org.apache.hadoop.ipc.Client.call(Client.java:1452)
> at org.apache.hadoop.ipc.Client.call(Client.java:1405)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:118)
> at com.sun.proxy.$Proxy11.create(Unknown Source)
> at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.create(ClientNamenodeProtocolTranslatorPB.java:376)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359)
> at com.sun.proxy.$Proxy12.create(Unknown Source)
> at 
> org.apache.hadoop.hdfs.DFSOutputStream.newStreamForCreate(DFSOutputStream.java:277)
> at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1240)
> at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1219)
> at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1201)
> at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1139)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$8.doCall(DistributedFileSystem.java:533)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$8.doCall(DistributedFileSystem.java:530)
> at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:544)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:471)

[jira] [Assigned] (HDFS-15923) RBF: Authentication failed when rename accross sub clusters

2021-04-25 Thread Jinglun (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jinglun reassigned HDFS-15923:
--

Assignee: Jinglun

> RBF:  Authentication failed when rename accross sub clusters
> 
>
> Key: HDFS-15923
> URL: https://issues.apache.org/jira/browse/HDFS-15923
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: rbf
>Reporter: zhuobin zheng
>Assignee: Jinglun
>Priority: Major
>  Labels: RBF, pull-request-available, rename
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> Rename accross subcluster with RBF and Kerberos environment. Will encounter 
> the following two errors:
>  # Save Object to journal.
>  # Precheck try to get src file status
> So, we need use Proxy UGI doAs create DistcpProcedure and TrashProcedure and 
> submit Job.
> In patch i use proxy ugi doAs above method. It worked.
> But there are another strange thing and this patch not solve:
> Router use ugi itself to submit the Distcp job. But not user ugi or proxy 
> ugi. This may cause excessive distcp permissions.
> First: Save Object to journal.
> {code:java}
> // code placeholder
> 2021-03-23 14:01:16,233 WARN org.apache.hadoop.ipc.Client: Exception 
> encountered while connecting to the server 
> javax.security.sasl.SaslException: GSS initiate failed [Caused by 
> GSSException: No valid credentials provided (Mechanism level: Failed to find 
> any Kerberos tgt)]
> at 
> com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:211)
> at 
> org.apache.hadoop.security.SaslRpcClient.saslConnect(SaslRpcClient.java:408)
> at 
> org.apache.hadoop.ipc.Client$Connection.setupSaslConnection(Client.java:622)
> at 
> org.apache.hadoop.ipc.Client$Connection.access$2300(Client.java:413)
> at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:822)
> at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:818)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1762)
> at 
> org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:818)
> at 
> org.apache.hadoop.ipc.Client$Connection.access$3800(Client.java:413)
> at org.apache.hadoop.ipc.Client.getConnection(Client.java:1636)
> at org.apache.hadoop.ipc.Client.call(Client.java:1452)
> at org.apache.hadoop.ipc.Client.call(Client.java:1405)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:118)
> at com.sun.proxy.$Proxy11.create(Unknown Source)
> at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.create(ClientNamenodeProtocolTranslatorPB.java:376)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359)
> at com.sun.proxy.$Proxy12.create(Unknown Source)
> at 
> org.apache.hadoop.hdfs.DFSOutputStream.newStreamForCreate(DFSOutputStream.java:277)
> at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1240)
> at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1219)
> at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1201)
> at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1139)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$8.doCall(DistributedFileSystem.java:533)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$8.doCall(DistributedFileSystem.java:530)
> at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:544)
> at 
> 

[jira] [Commented] (HDFS-15973) RBF: Add permission check before doting router federation rename.

2021-04-19 Thread Jinglun (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17325026#comment-17325026
 ] 

Jinglun commented on HDFS-15973:


Hi [~zhengzhuobinzzb], thanks your comments ! The security mode is not 
considered in v03, thanks your explanation ! Submit v04.

> RBF: Add permission check before doting router federation rename.
> -
>
> Key: HDFS-15973
> URL: https://issues.apache.org/jira/browse/HDFS-15973
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
> Attachments: HDFS-15973.001.patch, HDFS-15973.002.patch, 
> HDFS-15973.003.patch, HDFS-15973.004.patch
>
>
> The router federation rename is lack of permission check. It is a security 
> issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15973) RBF: Add permission check before doting router federation rename.

2021-04-19 Thread Jinglun (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jinglun updated HDFS-15973:
---
Attachment: HDFS-15973.004.patch

> RBF: Add permission check before doting router federation rename.
> -
>
> Key: HDFS-15973
> URL: https://issues.apache.org/jira/browse/HDFS-15973
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
> Attachments: HDFS-15973.001.patch, HDFS-15973.002.patch, 
> HDFS-15973.003.patch, HDFS-15973.004.patch
>
>
> The router federation rename is lack of permission check. It is a security 
> issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15973) RBF: Add permission check before doting router federation rename.

2021-04-19 Thread Jinglun (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jinglun updated HDFS-15973:
---
Attachment: (was: HDFS-15973.004.patch)

> RBF: Add permission check before doting router federation rename.
> -
>
> Key: HDFS-15973
> URL: https://issues.apache.org/jira/browse/HDFS-15973
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
> Attachments: HDFS-15973.001.patch, HDFS-15973.002.patch, 
> HDFS-15973.003.patch
>
>
> The router federation rename is lack of permission check. It is a security 
> issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15973) RBF: Add permission check before doting router federation rename.

2021-04-19 Thread Jinglun (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jinglun updated HDFS-15973:
---
Attachment: HDFS-15973.004.patch

> RBF: Add permission check before doting router federation rename.
> -
>
> Key: HDFS-15973
> URL: https://issues.apache.org/jira/browse/HDFS-15973
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
> Attachments: HDFS-15973.001.patch, HDFS-15973.002.patch, 
> HDFS-15973.003.patch, HDFS-15973.004.patch
>
>
> The router federation rename is lack of permission check. It is a security 
> issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15973) RBF: Add permission check before doting router federation rename.

2021-04-15 Thread Jinglun (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jinglun updated HDFS-15973:
---
Attachment: HDFS-15973.003.patch

> RBF: Add permission check before doting router federation rename.
> -
>
> Key: HDFS-15973
> URL: https://issues.apache.org/jira/browse/HDFS-15973
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
> Attachments: HDFS-15973.001.patch, HDFS-15973.002.patch, 
> HDFS-15973.003.patch
>
>
> The router federation rename is lack of permission check. It is a security 
> issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15973) RBF: Add permission check before doting router federation rename.

2021-04-15 Thread Jinglun (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17322571#comment-17322571
 ] 

Jinglun commented on HDFS-15973:


Submit v03 fix checkstyle. The failed unit tests are not related.

> RBF: Add permission check before doting router federation rename.
> -
>
> Key: HDFS-15973
> URL: https://issues.apache.org/jira/browse/HDFS-15973
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
> Attachments: HDFS-15973.001.patch, HDFS-15973.002.patch, 
> HDFS-15973.003.patch
>
>
> The router federation rename is lack of permission check. It is a security 
> issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15973) RBF: Add permission check before doting router federation rename.

2021-04-15 Thread Jinglun (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17322569#comment-17322569
 ] 

Jinglun commented on HDFS-15973:


Hi [~zhengzhuobinzzb], thanks your comments.
{quote}I think access check also need credentials in kerberos environment
{quote}
I don't fully understand, could you describe it in more detail.

> RBF: Add permission check before doting router federation rename.
> -
>
> Key: HDFS-15973
> URL: https://issues.apache.org/jira/browse/HDFS-15973
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
> Attachments: HDFS-15973.001.patch, HDFS-15973.002.patch
>
>
> The router federation rename is lack of permission check. It is a security 
> issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15973) RBF: Add permission check before doting router federation rename.

2021-04-15 Thread Jinglun (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17322214#comment-17322214
 ] 

Jinglun commented on HDFS-15973:


Submit v02 using FileSystem.access().

> RBF: Add permission check before doting router federation rename.
> -
>
> Key: HDFS-15973
> URL: https://issues.apache.org/jira/browse/HDFS-15973
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
> Attachments: HDFS-15973.001.patch, HDFS-15973.002.patch
>
>
> The router federation rename is lack of permission check. It is a security 
> issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15973) RBF: Add permission check before doting router federation rename.

2021-04-15 Thread Jinglun (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jinglun updated HDFS-15973:
---
Attachment: HDFS-15973.002.patch

> RBF: Add permission check before doting router federation rename.
> -
>
> Key: HDFS-15973
> URL: https://issues.apache.org/jira/browse/HDFS-15973
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
> Attachments: HDFS-15973.001.patch, HDFS-15973.002.patch
>
>
> The router federation rename is lack of permission check. It is a security 
> issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15973) RBF: Add permission check before doting router federation rename.

2021-04-14 Thread Jinglun (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17321889#comment-17321889
 ] 

Jinglun commented on HDFS-15973:


Hi [~zhengzhuobinzzb], thanks your comments. Using FileSystem.access() is 
better, I made a negligence of the rpc :P. I'll submit v02 using the access() 
and the second point can be handled too.

> RBF: Add permission check before doting router federation rename.
> -
>
> Key: HDFS-15973
> URL: https://issues.apache.org/jira/browse/HDFS-15973
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
> Attachments: HDFS-15973.001.patch
>
>
> The router federation rename is lack of permission check. It is a security 
> issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-15923) RBF: Authentication failed when rename accross sub clusters

2021-04-14 Thread Jinglun (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17320887#comment-17320887
 ] 

Jinglun edited comment on HDFS-15923 at 4/14/21, 3:42 PM:
--

Hi [~zhengzhuobinzzb], you are right ! Please continue with your work, we still 
need some test cases. 
{quote}In the current code logic, storing tasks in Journal does not use super 
users and Kerberos credentials. (Because when RPC executes Call, it uses the 
corresponding Ugi's doAs, and the Ugi does not have a Kerberberos certificate.)
{quote}
 

I'll start a new Jira(HDFS-15973) to resolve the permission check issue.


was (Author: lijinglun):
Hi [~zhengzhuobinzzb], you are right ! Please continue with your work, we still 
need some test cases. 
{quote}In the current code logic, storing tasks in Journal does not use super 
users and Kerberos credentials. (Because when RPC executes Call, it uses the 
corresponding Ugi's doAs, and the Ugi does not have a Kerberberos certificate.)
{quote}
 

I'll start a new Jira to resolve the permission check issue.

> RBF:  Authentication failed when rename accross sub clusters
> 
>
> Key: HDFS-15923
> URL: https://issues.apache.org/jira/browse/HDFS-15923
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: rbf
>Reporter: zhuobin zheng
>Priority: Major
>  Labels: RBF, pull-request-available, rename
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> Rename accross subcluster with RBF and Kerberos environment. Will encounter 
> the following two errors:
>  # Save Object to journal.
>  # Precheck try to get src file status
> So, we need use Proxy UGI doAs create DistcpProcedure and TrashProcedure and 
> submit Job.
> In patch i use proxy ugi doAs above method. It worked.
> But there are another strange thing and this patch not solve:
> Router use ugi itself to submit the Distcp job. But not user ugi or proxy 
> ugi. This may cause excessive distcp permissions.
> First: Save Object to journal.
> {code:java}
> // code placeholder
> 2021-03-23 14:01:16,233 WARN org.apache.hadoop.ipc.Client: Exception 
> encountered while connecting to the server 
> javax.security.sasl.SaslException: GSS initiate failed [Caused by 
> GSSException: No valid credentials provided (Mechanism level: Failed to find 
> any Kerberos tgt)]
> at 
> com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:211)
> at 
> org.apache.hadoop.security.SaslRpcClient.saslConnect(SaslRpcClient.java:408)
> at 
> org.apache.hadoop.ipc.Client$Connection.setupSaslConnection(Client.java:622)
> at 
> org.apache.hadoop.ipc.Client$Connection.access$2300(Client.java:413)
> at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:822)
> at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:818)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1762)
> at 
> org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:818)
> at 
> org.apache.hadoop.ipc.Client$Connection.access$3800(Client.java:413)
> at org.apache.hadoop.ipc.Client.getConnection(Client.java:1636)
> at org.apache.hadoop.ipc.Client.call(Client.java:1452)
> at org.apache.hadoop.ipc.Client.call(Client.java:1405)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:118)
> at com.sun.proxy.$Proxy11.create(Unknown Source)
> at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.create(ClientNamenodeProtocolTranslatorPB.java:376)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359)
> at 

[jira] [Commented] (HDFS-15973) RBF: Add permission check before doting router federation rename.

2021-04-14 Thread Jinglun (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17321099#comment-17321099
 ] 

Jinglun commented on HDFS-15973:


Submit the initial patch. The patch introduces the RouterINode class to save 
the file status. First it collects the file status of the src and the dst and 
saves to the RouterINode array.  Then it uses RouterPermissionChecker(very like 
the FsPermissionChecker) to do the permission check.

> RBF: Add permission check before doting router federation rename.
> -
>
> Key: HDFS-15973
> URL: https://issues.apache.org/jira/browse/HDFS-15973
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
> Attachments: HDFS-15973.001.patch
>
>
> The router federation rename is lack of permission check. It is a security 
> issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15973) RBF: Add permission check before doting router federation rename.

2021-04-14 Thread Jinglun (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jinglun updated HDFS-15973:
---
Attachment: HDFS-15973.001.patch
Status: Patch Available  (was: Open)

> RBF: Add permission check before doting router federation rename.
> -
>
> Key: HDFS-15973
> URL: https://issues.apache.org/jira/browse/HDFS-15973
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
> Attachments: HDFS-15973.001.patch
>
>
> The router federation rename is lack of permission check. It is a security 
> issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Assigned] (HDFS-15973) RBF: Add permission check before doting router federation rename.

2021-04-14 Thread Jinglun (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jinglun reassigned HDFS-15973:
--

Assignee: Jinglun

> RBF: Add permission check before doting router federation rename.
> -
>
> Key: HDFS-15973
> URL: https://issues.apache.org/jira/browse/HDFS-15973
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
>
> The router federation rename is lack of permission check. It is a security 
> issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-15973) RBF: Add permission check before doting router federation rename.

2021-04-14 Thread Jinglun (Jira)
Jinglun created HDFS-15973:
--

 Summary: RBF: Add permission check before doting router federation 
rename.
 Key: HDFS-15973
 URL: https://issues.apache.org/jira/browse/HDFS-15973
 Project: Hadoop HDFS
  Issue Type: Sub-task
Reporter: Jinglun


The router federation rename is lack of permission check. It is a security 
issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15923) RBF: Authentication failed when rename accross sub clusters

2021-04-14 Thread Jinglun (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jinglun updated HDFS-15923:
---
Parent: HDFS-15747
Issue Type: Sub-task  (was: Bug)

> RBF:  Authentication failed when rename accross sub clusters
> 
>
> Key: HDFS-15923
> URL: https://issues.apache.org/jira/browse/HDFS-15923
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: rbf
>Reporter: zhuobin zheng
>Priority: Major
>  Labels: RBF, pull-request-available, rename
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> Rename accross subcluster with RBF and Kerberos environment. Will encounter 
> the following two errors:
>  # Save Object to journal.
>  # Precheck try to get src file status
> So, we need use Proxy UGI doAs create DistcpProcedure and TrashProcedure and 
> submit Job.
> In patch i use proxy ugi doAs above method. It worked.
> But there are another strange thing and this patch not solve:
> Router use ugi itself to submit the Distcp job. But not user ugi or proxy 
> ugi. This may cause excessive distcp permissions.
> First: Save Object to journal.
> {code:java}
> // code placeholder
> 2021-03-23 14:01:16,233 WARN org.apache.hadoop.ipc.Client: Exception 
> encountered while connecting to the server 
> javax.security.sasl.SaslException: GSS initiate failed [Caused by 
> GSSException: No valid credentials provided (Mechanism level: Failed to find 
> any Kerberos tgt)]
> at 
> com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:211)
> at 
> org.apache.hadoop.security.SaslRpcClient.saslConnect(SaslRpcClient.java:408)
> at 
> org.apache.hadoop.ipc.Client$Connection.setupSaslConnection(Client.java:622)
> at 
> org.apache.hadoop.ipc.Client$Connection.access$2300(Client.java:413)
> at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:822)
> at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:818)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1762)
> at 
> org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:818)
> at 
> org.apache.hadoop.ipc.Client$Connection.access$3800(Client.java:413)
> at org.apache.hadoop.ipc.Client.getConnection(Client.java:1636)
> at org.apache.hadoop.ipc.Client.call(Client.java:1452)
> at org.apache.hadoop.ipc.Client.call(Client.java:1405)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:118)
> at com.sun.proxy.$Proxy11.create(Unknown Source)
> at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.create(ClientNamenodeProtocolTranslatorPB.java:376)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359)
> at com.sun.proxy.$Proxy12.create(Unknown Source)
> at 
> org.apache.hadoop.hdfs.DFSOutputStream.newStreamForCreate(DFSOutputStream.java:277)
> at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1240)
> at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1219)
> at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1201)
> at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1139)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$8.doCall(DistributedFileSystem.java:533)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$8.doCall(DistributedFileSystem.java:530)
> at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:544)
> at 
> 

[jira] [Commented] (HDFS-15923) RBF: Authentication failed when rename accross sub clusters

2021-04-14 Thread Jinglun (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17320887#comment-17320887
 ] 

Jinglun commented on HDFS-15923:


Hi [~zhengzhuobinzzb], you are right ! Please continue with your work, we still 
need some test cases. 
{quote}In the current code logic, storing tasks in Journal does not use super 
users and Kerberos credentials. (Because when RPC executes Call, it uses the 
corresponding Ugi's doAs, and the Ugi does not have a Kerberberos certificate.)
{quote}
 

I'll start a new Jira to resolve the permission check issue.

> RBF:  Authentication failed when rename accross sub clusters
> 
>
> Key: HDFS-15923
> URL: https://issues.apache.org/jira/browse/HDFS-15923
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: rbf
>Reporter: zhuobin zheng
>Priority: Major
>  Labels: RBF, pull-request-available, rename
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> Rename accross subcluster with RBF and Kerberos environment. Will encounter 
> the following two errors:
>  # Save Object to journal.
>  # Precheck try to get src file status
> So, we need use Proxy UGI doAs create DistcpProcedure and TrashProcedure and 
> submit Job.
> In patch i use proxy ugi doAs above method. It worked.
> But there are another strange thing and this patch not solve:
> Router use ugi itself to submit the Distcp job. But not user ugi or proxy 
> ugi. This may cause excessive distcp permissions.
> First: Save Object to journal.
> {code:java}
> // code placeholder
> 2021-03-23 14:01:16,233 WARN org.apache.hadoop.ipc.Client: Exception 
> encountered while connecting to the server 
> javax.security.sasl.SaslException: GSS initiate failed [Caused by 
> GSSException: No valid credentials provided (Mechanism level: Failed to find 
> any Kerberos tgt)]
> at 
> com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:211)
> at 
> org.apache.hadoop.security.SaslRpcClient.saslConnect(SaslRpcClient.java:408)
> at 
> org.apache.hadoop.ipc.Client$Connection.setupSaslConnection(Client.java:622)
> at 
> org.apache.hadoop.ipc.Client$Connection.access$2300(Client.java:413)
> at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:822)
> at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:818)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1762)
> at 
> org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:818)
> at 
> org.apache.hadoop.ipc.Client$Connection.access$3800(Client.java:413)
> at org.apache.hadoop.ipc.Client.getConnection(Client.java:1636)
> at org.apache.hadoop.ipc.Client.call(Client.java:1452)
> at org.apache.hadoop.ipc.Client.call(Client.java:1405)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:118)
> at com.sun.proxy.$Proxy11.create(Unknown Source)
> at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.create(ClientNamenodeProtocolTranslatorPB.java:376)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359)
> at com.sun.proxy.$Proxy12.create(Unknown Source)
> at 
> org.apache.hadoop.hdfs.DFSOutputStream.newStreamForCreate(DFSOutputStream.java:277)
> at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1240)
> at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1219)
> at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1201)
> at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1139)
> at 
> 

[jira] [Commented] (HDFS-15972) Fedbalance only copies data partially when there's existing opened file

2021-04-13 Thread Jinglun (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17320123#comment-17320123
 ] 

Jinglun commented on HDFS-15972:


Hi [~coconut_icecream], thanks for your report ! I'll try to reproduce then dig 
into it this week.

> Fedbalance only copies data partially when there's existing opened file
> ---
>
> Key: HDFS-15972
> URL: https://issues.apache.org/jira/browse/HDFS-15972
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Felix N
>Priority: Major
>
> If there are opened files when fedbalance is run and data is being written to 
> these files, fedbalance might skip the newly written data.
> Steps to recreate the issue:
>  # Create a dummy file /test/file with some data: {{echo "start" | hdfs dfs 
> -appendToFile /test/file}}
>  # Start writing to the file: {{hdfs dfs -appendToFile /test/file}} but do 
> not stop writing
>  # Run fedbalance: {{hadoop fedbalance submit hdfs://ns1/test 
> hdfs://ns2/test}}
>  # Write something to the file while fedbalance is running, "end" for 
> example, then stop writing
>  # After fedbalance is done, {{hdfs://ns2/test/file}} should only contain 
> "start" while {{hdfs://ns1/user/hadoop/.Trash/Current/test/file}} contains 
> "start\nend"
> Fedbalance is run with default configs and arguments so no diff should happen.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15923) RBF: Authentication failed when rename accross sub clusters

2021-04-13 Thread Jinglun (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17320117#comment-17320117
 ] 

Jinglun commented on HDFS-15923:


Hi [~zhengzhuobinzzb], I'll take over this, hope you don't mind. The 
description of this Jira is not precise. After I finish the patch I'll start a 
new Jira to deal with the permission issue.

> RBF:  Authentication failed when rename accross sub clusters
> 
>
> Key: HDFS-15923
> URL: https://issues.apache.org/jira/browse/HDFS-15923
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: rbf
>Reporter: zhuobin zheng
>Priority: Major
>  Labels: RBF, pull-request-available, rename
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> Rename accross subcluster with RBF and Kerberos environment. Will encounter 
> the following two errors:
>  # Save Object to journal.
>  # Precheck try to get src file status
> So, we need use Proxy UGI doAs create DistcpProcedure and TrashProcedure and 
> submit Job.
> In patch i use proxy ugi doAs above method. It worked.
> But there are another strange thing and this patch not solve:
> Router use ugi itself to submit the Distcp job. But not user ugi or proxy 
> ugi. This may cause excessive distcp permissions.
> First: Save Object to journal.
> {code:java}
> // code placeholder
> 2021-03-23 14:01:16,233 WARN org.apache.hadoop.ipc.Client: Exception 
> encountered while connecting to the server 
> javax.security.sasl.SaslException: GSS initiate failed [Caused by 
> GSSException: No valid credentials provided (Mechanism level: Failed to find 
> any Kerberos tgt)]
> at 
> com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:211)
> at 
> org.apache.hadoop.security.SaslRpcClient.saslConnect(SaslRpcClient.java:408)
> at 
> org.apache.hadoop.ipc.Client$Connection.setupSaslConnection(Client.java:622)
> at 
> org.apache.hadoop.ipc.Client$Connection.access$2300(Client.java:413)
> at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:822)
> at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:818)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1762)
> at 
> org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:818)
> at 
> org.apache.hadoop.ipc.Client$Connection.access$3800(Client.java:413)
> at org.apache.hadoop.ipc.Client.getConnection(Client.java:1636)
> at org.apache.hadoop.ipc.Client.call(Client.java:1452)
> at org.apache.hadoop.ipc.Client.call(Client.java:1405)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:118)
> at com.sun.proxy.$Proxy11.create(Unknown Source)
> at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.create(ClientNamenodeProtocolTranslatorPB.java:376)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359)
> at com.sun.proxy.$Proxy12.create(Unknown Source)
> at 
> org.apache.hadoop.hdfs.DFSOutputStream.newStreamForCreate(DFSOutputStream.java:277)
> at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1240)
> at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1219)
> at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1201)
> at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1139)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$8.doCall(DistributedFileSystem.java:533)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$8.doCall(DistributedFileSystem.java:530)
> at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
> at 

[jira] [Commented] (HDFS-15923) RBF: Authentication failed when rename accross sub clusters

2021-03-30 Thread Jinglun (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17311555#comment-17311555
 ] 

Jinglun commented on HDFS-15923:


Hi [~zhengzhuobinzzb], thanks your question !

 

In the current design the journal and the distcp procedure are all done with 
the Router's kerberos credential (a super user). Both the journal path and the 
yarn queue are configured by the administrator. The super user's credential is 
also used for preserving all the permissions in distcp. So we shouldn't use the 
user's ugi. The user's ugi won't have write access of the journal path. The ugi 
doesn't have access of the super user's yarn queue too.

 

But there is an issue about the user's ugi: "The Router doesn't do any 
permission check before doing the Router Federation Rename". We should check 
both the source and the dst with the user's ugi before submitting the Balance 
Job.

 

Let me know your thoughts. If you also agree with the permission issue, are you 
interested in fixing it ?

> RBF:  Authentication failed when rename accross sub clusters
> 
>
> Key: HDFS-15923
> URL: https://issues.apache.org/jira/browse/HDFS-15923
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: rbf
>Reporter: zhuobin zheng
>Priority: Major
>  Labels: RBF, pull-request-available, rename
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> Rename accross subcluster with RBF and Kerberos environment. Will encounter 
> the following two errors:
>  # Save Object to journal.
>  # Precheck try to get src file status
> So, we need use Proxy UGI doAs create DistcpProcedure and TrashProcedure and 
> submit Job.
> In patch i use proxy ugi doAs above method. It worked.
> But there are another strange thing and this patch not solve:
> Router use ugi itself to submit the Distcp job. But not user ugi or proxy 
> ugi. This may cause excessive distcp permissions.
> First: Save Object to journal.
> {code:java}
> // code placeholder
> 2021-03-23 14:01:16,233 WARN org.apache.hadoop.ipc.Client: Exception 
> encountered while connecting to the server 
> javax.security.sasl.SaslException: GSS initiate failed [Caused by 
> GSSException: No valid credentials provided (Mechanism level: Failed to find 
> any Kerberos tgt)]
> at 
> com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:211)
> at 
> org.apache.hadoop.security.SaslRpcClient.saslConnect(SaslRpcClient.java:408)
> at 
> org.apache.hadoop.ipc.Client$Connection.setupSaslConnection(Client.java:622)
> at 
> org.apache.hadoop.ipc.Client$Connection.access$2300(Client.java:413)
> at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:822)
> at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:818)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1762)
> at 
> org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:818)
> at 
> org.apache.hadoop.ipc.Client$Connection.access$3800(Client.java:413)
> at org.apache.hadoop.ipc.Client.getConnection(Client.java:1636)
> at org.apache.hadoop.ipc.Client.call(Client.java:1452)
> at org.apache.hadoop.ipc.Client.call(Client.java:1405)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:118)
> at com.sun.proxy.$Proxy11.create(Unknown Source)
> at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.create(ClientNamenodeProtocolTranslatorPB.java:376)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359)
> at com.sun.proxy.$Proxy12.create(Unknown Source)
> at 
> 

[jira] [Commented] (HDFS-15899) Remove rpcThreadPool from DeadNodeDetector.

2021-03-18 Thread Jinglun (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17303875#comment-17303875
 ] 

Jinglun commented on HDFS-15899:


Submit v02 fix checkstyle. The failed unit tests are not related.

> Remove rpcThreadPool from DeadNodeDetector.
> ---
>
> Key: HDFS-15899
> URL: https://issues.apache.org/jira/browse/HDFS-15899
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
> Attachments: HDFS-15899.001.patch, HDFS-15899.002.patch
>
>
> The DeadNodeDetector uses a thread pool to do all the probe rpc calls. The 
> purpose is to use the thread pool timeout to monitor the probe timeout. But 
> the rpc client already has a timeout. We can use the rpc client timeout 
> instead of the thread pool timeout and remove the rpcThreadPool.
> The rpcThreadPool introduces additional complexity for probing the DataNode. 
> The probe task waiting in the busy rpcThreadPool might exceed the configured 
> timeout. The probe task will be marked as failed even it is not scheduled.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15899) Remove rpcThreadPool from DeadNodeDetector.

2021-03-18 Thread Jinglun (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jinglun updated HDFS-15899:
---
Attachment: HDFS-15899.002.patch

> Remove rpcThreadPool from DeadNodeDetector.
> ---
>
> Key: HDFS-15899
> URL: https://issues.apache.org/jira/browse/HDFS-15899
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
> Attachments: HDFS-15899.001.patch, HDFS-15899.002.patch
>
>
> The DeadNodeDetector uses a thread pool to do all the probe rpc calls. The 
> purpose is to use the thread pool timeout to monitor the probe timeout. But 
> the rpc client already has a timeout. We can use the rpc client timeout 
> instead of the thread pool timeout and remove the rpcThreadPool.
> The rpcThreadPool introduces additional complexity for probing the DataNode. 
> The probe task waiting in the busy rpcThreadPool might exceed the configured 
> timeout. The probe task will be marked as failed even it is not scheduled.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15899) Remove rpcThreadPool from DeadNodeDetector.

2021-03-16 Thread Jinglun (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17302372#comment-17302372
 ] 

Jinglun commented on HDFS-15899:


Submit v01. Hi [~leosun08], could you help reviewing this, thanks !

> Remove rpcThreadPool from DeadNodeDetector.
> ---
>
> Key: HDFS-15899
> URL: https://issues.apache.org/jira/browse/HDFS-15899
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
> Attachments: HDFS-15899.001.patch
>
>
> The DeadNodeDetector uses a thread pool to do all the probe rpc calls. The 
> purpose is to use the thread pool timeout to monitor the probe timeout. But 
> the rpc client already has a timeout. We can use the rpc client timeout 
> instead of the thread pool timeout and remove the rpcThreadPool.
> The rpcThreadPool introduces additional complexity for probing the DataNode. 
> The probe task waiting in the busy rpcThreadPool might exceed the configured 
> timeout. The probe task will be marked as failed even it is not scheduled.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15899) Remove rpcThreadPool from DeadNodeDetector.

2021-03-16 Thread Jinglun (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jinglun updated HDFS-15899:
---
Attachment: HDFS-15899.001.patch
Status: Patch Available  (was: Open)

> Remove rpcThreadPool from DeadNodeDetector.
> ---
>
> Key: HDFS-15899
> URL: https://issues.apache.org/jira/browse/HDFS-15899
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
> Attachments: HDFS-15899.001.patch
>
>
> The DeadNodeDetector uses a thread pool to do all the probe rpc calls. The 
> purpose is to use the thread pool timeout to monitor the probe timeout. But 
> the rpc client already has a timeout. We can use the rpc client timeout 
> instead of the thread pool timeout and remove the rpcThreadPool.
> The rpcThreadPool introduces additional complexity for probing the DataNode. 
> The probe task waiting in the busy rpcThreadPool might exceed the configured 
> timeout. The probe task will be marked as failed even it is not scheduled.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Assigned] (HDFS-15899) Remove rpcThreadPool from DeadNodeDetector.

2021-03-16 Thread Jinglun (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jinglun reassigned HDFS-15899:
--

Assignee: Jinglun

> Remove rpcThreadPool from DeadNodeDetector.
> ---
>
> Key: HDFS-15899
> URL: https://issues.apache.org/jira/browse/HDFS-15899
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
>
> The DeadNodeDetector uses a thread pool to do all the probe rpc calls. The 
> purpose is to use the thread pool timeout to monitor the probe timeout. But 
> the rpc client already has a timeout. We can use the rpc client timeout 
> instead of the thread pool timeout and remove the rpcThreadPool.
> The rpcThreadPool introduces additional complexity for probing the DataNode. 
> The probe task waiting in the busy rpcThreadPool might exceed the configured 
> timeout. The probe task will be marked as failed even it is not scheduled.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-15899) Remove rpcThreadPool from DeadNodeDetector.

2021-03-16 Thread Jinglun (Jira)
Jinglun created HDFS-15899:
--

 Summary: Remove rpcThreadPool from DeadNodeDetector.
 Key: HDFS-15899
 URL: https://issues.apache.org/jira/browse/HDFS-15899
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: Jinglun


The DeadNodeDetector uses a thread pool to do all the probe rpc calls. The 
purpose is to use the thread pool timeout to monitor the probe timeout. But the 
rpc client already has a timeout. We can use the rpc client timeout instead of 
the thread pool timeout and remove the rpcThreadPool.

The rpcThreadPool introduces additional complexity for probing the DataNode. 
The probe task waiting in the busy rpcThreadPool might exceed the configured 
timeout. The probe task will be marked as failed even it is not scheduled.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15809) DeadNodeDetector doesn't remove live nodes from dead node set.

2021-03-07 Thread Jinglun (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jinglun updated HDFS-15809:
---
Attachment: HDFS-15809.007.patch

> DeadNodeDetector doesn't remove live nodes from dead node set.
> --
>
> Key: HDFS-15809
> URL: https://issues.apache.org/jira/browse/HDFS-15809
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
> Attachments: HDFS-15809.001.patch, HDFS-15809.002.patch, 
> HDFS-15809.003.patch, HDFS-15809.004.patch, HDFS-15809.005.patch, 
> HDFS-15809.006.patch, HDFS-15809.007.patch
>
>
> We found the dead node detector might never remove the alive nodes from the 
> dead node set in a big cluster. For example:
>  # 200 nodes are added to the dead node set by DeadNodeDetector.
>  # DeadNodeDetector#checkDeadNodes() adds 100 nodes to the 
> deadNodesProbeQueue because the queue limited length is 100.
>  # The probe threads start working and probe 30 nodes.
>  # DeadNodeDetector#checkDeadNodes() is scheduled again. It iterates the dead 
> node set  and adds 30 nodes to the deadNodesProbeQueue. But the order is the 
> same as the last time. So the 30 nodes that has already been probed are added 
> to the queue again.
>  # Repeat 3 and 4. But we always add the first 30 nodes from the dead set. If 
> they are all dead then the live nodes behind them could never be recovered.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15809) DeadNodeDetector doesn't remove live nodes from dead node set.

2021-03-06 Thread Jinglun (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jinglun updated HDFS-15809:
---
Attachment: HDFS-15809.006.patch

> DeadNodeDetector doesn't remove live nodes from dead node set.
> --
>
> Key: HDFS-15809
> URL: https://issues.apache.org/jira/browse/HDFS-15809
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
> Attachments: HDFS-15809.001.patch, HDFS-15809.002.patch, 
> HDFS-15809.003.patch, HDFS-15809.004.patch, HDFS-15809.005.patch, 
> HDFS-15809.006.patch
>
>
> We found the dead node detector might never remove the alive nodes from the 
> dead node set in a big cluster. For example:
>  # 200 nodes are added to the dead node set by DeadNodeDetector.
>  # DeadNodeDetector#checkDeadNodes() adds 100 nodes to the 
> deadNodesProbeQueue because the queue limited length is 100.
>  # The probe threads start working and probe 30 nodes.
>  # DeadNodeDetector#checkDeadNodes() is scheduled again. It iterates the dead 
> node set  and adds 30 nodes to the deadNodesProbeQueue. But the order is the 
> same as the last time. So the 30 nodes that has already been probed are added 
> to the queue again.
>  # Repeat 3 and 4. But we always add the first 30 nodes from the dead set. If 
> they are all dead then the live nodes behind them could never be recovered.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-15809) DeadNodeDetector doesn't remove live nodes from dead node set.

2021-03-06 Thread Jinglun (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17296505#comment-17296505
 ] 

Jinglun edited comment on HDFS-15809 at 3/6/21, 10:51 AM:
--

After a discussion with [~leosun08], we decide to abandon the 
Collections.synchronizedSet(new LinkedHashSet<>()) plan because the interface 
is not friendly and makes the patch more complicated. Also it makes the probe 
queue hard to spy(set wrapped by synchronized would be final).

Thanks [~leosun08]'s suggestions for design and unit tests. Submit v05 fixing 
unit test.


was (Author: lijinglun):
After a offline discussion with [~leosun08], we decide to abandon the 
Collections.synchronizedSet(new LinkedHashSet<>()) plan because the interface 
is not friendly and makes the patch more complicated. Also it makes the probe 
queue hard to spy(set wrapped by synchronized would be final).

Thanks [~leosun08]'s suggestions for design and unit tests. Submit v05 fixing 
unit test.

> DeadNodeDetector doesn't remove live nodes from dead node set.
> --
>
> Key: HDFS-15809
> URL: https://issues.apache.org/jira/browse/HDFS-15809
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
> Attachments: HDFS-15809.001.patch, HDFS-15809.002.patch, 
> HDFS-15809.003.patch, HDFS-15809.004.patch, HDFS-15809.005.patch
>
>
> We found the dead node detector might never remove the alive nodes from the 
> dead node set in a big cluster. For example:
>  # 200 nodes are added to the dead node set by DeadNodeDetector.
>  # DeadNodeDetector#checkDeadNodes() adds 100 nodes to the 
> deadNodesProbeQueue because the queue limited length is 100.
>  # The probe threads start working and probe 30 nodes.
>  # DeadNodeDetector#checkDeadNodes() is scheduled again. It iterates the dead 
> node set  and adds 30 nodes to the deadNodesProbeQueue. But the order is the 
> same as the last time. So the 30 nodes that has already been probed are added 
> to the queue again.
>  # Repeat 3 and 4. But we always add the first 30 nodes from the dead set. If 
> they are all dead then the live nodes behind them could never be recovered.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15809) DeadNodeDetector doesn't remove live nodes from dead node set.

2021-03-06 Thread Jinglun (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17296505#comment-17296505
 ] 

Jinglun commented on HDFS-15809:


After a offline discussion with [~leosun08], we decide to abandon the 
Collections.synchronizedSet(new LinkedHashSet<>()) plan because the interface 
is not friendly and makes the patch more complicated. Also it makes the probe 
queue hard to spy(set wrapped by synchronized would be final).

Thanks [~leosun08]'s suggestions for design and unit tests. Submit v05 fixing 
unit test.

> DeadNodeDetector doesn't remove live nodes from dead node set.
> --
>
> Key: HDFS-15809
> URL: https://issues.apache.org/jira/browse/HDFS-15809
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
> Attachments: HDFS-15809.001.patch, HDFS-15809.002.patch, 
> HDFS-15809.003.patch, HDFS-15809.004.patch, HDFS-15809.005.patch
>
>
> We found the dead node detector might never remove the alive nodes from the 
> dead node set in a big cluster. For example:
>  # 200 nodes are added to the dead node set by DeadNodeDetector.
>  # DeadNodeDetector#checkDeadNodes() adds 100 nodes to the 
> deadNodesProbeQueue because the queue limited length is 100.
>  # The probe threads start working and probe 30 nodes.
>  # DeadNodeDetector#checkDeadNodes() is scheduled again. It iterates the dead 
> node set  and adds 30 nodes to the deadNodesProbeQueue. But the order is the 
> same as the last time. So the 30 nodes that has already been probed are added 
> to the queue again.
>  # Repeat 3 and 4. But we always add the first 30 nodes from the dead set. If 
> they are all dead then the live nodes behind them could never be recovered.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15809) DeadNodeDetector doesn't remove live nodes from dead node set.

2021-03-06 Thread Jinglun (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jinglun updated HDFS-15809:
---
Attachment: HDFS-15809.005.patch

> DeadNodeDetector doesn't remove live nodes from dead node set.
> --
>
> Key: HDFS-15809
> URL: https://issues.apache.org/jira/browse/HDFS-15809
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
> Attachments: HDFS-15809.001.patch, HDFS-15809.002.patch, 
> HDFS-15809.003.patch, HDFS-15809.004.patch, HDFS-15809.005.patch
>
>
> We found the dead node detector might never remove the alive nodes from the 
> dead node set in a big cluster. For example:
>  # 200 nodes are added to the dead node set by DeadNodeDetector.
>  # DeadNodeDetector#checkDeadNodes() adds 100 nodes to the 
> deadNodesProbeQueue because the queue limited length is 100.
>  # The probe threads start working and probe 30 nodes.
>  # DeadNodeDetector#checkDeadNodes() is scheduled again. It iterates the dead 
> node set  and adds 30 nodes to the deadNodesProbeQueue. But the order is the 
> same as the last time. So the 30 nodes that has already been probed are added 
> to the queue again.
>  # Repeat 3 and 4. But we always add the first 30 nodes from the dead set. If 
> they are all dead then the live nodes behind them could never be recovered.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15809) DeadNodeDetector doesn't remove live nodes from dead node set.

2021-02-23 Thread Jinglun (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17289599#comment-17289599
 ] 

Jinglun commented on HDFS-15809:


Hi [~leosun08], thanks your comments. Submit v04 using LinkedHashSet. The test 
case testDeadNodeDetectionDeadNodeProbe can cover the situation. It verifies 
the whole progress of the deadnodedetector. One node should be first put into 
suspect queue, then marked as dead and finally probed by the dead queue multi 
times. In the original implementation the 3 datanodes won't be all dead.

> DeadNodeDetector doesn't remove live nodes from dead node set.
> --
>
> Key: HDFS-15809
> URL: https://issues.apache.org/jira/browse/HDFS-15809
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
> Attachments: HDFS-15809.001.patch, HDFS-15809.002.patch, 
> HDFS-15809.003.patch, HDFS-15809.004.patch
>
>
> We found the dead node detector might never remove the alive nodes from the 
> dead node set in a big cluster. For example:
>  # 200 nodes are added to the dead node set by DeadNodeDetector.
>  # DeadNodeDetector#checkDeadNodes() adds 100 nodes to the 
> deadNodesProbeQueue because the queue limited length is 100.
>  # The probe threads start working and probe 30 nodes.
>  # DeadNodeDetector#checkDeadNodes() is scheduled again. It iterates the dead 
> node set  and adds 30 nodes to the deadNodesProbeQueue. But the order is the 
> same as the last time. So the 30 nodes that has already been probed are added 
> to the queue again.
>  # Repeat 3 and 4. But we always add the first 30 nodes from the dead set. If 
> they are all dead then the live nodes behind them could never be recovered.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15809) DeadNodeDetector doesn't remove live nodes from dead node set.

2021-02-23 Thread Jinglun (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jinglun updated HDFS-15809:
---
Attachment: HDFS-15809.004.patch

> DeadNodeDetector doesn't remove live nodes from dead node set.
> --
>
> Key: HDFS-15809
> URL: https://issues.apache.org/jira/browse/HDFS-15809
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
> Attachments: HDFS-15809.001.patch, HDFS-15809.002.patch, 
> HDFS-15809.003.patch, HDFS-15809.004.patch
>
>
> We found the dead node detector might never remove the alive nodes from the 
> dead node set in a big cluster. For example:
>  # 200 nodes are added to the dead node set by DeadNodeDetector.
>  # DeadNodeDetector#checkDeadNodes() adds 100 nodes to the 
> deadNodesProbeQueue because the queue limited length is 100.
>  # The probe threads start working and probe 30 nodes.
>  # DeadNodeDetector#checkDeadNodes() is scheduled again. It iterates the dead 
> node set  and adds 30 nodes to the deadNodesProbeQueue. But the order is the 
> same as the last time. So the 30 nodes that has already been probed are added 
> to the queue again.
>  # Repeat 3 and 4. But we always add the first 30 nodes from the dead set. If 
> they are all dead then the live nodes behind them could never be recovered.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15845) RBF: Router fails to start due to NoClassDefFoundError for hadoop-federation-balance

2021-02-22 Thread Jinglun (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17288800#comment-17288800
 ] 

Jinglun commented on HDFS-15845:


Hi [~tasanuma], thanks for your nice fix ! LGTM +1.

> RBF: Router fails to start due to NoClassDefFoundError for 
> hadoop-federation-balance
> 
>
> Key: HDFS-15845
> URL: https://issues.apache.org/jira/browse/HDFS-15845
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Takanobu Asanuma
>Assignee: Takanobu Asanuma
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> {noformat}
> $ hdfs dfsrouter
> ...
> 2021-02-22 17:21:55,400 ERROR router.DFSRouter: Failed to start router
> java.lang.NoClassDefFoundError: 
> org/apache/hadoop/tools/fedbalance/procedure/BalanceProcedure
> at 
> org.apache.hadoop.hdfs.server.federation.router.RouterClientProtocol.(RouterClientProtocol.java:195)
> at 
> org.apache.hadoop.hdfs.server.federation.router.RouterRpcServer.(RouterRpcServer.java:394)
> at 
> org.apache.hadoop.hdfs.server.federation.router.Router.createRpcServer(Router.java:391)
> at 
> org.apache.hadoop.hdfs.server.federation.router.Router.serviceInit(Router.java:188)
> at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:165)
> at 
> org.apache.hadoop.hdfs.server.federation.router.DFSRouter.main(DFSRouter.java:69)
> Caused by: java.lang.ClassNotFoundException: 
> org.apache.hadoop.tools.fedbalance.procedure.BalanceProcedure
> at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
> ... 6 more
> 2021-02-22 17:21:55,402 INFO util.ExitUtil: Exiting with status 1: 
> java.lang.NoClassDefFoundError: 
> org/apache/hadoop/tools/fedbalance/procedure/BalanceProcedure
> 2021-02-22 17:21:55,404 INFO router.DFSRouter: SHUTDOWN_MSG:
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15845) RBF: Router fails to start due to NoClassDefFoundError for hadoop-federation-balance

2021-02-22 Thread Jinglun (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17288799#comment-17288799
 ] 

Jinglun commented on HDFS-15845:


My bad !  I missed the classpath of command dfsrouter. 

Hi [~tasanuma], would you please give a try of adding 
`hadoop_add_to_classpath_tools hadoop-federation-balance` to the command 
dfsrouter. Like below.
{quote}File: hadoop-hdfs-project/hadoop-hdfs/src/main/bin/hdfs 
b/hadoop-hdfs-project/hadoop-hdfs/src/main/bin/hdfs
{quote}
{quote}dfsrouter)
 HADOOP_SUBCMD_SUPPORTDAEMONIZATION="true"
 HADOOP_CLASSNAME='org.apache.hadoop.hdfs.server.federation.router.DFSRouter'
 hadoop_add_to_classpath_tools hadoop-federation-balance  // add this.
;;{quote}
 

> RBF: Router fails to start due to NoClassDefFoundError for 
> hadoop-federation-balance
> 
>
> Key: HDFS-15845
> URL: https://issues.apache.org/jira/browse/HDFS-15845
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Takanobu Asanuma
>Assignee: Takanobu Asanuma
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> {noformat}
> $ hdfs dfsrouter
> ...
> 2021-02-22 17:21:55,400 ERROR router.DFSRouter: Failed to start router
> java.lang.NoClassDefFoundError: 
> org/apache/hadoop/tools/fedbalance/procedure/BalanceProcedure
> at 
> org.apache.hadoop.hdfs.server.federation.router.RouterClientProtocol.(RouterClientProtocol.java:195)
> at 
> org.apache.hadoop.hdfs.server.federation.router.RouterRpcServer.(RouterRpcServer.java:394)
> at 
> org.apache.hadoop.hdfs.server.federation.router.Router.createRpcServer(Router.java:391)
> at 
> org.apache.hadoop.hdfs.server.federation.router.Router.serviceInit(Router.java:188)
> at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:165)
> at 
> org.apache.hadoop.hdfs.server.federation.router.DFSRouter.main(DFSRouter.java:69)
> Caused by: java.lang.ClassNotFoundException: 
> org.apache.hadoop.tools.fedbalance.procedure.BalanceProcedure
> at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
> ... 6 more
> 2021-02-22 17:21:55,402 INFO util.ExitUtil: Exiting with status 1: 
> java.lang.NoClassDefFoundError: 
> org/apache/hadoop/tools/fedbalance/procedure/BalanceProcedure
> 2021-02-22 17:21:55,404 INFO router.DFSRouter: SHUTDOWN_MSG:
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Issue Comment Deleted] (HDFS-15845) RBF: Router fails to start due to NoClassDefFoundError for hadoop-federation-balance

2021-02-22 Thread Jinglun (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jinglun updated HDFS-15845:
---
Comment: was deleted

(was: My bad !  I missed the classpath of command dfsrouter. 

Hi [~tasanuma], would you please give a try of adding 
`hadoop_add_to_classpath_tools hadoop-federation-balance` to the command 
dfsrouter. Like below.
{quote}File: hadoop-hdfs-project/hadoop-hdfs/src/main/bin/hdfs 
b/hadoop-hdfs-project/hadoop-hdfs/src/main/bin/hdfs
{quote}
{quote}dfsrouter)
 HADOOP_SUBCMD_SUPPORTDAEMONIZATION="true"
 HADOOP_CLASSNAME='org.apache.hadoop.hdfs.server.federation.router.DFSRouter'
 hadoop_add_to_classpath_tools hadoop-federation-balance  // add this.
;;{quote}
 )

> RBF: Router fails to start due to NoClassDefFoundError for 
> hadoop-federation-balance
> 
>
> Key: HDFS-15845
> URL: https://issues.apache.org/jira/browse/HDFS-15845
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Takanobu Asanuma
>Assignee: Takanobu Asanuma
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> {noformat}
> $ hdfs dfsrouter
> ...
> 2021-02-22 17:21:55,400 ERROR router.DFSRouter: Failed to start router
> java.lang.NoClassDefFoundError: 
> org/apache/hadoop/tools/fedbalance/procedure/BalanceProcedure
> at 
> org.apache.hadoop.hdfs.server.federation.router.RouterClientProtocol.(RouterClientProtocol.java:195)
> at 
> org.apache.hadoop.hdfs.server.federation.router.RouterRpcServer.(RouterRpcServer.java:394)
> at 
> org.apache.hadoop.hdfs.server.federation.router.Router.createRpcServer(Router.java:391)
> at 
> org.apache.hadoop.hdfs.server.federation.router.Router.serviceInit(Router.java:188)
> at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:165)
> at 
> org.apache.hadoop.hdfs.server.federation.router.DFSRouter.main(DFSRouter.java:69)
> Caused by: java.lang.ClassNotFoundException: 
> org.apache.hadoop.tools.fedbalance.procedure.BalanceProcedure
> at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
> ... 6 more
> 2021-02-22 17:21:55,402 INFO util.ExitUtil: Exiting with status 1: 
> java.lang.NoClassDefFoundError: 
> org/apache/hadoop/tools/fedbalance/procedure/BalanceProcedure
> 2021-02-22 17:21:55,404 INFO router.DFSRouter: SHUTDOWN_MSG:
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15809) DeadNodeDetector doesn't remove live nodes from dead node set.

2021-02-19 Thread Jinglun (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17287477#comment-17287477
 ] 

Jinglun commented on HDFS-15809:


Submit v03 fix checkstyle and unit tests.

> DeadNodeDetector doesn't remove live nodes from dead node set.
> --
>
> Key: HDFS-15809
> URL: https://issues.apache.org/jira/browse/HDFS-15809
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
> Attachments: HDFS-15809.001.patch, HDFS-15809.002.patch, 
> HDFS-15809.003.patch
>
>
> We found the dead node detector might never remove the alive nodes from the 
> dead node set in a big cluster. For example:
>  # 200 nodes are added to the dead node set by DeadNodeDetector.
>  # DeadNodeDetector#checkDeadNodes() adds 100 nodes to the 
> deadNodesProbeQueue because the queue limited length is 100.
>  # The probe threads start working and probe 30 nodes.
>  # DeadNodeDetector#checkDeadNodes() is scheduled again. It iterates the dead 
> node set  and adds 30 nodes to the deadNodesProbeQueue. But the order is the 
> same as the last time. So the 30 nodes that has already been probed are added 
> to the queue again.
>  # Repeat 3 and 4. But we always add the first 30 nodes from the dead set. If 
> they are all dead then the live nodes behind them could never be recovered.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15809) DeadNodeDetector doesn't remove live nodes from dead node set.

2021-02-19 Thread Jinglun (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jinglun updated HDFS-15809:
---
Attachment: HDFS-15809.003.patch

> DeadNodeDetector doesn't remove live nodes from dead node set.
> --
>
> Key: HDFS-15809
> URL: https://issues.apache.org/jira/browse/HDFS-15809
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
> Attachments: HDFS-15809.001.patch, HDFS-15809.002.patch, 
> HDFS-15809.003.patch
>
>
> We found the dead node detector might never remove the alive nodes from the 
> dead node set in a big cluster. For example:
>  # 200 nodes are added to the dead node set by DeadNodeDetector.
>  # DeadNodeDetector#checkDeadNodes() adds 100 nodes to the 
> deadNodesProbeQueue because the queue limited length is 100.
>  # The probe threads start working and probe 30 nodes.
>  # DeadNodeDetector#checkDeadNodes() is scheduled again. It iterates the dead 
> node set  and adds 30 nodes to the deadNodesProbeQueue. But the order is the 
> same as the last time. So the 30 nodes that has already been probed are added 
> to the queue again.
>  # Repeat 3 and 4. But we always add the first 30 nodes from the dead set. If 
> they are all dead then the live nodes behind them could never be recovered.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15809) DeadNodeDetector doesn't remove live nodes from dead node set.

2021-02-19 Thread Jinglun (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17287014#comment-17287014
 ] 

Jinglun commented on HDFS-15809:


I haven't deal with the checkstyle complain and it is out of date now(cry). 
Re-upload v02 to trigger the jenkins.

> DeadNodeDetector doesn't remove live nodes from dead node set.
> --
>
> Key: HDFS-15809
> URL: https://issues.apache.org/jira/browse/HDFS-15809
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
> Attachments: HDFS-15809.001.patch, HDFS-15809.002.patch
>
>
> We found the dead node detector might never remove the alive nodes from the 
> dead node set in a big cluster. For example:
>  # 200 nodes are added to the dead node set by DeadNodeDetector.
>  # DeadNodeDetector#checkDeadNodes() adds 100 nodes to the 
> deadNodesProbeQueue because the queue limited length is 100.
>  # The probe threads start working and probe 30 nodes.
>  # DeadNodeDetector#checkDeadNodes() is scheduled again. It iterates the dead 
> node set  and adds 30 nodes to the deadNodesProbeQueue. But the order is the 
> same as the last time. So the 30 nodes that has already been probed are added 
> to the queue again.
>  # Repeat 3 and 4. But we always add the first 30 nodes from the dead set. If 
> they are all dead then the live nodes behind them could never be recovered.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15809) DeadNodeDetector doesn't remove live nodes from dead node set.

2021-02-19 Thread Jinglun (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jinglun updated HDFS-15809:
---
Attachment: HDFS-15809.002.patch

> DeadNodeDetector doesn't remove live nodes from dead node set.
> --
>
> Key: HDFS-15809
> URL: https://issues.apache.org/jira/browse/HDFS-15809
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
> Attachments: HDFS-15809.001.patch, HDFS-15809.002.patch
>
>
> We found the dead node detector might never remove the alive nodes from the 
> dead node set in a big cluster. For example:
>  # 200 nodes are added to the dead node set by DeadNodeDetector.
>  # DeadNodeDetector#checkDeadNodes() adds 100 nodes to the 
> deadNodesProbeQueue because the queue limited length is 100.
>  # The probe threads start working and probe 30 nodes.
>  # DeadNodeDetector#checkDeadNodes() is scheduled again. It iterates the dead 
> node set  and adds 30 nodes to the deadNodesProbeQueue. But the order is the 
> same as the last time. So the 30 nodes that has already been probed are added 
> to the queue again.
>  # Repeat 3 and 4. But we always add the first 30 nodes from the dead set. If 
> they are all dead then the live nodes behind them could never be recovered.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15806) DeadNodeDetector should close all the threads when it is closed.

2021-02-18 Thread Jinglun (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17286871#comment-17286871
 ] 

Jinglun commented on HDFS-15806:


Hi [~ayushtkn], thanks your comments ! 
{quote}before this was there some kind of memory leak, or these threads were 
getting cleared later?
{quote}
In Xiaomi we use the dead node detector feature only for hbase. The HBase 
doesn't close the files system and the dfs client. So we haven't notice the 
leak before.  Recently we found the dead node detector won't remove alive nodes 
from the dead node set, as described in HDFS-15809. So I started reviewing the 
whole feature and found this leak bug.
{quote}Secondly, for the shutdown is there some specific order, or it is just 
random
{quote}
It is random. Most of the threads are connected by queue(the producer-consumer 
model). So the order of  stopping the producer or the consumer won't be a 
problem.

1) The DeadNodeDetector thread is responsible for add nodes from 
_suspectAndDeadNodes_ set to _deadNodesProbeQueue_.

2) The _probeDeadNodesSchedulerThr_ is responsible for taking nodes from 
_deadNodesProbeQueue_ and __ submit probe tasks to _probeDeadNodesThreadPool_. 
3) The _probeSuspectNodesSchedulerThr_ is responsible for taking nodes from 
_suspectNodesProbeQueue_ and submit probe tasks to 
_probeSuspectNodesThreadPool_.

4) All the probe tasks submit getDatanodeInfo rpc calls in the thread pool 
_rpcThreadPool_.

 

Some other thoughts: the thread model is a little complicated and could be 
improved. For example I think we can do the rpc call at the probe task instead 
of submitting to rpcThreadPool. I need first figure out the purpose of the 
original design then may be start a new Jira for the thread improvement later.

> DeadNodeDetector should close all the threads when it is closed.
> 
>
> Key: HDFS-15806
> URL: https://issues.apache.org/jira/browse/HDFS-15806
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
> Attachments: HDFS-15806.001.patch
>
>
> The DeadNodeDetector doesn't close all the threads when it is closed. This 
> Jira trys to fix this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-15809) DeadNodeDetector doesn't remove live nodes from dead node set.

2021-02-18 Thread Jinglun (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17286832#comment-17286832
 ] 

Jinglun edited comment on HDFS-15809 at 2/19/21, 3:23 AM:
--

Hi [~leosun08], thanks you comments. The solution in v01 introduces a new 
deduplicated queue. It won't accept duplicated nodes being added. The size of 
the queue is not fixed too so all the dead nodes could be added to the 
deduplicated queue. Thus the situation of duplicated dead nodes being 
repeatedly added to the probe queue won't happen anymore.

The queue itself is deduplicated so we don't need to worry the queue size 
explosion. The size is no greater than the size of datanodes.

Shuffle is a good idea and is a much simpler way. But I think the deduplicated 
way is more efficiency because there is no duplicated probe.

Adjust the queue size won't fix the problem because the queue accept duplicated 
nodes. Even the queue size is 10 it could still be filled up with the first 
30 nodes.

 


was (Author: lijinglun):
Hi [~leosun08], thanks you comments. The solution in v01 is to avoid adding 
duplicated dead nodes to the probe queue. So the queue won't be filled up with 
duplicated dead nodes.

Shuffle is a good idea and is a much simpler way. I also agree with the shuffle 
way.

 

> DeadNodeDetector doesn't remove live nodes from dead node set.
> --
>
> Key: HDFS-15809
> URL: https://issues.apache.org/jira/browse/HDFS-15809
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
> Attachments: HDFS-15809.001.patch
>
>
> We found the dead node detector might never remove the alive nodes from the 
> dead node set in a big cluster. For example:
>  # 200 nodes are added to the dead node set by DeadNodeDetector.
>  # DeadNodeDetector#checkDeadNodes() adds 100 nodes to the 
> deadNodesProbeQueue because the queue limited length is 100.
>  # The probe threads start working and probe 30 nodes.
>  # DeadNodeDetector#checkDeadNodes() is scheduled again. It iterates the dead 
> node set  and adds 30 nodes to the deadNodesProbeQueue. But the order is the 
> same as the last time. So the 30 nodes that has already been probed are added 
> to the queue again.
>  # Repeat 3 and 4. But we always add the first 30 nodes from the dead set. If 
> they are all dead then the live nodes behind them could never be recovered.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15809) DeadNodeDetector doesn't remove live nodes from dead node set.

2021-02-18 Thread Jinglun (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17286832#comment-17286832
 ] 

Jinglun commented on HDFS-15809:


Hi [~leosun08], thanks you comments. The solution in v01 is to avoid adding 
duplicated dead nodes to the probe queue. So the queue won't be filled up with 
duplicated dead nodes.

Shuffle is a good idea and is a much simpler way. I also agree with the shuffle 
way.

 

> DeadNodeDetector doesn't remove live nodes from dead node set.
> --
>
> Key: HDFS-15809
> URL: https://issues.apache.org/jira/browse/HDFS-15809
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
> Attachments: HDFS-15809.001.patch
>
>
> We found the dead node detector might never remove the alive nodes from the 
> dead node set in a big cluster. For example:
>  # 200 nodes are added to the dead node set by DeadNodeDetector.
>  # DeadNodeDetector#checkDeadNodes() adds 100 nodes to the 
> deadNodesProbeQueue because the queue limited length is 100.
>  # The probe threads start working and probe 30 nodes.
>  # DeadNodeDetector#checkDeadNodes() is scheduled again. It iterates the dead 
> node set  and adds 30 nodes to the deadNodesProbeQueue. But the order is the 
> same as the last time. So the 30 nodes that has already been probed are added 
> to the queue again.
>  # Repeat 3 and 4. But we always add the first 30 nodes from the dead set. If 
> they are all dead then the live nodes behind them could never be recovered.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15809) DeadNodeDetector doesn't remove live nodes from dead node set.

2021-02-01 Thread Jinglun (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jinglun updated HDFS-15809:
---
Attachment: HDFS-15809.001.patch
Status: Patch Available  (was: Open)

> DeadNodeDetector doesn't remove live nodes from dead node set.
> --
>
> Key: HDFS-15809
> URL: https://issues.apache.org/jira/browse/HDFS-15809
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
> Attachments: HDFS-15809.001.patch
>
>
> We found the dead node detector might never remove the alive nodes from the 
> dead node set in a big cluster. For example:
>  # 200 nodes are added to the dead node set by DeadNodeDetector.
>  # DeadNodeDetector#checkDeadNodes() adds 100 nodes to the 
> deadNodesProbeQueue because the queue limited length is 100.
>  # The probe threads start working and probe 30 nodes.
>  # DeadNodeDetector#checkDeadNodes() is scheduled again. It iterates the dead 
> node set  and adds 30 nodes to the deadNodesProbeQueue. But the order is the 
> same as the last time. So the 30 nodes that has already been probed are added 
> to the queue again.
>  # Repeat 3 and 4. But we always add the first 30 nodes from the dead set. If 
> they are all dead then the live nodes behind them could never be recovered.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15809) DeadNodeDetector doesn't remove live nodes from dead node set.

2021-02-01 Thread Jinglun (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17276326#comment-17276326
 ] 

Jinglun commented on HDFS-15809:


Hi [~leosun08], could you help reviewing this, thanks !

> DeadNodeDetector doesn't remove live nodes from dead node set.
> --
>
> Key: HDFS-15809
> URL: https://issues.apache.org/jira/browse/HDFS-15809
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
> Attachments: HDFS-15809.001.patch
>
>
> We found the dead node detector might never remove the alive nodes from the 
> dead node set in a big cluster. For example:
>  # 200 nodes are added to the dead node set by DeadNodeDetector.
>  # DeadNodeDetector#checkDeadNodes() adds 100 nodes to the 
> deadNodesProbeQueue because the queue limited length is 100.
>  # The probe threads start working and probe 30 nodes.
>  # DeadNodeDetector#checkDeadNodes() is scheduled again. It iterates the dead 
> node set  and adds 30 nodes to the deadNodesProbeQueue. But the order is the 
> same as the last time. So the 30 nodes that has already been probed are added 
> to the queue again.
>  # Repeat 3 and 4. But we always add the first 30 nodes from the dead set. If 
> they are all dead then the live nodes behind them could never be recovered.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-15809) DeadNodeDetector doesn't remove live nodes from dead node set.

2021-02-01 Thread Jinglun (Jira)
Jinglun created HDFS-15809:
--

 Summary: DeadNodeDetector doesn't remove live nodes from dead node 
set.
 Key: HDFS-15809
 URL: https://issues.apache.org/jira/browse/HDFS-15809
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: Jinglun


We found the dead node detector might never remove the alive nodes from the 
dead node set in a big cluster. For example:
 # 200 nodes are added to the dead node set by DeadNodeDetector.
 # DeadNodeDetector#checkDeadNodes() adds 100 nodes to the deadNodesProbeQueue 
because the queue limited length is 100.
 # The probe threads start working and probe 30 nodes.
 # DeadNodeDetector#checkDeadNodes() is scheduled again. It iterates the dead 
node set  and adds 30 nodes to the deadNodesProbeQueue. But the order is the 
same as the last time. So the 30 nodes that has already been probed are added 
to the queue again.
 # Repeat 3 and 4. But we always add the first 30 nodes from the dead set. If 
they are all dead then the live nodes behind them could never be recovered.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Assigned] (HDFS-15809) DeadNodeDetector doesn't remove live nodes from dead node set.

2021-02-01 Thread Jinglun (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jinglun reassigned HDFS-15809:
--

Assignee: Jinglun

> DeadNodeDetector doesn't remove live nodes from dead node set.
> --
>
> Key: HDFS-15809
> URL: https://issues.apache.org/jira/browse/HDFS-15809
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
>
> We found the dead node detector might never remove the alive nodes from the 
> dead node set in a big cluster. For example:
>  # 200 nodes are added to the dead node set by DeadNodeDetector.
>  # DeadNodeDetector#checkDeadNodes() adds 100 nodes to the 
> deadNodesProbeQueue because the queue limited length is 100.
>  # The probe threads start working and probe 30 nodes.
>  # DeadNodeDetector#checkDeadNodes() is scheduled again. It iterates the dead 
> node set  and adds 30 nodes to the deadNodesProbeQueue. But the order is the 
> same as the last time. So the 30 nodes that has already been probed are added 
> to the queue again.
>  # Repeat 3 and 4. But we always add the first 30 nodes from the dead set. If 
> they are all dead then the live nodes behind them could never be recovered.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15806) DeadNodeDetector should close all the threads when it is closed.

2021-01-31 Thread Jinglun (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jinglun updated HDFS-15806:
---
Attachment: HDFS-15806.001.patch
Status: Patch Available  (was: Open)

> DeadNodeDetector should close all the threads when it is closed.
> 
>
> Key: HDFS-15806
> URL: https://issues.apache.org/jira/browse/HDFS-15806
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
> Attachments: HDFS-15806.001.patch
>
>
> The DeadNodeDetector doesn't close all the threads when it is closed. This 
> Jira trys to fix this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15806) DeadNodeDetector should close all the threads when it is closed.

2021-01-31 Thread Jinglun (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17276095#comment-17276095
 ] 

Jinglun commented on HDFS-15806:


Hi [~leosun08], would you help reviewing this, thanks !

> DeadNodeDetector should close all the threads when it is closed.
> 
>
> Key: HDFS-15806
> URL: https://issues.apache.org/jira/browse/HDFS-15806
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
> Attachments: HDFS-15806.001.patch
>
>
> The DeadNodeDetector doesn't close all the threads when it is closed. This 
> Jira trys to fix this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Assigned] (HDFS-15806) DeadNodeDetector should close all the threads when it is closed.

2021-01-31 Thread Jinglun (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jinglun reassigned HDFS-15806:
--

Assignee: Jinglun

> DeadNodeDetector should close all the threads when it is closed.
> 
>
> Key: HDFS-15806
> URL: https://issues.apache.org/jira/browse/HDFS-15806
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
>
> The DeadNodeDetector doesn't close all the threads when it is closed. This 
> Jira trys to fix this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-15806) DeadNodeDetector should close all the threads when it is closed.

2021-01-31 Thread Jinglun (Jira)
Jinglun created HDFS-15806:
--

 Summary: DeadNodeDetector should close all the threads when it is 
closed.
 Key: HDFS-15806
 URL: https://issues.apache.org/jira/browse/HDFS-15806
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: Jinglun


The DeadNodeDetector doesn't close all the threads when it is closed. This Jira 
trys to fix this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15661) The DeadNodeDetector shouldn't be shared by different DFSClients.

2021-01-25 Thread Jinglun (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17271212#comment-17271212
 ] 

Jinglun commented on HDFS-15661:


Hi [~leosun08], thanks your comments ! Fix checkstyle and submit v05. The 
failed unit tests run well on my local computer so should not be related.

> The DeadNodeDetector shouldn't be shared by different DFSClients.
> -
>
> Key: HDFS-15661
> URL: https://issues.apache.org/jira/browse/HDFS-15661
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
> Attachments: HDFS-15661.001.patch, HDFS-15661.002.patch, 
> HDFS-15661.003.patch, HDFS-15661.004.patch, HDFS-15661.005.patch
>
>
> Currently the DeadNodeDetector is a member of ClientContext. That means it is 
> shared by many different DFSClients. When one DFSClient.close() is invoked, 
> the DeadNodeDetecotor thread would be interrupted and impact other DFSClients.
> From the original design of HDFS-13571 we could see the DeadNodeDetector is 
> supposed to share dead nodes of many input streams from the same client. 
> We should move the DeadNodeDetector as a member of DFSClient instead of 
> ClientContext. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



  1   2   3   4   5   6   7   8   9   >