[jira] [Work logged] (HDFS-15869) Network issue while FSEditLogAsync is executing RpcEdit.logSyncNotify can cause the namenode to hang

2021-04-14 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15869?focusedWorklogId=583141=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-583141
 ]

ASF GitHub Bot logged work on HDFS-15869:
-

Author: ASF GitHub Bot
Created on: 15/Apr/21 05:19
Start Date: 15/Apr/21 05:19
Worklog Time Spent: 10m 
  Work Description: functioner commented on pull request #2737:
URL: https://github.com/apache/hadoop/pull/2737#issuecomment-820106508


   @Hexiaoqiao thanks for the reminder. I've added:
   1. comments for removing `logSyncNotifyExecutor.shutdown()`
   2. configuration for the size of `logSyncNotifyExecutor`
   3. the docs for that configuration


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 583141)
Time Spent: 2.5h  (was: 2h 20m)

> Network issue while FSEditLogAsync is executing RpcEdit.logSyncNotify can 
> cause the namenode to hang
> 
>
> Key: HDFS-15869
> URL: https://issues.apache.org/jira/browse/HDFS-15869
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: fs async, namenode
>Affects Versions: 3.2.2
>Reporter: Haoze Wu
>Priority: Critical
>  Labels: pull-request-available
>  Time Spent: 2.5h
>  Remaining Estimate: 0h
>
>     We were doing some testing of the latest Hadoop stable release 3.2.2 and 
> found some network issue can cause the namenode to hang even with the async 
> edit logging (FSEditLogAsync).
>     The workflow of the FSEditLogAsync thread is basically:
>  # get EditLog from a queue (line 229)
>  # do the transaction (line 232)
>  # sync the log if doSync (line 243)
>  # do logSyncNotify (line 248)
> {code:java}
> //hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSEditLogAsync.java
>   @Override
>   public void run() {
> try {
>   while (true) {
> boolean doSync;
> Edit edit = dequeueEdit(); // 
> line 229
> if (edit != null) {
>   // sync if requested by edit log.
>   doSync = edit.logEdit(); // 
> line 232
>   syncWaitQ.add(edit);
> } else {
>   // sync when editq runs dry, but have edits pending a sync.
>   doSync = !syncWaitQ.isEmpty();
> }
> if (doSync) {
>   // normally edit log exceptions cause the NN to terminate, but tests
>   // relying on ExitUtil.terminate need to see the exception.
>   RuntimeException syncEx = null;
>   try {
> logSync(getLastWrittenTxId()); // 
> line 243
>   } catch (RuntimeException ex) {
> syncEx = ex;
>   }
>   while ((edit = syncWaitQ.poll()) != null) {
> edit.logSyncNotify(syncEx);// 
> line 248
>   }
> }
>   }
> } catch (InterruptedException ie) {
>   LOG.info(Thread.currentThread().getName() + " was interrupted, 
> exiting");
> } catch (Throwable t) {
>   terminate(t);
> }
>   }
> {code}
>     In terms of the step 4, FSEditLogAsync$RpcEdit.logSyncNotify is 
> essentially doing some network write (line 365).
> {code:java}
> //hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSEditLogAsync.java
>   private static class RpcEdit extends Edit {
> // ...
> @Override
> public void logSyncNotify(RuntimeException syncEx) {
>   try {
> if (syncEx == null) {
>   call.sendResponse();   // line 
> 365
> } else {
>   call.abortResponse(syncEx);
> }
>   } catch (Exception e) {} // don't care if not sent.
> }
> // ...
>   }{code}
>     If the sendResponse operation in line 365 gets stuck, then the whole 
> FSEditLogAsync thread is not able to proceed. In this case, the critical 
> logSync (line 243) can’t be executed, for the incoming transactions. Then the 
> namenode hangs. This is undesirable because FSEditLogAsync’s key feature is 
> asynchronous edit logging that is supposed to tolerate slow I/O.
>     To see why the sendResponse operation in line 365 may get stuck, here is 
> the stack trace:
> {code:java}
>  '(org.apache.hadoop.ipc.Server,channelWrite,3593)',
>  '(org.apache.hadoop.ipc.Server,access$1700,139)',
>  '(org.apache.hadoop.ipc.Server$Responder,processResponse,1657)',
>  

[jira] [Work logged] (HDFS-15963) Unreleased volume references cause an infinite loop

2021-04-14 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15963?focusedWorklogId=583134=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-583134
 ]

ASF GitHub Bot logged work on HDFS-15963:
-

Author: ASF GitHub Bot
Created on: 15/Apr/21 05:00
Start Date: 15/Apr/21 05:00
Worklog Time Spent: 10m 
  Work Description: Hexiaoqiao commented on pull request #2889:
URL: https://github.com/apache/hadoop/pull/2889#issuecomment-820096203


   Thanks @zhangshuyan0 for your great catch here. LGTM. +1 from my side.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 583134)
Time Spent: 3h  (was: 2h 50m)

> Unreleased volume references cause an infinite loop
> ---
>
> Key: HDFS-15963
> URL: https://issues.apache.org/jira/browse/HDFS-15963
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Reporter: Shuyan Zhang
>Assignee: Shuyan Zhang
>Priority: Major
>  Labels: pull-request-available
> Attachments: HDFS-15963.001.patch, HDFS-15963.002.patch, 
> HDFS-15963.003.patch
>
>  Time Spent: 3h
>  Remaining Estimate: 0h
>
> When BlockSender throws an exception because the meta-data cannot be found, 
> the volume reference obtained by the thread is not released, which causes the 
> thread trying to remove the volume to wait and fall into an infinite loop.
> {code:java}
> boolean checkVolumesRemoved() {
>   Iterator it = volumesBeingRemoved.iterator();
>   while (it.hasNext()) {
> FsVolumeImpl volume = it.next();
> if (!volume.checkClosed()) {
>   return false;
> }
> it.remove();
>   }
>   return true;
> }
> boolean checkClosed() {
>   // always be true.
>   if (this.reference.getReferenceCount() > 0) {
> FsDatasetImpl.LOG.debug("The reference count for {} is {}, wait to be 0.",
> this, reference.getReferenceCount());
> return false;
>   }
>   return true;
> }
> {code}
> At the same time, because the thread has been holding checkDirsLock when 
> removing the volume, other threads trying to acquire the same lock will be 
> permanently blocked.
> Similar problems also occur in RamDiskAsyncLazyPersistService and 
> FsDatasetAsyncDiskService.
> This patch releases the three previously unreleased volume references.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Work logged] (HDFS-15869) Network issue while FSEditLogAsync is executing RpcEdit.logSyncNotify can cause the namenode to hang

2021-04-14 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15869?focusedWorklogId=583125=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-583125
 ]

ASF GitHub Bot logged work on HDFS-15869:
-

Author: ASF GitHub Bot
Created on: 15/Apr/21 04:12
Start Date: 15/Apr/21 04:12
Worklog Time Spent: 10m 
  Work Description: Hexiaoqiao commented on pull request #2737:
URL: https://github.com/apache/hadoop/pull/2737#issuecomment-820080877


   @functioner after review code, comment from Yiqun seems not resolved yet, do 
you mind to make thread pool size be configurable?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 583125)
Time Spent: 2h 20m  (was: 2h 10m)

> Network issue while FSEditLogAsync is executing RpcEdit.logSyncNotify can 
> cause the namenode to hang
> 
>
> Key: HDFS-15869
> URL: https://issues.apache.org/jira/browse/HDFS-15869
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: fs async, namenode
>Affects Versions: 3.2.2
>Reporter: Haoze Wu
>Priority: Critical
>  Labels: pull-request-available
>  Time Spent: 2h 20m
>  Remaining Estimate: 0h
>
>     We were doing some testing of the latest Hadoop stable release 3.2.2 and 
> found some network issue can cause the namenode to hang even with the async 
> edit logging (FSEditLogAsync).
>     The workflow of the FSEditLogAsync thread is basically:
>  # get EditLog from a queue (line 229)
>  # do the transaction (line 232)
>  # sync the log if doSync (line 243)
>  # do logSyncNotify (line 248)
> {code:java}
> //hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSEditLogAsync.java
>   @Override
>   public void run() {
> try {
>   while (true) {
> boolean doSync;
> Edit edit = dequeueEdit(); // 
> line 229
> if (edit != null) {
>   // sync if requested by edit log.
>   doSync = edit.logEdit(); // 
> line 232
>   syncWaitQ.add(edit);
> } else {
>   // sync when editq runs dry, but have edits pending a sync.
>   doSync = !syncWaitQ.isEmpty();
> }
> if (doSync) {
>   // normally edit log exceptions cause the NN to terminate, but tests
>   // relying on ExitUtil.terminate need to see the exception.
>   RuntimeException syncEx = null;
>   try {
> logSync(getLastWrittenTxId()); // 
> line 243
>   } catch (RuntimeException ex) {
> syncEx = ex;
>   }
>   while ((edit = syncWaitQ.poll()) != null) {
> edit.logSyncNotify(syncEx);// 
> line 248
>   }
> }
>   }
> } catch (InterruptedException ie) {
>   LOG.info(Thread.currentThread().getName() + " was interrupted, 
> exiting");
> } catch (Throwable t) {
>   terminate(t);
> }
>   }
> {code}
>     In terms of the step 4, FSEditLogAsync$RpcEdit.logSyncNotify is 
> essentially doing some network write (line 365).
> {code:java}
> //hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSEditLogAsync.java
>   private static class RpcEdit extends Edit {
> // ...
> @Override
> public void logSyncNotify(RuntimeException syncEx) {
>   try {
> if (syncEx == null) {
>   call.sendResponse();   // line 
> 365
> } else {
>   call.abortResponse(syncEx);
> }
>   } catch (Exception e) {} // don't care if not sent.
> }
> // ...
>   }{code}
>     If the sendResponse operation in line 365 gets stuck, then the whole 
> FSEditLogAsync thread is not able to proceed. In this case, the critical 
> logSync (line 243) can’t be executed, for the incoming transactions. Then the 
> namenode hangs. This is undesirable because FSEditLogAsync’s key feature is 
> asynchronous edit logging that is supposed to tolerate slow I/O.
>     To see why the sendResponse operation in line 365 may get stuck, here is 
> the stack trace:
> {code:java}
>  '(org.apache.hadoop.ipc.Server,channelWrite,3593)',
>  '(org.apache.hadoop.ipc.Server,access$1700,139)',
>  '(org.apache.hadoop.ipc.Server$Responder,processResponse,1657)',
>  '(org.apache.hadoop.ipc.Server$Responder,doRespond,1727)',
>  

[jira] [Commented] (HDFS-15973) RBF: Add permission check before doting router federation rename.

2021-04-14 Thread Jinglun (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17321889#comment-17321889
 ] 

Jinglun commented on HDFS-15973:


Hi [~zhengzhuobinzzb], thanks your comments. Using FileSystem.access() is 
better, I made a negligence of the rpc :P. I'll submit v02 using the access() 
and the second point can be handled too.

> RBF: Add permission check before doting router federation rename.
> -
>
> Key: HDFS-15973
> URL: https://issues.apache.org/jira/browse/HDFS-15973
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
> Attachments: HDFS-15973.001.patch
>
>
> The router federation rename is lack of permission check. It is a security 
> issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15974) RBF: Unable to display the datanode UI of the router

2021-04-14 Thread zhu (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17321881#comment-17321881
 ] 

zhu commented on HDFS-15974:


Thank you [~elgoiri]. The function GetEnteringMaintenanceNodes returns " N/A " 
, which will cause the page parsing Json exception . I think this function 
returns null to solve the problem, and my test is exactly that.

*The debug exception is as follows:*

SyntaxError: Unexpected token N in JSON at position 0
 at JSON.parse ()
 at workaround (federationhealth.js:321)
 at Object. (federationhealth.js:398)
 at Object.success (federationhealth.js:504)
 at c (jquery-3.5.1.min.js:2)
 at Object.fireWith [as resolveWith] (jquery-3.5.1.min.js:2)
 at l (jquery-3.5.1.min.js:2)
 at XMLHttpRequest. (jquery-3.5.1.min.js:2)

 

!image-2021-04-15-11-36-47-644.png!

> RBF: Unable to display the datanode UI of the router
> 
>
> Key: HDFS-15974
> URL: https://issues.apache.org/jira/browse/HDFS-15974
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: rbf, ui
>Affects Versions: 3.4.0
>Reporter: zhu
>Priority: Major
> Attachments: HDFS-15358-1.patch, image-2021-04-15-11-36-47-644.png
>
>
> Clicking the Datanodes tag on the Router UI does not respond.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15974) RBF: Unable to display the datanode UI of the router

2021-04-14 Thread zhu (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhu updated HDFS-15974:
---
Attachment: image-2021-04-15-11-36-47-644.png

> RBF: Unable to display the datanode UI of the router
> 
>
> Key: HDFS-15974
> URL: https://issues.apache.org/jira/browse/HDFS-15974
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: rbf, ui
>Affects Versions: 3.4.0
>Reporter: zhu
>Priority: Major
> Attachments: HDFS-15358-1.patch, image-2021-04-15-11-36-47-644.png
>
>
> Clicking the Datanodes tag on the Router UI does not respond.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15974) RBF: Unable to display the datanode UI of the router

2021-04-14 Thread zhu (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhu updated HDFS-15974:
---
Attachment: (was: image-2021-04-15-10-05-35-985.png)

> RBF: Unable to display the datanode UI of the router
> 
>
> Key: HDFS-15974
> URL: https://issues.apache.org/jira/browse/HDFS-15974
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: rbf, ui
>Affects Versions: 3.4.0
>Reporter: zhu
>Priority: Major
> Attachments: HDFS-15358-1.patch, image-2021-04-15-11-36-47-644.png
>
>
> Clicking the Datanodes tag on the Router UI does not respond.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15940) Some tests in TestBlockRecovery are consistently failing

2021-04-14 Thread Hui Fei (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hui Fei updated HDFS-15940:
---
Fix Version/s: 3.2.3

> Some tests in TestBlockRecovery are consistently failing
> 
>
> Key: HDFS-15940
> URL: https://issues.apache.org/jira/browse/HDFS-15940
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 3.4.0
>Reporter: Viraj Jasani
>Assignee: Viraj Jasani
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.3.1, 3.4.0, 3.2.3
>
>  Time Spent: 6h 20m
>  Remaining Estimate: 0h
>
> Some long running tests in TestBlockRecovery are consistently failing. Also, 
> TestBlockRecovery is huge with so many tests, we should refactor some of long 
> running and race condition specific tests to separate class.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Work logged] (HDFS-15940) Some tests in TestBlockRecovery are consistently failing

2021-04-14 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15940?focusedWorklogId=583107=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-583107
 ]

ASF GitHub Bot logged work on HDFS-15940:
-

Author: ASF GitHub Bot
Created on: 15/Apr/21 03:20
Start Date: 15/Apr/21 03:20
Worklog Time Spent: 10m 
  Work Description: ferhui merged pull request #2902:
URL: https://github.com/apache/hadoop/pull/2902


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 583107)
Time Spent: 6h 20m  (was: 6h 10m)

> Some tests in TestBlockRecovery are consistently failing
> 
>
> Key: HDFS-15940
> URL: https://issues.apache.org/jira/browse/HDFS-15940
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 3.4.0
>Reporter: Viraj Jasani
>Assignee: Viraj Jasani
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.3.1, 3.4.0
>
>  Time Spent: 6h 20m
>  Remaining Estimate: 0h
>
> Some long running tests in TestBlockRecovery are consistently failing. Also, 
> TestBlockRecovery is huge with so many tests, we should refactor some of long 
> running and race condition specific tests to separate class.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Work logged] (HDFS-15810) RBF: RBFMetrics's TotalCapacity out of bounds

2021-04-14 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15810?focusedWorklogId=583106=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-583106
 ]

ASF GitHub Bot logged work on HDFS-15810:
-

Author: ASF GitHub Bot
Created on: 15/Apr/21 03:14
Start Date: 15/Apr/21 03:14
Worklog Time Spent: 10m 
  Work Description: cxorm commented on a change in pull request #2910:
URL: https://github.com/apache/hadoop/pull/2910#discussion_r613727691



##
File path: 
hadoop-hdfs-project/hadoop-hdfs-rbf/src/main/java/org/apache/hadoop/hdfs/server/federation/metrics/RBFMetrics.java
##
@@ -371,13 +372,13 @@ private static void setStateStoreVersions(
   }
 
   @Override
-  public long getTotalCapacity() {
-return getNameserviceAggregatedLong(MembershipStats::getTotalSpace);
+  public BigInteger getTotalCapacity() {

Review comment:
   Not sure the change of public API is proper or not.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 583106)
Time Spent: 0.5h  (was: 20m)

> RBF: RBFMetrics's TotalCapacity out of bounds
> -
>
> Key: HDFS-15810
> URL: https://issues.apache.org/jira/browse/HDFS-15810
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Xiaoxing Wei
>Assignee: Fengnan Li
>Priority: Major
>  Labels: pull-request-available
> Attachments: image-2021-02-02-10-59-17-113.png
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> The Long type fields TotalCapacity,UsedCapacity and RemainingCapacity in 
> RBFMetrics maybe ** out of bounds.
> !image-2021-02-02-10-59-17-113.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15974) RBF: Unable to display the datanode UI of the router

2021-04-14 Thread zhu (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhu updated HDFS-15974:
---
Attachment: (was: image-2021-04-15-10-31-27-915.png)

> RBF: Unable to display the datanode UI of the router
> 
>
> Key: HDFS-15974
> URL: https://issues.apache.org/jira/browse/HDFS-15974
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: rbf, ui
>Affects Versions: 3.4.0
>Reporter: zhu
>Priority: Major
> Attachments: HDFS-15358-1.patch, image-2021-04-15-10-05-35-985.png
>
>
> Clicking the Datanodes tag on the Router UI does not respond.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Work logged] (HDFS-15940) Some tests in TestBlockRecovery are consistently failing

2021-04-14 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15940?focusedWorklogId=583104=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-583104
 ]

ASF GitHub Bot logged work on HDFS-15940:
-

Author: ASF GitHub Bot
Created on: 15/Apr/21 03:13
Start Date: 15/Apr/21 03:13
Worklog Time Spent: 10m 
  Work Description: ferhui commented on pull request #2902:
URL: https://github.com/apache/hadoop/pull/2902#issuecomment-820025360


   @jojochuang Thanks for this PR. @virajjasani Thanks for review!
   merged


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 583104)
Time Spent: 6h 10m  (was: 6h)

> Some tests in TestBlockRecovery are consistently failing
> 
>
> Key: HDFS-15940
> URL: https://issues.apache.org/jira/browse/HDFS-15940
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 3.4.0
>Reporter: Viraj Jasani
>Assignee: Viraj Jasani
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.3.1, 3.4.0
>
>  Time Spent: 6h 10m
>  Remaining Estimate: 0h
>
> Some long running tests in TestBlockRecovery are consistently failing. Also, 
> TestBlockRecovery is huge with so many tests, we should refactor some of long 
> running and race condition specific tests to separate class.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-15756) RBF: Cannot get updated delegation token from zookeeper

2021-04-14 Thread zhangxiping (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15756?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17321866#comment-17321866
 ] 

zhangxiping edited comment on HDFS-15756 at 4/15/21, 3:04 AM:
--

-_- like :
{code:java}
//代码占位符 AbstractDelegationTokenSecretManager

public synchronized long renewToken(Token token,
   String renewer) throws InvalidToken, IOException {
  ByteArrayInputStream buf = new ByteArrayInputStream(token.getIdentifier());
  DataInputStream in = new DataInputStream(buf);
  TokenIdent id = createIdentifier();
  id.readFields(in);
  LOG.info("Token renewal for identifier: " + formatTokenId(id)
  + "; total currentTokens " +  currentTokens.size());

  long now = Time.now();
  ...
  ...

  if (getTokenInfo(id) == null && id.getIssueDate() + 6 > now) {
throw new StandbyException("Renewal request for unknown token "
+ formatTokenId(id)+",will Failover to other server and retry");
  }

  if (getTokenInfo(id) == null) {
throw new InvalidToken("Renewal request for unknown token "
+ formatTokenId(id));
  }
  updateToken(id, info);
  return renewTime;
}

{code}


was (Author: zhangxiping):
like :
{code:java}
//代码占位符 AbstractDelegationTokenSecretManager

public synchronized long renewToken(Token token,
   String renewer) throws InvalidToken, IOException {
  ByteArrayInputStream buf = new ByteArrayInputStream(token.getIdentifier());
  DataInputStream in = new DataInputStream(buf);
  TokenIdent id = createIdentifier();
  id.readFields(in);
  LOG.info("Token renewal for identifier: " + formatTokenId(id)
  + "; total currentTokens " +  currentTokens.size());

  long now = Time.now();
  ...
  ...

  if (getTokenInfo(id) == null && id.getIssueDate() + 6 > now) {
throw new StandbyException("Renewal request for unknown token "
+ formatTokenId(id)+",will Failover to other server and retry");
  }

  if (getTokenInfo(id) == null) {
throw new InvalidToken("Renewal request for unknown token "
+ formatTokenId(id));
  }
  updateToken(id, info);
  return renewTime;
}

{code}

> RBF: Cannot get updated delegation token from zookeeper
> ---
>
> Key: HDFS-15756
> URL: https://issues.apache.org/jira/browse/HDFS-15756
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: rbf
>Affects Versions: 3.0.0
>Reporter: hbprotoss
>Priority: Major
>
> Affected version: all version with rbf
> When RBF work with spark 2.4 client mode, there will be a chance that token 
> is missing across different nodes in RBF cluster. The root cause is that 
> spark renew the  token(via resource manager) immediately after got one, as 
> zookeeper don't have a strong consistency guarantee after an update in 
> cluster, zookeeper client may read a stale value in some followers not synced 
> with other nodes.
>  
> We apply a patch in spark, but it is still the problem of RBF. Is it possible 
> for RBF to replace the delegation token store using some other 
> datasource(redis for example)?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15756) RBF: Cannot get updated delegation token from zookeeper

2021-04-14 Thread zhangxiping (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15756?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17321866#comment-17321866
 ] 

zhangxiping commented on HDFS-15756:


like :
{code:java}
//代码占位符 AbstractDelegationTokenSecretManager

public synchronized long renewToken(Token token,
   String renewer) throws InvalidToken, IOException {
  ByteArrayInputStream buf = new ByteArrayInputStream(token.getIdentifier());
  DataInputStream in = new DataInputStream(buf);
  TokenIdent id = createIdentifier();
  id.readFields(in);
  LOG.info("Token renewal for identifier: " + formatTokenId(id)
  + "; total currentTokens " +  currentTokens.size());

  long now = Time.now();
  ...
  ...

  if (getTokenInfo(id) == null && id.getIssueDate() + 6 > now) {
throw new StandbyException("Renewal request for unknown token "
+ formatTokenId(id)+",will Failover to other server and retry");
  }

  if (getTokenInfo(id) == null) {
throw new InvalidToken("Renewal request for unknown token "
+ formatTokenId(id));
  }
  updateToken(id, info);
  return renewTime;
}

{code}

> RBF: Cannot get updated delegation token from zookeeper
> ---
>
> Key: HDFS-15756
> URL: https://issues.apache.org/jira/browse/HDFS-15756
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: rbf
>Affects Versions: 3.0.0
>Reporter: hbprotoss
>Priority: Major
>
> Affected version: all version with rbf
> When RBF work with spark 2.4 client mode, there will be a chance that token 
> is missing across different nodes in RBF cluster. The root cause is that 
> spark renew the  token(via resource manager) immediately after got one, as 
> zookeeper don't have a strong consistency guarantee after an update in 
> cluster, zookeeper client may read a stale value in some followers not synced 
> with other nodes.
>  
> We apply a patch in spark, but it is still the problem of RBF. Is it possible 
> for RBF to replace the delegation token store using some other 
> datasource(redis for example)?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-15756) RBF: Cannot get updated delegation token from zookeeper

2021-04-14 Thread zhangxiping (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15756?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17321862#comment-17321862
 ] 

zhangxiping edited comment on HDFS-15756 at 4/15/21, 2:35 AM:
--

[~hexiaoqiao]  

I am very happy to see your comment. We are also using Router now and have 
encountered this problem.We encountered this problem when we submitted the 
Spark application. During the task submission, the following log appeared:

 
{code:java}
//代码占位符
2021-04-13 01:01:13 CST DFSClient INFO - Created HDFS_DELEGATION_TOKEN token 
205440696 for da_music on ha-hdfs:hz-cluster11 
2021-04-13 01:01:13 CST SparkContext ERROR - Error initializing SparkContext. 
org.apache.hadoop.security.token.SecretManager$InvalidToken: Renewal request 
for unknown token (token for da_music: HDFS_DELEGATION_TOKEN 
owner=da_music/d...@hadoop.hz.netease.com, renewer=da_music, realUser=, 
issueDate=1618246873345, maxDate=1618851673345, sequenceNumber=205440696, 
masterKeyId=161) at 
org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager.renewToken(AbstractDelegationTokenSecretManager.java:543)

{code}
{code:java}
//代码占位符
  private def getTokenRenewalInterval(
  hadoopConf: Configuration,
  sparkConf: SparkConf,
  filesystems: Set[FileSystem]): Option[Long] = {
// We cannot use the tokens generated with renewer yarn. Trying to renew
// those will fail with an access control issue. So create new tokens with 
the logged in
// user as renewer.
sparkConf.get(PRINCIPAL).flatMap { renewer =>
  val creds = new Credentials()
  fetchDelegationTokens(renewer, filesystems, creds)

  val renewIntervals = creds.getAllTokens.asScala.filter {
_.decodeIdentifier().isInstanceOf[AbstractDelegationTokenIdentifier]
  }.flatMap { token =>
Try {
  val newExpiration = token.renew(hadoopConf)
  val identifier = 
token.decodeIdentifier().asInstanceOf[AbstractDelegationTokenIdentifier]
  val interval = newExpiration - identifier.getIssueDate
  logInfo(s"Renewal interval is $interval for token 
${token.getKind.toString}")
  interval
}.toOption
  }
  if (renewIntervals.isEmpty) None else Some(renewIntervals.min)
}
  }
}
{code}
Looking at the Spark2.4.7 code, we found that the time between creating a token 
and renew it is very short。

So can we only make retry requests for tokens that were created shortly after, 
such as those created within 1 minute?

 


was (Author: zhangxiping):
[~hexiaoqiao]  

I am very happy to see your comment. We are also using Router now and have 
encountered this problem.We encountered this problem when we submitted the 
Spark application. During the task submission, the following log appeared:

 
{code:java}
//代码占位符
2021-04-13 01:01:13 CST DFSClient INFO - Created HDFS_DELEGATION_TOKEN token 
205440696 for da_music on ha-hdfs:hz-cluster11 
2021-04-13 01:01:13 CST SparkContext ERROR - Error initializing SparkContext. 
org.apache.hadoop.security.token.SecretManager$InvalidToken: Renewal request 
for unknown token (token for da_music: HDFS_DELEGATION_TOKEN 
owner=da_music/d...@hadoop.hz.netease.com, renewer=da_music, realUser=, 
issueDate=1618246873345, maxDate=1618851673345, sequenceNumber=205440696, 
masterKeyId=161) at 
org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager.renewToken(AbstractDelegationTokenSecretManager.java:543)

{code}
!image-2021-04-15-10-27-29-927.png!

Looking at the Spark2.4.7 code, we found that the time between creating a token 
and renew it is very short。

So can we only make retry requests for tokens that were created shortly after, 
such as those created within 1 minute?

 

> RBF: Cannot get updated delegation token from zookeeper
> ---
>
> Key: HDFS-15756
> URL: https://issues.apache.org/jira/browse/HDFS-15756
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: rbf
>Affects Versions: 3.0.0
>Reporter: hbprotoss
>Priority: Major
>
> Affected version: all version with rbf
> When RBF work with spark 2.4 client mode, there will be a chance that token 
> is missing across different nodes in RBF cluster. The root cause is that 
> spark renew the  token(via resource manager) immediately after got one, as 
> zookeeper don't have a strong consistency guarantee after an update in 
> cluster, zookeeper client may read a stale value in some followers not synced 
> with other nodes.
>  
> We apply a patch in spark, but it is still the problem of RBF. Is it possible 
> for RBF to replace the delegation token store using some other 
> datasource(redis for example)?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, 

[jira] [Commented] (HDFS-15756) RBF: Cannot get updated delegation token from zookeeper

2021-04-14 Thread zhangxiping (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15756?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17321862#comment-17321862
 ] 

zhangxiping commented on HDFS-15756:


[~hexiaoqiao]  

I am very happy to see your comment. We are also using Router now and have 
encountered this problem.We encountered this problem when we submitted the 
Spark application. During the task submission, the following log appeared:

 
{code:java}
//代码占位符
2021-04-13 01:01:13 CST DFSClient INFO - Created HDFS_DELEGATION_TOKEN token 
205440696 for da_music on ha-hdfs:hz-cluster11 
2021-04-13 01:01:13 CST SparkContext ERROR - Error initializing SparkContext. 
org.apache.hadoop.security.token.SecretManager$InvalidToken: Renewal request 
for unknown token (token for da_music: HDFS_DELEGATION_TOKEN 
owner=da_music/d...@hadoop.hz.netease.com, renewer=da_music, realUser=, 
issueDate=1618246873345, maxDate=1618851673345, sequenceNumber=205440696, 
masterKeyId=161) at 
org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager.renewToken(AbstractDelegationTokenSecretManager.java:543)

{code}
!image-2021-04-15-10-27-29-927.png!

Looking at the Spark2.4.7 code, we found that the time between creating a token 
and renew it is very short。

So can we only make retry requests for tokens that were created shortly after, 
such as those created within 1 minute?

 

> RBF: Cannot get updated delegation token from zookeeper
> ---
>
> Key: HDFS-15756
> URL: https://issues.apache.org/jira/browse/HDFS-15756
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: rbf
>Affects Versions: 3.0.0
>Reporter: hbprotoss
>Priority: Major
>
> Affected version: all version with rbf
> When RBF work with spark 2.4 client mode, there will be a chance that token 
> is missing across different nodes in RBF cluster. The root cause is that 
> spark renew the  token(via resource manager) immediately after got one, as 
> zookeeper don't have a strong consistency guarantee after an update in 
> cluster, zookeeper client may read a stale value in some followers not synced 
> with other nodes.
>  
> We apply a patch in spark, but it is still the problem of RBF. Is it possible 
> for RBF to replace the delegation token store using some other 
> datasource(redis for example)?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15973) RBF: Add permission check before doting router federation rename.

2021-04-14 Thread zhuobin zheng (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17321861#comment-17321861
 ] 

zhuobin zheng commented on HDFS-15973:
--

Hi, I view this Patch and  there are two confusions : 
 # Why not use method FileSystem.access() to check permission?
 # checkRenameSrcPermission and  checkRenameDstPermission is almost same. Can 
extract the common sub-methods? 

> RBF: Add permission check before doting router federation rename.
> -
>
> Key: HDFS-15973
> URL: https://issues.apache.org/jira/browse/HDFS-15973
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
> Attachments: HDFS-15973.001.patch
>
>
> The router federation rename is lack of permission check. It is a security 
> issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15974) RBF: Unable to display the datanode UI of the router

2021-04-14 Thread zhu (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhu updated HDFS-15974:
---
Attachment: image-2021-04-15-10-31-27-915.png

> RBF: Unable to display the datanode UI of the router
> 
>
> Key: HDFS-15974
> URL: https://issues.apache.org/jira/browse/HDFS-15974
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: rbf, ui
>Affects Versions: 3.4.0
>Reporter: zhu
>Priority: Major
> Attachments: HDFS-15358-1.patch, image-2021-04-15-10-05-35-985.png, 
> image-2021-04-15-10-31-27-915.png
>
>
> Clicking the Datanodes tag on the Router UI does not respond.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Work logged] (HDFS-15810) RBF: RBFMetrics's TotalCapacity out of bounds

2021-04-14 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15810?focusedWorklogId=583091=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-583091
 ]

ASF GitHub Bot logged work on HDFS-15810:
-

Author: ASF GitHub Bot
Created on: 15/Apr/21 02:25
Start Date: 15/Apr/21 02:25
Worklog Time Spent: 10m 
  Work Description: hadoop-yetus commented on pull request #2910:
URL: https://github.com/apache/hadoop/pull/2910#issuecomment-820010002


   :broken_heart: **-1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |   0m 35s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  1s |  |  codespell was not available.  |
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to 
include 1 new or modified test files.  |
    _ trunk Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |  33m 49s |  |  trunk passed  |
   | +1 :green_heart: |  compile  |   0m 42s |  |  trunk passed with JDK 
Ubuntu-11.0.10+9-Ubuntu-0ubuntu1.20.04  |
   | +1 :green_heart: |  compile  |   0m 38s |  |  trunk passed with JDK 
Private Build-1.8.0_282-8u282-b08-0ubuntu1~20.04-b08  |
   | +1 :green_heart: |  checkstyle  |   0m 26s |  |  trunk passed  |
   | +1 :green_heart: |  mvnsite  |   0m 43s |  |  trunk passed  |
   | +1 :green_heart: |  javadoc  |   0m 41s |  |  trunk passed with JDK 
Ubuntu-11.0.10+9-Ubuntu-0ubuntu1.20.04  |
   | +1 :green_heart: |  javadoc  |   0m 56s |  |  trunk passed with JDK 
Private Build-1.8.0_282-8u282-b08-0ubuntu1~20.04-b08  |
   | +1 :green_heart: |  spotbugs  |   1m 13s |  |  trunk passed  |
   | +1 :green_heart: |  shadedclient  |  14m 20s |  |  branch has no errors 
when building and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |   0m 33s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   0m 33s |  |  the patch passed with JDK 
Ubuntu-11.0.10+9-Ubuntu-0ubuntu1.20.04  |
   | +1 :green_heart: |  javac  |   0m 33s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   0m 28s |  |  the patch passed with JDK 
Private Build-1.8.0_282-8u282-b08-0ubuntu1~20.04-b08  |
   | +1 :green_heart: |  javac  |   0m 28s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | +1 :green_heart: |  checkstyle  |   0m 17s |  |  the patch passed  |
   | +1 :green_heart: |  mvnsite  |   0m 32s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   0m 30s |  |  the patch passed with JDK 
Ubuntu-11.0.10+9-Ubuntu-0ubuntu1.20.04  |
   | +1 :green_heart: |  javadoc  |   0m 47s |  |  the patch passed with JDK 
Private Build-1.8.0_282-8u282-b08-0ubuntu1~20.04-b08  |
   | -1 :x: |  spotbugs  |   1m 19s | 
[/new-spotbugs-hadoop-hdfs-project_hadoop-hdfs-rbf.html](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-2910/1/artifact/out/new-spotbugs-hadoop-hdfs-project_hadoop-hdfs-rbf.html)
 |  hadoop-hdfs-project/hadoop-hdfs-rbf generated 1 new + 0 unchanged - 0 fixed 
= 1 total (was 0)  |
   | +1 :green_heart: |  shadedclient  |  14m  2s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | -1 :x: |  unit  |  17m 48s | 
[/patch-unit-hadoop-hdfs-project_hadoop-hdfs-rbf.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-2910/1/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs-rbf.txt)
 |  hadoop-hdfs-rbf in the patch passed.  |
   | +1 :green_heart: |  asflicense  |   0m 34s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   |  93m  9s |  |  |
   
   
   | Reason | Tests |
   |---:|:--|
   | SpotBugs | module:hadoop-hdfs-project/hadoop-hdfs-rbf |
   |  |  Return value of java.math.BigInteger.add(BigInteger) ignored in 
org.apache.hadoop.hdfs.server.federation.metrics.RBFMetrics.getNameserviceAggregatedBigInt(ToLongFunction)
  At RBFMetrics.java:in 
org.apache.hadoop.hdfs.server.federation.metrics.RBFMetrics.getNameserviceAggregatedBigInt(ToLongFunction)
  At RBFMetrics.java:[line 795] |
   | Failed junit tests | 
hadoop.hdfs.server.federation.router.TestRouterRpcMultiDestination |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.41 ServerAPI=1.41 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-2910/1/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/2910 |
   | Optional Tests | dupname asflicense compile javac javadoc mvninstall 
mvnsite unit shadedclient spotbugs checkstyle codespell |
   | uname | Linux ee336fa6323a 

[jira] [Updated] (HDFS-15974) RBF: Unable to display the datanode UI of the router

2021-04-14 Thread zhu (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhu updated HDFS-15974:
---
Attachment: image-2021-04-15-10-05-35-985.png

> RBF: Unable to display the datanode UI of the router
> 
>
> Key: HDFS-15974
> URL: https://issues.apache.org/jira/browse/HDFS-15974
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: rbf, ui
>Affects Versions: 3.4.0
>Reporter: zhu
>Priority: Major
> Attachments: HDFS-15358-1.patch, image-2021-04-15-10-05-35-985.png
>
>
> Clicking the Datanodes tag on the Router UI does not respond.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15923) RBF: Authentication failed when rename accross sub clusters

2021-04-14 Thread zhuobin zheng (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17321844#comment-17321844
 ] 

zhuobin zheng commented on HDFS-15923:
--

Ok i will finish my work as soon as possible

> RBF:  Authentication failed when rename accross sub clusters
> 
>
> Key: HDFS-15923
> URL: https://issues.apache.org/jira/browse/HDFS-15923
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: rbf
>Reporter: zhuobin zheng
>Priority: Major
>  Labels: RBF, pull-request-available, rename
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> Rename accross subcluster with RBF and Kerberos environment. Will encounter 
> the following two errors:
>  # Save Object to journal.
>  # Precheck try to get src file status
> So, we need use Proxy UGI doAs create DistcpProcedure and TrashProcedure and 
> submit Job.
> In patch i use proxy ugi doAs above method. It worked.
> But there are another strange thing and this patch not solve:
> Router use ugi itself to submit the Distcp job. But not user ugi or proxy 
> ugi. This may cause excessive distcp permissions.
> First: Save Object to journal.
> {code:java}
> // code placeholder
> 2021-03-23 14:01:16,233 WARN org.apache.hadoop.ipc.Client: Exception 
> encountered while connecting to the server 
> javax.security.sasl.SaslException: GSS initiate failed [Caused by 
> GSSException: No valid credentials provided (Mechanism level: Failed to find 
> any Kerberos tgt)]
> at 
> com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:211)
> at 
> org.apache.hadoop.security.SaslRpcClient.saslConnect(SaslRpcClient.java:408)
> at 
> org.apache.hadoop.ipc.Client$Connection.setupSaslConnection(Client.java:622)
> at 
> org.apache.hadoop.ipc.Client$Connection.access$2300(Client.java:413)
> at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:822)
> at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:818)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1762)
> at 
> org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:818)
> at 
> org.apache.hadoop.ipc.Client$Connection.access$3800(Client.java:413)
> at org.apache.hadoop.ipc.Client.getConnection(Client.java:1636)
> at org.apache.hadoop.ipc.Client.call(Client.java:1452)
> at org.apache.hadoop.ipc.Client.call(Client.java:1405)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:118)
> at com.sun.proxy.$Proxy11.create(Unknown Source)
> at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.create(ClientNamenodeProtocolTranslatorPB.java:376)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359)
> at com.sun.proxy.$Proxy12.create(Unknown Source)
> at 
> org.apache.hadoop.hdfs.DFSOutputStream.newStreamForCreate(DFSOutputStream.java:277)
> at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1240)
> at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1219)
> at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1201)
> at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1139)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$8.doCall(DistributedFileSystem.java:533)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$8.doCall(DistributedFileSystem.java:530)
> at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:544)
> at 
> 

[jira] [Work logged] (HDFS-15970) Print network topology on the web

2021-04-14 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15970?focusedWorklogId=583086=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-583086
 ]

ASF GitHub Bot logged work on HDFS-15970:
-

Author: ASF GitHub Bot
Created on: 15/Apr/21 01:51
Start Date: 15/Apr/21 01:51
Worklog Time Spent: 10m 
  Work Description: tomscut commented on pull request #2896:
URL: https://github.com/apache/hadoop/pull/2896#issuecomment-819960710


   Hi @tasanuma , could you please do another review? Thank you.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 583086)
Time Spent: 1h 50m  (was: 1h 40m)

> Print network topology on the web
> -
>
> Key: HDFS-15970
> URL: https://issues.apache.org/jira/browse/HDFS-15970
> Project: Hadoop HDFS
>  Issue Type: Wish
>Reporter: tomscut
>Assignee: tomscut
>Priority: Minor
>  Labels: pull-request-available
> Attachments: hdfs-topology.jpg, hdfs-web.jpg
>
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> In order to query the network topology information conveniently, we can print 
> it on the web.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Work logged] (HDFS-15975) Use LongAdder instead of AtomicLong

2021-04-14 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15975?focusedWorklogId=583083=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-583083
 ]

ASF GitHub Bot logged work on HDFS-15975:
-

Author: ASF GitHub Bot
Created on: 15/Apr/21 01:49
Start Date: 15/Apr/21 01:49
Worklog Time Spent: 10m 
  Work Description: tomscut commented on pull request #2907:
URL: https://github.com/apache/hadoop/pull/2907#issuecomment-819959935


   Those failed unit tests were unrelated to the change. And they work fine 
locally.
   
   Hi @goiri @tasanuma , could you please review the code? Thanks very much.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 583083)
Time Spent: 40m  (was: 0.5h)

> Use LongAdder instead of AtomicLong
> ---
>
> Key: HDFS-15975
> URL: https://issues.apache.org/jira/browse/HDFS-15975
> Project: Hadoop HDFS
>  Issue Type: Wish
>Reporter: tomscut
>Assignee: tomscut
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> When counting some indicators, we can use LongAdder instead of AtomicLong to 
> improve performance. The long value is not an atomic snapshot in LongAdder, 
> but I think we can tolerate that.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15810) RBF: RBFMetrics's TotalCapacity out of bounds

2021-04-14 Thread Fengnan Li (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fengnan Li updated HDFS-15810:
--
Status: Patch Available  (was: Open)

> RBF: RBFMetrics's TotalCapacity out of bounds
> -
>
> Key: HDFS-15810
> URL: https://issues.apache.org/jira/browse/HDFS-15810
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Xiaoxing Wei
>Assignee: Fengnan Li
>Priority: Major
>  Labels: pull-request-available
> Attachments: image-2021-02-02-10-59-17-113.png
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The Long type fields TotalCapacity,UsedCapacity and RemainingCapacity in 
> RBFMetrics maybe ** out of bounds.
> !image-2021-02-02-10-59-17-113.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15810) RBF: RBFMetrics's TotalCapacity out of bounds

2021-04-14 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HDFS-15810:
--
Labels: pull-request-available  (was: )

> RBF: RBFMetrics's TotalCapacity out of bounds
> -
>
> Key: HDFS-15810
> URL: https://issues.apache.org/jira/browse/HDFS-15810
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Xiaoxing Wei
>Assignee: Fengnan Li
>Priority: Major
>  Labels: pull-request-available
> Attachments: image-2021-02-02-10-59-17-113.png
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The Long type fields TotalCapacity,UsedCapacity and RemainingCapacity in 
> RBFMetrics maybe ** out of bounds.
> !image-2021-02-02-10-59-17-113.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Work logged] (HDFS-15810) RBF: RBFMetrics's TotalCapacity out of bounds

2021-04-14 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15810?focusedWorklogId=583060=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-583060
 ]

ASF GitHub Bot logged work on HDFS-15810:
-

Author: ASF GitHub Bot
Created on: 15/Apr/21 00:50
Start Date: 15/Apr/21 00:50
Worklog Time Spent: 10m 
  Work Description: fengnanli opened a new pull request #2910:
URL: https://github.com/apache/hadoop/pull/2910


   ## NOTICE
   
   Please create an issue in ASF JIRA before opening a pull request,
   and you need to set the title of the pull request which starts with
   the corresponding JIRA issue number. (e.g. HADOOP-X. Fix a typo in YYY.)
   For more details, please see 
https://cwiki.apache.org/confluence/display/HADOOP/How+To+Contribute
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 583060)
Remaining Estimate: 0h
Time Spent: 10m

> RBF: RBFMetrics's TotalCapacity out of bounds
> -
>
> Key: HDFS-15810
> URL: https://issues.apache.org/jira/browse/HDFS-15810
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Xiaoxing Wei
>Assignee: Fengnan Li
>Priority: Major
> Attachments: image-2021-02-02-10-59-17-113.png
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The Long type fields TotalCapacity,UsedCapacity and RemainingCapacity in 
> RBFMetrics maybe ** out of bounds.
> !image-2021-02-02-10-59-17-113.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15973) RBF: Add permission check before doting router federation rename.

2021-04-14 Thread Hadoop QA (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17321828#comment-17321828
 ] 

Hadoop QA commented on HDFS-15973:
--

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime ||  Logfile || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  1m 
22s{color} | {color:blue}{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} || ||
| {color:green}+1{color} | {color:green} dupname {color} | {color:green}  0m  
0s{color} | {color:green}{color} | {color:green} No case conflicting files 
found. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green}{color} | {color:green} The patch does not contain any 
@author tags. {color} |
| {color:green}+1{color} | {color:green} {color} | {color:green}  0m  0s{color} 
| {color:green}test4tests{color} | {color:green} The patch appears to include 1 
new or modified test files. {color} |
|| || || || {color:brown} trunk Compile Tests {color} || ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
23s{color} | {color:blue}{color} | {color:blue} Maven dependency ordering for 
branch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 22m 
10s{color} | {color:green}{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  5m 
13s{color} | {color:green}{color} | {color:green} trunk passed with JDK 
Ubuntu-11.0.10+9-Ubuntu-0ubuntu1.20.04 {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  4m 
46s{color} | {color:green}{color} | {color:green} trunk passed with JDK Private 
Build-1.8.0_282-8u282-b08-0ubuntu1~20.04-b08 {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  1m 
11s{color} | {color:green}{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m 
54s{color} | {color:green}{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
18m 29s{color} | {color:green}{color} | {color:green} branch has no errors when 
building and testing our client artifacts. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m 
28s{color} | {color:green}{color} | {color:green} trunk passed with JDK 
Ubuntu-11.0.10+9-Ubuntu-0ubuntu1.20.04 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  2m 
16s{color} | {color:green}{color} | {color:green} trunk passed with JDK Private 
Build-1.8.0_282-8u282-b08-0ubuntu1~20.04-b08 {color} |
| {color:blue}0{color} | {color:blue} spotbugs {color} | {color:blue} 26m 
32s{color} | {color:blue}{color} | {color:blue} Both FindBugs and SpotBugs are 
enabled, using SpotBugs. {color} |
| {color:green}+1{color} | {color:green} spotbugs {color} | {color:green}  4m 
23s{color} | {color:green}{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} || ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
21s{color} | {color:blue}{color} | {color:blue} Maven dependency ordering for 
patch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  1m 
43s{color} | {color:green}{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  5m  
5s{color} | {color:green}{color} | {color:green} the patch passed with JDK 
Ubuntu-11.0.10+9-Ubuntu-0ubuntu1.20.04 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  5m  
5s{color} | {color:green}{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  4m 
40s{color} | {color:green}{color} | {color:green} the patch passed with JDK 
Private Build-1.8.0_282-8u282-b08-0ubuntu1~20.04-b08 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  4m 
40s{color} | {color:green}{color} | {color:green} the patch passed {color} |
| {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange}  
1m  8s{color} | 
{color:orange}https://ci-hadoop.apache.org/job/PreCommit-HDFS-Build/573/artifact/out/diff-checkstyle-hadoop-hdfs-project.txt{color}
 | {color:orange} hadoop-hdfs-project: The patch generated 11 new + 23 
unchanged - 0 fixed = 34 total (was 23) {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m 
46s{color} | {color:green}{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green}{color} | {color:green} The patch has no whitespace 
issues. {color} |
| {color:green}+1{color} | 

[jira] [Work logged] (HDFS-15975) Use LongAdder instead of AtomicLong

2021-04-14 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15975?focusedWorklogId=582989=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-582989
 ]

ASF GitHub Bot logged work on HDFS-15975:
-

Author: ASF GitHub Bot
Created on: 14/Apr/21 22:47
Start Date: 14/Apr/21 22:47
Worklog Time Spent: 10m 
  Work Description: hadoop-yetus commented on pull request #2907:
URL: https://github.com/apache/hadoop/pull/2907#issuecomment-819897893


   :broken_heart: **-1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |   0m 36s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  1s |  |  codespell was not available.  |
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to 
include 1 new or modified test files.  |
    _ trunk Compile Tests _ |
   | +0 :ok: |  mvndep  |  15m 59s |  |  Maven dependency ordering for branch  |
   | +1 :green_heart: |  mvninstall  |  27m 32s |  |  trunk passed  |
   | +1 :green_heart: |  compile  |  28m 48s |  |  trunk passed with JDK 
Ubuntu-11.0.10+9-Ubuntu-0ubuntu1.20.04  |
   | +1 :green_heart: |  compile  |  24m  6s |  |  trunk passed with JDK 
Private Build-1.8.0_282-8u282-b08-0ubuntu1~20.04-b08  |
   | +1 :green_heart: |  checkstyle  |   5m  6s |  |  trunk passed  |
   | +1 :green_heart: |  mvnsite  |   4m 32s |  |  trunk passed  |
   | +1 :green_heart: |  javadoc  |   3m  8s |  |  trunk passed with JDK 
Ubuntu-11.0.10+9-Ubuntu-0ubuntu1.20.04  |
   | +1 :green_heart: |  javadoc  |   4m 13s |  |  trunk passed with JDK 
Private Build-1.8.0_282-8u282-b08-0ubuntu1~20.04-b08  |
   | +1 :green_heart: |  spotbugs  |  10m 30s |  |  trunk passed  |
   | +1 :green_heart: |  shadedclient  |  21m 42s |  |  branch has no errors 
when building and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | +0 :ok: |  mvndep  |   0m 28s |  |  Maven dependency ordering for patch  |
   | +1 :green_heart: |  mvninstall  |   3m 21s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |  20m 20s |  |  the patch passed with JDK 
Ubuntu-11.0.10+9-Ubuntu-0ubuntu1.20.04  |
   | +1 :green_heart: |  javac  |  20m 20s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |  18m  4s |  |  the patch passed with JDK 
Private Build-1.8.0_282-8u282-b08-0ubuntu1~20.04-b08  |
   | +1 :green_heart: |  javac  |  18m  4s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | -0 :warning: |  checkstyle  |   3m 45s | 
[/results-checkstyle-root.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-2907/1/artifact/out/results-checkstyle-root.txt)
 |  root: The patch generated 5 new + 242 unchanged - 5 fixed = 247 total (was 
247)  |
   | +1 :green_heart: |  mvnsite  |   4m 10s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   3m  0s |  |  the patch passed with JDK 
Ubuntu-11.0.10+9-Ubuntu-0ubuntu1.20.04  |
   | +1 :green_heart: |  javadoc  |   4m  3s |  |  the patch passed with JDK 
Private Build-1.8.0_282-8u282-b08-0ubuntu1~20.04-b08  |
   | +1 :green_heart: |  spotbugs  |   8m 42s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  |  16m 41s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | +1 :green_heart: |  unit  |  17m 25s |  |  hadoop-common in the patch 
passed.  |
   | +1 :green_heart: |  unit  |   2m 36s |  |  hadoop-hdfs-client in the patch 
passed.  |
   | -1 :x: |  unit  | 246m 51s | 
[/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-2907/1/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt)
 |  hadoop-hdfs in the patch passed.  |
   | +1 :green_heart: |  asflicense  |   1m 12s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 495m 56s |  |  |
   
   
   | Reason | Tests |
   |---:|:--|
   | Failed junit tests | hadoop.hdfs.server.datanode.TestDirectoryScanner |
   |   | hadoop.hdfs.server.balancer.TestBalancer |
   |   | hadoop.hdfs.TestReconstructStripedFile |
   |   | hadoop.hdfs.qjournal.server.TestJournalNodeRespectsBindHostKeys |
   |   | hadoop.hdfs.server.namenode.snapshot.TestNestedSnapshots |
   |   | hadoop.hdfs.server.blockmanagement.TestUnderReplicatedBlocks |
   |   | hadoop.hdfs.server.namenode.TestAddOverReplicatedStripedBlocks |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.41 ServerAPI=1.41 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-2907/1/artifact/out/Dockerfile
 |
   | 

[jira] [Assigned] (HDFS-15810) RBF: RBFMetrics's TotalCapacity out of bounds

2021-04-14 Thread Fengnan Li (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fengnan Li reassigned HDFS-15810:
-

Assignee: Fengnan Li

> RBF: RBFMetrics's TotalCapacity out of bounds
> -
>
> Key: HDFS-15810
> URL: https://issues.apache.org/jira/browse/HDFS-15810
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Xiaoxing Wei
>Assignee: Fengnan Li
>Priority: Major
> Attachments: image-2021-02-02-10-59-17-113.png
>
>
> The Long type fields TotalCapacity,UsedCapacity and RemainingCapacity in 
> RBFMetrics maybe ** out of bounds.
> !image-2021-02-02-10-59-17-113.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15810) RBF: RBFMetrics's TotalCapacity out of bounds

2021-04-14 Thread Fengnan Li (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17321352#comment-17321352
 ] 

Fengnan Li commented on HDFS-15810:
---

Sure. I will give it a try with BigInteger.

> RBF: RBFMetrics's TotalCapacity out of bounds
> -
>
> Key: HDFS-15810
> URL: https://issues.apache.org/jira/browse/HDFS-15810
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Xiaoxing Wei
>Priority: Major
> Attachments: image-2021-02-02-10-59-17-113.png
>
>
> The Long type fields TotalCapacity,UsedCapacity and RemainingCapacity in 
> RBFMetrics maybe ** out of bounds.
> !image-2021-02-02-10-59-17-113.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15810) RBF: RBFMetrics's TotalCapacity out of bounds

2021-04-14 Thread Jira


[ 
https://issues.apache.org/jira/browse/HDFS-15810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17321344#comment-17321344
 ] 

Íñigo Goiri commented on HDFS-15810:


The other option is to go for BigInteger.
I guess that's the proper way to go around this issue.

> RBF: RBFMetrics's TotalCapacity out of bounds
> -
>
> Key: HDFS-15810
> URL: https://issues.apache.org/jira/browse/HDFS-15810
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Xiaoxing Wei
>Priority: Major
> Attachments: image-2021-02-02-10-59-17-113.png
>
>
> The Long type fields TotalCapacity,UsedCapacity and RemainingCapacity in 
> RBFMetrics maybe ** out of bounds.
> !image-2021-02-02-10-59-17-113.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15810) RBF: RBFMetrics's TotalCapacity out of bounds

2021-04-14 Thread Fengnan Li (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17321315#comment-17321315
 ] 

Fengnan Li commented on HDFS-15810:
---

Can we use double which has much bigger MAX than long?

> RBF: RBFMetrics's TotalCapacity out of bounds
> -
>
> Key: HDFS-15810
> URL: https://issues.apache.org/jira/browse/HDFS-15810
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Xiaoxing Wei
>Priority: Major
> Attachments: image-2021-02-02-10-59-17-113.png
>
>
> The Long type fields TotalCapacity,UsedCapacity and RemainingCapacity in 
> RBFMetrics maybe ** out of bounds.
> !image-2021-02-02-10-59-17-113.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Work logged] (HDFS-15869) Network issue while FSEditLogAsync is executing RpcEdit.logSyncNotify can cause the namenode to hang

2021-04-14 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15869?focusedWorklogId=582863=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-582863
 ]

ASF GitHub Bot logged work on HDFS-15869:
-

Author: ASF GitHub Bot
Created on: 14/Apr/21 20:30
Start Date: 14/Apr/21 20:30
Worklog Time Spent: 10m 
  Work Description: hadoop-yetus commented on pull request #2737:
URL: https://github.com/apache/hadoop/pull/2737#issuecomment-819811557


   :broken_heart: **-1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |   0m 36s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  1s |  |  codespell was not available.  |
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | -1 :x: |  test4tests  |   0m  0s |  |  The patch doesn't appear to include 
any new or modified tests. Please justify why no new tests are needed for this 
patch. Also please list what manual steps were performed to verify this patch.  
|
    _ trunk Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |  34m  1s |  |  trunk passed  |
   | +1 :green_heart: |  compile  |   1m 21s |  |  trunk passed with JDK 
Ubuntu-11.0.10+9-Ubuntu-0ubuntu1.20.04  |
   | +1 :green_heart: |  compile  |   1m 16s |  |  trunk passed with JDK 
Private Build-1.8.0_282-8u282-b08-0ubuntu1~20.04-b08  |
   | +1 :green_heart: |  checkstyle  |   1m  2s |  |  trunk passed  |
   | +1 :green_heart: |  mvnsite  |   1m 19s |  |  trunk passed  |
   | +1 :green_heart: |  javadoc  |   0m 54s |  |  trunk passed with JDK 
Ubuntu-11.0.10+9-Ubuntu-0ubuntu1.20.04  |
   | +1 :green_heart: |  javadoc  |   1m 24s |  |  trunk passed with JDK 
Private Build-1.8.0_282-8u282-b08-0ubuntu1~20.04-b08  |
   | +1 :green_heart: |  spotbugs  |   3m  4s |  |  trunk passed  |
   | +1 :green_heart: |  shadedclient  |  16m  4s |  |  branch has no errors 
when building and testing our client artifacts.  |
   | -0 :warning: |  patch  |  16m 23s |  |  Used diff version of patch file. 
Binary files and potentially other changes not applied. Please rebase and 
squash commits if necessary.  |
    _ Patch Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |   1m 12s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   1m 13s |  |  the patch passed with JDK 
Ubuntu-11.0.10+9-Ubuntu-0ubuntu1.20.04  |
   | +1 :green_heart: |  javac  |   1m 13s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   1m  9s |  |  the patch passed with JDK 
Private Build-1.8.0_282-8u282-b08-0ubuntu1~20.04-b08  |
   | +1 :green_heart: |  javac  |   1m  9s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | -0 :warning: |  checkstyle  |   0m 52s | 
[/results-checkstyle-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-2737/3/artifact/out/results-checkstyle-hadoop-hdfs-project_hadoop-hdfs.txt)
 |  hadoop-hdfs-project/hadoop-hdfs: The patch generated 1 new + 2 unchanged - 
0 fixed = 3 total (was 2)  |
   | +1 :green_heart: |  mvnsite  |   1m 11s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   0m 45s |  |  the patch passed with JDK 
Ubuntu-11.0.10+9-Ubuntu-0ubuntu1.20.04  |
   | +1 :green_heart: |  javadoc  |   1m 16s |  |  the patch passed with JDK 
Private Build-1.8.0_282-8u282-b08-0ubuntu1~20.04-b08  |
   | +1 :green_heart: |  spotbugs  |   3m  6s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  |  15m 52s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | -1 :x: |  unit  | 231m 29s | 
[/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-2737/3/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt)
 |  hadoop-hdfs in the patch passed.  |
   | +1 :green_heart: |  asflicense  |   0m 45s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 317m 56s |  |  |
   
   
   | Reason | Tests |
   |---:|:--|
   | Failed junit tests | hadoop.hdfs.server.balancer.TestBalancer |
   |   | hadoop.hdfs.server.blockmanagement.TestUnderReplicatedBlocks |
   |   | hadoop.hdfs.qjournal.server.TestJournalNodeRespectsBindHostKeys |
   |   | hadoop.hdfs.server.namenode.snapshot.TestNestedSnapshots |
   |   | hadoop.hdfs.server.datanode.TestDirectoryScanner |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.41 ServerAPI=1.41 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-2737/3/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/2737 |
   | Optional Tests | 

[jira] [Work logged] (HDFS-15976) Make mkdtemp cross platform

2021-04-14 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15976?focusedWorklogId=582740=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-582740
 ]

ASF GitHub Bot logged work on HDFS-15976:
-

Author: ASF GitHub Bot
Created on: 14/Apr/21 18:14
Start Date: 14/Apr/21 18:14
Worklog Time Spent: 10m 
  Work Description: hadoop-yetus commented on pull request #2908:
URL: https://github.com/apache/hadoop/pull/2908#issuecomment-819729658


   :broken_heart: **-1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |   0m 54s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  1s |  |  codespell was not available.  |
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to 
include 6 new or modified test files.  |
    _ trunk Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |  34m 51s |  |  trunk passed  |
   | +1 :green_heart: |  compile  |   2m 38s |  |  trunk passed with JDK 
Ubuntu-11.0.10+9-Ubuntu-0ubuntu1.20.04  |
   | +1 :green_heart: |  compile  |   2m 39s |  |  trunk passed with JDK 
Private Build-1.8.0_282-8u282-b08-0ubuntu1~20.04-b08  |
   | +1 :green_heart: |  mvnsite  |   0m 28s |  |  trunk passed  |
   | +1 :green_heart: |  shadedclient  |  54m 12s |  |  branch has no errors 
when building and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |   0m 16s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   2m 34s |  |  the patch passed with JDK 
Ubuntu-11.0.10+9-Ubuntu-0ubuntu1.20.04  |
   | +1 :green_heart: |  cc  |   2m 34s |  |  the patch passed  |
   | +1 :green_heart: |  golang  |   2m 34s |  |  the patch passed  |
   | +1 :green_heart: |  javac  |   2m 34s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   2m 32s |  |  the patch passed with JDK 
Private Build-1.8.0_282-8u282-b08-0ubuntu1~20.04-b08  |
   | +1 :green_heart: |  cc  |   2m 32s |  |  the patch passed  |
   | +1 :green_heart: |  golang  |   2m 32s |  |  the patch passed  |
   | +1 :green_heart: |  javac  |   2m 32s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | +1 :green_heart: |  mvnsite  |   0m 19s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  |  13m 21s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | -1 :x: |  unit  | 116m  2s | 
[/patch-unit-hadoop-hdfs-project_hadoop-hdfs-native-client.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-2908/2/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs-native-client.txt)
 |  hadoop-hdfs-native-client in the patch failed.  |
   | +1 :green_heart: |  asflicense  |   0m 34s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 193m  3s |  |  |
   
   
   | Reason | Tests |
   |---:|:--|
   | Failed CTEST tests | configuration |
   |   | memcheck_configuration |
   |   | hdfs_configuration |
   |   | memcheck_hdfs_configuration |
   |   | hdfs_builder_test |
   |   | memcheck_hdfs_builder_test |
   |   | hdfs_config_connect_bugs |
   |   | memcheck_hdfs_config_connect_bugs |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.41 ServerAPI=1.41 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-2908/2/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/2908 |
   | Optional Tests | dupname asflicense compile cc mvnsite javac unit 
codespell golang |
   | uname | Linux 8bfbfed8d78b 4.15.0-112-generic #113-Ubuntu SMP Thu Jul 9 
23:41:39 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/bin/hadoop.sh |
   | git revision | trunk / 5df023b90978abf2bf5ec131b5e401c1f880d268 |
   | Default Java | Private Build-1.8.0_282-8u282-b08-0ubuntu1~20.04-b08 |
   | Multi-JDK versions | 
/usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.10+9-Ubuntu-0ubuntu1.20.04 
/usr/lib/jvm/java-8-openjdk-amd64:Private 
Build-1.8.0_282-8u282-b08-0ubuntu1~20.04-b08 |
   | CTEST | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-2908/2/artifact/out/patch-hadoop-hdfs-project_hadoop-hdfs-native-client-ctest.txt
 |
   |  Test Results | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-2908/2/testReport/ |
   | Max. process+thread count | 713 (vs. ulimit of 5500) |
   | modules | C: hadoop-hdfs-project/hadoop-hdfs-native-client U: 
hadoop-hdfs-project/hadoop-hdfs-native-client |
   | Console output | 

[jira] [Work logged] (HDFS-15976) Make mkdtemp cross platform

2021-04-14 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15976?focusedWorklogId=582730=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-582730
 ]

ASF GitHub Bot logged work on HDFS-15976:
-

Author: ASF GitHub Bot
Created on: 14/Apr/21 17:55
Start Date: 14/Apr/21 17:55
Worklog Time Spent: 10m 
  Work Description: hadoop-yetus commented on pull request #2908:
URL: https://github.com/apache/hadoop/pull/2908#issuecomment-819718466


   :broken_heart: **-1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |   0m 54s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  1s |  |  codespell was not available.  |
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to 
include 6 new or modified test files.  |
    _ trunk Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |  38m  4s |  |  trunk passed  |
   | +1 :green_heart: |  compile  |   2m 57s |  |  trunk passed with JDK 
Ubuntu-11.0.10+9-Ubuntu-0ubuntu1.20.04  |
   | +1 :green_heart: |  compile  |   2m 52s |  |  trunk passed with JDK 
Private Build-1.8.0_282-8u282-b08-0ubuntu1~20.04-b08  |
   | +1 :green_heart: |  mvnsite  |   0m 22s |  |  trunk passed  |
   | +1 :green_heart: |  shadedclient  |  60m  3s |  |  branch has no errors 
when building and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |   0m 13s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   2m 42s |  |  the patch passed with JDK 
Ubuntu-11.0.10+9-Ubuntu-0ubuntu1.20.04  |
   | +1 :green_heart: |  cc  |   2m 42s |  |  the patch passed  |
   | +1 :green_heart: |  golang  |   2m 42s |  |  the patch passed  |
   | +1 :green_heart: |  javac  |   2m 42s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   2m 44s |  |  the patch passed with JDK 
Private Build-1.8.0_282-8u282-b08-0ubuntu1~20.04-b08  |
   | +1 :green_heart: |  cc  |   2m 44s |  |  the patch passed  |
   | +1 :green_heart: |  golang  |   2m 44s |  |  the patch passed  |
   | +1 :green_heart: |  javac  |   2m 44s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | +1 :green_heart: |  mvnsite  |   0m 15s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  |  15m 52s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | -1 :x: |  unit  |  91m 44s | 
[/patch-unit-hadoop-hdfs-project_hadoop-hdfs-native-client.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-2908/1/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs-native-client.txt)
 |  hadoop-hdfs-native-client in the patch failed.  |
   | +1 :green_heart: |  asflicense  |   0m 30s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 177m 20s |  |  |
   
   
   | Reason | Tests |
   |---:|:--|
   | Failed CTEST tests | configuration |
   |   | memcheck_configuration |
   |   | hdfs_configuration |
   |   | memcheck_hdfs_configuration |
   |   | hdfs_builder_test |
   |   | memcheck_hdfs_builder_test |
   |   | hdfs_config_connect_bugs |
   |   | memcheck_hdfs_config_connect_bugs |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.41 ServerAPI=1.41 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-2908/1/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/2908 |
   | Optional Tests | dupname asflicense compile cc mvnsite javac unit 
codespell golang |
   | uname | Linux cd68af073374 4.15.0-136-generic #140-Ubuntu SMP Thu Jan 28 
05:20:47 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/bin/hadoop.sh |
   | git revision | trunk / 5df023b90978abf2bf5ec131b5e401c1f880d268 |
   | Default Java | Private Build-1.8.0_282-8u282-b08-0ubuntu1~20.04-b08 |
   | Multi-JDK versions | 
/usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.10+9-Ubuntu-0ubuntu1.20.04 
/usr/lib/jvm/java-8-openjdk-amd64:Private 
Build-1.8.0_282-8u282-b08-0ubuntu1~20.04-b08 |
   | CTEST | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-2908/1/artifact/out/patch-hadoop-hdfs-project_hadoop-hdfs-native-client-ctest.txt
 |
   |  Test Results | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-2908/1/testReport/ |
   | Max. process+thread count | 569 (vs. ulimit of 5500) |
   | modules | C: hadoop-hdfs-project/hadoop-hdfs-native-client U: 
hadoop-hdfs-project/hadoop-hdfs-native-client |
   | Console output | 

[jira] [Commented] (HDFS-15974) RBF: Unable to display the datanode UI of the router

2021-04-14 Thread Jira


[ 
https://issues.apache.org/jira/browse/HDFS-15974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17321197#comment-17321197
 ] 

Íñigo Goiri commented on HDFS-15974:


Can you give a little more context on why switching the maintenance mode to 
null fixes the issue?

> RBF: Unable to display the datanode UI of the router
> 
>
> Key: HDFS-15974
> URL: https://issues.apache.org/jira/browse/HDFS-15974
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: rbf, ui
>Affects Versions: 3.4.0
>Reporter: zhu
>Priority: Major
> Attachments: HDFS-15358-1.patch
>
>
> Clicking the Datanodes tag on the Router UI does not respond.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Work logged] (HDFS-15940) Some tests in TestBlockRecovery are consistently failing

2021-04-14 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15940?focusedWorklogId=582713=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-582713
 ]

ASF GitHub Bot logged work on HDFS-15940:
-

Author: ASF GitHub Bot
Created on: 14/Apr/21 17:20
Start Date: 14/Apr/21 17:20
Worklog Time Spent: 10m 
  Work Description: hadoop-yetus commented on pull request #2902:
URL: https://github.com/apache/hadoop/pull/2902#issuecomment-819685591


   :broken_heart: **-1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |   1m  4s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  0s |  |  codespell was not available.  |
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to 
include 3 new or modified test files.  |
    _ branch-3.2 Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |  30m 27s |  |  branch-3.2 passed  |
   | +1 :green_heart: |  compile  |   1m  1s |  |  branch-3.2 passed  |
   | +1 :green_heart: |  checkstyle  |   0m 46s |  |  branch-3.2 passed  |
   | +1 :green_heart: |  mvnsite  |   1m  9s |  |  branch-3.2 passed  |
   | +1 :green_heart: |  javadoc  |   0m 56s |  |  branch-3.2 passed  |
   | +1 :green_heart: |  spotbugs  |   2m 53s |  |  branch-3.2 passed  |
   | +1 :green_heart: |  shadedclient  |  16m 11s |  |  branch has no errors 
when building and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |   1m  6s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   0m 55s |  |  the patch passed  |
   | +1 :green_heart: |  javac  |   0m 55s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | +1 :green_heart: |  checkstyle  |   0m 40s |  |  
hadoop-hdfs-project/hadoop-hdfs: The patch generated 0 new + 56 unchanged - 7 
fixed = 56 total (was 63)  |
   | +1 :green_heart: |  mvnsite  |   1m  1s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   0m 46s |  |  the patch passed  |
   | +1 :green_heart: |  spotbugs  |   3m  1s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  |  17m 32s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | -1 :x: |  unit  | 192m 23s | 
[/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-2902/2/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt)
 |  hadoop-hdfs in the patch failed.  |
   | +1 :green_heart: |  asflicense  |   0m 36s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 269m 55s |  |  |
   
   
   | Reason | Tests |
   |---:|:--|
   | Failed junit tests | hadoop.hdfs.server.namenode.TestRedudantBlocks |
   |   | hadoop.hdfs.server.namenode.ha.TestRetryCacheWithHA |
   |   | hadoop.hdfs.server.namenode.TestEditLogRace |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.41 ServerAPI=1.41 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-2902/2/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/2902 |
   | Optional Tests | dupname asflicense compile javac javadoc mvninstall 
mvnsite unit shadedclient spotbugs checkstyle codespell |
   | uname | Linux 82eff6c5936f 4.15.0-136-generic #140-Ubuntu SMP Thu Jan 28 
05:20:47 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/bin/hadoop.sh |
   | git revision | branch-3.2 / 3c8ee877e3bf8fbc99e0b64692c6e71a25ec4b61 |
   | Default Java | Private Build-1.8.0_282-8u282-b08-0ubuntu1~18.04-b08 |
   |  Test Results | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-2902/2/testReport/ |
   | Max. process+thread count | 2364 (vs. ulimit of 5500) |
   | modules | C: hadoop-hdfs-project/hadoop-hdfs U: 
hadoop-hdfs-project/hadoop-hdfs |
   | Console output | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-2902/2/console |
   | versions | git=2.17.1 maven=3.6.0 spotbugs=4.2.2 |
   | Powered by | Apache Yetus 0.14.0-SNAPSHOT https://yetus.apache.org |
   
   
   This message was automatically generated.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 582713)
Time Spent: 6h  (was: 5h 50m)

> Some tests in TestBlockRecovery 

[jira] [Work logged] (HDFS-15957) The ignored IOException in the RPC response sent by FSEditLogAsync can cause the HDFS client to hang

2021-04-14 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15957?focusedWorklogId=582650=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-582650
 ]

ASF GitHub Bot logged work on HDFS-15957:
-

Author: ASF GitHub Bot
Created on: 14/Apr/21 16:06
Start Date: 14/Apr/21 16:06
Worklog Time Spent: 10m 
  Work Description: amahussein commented on a change in pull request #2878:
URL: https://github.com/apache/hadoop/pull/2878#discussion_r613382377



##
File path: 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSEditLogAsync.java
##
@@ -378,13 +381,18 @@ public void logSyncWait() {
 
 @Override
 public void logSyncNotify(RuntimeException syncEx) {
-  try {
-if (syncEx == null) {
-  call.sendResponse();
-} else {
-  call.abortResponse(syncEx);
+  for (int retries = 0; retries <= RESPONSE_SEND_RETRIES; retries++) {

Review comment:
   @functioner Thanks for the explanation.
   
   Adding an explicit IOException alters the purpose of the implementation. In 
order to observe the symptom of client, the faultInjection needs to be closer 
to what the code was implemented for. For example, instead of throwing an 
exception, a faultijection should be added to the rpc-call so that 
`call.sendResponse()/call.abortResponse(syncEx)` would throw the appropriate 
exception.
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 582650)
Time Spent: 1h  (was: 50m)

> The ignored IOException in the RPC response sent by FSEditLogAsync can cause 
> the HDFS client to hang
> 
>
> Key: HDFS-15957
> URL: https://issues.apache.org/jira/browse/HDFS-15957
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: fs async, namenode
>Affects Versions: 3.2.2
>Reporter: Haoze Wu
>Priority: Critical
>  Labels: pull-request-available
> Attachments: fsshell.txt, namenode.txt, reproduce.patch, 
> secondnamenode.txt
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
>     In FSEditLogAsync, the RpcEdit notification in line 248 could be skipped, 
> because the possible exception (e.g., IOException) thrown in line 365 is 
> always ignored.
>  
> {code:java}
> //hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSEditLogAsync.java
> class FSEditLogAsync extends FSEditLog implements Runnable {
>   // ...
>   @Override
>   public void run() {
> try {
>   while (true) {
> boolean doSync;
> Edit edit = dequeueEdit();
> if (edit != null) {
>   // sync if requested by edit log.
>   doSync = edit.logEdit();
>   syncWaitQ.add(edit);
> } else {
>   // sync when editq runs dry, but have edits pending a sync.
>   doSync = !syncWaitQ.isEmpty();
> }
> if (doSync) {
>   // normally edit log exceptions cause the NN to terminate, but tests
>   // relying on ExitUtil.terminate need to see the exception.
>   RuntimeException syncEx = null;
>   try {
> logSync(getLastWrittenTxId());
>   } catch (RuntimeException ex) {
> syncEx = ex;
>   }
>   while ((edit = syncWaitQ.poll()) != null) {
> edit.logSyncNotify(syncEx);   // line 
> 248
>   }
> }
>   }
> } catch (InterruptedException ie) {
>   LOG.info(Thread.currentThread().getName() + " was interrupted, 
> exiting");
> } catch (Throwable t) {
>   terminate(t);
> }
>   }
>   // the calling rpc thread will return immediately from logSync but the
>   // rpc response will not be sent until the edit is durable.
>   private static class RpcEdit extends Edit {
> // ...
> @Override
> public void logSyncNotify(RuntimeException syncEx) {
>   try {
> if (syncEx == null) {
>   call.sendResponse();// line 
> 365
> } else {
>   call.abortResponse(syncEx);
> }
>   } catch (Exception e) {} // don't care if not sent.
> }
>   }
> }
> {code}
>     The `call.sendResponse()` may throw an IOException. According to the 
> comment (“don’t care if not sent”) there, this exception is neither handled 
> nor printed in log. However, we suspect that some RPC responses sent there 
> may be critical, and there should be some retry mechanism.
>     We try to introduce a 

[jira] [Work logged] (HDFS-15957) The ignored IOException in the RPC response sent by FSEditLogAsync can cause the HDFS client to hang

2021-04-14 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15957?focusedWorklogId=582647=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-582647
 ]

ASF GitHub Bot logged work on HDFS-15957:
-

Author: ASF GitHub Bot
Created on: 14/Apr/21 16:04
Start Date: 14/Apr/21 16:04
Worklog Time Spent: 10m 
  Work Description: functioner commented on a change in pull request #2878:
URL: https://github.com/apache/hadoop/pull/2878#discussion_r613361072



##
File path: 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSEditLogAsync.java
##
@@ -378,13 +381,18 @@ public void logSyncWait() {
 
 @Override
 public void logSyncNotify(RuntimeException syncEx) {
-  try {
-if (syncEx == null) {
-  call.sendResponse();
-} else {
-  call.abortResponse(syncEx);
+  for (int retries = 0; retries <= RESPONSE_SEND_RETRIES; retries++) {

Review comment:
   @daryn-sharp, in our report 
([HDFS-15957](https://issues.apache.org/jira/browse/HDFS-15957)), we are doing 
fault injection testing, and we inject an IOException in `call.sendResponse()`, 
and then we observe the symptom that the client gets stuck. In this scenario, 
retry can help.
   
   >  If the response cannot be sent it's either because the connection is 
already closed or there's a bug preventing the encoding of the response.
   
   Within `call.sendResponse()`, there are lots of code; some of them is 
related to the OS kernel I/O, and some of them may get changed in the future, 
so I think (now & future) there should be multiple reasons for an IOException, 
besides the two you list here. For example, for transient connection issues, 
retry would help. Furthermore, in the scenarios you describe, retry wouldn't 
make it worst.
   
   Actually, it's not like retry won't help with connection issue, it's just a 
matter of whether our fix implements the retry correctly or not, i.e., whether 
we should re-create a connection object or not. Therefore, I think it's still 
worth adding the retry logic here, although it might not be able to handle the 
two scenarios you describe here. Do you agree?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 582647)
Time Spent: 50m  (was: 40m)

> The ignored IOException in the RPC response sent by FSEditLogAsync can cause 
> the HDFS client to hang
> 
>
> Key: HDFS-15957
> URL: https://issues.apache.org/jira/browse/HDFS-15957
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: fs async, namenode
>Affects Versions: 3.2.2
>Reporter: Haoze Wu
>Priority: Critical
>  Labels: pull-request-available
> Attachments: fsshell.txt, namenode.txt, reproduce.patch, 
> secondnamenode.txt
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
>     In FSEditLogAsync, the RpcEdit notification in line 248 could be skipped, 
> because the possible exception (e.g., IOException) thrown in line 365 is 
> always ignored.
>  
> {code:java}
> //hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSEditLogAsync.java
> class FSEditLogAsync extends FSEditLog implements Runnable {
>   // ...
>   @Override
>   public void run() {
> try {
>   while (true) {
> boolean doSync;
> Edit edit = dequeueEdit();
> if (edit != null) {
>   // sync if requested by edit log.
>   doSync = edit.logEdit();
>   syncWaitQ.add(edit);
> } else {
>   // sync when editq runs dry, but have edits pending a sync.
>   doSync = !syncWaitQ.isEmpty();
> }
> if (doSync) {
>   // normally edit log exceptions cause the NN to terminate, but tests
>   // relying on ExitUtil.terminate need to see the exception.
>   RuntimeException syncEx = null;
>   try {
> logSync(getLastWrittenTxId());
>   } catch (RuntimeException ex) {
> syncEx = ex;
>   }
>   while ((edit = syncWaitQ.poll()) != null) {
> edit.logSyncNotify(syncEx);   // line 
> 248
>   }
> }
>   }
> } catch (InterruptedException ie) {
>   LOG.info(Thread.currentThread().getName() + " was interrupted, 
> exiting");
> } catch (Throwable t) {
>   terminate(t);
> }
>   }
>   // the calling rpc thread will return immediately from logSync but the
>   // rpc response will not be sent 

[jira] [Comment Edited] (HDFS-15923) RBF: Authentication failed when rename accross sub clusters

2021-04-14 Thread Jinglun (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17320887#comment-17320887
 ] 

Jinglun edited comment on HDFS-15923 at 4/14/21, 3:42 PM:
--

Hi [~zhengzhuobinzzb], you are right ! Please continue with your work, we still 
need some test cases. 
{quote}In the current code logic, storing tasks in Journal does not use super 
users and Kerberos credentials. (Because when RPC executes Call, it uses the 
corresponding Ugi's doAs, and the Ugi does not have a Kerberberos certificate.)
{quote}
 

I'll start a new Jira(HDFS-15973) to resolve the permission check issue.


was (Author: lijinglun):
Hi [~zhengzhuobinzzb], you are right ! Please continue with your work, we still 
need some test cases. 
{quote}In the current code logic, storing tasks in Journal does not use super 
users and Kerberos credentials. (Because when RPC executes Call, it uses the 
corresponding Ugi's doAs, and the Ugi does not have a Kerberberos certificate.)
{quote}
 

I'll start a new Jira to resolve the permission check issue.

> RBF:  Authentication failed when rename accross sub clusters
> 
>
> Key: HDFS-15923
> URL: https://issues.apache.org/jira/browse/HDFS-15923
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: rbf
>Reporter: zhuobin zheng
>Priority: Major
>  Labels: RBF, pull-request-available, rename
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> Rename accross subcluster with RBF and Kerberos environment. Will encounter 
> the following two errors:
>  # Save Object to journal.
>  # Precheck try to get src file status
> So, we need use Proxy UGI doAs create DistcpProcedure and TrashProcedure and 
> submit Job.
> In patch i use proxy ugi doAs above method. It worked.
> But there are another strange thing and this patch not solve:
> Router use ugi itself to submit the Distcp job. But not user ugi or proxy 
> ugi. This may cause excessive distcp permissions.
> First: Save Object to journal.
> {code:java}
> // code placeholder
> 2021-03-23 14:01:16,233 WARN org.apache.hadoop.ipc.Client: Exception 
> encountered while connecting to the server 
> javax.security.sasl.SaslException: GSS initiate failed [Caused by 
> GSSException: No valid credentials provided (Mechanism level: Failed to find 
> any Kerberos tgt)]
> at 
> com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:211)
> at 
> org.apache.hadoop.security.SaslRpcClient.saslConnect(SaslRpcClient.java:408)
> at 
> org.apache.hadoop.ipc.Client$Connection.setupSaslConnection(Client.java:622)
> at 
> org.apache.hadoop.ipc.Client$Connection.access$2300(Client.java:413)
> at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:822)
> at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:818)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1762)
> at 
> org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:818)
> at 
> org.apache.hadoop.ipc.Client$Connection.access$3800(Client.java:413)
> at org.apache.hadoop.ipc.Client.getConnection(Client.java:1636)
> at org.apache.hadoop.ipc.Client.call(Client.java:1452)
> at org.apache.hadoop.ipc.Client.call(Client.java:1405)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:118)
> at com.sun.proxy.$Proxy11.create(Unknown Source)
> at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.create(ClientNamenodeProtocolTranslatorPB.java:376)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359)
> at 

[jira] [Commented] (HDFS-15973) RBF: Add permission check before doting router federation rename.

2021-04-14 Thread Jinglun (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17321099#comment-17321099
 ] 

Jinglun commented on HDFS-15973:


Submit the initial patch. The patch introduces the RouterINode class to save 
the file status. First it collects the file status of the src and the dst and 
saves to the RouterINode array.  Then it uses RouterPermissionChecker(very like 
the FsPermissionChecker) to do the permission check.

> RBF: Add permission check before doting router federation rename.
> -
>
> Key: HDFS-15973
> URL: https://issues.apache.org/jira/browse/HDFS-15973
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
> Attachments: HDFS-15973.001.patch
>
>
> The router federation rename is lack of permission check. It is a security 
> issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Work logged] (HDFS-15957) The ignored IOException in the RPC response sent by FSEditLogAsync can cause the HDFS client to hang

2021-04-14 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15957?focusedWorklogId=582621=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-582621
 ]

ASF GitHub Bot logged work on HDFS-15957:
-

Author: ASF GitHub Bot
Created on: 14/Apr/21 15:39
Start Date: 14/Apr/21 15:39
Worklog Time Spent: 10m 
  Work Description: functioner commented on a change in pull request #2878:
URL: https://github.com/apache/hadoop/pull/2878#discussion_r613361072



##
File path: 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSEditLogAsync.java
##
@@ -378,13 +381,18 @@ public void logSyncWait() {
 
 @Override
 public void logSyncNotify(RuntimeException syncEx) {
-  try {
-if (syncEx == null) {
-  call.sendResponse();
-} else {
-  call.abortResponse(syncEx);
+  for (int retries = 0; retries <= RESPONSE_SEND_RETRIES; retries++) {

Review comment:
   @daryn-sharp, in our report 
([HDFS-15957](https://issues.apache.org/jira/browse/HDFS-15957)), we are doing 
fault injection testing, and we inject an IOException in `call.sendResponse()`, 
and then we observe the symptom that the client gets stuck. In this scenario, 
retry can help.
   
   >  If the response cannot be sent it's either because the connection is 
already closed or there's a bug preventing the encoding of the response.
   
   Within `call.sendResponse()`, there are lots of code; some of them is 
related to the OS kernel I/O, and some of them may get changed in the future, 
so I think (now & future) there should be multiple reasons for an IOException, 
besides the two you list here. Furthermore, in the scenarios you describe, 
retry wouldn't make it worst.
   
   Therefore, I think it's still worth adding the retry logic here, although it 
might not be able to handle the two scenarios you describe here. Do you agree?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 582621)
Time Spent: 40m  (was: 0.5h)

> The ignored IOException in the RPC response sent by FSEditLogAsync can cause 
> the HDFS client to hang
> 
>
> Key: HDFS-15957
> URL: https://issues.apache.org/jira/browse/HDFS-15957
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: fs async, namenode
>Affects Versions: 3.2.2
>Reporter: Haoze Wu
>Priority: Critical
>  Labels: pull-request-available
> Attachments: fsshell.txt, namenode.txt, reproduce.patch, 
> secondnamenode.txt
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
>     In FSEditLogAsync, the RpcEdit notification in line 248 could be skipped, 
> because the possible exception (e.g., IOException) thrown in line 365 is 
> always ignored.
>  
> {code:java}
> //hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSEditLogAsync.java
> class FSEditLogAsync extends FSEditLog implements Runnable {
>   // ...
>   @Override
>   public void run() {
> try {
>   while (true) {
> boolean doSync;
> Edit edit = dequeueEdit();
> if (edit != null) {
>   // sync if requested by edit log.
>   doSync = edit.logEdit();
>   syncWaitQ.add(edit);
> } else {
>   // sync when editq runs dry, but have edits pending a sync.
>   doSync = !syncWaitQ.isEmpty();
> }
> if (doSync) {
>   // normally edit log exceptions cause the NN to terminate, but tests
>   // relying on ExitUtil.terminate need to see the exception.
>   RuntimeException syncEx = null;
>   try {
> logSync(getLastWrittenTxId());
>   } catch (RuntimeException ex) {
> syncEx = ex;
>   }
>   while ((edit = syncWaitQ.poll()) != null) {
> edit.logSyncNotify(syncEx);   // line 
> 248
>   }
> }
>   }
> } catch (InterruptedException ie) {
>   LOG.info(Thread.currentThread().getName() + " was interrupted, 
> exiting");
> } catch (Throwable t) {
>   terminate(t);
> }
>   }
>   // the calling rpc thread will return immediately from logSync but the
>   // rpc response will not be sent until the edit is durable.
>   private static class RpcEdit extends Edit {
> // ...
> @Override
> public void logSyncNotify(RuntimeException syncEx) {
>   try {
> if (syncEx == null) {
>   call.sendResponse(); 

[jira] [Work logged] (HDFS-15869) Network issue while FSEditLogAsync is executing RpcEdit.logSyncNotify can cause the namenode to hang

2021-04-14 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15869?focusedWorklogId=582618=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-582618
 ]

ASF GitHub Bot logged work on HDFS-15869:
-

Author: ASF GitHub Bot
Created on: 14/Apr/21 15:38
Start Date: 14/Apr/21 15:38
Worklog Time Spent: 10m 
  Work Description: Hexiaoqiao commented on pull request #2737:
URL: https://github.com/apache/hadoop/pull/2737#issuecomment-819615655


   > I think we also need to add some comments in the code to show why we don't 
shutdown this executor, in case some developers may get confused, because at 
the first glance it's also normal for people to think that this executor should 
shutdown on `close`.
   
   +1.
   
   > Furthermore, I think your argument implies that the `FSEditLogAsync` is 
singleton in terms of the namenode process, otherwise we may create multiple 
executors without shutting down any of them. Is `FSEditLogAsync` really always 
singleton (now & future)?
   
   IMO, it is true that `FSEditLogAsync` is singleton instance for NameNode 
process.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 582618)
Time Spent: 2h  (was: 1h 50m)

> Network issue while FSEditLogAsync is executing RpcEdit.logSyncNotify can 
> cause the namenode to hang
> 
>
> Key: HDFS-15869
> URL: https://issues.apache.org/jira/browse/HDFS-15869
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: fs async, namenode
>Affects Versions: 3.2.2
>Reporter: Haoze Wu
>Priority: Critical
>  Labels: pull-request-available
>  Time Spent: 2h
>  Remaining Estimate: 0h
>
>     We were doing some testing of the latest Hadoop stable release 3.2.2 and 
> found some network issue can cause the namenode to hang even with the async 
> edit logging (FSEditLogAsync).
>     The workflow of the FSEditLogAsync thread is basically:
>  # get EditLog from a queue (line 229)
>  # do the transaction (line 232)
>  # sync the log if doSync (line 243)
>  # do logSyncNotify (line 248)
> {code:java}
> //hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSEditLogAsync.java
>   @Override
>   public void run() {
> try {
>   while (true) {
> boolean doSync;
> Edit edit = dequeueEdit(); // 
> line 229
> if (edit != null) {
>   // sync if requested by edit log.
>   doSync = edit.logEdit(); // 
> line 232
>   syncWaitQ.add(edit);
> } else {
>   // sync when editq runs dry, but have edits pending a sync.
>   doSync = !syncWaitQ.isEmpty();
> }
> if (doSync) {
>   // normally edit log exceptions cause the NN to terminate, but tests
>   // relying on ExitUtil.terminate need to see the exception.
>   RuntimeException syncEx = null;
>   try {
> logSync(getLastWrittenTxId()); // 
> line 243
>   } catch (RuntimeException ex) {
> syncEx = ex;
>   }
>   while ((edit = syncWaitQ.poll()) != null) {
> edit.logSyncNotify(syncEx);// 
> line 248
>   }
> }
>   }
> } catch (InterruptedException ie) {
>   LOG.info(Thread.currentThread().getName() + " was interrupted, 
> exiting");
> } catch (Throwable t) {
>   terminate(t);
> }
>   }
> {code}
>     In terms of the step 4, FSEditLogAsync$RpcEdit.logSyncNotify is 
> essentially doing some network write (line 365).
> {code:java}
> //hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSEditLogAsync.java
>   private static class RpcEdit extends Edit {
> // ...
> @Override
> public void logSyncNotify(RuntimeException syncEx) {
>   try {
> if (syncEx == null) {
>   call.sendResponse();   // line 
> 365
> } else {
>   call.abortResponse(syncEx);
> }
>   } catch (Exception e) {} // don't care if not sent.
> }
> // ...
>   }{code}
>     If the sendResponse operation in line 365 gets stuck, then the whole 
> FSEditLogAsync thread is not able to proceed. In this case, the critical 
> logSync (line 243) can’t be executed, for the incoming transactions. Then the 
> namenode hangs. This is undesirable because FSEditLogAsync’s key 

[jira] [Updated] (HDFS-15973) RBF: Add permission check before doting router federation rename.

2021-04-14 Thread Jinglun (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jinglun updated HDFS-15973:
---
Attachment: HDFS-15973.001.patch
Status: Patch Available  (was: Open)

> RBF: Add permission check before doting router federation rename.
> -
>
> Key: HDFS-15973
> URL: https://issues.apache.org/jira/browse/HDFS-15973
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
> Attachments: HDFS-15973.001.patch
>
>
> The router federation rename is lack of permission check. It is a security 
> issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Work logged] (HDFS-15869) Network issue while FSEditLogAsync is executing RpcEdit.logSyncNotify can cause the namenode to hang

2021-04-14 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15869?focusedWorklogId=582603=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-582603
 ]

ASF GitHub Bot logged work on HDFS-15869:
-

Author: ASF GitHub Bot
Created on: 14/Apr/21 15:24
Start Date: 14/Apr/21 15:24
Worklog Time Spent: 10m 
  Work Description: functioner commented on pull request #2737:
URL: https://github.com/apache/hadoop/pull/2737#issuecomment-819605128


   @Hexiaoqiao Thanks for the comment, I removed 
`logSyncNotifyExecutor.shutdown()`.
   
   I think we also need to add some comments in the code to show why we don't 
shutdown this executor, in case some developers may get confused, because at 
the first glance it's also normal for people to think that this executor should 
shutdown on `close`.
   
   Furthermore, I think your argument implies that the `FSEditLogAsync` is 
singleton in terms of the namenode process, otherwise we may create multiple 
executors without shutting down any of them. Is `FSEditLogAsync` really always 
singleton (now & future)?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 582603)
Time Spent: 1h 50m  (was: 1h 40m)

> Network issue while FSEditLogAsync is executing RpcEdit.logSyncNotify can 
> cause the namenode to hang
> 
>
> Key: HDFS-15869
> URL: https://issues.apache.org/jira/browse/HDFS-15869
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: fs async, namenode
>Affects Versions: 3.2.2
>Reporter: Haoze Wu
>Priority: Critical
>  Labels: pull-request-available
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
>     We were doing some testing of the latest Hadoop stable release 3.2.2 and 
> found some network issue can cause the namenode to hang even with the async 
> edit logging (FSEditLogAsync).
>     The workflow of the FSEditLogAsync thread is basically:
>  # get EditLog from a queue (line 229)
>  # do the transaction (line 232)
>  # sync the log if doSync (line 243)
>  # do logSyncNotify (line 248)
> {code:java}
> //hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSEditLogAsync.java
>   @Override
>   public void run() {
> try {
>   while (true) {
> boolean doSync;
> Edit edit = dequeueEdit(); // 
> line 229
> if (edit != null) {
>   // sync if requested by edit log.
>   doSync = edit.logEdit(); // 
> line 232
>   syncWaitQ.add(edit);
> } else {
>   // sync when editq runs dry, but have edits pending a sync.
>   doSync = !syncWaitQ.isEmpty();
> }
> if (doSync) {
>   // normally edit log exceptions cause the NN to terminate, but tests
>   // relying on ExitUtil.terminate need to see the exception.
>   RuntimeException syncEx = null;
>   try {
> logSync(getLastWrittenTxId()); // 
> line 243
>   } catch (RuntimeException ex) {
> syncEx = ex;
>   }
>   while ((edit = syncWaitQ.poll()) != null) {
> edit.logSyncNotify(syncEx);// 
> line 248
>   }
> }
>   }
> } catch (InterruptedException ie) {
>   LOG.info(Thread.currentThread().getName() + " was interrupted, 
> exiting");
> } catch (Throwable t) {
>   terminate(t);
> }
>   }
> {code}
>     In terms of the step 4, FSEditLogAsync$RpcEdit.logSyncNotify is 
> essentially doing some network write (line 365).
> {code:java}
> //hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSEditLogAsync.java
>   private static class RpcEdit extends Edit {
> // ...
> @Override
> public void logSyncNotify(RuntimeException syncEx) {
>   try {
> if (syncEx == null) {
>   call.sendResponse();   // line 
> 365
> } else {
>   call.abortResponse(syncEx);
> }
>   } catch (Exception e) {} // don't care if not sent.
> }
> // ...
>   }{code}
>     If the sendResponse operation in line 365 gets stuck, then the whole 
> FSEditLogAsync thread is not able to proceed. In this case, the critical 
> logSync (line 243) can’t be executed, for the incoming transactions. Then the 
> namenode hangs. This is undesirable because FSEditLogAsync’s key feature is 

[jira] [Work logged] (HDFS-15975) Use LongAdder instead of AtomicLong

2021-04-14 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15975?focusedWorklogId=582589=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-582589
 ]

ASF GitHub Bot logged work on HDFS-15975:
-

Author: ASF GitHub Bot
Created on: 14/Apr/21 15:08
Start Date: 14/Apr/21 15:08
Worklog Time Spent: 10m 
  Work Description: tomscut commented on pull request #2907:
URL: https://github.com/apache/hadoop/pull/2907#issuecomment-819592967


   > Nice one. For metrics and statistics, `LongAdder` provides much better 
throughput.
   
   Thanks @virajjasani for the review.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 582589)
Time Spent: 20m  (was: 10m)

> Use LongAdder instead of AtomicLong
> ---
>
> Key: HDFS-15975
> URL: https://issues.apache.org/jira/browse/HDFS-15975
> Project: Hadoop HDFS
>  Issue Type: Wish
>Reporter: tomscut
>Assignee: tomscut
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> When counting some indicators, we can use LongAdder instead of AtomicLong to 
> improve performance. The long value is not an atomic snapshot in LongAdder, 
> but I think we can tolerate that.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Work logged] (HDFS-15957) The ignored IOException in the RPC response sent by FSEditLogAsync can cause the HDFS client to hang

2021-04-14 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15957?focusedWorklogId=582587=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-582587
 ]

ASF GitHub Bot logged work on HDFS-15957:
-

Author: ASF GitHub Bot
Created on: 14/Apr/21 15:07
Start Date: 14/Apr/21 15:07
Worklog Time Spent: 10m 
  Work Description: daryn-sharp commented on a change in pull request #2878:
URL: https://github.com/apache/hadoop/pull/2878#discussion_r61355



##
File path: 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSEditLogAsync.java
##
@@ -378,13 +381,18 @@ public void logSyncWait() {
 
 @Override
 public void logSyncNotify(RuntimeException syncEx) {
-  try {
-if (syncEx == null) {
-  call.sendResponse();
-} else {
-  call.abortResponse(syncEx);
+  for (int retries = 0; retries <= RESPONSE_SEND_RETRIES; retries++) {

Review comment:
   This appears to be a "solution" to the contrived fault injection.  If 
the response cannot be sent it's either because the connection is already 
closed or there's a bug preventing the encoding of the response.  In either 
case, retrying is not going to help.
   
   Have you observed a real problem or just noticed this by casual inspection 
of the code?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 582587)
Time Spent: 0.5h  (was: 20m)

> The ignored IOException in the RPC response sent by FSEditLogAsync can cause 
> the HDFS client to hang
> 
>
> Key: HDFS-15957
> URL: https://issues.apache.org/jira/browse/HDFS-15957
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: fs async, namenode
>Affects Versions: 3.2.2
>Reporter: Haoze Wu
>Priority: Critical
>  Labels: pull-request-available
> Attachments: fsshell.txt, namenode.txt, reproduce.patch, 
> secondnamenode.txt
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
>     In FSEditLogAsync, the RpcEdit notification in line 248 could be skipped, 
> because the possible exception (e.g., IOException) thrown in line 365 is 
> always ignored.
>  
> {code:java}
> //hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSEditLogAsync.java
> class FSEditLogAsync extends FSEditLog implements Runnable {
>   // ...
>   @Override
>   public void run() {
> try {
>   while (true) {
> boolean doSync;
> Edit edit = dequeueEdit();
> if (edit != null) {
>   // sync if requested by edit log.
>   doSync = edit.logEdit();
>   syncWaitQ.add(edit);
> } else {
>   // sync when editq runs dry, but have edits pending a sync.
>   doSync = !syncWaitQ.isEmpty();
> }
> if (doSync) {
>   // normally edit log exceptions cause the NN to terminate, but tests
>   // relying on ExitUtil.terminate need to see the exception.
>   RuntimeException syncEx = null;
>   try {
> logSync(getLastWrittenTxId());
>   } catch (RuntimeException ex) {
> syncEx = ex;
>   }
>   while ((edit = syncWaitQ.poll()) != null) {
> edit.logSyncNotify(syncEx);   // line 
> 248
>   }
> }
>   }
> } catch (InterruptedException ie) {
>   LOG.info(Thread.currentThread().getName() + " was interrupted, 
> exiting");
> } catch (Throwable t) {
>   terminate(t);
> }
>   }
>   // the calling rpc thread will return immediately from logSync but the
>   // rpc response will not be sent until the edit is durable.
>   private static class RpcEdit extends Edit {
> // ...
> @Override
> public void logSyncNotify(RuntimeException syncEx) {
>   try {
> if (syncEx == null) {
>   call.sendResponse();// line 
> 365
> } else {
>   call.abortResponse(syncEx);
> }
>   } catch (Exception e) {} // don't care if not sent.
> }
>   }
> }
> {code}
>     The `call.sendResponse()` may throw an IOException. According to the 
> comment (“don’t care if not sent”) there, this exception is neither handled 
> nor printed in log. However, we suspect that some RPC responses sent there 
> may be critical, and there should be some retry mechanism.
>     We try to introduce a single IOException in line 365, and find that the 
> HDFS client (e.g., 

[jira] [Commented] (HDFS-15957) The ignored IOException in the RPC response sent by FSEditLogAsync can cause the HDFS client to hang

2021-04-14 Thread Daryn Sharp (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17321069#comment-17321069
 ] 

Daryn Sharp commented on HDFS-15957:


I'll take a look today.  In practice, the only time sendResponse should 
actually throw is if the client has disconnected.  Encoding rpc responses can 
theoretically fail if there's a bug in the NN but it should close the 
connection.

> The ignored IOException in the RPC response sent by FSEditLogAsync can cause 
> the HDFS client to hang
> 
>
> Key: HDFS-15957
> URL: https://issues.apache.org/jira/browse/HDFS-15957
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: fs async, namenode
>Affects Versions: 3.2.2
>Reporter: Haoze Wu
>Priority: Critical
>  Labels: pull-request-available
> Attachments: fsshell.txt, namenode.txt, reproduce.patch, 
> secondnamenode.txt
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
>     In FSEditLogAsync, the RpcEdit notification in line 248 could be skipped, 
> because the possible exception (e.g., IOException) thrown in line 365 is 
> always ignored.
>  
> {code:java}
> //hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSEditLogAsync.java
> class FSEditLogAsync extends FSEditLog implements Runnable {
>   // ...
>   @Override
>   public void run() {
> try {
>   while (true) {
> boolean doSync;
> Edit edit = dequeueEdit();
> if (edit != null) {
>   // sync if requested by edit log.
>   doSync = edit.logEdit();
>   syncWaitQ.add(edit);
> } else {
>   // sync when editq runs dry, but have edits pending a sync.
>   doSync = !syncWaitQ.isEmpty();
> }
> if (doSync) {
>   // normally edit log exceptions cause the NN to terminate, but tests
>   // relying on ExitUtil.terminate need to see the exception.
>   RuntimeException syncEx = null;
>   try {
> logSync(getLastWrittenTxId());
>   } catch (RuntimeException ex) {
> syncEx = ex;
>   }
>   while ((edit = syncWaitQ.poll()) != null) {
> edit.logSyncNotify(syncEx);   // line 
> 248
>   }
> }
>   }
> } catch (InterruptedException ie) {
>   LOG.info(Thread.currentThread().getName() + " was interrupted, 
> exiting");
> } catch (Throwable t) {
>   terminate(t);
> }
>   }
>   // the calling rpc thread will return immediately from logSync but the
>   // rpc response will not be sent until the edit is durable.
>   private static class RpcEdit extends Edit {
> // ...
> @Override
> public void logSyncNotify(RuntimeException syncEx) {
>   try {
> if (syncEx == null) {
>   call.sendResponse();// line 
> 365
> } else {
>   call.abortResponse(syncEx);
> }
>   } catch (Exception e) {} // don't care if not sent.
> }
>   }
> }
> {code}
>     The `call.sendResponse()` may throw an IOException. According to the 
> comment (“don’t care if not sent”) there, this exception is neither handled 
> nor printed in log. However, we suspect that some RPC responses sent there 
> may be critical, and there should be some retry mechanism.
>     We try to introduce a single IOException in line 365, and find that the 
> HDFS client (e.g., `bin/hdfs dfs -copyFromLocal ./foo.txt /1.txt`) may get 
> stuck forever (hang for >30min without any log). We can reproduce this 
> symptom in multiple ways. One of the simplest ways of reproduction is shown 
> as follows:
>  # Start a new empty HDFS cluster (1 namenode, 2 datanodes) with the default 
> configuration.
>  # Generate a file of 15MB for testing, by `fallocate -l 1500 foo.txt`.
>  # Run the HDFS client `bin/hdfs dfs -copyFromLocal ./foo.txt /1.txt`.
>  # When line 365 is invoked the third time (it is invoked 6 times in total in 
> this experiment), inject an IOException there. (A patch for injecting the 
> exception this way is attached to reproduce the issue)
>     Then the client hangs forever, without any log. If we run `bin/hdfs dfs 
> -ls /` to check the file status, we can not see the expected 15MB `/1.txt` 
> file.
>     The jstack of the HDFS client shows that there is an RPC call infinitely 
> waiting.
> {code:java}
> "Thread-6" #18 daemon prio=5 os_prio=0 tid=0x7f9cd5295800 nid=0x26b9 in 
> Object.wait() [0x7f9ca354f000]
>java.lang.Thread.State: WAITING (on object monitor)
> at java.lang.Object.wait(Native Method)
> - waiting on <0x00071e709610> (a org.apache.hadoop.ipc.Client$Call)
> at java.lang.Object.wait(Object.java:502)
> at 

[jira] [Work logged] (HDFS-15976) Make mkdtemp cross platform

2021-04-14 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15976?focusedWorklogId=582579=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-582579
 ]

ASF GitHub Bot logged work on HDFS-15976:
-

Author: ASF GitHub Bot
Created on: 14/Apr/21 14:57
Start Date: 14/Apr/21 14:57
Worklog Time Spent: 10m 
  Work Description: GauthamBanasandra opened a new pull request #2908:
URL: https://github.com/apache/hadoop/pull/2908


   * mkdtemp is used for creating temporary
 directory, adhering to the given pattern.
 It's not available on Visual C++. Need
 to make this cross platform.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 582579)
Remaining Estimate: 0h
Time Spent: 10m

> Make mkdtemp cross platform
> ---
>
> Key: HDFS-15976
> URL: https://issues.apache.org/jira/browse/HDFS-15976
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: libhdfs++
>Affects Versions: 3.4.0
>Reporter: Gautham Banasandra
>Assignee: Gautham Banasandra
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> mkdtemp is used for creating temporary directory, adhering to the given 
> pattern. It's not available on Visual C++. Need to make this cross platform.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15976) Make mkdtemp cross platform

2021-04-14 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HDFS-15976:
--
Labels: pull-request-available  (was: )

> Make mkdtemp cross platform
> ---
>
> Key: HDFS-15976
> URL: https://issues.apache.org/jira/browse/HDFS-15976
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: libhdfs++
>Affects Versions: 3.4.0
>Reporter: Gautham Banasandra
>Assignee: Gautham Banasandra
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> mkdtemp is used for creating temporary directory, adhering to the given 
> pattern. It's not available on Visual C++. Need to make this cross platform.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15134) Any write calls with REST API on Standby NN print error message with wrong online help URL

2021-04-14 Thread Hadoop QA (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17321059#comment-17321059
 ] 

Hadoop QA commented on HDFS-15134:
--

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime ||  Logfile || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  4m  
6s{color} | {color:blue}{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} || ||
| {color:green}+1{color} | {color:green} dupname {color} | {color:green}  0m  
0s{color} | {color:green}{color} | {color:green} No case conflicting files 
found. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green}{color} | {color:green} The patch does not contain any 
@author tags. {color} |
| {color:red}-1{color} | {color:red} test4tests {color} | {color:red}  0m  
0s{color} | {color:red}{color} | {color:red} The patch doesn't appear to 
include any new or modified tests. Please justify why no new tests are needed 
for this patch. Also please list what manual steps were performed to verify 
this patch. {color} |
|| || || || {color:brown} trunk Compile Tests {color} || ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 25m 
41s{color} | {color:green}{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m 
27s{color} | {color:green}{color} | {color:green} trunk passed with JDK 
Ubuntu-11.0.10+9-Ubuntu-0ubuntu1.20.04 {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m 
18s{color} | {color:green}{color} | {color:green} trunk passed with JDK Private 
Build-1.8.0_282-8u282-b08-0ubuntu1~20.04-b08 {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
59s{color} | {color:green}{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m 
22s{color} | {color:green}{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
17m 36s{color} | {color:green}{color} | {color:green} branch has no errors when 
building and testing our client artifacts. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
55s{color} | {color:green}{color} | {color:green} trunk passed with JDK 
Ubuntu-11.0.10+9-Ubuntu-0ubuntu1.20.04 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m 
21s{color} | {color:green}{color} | {color:green} trunk passed with JDK Private 
Build-1.8.0_282-8u282-b08-0ubuntu1~20.04-b08 {color} |
| {color:blue}0{color} | {color:blue} spotbugs {color} | {color:blue} 23m 
11s{color} | {color:blue}{color} | {color:blue} Both FindBugs and SpotBugs are 
enabled, using SpotBugs. {color} |
| {color:green}+1{color} | {color:green} spotbugs {color} | {color:green}  3m 
21s{color} | {color:green}{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} || ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  1m 
16s{color} | {color:green}{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m 
18s{color} | {color:green}{color} | {color:green} the patch passed with JDK 
Ubuntu-11.0.10+9-Ubuntu-0ubuntu1.20.04 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  1m 
18s{color} | {color:green}{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m  
8s{color} | {color:green}{color} | {color:green} the patch passed with JDK 
Private Build-1.8.0_282-8u282-b08-0ubuntu1~20.04-b08 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  1m  
8s{color} | {color:green}{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
54s{color} | {color:green}{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m 
14s{color} | {color:green}{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green}{color} | {color:green} The patch has no whitespace 
issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
14m 57s{color} | {color:green}{color} | {color:green} patch has no errors when 
building and testing our client artifacts. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
50s{color} | {color:green}{color} | {color:green} the patch passed with JDK 
Ubuntu-11.0.10+9-Ubuntu-0ubuntu1.20.04 {color} |
| 

[jira] [Work logged] (HDFS-15869) Network issue while FSEditLogAsync is executing RpcEdit.logSyncNotify can cause the namenode to hang

2021-04-14 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15869?focusedWorklogId=582560=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-582560
 ]

ASF GitHub Bot logged work on HDFS-15869:
-

Author: ASF GitHub Bot
Created on: 14/Apr/21 14:41
Start Date: 14/Apr/21 14:41
Worklog Time Spent: 10m 
  Work Description: Hexiaoqiao commented on a change in pull request #2737:
URL: https://github.com/apache/hadoop/pull/2737#discussion_r613305815



##
File path: 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSEditLogAsync.java
##
@@ -117,6 +125,7 @@ void openForWrite(int layoutVersion) throws IOException {
   public void close() {
 super.close();
 stopSyncThread();
+logSyncNotifyExecutor.shutdown();

Review comment:
   It could meet issue when transition active to standby then transition 
back here. Because executor has been shutdown and no response will be sent. 
Moreover I suspect that namenode process could fatal in this case. In my 
opinion, we do not need to shutdown this executor. FYI. Thanks.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 582560)
Time Spent: 1h 40m  (was: 1.5h)

> Network issue while FSEditLogAsync is executing RpcEdit.logSyncNotify can 
> cause the namenode to hang
> 
>
> Key: HDFS-15869
> URL: https://issues.apache.org/jira/browse/HDFS-15869
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: fs async, namenode
>Affects Versions: 3.2.2
>Reporter: Haoze Wu
>Priority: Critical
>  Labels: pull-request-available
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
>     We were doing some testing of the latest Hadoop stable release 3.2.2 and 
> found some network issue can cause the namenode to hang even with the async 
> edit logging (FSEditLogAsync).
>     The workflow of the FSEditLogAsync thread is basically:
>  # get EditLog from a queue (line 229)
>  # do the transaction (line 232)
>  # sync the log if doSync (line 243)
>  # do logSyncNotify (line 248)
> {code:java}
> //hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSEditLogAsync.java
>   @Override
>   public void run() {
> try {
>   while (true) {
> boolean doSync;
> Edit edit = dequeueEdit(); // 
> line 229
> if (edit != null) {
>   // sync if requested by edit log.
>   doSync = edit.logEdit(); // 
> line 232
>   syncWaitQ.add(edit);
> } else {
>   // sync when editq runs dry, but have edits pending a sync.
>   doSync = !syncWaitQ.isEmpty();
> }
> if (doSync) {
>   // normally edit log exceptions cause the NN to terminate, but tests
>   // relying on ExitUtil.terminate need to see the exception.
>   RuntimeException syncEx = null;
>   try {
> logSync(getLastWrittenTxId()); // 
> line 243
>   } catch (RuntimeException ex) {
> syncEx = ex;
>   }
>   while ((edit = syncWaitQ.poll()) != null) {
> edit.logSyncNotify(syncEx);// 
> line 248
>   }
> }
>   }
> } catch (InterruptedException ie) {
>   LOG.info(Thread.currentThread().getName() + " was interrupted, 
> exiting");
> } catch (Throwable t) {
>   terminate(t);
> }
>   }
> {code}
>     In terms of the step 4, FSEditLogAsync$RpcEdit.logSyncNotify is 
> essentially doing some network write (line 365).
> {code:java}
> //hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSEditLogAsync.java
>   private static class RpcEdit extends Edit {
> // ...
> @Override
> public void logSyncNotify(RuntimeException syncEx) {
>   try {
> if (syncEx == null) {
>   call.sendResponse();   // line 
> 365
> } else {
>   call.abortResponse(syncEx);
> }
>   } catch (Exception e) {} // don't care if not sent.
> }
> // ...
>   }{code}
>     If the sendResponse operation in line 365 gets stuck, then the whole 
> FSEditLogAsync thread is not able to proceed. In this case, the critical 
> logSync (line 243) can’t be executed, for the incoming transactions. Then the 
> namenode hangs. This is undesirable 

[jira] [Created] (HDFS-15975) Use LongAdder instead of AtomicLong

2021-04-14 Thread tomscut (Jira)
tomscut created HDFS-15975:
--

 Summary: Use LongAdder instead of AtomicLong
 Key: HDFS-15975
 URL: https://issues.apache.org/jira/browse/HDFS-15975
 Project: Hadoop HDFS
  Issue Type: Wish
Reporter: tomscut
Assignee: tomscut


When counting some indicators, we can use LongAdder instead of AtomicLong to 
improve performance. The long value is not an atomic snapshot in LongAdder, but 
I think we can tolerate that.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Work logged] (HDFS-15975) Use LongAdder instead of AtomicLong

2021-04-14 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15975?focusedWorklogId=582546=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-582546
 ]

ASF GitHub Bot logged work on HDFS-15975:
-

Author: ASF GitHub Bot
Created on: 14/Apr/21 14:30
Start Date: 14/Apr/21 14:30
Worklog Time Spent: 10m 
  Work Description: tomscut opened a new pull request #2907:
URL: https://github.com/apache/hadoop/pull/2907


   JIRA: [HDFS-15975](https://issues.apache.org/jira/browse/HDFS-15975)
   
   When counting some indicators, we can use LongAdder instead of AtomicLong to 
improve performance. The long value is not an atomic snapshot in LongAdder, but 
I think we can tolerate that.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 582546)
Remaining Estimate: 0h
Time Spent: 10m

> Use LongAdder instead of AtomicLong
> ---
>
> Key: HDFS-15975
> URL: https://issues.apache.org/jira/browse/HDFS-15975
> Project: Hadoop HDFS
>  Issue Type: Wish
>Reporter: tomscut
>Assignee: tomscut
>Priority: Minor
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> When counting some indicators, we can use LongAdder instead of AtomicLong to 
> improve performance. The long value is not an atomic snapshot in LongAdder, 
> but I think we can tolerate that.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-15976) Make mkdtemp cross platform

2021-04-14 Thread Gautham Banasandra (Jira)
Gautham Banasandra created HDFS-15976:
-

 Summary: Make mkdtemp cross platform
 Key: HDFS-15976
 URL: https://issues.apache.org/jira/browse/HDFS-15976
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: libhdfs++
Affects Versions: 3.4.0
Reporter: Gautham Banasandra
Assignee: Gautham Banasandra


mkdtemp is used for creating temporary directory, adhering to the given 
pattern. It's not available on Visual C++. Need to make this cross platform.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15975) Use LongAdder instead of AtomicLong

2021-04-14 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HDFS-15975:
--
Labels: pull-request-available  (was: )

> Use LongAdder instead of AtomicLong
> ---
>
> Key: HDFS-15975
> URL: https://issues.apache.org/jira/browse/HDFS-15975
> Project: Hadoop HDFS
>  Issue Type: Wish
>Reporter: tomscut
>Assignee: tomscut
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> When counting some indicators, we can use LongAdder instead of AtomicLong to 
> improve performance. The long value is not an atomic snapshot in LongAdder, 
> but I think we can tolerate that.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15957) The ignored IOException in the RPC response sent by FSEditLogAsync can cause the HDFS client to hang

2021-04-14 Thread Wei-Chiu Chuang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17320983#comment-17320983
 ] 

Wei-Chiu Chuang commented on HDFS-15957:


[~kihwal] [~daryn] [~ahussein] any ideas about this async edit logger bug?

> The ignored IOException in the RPC response sent by FSEditLogAsync can cause 
> the HDFS client to hang
> 
>
> Key: HDFS-15957
> URL: https://issues.apache.org/jira/browse/HDFS-15957
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: fs async, namenode
>Affects Versions: 3.2.2
>Reporter: Haoze Wu
>Priority: Critical
>  Labels: pull-request-available
> Attachments: fsshell.txt, namenode.txt, reproduce.patch, 
> secondnamenode.txt
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
>     In FSEditLogAsync, the RpcEdit notification in line 248 could be skipped, 
> because the possible exception (e.g., IOException) thrown in line 365 is 
> always ignored.
>  
> {code:java}
> //hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSEditLogAsync.java
> class FSEditLogAsync extends FSEditLog implements Runnable {
>   // ...
>   @Override
>   public void run() {
> try {
>   while (true) {
> boolean doSync;
> Edit edit = dequeueEdit();
> if (edit != null) {
>   // sync if requested by edit log.
>   doSync = edit.logEdit();
>   syncWaitQ.add(edit);
> } else {
>   // sync when editq runs dry, but have edits pending a sync.
>   doSync = !syncWaitQ.isEmpty();
> }
> if (doSync) {
>   // normally edit log exceptions cause the NN to terminate, but tests
>   // relying on ExitUtil.terminate need to see the exception.
>   RuntimeException syncEx = null;
>   try {
> logSync(getLastWrittenTxId());
>   } catch (RuntimeException ex) {
> syncEx = ex;
>   }
>   while ((edit = syncWaitQ.poll()) != null) {
> edit.logSyncNotify(syncEx);   // line 
> 248
>   }
> }
>   }
> } catch (InterruptedException ie) {
>   LOG.info(Thread.currentThread().getName() + " was interrupted, 
> exiting");
> } catch (Throwable t) {
>   terminate(t);
> }
>   }
>   // the calling rpc thread will return immediately from logSync but the
>   // rpc response will not be sent until the edit is durable.
>   private static class RpcEdit extends Edit {
> // ...
> @Override
> public void logSyncNotify(RuntimeException syncEx) {
>   try {
> if (syncEx == null) {
>   call.sendResponse();// line 
> 365
> } else {
>   call.abortResponse(syncEx);
> }
>   } catch (Exception e) {} // don't care if not sent.
> }
>   }
> }
> {code}
>     The `call.sendResponse()` may throw an IOException. According to the 
> comment (“don’t care if not sent”) there, this exception is neither handled 
> nor printed in log. However, we suspect that some RPC responses sent there 
> may be critical, and there should be some retry mechanism.
>     We try to introduce a single IOException in line 365, and find that the 
> HDFS client (e.g., `bin/hdfs dfs -copyFromLocal ./foo.txt /1.txt`) may get 
> stuck forever (hang for >30min without any log). We can reproduce this 
> symptom in multiple ways. One of the simplest ways of reproduction is shown 
> as follows:
>  # Start a new empty HDFS cluster (1 namenode, 2 datanodes) with the default 
> configuration.
>  # Generate a file of 15MB for testing, by `fallocate -l 1500 foo.txt`.
>  # Run the HDFS client `bin/hdfs dfs -copyFromLocal ./foo.txt /1.txt`.
>  # When line 365 is invoked the third time (it is invoked 6 times in total in 
> this experiment), inject an IOException there. (A patch for injecting the 
> exception this way is attached to reproduce the issue)
>     Then the client hangs forever, without any log. If we run `bin/hdfs dfs 
> -ls /` to check the file status, we can not see the expected 15MB `/1.txt` 
> file.
>     The jstack of the HDFS client shows that there is an RPC call infinitely 
> waiting.
> {code:java}
> "Thread-6" #18 daemon prio=5 os_prio=0 tid=0x7f9cd5295800 nid=0x26b9 in 
> Object.wait() [0x7f9ca354f000]
>java.lang.Thread.State: WAITING (on object monitor)
> at java.lang.Object.wait(Native Method)
> - waiting on <0x00071e709610> (a org.apache.hadoop.ipc.Client$Call)
> at java.lang.Object.wait(Object.java:502)
> at org.apache.hadoop.util.concurrent.AsyncGet$Util.wait(AsyncGet.java:59)
> at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1556)
> - locked 

[jira] [Commented] (HDFS-15614) Initialize snapshot trash root during NameNode startup if enabled

2021-04-14 Thread Ayush Saxena (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17320982#comment-17320982
 ] 

Ayush Saxena commented on HDFS-15614:
-

Quoting [~shashikant] from the 
[PR|https://github.com/apache/hadoop/pull/2881#issuecomment-819454476]
{quote}Coming to think of it, if providing an external command to create the 
Trash directory by admins is feasible and makes sense, i think its ok to remove 
the NN startup logic to create Trash directories inside snapshottable root.
{quote}
Yeps, thats exactly my point, We seems to be on same page now. :)

[~smeng]
{quote}Therefore this would be the same deal for encryption zone trash root 
creation. When replacing dfs.allowSnapshot calls with dfs.createEncryptionZone 
in the first test case we should also find trash root inside encryption zone 
missing with dfs.provisionSnapshotTrash call alone.
{quote}
Well I didn't actually catch this, but the Ambiguity I am talking about is not 
between commands executed from DFSAdmin and DFS, The ambiguity I am talking 
about is the change coming post namenode startup/failover. If it was there 
before failover/restart. It should be there post failover/restart. If it wasn't 
there before so it shouldn't surface also after failover/restart.

EncryptionZones doesn't create any trash dir themselves during restart or 
failover AFAIK. So, this ambiguity will not be there. And the existence of 
Trash in case of DFSAdmin and not through DFS, is OK for both yours and 
Encryption zone cases. That is just a behavioural aspect.
{quote}name quota can become an issue indeed.
{quote}
{quote}name quota can become an issue indeed.
 I think I got your point. Maybe a better way to create those necessary Trash 
dirs is to ask an admin to run dfsadmin command manually after flipping 
dfs.namenode.snapshot.trashroot.enabled to true.
 Currently we already have dfsadmin -provisionSnapshotTrash but can only be 
done one by one. dfsadmin -provisionSnapshotTrash -all can be implemented to 
achieve this.
{quote}
Yeps, We have an agreement again. :) This way sounds very appropriate. Having 
{{dfsadmin -provisionSnapshotTrash -all}} will save efforts and will be really 
good, Should be very doable as well at client side itself. Can go to Namenode 
too, but going to the namenode to create for all in one shot might lead to lock 
starvation and stuff.

So, If everyone is on same page(I suppose). We can get rid of HDFS-15614 and 
HDFS-15820  and if any other fix on top. and open up a new jira for -all stuff.

 

Thanx Shashikant and Siyao!!!

> Initialize snapshot trash root during NameNode startup if enabled
> -
>
> Key: HDFS-15614
> URL: https://issues.apache.org/jira/browse/HDFS-15614
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Siyao Meng
>Assignee: Siyao Meng
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>  Time Spent: 3h 10m
>  Remaining Estimate: 0h
>
> This is a follow-up to HDFS-15607.
> Goal:
> Initialize (create) snapshot trash root for all existing snapshottable 
> directories if {{dfs.namenode.snapshot.trashroot.enabled}} is set to 
> {{true}}. So admins won't have to run {{dfsadmin -provisionTrash}} manually 
> on all those existing snapshottable directories.
> The change is expected to land in {{FSNamesystem}}.
> Discussion:
> 1. Currently in HDFS-15607, the snapshot trash root creation logic is on the 
> client side. But in order for NN to create it at startup, the logic must 
> (also) be implemented on the server side as well. -- which is also a 
> requirement by WebHDFS (HDFS-15612).
> 2. Alternatively, we can provide an extra parameter to the 
> {{-provisionTrash}} command like: {{dfsadmin -provisionTrash -all}} to 
> initialize/provision trash root on all existing snapshottable dirs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Work logged] (HDFS-15963) Unreleased volume references cause an infinite loop

2021-04-14 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15963?focusedWorklogId=582430=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-582430
 ]

ASF GitHub Bot logged work on HDFS-15963:
-

Author: ASF GitHub Bot
Created on: 14/Apr/21 12:22
Start Date: 14/Apr/21 12:22
Worklog Time Spent: 10m 
  Work Description: jojochuang commented on pull request #2889:
URL: https://github.com/apache/hadoop/pull/2889#issuecomment-819475596


   Please fix the spotbugs warning. Other than that I am +1.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 582430)
Time Spent: 2h 50m  (was: 2h 40m)

> Unreleased volume references cause an infinite loop
> ---
>
> Key: HDFS-15963
> URL: https://issues.apache.org/jira/browse/HDFS-15963
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Reporter: Shuyan Zhang
>Assignee: Shuyan Zhang
>Priority: Major
>  Labels: pull-request-available
> Attachments: HDFS-15963.001.patch, HDFS-15963.002.patch, 
> HDFS-15963.003.patch
>
>  Time Spent: 2h 50m
>  Remaining Estimate: 0h
>
> When BlockSender throws an exception because the meta-data cannot be found, 
> the volume reference obtained by the thread is not released, which causes the 
> thread trying to remove the volume to wait and fall into an infinite loop.
> {code:java}
> boolean checkVolumesRemoved() {
>   Iterator it = volumesBeingRemoved.iterator();
>   while (it.hasNext()) {
> FsVolumeImpl volume = it.next();
> if (!volume.checkClosed()) {
>   return false;
> }
> it.remove();
>   }
>   return true;
> }
> boolean checkClosed() {
>   // always be true.
>   if (this.reference.getReferenceCount() > 0) {
> FsDatasetImpl.LOG.debug("The reference count for {} is {}, wait to be 0.",
> this, reference.getReferenceCount());
> return false;
>   }
>   return true;
> }
> {code}
> At the same time, because the thread has been holding checkDirsLock when 
> removing the volume, other threads trying to acquire the same lock will be 
> permanently blocked.
> Similar problems also occur in RamDiskAsyncLazyPersistService and 
> FsDatasetAsyncDiskService.
> This patch releases the three previously unreleased volume references.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Work logged] (HDFS-15961) standby namenode failed to start ordered snapshot deletion is enabled while having snapshottable directories

2021-04-14 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15961?focusedWorklogId=582408=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-582408
 ]

ASF GitHub Bot logged work on HDFS-15961:
-

Author: ASF GitHub Bot
Created on: 14/Apr/21 11:44
Start Date: 14/Apr/21 11:44
Worklog Time Spent: 10m 
  Work Description: bshashikant commented on pull request #2881:
URL: https://github.com/apache/hadoop/pull/2881#issuecomment-819454476


   > > I think We should hold this off, I think the code has issues, as I said. 
Earlier I thought, there is some catch but I don’t think. It is misbehaving 
only. Ideally such features should go in a branch first
   > 
   > @ayushtkn , IMO its not a misbehaviour here. Once the snapshotTrash 
feature is enabled, the .Trash directory has to be present to make the feature 
work. As you can have pre existing snapshottable directories , one way to 
create the .Trash inside these was to create right after the startup is done. 
The other solution was to explicitly provision the .Trash using a cmd line 
option. The choice was made to do it automatically on restart, and fail the 
Namenode in case any any issues occur.
   > 
   > The other solution is not fail the namenode , but log an warning and 
later, if any Trash operations gets performed inside the snapshottable root, 
will fail.
   > 
   > The discussion is here: [#2682 
(comment)](https://github.com/apache/hadoop/pull/2682#discussion_r570461526)
   
   Coming to think of it, if providing an external command to create the Trash 
directory by admins is feasible and makes sense, i think its ok to remove the 
NN startup logic to create Trash directories inside snapshottable root.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 582408)
Time Spent: 2h 20m  (was: 2h 10m)

> standby namenode failed to start ordered snapshot deletion is enabled while 
> having snapshottable directories
> 
>
> Key: HDFS-15961
> URL: https://issues.apache.org/jira/browse/HDFS-15961
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: snapshots
>Affects Versions: 3.4.0
>Reporter: Nilotpal Nandi
>Assignee: Shashikant Banerjee
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>  Time Spent: 2h 20m
>  Remaining Estimate: 0h
>
> {code:java}
> 2021-04-08 12:07:25,398 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor: Adding new 
> storage ID DS-515dfb62-9975-4a2d-8384-d33ac8ff9cd1 for DN 172.27.121.195:9866
> 2021-04-08 12:07:55,581 INFO org.apache.hadoop.util.ExitUtil: Exiting with 
> status 1: Could not provision Trash directory for existing snapshottable 
> directories. Exiting Namenode.
> 2021-04-08 12:07:55,596 INFO 
> org.apache.ranger.audit.provider.AuditProviderFactory: ==> 
> JVMShutdownHook.run()
> 2021-04-08 12:07:55,596 INFO 
> org.apache.ranger.audit.provider.AuditProviderFactory: JVMShutdownHook: 
> Signalling async audit cleanup to start.
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15974) RBF: Unable to display the datanode UI of the router

2021-04-14 Thread zhu (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhu updated HDFS-15974:
---
Attachment: HDFS-15358-1.patch

> RBF: Unable to display the datanode UI of the router
> 
>
> Key: HDFS-15974
> URL: https://issues.apache.org/jira/browse/HDFS-15974
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: rbf, ui
>Affects Versions: 3.4.0
>Reporter: zhu
>Priority: Major
> Attachments: HDFS-15358-1.patch
>
>
> Clicking the Datanodes tag on the Router UI does not respond.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-15974) RBF: Unable to display the datanode UI of the router

2021-04-14 Thread zhu (Jira)
zhu created HDFS-15974:
--

 Summary: RBF: Unable to display the datanode UI of the router
 Key: HDFS-15974
 URL: https://issues.apache.org/jira/browse/HDFS-15974
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: rbf, ui
Affects Versions: 3.4.0
Reporter: zhu


Clicking the Datanodes tag on the Router UI does not respond.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Assigned] (HDFS-15973) RBF: Add permission check before doting router federation rename.

2021-04-14 Thread Jinglun (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jinglun reassigned HDFS-15973:
--

Assignee: Jinglun

> RBF: Add permission check before doting router federation rename.
> -
>
> Key: HDFS-15973
> URL: https://issues.apache.org/jira/browse/HDFS-15973
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
>
> The router federation rename is lack of permission check. It is a security 
> issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-15973) RBF: Add permission check before doting router federation rename.

2021-04-14 Thread Jinglun (Jira)
Jinglun created HDFS-15973:
--

 Summary: RBF: Add permission check before doting router federation 
rename.
 Key: HDFS-15973
 URL: https://issues.apache.org/jira/browse/HDFS-15973
 Project: Hadoop HDFS
  Issue Type: Sub-task
Reporter: Jinglun


The router federation rename is lack of permission check. It is a security 
issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-15614) Initialize snapshot trash root during NameNode startup if enabled

2021-04-14 Thread Siyao Meng (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17320927#comment-17320927
 ] 

Siyao Meng edited comment on HDFS-15614 at 4/14/21, 11:31 AM:
--

Thanks for bringing this up [~ayushtkn].


{quote}
And this fails, And yep there is an ambiguity.
{quote}

The reason is that 
[{{DFS#provisionSnapshotTrash}}|https://github.com/apache/hadoop/blob/c6539e3289711d29f508930bbda40302f48ddf4c/hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DistributedFileSystem.java#L2984]
 followed EZ counterpart's 
[{{DFS#provisionEZTrash}}|https://github.com/apache/hadoop/blob/c6539e3289711d29f508930bbda40302f48ddf4c/hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DistributedFileSystem.java#L2913]
 implementation. {{dfs.provisionSnapshotTrash}} is not automatically called 
from {{dfs.allowSnapshot}}, following what encryption zone has done similarly.

Therefore this would be the same deal for encryption zone trash root creation. 
When replacing {{dfs.allowSnapshot}} calls with {{dfs.createEncryptionZone}} in 
the first test case we should also find trash root inside encryption zone 
missing with {{dfs.provisionSnapshotTrash}} call *alone*.

I suggest some guidelines should be posted and in javadoc adding that 
allowSnapshot should better performed with dfsadmin CLI (and 
createEncryptionZone if there aren't already).


{quote}
How come a client side feature that important, that can make the cluster go 
down in times of critical situation like failover, Again a test to show that:
{quote}

[name 
quota|https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsQuotaAdminGuide.html#Name_Quotas]
 can become an issue indeed.


I think I got your point. Maybe a better way to create those necessary Trash 
dirs is to ask an admin to run dfsadmin command *manually* after flipping 
{{dfs.namenode.snapshot.trashroot.enabled}} to {{true}}.

Currently we already have {{dfsadmin -provisionSnapshotTrash}} but can only be 
done one by one. {{dfsadmin -provisionSnapshotTrash -all}} can be implemented 
to achieve this.


Cheers,
Siyao


was (Author: smeng):
Thanks for bringing this up [~ayushtkn].


{quote}
And this fails, And yep there is an ambiguity.
{quote}

The reason is that 
[{{DFS#provisionSnapshotTrash}}|https://github.com/apache/hadoop/blob/c6539e3289711d29f508930bbda40302f48ddf4c/hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DistributedFileSystem.java#L2984]
 followed EZ counterpart's 
[{{DFS#provisionEZTrash}}|https://github.com/apache/hadoop/blob/c6539e3289711d29f508930bbda40302f48ddf4c/hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DistributedFileSystem.java#L2913]
 implementation. {{dfs.provisionSnapshotTrash}} is not automatically called 
from {{dfs.allowSnapshot}}, following what .

Therefore this would be the same deal for encryption zone trash root creation. 
When replacing {{dfs.allowSnapshot}} calls with {{dfs.createEncryptionZone}} in 
the first test case we should also find trash root inside encryption zone 
missing with {{dfs.provisionSnapshotTrash}} call *alone*.

I suggest some guidelines should be posted and in javadoc adding that 
allowSnapshot should better performed with dfsadmin CLI (and 
createEncryptionZone if there aren't already).


{quote}
How come a client side feature that important, that can make the cluster go 
down in times of critical situation like failover, Again a test to show that:
{quote}

[name 
quota|https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsQuotaAdminGuide.html#Name_Quotas]
 can become an issue indeed.


I think I got your point. Maybe a better way to create those necessary Trash 
dirs is to ask an admin to run dfsadmin command *manually* after flipping 
{{dfs.namenode.snapshot.trashroot.enabled}} to {{true}}.

Currently we already have {{dfsadmin -provisionSnapshotTrash}} but can only be 
done one by one. {{dfsadmin -provisionSnapshotTrash -all}} can be implemented 
to achieve this.


Cheers,
Siyao

> Initialize snapshot trash root during NameNode startup if enabled
> -
>
> Key: HDFS-15614
> URL: https://issues.apache.org/jira/browse/HDFS-15614
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Siyao Meng
>Assignee: Siyao Meng
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>  Time Spent: 3h 10m
>  Remaining Estimate: 0h
>
> This is a follow-up to HDFS-15607.
> Goal:
> Initialize (create) snapshot trash root for all existing snapshottable 
> directories if {{dfs.namenode.snapshot.trashroot.enabled}} is set to 
> {{true}}. So admins won't have to run {{dfsadmin -provisionTrash}} manually 
> on 

[jira] [Commented] (HDFS-15614) Initialize snapshot trash root during NameNode startup if enabled

2021-04-14 Thread Siyao Meng (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17320927#comment-17320927
 ] 

Siyao Meng commented on HDFS-15614:
---

Thanks for bringing this up [~ayushtkn].


{quote}
And this fails, And yep there is an ambiguity.
{quote}

The reason is that 
[{{DFS#provisionSnapshotTrash}}|https://github.com/apache/hadoop/blob/c6539e3289711d29f508930bbda40302f48ddf4c/hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DistributedFileSystem.java#L2984]
 followed EZ counterpart's 
[{{DFS#provisionEZTrash}}|https://github.com/apache/hadoop/blob/c6539e3289711d29f508930bbda40302f48ddf4c/hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DistributedFileSystem.java#L2913]
 implementation. {{dfs.provisionSnapshotTrash}} is not automatically called 
from {{dfs.allowSnapshot}}, following what .

Therefore this would be the same deal for encryption zone trash root creation. 
When replacing {{dfs.allowSnapshot}} calls with {{dfs.createEncryptionZone}} in 
the first test case we should also find trash root inside encryption zone 
missing with {{dfs.provisionSnapshotTrash}} call *alone*.

I suggest some guidelines should be posted and in javadoc adding that 
allowSnapshot should better performed with dfsadmin CLI (and 
createEncryptionZone if there aren't already).


{quote}
How come a client side feature that important, that can make the cluster go 
down in times of critical situation like failover, Again a test to show that:
{quote}

[name 
quota|https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsQuotaAdminGuide.html#Name_Quotas]
 can become an issue indeed.


I think I got your point. Maybe a better way to create those necessary Trash 
dirs is to ask an admin to run dfsadmin command *manually* after flipping 
{{dfs.namenode.snapshot.trashroot.enabled}} to {{true}}.

Currently we already have {{dfsadmin -provisionSnapshotTrash}} but can only be 
done one by one. {{dfsadmin -provisionSnapshotTrash -all}} can be implemented 
to achieve this.


Cheers,
Siyao

> Initialize snapshot trash root during NameNode startup if enabled
> -
>
> Key: HDFS-15614
> URL: https://issues.apache.org/jira/browse/HDFS-15614
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Siyao Meng
>Assignee: Siyao Meng
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>  Time Spent: 3h 10m
>  Remaining Estimate: 0h
>
> This is a follow-up to HDFS-15607.
> Goal:
> Initialize (create) snapshot trash root for all existing snapshottable 
> directories if {{dfs.namenode.snapshot.trashroot.enabled}} is set to 
> {{true}}. So admins won't have to run {{dfsadmin -provisionTrash}} manually 
> on all those existing snapshottable directories.
> The change is expected to land in {{FSNamesystem}}.
> Discussion:
> 1. Currently in HDFS-15607, the snapshot trash root creation logic is on the 
> client side. But in order for NN to create it at startup, the logic must 
> (also) be implemented on the server side as well. -- which is also a 
> requirement by WebHDFS (HDFS-15612).
> 2. Alternatively, we can provide an extra parameter to the 
> {{-provisionTrash}} command like: {{dfsadmin -provisionTrash -all}} to 
> initialize/provision trash root on all existing snapshottable dirs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15923) RBF: Authentication failed when rename accross sub clusters

2021-04-14 Thread Jinglun (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jinglun updated HDFS-15923:
---
Parent: HDFS-15747
Issue Type: Sub-task  (was: Bug)

> RBF:  Authentication failed when rename accross sub clusters
> 
>
> Key: HDFS-15923
> URL: https://issues.apache.org/jira/browse/HDFS-15923
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: rbf
>Reporter: zhuobin zheng
>Priority: Major
>  Labels: RBF, pull-request-available, rename
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> Rename accross subcluster with RBF and Kerberos environment. Will encounter 
> the following two errors:
>  # Save Object to journal.
>  # Precheck try to get src file status
> So, we need use Proxy UGI doAs create DistcpProcedure and TrashProcedure and 
> submit Job.
> In patch i use proxy ugi doAs above method. It worked.
> But there are another strange thing and this patch not solve:
> Router use ugi itself to submit the Distcp job. But not user ugi or proxy 
> ugi. This may cause excessive distcp permissions.
> First: Save Object to journal.
> {code:java}
> // code placeholder
> 2021-03-23 14:01:16,233 WARN org.apache.hadoop.ipc.Client: Exception 
> encountered while connecting to the server 
> javax.security.sasl.SaslException: GSS initiate failed [Caused by 
> GSSException: No valid credentials provided (Mechanism level: Failed to find 
> any Kerberos tgt)]
> at 
> com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:211)
> at 
> org.apache.hadoop.security.SaslRpcClient.saslConnect(SaslRpcClient.java:408)
> at 
> org.apache.hadoop.ipc.Client$Connection.setupSaslConnection(Client.java:622)
> at 
> org.apache.hadoop.ipc.Client$Connection.access$2300(Client.java:413)
> at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:822)
> at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:818)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1762)
> at 
> org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:818)
> at 
> org.apache.hadoop.ipc.Client$Connection.access$3800(Client.java:413)
> at org.apache.hadoop.ipc.Client.getConnection(Client.java:1636)
> at org.apache.hadoop.ipc.Client.call(Client.java:1452)
> at org.apache.hadoop.ipc.Client.call(Client.java:1405)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:118)
> at com.sun.proxy.$Proxy11.create(Unknown Source)
> at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.create(ClientNamenodeProtocolTranslatorPB.java:376)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359)
> at com.sun.proxy.$Proxy12.create(Unknown Source)
> at 
> org.apache.hadoop.hdfs.DFSOutputStream.newStreamForCreate(DFSOutputStream.java:277)
> at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1240)
> at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1219)
> at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1201)
> at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1139)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$8.doCall(DistributedFileSystem.java:533)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$8.doCall(DistributedFileSystem.java:530)
> at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:544)
> at 
> 

[jira] [Commented] (HDFS-15923) RBF: Authentication failed when rename accross sub clusters

2021-04-14 Thread Jinglun (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17320887#comment-17320887
 ] 

Jinglun commented on HDFS-15923:


Hi [~zhengzhuobinzzb], you are right ! Please continue with your work, we still 
need some test cases. 
{quote}In the current code logic, storing tasks in Journal does not use super 
users and Kerberos credentials. (Because when RPC executes Call, it uses the 
corresponding Ugi's doAs, and the Ugi does not have a Kerberberos certificate.)
{quote}
 

I'll start a new Jira to resolve the permission check issue.

> RBF:  Authentication failed when rename accross sub clusters
> 
>
> Key: HDFS-15923
> URL: https://issues.apache.org/jira/browse/HDFS-15923
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: rbf
>Reporter: zhuobin zheng
>Priority: Major
>  Labels: RBF, pull-request-available, rename
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> Rename accross subcluster with RBF and Kerberos environment. Will encounter 
> the following two errors:
>  # Save Object to journal.
>  # Precheck try to get src file status
> So, we need use Proxy UGI doAs create DistcpProcedure and TrashProcedure and 
> submit Job.
> In patch i use proxy ugi doAs above method. It worked.
> But there are another strange thing and this patch not solve:
> Router use ugi itself to submit the Distcp job. But not user ugi or proxy 
> ugi. This may cause excessive distcp permissions.
> First: Save Object to journal.
> {code:java}
> // code placeholder
> 2021-03-23 14:01:16,233 WARN org.apache.hadoop.ipc.Client: Exception 
> encountered while connecting to the server 
> javax.security.sasl.SaslException: GSS initiate failed [Caused by 
> GSSException: No valid credentials provided (Mechanism level: Failed to find 
> any Kerberos tgt)]
> at 
> com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:211)
> at 
> org.apache.hadoop.security.SaslRpcClient.saslConnect(SaslRpcClient.java:408)
> at 
> org.apache.hadoop.ipc.Client$Connection.setupSaslConnection(Client.java:622)
> at 
> org.apache.hadoop.ipc.Client$Connection.access$2300(Client.java:413)
> at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:822)
> at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:818)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1762)
> at 
> org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:818)
> at 
> org.apache.hadoop.ipc.Client$Connection.access$3800(Client.java:413)
> at org.apache.hadoop.ipc.Client.getConnection(Client.java:1636)
> at org.apache.hadoop.ipc.Client.call(Client.java:1452)
> at org.apache.hadoop.ipc.Client.call(Client.java:1405)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:118)
> at com.sun.proxy.$Proxy11.create(Unknown Source)
> at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.create(ClientNamenodeProtocolTranslatorPB.java:376)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359)
> at com.sun.proxy.$Proxy12.create(Unknown Source)
> at 
> org.apache.hadoop.hdfs.DFSOutputStream.newStreamForCreate(DFSOutputStream.java:277)
> at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1240)
> at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1219)
> at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1201)
> at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1139)
> at 
> 

[jira] [Work logged] (HDFS-15923) RBF: Authentication failed when rename accross sub clusters

2021-04-14 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15923?focusedWorklogId=582356=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-582356
 ]

ASF GitHub Bot logged work on HDFS-15923:
-

Author: ASF GitHub Bot
Created on: 14/Apr/21 10:19
Start Date: 14/Apr/21 10:19
Worklog Time Spent: 10m 
  Work Description: hadoop-yetus commented on pull request #2819:
URL: https://github.com/apache/hadoop/pull/2819#issuecomment-819407274


   :broken_heart: **-1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |   1m  0s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  1s |  |  codespell was not available.  |
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to 
include 1 new or modified test files.  |
    _ trunk Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |  36m 27s |  |  trunk passed  |
   | +1 :green_heart: |  compile  |   0m 42s |  |  trunk passed with JDK 
Ubuntu-11.0.10+9-Ubuntu-0ubuntu1.20.04  |
   | +1 :green_heart: |  compile  |   0m 32s |  |  trunk passed with JDK 
Private Build-1.8.0_282-8u282-b08-0ubuntu1~20.04-b08  |
   | +1 :green_heart: |  checkstyle  |   0m 23s |  |  trunk passed  |
   | +1 :green_heart: |  mvnsite  |   0m 39s |  |  trunk passed  |
   | +1 :green_heart: |  javadoc  |   0m 36s |  |  trunk passed with JDK 
Ubuntu-11.0.10+9-Ubuntu-0ubuntu1.20.04  |
   | +1 :green_heart: |  javadoc  |   0m 51s |  |  trunk passed with JDK 
Private Build-1.8.0_282-8u282-b08-0ubuntu1~20.04-b08  |
   | +1 :green_heart: |  spotbugs  |   1m 23s |  |  trunk passed  |
   | +1 :green_heart: |  shadedclient  |  17m  8s |  |  branch has no errors 
when building and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |   0m 33s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   0m 34s |  |  the patch passed with JDK 
Ubuntu-11.0.10+9-Ubuntu-0ubuntu1.20.04  |
   | +1 :green_heart: |  javac  |   0m 34s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   0m 29s |  |  the patch passed with JDK 
Private Build-1.8.0_282-8u282-b08-0ubuntu1~20.04-b08  |
   | +1 :green_heart: |  javac  |   0m 29s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | -0 :warning: |  checkstyle  |   0m 16s | 
[/results-checkstyle-hadoop-hdfs-project_hadoop-hdfs-rbf.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-2819/3/artifact/out/results-checkstyle-hadoop-hdfs-project_hadoop-hdfs-rbf.txt)
 |  hadoop-hdfs-project/hadoop-hdfs-rbf: The patch generated 5 new + 0 
unchanged - 0 fixed = 5 total (was 0)  |
   | +1 :green_heart: |  mvnsite  |   0m 31s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   0m 30s |  |  the patch passed with JDK 
Ubuntu-11.0.10+9-Ubuntu-0ubuntu1.20.04  |
   | +1 :green_heart: |  javadoc  |   0m 50s |  |  the patch passed with JDK 
Private Build-1.8.0_282-8u282-b08-0ubuntu1~20.04-b08  |
   | +1 :green_heart: |  spotbugs  |   1m 23s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  |  17m  9s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | -1 :x: |  unit  |  25m  2s | 
[/patch-unit-hadoop-hdfs-project_hadoop-hdfs-rbf.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-2819/3/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs-rbf.txt)
 |  hadoop-hdfs-rbf in the patch passed.  |
   | +1 :green_heart: |  asflicense  |   0m 28s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 108m 52s |  |  |
   
   
   | Reason | Tests |
   |---:|:--|
   | Failed junit tests | hadoop.hdfs.server.federation.router.TestRouterRpc |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.41 ServerAPI=1.41 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-2819/3/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/2819 |
   | Optional Tests | dupname asflicense compile javac javadoc mvninstall 
mvnsite unit shadedclient spotbugs checkstyle codespell |
   | uname | Linux 3f4a540e031b 4.15.0-101-generic #102-Ubuntu SMP Mon May 11 
10:07:26 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/bin/hadoop.sh |
   | git revision | trunk / 8a2cd53a4d27271f9498f4e02e3901d17c5c1a9c |
   | Default Java | Private Build-1.8.0_282-8u282-b08-0ubuntu1~20.04-b08 |
   | Multi-JDK versions | 

[jira] [Commented] (HDFS-15923) RBF: Authentication failed when rename accross sub clusters

2021-04-14 Thread zhuobin zheng (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17320833#comment-17320833
 ] 

zhuobin zheng commented on HDFS-15923:
--

I'm really sorry, I just noticed a recent comment.
 After seeing [~LiJinglun] comments in the early days, I agree with most of his 
views. Except:
 * In the current code logic, storing tasks in Journal does not use super users 
and Kerberos credentials. (Because when RPC executes Call, it uses the 
corresponding Ugi's doAs, and the Ugi does not have a Kerberberos certificate.)

Then I tried to modify the code to use the super user to store tasks in the 
journal, and check the user permissions before rename. The code is almost 
finished (lack of unit tests) (some other things and lack of understanding of 
the HDFS code consume a lot of time).
 I don't mind @jinglun taking over the issue at all. But if this patch meets 
your expectations and you haven't started work yet, I can complete the next 
unit test.

> RBF:  Authentication failed when rename accross sub clusters
> 
>
> Key: HDFS-15923
> URL: https://issues.apache.org/jira/browse/HDFS-15923
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: rbf
>Reporter: zhuobin zheng
>Priority: Major
>  Labels: RBF, pull-request-available, rename
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> Rename accross subcluster with RBF and Kerberos environment. Will encounter 
> the following two errors:
>  # Save Object to journal.
>  # Precheck try to get src file status
> So, we need use Proxy UGI doAs create DistcpProcedure and TrashProcedure and 
> submit Job.
> In patch i use proxy ugi doAs above method. It worked.
> But there are another strange thing and this patch not solve:
> Router use ugi itself to submit the Distcp job. But not user ugi or proxy 
> ugi. This may cause excessive distcp permissions.
> First: Save Object to journal.
> {code:java}
> // code placeholder
> 2021-03-23 14:01:16,233 WARN org.apache.hadoop.ipc.Client: Exception 
> encountered while connecting to the server 
> javax.security.sasl.SaslException: GSS initiate failed [Caused by 
> GSSException: No valid credentials provided (Mechanism level: Failed to find 
> any Kerberos tgt)]
> at 
> com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:211)
> at 
> org.apache.hadoop.security.SaslRpcClient.saslConnect(SaslRpcClient.java:408)
> at 
> org.apache.hadoop.ipc.Client$Connection.setupSaslConnection(Client.java:622)
> at 
> org.apache.hadoop.ipc.Client$Connection.access$2300(Client.java:413)
> at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:822)
> at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:818)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1762)
> at 
> org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:818)
> at 
> org.apache.hadoop.ipc.Client$Connection.access$3800(Client.java:413)
> at org.apache.hadoop.ipc.Client.getConnection(Client.java:1636)
> at org.apache.hadoop.ipc.Client.call(Client.java:1452)
> at org.apache.hadoop.ipc.Client.call(Client.java:1405)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:118)
> at com.sun.proxy.$Proxy11.create(Unknown Source)
> at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.create(ClientNamenodeProtocolTranslatorPB.java:376)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359)
> at com.sun.proxy.$Proxy12.create(Unknown Source)
> at 
> 

[jira] [Updated] (HDFS-15961) standby namenode failed to start ordered snapshot deletion is enabled while having snapshottable directories

2021-04-14 Thread Shashikant Banerjee (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shashikant Banerjee updated HDFS-15961:
---
Reporter: Nilotpal Nandi  (was: Shashikant Banerjee)

> standby namenode failed to start ordered snapshot deletion is enabled while 
> having snapshottable directories
> 
>
> Key: HDFS-15961
> URL: https://issues.apache.org/jira/browse/HDFS-15961
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: snapshots
>Affects Versions: 3.4.0
>Reporter: Nilotpal Nandi
>Assignee: Shashikant Banerjee
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> {code:java}
> 2021-04-08 12:07:25,398 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor: Adding new 
> storage ID DS-515dfb62-9975-4a2d-8384-d33ac8ff9cd1 for DN 172.27.121.195:9866
> 2021-04-08 12:07:55,581 INFO org.apache.hadoop.util.ExitUtil: Exiting with 
> status 1: Could not provision Trash directory for existing snapshottable 
> directories. Exiting Namenode.
> 2021-04-08 12:07:55,596 INFO 
> org.apache.ranger.audit.provider.AuditProviderFactory: ==> 
> JVMShutdownHook.run()
> 2021-04-08 12:07:55,596 INFO 
> org.apache.ranger.audit.provider.AuditProviderFactory: JVMShutdownHook: 
> Signalling async audit cleanup to start.
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15614) Initialize snapshot trash root during NameNode startup if enabled

2021-04-14 Thread Ayush Saxena (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17320757#comment-17320757
 ] 

Ayush Saxena commented on HDFS-15614:
-

[~shashikant], You tried through shell, I told it creates trash directory when 
executed through HDFSAdmin. :(

See here above:
{quote}Secondly, An ambiguity, A client did an allowSnapshot say not from 
HdfsAdmin he didn't had any Trash directory in the snapshot dir,
{quote}
 But ok, Let me show a case where you can see some stuff through shell also, I 
deployed a single node cluster on top of your PR-2881 and through shell:

 
{noformat}
ayushsaxena@ayushsaxena-MBP16 hadoop-3.4.0-SNAPSHOT % bin/hdfs dfs -mkdir /dir

ayushsaxena@ayushsaxena-MBP16 hadoop-3.4.0-SNAPSHOT % bin/hdfs dfs -put 
bin/yarn /dir/file1

ayushsaxena@ayushsaxena-MBP16 hadoop-3.4.0-SNAPSHOT % bin/hdfs dfsadmin 
-setQuota 1 /dir   

ayushsaxena@ayushsaxena-MBP16 hadoop-3.4.0-SNAPSHOT % bin/hdfs dfsadmin 
-allowSnapshot /dir

allowSnapshot: The NameSpace quota (directories and files) of directory /dir is 
exceeded: quota=1 file count=3
at 
org.apache.hadoop.hdfs.server.namenode.DirectoryWithQuotaFeature.verifyNamespaceQuota(DirectoryWithQuotaFeature.java:188)
at 
org.apache.hadoop.hdfs.server.namenode.DirectoryWithQuotaFeature.verifyQuota(DirectoryWithQuotaFeature.java:221)
at 
org.apache.hadoop.hdfs.server.namenode.FSDirectory.verifyQuota(FSDirectory.java:1230)
at 
org.apache.hadoop.hdfs.server.namenode.FSDirectory.updateCount(FSDirectory.java:1061)
at 
org.apache.hadoop.hdfs.server.namenode.FSDirectory.addLastINode(FSDirectory.java:1378)
at 
org.apache.hadoop.hdfs.server.namenode.FSDirMkdirOp.unprotectedMkdir(FSDirMkdirOp.java:225)
at 
org.apache.hadoop.hdfs.server.namenode.FSDirMkdirOp.createSingleDirectory(FSDirMkdirOp.java:169)
at 
org.apache.hadoop.hdfs.server.namenode.FSDirMkdirOp.mkdirs(FSDirMkdirOp.java:77)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirs(FSNamesystem.java:3483)
at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.mkdirs(NameNodeRpcServer.java:1156)
at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.mkdirs(ClientNamenodeProtocolServerSideTranslatorPB.java:746)
at 
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:537)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1086)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1037)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:965)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1900)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2972)

ayushsaxena@ayushsaxena-MBP16 hadoop-3.4.0-SNAPSHOT % bin/hdfs dfsadmin 
-setQuota 2 /dir   

ayushsaxena@ayushsaxena-MBP16 hadoop-3.4.0-SNAPSHOT % bin/hdfs 
lsSnapshottableDir

drwxr-xr-x 0 ayushsaxena staff 0 2021-04-14 12:26 0 65536 /dir


//Namenode is working as of now, lets restart!!!
ayushsaxena@ayushsaxena-MBP16 hadoop-3.4.0-SNAPSHOT % sbin/hadoop-daemon.sh 
stop namenode
WARNING: Use of this script to stop HDFS daemons is deprecated.
WARNING: Attempting to execute replacement "hdfs --daemon stop" instead.
ayushsaxena@ayushsaxena-MBP16 hadoop-3.4.0-SNAPSHOT % sbin/hadoop-daemon.sh 
start namenode
WARNING: Use of this script to start HDFS daemons is deprecated.
WARNING: Attempting to execute replacement "hdfs --daemon start" instead.

ayushsaxena@ayushsaxena-MBP16 hadoop-3.4.0-SNAPSHOT % jps
88755 
36531 DataNode
37158 Jps
// No Namenode

ayushsaxena@ayushsaxena-MBP16 hadoop-3.4.0-SNAPSHOT % bin/hdfs dfs -ls /
2021-04-14 12:29:00,026 WARN util.NativeCodeLoader: Unable to load 
native-hadoop library for your platform... using builtin-java classes where 
applicable
ls: Call From ayushsaxena-MBP16.local/192.168.0.194 to localhost:9000 failed on 
connection exception: java.net.ConnectException: Connection refused; For more 
details see:  http://wiki.apache.org/hadoop/ConnectionRefused

// Lets check the logs
ayushsaxena@ayushsaxena-MBP16 hadoop-3.4.0-SNAPSHOT % tail -f 
logs/hadoop-ayushsaxena-namenode-ayushsaxena-MBP16.local.log 
2021-04-14 12:28:51,113 INFO 
org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: initializing 
replication queues
2021-04-14 12:28:51,113 INFO org.apache.hadoop.hdfs.StateChange: STATE* Leaving 
safe mode after 0 secs
2021-04-14 12:28:51,114 INFO org.apache.hadoop.hdfs.StateChange: STATE* Network 
topology has 0 racks and 0 datanodes
2021-04-14 12:28:51,114 INFO 

[jira] [Work logged] (HDFS-15961) standby namenode failed to start ordered snapshot deletion is enabled while having snapshottable directories

2021-04-14 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15961?focusedWorklogId=582243=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-582243
 ]

ASF GitHub Bot logged work on HDFS-15961:
-

Author: ASF GitHub Bot
Created on: 14/Apr/21 06:39
Start Date: 14/Apr/21 06:39
Worklog Time Spent: 10m 
  Work Description: bshashikant edited a comment on pull request #2881:
URL: https://github.com/apache/hadoop/pull/2881#issuecomment-819258964


   > I think We should hold this off, I think the code has issues, as I said. 
Earlier I thought, there is some catch but I don’t think. It is misbehaving 
only. Ideally such features should go in a branch first
   
   @ayushtkn , IMO its not a misbehaviour here. Once the snapshotTrash feature 
is enabled, the .Trash directory has to be present to make the feature work. As 
you can have pre existing snapshottable directories , one way to create the 
.Trash inside these was to create right after the startup is done. The other 
solution was to explicitly provision the .Trash  using a cmd line option. The 
choice was made to do it automatically on restart, and fail the Namenode in 
case any any issues occur.
   
   The other solution is not fail the namenode , but log an warning and later, 
if any Trash operations gets performed inside the snapshottable root, will fail.
   
   The discussion is here: 
https://github.com/apache/hadoop/pull/2682#discussion_r570461526


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 582243)
Time Spent: 2h 10m  (was: 2h)

> standby namenode failed to start ordered snapshot deletion is enabled while 
> having snapshottable directories
> 
>
> Key: HDFS-15961
> URL: https://issues.apache.org/jira/browse/HDFS-15961
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: snapshots
>Affects Versions: 3.4.0
>Reporter: Shashikant Banerjee
>Assignee: Shashikant Banerjee
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> {code:java}
> 2021-04-08 12:07:25,398 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor: Adding new 
> storage ID DS-515dfb62-9975-4a2d-8384-d33ac8ff9cd1 for DN 172.27.121.195:9866
> 2021-04-08 12:07:55,581 INFO org.apache.hadoop.util.ExitUtil: Exiting with 
> status 1: Could not provision Trash directory for existing snapshottable 
> directories. Exiting Namenode.
> 2021-04-08 12:07:55,596 INFO 
> org.apache.ranger.audit.provider.AuditProviderFactory: ==> 
> JVMShutdownHook.run()
> 2021-04-08 12:07:55,596 INFO 
> org.apache.ranger.audit.provider.AuditProviderFactory: JVMShutdownHook: 
> Signalling async audit cleanup to start.
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Work logged] (HDFS-15961) standby namenode failed to start ordered snapshot deletion is enabled while having snapshottable directories

2021-04-14 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15961?focusedWorklogId=582234=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-582234
 ]

ASF GitHub Bot logged work on HDFS-15961:
-

Author: ASF GitHub Bot
Created on: 14/Apr/21 06:17
Start Date: 14/Apr/21 06:17
Worklog Time Spent: 10m 
  Work Description: bshashikant commented on pull request #2881:
URL: https://github.com/apache/hadoop/pull/2881#issuecomment-819258964


   > I think We should hold this off, I think the code has issues, as I said. 
Earlier I thought, there is some catch but I don’t think. It is misbehaving 
only. Ideally such features should go in a branch first
   
   @ayushtkn , IMO its not a misbehaviour here. Once the snapshotTrash feature 
is enabled, the .Trash directory has to be present to make the feature work. As 
you can have pre existing snapshottable directories , one way to create the 
.Trash inside these was to create right after the startup is done. The other 
solution was to explicitly provision the .Trash  using a cmd line option. The 
choice was made to do it automatically on restart, and fail the Namenode in 
case any any issues occur.
   
   The other solution is not fail the namenode , but log an warning and later, 
if any Trash operations get performes inside the snapshottable root, will fail.
   
   The discussion is here: 
https://github.com/apache/hadoop/pull/2682#discussion_r570461526


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 582234)
Time Spent: 2h  (was: 1h 50m)

> standby namenode failed to start ordered snapshot deletion is enabled while 
> having snapshottable directories
> 
>
> Key: HDFS-15961
> URL: https://issues.apache.org/jira/browse/HDFS-15961
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: snapshots
>Affects Versions: 3.4.0
>Reporter: Shashikant Banerjee
>Assignee: Shashikant Banerjee
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>  Time Spent: 2h
>  Remaining Estimate: 0h
>
> {code:java}
> 2021-04-08 12:07:25,398 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor: Adding new 
> storage ID DS-515dfb62-9975-4a2d-8384-d33ac8ff9cd1 for DN 172.27.121.195:9866
> 2021-04-08 12:07:55,581 INFO org.apache.hadoop.util.ExitUtil: Exiting with 
> status 1: Could not provision Trash directory for existing snapshottable 
> directories. Exiting Namenode.
> 2021-04-08 12:07:55,596 INFO 
> org.apache.ranger.audit.provider.AuditProviderFactory: ==> 
> JVMShutdownHook.run()
> 2021-04-08 12:07:55,596 INFO 
> org.apache.ranger.audit.provider.AuditProviderFactory: JVMShutdownHook: 
> Signalling async audit cleanup to start.
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15614) Initialize snapshot trash root during NameNode startup if enabled

2021-04-14 Thread Shashikant Banerjee (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17320719#comment-17320719
 ] 

Shashikant Banerjee commented on HDFS-15614:


[~ayushtkn], i just tried making a snapshottable directory and it seems the 
.Trash is implicitly created once the config 
"dfs.namenode.snapshot.trashroot.enabled" is set to true.
{code:java}
hdfs dfsadmin -fs hdfs://127.0.0.1: -allowsnapshot /

hdfs dfs -ls hdfs://127.0.0.1:/ 
Found 2 items
drwxrwxrwt - shashikant supergroup 0 2021-04-12 11:20 
hdfs://127.0.0.1:/.Trash 
drwxr-xr-x - shashikant supergroup 0 2021-04-12 11:19 
hdfs://127.0.0.1:/dir1{code}
[~smeng], can you please confirm?

> Initialize snapshot trash root during NameNode startup if enabled
> -
>
> Key: HDFS-15614
> URL: https://issues.apache.org/jira/browse/HDFS-15614
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Siyao Meng
>Assignee: Siyao Meng
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>  Time Spent: 3h 10m
>  Remaining Estimate: 0h
>
> This is a follow-up to HDFS-15607.
> Goal:
> Initialize (create) snapshot trash root for all existing snapshottable 
> directories if {{dfs.namenode.snapshot.trashroot.enabled}} is set to 
> {{true}}. So admins won't have to run {{dfsadmin -provisionTrash}} manually 
> on all those existing snapshottable directories.
> The change is expected to land in {{FSNamesystem}}.
> Discussion:
> 1. Currently in HDFS-15607, the snapshot trash root creation logic is on the 
> client side. But in order for NN to create it at startup, the logic must 
> (also) be implemented on the server side as well. -- which is also a 
> requirement by WebHDFS (HDFS-15612).
> 2. Alternatively, we can provide an extra parameter to the 
> {{-provisionTrash}} command like: {{dfsadmin -provisionTrash -all}} to 
> initialize/provision trash root on all existing snapshottable dirs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org