[jira] [Resolved] (HDFS-17177) ErasureCodingWork reconstruct ignore the block length is Long.MAX_VALUE.

2023-09-10 Thread ZanderXu (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ZanderXu resolved HDFS-17177.
-
Fix Version/s: 3.4.0
   Resolution: Fixed

> ErasureCodingWork reconstruct ignore the block length is Long.MAX_VALUE.
> 
>
> Key: HDFS-17177
> URL: https://issues.apache.org/jira/browse/HDFS-17177
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Haiyang Hu
>Assignee: Haiyang Hu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>
> As noted in https://issues.apache.org/jira/browse/HDFS-14720, 
> ErasureCodingWork reconstruct maybe also need to ignore the block lengths of 
> Long.MAX_VALUE.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-16587) Allow configuring Handler number for the JournalNodeRpcServer

2022-05-21 Thread ZanderXu (Jira)
ZanderXu created HDFS-16587:
---

 Summary: Allow configuring Handler number for the 
JournalNodeRpcServer
 Key: HDFS-16587
 URL: https://issues.apache.org/jira/browse/HDFS-16587
 Project: Hadoop HDFS
  Issue Type: Wish
Reporter: ZanderXu
Assignee: ZanderXu


We can allow configuring the handler number for the JournalNodeRpcServer.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-16596) Improve the processing capability of FsDatasetAsyncDiskService

2022-05-26 Thread ZanderXu (Jira)
ZanderXu created HDFS-16596:
---

 Summary: Improve the processing capability of 
FsDatasetAsyncDiskService
 Key: HDFS-16596
 URL: https://issues.apache.org/jira/browse/HDFS-16596
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: ZanderXu
Assignee: ZanderXu


In our production environment, when DN needs to delete a large number blocks, 
we find that many deletion tasks are backlogged in the queue of 
threadPoolExecutor in FsDatasetAsyncDiskService. We can't improve its 
throughput because the number of core threads is hard coded.

So DN needs to support the number of core threads of FsDatasetAsyncDiskService 
can be configured.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-16593) Correct inaccurate BlocksRemoved metric on DataNode side

2022-05-24 Thread ZanderXu (Jira)
ZanderXu created HDFS-16593:
---

 Summary: Correct inaccurate BlocksRemoved metric on DataNode side
 Key: HDFS-16593
 URL: https://issues.apache.org/jira/browse/HDFS-16593
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: ZanderXu
Assignee: ZanderXu


When tracing the root cause of production issue, I found that the BlocksRemoved 
 metric on Datanode size was inaccurate.

{code:java}
case DatanodeProtocol.DNA_INVALIDATE:
  //
  // Some local block(s) are obsolete and can be 
  // safely garbage-collected.
  //
  Block toDelete[] = bcmd.getBlocks();
  try {
// using global fsdataset
dn.getFSDataset().invalidate(bcmd.getBlockPoolId(), toDelete);
  } catch(IOException e) {
// Exceptions caught here are not expected to be disk-related.
throw e;
  }
  dn.metrics.incrBlocksRemoved(toDelete.length);
  break;
{code}

Because even if the invalidate method throws an exception, some blocks may have 
been successfully deleted internally.




--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-16601) Failed to replace a bad datanode on the existing pipeline due to no more good datanodes being available to try

2022-05-27 Thread ZanderXu (Jira)
ZanderXu created HDFS-16601:
---

 Summary: Failed to replace a bad datanode on the existing pipeline 
due to no more good datanodes being available to try
 Key: HDFS-16601
 URL: https://issues.apache.org/jira/browse/HDFS-16601
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: ZanderXu
Assignee: ZanderXu


In our production environment, we found a bug and stack like:

{code:java}
java.io.IOException: Failed to replace a bad datanode on the existing pipeline 
due to no more good datanodes being available to try. (Nodes: 
current=[DatanodeInfoWithStorage[127.0.0.1:59687,DS-b803febc-7b22-4144-9b39-7bf521cdaa8d,DISK],
 
DatanodeInfoWithStorage[127.0.0.1:59670,DS-0d652bc2-1784-430d-961f-750f80a290f1,DISK]],
 
original=[DatanodeInfoWithStorage[127.0.0.1:59670,DS-0d652bc2-1784-430d-961f-750f80a290f1,DISK],
 
DatanodeInfoWithStorage[127.0.0.1:59687,DS-b803febc-7b22-4144-9b39-7bf521cdaa8d,DISK]]).
 The current failed datanode replacement policy is DEFAULT, and a client may 
configure this via 'dfs.client.block.write.replace-datanode-on-failure.policy' 
in its configuration.
at 
org.apache.hadoop.hdfs.DataStreamer.findNewDatanode(DataStreamer.java:1418)
at 
org.apache.hadoop.hdfs.DataStreamer.addDatanode2ExistingPipeline(DataStreamer.java:1478)
at 
org.apache.hadoop.hdfs.DataStreamer.handleDatanodeReplacement(DataStreamer.java:1704)
at 
org.apache.hadoop.hdfs.DataStreamer.setupPipelineInternal(DataStreamer.java:1605)
at 
org.apache.hadoop.hdfs.DataStreamer.setupPipelineForAppendOrRecovery(DataStreamer.java:1587)
at 
org.apache.hadoop.hdfs.DataStreamer.processDatanodeOrExternalError(DataStreamer.java:1371)
at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:674)
{code}

And the root cause is that DFSClient cannot  perceive the exception of 
TransferBlock during PipelineRecovery. If failed during TransferBlock, the 
DFSClient will retry all datanodes in the cluster and then failed.




--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-16598) All datanodes [DatanodeInfoWithStorage[127.0.0.1:57448,DS-1b5f7e33-a2bf-4edc-9122-a74c995a99f5,DISK]] are bad. Aborting...

2022-05-27 Thread ZanderXu (Jira)
ZanderXu created HDFS-16598:
---

 Summary: All datanodes 
[DatanodeInfoWithStorage[127.0.0.1:57448,DS-1b5f7e33-a2bf-4edc-9122-a74c995a99f5,DISK]]
 are bad. Aborting...
 Key: HDFS-16598
 URL: https://issues.apache.org/jira/browse/HDFS-16598
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: ZanderXu
Assignee: ZanderXu


org.apache.hadoop.hdfs.testPipelineRecoveryOnRestartFailure failed with the 
stack like:
{code:java}
java.io.IOException: All datanodes 
[DatanodeInfoWithStorage[127.0.0.1:57448,DS-1b5f7e33-a2bf-4edc-9122-a74c995a99f5,DISK]]
 are bad. Aborting...
at 
org.apache.hadoop.hdfs.DataStreamer.handleBadDatanode(DataStreamer.java:1667)
at 
org.apache.hadoop.hdfs.DataStreamer.setupPipelineInternal(DataStreamer.java:1601)
at 
org.apache.hadoop.hdfs.DataStreamer.setupPipelineForAppendOrRecovery(DataStreamer.java:1587)
at 
org.apache.hadoop.hdfs.DataStreamer.processDatanodeOrExternalError(DataStreamer.java:1371)
at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:674)
{code}

After tracing the root cause, this bug was introduced by 
[HDFS-16534|https://issues.apache.org/jira/browse/HDFS-16534]. Because the 
block GS of client may be smaller than DN when pipeline recovery failed.




--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-16600) Deadlock on DataNode

2022-05-27 Thread ZanderXu (Jira)
ZanderXu created HDFS-16600:
---

 Summary: Deadlock on DataNode
 Key: HDFS-16600
 URL: https://issues.apache.org/jira/browse/HDFS-16600
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: ZanderXu
Assignee: ZanderXu


The UT 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.testSynchronousEviction 
failed, because happened deadlock, which  is introduced by 
[HDFS-16534|https://issues.apache.org/jira/browse/HDFS-16534]. 
DeadLock:
{code:java}
// org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.createRbw line 1588 
need a read lock
try (AutoCloseableLock lock = lockManager.readLock(LockLevel.BLOCK_POOl,
b.getBlockPoolId()))

// org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.evictBlocks line 3526 
need a write lock
try (AutoCloseableLock lock = lockManager.writeLock(LockLevel.BLOCK_POOl, bpid))
{code}





--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-16648) Normalize the usage of debug logs in NameNode

2022-07-01 Thread ZanderXu (Jira)
ZanderXu created HDFS-16648:
---

 Summary: Normalize the usage of debug logs in NameNode
 Key: HDFS-16648
 URL: https://issues.apache.org/jira/browse/HDFS-16648
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: ZanderXu
Assignee: ZanderXu


There are many irregular debug logs in NameNode.  such as:
Error type1: 
{code:java}
if (LOG.isDebugEnabled()) {
  LOG.debug("Getting groups for user " + user);
}
{code}
we can format it to:
{code:java}
LOG.debug("Getting groups for user {}. ", user);
{code}

Error type2:
{code:java}
LOG.debug("*DIR* NameNode.renameSnapshot: Snapshot Path {}, " +
"snapshotOldName {}, snapshotNewName {}", snapshotRoot,
snapshotOldName, snapshotNewName);
{code}
we can format it to:
{code:java}
if (LOG.isDebugEnabled()) {
  LOG.debug("*DIR* NameNode.renameSnapshot: Snapshot Path {}, " +
"snapshotOldName {}, snapshotNewName {}", snapshotRoot,
snapshotOldName, snapshotNewName); 
}
{code}

Error type3:
{code:java}
if (LOG.isDebugEnabled()) {
  LOG.debug("getAdditionalDatanode: src=" + src
  + ", fileId=" + fileId
  + ", blk=" + blk
  + ", existings=" + Arrays.asList(existings)
  + ", excludes=" + Arrays.asList(excludes)
  + ", numAdditionalNodes=" + numAdditionalNodes
  + ", clientName=" + clientName);
}
{code}
We can format it to:
{code:java}
 if (LOG.isDebugEnabled()) {
   LOG.debug("getAdditionalDatanode: src={}, fileId={}, "
   + "blk={}, existings={}, excludes={}, numAdditionalNodes={}, "
  + "clientName={}.", src, fileId, blk, Arrays.asList(existings),
  Arrays.asList(excludes), numAdditionalNodes, clientName);
 }
{code}




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-16645) Multi inProgress segments caused "Invalid log manifest"

2022-06-30 Thread ZanderXu (Jira)
ZanderXu created HDFS-16645:
---

 Summary: Multi inProgress segments caused "Invalid log manifest"
 Key: HDFS-16645
 URL: https://issues.apache.org/jira/browse/HDFS-16645
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: ZanderXu
Assignee: ZanderXu


{code:java}
java.lang.IllegalStateException: Invalid log manifest (log [1-? (in-progress)] 
overlaps [6-? (in-progress)])[[6-? (in-progress)], [1-? (in-progress)]] 
CommittedTxId: 0 
at 
org.apache.hadoop.hdfs.server.protocol.RemoteEditLogManifest.checkState(RemoteEditLogManifest.java:62)
at 
org.apache.hadoop.hdfs.server.protocol.RemoteEditLogManifest.(RemoteEditLogManifest.java:46)
at 
org.apache.hadoop.hdfs.qjournal.server.Journal.getEditLogManifest(Journal.java:740)
{code}




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-16646) [RBF] Improved isolation for downstream name nodes. {Elastic}

2022-06-30 Thread ZanderXu (Jira)
ZanderXu created HDFS-16646:
---

 Summary: [RBF] Improved isolation for downstream name nodes. 
{Elastic}
 Key: HDFS-16646
 URL: https://issues.apache.org/jira/browse/HDFS-16646
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: ZanderXu
Assignee: ZanderXu






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-16641) [HDFS] Add RPC ReQueue Metrics

2022-06-24 Thread ZanderXu (Jira)
ZanderXu created HDFS-16641:
---

 Summary: [HDFS] Add RPC ReQueue Metrics
 Key: HDFS-16641
 URL: https://issues.apache.org/jira/browse/HDFS-16641
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: ZanderXu
Assignee: ZanderXu


Add RPC ReQueue Metrics to easily locate the abnormal case that 
ObserverNameNode has lower RPCProcessingTime and higher RPCQueueTime.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-16642) [HDFS] Moving selecting inputstream from JN in EditlogTailer out of FSNLock

2022-06-24 Thread ZanderXu (Jira)
ZanderXu created HDFS-16642:
---

 Summary: [HDFS] Moving selecting inputstream from JN in 
EditlogTailer out of FSNLock
 Key: HDFS-16642
 URL: https://issues.apache.org/jira/browse/HDFS-16642
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: ZanderXu
Assignee: ZanderXu


EditlogTailer cost a long time for selecting InputStreams from Journalnode 
while holding the write lock of FSNLock.
During this period, 8020 handlers of Observer NameNode will be blocked by the 
FSN Lock.

In theory, selecting inputstreams from JournalNode does not involve changing 
memory information in NameNode, so we can move the selecting out of the FSN 
Lock, and it can improve the throughput of Observer NameNode.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-16623) IllegalArgumentException in LifelineSender

2022-06-06 Thread ZanderXu (Jira)
ZanderXu created HDFS-16623:
---

 Summary: IllegalArgumentException in LifelineSender
 Key: HDFS-16623
 URL: https://issues.apache.org/jira/browse/HDFS-16623
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: ZanderXu
Assignee: ZanderXu


In our production environment, an IllegalArgumentException occurred in the 
LifelineSender at one DataNode, because the DataNode was undergoing GC at that 
time. 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-16622) addRDBI in IncrementalBlockReportManager may remove the block with bigger GS.

2022-06-05 Thread ZanderXu (Jira)
ZanderXu created HDFS-16622:
---

 Summary: addRDBI in IncrementalBlockReportManager may remove the 
block with bigger GS.
 Key: HDFS-16622
 URL: https://issues.apache.org/jira/browse/HDFS-16622
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: ZanderXu
Assignee: ZanderXu


In our production environment,  there is a strange missing block, according to 
the log, I suspect there is a bug in function addRDBI(ReceivedDeletedBlockInfo 
rdbi,DatanodeStorage storage)(line 250).

Bug code in the for loop:
{code:java}
synchronized void addRDBI(ReceivedDeletedBlockInfo rdbi,
  DatanodeStorage storage) {
// Make sure another entry for the same block is first removed.
// There may only be one such entry.
for (PerStorageIBR perStorage : pendingIBRs.values()) {
  if (perStorage.remove(rdbi.getBlock()) != null) {
break;
  }
}
getPerStorageIBR(storage).put(rdbi);
  }
{code}

Removed the GS of the Block in ReceivedDeletedBlockInfo may be greater than the 
GS of the Block in rdbi. And NN will invalidate the Replicate will small GS 
when complete one block. 
So If there is only one replicate for one block, there is a possibility of 
missingblock because of this wrong logic. 




--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-16670) Improve Code With Lambda in EditLogTailer class

2022-07-20 Thread ZanderXu (Jira)
ZanderXu created HDFS-16670:
---

 Summary: Improve Code With Lambda in EditLogTailer class
 Key: HDFS-16670
 URL: https://issues.apache.org/jira/browse/HDFS-16670
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: ZanderXu
Assignee: ZanderXu


Improve Code With Lambda in EditLogTailer class



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-16671) RBF: RouterRpcFairnessPolicyController supports configurable permit acquire timeout

2022-07-20 Thread ZanderXu (Jira)
ZanderXu created HDFS-16671:
---

 Summary: RBF: RouterRpcFairnessPolicyController supports 
configurable permit acquire timeout
 Key: HDFS-16671
 URL: https://issues.apache.org/jira/browse/HDFS-16671
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: ZanderXu
Assignee: ZanderXu


RouterRpcFairnessPolicyController supports configurable permit acquire timeout. 
Hardcode 1s is very long, and it has causes an incident in our prod environment 
when one nameserivce is busy.

And the optimal timeout maybe should be less than p50(avgTime).

And all handlers in RBF is waiting to acquire the permit of the busy ns. 

{code:java}
"IPC Server handler 12 on default port " #2370 daemon prio=5 os_prio=0 
tid=? nid=?  waiting on condition [?]
   java.lang.Thread.State: TIMED_WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for   (a 
java.util.concurrent.Semaphore$NonfairSync)
at 
java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1037)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1328)
at java.util.concurrent.Semaphore.tryAcquire(Semaphore.java:409)
at 
org.apache.hadoop.hdfs.server.federation.fairness.AbstractRouterRpcFairnessPolicyController.acquirePermit(AbstractRouterRpcFairnessPolicyController.java:56)
at 
org.apache.hadoop.hdfs.server.federation.fairness.DynamicRouterRpcFairnessPolicyController.acquirePermit(DynamicRouterRpcFairnessPolicyController.java:123)
at 
org.apache.hadoop.hdfs.server.federation.router.RouterRpcClient.acquirePermit(RouterRpcClient.java:1500)
{code}




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-16689) Standby NameNode crashes when transitioning to Active with in-progress tailer

2022-07-25 Thread ZanderXu (Jira)
ZanderXu created HDFS-16689:
---

 Summary: Standby NameNode crashes when transitioning to Active 
with in-progress tailer
 Key: HDFS-16689
 URL: https://issues.apache.org/jira/browse/HDFS-16689
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: ZanderXu
Assignee: ZanderXu


Standby NameNode crashes when transitioning to Active with a in-progress 
tailer. And the error message like blew:


{code:java}
Caused by: java.lang.IllegalStateException: Cannot start writing at txid X when 
there is a stream available for read: ByteStringEditLog[X, Y], 
ByteStringEditLog[X, 0]
at 
org.apache.hadoop.hdfs.server.namenode.FSEditLog.openForWrite(FSEditLog.java:344)
at 
org.apache.hadoop.hdfs.server.namenode.FSEditLogAsync.openForWrite(FSEditLogAsync.java:113)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startActiveServices(FSNamesystem.java:1423)
at 
org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.startActiveServices(NameNode.java:2132)
... 36 more
{code}

After tracing and found there is a critical bug in 
*EditlogTailer#catchupDuringFailover()* when *DFS_HA_TAILEDITS_INPROGRESS_KEY* 
is true. Because *catchupDuringFailover()* try to replay all missed edits from 
JournalNodes with *onlyDurableTxns=true*. It may cannot replay any edits when 
they are some abnormal JournalNodes. 

Reproduce method, suppose:
- There are 2 namenode, namely NN0 and NN1, and the status of echo namenode is 
Active, Standby respectively. And there are 3 JournalNodes, namely JN0, JN1 and 
JN2. 
- NN0 try to sync 3 edits to JNs with started txid 3, but only successfully 
synced them to JN1 and JN3. And JN0 is abnormal, such as GC, bad network or 
restarted.
- NN1's lastAppliedTxId is 2, and at the moment, we are trying failover active 
from NN0 to NN1. 
- NN1 only got two responses from JN0 and JN1 when it try to selecting 
inputStreams with *fromTxnId=3*  and *onlyDurableTxns=true*, and the count txid 
of response is 0, 3 respectively. JN2 is abnormal, such as GC,  bad network or 
restarted.
- NN1 will cannot replay any Edits with *fromTxnId=3* from JournalNodes because 
the *maxAllowedTxns* is 0.


So I think Standby NameNode should *catchupDuringFailover()* with 
*onlyDurableTxns=true* , so that it can replay all missed edits from 
JournalNode.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-16692) Add detailed scope info in NotEnoughReplicas Reason logs.

2022-07-26 Thread ZanderXu (Jira)
ZanderXu created HDFS-16692:
---

 Summary: Add detailed scope info in NotEnoughReplicas Reason logs.
 Key: HDFS-16692
 URL: https://issues.apache.org/jira/browse/HDFS-16692
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: ZanderXu
Assignee: ZanderXu


When we writing some ec data from client not in hdfs cluster, there are a large 
number of INFO log output, as blew:
{code:shell}
2022-07-26 15:50:40,973 INFO  blockmanagement.BlockPlacementPolicy 
(BlockPlacementPolicyDefault.java:chooseRandom(912)) - Not enough replicas was 
chosen. Reason: {NO_REQUIRED_STORAGE_TYPE=1, TOO_MANY_NODES_ON_RACK=17}
2022-07-26 15:50:40,974 INFO  blockmanagement.BlockPlacementPolicy 
(BlockPlacementPolicyDefault.java:chooseRandom(912)) - Not enough replicas was 
chosen. Reason: {NO_REQUIRED_STORAGE_TYPE=1, TOO_MANY_NODES_ON_RACK=18}
2022-07-26 15:50:40,974 INFO  blockmanagement.BlockPlacementPolicy 
(BlockPlacementPolicyDefault.java:chooseRandom(912)) - Not enough replicas was 
chosen. Reason: {NO_REQUIRED_STORAGE_TYPE=1, TOO_MANY_NODES_ON_RACK=17}
2022-07-26 15:50:40,975 INFO  blockmanagement.BlockPlacementPolicy 
(BlockPlacementPolicyDefault.java:chooseRandom(912)) - Not enough replicas was 
chosen. Reason: {NO_REQUIRED_STORAGE_TYPE=1, TOO_MANY_NODES_ON_RACK=18}
2022-07-26 15:50:40,975 INFO  blockmanagement.BlockPlacementPolicy 
(BlockPlacementPolicyDefault.java:chooseRandom(912)) - Not enough replicas was 
chosen. Reason: {NO_REQUIRED_STORAGE_TYPE=1, TOO_MANY_NODES_ON_RACK=17}
2022-07-26 15:50:40,976 INFO  blockmanagement.BlockPlacementPolicy 
(BlockPlacementPolicyDefault.java:chooseRandom(912)) - Not enough replicas was 
chosen. Reason: {NO_REQUIRED_STORAGE_TYPE=1, TOO_MANY_NODES_ON_RACK=18}
2022-07-26 15:50:40,976 INFO  blockmanagement.BlockPlacementPolicy 
(BlockPlacementPolicyDefault.java:chooseRandom(912)) - Not enough replicas was 
chosen. Reason: {NO_REQUIRED_STORAGE_TYPE=1, TOO_MANY_NODES_ON_RACK=18}
2022-07-26 15:50:40,977 INFO  blockmanagement.BlockPlacementPolicy 
(BlockPlacementPolicyDefault.java:chooseRandom(912)) - Not enough replicas was 
chosen. Reason: {NO_REQUIRED_STORAGE_TYPE=1, TOO_MANY_NODES_ON_RACK=18}
2022-07-26 15:50:40,977 INFO  blockmanagement.BlockPlacementPolicy 
(BlockPlacementPolicyDefault.java:chooseRandom(912)) - Not enough replicas was 
chosen. Reason: {NO_REQUIRED_STORAGE_TYPE=1, TOO_MANY_NODES_ON_RACK=19}
2022-07-26 15:50:40,977 INFO  blockmanagement.BlockPlacementPolicy 
(BlockPlacementPolicyDefault.java:chooseRandom(912)) - Not enough replicas was 
chosen. Reason: {NO_REQUIRED_STORAGE_TYPE=1, TOO_MANY_NODES_ON_RACK=3}
{code}

I feel that we should add detailed scope info in this log to show the scope 
that we cannot select any good nodes from.




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-16658) BlockManager should output some logs when logEveryBlock is true.

2022-07-13 Thread ZanderXu (Jira)
ZanderXu created HDFS-16658:
---

 Summary: BlockManager should output some logs when logEveryBlock 
is true.
 Key: HDFS-16658
 URL: https://issues.apache.org/jira/browse/HDFS-16658
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: ZanderXu
Assignee: ZanderXu


During locating some abnormal cases in our prod environment, I found that 
BlockManager does not out put some logs in `addStoredBlock` even though 
`logEveryBlock` is true.

I feel that we need to change the log level from DEBUG to INFO.
```
private Block addStoredBlock(final BlockInfo block,
   final Block reportedBlock,
   DatanodeStorageInfo storageInfo,
   DatanodeDescriptor delNodeHint,
   boolean logEveryBlock)
  throws IOException {

  if (logEveryBlock) {
blockLog.debug("BLOCK* addStoredBlock: {} is added to {} (size={})",
node, storedBlock, storedBlock.getNumBytes());
  }
...
  }
```



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-16661) Improve Code With Lambda in AsyncLoggerSet class

2022-07-14 Thread ZanderXu (Jira)
ZanderXu created HDFS-16661:
---

 Summary: Improve Code With Lambda in AsyncLoggerSet class
 Key: HDFS-16661
 URL: https://issues.apache.org/jira/browse/HDFS-16661
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: ZanderXu
Assignee: ZanderXu


Improve Code With Lambda in AsyncLoggerSet class



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-16659) JournalNode should throw CacheMissException when SinceTxId is more than HighestWrittenTxId

2022-07-13 Thread ZanderXu (Jira)
ZanderXu created HDFS-16659:
---

 Summary: JournalNode should throw CacheMissException when 
SinceTxId is more than HighestWrittenTxId
 Key: HDFS-16659
 URL: https://issues.apache.org/jira/browse/HDFS-16659
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: ZanderXu
Assignee: ZanderXu


JournalNode should throw `CacheMissException` if `sinceTxId` is bigger than 
`highestWrittenTxId`. And it will caused EditlogTailer can not able to tail 
edits. And it maybe caused ObserverNameNode can not able handle requests from 
clients.

Suppose there are 3 journalNodes, JN0 ~ JN1.
The corner case as blew:
* JN0 has some abnormal cases when Active Namenode is journaling Edits with 
start txId 11
* NameNode just ignore the abnormal JN0 and continue to write Edits to Journal 
1 and 2
* JN0 backed to health
* Observer NameNode try to select EditLogInputStream vis PRC with start txId 21
* Journal 1 has some abnormal cases caused slow rpc response

And the expected selecting result is: Response should contain 20 Edits from 
txId 21 to txId 40 from JN1 and JN2. Because Active NameNode successfully write 
these Edits to JN1 and JN2 and failed write these edits to JN0, so there is no 
Edits from id 21 to 40 in the cache of JN0.

But in the current implementation,  there is no Edits in the Response. Because 
namenode successfully got a response from JN0 that did not contains any Edits.
And the bug code as blew:

{code:java}
if (sinceTxId > getHighestWrittenTxId()) {
// Requested edits that don't exist yet; short-circuit the cache here
metrics.rpcEmptyResponses.incr();
return GetJournaledEditsResponseProto.newBuilder().setTxnCount(0).build(); 
}
{code}




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-16660) Improve Code With Lambda in IPCLoggerChannel class

2022-07-13 Thread ZanderXu (Jira)
ZanderXu created HDFS-16660:
---

 Summary: Improve Code With Lambda in IPCLoggerChannel class
 Key: HDFS-16660
 URL: https://issues.apache.org/jira/browse/HDFS-16660
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: ZanderXu
Assignee: ZanderXu


Improve Code With Lambda in IPCLoggerChannel class



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-16703) Enable RPC Timeout for some protocols of NameNode.

2022-07-29 Thread ZanderXu (Jira)
ZanderXu created HDFS-16703:
---

 Summary: Enable RPC Timeout for some protocols of NameNode.
 Key: HDFS-16703
 URL: https://issues.apache.org/jira/browse/HDFS-16703
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: ZanderXu
Assignee: ZanderXu


When I read some code about protocol, I found that only 
ClientNamenodeProtocolPB proxy with RPC timeout, other protocolPB proxy without 
RPC timeout, such as RefreshAuthorizationPolicyProtocolPB, 
RefreshUserMappingsProtocolPB, RefreshCallQueueProtocolPB, 
GetUserMappingsProtocolPB and NamenodeProtocolPB.

 

If proxy without rpc timeout,  it will be blocked for a long time if the NN 
machine crash or bad network during writing or reading with NN. 

 

So I feel that we should enable RPC timeout for all ProtocolPB.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-16704) Datanode return empty response instead of NPE for GetVolumeInfo during restarting

2022-07-29 Thread ZanderXu (Jira)
ZanderXu created HDFS-16704:
---

 Summary: Datanode return empty response instead of NPE for 
GetVolumeInfo during restarting
 Key: HDFS-16704
 URL: https://issues.apache.org/jira/browse/HDFS-16704
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: ZanderXu
Assignee: ZanderXu


During datanode starting, I found some NPE in logs:

 
{code:java}
Caused by: java.lang.NullPointerException: Storage not yet initialized
    at 
org.apache.hadoop.thirdparty.com.google.common.base.Preconditions.checkNotNull(Preconditions.java:899)
    at 
org.apache.hadoop.hdfs.server.datanode.DataNode.getVolumeInfo(DataNode.java:3533)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at sun.reflect.misc.Trampoline.invoke(MethodUtil.java:72)
    at sun.reflect.GeneratedMethodAccessor7.invoke(Unknown Source)
    at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at sun.reflect.misc.MethodUtil.invoke(MethodUtil.java:276)
    at 
com.sun.jmx.mbeanserver.ConvertingMethod.invokeWithOpenReturn(ConvertingMethod.java:193)
    at 
com.sun.jmx.mbeanserver.ConvertingMethod.invokeWithOpenReturn(ConvertingMethod.java:175)
 {code}
 

 

Because the storage of datanode not yet initialized when we trying to get 
metrics of datanode, and related code as below:
{code:java}
@Override // DataNodeMXBean
public String getVolumeInfo() {
  Preconditions.checkNotNull(data, "Storage not yet initialized");
  return JSON.toString(data.getVolumeInfoMap());
} {code}
The logic is ok, but I feel that more reasonable logic should be return a empty 
response instead of NPE, because InfoServer will be started before 
initBlockPool.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-16678) RBF supports disable getNodeUsage() in RBFMetrics

2022-07-21 Thread ZanderXu (Jira)
ZanderXu created HDFS-16678:
---

 Summary: RBF supports disable getNodeUsage() in RBFMetrics
 Key: HDFS-16678
 URL: https://issues.apache.org/jira/browse/HDFS-16678
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: ZanderXu
Assignee: ZanderXu


In our prod environment, we try to collect RBF metrics every 15s through 
jmx_exporter. And we found that collection task often failed. 

After tracing and found that the collection task is blocked at getNodeUsage() 
in RBFMetrics, because it will collection all datanode's usage from downstream 
nameservices.  This is a very expensive and almost useless operation. Because 
in most scenarios, each NameSerivce contains almost the same DNs. We can get 
the data usage's from any one nameservices, not from RBF.

So I feel that RBF should supports disable getNodeUsage() in RBFMetrics.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-16696) NameNode supports a new MsyncRPCServer to reduce the latency of msync() rpc

2022-07-27 Thread ZanderXu (Jira)
ZanderXu created HDFS-16696:
---

 Summary: NameNode supports a new MsyncRPCServer to reduce the 
latency of msync() rpc
 Key: HDFS-16696
 URL: https://issues.apache.org/jira/browse/HDFS-16696
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: ZanderXu
Assignee: ZanderXu


HDFS-12943 introduced Consistent Reads from Standby Node. It use msync 
mechanism to guarantee the consistency.  So the latency of msycn() rpc is very 
important, especially for some end users who need call msync() rpc every time.

Unfortunately, NameNode handle msync() RPCs same with other RPCs, also need 
enqueue, wait, handled. So the msync() will be blocked by other RPCs, such as 
setQuota, rename, delete, etc. 

So we need a new mechanism to guarantee the latency of the msync() RPC.
Such as: 
* We can supports a new MsyncRPCServer in NameNode to separately msync() RPC.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-16695) Improve Code With Lambda in hadoop-hdfs module

2022-07-27 Thread ZanderXu (Jira)
ZanderXu created HDFS-16695:
---

 Summary: Improve Code With Lambda in hadoop-hdfs module
 Key: HDFS-16695
 URL: https://issues.apache.org/jira/browse/HDFS-16695
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: ZanderXu
Assignee: ZanderXu


Improve Code with Lambda in hadoop-hdfs module. 

For example:
Current logic:
{code:java}
public ListenableFuture getJournaledEdits(
  long fromTxnId, int maxTransactions) {
return parallelExecutor.submit(
new Callable() {
  @Override
  public GetJournaledEditsResponseProto call() throws IOException {
return getProxy().getJournaledEdits(journalId, nameServiceId,
fromTxnId, maxTransactions);
  }
});
  }
{code}

Improved Code with Lambda:
{code:java}
public ListenableFuture getJournaledEdits(
  long fromTxnId, int maxTransactions) {
return parallelExecutor.submit(() -> getProxy().getJournaledEdits(
journalId, nameServiceId, fromTxnId, maxTransactions));
  }
{code}





--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-16698) Add a metric to sense possible MaxDirectoryItemsExceededException in time.

2022-07-27 Thread ZanderXu (Jira)
ZanderXu created HDFS-16698:
---

 Summary: Add a metric to sense possible 
MaxDirectoryItemsExceededException in time.
 Key: HDFS-16698
 URL: https://issues.apache.org/jira/browse/HDFS-16698
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: ZanderXu
Assignee: ZanderXu


In our prod environment, we occasionally encounter 
MaxDirectoryItemsExceededException caused job failure.
{code:java}
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.protocol.FSLimitException$MaxDirectoryItemsExceededException):
 The directory item limit of /user/XXX/.sparkStaging is exceeded: limit=1048576 
items=1048576
{code}

In order to avoid it, we add a metric to sense possible 
MaxDirectoryItemsExceededException in time. So that we can process it in time 
to avoid job failure.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-16705) RBF supports healthMonitor timeout configurable and cache NN and client proxy in NamenodeHeartbeatService

2022-07-30 Thread ZanderXu (Jira)
ZanderXu created HDFS-16705:
---

 Summary: RBF supports healthMonitor timeout configurable and cache 
NN and client proxy in NamenodeHeartbeatService
 Key: HDFS-16705
 URL: https://issues.apache.org/jira/browse/HDFS-16705
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: ZanderXu
Assignee: ZanderXu


When I read NamenodeHeartbeatService.class of RBF, I feel that there are 
somethings we can do for NamenodeHeartbeatService.class.
 * Cache NameNode Protocol and Client Protocol to avoid creating a new proxy 
every time
 * Supports healthMonitorTimeout configuration
 * Format code of getNamenodeStatusReport to make it clearer



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-16670) Improve Code With Lambda in EditLogTailer class

2022-08-01 Thread ZanderXu (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16670?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ZanderXu resolved HDFS-16670.
-
Resolution: Duplicate

> Improve Code With Lambda in EditLogTailer class
> ---
>
> Key: HDFS-16670
> URL: https://issues.apache.org/jira/browse/HDFS-16670
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: ZanderXu
>Assignee: ZanderXu
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> Improve Code With Lambda in EditLogTailer class



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-16750) NameNode should use NameNode.getRemoteUser() to log audit event to avoid possible NPE

2022-08-29 Thread ZanderXu (Jira)
ZanderXu created HDFS-16750:
---

 Summary: NameNode should use NameNode.getRemoteUser() to log audit 
event to avoid possible NPE 
 Key: HDFS-16750
 URL: https://issues.apache.org/jira/browse/HDFS-16750
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: ZanderXu
Assignee: ZanderXu


NameNode should use NameNode.getRemoteUser() to log audit event to avoid 
possible NPE.

The relate code is: 
{code:java}
private void logAuditEvent(boolean succeeded, String cmd, String src,
String dst, FileStatus stat) throws IOException {
  if (isAuditEnabled() && isExternalInvocation()) {
logAuditEvent(succeeded, Server.getRemoteUser(), Server.getRemoteIp(),
cmd, src, dst, stat);
  }
}

// the ugi may be null.
private void logAuditEvent(boolean succeeded,
UserGroupInformation ugi, InetAddress addr, String cmd, String src,
String dst, FileStatus status) {
  final String ugiStr = ugi.toString();
  ...
} {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-16748) DFSClient should diff the writing files with namespace Id and iNodeId.

2022-08-27 Thread ZanderXu (Jira)
ZanderXu created HDFS-16748:
---

 Summary: DFSClient should diff the writing files with namespace Id 
and iNodeId.
 Key: HDFS-16748
 URL: https://issues.apache.org/jira/browse/HDFS-16748
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: ZanderXu
Assignee: ZanderXu


DFSClient should diff the writing files with namespaceId and iNodeId, because 
the writing files may belongs to different namespace with the same iNodeId.

And the related code as bellows:
{code:java}
public void putFileBeingWritten(final long inodeId,
  final DFSOutputStream out) {
synchronized(filesBeingWritten) {
  filesBeingWritten.put(inodeId, out);
  // update the last lease renewal time only when there was no
  // writes. once there is one write stream open, the lease renewer
  // thread keeps it updated well with in anyone's expiration time.
  if (lastLeaseRenewal == 0) {
updateLastLeaseRenewal();
  }
}
  }
{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-16737) Fix number of threads in FsDatasetAsyncDiskService#addExecutorForVolume

2022-08-22 Thread ZanderXu (Jira)
ZanderXu created HDFS-16737:
---

 Summary: Fix number of threads in 
FsDatasetAsyncDiskService#addExecutorForVolume
 Key: HDFS-16737
 URL: https://issues.apache.org/jira/browse/HDFS-16737
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: ZanderXu
Assignee: ZanderXu


The number of threads in FsDatasetAsyncDiskService#addExecutorForVolume is 
elastic right now, make it fixed.
Presently the corePoolSize is set to 1 and maximumPoolSize is set to 
maxNumThreadsPerVolume, but since the size of Queue is Integer.MAX, the queue 
doesn't tend to get full and threads are always confined to 1 irrespective of 
maxNumThreadsPerVolume.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-16754) UnderConstruct file with missingBlock should be able to recover lease

2022-08-31 Thread ZanderXu (Jira)
ZanderXu created HDFS-16754:
---

 Summary: UnderConstruct file with missingBlock should be able to 
recover lease
 Key: HDFS-16754
 URL: https://issues.apache.org/jira/browse/HDFS-16754
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: ZanderXu
Assignee: ZanderXu


During looking into the logic of RecoverLease, I guess there is a bug:
{code:java}
int nrCompleteBlocks;
BlockInfo curBlock = null;
for(nrCompleteBlocks = 0; nrCompleteBlocks < nrBlocks; nrCompleteBlocks++) {
  curBlock = blocks[nrCompleteBlocks];
  if(!curBlock.isComplete())
break;
  // Why?
  assert blockManager.hasMinStorage(curBlock) :
  "A COMPLETE block is not minimally replicated in " + src;
} {code}
RecoverLease just try to align all replicas of the last block of one 
UnderConstruct file.

So if the UC file with some completed blocks, which may missed all replicas, 
should be able to recoverLease.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-16641) [HDFS] Add RPC ReQueue Metrics

2022-09-07 Thread ZanderXu (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ZanderXu resolved HDFS-16641.
-
Resolution: Duplicate

> [HDFS] Add RPC ReQueue Metrics
> --
>
> Key: HDFS-16641
> URL: https://issues.apache.org/jira/browse/HDFS-16641
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: ZanderXu
>Assignee: ZanderXu
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> Add RPC ReQueue Metrics to easily locate the abnormal case that 
> ObserverNameNode has lower RPCProcessingTime and higher RPCQueueTime.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-16758) Add an optional RetriableFileFastCopyCommand to Distcp

2022-09-06 Thread ZanderXu (Jira)
ZanderXu created HDFS-16758:
---

 Summary: Add an optional RetriableFileFastCopyCommand to Distcp
 Key: HDFS-16758
 URL: https://issues.apache.org/jira/browse/HDFS-16758
 Project: Hadoop HDFS
  Issue Type: Sub-task
Reporter: ZanderXu
Assignee: ZanderXu


Add an optional RetriableFileFastCopyCommand to Distcp to support copying a 
contiguous file from one namespace to another via FastCopy.

RetriableFileFastCopyCommand will be a subClass of RetriableFileCopyCommand. If 
the files to be copied does not meet the conditions of FastCopy, it will 
automatically call the super method to copy the files.

 

Tips: Current task only supports copy the contiguous file.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-16760) Support a block level inputFormat for RetriableFileFastCopyCommand

2022-09-06 Thread ZanderXu (Jira)
ZanderXu created HDFS-16760:
---

 Summary: Support a block level inputFormat for 
RetriableFileFastCopyCommand
 Key: HDFS-16760
 URL: https://issues.apache.org/jira/browse/HDFS-16760
 Project: Hadoop HDFS
  Issue Type: Sub-task
Reporter: ZanderXu
Assignee: ZanderXu


RetriableFileFastCopyCommand needs a block level inputFormat to deal with the 
data skew problem at the block level.

If the current existed inputFormats have the similar function, can modify them 
to meet this condition first.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-16757) Add a new method copyBlockCrossNamespace to DataNode

2022-09-06 Thread ZanderXu (Jira)
ZanderXu created HDFS-16757:
---

 Summary: Add a new method copyBlockCrossNamespace to DataNode
 Key: HDFS-16757
 URL: https://issues.apache.org/jira/browse/HDFS-16757
 Project: Hadoop HDFS
  Issue Type: Sub-task
Reporter: ZanderXu
Assignee: ZanderXu


Add a new method copyBlockCrossNamespace in DataTransferProtocol at the 
DataNode Side.

This method will copy a source block from one namespace to a target block from 
a different namespace. If the target DN is the same with the current DN, this 
method will copy the block via HardLink. If the target DN is different with the 
current DN, this method will copy the block via TransferBlock.

This method will contains some parameters:
 * ExtendedBlock sourceBlock
 * Token sourceBlockToken
 * ExtendedBlock targetBlock
 * Token targetBlockToken
 * DatanodeInfo targetDN



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-16759) RetriableFileFastCopyCommand supports migrating EC files

2022-09-06 Thread ZanderXu (Jira)
ZanderXu created HDFS-16759:
---

 Summary: RetriableFileFastCopyCommand supports migrating EC files
 Key: HDFS-16759
 URL: https://issues.apache.org/jira/browse/HDFS-16759
 Project: Hadoop HDFS
  Issue Type: Sub-task
Reporter: ZanderXu
Assignee: ZanderXu


RetriableFileFastCopyCommand should support migrating EC files via FastCopy. 
The main places that need to be modified are as bellows:
 * Preparing favoured list
 * Align the DNs of one block



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-16738) Invalid CallerContext caused NullPointerException

2022-08-22 Thread ZanderXu (Jira)
ZanderXu created HDFS-16738:
---

 Summary: Invalid CallerContext caused NullPointerException
 Key: HDFS-16738
 URL: https://issues.apache.org/jira/browse/HDFS-16738
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: ZanderXu
Assignee: ZanderXu


{code:java}
2022-08-23 11:58:03,258 [FSEditLogAsync] ERROR namenode.FSEditLog 
(JournalSet.java:mapJournalsAndReportErrors(398)) - Error: write op failed for 
required journal (JournalAndStream(mgr=QJM to [127.0.0.1:55779, 
127.0.0.1:55781, 127.0.0.1:55783], stream=QuorumOutputStream starting at txid 
1))
java.lang.NullPointerException
at org.apache.hadoop.io.UTF8.set(UTF8.java:97)
at 
org.apache.hadoop.hdfs.server.namenode.FSImageSerialization.writeString(FSImageSerialization.java:361)
at 
org.apache.hadoop.hdfs.server.namenode.FSEditLogOp$AddCloseOp.writeFields(FSEditLogOp.java:586)
at 
org.apache.hadoop.hdfs.server.namenode.FSEditLogOp$Writer.writeOp(FSEditLogOp.java:4986)
at 
org.apache.hadoop.hdfs.server.namenode.EditsDoubleBuffer$TxnBuffer.writeOp(EditsDoubleBuffer.java:158)
at 
org.apache.hadoop.hdfs.server.namenode.EditsDoubleBuffer.writeOp(EditsDoubleBuffer.java:61)
at 
org.apache.hadoop.hdfs.qjournal.client.QuorumOutputStream.write(QuorumOutputStream.java:50)
at 
org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalSetOutputStream$1.apply(JournalSet.java:462)
at 
org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:393)
at 
org.apache.hadoop.hdfs.server.namenode.JournalSet.access$200(JournalSet.java:56)
at 
org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalSetOutputStream.write(JournalSet.java:458)
at 
org.apache.hadoop.hdfs.server.namenode.FSEditLog.doEditTransaction(FSEditLog.java:496)
at 
org.apache.hadoop.hdfs.server.namenode.FSEditLogAsync$Edit.logEdit(FSEditLogAsync.java:311)
at 
org.apache.hadoop.hdfs.server.namenode.FSEditLogAsync.run(FSEditLogAsync.java:253)
at java.lang.Thread.run(Thread.java:748)
{code}




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-16734) RBF: fix some bugs when handling getContentSummary RPC

2022-08-19 Thread ZanderXu (Jira)
ZanderXu created HDFS-16734:
---

 Summary: RBF: fix some bugs when handling getContentSummary RPC
 Key: HDFS-16734
 URL: https://issues.apache.org/jira/browse/HDFS-16734
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: ZanderXu
Assignee: ZanderXu


Suppose there are some mount points as bellows in RBF without default namespace.

 
||Source Path||NameSpace||Destination Path ||
|/a/b|ns0|/a/b|
|/a/b/c|ns0|/a/b/c|
|/a/b/c/d|ns1|/a/b/c/d|

 

 

Suppose there a file /a/b/c/file1 with 10MB data in ns0 and a file 
/a/b/c/d/file2 with 20MB data in ns1.

There are bugs during handling some cases:

 
||Case Number||Case||Current Result||Expected Result||
|1|getContentSummary('/a')|Throw RouterResolveException |2files and 30MB data|
|2|getContentSummary('/a/b')|2files and 40MB data|3files and 40MB data|

 

Bugs for these cases:

Case1: If can't find any locations for the path,  RBF should try to do it with 
sub mount points.

Case2: RBF shouldn't repeatedly get content summary from one same namespace 
with same ancestors path, such as from ns0 with /a/b and from ns0 with /a/b/c.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-16785) DataNode hold BP write lock to scan disk

2022-09-28 Thread ZanderXu (Jira)
ZanderXu created HDFS-16785:
---

 Summary: DataNode hold BP write lock to scan disk
 Key: HDFS-16785
 URL: https://issues.apache.org/jira/browse/HDFS-16785
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: ZanderXu
Assignee: ZanderXu


When patching the fine-grained locking of datanode, I  found that `addVolume` 
will hold the write block of the BP lock to scan the new volume to get the 
blocks. If we try to add one full volume that was fixed offline before, i will 
hold the write lock for a long time.

The related code as bellows:
{code:java}
for (final NamespaceInfo nsInfo : nsInfos) {
  String bpid = nsInfo.getBlockPoolID();
  try (AutoCloseDataSetLock l = lockManager.writeLock(LockLevel.BLOCK_POOl, 
bpid)) {
fsVolume.addBlockPool(bpid, this.conf, this.timer);
fsVolume.getVolumeMap(bpid, tempVolumeMap, ramDiskReplicaTracker);
  } catch (IOException e) {
LOG.warn("Caught exception when adding " + fsVolume +
". Will throw later.", e);
exceptions.add(e);
  }
} {code}
And I noticed that this lock is added by HDFS-15382, means that this logic is 
not in lock before. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-16787) Remove redundant lock in DataSetLockManager#removeLock in datanode.

2022-09-28 Thread ZanderXu (Jira)
ZanderXu created HDFS-16787:
---

 Summary: Remove redundant lock in DataSetLockManager#removeLock in 
datanode.
 Key: HDFS-16787
 URL: https://issues.apache.org/jira/browse/HDFS-16787
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: ZanderXu
Assignee: ZanderXu


During patching the datanode fine-grained locking, found there is a redundant 
lock in DataSetLockManager#removeLock, and the code as bellow:
{code:java}
@Override
public void removeLock(LockLevel level, String... resources) {
  String lockName = generateLockName(level, resources);
  try (AutoCloseDataSetLock lock = writeLock(level, resources)) {
// Here, this lock is redundant.
lock.lock();
lockMap.removeLock(lockName);
  }
} {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-16803) Improve some annotations in hdfs module

2022-10-19 Thread ZanderXu (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ZanderXu resolved HDFS-16803.
-
Fix Version/s: 3.4.0
 Hadoop Flags: Reviewed
   Resolution: Fixed

> Improve some annotations in hdfs module
> ---
>
> Key: HDFS-16803
> URL: https://issues.apache.org/jira/browse/HDFS-16803
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: documentation, namenode
>Affects Versions: 2.9.2, 3.3.4
>Reporter: JiangHua Zhu
>Assignee: JiangHua Zhu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>
> In hdfs module, some annotations are out of date. E.g:
> {code:java}
>   FSDirRenameOp: 
>   /**
>* @see {@link #unprotectedRenameTo(FSDirectory, String, String, 
> INodesInPath,
>* INodesInPath, long, BlocksMapUpdateInfo, Options.Rename...)}
>*/
>   static RenameResult renameTo(FSDirectory fsd, FSPermissionChecker pc,
>   String src, String dst, BlocksMapUpdateInfo collectedBlocks,
>   boolean logRetryCache,Options.Rename... options)
>   throws IOException {
> {code}
> We should try to improve these annotations to make the documentation look 
> more comfortable.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-16817) Remove useless DataNode lock related configuration

2022-10-27 Thread ZanderXu (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ZanderXu resolved HDFS-16817.
-
Fix Version/s: 3.4.0
 Hadoop Flags: Reviewed
   Resolution: Fixed

> Remove useless DataNode lock related configuration
> --
>
> Key: HDFS-16817
> URL: https://issues.apache.org/jira/browse/HDFS-16817
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Haiyang Hu
>Assignee: Haiyang Hu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>
> When look at the code related to DataNode lock, it is found that the relevant 
> configuration are invalid and maybe can be removed
> {code:java}
> public static final String DFS_DATANODE_LOCK_READ_WRITE_ENABLED_KEY =
> "dfs.datanode.lock.read.write.enabled";
> public static final Boolean DFS_DATANODE_LOCK_READ_WRITE_ENABLED_DEFAULT =
> true;
> public static final String  DFS_DATANODE_LOCK_REPORTING_THRESHOLD_MS_KEY =
> "dfs.datanode.lock-reporting-threshold-ms";
> public static final long
> DFS_DATANODE_LOCK_REPORTING_THRESHOLD_MS_DEFAULT = 300L;
>  
>  dfs.datanode.lock.read.write.enabled  
>  true  
>  If this is true, the FsDataset lock will be a read write lock. 
> If
> it is false, all locks will be a write lock.
> Enabling this should give better datanode throughput, as many read only
> functions can run concurrently under the read lock, when they would
> previously have required the exclusive write lock. As the feature is
> experimental, this switch can be used to disable the shared read lock, and
> cause all lock acquisitions to use the exclusive write lock.
>  
>  
>  
>  dfs.datanode.lock-reporting-threshold-ms  
>  300  
>  When thread waits to obtain a lock, or a thread holds a lock for
> more than the threshold, a log message will be written. Note that
> dfs.lock.suppress.warning.interval ensures a single log message is
> emitted per interval for waiting threads and a single message for holding
> threads to avoid excessive logging.
>  
>  
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-16826) [RBF SBN] ConnectionManager should advance the client stateId for every request

2022-10-28 Thread ZanderXu (Jira)
ZanderXu created HDFS-16826:
---

 Summary: [RBF SBN] ConnectionManager should advance the client 
stateId for every request
 Key: HDFS-16826
 URL: https://issues.apache.org/jira/browse/HDFS-16826
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: ZanderXu
Assignee: ZanderXu


ConnectionManager should advance the client stateId for every request whatever 
pool is null or not.

 

Bug Code as bellow:
{code:java}
// Create the pool if not created before
if (pool == null) {
  writeLock.lock();
  try {
pool = this.pools.get(connectionId);
if (pool == null) {
  pool = new ConnectionPool(
  this.conf, nnAddress, ugi, this.minSize, this.maxSize,
  this.minActiveRatio, protocol,
  new PoolAlignmentContext(this.routerStateIdContext, nsId));
  this.pools.put(connectionId, pool);
  this.connectionPoolToNamespaceMap.put(connectionId, nsId);
}
// BUG Here
long clientStateId = 
RouterStateIdContext.getClientStateIdFromCurrentCall(nsId);
pool.getPoolAlignmentContext().advanceClientStateId(clientStateId);
  } finally {
writeLock.unlock();
  }
} {code}
 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-16827) [RBF SBN] RouterStateIdContext shouldn't update the ResponseState if client doesn't use ObserverReadProxyProvider

2022-10-28 Thread ZanderXu (Jira)
ZanderXu created HDFS-16827:
---

 Summary: [RBF SBN] RouterStateIdContext shouldn't update the 
ResponseState if client doesn't use ObserverReadProxyProvider
 Key: HDFS-16827
 URL: https://issues.apache.org/jira/browse/HDFS-16827
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: ZanderXu
Assignee: ZanderXu


RouterStateIdContext shouldn't update the ResponseState if client doesn't use 
ObserverReadProxyProvider.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-16802) Print options when accessing ClientProtocol#rename2()

2022-10-28 Thread ZanderXu (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ZanderXu resolved HDFS-16802.
-
Fix Version/s: 3.4.0
 Hadoop Flags: Reviewed
   Resolution: Fixed

> Print options when accessing ClientProtocol#rename2()
> -
>
> Key: HDFS-16802
> URL: https://issues.apache.org/jira/browse/HDFS-16802
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Affects Versions: 3.3.4
>Reporter: JiangHua Zhu
>Assignee: JiangHua Zhu
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>
> When accessing ClientProtocol#rename2(), the carried options cannot be seen 
> in the log. Here is some log information:
> {code:java}
> 2022-10-13 10:21:10,727 [Listener at localhost/59732] DEBUG  hdfs.StateChange 
> (FSDirRenameOp.java:renameToInt(255)) - DIR* NameSystem.renameTo: with 
> options - /testNamenodeRetryCache/testRename2/src to 
> /testNamenodeRetryCache/testRename2/target
> {code}
> We should improve this, maybe printing options would be better.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-16771) JN should tersely print logs about NewerTxnIdException

2022-09-13 Thread ZanderXu (Jira)
ZanderXu created HDFS-16771:
---

 Summary: JN should tersely print logs about NewerTxnIdException
 Key: HDFS-16771
 URL: https://issues.apache.org/jira/browse/HDFS-16771
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: ZanderXu
Assignee: ZanderXu


JournalNode should tersely print some logs about NewerTxnIdException.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-16772) refreshHostsReader should use the new configuration

2022-09-14 Thread ZanderXu (Jira)
ZanderXu created HDFS-16772:
---

 Summary: refreshHostsReader should use the new configuration
 Key: HDFS-16772
 URL: https://issues.apache.org/jira/browse/HDFS-16772
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: ZanderXu
Assignee: ZanderXu


`refreshHostsReader` should use the latest configuration.

And the current code as bellow:
{code:java}
/** Reread include/exclude files. */
private void refreshHostsReader(Configuration conf) throws IOException {
  if (conf == null) {
conf = new HdfsConfiguration();
// BUG here
this.hostConfigManager.setConf(conf);
  }
  this.hostConfigManager.refresh();
} {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-16764) ObserverNamenode handles addBlock rpc and throws a FileNotFoundException

2022-09-07 Thread ZanderXu (Jira)
ZanderXu created HDFS-16764:
---

 Summary: ObserverNamenode handles addBlock rpc and throws a 
FileNotFoundException 
 Key: HDFS-16764
 URL: https://issues.apache.org/jira/browse/HDFS-16764
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: ZanderXu
Assignee: ZanderXu


ObserverNameNode currently can handle the addBlockLocation RPC, but it may 
throw a FileNotFoundException when it contains stale txid.
 * AddBlock is not a coordinated method, so Observer will not check the statId.
 * AddBlock does the validation with checkOperation(OperationCategory.READ)

So the observer can handle the addBlock rpc. If this observer cannot replay the 
edit of create file, it will throw a FileNotFoundException during doing 
validation.

The related code as follows:
{code:java}
checkOperation(OperationCategory.READ);
final FSPermissionChecker pc = getPermissionChecker();
FSPermissionChecker.setOperationType(operationName);
readLock();
try {
  checkOperation(OperationCategory.READ);
  r = FSDirWriteFileOp.validateAddBlock(this, pc, src, fileId, clientName,
previous, onRetryBlock);
} finally {
  readUnlock(operationName);
} {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-16779) Add ErasureCodingPolicy information to the response description for GETFILESTATUS in WebHDFS.md

2022-09-21 Thread ZanderXu (Jira)
ZanderXu created HDFS-16779:
---

 Summary: Add ErasureCodingPolicy information to the response 
description for GETFILESTATUS in WebHDFS.md
 Key: HDFS-16779
 URL: https://issues.apache.org/jira/browse/HDFS-16779
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: ZanderXu
Assignee: ZanderXu


[WebHDFS_GETFILESTATUS|https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/WebHDFS.html#Status_of_a_FileDirectory]
 doesn't contains the information of ErasureCodePolicy. We should add it to 
make it easier for EndUsers to get EC Policy.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-16780) Add ErasureCodePolicy information to the response of GET_BLOCK_LOCATIONS in NamenodeWebHdfsMethods

2022-09-21 Thread ZanderXu (Jira)
ZanderXu created HDFS-16780:
---

 Summary: Add ErasureCodePolicy information to the response of 
GET_BLOCK_LOCATIONS in NamenodeWebHdfsMethods
 Key: HDFS-16780
 URL: https://issues.apache.org/jira/browse/HDFS-16780
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: ZanderXu
Assignee: ZanderXu


The response of GET_BLOCK_LOCATIONS in NamenodeWebHdfsMethods does not contain 
the ErasureCodingPolicy information. Add it to make it easier for EndUser to 
get the EC policy from the result of GET_BLOCK_LOCATIONS.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-16793) ObserverNameNode fails to select streaming inputStream with a timeout exception

2022-10-04 Thread ZanderXu (Jira)
ZanderXu created HDFS-16793:
---

 Summary: ObserverNameNode fails to select streaming inputStream 
with a timeout exception 
 Key: HDFS-16793
 URL: https://issues.apache.org/jira/browse/HDFS-16793
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: ZanderXu
Assignee: ZanderXu


In out prod environment, we encountered one case that observer namenode failed 
to select streaming inputStream with a timeout exception. And the related code 
as bellow:

 
{code:java}
@Override
public void selectInputStreams(Collection estreams,
long fromTxnId, boolean inProgressOk,
boolean onlyDurableTxns) throws IOException { 
  if (inProgressOk && inProgressTailingEnabled) {
...
  }
  // Timeout here.
  selectStreamingInputStreams(streams, fromTxnId, inProgressOk,
  onlyDurableTxns);
} {code}
 

 

After looked into the code and found that JournalNode contains one very 
expensive and redundant operation that scan all of edits of the last 
in-progress segment with IO. The related code as bellow:

 
{code:java}
public List getRemoteEditLogs(long firstTxId,
boolean inProgressOk) throws IOException {
  File currentDir = sd.getCurrentDir();
  List allLogFiles = matchEditLogs(currentDir);
  List ret = Lists.newArrayListWithCapacity(
  allLogFiles.size());
  for (EditLogFile elf : allLogFiles) {
if (elf.hasCorruptHeader() || (!inProgressOk && elf.isInProgress())) {
  continue;
}
// Here.
if (elf.isInProgress()) {
  try {
elf.scanLog(getLastReadableTxId(), true);
  } catch (IOException e) {
LOG.error("got IOException while trying to validate header of " +
elf + ".  Skipping.", e);
continue;
  }
}
if (elf.getFirstTxId() >= firstTxId) {
  ret.add(new RemoteEditLog(elf.firstTxId, elf.lastTxId,
  elf.isInProgress()));
} else if (elf.getFirstTxId() < firstTxId && firstTxId <= 
elf.getLastTxId()) {
  // If the firstTxId is in the middle of an edit log segment. Return this
  // anyway and let the caller figure out whether it wants to use it.
  ret.add(new RemoteEditLog(elf.firstTxId, elf.lastTxId,
  elf.isInProgress()));
}
  }
  
  Collections.sort(ret);
  
  return ret;
} {code}
 

Expensive:
 * This scan operation will scan all of the edits of the in-progress segment 
with IO.

Redundant:
 * This scan operation just find the lastTxId of this in-progress segment
 * But the caller method getEditLogManifest(long sinceTxId, boolean 
inProgressOk) in Journal.java just ignore the lastTxId of the in-progress 
segment and use getHighestWrittenTxId() as the lastTxId of the in-progress and 
return to namenode.
 * So, the scan operation is redundant.

 

If end user enable the Observer Read feature, the delay of the tailing edits 
from journalnode is very important, whether it is normal process or fallback 
process. 

And there is no more comments about this scan logic after looked into the code 
and HDFS-6634 which added this logic.

The only effect I can get is to scan the in-progress segment for corruption. 
But namenode can handle the corrupted in-progress segment.  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-16596) Improve the processing capability of FsDatasetAsyncDiskService

2022-10-05 Thread ZanderXu (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ZanderXu resolved HDFS-16596.
-
Resolution: Duplicate

> Improve the processing capability of FsDatasetAsyncDiskService
> --
>
> Key: HDFS-16596
> URL: https://issues.apache.org/jira/browse/HDFS-16596
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: ZanderXu
>Assignee: ZanderXu
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> In our production environment, when DN needs to delete a large number blocks, 
> we find that many deletion tasks are backlogged in the queue of 
> threadPoolExecutor in FsDatasetAsyncDiskService. We can't improve its 
> throughput because the number of core threads is hard coded.
> So DN needs to support the number of core threads of 
> FsDatasetAsyncDiskService can be configured.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-16798) SerialNumberMap should decrease the current if the old already exist

2022-10-08 Thread ZanderXu (Jira)
ZanderXu created HDFS-16798:
---

 Summary: SerialNumberMap should decrease the current if the old 
already exist
 Key: HDFS-16798
 URL: https://issues.apache.org/jira/browse/HDFS-16798
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: ZanderXu
Assignee: ZanderXu


During looking into some code related XATTR, I found there is a bug in 
SerialNumberMap, as bellow:
{code:java}
public int get(T t) {
  if (t == null) {
return 0;
  }
  Integer sn = t2i.get(t);
  if (sn == null) {
sn = current.getAndIncrement();
if (sn > max) {
  current.getAndDecrement();
  throw new IllegalStateException(name + ": serial number map is full");
}
Integer old = t2i.putIfAbsent(t, sn);
if (old != null) {
  // here: if the old is not null, we should decrease the current value.
  return old;
}
i2t.put(sn, t);
  }
  return sn;
} {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-16774) Improve async delete replica on datanode

2022-10-11 Thread ZanderXu (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ZanderXu resolved HDFS-16774.
-
Fix Version/s: 3.4.0
   Resolution: Fixed

> Improve async delete replica on datanode
> 
>
> Key: HDFS-16774
> URL: https://issues.apache.org/jira/browse/HDFS-16774
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Haiyang Hu
>Assignee: Haiyang Hu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>
> In our online cluster, a large number of ReplicaNotFoundExceptions occur when 
> client reads the data.
> After tracing the root cause, it is caused by the asynchronous deletion of 
> the replica operation has  many stacked pending deletion  caused 
> ReplicationNotFoundException.
> Current the asynchronous delete of the replica operation process is as 
> follows:
> 1.remove the replica from the ReplicaMap
> 2.delete the replica file on the disk [blocked in threadpool]
> 3.notifying namenode through IBR [blocked in threadpool]
> In order to avoid similar problems as much as possible, consider optimizing 
> the execution flow:
> The deleting replica from ReplicaMap, deleting replica from disk and 
> notifying namenode through IBR are processed in the same asynchronous thread.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-16804) AddVolume contains a race condition with shutdown block pool

2022-10-14 Thread ZanderXu (Jira)
ZanderXu created HDFS-16804:
---

 Summary: AddVolume contains a race condition with shutdown block 
pool
 Key: HDFS-16804
 URL: https://issues.apache.org/jira/browse/HDFS-16804
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: ZanderXu
Assignee: ZanderXu


Add Volume contains a race condition with shutdown block pool, causing the 
ReplicaMap still contains some blocks belong to the removed block pool.

And the new volume still contains one unused BlockPoolSlice belongs to the 
removed block pool, caused some problems, such as: incorrect dfsUsed, incorrect 
numBlocks of the volume.

Let's review the logic of addVolume and shutdownBlockPool respectively.

 

AddVolume Logic:
 * Step1: Get all namespaceInfo from blockPoolManager
 * Step2: Create one temporary FsVolumeImpl object
 * Step3: Create some blockPoolSlice according to the namespaceInfo and add 
them to the temporary FsVolumeImpl object
 * Step4: Scan all blocks of the namespaceInfo from the volume and store them 
by one temporary ReplicaMap
 * Step5: Active the temporary FsVolumeImpl which created before (with 
FsDatasetImpl synchronized lock)
 ** Step5.1: Merge all blocks of the temporary ReplicaMap to the global 
ReplicaMap
 ** Step5.2: Add the FsVolumeImpl to the volumes

ShutdownBlockPool Logic:(with blockPool write lock)
 * Step1: Cleanup the blockPool from the global ReplicaMap
 * Step2: Shutdown the block pool from all the volumes
 ** Step2.1: do some clean operations for the block pool, such as saveReplica, 
saveDfsUsed, etc
 ** Step2.2: remove the blockPool from bpSlices

The race condition can be reproduced by the following steps:
 * AddVolume Step1: Get all namespaceInfo from blockPoolManager
 * ShutdownBlockPool Step1: Cleanup the blockPool from the global ReplicaMap
 * ShutdownBlockPool Step2: Shutdown the block pool from all the volumes
 * AddVolume Step 2~5

And result:
 * The global replicaMap contains some blocks belong to the removed blockPool
 * The bpSlices of the FsVolumeImpl contains one blockPoolSlice belong to the 
removed blockPool

Expected result:
 * The global replicaMap shouldn't contain any blocks belong to the removed 
blockPool
 * The bpSlices of any FsVolumeImpl shouldn't contain any blockPoolSlice belong 
to the removed blockPool



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-16783) Remove the redundant lock in deepCopyReplica

2022-09-27 Thread ZanderXu (Jira)
ZanderXu created HDFS-16783:
---

 Summary: Remove the redundant lock in deepCopyReplica
 Key: HDFS-16783
 URL: https://issues.apache.org/jira/browse/HDFS-16783
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: ZanderXu
Assignee: ZanderXu


When patching the fine-grained locking of datanode, found there is a redundant 
lock in deepCopyReplica, maybe we can remove it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-16756) RBF proxies the client's user by the login user to enable CacheEntry

2022-09-05 Thread ZanderXu (Jira)
ZanderXu created HDFS-16756:
---

 Summary: RBF proxies the client's user by the login user to enable 
CacheEntry
 Key: HDFS-16756
 URL: https://issues.apache.org/jira/browse/HDFS-16756
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: ZanderXu
Assignee: ZanderXu


RBF just proxies the client's user by the login user for Kerberos 
authentication. If the cluster uses the SIMPLE authentication method, the RBF 
will not proxies the client's user by the login user, the downstream namespace 
will not use the real clientIp, clientPort, clientId and callId even if the 
namenode configured dfs.namenode.ip-proxy-users.

 

And the related code as bellow:
{code:java}
UserGroupInformation connUGI = ugi;
if (UserGroupInformation.isSecurityEnabled()) {
  UserGroupInformation routerUser = UserGroupInformation.getLoginUser();
  connUGI = UserGroupInformation.createProxyUser(
  ugi.getUserName(), routerUser);
} {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-16715) HostRestrictingAuthorizationFilter#matchRule should catch IOException

2022-08-03 Thread ZanderXu (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ZanderXu resolved HDFS-16715.
-
Resolution: Invalid

> HostRestrictingAuthorizationFilter#matchRule should catch IOException
> -
>
> Key: HDFS-16715
> URL: https://issues.apache.org/jira/browse/HDFS-16715
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: ZanderXu
>Assignee: ZanderXu
>Priority: Major
>
> Latest trunk build failed, because 
> HostRestrictingAuthorizationFilter#matchRule didn't catch IOException,  
> introduced by HADOOP-18301.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-16715) HostRestrictingAuthorizationFilter#matchRule should catch IOException

2022-08-03 Thread ZanderXu (Jira)
ZanderXu created HDFS-16715:
---

 Summary: HostRestrictingAuthorizationFilter#matchRule should catch 
IOException
 Key: HDFS-16715
 URL: https://issues.apache.org/jira/browse/HDFS-16715
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: ZanderXu
Assignee: ZanderXu


Latest trunk build failed, because HostRestrictingAuthorizationFilter#matchRule 
didn't catch IOException,  introduced by HADOOP-18301.

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-16718) Improve Code with Lambda in org.apahce.hadoop.hdfs.server.datanode packages

2022-08-05 Thread ZanderXu (Jira)
ZanderXu created HDFS-16718:
---

 Summary: Improve Code with Lambda in 
org.apahce.hadoop.hdfs.server.datanode packages
 Key: HDFS-16718
 URL: https://issues.apache.org/jira/browse/HDFS-16718
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: ZanderXu
Assignee: ZanderXu


Improve Code with Lambda in org.apahce.hadoop.hdfs.server.datanode packages.

For example:
Current logic:
{code:java}
synchronized void startAll() throws IOException {
  try {
UserGroupInformation.getLoginUser().doAs(
new PrivilegedExceptionAction() {
  @Override
  public Object run() throws Exception {
for (BPOfferService bpos : offerServices) {
  bpos.start();
}
return null;
  }
});
  } catch (InterruptedException ex) {
IOException ioe = new IOException();
ioe.initCause(ex.getCause());
throw ioe;
  }
}{code}
Improved Code with Lambda:
{code:java}
synchronized void startAll() throws IOException {
  try {
UserGroupInformation.getLoginUser().doAs(
(PrivilegedExceptionAction) () -> {
  for (BPOfferService bpos : offerServices) {
bpos.start();
  }
  return null;
});
  } catch (InterruptedException ex) {
IOException ioe = new IOException();
ioe.initCause(ex.getCause());
throw ioe;
  }
}{code}
 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-16719) Remove useless import in hadoop-hdfs module

2022-08-05 Thread ZanderXu (Jira)
ZanderXu created HDFS-16719:
---

 Summary: Remove useless import in hadoop-hdfs module
 Key: HDFS-16719
 URL: https://issues.apache.org/jira/browse/HDFS-16719
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: ZanderXu
Assignee: ZanderXu


During reading code of hadoop-hdfs module, I found that there are some useless 
import in some classes, such as useless import TrustedChannelResolver in 
WhitelistBasedTrustedChannelResolver.java
{code:java}
import org.apache.hadoop.hdfs.protocol.datatransfer.TrustedChannelResolver 
{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-16720) Import detailed class instead of import * in hadoop-hdfs moulde

2022-08-05 Thread ZanderXu (Jira)
ZanderXu created HDFS-16720:
---

 Summary: Import detailed class instead of import * in hadoop-hdfs 
moulde
 Key: HDFS-16720
 URL: https://issues.apache.org/jira/browse/HDFS-16720
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: ZanderXu
Assignee: ZanderXu


During reading code in hadoop-hdfs module, I found that there is import * case 
in some classes, such as 
{code:java}
// BlockPlacementPolicyDefault.java
import java.util.*;{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-16719) Remove unused import in hadoop-hdfs module

2022-08-05 Thread ZanderXu (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ZanderXu resolved HDFS-16719.
-
Resolution: Invalid

> Remove unused import in hadoop-hdfs module
> --
>
> Key: HDFS-16719
> URL: https://issues.apache.org/jira/browse/HDFS-16719
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: ZanderXu
>Assignee: ZanderXu
>Priority: Major
>  Labels: pull-request-available
>
> During reading code of hadoop-hdfs module, I found that there are some unused 
> import in some classes, such as unused import TrustedChannelResolver in 
> WhitelistBasedTrustedChannelResolver.java
> {code:java}
> import org.apache.hadoop.hdfs.protocol.datatransfer.TrustedChannelResolver 
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-16720) Import detailed class instead of import * in hadoop-hdfs moulde

2022-08-05 Thread ZanderXu (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ZanderXu resolved HDFS-16720.
-
Resolution: Invalid

> Import detailed class instead of import * in hadoop-hdfs moulde
> ---
>
> Key: HDFS-16720
> URL: https://issues.apache.org/jira/browse/HDFS-16720
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: ZanderXu
>Assignee: ZanderXu
>Priority: Major
>  Labels: pull-request-available
>
> During reading code in hadoop-hdfs module, I found that there is import * 
> case in some classes, such as 
> {code:java}
> // BlockPlacementPolicyDefault.java
> import java.util.*;{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-16723) Replace incorrect SafeModeException with StandbyException in RouterRpcServer.class

2022-08-07 Thread ZanderXu (Jira)
ZanderXu created HDFS-16723:
---

 Summary: Replace incorrect SafeModeException with StandbyException 
in RouterRpcServer.class
 Key: HDFS-16723
 URL: https://issues.apache.org/jira/browse/HDFS-16723
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: ZanderXu
Assignee: ZanderXu


Incorrect code as below:
{code:java}
/**
 * ...
 * @throws SafeModeException If the Router is in safe mode and cannot serve
 *   client requests.
 */
void checkOperation(OperationCategory op)
throws StandbyException {
  ...
} {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-15310) RBF: Not proxy client's clientId and callId caused RetryCache invalid in NameNode.

2022-08-08 Thread ZanderXu (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ZanderXu resolved HDFS-15310.
-
Resolution: Duplicate

> RBF: Not proxy client's clientId and callId caused RetryCache invalid in 
> NameNode.
> --
>
> Key: HDFS-15310
> URL: https://issues.apache.org/jira/browse/HDFS-15310
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: ZanderXu
>Assignee: ZanderXu
>Priority: Critical
>
> The RBF not proxy client's clientId and CallId to NameNode, it caused 
> RetryCache invalid in NameNode and some rpc may be failed.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-16728) RBF throw IndexOutOfBoundsException with disableNameServices

2022-08-11 Thread ZanderXu (Jira)
ZanderXu created HDFS-16728:
---

 Summary: RBF throw IndexOutOfBoundsException with 
disableNameServices
 Key: HDFS-16728
 URL: https://issues.apache.org/jira/browse/HDFS-16728
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: ZanderXu
Assignee: ZanderXu


RBF will throw an IndexOutOfBoundsException when the namespace is disabled.

Suppose we have a mount point /a/b -> ns0 -> /a/b and we disabled the ns0.  

RBF will throw IndexOutOfBoundsException during handling requests with path 
starting with /a/b.
{code:java}
java.lang.IndexOutOfBoundsException: Index: 0, Size: 0    at 
java.util.ArrayList.rangeCheck(ArrayList.java:657)
    at java.util.ArrayList.get(ArrayList.java:433)
    at 
org.apache.hadoop.hdfs.server.federation.router.RouterClientProtocol.mkdirs(RouterClientProtocol.java:756)
    at 
org.apache.hadoop.hdfs.server.federation.router.RouterRpcServer.mkdirs(RouterRpcServer.java:980)
 {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-16713) Improve Code with Lambda in org.apahce.hadoop.hdfs.server.namenode sub packages

2022-08-02 Thread ZanderXu (Jira)
ZanderXu created HDFS-16713:
---

 Summary: Improve Code with Lambda in 
org.apahce.hadoop.hdfs.server.namenode sub packages
 Key: HDFS-16713
 URL: https://issues.apache.org/jira/browse/HDFS-16713
 Project: Hadoop HDFS
  Issue Type: Improvement
 Environment: Improve Code with Lambda in 
org.apahce.hadoop.hdfs.server.namenode sub packages.

For example:
Current logic:
{code:java}
public ListenableFuture getJournaledEdits(
  long fromTxnId, int maxTransactions) {
return parallelExecutor.submit(
new Callable() {
  @Override
  public GetJournaledEditsResponseProto call() throws IOException {
return getProxy().getJournaledEdits(journalId, nameServiceId,
fromTxnId, maxTransactions);
  }
});
  } {code}
Improved Code with Lambda:
{code:java}
public ListenableFuture getJournaledEdits(
  long fromTxnId, int maxTransactions) {
return parallelExecutor.submit(() -> getProxy().getJournaledEdits(
journalId, nameServiceId, fromTxnId, maxTransactions));
  } {code}
 
Reporter: ZanderXu
Assignee: ZanderXu






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-16712) Fix incorrect placeholder in DataNode.java

2022-08-02 Thread ZanderXu (Jira)
ZanderXu created HDFS-16712:
---

 Summary: Fix incorrect placeholder in DataNode.java
 Key: HDFS-16712
 URL: https://issues.apache.org/jira/browse/HDFS-16712
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: ZanderXu
Assignee: ZanderXu


Fix incorrect placeholder in DataNode.java
{code:java}
public String getDiskBalancerStatus() {
  try {
return getDiskBalancer().queryWorkStatus().toJsonString();
  } catch (IOException ex) {
// incorrect placeholder
LOG.debug("Reading diskbalancer Status failed. ex:{}", ex);
return "";
  }
} {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-16717) Replace NPE with IOException in DataNode.class

2022-08-04 Thread ZanderXu (Jira)
ZanderXu created HDFS-16717:
---

 Summary: Replace NPE with IOException in DataNode.class
 Key: HDFS-16717
 URL: https://issues.apache.org/jira/browse/HDFS-16717
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: ZanderXu
Assignee: ZanderXu


In current logic, if storage not yet initialized, it will throw a NPE in 
DataNode.class. Developers or SREs are very sensitive to NPE, so I feel that we 
can use IOException instead of NPE when storage not yet initialized.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-16724) RBF should support get the information about ancestor mount points

2022-08-08 Thread ZanderXu (Jira)
ZanderXu created HDFS-16724:
---

 Summary: RBF should support get the information about ancestor 
mount points
 Key: HDFS-16724
 URL: https://issues.apache.org/jira/browse/HDFS-16724
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: ZanderXu
Assignee: ZanderXu


Suppose RBF cluster have 2 nameservices and to mount point as below:
 * /user/ns1 -> ns1 -> /user/ns1
 * /usre/ns2 -> ns2 -> /user/ns2

Suppose we disable default nameservice of the RBF cluster and try to 
getFileInfo of the path /user, RBF will throw one IOException to client due to 
can not find locations for path /user. 

But as this case, RBF should should return one valid response to client, 
because /user has two sub mount point ns1 and ns2.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-16709) Remove redundant cast in FSEditLogOp.class

2022-08-01 Thread ZanderXu (Jira)
ZanderXu created HDFS-16709:
---

 Summary: Remove redundant cast in FSEditLogOp.class
 Key: HDFS-16709
 URL: https://issues.apache.org/jira/browse/HDFS-16709
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: ZanderXu
Assignee: ZanderXu


When I read some class about Edits of NameNode, I found that there are much 
redundant cast in FSEditLogOp.class, I feel that we should remove them.

Such as:
{code:java}
static UpdateBlocksOp getInstance(OpInstanceCache cache) {
  return (UpdateBlocksOp)cache.get(OP_UPDATE_BLOCKS);
} {code}
Because cache.get() have cast the response to T, such as:
{code:java}
@SuppressWarnings("unchecked")
public  T get(FSEditLogOpCodes opCode) {
  return useCache ? (T)CACHE.get().get(opCode) : (T)newInstance(opCode);
} {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-16710) Remove redundant throw exceptions in org.apahce.hadoop.hdfs.server.namenode package

2022-08-01 Thread ZanderXu (Jira)
ZanderXu created HDFS-16710:
---

 Summary: Remove redundant throw exceptions in 
org.apahce.hadoop.hdfs.server.namenode package
 Key: HDFS-16710
 URL: https://issues.apache.org/jira/browse/HDFS-16710
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: ZanderXu
Assignee: ZanderXu


When I read some class about HDFS NameNode, I found there are many redundant 
throw exception in org.apahce.hadoop.hdfs.server.namenode package, such as:

 
{code:java}
public synchronized void transitionToObserver(StateChangeRequestInfo req)
throws ServiceFailedException, AccessControlException, IOException {
  checkNNStartup();
  nn.checkHaStateChange(req);
  nn.transitionToObserver();
} {code}
Because ServiceFailedException and AccessControlException is subClass of 
IOException, so I feel that ServiceFailedException and AccessControlException 
are redundant, so we can remove it to make code clearer, such as:

 

 
{code:java}
public synchronized void transitionToObserver(StateChangeRequestInfo req)
throws IOException {
  checkNNStartup();
  nn.checkHaStateChange(req);
  nn.transitionToObserver();
} {code}
 

 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-16831) [RBF SBN] GetNamenodesForNameserviceId should shuffle Observer NameNodes every time

2022-11-01 Thread ZanderXu (Jira)
ZanderXu created HDFS-16831:
---

 Summary: [RBF SBN] GetNamenodesForNameserviceId should shuffle 
Observer NameNodes every time
 Key: HDFS-16831
 URL: https://issues.apache.org/jira/browse/HDFS-16831
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: ZanderXu
Assignee: ZanderXu


The method getNamenodesForNameserviceId in MembershipNamenodeResolver.class 
should shuffle Observer NameNodes every time. The current logic will return the 
cached list and will caused all of read requests are forwarding to the first 
observer namenode. 

 

The related code as bellow:
{code:java}
@Override
public List getNamenodesForNameserviceId(
final String nsId, boolean listObserversFirst) throws IOException {

  List ret = cacheNS.get(Pair.of(nsId, 
listObserversFirst));
  if (ret != null) {
return ret;
  } 
  ...
}{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-16831) [RBF SBN] GetNamenodesForNameserviceId should shuffle Observer NameNodes every time

2022-12-22 Thread ZanderXu (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ZanderXu resolved HDFS-16831.
-
Fix Version/s: 3.4.0
 Hadoop Flags: Reviewed
   Resolution: Fixed

> [RBF SBN] GetNamenodesForNameserviceId should shuffle Observer NameNodes 
> every time
> ---
>
> Key: HDFS-16831
> URL: https://issues.apache.org/jira/browse/HDFS-16831
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: ZanderXu
>Assignee: ZanderXu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>
> The method getNamenodesForNameserviceId in MembershipNamenodeResolver.class 
> should shuffle Observer NameNodes every time. The current logic will return 
> the cached list and will caused all of read requests are forwarding to the 
> first observer namenode. 
>  
> The related code as bellow:
> {code:java}
> @Override
> public List getNamenodesForNameserviceId(
> final String nsId, boolean listObserversFirst) throws IOException {
>   List ret = cacheNS.get(Pair.of(nsId, 
> listObserversFirst));
>   if (ret != null) {
> return ret;
>   } 
>   ...
> }{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-16879) EC : Fsck -blockId shows number of redundant internal block replicas for EC Blocks

2023-01-03 Thread ZanderXu (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ZanderXu resolved HDFS-16879.
-
Fix Version/s: 3.4.0
 Hadoop Flags: Reviewed
   Resolution: Fixed

> EC : Fsck -blockId shows number of redundant internal block replicas for EC 
> Blocks
> --
>
> Key: HDFS-16879
> URL: https://issues.apache.org/jira/browse/HDFS-16879
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Haiyang Hu
>Assignee: Haiyang Hu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>
> For the block of the ec file run hdfs fsck -blockId xxx  can add shows number 
> of redundant internal block replicas.
> for example: the current blockgroup has 10 live replicas, and it will show 
> there are 9 live replicas. 
> actually, there is a live replica that should be in the redundant state, can 
> add shows "No. of redundant Replica: 1"
> {code:java}
> hdfs fsck -blockId blk_-xxx
> Block Id: blk_-xxx
> Block belongs to: /ec/file1
> No. of Expected Replica: 9
> No. of live Replica: 9
> No. of excess Replica: 0
> No. of stale Replica: 0
> No. of decommissioned Replica: 0
> No. of decommissioning Replica: 0
> No. of corrupted Replica: 0
> Block replica on datanode/rack: ip-xxx1 is HEALTHY
> Block replica on datanode/rack: ip-xxx2 is HEALTHY
> Block replica on datanode/rack: ip-xxx3 is HEALTHY
> Block replica on datanode/rack: ip-xxx4 is HEALTHY
> Block replica on datanode/rack: ip-xxx5 is HEALTHY
> Block replica on datanode/rack: ip-xxx6 is HEALTHY
> Block replica on datanode/rack: ip-xxx7 is HEALTHY
> Block replica on datanode/rack: ip-xxx8 is HEALTHY
> Block replica on datanode/rack: ip-xxx9 is HEALTHY
> Block replica on datanode/rack: ip-xxx10 is HEALTHY
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-16837) [RBF SBN] ClientGSIContext should merge RouterFederatedStates to get the max state id for each namespace

2022-12-05 Thread ZanderXu (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ZanderXu resolved HDFS-16837.
-
Fix Version/s: 3.4.0
 Hadoop Flags: Reviewed
   Resolution: Fixed

> [RBF SBN] ClientGSIContext should merge RouterFederatedStates to get the max 
> state id for each namespace
> 
>
> Key: HDFS-16837
> URL: https://issues.apache.org/jira/browse/HDFS-16837
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: ZanderXu
>Assignee: ZanderXu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>
> ClientGSIContext should merge local and remote RouterFederatedState to get 
> the max state id for each namespace.
> And the related code as bellows:
> {code:java}
> @Override
> public synchronized void receiveResponseState(RpcResponseHeaderProto header) {
>   if (header.hasRouterFederatedState()) {
> // BUG here
> routerFederatedState = header.getRouterFederatedState();
>   } else {
> lastSeenStateId.accumulate(header.getStateId());
>   }
> } {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-16865) The source path is always / after RBF proxied the complete, addBlock and getAdditionalDatanode RPC.

2022-12-09 Thread ZanderXu (Jira)
ZanderXu created HDFS-16865:
---

 Summary: The source path is always / after RBF proxied the 
complete, addBlock and getAdditionalDatanode RPC.
 Key: HDFS-16865
 URL: https://issues.apache.org/jira/browse/HDFS-16865
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: ZanderXu
Assignee: ZanderXu


The source path is always / after RBF proxied the complete, addBlock and 
getAdditionalDatanode RPC.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-16853) The UT TestLeaseRecovery2#testHardLeaseRecoveryAfterNameNodeRestart failed because HADOOP-18324

2022-11-23 Thread ZanderXu (Jira)
ZanderXu created HDFS-16853:
---

 Summary: The UT 
TestLeaseRecovery2#testHardLeaseRecoveryAfterNameNodeRestart failed because 
HADOOP-18324
 Key: HDFS-16853
 URL: https://issues.apache.org/jira/browse/HDFS-16853
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: ZanderXu
Assignee: ZanderXu


The UT TestLeaseRecovery2#testHardLeaseRecoveryAfterNameNodeRestart failed with 
error message: Waiting for cluster to become active. And the blocking jstack as 
bellows:
{code:java}
"BP-1618793397-192.168.3.4-1669198559828 heartbeating to 
localhost/127.0.0.1:54673" #260 daemon prio=5 os_prio=31 tid=0x
7fc1108fa000 nid=0x19303 waiting on condition [0x700017884000]
   java.lang.Thread.State: WAITING (parking)
        at sun.misc.Unsafe.park(Native Method)
        - parking to wait for  <0x0007430a9ec0> (a 
java.util.concurrent.SynchronousQueue$TransferQueue)
        at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
        at 
java.util.concurrent.SynchronousQueue$TransferQueue.awaitFulfill(SynchronousQueue.java:762)
        at 
java.util.concurrent.SynchronousQueue$TransferQueue.transfer(SynchronousQueue.java:695)
        at java.util.concurrent.SynchronousQueue.put(SynchronousQueue.java:877)
        at 
org.apache.hadoop.ipc.Client$Connection.sendRpcRequest(Client.java:1186)
        at org.apache.hadoop.ipc.Client.call(Client.java:1482)
        at org.apache.hadoop.ipc.Client.call(Client.java:1429)
        at 
org.apache.hadoop.ipc.ProtobufRpcEngine2$Invoker.invoke(ProtobufRpcEngine2.java:258)
        at 
org.apache.hadoop.ipc.ProtobufRpcEngine2$Invoker.invoke(ProtobufRpcEngine2.java:139)
        at com.sun.proxy.$Proxy23.sendHeartbeat(Unknown Source)
        at 
org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.sendHeartbeat(DatanodeProtocolClient
SideTranslatorPB.java:168)
        at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.sendHeartBeat(BPServiceActor.java:570)
        at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:714)
        at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:915)
        at java.lang.Thread.run(Thread.java:748)  {code}
After looking into the code and found that this bug is imported by 
HADOOP-18324. Because RpcRequestSender exited without cleaning up the 
rpcRequestQueue, then caused BPServiceActor was blocked in sending request.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-16841) Enhance the function of DebugAdmin#VerifyECCommand

2022-11-23 Thread ZanderXu (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16841?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ZanderXu resolved HDFS-16841.
-
Fix Version/s: 3.4.0
 Hadoop Flags: Reviewed
   Resolution: Fixed

> Enhance the function of DebugAdmin#VerifyECCommand
> --
>
> Key: HDFS-16841
> URL: https://issues.apache.org/jira/browse/HDFS-16841
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Haiyang Hu
>Assignee: Haiyang Hu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>
> Currently DebugAdmin#VerifyECCommand supports verify the correctness of 
> erasure coding on file. If the first failures block group occurs during 
> verify, the verify will end.
> 1.Consider add option to control whether to ignore failures  block group . If 
> set, will ignores failures block group during verify and continues verify all 
> block groups of the file
> 2.add option  support for specifying block group to verify



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-16813) Remove parameter validation logic such as dfs.namenode.decommission.blocks.per.interval in DatanodeAdminManager#activate

2022-11-22 Thread ZanderXu (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ZanderXu resolved HDFS-16813.
-
Fix Version/s: 3.4.0
 Hadoop Flags: Reviewed
   Resolution: Fixed

> Remove parameter validation logic such as 
> dfs.namenode.decommission.blocks.per.interval in DatanodeAdminManager#activate
> 
>
> Key: HDFS-16813
> URL: https://issues.apache.org/jira/browse/HDFS-16813
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Haiyang Hu
>Assignee: Haiyang Hu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>
> In DatanodeAdminManager#activate
> {code:java}
> int blocksPerInterval = conf.getInt(
> DFSConfigKeys.DFS_NAMENODE_DECOMMISSION_BLOCKS_PER_INTERVAL_KEY,
> DFSConfigKeys.DFS_NAMENODE_DECOMMISSION_BLOCKS_PER_INTERVAL_DEFAULT);
> final String deprecatedKey =
> "dfs.namenode.decommission.nodes.per.interval";
> final String strNodes = conf.get(deprecatedKey);
> if (strNodes != null) {
>   LOG.warn("Deprecated configuration key {} will be ignored.",
>   deprecatedKey);
>   LOG.warn("Please update your configuration to use {} instead.",
>   DFSConfigKeys.DFS_NAMENODE_DECOMMISSION_BLOCKS_PER_INTERVAL_KEY);
> }
> checkArgument(blocksPerInterval > 0,
> "Must set a positive value for "
> + DFSConfigKeys.DFS_NAMENODE_DECOMMISSION_BLOCKS_PER_INTERVAL_KEY);
> final int maxConcurrentTrackedNodes = conf.getInt(
> DFSConfigKeys.DFS_NAMENODE_DECOMMISSION_MAX_CONCURRENT_TRACKED_NODES,
> DFSConfigKeys
> .DFS_NAMENODE_DECOMMISSION_MAX_CONCURRENT_TRACKED_NODES_DEFAULT);
> checkArgument(maxConcurrentTrackedNodes >= 0, "Cannot set a negative " +
> "value for "
> + DFSConfigKeys.DFS_NAMENODE_DECOMMISSION_MAX_CONCURRENT_TRACKED_NODES);
> {code}
> there is no need for parameters
> dfs.namenode.decommission.blocks.per.interval and
> dfs.namenode.decommission.max.concurrent.tracked.nodes to verify.
> Because the parameters are processed in DatanodeAdminMonitorBase and 
> DatanodeAdminDefaultMonitor.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-16826) [RBF SBN] ConnectionManager should advance the client stateId for every request

2022-11-24 Thread ZanderXu (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ZanderXu resolved HDFS-16826.
-
Fix Version/s: 3.4.0
 Hadoop Flags: Reviewed
   Resolution: Fixed

> [RBF SBN] ConnectionManager should advance the client stateId for every 
> request
> ---
>
> Key: HDFS-16826
> URL: https://issues.apache.org/jira/browse/HDFS-16826
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: ZanderXu
>Assignee: ZanderXu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>
> ConnectionManager should advance the client stateId for every request 
> whatever pool is null or not.
>  
> Bug Code as bellow:
> {code:java}
> // Create the pool if not created before
> if (pool == null) {
>   writeLock.lock();
>   try {
> pool = this.pools.get(connectionId);
> if (pool == null) {
>   pool = new ConnectionPool(
>   this.conf, nnAddress, ugi, this.minSize, this.maxSize,
>   this.minActiveRatio, protocol,
>   new PoolAlignmentContext(this.routerStateIdContext, nsId));
>   this.pools.put(connectionId, pool);
>   this.connectionPoolToNamespaceMap.put(connectionId, nsId);
> }
> // BUG Here
> long clientStateId = 
> RouterStateIdContext.getClientStateIdFromCurrentCall(nsId);
> pool.getPoolAlignmentContext().advanceClientStateId(clientStateId);
>   } finally {
> writeLock.unlock();
>   }
> } {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-16779) Add ErasureCodingPolicy information to the response description for GETFILESTATUS in WebHDFS.md

2022-11-24 Thread ZanderXu (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ZanderXu resolved HDFS-16779.
-
Fix Version/s: 3.4.0
 Hadoop Flags: Reviewed
   Resolution: Fixed

> Add ErasureCodingPolicy information to the response description for 
> GETFILESTATUS in WebHDFS.md
> ---
>
> Key: HDFS-16779
> URL: https://issues.apache.org/jira/browse/HDFS-16779
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: ZanderXu
>Assignee: ZanderXu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>
> [WebHDFS_GETFILESTATUS|https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/WebHDFS.html#Status_of_a_FileDirectory]
>  doesn't contains the information of ErasureCodePolicy. We should add it to 
> make it easier for EndUsers to get EC Policy.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-16837) [RBF SBN] ClientGSIContext should merge RouterFederatedStates to get the max state id for each namespace

2022-11-10 Thread ZanderXu (Jira)
ZanderXu created HDFS-16837:
---

 Summary: [RBF SBN] ClientGSIContext should merge 
RouterFederatedStates to get the max state id for each namespace
 Key: HDFS-16837
 URL: https://issues.apache.org/jira/browse/HDFS-16837
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: ZanderXu
Assignee: ZanderXu


ClientGSIContext should merge local and remote RouterFederatedState to get the 
max state id for each namespace.

And the related code as bellows:
{code:java}
@Override
public synchronized void receiveResponseState(RpcResponseHeaderProto header) {
  if (header.hasRouterFederatedState()) {
// BUG here
routerFederatedState = header.getRouterFederatedState();
  } else {
lastSeenStateId.accumulate(header.getStateId());
  }
} {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-16838) Fix NPE in testAddRplicaProcessorForAddingReplicaInMap

2022-11-10 Thread ZanderXu (Jira)
ZanderXu created HDFS-16838:
---

 Summary: Fix NPE in testAddRplicaProcessorForAddingReplicaInMap
 Key: HDFS-16838
 URL: https://issues.apache.org/jira/browse/HDFS-16838
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: ZanderXu
Assignee: ZanderXu


There is a NPE in 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.TestFsVolumeList#testAddRplicaProcessorForAddingReplicaInMap
 if we run this UT individually. And the related bug as bellow:

 
{code:java}
public void testAddRplicaProcessorForAddingReplicaInMap() throws Exception {
  // BUG here
  BlockPoolSlice.reInitializeAddReplicaThreadPool();
  Configuration cnf = new Configuration();
  int poolSize = 5; 
  ...
}{code}
 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-16834) PoolAlignmentContext should not max poolLocalStateId with sharedGlobalStateId when sending requests to the namenode.

2022-11-10 Thread ZanderXu (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ZanderXu resolved HDFS-16834.
-
Fix Version/s: 3.4.0
 Hadoop Flags: Reviewed
   Resolution: Fixed

> PoolAlignmentContext should not max poolLocalStateId with sharedGlobalStateId 
> when sending requests to the namenode.
> 
>
> Key: HDFS-16834
> URL: https://issues.apache.org/jira/browse/HDFS-16834
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Simbarashe Dzinamarira
>Assignee: Simbarashe Dzinamarira
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>
> When sending requests to the namenode, we should only use the 
> poolLocalStateId. Maxing it with the sharedGlobalStateId makes reads 
> consistent across all clients using a router, which is not necessary and will 
> lead to more waiting on the observer.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-16962) The blockReport RPC should not update the lastBlockReportTime if this blockReport is ignored

2023-03-22 Thread ZanderXu (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ZanderXu resolved HDFS-16962.
-
Resolution: Invalid

> The blockReport RPC should not update the lastBlockReportTime if this 
> blockReport is ignored
> 
>
> Key: HDFS-16962
> URL: https://issues.apache.org/jira/browse/HDFS-16962
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: ZanderXu
>Assignee: ZanderXu
>Priority: Major
>  Labels: pull-request-available
>
> The blockReport RPC should not update the lastBlockReportTime if this 
> blockReport is ignored. the related code as bellows:
> {code:java}
> public DatanodeCommand blockReport(final DatanodeRegistration nodeReg,
>   String poolId, final StorageBlockReport[] reports,
>   final BlockReportContext context) throws IOException {
>   // code placeholder
>   ...
>   try {
> // this blockReport may be ignored if bm.checkBlockReportLease return 
> false
> if (bm.checkBlockReportLease(context, nodeReg)) {
>   // code placeholder
>   ...
> } 
>   }
>   // If this blockReport is ignored, the removeBRLeaseIfNeeded should not 
> update the lastBlockReportTime
>   bm.removeBRLeaseIfNeeded(nodeReg, context);
>   // code placeholder
>   ...
>   return null;
> } {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-16962) The blockReport RPC should not update the lastBlockReportTime if this blockReport is ignored

2023-03-21 Thread ZanderXu (Jira)
ZanderXu created HDFS-16962:
---

 Summary: The blockReport RPC should not update the 
lastBlockReportTime if this blockReport is ignored
 Key: HDFS-16962
 URL: https://issues.apache.org/jira/browse/HDFS-16962
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: ZanderXu
Assignee: ZanderXu


The blockReport RPC should not update the lastBlockReportTime if this 
blockReport is ignored. the related code as bellows:
{code:java}
public DatanodeCommand blockReport(final DatanodeRegistration nodeReg,
  String poolId, final StorageBlockReport[] reports,
  final BlockReportContext context) throws IOException {
  // code placeholder
  ...
  try {
// this blockReport may be ignored if bm.checkBlockReportLease return false
if (bm.checkBlockReportLease(context, nodeReg)) {
  // code placeholder
  ...
} 
  }
  // If this blockReport is ignored, the removeBRLeaseIfNeeded should not 
update the lastBlockReportTime
  bm.removeBRLeaseIfNeeded(nodeReg, context);

  // code placeholder
  ...

  return null;
} {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-16961) The blockReport RPC should throw UnregisteredNodeException when the storedDN is null

2023-03-21 Thread ZanderXu (Jira)
ZanderXu created HDFS-16961:
---

 Summary: The blockReport RPC should throw 
UnregisteredNodeException when the storedDN is null
 Key: HDFS-16961
 URL: https://issues.apache.org/jira/browse/HDFS-16961
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: ZanderXu
Assignee: ZanderXu


The blockReport RPC should throw UnregisteredNodeException when the storedDN is 
null, and the related code as bellows:
{code:java}
public void removeBRLeaseIfNeeded(final DatanodeID nodeID,
final BlockReportContext context) throws IOException {
  namesystem.writeLock(OperationName.REMOVE_BR_LEASE_IF_NEEDED);
  DatanodeDescriptor node;
  try {
// Here, if the node is null, should throw UnregisteredNodeException 
instand of NPE
    node = datanodeManager.getDatanode(nodeID);
if (context != null) {
  if (context.getTotalRpcs() == context.getCurRpc() + 1) {
long leaseId = this.getBlockReportLeaseManager().removeLease(node);
BlockManagerFaultInjector.getInstance().
removeBlockReportLease(node, leaseId);
node.setLastBlockReportTime(now());
node.setLastBlockReportMonotonic(Time.monotonicNow());
  }
  LOG.debug("Processing RPC with index {} out of total {} RPCs in "
  + "processReport 0x{}", context.getCurRpc(),
  context.getTotalRpcs(), Long.toHexString(context.getReportId()));
}
  } finally {
namesystem.writeUnlock(OperationName.REMOVE_BR_LEASE_IF_NEEDED);
  }
}{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-16960) Remove useless getEditLog().logSync() for setTimes(), setPermission() and setQuota() PRC

2023-03-21 Thread ZanderXu (Jira)
ZanderXu created HDFS-16960:
---

 Summary: Remove useless getEditLog().logSync() for setTimes(), 
setPermission() and setQuota() PRC
 Key: HDFS-16960
 URL: https://issues.apache.org/jira/browse/HDFS-16960
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: ZanderXu
Assignee: ZanderXu


Remove useless getEditLog().logSync() for setTimes(), setPermission() and 
setQuota() PRC if there is no changed. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-16933) A race in SerialNumberMap will cause wrong ownership

2023-02-24 Thread ZanderXu (Jira)
ZanderXu created HDFS-16933:
---

 Summary: A race in SerialNumberMap will cause wrong ownership
 Key: HDFS-16933
 URL: https://issues.apache.org/jira/browse/HDFS-16933
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: ZanderXu
Assignee: ZanderXu


If namenode enables parallel fsimage loading, a race that occurs in 
SerialNumberMap will cause wrong owner ship for INodes.
{code:java}
public int get(T t) {
  if (t == null) {
return 0;
  }
  Integer sn = t2i.get(t);
  if (sn == null) {
// Assume there are two thread with different t, such as:
// T1 with hbase
// T2 with hdfs
// If T1 and T2 get the sn in the same time, they will get the same sn, 
such as 10
sn = current.getAndIncrement();
if (sn > max) {
  current.getAndDecrement();
  throw new IllegalStateException(name + ": serial number map is full");
}
Integer old = t2i.putIfAbsent(t, sn);
if (old != null) {
  current.getAndDecrement();
  return old;
}
// If T1 puts the 10->hbase to the i2t first, T2 will use 10 -> hdfs to 
overwrite it. So it will cause that the Inodes will get a wrong owner hdfs, 
actual it should be hbase.
i2t.put(sn, t);
  }
  return sn;
} {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-16987) HA Failover may cause some corrupted blocks

2023-04-23 Thread ZanderXu (Jira)
ZanderXu created HDFS-16987:
---

 Summary: HA Failover may cause some corrupted blocks
 Key: HDFS-16987
 URL: https://issues.apache.org/jira/browse/HDFS-16987
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: ZanderXu
Assignee: ZanderXu


In our prod environment, we encountered an incident where HA failover caused 
some new corrupted blocks, causing some jobs to fail.

 

Traced down and found a bug in the processing of all pending DN messages when 
starting active services.

The steps to reproduce are as follows:
 # Suppose NN1 is Active and NN2 is Standby, Active works well and Standby is 
unstable
 # Timing 1, client create a file, write some data and close it.
 # Timing 2, client append this file, write some data and close it.
 # Timing 3, Standby replayed the second closing edits of this file
 # Timing 4, Standby processes the blockReceivedAndDeleted of the first create 
operation
 # Timing 5, Standby processed the blockReceivedAndDeleted of the second append 
operation
 # Timing 6, Admin switched the active namenode from NN1 to NN2
 # Timing 7, client failed to append some data to this file.

{code:java}
org.apache.hadoop.ipc.RemoteException(java.io.IOException): append: 
lastBlock=blk_1073741825_1002 of src=/testCorruptedBlockAfterHAFailover is not 
sufficiently replicated yet.
    at 
org.apache.hadoop.hdfs.server.namenode.FSDirAppendOp.appendFile(FSDirAppendOp.java:138)
    at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFile(FSNamesystem.java:2992)
    at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.append(NameNodeRpcServer.java:858)
    at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.append(ClientNamenodeProtocolServerSideTranslatorPB.java:527)
    at 
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
    at 
org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:621)
    at 
org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:589)
    at 
org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:573)
    at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1227)
    at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1221)
    at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1144)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:422)
    at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1953)
    at org.apache.hadoop.ipc.Server$Handler.run(Server.java:3170) {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-16910) Fix incorrectly initializing RandomAccessFile caused flush performance decreased for JN

2023-02-08 Thread ZanderXu (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ZanderXu resolved HDFS-16910.
-
Fix Version/s: 3.4.0
 Hadoop Flags: Reviewed
   Resolution: Fixed

> Fix incorrectly initializing RandomAccessFile caused flush performance 
> decreased for JN
> ---
>
> Key: HDFS-16910
> URL: https://issues.apache.org/jira/browse/HDFS-16910
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Haiyang Hu
>Assignee: Haiyang Hu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>
> At present, after our cluster backport patch HDFS-15882, 
> when set shouldSyncWritesAndSkipFsync to false, there will be flush 
> performance degradation caused by JN.
> *Root Cause*:
> when setting shouldSyncWritesAndSkipFsync to false, the mode of init 
> RandomAccessFile will be `rws`. 
> even if fc.force(false) is executed when flushAndSync is executed (hopefully, 
> only requires updates to the file's content to be written to storage and the 
> metadata is not update), 
> but since the mode of RandomAccessFile is `rws`, It will requires updates to 
> both the file's content and its metadata to be written, 
> there will be flush performance degradation caused by JN.
> *Fix:*
> Need to update RandomAccessFile's mode from `rws` to `rwd`:
> rwd: Open for reading and writing, as with "rw", and also require that every 
> update to the file's content be written synchronously to the underlying 
> storage device.
> {code:java}
> if (shouldSyncWritesAndSkipFsync) {
> rp = new RandomAccessFile(name, "rwd");
> } else {
> rp = new RandomAccessFile(name, "rw");
> }
> {code}
> In this way, when flushAndSync is executed, 
> if shouldSyncWritesAndSkipFsync is false and the mode of RandomAccessFile is 
> 'rw', it will call fc.force(false) to execute, 
> otherwise should use `rwd` to perform the operation.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-16923) The getListing RPC will throw NPE if the path does not exist

2023-02-15 Thread ZanderXu (Jira)
ZanderXu created HDFS-16923:
---

 Summary: The getListing RPC will throw NPE if the path does not 
exist
 Key: HDFS-16923
 URL: https://issues.apache.org/jira/browse/HDFS-16923
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: ZanderXu
Assignee: ZanderXu


The getListing RPC will throw NPE if the path does not exist. And the stack as 
bellow:
{code:java}
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.RemoteException): 
org.apache.hadoop.ipc.RemoteException(java.lang.NullPointerException): 
java.lang.NullPointerException
    at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getListing(FSNamesystem.java:4195)
    at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getListing(NameNodeRpcServer.java:1421)
    at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getListing(ClientNamenodeProtocolServerSideTranslatorPB.java:783)
    at 
org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:622)
    at 
org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:590)
    at 
org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:574)
 {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-17019) Optimize the logic for reconfigure slow peer enable for Namenode

2023-06-07 Thread ZanderXu (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ZanderXu resolved HDFS-17019.
-
Fix Version/s: 3.4.0
   Resolution: Fixed

>  Optimize the logic for reconfigure slow peer enable for Namenode
> -
>
> Key: HDFS-17019
> URL: https://issues.apache.org/jira/browse/HDFS-17019
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Haiyang Hu
>Assignee: Haiyang Hu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>
> The logic of Reconfigure slow peer enable for Namenode requires the following 
> optimizations:
> 1.Make SlowPeerTracker slowPeerTracker volatile.
> 2.When starting the NameNode, if the parameter 
> dfs.datanode.peer.stats.enabled is set to false, 
> DatanodeManager#startSlowPeerCollector() will not call, as a result the Slow 
> peers collection thread 'slowPeerCollectorDaemon' will not be started .
>  If the parameter dfs.datanode.peer.stats.enabled is dynamically refreshed to 
> true, the current logic will not call 
> DatanodeManager#startSlowPeerCollector(), which to thread 
> 'slowPeerCollectorDaemon' not started as expected, so we will optimize here



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-17387) [FGL] Abstract selectable locking mode

2024-02-19 Thread ZanderXu (Jira)
ZanderXu created HDFS-17387:
---

 Summary: [FGL] Abstract selectable locking mode
 Key: HDFS-17387
 URL: https://issues.apache.org/jira/browse/HDFS-17387
 Project: Hadoop HDFS
  Issue Type: Sub-task
Reporter: ZanderXu


Abstract a lock mode to cover the current global lock and the new fine-grained 
lock(global FS lock and global BM lock).

End-user can select to use lock mode through configuration.

The possible lock modes after this patch are as follows:
 * GLOBAL Lock
 * FS Lock
 * BM Lock



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-17389) [FGL] Create RPC minimizes the scope of the global BM lock

2024-02-19 Thread ZanderXu (Jira)
ZanderXu created HDFS-17389:
---

 Summary: [FGL] Create RPC minimizes the scope of the global BM lock
 Key: HDFS-17389
 URL: https://issues.apache.org/jira/browse/HDFS-17389
 Project: Hadoop HDFS
  Issue Type: Sub-task
Reporter: ZanderXu


The Create RPC minimizes the scope of the global BM lock, because it doesn't 
need the global BM lock in most scenes.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



  1   2   >