[jira] [Created] (HDFS-17539) TestFileChecksum should not spin up a MiniDFSCluster for every test

2024-05-30 Thread Felix N (Jira)
Felix N created HDFS-17539:
--

 Summary: TestFileChecksum should not spin up a MiniDFSCluster for 
every test
 Key: HDFS-17539
 URL: https://issues.apache.org/jira/browse/HDFS-17539
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: Felix N
Assignee: Felix N


TestFileChecksum has 34 tests. Add its brother the parameterized COMPOSITE_CRC 
version and that's 68 times a cluster is spun up then shutdown when twice is 
necessary (or maybe even once but 2 is not too bad).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-17474) [FGL] Make INodeMap thread safe

2024-05-27 Thread Felix N (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17849863#comment-17849863
 ] 

Felix N commented on HDFS-17474:


Hi [~zhanghaobo] , more detailed descriptions will be in the PR once I'm done 
cleaning up the code (the code is roughly done for now but stills need some 
tidying up) but here's the rough idea: the underlying data structure used in 
INodeMap (and BlocksMap also) is a GSet, and it should be made thread-safe 
(i.e. ThreadSafeGSet). The most simple option is a ConcurrentHashMap, but a 
HashMap occupies way more memory than the current LightWeightGSet and that 
makes it not viable as a solution. I have 3 options in mind:
 * Just a LightWeightGSet but synchronize all its operations
 * LightWeightGSet with a lock for each element, and operations on an element 
can synchronize on the lock assigned to that element
 * LightWeightGSet with a lock for a group of elements (grouped in powers of 
2), and operations on an element can synchronize on the lock assigned to that 
element. Or use a ReentrantReadWriteLock instead of synchronizing on a lock 
object.

I'm still benchmarking but the implementation I'm leaning towards right now is 
the 3rd option with lock objects.

> [FGL] Make INodeMap thread safe
> ---
>
> Key: HDFS-17474
> URL: https://issues.apache.org/jira/browse/HDFS-17474
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: ZanderXu
>Assignee: Felix N
>Priority: Major
>
> Operations related to INodeMap should be handled by namenode safety, since 
> operations maybe access or update INodeMap concurrently.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-17529) RBF: Improve router state store cache entry deletion

2024-05-22 Thread Felix N (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix N updated HDFS-17529:
---
Summary: RBF: Improve router state store cache entry deletion  (was: 
Improve router state store cache entry deletion)

> RBF: Improve router state store cache entry deletion
> 
>
> Key: HDFS-17529
> URL: https://issues.apache.org/jira/browse/HDFS-17529
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs, rbf
>Reporter: Felix N
>Assignee: Felix N
>Priority: Major
>  Labels: pull-request-available
>
> Current implementation for router state store update is quite inefficient, so 
> much that when routers are removed and a lot of NameNodeMembership records 
> are deleted in a short burst, the deletions triggered a router safemode in 
> our cluster and caused a lot of troubles.
> This ticket aims to improve the deletion process for ZK state store 
> implementation.
> See HDFS-17532 for the other half of this improvement



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-17532) Allow router state store cache update to overwrite and delete in parallel

2024-05-20 Thread Felix N (Jira)
Felix N created HDFS-17532:
--

 Summary: Allow router state store cache update to overwrite and 
delete in parallel
 Key: HDFS-17532
 URL: https://issues.apache.org/jira/browse/HDFS-17532
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: hdfs, rbf
Reporter: Felix N
Assignee: Felix N


Current implementation for router state store update is quite inefficient, so 
much that when routers are removed and a lot of NameNodeMembership records are 
deleted in a short burst, the deletions triggered a router safemode in our 
cluster and caused a lot of troubles.

This ticket aims to allow the overwrite part and delete part of 
org.apache.hadoop.hdfs.server.federation.store.CachedRecordStore#overrideExpiredRecords
 to run in parallel.

See HDFS-17529 for the other half of this improvement.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-17529) Improve router state store cache entry deletion

2024-05-20 Thread Felix N (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix N updated HDFS-17529:
---
Description: 
Current implementation for router state store update is quite inefficient, so 
much that when routers are removed and a lot of NameNodeMembership records are 
deleted in a short burst, the deletions triggered a router safemode in our 
cluster and caused a lot of troubles.

This ticket aims to improve the deletion process for ZK state store 
implementation.

See HDFS-17532 for the other half of this improvement

  was:
Current implementation for router state store update is quite inefficient, so 
much that when routers are removed and a lot of NameNodeMembership records are 
deleted in a short burst, the deletions triggered a router safemode in our 
cluster and caused a lot of troubles.

This ticket aims to improve the deletion process for ZK state store 
implementation.


> Improve router state store cache entry deletion
> ---
>
> Key: HDFS-17529
> URL: https://issues.apache.org/jira/browse/HDFS-17529
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs, rbf
>Reporter: Felix N
>Assignee: Felix N
>Priority: Major
>  Labels: pull-request-available
>
> Current implementation for router state store update is quite inefficient, so 
> much that when routers are removed and a lot of NameNodeMembership records 
> are deleted in a short burst, the deletions triggered a router safemode in 
> our cluster and caused a lot of troubles.
> This ticket aims to improve the deletion process for ZK state store 
> implementation.
> See HDFS-17532 for the other half of this improvement



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-17529) Improve router state store cache entry deletion

2024-05-20 Thread Felix N (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix N updated HDFS-17529:
---
Description: 
Current implementation for router state store update is quite inefficient, so 
much that when routers are removed and a lot of NameNodeMembership records are 
deleted in a short burst, the deletions triggered a router safemode in our 
cluster and caused a lot of troubles.

This ticket aims to improve the deletion process for ZK state store 
implementation.

  was:
Current implementation for router state store update is quite inefficient, so 
much that when routers are removed and a lot of NameNodeMembership records are 
deleted in a short burst, the deletions triggered a router safemode in our 
cluster and caused a lot of troubles.

This ticket contains 2 parts: improving the deletion process for ZK state store 
implementation, and allowing the overwrite part and delete part of 
org.apache.hadoop.hdfs.server.federation.store.CachedRecordStore#overrideExpiredRecords
 to run in parallel.


> Improve router state store cache entry deletion
> ---
>
> Key: HDFS-17529
> URL: https://issues.apache.org/jira/browse/HDFS-17529
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs, rbf
>Reporter: Felix N
>Assignee: Felix N
>Priority: Major
>  Labels: pull-request-available
>
> Current implementation for router state store update is quite inefficient, so 
> much that when routers are removed and a lot of NameNodeMembership records 
> are deleted in a short burst, the deletions triggered a router safemode in 
> our cluster and caused a lot of troubles.
> This ticket aims to improve the deletion process for ZK state store 
> implementation.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-17529) Improve router state store cache entry deletion

2024-05-20 Thread Felix N (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix N updated HDFS-17529:
---
Summary: Improve router state store cache entry deletion  (was: Improve 
router state store cache update)

> Improve router state store cache entry deletion
> ---
>
> Key: HDFS-17529
> URL: https://issues.apache.org/jira/browse/HDFS-17529
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs, rbf
>Reporter: Felix N
>Assignee: Felix N
>Priority: Major
>  Labels: pull-request-available
>
> Current implementation for router state store update is quite inefficient, so 
> much that when routers are removed and a lot of NameNodeMembership records 
> are deleted in a short burst, the deletions triggered a router safemode in 
> our cluster and caused a lot of troubles.
> This ticket contains 2 parts: improving the deletion process for ZK state 
> store implementation, and allowing the overwrite part and delete part of 
> org.apache.hadoop.hdfs.server.federation.store.CachedRecordStore#overrideExpiredRecords
>  to run in parallel.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-17529) Improve router state store cache update

2024-05-17 Thread Felix N (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix N updated HDFS-17529:
---
Description: 
Current implementation for router state store update is quite inefficient, so 
much that when routers are removed and a lot of NameNodeMembership records are 
deleted in a short burst, the deletions triggered a router safemode in our 
cluster and caused a lot of troubles.

This ticket contains 2 parts: improving the deletion process for ZK state store 
implementation, and allowing the overwrite part and delete part of 
org.apache.hadoop.hdfs.server.federation.store.CachedRecordStore#overrideExpiredRecords
 to run in parallel.

  was:
Current implementation for router state store update is quite inefficient, so 
much that when routers are removed and a lot of NameNodeMembership records are 
deleted in a short burst, the deletions triggered a router safemode in our 
cluster and caused a lot of troubles.

This ticket contains 2 parts: improving the deletion process for ZK state store 
implementation, and allowing the overwrite part and delete part of


> Improve router state store cache update
> ---
>
> Key: HDFS-17529
> URL: https://issues.apache.org/jira/browse/HDFS-17529
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs, rbf
>Reporter: Felix N
>Assignee: Felix N
>Priority: Major
>
> Current implementation for router state store update is quite inefficient, so 
> much that when routers are removed and a lot of NameNodeMembership records 
> are deleted in a short burst, the deletions triggered a router safemode in 
> our cluster and caused a lot of troubles.
> This ticket contains 2 parts: improving the deletion process for ZK state 
> store implementation, and allowing the overwrite part and delete part of 
> org.apache.hadoop.hdfs.server.federation.store.CachedRecordStore#overrideExpiredRecords
>  to run in parallel.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-17529) Improve router state store cache update

2024-05-17 Thread Felix N (Jira)
Felix N created HDFS-17529:
--

 Summary: Improve router state store cache update
 Key: HDFS-17529
 URL: https://issues.apache.org/jira/browse/HDFS-17529
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: hdfs, rbf
Reporter: Felix N
Assignee: Felix N


Current implementation for router state store update is quite inefficient, so 
much that when routers are removed and a lot of NameNodeMembership records are 
deleted in a short burst, the deletions triggered a router safemode in our 
cluster and caused a lot of troubles.

This ticket contains 2 parts: improving the deletion process for ZK state store 
implementation, and allowing the overwrite part and delete part of



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-17492) [FGL] Abstract a INodeLockManager to manage acquiring and releasing locks in the directory-tree

2024-04-30 Thread Felix N (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix N updated HDFS-17492:
---
Summary: [FGL] Abstract a INodeLockManager to manage acquiring and 
releasing locks in the directory-tree  (was: [FGL] Abstract a INodeLockManager 
to mange acquiring and releasing locks in the directory-tree)

> [FGL] Abstract a INodeLockManager to manage acquiring and releasing locks in 
> the directory-tree
> ---
>
> Key: HDFS-17492
> URL: https://issues.apache.org/jira/browse/HDFS-17492
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: ZanderXu
>Assignee: ZanderXu
>Priority: Major
>
> Abstract a INodeLockManager to mange acquiring and releasing locks in the 
> directory-tree.
>  # Abstract a lock type to cover all cases in NN
>  # Acquire the full path lock for the input path base on the input lock type
>  # Acquire the full path lock for the input iNodeId base on the input lock 
> type
>  # Acquire the full path lock for some input paths, such as for rename, concat
>  
> INodeLockManager should returns an IIP which contains both iNodes and locks.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-17488) DN can fail IBRs with NPE when a volume is removed

2024-04-22 Thread Felix N (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17839556#comment-17839556
 ] 

Felix N commented on HDFS-17488:


Hi [~zhanghaobo], thanks for letting me know. I think you can review my PR 
since it should contain your patch + some extra steps to prevent the situation 
from appearing + unit tests

> DN can fail IBRs with NPE when a volume is removed
> --
>
> Key: HDFS-17488
> URL: https://issues.apache.org/jira/browse/HDFS-17488
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Reporter: Felix N
>Assignee: Felix N
>Priority: Major
>  Labels: pull-request-available
>
>  
> Error logs
> {code:java}
> 2024-04-22 15:46:33,422 [BP-1842952724-10.22.68.249-1713771988830 
> heartbeating to localhost/127.0.0.1:64977] ERROR datanode.DataNode 
> (BPServiceActor.java:run(922)) - Exception in BPOfferService for Block pool 
> BP-1842952724-10.22.68.249-1713771988830 (Datanode Uuid 
> 1659ffaf-1a80-4a8e-a542-643f6bd97ed4) service to localhost/127.0.0.1:64977
> java.lang.NullPointerException
>     at 
> org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.blockReceivedAndDeleted(DatanodeProtocolClientSideTranslatorPB.java:246)
>     at 
> org.apache.hadoop.hdfs.server.datanode.IncrementalBlockReportManager.sendIBRs(IncrementalBlockReportManager.java:218)
>     at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:749)
>     at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:920)
>     at java.lang.Thread.run(Thread.java:748) {code}
> The root cause is in BPOfferService#notifyNamenodeBlock, happens when it's 
> called on a block belonging to a volume already removed prior. Because the 
> volume was already removed
>  
> {code:java}
> private void notifyNamenodeBlock(ExtendedBlock block, BlockStatus status,
> String delHint, String storageUuid, boolean isOnTransientStorage) {
>   checkBlock(block);
>   final ReceivedDeletedBlockInfo info = new ReceivedDeletedBlockInfo(
>   block.getLocalBlock(), status, delHint);
>   final DatanodeStorage storage = dn.getFSDataset().getStorage(storageUuid);
>   
>   // storage == null here because it's already removed earlier.
>   for (BPServiceActor actor : bpServices) {
> actor.getIbrManager().notifyNamenodeBlock(info, storage,
> isOnTransientStorage);
>   }
> } {code}
> so IBRs with a null storage are now pending.
> The reason why notifyNamenodeBlock can trigger on such blocks is up in 
> DirectoryScanner#reconcile
> {code:java}
>   public void reconcile() throws IOException {
>     LOG.debug("reconcile start DirectoryScanning");
>     scan();
> // If a volume is removed here after scan() already finished running,
> // diffs is stale and checkAndUpdate will run on a removed volume
>     // HDFS-14476: run checkAndUpdate with batch to avoid holding the lock too
>     // long
>     int loopCount = 0;
>     synchronized (diffs) {
>       for (final Map.Entry entry : diffs.getEntries()) {
>         dataset.checkAndUpdate(entry.getKey(), entry.getValue());        
>     ...
>   } {code}
> Inside checkAndUpdate, memBlockInfo is null because all the block meta in 
> memory is removed during the volume removal, but diskFile still exists. Then 
> DataNode#notifyNamenodeDeletedBlock (and further down the line, 
> notifyNamenodeBlock) is called on this block.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-17488) DN can fail IBRs with NPE when a volume is removed

2024-04-22 Thread Felix N (Jira)
Felix N created HDFS-17488:
--

 Summary: DN can fail IBRs with NPE when a volume is removed
 Key: HDFS-17488
 URL: https://issues.apache.org/jira/browse/HDFS-17488
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: hdfs
Reporter: Felix N
Assignee: Felix N


 

Error logs
{code:java}
2024-04-22 15:46:33,422 [BP-1842952724-10.22.68.249-1713771988830 heartbeating 
to localhost/127.0.0.1:64977] ERROR datanode.DataNode 
(BPServiceActor.java:run(922)) - Exception in BPOfferService for Block pool 
BP-1842952724-10.22.68.249-1713771988830 (Datanode Uuid 
1659ffaf-1a80-4a8e-a542-643f6bd97ed4) service to localhost/127.0.0.1:64977
java.lang.NullPointerException
    at 
org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.blockReceivedAndDeleted(DatanodeProtocolClientSideTranslatorPB.java:246)
    at 
org.apache.hadoop.hdfs.server.datanode.IncrementalBlockReportManager.sendIBRs(IncrementalBlockReportManager.java:218)
    at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:749)
    at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:920)
    at java.lang.Thread.run(Thread.java:748) {code}
The root cause is in BPOfferService#notifyNamenodeBlock, happens when it's 
called on a block belonging to a volume already removed prior. Because the 
volume was already removed

 
{code:java}
private void notifyNamenodeBlock(ExtendedBlock block, BlockStatus status,
String delHint, String storageUuid, boolean isOnTransientStorage) {
  checkBlock(block);
  final ReceivedDeletedBlockInfo info = new ReceivedDeletedBlockInfo(
  block.getLocalBlock(), status, delHint);
  final DatanodeStorage storage = dn.getFSDataset().getStorage(storageUuid);
  
  // storage == null here because it's already removed earlier.

  for (BPServiceActor actor : bpServices) {
actor.getIbrManager().notifyNamenodeBlock(info, storage,
isOnTransientStorage);
  }
} {code}
so IBRs with a null storage are now pending.

The reason why notifyNamenodeBlock can trigger on such blocks is up in 
DirectoryScanner#reconcile
{code:java}
  public void reconcile() throws IOException {
    LOG.debug("reconcile start DirectoryScanning");
    scan();

// If a volume is removed here after scan() already finished running,
// diffs is stale and checkAndUpdate will run on a removed volume

    // HDFS-14476: run checkAndUpdate with batch to avoid holding the lock too
    // long
    int loopCount = 0;
    synchronized (diffs) {
      for (final Map.Entry entry : diffs.getEntries()) {
        dataset.checkAndUpdate(entry.getKey(), entry.getValue());        
    ...
  } {code}
Inside checkAndUpdate, memBlockInfo is null because all the block meta in 
memory is removed during the volume removal, but diskFile still exists. Then 
DataNode#notifyNamenodeDeletedBlock (and further down the line, 
notifyNamenodeBlock) is called on this block.

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-17475) Add a command to check if files are readable

2024-04-17 Thread Felix N (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17838119#comment-17838119
 ] 

Felix N commented on HDFS-17475:


Hi [~ayushtkn], the requirements for this feature did indeed come from our 
production users. While fsck can check if some blocks are missing, AFAIK, a 
successful fsck doesn't guarantee all blocks being readable. This feature aims 
to provide a method to verify a large number of files are readable or not 
without going through the full read pipeline for each file.

> Add a command to check if files are readable
> 
>
> Key: HDFS-17475
> URL: https://issues.apache.org/jira/browse/HDFS-17475
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs
>Reporter: Felix N
>Assignee: Felix N
>Priority: Minor
>  Labels: pull-request-available
>
> Sometimes a job can fail due to one unreadable file down the line due to 
> missing replicas or dead DNs or other reason. This command should allow users 
> to check whether files are readable by checking for metadata on DNs without 
> executing full read pipelines of the files.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-17475) Add a command to check if files are readable

2024-04-17 Thread Felix N (Jira)
Felix N created HDFS-17475:
--

 Summary: Add a command to check if files are readable
 Key: HDFS-17475
 URL: https://issues.apache.org/jira/browse/HDFS-17475
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: hdfs
Reporter: Felix N
Assignee: Felix N
 Fix For: 3.5.0


Sometimes a job can fail due to one unreadable file down the line due to 
missing replicas or dead DNs or other reason. This command should allow users 
to check whether files are readable by checking for metadata on DNs without 
executing full read pipelines of the files.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-17459) [FGL] Summarize this feature

2024-04-11 Thread Felix N (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17836396#comment-17836396
 ] 

Felix N commented on HDFS-17459:


I can help with this one. I assume it's documentation for this feature?

> [FGL] Summarize this feature 
> -
>
> Key: HDFS-17459
> URL: https://issues.apache.org/jira/browse/HDFS-17459
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: ZanderXu
>Assignee: ZanderXu
>Priority: Major
>
> Write a doc to summarize this feature so we can merge it into the trunk.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-17271) Web UI DN report shows random order when sorting with dead DNs

2023-12-01 Thread Felix N (Jira)
Felix N created HDFS-17271:
--

 Summary: Web UI DN report shows random order when sorting with 
dead DNs
 Key: HDFS-17271
 URL: https://issues.apache.org/jira/browse/HDFS-17271
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode, rbf, ui
Affects Versions: 3.4.0
Reporter: Felix N
Assignee: Felix N
 Fix For: 3.4.0
 Attachments: image-2023-12-01-15-04-11-047.png

When sorted by "last contact" in ascending order, dead nodes come up on top in 
a random order

!image-2023-12-01-15-04-11-047.png|width=337,height=263!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-16539) RBF: Support refreshing/changing router fairness policy controller without rebooting router

2022-04-27 Thread Felix N (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17528665#comment-17528665
 ] 

Felix N commented on HDFS-16539:


Might need to add some documentation somewhere since this patch makes use of 
the generic refresh and the feature seems to be underutilized for routers.

> RBF: Support refreshing/changing router fairness policy controller without 
> rebooting router
> ---
>
> Key: HDFS-16539
> URL: https://issues.apache.org/jira/browse/HDFS-16539
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Felix N
>Assignee: Felix N
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>  Time Spent: 2h
>  Remaining Estimate: 0h
>
> Add support for refreshing/changing router fairness policy controller without 
> the need to reboot a router.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-16539) RBF: Support refreshing/changing router fairness policy controller without rebooting router

2022-04-27 Thread Felix N (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix N resolved HDFS-16539.

Fix Version/s: 3.4.0
   Resolution: Fixed

> RBF: Support refreshing/changing router fairness policy controller without 
> rebooting router
> ---
>
> Key: HDFS-16539
> URL: https://issues.apache.org/jira/browse/HDFS-16539
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Felix N
>Assignee: Felix N
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>  Time Spent: 2h
>  Remaining Estimate: 0h
>
> Add support for refreshing/changing router fairness policy controller without 
> the need to reboot a router.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14750) RBF: Improved isolation for downstream name nodes. {Dynamic}

2022-04-26 Thread Felix N (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17528060#comment-17528060
 ] 

Felix N commented on HDFS-14750:


Tried my hands at it since there seems to be no updates for this ticket.

The rough idea is to utilize the metrics added by HDFS-16296 and HDFS-16302, 
spawn a background thread that resizes the semaphores periodically based on the 
traffic to the namespaces (determined from the metrics).

> RBF: Improved isolation for downstream name nodes. {Dynamic}
> 
>
> Key: HDFS-14750
> URL: https://issues.apache.org/jira/browse/HDFS-14750
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: CR Hota
>Assignee: CR Hota
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> This Jira tracks the work around dynamic allocation of resources in routers 
> for downstream hdfs clusters. 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16539) RBF: Support refreshing/changing router fairness policy controller without rebooting router

2022-04-13 Thread Felix N (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix N updated HDFS-16539:
---
Description: Add support for refreshing/changing router fairness policy 
controller without the need to reboot a router.

> RBF: Support refreshing/changing router fairness policy controller without 
> rebooting router
> ---
>
> Key: HDFS-16539
> URL: https://issues.apache.org/jira/browse/HDFS-16539
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Felix N
>Assignee: Felix N
>Priority: Minor
>
> Add support for refreshing/changing router fairness policy controller without 
> the need to reboot a router.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-16539) RBF: Support refreshing/changing router fairness policy controller without rebooting router

2022-04-13 Thread Felix N (Jira)
Felix N created HDFS-16539:
--

 Summary: RBF: Support refreshing/changing router fairness policy 
controller without rebooting router
 Key: HDFS-16539
 URL: https://issues.apache.org/jira/browse/HDFS-16539
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: Felix N
Assignee: Felix N






--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-16417) RBF: StaticRouterRpcFairnessPolicyController init fails with division by 0 if concurrent ns handler count is configured

2022-01-06 Thread Felix N (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17470308#comment-17470308
 ] 

Felix N commented on HDFS-16417:


Hi [~elgoiri], sorry I was using the wrong formatter, hence the imports change. 
I have submitted a PR with that addressed.

> RBF: StaticRouterRpcFairnessPolicyController init fails with division by 0 if 
> concurrent ns handler count is configured
> ---
>
> Key: HDFS-16417
> URL: https://issues.apache.org/jira/browse/HDFS-16417
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: rbf
>Reporter: Felix N
>Priority: Minor
>  Labels: pull-request-available
> Attachments: HDFS-16417.001.patch
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> If {{dfs.federation.router.fairness.handler.count.concurrent}} is configured, 
> unassignedNS is thus empty and {{handlerCount % unassignedNS.size()}} will 
> throw a /0 exception.
> Changed it to assigning extra handlers to concurrent ns in case it's 
> configured.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16417) RBF: StaticRouterRpcFairnessPolicyController init fails with division by 0 if concurrent ns handler count is configured

2022-01-06 Thread Felix N (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix N updated HDFS-16417:
---
Attachment: HDFS-16417.001.patch
Status: Patch Available  (was: Open)

> RBF: StaticRouterRpcFairnessPolicyController init fails with division by 0 if 
> concurrent ns handler count is configured
> ---
>
> Key: HDFS-16417
> URL: https://issues.apache.org/jira/browse/HDFS-16417
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: rbf
>Reporter: Felix N
>Priority: Minor
> Attachments: HDFS-16417.001.patch
>
>
> If {{dfs.federation.router.fairness.handler.count.concurrent}} is configured, 
> unassignedNS is thus empty and {{handlerCount % unassignedNS.size()}} will 
> throw a /0 exception.
> Changed it to assigning extra handlers to concurrent ns in case it's 
> configured.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-16417) RBF: StaticRouterRpcFairnessPolicyController init fails with division by 0 if concurrent ns handler count is configured

2022-01-06 Thread Felix N (Jira)
Felix N created HDFS-16417:
--

 Summary: RBF: StaticRouterRpcFairnessPolicyController init fails 
with division by 0 if concurrent ns handler count is configured
 Key: HDFS-16417
 URL: https://issues.apache.org/jira/browse/HDFS-16417
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: rbf
Reporter: Felix N


If {{dfs.federation.router.fairness.handler.count.concurrent}} is configured, 
unassignedNS is thus empty and {{handlerCount % unassignedNS.size()}} will 
throw a /0 exception.

Changed it to assigning extra handlers to concurrent ns in case it's configured.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15972) Fedbalance only copies data partially when there's existing opened file

2021-04-12 Thread Felix N (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17319870#comment-17319870
 ] 

Felix N commented on HDFS-15972:


Hi [~LiJinglun], is this the expected behavior? During heavy write period, this 
might lead to data loss.

> Fedbalance only copies data partially when there's existing opened file
> ---
>
> Key: HDFS-15972
> URL: https://issues.apache.org/jira/browse/HDFS-15972
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Felix N
>Priority: Major
>
> If there are opened files when fedbalance is run and data is being written to 
> these files, fedbalance might skip the newly written data.
> Steps to recreate the issue:
>  # Create a dummy file /test/file with some data: {{echo "start" | hdfs dfs 
> -appendToFile /test/file}}
>  # Start writing to the file: {{hdfs dfs -appendToFile /test/file}} but do 
> not stop writing
>  # Run fedbalance: {{hadoop fedbalance submit hdfs://ns1/test 
> hdfs://ns2/test}}
>  # Write something to the file while fedbalance is running, "end" for 
> example, then stop writing
>  # After fedbalance is done, {{hdfs://ns2/test/file}} should only contain 
> "start" while {{hdfs://ns1/user/hadoop/.Trash/Current/test/file}} contains 
> "start\nend"
> Fedbalance is run with default configs and arguments so no diff should happen.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Moved] (HDFS-15972) Fedbalance only copies data partially when there's existing opened file

2021-04-12 Thread Felix N (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix N moved HADOOP-17634 to HDFS-15972:
-

Key: HDFS-15972  (was: HADOOP-17634)
Project: Hadoop HDFS  (was: Hadoop Common)

> Fedbalance only copies data partially when there's existing opened file
> ---
>
> Key: HDFS-15972
> URL: https://issues.apache.org/jira/browse/HDFS-15972
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Felix N
>Priority: Major
>
> If there are opened files when fedbalance is run and data is being written to 
> these files, fedbalance might skip the newly written data.
> Steps to recreate the issue:
>  # Create a dummy file /test/file with some data: {{echo "start" | hdfs dfs 
> -appendToFile /test/file}}
>  # Start writing to the file: {{hdfs dfs -appendToFile /test/file}} but do 
> not stop writing
>  # Run fedbalance: {{hadoop fedbalance submit hdfs://ns1/test 
> hdfs://ns2/test}}
>  # Write something to the file while fedbalance is running, "end" for 
> example, then stop writing
>  # After fedbalance is done, {{hdfs://ns2/test/file}} should only contain 
> "start" while {{hdfs://ns1/user/hadoop/.Trash/Current/test/file}} contains 
> "start\nend"
> Fedbalance is run with default configs and arguments so no diff should happen.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15294) Federation balance tool

2020-11-02 Thread Felix N (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17224549#comment-17224549
 ] 

Felix N commented on HDFS-15294:


Thanks for the awesome work! May I ask if there is any plan to backport this 
feature to 2.x?

> Federation balance tool
> ---
>
> Key: HDFS-15294
> URL: https://issues.apache.org/jira/browse/HDFS-15294
> Project: Hadoop HDFS
>  Issue Type: New Feature
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
> Fix For: 3.4.0
>
> Attachments: BalanceProcedureScheduler.png, HDFS-15294.001.patch, 
> HDFS-15294.002.patch, HDFS-15294.003.patch, HDFS-15294.003.reupload.patch, 
> HDFS-15294.004.patch, HDFS-15294.005.patch, HDFS-15294.006.patch, 
> HDFS-15294.007.patch, distcp-balance.pdf, distcp-balance.v2.pdf
>
>
> This jira introduces a new HDFS federation balance tool to balance data 
> across different federation namespaces. It uses Distcp to copy data from the 
> source path to the target path.
> The process is:
>  1. Use distcp and snapshot diff to sync data between src and dst until they 
> are the same.
>  2. Update mount table in Router if we specified RBF mode.
>  3. Deal with src data, move to trash, delete or skip them.
> The design of fedbalance tool comes from the discussion in HDFS-15087.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org