[jira] [Commented] (HDFS-13224) RBF: Resolvers to support mount points across multiple subclusters

2023-02-15 Thread Daniel Ma (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-13224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17689432#comment-17689432
 ] 

Daniel Ma commented on HDFS-13224:
--

[~elgoiri]  Could you pls share the design doc for this feature. No idea what 
kind of scenario need to cross subclusters.
Thanks

> RBF: Resolvers to support mount points across multiple subclusters
> --
>
> Key: HDFS-13224
> URL: https://issues.apache.org/jira/browse/HDFS-13224
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Íñigo Goiri
>Assignee: Íñigo Goiri
>Priority: Major
> Fix For: 3.1.0, 2.10.0, 2.9.1, 3.0.3
>
> Attachments: HDFS-13224-branch-2.000.patch, HDFS-13224.000.patch, 
> HDFS-13224.001.patch, HDFS-13224.002.patch, HDFS-13224.003.patch, 
> HDFS-13224.004.patch, HDFS-13224.005.patch, HDFS-13224.006.patch, 
> HDFS-13224.007.patch, HDFS-13224.008.patch, HDFS-13224.009.patch, 
> HDFS-13224.010.patch
>
>
> Currently, a mount point points to a single subcluster. We should be able to 
> spread files in a mount point across subclusters.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-16115) Asynchronously handle BPServiceActor command mechanism may result in BPServiceActor never fails even CommandProcessingThread is closed with fatal error.

2022-12-22 Thread Daniel Ma (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17651129#comment-17651129
 ] 

Daniel Ma commented on HDFS-16115:
--

[~hemanthboyina] [~brahma]
Could you pls help to review

> Asynchronously handle BPServiceActor command mechanism may result in 
> BPServiceActor never fails even CommandProcessingThread is closed with fatal 
> error.
> 
>
> Key: HDFS-16115
> URL: https://issues.apache.org/jira/browse/HDFS-16115
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Affects Versions: 3.3.1
>Reporter: Daniel Ma
>Assignee: Daniel Ma
>Priority: Critical
> Fix For: 3.3.1
>
> Attachments: 0001-HDFS-16115.patch
>
>
> It is an improvement issue. Actually the issue has two sub issues:
> 1- BPServerActor thread handle commands from NameNode in aysnchronous way ( 
> CommandProcessThread handle commands ), so if there are any exceptions or 
> errors happen in thread CommandProcessthread resulting the thread fails and 
> stop, of which BPServiceActor cannot aware and still keep putting commands 
> from namenode into queues waiting to be handled by CommandProcessThread, 
> actually CommandProcessThread was dead already.
> 2-the second sub issue is based on the first one, if CommandProcessThread was 
> dead owing to some non-fatal errors like "can not create native thread" which 
> is caused by too many threads existed in OS, this kind of problem should be 
> given much more torlerance instead of simply shudown the thread and never 
> recover automatically, because the non-fatal errors mentioned above probably 
> can be recovered soon by itself,
> {code:java}
> //代码占位符
> 2021-07-02 16:26:02,315 | WARN  | Command processor | Exception happened when 
> process queue BPServiceActor.java:1393
> java.lang.OutOfMemoryError: unable to create new native thread
> at java.lang.Thread.start0(Native Method)
> at java.lang.Thread.start(Thread.java:717)
> at 
> java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:957)
> at 
> java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1367)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService.execute(FsDatasetAsyncDiskService.java:180)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService.deleteAsync(FsDatasetAsyncDiskService.java:229)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.invalidate(FsDatasetImpl.java:2315)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.invalidate(FsDatasetImpl.java:2237)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActive(BPOfferService.java:752)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActor(BPOfferService.java:698)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.processCommand(BPServiceActor.java:1417)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.lambda$enqueue$2(BPServiceActor.java:1463)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.processQueue(BPServiceActor.java:1382)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.run(BPServiceActor.java:1365)
> {code}
> currently, Datanode BPServiceActor cannot turn to normal even when the 
> non-fatal error was eliminated.
> Therefore, in this patch, two things will be done:
> 1-Add retry mechanism in BPServiceActor thread and CommandProcessThread 
> thread which is 5 by default and configurable;
> 2-Add a monitor periodical thread in BPOfferService, if a BPServiceActor 
> thread is dead owing to  too many times non-fatal error, it should not be 
> simply removed from BPServviceActor lists stored in BPOfferService, instead, 
> the monitor thread will periodically try to start these special dead 
> BPServiceActor thread. the interval is also configurable.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16871) DiskBalancer process may throws IllegalArgumentException when the target DataNode has capital letter in hostname

2022-12-19 Thread Daniel Ma (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Ma updated HDFS-16871:
-
Description: 
DiskBalancer process read DataNode hostname as lowercase letters,
 !screenshot-1.png! 
 but there is no letter case transform when getNodeByName.
 !screenshot-2.png! 
For a DataNode with lowercase hostname. everything is ok.
But for a DataNode with uppercase hostname, when Balancer process try to 
migrate on it,  there will be a IllegalArgumentException thrown as below,

{code:java}
2022-10-09 16:15:26,631 ERROR tools.DiskBalancerCLI: 
java.lang.IllegalArgumentException: Unable to find the specified node. 
node-group-1YlRf0002
{code}



  was:
DiskBalancer process read DataNode hostname as lowercase letters,
 !screenshot-1.png! 
 but there is no letter case transform when getNodeByName.
 !screenshot-2.png! 
For a DataNode with lowercase hostname. everything is ok.
But for a DataNode with uppercase hostname,, there will be a 
IllegalArgumentException as below,

{code:java}
2022-10-09 16:15:26,631 ERROR tools.DiskBalancerCLI: 
java.lang.IllegalArgumentException: Unable to find the specified node. 
node-group-1YlRf0002
{code}




> DiskBalancer process may throws IllegalArgumentException when the target 
> DataNode has capital letter in hostname
> 
>
> Key: HDFS-16871
> URL: https://issues.apache.org/jira/browse/HDFS-16871
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Daniel Ma
>Assignee: Daniel Ma
>Priority: Major
> Attachments: screenshot-1.png, screenshot-2.png
>
>
> DiskBalancer process read DataNode hostname as lowercase letters,
>  !screenshot-1.png! 
>  but there is no letter case transform when getNodeByName.
>  !screenshot-2.png! 
> For a DataNode with lowercase hostname. everything is ok.
> But for a DataNode with uppercase hostname, when Balancer process try to 
> migrate on it,  there will be a IllegalArgumentException thrown as below,
> {code:java}
> 2022-10-09 16:15:26,631 ERROR tools.DiskBalancerCLI: 
> java.lang.IllegalArgumentException: Unable to find the specified node. 
> node-group-1YlRf0002
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16871) DiskBalancer process may throws IllegalArgumentException when the target DataNode has capital letter in hostname

2022-12-19 Thread Daniel Ma (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Ma updated HDFS-16871:
-
Attachment: screenshot-2.png

> DiskBalancer process may throws IllegalArgumentException when the target 
> DataNode has capital letter in hostname
> 
>
> Key: HDFS-16871
> URL: https://issues.apache.org/jira/browse/HDFS-16871
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Daniel Ma
>Assignee: Daniel Ma
>Priority: Major
> Attachments: screenshot-1.png, screenshot-2.png
>
>
> DiskBalancer process read DataNode hostname as lowercase letters,
>  !screenshot-1.png! 
>  but there is no letter case transform when getNodeByName.
> For a DataNode with lowercase hostname. everything is ok.
> But for a DataNode with uppercase hostname,, there will be a 
> IllegalArgumentException as below,
> {code:java}
> 2022-10-09 16:15:26,631 ERROR tools.DiskBalancerCLI: 
> java.lang.IllegalArgumentException: Unable to find the specified node. 
> node-group-1YlRf0002
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16871) DiskBalancer process may throws IllegalArgumentException when the target DataNode has capital letter in hostname

2022-12-19 Thread Daniel Ma (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Ma updated HDFS-16871:
-
Description: 
DiskBalancer process read DataNode hostname as lowercase letters,
 !screenshot-1.png! 
 but there is no letter case transform when getNodeByName.
 !screenshot-2.png! 
For a DataNode with lowercase hostname. everything is ok.
But for a DataNode with uppercase hostname,, there will be a 
IllegalArgumentException as below,

{code:java}
2022-10-09 16:15:26,631 ERROR tools.DiskBalancerCLI: 
java.lang.IllegalArgumentException: Unable to find the specified node. 
node-group-1YlRf0002
{code}



  was:
DiskBalancer process read DataNode hostname as lowercase letters,
 !screenshot-1.png! 
 but there is no letter case transform when getNodeByName.
For a DataNode with lowercase hostname. everything is ok.
But for a DataNode with uppercase hostname,, there will be a 
IllegalArgumentException as below,

{code:java}
2022-10-09 16:15:26,631 ERROR tools.DiskBalancerCLI: 
java.lang.IllegalArgumentException: Unable to find the specified node. 
node-group-1YlRf0002
{code}




> DiskBalancer process may throws IllegalArgumentException when the target 
> DataNode has capital letter in hostname
> 
>
> Key: HDFS-16871
> URL: https://issues.apache.org/jira/browse/HDFS-16871
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Daniel Ma
>Assignee: Daniel Ma
>Priority: Major
> Attachments: screenshot-1.png, screenshot-2.png
>
>
> DiskBalancer process read DataNode hostname as lowercase letters,
>  !screenshot-1.png! 
>  but there is no letter case transform when getNodeByName.
>  !screenshot-2.png! 
> For a DataNode with lowercase hostname. everything is ok.
> But for a DataNode with uppercase hostname,, there will be a 
> IllegalArgumentException as below,
> {code:java}
> 2022-10-09 16:15:26,631 ERROR tools.DiskBalancerCLI: 
> java.lang.IllegalArgumentException: Unable to find the specified node. 
> node-group-1YlRf0002
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16871) DiskBalancer process may throws IllegalArgumentException when the target DataNode has capital letter in hostname

2022-12-19 Thread Daniel Ma (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Ma updated HDFS-16871:
-
Description: 
DiskBalancer process read DataNode hostname as lowercase letters,
 !screenshot-1.png! 
 but there is no letter case transform when getNodeByName.
For a DataNode with lowercase hostname. everything is ok.
But for a DataNode with uppercase hostname,, there will be a 
IllegalArgumentException as below,

{code:java}
2022-10-09 16:15:26,631 ERROR tools.DiskBalancerCLI: 
java.lang.IllegalArgumentException: Unable to find the specified node. 
node-group-1YlRf0002
{code}



> DiskBalancer process may throws IllegalArgumentException when the target 
> DataNode has capital letter in hostname
> 
>
> Key: HDFS-16871
> URL: https://issues.apache.org/jira/browse/HDFS-16871
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Daniel Ma
>Assignee: Daniel Ma
>Priority: Major
> Attachments: screenshot-1.png
>
>
> DiskBalancer process read DataNode hostname as lowercase letters,
>  !screenshot-1.png! 
>  but there is no letter case transform when getNodeByName.
> For a DataNode with lowercase hostname. everything is ok.
> But for a DataNode with uppercase hostname,, there will be a 
> IllegalArgumentException as below,
> {code:java}
> 2022-10-09 16:15:26,631 ERROR tools.DiskBalancerCLI: 
> java.lang.IllegalArgumentException: Unable to find the specified node. 
> node-group-1YlRf0002
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Assigned] (HDFS-16871) DiskBalancer process may throws IllegalArgumentException when the target DataNode has capital letter in hostname

2022-12-19 Thread Daniel Ma (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Ma reassigned HDFS-16871:


Assignee: Daniel Ma

> DiskBalancer process may throws IllegalArgumentException when the target 
> DataNode has capital letter in hostname
> 
>
> Key: HDFS-16871
> URL: https://issues.apache.org/jira/browse/HDFS-16871
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Daniel Ma
>Assignee: Daniel Ma
>Priority: Major
> Attachments: screenshot-1.png
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-16871) DiskBalancer process may throws IllegalArgumentException when the target DataNode has capital letter in hostname

2022-12-19 Thread Daniel Ma (Jira)
Daniel Ma created HDFS-16871:


 Summary: DiskBalancer process may throws IllegalArgumentException 
when the target DataNode has capital letter in hostname
 Key: HDFS-16871
 URL: https://issues.apache.org/jira/browse/HDFS-16871
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: Daniel Ma
 Attachments: screenshot-1.png





--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16871) DiskBalancer process may throws IllegalArgumentException when the target DataNode has capital letter in hostname

2022-12-19 Thread Daniel Ma (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Ma updated HDFS-16871:
-
Attachment: screenshot-1.png

> DiskBalancer process may throws IllegalArgumentException when the target 
> DataNode has capital letter in hostname
> 
>
> Key: HDFS-16871
> URL: https://issues.apache.org/jira/browse/HDFS-16871
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Daniel Ma
>Assignee: Daniel Ma
>Priority: Major
> Attachments: screenshot-1.png
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16870) Client ip should also be recorded when NameNode is processing reportBadBlocks

2022-12-17 Thread Daniel Ma (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Ma updated HDFS-16870:
-
Description: 
There are two scenario involved for reportBadBlocks.
1-HDFS client will report bad block to NameNode once the block size is 
inconsitent with meta;
2-DataNode will report bad block to NameNode via heartbeat if Replica stored on 
DataNode is corrupted or be modified.

Currently, when namenode process reportBadBlock rpc request, only DataNode 
address is recorded in Log msg,
Client Ip should also be recorded to distinguish where the report comes from, 
which is very useful for trouble shooting.

  was:
There are two scenario involved for reportBadBlocks.
1-HDFS client will report bad block to NameNode once the block size is 
inconsitent with meta;
2-DataNode will report bad block to NameNode via heartbeat if Replica stored on 
DataNode is corrupted or be modified.

Currently, when namenode process reportBadBlock rpc request, only DataNode Ip 
is recorded in Log msg,
Client Ip should also be recorded to distinguish where the report comes from, 
which is very useful for trouble shooting.


> Client ip should also be recorded when NameNode is processing reportBadBlocks
> -
>
> Key: HDFS-16870
> URL: https://issues.apache.org/jira/browse/HDFS-16870
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Daniel Ma
>Priority: Trivial
>
> There are two scenario involved for reportBadBlocks.
> 1-HDFS client will report bad block to NameNode once the block size is 
> inconsitent with meta;
> 2-DataNode will report bad block to NameNode via heartbeat if Replica stored 
> on DataNode is corrupted or be modified.
> Currently, when namenode process reportBadBlock rpc request, only DataNode 
> address is recorded in Log msg,
> Client Ip should also be recorded to distinguish where the report comes from, 
> which is very useful for trouble shooting.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16870) Client ip should also be recorded when NameNode is processing reportBadBlocks

2022-12-17 Thread Daniel Ma (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Ma updated HDFS-16870:
-
Description: 
There are two scenario involved for reportBadBlocks.
1-HDFS client will report bad block to NameNode once the block size is 
inconsitent with meta;
2-DataNode will report bad block to NameNode via heartbeat if Replica stored on 
DataNode is corrupted or be modified.

Currently, when namenode process reportBadBlock rpc request, only DataNode Ip 
is recorded in Log msg,
Client Ip should also be recorded to distinguish where the report comes from, 
which is very useful for trouble shooting.

> Client ip should also be recorded when NameNode is processing reportBadBlocks
> -
>
> Key: HDFS-16870
> URL: https://issues.apache.org/jira/browse/HDFS-16870
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Daniel Ma
>Priority: Trivial
>
> There are two scenario involved for reportBadBlocks.
> 1-HDFS client will report bad block to NameNode once the block size is 
> inconsitent with meta;
> 2-DataNode will report bad block to NameNode via heartbeat if Replica stored 
> on DataNode is corrupted or be modified.
> Currently, when namenode process reportBadBlock rpc request, only DataNode Ip 
> is recorded in Log msg,
> Client Ip should also be recorded to distinguish where the report comes from, 
> which is very useful for trouble shooting.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-16870) Client ip should also be recorded when NameNode is processing reportBadBlocks

2022-12-17 Thread Daniel Ma (Jira)
Daniel Ma created HDFS-16870:


 Summary: Client ip should also be recorded when NameNode is 
processing reportBadBlocks
 Key: HDFS-16870
 URL: https://issues.apache.org/jira/browse/HDFS-16870
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: Daniel Ma






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16869) Fail to start namenode owing to 0 size of clientid recorded in edit log.

2022-12-17 Thread Daniel Ma (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Ma updated HDFS-16869:
-
Description: 
We first encounter this issue in 3.3.1 version when we are upgrading from 3.1.1 
to 3.3.1 which may cause NameNode start failure but just occasionally not 
everytime.

The root cause for why 0 size of clientid happened here is still not found 
after long-term investigating.
So here we add a protection judge here to exlude 0 size of clientid from be 
added into cache.

  was:
The root cause for why 0 size of clientid happened here is still not found.
So here we add a protection judge here to exlude 0 size of clientid from be 
added into cache.


> Fail to start namenode owing to 0 size of clientid recorded in edit log.
> 
>
> Key: HDFS-16869
> URL: https://issues.apache.org/jira/browse/HDFS-16869
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 3.3.1, 3.3.2, 3.3.3, 3.3.4
>Reporter: Daniel Ma
>Assignee: Daniel Ma
>Priority: Major
>
> We first encounter this issue in 3.3.1 version when we are upgrading from 
> 3.1.1 to 3.3.1 which may cause NameNode start failure but just occasionally 
> not everytime.
> The root cause for why 0 size of clientid happened here is still not found 
> after long-term investigating.
> So here we add a protection judge here to exlude 0 size of clientid from be 
> added into cache.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16869) Fail to start namenode owing to 0 size of clientid recorded in edit log.

2022-12-16 Thread Daniel Ma (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Ma updated HDFS-16869:
-
Description: 
The root cause for why 0 size of clientid happened here is still not found.
So here we add a protection judge here to exlude 0 size of clientid from be 
added into cache.

  was:
DelegationTokenRenewer timeout feature may cause high utilization of CPU and 
object leak。
1-If yarn cluster is in idle state, that is almost no token renewer event 
triggered, the DelegationTokenRenewerPoolTracker thread will do nothing but 
dead loop, it will cause high CPU utilization.

2-The renewer event is hold in a map named futures, will has no remove logic , 
that is the map will become increasingly great with time going by.


> Fail to start namenode owing to 0 size of clientid recorded in edit log.
> 
>
> Key: HDFS-16869
> URL: https://issues.apache.org/jira/browse/HDFS-16869
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 3.3.1, 3.3.2, 3.3.3, 3.3.4
>Reporter: Daniel Ma
>Assignee: Daniel Ma
>Priority: Major
>
> The root cause for why 0 size of clientid happened here is still not found.
> So here we add a protection judge here to exlude 0 size of clientid from be 
> added into cache.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16869) Fail to start namenode owing to 0 size of clientid recorded in edit log.

2022-12-16 Thread Daniel Ma (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Ma updated HDFS-16869:
-
Issue Type: Bug  (was: Improvement)

> Fail to start namenode owing to 0 size of clientid recorded in edit log.
> 
>
> Key: HDFS-16869
> URL: https://issues.apache.org/jira/browse/HDFS-16869
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 3.3.1, 3.3.2, 3.3.3, 3.3.4
>Reporter: Daniel Ma
>Assignee: Daniel Ma
>Priority: Major
>
> DelegationTokenRenewer timeout feature may cause high utilization of CPU and 
> object leak。
> 1-If yarn cluster is in idle state, that is almost no token renewer event 
> triggered, the DelegationTokenRenewerPoolTracker thread will do nothing but 
> dead loop, it will cause high CPU utilization.
> 2-The renewer event is hold in a map named futures, will has no remove logic 
> , that is the map will become increasingly great with time going by.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16869) Fail to start namenode owing to 0 size of clientid recorded in edit log.

2022-12-16 Thread Daniel Ma (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Ma updated HDFS-16869:
-
Summary: Fail to start namenode owing to 0 size of clientid recorded in 
edit log.  (was: DelegationTokenRenewer timeout feature may cause high 
utilization of CPU and object leak)

> Fail to start namenode owing to 0 size of clientid recorded in edit log.
> 
>
> Key: HDFS-16869
> URL: https://issues.apache.org/jira/browse/HDFS-16869
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Affects Versions: 3.3.1, 3.3.2, 3.3.3, 3.3.4
>Reporter: Daniel Ma
>Assignee: Daniel Ma
>Priority: Major
>
> DelegationTokenRenewer timeout feature may cause high utilization of CPU and 
> object leak。
> 1-If yarn cluster is in idle state, that is almost no token renewer event 
> triggered, the DelegationTokenRenewerPoolTracker thread will do nothing but 
> dead loop, it will cause high CPU utilization.
> 2-The renewer event is hold in a map named futures, will has no remove logic 
> , that is the map will become increasingly great with time going by.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-16869) DelegationTokenRenewer timeout feature may cause high utilization of CPU and object leak

2022-12-16 Thread Daniel Ma (Jira)
Daniel Ma created HDFS-16869:


 Summary: DelegationTokenRenewer timeout feature may cause high 
utilization of CPU and object leak
 Key: HDFS-16869
 URL: https://issues.apache.org/jira/browse/HDFS-16869
 Project: Hadoop HDFS
  Issue Type: Improvement
Affects Versions: 3.3.4, 3.3.3, 3.3.2, 3.3.1
Reporter: Daniel Ma
Assignee: Daniel Ma


DelegationTokenRenewer timeout feature may cause high utilization of CPU and 
object leak。
1-If yarn cluster is in idle state, that is almost no token renewer event 
triggered, the DelegationTokenRenewerPoolTracker thread will do nothing but 
dead loop, it will cause high CPU utilization.

2-The renewer event is hold in a map named futures, will has no remove logic , 
that is the map will become increasingly great with time going by.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-16060) There is an inconsistent between replicas of datanodes when hardware is abnormal

2022-05-12 Thread Daniel Ma (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17535888#comment-17535888
 ] 

Daniel Ma edited comment on HDFS-16060 at 5/12/22 6:27 AM:
---

[~ferhui] 

Thanks for your report, I have encountered the similiar issue.

There are some doubts I wanna confirm:

1-Does write operation succeed when the Exception info cast on Datanode log?

2-Does read operation also fail?


was (Author: daniel ma):
[~ferhui] 

Thanks for your report, I have encountered the similiar issue.

There are some doubts I wanna confirm:

1-Does write operation succeed even with the Exception info on Datanode log?

2-Does read operation also fail?

> There is an inconsistent between replicas of datanodes when hardware is 
> abnormal
> 
>
> Key: HDFS-16060
> URL: https://issues.apache.org/jira/browse/HDFS-16060
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 3.4.0
>Reporter: Hui Fei
>Priority: Major
>
> We find the following case on production env.
>  * replicas of the same block are stored in dn1, dn2.
>  * replicas of dn1 and dn2 are different
>  * Verify meta & data for replica successfully on dn1, and the same on dn2.
> User code is just copyfromlocal.
> Find some error log on datanode at first
> {quote}
> 2021-05-27 04:54:20,471 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Checksum error in block 
> BP-1453431581-x.x.x.x-1531302155027:blk_13892199285_12902824176 from 
> /y.y.y.y:47960
> org.apache.hadoop.fs.ChecksumException: Checksum error: 
> DFSClient_NONMAPREDUCE_-1760730985_129 at 0 exp: 37939694 got: -1180138774
>  at 
> org.apache.hadoop.util.NativeCrc32.nativeComputeChunkedSumsByteArray(Native 
> Method)
>  at 
> org.apache.hadoop.util.NativeCrc32.verifyChunkedSumsByteArray(NativeCrc32.java:69)
>  at 
> org.apache.hadoop.util.DataChecksum.verifyChunkedSums(DataChecksum.java:347)
>  at 
> org.apache.hadoop.util.DataChecksum.verifyChunkedSums(DataChecksum.java:294)
>  at 
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.verifyChunks(BlockReceiver.java:438)
>  at 
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:582)
>  at 
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:885)
>  at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:801)
>  at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:137)
>  at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:74)
>  at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:253)
>  at java.lang.Thread.run(Thread.java:748)
> {quote}
> After this, new pipeline is created and then wrong data and meta written in 
> disk file.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-16060) There is an inconsistent between replicas of datanodes when hardware is abnormal

2022-05-12 Thread Daniel Ma (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17535888#comment-17535888
 ] 

Daniel Ma edited comment on HDFS-16060 at 5/12/22 6:22 AM:
---

[~ferhui] 

Thanks for your report, I have encountered the similiar issue.

There are some doubts I wanna confirm:

1-Does write operation succeed even with the Exception info on Datanode log?

2-Does read operation also fail?


was (Author: daniel ma):
[~ferhui] 

Thanks for your report, I have encountered the similiar issue.

There are some doubts I wanna confirm:

1-Does write operation succeed even with the Exception info on Datanode log?

2-Does read operation fail owing to inconsistent replica data and its meta?

> There is an inconsistent between replicas of datanodes when hardware is 
> abnormal
> 
>
> Key: HDFS-16060
> URL: https://issues.apache.org/jira/browse/HDFS-16060
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 3.4.0
>Reporter: Hui Fei
>Priority: Major
>
> We find the following case on production env.
>  * replicas of the same block are stored in dn1, dn2.
>  * replicas of dn1 and dn2 are different
>  * Verify meta & data for replica successfully on dn1, and the same on dn2.
> User code is just copyfromlocal.
> Find some error log on datanode at first
> {quote}
> 2021-05-27 04:54:20,471 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Checksum error in block 
> BP-1453431581-x.x.x.x-1531302155027:blk_13892199285_12902824176 from 
> /y.y.y.y:47960
> org.apache.hadoop.fs.ChecksumException: Checksum error: 
> DFSClient_NONMAPREDUCE_-1760730985_129 at 0 exp: 37939694 got: -1180138774
>  at 
> org.apache.hadoop.util.NativeCrc32.nativeComputeChunkedSumsByteArray(Native 
> Method)
>  at 
> org.apache.hadoop.util.NativeCrc32.verifyChunkedSumsByteArray(NativeCrc32.java:69)
>  at 
> org.apache.hadoop.util.DataChecksum.verifyChunkedSums(DataChecksum.java:347)
>  at 
> org.apache.hadoop.util.DataChecksum.verifyChunkedSums(DataChecksum.java:294)
>  at 
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.verifyChunks(BlockReceiver.java:438)
>  at 
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:582)
>  at 
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:885)
>  at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:801)
>  at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:137)
>  at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:74)
>  at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:253)
>  at java.lang.Thread.run(Thread.java:748)
> {quote}
> After this, new pipeline is created and then wrong data and meta written in 
> disk file.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-16060) There is an inconsistent between replicas of datanodes when hardware is abnormal

2022-05-12 Thread Daniel Ma (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17535888#comment-17535888
 ] 

Daniel Ma commented on HDFS-16060:
--

[~ferhui] 

Thanks for your report, I have encountered the similiar issue.

There are some doubts I wanna confirm:

1-Does write operation succeed even with the Exception info on Datanode log?

2-Does read operation fail owing to inconsistent replica data and its meta?

> There is an inconsistent between replicas of datanodes when hardware is 
> abnormal
> 
>
> Key: HDFS-16060
> URL: https://issues.apache.org/jira/browse/HDFS-16060
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 3.4.0
>Reporter: Hui Fei
>Priority: Major
>
> We find the following case on production env.
>  * replicas of the same block are stored in dn1, dn2.
>  * replicas of dn1 and dn2 are different
>  * Verify meta & data for replica successfully on dn1, and the same on dn2.
> User code is just copyfromlocal.
> Find some error log on datanode at first
> {quote}
> 2021-05-27 04:54:20,471 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Checksum error in block 
> BP-1453431581-x.x.x.x-1531302155027:blk_13892199285_12902824176 from 
> /y.y.y.y:47960
> org.apache.hadoop.fs.ChecksumException: Checksum error: 
> DFSClient_NONMAPREDUCE_-1760730985_129 at 0 exp: 37939694 got: -1180138774
>  at 
> org.apache.hadoop.util.NativeCrc32.nativeComputeChunkedSumsByteArray(Native 
> Method)
>  at 
> org.apache.hadoop.util.NativeCrc32.verifyChunkedSumsByteArray(NativeCrc32.java:69)
>  at 
> org.apache.hadoop.util.DataChecksum.verifyChunkedSums(DataChecksum.java:347)
>  at 
> org.apache.hadoop.util.DataChecksum.verifyChunkedSums(DataChecksum.java:294)
>  at 
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.verifyChunks(BlockReceiver.java:438)
>  at 
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:582)
>  at 
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:885)
>  at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:801)
>  at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:137)
>  at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:74)
>  at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:253)
>  at java.lang.Thread.run(Thread.java:748)
> {quote}
> After this, new pipeline is created and then wrong data and meta written in 
> disk file.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15796) ConcurrentModificationException error happens on NameNode occasionally

2021-07-08 Thread Daniel Ma (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17377326#comment-17377326
 ] 

Daniel Ma commented on HDFS-15796:
--

[~sodonnell]

Yes, NameNode will exit when the exception happens which result in unexpected 
NameNode switchover

> ConcurrentModificationException error happens on NameNode occasionally
> --
>
> Key: HDFS-15796
> URL: https://issues.apache.org/jira/browse/HDFS-15796
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Affects Versions: 3.1.1
>Reporter: Daniel Ma
>Assignee: Daniel Ma
>Priority: Critical
> Attachments: HDFS-15796-0001.patch
>
>
> ConcurrentModificationException error happens on NameNode occasionally.
>  
> {code:java}
> 2021-01-23 20:21:18,107 | ERROR | RedundancyMonitor | RedundancyMonitor 
> thread received Runtime exception.  | BlockManager.java:4746
> java.util.ConcurrentModificationException
>   at java.util.ArrayList$Itr.checkForComodification(ArrayList.java:909)
>   at java.util.ArrayList$Itr.next(ArrayList.java:859)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReconstructionWorkForBlocks(BlockManager.java:1907)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeBlockReconstructionWork(BlockManager.java:1859)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeDatanodeWork(BlockManager.java:4862)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$RedundancyMonitor.run(BlockManager.java:4729)
>   at java.lang.Thread.run(Thread.java:748)
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Assigned] (HDFS-16115) Asynchronously handle BPServiceActor command mechanism may result in BPServiceActor never fails even CommandProcessingThread is closed with fatal error.

2021-07-07 Thread Daniel Ma (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Ma reassigned HDFS-16115:


Assignee: Daniel Ma

> Asynchronously handle BPServiceActor command mechanism may result in 
> BPServiceActor never fails even CommandProcessingThread is closed with fatal 
> error.
> 
>
> Key: HDFS-16115
> URL: https://issues.apache.org/jira/browse/HDFS-16115
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Affects Versions: 3.3.1
>Reporter: Daniel Ma
>Assignee: Daniel Ma
>Priority: Critical
> Fix For: 3.3.1
>
> Attachments: 0001-HDFS-16115.patch
>
>
> It is an improvement issue. Actually the issue has two sub issues:
> 1- BPServerActor thread handle commands from NameNode in aysnchronous way ( 
> CommandProcessThread handle commands ), so if there are any exceptions or 
> errors happen in thread CommandProcessthread resulting the thread fails and 
> stop, of which BPServiceActor cannot aware and still keep putting commands 
> from namenode into queues waiting to be handled by CommandProcessThread, 
> actually CommandProcessThread was dead already.
> 2-the second sub issue is based on the first one, if CommandProcessThread was 
> dead owing to some non-fatal errors like "can not create native thread" which 
> is caused by too many threads existed in OS, this kind of problem should be 
> given much more torlerance instead of simply shudown the thread and never 
> recover automatically, because the non-fatal errors mentioned above probably 
> can be recovered soon by itself,
> {code:java}
> //代码占位符
> 2021-07-02 16:26:02,315 | WARN  | Command processor | Exception happened when 
> process queue BPServiceActor.java:1393
> java.lang.OutOfMemoryError: unable to create new native thread
> at java.lang.Thread.start0(Native Method)
> at java.lang.Thread.start(Thread.java:717)
> at 
> java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:957)
> at 
> java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1367)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService.execute(FsDatasetAsyncDiskService.java:180)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService.deleteAsync(FsDatasetAsyncDiskService.java:229)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.invalidate(FsDatasetImpl.java:2315)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.invalidate(FsDatasetImpl.java:2237)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActive(BPOfferService.java:752)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActor(BPOfferService.java:698)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.processCommand(BPServiceActor.java:1417)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.lambda$enqueue$2(BPServiceActor.java:1463)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.processQueue(BPServiceActor.java:1382)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.run(BPServiceActor.java:1365)
> {code}
> currently, Datanode BPServiceActor cannot turn to normal even when the 
> non-fatal error was eliminated.
> Therefore, in this patch, two things will be done:
> 1-Add retry mechanism in BPServiceActor thread and CommandProcessThread 
> thread which is 5 by default and configurable;
> 2-Add a monitor periodical thread in BPOfferService, if a BPServiceActor 
> thread is dead owing to  too many times non-fatal error, it should not be 
> simply removed from BPServviceActor lists stored in BPOfferService, instead, 
> the monitor thread will periodically try to start these special dead 
> BPServiceActor thread. the interval is also configurable.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-16117) Add file count info in audit log to record the file count for delete and getListing RPC request to assist user trouble shooting when RPC cost is increasing

2021-07-07 Thread Daniel Ma (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17376938#comment-17376938
 ] 

Daniel Ma commented on HDFS-16117:
--

Hello ,[~sodonnell]

Could you pls help to review this patch?

> Add file count info in audit log to record the file count for delete and 
> getListing RPC request to assist user trouble shooting when RPC cost is 
> increasing 
> 
>
> Key: HDFS-16117
> URL: https://issues.apache.org/jira/browse/HDFS-16117
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Affects Versions: 3.3.1
>Reporter: Daniel Ma
>Assignee: Daniel Ma
>Priority: Major
> Fix For: 3.3.1
>
> Attachments: 0001-HDFS-16117.patch
>
>
> Currently, there is no file count in audit log for delete and getListing RPC 
> request, therefore, for the increasing RPC call, it is not easy to configure 
> it out whether the time-consuming RPC  request is related to too many files 
> be operated in the RPC request.
>  
> Therefore, It it necessary to add file count info in the audit log to assist 
> maintenance 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Assigned] (HDFS-16117) Add file count info in audit log to record the file count for delete and getListing RPC request to assist user trouble shooting when RPC cost is increasing

2021-07-07 Thread Daniel Ma (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Ma reassigned HDFS-16117:


Assignee: Daniel Ma

> Add file count info in audit log to record the file count for delete and 
> getListing RPC request to assist user trouble shooting when RPC cost is 
> increasing 
> 
>
> Key: HDFS-16117
> URL: https://issues.apache.org/jira/browse/HDFS-16117
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Affects Versions: 3.3.1
>Reporter: Daniel Ma
>Assignee: Daniel Ma
>Priority: Major
> Fix For: 3.3.1
>
> Attachments: 0001-HDFS-16117.patch
>
>
> Currently, there is no file count in audit log for delete and getListing RPC 
> request, therefore, for the increasing RPC call, it is not easy to configure 
> it out whether the time-consuming RPC  request is related to too many files 
> be operated in the RPC request.
>  
> Therefore, It it necessary to add file count info in the audit log to assist 
> maintenance 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Assigned] (HDFS-16093) DataNodes under decommission will still be returned to the client via getLocatedBlocks, so the client may request decommissioning datanodes to read which will cause badl

2021-07-07 Thread Daniel Ma (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Ma reassigned HDFS-16093:


Assignee: Daniel Ma

> DataNodes under decommission will still be returned to the client via 
> getLocatedBlocks, so the client may request decommissioning datanodes to read 
> which will cause badly competation on disk IO.
> --
>
> Key: HDFS-16093
> URL: https://issues.apache.org/jira/browse/HDFS-16093
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Affects Versions: 3.3.1
>Reporter: Daniel Ma
>Assignee: Daniel Ma
>Priority: Critical
>
> DataNodes under decommission will still be returned to the client via 
> getLocatedBlocks, so the client may request decommissioning datanodes to read 
> which will cause badly competation on disk IO.
> Therefore, datanodes under decommission should be removed from the return 
> list of getLocatedBlocks api.
> !image-2021-06-29-10-50-44-739.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-15796) ConcurrentModificationException error happens on NameNode occasionally

2021-07-07 Thread Daniel Ma (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17376586#comment-17376586
 ] 

Daniel Ma edited comment on HDFS-15796 at 7/7/21, 1:52 PM:
---

[~sodonnell]

Yes, it is more elegant.(y)


was (Author: daniel ma):
[~sodonnell]

Yes, it is more elegant.

> ConcurrentModificationException error happens on NameNode occasionally
> --
>
> Key: HDFS-15796
> URL: https://issues.apache.org/jira/browse/HDFS-15796
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Affects Versions: 3.1.1
>Reporter: Daniel Ma
>Priority: Critical
> Attachments: HDFS-15796-0001.patch
>
>
> ConcurrentModificationException error happens on NameNode occasionally.
>  
> {code:java}
> 2021-01-23 20:21:18,107 | ERROR | RedundancyMonitor | RedundancyMonitor 
> thread received Runtime exception.  | BlockManager.java:4746
> java.util.ConcurrentModificationException
>   at java.util.ArrayList$Itr.checkForComodification(ArrayList.java:909)
>   at java.util.ArrayList$Itr.next(ArrayList.java:859)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReconstructionWorkForBlocks(BlockManager.java:1907)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeBlockReconstructionWork(BlockManager.java:1859)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeDatanodeWork(BlockManager.java:4862)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$RedundancyMonitor.run(BlockManager.java:4729)
>   at java.lang.Thread.run(Thread.java:748)
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15796) ConcurrentModificationException error happens on NameNode occasionally

2021-07-07 Thread Daniel Ma (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Ma updated HDFS-15796:
-
Attachment: HDFS-15796-0001.patch

> ConcurrentModificationException error happens on NameNode occasionally
> --
>
> Key: HDFS-15796
> URL: https://issues.apache.org/jira/browse/HDFS-15796
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Affects Versions: 3.1.1
>Reporter: Daniel Ma
>Priority: Critical
> Attachments: HDFS-15796-0001.patch
>
>
> ConcurrentModificationException error happens on NameNode occasionally.
>  
> {code:java}
> 2021-01-23 20:21:18,107 | ERROR | RedundancyMonitor | RedundancyMonitor 
> thread received Runtime exception.  | BlockManager.java:4746
> java.util.ConcurrentModificationException
>   at java.util.ArrayList$Itr.checkForComodification(ArrayList.java:909)
>   at java.util.ArrayList$Itr.next(ArrayList.java:859)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReconstructionWorkForBlocks(BlockManager.java:1907)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeBlockReconstructionWork(BlockManager.java:1859)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeDatanodeWork(BlockManager.java:4862)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$RedundancyMonitor.run(BlockManager.java:4729)
>   at java.lang.Thread.run(Thread.java:748)
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15796) ConcurrentModificationException error happens on NameNode occasionally

2021-07-07 Thread Daniel Ma (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Ma updated HDFS-15796:
-
Attachment: (was: 0002-HDFS-15796.patch)

> ConcurrentModificationException error happens on NameNode occasionally
> --
>
> Key: HDFS-15796
> URL: https://issues.apache.org/jira/browse/HDFS-15796
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Affects Versions: 3.1.1
>Reporter: Daniel Ma
>Priority: Critical
>
> ConcurrentModificationException error happens on NameNode occasionally.
>  
> {code:java}
> 2021-01-23 20:21:18,107 | ERROR | RedundancyMonitor | RedundancyMonitor 
> thread received Runtime exception.  | BlockManager.java:4746
> java.util.ConcurrentModificationException
>   at java.util.ArrayList$Itr.checkForComodification(ArrayList.java:909)
>   at java.util.ArrayList$Itr.next(ArrayList.java:859)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReconstructionWorkForBlocks(BlockManager.java:1907)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeBlockReconstructionWork(BlockManager.java:1859)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeDatanodeWork(BlockManager.java:4862)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$RedundancyMonitor.run(BlockManager.java:4729)
>   at java.lang.Thread.run(Thread.java:748)
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15796) ConcurrentModificationException error happens on NameNode occasionally

2021-07-07 Thread Daniel Ma (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Ma updated HDFS-15796:
-
Attachment: (was: 0003-HDFS-15796.patch)

> ConcurrentModificationException error happens on NameNode occasionally
> --
>
> Key: HDFS-15796
> URL: https://issues.apache.org/jira/browse/HDFS-15796
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Affects Versions: 3.1.1
>Reporter: Daniel Ma
>Priority: Critical
>
> ConcurrentModificationException error happens on NameNode occasionally.
>  
> {code:java}
> 2021-01-23 20:21:18,107 | ERROR | RedundancyMonitor | RedundancyMonitor 
> thread received Runtime exception.  | BlockManager.java:4746
> java.util.ConcurrentModificationException
>   at java.util.ArrayList$Itr.checkForComodification(ArrayList.java:909)
>   at java.util.ArrayList$Itr.next(ArrayList.java:859)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReconstructionWorkForBlocks(BlockManager.java:1907)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeBlockReconstructionWork(BlockManager.java:1859)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeDatanodeWork(BlockManager.java:4862)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$RedundancyMonitor.run(BlockManager.java:4729)
>   at java.lang.Thread.run(Thread.java:748)
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15796) ConcurrentModificationException error happens on NameNode occasionally

2021-07-07 Thread Daniel Ma (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Ma updated HDFS-15796:
-
Attachment: (was: 0001-HDFS-15796.patch)

> ConcurrentModificationException error happens on NameNode occasionally
> --
>
> Key: HDFS-15796
> URL: https://issues.apache.org/jira/browse/HDFS-15796
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Affects Versions: 3.1.1
>Reporter: Daniel Ma
>Priority: Critical
>
> ConcurrentModificationException error happens on NameNode occasionally.
>  
> {code:java}
> 2021-01-23 20:21:18,107 | ERROR | RedundancyMonitor | RedundancyMonitor 
> thread received Runtime exception.  | BlockManager.java:4746
> java.util.ConcurrentModificationException
>   at java.util.ArrayList$Itr.checkForComodification(ArrayList.java:909)
>   at java.util.ArrayList$Itr.next(ArrayList.java:859)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReconstructionWorkForBlocks(BlockManager.java:1907)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeBlockReconstructionWork(BlockManager.java:1859)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeDatanodeWork(BlockManager.java:4862)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$RedundancyMonitor.run(BlockManager.java:4729)
>   at java.lang.Thread.run(Thread.java:748)
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15796) ConcurrentModificationException error happens on NameNode occasionally

2021-07-07 Thread Daniel Ma (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17376586#comment-17376586
 ] 

Daniel Ma commented on HDFS-15796:
--

[~sodonnell]

Yes, it is more elegant.

> ConcurrentModificationException error happens on NameNode occasionally
> --
>
> Key: HDFS-15796
> URL: https://issues.apache.org/jira/browse/HDFS-15796
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Affects Versions: 3.1.1
>Reporter: Daniel Ma
>Priority: Critical
> Attachments: 0001-HDFS-15796.patch, 0002-HDFS-15796.patch, 
> 0003-HDFS-15796.patch
>
>
> ConcurrentModificationException error happens on NameNode occasionally.
>  
> {code:java}
> 2021-01-23 20:21:18,107 | ERROR | RedundancyMonitor | RedundancyMonitor 
> thread received Runtime exception.  | BlockManager.java:4746
> java.util.ConcurrentModificationException
>   at java.util.ArrayList$Itr.checkForComodification(ArrayList.java:909)
>   at java.util.ArrayList$Itr.next(ArrayList.java:859)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReconstructionWorkForBlocks(BlockManager.java:1907)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeBlockReconstructionWork(BlockManager.java:1859)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeDatanodeWork(BlockManager.java:4862)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$RedundancyMonitor.run(BlockManager.java:4729)
>   at java.lang.Thread.run(Thread.java:748)
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15796) ConcurrentModificationException error happens on NameNode occasionally

2021-07-07 Thread Daniel Ma (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Ma updated HDFS-15796:
-
Attachment: 0003-HDFS-15796.patch

> ConcurrentModificationException error happens on NameNode occasionally
> --
>
> Key: HDFS-15796
> URL: https://issues.apache.org/jira/browse/HDFS-15796
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Affects Versions: 3.1.1
>Reporter: Daniel Ma
>Priority: Critical
> Attachments: 0001-HDFS-15796.patch, 0002-HDFS-15796.patch, 
> 0003-HDFS-15796.patch
>
>
> ConcurrentModificationException error happens on NameNode occasionally.
>  
> {code:java}
> 2021-01-23 20:21:18,107 | ERROR | RedundancyMonitor | RedundancyMonitor 
> thread received Runtime exception.  | BlockManager.java:4746
> java.util.ConcurrentModificationException
>   at java.util.ArrayList$Itr.checkForComodification(ArrayList.java:909)
>   at java.util.ArrayList$Itr.next(ArrayList.java:859)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReconstructionWorkForBlocks(BlockManager.java:1907)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeBlockReconstructionWork(BlockManager.java:1859)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeDatanodeWork(BlockManager.java:4862)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$RedundancyMonitor.run(BlockManager.java:4729)
>   at java.lang.Thread.run(Thread.java:748)
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-16115) Asynchronously handle BPServiceActor command mechanism may result in BPServiceActor never fails even CommandProcessingThread is closed with fatal error.

2021-07-07 Thread Daniel Ma (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17376571#comment-17376571
 ] 

Daniel Ma commented on HDFS-16115:
--

hello [~hexiaoqiao].

Thanks for review, 

1-For non-fatal error, I define two kinds at present:

These two errors are caused by too many threads in OS and too many open files 
in OS, which probably can recover soon.

Even if the OS limit cannot be recovered proactively, users expect that the 
HDFS can recover automatically after manual access.

 
{code:java}
//代码占位符
enum NON_FATAL_TYPES {
  THREAD_EXCEED("unable to create new native thread"),
  FILE_EXCEED("Too many open files");

  private final String errorMsg;

  NON_FATAL_TYPES(String errorMsg){
this.errorMsg = errorMsg;
  }

  public String getErrorMsg() {
return errorMsg;
  }
}
{code}
2-the main defect of HDFS-15651 is that BPServiceActor thread will never back 
to normal unless restart DataNode which is always unacceptable in production 
environment reported by real users. That is why I develop this feature. 

 

> Asynchronously handle BPServiceActor command mechanism may result in 
> BPServiceActor never fails even CommandProcessingThread is closed with fatal 
> error.
> 
>
> Key: HDFS-16115
> URL: https://issues.apache.org/jira/browse/HDFS-16115
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Affects Versions: 3.3.1
>Reporter: Daniel Ma
>Priority: Critical
> Fix For: 3.3.1
>
> Attachments: 0001-HDFS-16115.patch
>
>
> It is an improvement issue. Actually the issue has two sub issues:
> 1- BPServerActor thread handle commands from NameNode in aysnchronous way ( 
> CommandProcessThread handle commands ), so if there are any exceptions or 
> errors happen in thread CommandProcessthread resulting the thread fails and 
> stop, of which BPServiceActor cannot aware and still keep putting commands 
> from namenode into queues waiting to be handled by CommandProcessThread, 
> actually CommandProcessThread was dead already.
> 2-the second sub issue is based on the first one, if CommandProcessThread was 
> dead owing to some non-fatal errors like "can not create native thread" which 
> is caused by too many threads existed in OS, this kind of problem should be 
> given much more torlerance instead of simply shudown the thread and never 
> recover automatically, because the non-fatal errors mentioned above probably 
> can be recovered soon by itself,
> {code:java}
> //代码占位符
> 2021-07-02 16:26:02,315 | WARN  | Command processor | Exception happened when 
> process queue BPServiceActor.java:1393
> java.lang.OutOfMemoryError: unable to create new native thread
> at java.lang.Thread.start0(Native Method)
> at java.lang.Thread.start(Thread.java:717)
> at 
> java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:957)
> at 
> java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1367)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService.execute(FsDatasetAsyncDiskService.java:180)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService.deleteAsync(FsDatasetAsyncDiskService.java:229)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.invalidate(FsDatasetImpl.java:2315)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.invalidate(FsDatasetImpl.java:2237)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActive(BPOfferService.java:752)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActor(BPOfferService.java:698)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.processCommand(BPServiceActor.java:1417)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.lambda$enqueue$2(BPServiceActor.java:1463)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.processQueue(BPServiceActor.java:1382)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.run(BPServiceActor.java:1365)
> {code}
> currently, Datanode BPServiceActor cannot turn to normal even when the 
> non-fatal error was eliminated.
> Therefore, in this patch, two things will be done:
> 1-Add retry mechanism in BPServiceActor thread and CommandProcessThread 
> thread which is 5 by default and configurable;
> 2-Add a monitor periodical thread in BPOfferService, if a BPServiceActor 
> thread is dead owing to  too many times non-fatal error, it should 

[jira] [Comment Edited] (HDFS-15796) ConcurrentModificationException error happens on NameNode occasionally

2021-07-07 Thread Daniel Ma (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17376429#comment-17376429
 ] 

Daniel Ma edited comment on HDFS-15796 at 7/7/21, 11:57 AM:


[~sodonnell]

yes, I also think it is a better way:
{code:java}
//
A better approach, may be to return a new ArrayList from getTargets, eg:
{code}
The patch is updated, Pls help to review.


was (Author: daniel ma):
[~sodonnell]

yes, I totally agree with the solution you raised.

I will work on it, and update the patch

> ConcurrentModificationException error happens on NameNode occasionally
> --
>
> Key: HDFS-15796
> URL: https://issues.apache.org/jira/browse/HDFS-15796
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Affects Versions: 3.1.1
>Reporter: Daniel Ma
>Priority: Critical
> Attachments: 0001-HDFS-15796.patch, 0002-HDFS-15796.patch
>
>
> ConcurrentModificationException error happens on NameNode occasionally.
>  
> {code:java}
> 2021-01-23 20:21:18,107 | ERROR | RedundancyMonitor | RedundancyMonitor 
> thread received Runtime exception.  | BlockManager.java:4746
> java.util.ConcurrentModificationException
>   at java.util.ArrayList$Itr.checkForComodification(ArrayList.java:909)
>   at java.util.ArrayList$Itr.next(ArrayList.java:859)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReconstructionWorkForBlocks(BlockManager.java:1907)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeBlockReconstructionWork(BlockManager.java:1859)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeDatanodeWork(BlockManager.java:4862)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$RedundancyMonitor.run(BlockManager.java:4729)
>   at java.lang.Thread.run(Thread.java:748)
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15796) ConcurrentModificationException error happens on NameNode occasionally

2021-07-07 Thread Daniel Ma (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Ma updated HDFS-15796:
-
Attachment: 0002-HDFS-15796.patch

> ConcurrentModificationException error happens on NameNode occasionally
> --
>
> Key: HDFS-15796
> URL: https://issues.apache.org/jira/browse/HDFS-15796
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Affects Versions: 3.1.1
>Reporter: Daniel Ma
>Priority: Critical
> Attachments: 0001-HDFS-15796.patch, 0002-HDFS-15796.patch
>
>
> ConcurrentModificationException error happens on NameNode occasionally.
>  
> {code:java}
> 2021-01-23 20:21:18,107 | ERROR | RedundancyMonitor | RedundancyMonitor 
> thread received Runtime exception.  | BlockManager.java:4746
> java.util.ConcurrentModificationException
>   at java.util.ArrayList$Itr.checkForComodification(ArrayList.java:909)
>   at java.util.ArrayList$Itr.next(ArrayList.java:859)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReconstructionWorkForBlocks(BlockManager.java:1907)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeBlockReconstructionWork(BlockManager.java:1859)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeDatanodeWork(BlockManager.java:4862)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$RedundancyMonitor.run(BlockManager.java:4729)
>   at java.lang.Thread.run(Thread.java:748)
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15796) ConcurrentModificationException error happens on NameNode occasionally

2021-07-07 Thread Daniel Ma (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17376429#comment-17376429
 ] 

Daniel Ma commented on HDFS-15796:
--

[~sodonnell]

yes, I totally agree with the solution you raised.

I will work on it, and update the patch

> ConcurrentModificationException error happens on NameNode occasionally
> --
>
> Key: HDFS-15796
> URL: https://issues.apache.org/jira/browse/HDFS-15796
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Affects Versions: 3.1.1
>Reporter: Daniel Ma
>Priority: Critical
> Attachments: 0001-HDFS-15796.patch
>
>
> ConcurrentModificationException error happens on NameNode occasionally.
>  
> {code:java}
> 2021-01-23 20:21:18,107 | ERROR | RedundancyMonitor | RedundancyMonitor 
> thread received Runtime exception.  | BlockManager.java:4746
> java.util.ConcurrentModificationException
>   at java.util.ArrayList$Itr.checkForComodification(ArrayList.java:909)
>   at java.util.ArrayList$Itr.next(ArrayList.java:859)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReconstructionWorkForBlocks(BlockManager.java:1907)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeBlockReconstructionWork(BlockManager.java:1859)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeDatanodeWork(BlockManager.java:4862)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$RedundancyMonitor.run(BlockManager.java:4729)
>   at java.lang.Thread.run(Thread.java:748)
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16117) Add file count info in audit log to record the file count for delete and getListing RPC request to assist user trouble shooting when RPC cost is increasing

2021-07-07 Thread Daniel Ma (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Ma updated HDFS-16117:
-
Summary: Add file count info in audit log to record the file count for 
delete and getListing RPC request to assist user trouble shooting when RPC cost 
is increasing   (was: Add file count info in audit log to record the file count 
count in delete and getListing RPC request to assist user trouble shooting when 
RPC cost is increasing )

> Add file count info in audit log to record the file count for delete and 
> getListing RPC request to assist user trouble shooting when RPC cost is 
> increasing 
> 
>
> Key: HDFS-16117
> URL: https://issues.apache.org/jira/browse/HDFS-16117
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Affects Versions: 3.3.1
>Reporter: Daniel Ma
>Priority: Major
> Fix For: 3.3.1
>
> Attachments: 0001-HDFS-16117.patch
>
>
> Currently, there is no file count in audit log for delete and getListing RPC 
> request, therefore, for the increasing RPC call, it is not easy to configure 
> it out whether the time-consuming RPC  request is related to too many files 
> be operated in the RPC request.
>  
> Therefore, It it necessary to add file count info in the audit log to assist 
> maintenance 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16117) Add file count info in audit log to record the file count count in delete and getListing RPC request to assist user trouble shooting when RPC cost is increasing

2021-07-07 Thread Daniel Ma (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Ma updated HDFS-16117:
-
Attachment: 0001-HDFS-16117.patch

> Add file count info in audit log to record the file count count in delete and 
> getListing RPC request to assist user trouble shooting when RPC cost is 
> increasing 
> -
>
> Key: HDFS-16117
> URL: https://issues.apache.org/jira/browse/HDFS-16117
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Affects Versions: 3.3.1
>Reporter: Daniel Ma
>Priority: Major
> Fix For: 3.3.1
>
> Attachments: 0001-HDFS-16117.patch
>
>
> Currently, there is no file count in audit log for delete and getListing RPC 
> request, therefore, for the increasing RPC call, it is not easy to configure 
> it out whether the time-consuming RPC  request is related to too many files 
> be operated in the RPC request.
>  
> Therefore, It it necessary to add file count info in the audit log to assist 
> maintenance 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-16117) Add file count info in audit log to record the file count count in delete and getListing RPC request to assist user trouble shooting when RPC cost is increasing

2021-07-07 Thread Daniel Ma (Jira)
Daniel Ma created HDFS-16117:


 Summary: Add file count info in audit log to record the file count 
count in delete and getListing RPC request to assist user trouble shooting when 
RPC cost is increasing 
 Key: HDFS-16117
 URL: https://issues.apache.org/jira/browse/HDFS-16117
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: namenode
Affects Versions: 3.3.1
Reporter: Daniel Ma
 Fix For: 3.3.1


Currently, there is no file count in audit log for delete and getListing RPC 
request, therefore, for the increasing RPC call, it is not easy to configure it 
out whether the time-consuming RPC  request is related to too many files be 
operated in the RPC request.

 

Therefore, It it necessary to add file count info in the audit log to assist 
maintenance 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16094) HDFS balancer process start failed owing to daemon pid file is not cleared in some exception senario

2021-07-06 Thread Daniel Ma (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Ma updated HDFS-16094:
-
Target Version/s: 3.1.1  (was: 3.4.0)

> HDFS balancer process start failed owing to daemon pid file is not cleared in 
> some exception senario
> 
>
> Key: HDFS-16094
> URL: https://issues.apache.org/jira/browse/HDFS-16094
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: scripts
>Affects Versions: 3.3.1
>Reporter: Daniel Ma
>Priority: Major
>
> HDFS balancer process start failed owing to daemon pid file is not cleared in 
> some exception senario, but there is no useful information in log to trouble 
> shoot as below.
> {code:java}
> //代码占位符
> hadoop_error "${daemonname} is running as process $(cat "${daemon_pidfile}")
> {code}
> but actually, the process is not running as the error msg details above.
> Therefore, some more explicit information should be print in error log to 
> guide  users to clear the pid file and where the pid file location is.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-16115) Asynchronously handle BPServiceActor command mechanism may result in BPServiceActor never fails even CommandProcessingThread is closed with fatal error.

2021-07-06 Thread Daniel Ma (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17376146#comment-17376146
 ] 

Daniel Ma edited comment on HDFS-16115 at 7/7/21, 2:01 AM:
---

[~hexiaoqiao] Thanks for your review and tips. Actually the issue that my patch 
attempt to solve is totally different from the one HDFS-15651 mentioned.

I have noticed this Jira perviously, but it can not solve my issue perfectly. 

What I try to solve in this patch is :

1-Once CommandProcess thread caught a non-fatal error or exception, there will 
be 5 time retry instead of simply interrup it,  and after it reach the max 
retry times , we need to stop the corresponding BPServiceActor thread as well. 

In HDFS-15651, no matter what kind of the error is , just simply close the 
thread, but there are many non-fatal errors  that probably recover 
automatically like "cannot create native thread error", when the thread in os 
drop, the BPServiceActor service still dead can not recover by itself.

2-In my patch, for the non-fatal error, BPOfferService thread always running a 
periodical thread to try to recover the BPServiceActor thread that is dead 
owing to non-fatal error, which is the essential difference between our patch 
and HDFS-15651

 


was (Author: daniel ma):
[~hexiaoqiao] Thanks for your review and tips. Actually the issue that my patch 
attempt to solve is totally different from the one HDFS-15651 mentioned.

I have noticed this Jira perviously, but it can not solve my issue perfectly. 

What I try to solve in this patch is :

1-Once CommandProcess thread is dead, we need to stop the corresponding 
BPServiceActor thread as well. 

2-In my patch, for the non-fatal error, BPOfferService thread always running a 
periodical thread to try to recover the BPServiceActor thread that is dead 
owing to non-fatal error, which is the essential difference between our patch 
and HDFS-15651

 

> Asynchronously handle BPServiceActor command mechanism may result in 
> BPServiceActor never fails even CommandProcessingThread is closed with fatal 
> error.
> 
>
> Key: HDFS-16115
> URL: https://issues.apache.org/jira/browse/HDFS-16115
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Affects Versions: 3.3.1
>Reporter: Daniel Ma
>Priority: Critical
> Fix For: 3.3.1
>
> Attachments: 0001-HDFS-16115.patch
>
>
> It is an improvement issue. Actually the issue has two sub issues:
> 1- BPServerActor thread handle commands from NameNode in aysnchronous way ( 
> CommandProcessThread handle commands ), so if there are any exceptions or 
> errors happen in thread CommandProcessthread resulting the thread fails and 
> stop, of which BPServiceActor cannot aware and still keep putting commands 
> from namenode into queues waiting to be handled by CommandProcessThread, 
> actually CommandProcessThread was dead already.
> 2-the second sub issue is based on the first one, if CommandProcessThread was 
> dead owing to some non-fatal errors like "can not create native thread" which 
> is caused by too many threads existed in OS, this kind of problem should be 
> given much more torlerance instead of simply shudown the thread and never 
> recover automatically, because the non-fatal errors mentioned above probably 
> can be recovered soon by itself,
> {code:java}
> //代码占位符
> 2021-07-02 16:26:02,315 | WARN  | Command processor | Exception happened when 
> process queue BPServiceActor.java:1393
> java.lang.OutOfMemoryError: unable to create new native thread
> at java.lang.Thread.start0(Native Method)
> at java.lang.Thread.start(Thread.java:717)
> at 
> java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:957)
> at 
> java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1367)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService.execute(FsDatasetAsyncDiskService.java:180)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService.deleteAsync(FsDatasetAsyncDiskService.java:229)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.invalidate(FsDatasetImpl.java:2315)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.invalidate(FsDatasetImpl.java:2237)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActive(BPOfferService.java:752)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActor(BPOfferService.java:698)
> at 
> 

[jira] [Comment Edited] (HDFS-16115) Asynchronously handle BPServiceActor command mechanism may result in BPServiceActor never fails even CommandProcessingThread is closed with fatal error.

2021-07-06 Thread Daniel Ma (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17375401#comment-17375401
 ] 

Daniel Ma edited comment on HDFS-16115 at 7/7/21, 1:47 AM:
---

Hello [~brahmareddy],[~hemant]

Could you pls help to review this patch. thanks.


was (Author: daniel ma):
Hello [~brahmareddy],[~hemant]

Pls help to review this patch. thanks.

> Asynchronously handle BPServiceActor command mechanism may result in 
> BPServiceActor never fails even CommandProcessingThread is closed with fatal 
> error.
> 
>
> Key: HDFS-16115
> URL: https://issues.apache.org/jira/browse/HDFS-16115
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Affects Versions: 3.3.1
>Reporter: Daniel Ma
>Priority: Critical
> Fix For: 3.3.1
>
> Attachments: 0001-HDFS-16115.patch
>
>
> It is an improvement issue. Actually the issue has two sub issues:
> 1- BPServerActor thread handle commands from NameNode in aysnchronous way ( 
> CommandProcessThread handle commands ), so if there are any exceptions or 
> errors happen in thread CommandProcessthread resulting the thread fails and 
> stop, of which BPServiceActor cannot aware and still keep putting commands 
> from namenode into queues waiting to be handled by CommandProcessThread, 
> actually CommandProcessThread was dead already.
> 2-the second sub issue is based on the first one, if CommandProcessThread was 
> dead owing to some non-fatal errors like "can not create native thread" which 
> is caused by too many threads existed in OS, this kind of problem should be 
> given much more torlerance instead of simply shudown the thread and never 
> recover automatically, because the non-fatal errors mentioned above probably 
> can be recovered soon by itself,
> {code:java}
> //代码占位符
> 2021-07-02 16:26:02,315 | WARN  | Command processor | Exception happened when 
> process queue BPServiceActor.java:1393
> java.lang.OutOfMemoryError: unable to create new native thread
> at java.lang.Thread.start0(Native Method)
> at java.lang.Thread.start(Thread.java:717)
> at 
> java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:957)
> at 
> java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1367)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService.execute(FsDatasetAsyncDiskService.java:180)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService.deleteAsync(FsDatasetAsyncDiskService.java:229)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.invalidate(FsDatasetImpl.java:2315)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.invalidate(FsDatasetImpl.java:2237)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActive(BPOfferService.java:752)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActor(BPOfferService.java:698)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.processCommand(BPServiceActor.java:1417)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.lambda$enqueue$2(BPServiceActor.java:1463)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.processQueue(BPServiceActor.java:1382)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.run(BPServiceActor.java:1365)
> {code}
> currently, Datanode BPServiceActor cannot turn to normal even when the 
> non-fatal error was eliminated.
> Therefore, in this patch, two things will be done:
> 1-Add retry mechanism in BPServiceActor thread and CommandProcessThread 
> thread which is 5 by default and configurable;
> 2-Add a monitor periodical thread in BPOfferService, if a BPServiceActor 
> thread is dead owing to  too many times non-fatal error, it should not be 
> simply removed from BPServviceActor lists stored in BPOfferService, instead, 
> the monitor thread will periodically try to start these special dead 
> BPServiceActor thread. the interval is also configurable.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-16115) Asynchronously handle BPServiceActor command mechanism may result in BPServiceActor never fails even CommandProcessingThread is closed with fatal error.

2021-07-06 Thread Daniel Ma (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17375401#comment-17375401
 ] 

Daniel Ma edited comment on HDFS-16115 at 7/7/21, 1:46 AM:
---

Hello [~brahmareddy],[~hemant]

Pls help to review this patch. thanks.


was (Author: daniel ma):
Hello [~brahmareddy],

[~ayush]

Pls help to review this patch. thanks.

> Asynchronously handle BPServiceActor command mechanism may result in 
> BPServiceActor never fails even CommandProcessingThread is closed with fatal 
> error.
> 
>
> Key: HDFS-16115
> URL: https://issues.apache.org/jira/browse/HDFS-16115
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Affects Versions: 3.3.1
>Reporter: Daniel Ma
>Priority: Critical
> Fix For: 3.3.1
>
> Attachments: 0001-HDFS-16115.patch
>
>
> It is an improvement issue. Actually the issue has two sub issues:
> 1- BPServerActor thread handle commands from NameNode in aysnchronous way ( 
> CommandProcessThread handle commands ), so if there are any exceptions or 
> errors happen in thread CommandProcessthread resulting the thread fails and 
> stop, of which BPServiceActor cannot aware and still keep putting commands 
> from namenode into queues waiting to be handled by CommandProcessThread, 
> actually CommandProcessThread was dead already.
> 2-the second sub issue is based on the first one, if CommandProcessThread was 
> dead owing to some non-fatal errors like "can not create native thread" which 
> is caused by too many threads existed in OS, this kind of problem should be 
> given much more torlerance instead of simply shudown the thread and never 
> recover automatically, because the non-fatal errors mentioned above probably 
> can be recovered soon by itself,
> {code:java}
> //代码占位符
> 2021-07-02 16:26:02,315 | WARN  | Command processor | Exception happened when 
> process queue BPServiceActor.java:1393
> java.lang.OutOfMemoryError: unable to create new native thread
> at java.lang.Thread.start0(Native Method)
> at java.lang.Thread.start(Thread.java:717)
> at 
> java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:957)
> at 
> java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1367)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService.execute(FsDatasetAsyncDiskService.java:180)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService.deleteAsync(FsDatasetAsyncDiskService.java:229)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.invalidate(FsDatasetImpl.java:2315)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.invalidate(FsDatasetImpl.java:2237)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActive(BPOfferService.java:752)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActor(BPOfferService.java:698)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.processCommand(BPServiceActor.java:1417)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.lambda$enqueue$2(BPServiceActor.java:1463)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.processQueue(BPServiceActor.java:1382)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.run(BPServiceActor.java:1365)
> {code}
> currently, Datanode BPServiceActor cannot turn to normal even when the 
> non-fatal error was eliminated.
> Therefore, in this patch, two things will be done:
> 1-Add retry mechanism in BPServiceActor thread and CommandProcessThread 
> thread which is 5 by default and configurable;
> 2-Add a monitor periodical thread in BPOfferService, if a BPServiceActor 
> thread is dead owing to  too many times non-fatal error, it should not be 
> simply removed from BPServviceActor lists stored in BPOfferService, instead, 
> the monitor thread will periodically try to start these special dead 
> BPServiceActor thread. the interval is also configurable.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-16115) Asynchronously handle BPServiceActor command mechanism may result in BPServiceActor never fails even CommandProcessingThread is closed with fatal error.

2021-07-06 Thread Daniel Ma (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17376146#comment-17376146
 ] 

Daniel Ma commented on HDFS-16115:
--

[~hexiaoqiao] Thanks for your review and tips. Actually the issue that my patch 
attempt to solve is totally different from the one HDFS-15651 mentioned.

I have noticed this Jira perviously, but it can not solve my issue perfectly. 

What I try to solve in this patch is :

1-Once CommandProcess thread is dead, we need to stop the corresponding 
BPServiceActor thread as well. 

2-In my patch, for the non-fatal error, BPOfferService thread always running a 
periodical thread to try to recover the BPServiceActor thread that is dead 
owing to non-fatal error, which is the essential difference between our patch 
and HDFS-15651

 

> Asynchronously handle BPServiceActor command mechanism may result in 
> BPServiceActor never fails even CommandProcessingThread is closed with fatal 
> error.
> 
>
> Key: HDFS-16115
> URL: https://issues.apache.org/jira/browse/HDFS-16115
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Affects Versions: 3.3.1
>Reporter: Daniel Ma
>Priority: Critical
> Fix For: 3.3.1
>
> Attachments: 0001-HDFS-16115.patch
>
>
> It is an improvement issue. Actually the issue has two sub issues:
> 1- BPServerActor thread handle commands from NameNode in aysnchronous way ( 
> CommandProcessThread handle commands ), so if there are any exceptions or 
> errors happen in thread CommandProcessthread resulting the thread fails and 
> stop, of which BPServiceActor cannot aware and still keep putting commands 
> from namenode into queues waiting to be handled by CommandProcessThread, 
> actually CommandProcessThread was dead already.
> 2-the second sub issue is based on the first one, if CommandProcessThread was 
> dead owing to some non-fatal errors like "can not create native thread" which 
> is caused by too many threads existed in OS, this kind of problem should be 
> given much more torlerance instead of simply shudown the thread and never 
> recover automatically, because the non-fatal errors mentioned above probably 
> can be recovered soon by itself,
> {code:java}
> //代码占位符
> 2021-07-02 16:26:02,315 | WARN  | Command processor | Exception happened when 
> process queue BPServiceActor.java:1393
> java.lang.OutOfMemoryError: unable to create new native thread
> at java.lang.Thread.start0(Native Method)
> at java.lang.Thread.start(Thread.java:717)
> at 
> java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:957)
> at 
> java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1367)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService.execute(FsDatasetAsyncDiskService.java:180)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService.deleteAsync(FsDatasetAsyncDiskService.java:229)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.invalidate(FsDatasetImpl.java:2315)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.invalidate(FsDatasetImpl.java:2237)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActive(BPOfferService.java:752)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActor(BPOfferService.java:698)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.processCommand(BPServiceActor.java:1417)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.lambda$enqueue$2(BPServiceActor.java:1463)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.processQueue(BPServiceActor.java:1382)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.run(BPServiceActor.java:1365)
> {code}
> currently, Datanode BPServiceActor cannot turn to normal even when the 
> non-fatal error was eliminated.
> Therefore, in this patch, two things will be done:
> 1-Add retry mechanism in BPServiceActor thread and CommandProcessThread 
> thread which is 5 by default and configurable;
> 2-Add a monitor periodical thread in BPOfferService, if a BPServiceActor 
> thread is dead owing to  too many times non-fatal error, it should not be 
> simply removed from BPServviceActor lists stored in BPOfferService, instead, 
> the monitor thread will periodically try to start these special dead 
> BPServiceActor thread. the interval is also configurable.



--
This message was sent by 

[jira] [Updated] (HDFS-16115) Asynchronously handle BPServiceActor command mechanism may result in BPServiceActor never fails even CommandProcessingThread is closed with fatal error.

2021-07-06 Thread Daniel Ma (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Ma updated HDFS-16115:
-
Description: 
It is an improvement issue. Actually the issue has two sub issues:

1- BPServerActor thread handle commands from NameNode in aysnchronous way ( 
CommandProcessThread handle commands ), so if there are any exceptions or 
errors happen in thread CommandProcessthread resulting the thread fails and 
stop, of which BPServiceActor cannot aware and still keep putting commands from 
namenode into queues waiting to be handled by CommandProcessThread, actually 
CommandProcessThread was dead already.

2-the second sub issue is based on the first one, if CommandProcessThread was 
dead owing to some non-fatal errors like "can not create native thread" which 
is caused by too many threads existed in OS, this kind of problem should be 
given much more torlerance instead of simply shudown the thread and never 
recover automatically, because the non-fatal errors mentioned above probably 
can be recovered soon by itself,
{code:java}
//代码占位符
2021-07-02 16:26:02,315 | WARN  | Command processor | Exception happened when 
process queue BPServiceActor.java:1393
java.lang.OutOfMemoryError: unable to create new native thread
at java.lang.Thread.start0(Native Method)
at java.lang.Thread.start(Thread.java:717)
at 
java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:957)
at 
java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1367)
at 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService.execute(FsDatasetAsyncDiskService.java:180)
at 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService.deleteAsync(FsDatasetAsyncDiskService.java:229)
at 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.invalidate(FsDatasetImpl.java:2315)
at 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.invalidate(FsDatasetImpl.java:2237)
at 
org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActive(BPOfferService.java:752)
at 
org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActor(BPOfferService.java:698)
at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.processCommand(BPServiceActor.java:1417)
at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.lambda$enqueue$2(BPServiceActor.java:1463)
at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.processQueue(BPServiceActor.java:1382)
at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.run(BPServiceActor.java:1365)

{code}
currently, Datanode BPServiceActor cannot turn to normal even when the 
non-fatal error was eliminated.

Therefore, in this patch, two things will be done:

1-Add retry mechanism in BPServiceActor thread and CommandProcessThread thread 
which is 5 by default and configurable;

2-Add a monitor periodical thread in BPOfferService, if a BPServiceActor thread 
is dead owing to  too many times non-fatal error, it should not be simply 
removed from BPServviceActor lists stored in BPOfferService, instead, the 
monitor thread will periodically try to start these special dead BPServiceActor 
thread. the interval is also configurable.

  was:
It is an improvement issue. Actually the issue has two sub issues:

1- BPServerActor thread handle commands from NameNode in aysnchronous way ( 
CommandProcessThread handle commands ), so if there are any exceptions or 
errors happen in thread CommandProcessthread resulting the thread fails and 
stop, of which BPServiceActor cannot aware and still keep putting commands from 
namenode into queues waiting to be handled by CommandProcessThread, actually 
CommandProcessThread was dead already.

2-the second sub issue is based on the first one, if CommandProcessThread was 
dead owing to some non-fatal errors like "can not create native thread" which 
is caused by too many threads existed in OS, this kind of problem should be 
given much more torlerance instead of simply shudown the thread and never 
recover automatically, because the non-fatal errors mentioned above probably 
can be recovered soon by itself,

currently, Datanode BPServiceActor cannot turn to normal even when the 
non-fatal error was eliminated.

Therefore, in this patch, two things will be done:

1-Add retry mechanism in BPServiceActor thread and CommandProcessThread thread 
which is 5 by default and configurable;

2-Add a monitor periodical thread in BPOfferService, if a BPServiceActor thread 
is dead owing to  too many times non-fatal error, it should not be simply 
removed from BPServviceActor lists stored in BPOfferService, instead, the 
monitor thread will periodically try to start these special 

[jira] [Commented] (HDFS-16098) ERROR tools.DiskBalancerCLI: java.lang.IllegalArgumentException

2021-07-06 Thread Daniel Ma (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17375427#comment-17375427
 ] 

Daniel Ma commented on HDFS-16098:
--

[~wangyanfu]

Thanks for reporting this issue.

Could you pls share more details about the error stack.

> ERROR tools.DiskBalancerCLI: java.lang.IllegalArgumentException
> ---
>
> Key: HDFS-16098
> URL: https://issues.apache.org/jira/browse/HDFS-16098
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: diskbalancer
>Affects Versions: 2.6.0
> Environment: VERSION info:
> Hadoop 2.6.0-cdh5.14.4
>Reporter: wangyanfu
>Priority: Blocker
>  Labels: diskbalancer
> Fix For: 2.6.0
>
> Attachments: image-2021-07-01-18-34-54-905.png, on-branch-3.1.jpg
>
>   Original Estimate: 504h
>  Remaining Estimate: 504h
>
> when i tried to run 
> hdfs diskbalancer -plan $(hostname -f)
>  
>  
>  
>  i get this notice:
> 21/06/30 11:30:41 ERROR tools.DiskBalancerCLI: 
> java.lang.IllegalArgumentException
>  
> then i tried write the real hostname into my command , not work and same 
> error notice
> i also tried  use --plan instead of -plan , not work and same error notice
> i found this 
> [link|https://community.cloudera.com/t5/Support-Questions/Error-trying-to-balance-disks-on-node/m-p/59989#M54850]
>   but there's no resolve solution , can somebody help me?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-15796) ConcurrentModificationException error happens on NameNode occasionally

2021-07-06 Thread Daniel Ma (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17375421#comment-17375421
 ] 

Daniel Ma edited comment on HDFS-15796 at 7/6/21, 9:57 AM:
---

[~sodonnell]

Thanks for reviewing, Actually you missed the for loop here:
{code:java}
//代码占位符
synchronized (pendingReconstruction) {
  List targets = pendingReconstruction
  .getTargets(rw.getBlock());
  if (targets != null) {
for (DatanodeStorageInfo dn : targets) {
  if (!excludedNodes.contains(dn.getDatanodeDescriptor())) {
excludedNodes.add(dn.getDatanodeDescriptor());
  }
}
  }
}
{code}
The problem happens when the code above try to travel the DataNodes stored in 
pendingReconstruction object, while the DataNode list is also been modifing 
elsewhere.

In other words, if you modify a List(delete or add an element) and visit it in 
the same time, ConcurrentModificationException will be casted.


was (Author: daniel ma):
[~sodonnell]

Thanks for reviewing, Actually you missed the for loop here:
{code:java}
//代码占位符
synchronized (pendingReconstruction) {
  List targets = pendingReconstruction
  .getTargets(rw.getBlock());
  if (targets != null) {
for (DatanodeStorageInfo dn : targets) {
  if (!excludedNodes.contains(dn.getDatanodeDescriptor())) {
excludedNodes.add(dn.getDatanodeDescriptor());
  }
}
  }
}
{code}
The problem happens when the code above try to travel the DataNodes stored in 
pendingReconstruction object, while the DataNode list is also be modified 
elsewhere.

In other words, if you modify a List(delete or add an element) and visit it in 
the same time, ConcurrentModificationException will be casted.

> ConcurrentModificationException error happens on NameNode occasionally
> --
>
> Key: HDFS-15796
> URL: https://issues.apache.org/jira/browse/HDFS-15796
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Affects Versions: 3.1.1
>Reporter: Daniel Ma
>Priority: Critical
> Attachments: 0001-HDFS-15796.patch
>
>
> ConcurrentModificationException error happens on NameNode occasionally.
>  
> {code:java}
> 2021-01-23 20:21:18,107 | ERROR | RedundancyMonitor | RedundancyMonitor 
> thread received Runtime exception.  | BlockManager.java:4746
> java.util.ConcurrentModificationException
>   at java.util.ArrayList$Itr.checkForComodification(ArrayList.java:909)
>   at java.util.ArrayList$Itr.next(ArrayList.java:859)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReconstructionWorkForBlocks(BlockManager.java:1907)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeBlockReconstructionWork(BlockManager.java:1859)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeDatanodeWork(BlockManager.java:4862)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$RedundancyMonitor.run(BlockManager.java:4729)
>   at java.lang.Thread.run(Thread.java:748)
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15796) ConcurrentModificationException error happens on NameNode occasionally

2021-07-06 Thread Daniel Ma (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17375421#comment-17375421
 ] 

Daniel Ma commented on HDFS-15796:
--

[~sodonnell]

Thanks for reviewing, Actually you missed the for loop here:
{code:java}
//代码占位符
synchronized (pendingReconstruction) {
  List targets = pendingReconstruction
  .getTargets(rw.getBlock());
  if (targets != null) {
for (DatanodeStorageInfo dn : targets) {
  if (!excludedNodes.contains(dn.getDatanodeDescriptor())) {
excludedNodes.add(dn.getDatanodeDescriptor());
  }
}
  }
}
{code}
The problem happens when the code above try to travel the DataNodes stored in 
pendingReconstruction object, while the DataNode list is also be modified 
elsewhere.

In other words, if you modify a List(delete or add an element) and visit it in 
the same time, ConcurrentModificationException will be casted.

> ConcurrentModificationException error happens on NameNode occasionally
> --
>
> Key: HDFS-15796
> URL: https://issues.apache.org/jira/browse/HDFS-15796
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Affects Versions: 3.1.1
>Reporter: Daniel Ma
>Priority: Critical
> Attachments: 0001-HDFS-15796.patch
>
>
> ConcurrentModificationException error happens on NameNode occasionally.
>  
> {code:java}
> 2021-01-23 20:21:18,107 | ERROR | RedundancyMonitor | RedundancyMonitor 
> thread received Runtime exception.  | BlockManager.java:4746
> java.util.ConcurrentModificationException
>   at java.util.ArrayList$Itr.checkForComodification(ArrayList.java:909)
>   at java.util.ArrayList$Itr.next(ArrayList.java:859)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReconstructionWorkForBlocks(BlockManager.java:1907)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeBlockReconstructionWork(BlockManager.java:1859)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeDatanodeWork(BlockManager.java:4862)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$RedundancyMonitor.run(BlockManager.java:4729)
>   at java.lang.Thread.run(Thread.java:748)
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16115) Asynchronously handle BPServiceActor command mechanism may result in BPServiceActor never fails even CommandProcessingThread is closed with fatal error.

2021-07-06 Thread Daniel Ma (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Ma updated HDFS-16115:
-
Description: 
It is an improvement issue. Actually the issue has two sub issues:

1- BPServerActor thread handle commands from NameNode in aysnchronous way ( 
CommandProcessThread handle commands ), so if there are any exceptions or 
errors happen in thread CommandProcessthread resulting the thread fails and 
stop, of which BPServiceActor cannot aware and still keep putting commands from 
namenode into queues waiting to be handled by CommandProcessThread, actually 
CommandProcessThread was dead already.

2-the second sub issue is based on the first one, if CommandProcessThread was 
dead owing to some non-fatal errors like "can not create native thread" which 
is caused by too many threads existed in OS, this kind of problem should be 
given much more torlerance instead of simply shudown the thread and never 
recover automatically, because the non-fatal errors mentioned above probably 
can be recovered soon by itself,

currently, Datanode BPServiceActor cannot turn to normal even when the 
non-fatal error was eliminated.

Therefore, in this patch, two things will be done:

1-Add retry mechanism in BPServiceActor thread and CommandProcessThread thread 
which is 5 by default and configurable;

2-Add a monitor periodical thread in BPOfferService, if a BPServiceActor thread 
is dead owing to  too many times non-fatal error, it should not be simply 
removed from BPServviceActor lists stored in BPOfferService, instead, the 
monitor thread will periodically try to start these special dead BPServiceActor 
thread. the interval is also configurable.

  was:
It is an improvement issue. Actually the issue has two sub issues:

1- BPServerActor thread handle commands from NameNode in aysnchronous way ( 
CommandProcessThread handle commands ), so if there are any exceptions or 
errors happen in thread CommandProcessthread resulting the thread fails and 
stop, of which BPServiceActor cannot aware and still keep putting commands from 
namenode into queues waiting to be handled by CommandProcessThread, actually 
CommandProcessThread was dead already.

2-the second sub issue is based on the first one, if CommandProcessThread was 
dead owing to some non-fatal errors like "can not create native thread" which 
is caused by too many threads existed in OS, this kind of problem should be 
given much more torlerance instead of simply shudown the thread and never 
recover automatically, because the non-fatal errors mentioned above probably 
can be recovered soon by itself,

currently, Datanode BPServiceActor cannot turn to normal even when the 
non-fatal error was eliminated.

Therefore, in this patch, two things will be done:

1-Add retry mechanism in BPServiceActor thread and CommandProcessThread thread 
which is 5 by default and configurable;

2-Add a monitor periodical thread in BPOfferService, if a BPServiceActor thread 
is dead owing to  too many times non-fatal error, it should not be simply 
removed from BPServviceActor lists stored in BPOfferService, instead, the 
monitor thread will periodically try to start these special dead BPService 
Actor thread. the interval is also configurable.


> Asynchronously handle BPServiceActor command mechanism may result in 
> BPServiceActor never fails even CommandProcessingThread is closed with fatal 
> error.
> 
>
> Key: HDFS-16115
> URL: https://issues.apache.org/jira/browse/HDFS-16115
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Affects Versions: 3.3.1
>Reporter: Daniel Ma
>Priority: Critical
> Fix For: 3.3.1
>
> Attachments: 0001-HDFS-16115.patch
>
>
> It is an improvement issue. Actually the issue has two sub issues:
> 1- BPServerActor thread handle commands from NameNode in aysnchronous way ( 
> CommandProcessThread handle commands ), so if there are any exceptions or 
> errors happen in thread CommandProcessthread resulting the thread fails and 
> stop, of which BPServiceActor cannot aware and still keep putting commands 
> from namenode into queues waiting to be handled by CommandProcessThread, 
> actually CommandProcessThread was dead already.
> 2-the second sub issue is based on the first one, if CommandProcessThread was 
> dead owing to some non-fatal errors like "can not create native thread" which 
> is caused by too many threads existed in OS, this kind of problem should be 
> given much more torlerance instead of simply shudown the thread and never 
> recover automatically, because the non-fatal errors mentioned above probably 
> can be recovered soon by itself,
> currently, Datanode BPServiceActor 

[jira] [Updated] (HDFS-16115) Asynchronously handle BPServiceActor command mechanism may result in BPServiceActor never fails even CommandProcessingThread is closed with fatal error.

2021-07-06 Thread Daniel Ma (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Ma updated HDFS-16115:
-
Description: 
It is an improvement issue. Actually the issue has two sub issues:

1- BPServerActor thread handle commands from NameNode in aysnchronous way ( 
CommandProcessThread handle commands ), so if there are any exceptions or 
errors happen in thread CommandProcessthread resulting the thread fails and 
stop, of which BPServiceActor cannot aware and still keep putting commands from 
namenode into queues waiting to be handled by CommandProcessThread, actually 
CommandProcessThread was dead already.

2-the second sub issue is based on the first one, if CommandProcessThread was 
dead owing to some non-fatal errors like "can not create native thread" which 
is caused by too many threads existed in OS, this kind of problem should be 
given much more torlerance instead of simply shudown the thread and never 
recover automatically, because the non-fatal errors mentioned above probably 
can be recovered soon by itself,

currently, Datanode BPServiceActor cannot turn to normal even when the 
non-fatal error was eliminated.

Therefore, in this patch, two things will be done:

1-Add retry mechanism in BPServiceActor thread and CommandProcessThread thread 
which is 5 by default and configurable;

2-Add a monitor periodical thread in BPOfferService, if a BPServiceActor thread 
is dead owing to  too many times non-fatal error, it should not be simply 
removed from BPServviceActor lists stored in BPOfferService, instead, the 
monitor thread will periodically try to start these special dead BPService 
Actor thread. the interval is also configurable.

  was:
It is an improvement issue. Actually the issue has two sub issues:

1- BPServerActor thread handle commands from NameNode in aysnchronous way ( 
CommandProcessThread handle commands ), so if there are any exceptions or 
errors happen in thread CommandProcessthread resulting the thread fails and 
stop, of which BPServiceActor cannot aware and still keep putting commands from 
namenode into queues waiting to be handled by CommandProcessThread, actually 
CommandProcessThread was dead already.

2-the second sub issue is based on the first one, if CommandProcessThread was 
dead owing to some non-fatal errors like "can not create native thread" which 
is caused by too many threads existed in OS, this kind of problem should be 
given much more torlerance instead of simply shudown the thread and never 
recover automatically, because the non-fatal errors mentioned above probably 
can be recovered soon by itself,

currently, Datanode BPServiceActor cannot turn to normal even when the 
non-fatal error was eliminated.

Therefore, in this patch, two things will be done:

1-Add retry mechanism in BPServiceActor thread and CommandProcessThread thread 
which is 5 by default and configurable;

2-Add a monitor periodical thread in BPOfferService, if a BPServiceActor thread 
is dead owing to  too many times non-fatal error, it should not be simply 
remove from BPServviceActor lists stored in BPOfferService, instead, the 
monitor thread will periodically try to start these special dead BPService 
Actor thread. the interval is also configurable.


> Asynchronously handle BPServiceActor command mechanism may result in 
> BPServiceActor never fails even CommandProcessingThread is closed with fatal 
> error.
> 
>
> Key: HDFS-16115
> URL: https://issues.apache.org/jira/browse/HDFS-16115
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Affects Versions: 3.3.1
>Reporter: Daniel Ma
>Priority: Critical
> Fix For: 3.3.1
>
> Attachments: 0001-HDFS-16115.patch
>
>
> It is an improvement issue. Actually the issue has two sub issues:
> 1- BPServerActor thread handle commands from NameNode in aysnchronous way ( 
> CommandProcessThread handle commands ), so if there are any exceptions or 
> errors happen in thread CommandProcessthread resulting the thread fails and 
> stop, of which BPServiceActor cannot aware and still keep putting commands 
> from namenode into queues waiting to be handled by CommandProcessThread, 
> actually CommandProcessThread was dead already.
> 2-the second sub issue is based on the first one, if CommandProcessThread was 
> dead owing to some non-fatal errors like "can not create native thread" which 
> is caused by too many threads existed in OS, this kind of problem should be 
> given much more torlerance instead of simply shudown the thread and never 
> recover automatically, because the non-fatal errors mentioned above probably 
> can be recovered soon by itself,
> currently, Datanode BPServiceActor 

[jira] [Updated] (HDFS-16115) Asynchronously handle BPServiceActor command mechanism may result in BPServiceActor never fails even CommandProcessingThread is closed with fatal error.

2021-07-06 Thread Daniel Ma (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Ma updated HDFS-16115:
-
Description: 
It is an improvement issue. Actually the issue has two sub issues:

1- BPServerActor thread handle commands from NameNode in aysnchronous way ( 
CommandProcessThread handle commands ), so if there are any exceptions or 
errors happen in thread CommandProcessthread resulting the thread fails and 
stop, of which BPServiceActor cannot aware and still keep putting commands from 
namenode into queues waiting to be handled by CommandProcessThread, actually 
CommandProcessThread was dead already.

2-the second sub issue is based on the first one, if CommandProcessThread was 
dead owing to some non-fatal errors like "can not create native thread" which 
is caused by too many threads existed in OS, this kind of problem should be 
given much more torlerance instead of simply shudown the thread and never 
recover automatically, because the non-fatal errors mentioned above probably 
can be recovered soon by itself,

currently, Datanode BPServiceActor cannot turn to normal even when the 
non-fatal error was eliminated.

Therefore, in this patch, two things will be done:

1-Add retry mechanism in BPServiceActor thread and CommandProcessThread thread 
which is 5 by default and configurable;

2-Add a monitor periodical thread in BPOfferService, if a BPServiceActor thread 
is dead owing to  too many times non-fatal error, it should not be simply 
remove from BPServviceActor lists stored in BPOfferService, instead, the 
monitor thread will periodically try to start these special dead BPService 
Actor thread. the interval is also configurable.

  was:
It is an improvement issue. Actually the issue has two sub issues:

1- BPServerActor thread handle commands from NameNode in aysnchronous way ( 
CommandProcessThread handle commands ), so if there are any exceptions or 
errors happen in thread CommandProcessthread resulting the thread fails and 
stop, of which BPServiceActor cannot aware and still keep putting commands from 
namenode into queues waiting to be handled by CommandProcessThread, actually 
CommandProcessThread was dead already.

2-the second sub issue is based on the first one, if CommandProcessThread was 
dead owing to some non-fatal errors like "can not create native thread" which 
is caused by too many threads existed in OS, this kind of problem should be 
given much more torlerance instead of simply shudown the thread and never 
recover automatically, because the non-fatal errors mentioned above probably 
can be recovered soon by itself,

currently, Datanode BPServiceActor cannot turn to normal even when the 
non-fatal error was eliminated.

Therefor, in this patch, two things was be done:

1-Add retry mechanism in BPServiceActor thread and CommandProcessThread thread 
which is 5 by default and configurable;

2-Add a monitor periodical thread in BPOfferService, if a BPServiceActor thread 
is dead owing to  too many times non-fatal error, it should not be simply 
remove from BPServviceActor lists stored in BPOfferService, instead, the 
monitor thread will periodically try to start these special dead BPService 
Actor thread. the interval is also configurable.


> Asynchronously handle BPServiceActor command mechanism may result in 
> BPServiceActor never fails even CommandProcessingThread is closed with fatal 
> error.
> 
>
> Key: HDFS-16115
> URL: https://issues.apache.org/jira/browse/HDFS-16115
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Affects Versions: 3.3.1
>Reporter: Daniel Ma
>Priority: Critical
> Fix For: 3.3.1
>
> Attachments: 0001-HDFS-16115.patch
>
>
> It is an improvement issue. Actually the issue has two sub issues:
> 1- BPServerActor thread handle commands from NameNode in aysnchronous way ( 
> CommandProcessThread handle commands ), so if there are any exceptions or 
> errors happen in thread CommandProcessthread resulting the thread fails and 
> stop, of which BPServiceActor cannot aware and still keep putting commands 
> from namenode into queues waiting to be handled by CommandProcessThread, 
> actually CommandProcessThread was dead already.
> 2-the second sub issue is based on the first one, if CommandProcessThread was 
> dead owing to some non-fatal errors like "can not create native thread" which 
> is caused by too many threads existed in OS, this kind of problem should be 
> given much more torlerance instead of simply shudown the thread and never 
> recover automatically, because the non-fatal errors mentioned above probably 
> can be recovered soon by itself,
> currently, Datanode BPServiceActor 

[jira] [Updated] (HDFS-16115) Asynchronously handle BPServiceActor command mechanism may result in BPServiceActor never fails even CommandProcessingThread is closed with fatal error.

2021-07-06 Thread Daniel Ma (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Ma updated HDFS-16115:
-
Description: 
It is an improvement issue. Actually the issue has two sub issues:

1- BPServerActor thread handle commands from NameNode in aysnchronous way ( 
CommandProcessThread handle commands ), so if there are any exceptions or 
errors happen in thread CommandProcessthread resulting the thread fails and 
stop, of which BPServiceActor cannot aware and still keep putting commands from 
namenode into queues waiting to be handled by CommandProcessThread, actually 
CommandProcessThread was dead already.

2-the second sub issue is based on the first one, if CommandProcessThread was 
dead owing to some non-fatal errors like "can not create native thread" which 
is caused by too many threads existed in OS, this kind of problem should be 
given much torlerance instead of simply shudown the thread and never recover 
automatically, because the non-fatal errors mentioned above probably can be 
recovered soon by itself,

currently, Datanode BPServiceActor cannot turn to normal even when the 
non-fatal error was eliminated.

Therefor, in this patch, two things was be done:

1-Add retry mechanism in BPServiceActor thread and CommandProcessThread thread 
which is 5 by default and configurable;

2-Add a monitor periodical thread in BPOfferService, if a BPServiceActor thread 
is dead owing to  too many times non-fatal error, it should not be simply 
remove from BPServviceActor lists stored in BPOfferService, instead, the 
monitor thread will periodically try to start these special dead BPService 
Actor thread. the interval is also configurable.

  was:
It is an improvement issue. Actually the issue has two sub issues:

1- BPServerActor thread handle commands from NameNode in aysnchronous way ( 
CommandProcessThread handle commands ), so if there are any exceptions or 
errors happen in thread CommandProcessthread resulting the thread fails and 
stop, of which BPServiceActor cannot aware and still keep putting commands from 
namenode into queues waiting to be handled by CommandProcessThread, actually 
CommandProcessThread was dead already.

2-the second sub issue is based on the first one, if CommandProcessThread was 
dead owing to some non-fatal errors like "can not create native thread" which 
is caused by too many threads existed on the node, this kind of problem should 
be given much torlerance instead of simply shudown the thread and never recover 
automatically, because the non-fatal errors mentioned above probably can be 
recovered soon by itself,

currently, Datanode BPServiceActor cannot turn to normal even when the 
non-fatal error was eliminated.

Therefor, in this patch, two things was be done:

1-Add retry mechanism in BPServiceActor thread and CommandProcessThread thread 
which is 5 by default and configurable;

2-Add a monitor periodical thread in BPOfferService, if a BPServiceActor thread 
is dead owing to  too many times non-fatal error, it should not be simply 
remove from BPServviceActor lists stored in BPOfferService, instead, the 
monitor thread will periodically try to start these special dead BPService 
Actor thread. the interval is also configurable.


> Asynchronously handle BPServiceActor command mechanism may result in 
> BPServiceActor never fails even CommandProcessingThread is closed with fatal 
> error.
> 
>
> Key: HDFS-16115
> URL: https://issues.apache.org/jira/browse/HDFS-16115
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Affects Versions: 3.3.1
>Reporter: Daniel Ma
>Priority: Critical
> Fix For: 3.3.1
>
> Attachments: 0001-HDFS-16115.patch
>
>
> It is an improvement issue. Actually the issue has two sub issues:
> 1- BPServerActor thread handle commands from NameNode in aysnchronous way ( 
> CommandProcessThread handle commands ), so if there are any exceptions or 
> errors happen in thread CommandProcessthread resulting the thread fails and 
> stop, of which BPServiceActor cannot aware and still keep putting commands 
> from namenode into queues waiting to be handled by CommandProcessThread, 
> actually CommandProcessThread was dead already.
> 2-the second sub issue is based on the first one, if CommandProcessThread was 
> dead owing to some non-fatal errors like "can not create native thread" which 
> is caused by too many threads existed in OS, this kind of problem should be 
> given much torlerance instead of simply shudown the thread and never recover 
> automatically, because the non-fatal errors mentioned above probably can be 
> recovered soon by itself,
> currently, Datanode BPServiceActor cannot turn to 

[jira] [Updated] (HDFS-16115) Asynchronously handle BPServiceActor command mechanism may result in BPServiceActor never fails even CommandProcessingThread is closed with fatal error.

2021-07-06 Thread Daniel Ma (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Ma updated HDFS-16115:
-
Description: 
It is an improvement issue. Actually the issue has two sub issues:

1- BPServerActor thread handle commands from NameNode in aysnchronous way ( 
CommandProcessThread handle commands ), so if there are any exceptions or 
errors happen in thread CommandProcessthread resulting the thread fails and 
stop, of which BPServiceActor cannot aware and still keep putting commands from 
namenode into queues waiting to be handled by CommandProcessThread, actually 
CommandProcessThread was dead already.

2-the second sub issue is based on the first one, if CommandProcessThread was 
dead owing to some non-fatal errors like "can not create native thread" which 
is caused by too many threads existed in OS, this kind of problem should be 
given much more torlerance instead of simply shudown the thread and never 
recover automatically, because the non-fatal errors mentioned above probably 
can be recovered soon by itself,

currently, Datanode BPServiceActor cannot turn to normal even when the 
non-fatal error was eliminated.

Therefor, in this patch, two things was be done:

1-Add retry mechanism in BPServiceActor thread and CommandProcessThread thread 
which is 5 by default and configurable;

2-Add a monitor periodical thread in BPOfferService, if a BPServiceActor thread 
is dead owing to  too many times non-fatal error, it should not be simply 
remove from BPServviceActor lists stored in BPOfferService, instead, the 
monitor thread will periodically try to start these special dead BPService 
Actor thread. the interval is also configurable.

  was:
It is an improvement issue. Actually the issue has two sub issues:

1- BPServerActor thread handle commands from NameNode in aysnchronous way ( 
CommandProcessThread handle commands ), so if there are any exceptions or 
errors happen in thread CommandProcessthread resulting the thread fails and 
stop, of which BPServiceActor cannot aware and still keep putting commands from 
namenode into queues waiting to be handled by CommandProcessThread, actually 
CommandProcessThread was dead already.

2-the second sub issue is based on the first one, if CommandProcessThread was 
dead owing to some non-fatal errors like "can not create native thread" which 
is caused by too many threads existed in OS, this kind of problem should be 
given much torlerance instead of simply shudown the thread and never recover 
automatically, because the non-fatal errors mentioned above probably can be 
recovered soon by itself,

currently, Datanode BPServiceActor cannot turn to normal even when the 
non-fatal error was eliminated.

Therefor, in this patch, two things was be done:

1-Add retry mechanism in BPServiceActor thread and CommandProcessThread thread 
which is 5 by default and configurable;

2-Add a monitor periodical thread in BPOfferService, if a BPServiceActor thread 
is dead owing to  too many times non-fatal error, it should not be simply 
remove from BPServviceActor lists stored in BPOfferService, instead, the 
monitor thread will periodically try to start these special dead BPService 
Actor thread. the interval is also configurable.


> Asynchronously handle BPServiceActor command mechanism may result in 
> BPServiceActor never fails even CommandProcessingThread is closed with fatal 
> error.
> 
>
> Key: HDFS-16115
> URL: https://issues.apache.org/jira/browse/HDFS-16115
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Affects Versions: 3.3.1
>Reporter: Daniel Ma
>Priority: Critical
> Fix For: 3.3.1
>
> Attachments: 0001-HDFS-16115.patch
>
>
> It is an improvement issue. Actually the issue has two sub issues:
> 1- BPServerActor thread handle commands from NameNode in aysnchronous way ( 
> CommandProcessThread handle commands ), so if there are any exceptions or 
> errors happen in thread CommandProcessthread resulting the thread fails and 
> stop, of which BPServiceActor cannot aware and still keep putting commands 
> from namenode into queues waiting to be handled by CommandProcessThread, 
> actually CommandProcessThread was dead already.
> 2-the second sub issue is based on the first one, if CommandProcessThread was 
> dead owing to some non-fatal errors like "can not create native thread" which 
> is caused by too many threads existed in OS, this kind of problem should be 
> given much more torlerance instead of simply shudown the thread and never 
> recover automatically, because the non-fatal errors mentioned above probably 
> can be recovered soon by itself,
> currently, Datanode BPServiceActor cannot 

[jira] [Updated] (HDFS-16115) Asynchronously handle BPServiceActor command mechanism may result in BPServiceActor never fails even CommandProcessingThread is closed with fatal error.

2021-07-06 Thread Daniel Ma (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Ma updated HDFS-16115:
-
Description: 
It is an improvement issue. Actually the issue has two sub issues:

1- BPServerActor thread handle commands from NameNode in aysnchronous way ( 
CommandProcessThread handle commands ), so if there are any exceptions or 
errors happen in thread CommandProcessthread resulting the thread fails and 
stop, of which BPServiceActor cannot aware and still keep putting commands from 
namenode into queues waiting to be handled by CommandProcessThread, actually 
CommandProcessThread was dead already.

2-the second sub issue is based on the first one, if CommandProcessThread was 
dead owing to some non-fatal errors like "can not create native thread" which 
is caused by too many threads existed on the node, this kind of problem should 
be given much torlerance instead of simply shudown the thread and never recover 
automatically, because the non-fatal errors mentioned above probably can be 
recovered soon by itself,

currently, Datanode BPServiceActor cannot turn to normal even when the 
non-fatal error was eliminated.

Therefor, in this patch, two things was be done:

1-Add retry mechanism in BPServiceActor thread and CommandProcessThread thread 
which is 5 by default and configurable;

2-Add a monitor periodical thread in BPOfferService, if a BPServiceActor thread 
is dead owing to  too many times non-fatal error, it should not be simply 
remove from BPServviceActor lists stored in BPOfferService, instead, the 
monitor thread will periodically try to start these special dead BPService 
Actor thread. the interval is also configurable.

  was:
It is an improvement issue. Actually the issue has two sub issues:

1- BPServerActor thread handle commands from NameNode in aysnchronous way ( 
CommandProcessThread handle commands ), so if there are any exceptions or 
errors happen in thread CommandProcessthread resulting the thread fails and 
stop, of which BPServiceActor cannot aware and still keep putting commands from 
namenode into queues waiting to be handled by CommandProcessThread, actually 
CommandProcessThread was dead already.

2-the second sub issue is based on the first one, if CommandProcessThread was 
dead owing to some non-fatal errors like "can not create native thread" which 
is caused by too many threads existed on the node, this kind of problem should 
be given much torlerance instead of simply shudown the thread and never recover 
automatically, because the non-fatal errors mentioned above probably can be 
recovered soon by itself,

currently, Datanode BPServiceActor cannot turn to normal even when the 
non-fatal error was eliminated.

Therefor, in this patch, two things was be done:

1-Add retry mechanism in BPServiceActor thread and CommandProcessThread thread 
which is 5 by default and configurable;

2-Add a monitor periodical thread in BPOfferService, if a BPServiceActor thread 
is dead owing to  too many times non-fatal error, it should not be simply 
remove from BPServviceActor lists stored in BPOfferService, instead, the 
monitor thread will periodically try to start these special dead BPService 
Actor thread.


> Asynchronously handle BPServiceActor command mechanism may result in 
> BPServiceActor never fails even CommandProcessingThread is closed with fatal 
> error.
> 
>
> Key: HDFS-16115
> URL: https://issues.apache.org/jira/browse/HDFS-16115
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Affects Versions: 3.3.1
>Reporter: Daniel Ma
>Priority: Critical
> Fix For: 3.3.1
>
> Attachments: 0001-HDFS-16115.patch
>
>
> It is an improvement issue. Actually the issue has two sub issues:
> 1- BPServerActor thread handle commands from NameNode in aysnchronous way ( 
> CommandProcessThread handle commands ), so if there are any exceptions or 
> errors happen in thread CommandProcessthread resulting the thread fails and 
> stop, of which BPServiceActor cannot aware and still keep putting commands 
> from namenode into queues waiting to be handled by CommandProcessThread, 
> actually CommandProcessThread was dead already.
> 2-the second sub issue is based on the first one, if CommandProcessThread was 
> dead owing to some non-fatal errors like "can not create native thread" which 
> is caused by too many threads existed on the node, this kind of problem 
> should be given much torlerance instead of simply shudown the thread and 
> never recover automatically, because the non-fatal errors mentioned above 
> probably can be recovered soon by itself,
> currently, Datanode BPServiceActor cannot turn to normal even when the 

[jira] [Updated] (HDFS-16115) Asynchronously handle BPServiceActor command mechanism may result in BPServiceActor never fails even CommandProcessingThread is closed with fatal error.

2021-07-06 Thread Daniel Ma (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Ma updated HDFS-16115:
-
Description: 
It is an improvement issue. Actually the issue has two sub issues:

1- BPServerActor thread handle commands from NameNode in aysnchronous way ( 
CommandProcessThread handle commands ), so if there are any exceptions or 
errors happen in thread CommandProcessthread resulting the thread fails and 
stop, of which BPServiceActor cannot aware and still keep putting commands from 
namenode into queues waiting to be handled by CommandProcessThread, actually 
CommandProcessThread was dead already.

2-the second sub issue is based on the first one, if CommandProcessThread was 
dead owing to some non-fatal errors like "can not create native thread" which 
is caused by too many threads existed on the node, this kind of problem should 
be given much torlerance instead of simply shudown the thread and never recover 
automatically, because the non-fatal errors mentioned above probably can be 
recovered soon by itself,

currently, Datanode BPServiceActor cannot turn to normal even when the 
non-fatal error was eliminated.

Therefor, in this patch, two things was be done:

1-Add retry mechanism in BPServiceActor thread and CommandProcessThread thread 
which is 5 by default and configurable;

2-Add a monitor periodical thread in BPOfferService, if a BPServiceActor thread 
is dead owing to  too many times non-fatal error, it should not be simply 
remove from BPServviceActor lists stored in BPOfferService, instead, the 
monitor thread will periodically try to start these special dead BPService 
Actor thread.

  was:
It is an improvement issue. Actually the issue has two sub issues:

1- BPServerActor thread handle commands from NameNode in aysnchronous way ( 
CommandProcessThread handle commands ), so if there are any exceptions or 
errors happen in thread CommandProcessthread resulting the thread fails and 
stop, of which BPServiceActor cannot aware and still keep putting commands from 
namenode into queues waiting to be handled by CommandProcessThread, actually 
CommandProcessThread was dead already.

2-the second sub issue is based on the first one, if CommandProcessThread was 
dead owing to some non-fatal errors like "can not create native thread" which 
is caused by too many threads existed on the node, this kind of problem should 
be given much torlerance instead of simply shudown the thread and never recover 
automatically, because the non-fatal errors mentioned above probably can be 
recovered soon by itself,

currently, Datanode BPServiceActor cannot turn to normal even when the 
non-fatal error was eliminated.


> Asynchronously handle BPServiceActor command mechanism may result in 
> BPServiceActor never fails even CommandProcessingThread is closed with fatal 
> error.
> 
>
> Key: HDFS-16115
> URL: https://issues.apache.org/jira/browse/HDFS-16115
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Affects Versions: 3.3.1
>Reporter: Daniel Ma
>Priority: Critical
> Fix For: 3.3.1
>
> Attachments: 0001-HDFS-16115.patch
>
>
> It is an improvement issue. Actually the issue has two sub issues:
> 1- BPServerActor thread handle commands from NameNode in aysnchronous way ( 
> CommandProcessThread handle commands ), so if there are any exceptions or 
> errors happen in thread CommandProcessthread resulting the thread fails and 
> stop, of which BPServiceActor cannot aware and still keep putting commands 
> from namenode into queues waiting to be handled by CommandProcessThread, 
> actually CommandProcessThread was dead already.
> 2-the second sub issue is based on the first one, if CommandProcessThread was 
> dead owing to some non-fatal errors like "can not create native thread" which 
> is caused by too many threads existed on the node, this kind of problem 
> should be given much torlerance instead of simply shudown the thread and 
> never recover automatically, because the non-fatal errors mentioned above 
> probably can be recovered soon by itself,
> currently, Datanode BPServiceActor cannot turn to normal even when the 
> non-fatal error was eliminated.
> Therefor, in this patch, two things was be done:
> 1-Add retry mechanism in BPServiceActor thread and CommandProcessThread 
> thread which is 5 by default and configurable;
> 2-Add a monitor periodical thread in BPOfferService, if a BPServiceActor 
> thread is dead owing to  too many times non-fatal error, it should not be 
> simply remove from BPServviceActor lists stored in BPOfferService, instead, 
> the monitor thread will periodically try to start these special dead 
> BPService 

[jira] [Updated] (HDFS-16115) Asynchronously handle BPServiceActor command mechanism may result in BPServiceActor never fails even CommandProcessingThread is closed with fatal error.

2021-07-06 Thread Daniel Ma (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Ma updated HDFS-16115:
-
Description: 
It is an improvement issue. Actually the issue has two sub issues:

1- BPServerActor thread handle commands from NameNode in aysnchronous way ( 
CommandProcessThread handle commands ), so if there are any exceptions or 
errors happen in thread CommandProcessthread resulting the thread fails and 
stop, of which BPServiceActor cannot aware and still keep putting commands from 
namenode into queues waiting to be handled by CommandProcessThread, actually 
CommandProcessThread was dead already.

2-the second sub issue is based on the first one, if CommandProcessThread was 
dead owing to some non-fatal errors like "can not create native thread" which 
is caused by too many threads existed on the node, this kind of problem should 
be given much torlerance instead of simply shudown the thread and never recover 
automatically, because the non-fatal errors mentioned above probably can be 
recovered soon by itself,

currently, Datanode BPServiceActor cannot turn to normal even when the 
non-fatal error was eliminated.

  was:
It is an improvement issue. Actually the issue has two sub issues:

1- BPServerActor thread handle commands from NameNode in aysnchronous way ( 
CommandProcessThread handle commands ), so if there are any exception or errors 
happens in thread CommandProcessthread resulting the thread fails and stop, of 
which BPServiceActor cannot aware and still keep put commands from namenode 
into queues waiting to be handled by CommandProcessThread, actually 
CommandProcessThread was dead already.

2-the second sub issue is based on the first one, if CommandProcessThread fails 
owing to some non-fatal error like "can not create native thread" which is 
caused by too many threads existed on the node, this kind of problem should be 
given much torlerance instead of simply shudown the thread and never recover 
automatically, because the non-fatal eror mention above may recover soon by 
itself,

currently, Datanode BPServiceActor cannot turn to normal even when the 
non-fatal error was eliminated.


> Asynchronously handle BPServiceActor command mechanism may result in 
> BPServiceActor never fails even CommandProcessingThread is closed with fatal 
> error.
> 
>
> Key: HDFS-16115
> URL: https://issues.apache.org/jira/browse/HDFS-16115
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Affects Versions: 3.3.1
>Reporter: Daniel Ma
>Priority: Critical
> Fix For: 3.3.1
>
> Attachments: 0001-HDFS-16115.patch
>
>
> It is an improvement issue. Actually the issue has two sub issues:
> 1- BPServerActor thread handle commands from NameNode in aysnchronous way ( 
> CommandProcessThread handle commands ), so if there are any exceptions or 
> errors happen in thread CommandProcessthread resulting the thread fails and 
> stop, of which BPServiceActor cannot aware and still keep putting commands 
> from namenode into queues waiting to be handled by CommandProcessThread, 
> actually CommandProcessThread was dead already.
> 2-the second sub issue is based on the first one, if CommandProcessThread was 
> dead owing to some non-fatal errors like "can not create native thread" which 
> is caused by too many threads existed on the node, this kind of problem 
> should be given much torlerance instead of simply shudown the thread and 
> never recover automatically, because the non-fatal errors mentioned above 
> probably can be recovered soon by itself,
> currently, Datanode BPServiceActor cannot turn to normal even when the 
> non-fatal error was eliminated.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-16115) Asynchronously handle BPServiceActor command mechanism may result in BPServiceActor never fails even CommandProcessingThread is closed with fatal error.

2021-07-06 Thread Daniel Ma (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17375401#comment-17375401
 ] 

Daniel Ma edited comment on HDFS-16115 at 7/6/21, 9:33 AM:
---

Hello [~brahmareddy],

[~ayush]

Pls help to review this patch. thanks.


was (Author: daniel ma):
[~brahmareddy]

[~ayush]

Pls help to review this patch.

> Asynchronously handle BPServiceActor command mechanism may result in 
> BPServiceActor never fails even CommandProcessingThread is closed with fatal 
> error.
> 
>
> Key: HDFS-16115
> URL: https://issues.apache.org/jira/browse/HDFS-16115
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Affects Versions: 3.3.1
>Reporter: Daniel Ma
>Priority: Critical
> Fix For: 3.3.1
>
> Attachments: 0001-HDFS-16115.patch
>
>
> It is an improvement issue. Actually the issue has two sub issues:
> 1- BPServerActor thread handle commands from NameNode in aysnchronous way ( 
> CommandProcessThread handle commands ), so if there are any exception or 
> errors happens in thread CommandProcessthread resulting the thread fails and 
> stop, of which BPServiceActor cannot aware and still keep put commands from 
> namenode into queues waiting to be handled by CommandProcessThread, actually 
> CommandProcessThread was dead already.
> 2-the second sub issue is based on the first one, if CommandProcessThread 
> fails owing to some non-fatal error like "can not create native thread" which 
> is caused by too many threads existed on the node, this kind of problem 
> should be given much torlerance instead of simply shudown the thread and 
> never recover automatically, because the non-fatal eror mention above may 
> recover soon by itself,
> currently, Datanode BPServiceActor cannot turn to normal even when the 
> non-fatal error was eliminated.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-16115) Asynchronously handle BPServiceActor command mechanism may result in BPServiceActor never fails even CommandProcessingThread is closed with fatal error.

2021-07-06 Thread Daniel Ma (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17375401#comment-17375401
 ] 

Daniel Ma commented on HDFS-16115:
--

[~brahmareddy]

[~ayush]

Pls help to review this patch.

> Asynchronously handle BPServiceActor command mechanism may result in 
> BPServiceActor never fails even CommandProcessingThread is closed with fatal 
> error.
> 
>
> Key: HDFS-16115
> URL: https://issues.apache.org/jira/browse/HDFS-16115
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Affects Versions: 3.3.1
>Reporter: Daniel Ma
>Priority: Critical
> Fix For: 3.3.1
>
> Attachments: 0001-HDFS-16115.patch
>
>
> It is an improvement issue. Actually the issue has two sub issues:
> 1- BPServerActor thread handle commands from NameNode in aysnchronous way ( 
> CommandProcessThread handle commands ), so if there are any exception or 
> errors happens in thread CommandProcessthread resulting the thread fails and 
> stop, of which BPServiceActor cannot aware and still keep put commands from 
> namenode into queues waiting to be handled by CommandProcessThread, actually 
> CommandProcessThread was dead already.
> 2-the second sub issue is based on the first one, if CommandProcessThread 
> fails owing to some non-fatal error like "can not create native thread" which 
> is caused by too many threads existed on the node, this kind of problem 
> should be given much torlerance instead of simply shudown the thread and 
> never recover automatically, because the non-fatal eror mention above may 
> recover soon by itself,
> currently, Datanode BPServiceActor cannot turn to normal even when the 
> non-fatal error was eliminated.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16115) Asynchronously handle BPServiceActor command mechanism may result in BPServiceActor never fails even CommandProcessingThread is closed with fatal error.

2021-07-06 Thread Daniel Ma (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Ma updated HDFS-16115:
-
Description: 
It is an improvement issue. Actually the issue has two sub issues:

1- BPServerActor thread handle commands from NameNode in aysnchronous way ( 
CommandProcessThread handle commands ), so if there are any exception or errors 
happens in thread CommandProcessthread resulting the thread fails and stop, of 
which BPServiceActor cannot aware and still keep put commands from namenode 
into queues waiting to be handled by CommandProcessThread, actually 
CommandProcessThread was dead already.

2-the second sub issue is based on the first one, if CommandProcessThread fails 
owing to some non-fatal error like "can not create native thread" which is 
caused by too many threads existed on the node, this kind of problem should be 
given much torlerance instead of simply shudown the thread and never recover 
automatically, because the non-fatal eror mention above may recover soon by 
itself,

currently, Datanode BPServiceActor cannot turn to normal even when the 
non-fatal error was eliminated.

  was:
It is an improvement issue. Actually the issue has two sub issues:

1- BPServerActor thread handle commands from NameNode in aysnchronous way ( 
CommandProcessThread handle commands ), so if there are any exception or errors 
happens in thread CommandProcessthread resulting the thread fails and stop, 
which is not aware of it and still keep put command from namenode into queues 
to be handled by CommandProcessThread

2-the second sub issue is based on the first one, if CommandProcessThread fails 
owing to some non-fatal error like "can not create native thread" which is 
caused by too many threads existed on the node, this kind of problem should be 
given much torlerance instead of simply shudown the thread and never recover 
automatically, because the non-fatal eror mention above may recover soon by 
itself,

currently, Datanode BPServiceActor cannot turn to normal even when the 
non-fatal error was eliminated.


> Asynchronously handle BPServiceActor command mechanism may result in 
> BPServiceActor never fails even CommandProcessingThread is closed with fatal 
> error.
> 
>
> Key: HDFS-16115
> URL: https://issues.apache.org/jira/browse/HDFS-16115
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Affects Versions: 3.3.1
>Reporter: Daniel Ma
>Priority: Critical
> Fix For: 3.3.1
>
> Attachments: 0001-HDFS-16115.patch
>
>
> It is an improvement issue. Actually the issue has two sub issues:
> 1- BPServerActor thread handle commands from NameNode in aysnchronous way ( 
> CommandProcessThread handle commands ), so if there are any exception or 
> errors happens in thread CommandProcessthread resulting the thread fails and 
> stop, of which BPServiceActor cannot aware and still keep put commands from 
> namenode into queues waiting to be handled by CommandProcessThread, actually 
> CommandProcessThread was dead already.
> 2-the second sub issue is based on the first one, if CommandProcessThread 
> fails owing to some non-fatal error like "can not create native thread" which 
> is caused by too many threads existed on the node, this kind of problem 
> should be given much torlerance instead of simply shudown the thread and 
> never recover automatically, because the non-fatal eror mention above may 
> recover soon by itself,
> currently, Datanode BPServiceActor cannot turn to normal even when the 
> non-fatal error was eliminated.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16115) Asynchronously handle BPServiceActor command mechanism may result in BPServiceActor never fails even CommandProcessingThread is closed with fatal error.

2021-07-06 Thread Daniel Ma (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Ma updated HDFS-16115:
-
Attachment: 0001-HDFS-16115.patch

> Asynchronously handle BPServiceActor command mechanism may result in 
> BPServiceActor never fails even CommandProcessingThread is closed with fatal 
> error.
> 
>
> Key: HDFS-16115
> URL: https://issues.apache.org/jira/browse/HDFS-16115
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Affects Versions: 3.3.1
>Reporter: Daniel Ma
>Priority: Critical
> Fix For: 3.3.1
>
> Attachments: 0001-HDFS-16115.patch
>
>
> It is an improvement issue. Actually the issue has two sub issues:
> 1- BPServerActor thread handle commands from NameNode in aysnchronous way ( 
> CommandProcessThread handle commands ), so if there are any exception or 
> errors happens in thread CommandProcessthread resulting the thread fails and 
> stop, which is not aware of it and still keep put command from namenode into 
> queues to be handled by CommandProcessThread
> 2-the second sub issue is based on the first one, if CommandProcessThread 
> fails owing to some non-fatal error like "can not create native thread" which 
> is caused by too many threads existed on the node, this kind of problem 
> should be given much torlerance instead of simply shudown the thread and 
> never recover automatically, because the non-fatal eror mention above may 
> recover soon by itself,
> currently, Datanode BPServiceActor cannot turn to normal even when the 
> non-fatal error was eliminated.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-16115) Asynchronously handle BPServiceActor command mechanism may result in BPServiceActor never fails even CommandProcessingThread is closed with fatal error.

2021-07-06 Thread Daniel Ma (Jira)
Daniel Ma created HDFS-16115:


 Summary: Asynchronously handle BPServiceActor command mechanism 
may result in BPServiceActor never fails even CommandProcessingThread is closed 
with fatal error.
 Key: HDFS-16115
 URL: https://issues.apache.org/jira/browse/HDFS-16115
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: datanode
Affects Versions: 3.3.1
Reporter: Daniel Ma
 Fix For: 3.3.1


It is an improvement issue. Actually the issue has two sub issues:

1- BPServerActor thread handle commands from NameNode in aysnchronous way ( 
CommandProcessThread handle commands ), so if there are any exception or errors 
happens in thread CommandProcessthread resulting the thread fails and stop, 
which is not aware of it and still keep put command from namenode into queues 
to be handled by CommandProcessThread

2-the second sub issue is based on the first one, if CommandProcessThread fails 
owing to some non-fatal error like "can not create native thread" which is 
caused by too many threads existed on the node, this kind of problem should be 
given much torlerance instead of simply shudown the thread and never recover 
automatically, because the non-fatal eror mention above may recover soon by 
itself,

currently, Datanode BPServiceActor cannot turn to normal even when the 
non-fatal error was eliminated.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15796) ConcurrentModificationException error happens on NameNode occasionally

2021-07-06 Thread Daniel Ma (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17375331#comment-17375331
 ] 

Daniel Ma commented on HDFS-15796:
--

[~weichiu],[~hexiaoqiao] 

Pls help to review this patch, thanks

> ConcurrentModificationException error happens on NameNode occasionally
> --
>
> Key: HDFS-15796
> URL: https://issues.apache.org/jira/browse/HDFS-15796
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Affects Versions: 3.1.1
>Reporter: Daniel Ma
>Priority: Critical
> Attachments: 0001-HDFS-15796.patch
>
>
> ConcurrentModificationException error happens on NameNode occasionally.
>  
> {code:java}
> 2021-01-23 20:21:18,107 | ERROR | RedundancyMonitor | RedundancyMonitor 
> thread received Runtime exception.  | BlockManager.java:4746
> java.util.ConcurrentModificationException
>   at java.util.ArrayList$Itr.checkForComodification(ArrayList.java:909)
>   at java.util.ArrayList$Itr.next(ArrayList.java:859)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReconstructionWorkForBlocks(BlockManager.java:1907)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeBlockReconstructionWork(BlockManager.java:1859)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeDatanodeWork(BlockManager.java:4862)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$RedundancyMonitor.run(BlockManager.java:4729)
>   at java.lang.Thread.run(Thread.java:748)
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15796) ConcurrentModificationException error happens on NameNode occasionally

2021-07-06 Thread Daniel Ma (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Ma updated HDFS-15796:
-
Attachment: 0001-HDFS-15796.patch

> ConcurrentModificationException error happens on NameNode occasionally
> --
>
> Key: HDFS-15796
> URL: https://issues.apache.org/jira/browse/HDFS-15796
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Affects Versions: 3.1.1
>Reporter: Daniel Ma
>Priority: Critical
> Attachments: 0001-HDFS-15796.patch
>
>
> ConcurrentModificationException error happens on NameNode occasionally.
>  
> {code:java}
> 2021-01-23 20:21:18,107 | ERROR | RedundancyMonitor | RedundancyMonitor 
> thread received Runtime exception.  | BlockManager.java:4746
> java.util.ConcurrentModificationException
>   at java.util.ArrayList$Itr.checkForComodification(ArrayList.java:909)
>   at java.util.ArrayList$Itr.next(ArrayList.java:859)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReconstructionWorkForBlocks(BlockManager.java:1907)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeBlockReconstructionWork(BlockManager.java:1859)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeDatanodeWork(BlockManager.java:4862)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$RedundancyMonitor.run(BlockManager.java:4729)
>   at java.lang.Thread.run(Thread.java:748)
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15796) ConcurrentModificationException error happens on NameNode occasionally

2021-07-06 Thread Daniel Ma (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Ma updated HDFS-15796:
-
Target Version/s: 3.3.1  (was: 3.4.0)

> ConcurrentModificationException error happens on NameNode occasionally
> --
>
> Key: HDFS-15796
> URL: https://issues.apache.org/jira/browse/HDFS-15796
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Affects Versions: 3.1.1
>Reporter: Daniel Ma
>Priority: Critical
> Attachments: 0001-HDFS-15796.patch
>
>
> ConcurrentModificationException error happens on NameNode occasionally.
>  
> {code:java}
> 2021-01-23 20:21:18,107 | ERROR | RedundancyMonitor | RedundancyMonitor 
> thread received Runtime exception.  | BlockManager.java:4746
> java.util.ConcurrentModificationException
>   at java.util.ArrayList$Itr.checkForComodification(ArrayList.java:909)
>   at java.util.ArrayList$Itr.next(ArrayList.java:859)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReconstructionWorkForBlocks(BlockManager.java:1907)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeBlockReconstructionWork(BlockManager.java:1859)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeDatanodeWork(BlockManager.java:4862)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$RedundancyMonitor.run(BlockManager.java:4729)
>   at java.lang.Thread.run(Thread.java:748)
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16094) HDFS balancer process start failed owing to daemon pid file is not cleared in some exception senario

2021-06-28 Thread Daniel Ma (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Ma updated HDFS-16094:
-
Summary: HDFS balancer process start failed owing to daemon pid file is not 
cleared in some exception senario  (was: HDFS start failed owing to daemon pid 
file is not cleared in some exception senario)

> HDFS balancer process start failed owing to daemon pid file is not cleared in 
> some exception senario
> 
>
> Key: HDFS-16094
> URL: https://issues.apache.org/jira/browse/HDFS-16094
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: scripts
>Affects Versions: 3.3.1
>Reporter: Daniel Ma
>Priority: Major
>
> NameNode start failed owing to daemon pid file is not cleared in some 
> exception senario, but there is no useful information in log to trouble shoot 
> as below.
> {code:java}
> //代码占位符
> hadoop_error "${daemonname} is running as process $(cat "${daemon_pidfile}")
> {code}
> but actually, the process is not running as the error msg details above.
> Therefore, some more explicit information should be print in error log to 
> guide  users to clear the pid file and where the pid file location is.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16094) HDFS balancer process start failed owing to daemon pid file is not cleared in some exception senario

2021-06-28 Thread Daniel Ma (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Ma updated HDFS-16094:
-
Description: 
HDFS balancer process start failed owing to daemon pid file is not cleared in 
some exception senario, but there is no useful information in log to trouble 
shoot as below.
{code:java}
//代码占位符
hadoop_error "${daemonname} is running as process $(cat "${daemon_pidfile}")
{code}
but actually, the process is not running as the error msg details above.

Therefore, some more explicit information should be print in error log to guide 
 users to clear the pid file and where the pid file location is.

 

  was:
NameNode start failed owing to daemon pid file is not cleared in some exception 
senario, but there is no useful information in log to trouble shoot as below.
{code:java}
//代码占位符
hadoop_error "${daemonname} is running as process $(cat "${daemon_pidfile}")
{code}
but actually, the process is not running as the error msg details above.

Therefore, some more explicit information should be print in error log to guide 
 users to clear the pid file and where the pid file location is.

 


> HDFS balancer process start failed owing to daemon pid file is not cleared in 
> some exception senario
> 
>
> Key: HDFS-16094
> URL: https://issues.apache.org/jira/browse/HDFS-16094
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: scripts
>Affects Versions: 3.3.1
>Reporter: Daniel Ma
>Priority: Major
>
> HDFS balancer process start failed owing to daemon pid file is not cleared in 
> some exception senario, but there is no useful information in log to trouble 
> shoot as below.
> {code:java}
> //代码占位符
> hadoop_error "${daemonname} is running as process $(cat "${daemon_pidfile}")
> {code}
> but actually, the process is not running as the error msg details above.
> Therefore, some more explicit information should be print in error log to 
> guide  users to clear the pid file and where the pid file location is.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16094) HDFS start failed owing to daemon pid file is not cleared in some exception senario

2021-06-28 Thread Daniel Ma (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Ma updated HDFS-16094:
-
Summary: HDFS start failed owing to daemon pid file is not cleared in some 
exception senario  (was: NameNode start failed owing to daemon pid file is not 
cleared in some exception senario)

> HDFS start failed owing to daemon pid file is not cleared in some exception 
> senario
> ---
>
> Key: HDFS-16094
> URL: https://issues.apache.org/jira/browse/HDFS-16094
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: scripts
>Affects Versions: 3.3.1
>Reporter: Daniel Ma
>Priority: Major
>
> NameNode start failed owing to daemon pid file is not cleared in some 
> exception senario, but there is no useful information in log to trouble shoot 
> as below.
> {code:java}
> //代码占位符
> hadoop_error "${daemonname} is running as process $(cat "${daemon_pidfile}")
> {code}
> but actually, the process is not running as the error msg details above.
> Therefore, some more explicit information should be print in error log to 
> guide  users to clear the pid file and where the pid file location is.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-16094) NameNode start failed owing to daemon pid file is not cleared in some exception senario

2021-06-28 Thread Daniel Ma (Jira)
Daniel Ma created HDFS-16094:


 Summary: NameNode start failed owing to daemon pid file is not 
cleared in some exception senario
 Key: HDFS-16094
 URL: https://issues.apache.org/jira/browse/HDFS-16094
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: scripts
Affects Versions: 3.3.1
Reporter: Daniel Ma


NameNode start failed owing to daemon pid file is not cleared in some exception 
senario, but there is no useful information in log to trouble shoot as below.
{code:java}
//代码占位符
hadoop_error "${daemonname} is running as process $(cat "${daemon_pidfile}")
{code}
but actually, the process is not running as the error msg details above.

Therefore, some more explicit information should be print in error log to guide 
 users to clear the pid file and where the pid file location is.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-16093) DataNodes under decommission will still be returned to the client via getLocatedBlocks, so the client may request decommissioning datanodes to read which will cause bad

2021-06-28 Thread Daniel Ma (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17371015#comment-17371015
 ] 

Daniel Ma commented on HDFS-16093:
--

Hi [~tomscut], thanks for your quick reply.

The Jira you mentioned can relieve such issue to some extent, but I think only 
the DataNode in service should be returned to the client.

All the abnornal state DataNode like DECOMMISSION or MAINTANENCE should be 
removed in the return list.

> DataNodes under decommission will still be returned to the client via 
> getLocatedBlocks, so the client may request decommissioning datanodes to read 
> which will cause badly competation on disk IO.
> --
>
> Key: HDFS-16093
> URL: https://issues.apache.org/jira/browse/HDFS-16093
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Affects Versions: 3.3.1
>Reporter: Daniel Ma
>Priority: Critical
>
> DataNodes under decommission will still be returned to the client via 
> getLocatedBlocks, so the client may request decommissioning datanodes to read 
> which will cause badly competation on disk IO.
> Therefore, datanodes under decommission should be removed from the return 
> list of getLocatedBlocks api.
> !image-2021-06-29-10-50-44-739.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-16093) DataNodes under decommission will still be returned to the client via getLocatedBlocks, so the client may request decommissioning datanodes to read which will cause badly

2021-06-28 Thread Daniel Ma (Jira)
Daniel Ma created HDFS-16093:


 Summary: DataNodes under decommission will still be returned to 
the client via getLocatedBlocks, so the client may request decommissioning 
datanodes to read which will cause badly competation on disk IO.
 Key: HDFS-16093
 URL: https://issues.apache.org/jira/browse/HDFS-16093
 Project: Hadoop HDFS
  Issue Type: Improvement
Affects Versions: 3.3.1
Reporter: Daniel Ma


DataNodes under decommission will still be returned to the client via 
getLocatedBlocks, so the client may request decommissioning datanodes to read 
which will cause badly competation on disk IO.

Therefore, datanodes under decommission should be removed from the return list 
of getLocatedBlocks api.

!image-2021-06-29-10-50-44-739.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-15796) ConcurrentModificationException error happens on NameNode occasionally

2021-06-28 Thread Daniel Ma (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17370995#comment-17370995
 ] 

Daniel Ma edited comment on HDFS-15796 at 6/29/21, 2:36 AM:


[~weichiu]  No idea what kind of condition can reproduce this problem. it seems 
the targets object is modified elsewhere when computeReconstrutionWorkForBlocks 
is in progress owing to unsafe thread issue.
{code:java}
//代码占位符
// Step 2: choose target nodes for each reconstruction task
for (BlockReconstructionWork rw : reconWork) {
  // Exclude all of the containing nodes from being targets.
  // This list includes decommissioning or corrupt nodes.
  final Set excludedNodes = new HashSet<>(rw.getContainingNodes());
List targets = pendingReconstruction
.getTargets(rw.getBlock());
if (targets != null) {
  for (DatanodeStorageInfo dn : targets) {
if (!excludedNodes.contains(dn.getDatanodeDescriptor())) {
  excludedNodes.add(dn.getDatanodeDescriptor());
}
  }
}

  // choose replication targets: NOT HOLDING THE GLOBAL LOCK
  final BlockPlacementPolicy placementPolicy =
  placementPolicies.getPolicy(rw.getBlock().getBlockType());
  rw.chooseTargets(placementPolicy, storagePolicySuite, excludedNodes);
}

{code}
 


was (Author: daniel ma):
[~weichiu]  No idea what kind of condition can reproduce this problem. it seems 
the tergets object is modified elsewhere, when 
computeReconstrutionWorkForBlocks is in progress.
{code:java}
//代码占位符
// Step 2: choose target nodes for each reconstruction task
for (BlockReconstructionWork rw : reconWork) {
  // Exclude all of the containing nodes from being targets.
  // This list includes decommissioning or corrupt nodes.
  final Set excludedNodes = new HashSet<>(rw.getContainingNodes());
List targets = pendingReconstruction
.getTargets(rw.getBlock());
if (targets != null) {
  for (DatanodeStorageInfo dn : targets) {
if (!excludedNodes.contains(dn.getDatanodeDescriptor())) {
  excludedNodes.add(dn.getDatanodeDescriptor());
}
  }
}

  // choose replication targets: NOT HOLDING THE GLOBAL LOCK
  final BlockPlacementPolicy placementPolicy =
  placementPolicies.getPolicy(rw.getBlock().getBlockType());
  rw.chooseTargets(placementPolicy, storagePolicySuite, excludedNodes);
}

{code}
 

> ConcurrentModificationException error happens on NameNode occasionally
> --
>
> Key: HDFS-15796
> URL: https://issues.apache.org/jira/browse/HDFS-15796
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Affects Versions: 3.1.1
>Reporter: Daniel Ma
>Priority: Critical
>
> ConcurrentModificationException error happens on NameNode occasionally.
>  
> {code:java}
> 2021-01-23 20:21:18,107 | ERROR | RedundancyMonitor | RedundancyMonitor 
> thread received Runtime exception.  | BlockManager.java:4746
> java.util.ConcurrentModificationException
>   at java.util.ArrayList$Itr.checkForComodification(ArrayList.java:909)
>   at java.util.ArrayList$Itr.next(ArrayList.java:859)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReconstructionWorkForBlocks(BlockManager.java:1907)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeBlockReconstructionWork(BlockManager.java:1859)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeDatanodeWork(BlockManager.java:4862)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$RedundancyMonitor.run(BlockManager.java:4729)
>   at java.lang.Thread.run(Thread.java:748)
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-15796) ConcurrentModificationException error happens on NameNode occasionally

2021-06-28 Thread Daniel Ma (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17370995#comment-17370995
 ] 

Daniel Ma edited comment on HDFS-15796 at 6/29/21, 2:35 AM:


[~weichiu]  No idea what kind of condition can reproduce this problem. it seems 
the tergets object is modified elsewhere, when 
computeReconstrutionWorkForBlocks is in progress.
{code:java}
//代码占位符
// Step 2: choose target nodes for each reconstruction task
for (BlockReconstructionWork rw : reconWork) {
  // Exclude all of the containing nodes from being targets.
  // This list includes decommissioning or corrupt nodes.
  final Set excludedNodes = new HashSet<>(rw.getContainingNodes());
List targets = pendingReconstruction
.getTargets(rw.getBlock());
if (targets != null) {
  for (DatanodeStorageInfo dn : targets) {
if (!excludedNodes.contains(dn.getDatanodeDescriptor())) {
  excludedNodes.add(dn.getDatanodeDescriptor());
}
  }
}

  // choose replication targets: NOT HOLDING THE GLOBAL LOCK
  final BlockPlacementPolicy placementPolicy =
  placementPolicies.getPolicy(rw.getBlock().getBlockType());
  rw.chooseTargets(placementPolicy, storagePolicySuite, excludedNodes);
}

{code}
 


was (Author: daniel ma):
[~weichiu]  No idea what kind of condition can reproduce this problem. it seems 
the tergets object is modified elsewhere, when 
computeReconstrutionWorkForBlocks is in progress.
{quote}// Step 2: choose target nodes for each reconstruction task
 for (BlockReconstructionWork rw : reconWork) {
     // Exclude all of the containing nodes from being targets.
     // This list includes decommissioning or corrupt nodes.
     final Set excludedNodes = new HashSet<>(rw.getContainingNodes());
     List targets = pendingReconstruction
         .getTargets(rw.getBlock());
     if (targets != null) {
         for (DatanodeStorageInfo dn : targets) {
               if (!excludedNodes.contains(dn.getDatanodeDescriptor())) {       
                                        
excludedNodes.add(dn.getDatanodeDescriptor());                

              }

         }
      }

     // choose replication targets: NOT HOLDING THE GLOBAL LOCK
      final BlockPlacementPolicy placementPolicy =
      placementPolicies.getPolicy(rw.getBlock().getBlockType());
      rw.chooseTargets(placementPolicy, storagePolicySuite, excludedNodes);
 }
{quote}
 

> ConcurrentModificationException error happens on NameNode occasionally
> --
>
> Key: HDFS-15796
> URL: https://issues.apache.org/jira/browse/HDFS-15796
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Affects Versions: 3.1.1
>Reporter: Daniel Ma
>Priority: Critical
>
> ConcurrentModificationException error happens on NameNode occasionally.
>  
> {code:java}
> 2021-01-23 20:21:18,107 | ERROR | RedundancyMonitor | RedundancyMonitor 
> thread received Runtime exception.  | BlockManager.java:4746
> java.util.ConcurrentModificationException
>   at java.util.ArrayList$Itr.checkForComodification(ArrayList.java:909)
>   at java.util.ArrayList$Itr.next(ArrayList.java:859)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReconstructionWorkForBlocks(BlockManager.java:1907)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeBlockReconstructionWork(BlockManager.java:1859)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeDatanodeWork(BlockManager.java:4862)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$RedundancyMonitor.run(BlockManager.java:4729)
>   at java.lang.Thread.run(Thread.java:748)
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-15796) ConcurrentModificationException error happens on NameNode occasionally

2021-06-28 Thread Daniel Ma (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17370995#comment-17370995
 ] 

Daniel Ma edited comment on HDFS-15796 at 6/29/21, 2:34 AM:


[~weichiu]  No idea what kind of condition can reproduce this problem. it seems 
the tergets object is modified elsewhere, when 
computeReconstrutionWorkForBlocks is in progress.
{quote}// Step 2: choose target nodes for each reconstruction task
 for (BlockReconstructionWork rw : reconWork) {
     // Exclude all of the containing nodes from being targets.
     // This list includes decommissioning or corrupt nodes.
     final Set excludedNodes = new HashSet<>(rw.getContainingNodes());
     List targets = pendingReconstruction
         .getTargets(rw.getBlock());
     if (targets != null) {
         for (DatanodeStorageInfo dn : targets) {
               if (!excludedNodes.contains(dn.getDatanodeDescriptor())) {       
                                        
excludedNodes.add(dn.getDatanodeDescriptor());                

              }

         }
      }

     // choose replication targets: NOT HOLDING THE GLOBAL LOCK
      final BlockPlacementPolicy placementPolicy =
      placementPolicies.getPolicy(rw.getBlock().getBlockType());
      rw.chooseTargets(placementPolicy, storagePolicySuite, excludedNodes);
 }
{quote}
 


was (Author: daniel ma):
[~weichiu]  No idea what kind of condition can reproduce this problem. it seems 
the tergets object is modified elsewhere, when 
computeReconstrutionWorkForBlocks is in progress.
{quote}// Step 2: choose target nodes for each reconstruction task
for (BlockReconstructionWork rw : reconWork) {
    // Exclude all of the containing nodes from being targets.
    // This list includes decommissioning or corrupt nodes.
    final Set excludedNodes = new HashSet<>(rw.getContainingNodes());
    List targets = pendingReconstruction
        .getTargets(rw.getBlock());
    if (targets != null) {
        for (DatanodeStorageInfo dn : targets) {
              if (!excludedNodes.contains(dn.getDatanodeDescriptor())) {
                   excludedNodes.add(dn.getDatanodeDescriptor());
               }
         }
     }

     // choose replication targets: NOT HOLDING THE GLOBAL LOCK
     final BlockPlacementPolicy placementPolicy =
     placementPolicies.getPolicy(rw.getBlock().getBlockType());
     rw.chooseTargets(placementPolicy, storagePolicySuite, excludedNodes);
}{quote}
 

> ConcurrentModificationException error happens on NameNode occasionally
> --
>
> Key: HDFS-15796
> URL: https://issues.apache.org/jira/browse/HDFS-15796
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Affects Versions: 3.1.1
>Reporter: Daniel Ma
>Priority: Critical
>
> ConcurrentModificationException error happens on NameNode occasionally.
>  
> {code:java}
> 2021-01-23 20:21:18,107 | ERROR | RedundancyMonitor | RedundancyMonitor 
> thread received Runtime exception.  | BlockManager.java:4746
> java.util.ConcurrentModificationException
>   at java.util.ArrayList$Itr.checkForComodification(ArrayList.java:909)
>   at java.util.ArrayList$Itr.next(ArrayList.java:859)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReconstructionWorkForBlocks(BlockManager.java:1907)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeBlockReconstructionWork(BlockManager.java:1859)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeDatanodeWork(BlockManager.java:4862)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$RedundancyMonitor.run(BlockManager.java:4729)
>   at java.lang.Thread.run(Thread.java:748)
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-15796) ConcurrentModificationException error happens on NameNode occasionally

2021-06-28 Thread Daniel Ma (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17370995#comment-17370995
 ] 

Daniel Ma edited comment on HDFS-15796 at 6/29/21, 2:33 AM:


[~weichiu]  No idea what kind of condition can reproduce this problem. it seems 
the tergets object is modified elsewhere, when 
computeReconstrutionWorkForBlocks is in progress.
{quote}// Step 2: choose target nodes for each reconstruction task
for (BlockReconstructionWork rw : reconWork) {
    // Exclude all of the containing nodes from being targets.
    // This list includes decommissioning or corrupt nodes.
    final Set excludedNodes = new HashSet<>(rw.getContainingNodes());
    List targets = pendingReconstruction
        .getTargets(rw.getBlock());
    if (targets != null) {
        for (DatanodeStorageInfo dn : targets) {
              if (!excludedNodes.contains(dn.getDatanodeDescriptor())) {
                   excludedNodes.add(dn.getDatanodeDescriptor());
               }
         }
     }

     // choose replication targets: NOT HOLDING THE GLOBAL LOCK
     final BlockPlacementPolicy placementPolicy =
     placementPolicies.getPolicy(rw.getBlock().getBlockType());
     rw.chooseTargets(placementPolicy, storagePolicySuite, excludedNodes);
}{quote}
 


was (Author: daniel ma):
[~weichiu]  No idea what kind of condition can reproduce this problem. it seems 
the tergets object is modified elsewhere, when 
computeReconstrutionWorkForBlocks is in progress.
{quote}// Step 2: choose target nodes for each reconstruction task
for (BlockReconstructionWork rw : reconWork) {
 // Exclude all of the containing nodes from being targets.
 // This list includes decommissioning or corrupt nodes.
 final Set excludedNodes = new HashSet<>(rw.getContainingNodes());
 List targets = pendingReconstruction
 .getTargets(rw.getBlock());
 if (targets != null) {
 for (DatanodeStorageInfo dn : targets) {
 if (!excludedNodes.contains(dn.getDatanodeDescriptor())) {
 excludedNodes.add(dn.getDatanodeDescriptor());
 }
 }
 }

 // choose replication targets: NOT HOLDING THE GLOBAL LOCK
 final BlockPlacementPolicy placementPolicy =
 placementPolicies.getPolicy(rw.getBlock().getBlockType());
 rw.chooseTargets(placementPolicy, storagePolicySuite, excludedNodes);
}{quote}

> ConcurrentModificationException error happens on NameNode occasionally
> --
>
> Key: HDFS-15796
> URL: https://issues.apache.org/jira/browse/HDFS-15796
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Affects Versions: 3.1.1
>Reporter: Daniel Ma
>Priority: Critical
>
> ConcurrentModificationException error happens on NameNode occasionally.
>  
> {code:java}
> 2021-01-23 20:21:18,107 | ERROR | RedundancyMonitor | RedundancyMonitor 
> thread received Runtime exception.  | BlockManager.java:4746
> java.util.ConcurrentModificationException
>   at java.util.ArrayList$Itr.checkForComodification(ArrayList.java:909)
>   at java.util.ArrayList$Itr.next(ArrayList.java:859)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReconstructionWorkForBlocks(BlockManager.java:1907)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeBlockReconstructionWork(BlockManager.java:1859)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeDatanodeWork(BlockManager.java:4862)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$RedundancyMonitor.run(BlockManager.java:4729)
>   at java.lang.Thread.run(Thread.java:748)
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-15796) ConcurrentModificationException error happens on NameNode occasionally

2021-06-28 Thread Daniel Ma (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17370995#comment-17370995
 ] 

Daniel Ma edited comment on HDFS-15796 at 6/29/21, 2:32 AM:


[~weichiu]  No idea what kind of condition can reproduce this problem. it seems 
the tergets object is modified elsewhere, when 
computeReconstrutionWorkForBlocks is in progress.
{quote}// Step 2: choose target nodes for each reconstruction task
for (BlockReconstructionWork rw : reconWork) {
 // Exclude all of the containing nodes from being targets.
 // This list includes decommissioning or corrupt nodes.
 final Set excludedNodes = new HashSet<>(rw.getContainingNodes());
 List targets = pendingReconstruction
 .getTargets(rw.getBlock());
 if (targets != null) {
 for (DatanodeStorageInfo dn : targets) {
 if (!excludedNodes.contains(dn.getDatanodeDescriptor())) {
 excludedNodes.add(dn.getDatanodeDescriptor());
 }
 }
 }

 // choose replication targets: NOT HOLDING THE GLOBAL LOCK
 final BlockPlacementPolicy placementPolicy =
 placementPolicies.getPolicy(rw.getBlock().getBlockType());
 rw.chooseTargets(placementPolicy, storagePolicySuite, excludedNodes);
}{quote}


was (Author: daniel ma):
[~weichiu]  No idea what kind of condition can reproduce this problem. it seems 
the tergets object is modified elsewhere, when 
computeReconstrutionWorkForBlocks is in progress.

> ConcurrentModificationException error happens on NameNode occasionally
> --
>
> Key: HDFS-15796
> URL: https://issues.apache.org/jira/browse/HDFS-15796
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Affects Versions: 3.1.1
>Reporter: Daniel Ma
>Priority: Critical
>
> ConcurrentModificationException error happens on NameNode occasionally.
>  
> {code:java}
> 2021-01-23 20:21:18,107 | ERROR | RedundancyMonitor | RedundancyMonitor 
> thread received Runtime exception.  | BlockManager.java:4746
> java.util.ConcurrentModificationException
>   at java.util.ArrayList$Itr.checkForComodification(ArrayList.java:909)
>   at java.util.ArrayList$Itr.next(ArrayList.java:859)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReconstructionWorkForBlocks(BlockManager.java:1907)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeBlockReconstructionWork(BlockManager.java:1859)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeDatanodeWork(BlockManager.java:4862)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$RedundancyMonitor.run(BlockManager.java:4729)
>   at java.lang.Thread.run(Thread.java:748)
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15796) ConcurrentModificationException error happens on NameNode occasionally

2021-06-28 Thread Daniel Ma (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17370997#comment-17370997
 ] 

Daniel Ma commented on HDFS-15796:
--

[~sodonnell]

We have made some modifications based on OS version, like merge some patches 
from newer version into our 3.1.1 version.

So the line number in the error statck trace is not exactly same.

> ConcurrentModificationException error happens on NameNode occasionally
> --
>
> Key: HDFS-15796
> URL: https://issues.apache.org/jira/browse/HDFS-15796
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Affects Versions: 3.1.1
>Reporter: Daniel Ma
>Priority: Critical
>
> ConcurrentModificationException error happens on NameNode occasionally.
>  
> {code:java}
> 2021-01-23 20:21:18,107 | ERROR | RedundancyMonitor | RedundancyMonitor 
> thread received Runtime exception.  | BlockManager.java:4746
> java.util.ConcurrentModificationException
>   at java.util.ArrayList$Itr.checkForComodification(ArrayList.java:909)
>   at java.util.ArrayList$Itr.next(ArrayList.java:859)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReconstructionWorkForBlocks(BlockManager.java:1907)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeBlockReconstructionWork(BlockManager.java:1859)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeDatanodeWork(BlockManager.java:4862)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$RedundancyMonitor.run(BlockManager.java:4729)
>   at java.lang.Thread.run(Thread.java:748)
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15796) ConcurrentModificationException error happens on NameNode occasionally

2021-06-28 Thread Daniel Ma (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17370995#comment-17370995
 ] 

Daniel Ma commented on HDFS-15796:
--

[~weichiu]  No idea what kind of condition can reproduce this problem. it seems 
the tergets object is modified elsewhere, when 
computeReconstrutionWorkForBlocks is in progress.

> ConcurrentModificationException error happens on NameNode occasionally
> --
>
> Key: HDFS-15796
> URL: https://issues.apache.org/jira/browse/HDFS-15796
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Affects Versions: 3.1.1
>Reporter: Daniel Ma
>Priority: Critical
>
> ConcurrentModificationException error happens on NameNode occasionally.
>  
> {code:java}
> 2021-01-23 20:21:18,107 | ERROR | RedundancyMonitor | RedundancyMonitor 
> thread received Runtime exception.  | BlockManager.java:4746
> java.util.ConcurrentModificationException
>   at java.util.ArrayList$Itr.checkForComodification(ArrayList.java:909)
>   at java.util.ArrayList$Itr.next(ArrayList.java:859)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReconstructionWorkForBlocks(BlockManager.java:1907)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeBlockReconstructionWork(BlockManager.java:1859)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeDatanodeWork(BlockManager.java:4862)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$RedundancyMonitor.run(BlockManager.java:4729)
>   at java.lang.Thread.run(Thread.java:748)
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15796) ConcurrentModificationException error happens on NameNode occasionally

2021-01-27 Thread Daniel Ma (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Ma updated HDFS-15796:
-
Description: 
ConcurrentModificationException error happens on NameNode occasionally.

 
{code:java}
2021-01-23 20:21:18,107 | ERROR | RedundancyMonitor | RedundancyMonitor thread 
received Runtime exception.  | BlockManager.java:4746
java.util.ConcurrentModificationException
at java.util.ArrayList$Itr.checkForComodification(ArrayList.java:909)
at java.util.ArrayList$Itr.next(ArrayList.java:859)
at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReconstructionWorkForBlocks(BlockManager.java:1907)
at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeBlockReconstructionWork(BlockManager.java:1859)
at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeDatanodeWork(BlockManager.java:4862)
at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$RedundancyMonitor.run(BlockManager.java:4729)
at java.lang.Thread.run(Thread.java:748)
{code}
 

 

  was:
ConcurrentModificationException error happens on NameNode occasionally

 

!file:///C:/Users/m00425105/AppData/Roaming/eSpace_Desktop/UserData/m00425105/imagefiles/10B02DC2-A9F0-4AE6-949B-92B8F1E9249A.png!


> ConcurrentModificationException error happens on NameNode occasionally
> --
>
> Key: HDFS-15796
> URL: https://issues.apache.org/jira/browse/HDFS-15796
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Affects Versions: 3.1.1
>Reporter: Daniel Ma
>Priority: Critical
> Fix For: 3.1.1
>
>
> ConcurrentModificationException error happens on NameNode occasionally.
>  
> {code:java}
> 2021-01-23 20:21:18,107 | ERROR | RedundancyMonitor | RedundancyMonitor 
> thread received Runtime exception.  | BlockManager.java:4746
> java.util.ConcurrentModificationException
>   at java.util.ArrayList$Itr.checkForComodification(ArrayList.java:909)
>   at java.util.ArrayList$Itr.next(ArrayList.java:859)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReconstructionWorkForBlocks(BlockManager.java:1907)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeBlockReconstructionWork(BlockManager.java:1859)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeDatanodeWork(BlockManager.java:4862)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$RedundancyMonitor.run(BlockManager.java:4729)
>   at java.lang.Thread.run(Thread.java:748)
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-15796) ConcurrentModificationException error happens on NameNode occasionally

2021-01-27 Thread Daniel Ma (Jira)
Daniel Ma created HDFS-15796:


 Summary: ConcurrentModificationException error happens on NameNode 
occasionally
 Key: HDFS-15796
 URL: https://issues.apache.org/jira/browse/HDFS-15796
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: hdfs
Affects Versions: 3.1.1
Reporter: Daniel Ma
 Fix For: 3.1.1


ConcurrentModificationException error happens on NameNode occasionally

 

!file:///C:/Users/m00425105/AppData/Roaming/eSpace_Desktop/UserData/m00425105/imagefiles/10B02DC2-A9F0-4AE6-949B-92B8F1E9249A.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org