[ 
https://issues.apache.org/jira/browse/HDFS-15651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yiqun Lin updated HDFS-15651:
-----------------------------
    Description: 
In our cluster, we applied the HDFS-14997 improvement.
 We find one case that CommandProcessingThread will exit due to OOM error. OOM 
error was caused by our one abnormal application that running on this DN node.
{noformat}
2020-10-18 10:27:12,604 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: 
Command processor encountered fatal exception and exit.
java.lang.OutOfMemoryError: unable to create new native thread
        at java.lang.Thread.start0(Native Method)
        at java.lang.Thread.start(Thread.java:717)
        at 
java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:957)
        at 
java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1367)
        at 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService.execute(FsDatasetAsyncDiskService.java:173)
        at 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService.deleteAsync(FsDatasetAsyncDiskService.java:222)
        at 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.invalidate(FsDatasetImpl.java:2005)
        at 
org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActive(BPOfferService.java:671)
        at 
org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActor(BPOfferService.java:617)
        at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.processCommand(BPServiceActor.java:1247)
        at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.access$1000(BPServiceActor.java:1194)
        at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread$3.run(BPServiceActor.java:1299)
        at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.processQueue(BPServiceActor.java:1221)
        at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.run(BPServiceActor.java:1208)
{noformat}
Here the main point is that CommandProcessingThread crashed will lead a very 
bad impact. All the NN response commands will not be processed by DN side.

We enabled the block token to access the data, but here the DN command 
DNA_ACCESSKEYUPDATE is not processed on time by DN. And then we see lots of 
Sasl error due to key expiration in DN log:
{noformat}
javax.security.sasl.SaslException: DIGEST-MD5: IO error acquiring password 
[Caused by org.apache.hadoop.security.token.SecretManager$InvalidToken: Can't 
re-compute password for block_token_identifier (expiryDate=xxx, keyId=xx, 
userId=xxx, blockPoolId=xxxx, blockId=xxx, access modes=[READ]), since the 
required block key (keyID=xxx) doesn't exist.]
{noformat}
 

For the impact in client side, our users receive lots of 'could not obtain 
block' error  with BlockMissingException.

CommandProcessingThread is a critical thread, it should always be running. Once 
a unexpected error happened, a better handing should be:
 * catch the exception
 or
 * exit the DN process to let admin user investigate this

  was:
In our cluster, we applied the HDFS-14997 improvement.
 We find one case that CommandProcessingThread will exit due to OOM error. OOM 
error was caused by our one abnormal application that running on this DN node.
{noformat}
2020-10-18 10:27:12,604 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: 
Command processor encountered fatal exception and exit.
java.lang.OutOfMemoryError: unable to create new native thread
        at java.lang.Thread.start0(Native Method)
        at java.lang.Thread.start(Thread.java:717)
        at 
java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:957)
        at 
java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1367)
        at 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService.execute(FsDatasetAsyncDiskService.java:173)
        at 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService.deleteAsync(FsDatasetAsyncDiskService.java:222)
        at 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.invalidate(FsDatasetImpl.java:2005)
        at 
org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActive(BPOfferService.java:671)
        at 
org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActor(BPOfferService.java:617)
        at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.processCommand(BPServiceActor.java:1247)
        at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.access$1000(BPServiceActor.java:1194)
        at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread$3.run(BPServiceActor.java:1299)
        at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.processQueue(BPServiceActor.java:1221)
        at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.run(BPServiceActor.java:1208)
{noformat}
Here the main point is that CommandProcessingThread crashed will lead a very 
bad impact. All the NN response commands will not be processed by DN side.

As we enabled the block token to access the data, but here the DN command 
DNA_ACCESSKEYUPDATE is not processed on time. And then we see lots of Sasl 
error due to key expiration:
{noformat}
javax.security.sasl.SaslException: DIGEST-MD5: IO error acquiring password 
[Caused by org.apache.hadoop.security.token.SecretManager$InvalidToken: Can't 
re-compute password for block_token_identifier (expiryDate=xxx, keyId=xx, 
userId=xxx, blockPoolId=xxxx, blockId=xxx, access modes=[READ]), since the 
required block key (keyID=xxx) doesn't exist.]
{noformat}
 

For the impact in client side, our users receive lots of 'could not obtain 
block' error  with BlockMissingException.

CommandProcessingThread is a critical thread, it should always be running. Once 
a unexpected error happened, a better handing should be:
 * catch the exception
 or
 * exit the DN process to let admin user investigate this


> Client could not obtain block when DN CommandProcessingThread exit
> ------------------------------------------------------------------
>
>                 Key: HDFS-15651
>                 URL: https://issues.apache.org/jira/browse/HDFS-15651
>             Project: Hadoop HDFS
>          Issue Type: Bug
>            Reporter: Yiqun Lin
>            Priority: Major
>
> In our cluster, we applied the HDFS-14997 improvement.
>  We find one case that CommandProcessingThread will exit due to OOM error. 
> OOM error was caused by our one abnormal application that running on this DN 
> node.
> {noformat}
> 2020-10-18 10:27:12,604 ERROR 
> org.apache.hadoop.hdfs.server.datanode.DataNode: Command processor 
> encountered fatal exception and exit.
> java.lang.OutOfMemoryError: unable to create new native thread
>         at java.lang.Thread.start0(Native Method)
>         at java.lang.Thread.start(Thread.java:717)
>         at 
> java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:957)
>         at 
> java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1367)
>         at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService.execute(FsDatasetAsyncDiskService.java:173)
>         at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService.deleteAsync(FsDatasetAsyncDiskService.java:222)
>         at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.invalidate(FsDatasetImpl.java:2005)
>         at 
> org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActive(BPOfferService.java:671)
>         at 
> org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActor(BPOfferService.java:617)
>         at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.processCommand(BPServiceActor.java:1247)
>         at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.access$1000(BPServiceActor.java:1194)
>         at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread$3.run(BPServiceActor.java:1299)
>         at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.processQueue(BPServiceActor.java:1221)
>         at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.run(BPServiceActor.java:1208)
> {noformat}
> Here the main point is that CommandProcessingThread crashed will lead a very 
> bad impact. All the NN response commands will not be processed by DN side.
> We enabled the block token to access the data, but here the DN command 
> DNA_ACCESSKEYUPDATE is not processed on time by DN. And then we see lots of 
> Sasl error due to key expiration in DN log:
> {noformat}
> javax.security.sasl.SaslException: DIGEST-MD5: IO error acquiring password 
> [Caused by org.apache.hadoop.security.token.SecretManager$InvalidToken: Can't 
> re-compute password for block_token_identifier (expiryDate=xxx, keyId=xx, 
> userId=xxx, blockPoolId=xxxx, blockId=xxx, access modes=[READ]), since the 
> required block key (keyID=xxx) doesn't exist.]
> {noformat}
>  
> For the impact in client side, our users receive lots of 'could not obtain 
> block' error  with BlockMissingException.
> CommandProcessingThread is a critical thread, it should always be running. 
> Once a unexpected error happened, a better handing should be:
>  * catch the exception
>  or
>  * exit the DN process to let admin user investigate this



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to