[ https://issues.apache.org/jira/browse/HDFS-15651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Yiqun Lin updated HDFS-15651: ----------------------------- Description: In our cluster, we applied the HDFS-14997 improvement. We find one case that CommandProcessingThread will exit due to OOM error. OOM error was caused by our one abnormal application that running on this DN node. {noformat} 2020-10-18 10:27:12,604 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: Command processor encountered fatal exception and exit. java.lang.OutOfMemoryError: unable to create new native thread at java.lang.Thread.start0(Native Method) at java.lang.Thread.start(Thread.java:717) at java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:957) at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1367) at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService.execute(FsDatasetAsyncDiskService.java:173) at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService.deleteAsync(FsDatasetAsyncDiskService.java:222) at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.invalidate(FsDatasetImpl.java:2005) at org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActive(BPOfferService.java:671) at org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActor(BPOfferService.java:617) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.processCommand(BPServiceActor.java:1247) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.access$1000(BPServiceActor.java:1194) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread$3.run(BPServiceActor.java:1299) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.processQueue(BPServiceActor.java:1221) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.run(BPServiceActor.java:1208) {noformat} Here the main point is that CommandProcessingThread crashed will lead a very bad impact. All the NN response commands will not be processed by DN side. We enabled the block token to access the data, but here the DN command DNA_ACCESSKEYUPDATE is not processed on time by DN. And then we see lots of Sasl error due to key expiration in DN log: {noformat} javax.security.sasl.SaslException: DIGEST-MD5: IO error acquiring password [Caused by org.apache.hadoop.security.token.SecretManager$InvalidToken: Can't re-compute password for block_token_identifier (expiryDate=xxx, keyId=xx, userId=xxx, blockPoolId=xxxx, blockId=xxx, access modes=[READ]), since the required block key (keyID=xxx) doesn't exist.] {noformat} For the impact in client side, our users receive lots of 'could not obtain block' error with BlockMissingException. CommandProcessingThread is a critical thread, it should always be running. {code:java} /** * CommandProcessingThread that process commands asynchronously. */ class CommandProcessingThread extends Thread { private final BPServiceActor actor; private final BlockingQueue<Runnable> queue; ... @Override public void run() { try { processQueue(); } catch (Throwable t) { LOG.error("{} encountered fatal exception and exit.", getName(), t); <=== should not exit this thread } } {code} Once a unexpected error happened, a better handing should be: * catch the exception, appropriately deal with the error and let processQueue continue to run or * exit the DN process to let admin user investigate this was: In our cluster, we applied the HDFS-14997 improvement. We find one case that CommandProcessingThread will exit due to OOM error. OOM error was caused by our one abnormal application that running on this DN node. {noformat} 2020-10-18 10:27:12,604 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: Command processor encountered fatal exception and exit. java.lang.OutOfMemoryError: unable to create new native thread at java.lang.Thread.start0(Native Method) at java.lang.Thread.start(Thread.java:717) at java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:957) at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1367) at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService.execute(FsDatasetAsyncDiskService.java:173) at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService.deleteAsync(FsDatasetAsyncDiskService.java:222) at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.invalidate(FsDatasetImpl.java:2005) at org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActive(BPOfferService.java:671) at org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActor(BPOfferService.java:617) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.processCommand(BPServiceActor.java:1247) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.access$1000(BPServiceActor.java:1194) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread$3.run(BPServiceActor.java:1299) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.processQueue(BPServiceActor.java:1221) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.run(BPServiceActor.java:1208) {noformat} Here the main point is that CommandProcessingThread crashed will lead a very bad impact. All the NN response commands will not be processed by DN side. We enabled the block token to access the data, but here the DN command DNA_ACCESSKEYUPDATE is not processed on time by DN. And then we see lots of Sasl error due to key expiration in DN log: {noformat} javax.security.sasl.SaslException: DIGEST-MD5: IO error acquiring password [Caused by org.apache.hadoop.security.token.SecretManager$InvalidToken: Can't re-compute password for block_token_identifier (expiryDate=xxx, keyId=xx, userId=xxx, blockPoolId=xxxx, blockId=xxx, access modes=[READ]), since the required block key (keyID=xxx) doesn't exist.] {noformat} For the impact in client side, our users receive lots of 'could not obtain block' error with BlockMissingException. CommandProcessingThread is a critical thread, it should always be running. {code:java} /** * CommandProcessingThread that process commands asynchronously. */ class CommandProcessingThread extends Thread { private final BPServiceActor actor; private final BlockingQueue<Runnable> queue; ... @Override public void run() { try { processQueue(); } catch (Throwable t) { LOG.error("{} encountered fatal exception and exit.", getName(), t); <=== should not exit this thread } } {code} Once a unexpected error happened, a better handing should be: * catch the exception or * exit the DN process to let admin user investigate this > Client could not obtain block when DN CommandProcessingThread exit > ------------------------------------------------------------------ > > Key: HDFS-15651 > URL: https://issues.apache.org/jira/browse/HDFS-15651 > Project: Hadoop HDFS > Issue Type: Bug > Reporter: Yiqun Lin > Priority: Major > > In our cluster, we applied the HDFS-14997 improvement. > We find one case that CommandProcessingThread will exit due to OOM error. > OOM error was caused by our one abnormal application that running on this DN > node. > {noformat} > 2020-10-18 10:27:12,604 ERROR > org.apache.hadoop.hdfs.server.datanode.DataNode: Command processor > encountered fatal exception and exit. > java.lang.OutOfMemoryError: unable to create new native thread > at java.lang.Thread.start0(Native Method) > at java.lang.Thread.start(Thread.java:717) > at > java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:957) > at > java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1367) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService.execute(FsDatasetAsyncDiskService.java:173) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService.deleteAsync(FsDatasetAsyncDiskService.java:222) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.invalidate(FsDatasetImpl.java:2005) > at > org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActive(BPOfferService.java:671) > at > org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActor(BPOfferService.java:617) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.processCommand(BPServiceActor.java:1247) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.access$1000(BPServiceActor.java:1194) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread$3.run(BPServiceActor.java:1299) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.processQueue(BPServiceActor.java:1221) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.run(BPServiceActor.java:1208) > {noformat} > Here the main point is that CommandProcessingThread crashed will lead a very > bad impact. All the NN response commands will not be processed by DN side. > We enabled the block token to access the data, but here the DN command > DNA_ACCESSKEYUPDATE is not processed on time by DN. And then we see lots of > Sasl error due to key expiration in DN log: > {noformat} > javax.security.sasl.SaslException: DIGEST-MD5: IO error acquiring password > [Caused by org.apache.hadoop.security.token.SecretManager$InvalidToken: Can't > re-compute password for block_token_identifier (expiryDate=xxx, keyId=xx, > userId=xxx, blockPoolId=xxxx, blockId=xxx, access modes=[READ]), since the > required block key (keyID=xxx) doesn't exist.] > {noformat} > > For the impact in client side, our users receive lots of 'could not obtain > block' error with BlockMissingException. > CommandProcessingThread is a critical thread, it should always be running. > {code:java} > /** > * CommandProcessingThread that process commands asynchronously. > */ > class CommandProcessingThread extends Thread { > private final BPServiceActor actor; > private final BlockingQueue<Runnable> queue; > ... > @Override > public void run() { > try { > processQueue(); > } catch (Throwable t) { > LOG.error("{} encountered fatal exception and exit.", getName(), t); > <=== should not exit this thread > } > } > {code} > Once a unexpected error happened, a better handing should be: > * catch the exception, appropriately deal with the error and let > processQueue continue to run > or > * exit the DN process to let admin user investigate this -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org