[ https://issues.apache.org/jira/browse/HDFS-14476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16880426#comment-16880426 ]
Sean Chow commented on HDFS-14476: ---------------------------------- I've add a patch here to process inconsistent blocks in map with fixed batch size 500. And It did work well in my production. !datanode-with-patch-14476.png! You can see there's a pike periodically before applying this patch. This is the jstack output for that spike time: {code:java} "DataXceiver for client DFSClient_NONMAPREDUCE_-2048576664_27 at /x.x.x.x:49793 [Sending block BP-1644920766-x.x.x.x-1450099987967:blk_2423788189_1403044486]" daemon prio=10 tid=0x00007f8a0ea3d800 nid=0x19a2 waiting for monitor entry [0x00007f89982a2000] java.lang.Thread.State: BLOCKED (on object monitor) at org.apache.hadoop.hdfs.server.datanode.BlockSender.<init>(BlockSender.java:231) - waiting to lock <0x00000004001d6348> (a org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:529) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opReadBlock(Receiver.java:116) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:71) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:243) at java.lang.Thread.run(Thread.java:745) "369604090@qtp-1069118552-20368" daemon prio=10 tid=0x0000000001f94000 nid=0x19a1 waiting for monitor entry [0x00007f89c0d41000] java.lang.Thread.State: BLOCKED (on object monitor) at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.getDfsUsed(FsVolumeImpl.java:295) - waiting to lock <0x00000004001d6348> (a org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl) at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.getAvailable(FsVolumeImpl.java:334) at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeList.getRemaining(FsVolumeList.java:162) at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.getRemaining(FsDatasetImpl.java:496) - locked <0x0000000400502b70> (a java.lang.Object) at sun.reflect.GeneratedMethodAccessor13.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) "java.util.concurrent.ThreadPoolExecutor$Worker@704adcff[State = -1, empty queue]" daemon prio=10 tid=0x00007f8a000f3000 nid=0x59f1 runnable [0x00007f89c9c66000] java.lang.Thread.State: RUNNABLE at java.io.UnixFileSystem.getBooleanAttributes0(Native Method) at java.io.UnixFileSystem.getBooleanAttributes(UnixFileSystem.java:242) at java.io.File.exists(File.java:813) at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.checkAndUpdate(FsDatasetImpl.java:1965) - locked <0x00000004001d6348> (a org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl) at org.apache.hadoop.hdfs.server.datanode.DirectoryScanner.reconcile(DirectoryScanner.java:410) at org.apache.hadoop.hdfs.server.datanode.DirectoryScanner.run(DirectoryScanner.java:360) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) {code} All threads are blocked until timeout returned when the directory scanner do check&update blocks. Also we know that the scan result {{missing metadata files:23574, missing block files:23574, missing blocks in memory:47625}} is not so correct, because the scan process may take two hours to finish and the already scanned directories may have new files that will be added to inconsistent blocks map. > lock too long when fix inconsistent blocks between disk and in-memory > --------------------------------------------------------------------- > > Key: HDFS-14476 > URL: https://issues.apache.org/jira/browse/HDFS-14476 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode > Affects Versions: 2.6.0, 2.7.0 > Reporter: Sean Chow > Priority: Major > Attachments: HDFS-14476.00.patch, datanode-with-patch-14476.png > > > When directoryScanner have the results of differences between disk and > in-memory blocks. it will try to run {{checkAndUpdate}} to fix it. However > {{FsDatasetImpl.checkAndUpdate}} is a synchronized call > As I have about 6millions blocks for every datanodes and every 6hours' scan > will have about 25000 abnormal blocks to fix. That leads to a long lock > holding FsDatasetImpl object. > let's assume every block need 10ms to fix(because of latency of SAS disk), > that will cost 250 seconds to finish. That means all reads and writes will be > blocked for 3mins for that datanode. > > {code:java} > 2019-05-06 08:06:51,704 INFO > org.apache.hadoop.hdfs.server.datanode.DirectoryScanner: BlockPool > BP-1644920766-10.223.143.220-1450099987967 Total blocks: 6850197, missing > metadata files:23574, missing block files:23574, missing blocks in > memory:47625, mismatched blocks:0 > ... > 2019-05-06 08:16:41,625 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: > Took 588402ms to process 1 commands from NN > {code} > Take long time to process command from nn because threads are blocked. And > namenode will see long lastContact time for this datanode. > Maybe this affect all hdfs versions. > *how to fix:* > just like process invalidate command from namenode with 1000 batch size, fix > these abnormal block should be handled with batch too and sleep 2 seconds > between the batch to allow normal reading/writing blocks. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org