Hi Wei-Chiu, I came across HDFS-14476 when searching if anyone else is seeing the same issues as us. I didn't see it merged and so assumed there are other fixes done around what you mention in there.
HDFS-11187 is already applied to 2.9.1 and we are on 2.9.2 and so it might not be impacting us though I need to be 100% sure of it. I will find the jstack as soon as I am on the server again, but I have the following WARN statement that I felt could point to the issue. Btw, we have tested the disk I/O and when these slowdown happen the disk is hardly under any kind of I/O pressure. And on restart of the datanode process all goes back to normal i.e. the disks are able to do high I/O. 2019-09-27 00:48:14,101 WARN org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Lock held time above threshold: lock identifier: org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsD atasetImpl lockHeldTimeMs=12628 ms. Suppressed 1 lock warnings. The stack trace is: java.lang.Thread.getStackTrace(Thread.java:1559) org.apache.hadoop.util.StringUtils.getStackTrace(StringUtils.java:1021) org.apache.hadoop.util.InstrumentedLock.logWarning(InstrumentedLock.java:143) org.apache.hadoop.util.InstrumentedLock.check(InstrumentedLock.java:186) org.apache.hadoop.util.InstrumentedLock.unlock(InstrumentedLock.java:133) org.apache.hadoop.util.AutoCloseableLock.release(AutoCloseableLock.java:84) org.apache.hadoop.util.AutoCloseableLock.close(AutoCloseableLock.java:96) org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.finalizeBlock(FsDatasetImpl.java:1781) org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.finalizeBlock(BlockReceiver.java:1517) org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.run(BlockReceiver.java:1474) java.lang.Thread.run(Thread.java:748) Thanks, Viral On Thu, Sep 26, 2019 at 7:03 PM Wei-Chiu Chuang <weic...@cloudera.com> wrote: > or maybe https://issues.apache.org/jira/browse/HDFS-14476 > > I reverted this fix and I've not looked at it further. But take a look. > It's disappointing to me that none of the active Hadoop contributors seem > to understand DirectoryScanner well enough. > > or HDFS-11187 <https://issues.apache.org/jira/browse/HDFS-11187> > > If you can post a jstack snippet I might be able to help out. > On Thu, Sep 26, 2019 at 9:48 PM Viral Bajaria <viral.baja...@gmail.com> > wrote: > >> Thanks for the quick response Jonathan. >> >> Honestly, I am not sure if 2.10.0 will fix my issue but it looks similar >> to >> https://issues.apache.org/jira/browse/HDFS-14536 which is not fixed yet >> so >> we will probably not see the benefit. >> >> We need to either dig more into the logs and jstack and create a JIRA to >> see if the developers can comment on what's going on. A datanode restart >> fixes the latency issue and so it is something that happens over time and >> need the right instrumentation to figure it out! >> >> Thanks, >> Viral >> >> >> On Thu, Sep 26, 2019 at 6:41 PM Jonathan Hung <jyhung2...@gmail.com> >> wrote: >> >> > Hey Viral, yes. We're working on a 2.10.0 release, I'm the release >> manager >> > for that. I can't comment on the particular issue you're seeing, but I >> plan >> > to start the release process for 2.10.0 early next week, then 2.10.0 >> will >> > be released shortly after that (assuming all goes well). >> > >> > Thanks, >> > Jonathan Hung >> > >> > >> > On Thu, Sep 26, 2019 at 6:34 PM Viral Bajaria <viral.baja...@gmail.com> >> > wrote: >> > >> >> (Cross posting from user list based on feedback by Sean Busbey) >> >> >> >> All, >> >> >> >> Just saw the announcement for new release for Hadoop 3.2.1, >> >> congratulations >> >> team and thanks for all the hard work! >> >> >> >> Are we going to see a new release in the 2.x.x ? >> >> >> >> I noticed a bunch of tickets that have been resolved in the last year >> have >> >> been tagged with 2.10.0 as a fix version and it's been a while since >> 2.9.2 >> >> was released so I was wondering if we are going to see a 2.10.0 release >> >> soon OR should we start looking to upgarde to 3.x line ? >> >> >> >> The reason I ask is that we are seeing very high ReadBlockOp Latency on >> >> our >> >> datanodes and feel that the issue is due to some locking going on >> between >> >> VolumeScanner, DirectoryScanner, RW Block Operations and >> MetricsRegistry. >> >> Looking at a few JIRA it looks like 2.10.0 might have some fixes that >> we >> >> should try, not fully sure yet! >> >> >> >> Thanks, >> >> Viral >> >> >> > >> >