[ https://issues.apache.org/jira/browse/HDFS-14657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16897220#comment-16897220 ]
Stephen O'Donnell commented on HDFS-14657: ------------------------------------------ I did not look back at the 2.x branch FBR processing code, but focused on the trunk code. On trunk, it seems block reports are processed by walking two sorted iterators: 1. The first comes from the block report itself 2. The second is a block iterator over the storage the FBR is for. Then the code makes used of the fact both are sorted to so a sort of merge of the two lists. The problem with dropping the write lock, is therefore that this second iterator can be invalidated by a concurrent modification. However that is probably solvable, either by making the iterator keyed, and reopening it after acquiring the lock (or if it throws concurrentModificationException) at the correct position, or fast forwarding it to the correct position (would need to check the overhead of this). It probably then makes sense to consider what, outside the current block report can alter the blocks in a storage for a volume: 1. A newly added file gets closed - this would update the storage via IBR, but it will be blocked by the block report lock this patch introduces, so that is probably not an issue. 2. A new file gets created - this is basically the same as 1. 3. The balancer or mover changing the location of blocks. These would be updated in the NN via more FBRs or IBRs and should not be an issue due to the block report processing lock. 4. A file gets deleted - A delete will immediately remove the blocks from the storage, but I think the FBR processing code will handle this. It checks to see if the block is present in the NN, and if it is not, it adds it to the invalidate list. The block is likely already on the list from the delete, but that is unlikely to be an issue either. 5. The node going dead - unlikely as it just sent a FBR. 6. Decommissioning / maintenance mode - these would impact the blocks on the storage only via IBRs or FBRs too. I'm sure there are there other scenarios I have not considered. Can anyone come up with any more? Aside from the above, in the latest patch, I see it will yield the write lock every 5000 blocks. Have you been able to do any tests to see how long it takes to process 5000, 50,000 and 100,000 blocks? I wonder if we would be better setting the default limit to release to lock at something a lot higher than 5000, like 50k or 100k, depending on the typical processing time a batch that size. That would reduce the overhead of having to reopen the storage iterator too many times and also prevent the FBR processing from taking too long if the NN is under pressure with many other threads wanting the write lock. > Refine NameSystem lock usage during processing FBR > -------------------------------------------------- > > Key: HDFS-14657 > URL: https://issues.apache.org/jira/browse/HDFS-14657 > Project: Hadoop HDFS > Issue Type: Improvement > Reporter: Chen Zhang > Assignee: Chen Zhang > Priority: Major > Attachments: HDFS-14657-001.patch, HDFS-14657.002.patch > > > The disk with 12TB capacity is very normal today, which means the FBR size is > much larger than before, Namenode holds the NameSystemLock during processing > block report for each storage, which might take quite a long time. > On our production environment, processing large FBR usually cause a longer > RPC queue time, which impacts client latency, so we did some simple work on > refining the lock usage, which improved the p99 latency significantly. > In our solution, BlockManager release the NameSystem write lock and request > it again for every 5000 blocks(by default) during processing FBR, with the > fair lock, all the RPC request can be processed before BlockManager > re-acquire the write lock. -- This message was sent by Atlassian JIRA (v7.6.14#76016) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org