[ 
https://issues.apache.org/jira/browse/HDFS-14657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16897220#comment-16897220
 ] 

Stephen O'Donnell commented on HDFS-14657:
------------------------------------------

I did not look back at the 2.x branch FBR processing code, but focused on the 
trunk code.

On trunk, it seems block reports are processed by walking two sorted iterators:

1. The first comes from the block report itself

2. The second is a block iterator over the storage the FBR is for.

Then the code makes used of the fact both are sorted to so a sort of merge of 
the two lists.

The problem with dropping the write lock, is therefore that this second 
iterator can be invalidated by a concurrent modification. However that is 
probably solvable, either by making the iterator keyed, and reopening it after 
acquiring the lock (or if it throws concurrentModificationException) at the 
correct position, or fast forwarding it to the correct position (would need to 
check the overhead of this).

It probably then makes sense to consider what, outside the current block report 
can alter the blocks in a storage for a volume:

1. A newly added file gets closed - this would update the storage via IBR, but 
it will be blocked by the block report lock this patch introduces, so that is 
probably not an issue.

2. A new file gets created - this is basically the same as 1.

3. The balancer or mover changing the location of blocks. These would be 
updated in the NN via more FBRs or IBRs and should not be an issue due to the 
block report processing lock.

4. A file gets deleted - A delete will immediately remove the blocks from the 
storage, but I think the FBR processing code will handle this. It checks to see 
if the block is present in the NN, and if it is not, it adds it to the 
invalidate list. The block is likely already on the list from the delete, but 
that is unlikely to be an issue either.

5. The node going dead - unlikely as it just sent a FBR.

6. Decommissioning / maintenance mode - these would impact the blocks on the 
storage only via IBRs or FBRs too.

I'm sure there are there other scenarios I have not considered. Can anyone come 
up with any more?

Aside from the above, in the latest patch, I see it will yield the write lock 
every 5000 blocks. Have you been able to do any tests to see how long it takes 
to process 5000, 50,000 and 100,000 blocks? I wonder if we would be better 
setting the default limit to release to lock at something a lot higher than 
5000, like 50k or 100k, depending on the typical processing time a batch that 
size. That would reduce the overhead of having to reopen the storage iterator 
too many times and also prevent the FBR processing from taking too long if the 
NN is under pressure with many other threads wanting the write lock.

> Refine NameSystem lock usage during processing FBR
> --------------------------------------------------
>
>                 Key: HDFS-14657
>                 URL: https://issues.apache.org/jira/browse/HDFS-14657
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>            Reporter: Chen Zhang
>            Assignee: Chen Zhang
>            Priority: Major
>         Attachments: HDFS-14657-001.patch, HDFS-14657.002.patch
>
>
> The disk with 12TB capacity is very normal today, which means the FBR size is 
> much larger than before, Namenode holds the NameSystemLock during processing 
> block report for each storage, which might take quite a long time.
> On our production environment, processing large FBR usually cause a longer 
> RPC queue time, which impacts client latency, so we did some simple work on 
> refining the lock usage, which improved the p99 latency significantly.
> In our solution, BlockManager release the NameSystem write lock and request 
> it again for every 5000 blocks(by default) during processing FBR, with the 
> fair lock, all the RPC request can be processed before BlockManager 
> re-acquire the write lock.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to