[jira] [Comment Edited] (HDFS-14657) Refine NameSystem lock usage during processing FBR

2019-07-31 Thread Chen Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-14657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16896917#comment-16896917
 ] 

Chen Zhang edited comment on HDFS-14657 at 7/31/19 3:20 PM:


Thanks [~shv], but sorry I can't see any problem of this change on 2.6 version.
{quote}I believe when you release the lock while iterating over the storage 
blocks, the iterator may find itself in an isolated chain of the list after 
reacquiring the lock
{quote}
It won't happen, because processReport don't iterate the storage blocks at 2.6, 
the whole FBR procedure(for each storage) can be simplified like this: 
| # Insert a delimiter into the head of block list(triplets, it's actually a 
double linked list, so I'll ref it as the block list for simplification) of 
this storage.
 # Start a loop, iterate through block report
 ## Get a block from the report
 ## Using the block to get the stored BlockInfo object from BlockMap
 ## Check the status of the block, and add the block to corresponding 
set(toAdd, toUc, toInvalidate, toCorrupt)
 ## Move the block to the head of block list(which makes the block placed 
before delimiter)
 # Start a loop to iterate through block list, find the blocks after delimiter, 
add them to toRemove set.|

My proposal in this Jira is to release and re-acquire NN lock between step 2.3 
and step 2.4. This solution won't affect the correctness of block report 
procedure for the following reasons:
 # At last, all the reported block will be moved before delimiter.
 # If any other thread get the NN lock before 2.4 add adds some new blocks, 
they will be added in the head of list.
 # If any other thread get the NN lock before 2.4 and removes some blocks, it 
won't affect the loop at 2nd step. (Pls notice that the delimiter can't be 
remove by other threads)
 # All the blocks after delimiter should be removed

According to the reasons described above, the following problem you mentioned 
also won't happen:
{quote}you may remove replicas that were not supposed to be removed
{quote}
 

I agree with you that the  things are tricky here, but this change is quite 
simple and I think we still can make clear the impaction.


was (Author: zhangchen):
Thanks [~shv], but sorry I can't see any problem of this change on 2.6 version.
{quote}I believe when you release the lock while iterating over the storage 
blocks, the iterator may find itself in an isolated chain of the list after 
reacquiring the lock
{quote}
It won't happen, because processReport don't iterate the storage blocks at 2.6, 
the whole FBR procedure(for each storage) can be simplified like this: 
| # Insert a delimiter into the head of block list(triplets, it's actually a 
double linked list, so I'll ref it as the block list for simplification) of 
this storage.
 # Start a loop, iterate through block report
 ## Get a block from the report
 ## Using the block to get the stored BlockInfo object from BlockMap
 ## Check the status of the block, and add the block to corresponding 
set(toAdd, toUc, toInvalidate, toCorrupt)
 ## Move the block to the head of block list(which makes the block placed 
before delimiter)
 # Start a loop to iterate through block list, find the blocks after delimiter, 
add them to toRemove set.|

My proposal in this Jira is to release and re-acquire NN lock between 2.3 and 
2.4. This solution won't affect the correctness of block report procedure for 
the following reasons:
 # At last, all the reported block will be moved before delimiter.
 # If any other thread acquire the NN lock before 2.4 add adds some new blocks, 
they will be added in the head of list.
 # If any other thread acquire the NN lock before 2.4 and removes some blocks, 
it won't affect the loop at 2nd step. (Pls notice that the delimiter can't be 
remove by other threads)
 # All the blocks after delimiter should be removed

According to the reasons described above, the following problem you mentioned 
also won't happen:
{quote}you may remove replicas that were not supposed to be removed
{quote}
 

I agree with you that the  things are tricky here, but this change is quite 
simple and I think we still can make clear the impaction.

> Refine NameSystem lock usage during processing FBR
> --
>
> Key: HDFS-14657
> URL: https://issues.apache.org/jira/browse/HDFS-14657
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Chen Zhang
>Assignee: Chen Zhang
>Priority: Major
> Attachments: HDFS-14657-001.patch, HDFS-14657.002.patch
>
>
> The disk with 12TB capacity is very normal today, which means the FBR size is 
> much larger than before, Namenode holds the NameSystemLock during processing 
> block report for each storage, which might take quite a long time.
> On our production environment, processing large FBR usually cause a longer 

[jira] [Comment Edited] (HDFS-14657) Refine NameSystem lock usage during processing FBR

2019-07-31 Thread Chen Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-14657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16896917#comment-16896917
 ] 

Chen Zhang edited comment on HDFS-14657 at 7/31/19 8:59 AM:


Thanks [~shv], but sorry I can't see any problem of this change on 2.6 version.
{quote}I believe when you release the lock while iterating over the storage 
blocks, the iterator may find itself in an isolated chain of the list after 
reacquiring the lock
{quote}
It won't happen, because processReport don't iterate the storage blocks at 2.6, 
the whole FBR procedure(for each storage) can be simplified like this: 
| # Insert a delimiter into the head of block list(triplets, it's actually a 
double linked list, so I'll ref it as the block list for simplification) of 
this storage.
 # Start a loop, iterate through block report
 ## Get a block from the report
 ## Using the block to get the stored BlockInfo object from BlockMap
 ## Check the status of the block, and add the block to corresponding 
set(toAdd, toUc, toInvalidate, toCorrupt)
 ## Move the block to the head of block list(which makes the block placed 
before delimiter)
 # Start a loop to iterate through block list, find the blocks after delimiter, 
add them to toRemove set.|

My proposal in this Jira is to release and re-acquire NN lock between 2.3 and 
2.4. This solution won't affect the correctness of block report procedure for 
the following reasons:
 # At last, all the reported block will be moved before delimiter.
 # If any other thread acquire the NN lock before 2.4 add adds some new blocks, 
they will be added in the head of list.
 # If any other thread acquire the NN lock before 2.4 and removes some blocks, 
it won't affect the loop at 2nd step. (Pls notice that the delimiter can't be 
remove by other threads)
 # All the blocks after delimiter should be removed

According to the reasons described above, the following problem you mentioned 
also won't happen:
{quote}you may remove replicas that were not supposed to be removed
{quote}
 

I agree with you that the  things are tricky here, but this change is quite 
simple and I think we still can make clear the impaction.


was (Author: zhangchen):
Thanks [~shv], but sorry I can't see any problem of this change on 2.6 version.
{quote}I believe when you release the lock while iterating over the storage 
blocks, the iterator may find itself in an isolated chain of the list after 
reacquiring the lock
{quote}
It won't happen, because processReport don't iterate the storage blocks at 2.6, 
the whole FBR procedure(for each storage) can be simplified like this:

 
| # Insert a delimiter into the head of block list(triplets, it's actually a 
double linked list, so I'll ref it as the block list for simplification) of 
this storage.
 # Start a loop, iterate through block report
 ## Get a block from the report
 ## Using the block to get the stored BlockInfo object from BlockMap
 ## Check the status of the block, and add the block to corresponding 
set(toAdd, toUc, toInvalidate, toCorrupt)
 ## Move the block to the head of block list(which makes the block placed 
before delimiter)
 # Start a loop to iterate through block list, find the blocks after delimiter, 
add them to toRemove set.|

My proposal in this Jira is to release and re-acquire NN lock between 2.3 and 
2.4. This solution won't affect the correctness of block report procedure for 
the following reasons:
 # All the reported block will stored before delimiter in the end.
 # If any other thread acquire the NN lock before 2.4 add adds some new blocks, 
they will be added in the head of list.
 # If any other thread acquire the NN lock before 2.4 and removes some blocks, 
it won't affect the loop at 2nd step. (Pls notice that the delimiter can't be 
remove by other threads)
 # All the blocks after delimiter should be removed

According to the reasons described above, the following problem you mentioned 
also won't happen:
{quote}you may remove replicas that were not supposed to be removed
{quote}
 

I agree with you that the  things are tricky here, but this change is quite 
simple and I think we still can make clear the impaction.

> Refine NameSystem lock usage during processing FBR
> --
>
> Key: HDFS-14657
> URL: https://issues.apache.org/jira/browse/HDFS-14657
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Chen Zhang
>Assignee: Chen Zhang
>Priority: Major
> Attachments: HDFS-14657-001.patch, HDFS-14657.002.patch
>
>
> The disk with 12TB capacity is very normal today, which means the FBR size is 
> much larger than before, Namenode holds the NameSystemLock during processing 
> block report for each storage, which might take quite a long time.
> On our production environment, processing large FBR usually cause a longer 

[jira] [Comment Edited] (HDFS-14657) Refine NameSystem lock usage during processing FBR

2019-07-26 Thread Chen Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-14657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16893844#comment-16893844
 ] 

Chen Zhang edited comment on HDFS-14657 at 7/26/19 2:06 PM:


Thanks [~shv] for your comments, you are right, releasing NN lock in the middle 
of the loop will cause ConcurrentModificationException.

This patch is ported from our internal 2.6 branch, the implementation changed a 
lot on the trunk branch and I didn't check all the detail. I just want to 
propose this demo solution and hear people's feedback.

If the community thinks this solution is feasible, I'll try to work out a 
complete patch on the trunk branch next week, also will test it on our cluster 
and post some number of performance enhancement. 


was (Author: zhangchen):
Hi [~shv], you are right, releasing NN lock in the middle of the loop will 
cause ConcurrentModificationException.

This patch is ported from our internal 2.6 branch, the implementation changed a 
lot on the trunk branch and I didn't check all the detail. I just want to 
propose this demo solution and hear people's feedback.

If the community thinks this solution is feasible, I'll try to work out a 
complete patch on the trunk branch next week, also will test it on our cluster 
and post some number of performance enhancement. 

> Refine NameSystem lock usage during processing FBR
> --
>
> Key: HDFS-14657
> URL: https://issues.apache.org/jira/browse/HDFS-14657
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Chen Zhang
>Assignee: Chen Zhang
>Priority: Major
> Attachments: HDFS-14657-001.patch, HDFS-14657.002.patch
>
>
> The disk with 12TB capacity is very normal today, which means the FBR size is 
> much larger than before, Namenode holds the NameSystemLock during processing 
> block report for each storage, which might take quite a long time.
> On our production environment, processing large FBR usually cause a longer 
> RPC queue time, which impacts client latency, so we did some simple work on 
> refining the lock usage, which improved the p99 latency significantly.
> In our solution, BlockManager release the NameSystem write lock and request 
> it again for every 5000 blocks(by default) during processing FBR, with the 
> fair lock, all the RPC request can be processed before BlockManager 
> re-acquire the write lock.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-14657) Refine NameSystem lock usage during processing FBR

2019-07-25 Thread Konstantin Shvachko (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-14657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16893228#comment-16893228
 ] 

Konstantin Shvachko edited comment on HDFS-14657 at 7/26/19 12:29 AM:
--

Hi [~zhangchen].
Looking at your patch v2. I am not sure I understand how your approach can 
work. Suppose you release {{namesystem.writeUnlock()}} in 
{{reportDiffSorted()}}, then somebody (unrelated to block report processing) 
can modify blocks belonging to the storage, which will invalidates 
{{storageBlocksIterator}}, meaning calling {{next()}} will cause 
{{ConcurrentModificationException}}.


was (Author: shv):
Hi [~zhangchen].
Looking at your patch v2. I am not sure I understand how your approach works. 
Suppose you release {{namesystem.writeUnlock()}} in {{reportDiffSorted()}}, 
then somebody (unrelated to block report processing) can modify blocks 
belonging to the storage, which will invalidate {{storageBlocksIterator}}, 
meaning calling {{next()}} will cause {{ConcurrentModificationException}}.

> Refine NameSystem lock usage during processing FBR
> --
>
> Key: HDFS-14657
> URL: https://issues.apache.org/jira/browse/HDFS-14657
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Chen Zhang
>Assignee: Chen Zhang
>Priority: Major
> Attachments: HDFS-14657-001.patch, HDFS-14657.002.patch
>
>
> The disk with 12TB capacity is very normal today, which means the FBR size is 
> much larger than before, Namenode holds the NameSystemLock during processing 
> block report for each storage, which might take quite a long time.
> On our production environment, processing large FBR usually cause a longer 
> RPC queue time, which impacts client latency, so we did some simple work on 
> refining the lock usage, which improved the p99 latency significantly.
> In our solution, BlockManager release the NameSystem write lock and request 
> it again for every 5000 blocks(by default) during processing FBR, with the 
> fair lock, all the RPC request can be processed before BlockManager 
> re-acquire the write lock.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-14657) Refine NameSystem lock usage during processing FBR

2019-07-25 Thread Konstantin Shvachko (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-14657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16893228#comment-16893228
 ] 

Konstantin Shvachko edited comment on HDFS-14657 at 7/26/19 12:29 AM:
--

Hi [~zhangchen].
Looking at your patch v2. I am not sure I understand how your approach can 
work. Suppose you release {{namesystem.writeUnlock()}} in 
{{reportDiffSorted()}}, then somebody (unrelated to block report processing) 
can modify blocks belonging to the storage, which invalidates 
{{storageBlocksIterator}}, meaning calling {{next()}} will cause 
{{ConcurrentModificationException}}.


was (Author: shv):
Hi [~zhangchen].
Looking at your patch v2. I am not sure I understand how your approach can 
work. Suppose you release {{namesystem.writeUnlock()}} in 
{{reportDiffSorted()}}, then somebody (unrelated to block report processing) 
can modify blocks belonging to the storage, which will invalidates 
{{storageBlocksIterator}}, meaning calling {{next()}} will cause 
{{ConcurrentModificationException}}.

> Refine NameSystem lock usage during processing FBR
> --
>
> Key: HDFS-14657
> URL: https://issues.apache.org/jira/browse/HDFS-14657
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Chen Zhang
>Assignee: Chen Zhang
>Priority: Major
> Attachments: HDFS-14657-001.patch, HDFS-14657.002.patch
>
>
> The disk with 12TB capacity is very normal today, which means the FBR size is 
> much larger than before, Namenode holds the NameSystemLock during processing 
> block report for each storage, which might take quite a long time.
> On our production environment, processing large FBR usually cause a longer 
> RPC queue time, which impacts client latency, so we did some simple work on 
> refining the lock usage, which improved the p99 latency significantly.
> In our solution, BlockManager release the NameSystem write lock and request 
> it again for every 5000 blocks(by default) during processing FBR, with the 
> fair lock, all the RPC request can be processed before BlockManager 
> re-acquire the write lock.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-14657) Refine NameSystem lock usage during processing FBR

2019-07-21 Thread Chen Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-14657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16889768#comment-16889768
 ] 

Chen Zhang edited comment on HDFS-14657 at 7/21/19 5:17 PM:


uploaded a quick demo patch(v2) for the idea of BlockManager.blockReportLock


was (Author: zhangchen):
upload a quick demo patch for the idea of BlockManager.blockReportLock

> Refine NameSystem lock usage during processing FBR
> --
>
> Key: HDFS-14657
> URL: https://issues.apache.org/jira/browse/HDFS-14657
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Chen Zhang
>Assignee: Chen Zhang
>Priority: Major
> Attachments: HDFS-14657-001.patch, HDFS-14657.002.patch
>
>
> The disk with 12TB capacity is very normal today, which means the FBR size is 
> much larger than before, Namenode holds the NameSystemLock during processing 
> block report for each storage, which might take quite a long time.
> On our production environment, processing large FBR usually cause a longer 
> RPC queue time, which impacts client latency, so we did some simple work on 
> refining the lock usage, which improved the p99 latency significantly.
> In our solution, BlockManager release the NameSystem write lock and request 
> it again for every 5000 blocks(by default) during processing FBR, with the 
> fair lock, all the RPC request can be processed before BlockManager 
> re-acquire the write lock.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-14657) Refine NameSystem lock usage during processing FBR

2019-07-21 Thread Chen Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-14657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16889763#comment-16889763
 ] 

Chen Zhang edited comment on HDFS-14657 at 7/21/19 4:49 PM:


Thanks [~xkrogen] for reminding me of the batch IBR processing feature.

This patch is initially applied to our 2.6 cluster which don't have batch IBR 
processing feature, so I ignored this.

Considering batch IBR processing, I suggest we can change the 
DatanodeDescriptor.reportLock to *BlockManager.blockReportLock*, and each time 
before processing a FBR or IBR, acquire the blockReportLock first, other 
behavior is not change.

I think this solution might be feasible for the following reasons:
 # FBR and IBR could only be proceed one by one(holding the FSNameSystem's 
write lock), adding a blockReportLock won't affect this behavior
 # With blockReportLock, we can release FSNameSystem's write-lock safely
 # BlockReportProcessingThread must acquire the blockReportLock first before 
starting bach IBR processing

In additional, the original solution using DatanodeDescriptor.reportLock allows 
multiple FBR being proceed at the same time, this change may introduce some 
race condition. But with blockReportLock, the behavior is almost same as before.

 


was (Author: zhangchen):
Thanks [~xkrogen] for reminding me of the batch IBR processing feature.

This patch is initially applied to our 2.6 cluster which don't have batch IBR 
processing feature, so I ignored this.

Considering batch IBR processing, I suggest we can change the 
DatanodeDescriptor.reportLock to *BlockManager.blockReportLock*, and each time 
before processing a FBR or IBR, acquire the blockReportLock first, other 
behavior is not change.

I think this solution might be feasible for the following reasons:
 # FBR and IBR could only be proceed one by one, adding a blockReportLock won't 
affect this behavior
 # With blockReportLock, we can release FSNameSystem's write-lock safely
 # BlockReportProcessingThread must acquire the blockReportLock first before 
starting bach IBR processing

In additional, the original solution using DatanodeDescriptor.reportLock allows 
multiple FBR being proceed at the same time, this change may introduce some 
race condition. But with blockReportLock, the behavior is almost same as before.

 

> Refine NameSystem lock usage during processing FBR
> --
>
> Key: HDFS-14657
> URL: https://issues.apache.org/jira/browse/HDFS-14657
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Chen Zhang
>Assignee: Chen Zhang
>Priority: Major
> Attachments: HDFS-14657-001.patch
>
>
> The disk with 12TB capacity is very normal today, which means the FBR size is 
> much larger than before, Namenode holds the NameSystemLock during processing 
> block report for each storage, which might take quite a long time.
> On our production environment, processing large FBR usually cause a longer 
> RPC queue time, which impacts client latency, so we did some simple work on 
> refining the lock usage, which improved the p99 latency significantly.
> In our solution, BlockManager release the NameSystem write lock and request 
> it again for every 5000 blocks(by default) during processing FBR, with the 
> fair lock, all the RPC request can be processed before BlockManager 
> re-acquire the write lock.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-14657) Refine NameSystem lock usage during processing FBR

2019-07-19 Thread He Xiaoqiao (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-14657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16888568#comment-16888568
 ] 

He Xiaoqiao edited comment on HDFS-14657 at 7/19/19 6:27 AM:
-

Thanks [~zhangchen] for filing this JIAR, it is very interesting improvement.
{quote}
1.Add a report lock to DatanodeDescriptor
2.Before processing the FBR and IBR, BlockManager should get the report lock 
for that node first
3.IBR must wait until FBR process complete, even the writelock may release and 
re-acquire many times during processing FBR
{quote}
Would you like to offer some more information about this improvement, it is 
very helpful for reviewers in my opinion.
IIUC, it changes only #blockReport processing in NameNode (rather than with 
DataNode) for only single DataNode, and hold lock per datanode and ensure no 
meta changes during process #blockReport. So I think it is under control about 
inconsistency.
[~shv], Would you like to share some furthermore suggestions?
+ cc [~linyiqun],[~xkrogen]


was (Author: hexiaoqiao):
Thanks [~zhangchen] for filing this JIAR, it is very interesting improvement.
{quote}
1.Add a report lock to DatanodeDescriptor
2.Before processing the FBR and IBR, BlockManager should get the report lock 
for that node first
3.IBR must wait until FBR process complete, even the writelock may release and 
re-acquire many times during processing FBR
{quote}
Would you like to offer some more information about this improvement, it is 
very helpful for reviewers in my opinion.
IIUC, it changes only #blockReport processing in NameNode (rather than with 
DataNode) for only single DataNode, and hold lock per datanode and ensure no 
meta changes during process #blockReport. So I think it is under control about 
inconsistency.
[~shv], Would you like to share some furthermore suggestions?

> Refine NameSystem lock usage during processing FBR
> --
>
> Key: HDFS-14657
> URL: https://issues.apache.org/jira/browse/HDFS-14657
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Chen Zhang
>Assignee: Chen Zhang
>Priority: Major
> Attachments: HDFS-14657-001.patch
>
>
> The disk with 12TB capacity is very normal today, which means the FBR size is 
> much larger than before, Namenode holds the NameSystemLock during processing 
> block report for each storage, which might take quite a long time.
> On our production environment, processing large FBR usually cause a longer 
> RPC queue time, which impacts client latency, so we did some simple work on 
> refining the lock usage, which improved the p99 latency significantly.
> In our solution, BlockManager release the NameSystem write lock and request 
> it again for every 5000 blocks(by default) during processing FBR, with the 
> fair lock, all the RPC request can be processed before BlockManager 
> re-acquire the write lock.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org