[jira] [Commented] (HDFS-14476) lock too long when fix inconsistent blocks between disk and in-memory

2020-04-13 Thread Masatake Iwasaki (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17082107#comment-17082107
 ] 

Masatake Iwasaki commented on HDFS-14476:
-

+1 on the HDFS-14476-branch-2.10.02.patch. Committing this and 
HDFS-15048-branch-2.10.002.patch to branch-2.10.

> lock too long when fix inconsistent blocks between disk and in-memory
> -
>
> Key: HDFS-14476
> URL: https://issues.apache.org/jira/browse/HDFS-14476
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Affects Versions: 2.6.0, 2.7.0, 3.0.3
>Reporter: Sean Chow
>Assignee: Sean Chow
>Priority: Major
> Fix For: 3.3.0, 2.10.1
>
> Attachments: HDFS-14476-branch-2.01.patch, 
> HDFS-14476-branch-2.02.patch, HDFS-14476-branch-2.10.02.patch, 
> HDFS-14476.00.patch, HDFS-14476.002.patch, HDFS-14476.01.patch, 
> HDFS-14476.branch-3.2.001.patch, datanode-with-patch-14476.png
>
>
> When directoryScanner have the results of differences between disk and 
> in-memory blocks. it will try to run {{checkAndUpdate}} to fix it. However 
> {{FsDatasetImpl.checkAndUpdate}} is a synchronized call
> As I have about 6millions blocks for every datanodes and every 6hours' scan 
> will have about 25000 abnormal blocks to fix. That leads to a long lock 
> holding FsDatasetImpl object.
> let's assume every block need 10ms to fix(because of latency of SAS disk), 
> that will cost 250 seconds to finish. That means all reads and writes will be 
> blocked for 3mins for that datanode.
>  
> {code:java}
> 2019-05-06 08:06:51,704 INFO 
> org.apache.hadoop.hdfs.server.datanode.DirectoryScanner: BlockPool 
> BP-1644920766-10.223.143.220-1450099987967 Total blocks: 6850197, missing 
> metadata files:23574, missing block files:23574, missing blocks in 
> memory:47625, mismatched blocks:0
> ...
> 2019-05-06 08:16:41,625 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Took 588402ms to process 1 commands from NN
> {code}
> Take long time to process command from nn because threads are blocked. And 
> namenode will see long lastContact time for this datanode.
> Maybe this affect all hdfs versions.
> *how to fix:*
> just like process invalidate command from namenode with 1000 batch size, fix 
> these abnormal block should be handled with batch too and sleep 2 seconds 
> between the batch to allow normal reading/writing blocks.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14476) lock too long when fix inconsistent blocks between disk and in-memory

2020-04-12 Thread Sean Chow (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17082042#comment-17082042
 ] 

Sean Chow commented on HDFS-14476:
--

I think this patch is safe for 3.3.0 as 
https://issues.apache.org/jira/browse/HDFS-15048 has been merged.

The later discussion is focused on merging it to branch-2. For caused issue 
HDFS-15048, I've add a patch for branch-2.10 in the link 
https://issues.apache.org/jira/browse/HDFS-15048 . Hope it will be committed 
quickly and make this HDFS-14476 *Resolved*

> lock too long when fix inconsistent blocks between disk and in-memory
> -
>
> Key: HDFS-14476
> URL: https://issues.apache.org/jira/browse/HDFS-14476
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Affects Versions: 2.6.0, 2.7.0, 3.0.3
>Reporter: Sean Chow
>Assignee: Sean Chow
>Priority: Major
> Fix For: 3.3.0
>
> Attachments: HDFS-14476-branch-2.01.patch, 
> HDFS-14476-branch-2.02.patch, HDFS-14476-branch-2.10.02.patch, 
> HDFS-14476.00.patch, HDFS-14476.002.patch, HDFS-14476.01.patch, 
> HDFS-14476.branch-3.2.001.patch, datanode-with-patch-14476.png
>
>
> When directoryScanner have the results of differences between disk and 
> in-memory blocks. it will try to run {{checkAndUpdate}} to fix it. However 
> {{FsDatasetImpl.checkAndUpdate}} is a synchronized call
> As I have about 6millions blocks for every datanodes and every 6hours' scan 
> will have about 25000 abnormal blocks to fix. That leads to a long lock 
> holding FsDatasetImpl object.
> let's assume every block need 10ms to fix(because of latency of SAS disk), 
> that will cost 250 seconds to finish. That means all reads and writes will be 
> blocked for 3mins for that datanode.
>  
> {code:java}
> 2019-05-06 08:06:51,704 INFO 
> org.apache.hadoop.hdfs.server.datanode.DirectoryScanner: BlockPool 
> BP-1644920766-10.223.143.220-1450099987967 Total blocks: 6850197, missing 
> metadata files:23574, missing block files:23574, missing blocks in 
> memory:47625, mismatched blocks:0
> ...
> 2019-05-06 08:16:41,625 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Took 588402ms to process 1 commands from NN
> {code}
> Take long time to process command from nn because threads are blocked. And 
> namenode will see long lastContact time for this datanode.
> Maybe this affect all hdfs versions.
> *how to fix:*
> just like process invalidate command from namenode with 1000 batch size, fix 
> these abnormal block should be handled with batch too and sleep 2 seconds 
> between the batch to allow normal reading/writing blocks.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14476) lock too long when fix inconsistent blocks between disk and in-memory

2020-04-10 Thread Wei-Chiu Chuang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17081127#comment-17081127
 ] 

Wei-Chiu Chuang commented on HDFS-14476:


We need HDFS-15048 too.

> lock too long when fix inconsistent blocks between disk and in-memory
> -
>
> Key: HDFS-14476
> URL: https://issues.apache.org/jira/browse/HDFS-14476
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Affects Versions: 2.6.0, 2.7.0, 3.0.3
>Reporter: Sean Chow
>Assignee: Sean Chow
>Priority: Major
> Fix For: 3.3.0
>
> Attachments: HDFS-14476-branch-2.01.patch, 
> HDFS-14476-branch-2.02.patch, HDFS-14476-branch-2.10.02.patch, 
> HDFS-14476.00.patch, HDFS-14476.002.patch, HDFS-14476.01.patch, 
> HDFS-14476.branch-3.2.001.patch, datanode-with-patch-14476.png
>
>
> When directoryScanner have the results of differences between disk and 
> in-memory blocks. it will try to run {{checkAndUpdate}} to fix it. However 
> {{FsDatasetImpl.checkAndUpdate}} is a synchronized call
> As I have about 6millions blocks for every datanodes and every 6hours' scan 
> will have about 25000 abnormal blocks to fix. That leads to a long lock 
> holding FsDatasetImpl object.
> let's assume every block need 10ms to fix(because of latency of SAS disk), 
> that will cost 250 seconds to finish. That means all reads and writes will be 
> blocked for 3mins for that datanode.
>  
> {code:java}
> 2019-05-06 08:06:51,704 INFO 
> org.apache.hadoop.hdfs.server.datanode.DirectoryScanner: BlockPool 
> BP-1644920766-10.223.143.220-1450099987967 Total blocks: 6850197, missing 
> metadata files:23574, missing block files:23574, missing blocks in 
> memory:47625, mismatched blocks:0
> ...
> 2019-05-06 08:16:41,625 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Took 588402ms to process 1 commands from NN
> {code}
> Take long time to process command from nn because threads are blocked. And 
> namenode will see long lastContact time for this datanode.
> Maybe this affect all hdfs versions.
> *how to fix:*
> just like process invalidate command from namenode with 1000 batch size, fix 
> these abnormal block should be handled with batch too and sleep 2 seconds 
> between the batch to allow normal reading/writing blocks.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14476) lock too long when fix inconsistent blocks between disk and in-memory

2020-04-10 Thread Hadoop QA (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17081122#comment-17081122
 ] 

Hadoop QA commented on HDFS-14476:
--

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 15m 
44s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:red}-1{color} | {color:red} test4tests {color} | {color:red}  0m  
0s{color} | {color:red} The patch doesn't appear to include any new or modified 
tests. Please justify why no new tests are needed for this patch. Also please 
list what manual steps were performed to verify this patch. {color} |
|| || || || {color:brown} branch-2.10 Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 12m 
35s{color} | {color:green} branch-2.10 passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
53s{color} | {color:green} branch-2.10 passed with JDK v1.7.0_95 {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
46s{color} | {color:green} branch-2.10 passed with JDK v1.8.0_242 {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
29s{color} | {color:green} branch-2.10 passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
56s{color} | {color:green} branch-2.10 passed {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  2m  
6s{color} | {color:green} branch-2.10 passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m  
8s{color} | {color:green} branch-2.10 passed with JDK v1.7.0_95 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
44s{color} | {color:green} branch-2.10 passed with JDK v1.8.0_242 {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
48s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
49s{color} | {color:green} the patch passed with JDK v1.7.0_95 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
49s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
43s{color} | {color:green} the patch passed with JDK v1.8.0_242 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
43s{color} | {color:green} the patch passed {color} |
| {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange}  
0m 26s{color} | {color:orange} hadoop-hdfs-project/hadoop-hdfs: The patch 
generated 4 new + 45 unchanged - 0 fixed = 49 total (was 45) {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
52s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:red}-1{color} | {color:red} findbugs {color} | {color:red}  2m 
11s{color} | {color:red} hadoop-hdfs-project/hadoop-hdfs generated 1 new + 0 
unchanged - 0 fixed = 1 total (was 0) {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m  
7s{color} | {color:green} the patch passed with JDK v1.7.0_95 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
41s{color} | {color:green} the patch passed with JDK v1.8.0_242 {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 70m 54s{color} 
| {color:red} hadoop-hdfs in the patch failed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
27s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black}116m 34s{color} | 
{color:black} {color} |
\\
\\
|| Reason || Tests ||
| FindBugs | module:hadoop-hdfs-project/hadoop-hdfs |
|  |  org.apache.hadoop.hdfs.server.datanode.DirectoryScanner.reconcile() calls 
Thread.sleep() with a lock held  At DirectoryScanner.java:lock held  At 
DirectoryScanner.java:[line 587] |
| Failed junit tests | 
hadoop.hdfs.qjournal.server.TestJournalNodeRespectsBindHostKeys |
|   | hadoop.hdfs.server.namenode.TestFsck |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=19.03.8 Server=19.03.8 Image:yetus/hadoop:bfeec47759e |
| JIRA Issue | HDFS-14476 |
| JIRA Patch URL | 

[jira] [Commented] (HDFS-14476) lock too long when fix inconsistent blocks between disk and in-memory

2020-04-09 Thread Hadoop QA (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17079353#comment-17079353
 ] 

Hadoop QA commented on HDFS-14476:
--

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
42s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:red}-1{color} | {color:red} test4tests {color} | {color:red}  0m  
0s{color} | {color:red} The patch doesn't appear to include any new or modified 
tests. Please justify why no new tests are needed for this patch. Also please 
list what manual steps were performed to verify this patch. {color} |
|| || || || {color:brown} branch-2.10 Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  9m 
57s{color} | {color:green} branch-2.10 passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
52s{color} | {color:green} branch-2.10 passed with JDK v1.7.0_95 {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
51s{color} | {color:green} branch-2.10 passed with JDK v1.8.0_242 {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
31s{color} | {color:green} branch-2.10 passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
56s{color} | {color:green} branch-2.10 passed {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  2m  
1s{color} | {color:green} branch-2.10 passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m 
11s{color} | {color:green} branch-2.10 passed with JDK v1.7.0_95 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
46s{color} | {color:green} branch-2.10 passed with JDK v1.8.0_242 {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
49s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
48s{color} | {color:green} the patch passed with JDK v1.7.0_95 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
48s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
44s{color} | {color:green} the patch passed with JDK v1.8.0_242 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
44s{color} | {color:green} the patch passed {color} |
| {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange}  
0m 27s{color} | {color:orange} hadoop-hdfs-project/hadoop-hdfs: The patch 
generated 4 new + 45 unchanged - 0 fixed = 49 total (was 45) {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
50s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:red}-1{color} | {color:red} findbugs {color} | {color:red}  2m  
7s{color} | {color:red} hadoop-hdfs-project/hadoop-hdfs generated 1 new + 0 
unchanged - 0 fixed = 1 total (was 0) {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m 
11s{color} | {color:green} the patch passed with JDK v1.7.0_95 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
41s{color} | {color:green} the patch passed with JDK v1.8.0_242 {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 61m 53s{color} 
| {color:red} hadoop-hdfs in the patch failed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
31s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 90m 16s{color} | 
{color:black} {color} |
\\
\\
|| Reason || Tests ||
| FindBugs | module:hadoop-hdfs-project/hadoop-hdfs |
|  |  org.apache.hadoop.hdfs.server.datanode.DirectoryScanner.reconcile() calls 
Thread.sleep() with a lock held  At DirectoryScanner.java:lock held  At 
DirectoryScanner.java:[line 587] |
| Failed junit tests | 
hadoop.hdfs.qjournal.server.TestJournalNodeRespectsBindHostKeys |
|   | hadoop.hdfs.server.namenode.TestFsck |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=19.03.8 Server=19.03.8 Image:yetus/hadoop:bfeec47759e |
| JIRA Issue | HDFS-14476 |
| JIRA Patch URL | 

[jira] [Commented] (HDFS-14476) lock too long when fix inconsistent blocks between disk and in-memory

2020-03-24 Thread Sean Chow (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17065398#comment-17065398
 ] 

Sean Chow commented on HDFS-14476:
--

Hi [~weichiu], I can't see why it builds failed. Any chance to merge this to 
2.10.1?

> lock too long when fix inconsistent blocks between disk and in-memory
> -
>
> Key: HDFS-14476
> URL: https://issues.apache.org/jira/browse/HDFS-14476
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Affects Versions: 2.6.0, 2.7.0, 3.0.3
>Reporter: Sean Chow
>Assignee: Sean Chow
>Priority: Major
> Fix For: 3.3.0
>
> Attachments: HDFS-14476-branch-2.01.patch, 
> HDFS-14476-branch-2.02.patch, HDFS-14476.00.patch, HDFS-14476.002.patch, 
> HDFS-14476.01.patch, HDFS-14476.branch-3.2.001.patch, 
> datanode-with-patch-14476.png
>
>
> When directoryScanner have the results of differences between disk and 
> in-memory blocks. it will try to run {{checkAndUpdate}} to fix it. However 
> {{FsDatasetImpl.checkAndUpdate}} is a synchronized call
> As I have about 6millions blocks for every datanodes and every 6hours' scan 
> will have about 25000 abnormal blocks to fix. That leads to a long lock 
> holding FsDatasetImpl object.
> let's assume every block need 10ms to fix(because of latency of SAS disk), 
> that will cost 250 seconds to finish. That means all reads and writes will be 
> blocked for 3mins for that datanode.
>  
> {code:java}
> 2019-05-06 08:06:51,704 INFO 
> org.apache.hadoop.hdfs.server.datanode.DirectoryScanner: BlockPool 
> BP-1644920766-10.223.143.220-1450099987967 Total blocks: 6850197, missing 
> metadata files:23574, missing block files:23574, missing blocks in 
> memory:47625, mismatched blocks:0
> ...
> 2019-05-06 08:16:41,625 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Took 588402ms to process 1 commands from NN
> {code}
> Take long time to process command from nn because threads are blocked. And 
> namenode will see long lastContact time for this datanode.
> Maybe this affect all hdfs versions.
> *how to fix:*
> just like process invalidate command from namenode with 1000 batch size, fix 
> these abnormal block should be handled with batch too and sleep 2 seconds 
> between the batch to allow normal reading/writing blocks.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14476) lock too long when fix inconsistent blocks between disk and in-memory

2020-02-11 Thread Sean Chow (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17034391#comment-17034391
 ] 

Sean Chow commented on HDFS-14476:
--

Hi [~weichiu] , what I can do to make it build pass.

thanks.

> lock too long when fix inconsistent blocks between disk and in-memory
> -
>
> Key: HDFS-14476
> URL: https://issues.apache.org/jira/browse/HDFS-14476
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Affects Versions: 2.6.0, 2.7.0, 3.0.3
>Reporter: Sean Chow
>Assignee: Sean Chow
>Priority: Major
> Fix For: 3.3.0
>
> Attachments: HDFS-14476-branch-2.01.patch, 
> HDFS-14476-branch-2.02.patch, HDFS-14476.00.patch, HDFS-14476.002.patch, 
> HDFS-14476.01.patch, HDFS-14476.branch-3.2.001.patch, 
> datanode-with-patch-14476.png
>
>
> When directoryScanner have the results of differences between disk and 
> in-memory blocks. it will try to run {{checkAndUpdate}} to fix it. However 
> {{FsDatasetImpl.checkAndUpdate}} is a synchronized call
> As I have about 6millions blocks for every datanodes and every 6hours' scan 
> will have about 25000 abnormal blocks to fix. That leads to a long lock 
> holding FsDatasetImpl object.
> let's assume every block need 10ms to fix(because of latency of SAS disk), 
> that will cost 250 seconds to finish. That means all reads and writes will be 
> blocked for 3mins for that datanode.
>  
> {code:java}
> 2019-05-06 08:06:51,704 INFO 
> org.apache.hadoop.hdfs.server.datanode.DirectoryScanner: BlockPool 
> BP-1644920766-10.223.143.220-1450099987967 Total blocks: 6850197, missing 
> metadata files:23574, missing block files:23574, missing blocks in 
> memory:47625, mismatched blocks:0
> ...
> 2019-05-06 08:16:41,625 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Took 588402ms to process 1 commands from NN
> {code}
> Take long time to process command from nn because threads are blocked. And 
> namenode will see long lastContact time for this datanode.
> Maybe this affect all hdfs versions.
> *how to fix:*
> just like process invalidate command from namenode with 1000 batch size, fix 
> these abnormal block should be handled with batch too and sleep 2 seconds 
> between the batch to allow normal reading/writing blocks.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14476) lock too long when fix inconsistent blocks between disk and in-memory

2020-01-16 Thread Sean Chow (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17017759#comment-17017759
 ] 

Sean Chow commented on HDFS-14476:
--

Build failed:(

> lock too long when fix inconsistent blocks between disk and in-memory
> -
>
> Key: HDFS-14476
> URL: https://issues.apache.org/jira/browse/HDFS-14476
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Affects Versions: 2.6.0, 2.7.0, 3.0.3
>Reporter: Sean Chow
>Assignee: Sean Chow
>Priority: Major
> Fix For: 3.3.0
>
> Attachments: HDFS-14476-branch-2.01.patch, 
> HDFS-14476-branch-2.02.patch, HDFS-14476.00.patch, HDFS-14476.002.patch, 
> HDFS-14476.01.patch, HDFS-14476.branch-3.2.001.patch, 
> datanode-with-patch-14476.png
>
>
> When directoryScanner have the results of differences between disk and 
> in-memory blocks. it will try to run {{checkAndUpdate}} to fix it. However 
> {{FsDatasetImpl.checkAndUpdate}} is a synchronized call
> As I have about 6millions blocks for every datanodes and every 6hours' scan 
> will have about 25000 abnormal blocks to fix. That leads to a long lock 
> holding FsDatasetImpl object.
> let's assume every block need 10ms to fix(because of latency of SAS disk), 
> that will cost 250 seconds to finish. That means all reads and writes will be 
> blocked for 3mins for that datanode.
>  
> {code:java}
> 2019-05-06 08:06:51,704 INFO 
> org.apache.hadoop.hdfs.server.datanode.DirectoryScanner: BlockPool 
> BP-1644920766-10.223.143.220-1450099987967 Total blocks: 6850197, missing 
> metadata files:23574, missing block files:23574, missing blocks in 
> memory:47625, mismatched blocks:0
> ...
> 2019-05-06 08:16:41,625 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Took 588402ms to process 1 commands from NN
> {code}
> Take long time to process command from nn because threads are blocked. And 
> namenode will see long lastContact time for this datanode.
> Maybe this affect all hdfs versions.
> *how to fix:*
> just like process invalidate command from namenode with 1000 batch size, fix 
> these abnormal block should be handled with batch too and sleep 2 seconds 
> between the batch to allow normal reading/writing blocks.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14476) lock too long when fix inconsistent blocks between disk and in-memory

2020-01-14 Thread Hadoop QA (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17015457#comment-17015457
 ] 

Hadoop QA commented on HDFS-14476:
--

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m  
0s{color} | {color:blue} Docker mode activated. {color} |
| {color:red}-1{color} | {color:red} docker {color} | {color:red}  0m 
13s{color} | {color:red} Docker failed to build yetus/hadoop:f555aa740b5. 
{color} |
\\
\\
|| Subsystem || Report/Notes ||
| JIRA Issue | HDFS-14476 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12990832/HDFS-14476-branch-2.02.patch
 |
| Console output | 
https://builds.apache.org/job/PreCommit-HDFS-Build/28673/console |
| Powered by | Apache Yetus 0.8.0   http://yetus.apache.org |


This message was automatically generated.



> lock too long when fix inconsistent blocks between disk and in-memory
> -
>
> Key: HDFS-14476
> URL: https://issues.apache.org/jira/browse/HDFS-14476
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Affects Versions: 2.6.0, 2.7.0, 3.0.3
>Reporter: Sean Chow
>Assignee: Sean Chow
>Priority: Major
> Fix For: 3.3.0
>
> Attachments: HDFS-14476-branch-2.01.patch, 
> HDFS-14476-branch-2.02.patch, HDFS-14476.00.patch, HDFS-14476.002.patch, 
> HDFS-14476.01.patch, HDFS-14476.branch-3.2.001.patch, 
> datanode-with-patch-14476.png
>
>
> When directoryScanner have the results of differences between disk and 
> in-memory blocks. it will try to run {{checkAndUpdate}} to fix it. However 
> {{FsDatasetImpl.checkAndUpdate}} is a synchronized call
> As I have about 6millions blocks for every datanodes and every 6hours' scan 
> will have about 25000 abnormal blocks to fix. That leads to a long lock 
> holding FsDatasetImpl object.
> let's assume every block need 10ms to fix(because of latency of SAS disk), 
> that will cost 250 seconds to finish. That means all reads and writes will be 
> blocked for 3mins for that datanode.
>  
> {code:java}
> 2019-05-06 08:06:51,704 INFO 
> org.apache.hadoop.hdfs.server.datanode.DirectoryScanner: BlockPool 
> BP-1644920766-10.223.143.220-1450099987967 Total blocks: 6850197, missing 
> metadata files:23574, missing block files:23574, missing blocks in 
> memory:47625, mismatched blocks:0
> ...
> 2019-05-06 08:16:41,625 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Took 588402ms to process 1 commands from NN
> {code}
> Take long time to process command from nn because threads are blocked. And 
> namenode will see long lastContact time for this datanode.
> Maybe this affect all hdfs versions.
> *how to fix:*
> just like process invalidate command from namenode with 1000 batch size, fix 
> these abnormal block should be handled with batch too and sleep 2 seconds 
> between the batch to allow normal reading/writing blocks.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14476) lock too long when fix inconsistent blocks between disk and in-memory

2020-01-14 Thread Hadoop QA (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17015410#comment-17015410
 ] 

Hadoop QA commented on HDFS-14476:
--

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m  
0s{color} | {color:blue} Docker mode activated. {color} |
| {color:red}-1{color} | {color:red} docker {color} | {color:red}  0m 
26s{color} | {color:red} Docker failed to build yetus/hadoop:f555aa740b5. 
{color} |
\\
\\
|| Subsystem || Report/Notes ||
| JIRA Issue | HDFS-14476 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12990832/HDFS-14476-branch-2.02.patch
 |
| Console output | 
https://builds.apache.org/job/PreCommit-HDFS-Build/28672/console |
| Powered by | Apache Yetus 0.8.0   http://yetus.apache.org |


This message was automatically generated.



> lock too long when fix inconsistent blocks between disk and in-memory
> -
>
> Key: HDFS-14476
> URL: https://issues.apache.org/jira/browse/HDFS-14476
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Affects Versions: 2.6.0, 2.7.0, 3.0.3
>Reporter: Sean Chow
>Assignee: Sean Chow
>Priority: Major
> Fix For: 3.3.0
>
> Attachments: HDFS-14476-branch-2.01.patch, 
> HDFS-14476-branch-2.02.patch, HDFS-14476.00.patch, HDFS-14476.002.patch, 
> HDFS-14476.01.patch, HDFS-14476.branch-3.2.001.patch, 
> datanode-with-patch-14476.png
>
>
> When directoryScanner have the results of differences between disk and 
> in-memory blocks. it will try to run {{checkAndUpdate}} to fix it. However 
> {{FsDatasetImpl.checkAndUpdate}} is a synchronized call
> As I have about 6millions blocks for every datanodes and every 6hours' scan 
> will have about 25000 abnormal blocks to fix. That leads to a long lock 
> holding FsDatasetImpl object.
> let's assume every block need 10ms to fix(because of latency of SAS disk), 
> that will cost 250 seconds to finish. That means all reads and writes will be 
> blocked for 3mins for that datanode.
>  
> {code:java}
> 2019-05-06 08:06:51,704 INFO 
> org.apache.hadoop.hdfs.server.datanode.DirectoryScanner: BlockPool 
> BP-1644920766-10.223.143.220-1450099987967 Total blocks: 6850197, missing 
> metadata files:23574, missing block files:23574, missing blocks in 
> memory:47625, mismatched blocks:0
> ...
> 2019-05-06 08:16:41,625 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Took 588402ms to process 1 commands from NN
> {code}
> Take long time to process command from nn because threads are blocked. And 
> namenode will see long lastContact time for this datanode.
> Maybe this affect all hdfs versions.
> *how to fix:*
> just like process invalidate command from namenode with 1000 batch size, fix 
> these abnormal block should be handled with batch too and sleep 2 seconds 
> between the batch to allow normal reading/writing blocks.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14476) lock too long when fix inconsistent blocks between disk and in-memory

2020-01-14 Thread Sean Chow (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17014921#comment-17014921
 ] 

Sean Chow commented on HDFS-14476:
--

Attach the patch for branch-2: HDFS-14476-branch-2.02.patch

This patch also fix issue HDFS-14751

> lock too long when fix inconsistent blocks between disk and in-memory
> -
>
> Key: HDFS-14476
> URL: https://issues.apache.org/jira/browse/HDFS-14476
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Affects Versions: 2.6.0, 2.7.0, 3.0.3
>Reporter: Sean Chow
>Assignee: Sean Chow
>Priority: Major
> Fix For: 3.3.0
>
> Attachments: HDFS-14476-branch-2.01.patch, 
> HDFS-14476-branch-2.02.patch, HDFS-14476.00.patch, HDFS-14476.002.patch, 
> HDFS-14476.01.patch, HDFS-14476.branch-3.2.001.patch, 
> datanode-with-patch-14476.png
>
>
> When directoryScanner have the results of differences between disk and 
> in-memory blocks. it will try to run {{checkAndUpdate}} to fix it. However 
> {{FsDatasetImpl.checkAndUpdate}} is a synchronized call
> As I have about 6millions blocks for every datanodes and every 6hours' scan 
> will have about 25000 abnormal blocks to fix. That leads to a long lock 
> holding FsDatasetImpl object.
> let's assume every block need 10ms to fix(because of latency of SAS disk), 
> that will cost 250 seconds to finish. That means all reads and writes will be 
> blocked for 3mins for that datanode.
>  
> {code:java}
> 2019-05-06 08:06:51,704 INFO 
> org.apache.hadoop.hdfs.server.datanode.DirectoryScanner: BlockPool 
> BP-1644920766-10.223.143.220-1450099987967 Total blocks: 6850197, missing 
> metadata files:23574, missing block files:23574, missing blocks in 
> memory:47625, mismatched blocks:0
> ...
> 2019-05-06 08:16:41,625 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Took 588402ms to process 1 commands from NN
> {code}
> Take long time to process command from nn because threads are blocked. And 
> namenode will see long lastContact time for this datanode.
> Maybe this affect all hdfs versions.
> *how to fix:*
> just like process invalidate command from namenode with 1000 batch size, fix 
> these abnormal block should be handled with batch too and sleep 2 seconds 
> between the batch to allow normal reading/writing blocks.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14476) lock too long when fix inconsistent blocks between disk and in-memory

2020-01-08 Thread Wei-Chiu Chuang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17010514#comment-17010514
 ] 

Wei-Chiu Chuang commented on HDFS-14476:


Sounds good to me, but the patch does not apply in lower branches. Could you 
offer a new version of patch for those branches?

> lock too long when fix inconsistent blocks between disk and in-memory
> -
>
> Key: HDFS-14476
> URL: https://issues.apache.org/jira/browse/HDFS-14476
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Affects Versions: 2.6.0, 2.7.0, 3.0.3
>Reporter: Sean Chow
>Assignee: Sean Chow
>Priority: Major
> Fix For: 3.3.0
>
> Attachments: HDFS-14476-branch-2.01.patch, HDFS-14476.00.patch, 
> HDFS-14476.002.patch, HDFS-14476.01.patch, HDFS-14476.branch-3.2.001.patch, 
> datanode-with-patch-14476.png
>
>
> When directoryScanner have the results of differences between disk and 
> in-memory blocks. it will try to run {{checkAndUpdate}} to fix it. However 
> {{FsDatasetImpl.checkAndUpdate}} is a synchronized call
> As I have about 6millions blocks for every datanodes and every 6hours' scan 
> will have about 25000 abnormal blocks to fix. That leads to a long lock 
> holding FsDatasetImpl object.
> let's assume every block need 10ms to fix(because of latency of SAS disk), 
> that will cost 250 seconds to finish. That means all reads and writes will be 
> blocked for 3mins for that datanode.
>  
> {code:java}
> 2019-05-06 08:06:51,704 INFO 
> org.apache.hadoop.hdfs.server.datanode.DirectoryScanner: BlockPool 
> BP-1644920766-10.223.143.220-1450099987967 Total blocks: 6850197, missing 
> metadata files:23574, missing block files:23574, missing blocks in 
> memory:47625, mismatched blocks:0
> ...
> 2019-05-06 08:16:41,625 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Took 588402ms to process 1 commands from NN
> {code}
> Take long time to process command from nn because threads are blocked. And 
> namenode will see long lastContact time for this datanode.
> Maybe this affect all hdfs versions.
> *how to fix:*
> just like process invalidate command from namenode with 1000 batch size, fix 
> these abnormal block should be handled with batch too and sleep 2 seconds 
> between the batch to allow normal reading/writing blocks.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14476) lock too long when fix inconsistent blocks between disk and in-memory

2020-01-06 Thread Sean Chow (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17008662#comment-17008662
 ] 

Sean Chow commented on HDFS-14476:
--

Hi [~weichiu], will you merge this to version 2.8.x and 2.9.x?

Thanks for [~leosun08]  fixing the testcase.

> lock too long when fix inconsistent blocks between disk and in-memory
> -
>
> Key: HDFS-14476
> URL: https://issues.apache.org/jira/browse/HDFS-14476
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Affects Versions: 2.6.0, 2.7.0, 3.0.3
>Reporter: Sean Chow
>Assignee: Sean Chow
>Priority: Major
> Fix For: 3.3.0
>
> Attachments: HDFS-14476-branch-2.01.patch, HDFS-14476.00.patch, 
> HDFS-14476.002.patch, HDFS-14476.01.patch, HDFS-14476.branch-3.2.001.patch, 
> datanode-with-patch-14476.png
>
>
> When directoryScanner have the results of differences between disk and 
> in-memory blocks. it will try to run {{checkAndUpdate}} to fix it. However 
> {{FsDatasetImpl.checkAndUpdate}} is a synchronized call
> As I have about 6millions blocks for every datanodes and every 6hours' scan 
> will have about 25000 abnormal blocks to fix. That leads to a long lock 
> holding FsDatasetImpl object.
> let's assume every block need 10ms to fix(because of latency of SAS disk), 
> that will cost 250 seconds to finish. That means all reads and writes will be 
> blocked for 3mins for that datanode.
>  
> {code:java}
> 2019-05-06 08:06:51,704 INFO 
> org.apache.hadoop.hdfs.server.datanode.DirectoryScanner: BlockPool 
> BP-1644920766-10.223.143.220-1450099987967 Total blocks: 6850197, missing 
> metadata files:23574, missing block files:23574, missing blocks in 
> memory:47625, mismatched blocks:0
> ...
> 2019-05-06 08:16:41,625 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Took 588402ms to process 1 commands from NN
> {code}
> Take long time to process command from nn because threads are blocked. And 
> namenode will see long lastContact time for this datanode.
> Maybe this affect all hdfs versions.
> *how to fix:*
> just like process invalidate command from namenode with 1000 batch size, fix 
> these abnormal block should be handled with batch too and sleep 2 seconds 
> between the batch to allow normal reading/writing blocks.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14476) lock too long when fix inconsistent blocks between disk and in-memory

2019-12-11 Thread Takanobu Asanuma (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16993324#comment-16993324
 ] 

Takanobu Asanuma commented on HDFS-14476:
-

The patch may generate a findbug. Please see HDFS-15048.

> lock too long when fix inconsistent blocks between disk and in-memory
> -
>
> Key: HDFS-14476
> URL: https://issues.apache.org/jira/browse/HDFS-14476
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Affects Versions: 2.6.0, 2.7.0, 3.0.3
>Reporter: Sean Chow
>Assignee: Sean Chow
>Priority: Major
> Fix For: 3.3.0
>
> Attachments: HDFS-14476-branch-2.01.patch, HDFS-14476.00.patch, 
> HDFS-14476.002.patch, HDFS-14476.01.patch, HDFS-14476.branch-3.2.001.patch, 
> datanode-with-patch-14476.png
>
>
> When directoryScanner have the results of differences between disk and 
> in-memory blocks. it will try to run {{checkAndUpdate}} to fix it. However 
> {{FsDatasetImpl.checkAndUpdate}} is a synchronized call
> As I have about 6millions blocks for every datanodes and every 6hours' scan 
> will have about 25000 abnormal blocks to fix. That leads to a long lock 
> holding FsDatasetImpl object.
> let's assume every block need 10ms to fix(because of latency of SAS disk), 
> that will cost 250 seconds to finish. That means all reads and writes will be 
> blocked for 3mins for that datanode.
>  
> {code:java}
> 2019-05-06 08:06:51,704 INFO 
> org.apache.hadoop.hdfs.server.datanode.DirectoryScanner: BlockPool 
> BP-1644920766-10.223.143.220-1450099987967 Total blocks: 6850197, missing 
> metadata files:23574, missing block files:23574, missing blocks in 
> memory:47625, mismatched blocks:0
> ...
> 2019-05-06 08:16:41,625 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Took 588402ms to process 1 commands from NN
> {code}
> Take long time to process command from nn because threads are blocked. And 
> namenode will see long lastContact time for this datanode.
> Maybe this affect all hdfs versions.
> *how to fix:*
> just like process invalidate command from namenode with 1000 batch size, fix 
> these abnormal block should be handled with batch too and sleep 2 seconds 
> between the batch to allow normal reading/writing blocks.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14476) lock too long when fix inconsistent blocks between disk and in-memory

2019-12-06 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16990227#comment-16990227
 ] 

Hudson commented on HDFS-14476:
---

SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #17735 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/17735/])
HDFS-14476. lock too long when fix inconsistent blocks between disk and 
(weichiu: rev 313b76f8e92643e3412a98dc73f83437729f3984)
* (edit) 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/DirectoryScanner.java


> lock too long when fix inconsistent blocks between disk and in-memory
> -
>
> Key: HDFS-14476
> URL: https://issues.apache.org/jira/browse/HDFS-14476
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Affects Versions: 2.6.0, 2.7.0, 3.0.3
>Reporter: Sean Chow
>Assignee: Sean Chow
>Priority: Major
> Fix For: 3.3.0
>
> Attachments: HDFS-14476-branch-2.01.patch, HDFS-14476.00.patch, 
> HDFS-14476.002.patch, HDFS-14476.01.patch, HDFS-14476.branch-3.2.001.patch, 
> datanode-with-patch-14476.png
>
>
> When directoryScanner have the results of differences between disk and 
> in-memory blocks. it will try to run {{checkAndUpdate}} to fix it. However 
> {{FsDatasetImpl.checkAndUpdate}} is a synchronized call
> As I have about 6millions blocks for every datanodes and every 6hours' scan 
> will have about 25000 abnormal blocks to fix. That leads to a long lock 
> holding FsDatasetImpl object.
> let's assume every block need 10ms to fix(because of latency of SAS disk), 
> that will cost 250 seconds to finish. That means all reads and writes will be 
> blocked for 3mins for that datanode.
>  
> {code:java}
> 2019-05-06 08:06:51,704 INFO 
> org.apache.hadoop.hdfs.server.datanode.DirectoryScanner: BlockPool 
> BP-1644920766-10.223.143.220-1450099987967 Total blocks: 6850197, missing 
> metadata files:23574, missing block files:23574, missing blocks in 
> memory:47625, mismatched blocks:0
> ...
> 2019-05-06 08:16:41,625 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Took 588402ms to process 1 commands from NN
> {code}
> Take long time to process command from nn because threads are blocked. And 
> namenode will see long lastContact time for this datanode.
> Maybe this affect all hdfs versions.
> *how to fix:*
> just like process invalidate command from namenode with 1000 batch size, fix 
> these abnormal block should be handled with batch too and sleep 2 seconds 
> between the batch to allow normal reading/writing blocks.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14476) lock too long when fix inconsistent blocks between disk and in-memory

2019-08-21 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16912475#comment-16912475
 ] 

Hudson commented on HDFS-14476:
---

FAILURE: Integrated in Jenkins build Hadoop-trunk-Commit #17160 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/17160/])
Revert "HDFS-14476. lock too long when fix inconsistent blocks between 
(weichiu: rev 57f737017465cccb0f6b5ab6e3130ef49a02d4c2)
* (edit) 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/DirectoryScanner.java


> lock too long when fix inconsistent blocks between disk and in-memory
> -
>
> Key: HDFS-14476
> URL: https://issues.apache.org/jira/browse/HDFS-14476
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Affects Versions: 2.6.0, 2.7.0, 3.0.3
>Reporter: Sean Chow
>Assignee: Sean Chow
>Priority: Major
> Fix For: 2.10.0, 2.8.6, 2.9.3
>
> Attachments: HDFS-14476-branch-2.01.patch, HDFS-14476.00.patch, 
> HDFS-14476.002.patch, HDFS-14476.01.patch, HDFS-14476.branch-3.2.001.patch, 
> datanode-with-patch-14476.png
>
>
> When directoryScanner have the results of differences between disk and 
> in-memory blocks. it will try to run {{checkAndUpdate}} to fix it. However 
> {{FsDatasetImpl.checkAndUpdate}} is a synchronized call
> As I have about 6millions blocks for every datanodes and every 6hours' scan 
> will have about 25000 abnormal blocks to fix. That leads to a long lock 
> holding FsDatasetImpl object.
> let's assume every block need 10ms to fix(because of latency of SAS disk), 
> that will cost 250 seconds to finish. That means all reads and writes will be 
> blocked for 3mins for that datanode.
>  
> {code:java}
> 2019-05-06 08:06:51,704 INFO 
> org.apache.hadoop.hdfs.server.datanode.DirectoryScanner: BlockPool 
> BP-1644920766-10.223.143.220-1450099987967 Total blocks: 6850197, missing 
> metadata files:23574, missing block files:23574, missing blocks in 
> memory:47625, mismatched blocks:0
> ...
> 2019-05-06 08:16:41,625 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Took 588402ms to process 1 commands from NN
> {code}
> Take long time to process command from nn because threads are blocked. And 
> namenode will see long lastContact time for this datanode.
> Maybe this affect all hdfs versions.
> *how to fix:*
> just like process invalidate command from namenode with 1000 batch size, fix 
> these abnormal block should be handled with batch too and sleep 2 seconds 
> between the batch to allow normal reading/writing blocks.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14476) lock too long when fix inconsistent blocks between disk and in-memory

2019-08-21 Thread Lisheng Sun (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16912414#comment-16912414
 ] 

Lisheng Sun commented on HDFS-14476:


[~jojochuang] . I am my problem.  The Jira is HDFS-14751. The apache build 
platform 
[https://builds.apache.org/job/PreCommit-HDFS-Build/18334/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt
 
|https://builds.apache.org/job/PreCommit-HDFS-Build/18334/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt]
 and my local reproduce this problem. 

> lock too long when fix inconsistent blocks between disk and in-memory
> -
>
> Key: HDFS-14476
> URL: https://issues.apache.org/jira/browse/HDFS-14476
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Affects Versions: 2.6.0, 2.7.0, 3.0.3
>Reporter: Sean Chow
>Assignee: Sean Chow
>Priority: Major
> Fix For: 2.10.0, 3.3.0, 2.8.6, 3.2.1, 2.9.3, 3.1.3
>
> Attachments: HDFS-14476-branch-2.01.patch, HDFS-14476.00.patch, 
> HDFS-14476.002.patch, HDFS-14476.01.patch, HDFS-14476.branch-3.2.001.patch, 
> datanode-with-patch-14476.png
>
>
> When directoryScanner have the results of differences between disk and 
> in-memory blocks. it will try to run {{checkAndUpdate}} to fix it. However 
> {{FsDatasetImpl.checkAndUpdate}} is a synchronized call
> As I have about 6millions blocks for every datanodes and every 6hours' scan 
> will have about 25000 abnormal blocks to fix. That leads to a long lock 
> holding FsDatasetImpl object.
> let's assume every block need 10ms to fix(because of latency of SAS disk), 
> that will cost 250 seconds to finish. That means all reads and writes will be 
> blocked for 3mins for that datanode.
>  
> {code:java}
> 2019-05-06 08:06:51,704 INFO 
> org.apache.hadoop.hdfs.server.datanode.DirectoryScanner: BlockPool 
> BP-1644920766-10.223.143.220-1450099987967 Total blocks: 6850197, missing 
> metadata files:23574, missing block files:23574, missing blocks in 
> memory:47625, mismatched blocks:0
> ...
> 2019-05-06 08:16:41,625 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Took 588402ms to process 1 commands from NN
> {code}
> Take long time to process command from nn because threads are blocked. And 
> namenode will see long lastContact time for this datanode.
> Maybe this affect all hdfs versions.
> *how to fix:*
> just like process invalidate command from namenode with 1000 batch size, fix 
> these abnormal block should be handled with batch too and sleep 2 seconds 
> between the batch to allow normal reading/writing blocks.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14476) lock too long when fix inconsistent blocks between disk and in-memory

2019-08-21 Thread Wei-Chiu Chuang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16912411#comment-16912411
 ] 

Wei-Chiu Chuang commented on HDFS-14476:


[~leosun08] could you re-confirm? HDFS-11396 was resolved last year.

> lock too long when fix inconsistent blocks between disk and in-memory
> -
>
> Key: HDFS-14476
> URL: https://issues.apache.org/jira/browse/HDFS-14476
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Affects Versions: 2.6.0, 2.7.0, 3.0.3
>Reporter: Sean Chow
>Assignee: Sean Chow
>Priority: Major
> Fix For: 2.10.0, 3.3.0, 2.8.6, 3.2.1, 2.9.3, 3.1.3
>
> Attachments: HDFS-14476-branch-2.01.patch, HDFS-14476.00.patch, 
> HDFS-14476.002.patch, HDFS-14476.01.patch, HDFS-14476.branch-3.2.001.patch, 
> datanode-with-patch-14476.png
>
>
> When directoryScanner have the results of differences between disk and 
> in-memory blocks. it will try to run {{checkAndUpdate}} to fix it. However 
> {{FsDatasetImpl.checkAndUpdate}} is a synchronized call
> As I have about 6millions blocks for every datanodes and every 6hours' scan 
> will have about 25000 abnormal blocks to fix. That leads to a long lock 
> holding FsDatasetImpl object.
> let's assume every block need 10ms to fix(because of latency of SAS disk), 
> that will cost 250 seconds to finish. That means all reads and writes will be 
> blocked for 3mins for that datanode.
>  
> {code:java}
> 2019-05-06 08:06:51,704 INFO 
> org.apache.hadoop.hdfs.server.datanode.DirectoryScanner: BlockPool 
> BP-1644920766-10.223.143.220-1450099987967 Total blocks: 6850197, missing 
> metadata files:23574, missing block files:23574, missing blocks in 
> memory:47625, mismatched blocks:0
> ...
> 2019-05-06 08:16:41,625 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Took 588402ms to process 1 commands from NN
> {code}
> Take long time to process command from nn because threads are blocked. And 
> namenode will see long lastContact time for this datanode.
> Maybe this affect all hdfs versions.
> *how to fix:*
> just like process invalidate command from namenode with 1000 batch size, fix 
> these abnormal block should be handled with batch too and sleep 2 seconds 
> between the batch to allow normal reading/writing blocks.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14476) lock too long when fix inconsistent blocks between disk and in-memory

2019-08-18 Thread Hadoop QA (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-14476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16910163#comment-16910163
 ] 

Hadoop QA commented on HDFS-14476:
--

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
44s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:red}-1{color} | {color:red} test4tests {color} | {color:red}  0m  
0s{color} | {color:red} The patch doesn't appear to include any new or modified 
tests. Please justify why no new tests are needed for this patch. Also please 
list what manual steps were performed to verify this patch. {color} |
|| || || || {color:brown} branch-3.2 Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 22m 
 8s{color} | {color:green} branch-3.2 passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
56s{color} | {color:green} branch-3.2 passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
45s{color} | {color:green} branch-3.2 passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m  
5s{color} | {color:green} branch-3.2 passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
13m 42s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  2m  
6s{color} | {color:green} branch-3.2 passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
50s{color} | {color:green} branch-3.2 passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  1m 
 4s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
55s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
55s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
40s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m  
4s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
13m 57s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  2m 
28s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
57s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:red}-1{color} | {color:red} unit {color} | {color:red}128m 28s{color} 
| {color:red} hadoop-hdfs in the patch failed. {color} |
| {color:red}-1{color} | {color:red} asflicense {color} | {color:red}  1m 
11s{color} | {color:red} The patch generated 1 ASF License warnings. {color} |
| {color:black}{color} | {color:black} {color} | {color:black}193m  9s{color} | 
{color:black} {color} |
\\
\\
|| Reason || Tests ||
| Failed junit tests | 
hadoop.hdfs.server.namenode.TestNameNodeMetadataConsistency |
|   | hadoop.hdfs.server.namenode.TestCheckpoint |
|   | hadoop.hdfs.server.namenode.TestReencryptionWithKMS |
|   | hadoop.hdfs.server.diskbalancer.TestDiskBalancer |
|   | hadoop.hdfs.server.balancer.TestBalancer |
|   | hadoop.hdfs.server.namenode.TestStripedINodeFile |
|   | hadoop.hdfs.server.namenode.metrics.TestNameNodeMetrics |
|   | hadoop.hdfs.server.namenode.TestFSEditLogLoader |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=18.09.7 Server=18.09.7 Image:yetus/hadoop:63396beab41 |
| JIRA Issue | HDFS-14476 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12977905/HDFS-14476.branch-3.2.001.patch
 |
| Optional Tests |  dupname  asflicense  compile  javac  javadoc  mvninstall  
mvnsite  unit  shadedclient  findbugs  checkstyle  |
| uname | Linux c991b824367d 4.15.0-52-generic #56-Ubuntu SMP Tue Jun 4 
22:49:08 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/patchprocess/precommit/personality/provided.sh |
| git revision | branch-3.2 / e89413d |
| maven | version: Apache Maven 3.3.9 |
| 

[jira] [Commented] (HDFS-14476) lock too long when fix inconsistent blocks between disk and in-memory

2019-08-18 Thread Wei-Chiu Chuang (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-14476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16910103#comment-16910103
 ] 

Wei-Chiu Chuang commented on HDFS-14476:


Pushed to trunk, branch-2, branch-2.9, branch-2.8.
There are conflicts in branch-3.2 and below. Updated the trunk patch and submit 
for precommit build.

> lock too long when fix inconsistent blocks between disk and in-memory
> -
>
> Key: HDFS-14476
> URL: https://issues.apache.org/jira/browse/HDFS-14476
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Affects Versions: 2.6.0, 2.7.0, 3.0.3
>Reporter: Sean Chow
>Assignee: Sean Chow
>Priority: Major
> Fix For: 2.10.0, 3.3.0, 2.8.6, 2.9.3
>
> Attachments: HDFS-14476-branch-2.01.patch, HDFS-14476.00.patch, 
> HDFS-14476.002.patch, HDFS-14476.01.patch, HDFS-14476.branch-3.2.001.patch, 
> datanode-with-patch-14476.png
>
>
> When directoryScanner have the results of differences between disk and 
> in-memory blocks. it will try to run {{checkAndUpdate}} to fix it. However 
> {{FsDatasetImpl.checkAndUpdate}} is a synchronized call
> As I have about 6millions blocks for every datanodes and every 6hours' scan 
> will have about 25000 abnormal blocks to fix. That leads to a long lock 
> holding FsDatasetImpl object.
> let's assume every block need 10ms to fix(because of latency of SAS disk), 
> that will cost 250 seconds to finish. That means all reads and writes will be 
> blocked for 3mins for that datanode.
>  
> {code:java}
> 2019-05-06 08:06:51,704 INFO 
> org.apache.hadoop.hdfs.server.datanode.DirectoryScanner: BlockPool 
> BP-1644920766-10.223.143.220-1450099987967 Total blocks: 6850197, missing 
> metadata files:23574, missing block files:23574, missing blocks in 
> memory:47625, mismatched blocks:0
> ...
> 2019-05-06 08:16:41,625 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Took 588402ms to process 1 commands from NN
> {code}
> Take long time to process command from nn because threads are blocked. And 
> namenode will see long lastContact time for this datanode.
> Maybe this affect all hdfs versions.
> *how to fix:*
> just like process invalidate command from namenode with 1000 batch size, fix 
> these abnormal block should be handled with batch too and sleep 2 seconds 
> between the batch to allow normal reading/writing blocks.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14476) lock too long when fix inconsistent blocks between disk and in-memory

2019-08-18 Thread Hadoop QA (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-14476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16910100#comment-16910100
 ] 

Hadoop QA commented on HDFS-14476:
--

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m  
0s{color} | {color:blue} Docker mode activated. {color} |
| {color:red}-1{color} | {color:red} patch {color} | {color:red}  0m  6s{color} 
| {color:red} HDFS-14476 does not apply to trunk. Rebase required? Wrong 
Branch? See https://wiki.apache.org/hadoop/HowToContribute for help. {color} |
\\
\\
|| Subsystem || Report/Notes ||
| JIRA Issue | HDFS-14476 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12977902/HDFS-14476.002.patch |
| Console output | 
https://builds.apache.org/job/PreCommit-HDFS-Build/27553/console |
| Powered by | Apache Yetus 0.8.0   http://yetus.apache.org |


This message was automatically generated.



> lock too long when fix inconsistent blocks between disk and in-memory
> -
>
> Key: HDFS-14476
> URL: https://issues.apache.org/jira/browse/HDFS-14476
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Affects Versions: 2.6.0, 2.7.0, 3.0.3
>Reporter: Sean Chow
>Assignee: Sean Chow
>Priority: Major
> Attachments: HDFS-14476-branch-2.01.patch, HDFS-14476.00.patch, 
> HDFS-14476.002.patch, HDFS-14476.01.patch, datanode-with-patch-14476.png
>
>
> When directoryScanner have the results of differences between disk and 
> in-memory blocks. it will try to run {{checkAndUpdate}} to fix it. However 
> {{FsDatasetImpl.checkAndUpdate}} is a synchronized call
> As I have about 6millions blocks for every datanodes and every 6hours' scan 
> will have about 25000 abnormal blocks to fix. That leads to a long lock 
> holding FsDatasetImpl object.
> let's assume every block need 10ms to fix(because of latency of SAS disk), 
> that will cost 250 seconds to finish. That means all reads and writes will be 
> blocked for 3mins for that datanode.
>  
> {code:java}
> 2019-05-06 08:06:51,704 INFO 
> org.apache.hadoop.hdfs.server.datanode.DirectoryScanner: BlockPool 
> BP-1644920766-10.223.143.220-1450099987967 Total blocks: 6850197, missing 
> metadata files:23574, missing block files:23574, missing blocks in 
> memory:47625, mismatched blocks:0
> ...
> 2019-05-06 08:16:41,625 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Took 588402ms to process 1 commands from NN
> {code}
> Take long time to process command from nn because threads are blocked. And 
> namenode will see long lastContact time for this datanode.
> Maybe this affect all hdfs versions.
> *how to fix:*
> just like process invalidate command from namenode with 1000 batch size, fix 
> these abnormal block should be handled with batch too and sleep 2 seconds 
> between the batch to allow normal reading/writing blocks.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14476) lock too long when fix inconsistent blocks between disk and in-memory

2019-08-18 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-14476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16910099#comment-16910099
 ] 

Hudson commented on HDFS-14476:
---

FAILURE: Integrated in Jenkins build Hadoop-trunk-Commit #17144 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/17144/])
HDFS-14476. lock too long when fix inconsistent blocks between disk and 
(weichiu: rev b58a35f374a9a750fddc2fc92e7f7a7ae8a4d3a4)
* (edit) 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/DirectoryScanner.java


> lock too long when fix inconsistent blocks between disk and in-memory
> -
>
> Key: HDFS-14476
> URL: https://issues.apache.org/jira/browse/HDFS-14476
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Affects Versions: 2.6.0, 2.7.0, 3.0.3
>Reporter: Sean Chow
>Assignee: Sean Chow
>Priority: Major
> Attachments: HDFS-14476-branch-2.01.patch, HDFS-14476.00.patch, 
> HDFS-14476.002.patch, HDFS-14476.01.patch, datanode-with-patch-14476.png
>
>
> When directoryScanner have the results of differences between disk and 
> in-memory blocks. it will try to run {{checkAndUpdate}} to fix it. However 
> {{FsDatasetImpl.checkAndUpdate}} is a synchronized call
> As I have about 6millions blocks for every datanodes and every 6hours' scan 
> will have about 25000 abnormal blocks to fix. That leads to a long lock 
> holding FsDatasetImpl object.
> let's assume every block need 10ms to fix(because of latency of SAS disk), 
> that will cost 250 seconds to finish. That means all reads and writes will be 
> blocked for 3mins for that datanode.
>  
> {code:java}
> 2019-05-06 08:06:51,704 INFO 
> org.apache.hadoop.hdfs.server.datanode.DirectoryScanner: BlockPool 
> BP-1644920766-10.223.143.220-1450099987967 Total blocks: 6850197, missing 
> metadata files:23574, missing block files:23574, missing blocks in 
> memory:47625, mismatched blocks:0
> ...
> 2019-05-06 08:16:41,625 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Took 588402ms to process 1 commands from NN
> {code}
> Take long time to process command from nn because threads are blocked. And 
> namenode will see long lastContact time for this datanode.
> Maybe this affect all hdfs versions.
> *how to fix:*
> just like process invalidate command from namenode with 1000 batch size, fix 
> these abnormal block should be handled with batch too and sleep 2 seconds 
> between the batch to allow normal reading/writing blocks.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14476) lock too long when fix inconsistent blocks between disk and in-memory

2019-08-18 Thread Wei-Chiu Chuang (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-14476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16910098#comment-16910098
 ] 

Wei-Chiu Chuang commented on HDFS-14476:


+1
[~seanlook] looks like you're the first time contributor to Hadoop. Welcome.
Note that there is a checkstyle warning. I find this Hadoop code formatter 
works for me most of time: 
https://github.com/cloudera/blog-eclipse/blob/master/hadoop-format.xml For your 
reference in the future.

I've taken care of the checkstyle warning, and committed the code for you. The 
updated patch is [^HDFS-14476.002.patch]  for your reference.



> lock too long when fix inconsistent blocks between disk and in-memory
> -
>
> Key: HDFS-14476
> URL: https://issues.apache.org/jira/browse/HDFS-14476
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Affects Versions: 2.6.0, 2.7.0, 3.0.3
>Reporter: Sean Chow
>Assignee: Sean Chow
>Priority: Major
> Attachments: HDFS-14476-branch-2.01.patch, HDFS-14476.00.patch, 
> HDFS-14476.002.patch, HDFS-14476.01.patch, datanode-with-patch-14476.png
>
>
> When directoryScanner have the results of differences between disk and 
> in-memory blocks. it will try to run {{checkAndUpdate}} to fix it. However 
> {{FsDatasetImpl.checkAndUpdate}} is a synchronized call
> As I have about 6millions blocks for every datanodes and every 6hours' scan 
> will have about 25000 abnormal blocks to fix. That leads to a long lock 
> holding FsDatasetImpl object.
> let's assume every block need 10ms to fix(because of latency of SAS disk), 
> that will cost 250 seconds to finish. That means all reads and writes will be 
> blocked for 3mins for that datanode.
>  
> {code:java}
> 2019-05-06 08:06:51,704 INFO 
> org.apache.hadoop.hdfs.server.datanode.DirectoryScanner: BlockPool 
> BP-1644920766-10.223.143.220-1450099987967 Total blocks: 6850197, missing 
> metadata files:23574, missing block files:23574, missing blocks in 
> memory:47625, mismatched blocks:0
> ...
> 2019-05-06 08:16:41,625 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Took 588402ms to process 1 commands from NN
> {code}
> Take long time to process command from nn because threads are blocked. And 
> namenode will see long lastContact time for this datanode.
> Maybe this affect all hdfs versions.
> *how to fix:*
> just like process invalidate command from namenode with 1000 batch size, fix 
> these abnormal block should be handled with batch too and sleep 2 seconds 
> between the batch to allow normal reading/writing blocks.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14476) lock too long when fix inconsistent blocks between disk and in-memory

2019-08-16 Thread Sean Chow (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-14476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16908781#comment-16908781
 ] 

Sean Chow commented on HDFS-14476:
--

Hi [~jojochuang] , is that patch ok? And what's the next step to have it merged 
to the main trunk?
Thanks.

> lock too long when fix inconsistent blocks between disk and in-memory
> -
>
> Key: HDFS-14476
> URL: https://issues.apache.org/jira/browse/HDFS-14476
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Affects Versions: 2.6.0, 2.7.0, 3.0.3
>Reporter: Sean Chow
>Assignee: Sean Chow
>Priority: Major
> Attachments: HDFS-14476-branch-2.01.patch, HDFS-14476.00.patch, 
> HDFS-14476.01.patch, datanode-with-patch-14476.png
>
>
> When directoryScanner have the results of differences between disk and 
> in-memory blocks. it will try to run {{checkAndUpdate}} to fix it. However 
> {{FsDatasetImpl.checkAndUpdate}} is a synchronized call
> As I have about 6millions blocks for every datanodes and every 6hours' scan 
> will have about 25000 abnormal blocks to fix. That leads to a long lock 
> holding FsDatasetImpl object.
> let's assume every block need 10ms to fix(because of latency of SAS disk), 
> that will cost 250 seconds to finish. That means all reads and writes will be 
> blocked for 3mins for that datanode.
>  
> {code:java}
> 2019-05-06 08:06:51,704 INFO 
> org.apache.hadoop.hdfs.server.datanode.DirectoryScanner: BlockPool 
> BP-1644920766-10.223.143.220-1450099987967 Total blocks: 6850197, missing 
> metadata files:23574, missing block files:23574, missing blocks in 
> memory:47625, mismatched blocks:0
> ...
> 2019-05-06 08:16:41,625 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Took 588402ms to process 1 commands from NN
> {code}
> Take long time to process command from nn because threads are blocked. And 
> namenode will see long lastContact time for this datanode.
> Maybe this affect all hdfs versions.
> *how to fix:*
> just like process invalidate command from namenode with 1000 batch size, fix 
> these abnormal block should be handled with batch too and sleep 2 seconds 
> between the batch to allow normal reading/writing blocks.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14476) lock too long when fix inconsistent blocks between disk and in-memory

2019-08-07 Thread Hadoop QA (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-14476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16901944#comment-16901944
 ] 

Hadoop QA commented on HDFS-14476:
--

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  9m 
23s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:red}-1{color} | {color:red} test4tests {color} | {color:red}  0m  
0s{color} | {color:red} The patch doesn't appear to include any new or modified 
tests. Please justify why no new tests are needed for this patch. Also please 
list what manual steps were performed to verify this patch. {color} |
|| || || || {color:brown} branch-2 Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 13m 
44s{color} | {color:green} branch-2 passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
51s{color} | {color:green} branch-2 passed with JDK v1.7.0_95 {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
45s{color} | {color:green} branch-2 passed with JDK v1.8.0_222 {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
27s{color} | {color:green} branch-2 passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
59s{color} | {color:green} branch-2 passed {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  2m  
4s{color} | {color:green} branch-2 passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m  
5s{color} | {color:green} branch-2 passed with JDK v1.7.0_95 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
45s{color} | {color:green} branch-2 passed with JDK v1.8.0_222 {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
51s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
47s{color} | {color:green} the patch passed with JDK v1.7.0_95 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
47s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
45s{color} | {color:green} the patch passed with JDK v1.8.0_222 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
45s{color} | {color:green} the patch passed {color} |
| {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange}  
0m 23s{color} | {color:orange} hadoop-hdfs-project/hadoop-hdfs: The patch 
generated 5 new + 45 unchanged - 0 fixed = 50 total (was 45) {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
53s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  2m  
5s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m  
4s{color} | {color:green} the patch passed with JDK v1.7.0_95 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
41s{color} | {color:green} the patch passed with JDK v1.8.0_222 {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 61m 35s{color} 
| {color:red} hadoop-hdfs in the patch failed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
31s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black}101m 45s{color} | 
{color:black} {color} |
\\
\\
|| Reason || Tests ||
| Failed junit tests | 
hadoop.hdfs.qjournal.server.TestJournalNodeRespectsBindHostKeys |
|   | hadoop.hdfs.web.TestWebHdfsTimeouts |
|   | hadoop.hdfs.TestSafeMode |
|   | hadoop.hdfs.server.namenode.TestNamenodeCapacityReport |
|   | hadoop.hdfs.server.datanode.TestDirectoryScanner |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=19.03.1 Server=19.03.1 Image:yetus/hadoop:da67579 |
| JIRA Issue | HDFS-14476 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12976908/HDFS-14476-branch-2.01.patch
 |
| Optional Tests |  dupname  asflicense  compile  javac  javadoc  mvninstall  
mvnsite  unit  shadedclient  

[jira] [Commented] (HDFS-14476) lock too long when fix inconsistent blocks between disk and in-memory

2019-08-07 Thread Sean Chow (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-14476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16901864#comment-16901864
 ] 

Sean Chow commented on HDFS-14476:
--

patch updated!

> lock too long when fix inconsistent blocks between disk and in-memory
> -
>
> Key: HDFS-14476
> URL: https://issues.apache.org/jira/browse/HDFS-14476
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Affects Versions: 2.6.0, 2.7.0
>Reporter: Sean Chow
>Assignee: Sean Chow
>Priority: Major
> Attachments: HDFS-14476-branch-2.01.patch, HDFS-14476.00.patch, 
> HDFS-14476.01.patch, datanode-with-patch-14476.png
>
>
> When directoryScanner have the results of differences between disk and 
> in-memory blocks. it will try to run {{checkAndUpdate}} to fix it. However 
> {{FsDatasetImpl.checkAndUpdate}} is a synchronized call
> As I have about 6millions blocks for every datanodes and every 6hours' scan 
> will have about 25000 abnormal blocks to fix. That leads to a long lock 
> holding FsDatasetImpl object.
> let's assume every block need 10ms to fix(because of latency of SAS disk), 
> that will cost 250 seconds to finish. That means all reads and writes will be 
> blocked for 3mins for that datanode.
>  
> {code:java}
> 2019-05-06 08:06:51,704 INFO 
> org.apache.hadoop.hdfs.server.datanode.DirectoryScanner: BlockPool 
> BP-1644920766-10.223.143.220-1450099987967 Total blocks: 6850197, missing 
> metadata files:23574, missing block files:23574, missing blocks in 
> memory:47625, mismatched blocks:0
> ...
> 2019-05-06 08:16:41,625 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Took 588402ms to process 1 commands from NN
> {code}
> Take long time to process command from nn because threads are blocked. And 
> namenode will see long lastContact time for this datanode.
> Maybe this affect all hdfs versions.
> *how to fix:*
> just like process invalidate command from namenode with 1000 batch size, fix 
> these abnormal block should be handled with batch too and sleep 2 seconds 
> between the batch to allow normal reading/writing blocks.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14476) lock too long when fix inconsistent blocks between disk and in-memory

2019-08-06 Thread Wei-Chiu Chuang (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-14476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16901645#comment-16901645
 ] 

Wei-Chiu Chuang commented on HDFS-14476:


[~seanlook] please follow the How to contribute wiki 
[https://cwiki.apache.org/confluence/display/HADOOP/How+To+Contribute] and name 
the patch file according to the branch where it applies. You will also need to 
submit the patch to kick off precommit check. Thank you.

> lock too long when fix inconsistent blocks between disk and in-memory
> -
>
> Key: HDFS-14476
> URL: https://issues.apache.org/jira/browse/HDFS-14476
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Affects Versions: 2.6.0, 2.7.0
>Reporter: Sean Chow
>Assignee: Sean Chow
>Priority: Major
> Attachments: HDFS-14476.00.patch, datanode-with-patch-14476.png
>
>
> When directoryScanner have the results of differences between disk and 
> in-memory blocks. it will try to run {{checkAndUpdate}} to fix it. However 
> {{FsDatasetImpl.checkAndUpdate}} is a synchronized call
> As I have about 6millions blocks for every datanodes and every 6hours' scan 
> will have about 25000 abnormal blocks to fix. That leads to a long lock 
> holding FsDatasetImpl object.
> let's assume every block need 10ms to fix(because of latency of SAS disk), 
> that will cost 250 seconds to finish. That means all reads and writes will be 
> blocked for 3mins for that datanode.
>  
> {code:java}
> 2019-05-06 08:06:51,704 INFO 
> org.apache.hadoop.hdfs.server.datanode.DirectoryScanner: BlockPool 
> BP-1644920766-10.223.143.220-1450099987967 Total blocks: 6850197, missing 
> metadata files:23574, missing block files:23574, missing blocks in 
> memory:47625, mismatched blocks:0
> ...
> 2019-05-06 08:16:41,625 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Took 588402ms to process 1 commands from NN
> {code}
> Take long time to process command from nn because threads are blocked. And 
> namenode will see long lastContact time for this datanode.
> Maybe this affect all hdfs versions.
> *how to fix:*
> just like process invalidate command from namenode with 1000 batch size, fix 
> these abnormal block should be handled with batch too and sleep 2 seconds 
> between the batch to allow normal reading/writing blocks.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14476) lock too long when fix inconsistent blocks between disk and in-memory

2019-08-06 Thread Wei-Chiu Chuang (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-14476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16901462#comment-16901462
 ] 

Wei-Chiu Chuang commented on HDFS-14476:


Sean, please offer patch for trunk branch. If the patch applies I can get them 
into lower branches (3.2, 3.1.,...)

> lock too long when fix inconsistent blocks between disk and in-memory
> -
>
> Key: HDFS-14476
> URL: https://issues.apache.org/jira/browse/HDFS-14476
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Affects Versions: 2.6.0, 2.7.0
>Reporter: Sean Chow
>Assignee: Sean Chow
>Priority: Major
> Attachments: HDFS-14476.00.patch, datanode-with-patch-14476.png
>
>
> When directoryScanner have the results of differences between disk and 
> in-memory blocks. it will try to run {{checkAndUpdate}} to fix it. However 
> {{FsDatasetImpl.checkAndUpdate}} is a synchronized call
> As I have about 6millions blocks for every datanodes and every 6hours' scan 
> will have about 25000 abnormal blocks to fix. That leads to a long lock 
> holding FsDatasetImpl object.
> let's assume every block need 10ms to fix(because of latency of SAS disk), 
> that will cost 250 seconds to finish. That means all reads and writes will be 
> blocked for 3mins for that datanode.
>  
> {code:java}
> 2019-05-06 08:06:51,704 INFO 
> org.apache.hadoop.hdfs.server.datanode.DirectoryScanner: BlockPool 
> BP-1644920766-10.223.143.220-1450099987967 Total blocks: 6850197, missing 
> metadata files:23574, missing block files:23574, missing blocks in 
> memory:47625, mismatched blocks:0
> ...
> 2019-05-06 08:16:41,625 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Took 588402ms to process 1 commands from NN
> {code}
> Take long time to process command from nn because threads are blocked. And 
> namenode will see long lastContact time for this datanode.
> Maybe this affect all hdfs versions.
> *how to fix:*
> just like process invalidate command from namenode with 1000 batch size, fix 
> these abnormal block should be handled with batch too and sleep 2 seconds 
> between the batch to allow normal reading/writing blocks.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14476) lock too long when fix inconsistent blocks between disk and in-memory

2019-08-05 Thread Sean Chow (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-14476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16900255#comment-16900255
 ] 

Sean Chow commented on HDFS-14476:
--

Thanks for reviewing. The already added patch works for banch2.6~branch2.9, so 
I will add a rebased patch tomorrow.

By the way, do I need to add this same patch to branch-3? I see the 
{{reconcile}} is a little diffrent.

> lock too long when fix inconsistent blocks between disk and in-memory
> -
>
> Key: HDFS-14476
> URL: https://issues.apache.org/jira/browse/HDFS-14476
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Affects Versions: 2.6.0, 2.7.0
>Reporter: Sean Chow
>Assignee: Sean Chow
>Priority: Major
> Attachments: HDFS-14476.00.patch, datanode-with-patch-14476.png
>
>
> When directoryScanner have the results of differences between disk and 
> in-memory blocks. it will try to run {{checkAndUpdate}} to fix it. However 
> {{FsDatasetImpl.checkAndUpdate}} is a synchronized call
> As I have about 6millions blocks for every datanodes and every 6hours' scan 
> will have about 25000 abnormal blocks to fix. That leads to a long lock 
> holding FsDatasetImpl object.
> let's assume every block need 10ms to fix(because of latency of SAS disk), 
> that will cost 250 seconds to finish. That means all reads and writes will be 
> blocked for 3mins for that datanode.
>  
> {code:java}
> 2019-05-06 08:06:51,704 INFO 
> org.apache.hadoop.hdfs.server.datanode.DirectoryScanner: BlockPool 
> BP-1644920766-10.223.143.220-1450099987967 Total blocks: 6850197, missing 
> metadata files:23574, missing block files:23574, missing blocks in 
> memory:47625, mismatched blocks:0
> ...
> 2019-05-06 08:16:41,625 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Took 588402ms to process 1 commands from NN
> {code}
> Take long time to process command from nn because threads are blocked. And 
> namenode will see long lastContact time for this datanode.
> Maybe this affect all hdfs versions.
> *how to fix:*
> just like process invalidate command from namenode with 1000 batch size, fix 
> these abnormal block should be handled with batch too and sleep 2 seconds 
> between the batch to allow normal reading/writing blocks.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14476) lock too long when fix inconsistent blocks between disk and in-memory

2019-07-30 Thread Wei-Chiu Chuang (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-14476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16896635#comment-16896635
 ] 

Wei-Chiu Chuang commented on HDFS-14476:


Retract my +1. The patch no longer applies and need to rebase for trunk.

> lock too long when fix inconsistent blocks between disk and in-memory
> -
>
> Key: HDFS-14476
> URL: https://issues.apache.org/jira/browse/HDFS-14476
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Affects Versions: 2.6.0, 2.7.0
>Reporter: Sean Chow
>Assignee: Sean Chow
>Priority: Major
> Attachments: HDFS-14476.00.patch, datanode-with-patch-14476.png
>
>
> When directoryScanner have the results of differences between disk and 
> in-memory blocks. it will try to run {{checkAndUpdate}} to fix it. However 
> {{FsDatasetImpl.checkAndUpdate}} is a synchronized call
> As I have about 6millions blocks for every datanodes and every 6hours' scan 
> will have about 25000 abnormal blocks to fix. That leads to a long lock 
> holding FsDatasetImpl object.
> let's assume every block need 10ms to fix(because of latency of SAS disk), 
> that will cost 250 seconds to finish. That means all reads and writes will be 
> blocked for 3mins for that datanode.
>  
> {code:java}
> 2019-05-06 08:06:51,704 INFO 
> org.apache.hadoop.hdfs.server.datanode.DirectoryScanner: BlockPool 
> BP-1644920766-10.223.143.220-1450099987967 Total blocks: 6850197, missing 
> metadata files:23574, missing block files:23574, missing blocks in 
> memory:47625, mismatched blocks:0
> ...
> 2019-05-06 08:16:41,625 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Took 588402ms to process 1 commands from NN
> {code}
> Take long time to process command from nn because threads are blocked. And 
> namenode will see long lastContact time for this datanode.
> Maybe this affect all hdfs versions.
> *how to fix:*
> just like process invalidate command from namenode with 1000 batch size, fix 
> these abnormal block should be handled with batch too and sleep 2 seconds 
> between the batch to allow normal reading/writing blocks.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14476) lock too long when fix inconsistent blocks between disk and in-memory

2019-07-30 Thread Wei-Chiu Chuang (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-14476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16896560#comment-16896560
 ] 

Wei-Chiu Chuang commented on HDFS-14476:


Looks great! +1 from me

> lock too long when fix inconsistent blocks between disk and in-memory
> -
>
> Key: HDFS-14476
> URL: https://issues.apache.org/jira/browse/HDFS-14476
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Affects Versions: 2.6.0, 2.7.0
>Reporter: Sean Chow
>Assignee: Sean Chow
>Priority: Major
> Attachments: HDFS-14476.00.patch, datanode-with-patch-14476.png
>
>
> When directoryScanner have the results of differences between disk and 
> in-memory blocks. it will try to run {{checkAndUpdate}} to fix it. However 
> {{FsDatasetImpl.checkAndUpdate}} is a synchronized call
> As I have about 6millions blocks for every datanodes and every 6hours' scan 
> will have about 25000 abnormal blocks to fix. That leads to a long lock 
> holding FsDatasetImpl object.
> let's assume every block need 10ms to fix(because of latency of SAS disk), 
> that will cost 250 seconds to finish. That means all reads and writes will be 
> blocked for 3mins for that datanode.
>  
> {code:java}
> 2019-05-06 08:06:51,704 INFO 
> org.apache.hadoop.hdfs.server.datanode.DirectoryScanner: BlockPool 
> BP-1644920766-10.223.143.220-1450099987967 Total blocks: 6850197, missing 
> metadata files:23574, missing block files:23574, missing blocks in 
> memory:47625, mismatched blocks:0
> ...
> 2019-05-06 08:16:41,625 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Took 588402ms to process 1 commands from NN
> {code}
> Take long time to process command from nn because threads are blocked. And 
> namenode will see long lastContact time for this datanode.
> Maybe this affect all hdfs versions.
> *how to fix:*
> just like process invalidate command from namenode with 1000 batch size, fix 
> these abnormal block should be handled with batch too and sleep 2 seconds 
> between the batch to allow normal reading/writing blocks.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14476) lock too long when fix inconsistent blocks between disk and in-memory

2019-07-08 Thread Sean Chow (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-14476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16880426#comment-16880426
 ] 

Sean Chow commented on HDFS-14476:
--

I've add a patch here to process inconsistent blocks in map with fixed batch 
size 500. And It did work well in my production.
 !datanode-with-patch-14476.png! 

You can see there's a pike periodically before applying this patch. This is the 
jstack output for that spike time:

{code:java}
"DataXceiver for client DFSClient_NONMAPREDUCE_-2048576664_27 at /x.x.x.x:49793 
[Sending block BP-1644920766-x.x.x.x-1450099987967:blk_2423788189_1403044486]" 
daemon prio=10 tid=0x7f8a0ea3d800 nid=0x19a2 waiting for monitor entry 
[0x7f89982a2000]
   java.lang.Thread.State: BLOCKED (on object monitor)
at 
org.apache.hadoop.hdfs.server.datanode.BlockSender.(BlockSender.java:231)
- waiting to lock <0x0004001d6348> (a 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl)
at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:529)
at 
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opReadBlock(Receiver.java:116)
at 
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:71)
at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:243)
at java.lang.Thread.run(Thread.java:745)

"369604090@qtp-1069118552-20368" daemon prio=10 tid=0x01f94000 
nid=0x19a1 waiting for monitor entry [0x7f89c0d41000]
   java.lang.Thread.State: BLOCKED (on object monitor)
at 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.getDfsUsed(FsVolumeImpl.java:295)
- waiting to lock <0x0004001d6348> (a 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl)
at 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.getAvailable(FsVolumeImpl.java:334)
at 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeList.getRemaining(FsVolumeList.java:162)
at 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.getRemaining(FsDatasetImpl.java:496)
- locked <0x000400502b70> (a java.lang.Object)
at sun.reflect.GeneratedMethodAccessor13.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)

"java.util.concurrent.ThreadPoolExecutor$Worker@704adcff[State = -1, empty 
queue]" daemon prio=10 tid=0x7f8a000f3000 nid=0x59f1 runnable 
[0x7f89c9c66000]
   java.lang.Thread.State: RUNNABLE
at java.io.UnixFileSystem.getBooleanAttributes0(Native Method)
at java.io.UnixFileSystem.getBooleanAttributes(UnixFileSystem.java:242)
at java.io.File.exists(File.java:813)
at 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.checkAndUpdate(FsDatasetImpl.java:1965)
- locked <0x0004001d6348> (a 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl)
at 
org.apache.hadoop.hdfs.server.datanode.DirectoryScanner.reconcile(DirectoryScanner.java:410)
at 
org.apache.hadoop.hdfs.server.datanode.DirectoryScanner.run(DirectoryScanner.java:360)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304)
at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178)
at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

{code}

All threads are blocked until timeout returned when the directory scanner do 
check blocks. 

Also we know that the scan result {{missing metadata files:23574, missing block 
files:23574, missing blocks in memory:47625}} is not so correct, because the 
scan process may take two hours to finish and the already scanned directories 
may have new files that will be added to inconsistent blocks map.

> lock too long when fix inconsistent blocks between disk and in-memory
> -
>
> Key: HDFS-14476
> URL: https://issues.apache.org/jira/browse/HDFS-14476
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Affects Versions: 2.6.0, 2.7.0
>Reporter: Sean Chow
>Priority: Major
> Attachments: HDFS-14476.00.patch, datanode-with-patch-14476.png
>
>
> When directoryScanner have the results of differences 

[jira] [Commented] (HDFS-14476) lock too long when fix inconsistent blocks between disk and in-memory

2019-06-03 Thread Sean Chow (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-14476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16854448#comment-16854448
 ] 

Sean Chow commented on HDFS-14476:
--

[~jojochuang] I've found out why _every 6hours' scan will have about 25000 
abnormal blocks to fix_.

Because the directory scan and get memory finalized blocks is asynchronously. 
The actual disk scan may take one hour to finish, between that time period 
other blocks is added or removed.

But the long held lock  mentioned in this issue still exists. I've this patch 
my own and will test it.

> lock too long when fix inconsistent blocks between disk and in-memory
> -
>
> Key: HDFS-14476
> URL: https://issues.apache.org/jira/browse/HDFS-14476
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Affects Versions: 2.6.0, 2.7.0
>Reporter: Sean Chow
>Priority: Major
>
> When directoryScanner have the results of differences between disk and 
> in-memory blocks. it will try to run `checkAndUpdate` to fix it. However 
> `FsDatasetImpl.checkAndUpdate` is a synchronized call
> As I have about 6millions blocks for every datanodes and every 6hours' scan 
> will have about 25000 abnormal blocks to fix. That leads to a long lock 
> holding FsDatasetImpl object.
> let's assume every block need 10ms to fix(because of latency of SAS disk), 
> that will cost 250 seconds to finish. That means all reads and writes will be 
> blocked for 3mins for that datanode.
>  
> {code:java}
> 2019-05-06 08:06:51,704 INFO 
> org.apache.hadoop.hdfs.server.datanode.DirectoryScanner: BlockPool 
> BP-1644920766-10.223.143.220-1450099987967 Total blocks: 6850197, missing 
> metadata files:23574, missing block files:23574, missing blocks in 
> memory:47625, mismatched blocks:0
> ...
> 2019-05-06 08:16:41,625 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Took 588402ms to process 1 commands from NN
> {code}
> Take long time to process command from nn because threads are blocked. And 
> namenode will see long lastContact time for this datanode.
> Maybe this affect all hdfs versions.
> *to fix:*
> just like process invalidate command from namenode with 1000 batch size, fix 
> these abnormal block should be handled with batch too and sleep 2 seconds 
> between the batch to allow normal reading/writing blocks.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14476) lock too long when fix inconsistent blocks between disk and in-memory

2019-05-07 Thread Sean Chow (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-14476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16835265#comment-16835265
 ] 

Sean Chow commented on HDFS-14476:
--

[~jojochuang] I'm sure my disk status is ok. I have more than 60 datanodes that 
directory scan have that lots of abnormal blocks to fix.

I think that because of lots of small files. In my another normal cluster, 
directory scan usually have 50-100 abnormal blocks. That won't be a problem to 
fix it.

 

like [~starphin] said, maybe 6millions blocks is too much for every datanodes 
that raises the amount of abnormal blocks.

 

whatever, the long held lock to checkAndUpdate shall be avoided to allow 
processing read/write requests.

> lock too long when fix inconsistent blocks between disk and in-memory
> -
>
> Key: HDFS-14476
> URL: https://issues.apache.org/jira/browse/HDFS-14476
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Affects Versions: 2.6.0, 2.7.0
>Reporter: Sean Chow
>Priority: Major
>
> When directoryScanner have the results of differences between disk and 
> in-memory blocks. it will try to run `checkAndUpdate` to fix it. However 
> `FsDatasetImpl.checkAndUpdate` is a synchronized call
> As I have about 6millions blocks for every datanodes and every 6hours' scan 
> will have about 25000 abnormal blocks to fix. That leads to a long lock 
> holding FsDatasetImpl object.
> let's assume every block need 10ms to fix(because of latency of SAS disk), 
> that will cost 250 seconds to finish. That means all reads and writes will be 
> blocked for 3mins for that datanode.
>  
> {code:java}
> 2019-05-06 08:06:51,704 INFO 
> org.apache.hadoop.hdfs.server.datanode.DirectoryScanner: BlockPool 
> BP-1644920766-10.223.143.220-1450099987967 Total blocks: 6850197, missing 
> metadata files:23574, missing block files:23574, missing blocks in 
> memory:47625, mismatched blocks:0
> ...
> 2019-05-06 08:16:41,625 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Took 588402ms to process 1 commands from NN
> {code}
> Take long time to process command from nn because threads are blocked. And 
> namenode will see long lastContact time for this datanode.
> Maybe this affect all hdfs versions.
> *to fix:*
> just like process invalidate command from namenode with 1000 batch size, fix 
> these abnormal block should be handled with batch too and sleep 2 seconds 
> between the batch to allow normal reading/writing blocks.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14476) lock too long when fix inconsistent blocks between disk and in-memory

2019-05-07 Thread Wei-Chiu Chuang (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-14476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16834860#comment-16834860
 ] 

Wei-Chiu Chuang commented on HDFS-14476:


bq. every 6hours' scan will have about 25000 abnormal blocks to fix.
I've never seen disk behave like this. Better to repair the disk, even though 
it sounds like an interesting scale bug :)

> lock too long when fix inconsistent blocks between disk and in-memory
> -
>
> Key: HDFS-14476
> URL: https://issues.apache.org/jira/browse/HDFS-14476
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Affects Versions: 2.6.0, 2.7.0
>Reporter: Sean Chow
>Priority: Major
>
> When directoryScanner have the results of differences between disk and 
> in-memory blocks. it will try to run `checkAndUpdate` to fix it. However 
> `FsDatasetImpl.checkAndUpdate` is a synchronized call
> As I have about 6millions blocks for every datanodes and every 6hours' scan 
> will have about 25000 abnormal blocks to fix. That leads to a long lock 
> holding FsDatasetImpl object.
> let's assume every block need 10ms to fix(because of latency of SAS disk), 
> that will cost 250 seconds to finish. That means all reads and writes will be 
> blocked for 3mins for that datanode.
>  
> {code:java}
> 2019-05-06 08:06:51,704 INFO 
> org.apache.hadoop.hdfs.server.datanode.DirectoryScanner: BlockPool 
> BP-1644920766-10.223.143.220-1450099987967 Total blocks: 6850197, missing 
> metadata files:23574, missing block files:23574, missing blocks in 
> memory:47625, mismatched blocks:0
> ...
> 2019-05-06 08:16:41,625 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Took 588402ms to process 1 commands from NN
> {code}
> Take long time to process command from nn because threads are blocked. And 
> namenode will see long lastContact time for this datanode.
> Maybe this affect all hdfs versions.
> *to fix:*
> just like process invalidate command from namenode with 1000 batch size, fix 
> these abnormal block should be handled with batch too and sleep 2 seconds 
> between the batch to allow normal reading/writing blocks.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14476) lock too long when fix inconsistent blocks between disk and in-memory

2019-05-07 Thread star (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-14476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16834675#comment-16834675
 ] 

star commented on HDFS-14476:
-

HDFS-10477 may be a useful reference. Seems like a ubiquitous problem with 
large amount of blocks.

> lock too long when fix inconsistent blocks between disk and in-memory
> -
>
> Key: HDFS-14476
> URL: https://issues.apache.org/jira/browse/HDFS-14476
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Affects Versions: 2.6.0, 2.7.0
>Reporter: Sean Chow
>Priority: Major
>
> When directoryScanner have the results of differences between disk and 
> in-memory blocks. it will try to run `checkAndUpdate` to fix it. However 
> `FsDatasetImpl.checkAndUpdate` is a synchronized call
> As I have about 6millions blocks for every datanodes and every 6hours' scan 
> will have about 25000 abnormal blocks to fix. That leads to a long lock 
> holding FsDatasetImpl object.
> let's assume every block need 10ms to fix(because of latency of SAS disk), 
> that will cost 250 seconds to finish. That means all reads and writes will be 
> blocked for 3mins for that datanode.
>  
> {code:java}
> 2019-05-06 08:06:51,704 INFO 
> org.apache.hadoop.hdfs.server.datanode.DirectoryScanner: BlockPool 
> BP-1644920766-10.223.143.220-1450099987967 Total blocks: 6850197, missing 
> metadata files:23574, missing block files:23574, missing blocks in 
> memory:47625, mismatched blocks:0
> ...
> 2019-05-06 08:16:41,625 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Took 588402ms to process 1 commands from NN
> {code}
> Take long time to process command from nn because threads are blocked. And 
> namenode will see long lastContact time for this datanode.
> Maybe this affect all hdfs versions.
> *to fix:*
> just like process invalidate command from namenode with 1000 batch size, fix 
> these abnormal block should be handled with batch too and sleep 2 seconds 
> between the batch to allow normal reading/writing blocks.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org