[jira] [Commented] (HDFS-9275) Fix TestRecoverStripedFile

2015-10-27 Thread Yi Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14976233#comment-14976233
 ] 

Yi Liu commented on HDFS-9275:
--

Walter, sorry I forgot this JIRA :-) 

For continuous block, if n replicas are missed (for total 3 replicas, at most 2 
can be missed, so n <3), we will check the total of replicas in 
PendingReplicationBlocks to see whether we need to schedule new block 
replication.
For block reconstruction of striped block, ideally we should follow this,  for 
any missed striped internal block, we just need to reconstruct 1, so we should 
check whether there is 1 in pendingReplicationBlocks,  but currently we track 
the block group in the list.  Then it becomes we compare the total missed 
striped internal blocks with the number in PendingReplicationBlocks, if there 
are more than two missed striped internal blocks and one is reconstructed 
first, then there may be some unnecessary reconstruction.   I think we can do a 
simple improvement for striped block, if there is one in 
PendingReplicationBlocks, then we don't schedule new reconstruction work 
instead of comparing the number of missed striped internal blocks. 

> Fix TestRecoverStripedFile
> --
>
> Key: HDFS-9275
> URL: https://issues.apache.org/jira/browse/HDFS-9275
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Walter Su
>Assignee: Walter Su
> Attachments: HDFS-9275.01.patch, HDFS-9275.02.patch, 
> HDFS-9275.03.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-9275) Fix TestRecoverStripedFile

2015-10-27 Thread Kai Zheng (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14976289#comment-14976289
 ] 

Kai Zheng commented on HDFS-9275:
-

Thanks Walter for the JIRA and working on this. Would you give the description 
for what's the exact issue or cause this issue is trying to address? It may 
help others understand the issue in short. Thanks.

> Fix TestRecoverStripedFile
> --
>
> Key: HDFS-9275
> URL: https://issues.apache.org/jira/browse/HDFS-9275
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Walter Su
>Assignee: Walter Su
> Attachments: HDFS-9275.01.patch, HDFS-9275.02.patch, 
> HDFS-9275.03.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-9275) Fix TestRecoverStripedFile

2015-10-22 Thread Walter Su (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14968820#comment-14968820
 ] 

Walter Su commented on HDFS-9275:
-

Thanks [~hitliuyi] for review.
bq. please add some description in the JIRA description part about what issue 
in the Test you want to fix? It's better to have Jenkins failure link if there 
is.
https://builds.apache.org/job/PreCommit-HDFS-Build/13110/testReport/org.apache.hadoop.hdfs/TestRecoverStripedFile/testRecoverAnyBlocks1/

Unfortunately, the logs is of no use. We know some block didn't get recovered 
but don't know why from the logs. The logs is truncated.
It took me a while to reproduce locally. I can upload the logs(488kb) if you 
want. BTW, I'll appreciate if someone can tell me where can I download the full 
log of the failed jenkins run.


> Fix TestRecoverStripedFile
> --
>
> Key: HDFS-9275
> URL: https://issues.apache.org/jira/browse/HDFS-9275
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: test
>Reporter: Walter Su
>Assignee: Walter Su
> Attachments: HDFS-9275.01.patch, HDFS-9275.02.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-9275) Fix TestRecoverStripedFile

2015-10-22 Thread Yi Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14968797#comment-14968797
 ] 

Yi Liu commented on HDFS-9275:
--

Walter, please add some description in the JIRA description part about what 
issue in the Test you want to fix? It's better to have Jenkins failure link if 
there is.

I see you replace {{setDataNodesDead}} and {{readReplica}} with some existing 
methods in other Test Utils, that's good.

In the tests, originally I restart the datanodes at the end of tests because I 
intended to share same mini cluster to make tests faster, but I forgot to use 
{{@BeforeClass}} instead of {{Before}} which will do the restart for each test. 
 I am OK to just remove starting shutdown DNs at the end.

> Fix TestRecoverStripedFile
> --
>
> Key: HDFS-9275
> URL: https://issues.apache.org/jira/browse/HDFS-9275
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: test
>Reporter: Walter Su
>Assignee: Walter Su
> Attachments: HDFS-9275.01.patch, HDFS-9275.02.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-9275) Fix TestRecoverStripedFile

2015-10-22 Thread Yi Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14970093#comment-14970093
 ] 

Yi Liu commented on HDFS-9275:
--

Thanks Walter for the update, I will review it tomorrow since I'm OOO today if 
there is no other review.

> Fix TestRecoverStripedFile
> --
>
> Key: HDFS-9275
> URL: https://issues.apache.org/jira/browse/HDFS-9275
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Walter Su
>Assignee: Walter Su
> Attachments: HDFS-9275.01.patch, HDFS-9275.02.patch, 
> HDFS-9275.03.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-9275) Fix TestRecoverStripedFile

2015-10-22 Thread Walter Su (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14969341#comment-14969341
 ] 

Walter Su commented on HDFS-9275:
-

I keep digging, then I understand the whole steps:

# When client is writing blockGroup_0, DN1 sends a heartbeat, its xceiverCount=3
# Client finished writing blockGroup_0, blockGroup_1
# Shutdown DN8~10. So idx_6~8 of blockGroup_1 is missing.
# ReplicationMonitor schedules 1st recovery for blockGroup_1, because DN1 is 
busy(See previous comments), BlockPlacementPolicy choose DN0,DN11 as targets.
# ErasureCodingWorker recovers idx_6 at DN0, and idx_7 at DN11. (See 
getTargetIndices() you'll know why)
# Before idx_6,7 are reported, ReplicationMonitor schedules 2nd recovery for 
blockGroup_1. It choose DN0 as targets.
# ErasureCodingWorker tries to recover idx_6 at DN0, it failed because DN0 
complains replica exists.

A delayed heartbeat is the direct cause for the failed tests. The deep cause 
is, It's not about the test code, It's about the defects of handling 2 
concurrent EC recovery tasks:
# Defect in ReplicationMonitor. It shouldn't choose one DataNode as target 
twice for the same block.
# Defect in ErasureCodingWorker. It doesn't know which internal blocks is in 
recovering, or recovered. It purely guesses from live nodes.

> Fix TestRecoverStripedFile
> --
>
> Key: HDFS-9275
> URL: https://issues.apache.org/jira/browse/HDFS-9275
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: test
>Reporter: Walter Su
>Assignee: Walter Su
> Attachments: HDFS-9275.01.patch, HDFS-9275.02.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-9275) Fix TestRecoverStripedFile

2015-10-22 Thread Walter Su (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14969353#comment-14969353
 ] 

Walter Su commented on HDFS-9275:
-

Cancel 02 patch. Upload 03 patch. It should fix this. Just initial patch.

> Fix TestRecoverStripedFile
> --
>
> Key: HDFS-9275
> URL: https://issues.apache.org/jira/browse/HDFS-9275
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Walter Su
>Assignee: Walter Su
> Attachments: HDFS-9275.01.patch, HDFS-9275.02.patch, 
> HDFS-9275.03.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-9275) Fix TestRecoverStripedFile

2015-10-21 Thread Walter Su (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14968588#comment-14968588
 ] 

Walter Su commented on HDFS-9275:
-

||DN0||DN1||DN2||DN3||DN4||DN5||DN6||DN7||DN8||DN9||DN10||DN11
| |*|*|*|*|*|*|*|*|*| | |   <-- BlockGroup_0
| | |*|*|*|*|*|*|*|*|*| |   <-- BlockGroup_1

The test case only tests last block group. Suppose DN8~10 are shutdown. 
ReplicationMonitor will schedule a recovery. Firstly need to call 
BlockPlacementPolicy to choose targets. DN2~DN10 are excluded because they 
already have internal blocks on them. To recover 3 blocks, it must choose DN0, 
DN1, DN11.

But DN1 has a block belonging to BlockGroup_0. The last time DN1 sent a 
heartbeat, it reported its {{xceiverCount}} is 3. {{xceiverCount}} is equals to 
the active thread in DataNode.threadGroup, as show below.

{noformat}
DatanodeRegistration(127.0.0.1:47705, 
datanodeUuid=43e5be32-2066-4057-9b25-8544d2d542bc, infoPort=43445, 
infoSecurePort=0, ipcPort=34036, 
storageInfo=lv=-56;cid=testClusterID;nsid=23260287;c=1445489667626)
java.lang.ThreadGroup[name=dataXceiverServer,maxpri=10]

Thread[org.apache.hadoop.hdfs.server.datanode.DataXceiverServer@6aa03871,5,dataXceiverServer]
Thread[DataXceiver for client DFSClient_NONMAPREDUCE_-1867405584_1 at 
/127.0.0.1:56717 [Receiving block 
BP-1612020377-9.96.1.34-1445489667626:blk_-9223372036854775791_1001],5,dataXceiverServer]
Thread[PacketResponder: 
BP-1612020377-9.96.1.34-1445489667626:blk_-9223372036854775791_1001, 
type=LAST_IN_PIPELINE, downstreams=0:[],5,dataXceiverServer]
{noformat}
{{xceiverCount}} equals to 3 is lager than average number, so DN1 is excluded 
by {{chooseRandom()}}. Then BlockGroup_1 can only recover 2 blocks. As 
discussed 
[here|https://issues.apache.org/jira/browse/HDFS-8220?focusedCommentId=14518931=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14518931]
 , now temporarily PlacementPolicy doesn't support return two identical 
storages, aka, no 2 replicas(internal blocks) in the same storage. 

We could simply add more DNs to fix the test. Or we can set 
{{DFS_NAMENODE_REPLICATION_CONSIDERLOAD_KEY}} to false in the test case.

The 02 patch includes some clean up. Kindly review. Thanks.

> Fix TestRecoverStripedFile
> --
>
> Key: HDFS-9275
> URL: https://issues.apache.org/jira/browse/HDFS-9275
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: test
>Reporter: Walter Su
>Assignee: Walter Su
> Attachments: HDFS-9275.01.patch, HDFS-9275.02.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-9275) Fix TestRecoverStripedFile

2015-10-21 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14968587#comment-14968587
 ] 

Hadoop QA commented on HDFS-9275:
-

\\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:red}-1{color} | pre-patch |   8m  5s | Pre-patch trunk has 1 extant 
Findbugs (version 3.0.0) warnings. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:green}+1{color} | tests included |   0m  0s | The patch appears to 
include 7 new or modified test files. |
| {color:green}+1{color} | javac |   8m  3s | There were no new javac warning 
messages. |
| {color:green}+1{color} | release audit |   0m 23s | The applied patch does 
not increase the total number of release audit warnings. |
| {color:green}+1{color} | checkstyle |   1m 24s | There were no new checkstyle 
issues. |
| {color:red}-1{color} | whitespace |   0m  1s | The patch has 1  line(s) that 
end in whitespace. Use git apply --whitespace=fix. |
| {color:green}+1{color} | install |   1m 28s | mvn install still works. |
| {color:green}+1{color} | eclipse:eclipse |   0m 34s | The patch built with 
eclipse:eclipse. |
| {color:green}+1{color} | findbugs |   2m 32s | The patch does not introduce 
any new Findbugs (version 3.0.0) warnings. |
| {color:green}+1{color} | native |   1m  6s | Pre-build of native portion |
| {color:green}+1{color} | hdfs tests |  50m 37s | Tests passed in hadoop-hdfs. 
|
| | |  74m 19s | |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12767952/HDFS-9275.02.patch |
| Optional Tests | javac unit findbugs checkstyle |
| git revision | trunk / 381610d |
| Pre-patch Findbugs warnings | 
https://builds.apache.org/job/PreCommit-HDFS-Build/13121/artifact/patchprocess/trunkFindbugsWarningshadoop-hdfs.html
 |
| whitespace | 
https://builds.apache.org/job/PreCommit-HDFS-Build/13121/artifact/patchprocess/whitespace.txt
 |
| hadoop-hdfs test log | 
https://builds.apache.org/job/PreCommit-HDFS-Build/13121/artifact/patchprocess/testrun_hadoop-hdfs.txt
 |
| Test Results | 
https://builds.apache.org/job/PreCommit-HDFS-Build/13121/testReport/ |
| Java | 1.7.0_55 |
| uname | Linux asf900.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP 
PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Console output | 
https://builds.apache.org/job/PreCommit-HDFS-Build/13121/console |


This message was automatically generated.

> Fix TestRecoverStripedFile
> --
>
> Key: HDFS-9275
> URL: https://issues.apache.org/jira/browse/HDFS-9275
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: test
>Reporter: Walter Su
>Assignee: Walter Su
> Attachments: HDFS-9275.01.patch, HDFS-9275.02.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)