[jira] [Commented] (HDFS-15576) Erasure Coding: Add rs and rs-legacy codec test for addPolicies

2020-09-15 Thread Fei Hui (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17196006#comment-17196006
 ] 

Fei Hui commented on HDFS-15576:


Change the caption and upload v002 patch


> Erasure Coding: Add rs and rs-legacy codec test for addPolicies
> ---
>
> Key: HDFS-15576
> URL: https://issues.apache.org/jira/browse/HDFS-15576
> Project: Hadoop HDFS
>  Issue Type: Test
>Reporter: Fei Hui
>Assignee: Fei Hui
>Priority: Minor
> Attachments: HDFS-15576.001.patch, HDFS-15576.002.patch
>
>
> * Add rs and rs-legacy codec test for  TestErasureCodingCLI
> * Add comments for failed test RS
> * Modify UT, change "RS" to "rs", because "RS" is not supported 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15576) Erasure Coding: Add rs and rs-legacy codec test for addPolicies

2020-09-15 Thread Fei Hui (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fei Hui updated HDFS-15576:
---
Description: 
* Add rs and rs-legacy codec test for  TestErasureCodingCLI
* Add comments for failed test RS
* Modify UT, change "RS" to "rs", because "RS" is not supported 


  was:
* Add rs and rs-legacy codec test for  TestErasureCodingCLI
* Add comments for failed test 
* Modify UT, change "RS" to "rs", because "RS" is not supported 



> Erasure Coding: Add rs and rs-legacy codec test for addPolicies
> ---
>
> Key: HDFS-15576
> URL: https://issues.apache.org/jira/browse/HDFS-15576
> Project: Hadoop HDFS
>  Issue Type: Test
>Reporter: Fei Hui
>Assignee: Fei Hui
>Priority: Minor
> Attachments: HDFS-15576.001.patch, HDFS-15576.002.patch
>
>
> * Add rs and rs-legacy codec test for  TestErasureCodingCLI
> * Add comments for failed test RS
> * Modify UT, change "RS" to "rs", because "RS" is not supported 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15576) Erasure Coding: Add rs and rs-legacy codec test for addPolicies

2020-09-15 Thread Fei Hui (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fei Hui updated HDFS-15576:
---
Attachment: HDFS-15576.002.patch

> Erasure Coding: Add rs and rs-legacy codec test for addPolicies
> ---
>
> Key: HDFS-15576
> URL: https://issues.apache.org/jira/browse/HDFS-15576
> Project: Hadoop HDFS
>  Issue Type: Test
>Reporter: Fei Hui
>Assignee: Fei Hui
>Priority: Minor
> Attachments: HDFS-15576.001.patch, HDFS-15576.002.patch
>
>
> * Add rs and rs-legacy codec test for  TestErasureCodingCLI
> * Add comments for failed test RS
> * Modify UT, change "RS" to "rs", because "RS" is not supported 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15576) Erasure Coding: Add rs and rs-legacy codec test for addPolicies

2020-09-15 Thread Fei Hui (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fei Hui updated HDFS-15576:
---
Description: 
* Add rs and rs-legacy codec test for  TestErasureCodingCLI
* Add comments for failed test 
* Modify UT, change "RS" to "rs", because "RS" is not supported 


  was:
* Add UT TestECAdmin#testAddPolicies
* Modify UT, change "RS" to "rs", because "RS" is not supported 



> Erasure Coding: Add rs and rs-legacy codec test for addPolicies
> ---
>
> Key: HDFS-15576
> URL: https://issues.apache.org/jira/browse/HDFS-15576
> Project: Hadoop HDFS
>  Issue Type: Test
>Reporter: Fei Hui
>Assignee: Fei Hui
>Priority: Minor
> Attachments: HDFS-15576.001.patch
>
>
> * Add rs and rs-legacy codec test for  TestErasureCodingCLI
> * Add comments for failed test 
> * Modify UT, change "RS" to "rs", because "RS" is not supported 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15576) Erasure Coding: Add rs and rs-legacy codec test for addPolicies

2020-09-15 Thread Fei Hui (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fei Hui updated HDFS-15576:
---
Summary: Erasure Coding: Add rs and rs-legacy codec test for addPolicies  
(was: Erasure Coding: Add rs & rs-legacy codec test for addPolicies)

> Erasure Coding: Add rs and rs-legacy codec test for addPolicies
> ---
>
> Key: HDFS-15576
> URL: https://issues.apache.org/jira/browse/HDFS-15576
> Project: Hadoop HDFS
>  Issue Type: Test
>Reporter: Fei Hui
>Assignee: Fei Hui
>Priority: Minor
> Attachments: HDFS-15576.001.patch
>
>
> * Add UT TestECAdmin#testAddPolicies
> * Modify UT, change "RS" to "rs", because "RS" is not supported 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15576) Erasure Coding: Add rs & rs-legacy codec test for addPolicies

2020-09-15 Thread Fei Hui (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fei Hui updated HDFS-15576:
---
Summary: Erasure Coding: Add rs & rs-legacy codec test for addPolicies  
(was: Erasure Coding: Add test addPolicies to ECAdmin)

> Erasure Coding: Add rs & rs-legacy codec test for addPolicies
> -
>
> Key: HDFS-15576
> URL: https://issues.apache.org/jira/browse/HDFS-15576
> Project: Hadoop HDFS
>  Issue Type: Test
>Reporter: Fei Hui
>Assignee: Fei Hui
>Priority: Minor
> Attachments: HDFS-15576.001.patch
>
>
> * Add UT TestECAdmin#testAddPolicies
> * Modify UT, change "RS" to "rs", because "RS" is not supported 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15576) Erasure Coding: Add test addPolicies to ECAdmin

2020-09-15 Thread Fei Hui (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17195973#comment-17195973
 ] 

Fei Hui commented on HDFS-15576:


[~tasanuma] Thanks a lot, get it!
Will try it with your suggestions
In addition, I want to add comments for bellow codes from test_ec_policies.xml, 
it's just for failed test, because i had a mistake reference to it :(
{quote}
RS
{quote}

> Erasure Coding: Add test addPolicies to ECAdmin
> ---
>
> Key: HDFS-15576
> URL: https://issues.apache.org/jira/browse/HDFS-15576
> Project: Hadoop HDFS
>  Issue Type: Test
>Reporter: Fei Hui
>Assignee: Fei Hui
>Priority: Minor
> Attachments: HDFS-15576.001.patch
>
>
> * Add UT TestECAdmin#testAddPolicies
> * Modify UT, change "RS" to "rs", because "RS" is not supported 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15576) Erasure Coding: Add test addPolicies to ECAdmin

2020-09-15 Thread Fei Hui (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17195906#comment-17195906
 ] 

Fei Hui commented on HDFS-15576:


Failed Tests are unrelated.
[~hexiaoqiao][~weichiu] [~aajisaka] Could you please take a look? Thanks

> Erasure Coding: Add test addPolicies to ECAdmin
> ---
>
> Key: HDFS-15576
> URL: https://issues.apache.org/jira/browse/HDFS-15576
> Project: Hadoop HDFS
>  Issue Type: Test
>Reporter: Fei Hui
>Assignee: Fei Hui
>Priority: Minor
> Attachments: HDFS-15576.001.patch
>
>
> * Add UT TestECAdmin#testAddPolicies
> * Modify UT, change "RS" to "rs", because "RS" is not supported 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15576) Erasure Coding: Add test addPolicies to ECAdmin

2020-09-14 Thread Fei Hui (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fei Hui updated HDFS-15576:
---
Summary: Erasure Coding: Add test addPolicies to ECAdmin  (was: Add test 
addPolicies to ECAdmin)

> Erasure Coding: Add test addPolicies to ECAdmin
> ---
>
> Key: HDFS-15576
> URL: https://issues.apache.org/jira/browse/HDFS-15576
> Project: Hadoop HDFS
>  Issue Type: Test
>Reporter: Fei Hui
>Assignee: Fei Hui
>Priority: Minor
> Attachments: HDFS-15576.001.patch
>
>
> * Add UT TestECAdmin#testAddPolicies
> * Modify UT, change "RS" to "rs", because "RS" is not supported 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15576) Add test addPolicies to ECAdmin

2020-09-14 Thread Fei Hui (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fei Hui updated HDFS-15576:
---
Attachment: HDFS-15576.001.patch

> Add test addPolicies to ECAdmin
> ---
>
> Key: HDFS-15576
> URL: https://issues.apache.org/jira/browse/HDFS-15576
> Project: Hadoop HDFS
>  Issue Type: Test
>Reporter: Fei Hui
>Assignee: Fei Hui
>Priority: Minor
> Attachments: HDFS-15576.001.patch
>
>
> * Add UT TestECAdmin#testAddPolicies
> * Modify UT, change "RS" to "rs", because "RS" is not supported 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15576) Add test addPolicies to ECAdmin

2020-09-14 Thread Fei Hui (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fei Hui updated HDFS-15576:
---
Status: Patch Available  (was: Open)

> Add test addPolicies to ECAdmin
> ---
>
> Key: HDFS-15576
> URL: https://issues.apache.org/jira/browse/HDFS-15576
> Project: Hadoop HDFS
>  Issue Type: Test
>Reporter: Fei Hui
>Assignee: Fei Hui
>Priority: Minor
> Attachments: HDFS-15576.001.patch
>
>
> * Add UT TestECAdmin#testAddPolicies
> * Modify UT, change "RS" to "rs", because "RS" is not supported 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-15576) Add test addPolicies to ECAdmin

2020-09-14 Thread Fei Hui (Jira)
Fei Hui created HDFS-15576:
--

 Summary: Add test addPolicies to ECAdmin
 Key: HDFS-15576
 URL: https://issues.apache.org/jira/browse/HDFS-15576
 Project: Hadoop HDFS
  Issue Type: Test
Reporter: Fei Hui
Assignee: Fei Hui


* Add UT TestECAdmin#testAddPolicies
* Modify UT, change "RS" to "rs", because "RS" is not supported 




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15564) Add Test annotation for TestPersistBlocks#testRestartDfsWithSync

2020-09-11 Thread Fei Hui (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17194233#comment-17194233
 ] 

Fei Hui commented on HDFS-15564:


[~hexiaoqiao] Could you please help to commit it? Thanks

> Add Test annotation for TestPersistBlocks#testRestartDfsWithSync
> 
>
> Key: HDFS-15564
> URL: https://issues.apache.org/jira/browse/HDFS-15564
> Project: Hadoop HDFS
>  Issue Type: Test
>  Components: hdfs
>Affects Versions: 3.3.0
>Reporter: Fei Hui
>Assignee: Fei Hui
>Priority: Minor
> Attachments: HDFS-15564.001.patch
>
>
> Add Test annotation for TestPersistBlocks#testRestartDfsWithSync,  otherwise 
> it’s dead code



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-13293) RBF: The RouterRPCServer should transfer CallerContext and client ip to NamenodeRpcServer

2020-09-10 Thread Fei Hui (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-13293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17193397#comment-17193397
 ] 

Fei Hui commented on HDFS-13293:


[~aajisaka][~elgoiri] I think maybe we need to do three things.
* File a new jira and extend CallerContext in hadoop common, it can contains 
many key value pairs.
* add real client ip to the caller context in this jira. 
*hadoop.caller.context.enabled* has been used by audit log ,should we add new 
parameter?
* File a new jira and Fix the way Yarn use CallerContext (Add key value to the 
context)
What do you think ?

> RBF: The RouterRPCServer should transfer CallerContext and client ip to 
> NamenodeRpcServer
> -
>
> Key: HDFS-13293
> URL: https://issues.apache.org/jira/browse/HDFS-13293
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: maobaolong
>Assignee: Fei Hui
>Priority: Major
> Attachments: HDFS-13293.001.patch
>
>
> Otherwise, the namenode don't know the client's callerContext



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15556) Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline

2020-09-09 Thread Fei Hui (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17193291#comment-17193291
 ] 

Fei Hui commented on HDFS-15556:


[~haiyang Hu] It's the same as HDFS-14042, Should we resolve this issue as 
Duplicate?

> Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline
> 
>
> Key: HDFS-15556
> URL: https://issues.apache.org/jira/browse/HDFS-15556
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 3.2.0
>Reporter: huhaiyang
>Priority: Critical
> Attachments: HDFS-15556.001.patch, NN-CPU.png, NN_DN.LOG
>
>
> In our cluster, the NameNode appears NPE when processing lifeline messages 
> sent by the DataNode, which will cause an maxLoad exception calculated by NN.
> because DataNode is identified as busy and unable to allocate available nodes 
> in choose  DataNode, program loop execution results in high CPU and reduces 
> the processing performance of the cluster.
> *NameNode the exception stack*:
> {code:java}
> 2020-08-25 00:59:02,977 WARN org.apache.hadoop.ipc.Server: IPC Server handler 
> 5 on 8022, call Call#20535 Retry#0 
> org.apache.hadoop.hdfs.server.protocol.DatanodeLifelineProtocol.sendLifeline 
> from x:34766
> java.lang.NullPointerException
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateStorageStats(DatanodeDescriptor.java:460)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateHeartbeatState(DatanodeDescriptor.java:390)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.HeartbeatManager.updateLifeline(HeartbeatManager.java:254)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.handleLifeline(DatanodeManager.java:1805)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.handleLifeline(FSNamesystem.java:4039)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.sendLifeline(NameNodeRpcServer.java:1761)
> at 
> org.apache.hadoop.hdfs.protocolPB.DatanodeLifelineProtocolServerSideTranslatorPB.sendLifeline(DatanodeLifelineProtocolServerSideTranslatorPB.java:62)
> at 
> org.apache.hadoop.hdfs.protocol.proto.DatanodeLifelineProtocolProtos$DatanodeLifelineProtocolService$2.callBlockingMethod(DatanodeLifelineProtocolProtos.java:409)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:886)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:828)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1903)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2717)
> {code}
> {code:java}
> // DatanodeDescriptor#updateStorageStats
> ...
> for (StorageReport report : reports) {
>   DatanodeStorageInfo storage = null;
>   synchronized (storageMap) {
> storage =
> storageMap.get(report.getStorage().getStorageID());
>   }
>   if (checkFailedStorages) {
> failedStorageInfos.remove(storage);
>   }
>   storage.receivedHeartbeat(report);  //  NPE exception occurred here 
>   // skip accounting for capacity of PROVIDED storages!
>   if (StorageType.PROVIDED.equals(storage.getStorageType())) {
> continue;
>   }
> ...
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15564) Add Test annotation for TestPersistBlocks#testRestartDfsWithSync

2020-09-09 Thread Fei Hui (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fei Hui updated HDFS-15564:
---
Status: Patch Available  (was: Open)

> Add Test annotation for TestPersistBlocks#testRestartDfsWithSync
> 
>
> Key: HDFS-15564
> URL: https://issues.apache.org/jira/browse/HDFS-15564
> Project: Hadoop HDFS
>  Issue Type: Test
>  Components: hdfs
>Affects Versions: 3.3.0
>Reporter: Fei Hui
>Assignee: Fei Hui
>Priority: Minor
> Attachments: HDFS-15564.001.patch
>
>
> Add Test annotation for TestPersistBlocks#testRestartDfsWithSync,  otherwise 
> it’s dead code



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15564) Add Test annotation for TestPersistBlocks#testRestartDfsWithSync

2020-09-09 Thread Fei Hui (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fei Hui updated HDFS-15564:
---
Attachment: HDFS-15564.001.patch

> Add Test annotation for TestPersistBlocks#testRestartDfsWithSync
> 
>
> Key: HDFS-15564
> URL: https://issues.apache.org/jira/browse/HDFS-15564
> Project: Hadoop HDFS
>  Issue Type: Test
>  Components: hdfs
>Affects Versions: 3.3.0
>Reporter: Fei Hui
>Assignee: Fei Hui
>Priority: Minor
> Attachments: HDFS-15564.001.patch
>
>
> Add Test annotation for TestPersistBlocks#testRestartDfsWithSync,  otherwise 
> it’s dead code



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-15564) Add Test annotation for TestPersistBlocks#testRestartDfsWithSync

2020-09-09 Thread Fei Hui (Jira)
Fei Hui created HDFS-15564:
--

 Summary: Add Test annotation for 
TestPersistBlocks#testRestartDfsWithSync
 Key: HDFS-15564
 URL: https://issues.apache.org/jira/browse/HDFS-15564
 Project: Hadoop HDFS
  Issue Type: Test
  Components: hdfs
Affects Versions: 3.3.0
Reporter: Fei Hui
Assignee: Fei Hui


Add Test annotation for TestPersistBlocks#testRestartDfsWithSync,  otherwise 
it’s dead code




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-13293) RBF: The RouterRPCServer should transfer CallerContext and client ip to NamenodeRpcServer

2020-09-09 Thread Fei Hui (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-13293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17192646#comment-17192646
 ] 

Fei Hui commented on HDFS-13293:


[~aajisaka][~elgoiri] Thanks for bringing this up again.
If we are in agreed on this, I will rebase the patch.

> RBF: The RouterRPCServer should transfer CallerContext and client ip to 
> NamenodeRpcServer
> -
>
> Key: HDFS-13293
> URL: https://issues.apache.org/jira/browse/HDFS-13293
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: maobaolong
>Assignee: Fei Hui
>Priority: Major
> Attachments: HDFS-13293.001.patch
>
>
> Otherwise, the namenode don't know the client's callerContext



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14351) RBF: Optimize configuration item resolving for monitor namenode

2020-09-03 Thread Fei Hui (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17189992#comment-17189992
 ] 

Fei Hui commented on HDFS-14351:


Maybe it's helpful that backport it to other 3.x branches. Thanks

> RBF: Optimize configuration item resolving for monitor namenode
> ---
>
> Key: HDFS-14351
> URL: https://issues.apache.org/jira/browse/HDFS-14351
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: rbf
>Reporter: Xiaoqiao He
>Assignee: Xiaoqiao He
>Priority: Major
> Fix For: 3.3.0, HDFS-13891
>
> Attachments: HDFS-14351-HDFS-13891.001.patch, 
> HDFS-14351-HDFS-13891.002.patch, HDFS-14351-HDFS-13891.003.patch, 
> HDFS-14351-HDFS-13891.004.patch, HDFS-14351-HDFS-13891.005.patch, 
> HDFS-14351-HDFS-13891.006.patch, HDFS-14351.001.patch, HDFS-14351.002.patch
>
>
> We invoke {{configuration.get}} to resolve configuration item 
> `dfs.federation.router.monitor.namenode` at `Router.java`, then split the 
> value by comma to get nsid and nnid, it may confused users since this is not 
> compatible with blank space but other common parameters could do. The 
> following segment show example that resolve fails.
> {code:java}
>   
> dfs.federation.router.monitor.namenode
> nameservice1.nn1, nameservice1.nn2
> 
>   The identifier of the namenodes to monitor and heartbeat.
> 
>   
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15540) Directories protected from delete can still be moved to the trash

2020-08-26 Thread Fei Hui (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17185532#comment-17185532
 ] 

Fei Hui commented on HDFS-15540:


[~sodonnell] Good catch! It looks good!

> Directories protected from delete can still be moved to the trash
> -
>
> Key: HDFS-15540
> URL: https://issues.apache.org/jira/browse/HDFS-15540
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 3.4.0
>Reporter: Stephen O'Donnell
>Assignee: Stephen O'Donnell
>Priority: Major
> Attachments: HDFS-15540.001.patch
>
>
> With HDFS-8983, HDFS-14802 and HDFS-15243 we are able to list protected 
> directories which cannot be deleted or renamed, provided the following is set:
> fs.protected.directories: 
> dfs.protected.subdirectories.enable: true
> Testing this feature out, I can see it mostly works fine, but protected 
> non-empty folders can still be moved to the trash. In this example 
> /dir/protected is set in fs.protected.directories, and 
> dfs.protected.subdirectories.enable is true.
> {code}
> hadoop fs -ls -R /dir
> drwxr-xr-x - hdfs supergroup 0 2020-08-26 16:52 /dir/protected
> -rw-r--r-- 3 hdfs supergroup 174 2020-08-26 16:52 /dir/protected/file1
> drwxr-xr-x - hdfs supergroup 0 2020-08-26 16:52 /dir/protected/subdir1
> -rw-r--r-- 3 hdfs supergroup 174 2020-08-26 16:52 /dir/protected/subdir1/file1
> drwxr-xr-x - hdfs supergroup 0 2020-08-26 16:52 /dir/protected/subdir2
> -rw-r--r-- 3 hdfs supergroup 174 2020-08-26 16:52 /dir/protected/subdir2/file1
> [hdfs@7d67ed1af9b0 /]$ hadoop fs -rm -r -f -skipTrash /dir/protected/subdir1
> rm: Cannot delete/rename subdirectory under protected subdirectory 
> /dir/protected
> [hdfs@7d67ed1af9b0 /]$ hadoop fs -mv /dir/protected/subdir1 
> /dir/protected/subdir1-moved
> mv: Cannot delete/rename subdirectory under protected subdirectory 
> /dir/protected
> ** ALL GOOD SO FAR **
> [hdfs@7d67ed1af9b0 /]$ hadoop fs -rm -r -f /dir/protected/subdir1
> 2020-08-26 16:54:32,404 INFO fs.TrashPolicyDefault: Moved: 
> 'hdfs://nn1/dir/protected/subdir1' to trash at: 
> hdfs://nn1/user/hdfs/.Trash/Current/dir/protected/subdir1
> ** It moved the protected sub-dir to the trash, where it will be deleted **
> ** Checking the top level dir, it is the same **
> [hdfs@7d67ed1af9b0 /]$ hadoop fs -rm -r -f -skipTrash /dir/protected 
> rm: Cannot delete/rename non-empty protected directory /dir/protected
> [hdfs@7d67ed1af9b0 /]$ hadoop fs -mv /dir/protected /dir/protected-new
> mv: Cannot delete/rename non-empty protected directory /dir/protected
> [hdfs@7d67ed1af9b0 /]$ hadoop fs -rm -r -f /dir/protected 
> 2020-08-26 16:55:32,402 INFO fs.TrashPolicyDefault: Moved: 
> 'hdfs://nn1/dir/protected' to trash at: 
> hdfs://nn1/user/hdfs/.Trash/Current/dir/protected1598460932388
> {code}
> The reason for this, seems to be that "move to trash" uses a different rename 
> method in FSNameSystem and FSDirRenameOp which avoids the 
> DFSUtil.checkProtectedDescendants(...) in the earlier Jiras.
> I believe that "move to trash" should be protected in the same way as a 
> -skipTrash delete.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14852) Remove of LowRedundancyBlocks do NOT remove the block from all queues

2020-08-23 Thread Fei Hui (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fei Hui updated HDFS-14852:
---
Attachment: HDFS-14852.007.patch

> Remove of LowRedundancyBlocks do NOT remove the block from all queues
> -
>
> Key: HDFS-14852
> URL: https://issues.apache.org/jira/browse/HDFS-14852
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 3.2.0, 3.0.3, 3.1.2, 3.3.0
>Reporter: Fei Hui
>Assignee: Fei Hui
>Priority: Major
> Attachments: CorruptBlocksMismatch.png, HDFS-14852.001.patch, 
> HDFS-14852.002.patch, HDFS-14852.003.patch, HDFS-14852.004.patch, 
> HDFS-14852.005.patch, HDFS-14852.006.patch, HDFS-14852.007.patch, 
> screenshot-1.png
>
>
> LowRedundancyBlocks.java
> {code:java}
> // Some comments here
> if(priLevel >= 0 && priLevel < LEVEL
> && priorityQueues.get(priLevel).remove(block)) {
>   NameNode.blockStateChangeLog.debug(
>   "BLOCK* NameSystem.LowRedundancyBlock.remove: Removing block {}"
>   + " from priority queue {}",
>   block, priLevel);
>   decrementBlockStat(block, priLevel, oldExpectedReplicas);
>   return true;
> } else {
>   // Try to remove the block from all queues if the block was
>   // not found in the queue for the given priority level.
>   for (int i = 0; i < LEVEL; i++) {
> if (i != priLevel && priorityQueues.get(i).remove(block)) {
>   NameNode.blockStateChangeLog.debug(
>   "BLOCK* NameSystem.LowRedundancyBlock.remove: Removing block" +
>   " {} from priority queue {}", block, i);
>   decrementBlockStat(block, i, oldExpectedReplicas);
>   return true;
> }
>   }
> }
> return false;
>   }
> {code}
> Source code is above, the comments as follow
> {quote}
>   // Try to remove the block from all queues if the block was
>   // not found in the queue for the given priority level.
> {quote}
> The function "remove" does NOT remove the block from all queues.
> Function add from LowRedundancyBlocks.java is used on some places and maybe 
> one block in two or more queues.
> We found that corrupt blocks mismatch corrupt files on NN web UI. Maybe it is 
> related to this.
> Upload initial patch



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14852) Remove of LowRedundancyBlocks do NOT remove the block from all queues

2020-08-23 Thread Fei Hui (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17182659#comment-17182659
 ] 

Fei Hui commented on HDFS-14852:


[~hexiaoqiao] Thanks for review, Forget to remove original code. Upload v007 
patch.
When transition standby namenode to active, we found corrupt blocks. After 
deleting the corrupt files, we still found that "There are 2 corrupt blocks". I 
think If we delete the file, blocks should not in any queue. Didn't dig into 
why one block added into 2 queues and this didn't reproduce easily.

> Remove of LowRedundancyBlocks do NOT remove the block from all queues
> -
>
> Key: HDFS-14852
> URL: https://issues.apache.org/jira/browse/HDFS-14852
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 3.2.0, 3.0.3, 3.1.2, 3.3.0
>Reporter: Fei Hui
>Assignee: Fei Hui
>Priority: Major
> Attachments: CorruptBlocksMismatch.png, HDFS-14852.001.patch, 
> HDFS-14852.002.patch, HDFS-14852.003.patch, HDFS-14852.004.patch, 
> HDFS-14852.005.patch, HDFS-14852.006.patch, screenshot-1.png
>
>
> LowRedundancyBlocks.java
> {code:java}
> // Some comments here
> if(priLevel >= 0 && priLevel < LEVEL
> && priorityQueues.get(priLevel).remove(block)) {
>   NameNode.blockStateChangeLog.debug(
>   "BLOCK* NameSystem.LowRedundancyBlock.remove: Removing block {}"
>   + " from priority queue {}",
>   block, priLevel);
>   decrementBlockStat(block, priLevel, oldExpectedReplicas);
>   return true;
> } else {
>   // Try to remove the block from all queues if the block was
>   // not found in the queue for the given priority level.
>   for (int i = 0; i < LEVEL; i++) {
> if (i != priLevel && priorityQueues.get(i).remove(block)) {
>   NameNode.blockStateChangeLog.debug(
>   "BLOCK* NameSystem.LowRedundancyBlock.remove: Removing block" +
>   " {} from priority queue {}", block, i);
>   decrementBlockStat(block, i, oldExpectedReplicas);
>   return true;
> }
>   }
> }
> return false;
>   }
> {code}
> Source code is above, the comments as follow
> {quote}
>   // Try to remove the block from all queues if the block was
>   // not found in the queue for the given priority level.
> {quote}
> The function "remove" does NOT remove the block from all queues.
> Function add from LowRedundancyBlocks.java is used on some places and maybe 
> one block in two or more queues.
> We found that corrupt blocks mismatch corrupt files on NN web UI. Maybe it is 
> related to this.
> Upload initial patch



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14852) Remove of LowRedundancyBlocks do NOT remove the block from all queues

2020-08-21 Thread Fei Hui (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17181844#comment-17181844
 ] 

Fei Hui commented on HDFS-14852:


Failed Tests are unrelated

> Remove of LowRedundancyBlocks do NOT remove the block from all queues
> -
>
> Key: HDFS-14852
> URL: https://issues.apache.org/jira/browse/HDFS-14852
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 3.2.0, 3.0.3, 3.1.2, 3.3.0
>Reporter: Fei Hui
>Assignee: Fei Hui
>Priority: Major
> Attachments: CorruptBlocksMismatch.png, HDFS-14852.001.patch, 
> HDFS-14852.002.patch, HDFS-14852.003.patch, HDFS-14852.004.patch, 
> HDFS-14852.005.patch, HDFS-14852.006.patch, screenshot-1.png
>
>
> LowRedundancyBlocks.java
> {code:java}
> // Some comments here
> if(priLevel >= 0 && priLevel < LEVEL
> && priorityQueues.get(priLevel).remove(block)) {
>   NameNode.blockStateChangeLog.debug(
>   "BLOCK* NameSystem.LowRedundancyBlock.remove: Removing block {}"
>   + " from priority queue {}",
>   block, priLevel);
>   decrementBlockStat(block, priLevel, oldExpectedReplicas);
>   return true;
> } else {
>   // Try to remove the block from all queues if the block was
>   // not found in the queue for the given priority level.
>   for (int i = 0; i < LEVEL; i++) {
> if (i != priLevel && priorityQueues.get(i).remove(block)) {
>   NameNode.blockStateChangeLog.debug(
>   "BLOCK* NameSystem.LowRedundancyBlock.remove: Removing block" +
>   " {} from priority queue {}", block, i);
>   decrementBlockStat(block, i, oldExpectedReplicas);
>   return true;
> }
>   }
> }
> return false;
>   }
> {code}
> Source code is above, the comments as follow
> {quote}
>   // Try to remove the block from all queues if the block was
>   // not found in the queue for the given priority level.
> {quote}
> The function "remove" does NOT remove the block from all queues.
> Function add from LowRedundancyBlocks.java is used on some places and maybe 
> one block in two or more queues.
> We found that corrupt blocks mismatch corrupt files on NN web UI. Maybe it is 
> related to this.
> Upload initial patch



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14852) Remove of LowRedundancyBlocks do NOT remove the block from all queues

2020-08-19 Thread Fei Hui (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17180967#comment-17180967
 ] 

Fei Hui commented on HDFS-14852:


[~sodonnell] Upload v006 patch with your suggestion, Please review

> Remove of LowRedundancyBlocks do NOT remove the block from all queues
> -
>
> Key: HDFS-14852
> URL: https://issues.apache.org/jira/browse/HDFS-14852
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 3.2.0, 3.0.3, 3.1.2, 3.3.0
>Reporter: Fei Hui
>Assignee: Fei Hui
>Priority: Major
> Attachments: CorruptBlocksMismatch.png, HDFS-14852.001.patch, 
> HDFS-14852.002.patch, HDFS-14852.003.patch, HDFS-14852.004.patch, 
> HDFS-14852.005.patch, HDFS-14852.006.patch, screenshot-1.png
>
>
> LowRedundancyBlocks.java
> {code:java}
> // Some comments here
> if(priLevel >= 0 && priLevel < LEVEL
> && priorityQueues.get(priLevel).remove(block)) {
>   NameNode.blockStateChangeLog.debug(
>   "BLOCK* NameSystem.LowRedundancyBlock.remove: Removing block {}"
>   + " from priority queue {}",
>   block, priLevel);
>   decrementBlockStat(block, priLevel, oldExpectedReplicas);
>   return true;
> } else {
>   // Try to remove the block from all queues if the block was
>   // not found in the queue for the given priority level.
>   for (int i = 0; i < LEVEL; i++) {
> if (i != priLevel && priorityQueues.get(i).remove(block)) {
>   NameNode.blockStateChangeLog.debug(
>   "BLOCK* NameSystem.LowRedundancyBlock.remove: Removing block" +
>   " {} from priority queue {}", block, i);
>   decrementBlockStat(block, i, oldExpectedReplicas);
>   return true;
> }
>   }
> }
> return false;
>   }
> {code}
> Source code is above, the comments as follow
> {quote}
>   // Try to remove the block from all queues if the block was
>   // not found in the queue for the given priority level.
> {quote}
> The function "remove" does NOT remove the block from all queues.
> Function add from LowRedundancyBlocks.java is used on some places and maybe 
> one block in two or more queues.
> We found that corrupt blocks mismatch corrupt files on NN web UI. Maybe it is 
> related to this.
> Upload initial patch



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14852) Remove of LowRedundancyBlocks do NOT remove the block from all queues

2020-08-19 Thread Fei Hui (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fei Hui updated HDFS-14852:
---
Attachment: HDFS-14852.006.patch

> Remove of LowRedundancyBlocks do NOT remove the block from all queues
> -
>
> Key: HDFS-14852
> URL: https://issues.apache.org/jira/browse/HDFS-14852
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 3.2.0, 3.0.3, 3.1.2, 3.3.0
>Reporter: Fei Hui
>Assignee: Fei Hui
>Priority: Major
> Attachments: CorruptBlocksMismatch.png, HDFS-14852.001.patch, 
> HDFS-14852.002.patch, HDFS-14852.003.patch, HDFS-14852.004.patch, 
> HDFS-14852.005.patch, HDFS-14852.006.patch, screenshot-1.png
>
>
> LowRedundancyBlocks.java
> {code:java}
> // Some comments here
> if(priLevel >= 0 && priLevel < LEVEL
> && priorityQueues.get(priLevel).remove(block)) {
>   NameNode.blockStateChangeLog.debug(
>   "BLOCK* NameSystem.LowRedundancyBlock.remove: Removing block {}"
>   + " from priority queue {}",
>   block, priLevel);
>   decrementBlockStat(block, priLevel, oldExpectedReplicas);
>   return true;
> } else {
>   // Try to remove the block from all queues if the block was
>   // not found in the queue for the given priority level.
>   for (int i = 0; i < LEVEL; i++) {
> if (i != priLevel && priorityQueues.get(i).remove(block)) {
>   NameNode.blockStateChangeLog.debug(
>   "BLOCK* NameSystem.LowRedundancyBlock.remove: Removing block" +
>   " {} from priority queue {}", block, i);
>   decrementBlockStat(block, i, oldExpectedReplicas);
>   return true;
> }
>   }
> }
> return false;
>   }
> {code}
> Source code is above, the comments as follow
> {quote}
>   // Try to remove the block from all queues if the block was
>   // not found in the queue for the given priority level.
> {quote}
> The function "remove" does NOT remove the block from all queues.
> Function add from LowRedundancyBlocks.java is used on some places and maybe 
> one block in two or more queues.
> We found that corrupt blocks mismatch corrupt files on NN web UI. Maybe it is 
> related to this.
> Upload initial patch



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14852) Remove of LowRedundancyBlocks do NOT remove the block from all queues

2020-08-19 Thread Fei Hui (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17180485#comment-17180485
 ] 

Fei Hui commented on HDFS-14852:


[~sodonnell] [~kihwal] Can we move forward and fix this issue?

> Remove of LowRedundancyBlocks do NOT remove the block from all queues
> -
>
> Key: HDFS-14852
> URL: https://issues.apache.org/jira/browse/HDFS-14852
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 3.2.0, 3.0.3, 3.1.2, 3.3.0
>Reporter: Fei Hui
>Assignee: Fei Hui
>Priority: Major
> Attachments: CorruptBlocksMismatch.png, HDFS-14852.001.patch, 
> HDFS-14852.002.patch, HDFS-14852.003.patch, HDFS-14852.004.patch, 
> HDFS-14852.005.patch, screenshot-1.png
>
>
> LowRedundancyBlocks.java
> {code:java}
> // Some comments here
> if(priLevel >= 0 && priLevel < LEVEL
> && priorityQueues.get(priLevel).remove(block)) {
>   NameNode.blockStateChangeLog.debug(
>   "BLOCK* NameSystem.LowRedundancyBlock.remove: Removing block {}"
>   + " from priority queue {}",
>   block, priLevel);
>   decrementBlockStat(block, priLevel, oldExpectedReplicas);
>   return true;
> } else {
>   // Try to remove the block from all queues if the block was
>   // not found in the queue for the given priority level.
>   for (int i = 0; i < LEVEL; i++) {
> if (i != priLevel && priorityQueues.get(i).remove(block)) {
>   NameNode.blockStateChangeLog.debug(
>   "BLOCK* NameSystem.LowRedundancyBlock.remove: Removing block" +
>   " {} from priority queue {}", block, i);
>   decrementBlockStat(block, i, oldExpectedReplicas);
>   return true;
> }
>   }
> }
> return false;
>   }
> {code}
> Source code is above, the comments as follow
> {quote}
>   // Try to remove the block from all queues if the block was
>   // not found in the queue for the given priority level.
> {quote}
> The function "remove" does NOT remove the block from all queues.
> Function add from LowRedundancyBlocks.java is used on some places and maybe 
> one block in two or more queues.
> We found that corrupt blocks mismatch corrupt files on NN web UI. Maybe it is 
> related to this.
> Upload initial patch



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15422) Reported IBR is partially replaced with stored info when queuing.

2020-08-19 Thread Fei Hui (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17180482#comment-17180482
 ] 

Fei Hui commented on HDFS-15422:


[~kihwal] Thanks for reporting and the fix. Can we push this fix to trunk?

> Reported IBR is partially replaced with stored info when queuing.
> -
>
> Key: HDFS-15422
> URL: https://issues.apache.org/jira/browse/HDFS-15422
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Reporter: Kihwal Lee
>Priority: Critical
>
> When queueing an IBR (incremental block report) on a standby namenode, some 
> of the reported information is being replaced with the existing stored 
> information.  This can lead to false block corruption.
> We had a namenode, after transitioning to active, started reporting missing 
> blocks with "SIZE_MISMATCH" as corrupt reason. These were blocks that were 
> appended and the sizes were actually correct on the datanodes. Upon further 
> investigation, it was determined that the namenode was queueing IBRs with 
> altered information.
> Although it sounds bad, I am not making it blocker 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15240) Erasure Coding: dirty buffer causes reconstruction block error

2020-08-18 Thread Fei Hui (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17179393#comment-17179393
 ] 

Fei Hui commented on HDFS-15240:


[~marvelrock] Fail to apply your patch on trunk branch. Could you please rebase 
your patch on trunk?

> Erasure Coding: dirty buffer causes reconstruction block error
> --
>
> Key: HDFS-15240
> URL: https://issues.apache.org/jira/browse/HDFS-15240
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, erasure-coding
>Reporter: HuangTao
>Assignee: HuangTao
>Priority: Major
> Fix For: 3.4.0
>
> Attachments: HDFS-15240.001.patch, HDFS-15240.002.patch, 
> HDFS-15240.003.patch, HDFS-15240.004.patch, HDFS-15240.005.patch, 
> image-2020-07-16-15-56-38-608.png
>
>
> When read some lzo files we found some blocks were broken.
> I read back all internal blocks(b0-b8) of the block group(RS-6-3-1024k) from 
> DN directly, and choose 6(b0-b5) blocks to decode other 3(b6', b7', b8') 
> blocks. And find the longest common sequenece(LCS) between b6'(decoded) and 
> b6(read from DN)(b7'/b7 and b8'/b8).
> After selecting 6 blocks of the block group in combinations one time and 
> iterating through all cases, I find one case that the length of LCS is the 
> block length - 64KB, 64KB is just the length of ByteBuffer used by 
> StripedBlockReader. So the corrupt reconstruction block is made by a dirty 
> buffer.
> The following log snippet(only show 2 of 28 cases) is my check program 
> output. In my case, I known the 3th block is corrupt, so need other 5 blocks 
> to decode another 3 blocks, then find the 1th block's LCS substring is block 
> length - 64kb.
> It means (0,1,2,4,5,6)th blocks were used to reconstruct 3th block, and the 
> dirty buffer was used before read the 1th block.
> Must be noted that StripedBlockReader read from the offset 0 of the 1th block 
> after used the dirty buffer.
> {code:java}
> decode from [0, 2, 3, 4, 5, 7] -> [1, 6, 8]
> Check Block(1) first 131072 bytes longest common substring length 4
> Check Block(6) first 131072 bytes longest common substring length 4
> Check Block(8) first 131072 bytes longest common substring length 4
> decode from [0, 2, 3, 4, 5, 6] -> [1, 7, 8]
> Check Block(1) first 131072 bytes longest common substring length 65536
> CHECK AGAIN: Block(1) all 27262976 bytes longest common substring length 
> 27197440  # this one
> Check Block(7) first 131072 bytes longest common substring length 4
> Check Block(8) first 131072 bytes longest common substring length 4{code}
> Now I know the dirty buffer causes reconstruction block error, but how does 
> the dirty buffer come about?
> After digging into the code and DN log, I found this following DN log is the 
> root reason.
> {code:java}
> [INFO] [stripedRead-1017] : Interrupted while waiting for IO on channel 
> java.nio.channels.SocketChannel[connected local=/:52586 
> remote=/:50010]. 18 millis timeout left.
> [WARN] [StripedBlockReconstruction-199] : Failed to reconstruct striped 
> block: BP-714356632--1519726836856:blk_-YY_3472979393
> java.lang.NullPointerException
> at 
> org.apache.hadoop.hdfs.util.StripedBlockUtil.getNextCompletedStripedRead(StripedBlockUtil.java:314)
> at 
> org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedReader.doReadMinimumSources(StripedReader.java:308)
> at 
> org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedReader.readMinimumSources(StripedReader.java:269)
> at 
> org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.reconstruct(StripedBlockReconstructor.java:94)
> at 
> org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.run(StripedBlockReconstructor.java:60)
> at 
> java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
> at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
> at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
> at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
> at java.base/java.lang.Thread.run(Thread.java:834) {code}
> Reading from DN may timeout(hold by a future(F)) and output the INFO log, but 
> the futures that contains the future(F)  is cleared, 
> {code:java}
> return new StripingChunkReadResult(futures.remove(future),
> StripingChunkReadResult.CANCELLED); {code}
> futures.remove(future) cause NPE. So the EC reconstruction is failed. In the 
> finally phase, the code snippet in *getStripedReader().close()* 
> {code:java}
> reconstructor.freeBuffer(reader.getReadBuffer());
> reader.freeReadBuffer();
> reader.closeBlockReader(); {code}
> free buffer firstly, but the StripedBlockReader 

[jira] [Updated] (HDFS-15514) Remove useless dfs.webhdfs.enabled

2020-08-06 Thread Fei Hui (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fei Hui updated HDFS-15514:
---
Affects Version/s: 3.0.3
   3.3.0
   3.2.1
   3.1.3

> Remove useless dfs.webhdfs.enabled
> --
>
> Key: HDFS-15514
> URL: https://issues.apache.org/jira/browse/HDFS-15514
> Project: Hadoop HDFS
>  Issue Type: Test
>  Components: test
>Affects Versions: 3.0.3, 3.3.0, 3.2.1, 3.1.3
>Reporter: Fei Hui
>Assignee: Fei Hui
>Priority: Minor
> Attachments: HDFS-15514.001.patch
>
>
> After HDFS-7985 & HDFS-8349, " dfs.webhdfs.enabled" is useless. We should 
> remove it from code base.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15514) Remove useless dfs.webhdfs.enabled

2020-08-06 Thread Fei Hui (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fei Hui updated HDFS-15514:
---
Priority: Minor  (was: Major)

> Remove useless dfs.webhdfs.enabled
> --
>
> Key: HDFS-15514
> URL: https://issues.apache.org/jira/browse/HDFS-15514
> Project: Hadoop HDFS
>  Issue Type: Test
>  Components: test
>Reporter: Fei Hui
>Assignee: Fei Hui
>Priority: Minor
> Attachments: HDFS-15514.001.patch
>
>
> After HDFS-7985 & HDFS-8349, " dfs.webhdfs.enabled" is useless. We should 
> remove it from code base.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15514) Remove useless dfs.webhdfs.enabled

2020-08-06 Thread Fei Hui (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17172131#comment-17172131
 ] 

Fei Hui commented on HDFS-15514:


[~aajisaka][~ayushtkn] Could you please take a look? Thanks

> Remove useless dfs.webhdfs.enabled
> --
>
> Key: HDFS-15514
> URL: https://issues.apache.org/jira/browse/HDFS-15514
> Project: Hadoop HDFS
>  Issue Type: Test
>  Components: test
>Reporter: Fei Hui
>Assignee: Fei Hui
>Priority: Major
> Attachments: HDFS-15514.001.patch
>
>
> After HDFS-7985 & HDFS-8349, " dfs.webhdfs.enabled" is useless. We should 
> remove it from code base.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15514) Remove useless dfs.webhdfs.enabled

2020-08-05 Thread Fei Hui (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fei Hui updated HDFS-15514:
---
Status: Patch Available  (was: Open)

> Remove useless dfs.webhdfs.enabled
> --
>
> Key: HDFS-15514
> URL: https://issues.apache.org/jira/browse/HDFS-15514
> Project: Hadoop HDFS
>  Issue Type: Test
>  Components: test
>Reporter: Fei Hui
>Assignee: Fei Hui
>Priority: Major
> Attachments: HDFS-15514.001.patch
>
>
> After HDFS-7985 & HDFS-8349, " dfs.webhdfs.enabled" is useless. We should 
> remove it from code base.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15514) Remove useless dfs.webhdfs.enabled

2020-08-05 Thread Fei Hui (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fei Hui updated HDFS-15514:
---
Attachment: HDFS-15514.001.patch

> Remove useless dfs.webhdfs.enabled
> --
>
> Key: HDFS-15514
> URL: https://issues.apache.org/jira/browse/HDFS-15514
> Project: Hadoop HDFS
>  Issue Type: Test
>  Components: test
>Reporter: Fei Hui
>Assignee: Fei Hui
>Priority: Major
> Attachments: HDFS-15514.001.patch
>
>
> After HDFS-7985 & HDFS-8349, " dfs.webhdfs.enabled" is useless. We should 
> remove it from code base.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-15514) Remove useless dfs.webhdfs.enabled

2020-08-05 Thread Fei Hui (Jira)
Fei Hui created HDFS-15514:
--

 Summary: Remove useless dfs.webhdfs.enabled
 Key: HDFS-15514
 URL: https://issues.apache.org/jira/browse/HDFS-15514
 Project: Hadoop HDFS
  Issue Type: Test
  Components: test
Reporter: Fei Hui
Assignee: Fei Hui


After HDFS-7985 & HDFS-8349, " dfs.webhdfs.enabled" is useless. We should 
remove it from code base.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-13596) NN restart fails after RollingUpgrade from 2.x to 3.x

2020-07-23 Thread Fei Hui (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-13596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17163518#comment-17163518
 ] 

Fei Hui commented on HDFS-13596:


[~fengwu99] You are right!  Failed is related to HDFS-8791.

> NN restart fails after RollingUpgrade from 2.x to 3.x
> -
>
> Key: HDFS-13596
> URL: https://issues.apache.org/jira/browse/HDFS-13596
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Reporter: Hanisha Koneru
>Assignee: Fei Hui
>Priority: Blocker
> Fix For: 3.3.0, 3.2.1, 3.1.3
>
> Attachments: HDFS-13596.001.patch, HDFS-13596.002.patch, 
> HDFS-13596.003.patch, HDFS-13596.004.patch, HDFS-13596.005.patch, 
> HDFS-13596.006.patch, HDFS-13596.007.patch, HDFS-13596.008.patch, 
> HDFS-13596.009.patch, HDFS-13596.010.patch
>
>
> After rollingUpgrade NN from 2.x and 3.x, if the NN is restarted, it fails 
> while replaying edit logs.
>  * After NN is started with rollingUpgrade, the layoutVersion written to 
> editLogs (before finalizing the upgrade) is the pre-upgrade layout version 
> (so as to support downgrade).
>  * When writing transactions to log, NN writes as per the current layout 
> version. In 3.x, erasureCoding bits are added to the editLog transactions.
>  * So any edit log written after the upgrade and before finalizing the 
> upgrade will have the old layout version but the new format of transactions.
>  * When NN is restarted and the edit logs are replayed, the NN reads the old 
> layout version from the editLog file. When parsing the transactions, it 
> assumes that the transactions are also from the previous layout and hence 
> skips parsing the erasureCoding bits.
>  * This cascades into reading the wrong set of bits for other fields and 
> leads to NN shutting down.
> Sample error output:
> {code:java}
> java.lang.IllegalArgumentException: Invalid clientId - length is 0 expected 
> length 16
>  at com.google.common.base.Preconditions.checkArgument(Preconditions.java:88)
>  at org.apache.hadoop.ipc.RetryCache$CacheEntry.(RetryCache.java:74)
>  at org.apache.hadoop.ipc.RetryCache$CacheEntry.(RetryCache.java:86)
>  at 
> org.apache.hadoop.ipc.RetryCache$CacheEntryWithPayload.(RetryCache.java:163)
>  at 
> org.apache.hadoop.ipc.RetryCache.addCacheEntryWithPayload(RetryCache.java:322)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.addCacheEntryWithPayload(FSNamesystem.java:960)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:397)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:249)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:158)
>  at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:888)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:745)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:323)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:1086)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:714)
>  at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:632)
>  at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:694)
>  at org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:937)
>  at org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:910)
>  at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1643)
>  at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1710)
> 2018-05-17 19:10:06,522 WARN 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Encountered exception 
> loading fsimage
> java.io.IOException: java.lang.IllegalStateException: Cannot skip to less 
> than the current value (=16389), where newValue=16388
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.resetLastInodeId(FSDirectory.java:1945)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:298)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:158)
>  at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:888)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:745)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:323)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:1086)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:714)
>  at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:632)
>  at 
> 

[jira] [Commented] (HDFS-15079) RBF: Client maybe get an unexpected result with network anomaly

2020-04-08 Thread Fei Hui (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17078329#comment-17078329
 ] 

Fei Hui commented on HDFS-15079:


Upload workaround patch.
* use RetryCache of NN
* Call maybe fail when namenode failover with network anomaly, but not overwrite

> RBF: Client maybe get an unexpected result with network anomaly 
> 
>
> Key: HDFS-15079
> URL: https://issues.apache.org/jira/browse/HDFS-15079
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: rbf
>Affects Versions: 3.3.0
>Reporter: Fei Hui
>Priority: Critical
> Attachments: HDFS-15079.001.patch, HDFS-15079.002.patch, 
> UnexpectedOverWriteUT.patch
>
>
>  I find there is a critical problem on RBF, HDFS-15078 can resolve it on some 
> Scenarios, but i have no idea about the overall resolution.
> The problem is that
> Client with RBF(r0, r1) create a file HDFS file via r0, it gets Exception and 
> failovers to r1
> r0 has been send create rpc to namenode(1st create)
> Client create a HDFS file via r1(2nd create)
> Client writes the HDFS file and close it finally(3rd close)
> Maybe namenode receiving the rpc in order as follow
> 2nd create
> 3rd close
> 1st create
> And overwrite is true by default, this would make the file had been written 
> an empty file. This is an critical problem 
> We had encountered this problem. There are many hive and spark jobs running 
> on our cluster,   sometimes it occurs



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15079) RBF: Client maybe get an unexpected result with network anomaly

2020-04-08 Thread Fei Hui (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fei Hui updated HDFS-15079:
---
Attachment: HDFS-15079.002.patch

> RBF: Client maybe get an unexpected result with network anomaly 
> 
>
> Key: HDFS-15079
> URL: https://issues.apache.org/jira/browse/HDFS-15079
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: rbf
>Affects Versions: 3.3.0
>Reporter: Fei Hui
>Priority: Critical
> Attachments: HDFS-15079.001.patch, HDFS-15079.002.patch, 
> UnexpectedOverWriteUT.patch
>
>
>  I find there is a critical problem on RBF, HDFS-15078 can resolve it on some 
> Scenarios, but i have no idea about the overall resolution.
> The problem is that
> Client with RBF(r0, r1) create a file HDFS file via r0, it gets Exception and 
> failovers to r1
> r0 has been send create rpc to namenode(1st create)
> Client create a HDFS file via r1(2nd create)
> Client writes the HDFS file and close it finally(3rd close)
> Maybe namenode receiving the rpc in order as follow
> 2nd create
> 3rd close
> 1st create
> And overwrite is true by default, this would make the file had been written 
> an empty file. This is an critical problem 
> We had encountered this problem. There are many hive and spark jobs running 
> on our cluster,   sometimes it occurs



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15084) RBF: Remove useless param nsId in RouterRpcClient#getConnection

2020-04-05 Thread Fei Hui (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fei Hui updated HDFS-15084:
---
Resolution: Won't Fix
Status: Resolved  (was: Patch Available)

> RBF: Remove useless param nsId in RouterRpcClient#getConnection
> ---
>
> Key: HDFS-15084
> URL: https://issues.apache.org/jira/browse/HDFS-15084
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: rbf
>Affects Versions: 3.3.0
>Reporter: Fei Hui
>Assignee: Fei Hui
>Priority: Trivial
> Attachments: HDFS-15084.001.patch
>
>
> The param nsId in RouterRpcClient#getConnection is useless.
> Maybe we should remove it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15240) Erasure Coding: dirty buffer causes reconstruction block error

2020-03-26 Thread Fei Hui (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17067682#comment-17067682
 ] 

Fei Hui commented on HDFS-15240:


[~marvelrock] Good Catch ! Thanks for reporting and fixing.
Could you please add UT?

> Erasure Coding: dirty buffer causes reconstruction block error
> --
>
> Key: HDFS-15240
> URL: https://issues.apache.org/jira/browse/HDFS-15240
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, erasure-coding
>Reporter: HuangTao
>Assignee: HuangTao
>Priority: Major
> Attachments: HDFS-15240.001.patch
>
>
> When read some lzo files we found some blocks were broken.
> I read back all internal blocks(b0-b8) of the block group(RS-6-3-1024k) from 
> DN directly, and choose 6(b0-b5) blocks to decode other 3(b6', b7', b8') 
> blocks. And find the longest common sequenece(LCS) between b6'(decoded) and 
> b6(read from DN)(b7'/b7 and b8'/b8).
> After selecting 6 blocks of the block group in combinations one time and 
> iterating through all cases, I find one case that the length of LCS is the 
> block length - 64KB, 64KB is just the length of ByteBuffer used by 
> StripedBlockReader. So the corrupt reconstruction block is made by a dirty 
> buffer.
> The following log snippet(only show 2 of 28 cases) is my check program 
> output. In my case, I known the 3th block is corrupt, so need other 5 blocks 
> to decode another 3 blocks, then find the 1th block's LCS substring is block 
> length - 64kb.
> It means (0,1,2,4,5,6)th blocks were used to reconstruct 3th block, and the 
> dirty buffer was used before read the 1th block.
> Must be noted that StripedBlockReader read from the offset 0 of the 1th block 
> after used the dirty buffer.
> {code:java}
> decode from [0, 2, 3, 4, 5, 7] -> [1, 6, 8]
> Check Block(1) first 131072 bytes longest common substring length 4
> Check Block(6) first 131072 bytes longest common substring length 4
> Check Block(8) first 131072 bytes longest common substring length 4
> decode from [0, 2, 3, 4, 5, 6] -> [1, 7, 8]
> Check Block(1) first 131072 bytes longest common substring length 65536
> CHECK AGAIN: Block(1) all 27262976 bytes longest common substring length 
> 27197440  # this one
> Check Block(7) first 131072 bytes longest common substring length 4
> Check Block(8) first 131072 bytes longest common substring length 4{code}
> Now I know the dirty buffer causes reconstruction block error, but how does 
> the dirty buffer come about?
> After digging into the code and DN log, I found this following DN log is the 
> root reason.
> {code:java}
> [INFO] [stripedRead-1017] : Interrupted while waiting for IO on channel 
> java.nio.channels.SocketChannel[connected local=/:52586 
> remote=/:50010]. 18 millis timeout left.
> [WARN] [StripedBlockReconstruction-199] : Failed to reconstruct striped 
> block: BP-714356632--1519726836856:blk_-YY_3472979393
> java.lang.NullPointerException
> at 
> org.apache.hadoop.hdfs.util.StripedBlockUtil.getNextCompletedStripedRead(StripedBlockUtil.java:314)
> at 
> org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedReader.doReadMinimumSources(StripedReader.java:308)
> at 
> org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedReader.readMinimumSources(StripedReader.java:269)
> at 
> org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.reconstruct(StripedBlockReconstructor.java:94)
> at 
> org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.run(StripedBlockReconstructor.java:60)
> at 
> java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
> at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
> at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
> at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
> at java.base/java.lang.Thread.run(Thread.java:834) {code}
> Reading from DN may timeout(hold by a future(F)) and output the INFO log, but 
> the futures that contains the future(F)  is cleared, 
> {code:java}
> return new StripingChunkReadResult(futures.remove(future),
> StripingChunkReadResult.CANCELLED); {code}
> futures.remove(future) cause NPE. So the EC reconstruction is failed. In the 
> finally phase, the code snippet in *getStripedReader().close()* 
> {code:java}
> reconstructor.freeBuffer(reader.getReadBuffer());
> reader.freeReadBuffer();
> reader.closeBlockReader(); {code}
> free buffer firstly, but the StripedBlockReader still holds the buffer and 
> write it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To 

[jira] [Commented] (HDFS-15223) FSCK fails if one namenode is not available

2020-03-15 Thread Fei Hui (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17059972#comment-17059972
 ] 

Fei Hui commented on HDFS-15223:


[~ayushtkn] Thanks for reporting and fixing.
+1

> FSCK fails if one namenode is not available
> ---
>
> Key: HDFS-15223
> URL: https://issues.apache.org/jira/browse/HDFS-15223
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Ayush Saxena
>Assignee: Ayush Saxena
>Priority: Major
> Attachments: HDFS-15223-01.patch
>
>
> If one namenode is not available FSCK should try on other namenode, ignoring 
> the namenode not available



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15186) Erasure Coding: Decommission may generate the parity block's content with all 0 in some case

2020-02-27 Thread Fei Hui (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17046298#comment-17046298
 ] 

Fei Hui commented on HDFS-15186:


+1 for HDFS-15186.005.patch. Failed tests are unrelated.

> Erasure Coding: Decommission may generate the parity block's content with all 
> 0 in some case
> 
>
> Key: HDFS-15186
> URL: https://issues.apache.org/jira/browse/HDFS-15186
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, erasure-coding
>Affects Versions: 3.0.3, 3.2.1, 3.1.3
>Reporter: Yao Guangdong
>Assignee: Yao Guangdong
>Priority: Critical
> Attachments: HDFS-15186.001.patch, HDFS-15186.002.patch, 
> HDFS-15186.003.patch, HDFS-15186.004.patch, HDFS-15186.005.patch
>
>
> I can find some parity block's content with all 0 when i decommission some 
> DataNode(more than 1) from a cluster. And the probability is very big(parts 
> per thousand).This is a big problem.You can think that if we read data from 
> the zero parity block or use the zero parity block to recover a block which 
> can make us use the error data even we don't know it.
> There is some case in the below:
> B: Busy DataNode, 
> D:Decommissioning DataNode,
> Others is normal.
> 1.Group indices is [0, 1, 2, 3, 4, 5, 6(B,D), 7, 8(D)].
> 2.Group indices is [0(B,D), 1, 2, 3, 4, 5, 6(B,D), 7, 8(D)].
> 
> In the first case when the block group indices is [0, 1, 2, 3, 4, 5, 6(B,D), 
> 7, 8(D)], the DN may received reconstruct block command and the 
> liveIndices=[0, 1, 2, 3, 4, 5, 7, 8] and the targets's(the field which  in 
> the class StripedReconstructionInfo) length is 2. 
> The targets's length is 2 which mean that the DataNode need recover 2 
> internal block in current code.But from the liveIndices we only can find 1 
> missing block, so the method StripedWriter#initTargetIndices will use 0 as 
> the default recover block and don't care the indices 0 is in the sources 
> indices or not.
> When they use sources indices [0, 1, 2, 3, 4, 5] to recover indices [6, 0] 
> use the ec algorithm.We can find that the indices [0] is in the both the 
> sources indices and the targets indices in this case. The returned target 
> buffer in the indices [6] is always 0 from the ec  algorithm.So I think this 
> is the ec algorithm's problem. Because it should more fault tolerance.I try 
> to fixed it .But it is too hard. Because the case is too more. The second is 
> another case in the example above(use sources indices [1, 2, 3, 4, 5, 7] to 
> recover indices [0, 6, 0]). So I changed my mind.Invoke the ec  algorithm 
> with a correct parameters. Which mean that remove the duplicate target 
> indices 0 in this case.Finally, I fixed it in this way.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15186) Erasure Coding: Decommission may generate the parity block's content with all 0 in some case

2020-02-24 Thread Fei Hui (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17044123#comment-17044123
 ] 

Fei Hui commented on HDFS-15186:


[~yaoguangdong] Thanks for your patch
HDFS-15186.002.patch the whole fix looks good. Minor comments
{quote}
+//4. wait for decommissioning and not busy block to replicate
+Thread.sleep(3000);
{quote}
Here maybe  it will be good that GenericTestUtils.waitFor instead of it.

> Erasure Coding: Decommission may generate the parity block's content with all 
> 0 in some case
> 
>
> Key: HDFS-15186
> URL: https://issues.apache.org/jira/browse/HDFS-15186
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, erasure-coding
>Affects Versions: 3.0.3, 3.2.1, 3.1.3
>Reporter: Yao Guangdong
>Assignee: Yao Guangdong
>Priority: Critical
> Attachments: HDFS-15186.001.patch, HDFS-15186.002.patch
>
>
> I can find some parity block's content with all 0 when i decommission some 
> DataNode(more than 1) from a cluster. And the probability is very big(parts 
> per thousand).This is a big problem.You can think that if we read data from 
> the zero parity block or use the zero parity block to recover a block which 
> can make us use the error data even we don't know it.
> There is some case in the below:
> B: Busy DataNode, 
> D:Decommissioning DataNode,
> Others is normal.
> 1.Group indices is [0, 1, 2, 3, 4, 5, 6(B,D), 7, 8(D)].
> 2.Group indices is [0(B,D), 1, 2, 3, 4, 5, 6(B,D), 7, 8(D)].
> 
> In the first case when the block group indices is [0, 1, 2, 3, 4, 5, 6(B,D), 
> 7, 8(D)], the DN may received reconstruct block command and the 
> liveIndices=[0, 1, 2, 3, 4, 5, 7, 8] and the targets's(the field which  in 
> the class StripedReconstructionInfo) length is 2. 
> The targets's length is 2 which mean that the DataNode need recover 2 
> internal block in current code.But from the liveIndices we only can find 1 
> missing block, so the method StripedWriter#initTargetIndices will use 0 as 
> the default recover block and don't care the indices 0 is in the sources 
> indices or not.
> When they use sources indices [0, 1, 2, 3, 4, 5] to recover indices [6, 0] 
> use the ec algorithm.We can find that the indices [0] is in the both the 
> sources indices and the targets indices in this case. The returned target 
> buffer in the indices [6] is always 0 from the ec  algorithm.So I think this 
> is the ec algorithm's problem. Because it should more fault tolerance.I try 
> to fixed it .But it is too hard. Because the case is too more. The second is 
> another case in the example above(use sources indices [1, 2, 3, 4, 5, 7] to 
> recover indices [0, 6, 0]). So I changed my mind.Invoke the ec  algorithm 
> with a correct parameters. Which mean that remove the duplicate target 
> indices 0 in this case.Finally, I fixed it in this way.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-15186) Erasure Coding: Decommission may generate the parity block's content with all 0 in some case

2020-02-22 Thread Fei Hui (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17042780#comment-17042780
 ] 

Fei Hui edited comment on HDFS-15186 at 2/23/20 2:52 AM:
-

[~yaoguangdong]Thanks for reporting this. Good Catch.
Sorry for late, I couldn't receive emails these days.
+1 for [~ayushtkn] suggestions. I thinks indice[6] is not in liveindcies and 
busyindices, this cause this problem. Maybe we should fix it in namenode side.


was (Author: ferhui):
[~yaoguangdong]Thanks for reporting this !Good Catch!
Sorry for late, I couldn't receive emails these days!
+1 for [~ayushtkn] suggestions. I thinks indice[6] is not in liveindcies and 
busyindices, this cause this problem. Maybe we should fix it in namenode side.

> Erasure Coding: Decommission may generate the parity block's content with all 
> 0 in some case
> 
>
> Key: HDFS-15186
> URL: https://issues.apache.org/jira/browse/HDFS-15186
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, erasure-coding
>Affects Versions: 3.0.3, 3.2.1, 3.1.3
>Reporter: Yao Guangdong
>Assignee: Yao Guangdong
>Priority: Critical
> Attachments: HDFS-15186.001.patch
>
>
> I can find some parity block's content with all 0 when i decommission some 
> DataNode(more than 1) from a cluster. And the probability is very big(parts 
> per thousand).This is a big problem.You can think that if we read data from 
> the zero parity block or use the zero parity block to recover a block which 
> can make us use the error data even we don't know it.
> There is some case in the below:
> B: Busy DataNode, 
> D:Decommissioning DataNode,
> Others is normal.
> 1.Group indices is [0, 1, 2, 3, 4, 5, 6(B,D), 7, 8(D)].
> 2.Group indices is [0(B,D), 1, 2, 3, 4, 5, 6(B,D), 7, 8(D)].
> 
> In the first case when the block group indices is [0, 1, 2, 3, 4, 5, 6(B,D), 
> 7, 8(D)], the DN may received reconstruct block command and the 
> liveIndices=[0, 1, 2, 3, 4, 5, 7, 8] and the targets's(the field which  in 
> the class StripedReconstructionInfo) length is 2. 
> The targets's length is 2 which mean that the DataNode need recover 2 
> internal block in current code.But from the liveIndices we only can find 1 
> missing block, so the method StripedWriter#initTargetIndices will use 0 as 
> the default recover block and don't care the indices 0 is in the sources 
> indices or not.
> When they use sources indices [0, 1, 2, 3, 4, 5] to recover indices [6, 0] 
> use the ec algorithm.We can find that the indices [0] is in the both the 
> sources indices and the targets indices in this case. The returned target 
> buffer in the indices [6] is always 0 from the ec  algorithm.So I think this 
> is the ec algorithm's problem. Because it should more fault tolerance.I try 
> to fixed it .But it is too hard. Because the case is too more. The second is 
> another case in the example above(use sources indices [1, 2, 3, 4, 5, 7] to 
> recover indices [0, 6, 0]). So I changed my mind.Invoke the ec  algorithm 
> with a correct parameters. Which mean that remove the duplicate target 
> indices 0 in this case.Finally, I fixed it in this way.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15186) Erasure Coding: Decommission may generate the parity block's content with all 0 in some case

2020-02-22 Thread Fei Hui (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17042780#comment-17042780
 ] 

Fei Hui commented on HDFS-15186:


[~yaoguangdong]Thanks for reporting this !Good Catch!
Sorry for late, I couldn't receive emails these days!
+1 for [~ayushtkn] suggestions. I thinks indice[6] is not in liveindcies and 
busyindices, this cause this problem. Maybe we should fix it in namenode side.

> Erasure Coding: Decommission may generate the parity block's content with all 
> 0 in some case
> 
>
> Key: HDFS-15186
> URL: https://issues.apache.org/jira/browse/HDFS-15186
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, erasure-coding
>Affects Versions: 3.0.3, 3.2.1, 3.1.3
>Reporter: Yao Guangdong
>Assignee: Yao Guangdong
>Priority: Critical
> Attachments: HDFS-15186.001.patch
>
>
> I can find some parity block's content with all 0 when i decommission some 
> DataNode(more than 1) from a cluster. And the probability is very big(parts 
> per thousand).This is a big problem.You can think that if we read data from 
> the zero parity block or use the zero parity block to recover a block which 
> can make us use the error data even we don't know it.
> There is some case in the below:
> B: Busy DataNode, 
> D:Decommissioning DataNode,
> Others is normal.
> 1.Group indices is [0, 1, 2, 3, 4, 5, 6(B,D), 7, 8(D)].
> 2.Group indices is [0(B,D), 1, 2, 3, 4, 5, 6(B,D), 7, 8(D)].
> 
> In the first case when the block group indices is [0, 1, 2, 3, 4, 5, 6(B,D), 
> 7, 8(D)], the DN may received reconstruct block command and the 
> liveIndices=[0, 1, 2, 3, 4, 5, 7, 8] and the targets's(the field which  in 
> the class StripedReconstructionInfo) length is 2. 
> The targets's length is 2 which mean that the DataNode need recover 2 
> internal block in current code.But from the liveIndices we only can find 1 
> missing block, so the method StripedWriter#initTargetIndices will use 0 as 
> the default recover block and don't care the indices 0 is in the sources 
> indices or not.
> When they use sources indices [0, 1, 2, 3, 4, 5] to recover indices [6, 0] 
> use the ec algorithm.We can find that the indices [0] is in the both the 
> sources indices and the targets indices in this case. The returned target 
> buffer in the indices [6] is always 0 from the ec  algorithm.So I think this 
> is the ec algorithm's problem. Because it should more fault tolerance.I try 
> to fixed it .But it is too hard. Because the case is too more. The second is 
> another case in the example above(use sources indices [1, 2, 3, 4, 5, 7] to 
> recover indices [0, 6, 0]). So I changed my mind.Invoke the ec  algorithm 
> with a correct parameters. Which mean that remove the duplicate target 
> indices 0 in this case.Finally, I fixed it in this way.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15092) TestRedudantBlocks#testProcessOverReplicatedAndRedudantBlock sometimes failed

2020-01-20 Thread Fei Hui (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17019445#comment-17019445
 ] 

Fei Hui commented on HDFS-15092:


[~surendrasingh][~elgoiri] Could you please take a look?Thannks

> TestRedudantBlocks#testProcessOverReplicatedAndRedudantBlock sometimes failed
> -
>
> Key: HDFS-15092
> URL: https://issues.apache.org/jira/browse/HDFS-15092
> Project: Hadoop HDFS
>  Issue Type: Test
>  Components: test
>Affects Versions: 3.3.0
>Reporter: Fei Hui
>Assignee: Fei Hui
>Priority: Minor
> Attachments: HDFS-15092.001.patch, HDFS-15092.002.patch
>
>
> TestRedudantBlocks#testProcessOverReplicatedAndRedudantBlock sometimes failed
> {quote}
> java.lang.AssertionError: 
> Expected :5
> Actual   :4
>  
>   at org.junit.Assert.fail(Assert.java:88)
>   at org.junit.Assert.failNotEquals(Assert.java:834)
>   at org.junit.Assert.assertEquals(Assert.java:645)
>   at org.junit.Assert.assertEquals(Assert.java:631)
>   at 
> org.apache.hadoop.hdfs.server.namenode.TestRedudantBlocks.testProcessOverReplicatedAndRedudantBlock(TestRedudantBlocks.java:138)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
>   at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>   at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
>   at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>   at 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
>   at 
> org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
>   at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
>   at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
>   at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
>   at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
>   at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
>   at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
>   at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
>   at org.junit.runner.JUnitCore.run(JUnitCore.java:137)
>   at 
> com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs(JUnit4IdeaTestRunner.java:68)
>   at 
> com.intellij.rt.execution.junit.IdeaTestRunner$Repeater.startRunnerWithArgs(IdeaTestRunner.java:51)
>   at 
> com.intellij.rt.execution.junit.JUnitStarter.prepareStreamsAndStart(JUnitStarter.java:242)
>   at 
> com.intellij.rt.execution.junit.JUnitStarter.main(JUnitStarter.java:70)
> {quote}
> Maybe we should increase sleep time



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15092) TestRedudantBlocks#testProcessOverReplicatedAndRedudantBlock sometimes failed

2020-01-19 Thread Fei Hui (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17019195#comment-17019195
 ] 

Fei Hui commented on HDFS-15092:


Sorry for late
Upload v002 patch

> TestRedudantBlocks#testProcessOverReplicatedAndRedudantBlock sometimes failed
> -
>
> Key: HDFS-15092
> URL: https://issues.apache.org/jira/browse/HDFS-15092
> Project: Hadoop HDFS
>  Issue Type: Test
>  Components: test
>Affects Versions: 3.3.0
>Reporter: Fei Hui
>Assignee: Fei Hui
>Priority: Minor
> Attachments: HDFS-15092.001.patch, HDFS-15092.002.patch
>
>
> TestRedudantBlocks#testProcessOverReplicatedAndRedudantBlock sometimes failed
> {quote}
> java.lang.AssertionError: 
> Expected :5
> Actual   :4
>  
>   at org.junit.Assert.fail(Assert.java:88)
>   at org.junit.Assert.failNotEquals(Assert.java:834)
>   at org.junit.Assert.assertEquals(Assert.java:645)
>   at org.junit.Assert.assertEquals(Assert.java:631)
>   at 
> org.apache.hadoop.hdfs.server.namenode.TestRedudantBlocks.testProcessOverReplicatedAndRedudantBlock(TestRedudantBlocks.java:138)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
>   at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>   at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
>   at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>   at 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
>   at 
> org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
>   at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
>   at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
>   at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
>   at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
>   at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
>   at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
>   at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
>   at org.junit.runner.JUnitCore.run(JUnitCore.java:137)
>   at 
> com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs(JUnit4IdeaTestRunner.java:68)
>   at 
> com.intellij.rt.execution.junit.IdeaTestRunner$Repeater.startRunnerWithArgs(IdeaTestRunner.java:51)
>   at 
> com.intellij.rt.execution.junit.JUnitStarter.prepareStreamsAndStart(JUnitStarter.java:242)
>   at 
> com.intellij.rt.execution.junit.JUnitStarter.main(JUnitStarter.java:70)
> {quote}
> Maybe we should increase sleep time



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15092) TestRedudantBlocks#testProcessOverReplicatedAndRedudantBlock sometimes failed

2020-01-19 Thread Fei Hui (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15092?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fei Hui updated HDFS-15092:
---
Attachment: HDFS-15092.002.patch

> TestRedudantBlocks#testProcessOverReplicatedAndRedudantBlock sometimes failed
> -
>
> Key: HDFS-15092
> URL: https://issues.apache.org/jira/browse/HDFS-15092
> Project: Hadoop HDFS
>  Issue Type: Test
>  Components: test
>Affects Versions: 3.3.0
>Reporter: Fei Hui
>Assignee: Fei Hui
>Priority: Minor
> Attachments: HDFS-15092.001.patch, HDFS-15092.002.patch
>
>
> TestRedudantBlocks#testProcessOverReplicatedAndRedudantBlock sometimes failed
> {quote}
> java.lang.AssertionError: 
> Expected :5
> Actual   :4
>  
>   at org.junit.Assert.fail(Assert.java:88)
>   at org.junit.Assert.failNotEquals(Assert.java:834)
>   at org.junit.Assert.assertEquals(Assert.java:645)
>   at org.junit.Assert.assertEquals(Assert.java:631)
>   at 
> org.apache.hadoop.hdfs.server.namenode.TestRedudantBlocks.testProcessOverReplicatedAndRedudantBlock(TestRedudantBlocks.java:138)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
>   at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>   at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
>   at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>   at 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
>   at 
> org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
>   at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
>   at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
>   at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
>   at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
>   at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
>   at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
>   at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
>   at org.junit.runner.JUnitCore.run(JUnitCore.java:137)
>   at 
> com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs(JUnit4IdeaTestRunner.java:68)
>   at 
> com.intellij.rt.execution.junit.IdeaTestRunner$Repeater.startRunnerWithArgs(IdeaTestRunner.java:51)
>   at 
> com.intellij.rt.execution.junit.JUnitStarter.prepareStreamsAndStart(JUnitStarter.java:242)
>   at 
> com.intellij.rt.execution.junit.JUnitStarter.main(JUnitStarter.java:70)
> {quote}
> Maybe we should increase sleep time



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15084) RBF: Remove useless param nsId in RouterRpcClient#getConnection

2020-01-06 Thread Fei Hui (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17008726#comment-17008726
 ] 

Fei Hui commented on HDFS-15084:


[~surendrasingh] HDFS-13522 are there any progress?
Useless code is obvious in IDE, not so good for coders :(

> RBF: Remove useless param nsId in RouterRpcClient#getConnection
> ---
>
> Key: HDFS-15084
> URL: https://issues.apache.org/jira/browse/HDFS-15084
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: rbf
>Affects Versions: 3.3.0
>Reporter: Fei Hui
>Assignee: Fei Hui
>Priority: Trivial
> Attachments: HDFS-15084.001.patch
>
>
> The param nsId in RouterRpcClient#getConnection is useless.
> Maybe we should remove it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15092) TestRedudantBlocks#testProcessOverReplicatedAndRedudantBlock sometimes failed

2020-01-01 Thread Fei Hui (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15092?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fei Hui updated HDFS-15092:
---
Status: Patch Available  (was: Open)

> TestRedudantBlocks#testProcessOverReplicatedAndRedudantBlock sometimes failed
> -
>
> Key: HDFS-15092
> URL: https://issues.apache.org/jira/browse/HDFS-15092
> Project: Hadoop HDFS
>  Issue Type: Test
>  Components: test
>Affects Versions: 3.3.0
>Reporter: Fei Hui
>Assignee: Fei Hui
>Priority: Minor
> Attachments: HDFS-15092.001.patch
>
>
> TestRedudantBlocks#testProcessOverReplicatedAndRedudantBlock sometimes failed
> {quote}
> java.lang.AssertionError: 
> Expected :5
> Actual   :4
>  
>   at org.junit.Assert.fail(Assert.java:88)
>   at org.junit.Assert.failNotEquals(Assert.java:834)
>   at org.junit.Assert.assertEquals(Assert.java:645)
>   at org.junit.Assert.assertEquals(Assert.java:631)
>   at 
> org.apache.hadoop.hdfs.server.namenode.TestRedudantBlocks.testProcessOverReplicatedAndRedudantBlock(TestRedudantBlocks.java:138)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
>   at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>   at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
>   at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>   at 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
>   at 
> org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
>   at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
>   at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
>   at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
>   at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
>   at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
>   at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
>   at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
>   at org.junit.runner.JUnitCore.run(JUnitCore.java:137)
>   at 
> com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs(JUnit4IdeaTestRunner.java:68)
>   at 
> com.intellij.rt.execution.junit.IdeaTestRunner$Repeater.startRunnerWithArgs(IdeaTestRunner.java:51)
>   at 
> com.intellij.rt.execution.junit.JUnitStarter.prepareStreamsAndStart(JUnitStarter.java:242)
>   at 
> com.intellij.rt.execution.junit.JUnitStarter.main(JUnitStarter.java:70)
> {quote}
> Maybe we should increase sleep time



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15092) TestRedudantBlocks#testProcessOverReplicatedAndRedudantBlock sometimes failed

2020-01-01 Thread Fei Hui (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17006532#comment-17006532
 ] 

Fei Hui commented on HDFS-15092:


[~surendrasingh] Could you please take a look? I see you add this UT

> TestRedudantBlocks#testProcessOverReplicatedAndRedudantBlock sometimes failed
> -
>
> Key: HDFS-15092
> URL: https://issues.apache.org/jira/browse/HDFS-15092
> Project: Hadoop HDFS
>  Issue Type: Test
>  Components: test
>Affects Versions: 3.3.0
>Reporter: Fei Hui
>Assignee: Fei Hui
>Priority: Minor
> Attachments: HDFS-15092.001.patch
>
>
> TestRedudantBlocks#testProcessOverReplicatedAndRedudantBlock sometimes failed
> {quote}
> java.lang.AssertionError: 
> Expected :5
> Actual   :4
>  
>   at org.junit.Assert.fail(Assert.java:88)
>   at org.junit.Assert.failNotEquals(Assert.java:834)
>   at org.junit.Assert.assertEquals(Assert.java:645)
>   at org.junit.Assert.assertEquals(Assert.java:631)
>   at 
> org.apache.hadoop.hdfs.server.namenode.TestRedudantBlocks.testProcessOverReplicatedAndRedudantBlock(TestRedudantBlocks.java:138)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
>   at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>   at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
>   at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>   at 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
>   at 
> org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
>   at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
>   at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
>   at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
>   at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
>   at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
>   at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
>   at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
>   at org.junit.runner.JUnitCore.run(JUnitCore.java:137)
>   at 
> com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs(JUnit4IdeaTestRunner.java:68)
>   at 
> com.intellij.rt.execution.junit.IdeaTestRunner$Repeater.startRunnerWithArgs(IdeaTestRunner.java:51)
>   at 
> com.intellij.rt.execution.junit.JUnitStarter.prepareStreamsAndStart(JUnitStarter.java:242)
>   at 
> com.intellij.rt.execution.junit.JUnitStarter.main(JUnitStarter.java:70)
> {quote}
> Maybe we should increase sleep time



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15092) TestRedudantBlocks#testProcessOverReplicatedAndRedudantBlock sometimes failed

2020-01-01 Thread Fei Hui (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15092?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fei Hui updated HDFS-15092:
---
Attachment: HDFS-15092.001.patch

> TestRedudantBlocks#testProcessOverReplicatedAndRedudantBlock sometimes failed
> -
>
> Key: HDFS-15092
> URL: https://issues.apache.org/jira/browse/HDFS-15092
> Project: Hadoop HDFS
>  Issue Type: Test
>  Components: test
>Affects Versions: 3.3.0
>Reporter: Fei Hui
>Assignee: Fei Hui
>Priority: Minor
> Attachments: HDFS-15092.001.patch
>
>
> TestRedudantBlocks#testProcessOverReplicatedAndRedudantBlock sometimes failed
> {quote}
> java.lang.AssertionError: 
> Expected :5
> Actual   :4
>  
>   at org.junit.Assert.fail(Assert.java:88)
>   at org.junit.Assert.failNotEquals(Assert.java:834)
>   at org.junit.Assert.assertEquals(Assert.java:645)
>   at org.junit.Assert.assertEquals(Assert.java:631)
>   at 
> org.apache.hadoop.hdfs.server.namenode.TestRedudantBlocks.testProcessOverReplicatedAndRedudantBlock(TestRedudantBlocks.java:138)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
>   at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>   at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
>   at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>   at 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
>   at 
> org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
>   at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
>   at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
>   at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
>   at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
>   at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
>   at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
>   at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
>   at org.junit.runner.JUnitCore.run(JUnitCore.java:137)
>   at 
> com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs(JUnit4IdeaTestRunner.java:68)
>   at 
> com.intellij.rt.execution.junit.IdeaTestRunner$Repeater.startRunnerWithArgs(IdeaTestRunner.java:51)
>   at 
> com.intellij.rt.execution.junit.JUnitStarter.prepareStreamsAndStart(JUnitStarter.java:242)
>   at 
> com.intellij.rt.execution.junit.JUnitStarter.main(JUnitStarter.java:70)
> {quote}
> Maybe we should increase sleep time



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-15092) TestRedudantBlocks#testProcessOverReplicatedAndRedudantBlock sometimes failed

2020-01-01 Thread Fei Hui (Jira)
Fei Hui created HDFS-15092:
--

 Summary: 
TestRedudantBlocks#testProcessOverReplicatedAndRedudantBlock sometimes failed
 Key: HDFS-15092
 URL: https://issues.apache.org/jira/browse/HDFS-15092
 Project: Hadoop HDFS
  Issue Type: Test
  Components: test
Affects Versions: 3.3.0
Reporter: Fei Hui
Assignee: Fei Hui


TestRedudantBlocks#testProcessOverReplicatedAndRedudantBlock sometimes failed
{quote}
java.lang.AssertionError: 
Expected :5
Actual   :4
 


at org.junit.Assert.fail(Assert.java:88)
at org.junit.Assert.failNotEquals(Assert.java:834)
at org.junit.Assert.assertEquals(Assert.java:645)
at org.junit.Assert.assertEquals(Assert.java:631)
at 
org.apache.hadoop.hdfs.server.namenode.TestRedudantBlocks.testProcessOverReplicatedAndRedudantBlock(TestRedudantBlocks.java:138)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
at 
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
at 
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
at 
org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
at 
org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
at 
org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
at org.junit.runner.JUnitCore.run(JUnitCore.java:137)
at 
com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs(JUnit4IdeaTestRunner.java:68)
at 
com.intellij.rt.execution.junit.IdeaTestRunner$Repeater.startRunnerWithArgs(IdeaTestRunner.java:51)
at 
com.intellij.rt.execution.junit.JUnitStarter.prepareStreamsAndStart(JUnitStarter.java:242)
at 
com.intellij.rt.execution.junit.JUnitStarter.main(JUnitStarter.java:70)
{quote}

Maybe we should increase sleep time



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15079) RBF: Client maybe get an unexpected result with network anomaly

2019-12-31 Thread Fei Hui (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fei Hui updated HDFS-15079:
---
Attachment: HDFS-15079.001.patch

> RBF: Client maybe get an unexpected result with network anomaly 
> 
>
> Key: HDFS-15079
> URL: https://issues.apache.org/jira/browse/HDFS-15079
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: rbf
>Affects Versions: 3.3.0
>Reporter: Fei Hui
>Priority: Critical
> Attachments: HDFS-15079.001.patch, UnexpectedOverWriteUT.patch
>
>
>  I find there is a critical problem on RBF, HDFS-15078 can resolve it on some 
> Scenarios, but i have no idea about the overall resolution.
> The problem is that
> Client with RBF(r0, r1) create a file HDFS file via r0, it gets Exception and 
> failovers to r1
> r0 has been send create rpc to namenode(1st create)
> Client create a HDFS file via r1(2nd create)
> Client writes the HDFS file and close it finally(3rd close)
> Maybe namenode receiving the rpc in order as follow
> 2nd create
> 3rd close
> 1st create
> And overwrite is true by default, this would make the file had been written 
> an empty file. This is an critical problem 
> We had encountered this problem. There are many hive and spark jobs running 
> on our cluster,   sometimes it occurs



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15079) RBF: Client maybe get an unexpected result with network anomaly

2019-12-31 Thread Fei Hui (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fei Hui updated HDFS-15079:
---
Attachment: (was: HDFS-15079.001.patch)

> RBF: Client maybe get an unexpected result with network anomaly 
> 
>
> Key: HDFS-15079
> URL: https://issues.apache.org/jira/browse/HDFS-15079
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: rbf
>Affects Versions: 3.3.0
>Reporter: Fei Hui
>Priority: Critical
> Attachments: UnexpectedOverWriteUT.patch
>
>
>  I find there is a critical problem on RBF, HDFS-15078 can resolve it on some 
> Scenarios, but i have no idea about the overall resolution.
> The problem is that
> Client with RBF(r0, r1) create a file HDFS file via r0, it gets Exception and 
> failovers to r1
> r0 has been send create rpc to namenode(1st create)
> Client create a HDFS file via r1(2nd create)
> Client writes the HDFS file and close it finally(3rd close)
> Maybe namenode receiving the rpc in order as follow
> 2nd create
> 3rd close
> 1st create
> And overwrite is true by default, this would make the file had been written 
> an empty file. This is an critical problem 
> We had encountered this problem. There are many hive and spark jobs running 
> on our cluster,   sometimes it occurs



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15079) RBF: Client maybe get an unexpected result with network anomaly

2019-12-31 Thread Fei Hui (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fei Hui updated HDFS-15079:
---
Attachment: HDFS-15079.001.patch

> RBF: Client maybe get an unexpected result with network anomaly 
> 
>
> Key: HDFS-15079
> URL: https://issues.apache.org/jira/browse/HDFS-15079
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: rbf
>Affects Versions: 3.3.0
>Reporter: Fei Hui
>Priority: Critical
> Attachments: HDFS-15079.001.patch, UnexpectedOverWriteUT.patch
>
>
>  I find there is a critical problem on RBF, HDFS-15078 can resolve it on some 
> Scenarios, but i have no idea about the overall resolution.
> The problem is that
> Client with RBF(r0, r1) create a file HDFS file via r0, it gets Exception and 
> failovers to r1
> r0 has been send create rpc to namenode(1st create)
> Client create a HDFS file via r1(2nd create)
> Client writes the HDFS file and close it finally(3rd close)
> Maybe namenode receiving the rpc in order as follow
> 2nd create
> 3rd close
> 1st create
> And overwrite is true by default, this would make the file had been written 
> an empty file. This is an critical problem 
> We had encountered this problem. There are many hive and spark jobs running 
> on our cluster,   sometimes it occurs



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15079) RBF: Client maybe get an unexpected result with network anomaly

2019-12-31 Thread Fei Hui (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fei Hui updated HDFS-15079:
---
Attachment: (was: HDFS-15079.001.patch)

> RBF: Client maybe get an unexpected result with network anomaly 
> 
>
> Key: HDFS-15079
> URL: https://issues.apache.org/jira/browse/HDFS-15079
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: rbf
>Affects Versions: 3.3.0
>Reporter: Fei Hui
>Priority: Critical
> Attachments: UnexpectedOverWriteUT.patch
>
>
>  I find there is a critical problem on RBF, HDFS-15078 can resolve it on some 
> Scenarios, but i have no idea about the overall resolution.
> The problem is that
> Client with RBF(r0, r1) create a file HDFS file via r0, it gets Exception and 
> failovers to r1
> r0 has been send create rpc to namenode(1st create)
> Client create a HDFS file via r1(2nd create)
> Client writes the HDFS file and close it finally(3rd close)
> Maybe namenode receiving the rpc in order as follow
> 2nd create
> 3rd close
> 1st create
> And overwrite is true by default, this would make the file had been written 
> an empty file. This is an critical problem 
> We had encountered this problem. There are many hive and spark jobs running 
> on our cluster,   sometimes it occurs



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15079) RBF: Client maybe get an unexpected result with network anomaly

2019-12-30 Thread Fei Hui (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fei Hui updated HDFS-15079:
---
Status: Patch Available  (was: Open)

> RBF: Client maybe get an unexpected result with network anomaly 
> 
>
> Key: HDFS-15079
> URL: https://issues.apache.org/jira/browse/HDFS-15079
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: rbf
>Affects Versions: 3.3.0
>Reporter: Fei Hui
>Priority: Critical
> Attachments: HDFS-15079.001.patch, UnexpectedOverWriteUT.patch
>
>
>  I find there is a critical problem on RBF, HDFS-15078 can resolve it on some 
> Scenarios, but i have no idea about the overall resolution.
> The problem is that
> Client with RBF(r0, r1) create a file HDFS file via r0, it gets Exception and 
> failovers to r1
> r0 has been send create rpc to namenode(1st create)
> Client create a HDFS file via r1(2nd create)
> Client writes the HDFS file and close it finally(3rd close)
> Maybe namenode receiving the rpc in order as follow
> 2nd create
> 3rd close
> 1st create
> And overwrite is true by default, this would make the file had been written 
> an empty file. This is an critical problem 
> We had encountered this problem. There are many hive and spark jobs running 
> on our cluster,   sometimes it occurs



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15079) RBF: Client maybe get an unexpected result with network anomaly

2019-12-30 Thread Fei Hui (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17005301#comment-17005301
 ] 

Fei Hui commented on HDFS-15079:


Upoad rough fix.

I think maybe CallerContext is suitable here. If we change router client 
clientId and callId, it will have problem for router client retry or failover. 
And maybe we will add more fields to callercontext if needed.
Here are some issues:
* Normalize the  callerContext,how should we use it. json format or string?
* Checking callId for the same clientId is suitable? Delayed callId will be 
dropped.

[~ayushtkn] [~elgoiri] [~hexiaoqiao] Any thoughts?

> RBF: Client maybe get an unexpected result with network anomaly 
> 
>
> Key: HDFS-15079
> URL: https://issues.apache.org/jira/browse/HDFS-15079
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: rbf
>Affects Versions: 3.3.0
>Reporter: Fei Hui
>Priority: Critical
> Attachments: HDFS-15079.001.patch, UnexpectedOverWriteUT.patch
>
>
>  I find there is a critical problem on RBF, HDFS-15078 can resolve it on some 
> Scenarios, but i have no idea about the overall resolution.
> The problem is that
> Client with RBF(r0, r1) create a file HDFS file via r0, it gets Exception and 
> failovers to r1
> r0 has been send create rpc to namenode(1st create)
> Client create a HDFS file via r1(2nd create)
> Client writes the HDFS file and close it finally(3rd close)
> Maybe namenode receiving the rpc in order as follow
> 2nd create
> 3rd close
> 1st create
> And overwrite is true by default, this would make the file had been written 
> an empty file. This is an critical problem 
> We had encountered this problem. There are many hive and spark jobs running 
> on our cluster,   sometimes it occurs



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15079) RBF: Client maybe get an unexpected result with network anomaly

2019-12-30 Thread Fei Hui (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fei Hui updated HDFS-15079:
---
Attachment: HDFS-15079.001.patch

> RBF: Client maybe get an unexpected result with network anomaly 
> 
>
> Key: HDFS-15079
> URL: https://issues.apache.org/jira/browse/HDFS-15079
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: rbf
>Affects Versions: 3.3.0
>Reporter: Fei Hui
>Priority: Critical
> Attachments: HDFS-15079.001.patch, UnexpectedOverWriteUT.patch
>
>
>  I find there is a critical problem on RBF, HDFS-15078 can resolve it on some 
> Scenarios, but i have no idea about the overall resolution.
> The problem is that
> Client with RBF(r0, r1) create a file HDFS file via r0, it gets Exception and 
> failovers to r1
> r0 has been send create rpc to namenode(1st create)
> Client create a HDFS file via r1(2nd create)
> Client writes the HDFS file and close it finally(3rd close)
> Maybe namenode receiving the rpc in order as follow
> 2nd create
> 3rd close
> 1st create
> And overwrite is true by default, this would make the file had been written 
> an empty file. This is an critical problem 
> We had encountered this problem. There are many hive and spark jobs running 
> on our cluster,   sometimes it occurs



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15084) RBF: Remove useless param nsId in RouterRpcClient#getConnection

2019-12-30 Thread Fei Hui (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17005235#comment-17005235
 ] 

Fei Hui commented on HDFS-15084:


[~ayushtkn] [~weichiu] could you please have a look ? Thanks

> RBF: Remove useless param nsId in RouterRpcClient#getConnection
> ---
>
> Key: HDFS-15084
> URL: https://issues.apache.org/jira/browse/HDFS-15084
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: rbf
>Affects Versions: 3.3.0
>Reporter: Fei Hui
>Assignee: Fei Hui
>Priority: Trivial
> Attachments: HDFS-15084.001.patch
>
>
> The param nsId in RouterRpcClient#getConnection is useless.
> Maybe we should remove it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15085) Erasure Coding: some ORC data can not be recovery when partial DataNodes are shut down

2019-12-30 Thread Fei Hui (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17005191#comment-17005191
 ] 

Fei Hui commented on HDFS-15085:


[~zhangbutao] Thanks for reporting this. Could you please give more details?

> Erasure Coding: some ORC data can not be recovery  when partial DataNodes  
> are shut down
> 
>
> Key: HDFS-15085
> URL: https://issues.apache.org/jira/browse/HDFS-15085
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: ec
>Affects Versions: 3.1.0
>Reporter: zhangbutao
>Priority: Major
>
> Test environment: hadoop version 3.1.0,  5 datanode
> step to repo:
> 1: Set the ec policy RS-3-2-1024k on all of hdfs paths:
> hdfs ec -setPolicy -path / RS-3-2-1024k
> 2.Put the small orc file into hdfs:
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15084) RBF: Remove useless param nsId in RouterRpcClient#getConnection

2019-12-29 Thread Fei Hui (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fei Hui updated HDFS-15084:
---
Attachment: HDFS-15084.001.patch

> RBF: Remove useless param nsId in RouterRpcClient#getConnection
> ---
>
> Key: HDFS-15084
> URL: https://issues.apache.org/jira/browse/HDFS-15084
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: rbf
>Affects Versions: 3.3.0
>Reporter: Fei Hui
>Assignee: Fei Hui
>Priority: Trivial
> Attachments: HDFS-15084.001.patch
>
>
> The param nsId in RouterRpcClient#getConnection is useless.
> Maybe we should remove it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15084) RBF: Remove useless param nsId in RouterRpcClient#getConnection

2019-12-29 Thread Fei Hui (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fei Hui updated HDFS-15084:
---
Status: Patch Available  (was: Open)

> RBF: Remove useless param nsId in RouterRpcClient#getConnection
> ---
>
> Key: HDFS-15084
> URL: https://issues.apache.org/jira/browse/HDFS-15084
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: rbf
>Affects Versions: 3.3.0
>Reporter: Fei Hui
>Assignee: Fei Hui
>Priority: Trivial
> Attachments: HDFS-15084.001.patch
>
>
> The param nsId in RouterRpcClient#getConnection is useless.
> Maybe we should remove it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-15084) RBF: Remove useless param nsId in RouterRpcClient#getConnection

2019-12-29 Thread Fei Hui (Jira)
Fei Hui created HDFS-15084:
--

 Summary: RBF: Remove useless param nsId in 
RouterRpcClient#getConnection
 Key: HDFS-15084
 URL: https://issues.apache.org/jira/browse/HDFS-15084
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: rbf
Affects Versions: 3.3.0
Reporter: Fei Hui
Assignee: Fei Hui


The param nsId in RouterRpcClient#getConnection is useless.
Maybe we should remove it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15081) Typo in RetryCache#waitForCompletion annotation

2019-12-27 Thread Fei Hui (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fei Hui updated HDFS-15081:
---
Environment: (was: x)

> Typo in RetryCache#waitForCompletion annotation
> ---
>
> Key: HDFS-15081
> URL: https://issues.apache.org/jira/browse/HDFS-15081
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Affects Versions: 3.3.0
>Reporter: Fei Hui
>Assignee: Fei Hui
>Priority: Major
> Attachments: HDFS-15081.001.patch
>
>
> Typo in RetryCache#waitForCompletion annotation
> {code}
> // Previous request has failed, the expectation is is that it will be
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15081) Typo in RetryCache#waitForCompletion annotation

2019-12-27 Thread Fei Hui (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fei Hui updated HDFS-15081:
---
Status: Patch Available  (was: Open)

> Typo in RetryCache#waitForCompletion annotation
> ---
>
> Key: HDFS-15081
> URL: https://issues.apache.org/jira/browse/HDFS-15081
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Affects Versions: 3.3.0
>Reporter: Fei Hui
>Assignee: Fei Hui
>Priority: Major
> Attachments: HDFS-15081.001.patch
>
>
> Typo in RetryCache#waitForCompletion annotation
> {code}
> // Previous request has failed, the expectation is is that it will be
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15081) Typo in RetryCache#waitForCompletion annotation

2019-12-27 Thread Fei Hui (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fei Hui updated HDFS-15081:
---
Environment: x  (was: Typo in RetryCache#waitForCompletion annotation
{code}
// Previous request has failed, the expectation is is that it will be
{code})

> Typo in RetryCache#waitForCompletion annotation
> ---
>
> Key: HDFS-15081
> URL: https://issues.apache.org/jira/browse/HDFS-15081
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Affects Versions: 3.3.0
> Environment: x
>Reporter: Fei Hui
>Assignee: Fei Hui
>Priority: Major
> Attachments: HDFS-15081.001.patch
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15081) Typo in RetryCache#waitForCompletion annotation

2019-12-27 Thread Fei Hui (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fei Hui updated HDFS-15081:
---
Description: H

> Typo in RetryCache#waitForCompletion annotation
> ---
>
> Key: HDFS-15081
> URL: https://issues.apache.org/jira/browse/HDFS-15081
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Affects Versions: 3.3.0
> Environment: x
>Reporter: Fei Hui
>Assignee: Fei Hui
>Priority: Major
> Attachments: HDFS-15081.001.patch
>
>
> H



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15081) Typo in RetryCache#waitForCompletion annotation

2019-12-27 Thread Fei Hui (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fei Hui updated HDFS-15081:
---
Description: 
Typo in RetryCache#waitForCompletion annotation
{code}
// Previous request has failed, the expectation is is that it will be
{code}

  was:H


> Typo in RetryCache#waitForCompletion annotation
> ---
>
> Key: HDFS-15081
> URL: https://issues.apache.org/jira/browse/HDFS-15081
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Affects Versions: 3.3.0
> Environment: x
>Reporter: Fei Hui
>Assignee: Fei Hui
>Priority: Major
> Attachments: HDFS-15081.001.patch
>
>
> Typo in RetryCache#waitForCompletion annotation
> {code}
> // Previous request has failed, the expectation is is that it will be
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15081) Typo in RetryCache#waitForCompletion annotation

2019-12-27 Thread Fei Hui (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fei Hui updated HDFS-15081:
---
Attachment: HDFS-15081.001.patch

> Typo in RetryCache#waitForCompletion annotation
> ---
>
> Key: HDFS-15081
> URL: https://issues.apache.org/jira/browse/HDFS-15081
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Affects Versions: 3.3.0
> Environment: Typo in RetryCache#waitForCompletion annotation
> {code}
> // Previous request has failed, the expectation is is that it will be
> {code}
>Reporter: Fei Hui
>Assignee: Fei Hui
>Priority: Major
> Attachments: HDFS-15081.001.patch
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15081) Typo in RetryCache#waitForCompletion annotation

2019-12-27 Thread Fei Hui (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17004006#comment-17004006
 ] 

Fei Hui commented on HDFS-15081:


Upload simple fix

> Typo in RetryCache#waitForCompletion annotation
> ---
>
> Key: HDFS-15081
> URL: https://issues.apache.org/jira/browse/HDFS-15081
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Affects Versions: 3.3.0
> Environment: Typo in RetryCache#waitForCompletion annotation
> {code}
> // Previous request has failed, the expectation is is that it will be
> {code}
>Reporter: Fei Hui
>Assignee: Fei Hui
>Priority: Major
> Attachments: HDFS-15081.001.patch
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-15081) Typo in RetryCache#waitForCompletion annotation

2019-12-27 Thread Fei Hui (Jira)
Fei Hui created HDFS-15081:
--

 Summary: Typo in RetryCache#waitForCompletion annotation
 Key: HDFS-15081
 URL: https://issues.apache.org/jira/browse/HDFS-15081
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: namenode
Affects Versions: 3.3.0
 Environment: Typo in RetryCache#waitForCompletion annotation
{code}
// Previous request has failed, the expectation is is that it will be
{code}
Reporter: Fei Hui
Assignee: Fei Hui






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15079) RBF: Client maybe get an unexpected result with network anomaly

2019-12-26 Thread Fei Hui (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17003619#comment-17003619
 ] 

Fei Hui commented on HDFS-15079:


Thanks [~hexiaoqiao] 
{quote}
ClientId & CallId of request from Router to NameNode are both created by Router 
itself 
{quote}
Yes. I was wrong, regards clientName as clientId :(
Digging in

> RBF: Client maybe get an unexpected result with network anomaly 
> 
>
> Key: HDFS-15079
> URL: https://issues.apache.org/jira/browse/HDFS-15079
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: rbf
>Affects Versions: 3.3.0
>Reporter: Fei Hui
>Priority: Critical
> Attachments: UnexpectedOverWriteUT.patch
>
>
>  I find there is a critical problem on RBF, HDFS-15078 can resolve it on some 
> Scenarios, but i have no idea about the overall resolution.
> The problem is that
> Client with RBF(r0, r1) create a file HDFS file via r0, it gets Exception and 
> failovers to r1
> r0 has been send create rpc to namenode(1st create)
> Client create a HDFS file via r1(2nd create)
> Client writes the HDFS file and close it finally(3rd close)
> Maybe namenode receiving the rpc in order as follow
> 2nd create
> 3rd close
> 1st create
> And overwrite is true by default, this would make the file had been written 
> an empty file. This is an critical problem 
> We had encountered this problem. There are many hive and spark jobs running 
> on our cluster,   sometimes it occurs



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15079) RBF: Client maybe get an unexpected result with network anomaly

2019-12-25 Thread Fei Hui (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17003430#comment-17003430
 ] 

Fei Hui commented on HDFS-15079:


[~ayushtkn]
{quote}
The namenode Logic that you tend to add, that kind of logic is there in 
Namenode in form of RetryCache, It checks whether the call isn't a repeated one 
due to failover, if so, it doesn't execute it again rather sends the old 
response from the cache. 
{quote}
Great. CallId is the client id i tend to add. CallId and clientId maybe can 
resolve the problem.
For NN clientId is from client, but callId is from routerclient. Try to dig in 
more too.

> RBF: Client maybe get an unexpected result with network anomaly 
> 
>
> Key: HDFS-15079
> URL: https://issues.apache.org/jira/browse/HDFS-15079
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: rbf
>Affects Versions: 3.3.0
>Reporter: Fei Hui
>Priority: Critical
> Attachments: UnexpectedOverWriteUT.patch
>
>
>  I find there is a critical problem on RBF, HDFS-15078 can resolve it on some 
> Scenarios, but i have no idea about the overall resolution.
> The problem is that
> Client with RBF(r0, r1) create a file HDFS file via r0, it gets Exception and 
> failovers to r1
> r0 has been send create rpc to namenode(1st create)
> Client create a HDFS file via r1(2nd create)
> Client writes the HDFS file and close it finally(3rd close)
> Maybe namenode receiving the rpc in order as follow
> 2nd create
> 3rd close
> 1st create
> And overwrite is true by default, this would make the file had been written 
> an empty file. This is an critical problem 
> We had encountered this problem. There are many hive and spark jobs running 
> on our cluster,   sometimes it occurs



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15078) RBF: Should check connection channel before sending rpc to namenode

2019-12-25 Thread Fei Hui (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17003264#comment-17003264
 ] 

Fei Hui commented on HDFS-15078:


[~elgoiri]
{quote}
Can we try to do it as an exception handling instead of proactively checking?
{quote}
Sorry, didn't catch it. Before checking it, everything looks fine. Could you 
please give some ideas?

[~ayushtkn]
{quote}
Router is supposed to just receive the call, and if it has received a valid 
call, it should in any case send to namenode. 
{quote}
If connection between router and client is closed, result could not send to 
client. So maybe sending or not to namennode both are reasonable... Because the 
call failed for client.

> RBF: Should check connection channel before sending rpc to namenode
> ---
>
> Key: HDFS-15078
> URL: https://issues.apache.org/jira/browse/HDFS-15078
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: rbf
>Affects Versions: 3.3.0
>Reporter: Fei Hui
>Assignee: Fei Hui
>Priority: Major
> Attachments: HDFS-15078.001.patch, HDFS-15078.002.patch
>
>
> dfsrouter logs show that
> {quote}
> 2019-12-20 04:11:26,724 WARN org.apache.hadoop.ipc.Server: IPC Server handler 
> 6400 on , call org.apache.hadoop.hdfs.protocol.ClientProtocol.create from 
> 10.83.164.11:56908 Call#2 Retry#0: output error
> 2019-12-20 04:11:26,724 INFO org.apache.hadoop.ipc.Server: IPC Server handler 
> 125 on  caught an exception
> java.nio.channels.ClosedChannelException
> at 
> sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:270)
> at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:461)
> at org.apache.hadoop.ipc.Server.channelWrite(Server.java:2731)
> at org.apache.hadoop.ipc.Server.access$2100(Server.java:134)
> at 
> org.apache.hadoop.ipc.Server$Responder.processResponse(Server.java:1089)
> at org.apache.hadoop.ipc.Server$Responder.doRespond(Server.java:1161)
> at 
> org.apache.hadoop.ipc.Server$Connection.sendResponse(Server.java:2109)
> at 
> org.apache.hadoop.ipc.Server$Connection.access$400(Server.java:1229)
> at org.apache.hadoop.ipc.Server$Call.sendResponse(Server.java:631)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2245)
> {quote}
> Maybe checking connection between client and router is better before 
> sendingrpc to namenode



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-15079) RBF: Client maybe get an unexpected result with network anomaly

2019-12-25 Thread Fei Hui (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17003233#comment-17003233
 ] 

Fei Hui edited comment on HDFS-15079 at 12/25/19 11:40 AM:
---

[~ayushtkn][~elgoiri]Upload an overwrite UT, similar to HDFS-15078



was (Author: ferhui):
[~ayushtkn][~elgoiri]Upload a overwrite UT, similar to HDFS-15078


> RBF: Client maybe get an unexpected result with network anomaly 
> 
>
> Key: HDFS-15079
> URL: https://issues.apache.org/jira/browse/HDFS-15079
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: rbf
>Affects Versions: 3.3.0
>Reporter: Fei Hui
>Priority: Critical
> Attachments: UnexpectedOverWriteUT.patch
>
>
>  I find there is a critical problem on RBF, HDFS-15078 can resolve it on some 
> Scenarios, but i have no idea about the overall resolution.
> The problem is that
> Client with RBF(r0, r1) create a file HDFS file via r0, it gets Exception and 
> failovers to r1
> r0 has been send create rpc to namenode(1st create)
> Client create a HDFS file via r1(2nd create)
> Client writes the HDFS file and close it finally(3rd close)
> Maybe namenode receiving the rpc in order as follow
> 2nd create
> 3rd close
> 1st create
> And overwrite is true by default, this would make the file had been written 
> an empty file. This is an critical problem 
> We had encountered this problem. There are many hive and spark jobs running 
> on our cluster,   sometimes it occurs



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15079) RBF: Client maybe get an unexpected result with network anomaly

2019-12-25 Thread Fei Hui (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17003233#comment-17003233
 ] 

Fei Hui commented on HDFS-15079:


[~ayushtkn][~elgoiri]Upload a overwrite UT, similar to HDFS-15078


> RBF: Client maybe get an unexpected result with network anomaly 
> 
>
> Key: HDFS-15079
> URL: https://issues.apache.org/jira/browse/HDFS-15079
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: rbf
>Affects Versions: 3.3.0
>Reporter: Fei Hui
>Priority: Critical
> Attachments: UnexpectedOverWriteUT.patch
>
>
>  I find there is a critical problem on RBF, HDFS-15078 can resolve it on some 
> Scenarios, but i have no idea about the overall resolution.
> The problem is that
> Client with RBF(r0, r1) create a file HDFS file via r0, it gets Exception and 
> failovers to r1
> r0 has been send create rpc to namenode(1st create)
> Client create a HDFS file via r1(2nd create)
> Client writes the HDFS file and close it finally(3rd close)
> Maybe namenode receiving the rpc in order as follow
> 2nd create
> 3rd close
> 1st create
> And overwrite is true by default, this would make the file had been written 
> an empty file. This is an critical problem 
> We had encountered this problem. There are many hive and spark jobs running 
> on our cluster,   sometimes it occurs



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15079) RBF: Client maybe get an unexpected result with network anomaly

2019-12-25 Thread Fei Hui (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fei Hui updated HDFS-15079:
---
Attachment: UnexpectedOverWriteUT.patch

> RBF: Client maybe get an unexpected result with network anomaly 
> 
>
> Key: HDFS-15079
> URL: https://issues.apache.org/jira/browse/HDFS-15079
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: rbf
>Affects Versions: 3.3.0
>Reporter: Fei Hui
>Priority: Critical
> Attachments: UnexpectedOverWriteUT.patch
>
>
>  I find there is a critical problem on RBF, HDFS-15078 can resolve it on some 
> Scenarios, but i have no idea about the overall resolution.
> The problem is that
> Client with RBF(r0, r1) create a file HDFS file via r0, it gets Exception and 
> failovers to r1
> r0 has been send create rpc to namenode(1st create)
> Client create a HDFS file via r1(2nd create)
> Client writes the HDFS file and close it finally(3rd close)
> Maybe namenode receiving the rpc in order as follow
> 2nd create
> 3rd close
> 1st create
> And overwrite is true by default, this would make the file had been written 
> an empty file. This is an critical problem 
> We had encountered this problem. There are many hive and spark jobs running 
> on our cluster,   sometimes it occurs



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-15079) RBF: Client maybe get an unexpected result with network anomaly

2019-12-25 Thread Fei Hui (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17003153#comment-17003153
 ] 

Fei Hui edited comment on HDFS-15079 at 12/25/19 9:09 AM:
--

[~elgoiri]
HDFS-15078 has a test case, it's one case for this.

[~hexiaoqiao]
 Client gets Exception, but the exception is not that router throws. client 
logs as follow
{quote}
java.io.EOFException: End of File Exception between local host is: 
"xx.xx.xx.xx"; destination host is: "xx.xx.xx.xx":; : java.io.EOFException; 
For more details see:  http://wiki.apache.org/hadoop/EOFException
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:792)
at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:765)
at org.apache.hadoop.ipc.Client.call(Client.java:1507)
at org.apache.hadoop.ipc.Client.call(Client.java:1441)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
at com.sun.proxy.$Proxy19.create(Unknown Source)
at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.create(ClientNamenodeProtocolTranslatorPB.java:303)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:253)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:101)
at com.sun.proxy.$Proxy20.create(Unknown Source)
at 
org.apache.hadoop.hdfs.DFSOutputStream.newStreamForCreate(DFSOutputStream.java:264)
at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1727)
at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1662)
at 
org.apache.hadoop.hdfs.DistributedFileSystem$7.doCall(DistributedFileSystem.java:503)
at 
org.apache.hadoop.hdfs.DistributedFileSystem$7.doCall(DistributedFileSystem.java:499)
at 
org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at 
org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:514)
at 
org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:442)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:979)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:872)
at 
org.apache.hadoop.mapred.TextOutputFormat.getRecordWriter(TextOutputFormat.java:135)
at 
org.apache.spark.internal.io.HadoopMapRedWriteConfigUtil.initWriter(SparkHadoopWriter.scala:228)
at 
org.apache.spark.internal.io.SparkHadoopWriter$.org$apache$spark$internal$io$SparkHadoopWriter$$executeTask(SparkHadoopWriter.scala:122)
at 
org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:83)
at 
org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:78)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:121)
at 
org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.EOFException
at java.io.DataInputStream.readInt(DataInputStream.java:392)
at 
org.apache.hadoop.ipc.Client$Connection.receiveRpcResponse(Client.java:1113)
at org.apache.hadoop.ipc.Client$Connection.run(Client.java:1008)
{quote}

I think maybe consistency is not guaranteed if do not resolve it on nn side.


was (Author: ferhui):
[~elgoiri]
HDFS-15078 has a test case, it's one case for this.

[~hexiaoqiao]
 Client gets Exception, but the exception is not that router throws. client 
logs as follow
{quote}
java.io.EOFException: End of File Exception between local host is: 
"xx.xx.xx.xx"; destination host is: "xx.xx.xx.xx":; : java.io.EOFException; 
For more details see:  

[jira] [Commented] (HDFS-15079) RBF: Client maybe get an unexpected result with network anomaly

2019-12-25 Thread Fei Hui (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17003153#comment-17003153
 ] 

Fei Hui commented on HDFS-15079:


[~elgoiri]
HDFS-15078 has a test case, it's one case for this.

[~hexiaoqiao]
 Client gets Exception, but the exception is not that router throws. client 
logs as follow
{quote}
java.io.EOFException: End of File Exception between local host is: 
"xx.xx.xx.xx"; destination host is: "xx.xx.xx.xx":; : java.io.EOFException; 
For more details see:  http://wiki.apache.org/hadoop/EOFException
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:792)
at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:765)
at org.apache.hadoop.ipc.Client.call(Client.java:1507)
at org.apache.hadoop.ipc.Client.call(Client.java:1441)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
at com.sun.proxy.$Proxy19.create(Unknown Source)
at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.create(ClientNamenodeProtocolTranslatorPB.java:303)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:253)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:101)
at com.sun.proxy.$Proxy20.create(Unknown Source)
at 
org.apache.hadoop.hdfs.DFSOutputStream.newStreamForCreate(DFSOutputStream.java:264)
at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1727)
at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1662)
at 
org.apache.hadoop.hdfs.DistributedFileSystem$7.doCall(DistributedFileSystem.java:503)
at 
org.apache.hadoop.hdfs.DistributedFileSystem$7.doCall(DistributedFileSystem.java:499)
at 
org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at 
org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:514)
at 
org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:442)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:979)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:872)
at 
org.apache.hadoop.mapred.TextOutputFormat.getRecordWriter(TextOutputFormat.java:135)
at 
org.apache.spark.internal.io.HadoopMapRedWriteConfigUtil.initWriter(SparkHadoopWriter.scala:228)
at 
org.apache.spark.internal.io.SparkHadoopWriter$.org$apache$spark$internal$io$SparkHadoopWriter$$executeTask(SparkHadoopWriter.scala:122)
at 
org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:83)
at 
org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:78)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:121)
at 
org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.EOFException
at java.io.DataInputStream.readInt(DataInputStream.java:392)
at 
org.apache.hadoop.ipc.Client$Connection.receiveRpcResponse(Client.java:1113)
at org.apache.hadoop.ipc.Client$Connection.run(Client.java:1008)
{quote}

I think maybe consistency is not guaranteed if resolve it on nn side.

> RBF: Client maybe get an unexpected result with network anomaly 
> 
>
> Key: HDFS-15079
> URL: https://issues.apache.org/jira/browse/HDFS-15079
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: rbf
>Affects Versions: 3.3.0
>Reporter: Fei Hui
>Priority: Critical
>
>  I find there is a 

[jira] [Comment Edited] (HDFS-15078) RBF: Should check connection channel before sending rpc to namenode

2019-12-24 Thread Fei Hui (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17002739#comment-17002739
 ] 

Fei Hui edited comment on HDFS-15078 at 12/24/19 10:23 AM:
---

{quote}
The issue is the first router which sent the request that late, That client did 
failover to another router, triggered a new call and the second router 
completed the call, and the first call came after this. 
{quote}
Getting EOFException makes client failover to another router. 
And later the second router completed the call,  the first router sent the 
request late. If just the first router sent the request late, client doesn't 
get exception, it will not failover

{quote}
If the client crashed post the check, this scenario will again come, This 
doesn't seems to be a problem with the client crashing and the Router sending 
the request still to Namenode,

If such a case where one Router is delaying, I think without client connection 
crashing still issues like these can come up.
{quote}
Yes. This issue only can resolve the problem on some scenarios and it's just an 
improvement. HDFS-15079 tracks the high level problem.

In our  scenarios. This fix works.



was (Author: ferhui):
{quote}
The issue is the first router which sent the request that late, That client did 
failover to another router, triggered a new call and the second router 
completed the call, and the first call came after this. 
{quote}
Getting EOFException makes client failover to another router. 
And later the second router completed the call,  the first router sent the 
request late. If just the first router sent the request late, client doesn't 
get exception, it will not failover

{quote}
If such a case where one Router is delaying, I think without client connection 
crashing still issues like these can come up.
{quote}
Yes. This issue only can resolve the problem on some scenarios. HDFS-15079 
tracks the high level problem.

In our  scenarios. This fix works.


> RBF: Should check connection channel before sending rpc to namenode
> ---
>
> Key: HDFS-15078
> URL: https://issues.apache.org/jira/browse/HDFS-15078
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: rbf
>Affects Versions: 3.3.0
>Reporter: Fei Hui
>Assignee: Fei Hui
>Priority: Major
> Attachments: HDFS-15078.001.patch, HDFS-15078.002.patch
>
>
> dfsrouter logs show that
> {quote}
> 2019-12-20 04:11:26,724 WARN org.apache.hadoop.ipc.Server: IPC Server handler 
> 6400 on , call org.apache.hadoop.hdfs.protocol.ClientProtocol.create from 
> 10.83.164.11:56908 Call#2 Retry#0: output error
> 2019-12-20 04:11:26,724 INFO org.apache.hadoop.ipc.Server: IPC Server handler 
> 125 on  caught an exception
> java.nio.channels.ClosedChannelException
> at 
> sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:270)
> at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:461)
> at org.apache.hadoop.ipc.Server.channelWrite(Server.java:2731)
> at org.apache.hadoop.ipc.Server.access$2100(Server.java:134)
> at 
> org.apache.hadoop.ipc.Server$Responder.processResponse(Server.java:1089)
> at org.apache.hadoop.ipc.Server$Responder.doRespond(Server.java:1161)
> at 
> org.apache.hadoop.ipc.Server$Connection.sendResponse(Server.java:2109)
> at 
> org.apache.hadoop.ipc.Server$Connection.access$400(Server.java:1229)
> at org.apache.hadoop.ipc.Server$Call.sendResponse(Server.java:631)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2245)
> {quote}
> Maybe checking connection between client and router is better before 
> sendingrpc to namenode



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-15078) RBF: Should check connection channel before sending rpc to namenode

2019-12-24 Thread Fei Hui (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17002739#comment-17002739
 ] 

Fei Hui edited comment on HDFS-15078 at 12/24/19 10:02 AM:
---

{quote}
The issue is the first router which sent the request that late, That client did 
failover to another router, triggered a new call and the second router 
completed the call, and the first call came after this. 
{quote}
Getting EOFException makes client failover to another router. 
And later the second router completed the call,  the first router sent the 
request late. If just the first router sent the request late, client doesn't 
get exception, it will not failover

{quote}
If such a case where one Router is delaying, I think without client connection 
crashing still issues like these can come up.
{quote}
Yes. This issue only can resolve the problem on some scenarios. HDFS-15079 
tracks the high level problem.

In our  scenarios. This fix works.



was (Author: ferhui):
{quote}
The issue is the first router which c, That client did failover to another 
router, triggered a new call and the second router completed the call, and the 
first call came after this. 
{quote}
Getting EOFException makes client failover to another router. 
And later and the second router completed the call,  the first router the first 
router.

{quote}
If such a case where one Router is delaying, I think without client connection 
crashing still issues like these can come up.
{quote}
Yes. This issue only can resolve the problem on some scenarios. HDFS-15079 
tracks the high level problem.

In our  scenarios. This fix works.


> RBF: Should check connection channel before sending rpc to namenode
> ---
>
> Key: HDFS-15078
> URL: https://issues.apache.org/jira/browse/HDFS-15078
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: rbf
>Affects Versions: 3.3.0
>Reporter: Fei Hui
>Assignee: Fei Hui
>Priority: Major
> Attachments: HDFS-15078.001.patch, HDFS-15078.002.patch
>
>
> dfsrouter logs show that
> {quote}
> 2019-12-20 04:11:26,724 WARN org.apache.hadoop.ipc.Server: IPC Server handler 
> 6400 on , call org.apache.hadoop.hdfs.protocol.ClientProtocol.create from 
> 10.83.164.11:56908 Call#2 Retry#0: output error
> 2019-12-20 04:11:26,724 INFO org.apache.hadoop.ipc.Server: IPC Server handler 
> 125 on  caught an exception
> java.nio.channels.ClosedChannelException
> at 
> sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:270)
> at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:461)
> at org.apache.hadoop.ipc.Server.channelWrite(Server.java:2731)
> at org.apache.hadoop.ipc.Server.access$2100(Server.java:134)
> at 
> org.apache.hadoop.ipc.Server$Responder.processResponse(Server.java:1089)
> at org.apache.hadoop.ipc.Server$Responder.doRespond(Server.java:1161)
> at 
> org.apache.hadoop.ipc.Server$Connection.sendResponse(Server.java:2109)
> at 
> org.apache.hadoop.ipc.Server$Connection.access$400(Server.java:1229)
> at org.apache.hadoop.ipc.Server$Call.sendResponse(Server.java:631)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2245)
> {quote}
> Maybe checking connection between client and router is better before 
> sendingrpc to namenode



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15079) RBF: Client maybe get an unexpected result with network anomaly

2019-12-24 Thread Fei Hui (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17002762#comment-17002762
 ] 

Fei Hui commented on HDFS-15079:


General Idea:
* client generate id and send it with call to namenode
* namenode keeps last id for the file of each lease
* drop the call if its id less than last id

[~ayushtkn]  [~elgoiri] [~hexiaoqiao] Any thoughts?

> RBF: Client maybe get an unexpected result with network anomaly 
> 
>
> Key: HDFS-15079
> URL: https://issues.apache.org/jira/browse/HDFS-15079
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: rbf
>Affects Versions: 3.3.0
>Reporter: Fei Hui
>Priority: Critical
>
>  I find there is a critical problem on RBF, HDFS-15078 can resolve it on some 
> Scenarios, but i have no idea about the overall resolution.
> The problem is that
> Client with RBF(r0, r1) create a file HDFS file via r0, it gets Exception and 
> failovers to r1
> r0 has been send create rpc to namenode(1st create)
> Client create a HDFS file via r1(2nd create)
> Client writes the HDFS file and close it finally(3rd close)
> Maybe namenode receiving the rpc in order as follow
> 2nd create
> 3rd close
> 1st create
> And overwrite is true by default, this would make the file had been written 
> an empty file. This is an critical problem 
> We had encountered this problem. There are many hive and spark jobs running 
> on our cluster,   sometimes it occurs



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15079) RBF: Client maybe get an unexpected result with network anomaly

2019-12-24 Thread Fei Hui (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fei Hui updated HDFS-15079:
---
Summary: RBF: Client maybe get an unexpected result with network anomaly   
(was: RBF: Client may get an unexpected result with network anomaly )

> RBF: Client maybe get an unexpected result with network anomaly 
> 
>
> Key: HDFS-15079
> URL: https://issues.apache.org/jira/browse/HDFS-15079
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: rbf
>Affects Versions: 3.3.0
>Reporter: Fei Hui
>Priority: Critical
>
>  I find there is a critical problem on RBF, HDFS-15078 can resolve it on some 
> Scenarios, but i have no idea about the overall resolution.
> The problem is that
> Client with RBF(r0, r1) create a file HDFS file via r0, it gets Exception and 
> failovers to r1
> r0 has been send create rpc to namenode(1st create)
> Client create a HDFS file via r1(2nd create)
> Client writes the HDFS file and close it finally(3rd close)
> Maybe namenode receiving the rpc in order as follow
> 2nd create
> 3rd close
> 1st create
> And overwrite is true by default, this would make the file had been written 
> an empty file. This is an critical problem 
> We had encountered this problem. There are many hive and spark jobs running 
> on our cluster,   sometimes it occurs



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15078) RBF: Should check connection channel before sending rpc to namenode

2019-12-24 Thread Fei Hui (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17002739#comment-17002739
 ] 

Fei Hui commented on HDFS-15078:


{quote}
The issue is the first router which c, That client did failover to another 
router, triggered a new call and the second router completed the call, and the 
first call came after this. 
{quote}
Getting EOFException makes client failover to another router. 
And later and the second router completed the call,  the first router the first 
router.

{quote}
If such a case where one Router is delaying, I think without client connection 
crashing still issues like these can come up.
{quote}
Yes. This issue only can resolve the problem on some scenarios. HDFS-15079 
tracks the high level problem.

In our  scenarios. This fix works.


> RBF: Should check connection channel before sending rpc to namenode
> ---
>
> Key: HDFS-15078
> URL: https://issues.apache.org/jira/browse/HDFS-15078
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: rbf
>Affects Versions: 3.3.0
>Reporter: Fei Hui
>Assignee: Fei Hui
>Priority: Major
> Attachments: HDFS-15078.001.patch, HDFS-15078.002.patch
>
>
> dfsrouter logs show that
> {quote}
> 2019-12-20 04:11:26,724 WARN org.apache.hadoop.ipc.Server: IPC Server handler 
> 6400 on , call org.apache.hadoop.hdfs.protocol.ClientProtocol.create from 
> 10.83.164.11:56908 Call#2 Retry#0: output error
> 2019-12-20 04:11:26,724 INFO org.apache.hadoop.ipc.Server: IPC Server handler 
> 125 on  caught an exception
> java.nio.channels.ClosedChannelException
> at 
> sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:270)
> at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:461)
> at org.apache.hadoop.ipc.Server.channelWrite(Server.java:2731)
> at org.apache.hadoop.ipc.Server.access$2100(Server.java:134)
> at 
> org.apache.hadoop.ipc.Server$Responder.processResponse(Server.java:1089)
> at org.apache.hadoop.ipc.Server$Responder.doRespond(Server.java:1161)
> at 
> org.apache.hadoop.ipc.Server$Connection.sendResponse(Server.java:2109)
> at 
> org.apache.hadoop.ipc.Server$Connection.access$400(Server.java:1229)
> at org.apache.hadoop.ipc.Server$Call.sendResponse(Server.java:631)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2245)
> {quote}
> Maybe checking connection between client and router is better before 
> sendingrpc to namenode



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-12999) When reach the end of the block group, it may not need to flush all the data packets(flushAllInternals) twice.

2019-12-24 Thread Fei Hui (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-12999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17002720#comment-17002720
 ] 

Fei Hui commented on HDFS-12999:


Yes, [~figo] doesn't seems active nowdays. Upload v003 patch on his behalf
[~ayushtkn] please review

> When reach the end of the block group, it may not need to flush all the data 
> packets(flushAllInternals) twice. 
> ---
>
> Key: HDFS-12999
> URL: https://issues.apache.org/jira/browse/HDFS-12999
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: erasure-coding, hdfs-client
>Affects Versions: 3.0.0-beta1, 3.1.0
>Reporter: lufei
>Assignee: lufei
>Priority: Major
> Attachments: HDFS-12999.001.patch, HDFS-12999.002.patch, 
> HDFS-12999.003.patch
>
>
> In order to make the process simplification. It's no need to flush all the 
> data packets(flushAllInternals) twice,when reach the end of the block group.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-12999) When reach the end of the block group, it may not need to flush all the data packets(flushAllInternals) twice.

2019-12-24 Thread Fei Hui (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-12999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fei Hui updated HDFS-12999:
---
Attachment: HDFS-12999.003.patch

> When reach the end of the block group, it may not need to flush all the data 
> packets(flushAllInternals) twice. 
> ---
>
> Key: HDFS-12999
> URL: https://issues.apache.org/jira/browse/HDFS-12999
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: erasure-coding, hdfs-client
>Affects Versions: 3.0.0-beta1, 3.1.0
>Reporter: lufei
>Assignee: lufei
>Priority: Major
> Attachments: HDFS-12999.001.patch, HDFS-12999.002.patch, 
> HDFS-12999.003.patch
>
>
> In order to make the process simplification. It's no need to flush all the 
> data packets(flushAllInternals) twice,when reach the end of the block group.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15079) RBF: Client may get an unexpected result with network anomaly

2019-12-24 Thread Fei Hui (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fei Hui updated HDFS-15079:
---
Issue Type: Bug  (was: Improvement)

> RBF: Client may get an unexpected result with network anomaly 
> --
>
> Key: HDFS-15079
> URL: https://issues.apache.org/jira/browse/HDFS-15079
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: rbf
>Affects Versions: 3.3.0
>Reporter: Fei Hui
>Priority: Critical
>
>  I find there is a critical problem on RBF, HDFS-15078 can resolve it on some 
> Scenarios, but i have no idea about the overall resolution.
> The problem is that
> Client with RBF(r0, r1) create a file HDFS file via r0, it gets Exception and 
> failovers to r1
> r0 has been send create rpc to namenode(1st create)
> Client create a HDFS file via r1(2nd create)
> Client writes the HDFS file and close it finally(3rd close)
> Maybe namenode receiving the rpc in order as follow
> 2nd create
> 3rd close
> 1st create
> And overwrite is true by default, this would make the file had been written 
> an empty file. This is an critical problem 
> We had encountered this problem. There are many hive and spark jobs running 
> on our cluster,   sometimes it occurs



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-15079) RBF: Client may get an unexpected result with network anomaly

2019-12-24 Thread Fei Hui (Jira)
Fei Hui created HDFS-15079:
--

 Summary: RBF: Client may get an unexpected result with network 
anomaly 
 Key: HDFS-15079
 URL: https://issues.apache.org/jira/browse/HDFS-15079
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: rbf
Affects Versions: 3.3.0
Reporter: Fei Hui


 I find there is a critical problem on RBF, HDFS-15078 can resolve it on some 
Scenarios, but i have no idea about the overall resolution.
The problem is that

Client with RBF(r0, r1) create a file HDFS file via r0, it gets Exception and 
failovers to r1
r0 has been send create rpc to namenode(1st create)
Client create a HDFS file via r1(2nd create)
Client writes the HDFS file and close it finally(3rd close)
Maybe namenode receiving the rpc in order as follow

2nd create
3rd close
1st create
And overwrite is true by default, this would make the file had been written an 
empty file. This is an critical problem 
We had encountered this problem. There are many hive and spark jobs running on 
our cluster,   sometimes it occurs



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-15078) RBF: Should check connection channel before sending rpc to namenode

2019-12-23 Thread Fei Hui (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17002714#comment-17002714
 ] 

Fei Hui edited comment on HDFS-15078 at 12/24/19 7:58 AM:
--

This fix can resolve some scenarios
logs as follow
{quote}
2019-12-24 15:46:20,717 INFO org.apache.hadoop.ipc.Server: IPC Server handler 
53 on , call org.apache.hadoop.hdfs.protocol.ClientProtocol.getFileInfo f
rom 10.xx.xx.xx:60980 Call#18 Retry#0: java.io.IOException: Connection Channel 
to 10.xx.xx.xx of xxx (auth:SIMPLE) is closed!
2019-12-24 15:46:20,718 WARN org.apache.hadoop.ipc.Server: IPC Server handler 
53 on , call org.apache.hadoop.hdfs.protocol.ClientProtocol.getFileInfo f
rom 10.xx.xx.xx:60980 Call#18 Retry#0: output error
2019-12-24 15:46:20,718 INFO org.apache.hadoop.ipc.Server: IPC Server handler 
53 on  caught an exception
java.nio.channels.ClosedChannelException
at 
sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:270)
at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:461)
at org.apache.hadoop.ipc.Server.channelWrite(Server.java:2738)
at org.apache.hadoop.ipc.Server.access$2100(Server.java:134)
at 
org.apache.hadoop.ipc.Server$Responder.processResponse(Server.java:1096)
at org.apache.hadoop.ipc.Server$Responder.doRespond(Server.java:1168)
at 
org.apache.hadoop.ipc.Server$Connection.sendResponse(Server.java:2116)
at org.apache.hadoop.ipc.Server$Connection.access$500(Server.java:1236)
at org.apache.hadoop.ipc.Server$Call.sendResponse(Server.java:638)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2252)
{quote}


was (Author: ferhui):
This fix can resolve some
logs as follow
{quote}
2019-12-24 15:46:20,717 INFO org.apache.hadoop.ipc.Server: IPC Server handler 
53 on , call org.apache.hadoop.hdfs.protocol.ClientProtocol.getFileInfo f
rom 10.xx.xx.xx:60980 Call#18 Retry#0: java.io.IOException: Connection Channel 
to 10.xx.xx.xx of xxx (auth:SIMPLE) is closed!
2019-12-24 15:46:20,718 WARN org.apache.hadoop.ipc.Server: IPC Server handler 
53 on , call org.apache.hadoop.hdfs.protocol.ClientProtocol.getFileInfo f
rom 10.xx.xx.xx:60980 Call#18 Retry#0: output error
2019-12-24 15:46:20,718 INFO org.apache.hadoop.ipc.Server: IPC Server handler 
53 on  caught an exception
java.nio.channels.ClosedChannelException
at 
sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:270)
at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:461)
at org.apache.hadoop.ipc.Server.channelWrite(Server.java:2738)
at org.apache.hadoop.ipc.Server.access$2100(Server.java:134)
at 
org.apache.hadoop.ipc.Server$Responder.processResponse(Server.java:1096)
at org.apache.hadoop.ipc.Server$Responder.doRespond(Server.java:1168)
at 
org.apache.hadoop.ipc.Server$Connection.sendResponse(Server.java:2116)
at org.apache.hadoop.ipc.Server$Connection.access$500(Server.java:1236)
at org.apache.hadoop.ipc.Server$Call.sendResponse(Server.java:638)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2252)
{quote}

> RBF: Should check connection channel before sending rpc to namenode
> ---
>
> Key: HDFS-15078
> URL: https://issues.apache.org/jira/browse/HDFS-15078
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: rbf
>Affects Versions: 3.3.0
>Reporter: Fei Hui
>Assignee: Fei Hui
>Priority: Major
> Attachments: HDFS-15078.001.patch, HDFS-15078.002.patch
>
>
> dfsrouter logs show that
> {quote}
> 2019-12-20 04:11:26,724 WARN org.apache.hadoop.ipc.Server: IPC Server handler 
> 6400 on , call org.apache.hadoop.hdfs.protocol.ClientProtocol.create from 
> 10.83.164.11:56908 Call#2 Retry#0: output error
> 2019-12-20 04:11:26,724 INFO org.apache.hadoop.ipc.Server: IPC Server handler 
> 125 on  caught an exception
> java.nio.channels.ClosedChannelException
> at 
> sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:270)
> at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:461)
> at org.apache.hadoop.ipc.Server.channelWrite(Server.java:2731)
> at org.apache.hadoop.ipc.Server.access$2100(Server.java:134)
> at 
> org.apache.hadoop.ipc.Server$Responder.processResponse(Server.java:1089)
> at org.apache.hadoop.ipc.Server$Responder.doRespond(Server.java:1161)
> at 
> org.apache.hadoop.ipc.Server$Connection.sendResponse(Server.java:2109)
> at 
> org.apache.hadoop.ipc.Server$Connection.access$400(Server.java:1229)
> at org.apache.hadoop.ipc.Server$Call.sendResponse(Server.java:631)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2245)
> 

[jira] [Commented] (HDFS-15078) RBF: Should check connection channel before sending rpc to namenode

2019-12-23 Thread Fei Hui (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17002714#comment-17002714
 ] 

Fei Hui commented on HDFS-15078:


This fix can resolve some
logs as follow
{quote}
2019-12-24 15:46:20,717 INFO org.apache.hadoop.ipc.Server: IPC Server handler 
53 on , call org.apache.hadoop.hdfs.protocol.ClientProtocol.getFileInfo f
rom 10.xx.xx.xx:60980 Call#18 Retry#0: java.io.IOException: Connection Channel 
to 10.xx.xx.xx of xxx (auth:SIMPLE) is closed!
2019-12-24 15:46:20,718 WARN org.apache.hadoop.ipc.Server: IPC Server handler 
53 on , call org.apache.hadoop.hdfs.protocol.ClientProtocol.getFileInfo f
rom 10.xx.xx.xx:60980 Call#18 Retry#0: output error
2019-12-24 15:46:20,718 INFO org.apache.hadoop.ipc.Server: IPC Server handler 
53 on  caught an exception
java.nio.channels.ClosedChannelException
at 
sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:270)
at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:461)
at org.apache.hadoop.ipc.Server.channelWrite(Server.java:2738)
at org.apache.hadoop.ipc.Server.access$2100(Server.java:134)
at 
org.apache.hadoop.ipc.Server$Responder.processResponse(Server.java:1096)
at org.apache.hadoop.ipc.Server$Responder.doRespond(Server.java:1168)
at 
org.apache.hadoop.ipc.Server$Connection.sendResponse(Server.java:2116)
at org.apache.hadoop.ipc.Server$Connection.access$500(Server.java:1236)
at org.apache.hadoop.ipc.Server$Call.sendResponse(Server.java:638)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2252)
{quote}

> RBF: Should check connection channel before sending rpc to namenode
> ---
>
> Key: HDFS-15078
> URL: https://issues.apache.org/jira/browse/HDFS-15078
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: rbf
>Affects Versions: 3.3.0
>Reporter: Fei Hui
>Assignee: Fei Hui
>Priority: Major
> Attachments: HDFS-15078.001.patch, HDFS-15078.002.patch
>
>
> dfsrouter logs show that
> {quote}
> 2019-12-20 04:11:26,724 WARN org.apache.hadoop.ipc.Server: IPC Server handler 
> 6400 on , call org.apache.hadoop.hdfs.protocol.ClientProtocol.create from 
> 10.83.164.11:56908 Call#2 Retry#0: output error
> 2019-12-20 04:11:26,724 INFO org.apache.hadoop.ipc.Server: IPC Server handler 
> 125 on  caught an exception
> java.nio.channels.ClosedChannelException
> at 
> sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:270)
> at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:461)
> at org.apache.hadoop.ipc.Server.channelWrite(Server.java:2731)
> at org.apache.hadoop.ipc.Server.access$2100(Server.java:134)
> at 
> org.apache.hadoop.ipc.Server$Responder.processResponse(Server.java:1089)
> at org.apache.hadoop.ipc.Server$Responder.doRespond(Server.java:1161)
> at 
> org.apache.hadoop.ipc.Server$Connection.sendResponse(Server.java:2109)
> at 
> org.apache.hadoop.ipc.Server$Connection.access$400(Server.java:1229)
> at org.apache.hadoop.ipc.Server$Call.sendResponse(Server.java:631)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2245)
> {quote}
> Maybe checking connection between client and router is better before 
> sendingrpc to namenode



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-15078) RBF: Should check connection channel before sending rpc to namenode

2019-12-23 Thread Fei Hui (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17002603#comment-17002603
 ] 

Fei Hui edited comment on HDFS-15078 at 12/24/19 4:21 AM:
--

{quote}
if the client has triggered the request, I think it should go to the namenode, 
though it crashed after sending the request.
{quote}
On client side, if it crashed, client think it's failure, will failover other 
namenode. but call has succeed.

{quote}
Moreover the case would be a rare scenario and this check would be done on 
every call, this would add unnecessary overhead to all calls.
{quote}
In heavy load cluster, I see lots of output error because of 
java.nio.channels.ClosedChannelException.
There is similar check on namenode,Handler#run
{quote}
connDropped = !call.isOpen();
{quote}


was (Author: ferhui):
{quote}
Moreover the case would be a rare scenario and this check would be done on 
every call, this would add unnecessary overhead to all calls.
{quote}
In heavy load cluster, I see lots of output error because of 
java.nio.channels.ClosedChannelException.
There is similar check on namenode,Handler#run
{quote}
connDropped = !call.isOpen();
{quote}

> RBF: Should check connection channel before sending rpc to namenode
> ---
>
> Key: HDFS-15078
> URL: https://issues.apache.org/jira/browse/HDFS-15078
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: rbf
>Affects Versions: 3.3.0
>Reporter: Fei Hui
>Assignee: Fei Hui
>Priority: Major
> Attachments: HDFS-15078.001.patch, HDFS-15078.002.patch
>
>
> dfsrouter logs show that
> {quote}
> 2019-12-20 04:11:26,724 WARN org.apache.hadoop.ipc.Server: IPC Server handler 
> 6400 on , call org.apache.hadoop.hdfs.protocol.ClientProtocol.create from 
> 10.83.164.11:56908 Call#2 Retry#0: output error
> 2019-12-20 04:11:26,724 INFO org.apache.hadoop.ipc.Server: IPC Server handler 
> 125 on  caught an exception
> java.nio.channels.ClosedChannelException
> at 
> sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:270)
> at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:461)
> at org.apache.hadoop.ipc.Server.channelWrite(Server.java:2731)
> at org.apache.hadoop.ipc.Server.access$2100(Server.java:134)
> at 
> org.apache.hadoop.ipc.Server$Responder.processResponse(Server.java:1089)
> at org.apache.hadoop.ipc.Server$Responder.doRespond(Server.java:1161)
> at 
> org.apache.hadoop.ipc.Server$Connection.sendResponse(Server.java:2109)
> at 
> org.apache.hadoop.ipc.Server$Connection.access$400(Server.java:1229)
> at org.apache.hadoop.ipc.Server$Call.sendResponse(Server.java:631)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2245)
> {quote}
> Maybe checking connection between client and router is better before 
> sendingrpc to namenode



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15078) RBF: Should check connection channel before sending rpc to namenode

2019-12-23 Thread Fei Hui (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17002603#comment-17002603
 ] 

Fei Hui commented on HDFS-15078:


{quote}
Moreover the case would be a rare scenario and this check would be done on 
every call, this would add unnecessary overhead to all calls.
{quote}
In heavy load cluster, I see lots of output error because of 
java.nio.channels.ClosedChannelException.
There is similar check on namenode,Handler#run
{quote}
connDropped = !call.isOpen();
{quote}

> RBF: Should check connection channel before sending rpc to namenode
> ---
>
> Key: HDFS-15078
> URL: https://issues.apache.org/jira/browse/HDFS-15078
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: rbf
>Affects Versions: 3.3.0
>Reporter: Fei Hui
>Assignee: Fei Hui
>Priority: Major
> Attachments: HDFS-15078.001.patch, HDFS-15078.002.patch
>
>
> dfsrouter logs show that
> {quote}
> 2019-12-20 04:11:26,724 WARN org.apache.hadoop.ipc.Server: IPC Server handler 
> 6400 on , call org.apache.hadoop.hdfs.protocol.ClientProtocol.create from 
> 10.83.164.11:56908 Call#2 Retry#0: output error
> 2019-12-20 04:11:26,724 INFO org.apache.hadoop.ipc.Server: IPC Server handler 
> 125 on  caught an exception
> java.nio.channels.ClosedChannelException
> at 
> sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:270)
> at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:461)
> at org.apache.hadoop.ipc.Server.channelWrite(Server.java:2731)
> at org.apache.hadoop.ipc.Server.access$2100(Server.java:134)
> at 
> org.apache.hadoop.ipc.Server$Responder.processResponse(Server.java:1089)
> at org.apache.hadoop.ipc.Server$Responder.doRespond(Server.java:1161)
> at 
> org.apache.hadoop.ipc.Server$Connection.sendResponse(Server.java:2109)
> at 
> org.apache.hadoop.ipc.Server$Connection.access$400(Server.java:1229)
> at org.apache.hadoop.ipc.Server$Call.sendResponse(Server.java:631)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2245)
> {quote}
> Maybe checking connection between client and router is better before 
> sendingrpc to namenode



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15078) RBF: Should check connection channel before sending rpc to namenode

2019-12-23 Thread Fei Hui (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17002586#comment-17002586
 ] 

Fei Hui commented on HDFS-15078:


v002 patch
 change
{code}
if (curCall == null  || !curCall.isOpen()) {
{code}

to
{code}
if (curCall != null  && !curCall.isOpen()) {
{code}

> RBF: Should check connection channel before sending rpc to namenode
> ---
>
> Key: HDFS-15078
> URL: https://issues.apache.org/jira/browse/HDFS-15078
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: rbf
>Affects Versions: 3.3.0
>Reporter: Fei Hui
>Assignee: Fei Hui
>Priority: Major
> Attachments: HDFS-15078.001.patch, HDFS-15078.002.patch
>
>
> dfsrouter logs show that
> {quote}
> 2019-12-20 04:11:26,724 WARN org.apache.hadoop.ipc.Server: IPC Server handler 
> 6400 on , call org.apache.hadoop.hdfs.protocol.ClientProtocol.create from 
> 10.83.164.11:56908 Call#2 Retry#0: output error
> 2019-12-20 04:11:26,724 INFO org.apache.hadoop.ipc.Server: IPC Server handler 
> 125 on  caught an exception
> java.nio.channels.ClosedChannelException
> at 
> sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:270)
> at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:461)
> at org.apache.hadoop.ipc.Server.channelWrite(Server.java:2731)
> at org.apache.hadoop.ipc.Server.access$2100(Server.java:134)
> at 
> org.apache.hadoop.ipc.Server$Responder.processResponse(Server.java:1089)
> at org.apache.hadoop.ipc.Server$Responder.doRespond(Server.java:1161)
> at 
> org.apache.hadoop.ipc.Server$Connection.sendResponse(Server.java:2109)
> at 
> org.apache.hadoop.ipc.Server$Connection.access$400(Server.java:1229)
> at org.apache.hadoop.ipc.Server$Call.sendResponse(Server.java:631)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2245)
> {quote}
> Maybe checking connection between client and router is better before 
> sendingrpc to namenode



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15078) RBF: Should check connection channel before sending rpc to namenode

2019-12-23 Thread Fei Hui (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fei Hui updated HDFS-15078:
---
Attachment: HDFS-15078.002.patch

> RBF: Should check connection channel before sending rpc to namenode
> ---
>
> Key: HDFS-15078
> URL: https://issues.apache.org/jira/browse/HDFS-15078
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: rbf
>Affects Versions: 3.3.0
>Reporter: Fei Hui
>Assignee: Fei Hui
>Priority: Major
> Attachments: HDFS-15078.001.patch, HDFS-15078.002.patch
>
>
> dfsrouter logs show that
> {quote}
> 2019-12-20 04:11:26,724 WARN org.apache.hadoop.ipc.Server: IPC Server handler 
> 6400 on , call org.apache.hadoop.hdfs.protocol.ClientProtocol.create from 
> 10.83.164.11:56908 Call#2 Retry#0: output error
> 2019-12-20 04:11:26,724 INFO org.apache.hadoop.ipc.Server: IPC Server handler 
> 125 on  caught an exception
> java.nio.channels.ClosedChannelException
> at 
> sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:270)
> at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:461)
> at org.apache.hadoop.ipc.Server.channelWrite(Server.java:2731)
> at org.apache.hadoop.ipc.Server.access$2100(Server.java:134)
> at 
> org.apache.hadoop.ipc.Server$Responder.processResponse(Server.java:1089)
> at org.apache.hadoop.ipc.Server$Responder.doRespond(Server.java:1161)
> at 
> org.apache.hadoop.ipc.Server$Connection.sendResponse(Server.java:2109)
> at 
> org.apache.hadoop.ipc.Server$Connection.access$400(Server.java:1229)
> at org.apache.hadoop.ipc.Server$Call.sendResponse(Server.java:631)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2245)
> {quote}
> Maybe checking connection between client and router is better before 
> sendingrpc to namenode



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15078) RBF: Should check connection channel before sending rpc to namenode

2019-12-23 Thread Fei Hui (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17002571#comment-17002571
 ] 

Fei Hui commented on HDFS-15078:


[~ayushtkn][~elgoiri] I find  there is a critical problem on RBF, this issue 
can resolve it on some Scenarios, but i have no idea about the overall 
resolution. Plan to file a new jira to track it.
The problem is  that
# Client with RBF(r0, r1) create a file HDFS file via r0, it gets Exception and 
failovers to r1
# r0 has been send create rpc to namenode(1st create)
# Client create a HDFS file via r1(2nd create)
# Client writes the HDFS file and close it finally(3rd close)

Maybe namenode receiving the rpc in order as follow
# 2nd create
# 3rd close
# 1st create

And overwrite is true by default, this would make the file had been written an 
empty file. This is an critical problem and we had encountered it

> RBF: Should check connection channel before sending rpc to namenode
> ---
>
> Key: HDFS-15078
> URL: https://issues.apache.org/jira/browse/HDFS-15078
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: rbf
>Affects Versions: 3.3.0
>Reporter: Fei Hui
>Assignee: Fei Hui
>Priority: Major
> Attachments: HDFS-15078.001.patch
>
>
> dfsrouter logs show that
> {quote}
> 2019-12-20 04:11:26,724 WARN org.apache.hadoop.ipc.Server: IPC Server handler 
> 6400 on , call org.apache.hadoop.hdfs.protocol.ClientProtocol.create from 
> 10.83.164.11:56908 Call#2 Retry#0: output error
> 2019-12-20 04:11:26,724 INFO org.apache.hadoop.ipc.Server: IPC Server handler 
> 125 on  caught an exception
> java.nio.channels.ClosedChannelException
> at 
> sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:270)
> at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:461)
> at org.apache.hadoop.ipc.Server.channelWrite(Server.java:2731)
> at org.apache.hadoop.ipc.Server.access$2100(Server.java:134)
> at 
> org.apache.hadoop.ipc.Server$Responder.processResponse(Server.java:1089)
> at org.apache.hadoop.ipc.Server$Responder.doRespond(Server.java:1161)
> at 
> org.apache.hadoop.ipc.Server$Connection.sendResponse(Server.java:2109)
> at 
> org.apache.hadoop.ipc.Server$Connection.access$400(Server.java:1229)
> at org.apache.hadoop.ipc.Server$Call.sendResponse(Server.java:631)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2245)
> {quote}
> Maybe checking connection between client and router is better before 
> sendingrpc to namenode



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15078) RBF: Should check connection channel before sending rpc to namenode

2019-12-23 Thread Fei Hui (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17002374#comment-17002374
 ] 

Fei Hui commented on HDFS-15078:


[~ayushtkn] [~elgoiri] Could you please take a look? Thanks

> RBF: Should check connection channel before sending rpc to namenode
> ---
>
> Key: HDFS-15078
> URL: https://issues.apache.org/jira/browse/HDFS-15078
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: rbf
>Affects Versions: 3.3.0
>Reporter: Fei Hui
>Assignee: Fei Hui
>Priority: Major
> Attachments: HDFS-15078.001.patch
>
>
> dfsrouter logs show that
> {quote}
> 2019-12-20 04:11:26,724 WARN org.apache.hadoop.ipc.Server: IPC Server handler 
> 6400 on , call org.apache.hadoop.hdfs.protocol.ClientProtocol.create from 
> 10.83.164.11:56908 Call#2 Retry#0: output error
> 2019-12-20 04:11:26,724 INFO org.apache.hadoop.ipc.Server: IPC Server handler 
> 125 on  caught an exception
> java.nio.channels.ClosedChannelException
> at 
> sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:270)
> at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:461)
> at org.apache.hadoop.ipc.Server.channelWrite(Server.java:2731)
> at org.apache.hadoop.ipc.Server.access$2100(Server.java:134)
> at 
> org.apache.hadoop.ipc.Server$Responder.processResponse(Server.java:1089)
> at org.apache.hadoop.ipc.Server$Responder.doRespond(Server.java:1161)
> at 
> org.apache.hadoop.ipc.Server$Connection.sendResponse(Server.java:2109)
> at 
> org.apache.hadoop.ipc.Server$Connection.access$400(Server.java:1229)
> at org.apache.hadoop.ipc.Server$Call.sendResponse(Server.java:631)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2245)
> {quote}
> Maybe checking connection between client and router is better before 
> sendingrpc to namenode



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15078) RBF: Should check connection channel before sending rpc to namenode

2019-12-23 Thread Fei Hui (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fei Hui updated HDFS-15078:
---
Status: Patch Available  (was: Open)

> RBF: Should check connection channel before sending rpc to namenode
> ---
>
> Key: HDFS-15078
> URL: https://issues.apache.org/jira/browse/HDFS-15078
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: rbf
>Affects Versions: 3.3.0
>Reporter: Fei Hui
>Assignee: Fei Hui
>Priority: Major
> Attachments: HDFS-15078.001.patch
>
>
> dfsrouter logs show that
> {quote}
> 2019-12-20 04:11:26,724 WARN org.apache.hadoop.ipc.Server: IPC Server handler 
> 6400 on , call org.apache.hadoop.hdfs.protocol.ClientProtocol.create from 
> 10.83.164.11:56908 Call#2 Retry#0: output error
> 2019-12-20 04:11:26,724 INFO org.apache.hadoop.ipc.Server: IPC Server handler 
> 125 on  caught an exception
> java.nio.channels.ClosedChannelException
> at 
> sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:270)
> at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:461)
> at org.apache.hadoop.ipc.Server.channelWrite(Server.java:2731)
> at org.apache.hadoop.ipc.Server.access$2100(Server.java:134)
> at 
> org.apache.hadoop.ipc.Server$Responder.processResponse(Server.java:1089)
> at org.apache.hadoop.ipc.Server$Responder.doRespond(Server.java:1161)
> at 
> org.apache.hadoop.ipc.Server$Connection.sendResponse(Server.java:2109)
> at 
> org.apache.hadoop.ipc.Server$Connection.access$400(Server.java:1229)
> at org.apache.hadoop.ipc.Server$Call.sendResponse(Server.java:631)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2245)
> {quote}
> Maybe checking connection between client and router is better before 
> sendingrpc to namenode



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15078) RBF: Should check connection channel before sending rpc to namenode

2019-12-23 Thread Fei Hui (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fei Hui updated HDFS-15078:
---
Attachment: HDFS-15078.001.patch

> RBF: Should check connection channel before sending rpc to namenode
> ---
>
> Key: HDFS-15078
> URL: https://issues.apache.org/jira/browse/HDFS-15078
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: rbf
>Affects Versions: 3.3.0
>Reporter: Fei Hui
>Assignee: Fei Hui
>Priority: Major
> Attachments: HDFS-15078.001.patch
>
>
> dfsrouter logs show that
> {quote}
> 2019-12-20 04:11:26,724 WARN org.apache.hadoop.ipc.Server: IPC Server handler 
> 6400 on , call org.apache.hadoop.hdfs.protocol.ClientProtocol.create from 
> 10.83.164.11:56908 Call#2 Retry#0: output error
> 2019-12-20 04:11:26,724 INFO org.apache.hadoop.ipc.Server: IPC Server handler 
> 125 on  caught an exception
> java.nio.channels.ClosedChannelException
> at 
> sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:270)
> at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:461)
> at org.apache.hadoop.ipc.Server.channelWrite(Server.java:2731)
> at org.apache.hadoop.ipc.Server.access$2100(Server.java:134)
> at 
> org.apache.hadoop.ipc.Server$Responder.processResponse(Server.java:1089)
> at org.apache.hadoop.ipc.Server$Responder.doRespond(Server.java:1161)
> at 
> org.apache.hadoop.ipc.Server$Connection.sendResponse(Server.java:2109)
> at 
> org.apache.hadoop.ipc.Server$Connection.access$400(Server.java:1229)
> at org.apache.hadoop.ipc.Server$Call.sendResponse(Server.java:631)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2245)
> {quote}
> Maybe checking connection between client and router is better before 
> sendingrpc to namenode



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



  1   2   3   4   5   6   7   >