[jira] [Commented] (HDFS-15576) Erasure Coding: Add rs and rs-legacy codec test for addPolicies
[ https://issues.apache.org/jira/browse/HDFS-15576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17196006#comment-17196006 ] Fei Hui commented on HDFS-15576: Change the caption and upload v002 patch > Erasure Coding: Add rs and rs-legacy codec test for addPolicies > --- > > Key: HDFS-15576 > URL: https://issues.apache.org/jira/browse/HDFS-15576 > Project: Hadoop HDFS > Issue Type: Test >Reporter: Fei Hui >Assignee: Fei Hui >Priority: Minor > Attachments: HDFS-15576.001.patch, HDFS-15576.002.patch > > > * Add rs and rs-legacy codec test for TestErasureCodingCLI > * Add comments for failed test RS > * Modify UT, change "RS" to "rs", because "RS" is not supported -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15576) Erasure Coding: Add rs and rs-legacy codec test for addPolicies
[ https://issues.apache.org/jira/browse/HDFS-15576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fei Hui updated HDFS-15576: --- Description: * Add rs and rs-legacy codec test for TestErasureCodingCLI * Add comments for failed test RS * Modify UT, change "RS" to "rs", because "RS" is not supported was: * Add rs and rs-legacy codec test for TestErasureCodingCLI * Add comments for failed test * Modify UT, change "RS" to "rs", because "RS" is not supported > Erasure Coding: Add rs and rs-legacy codec test for addPolicies > --- > > Key: HDFS-15576 > URL: https://issues.apache.org/jira/browse/HDFS-15576 > Project: Hadoop HDFS > Issue Type: Test >Reporter: Fei Hui >Assignee: Fei Hui >Priority: Minor > Attachments: HDFS-15576.001.patch, HDFS-15576.002.patch > > > * Add rs and rs-legacy codec test for TestErasureCodingCLI > * Add comments for failed test RS > * Modify UT, change "RS" to "rs", because "RS" is not supported -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15576) Erasure Coding: Add rs and rs-legacy codec test for addPolicies
[ https://issues.apache.org/jira/browse/HDFS-15576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fei Hui updated HDFS-15576: --- Attachment: HDFS-15576.002.patch > Erasure Coding: Add rs and rs-legacy codec test for addPolicies > --- > > Key: HDFS-15576 > URL: https://issues.apache.org/jira/browse/HDFS-15576 > Project: Hadoop HDFS > Issue Type: Test >Reporter: Fei Hui >Assignee: Fei Hui >Priority: Minor > Attachments: HDFS-15576.001.patch, HDFS-15576.002.patch > > > * Add rs and rs-legacy codec test for TestErasureCodingCLI > * Add comments for failed test RS > * Modify UT, change "RS" to "rs", because "RS" is not supported -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15576) Erasure Coding: Add rs and rs-legacy codec test for addPolicies
[ https://issues.apache.org/jira/browse/HDFS-15576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fei Hui updated HDFS-15576: --- Description: * Add rs and rs-legacy codec test for TestErasureCodingCLI * Add comments for failed test * Modify UT, change "RS" to "rs", because "RS" is not supported was: * Add UT TestECAdmin#testAddPolicies * Modify UT, change "RS" to "rs", because "RS" is not supported > Erasure Coding: Add rs and rs-legacy codec test for addPolicies > --- > > Key: HDFS-15576 > URL: https://issues.apache.org/jira/browse/HDFS-15576 > Project: Hadoop HDFS > Issue Type: Test >Reporter: Fei Hui >Assignee: Fei Hui >Priority: Minor > Attachments: HDFS-15576.001.patch > > > * Add rs and rs-legacy codec test for TestErasureCodingCLI > * Add comments for failed test > * Modify UT, change "RS" to "rs", because "RS" is not supported -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15576) Erasure Coding: Add rs and rs-legacy codec test for addPolicies
[ https://issues.apache.org/jira/browse/HDFS-15576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fei Hui updated HDFS-15576: --- Summary: Erasure Coding: Add rs and rs-legacy codec test for addPolicies (was: Erasure Coding: Add rs & rs-legacy codec test for addPolicies) > Erasure Coding: Add rs and rs-legacy codec test for addPolicies > --- > > Key: HDFS-15576 > URL: https://issues.apache.org/jira/browse/HDFS-15576 > Project: Hadoop HDFS > Issue Type: Test >Reporter: Fei Hui >Assignee: Fei Hui >Priority: Minor > Attachments: HDFS-15576.001.patch > > > * Add UT TestECAdmin#testAddPolicies > * Modify UT, change "RS" to "rs", because "RS" is not supported -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15576) Erasure Coding: Add rs & rs-legacy codec test for addPolicies
[ https://issues.apache.org/jira/browse/HDFS-15576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fei Hui updated HDFS-15576: --- Summary: Erasure Coding: Add rs & rs-legacy codec test for addPolicies (was: Erasure Coding: Add test addPolicies to ECAdmin) > Erasure Coding: Add rs & rs-legacy codec test for addPolicies > - > > Key: HDFS-15576 > URL: https://issues.apache.org/jira/browse/HDFS-15576 > Project: Hadoop HDFS > Issue Type: Test >Reporter: Fei Hui >Assignee: Fei Hui >Priority: Minor > Attachments: HDFS-15576.001.patch > > > * Add UT TestECAdmin#testAddPolicies > * Modify UT, change "RS" to "rs", because "RS" is not supported -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15576) Erasure Coding: Add test addPolicies to ECAdmin
[ https://issues.apache.org/jira/browse/HDFS-15576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17195973#comment-17195973 ] Fei Hui commented on HDFS-15576: [~tasanuma] Thanks a lot, get it! Will try it with your suggestions In addition, I want to add comments for bellow codes from test_ec_policies.xml, it's just for failed test, because i had a mistake reference to it :( {quote} RS {quote} > Erasure Coding: Add test addPolicies to ECAdmin > --- > > Key: HDFS-15576 > URL: https://issues.apache.org/jira/browse/HDFS-15576 > Project: Hadoop HDFS > Issue Type: Test >Reporter: Fei Hui >Assignee: Fei Hui >Priority: Minor > Attachments: HDFS-15576.001.patch > > > * Add UT TestECAdmin#testAddPolicies > * Modify UT, change "RS" to "rs", because "RS" is not supported -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15576) Erasure Coding: Add test addPolicies to ECAdmin
[ https://issues.apache.org/jira/browse/HDFS-15576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17195906#comment-17195906 ] Fei Hui commented on HDFS-15576: Failed Tests are unrelated. [~hexiaoqiao][~weichiu] [~aajisaka] Could you please take a look? Thanks > Erasure Coding: Add test addPolicies to ECAdmin > --- > > Key: HDFS-15576 > URL: https://issues.apache.org/jira/browse/HDFS-15576 > Project: Hadoop HDFS > Issue Type: Test >Reporter: Fei Hui >Assignee: Fei Hui >Priority: Minor > Attachments: HDFS-15576.001.patch > > > * Add UT TestECAdmin#testAddPolicies > * Modify UT, change "RS" to "rs", because "RS" is not supported -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15576) Erasure Coding: Add test addPolicies to ECAdmin
[ https://issues.apache.org/jira/browse/HDFS-15576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fei Hui updated HDFS-15576: --- Summary: Erasure Coding: Add test addPolicies to ECAdmin (was: Add test addPolicies to ECAdmin) > Erasure Coding: Add test addPolicies to ECAdmin > --- > > Key: HDFS-15576 > URL: https://issues.apache.org/jira/browse/HDFS-15576 > Project: Hadoop HDFS > Issue Type: Test >Reporter: Fei Hui >Assignee: Fei Hui >Priority: Minor > Attachments: HDFS-15576.001.patch > > > * Add UT TestECAdmin#testAddPolicies > * Modify UT, change "RS" to "rs", because "RS" is not supported -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15576) Add test addPolicies to ECAdmin
[ https://issues.apache.org/jira/browse/HDFS-15576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fei Hui updated HDFS-15576: --- Attachment: HDFS-15576.001.patch > Add test addPolicies to ECAdmin > --- > > Key: HDFS-15576 > URL: https://issues.apache.org/jira/browse/HDFS-15576 > Project: Hadoop HDFS > Issue Type: Test >Reporter: Fei Hui >Assignee: Fei Hui >Priority: Minor > Attachments: HDFS-15576.001.patch > > > * Add UT TestECAdmin#testAddPolicies > * Modify UT, change "RS" to "rs", because "RS" is not supported -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15576) Add test addPolicies to ECAdmin
[ https://issues.apache.org/jira/browse/HDFS-15576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fei Hui updated HDFS-15576: --- Status: Patch Available (was: Open) > Add test addPolicies to ECAdmin > --- > > Key: HDFS-15576 > URL: https://issues.apache.org/jira/browse/HDFS-15576 > Project: Hadoop HDFS > Issue Type: Test >Reporter: Fei Hui >Assignee: Fei Hui >Priority: Minor > Attachments: HDFS-15576.001.patch > > > * Add UT TestECAdmin#testAddPolicies > * Modify UT, change "RS" to "rs", because "RS" is not supported -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-15576) Add test addPolicies to ECAdmin
Fei Hui created HDFS-15576: -- Summary: Add test addPolicies to ECAdmin Key: HDFS-15576 URL: https://issues.apache.org/jira/browse/HDFS-15576 Project: Hadoop HDFS Issue Type: Test Reporter: Fei Hui Assignee: Fei Hui * Add UT TestECAdmin#testAddPolicies * Modify UT, change "RS" to "rs", because "RS" is not supported -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15564) Add Test annotation for TestPersistBlocks#testRestartDfsWithSync
[ https://issues.apache.org/jira/browse/HDFS-15564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17194233#comment-17194233 ] Fei Hui commented on HDFS-15564: [~hexiaoqiao] Could you please help to commit it? Thanks > Add Test annotation for TestPersistBlocks#testRestartDfsWithSync > > > Key: HDFS-15564 > URL: https://issues.apache.org/jira/browse/HDFS-15564 > Project: Hadoop HDFS > Issue Type: Test > Components: hdfs >Affects Versions: 3.3.0 >Reporter: Fei Hui >Assignee: Fei Hui >Priority: Minor > Attachments: HDFS-15564.001.patch > > > Add Test annotation for TestPersistBlocks#testRestartDfsWithSync, otherwise > it’s dead code -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-13293) RBF: The RouterRPCServer should transfer CallerContext and client ip to NamenodeRpcServer
[ https://issues.apache.org/jira/browse/HDFS-13293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17193397#comment-17193397 ] Fei Hui commented on HDFS-13293: [~aajisaka][~elgoiri] I think maybe we need to do three things. * File a new jira and extend CallerContext in hadoop common, it can contains many key value pairs. * add real client ip to the caller context in this jira. *hadoop.caller.context.enabled* has been used by audit log ,should we add new parameter? * File a new jira and Fix the way Yarn use CallerContext (Add key value to the context) What do you think ? > RBF: The RouterRPCServer should transfer CallerContext and client ip to > NamenodeRpcServer > - > > Key: HDFS-13293 > URL: https://issues.apache.org/jira/browse/HDFS-13293 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: maobaolong >Assignee: Fei Hui >Priority: Major > Attachments: HDFS-13293.001.patch > > > Otherwise, the namenode don't know the client's callerContext -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15556) Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline
[ https://issues.apache.org/jira/browse/HDFS-15556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17193291#comment-17193291 ] Fei Hui commented on HDFS-15556: [~haiyang Hu] It's the same as HDFS-14042, Should we resolve this issue as Duplicate? > Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline > > > Key: HDFS-15556 > URL: https://issues.apache.org/jira/browse/HDFS-15556 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 3.2.0 >Reporter: huhaiyang >Priority: Critical > Attachments: HDFS-15556.001.patch, NN-CPU.png, NN_DN.LOG > > > In our cluster, the NameNode appears NPE when processing lifeline messages > sent by the DataNode, which will cause an maxLoad exception calculated by NN. > because DataNode is identified as busy and unable to allocate available nodes > in choose DataNode, program loop execution results in high CPU and reduces > the processing performance of the cluster. > *NameNode the exception stack*: > {code:java} > 2020-08-25 00:59:02,977 WARN org.apache.hadoop.ipc.Server: IPC Server handler > 5 on 8022, call Call#20535 Retry#0 > org.apache.hadoop.hdfs.server.protocol.DatanodeLifelineProtocol.sendLifeline > from x:34766 > java.lang.NullPointerException > at > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateStorageStats(DatanodeDescriptor.java:460) > at > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateHeartbeatState(DatanodeDescriptor.java:390) > at > org.apache.hadoop.hdfs.server.blockmanagement.HeartbeatManager.updateLifeline(HeartbeatManager.java:254) > at > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.handleLifeline(DatanodeManager.java:1805) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.handleLifeline(FSNamesystem.java:4039) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.sendLifeline(NameNodeRpcServer.java:1761) > at > org.apache.hadoop.hdfs.protocolPB.DatanodeLifelineProtocolServerSideTranslatorPB.sendLifeline(DatanodeLifelineProtocolServerSideTranslatorPB.java:62) > at > org.apache.hadoop.hdfs.protocol.proto.DatanodeLifelineProtocolProtos$DatanodeLifelineProtocolService$2.callBlockingMethod(DatanodeLifelineProtocolProtos.java:409) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:886) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:828) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1903) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2717) > {code} > {code:java} > // DatanodeDescriptor#updateStorageStats > ... > for (StorageReport report : reports) { > DatanodeStorageInfo storage = null; > synchronized (storageMap) { > storage = > storageMap.get(report.getStorage().getStorageID()); > } > if (checkFailedStorages) { > failedStorageInfos.remove(storage); > } > storage.receivedHeartbeat(report); // NPE exception occurred here > // skip accounting for capacity of PROVIDED storages! > if (StorageType.PROVIDED.equals(storage.getStorageType())) { > continue; > } > ... > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15564) Add Test annotation for TestPersistBlocks#testRestartDfsWithSync
[ https://issues.apache.org/jira/browse/HDFS-15564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fei Hui updated HDFS-15564: --- Status: Patch Available (was: Open) > Add Test annotation for TestPersistBlocks#testRestartDfsWithSync > > > Key: HDFS-15564 > URL: https://issues.apache.org/jira/browse/HDFS-15564 > Project: Hadoop HDFS > Issue Type: Test > Components: hdfs >Affects Versions: 3.3.0 >Reporter: Fei Hui >Assignee: Fei Hui >Priority: Minor > Attachments: HDFS-15564.001.patch > > > Add Test annotation for TestPersistBlocks#testRestartDfsWithSync, otherwise > it’s dead code -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15564) Add Test annotation for TestPersistBlocks#testRestartDfsWithSync
[ https://issues.apache.org/jira/browse/HDFS-15564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fei Hui updated HDFS-15564: --- Attachment: HDFS-15564.001.patch > Add Test annotation for TestPersistBlocks#testRestartDfsWithSync > > > Key: HDFS-15564 > URL: https://issues.apache.org/jira/browse/HDFS-15564 > Project: Hadoop HDFS > Issue Type: Test > Components: hdfs >Affects Versions: 3.3.0 >Reporter: Fei Hui >Assignee: Fei Hui >Priority: Minor > Attachments: HDFS-15564.001.patch > > > Add Test annotation for TestPersistBlocks#testRestartDfsWithSync, otherwise > it’s dead code -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-15564) Add Test annotation for TestPersistBlocks#testRestartDfsWithSync
Fei Hui created HDFS-15564: -- Summary: Add Test annotation for TestPersistBlocks#testRestartDfsWithSync Key: HDFS-15564 URL: https://issues.apache.org/jira/browse/HDFS-15564 Project: Hadoop HDFS Issue Type: Test Components: hdfs Affects Versions: 3.3.0 Reporter: Fei Hui Assignee: Fei Hui Add Test annotation for TestPersistBlocks#testRestartDfsWithSync, otherwise it’s dead code -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-13293) RBF: The RouterRPCServer should transfer CallerContext and client ip to NamenodeRpcServer
[ https://issues.apache.org/jira/browse/HDFS-13293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17192646#comment-17192646 ] Fei Hui commented on HDFS-13293: [~aajisaka][~elgoiri] Thanks for bringing this up again. If we are in agreed on this, I will rebase the patch. > RBF: The RouterRPCServer should transfer CallerContext and client ip to > NamenodeRpcServer > - > > Key: HDFS-13293 > URL: https://issues.apache.org/jira/browse/HDFS-13293 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: maobaolong >Assignee: Fei Hui >Priority: Major > Attachments: HDFS-13293.001.patch > > > Otherwise, the namenode don't know the client's callerContext -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14351) RBF: Optimize configuration item resolving for monitor namenode
[ https://issues.apache.org/jira/browse/HDFS-14351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17189992#comment-17189992 ] Fei Hui commented on HDFS-14351: Maybe it's helpful that backport it to other 3.x branches. Thanks > RBF: Optimize configuration item resolving for monitor namenode > --- > > Key: HDFS-14351 > URL: https://issues.apache.org/jira/browse/HDFS-14351 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: rbf >Reporter: Xiaoqiao He >Assignee: Xiaoqiao He >Priority: Major > Fix For: 3.3.0, HDFS-13891 > > Attachments: HDFS-14351-HDFS-13891.001.patch, > HDFS-14351-HDFS-13891.002.patch, HDFS-14351-HDFS-13891.003.patch, > HDFS-14351-HDFS-13891.004.patch, HDFS-14351-HDFS-13891.005.patch, > HDFS-14351-HDFS-13891.006.patch, HDFS-14351.001.patch, HDFS-14351.002.patch > > > We invoke {{configuration.get}} to resolve configuration item > `dfs.federation.router.monitor.namenode` at `Router.java`, then split the > value by comma to get nsid and nnid, it may confused users since this is not > compatible with blank space but other common parameters could do. The > following segment show example that resolve fails. > {code:java} > > dfs.federation.router.monitor.namenode > nameservice1.nn1, nameservice1.nn2 > > The identifier of the namenodes to monitor and heartbeat. > > > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15540) Directories protected from delete can still be moved to the trash
[ https://issues.apache.org/jira/browse/HDFS-15540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17185532#comment-17185532 ] Fei Hui commented on HDFS-15540: [~sodonnell] Good catch! It looks good! > Directories protected from delete can still be moved to the trash > - > > Key: HDFS-15540 > URL: https://issues.apache.org/jira/browse/HDFS-15540 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 3.4.0 >Reporter: Stephen O'Donnell >Assignee: Stephen O'Donnell >Priority: Major > Attachments: HDFS-15540.001.patch > > > With HDFS-8983, HDFS-14802 and HDFS-15243 we are able to list protected > directories which cannot be deleted or renamed, provided the following is set: > fs.protected.directories: > dfs.protected.subdirectories.enable: true > Testing this feature out, I can see it mostly works fine, but protected > non-empty folders can still be moved to the trash. In this example > /dir/protected is set in fs.protected.directories, and > dfs.protected.subdirectories.enable is true. > {code} > hadoop fs -ls -R /dir > drwxr-xr-x - hdfs supergroup 0 2020-08-26 16:52 /dir/protected > -rw-r--r-- 3 hdfs supergroup 174 2020-08-26 16:52 /dir/protected/file1 > drwxr-xr-x - hdfs supergroup 0 2020-08-26 16:52 /dir/protected/subdir1 > -rw-r--r-- 3 hdfs supergroup 174 2020-08-26 16:52 /dir/protected/subdir1/file1 > drwxr-xr-x - hdfs supergroup 0 2020-08-26 16:52 /dir/protected/subdir2 > -rw-r--r-- 3 hdfs supergroup 174 2020-08-26 16:52 /dir/protected/subdir2/file1 > [hdfs@7d67ed1af9b0 /]$ hadoop fs -rm -r -f -skipTrash /dir/protected/subdir1 > rm: Cannot delete/rename subdirectory under protected subdirectory > /dir/protected > [hdfs@7d67ed1af9b0 /]$ hadoop fs -mv /dir/protected/subdir1 > /dir/protected/subdir1-moved > mv: Cannot delete/rename subdirectory under protected subdirectory > /dir/protected > ** ALL GOOD SO FAR ** > [hdfs@7d67ed1af9b0 /]$ hadoop fs -rm -r -f /dir/protected/subdir1 > 2020-08-26 16:54:32,404 INFO fs.TrashPolicyDefault: Moved: > 'hdfs://nn1/dir/protected/subdir1' to trash at: > hdfs://nn1/user/hdfs/.Trash/Current/dir/protected/subdir1 > ** It moved the protected sub-dir to the trash, where it will be deleted ** > ** Checking the top level dir, it is the same ** > [hdfs@7d67ed1af9b0 /]$ hadoop fs -rm -r -f -skipTrash /dir/protected > rm: Cannot delete/rename non-empty protected directory /dir/protected > [hdfs@7d67ed1af9b0 /]$ hadoop fs -mv /dir/protected /dir/protected-new > mv: Cannot delete/rename non-empty protected directory /dir/protected > [hdfs@7d67ed1af9b0 /]$ hadoop fs -rm -r -f /dir/protected > 2020-08-26 16:55:32,402 INFO fs.TrashPolicyDefault: Moved: > 'hdfs://nn1/dir/protected' to trash at: > hdfs://nn1/user/hdfs/.Trash/Current/dir/protected1598460932388 > {code} > The reason for this, seems to be that "move to trash" uses a different rename > method in FSNameSystem and FSDirRenameOp which avoids the > DFSUtil.checkProtectedDescendants(...) in the earlier Jiras. > I believe that "move to trash" should be protected in the same way as a > -skipTrash delete. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-14852) Remove of LowRedundancyBlocks do NOT remove the block from all queues
[ https://issues.apache.org/jira/browse/HDFS-14852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fei Hui updated HDFS-14852: --- Attachment: HDFS-14852.007.patch > Remove of LowRedundancyBlocks do NOT remove the block from all queues > - > > Key: HDFS-14852 > URL: https://issues.apache.org/jira/browse/HDFS-14852 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 3.2.0, 3.0.3, 3.1.2, 3.3.0 >Reporter: Fei Hui >Assignee: Fei Hui >Priority: Major > Attachments: CorruptBlocksMismatch.png, HDFS-14852.001.patch, > HDFS-14852.002.patch, HDFS-14852.003.patch, HDFS-14852.004.patch, > HDFS-14852.005.patch, HDFS-14852.006.patch, HDFS-14852.007.patch, > screenshot-1.png > > > LowRedundancyBlocks.java > {code:java} > // Some comments here > if(priLevel >= 0 && priLevel < LEVEL > && priorityQueues.get(priLevel).remove(block)) { > NameNode.blockStateChangeLog.debug( > "BLOCK* NameSystem.LowRedundancyBlock.remove: Removing block {}" > + " from priority queue {}", > block, priLevel); > decrementBlockStat(block, priLevel, oldExpectedReplicas); > return true; > } else { > // Try to remove the block from all queues if the block was > // not found in the queue for the given priority level. > for (int i = 0; i < LEVEL; i++) { > if (i != priLevel && priorityQueues.get(i).remove(block)) { > NameNode.blockStateChangeLog.debug( > "BLOCK* NameSystem.LowRedundancyBlock.remove: Removing block" + > " {} from priority queue {}", block, i); > decrementBlockStat(block, i, oldExpectedReplicas); > return true; > } > } > } > return false; > } > {code} > Source code is above, the comments as follow > {quote} > // Try to remove the block from all queues if the block was > // not found in the queue for the given priority level. > {quote} > The function "remove" does NOT remove the block from all queues. > Function add from LowRedundancyBlocks.java is used on some places and maybe > one block in two or more queues. > We found that corrupt blocks mismatch corrupt files on NN web UI. Maybe it is > related to this. > Upload initial patch -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14852) Remove of LowRedundancyBlocks do NOT remove the block from all queues
[ https://issues.apache.org/jira/browse/HDFS-14852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17182659#comment-17182659 ] Fei Hui commented on HDFS-14852: [~hexiaoqiao] Thanks for review, Forget to remove original code. Upload v007 patch. When transition standby namenode to active, we found corrupt blocks. After deleting the corrupt files, we still found that "There are 2 corrupt blocks". I think If we delete the file, blocks should not in any queue. Didn't dig into why one block added into 2 queues and this didn't reproduce easily. > Remove of LowRedundancyBlocks do NOT remove the block from all queues > - > > Key: HDFS-14852 > URL: https://issues.apache.org/jira/browse/HDFS-14852 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 3.2.0, 3.0.3, 3.1.2, 3.3.0 >Reporter: Fei Hui >Assignee: Fei Hui >Priority: Major > Attachments: CorruptBlocksMismatch.png, HDFS-14852.001.patch, > HDFS-14852.002.patch, HDFS-14852.003.patch, HDFS-14852.004.patch, > HDFS-14852.005.patch, HDFS-14852.006.patch, screenshot-1.png > > > LowRedundancyBlocks.java > {code:java} > // Some comments here > if(priLevel >= 0 && priLevel < LEVEL > && priorityQueues.get(priLevel).remove(block)) { > NameNode.blockStateChangeLog.debug( > "BLOCK* NameSystem.LowRedundancyBlock.remove: Removing block {}" > + " from priority queue {}", > block, priLevel); > decrementBlockStat(block, priLevel, oldExpectedReplicas); > return true; > } else { > // Try to remove the block from all queues if the block was > // not found in the queue for the given priority level. > for (int i = 0; i < LEVEL; i++) { > if (i != priLevel && priorityQueues.get(i).remove(block)) { > NameNode.blockStateChangeLog.debug( > "BLOCK* NameSystem.LowRedundancyBlock.remove: Removing block" + > " {} from priority queue {}", block, i); > decrementBlockStat(block, i, oldExpectedReplicas); > return true; > } > } > } > return false; > } > {code} > Source code is above, the comments as follow > {quote} > // Try to remove the block from all queues if the block was > // not found in the queue for the given priority level. > {quote} > The function "remove" does NOT remove the block from all queues. > Function add from LowRedundancyBlocks.java is used on some places and maybe > one block in two or more queues. > We found that corrupt blocks mismatch corrupt files on NN web UI. Maybe it is > related to this. > Upload initial patch -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14852) Remove of LowRedundancyBlocks do NOT remove the block from all queues
[ https://issues.apache.org/jira/browse/HDFS-14852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17181844#comment-17181844 ] Fei Hui commented on HDFS-14852: Failed Tests are unrelated > Remove of LowRedundancyBlocks do NOT remove the block from all queues > - > > Key: HDFS-14852 > URL: https://issues.apache.org/jira/browse/HDFS-14852 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 3.2.0, 3.0.3, 3.1.2, 3.3.0 >Reporter: Fei Hui >Assignee: Fei Hui >Priority: Major > Attachments: CorruptBlocksMismatch.png, HDFS-14852.001.patch, > HDFS-14852.002.patch, HDFS-14852.003.patch, HDFS-14852.004.patch, > HDFS-14852.005.patch, HDFS-14852.006.patch, screenshot-1.png > > > LowRedundancyBlocks.java > {code:java} > // Some comments here > if(priLevel >= 0 && priLevel < LEVEL > && priorityQueues.get(priLevel).remove(block)) { > NameNode.blockStateChangeLog.debug( > "BLOCK* NameSystem.LowRedundancyBlock.remove: Removing block {}" > + " from priority queue {}", > block, priLevel); > decrementBlockStat(block, priLevel, oldExpectedReplicas); > return true; > } else { > // Try to remove the block from all queues if the block was > // not found in the queue for the given priority level. > for (int i = 0; i < LEVEL; i++) { > if (i != priLevel && priorityQueues.get(i).remove(block)) { > NameNode.blockStateChangeLog.debug( > "BLOCK* NameSystem.LowRedundancyBlock.remove: Removing block" + > " {} from priority queue {}", block, i); > decrementBlockStat(block, i, oldExpectedReplicas); > return true; > } > } > } > return false; > } > {code} > Source code is above, the comments as follow > {quote} > // Try to remove the block from all queues if the block was > // not found in the queue for the given priority level. > {quote} > The function "remove" does NOT remove the block from all queues. > Function add from LowRedundancyBlocks.java is used on some places and maybe > one block in two or more queues. > We found that corrupt blocks mismatch corrupt files on NN web UI. Maybe it is > related to this. > Upload initial patch -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14852) Remove of LowRedundancyBlocks do NOT remove the block from all queues
[ https://issues.apache.org/jira/browse/HDFS-14852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17180967#comment-17180967 ] Fei Hui commented on HDFS-14852: [~sodonnell] Upload v006 patch with your suggestion, Please review > Remove of LowRedundancyBlocks do NOT remove the block from all queues > - > > Key: HDFS-14852 > URL: https://issues.apache.org/jira/browse/HDFS-14852 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 3.2.0, 3.0.3, 3.1.2, 3.3.0 >Reporter: Fei Hui >Assignee: Fei Hui >Priority: Major > Attachments: CorruptBlocksMismatch.png, HDFS-14852.001.patch, > HDFS-14852.002.patch, HDFS-14852.003.patch, HDFS-14852.004.patch, > HDFS-14852.005.patch, HDFS-14852.006.patch, screenshot-1.png > > > LowRedundancyBlocks.java > {code:java} > // Some comments here > if(priLevel >= 0 && priLevel < LEVEL > && priorityQueues.get(priLevel).remove(block)) { > NameNode.blockStateChangeLog.debug( > "BLOCK* NameSystem.LowRedundancyBlock.remove: Removing block {}" > + " from priority queue {}", > block, priLevel); > decrementBlockStat(block, priLevel, oldExpectedReplicas); > return true; > } else { > // Try to remove the block from all queues if the block was > // not found in the queue for the given priority level. > for (int i = 0; i < LEVEL; i++) { > if (i != priLevel && priorityQueues.get(i).remove(block)) { > NameNode.blockStateChangeLog.debug( > "BLOCK* NameSystem.LowRedundancyBlock.remove: Removing block" + > " {} from priority queue {}", block, i); > decrementBlockStat(block, i, oldExpectedReplicas); > return true; > } > } > } > return false; > } > {code} > Source code is above, the comments as follow > {quote} > // Try to remove the block from all queues if the block was > // not found in the queue for the given priority level. > {quote} > The function "remove" does NOT remove the block from all queues. > Function add from LowRedundancyBlocks.java is used on some places and maybe > one block in two or more queues. > We found that corrupt blocks mismatch corrupt files on NN web UI. Maybe it is > related to this. > Upload initial patch -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-14852) Remove of LowRedundancyBlocks do NOT remove the block from all queues
[ https://issues.apache.org/jira/browse/HDFS-14852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fei Hui updated HDFS-14852: --- Attachment: HDFS-14852.006.patch > Remove of LowRedundancyBlocks do NOT remove the block from all queues > - > > Key: HDFS-14852 > URL: https://issues.apache.org/jira/browse/HDFS-14852 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 3.2.0, 3.0.3, 3.1.2, 3.3.0 >Reporter: Fei Hui >Assignee: Fei Hui >Priority: Major > Attachments: CorruptBlocksMismatch.png, HDFS-14852.001.patch, > HDFS-14852.002.patch, HDFS-14852.003.patch, HDFS-14852.004.patch, > HDFS-14852.005.patch, HDFS-14852.006.patch, screenshot-1.png > > > LowRedundancyBlocks.java > {code:java} > // Some comments here > if(priLevel >= 0 && priLevel < LEVEL > && priorityQueues.get(priLevel).remove(block)) { > NameNode.blockStateChangeLog.debug( > "BLOCK* NameSystem.LowRedundancyBlock.remove: Removing block {}" > + " from priority queue {}", > block, priLevel); > decrementBlockStat(block, priLevel, oldExpectedReplicas); > return true; > } else { > // Try to remove the block from all queues if the block was > // not found in the queue for the given priority level. > for (int i = 0; i < LEVEL; i++) { > if (i != priLevel && priorityQueues.get(i).remove(block)) { > NameNode.blockStateChangeLog.debug( > "BLOCK* NameSystem.LowRedundancyBlock.remove: Removing block" + > " {} from priority queue {}", block, i); > decrementBlockStat(block, i, oldExpectedReplicas); > return true; > } > } > } > return false; > } > {code} > Source code is above, the comments as follow > {quote} > // Try to remove the block from all queues if the block was > // not found in the queue for the given priority level. > {quote} > The function "remove" does NOT remove the block from all queues. > Function add from LowRedundancyBlocks.java is used on some places and maybe > one block in two or more queues. > We found that corrupt blocks mismatch corrupt files on NN web UI. Maybe it is > related to this. > Upload initial patch -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14852) Remove of LowRedundancyBlocks do NOT remove the block from all queues
[ https://issues.apache.org/jira/browse/HDFS-14852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17180485#comment-17180485 ] Fei Hui commented on HDFS-14852: [~sodonnell] [~kihwal] Can we move forward and fix this issue? > Remove of LowRedundancyBlocks do NOT remove the block from all queues > - > > Key: HDFS-14852 > URL: https://issues.apache.org/jira/browse/HDFS-14852 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 3.2.0, 3.0.3, 3.1.2, 3.3.0 >Reporter: Fei Hui >Assignee: Fei Hui >Priority: Major > Attachments: CorruptBlocksMismatch.png, HDFS-14852.001.patch, > HDFS-14852.002.patch, HDFS-14852.003.patch, HDFS-14852.004.patch, > HDFS-14852.005.patch, screenshot-1.png > > > LowRedundancyBlocks.java > {code:java} > // Some comments here > if(priLevel >= 0 && priLevel < LEVEL > && priorityQueues.get(priLevel).remove(block)) { > NameNode.blockStateChangeLog.debug( > "BLOCK* NameSystem.LowRedundancyBlock.remove: Removing block {}" > + " from priority queue {}", > block, priLevel); > decrementBlockStat(block, priLevel, oldExpectedReplicas); > return true; > } else { > // Try to remove the block from all queues if the block was > // not found in the queue for the given priority level. > for (int i = 0; i < LEVEL; i++) { > if (i != priLevel && priorityQueues.get(i).remove(block)) { > NameNode.blockStateChangeLog.debug( > "BLOCK* NameSystem.LowRedundancyBlock.remove: Removing block" + > " {} from priority queue {}", block, i); > decrementBlockStat(block, i, oldExpectedReplicas); > return true; > } > } > } > return false; > } > {code} > Source code is above, the comments as follow > {quote} > // Try to remove the block from all queues if the block was > // not found in the queue for the given priority level. > {quote} > The function "remove" does NOT remove the block from all queues. > Function add from LowRedundancyBlocks.java is used on some places and maybe > one block in two or more queues. > We found that corrupt blocks mismatch corrupt files on NN web UI. Maybe it is > related to this. > Upload initial patch -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15422) Reported IBR is partially replaced with stored info when queuing.
[ https://issues.apache.org/jira/browse/HDFS-15422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17180482#comment-17180482 ] Fei Hui commented on HDFS-15422: [~kihwal] Thanks for reporting and the fix. Can we push this fix to trunk? > Reported IBR is partially replaced with stored info when queuing. > - > > Key: HDFS-15422 > URL: https://issues.apache.org/jira/browse/HDFS-15422 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Reporter: Kihwal Lee >Priority: Critical > > When queueing an IBR (incremental block report) on a standby namenode, some > of the reported information is being replaced with the existing stored > information. This can lead to false block corruption. > We had a namenode, after transitioning to active, started reporting missing > blocks with "SIZE_MISMATCH" as corrupt reason. These were blocks that were > appended and the sizes were actually correct on the datanodes. Upon further > investigation, it was determined that the namenode was queueing IBRs with > altered information. > Although it sounds bad, I am not making it blocker -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15240) Erasure Coding: dirty buffer causes reconstruction block error
[ https://issues.apache.org/jira/browse/HDFS-15240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17179393#comment-17179393 ] Fei Hui commented on HDFS-15240: [~marvelrock] Fail to apply your patch on trunk branch. Could you please rebase your patch on trunk? > Erasure Coding: dirty buffer causes reconstruction block error > -- > > Key: HDFS-15240 > URL: https://issues.apache.org/jira/browse/HDFS-15240 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode, erasure-coding >Reporter: HuangTao >Assignee: HuangTao >Priority: Major > Fix For: 3.4.0 > > Attachments: HDFS-15240.001.patch, HDFS-15240.002.patch, > HDFS-15240.003.patch, HDFS-15240.004.patch, HDFS-15240.005.patch, > image-2020-07-16-15-56-38-608.png > > > When read some lzo files we found some blocks were broken. > I read back all internal blocks(b0-b8) of the block group(RS-6-3-1024k) from > DN directly, and choose 6(b0-b5) blocks to decode other 3(b6', b7', b8') > blocks. And find the longest common sequenece(LCS) between b6'(decoded) and > b6(read from DN)(b7'/b7 and b8'/b8). > After selecting 6 blocks of the block group in combinations one time and > iterating through all cases, I find one case that the length of LCS is the > block length - 64KB, 64KB is just the length of ByteBuffer used by > StripedBlockReader. So the corrupt reconstruction block is made by a dirty > buffer. > The following log snippet(only show 2 of 28 cases) is my check program > output. In my case, I known the 3th block is corrupt, so need other 5 blocks > to decode another 3 blocks, then find the 1th block's LCS substring is block > length - 64kb. > It means (0,1,2,4,5,6)th blocks were used to reconstruct 3th block, and the > dirty buffer was used before read the 1th block. > Must be noted that StripedBlockReader read from the offset 0 of the 1th block > after used the dirty buffer. > {code:java} > decode from [0, 2, 3, 4, 5, 7] -> [1, 6, 8] > Check Block(1) first 131072 bytes longest common substring length 4 > Check Block(6) first 131072 bytes longest common substring length 4 > Check Block(8) first 131072 bytes longest common substring length 4 > decode from [0, 2, 3, 4, 5, 6] -> [1, 7, 8] > Check Block(1) first 131072 bytes longest common substring length 65536 > CHECK AGAIN: Block(1) all 27262976 bytes longest common substring length > 27197440 # this one > Check Block(7) first 131072 bytes longest common substring length 4 > Check Block(8) first 131072 bytes longest common substring length 4{code} > Now I know the dirty buffer causes reconstruction block error, but how does > the dirty buffer come about? > After digging into the code and DN log, I found this following DN log is the > root reason. > {code:java} > [INFO] [stripedRead-1017] : Interrupted while waiting for IO on channel > java.nio.channels.SocketChannel[connected local=/:52586 > remote=/:50010]. 18 millis timeout left. > [WARN] [StripedBlockReconstruction-199] : Failed to reconstruct striped > block: BP-714356632--1519726836856:blk_-YY_3472979393 > java.lang.NullPointerException > at > org.apache.hadoop.hdfs.util.StripedBlockUtil.getNextCompletedStripedRead(StripedBlockUtil.java:314) > at > org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedReader.doReadMinimumSources(StripedReader.java:308) > at > org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedReader.readMinimumSources(StripedReader.java:269) > at > org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.reconstruct(StripedBlockReconstructor.java:94) > at > org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.run(StripedBlockReconstructor.java:60) > at > java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) > at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) > at > java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) > at > java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) > at java.base/java.lang.Thread.run(Thread.java:834) {code} > Reading from DN may timeout(hold by a future(F)) and output the INFO log, but > the futures that contains the future(F) is cleared, > {code:java} > return new StripingChunkReadResult(futures.remove(future), > StripingChunkReadResult.CANCELLED); {code} > futures.remove(future) cause NPE. So the EC reconstruction is failed. In the > finally phase, the code snippet in *getStripedReader().close()* > {code:java} > reconstructor.freeBuffer(reader.getReadBuffer()); > reader.freeReadBuffer(); > reader.closeBlockReader(); {code} > free buffer firstly, but the StripedBlockReader
[jira] [Updated] (HDFS-15514) Remove useless dfs.webhdfs.enabled
[ https://issues.apache.org/jira/browse/HDFS-15514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fei Hui updated HDFS-15514: --- Affects Version/s: 3.0.3 3.3.0 3.2.1 3.1.3 > Remove useless dfs.webhdfs.enabled > -- > > Key: HDFS-15514 > URL: https://issues.apache.org/jira/browse/HDFS-15514 > Project: Hadoop HDFS > Issue Type: Test > Components: test >Affects Versions: 3.0.3, 3.3.0, 3.2.1, 3.1.3 >Reporter: Fei Hui >Assignee: Fei Hui >Priority: Minor > Attachments: HDFS-15514.001.patch > > > After HDFS-7985 & HDFS-8349, " dfs.webhdfs.enabled" is useless. We should > remove it from code base. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15514) Remove useless dfs.webhdfs.enabled
[ https://issues.apache.org/jira/browse/HDFS-15514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fei Hui updated HDFS-15514: --- Priority: Minor (was: Major) > Remove useless dfs.webhdfs.enabled > -- > > Key: HDFS-15514 > URL: https://issues.apache.org/jira/browse/HDFS-15514 > Project: Hadoop HDFS > Issue Type: Test > Components: test >Reporter: Fei Hui >Assignee: Fei Hui >Priority: Minor > Attachments: HDFS-15514.001.patch > > > After HDFS-7985 & HDFS-8349, " dfs.webhdfs.enabled" is useless. We should > remove it from code base. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15514) Remove useless dfs.webhdfs.enabled
[ https://issues.apache.org/jira/browse/HDFS-15514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17172131#comment-17172131 ] Fei Hui commented on HDFS-15514: [~aajisaka][~ayushtkn] Could you please take a look? Thanks > Remove useless dfs.webhdfs.enabled > -- > > Key: HDFS-15514 > URL: https://issues.apache.org/jira/browse/HDFS-15514 > Project: Hadoop HDFS > Issue Type: Test > Components: test >Reporter: Fei Hui >Assignee: Fei Hui >Priority: Major > Attachments: HDFS-15514.001.patch > > > After HDFS-7985 & HDFS-8349, " dfs.webhdfs.enabled" is useless. We should > remove it from code base. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15514) Remove useless dfs.webhdfs.enabled
[ https://issues.apache.org/jira/browse/HDFS-15514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fei Hui updated HDFS-15514: --- Status: Patch Available (was: Open) > Remove useless dfs.webhdfs.enabled > -- > > Key: HDFS-15514 > URL: https://issues.apache.org/jira/browse/HDFS-15514 > Project: Hadoop HDFS > Issue Type: Test > Components: test >Reporter: Fei Hui >Assignee: Fei Hui >Priority: Major > Attachments: HDFS-15514.001.patch > > > After HDFS-7985 & HDFS-8349, " dfs.webhdfs.enabled" is useless. We should > remove it from code base. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15514) Remove useless dfs.webhdfs.enabled
[ https://issues.apache.org/jira/browse/HDFS-15514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fei Hui updated HDFS-15514: --- Attachment: HDFS-15514.001.patch > Remove useless dfs.webhdfs.enabled > -- > > Key: HDFS-15514 > URL: https://issues.apache.org/jira/browse/HDFS-15514 > Project: Hadoop HDFS > Issue Type: Test > Components: test >Reporter: Fei Hui >Assignee: Fei Hui >Priority: Major > Attachments: HDFS-15514.001.patch > > > After HDFS-7985 & HDFS-8349, " dfs.webhdfs.enabled" is useless. We should > remove it from code base. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-15514) Remove useless dfs.webhdfs.enabled
Fei Hui created HDFS-15514: -- Summary: Remove useless dfs.webhdfs.enabled Key: HDFS-15514 URL: https://issues.apache.org/jira/browse/HDFS-15514 Project: Hadoop HDFS Issue Type: Test Components: test Reporter: Fei Hui Assignee: Fei Hui After HDFS-7985 & HDFS-8349, " dfs.webhdfs.enabled" is useless. We should remove it from code base. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-13596) NN restart fails after RollingUpgrade from 2.x to 3.x
[ https://issues.apache.org/jira/browse/HDFS-13596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17163518#comment-17163518 ] Fei Hui commented on HDFS-13596: [~fengwu99] You are right! Failed is related to HDFS-8791. > NN restart fails after RollingUpgrade from 2.x to 3.x > - > > Key: HDFS-13596 > URL: https://issues.apache.org/jira/browse/HDFS-13596 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs >Reporter: Hanisha Koneru >Assignee: Fei Hui >Priority: Blocker > Fix For: 3.3.0, 3.2.1, 3.1.3 > > Attachments: HDFS-13596.001.patch, HDFS-13596.002.patch, > HDFS-13596.003.patch, HDFS-13596.004.patch, HDFS-13596.005.patch, > HDFS-13596.006.patch, HDFS-13596.007.patch, HDFS-13596.008.patch, > HDFS-13596.009.patch, HDFS-13596.010.patch > > > After rollingUpgrade NN from 2.x and 3.x, if the NN is restarted, it fails > while replaying edit logs. > * After NN is started with rollingUpgrade, the layoutVersion written to > editLogs (before finalizing the upgrade) is the pre-upgrade layout version > (so as to support downgrade). > * When writing transactions to log, NN writes as per the current layout > version. In 3.x, erasureCoding bits are added to the editLog transactions. > * So any edit log written after the upgrade and before finalizing the > upgrade will have the old layout version but the new format of transactions. > * When NN is restarted and the edit logs are replayed, the NN reads the old > layout version from the editLog file. When parsing the transactions, it > assumes that the transactions are also from the previous layout and hence > skips parsing the erasureCoding bits. > * This cascades into reading the wrong set of bits for other fields and > leads to NN shutting down. > Sample error output: > {code:java} > java.lang.IllegalArgumentException: Invalid clientId - length is 0 expected > length 16 > at com.google.common.base.Preconditions.checkArgument(Preconditions.java:88) > at org.apache.hadoop.ipc.RetryCache$CacheEntry.(RetryCache.java:74) > at org.apache.hadoop.ipc.RetryCache$CacheEntry.(RetryCache.java:86) > at > org.apache.hadoop.ipc.RetryCache$CacheEntryWithPayload.(RetryCache.java:163) > at > org.apache.hadoop.ipc.RetryCache.addCacheEntryWithPayload(RetryCache.java:322) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.addCacheEntryWithPayload(FSNamesystem.java:960) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:397) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:249) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:158) > at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:888) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:745) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:323) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:1086) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:714) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:632) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:694) > at org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:937) > at org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:910) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1643) > at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1710) > 2018-05-17 19:10:06,522 WARN > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Encountered exception > loading fsimage > java.io.IOException: java.lang.IllegalStateException: Cannot skip to less > than the current value (=16389), where newValue=16388 > at > org.apache.hadoop.hdfs.server.namenode.FSDirectory.resetLastInodeId(FSDirectory.java:1945) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:298) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:158) > at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:888) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:745) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:323) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:1086) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:714) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:632) > at >
[jira] [Commented] (HDFS-15079) RBF: Client maybe get an unexpected result with network anomaly
[ https://issues.apache.org/jira/browse/HDFS-15079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17078329#comment-17078329 ] Fei Hui commented on HDFS-15079: Upload workaround patch. * use RetryCache of NN * Call maybe fail when namenode failover with network anomaly, but not overwrite > RBF: Client maybe get an unexpected result with network anomaly > > > Key: HDFS-15079 > URL: https://issues.apache.org/jira/browse/HDFS-15079 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: rbf >Affects Versions: 3.3.0 >Reporter: Fei Hui >Priority: Critical > Attachments: HDFS-15079.001.patch, HDFS-15079.002.patch, > UnexpectedOverWriteUT.patch > > > I find there is a critical problem on RBF, HDFS-15078 can resolve it on some > Scenarios, but i have no idea about the overall resolution. > The problem is that > Client with RBF(r0, r1) create a file HDFS file via r0, it gets Exception and > failovers to r1 > r0 has been send create rpc to namenode(1st create) > Client create a HDFS file via r1(2nd create) > Client writes the HDFS file and close it finally(3rd close) > Maybe namenode receiving the rpc in order as follow > 2nd create > 3rd close > 1st create > And overwrite is true by default, this would make the file had been written > an empty file. This is an critical problem > We had encountered this problem. There are many hive and spark jobs running > on our cluster, sometimes it occurs -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15079) RBF: Client maybe get an unexpected result with network anomaly
[ https://issues.apache.org/jira/browse/HDFS-15079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fei Hui updated HDFS-15079: --- Attachment: HDFS-15079.002.patch > RBF: Client maybe get an unexpected result with network anomaly > > > Key: HDFS-15079 > URL: https://issues.apache.org/jira/browse/HDFS-15079 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: rbf >Affects Versions: 3.3.0 >Reporter: Fei Hui >Priority: Critical > Attachments: HDFS-15079.001.patch, HDFS-15079.002.patch, > UnexpectedOverWriteUT.patch > > > I find there is a critical problem on RBF, HDFS-15078 can resolve it on some > Scenarios, but i have no idea about the overall resolution. > The problem is that > Client with RBF(r0, r1) create a file HDFS file via r0, it gets Exception and > failovers to r1 > r0 has been send create rpc to namenode(1st create) > Client create a HDFS file via r1(2nd create) > Client writes the HDFS file and close it finally(3rd close) > Maybe namenode receiving the rpc in order as follow > 2nd create > 3rd close > 1st create > And overwrite is true by default, this would make the file had been written > an empty file. This is an critical problem > We had encountered this problem. There are many hive and spark jobs running > on our cluster, sometimes it occurs -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15084) RBF: Remove useless param nsId in RouterRpcClient#getConnection
[ https://issues.apache.org/jira/browse/HDFS-15084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fei Hui updated HDFS-15084: --- Resolution: Won't Fix Status: Resolved (was: Patch Available) > RBF: Remove useless param nsId in RouterRpcClient#getConnection > --- > > Key: HDFS-15084 > URL: https://issues.apache.org/jira/browse/HDFS-15084 > Project: Hadoop HDFS > Issue Type: Improvement > Components: rbf >Affects Versions: 3.3.0 >Reporter: Fei Hui >Assignee: Fei Hui >Priority: Trivial > Attachments: HDFS-15084.001.patch > > > The param nsId in RouterRpcClient#getConnection is useless. > Maybe we should remove it. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15240) Erasure Coding: dirty buffer causes reconstruction block error
[ https://issues.apache.org/jira/browse/HDFS-15240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17067682#comment-17067682 ] Fei Hui commented on HDFS-15240: [~marvelrock] Good Catch ! Thanks for reporting and fixing. Could you please add UT? > Erasure Coding: dirty buffer causes reconstruction block error > -- > > Key: HDFS-15240 > URL: https://issues.apache.org/jira/browse/HDFS-15240 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode, erasure-coding >Reporter: HuangTao >Assignee: HuangTao >Priority: Major > Attachments: HDFS-15240.001.patch > > > When read some lzo files we found some blocks were broken. > I read back all internal blocks(b0-b8) of the block group(RS-6-3-1024k) from > DN directly, and choose 6(b0-b5) blocks to decode other 3(b6', b7', b8') > blocks. And find the longest common sequenece(LCS) between b6'(decoded) and > b6(read from DN)(b7'/b7 and b8'/b8). > After selecting 6 blocks of the block group in combinations one time and > iterating through all cases, I find one case that the length of LCS is the > block length - 64KB, 64KB is just the length of ByteBuffer used by > StripedBlockReader. So the corrupt reconstruction block is made by a dirty > buffer. > The following log snippet(only show 2 of 28 cases) is my check program > output. In my case, I known the 3th block is corrupt, so need other 5 blocks > to decode another 3 blocks, then find the 1th block's LCS substring is block > length - 64kb. > It means (0,1,2,4,5,6)th blocks were used to reconstruct 3th block, and the > dirty buffer was used before read the 1th block. > Must be noted that StripedBlockReader read from the offset 0 of the 1th block > after used the dirty buffer. > {code:java} > decode from [0, 2, 3, 4, 5, 7] -> [1, 6, 8] > Check Block(1) first 131072 bytes longest common substring length 4 > Check Block(6) first 131072 bytes longest common substring length 4 > Check Block(8) first 131072 bytes longest common substring length 4 > decode from [0, 2, 3, 4, 5, 6] -> [1, 7, 8] > Check Block(1) first 131072 bytes longest common substring length 65536 > CHECK AGAIN: Block(1) all 27262976 bytes longest common substring length > 27197440 # this one > Check Block(7) first 131072 bytes longest common substring length 4 > Check Block(8) first 131072 bytes longest common substring length 4{code} > Now I know the dirty buffer causes reconstruction block error, but how does > the dirty buffer come about? > After digging into the code and DN log, I found this following DN log is the > root reason. > {code:java} > [INFO] [stripedRead-1017] : Interrupted while waiting for IO on channel > java.nio.channels.SocketChannel[connected local=/:52586 > remote=/:50010]. 18 millis timeout left. > [WARN] [StripedBlockReconstruction-199] : Failed to reconstruct striped > block: BP-714356632--1519726836856:blk_-YY_3472979393 > java.lang.NullPointerException > at > org.apache.hadoop.hdfs.util.StripedBlockUtil.getNextCompletedStripedRead(StripedBlockUtil.java:314) > at > org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedReader.doReadMinimumSources(StripedReader.java:308) > at > org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedReader.readMinimumSources(StripedReader.java:269) > at > org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.reconstruct(StripedBlockReconstructor.java:94) > at > org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.run(StripedBlockReconstructor.java:60) > at > java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) > at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) > at > java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) > at > java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) > at java.base/java.lang.Thread.run(Thread.java:834) {code} > Reading from DN may timeout(hold by a future(F)) and output the INFO log, but > the futures that contains the future(F) is cleared, > {code:java} > return new StripingChunkReadResult(futures.remove(future), > StripingChunkReadResult.CANCELLED); {code} > futures.remove(future) cause NPE. So the EC reconstruction is failed. In the > finally phase, the code snippet in *getStripedReader().close()* > {code:java} > reconstructor.freeBuffer(reader.getReadBuffer()); > reader.freeReadBuffer(); > reader.closeBlockReader(); {code} > free buffer firstly, but the StripedBlockReader still holds the buffer and > write it. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To
[jira] [Commented] (HDFS-15223) FSCK fails if one namenode is not available
[ https://issues.apache.org/jira/browse/HDFS-15223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17059972#comment-17059972 ] Fei Hui commented on HDFS-15223: [~ayushtkn] Thanks for reporting and fixing. +1 > FSCK fails if one namenode is not available > --- > > Key: HDFS-15223 > URL: https://issues.apache.org/jira/browse/HDFS-15223 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Ayush Saxena >Assignee: Ayush Saxena >Priority: Major > Attachments: HDFS-15223-01.patch > > > If one namenode is not available FSCK should try on other namenode, ignoring > the namenode not available -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15186) Erasure Coding: Decommission may generate the parity block's content with all 0 in some case
[ https://issues.apache.org/jira/browse/HDFS-15186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17046298#comment-17046298 ] Fei Hui commented on HDFS-15186: +1 for HDFS-15186.005.patch. Failed tests are unrelated. > Erasure Coding: Decommission may generate the parity block's content with all > 0 in some case > > > Key: HDFS-15186 > URL: https://issues.apache.org/jira/browse/HDFS-15186 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode, erasure-coding >Affects Versions: 3.0.3, 3.2.1, 3.1.3 >Reporter: Yao Guangdong >Assignee: Yao Guangdong >Priority: Critical > Attachments: HDFS-15186.001.patch, HDFS-15186.002.patch, > HDFS-15186.003.patch, HDFS-15186.004.patch, HDFS-15186.005.patch > > > I can find some parity block's content with all 0 when i decommission some > DataNode(more than 1) from a cluster. And the probability is very big(parts > per thousand).This is a big problem.You can think that if we read data from > the zero parity block or use the zero parity block to recover a block which > can make us use the error data even we don't know it. > There is some case in the below: > B: Busy DataNode, > D:Decommissioning DataNode, > Others is normal. > 1.Group indices is [0, 1, 2, 3, 4, 5, 6(B,D), 7, 8(D)]. > 2.Group indices is [0(B,D), 1, 2, 3, 4, 5, 6(B,D), 7, 8(D)]. > > In the first case when the block group indices is [0, 1, 2, 3, 4, 5, 6(B,D), > 7, 8(D)], the DN may received reconstruct block command and the > liveIndices=[0, 1, 2, 3, 4, 5, 7, 8] and the targets's(the field which in > the class StripedReconstructionInfo) length is 2. > The targets's length is 2 which mean that the DataNode need recover 2 > internal block in current code.But from the liveIndices we only can find 1 > missing block, so the method StripedWriter#initTargetIndices will use 0 as > the default recover block and don't care the indices 0 is in the sources > indices or not. > When they use sources indices [0, 1, 2, 3, 4, 5] to recover indices [6, 0] > use the ec algorithm.We can find that the indices [0] is in the both the > sources indices and the targets indices in this case. The returned target > buffer in the indices [6] is always 0 from the ec algorithm.So I think this > is the ec algorithm's problem. Because it should more fault tolerance.I try > to fixed it .But it is too hard. Because the case is too more. The second is > another case in the example above(use sources indices [1, 2, 3, 4, 5, 7] to > recover indices [0, 6, 0]). So I changed my mind.Invoke the ec algorithm > with a correct parameters. Which mean that remove the duplicate target > indices 0 in this case.Finally, I fixed it in this way. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15186) Erasure Coding: Decommission may generate the parity block's content with all 0 in some case
[ https://issues.apache.org/jira/browse/HDFS-15186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17044123#comment-17044123 ] Fei Hui commented on HDFS-15186: [~yaoguangdong] Thanks for your patch HDFS-15186.002.patch the whole fix looks good. Minor comments {quote} +//4. wait for decommissioning and not busy block to replicate +Thread.sleep(3000); {quote} Here maybe it will be good that GenericTestUtils.waitFor instead of it. > Erasure Coding: Decommission may generate the parity block's content with all > 0 in some case > > > Key: HDFS-15186 > URL: https://issues.apache.org/jira/browse/HDFS-15186 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode, erasure-coding >Affects Versions: 3.0.3, 3.2.1, 3.1.3 >Reporter: Yao Guangdong >Assignee: Yao Guangdong >Priority: Critical > Attachments: HDFS-15186.001.patch, HDFS-15186.002.patch > > > I can find some parity block's content with all 0 when i decommission some > DataNode(more than 1) from a cluster. And the probability is very big(parts > per thousand).This is a big problem.You can think that if we read data from > the zero parity block or use the zero parity block to recover a block which > can make us use the error data even we don't know it. > There is some case in the below: > B: Busy DataNode, > D:Decommissioning DataNode, > Others is normal. > 1.Group indices is [0, 1, 2, 3, 4, 5, 6(B,D), 7, 8(D)]. > 2.Group indices is [0(B,D), 1, 2, 3, 4, 5, 6(B,D), 7, 8(D)]. > > In the first case when the block group indices is [0, 1, 2, 3, 4, 5, 6(B,D), > 7, 8(D)], the DN may received reconstruct block command and the > liveIndices=[0, 1, 2, 3, 4, 5, 7, 8] and the targets's(the field which in > the class StripedReconstructionInfo) length is 2. > The targets's length is 2 which mean that the DataNode need recover 2 > internal block in current code.But from the liveIndices we only can find 1 > missing block, so the method StripedWriter#initTargetIndices will use 0 as > the default recover block and don't care the indices 0 is in the sources > indices or not. > When they use sources indices [0, 1, 2, 3, 4, 5] to recover indices [6, 0] > use the ec algorithm.We can find that the indices [0] is in the both the > sources indices and the targets indices in this case. The returned target > buffer in the indices [6] is always 0 from the ec algorithm.So I think this > is the ec algorithm's problem. Because it should more fault tolerance.I try > to fixed it .But it is too hard. Because the case is too more. The second is > another case in the example above(use sources indices [1, 2, 3, 4, 5, 7] to > recover indices [0, 6, 0]). So I changed my mind.Invoke the ec algorithm > with a correct parameters. Which mean that remove the duplicate target > indices 0 in this case.Finally, I fixed it in this way. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-15186) Erasure Coding: Decommission may generate the parity block's content with all 0 in some case
[ https://issues.apache.org/jira/browse/HDFS-15186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17042780#comment-17042780 ] Fei Hui edited comment on HDFS-15186 at 2/23/20 2:52 AM: - [~yaoguangdong]Thanks for reporting this. Good Catch. Sorry for late, I couldn't receive emails these days. +1 for [~ayushtkn] suggestions. I thinks indice[6] is not in liveindcies and busyindices, this cause this problem. Maybe we should fix it in namenode side. was (Author: ferhui): [~yaoguangdong]Thanks for reporting this !Good Catch! Sorry for late, I couldn't receive emails these days! +1 for [~ayushtkn] suggestions. I thinks indice[6] is not in liveindcies and busyindices, this cause this problem. Maybe we should fix it in namenode side. > Erasure Coding: Decommission may generate the parity block's content with all > 0 in some case > > > Key: HDFS-15186 > URL: https://issues.apache.org/jira/browse/HDFS-15186 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode, erasure-coding >Affects Versions: 3.0.3, 3.2.1, 3.1.3 >Reporter: Yao Guangdong >Assignee: Yao Guangdong >Priority: Critical > Attachments: HDFS-15186.001.patch > > > I can find some parity block's content with all 0 when i decommission some > DataNode(more than 1) from a cluster. And the probability is very big(parts > per thousand).This is a big problem.You can think that if we read data from > the zero parity block or use the zero parity block to recover a block which > can make us use the error data even we don't know it. > There is some case in the below: > B: Busy DataNode, > D:Decommissioning DataNode, > Others is normal. > 1.Group indices is [0, 1, 2, 3, 4, 5, 6(B,D), 7, 8(D)]. > 2.Group indices is [0(B,D), 1, 2, 3, 4, 5, 6(B,D), 7, 8(D)]. > > In the first case when the block group indices is [0, 1, 2, 3, 4, 5, 6(B,D), > 7, 8(D)], the DN may received reconstruct block command and the > liveIndices=[0, 1, 2, 3, 4, 5, 7, 8] and the targets's(the field which in > the class StripedReconstructionInfo) length is 2. > The targets's length is 2 which mean that the DataNode need recover 2 > internal block in current code.But from the liveIndices we only can find 1 > missing block, so the method StripedWriter#initTargetIndices will use 0 as > the default recover block and don't care the indices 0 is in the sources > indices or not. > When they use sources indices [0, 1, 2, 3, 4, 5] to recover indices [6, 0] > use the ec algorithm.We can find that the indices [0] is in the both the > sources indices and the targets indices in this case. The returned target > buffer in the indices [6] is always 0 from the ec algorithm.So I think this > is the ec algorithm's problem. Because it should more fault tolerance.I try > to fixed it .But it is too hard. Because the case is too more. The second is > another case in the example above(use sources indices [1, 2, 3, 4, 5, 7] to > recover indices [0, 6, 0]). So I changed my mind.Invoke the ec algorithm > with a correct parameters. Which mean that remove the duplicate target > indices 0 in this case.Finally, I fixed it in this way. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15186) Erasure Coding: Decommission may generate the parity block's content with all 0 in some case
[ https://issues.apache.org/jira/browse/HDFS-15186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17042780#comment-17042780 ] Fei Hui commented on HDFS-15186: [~yaoguangdong]Thanks for reporting this !Good Catch! Sorry for late, I couldn't receive emails these days! +1 for [~ayushtkn] suggestions. I thinks indice[6] is not in liveindcies and busyindices, this cause this problem. Maybe we should fix it in namenode side. > Erasure Coding: Decommission may generate the parity block's content with all > 0 in some case > > > Key: HDFS-15186 > URL: https://issues.apache.org/jira/browse/HDFS-15186 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode, erasure-coding >Affects Versions: 3.0.3, 3.2.1, 3.1.3 >Reporter: Yao Guangdong >Assignee: Yao Guangdong >Priority: Critical > Attachments: HDFS-15186.001.patch > > > I can find some parity block's content with all 0 when i decommission some > DataNode(more than 1) from a cluster. And the probability is very big(parts > per thousand).This is a big problem.You can think that if we read data from > the zero parity block or use the zero parity block to recover a block which > can make us use the error data even we don't know it. > There is some case in the below: > B: Busy DataNode, > D:Decommissioning DataNode, > Others is normal. > 1.Group indices is [0, 1, 2, 3, 4, 5, 6(B,D), 7, 8(D)]. > 2.Group indices is [0(B,D), 1, 2, 3, 4, 5, 6(B,D), 7, 8(D)]. > > In the first case when the block group indices is [0, 1, 2, 3, 4, 5, 6(B,D), > 7, 8(D)], the DN may received reconstruct block command and the > liveIndices=[0, 1, 2, 3, 4, 5, 7, 8] and the targets's(the field which in > the class StripedReconstructionInfo) length is 2. > The targets's length is 2 which mean that the DataNode need recover 2 > internal block in current code.But from the liveIndices we only can find 1 > missing block, so the method StripedWriter#initTargetIndices will use 0 as > the default recover block and don't care the indices 0 is in the sources > indices or not. > When they use sources indices [0, 1, 2, 3, 4, 5] to recover indices [6, 0] > use the ec algorithm.We can find that the indices [0] is in the both the > sources indices and the targets indices in this case. The returned target > buffer in the indices [6] is always 0 from the ec algorithm.So I think this > is the ec algorithm's problem. Because it should more fault tolerance.I try > to fixed it .But it is too hard. Because the case is too more. The second is > another case in the example above(use sources indices [1, 2, 3, 4, 5, 7] to > recover indices [0, 6, 0]). So I changed my mind.Invoke the ec algorithm > with a correct parameters. Which mean that remove the duplicate target > indices 0 in this case.Finally, I fixed it in this way. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15092) TestRedudantBlocks#testProcessOverReplicatedAndRedudantBlock sometimes failed
[ https://issues.apache.org/jira/browse/HDFS-15092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17019445#comment-17019445 ] Fei Hui commented on HDFS-15092: [~surendrasingh][~elgoiri] Could you please take a look?Thannks > TestRedudantBlocks#testProcessOverReplicatedAndRedudantBlock sometimes failed > - > > Key: HDFS-15092 > URL: https://issues.apache.org/jira/browse/HDFS-15092 > Project: Hadoop HDFS > Issue Type: Test > Components: test >Affects Versions: 3.3.0 >Reporter: Fei Hui >Assignee: Fei Hui >Priority: Minor > Attachments: HDFS-15092.001.patch, HDFS-15092.002.patch > > > TestRedudantBlocks#testProcessOverReplicatedAndRedudantBlock sometimes failed > {quote} > java.lang.AssertionError: > Expected :5 > Actual :4 > > at org.junit.Assert.fail(Assert.java:88) > at org.junit.Assert.failNotEquals(Assert.java:834) > at org.junit.Assert.assertEquals(Assert.java:645) > at org.junit.Assert.assertEquals(Assert.java:631) > at > org.apache.hadoop.hdfs.server.namenode.TestRedudantBlocks.testProcessOverReplicatedAndRedudantBlock(TestRedudantBlocks.java:138) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50) > at > org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) > at > org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47) > at > org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) > at > org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26) > at > org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27) > at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325) > at > org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78) > at > org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57) > at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290) > at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71) > at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288) > at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58) > at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268) > at org.junit.runners.ParentRunner.run(ParentRunner.java:363) > at org.junit.runner.JUnitCore.run(JUnitCore.java:137) > at > com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs(JUnit4IdeaTestRunner.java:68) > at > com.intellij.rt.execution.junit.IdeaTestRunner$Repeater.startRunnerWithArgs(IdeaTestRunner.java:51) > at > com.intellij.rt.execution.junit.JUnitStarter.prepareStreamsAndStart(JUnitStarter.java:242) > at > com.intellij.rt.execution.junit.JUnitStarter.main(JUnitStarter.java:70) > {quote} > Maybe we should increase sleep time -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15092) TestRedudantBlocks#testProcessOverReplicatedAndRedudantBlock sometimes failed
[ https://issues.apache.org/jira/browse/HDFS-15092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17019195#comment-17019195 ] Fei Hui commented on HDFS-15092: Sorry for late Upload v002 patch > TestRedudantBlocks#testProcessOverReplicatedAndRedudantBlock sometimes failed > - > > Key: HDFS-15092 > URL: https://issues.apache.org/jira/browse/HDFS-15092 > Project: Hadoop HDFS > Issue Type: Test > Components: test >Affects Versions: 3.3.0 >Reporter: Fei Hui >Assignee: Fei Hui >Priority: Minor > Attachments: HDFS-15092.001.patch, HDFS-15092.002.patch > > > TestRedudantBlocks#testProcessOverReplicatedAndRedudantBlock sometimes failed > {quote} > java.lang.AssertionError: > Expected :5 > Actual :4 > > at org.junit.Assert.fail(Assert.java:88) > at org.junit.Assert.failNotEquals(Assert.java:834) > at org.junit.Assert.assertEquals(Assert.java:645) > at org.junit.Assert.assertEquals(Assert.java:631) > at > org.apache.hadoop.hdfs.server.namenode.TestRedudantBlocks.testProcessOverReplicatedAndRedudantBlock(TestRedudantBlocks.java:138) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50) > at > org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) > at > org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47) > at > org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) > at > org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26) > at > org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27) > at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325) > at > org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78) > at > org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57) > at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290) > at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71) > at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288) > at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58) > at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268) > at org.junit.runners.ParentRunner.run(ParentRunner.java:363) > at org.junit.runner.JUnitCore.run(JUnitCore.java:137) > at > com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs(JUnit4IdeaTestRunner.java:68) > at > com.intellij.rt.execution.junit.IdeaTestRunner$Repeater.startRunnerWithArgs(IdeaTestRunner.java:51) > at > com.intellij.rt.execution.junit.JUnitStarter.prepareStreamsAndStart(JUnitStarter.java:242) > at > com.intellij.rt.execution.junit.JUnitStarter.main(JUnitStarter.java:70) > {quote} > Maybe we should increase sleep time -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15092) TestRedudantBlocks#testProcessOverReplicatedAndRedudantBlock sometimes failed
[ https://issues.apache.org/jira/browse/HDFS-15092?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fei Hui updated HDFS-15092: --- Attachment: HDFS-15092.002.patch > TestRedudantBlocks#testProcessOverReplicatedAndRedudantBlock sometimes failed > - > > Key: HDFS-15092 > URL: https://issues.apache.org/jira/browse/HDFS-15092 > Project: Hadoop HDFS > Issue Type: Test > Components: test >Affects Versions: 3.3.0 >Reporter: Fei Hui >Assignee: Fei Hui >Priority: Minor > Attachments: HDFS-15092.001.patch, HDFS-15092.002.patch > > > TestRedudantBlocks#testProcessOverReplicatedAndRedudantBlock sometimes failed > {quote} > java.lang.AssertionError: > Expected :5 > Actual :4 > > at org.junit.Assert.fail(Assert.java:88) > at org.junit.Assert.failNotEquals(Assert.java:834) > at org.junit.Assert.assertEquals(Assert.java:645) > at org.junit.Assert.assertEquals(Assert.java:631) > at > org.apache.hadoop.hdfs.server.namenode.TestRedudantBlocks.testProcessOverReplicatedAndRedudantBlock(TestRedudantBlocks.java:138) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50) > at > org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) > at > org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47) > at > org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) > at > org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26) > at > org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27) > at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325) > at > org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78) > at > org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57) > at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290) > at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71) > at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288) > at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58) > at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268) > at org.junit.runners.ParentRunner.run(ParentRunner.java:363) > at org.junit.runner.JUnitCore.run(JUnitCore.java:137) > at > com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs(JUnit4IdeaTestRunner.java:68) > at > com.intellij.rt.execution.junit.IdeaTestRunner$Repeater.startRunnerWithArgs(IdeaTestRunner.java:51) > at > com.intellij.rt.execution.junit.JUnitStarter.prepareStreamsAndStart(JUnitStarter.java:242) > at > com.intellij.rt.execution.junit.JUnitStarter.main(JUnitStarter.java:70) > {quote} > Maybe we should increase sleep time -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15084) RBF: Remove useless param nsId in RouterRpcClient#getConnection
[ https://issues.apache.org/jira/browse/HDFS-15084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17008726#comment-17008726 ] Fei Hui commented on HDFS-15084: [~surendrasingh] HDFS-13522 are there any progress? Useless code is obvious in IDE, not so good for coders :( > RBF: Remove useless param nsId in RouterRpcClient#getConnection > --- > > Key: HDFS-15084 > URL: https://issues.apache.org/jira/browse/HDFS-15084 > Project: Hadoop HDFS > Issue Type: Improvement > Components: rbf >Affects Versions: 3.3.0 >Reporter: Fei Hui >Assignee: Fei Hui >Priority: Trivial > Attachments: HDFS-15084.001.patch > > > The param nsId in RouterRpcClient#getConnection is useless. > Maybe we should remove it. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15092) TestRedudantBlocks#testProcessOverReplicatedAndRedudantBlock sometimes failed
[ https://issues.apache.org/jira/browse/HDFS-15092?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fei Hui updated HDFS-15092: --- Status: Patch Available (was: Open) > TestRedudantBlocks#testProcessOverReplicatedAndRedudantBlock sometimes failed > - > > Key: HDFS-15092 > URL: https://issues.apache.org/jira/browse/HDFS-15092 > Project: Hadoop HDFS > Issue Type: Test > Components: test >Affects Versions: 3.3.0 >Reporter: Fei Hui >Assignee: Fei Hui >Priority: Minor > Attachments: HDFS-15092.001.patch > > > TestRedudantBlocks#testProcessOverReplicatedAndRedudantBlock sometimes failed > {quote} > java.lang.AssertionError: > Expected :5 > Actual :4 > > at org.junit.Assert.fail(Assert.java:88) > at org.junit.Assert.failNotEquals(Assert.java:834) > at org.junit.Assert.assertEquals(Assert.java:645) > at org.junit.Assert.assertEquals(Assert.java:631) > at > org.apache.hadoop.hdfs.server.namenode.TestRedudantBlocks.testProcessOverReplicatedAndRedudantBlock(TestRedudantBlocks.java:138) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50) > at > org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) > at > org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47) > at > org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) > at > org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26) > at > org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27) > at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325) > at > org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78) > at > org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57) > at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290) > at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71) > at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288) > at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58) > at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268) > at org.junit.runners.ParentRunner.run(ParentRunner.java:363) > at org.junit.runner.JUnitCore.run(JUnitCore.java:137) > at > com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs(JUnit4IdeaTestRunner.java:68) > at > com.intellij.rt.execution.junit.IdeaTestRunner$Repeater.startRunnerWithArgs(IdeaTestRunner.java:51) > at > com.intellij.rt.execution.junit.JUnitStarter.prepareStreamsAndStart(JUnitStarter.java:242) > at > com.intellij.rt.execution.junit.JUnitStarter.main(JUnitStarter.java:70) > {quote} > Maybe we should increase sleep time -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15092) TestRedudantBlocks#testProcessOverReplicatedAndRedudantBlock sometimes failed
[ https://issues.apache.org/jira/browse/HDFS-15092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17006532#comment-17006532 ] Fei Hui commented on HDFS-15092: [~surendrasingh] Could you please take a look? I see you add this UT > TestRedudantBlocks#testProcessOverReplicatedAndRedudantBlock sometimes failed > - > > Key: HDFS-15092 > URL: https://issues.apache.org/jira/browse/HDFS-15092 > Project: Hadoop HDFS > Issue Type: Test > Components: test >Affects Versions: 3.3.0 >Reporter: Fei Hui >Assignee: Fei Hui >Priority: Minor > Attachments: HDFS-15092.001.patch > > > TestRedudantBlocks#testProcessOverReplicatedAndRedudantBlock sometimes failed > {quote} > java.lang.AssertionError: > Expected :5 > Actual :4 > > at org.junit.Assert.fail(Assert.java:88) > at org.junit.Assert.failNotEquals(Assert.java:834) > at org.junit.Assert.assertEquals(Assert.java:645) > at org.junit.Assert.assertEquals(Assert.java:631) > at > org.apache.hadoop.hdfs.server.namenode.TestRedudantBlocks.testProcessOverReplicatedAndRedudantBlock(TestRedudantBlocks.java:138) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50) > at > org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) > at > org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47) > at > org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) > at > org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26) > at > org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27) > at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325) > at > org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78) > at > org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57) > at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290) > at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71) > at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288) > at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58) > at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268) > at org.junit.runners.ParentRunner.run(ParentRunner.java:363) > at org.junit.runner.JUnitCore.run(JUnitCore.java:137) > at > com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs(JUnit4IdeaTestRunner.java:68) > at > com.intellij.rt.execution.junit.IdeaTestRunner$Repeater.startRunnerWithArgs(IdeaTestRunner.java:51) > at > com.intellij.rt.execution.junit.JUnitStarter.prepareStreamsAndStart(JUnitStarter.java:242) > at > com.intellij.rt.execution.junit.JUnitStarter.main(JUnitStarter.java:70) > {quote} > Maybe we should increase sleep time -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15092) TestRedudantBlocks#testProcessOverReplicatedAndRedudantBlock sometimes failed
[ https://issues.apache.org/jira/browse/HDFS-15092?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fei Hui updated HDFS-15092: --- Attachment: HDFS-15092.001.patch > TestRedudantBlocks#testProcessOverReplicatedAndRedudantBlock sometimes failed > - > > Key: HDFS-15092 > URL: https://issues.apache.org/jira/browse/HDFS-15092 > Project: Hadoop HDFS > Issue Type: Test > Components: test >Affects Versions: 3.3.0 >Reporter: Fei Hui >Assignee: Fei Hui >Priority: Minor > Attachments: HDFS-15092.001.patch > > > TestRedudantBlocks#testProcessOverReplicatedAndRedudantBlock sometimes failed > {quote} > java.lang.AssertionError: > Expected :5 > Actual :4 > > at org.junit.Assert.fail(Assert.java:88) > at org.junit.Assert.failNotEquals(Assert.java:834) > at org.junit.Assert.assertEquals(Assert.java:645) > at org.junit.Assert.assertEquals(Assert.java:631) > at > org.apache.hadoop.hdfs.server.namenode.TestRedudantBlocks.testProcessOverReplicatedAndRedudantBlock(TestRedudantBlocks.java:138) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50) > at > org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) > at > org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47) > at > org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) > at > org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26) > at > org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27) > at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325) > at > org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78) > at > org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57) > at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290) > at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71) > at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288) > at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58) > at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268) > at org.junit.runners.ParentRunner.run(ParentRunner.java:363) > at org.junit.runner.JUnitCore.run(JUnitCore.java:137) > at > com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs(JUnit4IdeaTestRunner.java:68) > at > com.intellij.rt.execution.junit.IdeaTestRunner$Repeater.startRunnerWithArgs(IdeaTestRunner.java:51) > at > com.intellij.rt.execution.junit.JUnitStarter.prepareStreamsAndStart(JUnitStarter.java:242) > at > com.intellij.rt.execution.junit.JUnitStarter.main(JUnitStarter.java:70) > {quote} > Maybe we should increase sleep time -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-15092) TestRedudantBlocks#testProcessOverReplicatedAndRedudantBlock sometimes failed
Fei Hui created HDFS-15092: -- Summary: TestRedudantBlocks#testProcessOverReplicatedAndRedudantBlock sometimes failed Key: HDFS-15092 URL: https://issues.apache.org/jira/browse/HDFS-15092 Project: Hadoop HDFS Issue Type: Test Components: test Affects Versions: 3.3.0 Reporter: Fei Hui Assignee: Fei Hui TestRedudantBlocks#testProcessOverReplicatedAndRedudantBlock sometimes failed {quote} java.lang.AssertionError: Expected :5 Actual :4 at org.junit.Assert.fail(Assert.java:88) at org.junit.Assert.failNotEquals(Assert.java:834) at org.junit.Assert.assertEquals(Assert.java:645) at org.junit.Assert.assertEquals(Assert.java:631) at org.apache.hadoop.hdfs.server.namenode.TestRedudantBlocks.testProcessOverReplicatedAndRedudantBlock(TestRedudantBlocks.java:138) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50) at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47) at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26) at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27) at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57) at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290) at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71) at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288) at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58) at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268) at org.junit.runners.ParentRunner.run(ParentRunner.java:363) at org.junit.runner.JUnitCore.run(JUnitCore.java:137) at com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs(JUnit4IdeaTestRunner.java:68) at com.intellij.rt.execution.junit.IdeaTestRunner$Repeater.startRunnerWithArgs(IdeaTestRunner.java:51) at com.intellij.rt.execution.junit.JUnitStarter.prepareStreamsAndStart(JUnitStarter.java:242) at com.intellij.rt.execution.junit.JUnitStarter.main(JUnitStarter.java:70) {quote} Maybe we should increase sleep time -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15079) RBF: Client maybe get an unexpected result with network anomaly
[ https://issues.apache.org/jira/browse/HDFS-15079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fei Hui updated HDFS-15079: --- Attachment: HDFS-15079.001.patch > RBF: Client maybe get an unexpected result with network anomaly > > > Key: HDFS-15079 > URL: https://issues.apache.org/jira/browse/HDFS-15079 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: rbf >Affects Versions: 3.3.0 >Reporter: Fei Hui >Priority: Critical > Attachments: HDFS-15079.001.patch, UnexpectedOverWriteUT.patch > > > I find there is a critical problem on RBF, HDFS-15078 can resolve it on some > Scenarios, but i have no idea about the overall resolution. > The problem is that > Client with RBF(r0, r1) create a file HDFS file via r0, it gets Exception and > failovers to r1 > r0 has been send create rpc to namenode(1st create) > Client create a HDFS file via r1(2nd create) > Client writes the HDFS file and close it finally(3rd close) > Maybe namenode receiving the rpc in order as follow > 2nd create > 3rd close > 1st create > And overwrite is true by default, this would make the file had been written > an empty file. This is an critical problem > We had encountered this problem. There are many hive and spark jobs running > on our cluster, sometimes it occurs -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15079) RBF: Client maybe get an unexpected result with network anomaly
[ https://issues.apache.org/jira/browse/HDFS-15079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fei Hui updated HDFS-15079: --- Attachment: (was: HDFS-15079.001.patch) > RBF: Client maybe get an unexpected result with network anomaly > > > Key: HDFS-15079 > URL: https://issues.apache.org/jira/browse/HDFS-15079 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: rbf >Affects Versions: 3.3.0 >Reporter: Fei Hui >Priority: Critical > Attachments: UnexpectedOverWriteUT.patch > > > I find there is a critical problem on RBF, HDFS-15078 can resolve it on some > Scenarios, but i have no idea about the overall resolution. > The problem is that > Client with RBF(r0, r1) create a file HDFS file via r0, it gets Exception and > failovers to r1 > r0 has been send create rpc to namenode(1st create) > Client create a HDFS file via r1(2nd create) > Client writes the HDFS file and close it finally(3rd close) > Maybe namenode receiving the rpc in order as follow > 2nd create > 3rd close > 1st create > And overwrite is true by default, this would make the file had been written > an empty file. This is an critical problem > We had encountered this problem. There are many hive and spark jobs running > on our cluster, sometimes it occurs -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15079) RBF: Client maybe get an unexpected result with network anomaly
[ https://issues.apache.org/jira/browse/HDFS-15079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fei Hui updated HDFS-15079: --- Attachment: HDFS-15079.001.patch > RBF: Client maybe get an unexpected result with network anomaly > > > Key: HDFS-15079 > URL: https://issues.apache.org/jira/browse/HDFS-15079 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: rbf >Affects Versions: 3.3.0 >Reporter: Fei Hui >Priority: Critical > Attachments: HDFS-15079.001.patch, UnexpectedOverWriteUT.patch > > > I find there is a critical problem on RBF, HDFS-15078 can resolve it on some > Scenarios, but i have no idea about the overall resolution. > The problem is that > Client with RBF(r0, r1) create a file HDFS file via r0, it gets Exception and > failovers to r1 > r0 has been send create rpc to namenode(1st create) > Client create a HDFS file via r1(2nd create) > Client writes the HDFS file and close it finally(3rd close) > Maybe namenode receiving the rpc in order as follow > 2nd create > 3rd close > 1st create > And overwrite is true by default, this would make the file had been written > an empty file. This is an critical problem > We had encountered this problem. There are many hive and spark jobs running > on our cluster, sometimes it occurs -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15079) RBF: Client maybe get an unexpected result with network anomaly
[ https://issues.apache.org/jira/browse/HDFS-15079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fei Hui updated HDFS-15079: --- Attachment: (was: HDFS-15079.001.patch) > RBF: Client maybe get an unexpected result with network anomaly > > > Key: HDFS-15079 > URL: https://issues.apache.org/jira/browse/HDFS-15079 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: rbf >Affects Versions: 3.3.0 >Reporter: Fei Hui >Priority: Critical > Attachments: UnexpectedOverWriteUT.patch > > > I find there is a critical problem on RBF, HDFS-15078 can resolve it on some > Scenarios, but i have no idea about the overall resolution. > The problem is that > Client with RBF(r0, r1) create a file HDFS file via r0, it gets Exception and > failovers to r1 > r0 has been send create rpc to namenode(1st create) > Client create a HDFS file via r1(2nd create) > Client writes the HDFS file and close it finally(3rd close) > Maybe namenode receiving the rpc in order as follow > 2nd create > 3rd close > 1st create > And overwrite is true by default, this would make the file had been written > an empty file. This is an critical problem > We had encountered this problem. There are many hive and spark jobs running > on our cluster, sometimes it occurs -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15079) RBF: Client maybe get an unexpected result with network anomaly
[ https://issues.apache.org/jira/browse/HDFS-15079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fei Hui updated HDFS-15079: --- Status: Patch Available (was: Open) > RBF: Client maybe get an unexpected result with network anomaly > > > Key: HDFS-15079 > URL: https://issues.apache.org/jira/browse/HDFS-15079 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: rbf >Affects Versions: 3.3.0 >Reporter: Fei Hui >Priority: Critical > Attachments: HDFS-15079.001.patch, UnexpectedOverWriteUT.patch > > > I find there is a critical problem on RBF, HDFS-15078 can resolve it on some > Scenarios, but i have no idea about the overall resolution. > The problem is that > Client with RBF(r0, r1) create a file HDFS file via r0, it gets Exception and > failovers to r1 > r0 has been send create rpc to namenode(1st create) > Client create a HDFS file via r1(2nd create) > Client writes the HDFS file and close it finally(3rd close) > Maybe namenode receiving the rpc in order as follow > 2nd create > 3rd close > 1st create > And overwrite is true by default, this would make the file had been written > an empty file. This is an critical problem > We had encountered this problem. There are many hive and spark jobs running > on our cluster, sometimes it occurs -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15079) RBF: Client maybe get an unexpected result with network anomaly
[ https://issues.apache.org/jira/browse/HDFS-15079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17005301#comment-17005301 ] Fei Hui commented on HDFS-15079: Upoad rough fix. I think maybe CallerContext is suitable here. If we change router client clientId and callId, it will have problem for router client retry or failover. And maybe we will add more fields to callercontext if needed. Here are some issues: * Normalize the callerContext,how should we use it. json format or string? * Checking callId for the same clientId is suitable? Delayed callId will be dropped. [~ayushtkn] [~elgoiri] [~hexiaoqiao] Any thoughts? > RBF: Client maybe get an unexpected result with network anomaly > > > Key: HDFS-15079 > URL: https://issues.apache.org/jira/browse/HDFS-15079 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: rbf >Affects Versions: 3.3.0 >Reporter: Fei Hui >Priority: Critical > Attachments: HDFS-15079.001.patch, UnexpectedOverWriteUT.patch > > > I find there is a critical problem on RBF, HDFS-15078 can resolve it on some > Scenarios, but i have no idea about the overall resolution. > The problem is that > Client with RBF(r0, r1) create a file HDFS file via r0, it gets Exception and > failovers to r1 > r0 has been send create rpc to namenode(1st create) > Client create a HDFS file via r1(2nd create) > Client writes the HDFS file and close it finally(3rd close) > Maybe namenode receiving the rpc in order as follow > 2nd create > 3rd close > 1st create > And overwrite is true by default, this would make the file had been written > an empty file. This is an critical problem > We had encountered this problem. There are many hive and spark jobs running > on our cluster, sometimes it occurs -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15079) RBF: Client maybe get an unexpected result with network anomaly
[ https://issues.apache.org/jira/browse/HDFS-15079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fei Hui updated HDFS-15079: --- Attachment: HDFS-15079.001.patch > RBF: Client maybe get an unexpected result with network anomaly > > > Key: HDFS-15079 > URL: https://issues.apache.org/jira/browse/HDFS-15079 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: rbf >Affects Versions: 3.3.0 >Reporter: Fei Hui >Priority: Critical > Attachments: HDFS-15079.001.patch, UnexpectedOverWriteUT.patch > > > I find there is a critical problem on RBF, HDFS-15078 can resolve it on some > Scenarios, but i have no idea about the overall resolution. > The problem is that > Client with RBF(r0, r1) create a file HDFS file via r0, it gets Exception and > failovers to r1 > r0 has been send create rpc to namenode(1st create) > Client create a HDFS file via r1(2nd create) > Client writes the HDFS file and close it finally(3rd close) > Maybe namenode receiving the rpc in order as follow > 2nd create > 3rd close > 1st create > And overwrite is true by default, this would make the file had been written > an empty file. This is an critical problem > We had encountered this problem. There are many hive and spark jobs running > on our cluster, sometimes it occurs -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15084) RBF: Remove useless param nsId in RouterRpcClient#getConnection
[ https://issues.apache.org/jira/browse/HDFS-15084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17005235#comment-17005235 ] Fei Hui commented on HDFS-15084: [~ayushtkn] [~weichiu] could you please have a look ? Thanks > RBF: Remove useless param nsId in RouterRpcClient#getConnection > --- > > Key: HDFS-15084 > URL: https://issues.apache.org/jira/browse/HDFS-15084 > Project: Hadoop HDFS > Issue Type: Improvement > Components: rbf >Affects Versions: 3.3.0 >Reporter: Fei Hui >Assignee: Fei Hui >Priority: Trivial > Attachments: HDFS-15084.001.patch > > > The param nsId in RouterRpcClient#getConnection is useless. > Maybe we should remove it. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15085) Erasure Coding: some ORC data can not be recovery when partial DataNodes are shut down
[ https://issues.apache.org/jira/browse/HDFS-15085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17005191#comment-17005191 ] Fei Hui commented on HDFS-15085: [~zhangbutao] Thanks for reporting this. Could you please give more details? > Erasure Coding: some ORC data can not be recovery when partial DataNodes > are shut down > > > Key: HDFS-15085 > URL: https://issues.apache.org/jira/browse/HDFS-15085 > Project: Hadoop HDFS > Issue Type: Bug > Components: ec >Affects Versions: 3.1.0 >Reporter: zhangbutao >Priority: Major > > Test environment: hadoop version 3.1.0, 5 datanode > step to repo: > 1: Set the ec policy RS-3-2-1024k on all of hdfs paths: > hdfs ec -setPolicy -path / RS-3-2-1024k > 2.Put the small orc file into hdfs: > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15084) RBF: Remove useless param nsId in RouterRpcClient#getConnection
[ https://issues.apache.org/jira/browse/HDFS-15084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fei Hui updated HDFS-15084: --- Attachment: HDFS-15084.001.patch > RBF: Remove useless param nsId in RouterRpcClient#getConnection > --- > > Key: HDFS-15084 > URL: https://issues.apache.org/jira/browse/HDFS-15084 > Project: Hadoop HDFS > Issue Type: Improvement > Components: rbf >Affects Versions: 3.3.0 >Reporter: Fei Hui >Assignee: Fei Hui >Priority: Trivial > Attachments: HDFS-15084.001.patch > > > The param nsId in RouterRpcClient#getConnection is useless. > Maybe we should remove it. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15084) RBF: Remove useless param nsId in RouterRpcClient#getConnection
[ https://issues.apache.org/jira/browse/HDFS-15084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fei Hui updated HDFS-15084: --- Status: Patch Available (was: Open) > RBF: Remove useless param nsId in RouterRpcClient#getConnection > --- > > Key: HDFS-15084 > URL: https://issues.apache.org/jira/browse/HDFS-15084 > Project: Hadoop HDFS > Issue Type: Improvement > Components: rbf >Affects Versions: 3.3.0 >Reporter: Fei Hui >Assignee: Fei Hui >Priority: Trivial > Attachments: HDFS-15084.001.patch > > > The param nsId in RouterRpcClient#getConnection is useless. > Maybe we should remove it. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-15084) RBF: Remove useless param nsId in RouterRpcClient#getConnection
Fei Hui created HDFS-15084: -- Summary: RBF: Remove useless param nsId in RouterRpcClient#getConnection Key: HDFS-15084 URL: https://issues.apache.org/jira/browse/HDFS-15084 Project: Hadoop HDFS Issue Type: Improvement Components: rbf Affects Versions: 3.3.0 Reporter: Fei Hui Assignee: Fei Hui The param nsId in RouterRpcClient#getConnection is useless. Maybe we should remove it. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15081) Typo in RetryCache#waitForCompletion annotation
[ https://issues.apache.org/jira/browse/HDFS-15081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fei Hui updated HDFS-15081: --- Environment: (was: x) > Typo in RetryCache#waitForCompletion annotation > --- > > Key: HDFS-15081 > URL: https://issues.apache.org/jira/browse/HDFS-15081 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Affects Versions: 3.3.0 >Reporter: Fei Hui >Assignee: Fei Hui >Priority: Major > Attachments: HDFS-15081.001.patch > > > Typo in RetryCache#waitForCompletion annotation > {code} > // Previous request has failed, the expectation is is that it will be > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15081) Typo in RetryCache#waitForCompletion annotation
[ https://issues.apache.org/jira/browse/HDFS-15081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fei Hui updated HDFS-15081: --- Status: Patch Available (was: Open) > Typo in RetryCache#waitForCompletion annotation > --- > > Key: HDFS-15081 > URL: https://issues.apache.org/jira/browse/HDFS-15081 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Affects Versions: 3.3.0 >Reporter: Fei Hui >Assignee: Fei Hui >Priority: Major > Attachments: HDFS-15081.001.patch > > > Typo in RetryCache#waitForCompletion annotation > {code} > // Previous request has failed, the expectation is is that it will be > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15081) Typo in RetryCache#waitForCompletion annotation
[ https://issues.apache.org/jira/browse/HDFS-15081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fei Hui updated HDFS-15081: --- Environment: x (was: Typo in RetryCache#waitForCompletion annotation {code} // Previous request has failed, the expectation is is that it will be {code}) > Typo in RetryCache#waitForCompletion annotation > --- > > Key: HDFS-15081 > URL: https://issues.apache.org/jira/browse/HDFS-15081 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Affects Versions: 3.3.0 > Environment: x >Reporter: Fei Hui >Assignee: Fei Hui >Priority: Major > Attachments: HDFS-15081.001.patch > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15081) Typo in RetryCache#waitForCompletion annotation
[ https://issues.apache.org/jira/browse/HDFS-15081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fei Hui updated HDFS-15081: --- Description: H > Typo in RetryCache#waitForCompletion annotation > --- > > Key: HDFS-15081 > URL: https://issues.apache.org/jira/browse/HDFS-15081 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Affects Versions: 3.3.0 > Environment: x >Reporter: Fei Hui >Assignee: Fei Hui >Priority: Major > Attachments: HDFS-15081.001.patch > > > H -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15081) Typo in RetryCache#waitForCompletion annotation
[ https://issues.apache.org/jira/browse/HDFS-15081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fei Hui updated HDFS-15081: --- Description: Typo in RetryCache#waitForCompletion annotation {code} // Previous request has failed, the expectation is is that it will be {code} was:H > Typo in RetryCache#waitForCompletion annotation > --- > > Key: HDFS-15081 > URL: https://issues.apache.org/jira/browse/HDFS-15081 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Affects Versions: 3.3.0 > Environment: x >Reporter: Fei Hui >Assignee: Fei Hui >Priority: Major > Attachments: HDFS-15081.001.patch > > > Typo in RetryCache#waitForCompletion annotation > {code} > // Previous request has failed, the expectation is is that it will be > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15081) Typo in RetryCache#waitForCompletion annotation
[ https://issues.apache.org/jira/browse/HDFS-15081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fei Hui updated HDFS-15081: --- Attachment: HDFS-15081.001.patch > Typo in RetryCache#waitForCompletion annotation > --- > > Key: HDFS-15081 > URL: https://issues.apache.org/jira/browse/HDFS-15081 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Affects Versions: 3.3.0 > Environment: Typo in RetryCache#waitForCompletion annotation > {code} > // Previous request has failed, the expectation is is that it will be > {code} >Reporter: Fei Hui >Assignee: Fei Hui >Priority: Major > Attachments: HDFS-15081.001.patch > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15081) Typo in RetryCache#waitForCompletion annotation
[ https://issues.apache.org/jira/browse/HDFS-15081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17004006#comment-17004006 ] Fei Hui commented on HDFS-15081: Upload simple fix > Typo in RetryCache#waitForCompletion annotation > --- > > Key: HDFS-15081 > URL: https://issues.apache.org/jira/browse/HDFS-15081 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Affects Versions: 3.3.0 > Environment: Typo in RetryCache#waitForCompletion annotation > {code} > // Previous request has failed, the expectation is is that it will be > {code} >Reporter: Fei Hui >Assignee: Fei Hui >Priority: Major > Attachments: HDFS-15081.001.patch > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-15081) Typo in RetryCache#waitForCompletion annotation
Fei Hui created HDFS-15081: -- Summary: Typo in RetryCache#waitForCompletion annotation Key: HDFS-15081 URL: https://issues.apache.org/jira/browse/HDFS-15081 Project: Hadoop HDFS Issue Type: Improvement Components: namenode Affects Versions: 3.3.0 Environment: Typo in RetryCache#waitForCompletion annotation {code} // Previous request has failed, the expectation is is that it will be {code} Reporter: Fei Hui Assignee: Fei Hui -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15079) RBF: Client maybe get an unexpected result with network anomaly
[ https://issues.apache.org/jira/browse/HDFS-15079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17003619#comment-17003619 ] Fei Hui commented on HDFS-15079: Thanks [~hexiaoqiao] {quote} ClientId & CallId of request from Router to NameNode are both created by Router itself {quote} Yes. I was wrong, regards clientName as clientId :( Digging in > RBF: Client maybe get an unexpected result with network anomaly > > > Key: HDFS-15079 > URL: https://issues.apache.org/jira/browse/HDFS-15079 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: rbf >Affects Versions: 3.3.0 >Reporter: Fei Hui >Priority: Critical > Attachments: UnexpectedOverWriteUT.patch > > > I find there is a critical problem on RBF, HDFS-15078 can resolve it on some > Scenarios, but i have no idea about the overall resolution. > The problem is that > Client with RBF(r0, r1) create a file HDFS file via r0, it gets Exception and > failovers to r1 > r0 has been send create rpc to namenode(1st create) > Client create a HDFS file via r1(2nd create) > Client writes the HDFS file and close it finally(3rd close) > Maybe namenode receiving the rpc in order as follow > 2nd create > 3rd close > 1st create > And overwrite is true by default, this would make the file had been written > an empty file. This is an critical problem > We had encountered this problem. There are many hive and spark jobs running > on our cluster, sometimes it occurs -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15079) RBF: Client maybe get an unexpected result with network anomaly
[ https://issues.apache.org/jira/browse/HDFS-15079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17003430#comment-17003430 ] Fei Hui commented on HDFS-15079: [~ayushtkn] {quote} The namenode Logic that you tend to add, that kind of logic is there in Namenode in form of RetryCache, It checks whether the call isn't a repeated one due to failover, if so, it doesn't execute it again rather sends the old response from the cache. {quote} Great. CallId is the client id i tend to add. CallId and clientId maybe can resolve the problem. For NN clientId is from client, but callId is from routerclient. Try to dig in more too. > RBF: Client maybe get an unexpected result with network anomaly > > > Key: HDFS-15079 > URL: https://issues.apache.org/jira/browse/HDFS-15079 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: rbf >Affects Versions: 3.3.0 >Reporter: Fei Hui >Priority: Critical > Attachments: UnexpectedOverWriteUT.patch > > > I find there is a critical problem on RBF, HDFS-15078 can resolve it on some > Scenarios, but i have no idea about the overall resolution. > The problem is that > Client with RBF(r0, r1) create a file HDFS file via r0, it gets Exception and > failovers to r1 > r0 has been send create rpc to namenode(1st create) > Client create a HDFS file via r1(2nd create) > Client writes the HDFS file and close it finally(3rd close) > Maybe namenode receiving the rpc in order as follow > 2nd create > 3rd close > 1st create > And overwrite is true by default, this would make the file had been written > an empty file. This is an critical problem > We had encountered this problem. There are many hive and spark jobs running > on our cluster, sometimes it occurs -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15078) RBF: Should check connection channel before sending rpc to namenode
[ https://issues.apache.org/jira/browse/HDFS-15078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17003264#comment-17003264 ] Fei Hui commented on HDFS-15078: [~elgoiri] {quote} Can we try to do it as an exception handling instead of proactively checking? {quote} Sorry, didn't catch it. Before checking it, everything looks fine. Could you please give some ideas? [~ayushtkn] {quote} Router is supposed to just receive the call, and if it has received a valid call, it should in any case send to namenode. {quote} If connection between router and client is closed, result could not send to client. So maybe sending or not to namennode both are reasonable... Because the call failed for client. > RBF: Should check connection channel before sending rpc to namenode > --- > > Key: HDFS-15078 > URL: https://issues.apache.org/jira/browse/HDFS-15078 > Project: Hadoop HDFS > Issue Type: Improvement > Components: rbf >Affects Versions: 3.3.0 >Reporter: Fei Hui >Assignee: Fei Hui >Priority: Major > Attachments: HDFS-15078.001.patch, HDFS-15078.002.patch > > > dfsrouter logs show that > {quote} > 2019-12-20 04:11:26,724 WARN org.apache.hadoop.ipc.Server: IPC Server handler > 6400 on , call org.apache.hadoop.hdfs.protocol.ClientProtocol.create from > 10.83.164.11:56908 Call#2 Retry#0: output error > 2019-12-20 04:11:26,724 INFO org.apache.hadoop.ipc.Server: IPC Server handler > 125 on caught an exception > java.nio.channels.ClosedChannelException > at > sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:270) > at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:461) > at org.apache.hadoop.ipc.Server.channelWrite(Server.java:2731) > at org.apache.hadoop.ipc.Server.access$2100(Server.java:134) > at > org.apache.hadoop.ipc.Server$Responder.processResponse(Server.java:1089) > at org.apache.hadoop.ipc.Server$Responder.doRespond(Server.java:1161) > at > org.apache.hadoop.ipc.Server$Connection.sendResponse(Server.java:2109) > at > org.apache.hadoop.ipc.Server$Connection.access$400(Server.java:1229) > at org.apache.hadoop.ipc.Server$Call.sendResponse(Server.java:631) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2245) > {quote} > Maybe checking connection between client and router is better before > sendingrpc to namenode -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-15079) RBF: Client maybe get an unexpected result with network anomaly
[ https://issues.apache.org/jira/browse/HDFS-15079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17003233#comment-17003233 ] Fei Hui edited comment on HDFS-15079 at 12/25/19 11:40 AM: --- [~ayushtkn][~elgoiri]Upload an overwrite UT, similar to HDFS-15078 was (Author: ferhui): [~ayushtkn][~elgoiri]Upload a overwrite UT, similar to HDFS-15078 > RBF: Client maybe get an unexpected result with network anomaly > > > Key: HDFS-15079 > URL: https://issues.apache.org/jira/browse/HDFS-15079 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: rbf >Affects Versions: 3.3.0 >Reporter: Fei Hui >Priority: Critical > Attachments: UnexpectedOverWriteUT.patch > > > I find there is a critical problem on RBF, HDFS-15078 can resolve it on some > Scenarios, but i have no idea about the overall resolution. > The problem is that > Client with RBF(r0, r1) create a file HDFS file via r0, it gets Exception and > failovers to r1 > r0 has been send create rpc to namenode(1st create) > Client create a HDFS file via r1(2nd create) > Client writes the HDFS file and close it finally(3rd close) > Maybe namenode receiving the rpc in order as follow > 2nd create > 3rd close > 1st create > And overwrite is true by default, this would make the file had been written > an empty file. This is an critical problem > We had encountered this problem. There are many hive and spark jobs running > on our cluster, sometimes it occurs -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15079) RBF: Client maybe get an unexpected result with network anomaly
[ https://issues.apache.org/jira/browse/HDFS-15079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17003233#comment-17003233 ] Fei Hui commented on HDFS-15079: [~ayushtkn][~elgoiri]Upload a overwrite UT, similar to HDFS-15078 > RBF: Client maybe get an unexpected result with network anomaly > > > Key: HDFS-15079 > URL: https://issues.apache.org/jira/browse/HDFS-15079 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: rbf >Affects Versions: 3.3.0 >Reporter: Fei Hui >Priority: Critical > Attachments: UnexpectedOverWriteUT.patch > > > I find there is a critical problem on RBF, HDFS-15078 can resolve it on some > Scenarios, but i have no idea about the overall resolution. > The problem is that > Client with RBF(r0, r1) create a file HDFS file via r0, it gets Exception and > failovers to r1 > r0 has been send create rpc to namenode(1st create) > Client create a HDFS file via r1(2nd create) > Client writes the HDFS file and close it finally(3rd close) > Maybe namenode receiving the rpc in order as follow > 2nd create > 3rd close > 1st create > And overwrite is true by default, this would make the file had been written > an empty file. This is an critical problem > We had encountered this problem. There are many hive and spark jobs running > on our cluster, sometimes it occurs -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15079) RBF: Client maybe get an unexpected result with network anomaly
[ https://issues.apache.org/jira/browse/HDFS-15079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fei Hui updated HDFS-15079: --- Attachment: UnexpectedOverWriteUT.patch > RBF: Client maybe get an unexpected result with network anomaly > > > Key: HDFS-15079 > URL: https://issues.apache.org/jira/browse/HDFS-15079 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: rbf >Affects Versions: 3.3.0 >Reporter: Fei Hui >Priority: Critical > Attachments: UnexpectedOverWriteUT.patch > > > I find there is a critical problem on RBF, HDFS-15078 can resolve it on some > Scenarios, but i have no idea about the overall resolution. > The problem is that > Client with RBF(r0, r1) create a file HDFS file via r0, it gets Exception and > failovers to r1 > r0 has been send create rpc to namenode(1st create) > Client create a HDFS file via r1(2nd create) > Client writes the HDFS file and close it finally(3rd close) > Maybe namenode receiving the rpc in order as follow > 2nd create > 3rd close > 1st create > And overwrite is true by default, this would make the file had been written > an empty file. This is an critical problem > We had encountered this problem. There are many hive and spark jobs running > on our cluster, sometimes it occurs -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-15079) RBF: Client maybe get an unexpected result with network anomaly
[ https://issues.apache.org/jira/browse/HDFS-15079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17003153#comment-17003153 ] Fei Hui edited comment on HDFS-15079 at 12/25/19 9:09 AM: -- [~elgoiri] HDFS-15078 has a test case, it's one case for this. [~hexiaoqiao] Client gets Exception, but the exception is not that router throws. client logs as follow {quote} java.io.EOFException: End of File Exception between local host is: "xx.xx.xx.xx"; destination host is: "xx.xx.xx.xx":; : java.io.EOFException; For more details see: http://wiki.apache.org/hadoop/EOFException at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:792) at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:765) at org.apache.hadoop.ipc.Client.call(Client.java:1507) at org.apache.hadoop.ipc.Client.call(Client.java:1441) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229) at com.sun.proxy.$Proxy19.create(Unknown Source) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.create(ClientNamenodeProtocolTranslatorPB.java:303) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:253) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:101) at com.sun.proxy.$Proxy20.create(Unknown Source) at org.apache.hadoop.hdfs.DFSOutputStream.newStreamForCreate(DFSOutputStream.java:264) at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1727) at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1662) at org.apache.hadoop.hdfs.DistributedFileSystem$7.doCall(DistributedFileSystem.java:503) at org.apache.hadoop.hdfs.DistributedFileSystem$7.doCall(DistributedFileSystem.java:499) at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) at org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:514) at org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:442) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:979) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:872) at org.apache.hadoop.mapred.TextOutputFormat.getRecordWriter(TextOutputFormat.java:135) at org.apache.spark.internal.io.HadoopMapRedWriteConfigUtil.initWriter(SparkHadoopWriter.scala:228) at org.apache.spark.internal.io.SparkHadoopWriter$.org$apache$spark$internal$io$SparkHadoopWriter$$executeTask(SparkHadoopWriter.scala:122) at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:83) at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:78) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:121) at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.io.EOFException at java.io.DataInputStream.readInt(DataInputStream.java:392) at org.apache.hadoop.ipc.Client$Connection.receiveRpcResponse(Client.java:1113) at org.apache.hadoop.ipc.Client$Connection.run(Client.java:1008) {quote} I think maybe consistency is not guaranteed if do not resolve it on nn side. was (Author: ferhui): [~elgoiri] HDFS-15078 has a test case, it's one case for this. [~hexiaoqiao] Client gets Exception, but the exception is not that router throws. client logs as follow {quote} java.io.EOFException: End of File Exception between local host is: "xx.xx.xx.xx"; destination host is: "xx.xx.xx.xx":; : java.io.EOFException; For more details see:
[jira] [Commented] (HDFS-15079) RBF: Client maybe get an unexpected result with network anomaly
[ https://issues.apache.org/jira/browse/HDFS-15079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17003153#comment-17003153 ] Fei Hui commented on HDFS-15079: [~elgoiri] HDFS-15078 has a test case, it's one case for this. [~hexiaoqiao] Client gets Exception, but the exception is not that router throws. client logs as follow {quote} java.io.EOFException: End of File Exception between local host is: "xx.xx.xx.xx"; destination host is: "xx.xx.xx.xx":; : java.io.EOFException; For more details see: http://wiki.apache.org/hadoop/EOFException at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:792) at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:765) at org.apache.hadoop.ipc.Client.call(Client.java:1507) at org.apache.hadoop.ipc.Client.call(Client.java:1441) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229) at com.sun.proxy.$Proxy19.create(Unknown Source) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.create(ClientNamenodeProtocolTranslatorPB.java:303) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:253) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:101) at com.sun.proxy.$Proxy20.create(Unknown Source) at org.apache.hadoop.hdfs.DFSOutputStream.newStreamForCreate(DFSOutputStream.java:264) at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1727) at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1662) at org.apache.hadoop.hdfs.DistributedFileSystem$7.doCall(DistributedFileSystem.java:503) at org.apache.hadoop.hdfs.DistributedFileSystem$7.doCall(DistributedFileSystem.java:499) at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) at org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:514) at org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:442) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:979) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:872) at org.apache.hadoop.mapred.TextOutputFormat.getRecordWriter(TextOutputFormat.java:135) at org.apache.spark.internal.io.HadoopMapRedWriteConfigUtil.initWriter(SparkHadoopWriter.scala:228) at org.apache.spark.internal.io.SparkHadoopWriter$.org$apache$spark$internal$io$SparkHadoopWriter$$executeTask(SparkHadoopWriter.scala:122) at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:83) at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:78) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:121) at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.io.EOFException at java.io.DataInputStream.readInt(DataInputStream.java:392) at org.apache.hadoop.ipc.Client$Connection.receiveRpcResponse(Client.java:1113) at org.apache.hadoop.ipc.Client$Connection.run(Client.java:1008) {quote} I think maybe consistency is not guaranteed if resolve it on nn side. > RBF: Client maybe get an unexpected result with network anomaly > > > Key: HDFS-15079 > URL: https://issues.apache.org/jira/browse/HDFS-15079 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: rbf >Affects Versions: 3.3.0 >Reporter: Fei Hui >Priority: Critical > > I find there is a
[jira] [Comment Edited] (HDFS-15078) RBF: Should check connection channel before sending rpc to namenode
[ https://issues.apache.org/jira/browse/HDFS-15078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17002739#comment-17002739 ] Fei Hui edited comment on HDFS-15078 at 12/24/19 10:23 AM: --- {quote} The issue is the first router which sent the request that late, That client did failover to another router, triggered a new call and the second router completed the call, and the first call came after this. {quote} Getting EOFException makes client failover to another router. And later the second router completed the call, the first router sent the request late. If just the first router sent the request late, client doesn't get exception, it will not failover {quote} If the client crashed post the check, this scenario will again come, This doesn't seems to be a problem with the client crashing and the Router sending the request still to Namenode, If such a case where one Router is delaying, I think without client connection crashing still issues like these can come up. {quote} Yes. This issue only can resolve the problem on some scenarios and it's just an improvement. HDFS-15079 tracks the high level problem. In our scenarios. This fix works. was (Author: ferhui): {quote} The issue is the first router which sent the request that late, That client did failover to another router, triggered a new call and the second router completed the call, and the first call came after this. {quote} Getting EOFException makes client failover to another router. And later the second router completed the call, the first router sent the request late. If just the first router sent the request late, client doesn't get exception, it will not failover {quote} If such a case where one Router is delaying, I think without client connection crashing still issues like these can come up. {quote} Yes. This issue only can resolve the problem on some scenarios. HDFS-15079 tracks the high level problem. In our scenarios. This fix works. > RBF: Should check connection channel before sending rpc to namenode > --- > > Key: HDFS-15078 > URL: https://issues.apache.org/jira/browse/HDFS-15078 > Project: Hadoop HDFS > Issue Type: Improvement > Components: rbf >Affects Versions: 3.3.0 >Reporter: Fei Hui >Assignee: Fei Hui >Priority: Major > Attachments: HDFS-15078.001.patch, HDFS-15078.002.patch > > > dfsrouter logs show that > {quote} > 2019-12-20 04:11:26,724 WARN org.apache.hadoop.ipc.Server: IPC Server handler > 6400 on , call org.apache.hadoop.hdfs.protocol.ClientProtocol.create from > 10.83.164.11:56908 Call#2 Retry#0: output error > 2019-12-20 04:11:26,724 INFO org.apache.hadoop.ipc.Server: IPC Server handler > 125 on caught an exception > java.nio.channels.ClosedChannelException > at > sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:270) > at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:461) > at org.apache.hadoop.ipc.Server.channelWrite(Server.java:2731) > at org.apache.hadoop.ipc.Server.access$2100(Server.java:134) > at > org.apache.hadoop.ipc.Server$Responder.processResponse(Server.java:1089) > at org.apache.hadoop.ipc.Server$Responder.doRespond(Server.java:1161) > at > org.apache.hadoop.ipc.Server$Connection.sendResponse(Server.java:2109) > at > org.apache.hadoop.ipc.Server$Connection.access$400(Server.java:1229) > at org.apache.hadoop.ipc.Server$Call.sendResponse(Server.java:631) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2245) > {quote} > Maybe checking connection between client and router is better before > sendingrpc to namenode -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-15078) RBF: Should check connection channel before sending rpc to namenode
[ https://issues.apache.org/jira/browse/HDFS-15078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17002739#comment-17002739 ] Fei Hui edited comment on HDFS-15078 at 12/24/19 10:02 AM: --- {quote} The issue is the first router which sent the request that late, That client did failover to another router, triggered a new call and the second router completed the call, and the first call came after this. {quote} Getting EOFException makes client failover to another router. And later the second router completed the call, the first router sent the request late. If just the first router sent the request late, client doesn't get exception, it will not failover {quote} If such a case where one Router is delaying, I think without client connection crashing still issues like these can come up. {quote} Yes. This issue only can resolve the problem on some scenarios. HDFS-15079 tracks the high level problem. In our scenarios. This fix works. was (Author: ferhui): {quote} The issue is the first router which c, That client did failover to another router, triggered a new call and the second router completed the call, and the first call came after this. {quote} Getting EOFException makes client failover to another router. And later and the second router completed the call, the first router the first router. {quote} If such a case where one Router is delaying, I think without client connection crashing still issues like these can come up. {quote} Yes. This issue only can resolve the problem on some scenarios. HDFS-15079 tracks the high level problem. In our scenarios. This fix works. > RBF: Should check connection channel before sending rpc to namenode > --- > > Key: HDFS-15078 > URL: https://issues.apache.org/jira/browse/HDFS-15078 > Project: Hadoop HDFS > Issue Type: Improvement > Components: rbf >Affects Versions: 3.3.0 >Reporter: Fei Hui >Assignee: Fei Hui >Priority: Major > Attachments: HDFS-15078.001.patch, HDFS-15078.002.patch > > > dfsrouter logs show that > {quote} > 2019-12-20 04:11:26,724 WARN org.apache.hadoop.ipc.Server: IPC Server handler > 6400 on , call org.apache.hadoop.hdfs.protocol.ClientProtocol.create from > 10.83.164.11:56908 Call#2 Retry#0: output error > 2019-12-20 04:11:26,724 INFO org.apache.hadoop.ipc.Server: IPC Server handler > 125 on caught an exception > java.nio.channels.ClosedChannelException > at > sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:270) > at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:461) > at org.apache.hadoop.ipc.Server.channelWrite(Server.java:2731) > at org.apache.hadoop.ipc.Server.access$2100(Server.java:134) > at > org.apache.hadoop.ipc.Server$Responder.processResponse(Server.java:1089) > at org.apache.hadoop.ipc.Server$Responder.doRespond(Server.java:1161) > at > org.apache.hadoop.ipc.Server$Connection.sendResponse(Server.java:2109) > at > org.apache.hadoop.ipc.Server$Connection.access$400(Server.java:1229) > at org.apache.hadoop.ipc.Server$Call.sendResponse(Server.java:631) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2245) > {quote} > Maybe checking connection between client and router is better before > sendingrpc to namenode -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15079) RBF: Client maybe get an unexpected result with network anomaly
[ https://issues.apache.org/jira/browse/HDFS-15079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17002762#comment-17002762 ] Fei Hui commented on HDFS-15079: General Idea: * client generate id and send it with call to namenode * namenode keeps last id for the file of each lease * drop the call if its id less than last id [~ayushtkn] [~elgoiri] [~hexiaoqiao] Any thoughts? > RBF: Client maybe get an unexpected result with network anomaly > > > Key: HDFS-15079 > URL: https://issues.apache.org/jira/browse/HDFS-15079 > Project: Hadoop HDFS > Issue Type: Bug > Components: rbf >Affects Versions: 3.3.0 >Reporter: Fei Hui >Priority: Critical > > I find there is a critical problem on RBF, HDFS-15078 can resolve it on some > Scenarios, but i have no idea about the overall resolution. > The problem is that > Client with RBF(r0, r1) create a file HDFS file via r0, it gets Exception and > failovers to r1 > r0 has been send create rpc to namenode(1st create) > Client create a HDFS file via r1(2nd create) > Client writes the HDFS file and close it finally(3rd close) > Maybe namenode receiving the rpc in order as follow > 2nd create > 3rd close > 1st create > And overwrite is true by default, this would make the file had been written > an empty file. This is an critical problem > We had encountered this problem. There are many hive and spark jobs running > on our cluster, sometimes it occurs -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15079) RBF: Client maybe get an unexpected result with network anomaly
[ https://issues.apache.org/jira/browse/HDFS-15079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fei Hui updated HDFS-15079: --- Summary: RBF: Client maybe get an unexpected result with network anomaly (was: RBF: Client may get an unexpected result with network anomaly ) > RBF: Client maybe get an unexpected result with network anomaly > > > Key: HDFS-15079 > URL: https://issues.apache.org/jira/browse/HDFS-15079 > Project: Hadoop HDFS > Issue Type: Bug > Components: rbf >Affects Versions: 3.3.0 >Reporter: Fei Hui >Priority: Critical > > I find there is a critical problem on RBF, HDFS-15078 can resolve it on some > Scenarios, but i have no idea about the overall resolution. > The problem is that > Client with RBF(r0, r1) create a file HDFS file via r0, it gets Exception and > failovers to r1 > r0 has been send create rpc to namenode(1st create) > Client create a HDFS file via r1(2nd create) > Client writes the HDFS file and close it finally(3rd close) > Maybe namenode receiving the rpc in order as follow > 2nd create > 3rd close > 1st create > And overwrite is true by default, this would make the file had been written > an empty file. This is an critical problem > We had encountered this problem. There are many hive and spark jobs running > on our cluster, sometimes it occurs -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15078) RBF: Should check connection channel before sending rpc to namenode
[ https://issues.apache.org/jira/browse/HDFS-15078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17002739#comment-17002739 ] Fei Hui commented on HDFS-15078: {quote} The issue is the first router which c, That client did failover to another router, triggered a new call and the second router completed the call, and the first call came after this. {quote} Getting EOFException makes client failover to another router. And later and the second router completed the call, the first router the first router. {quote} If such a case where one Router is delaying, I think without client connection crashing still issues like these can come up. {quote} Yes. This issue only can resolve the problem on some scenarios. HDFS-15079 tracks the high level problem. In our scenarios. This fix works. > RBF: Should check connection channel before sending rpc to namenode > --- > > Key: HDFS-15078 > URL: https://issues.apache.org/jira/browse/HDFS-15078 > Project: Hadoop HDFS > Issue Type: Improvement > Components: rbf >Affects Versions: 3.3.0 >Reporter: Fei Hui >Assignee: Fei Hui >Priority: Major > Attachments: HDFS-15078.001.patch, HDFS-15078.002.patch > > > dfsrouter logs show that > {quote} > 2019-12-20 04:11:26,724 WARN org.apache.hadoop.ipc.Server: IPC Server handler > 6400 on , call org.apache.hadoop.hdfs.protocol.ClientProtocol.create from > 10.83.164.11:56908 Call#2 Retry#0: output error > 2019-12-20 04:11:26,724 INFO org.apache.hadoop.ipc.Server: IPC Server handler > 125 on caught an exception > java.nio.channels.ClosedChannelException > at > sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:270) > at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:461) > at org.apache.hadoop.ipc.Server.channelWrite(Server.java:2731) > at org.apache.hadoop.ipc.Server.access$2100(Server.java:134) > at > org.apache.hadoop.ipc.Server$Responder.processResponse(Server.java:1089) > at org.apache.hadoop.ipc.Server$Responder.doRespond(Server.java:1161) > at > org.apache.hadoop.ipc.Server$Connection.sendResponse(Server.java:2109) > at > org.apache.hadoop.ipc.Server$Connection.access$400(Server.java:1229) > at org.apache.hadoop.ipc.Server$Call.sendResponse(Server.java:631) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2245) > {quote} > Maybe checking connection between client and router is better before > sendingrpc to namenode -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-12999) When reach the end of the block group, it may not need to flush all the data packets(flushAllInternals) twice.
[ https://issues.apache.org/jira/browse/HDFS-12999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17002720#comment-17002720 ] Fei Hui commented on HDFS-12999: Yes, [~figo] doesn't seems active nowdays. Upload v003 patch on his behalf [~ayushtkn] please review > When reach the end of the block group, it may not need to flush all the data > packets(flushAllInternals) twice. > --- > > Key: HDFS-12999 > URL: https://issues.apache.org/jira/browse/HDFS-12999 > Project: Hadoop HDFS > Issue Type: Improvement > Components: erasure-coding, hdfs-client >Affects Versions: 3.0.0-beta1, 3.1.0 >Reporter: lufei >Assignee: lufei >Priority: Major > Attachments: HDFS-12999.001.patch, HDFS-12999.002.patch, > HDFS-12999.003.patch > > > In order to make the process simplification. It's no need to flush all the > data packets(flushAllInternals) twice,when reach the end of the block group. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-12999) When reach the end of the block group, it may not need to flush all the data packets(flushAllInternals) twice.
[ https://issues.apache.org/jira/browse/HDFS-12999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fei Hui updated HDFS-12999: --- Attachment: HDFS-12999.003.patch > When reach the end of the block group, it may not need to flush all the data > packets(flushAllInternals) twice. > --- > > Key: HDFS-12999 > URL: https://issues.apache.org/jira/browse/HDFS-12999 > Project: Hadoop HDFS > Issue Type: Improvement > Components: erasure-coding, hdfs-client >Affects Versions: 3.0.0-beta1, 3.1.0 >Reporter: lufei >Assignee: lufei >Priority: Major > Attachments: HDFS-12999.001.patch, HDFS-12999.002.patch, > HDFS-12999.003.patch > > > In order to make the process simplification. It's no need to flush all the > data packets(flushAllInternals) twice,when reach the end of the block group. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15079) RBF: Client may get an unexpected result with network anomaly
[ https://issues.apache.org/jira/browse/HDFS-15079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fei Hui updated HDFS-15079: --- Issue Type: Bug (was: Improvement) > RBF: Client may get an unexpected result with network anomaly > -- > > Key: HDFS-15079 > URL: https://issues.apache.org/jira/browse/HDFS-15079 > Project: Hadoop HDFS > Issue Type: Bug > Components: rbf >Affects Versions: 3.3.0 >Reporter: Fei Hui >Priority: Critical > > I find there is a critical problem on RBF, HDFS-15078 can resolve it on some > Scenarios, but i have no idea about the overall resolution. > The problem is that > Client with RBF(r0, r1) create a file HDFS file via r0, it gets Exception and > failovers to r1 > r0 has been send create rpc to namenode(1st create) > Client create a HDFS file via r1(2nd create) > Client writes the HDFS file and close it finally(3rd close) > Maybe namenode receiving the rpc in order as follow > 2nd create > 3rd close > 1st create > And overwrite is true by default, this would make the file had been written > an empty file. This is an critical problem > We had encountered this problem. There are many hive and spark jobs running > on our cluster, sometimes it occurs -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-15079) RBF: Client may get an unexpected result with network anomaly
Fei Hui created HDFS-15079: -- Summary: RBF: Client may get an unexpected result with network anomaly Key: HDFS-15079 URL: https://issues.apache.org/jira/browse/HDFS-15079 Project: Hadoop HDFS Issue Type: Improvement Components: rbf Affects Versions: 3.3.0 Reporter: Fei Hui I find there is a critical problem on RBF, HDFS-15078 can resolve it on some Scenarios, but i have no idea about the overall resolution. The problem is that Client with RBF(r0, r1) create a file HDFS file via r0, it gets Exception and failovers to r1 r0 has been send create rpc to namenode(1st create) Client create a HDFS file via r1(2nd create) Client writes the HDFS file and close it finally(3rd close) Maybe namenode receiving the rpc in order as follow 2nd create 3rd close 1st create And overwrite is true by default, this would make the file had been written an empty file. This is an critical problem We had encountered this problem. There are many hive and spark jobs running on our cluster, sometimes it occurs -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-15078) RBF: Should check connection channel before sending rpc to namenode
[ https://issues.apache.org/jira/browse/HDFS-15078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17002714#comment-17002714 ] Fei Hui edited comment on HDFS-15078 at 12/24/19 7:58 AM: -- This fix can resolve some scenarios logs as follow {quote} 2019-12-24 15:46:20,717 INFO org.apache.hadoop.ipc.Server: IPC Server handler 53 on , call org.apache.hadoop.hdfs.protocol.ClientProtocol.getFileInfo f rom 10.xx.xx.xx:60980 Call#18 Retry#0: java.io.IOException: Connection Channel to 10.xx.xx.xx of xxx (auth:SIMPLE) is closed! 2019-12-24 15:46:20,718 WARN org.apache.hadoop.ipc.Server: IPC Server handler 53 on , call org.apache.hadoop.hdfs.protocol.ClientProtocol.getFileInfo f rom 10.xx.xx.xx:60980 Call#18 Retry#0: output error 2019-12-24 15:46:20,718 INFO org.apache.hadoop.ipc.Server: IPC Server handler 53 on caught an exception java.nio.channels.ClosedChannelException at sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:270) at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:461) at org.apache.hadoop.ipc.Server.channelWrite(Server.java:2738) at org.apache.hadoop.ipc.Server.access$2100(Server.java:134) at org.apache.hadoop.ipc.Server$Responder.processResponse(Server.java:1096) at org.apache.hadoop.ipc.Server$Responder.doRespond(Server.java:1168) at org.apache.hadoop.ipc.Server$Connection.sendResponse(Server.java:2116) at org.apache.hadoop.ipc.Server$Connection.access$500(Server.java:1236) at org.apache.hadoop.ipc.Server$Call.sendResponse(Server.java:638) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2252) {quote} was (Author: ferhui): This fix can resolve some logs as follow {quote} 2019-12-24 15:46:20,717 INFO org.apache.hadoop.ipc.Server: IPC Server handler 53 on , call org.apache.hadoop.hdfs.protocol.ClientProtocol.getFileInfo f rom 10.xx.xx.xx:60980 Call#18 Retry#0: java.io.IOException: Connection Channel to 10.xx.xx.xx of xxx (auth:SIMPLE) is closed! 2019-12-24 15:46:20,718 WARN org.apache.hadoop.ipc.Server: IPC Server handler 53 on , call org.apache.hadoop.hdfs.protocol.ClientProtocol.getFileInfo f rom 10.xx.xx.xx:60980 Call#18 Retry#0: output error 2019-12-24 15:46:20,718 INFO org.apache.hadoop.ipc.Server: IPC Server handler 53 on caught an exception java.nio.channels.ClosedChannelException at sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:270) at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:461) at org.apache.hadoop.ipc.Server.channelWrite(Server.java:2738) at org.apache.hadoop.ipc.Server.access$2100(Server.java:134) at org.apache.hadoop.ipc.Server$Responder.processResponse(Server.java:1096) at org.apache.hadoop.ipc.Server$Responder.doRespond(Server.java:1168) at org.apache.hadoop.ipc.Server$Connection.sendResponse(Server.java:2116) at org.apache.hadoop.ipc.Server$Connection.access$500(Server.java:1236) at org.apache.hadoop.ipc.Server$Call.sendResponse(Server.java:638) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2252) {quote} > RBF: Should check connection channel before sending rpc to namenode > --- > > Key: HDFS-15078 > URL: https://issues.apache.org/jira/browse/HDFS-15078 > Project: Hadoop HDFS > Issue Type: Improvement > Components: rbf >Affects Versions: 3.3.0 >Reporter: Fei Hui >Assignee: Fei Hui >Priority: Major > Attachments: HDFS-15078.001.patch, HDFS-15078.002.patch > > > dfsrouter logs show that > {quote} > 2019-12-20 04:11:26,724 WARN org.apache.hadoop.ipc.Server: IPC Server handler > 6400 on , call org.apache.hadoop.hdfs.protocol.ClientProtocol.create from > 10.83.164.11:56908 Call#2 Retry#0: output error > 2019-12-20 04:11:26,724 INFO org.apache.hadoop.ipc.Server: IPC Server handler > 125 on caught an exception > java.nio.channels.ClosedChannelException > at > sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:270) > at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:461) > at org.apache.hadoop.ipc.Server.channelWrite(Server.java:2731) > at org.apache.hadoop.ipc.Server.access$2100(Server.java:134) > at > org.apache.hadoop.ipc.Server$Responder.processResponse(Server.java:1089) > at org.apache.hadoop.ipc.Server$Responder.doRespond(Server.java:1161) > at > org.apache.hadoop.ipc.Server$Connection.sendResponse(Server.java:2109) > at > org.apache.hadoop.ipc.Server$Connection.access$400(Server.java:1229) > at org.apache.hadoop.ipc.Server$Call.sendResponse(Server.java:631) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2245) >
[jira] [Commented] (HDFS-15078) RBF: Should check connection channel before sending rpc to namenode
[ https://issues.apache.org/jira/browse/HDFS-15078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17002714#comment-17002714 ] Fei Hui commented on HDFS-15078: This fix can resolve some logs as follow {quote} 2019-12-24 15:46:20,717 INFO org.apache.hadoop.ipc.Server: IPC Server handler 53 on , call org.apache.hadoop.hdfs.protocol.ClientProtocol.getFileInfo f rom 10.xx.xx.xx:60980 Call#18 Retry#0: java.io.IOException: Connection Channel to 10.xx.xx.xx of xxx (auth:SIMPLE) is closed! 2019-12-24 15:46:20,718 WARN org.apache.hadoop.ipc.Server: IPC Server handler 53 on , call org.apache.hadoop.hdfs.protocol.ClientProtocol.getFileInfo f rom 10.xx.xx.xx:60980 Call#18 Retry#0: output error 2019-12-24 15:46:20,718 INFO org.apache.hadoop.ipc.Server: IPC Server handler 53 on caught an exception java.nio.channels.ClosedChannelException at sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:270) at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:461) at org.apache.hadoop.ipc.Server.channelWrite(Server.java:2738) at org.apache.hadoop.ipc.Server.access$2100(Server.java:134) at org.apache.hadoop.ipc.Server$Responder.processResponse(Server.java:1096) at org.apache.hadoop.ipc.Server$Responder.doRespond(Server.java:1168) at org.apache.hadoop.ipc.Server$Connection.sendResponse(Server.java:2116) at org.apache.hadoop.ipc.Server$Connection.access$500(Server.java:1236) at org.apache.hadoop.ipc.Server$Call.sendResponse(Server.java:638) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2252) {quote} > RBF: Should check connection channel before sending rpc to namenode > --- > > Key: HDFS-15078 > URL: https://issues.apache.org/jira/browse/HDFS-15078 > Project: Hadoop HDFS > Issue Type: Improvement > Components: rbf >Affects Versions: 3.3.0 >Reporter: Fei Hui >Assignee: Fei Hui >Priority: Major > Attachments: HDFS-15078.001.patch, HDFS-15078.002.patch > > > dfsrouter logs show that > {quote} > 2019-12-20 04:11:26,724 WARN org.apache.hadoop.ipc.Server: IPC Server handler > 6400 on , call org.apache.hadoop.hdfs.protocol.ClientProtocol.create from > 10.83.164.11:56908 Call#2 Retry#0: output error > 2019-12-20 04:11:26,724 INFO org.apache.hadoop.ipc.Server: IPC Server handler > 125 on caught an exception > java.nio.channels.ClosedChannelException > at > sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:270) > at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:461) > at org.apache.hadoop.ipc.Server.channelWrite(Server.java:2731) > at org.apache.hadoop.ipc.Server.access$2100(Server.java:134) > at > org.apache.hadoop.ipc.Server$Responder.processResponse(Server.java:1089) > at org.apache.hadoop.ipc.Server$Responder.doRespond(Server.java:1161) > at > org.apache.hadoop.ipc.Server$Connection.sendResponse(Server.java:2109) > at > org.apache.hadoop.ipc.Server$Connection.access$400(Server.java:1229) > at org.apache.hadoop.ipc.Server$Call.sendResponse(Server.java:631) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2245) > {quote} > Maybe checking connection between client and router is better before > sendingrpc to namenode -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-15078) RBF: Should check connection channel before sending rpc to namenode
[ https://issues.apache.org/jira/browse/HDFS-15078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17002603#comment-17002603 ] Fei Hui edited comment on HDFS-15078 at 12/24/19 4:21 AM: -- {quote} if the client has triggered the request, I think it should go to the namenode, though it crashed after sending the request. {quote} On client side, if it crashed, client think it's failure, will failover other namenode. but call has succeed. {quote} Moreover the case would be a rare scenario and this check would be done on every call, this would add unnecessary overhead to all calls. {quote} In heavy load cluster, I see lots of output error because of java.nio.channels.ClosedChannelException. There is similar check on namenode,Handler#run {quote} connDropped = !call.isOpen(); {quote} was (Author: ferhui): {quote} Moreover the case would be a rare scenario and this check would be done on every call, this would add unnecessary overhead to all calls. {quote} In heavy load cluster, I see lots of output error because of java.nio.channels.ClosedChannelException. There is similar check on namenode,Handler#run {quote} connDropped = !call.isOpen(); {quote} > RBF: Should check connection channel before sending rpc to namenode > --- > > Key: HDFS-15078 > URL: https://issues.apache.org/jira/browse/HDFS-15078 > Project: Hadoop HDFS > Issue Type: Improvement > Components: rbf >Affects Versions: 3.3.0 >Reporter: Fei Hui >Assignee: Fei Hui >Priority: Major > Attachments: HDFS-15078.001.patch, HDFS-15078.002.patch > > > dfsrouter logs show that > {quote} > 2019-12-20 04:11:26,724 WARN org.apache.hadoop.ipc.Server: IPC Server handler > 6400 on , call org.apache.hadoop.hdfs.protocol.ClientProtocol.create from > 10.83.164.11:56908 Call#2 Retry#0: output error > 2019-12-20 04:11:26,724 INFO org.apache.hadoop.ipc.Server: IPC Server handler > 125 on caught an exception > java.nio.channels.ClosedChannelException > at > sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:270) > at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:461) > at org.apache.hadoop.ipc.Server.channelWrite(Server.java:2731) > at org.apache.hadoop.ipc.Server.access$2100(Server.java:134) > at > org.apache.hadoop.ipc.Server$Responder.processResponse(Server.java:1089) > at org.apache.hadoop.ipc.Server$Responder.doRespond(Server.java:1161) > at > org.apache.hadoop.ipc.Server$Connection.sendResponse(Server.java:2109) > at > org.apache.hadoop.ipc.Server$Connection.access$400(Server.java:1229) > at org.apache.hadoop.ipc.Server$Call.sendResponse(Server.java:631) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2245) > {quote} > Maybe checking connection between client and router is better before > sendingrpc to namenode -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15078) RBF: Should check connection channel before sending rpc to namenode
[ https://issues.apache.org/jira/browse/HDFS-15078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17002603#comment-17002603 ] Fei Hui commented on HDFS-15078: {quote} Moreover the case would be a rare scenario and this check would be done on every call, this would add unnecessary overhead to all calls. {quote} In heavy load cluster, I see lots of output error because of java.nio.channels.ClosedChannelException. There is similar check on namenode,Handler#run {quote} connDropped = !call.isOpen(); {quote} > RBF: Should check connection channel before sending rpc to namenode > --- > > Key: HDFS-15078 > URL: https://issues.apache.org/jira/browse/HDFS-15078 > Project: Hadoop HDFS > Issue Type: Improvement > Components: rbf >Affects Versions: 3.3.0 >Reporter: Fei Hui >Assignee: Fei Hui >Priority: Major > Attachments: HDFS-15078.001.patch, HDFS-15078.002.patch > > > dfsrouter logs show that > {quote} > 2019-12-20 04:11:26,724 WARN org.apache.hadoop.ipc.Server: IPC Server handler > 6400 on , call org.apache.hadoop.hdfs.protocol.ClientProtocol.create from > 10.83.164.11:56908 Call#2 Retry#0: output error > 2019-12-20 04:11:26,724 INFO org.apache.hadoop.ipc.Server: IPC Server handler > 125 on caught an exception > java.nio.channels.ClosedChannelException > at > sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:270) > at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:461) > at org.apache.hadoop.ipc.Server.channelWrite(Server.java:2731) > at org.apache.hadoop.ipc.Server.access$2100(Server.java:134) > at > org.apache.hadoop.ipc.Server$Responder.processResponse(Server.java:1089) > at org.apache.hadoop.ipc.Server$Responder.doRespond(Server.java:1161) > at > org.apache.hadoop.ipc.Server$Connection.sendResponse(Server.java:2109) > at > org.apache.hadoop.ipc.Server$Connection.access$400(Server.java:1229) > at org.apache.hadoop.ipc.Server$Call.sendResponse(Server.java:631) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2245) > {quote} > Maybe checking connection between client and router is better before > sendingrpc to namenode -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15078) RBF: Should check connection channel before sending rpc to namenode
[ https://issues.apache.org/jira/browse/HDFS-15078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17002586#comment-17002586 ] Fei Hui commented on HDFS-15078: v002 patch change {code} if (curCall == null || !curCall.isOpen()) { {code} to {code} if (curCall != null && !curCall.isOpen()) { {code} > RBF: Should check connection channel before sending rpc to namenode > --- > > Key: HDFS-15078 > URL: https://issues.apache.org/jira/browse/HDFS-15078 > Project: Hadoop HDFS > Issue Type: Improvement > Components: rbf >Affects Versions: 3.3.0 >Reporter: Fei Hui >Assignee: Fei Hui >Priority: Major > Attachments: HDFS-15078.001.patch, HDFS-15078.002.patch > > > dfsrouter logs show that > {quote} > 2019-12-20 04:11:26,724 WARN org.apache.hadoop.ipc.Server: IPC Server handler > 6400 on , call org.apache.hadoop.hdfs.protocol.ClientProtocol.create from > 10.83.164.11:56908 Call#2 Retry#0: output error > 2019-12-20 04:11:26,724 INFO org.apache.hadoop.ipc.Server: IPC Server handler > 125 on caught an exception > java.nio.channels.ClosedChannelException > at > sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:270) > at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:461) > at org.apache.hadoop.ipc.Server.channelWrite(Server.java:2731) > at org.apache.hadoop.ipc.Server.access$2100(Server.java:134) > at > org.apache.hadoop.ipc.Server$Responder.processResponse(Server.java:1089) > at org.apache.hadoop.ipc.Server$Responder.doRespond(Server.java:1161) > at > org.apache.hadoop.ipc.Server$Connection.sendResponse(Server.java:2109) > at > org.apache.hadoop.ipc.Server$Connection.access$400(Server.java:1229) > at org.apache.hadoop.ipc.Server$Call.sendResponse(Server.java:631) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2245) > {quote} > Maybe checking connection between client and router is better before > sendingrpc to namenode -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15078) RBF: Should check connection channel before sending rpc to namenode
[ https://issues.apache.org/jira/browse/HDFS-15078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fei Hui updated HDFS-15078: --- Attachment: HDFS-15078.002.patch > RBF: Should check connection channel before sending rpc to namenode > --- > > Key: HDFS-15078 > URL: https://issues.apache.org/jira/browse/HDFS-15078 > Project: Hadoop HDFS > Issue Type: Improvement > Components: rbf >Affects Versions: 3.3.0 >Reporter: Fei Hui >Assignee: Fei Hui >Priority: Major > Attachments: HDFS-15078.001.patch, HDFS-15078.002.patch > > > dfsrouter logs show that > {quote} > 2019-12-20 04:11:26,724 WARN org.apache.hadoop.ipc.Server: IPC Server handler > 6400 on , call org.apache.hadoop.hdfs.protocol.ClientProtocol.create from > 10.83.164.11:56908 Call#2 Retry#0: output error > 2019-12-20 04:11:26,724 INFO org.apache.hadoop.ipc.Server: IPC Server handler > 125 on caught an exception > java.nio.channels.ClosedChannelException > at > sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:270) > at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:461) > at org.apache.hadoop.ipc.Server.channelWrite(Server.java:2731) > at org.apache.hadoop.ipc.Server.access$2100(Server.java:134) > at > org.apache.hadoop.ipc.Server$Responder.processResponse(Server.java:1089) > at org.apache.hadoop.ipc.Server$Responder.doRespond(Server.java:1161) > at > org.apache.hadoop.ipc.Server$Connection.sendResponse(Server.java:2109) > at > org.apache.hadoop.ipc.Server$Connection.access$400(Server.java:1229) > at org.apache.hadoop.ipc.Server$Call.sendResponse(Server.java:631) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2245) > {quote} > Maybe checking connection between client and router is better before > sendingrpc to namenode -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15078) RBF: Should check connection channel before sending rpc to namenode
[ https://issues.apache.org/jira/browse/HDFS-15078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17002571#comment-17002571 ] Fei Hui commented on HDFS-15078: [~ayushtkn][~elgoiri] I find there is a critical problem on RBF, this issue can resolve it on some Scenarios, but i have no idea about the overall resolution. Plan to file a new jira to track it. The problem is that # Client with RBF(r0, r1) create a file HDFS file via r0, it gets Exception and failovers to r1 # r0 has been send create rpc to namenode(1st create) # Client create a HDFS file via r1(2nd create) # Client writes the HDFS file and close it finally(3rd close) Maybe namenode receiving the rpc in order as follow # 2nd create # 3rd close # 1st create And overwrite is true by default, this would make the file had been written an empty file. This is an critical problem and we had encountered it > RBF: Should check connection channel before sending rpc to namenode > --- > > Key: HDFS-15078 > URL: https://issues.apache.org/jira/browse/HDFS-15078 > Project: Hadoop HDFS > Issue Type: Improvement > Components: rbf >Affects Versions: 3.3.0 >Reporter: Fei Hui >Assignee: Fei Hui >Priority: Major > Attachments: HDFS-15078.001.patch > > > dfsrouter logs show that > {quote} > 2019-12-20 04:11:26,724 WARN org.apache.hadoop.ipc.Server: IPC Server handler > 6400 on , call org.apache.hadoop.hdfs.protocol.ClientProtocol.create from > 10.83.164.11:56908 Call#2 Retry#0: output error > 2019-12-20 04:11:26,724 INFO org.apache.hadoop.ipc.Server: IPC Server handler > 125 on caught an exception > java.nio.channels.ClosedChannelException > at > sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:270) > at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:461) > at org.apache.hadoop.ipc.Server.channelWrite(Server.java:2731) > at org.apache.hadoop.ipc.Server.access$2100(Server.java:134) > at > org.apache.hadoop.ipc.Server$Responder.processResponse(Server.java:1089) > at org.apache.hadoop.ipc.Server$Responder.doRespond(Server.java:1161) > at > org.apache.hadoop.ipc.Server$Connection.sendResponse(Server.java:2109) > at > org.apache.hadoop.ipc.Server$Connection.access$400(Server.java:1229) > at org.apache.hadoop.ipc.Server$Call.sendResponse(Server.java:631) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2245) > {quote} > Maybe checking connection between client and router is better before > sendingrpc to namenode -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15078) RBF: Should check connection channel before sending rpc to namenode
[ https://issues.apache.org/jira/browse/HDFS-15078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17002374#comment-17002374 ] Fei Hui commented on HDFS-15078: [~ayushtkn] [~elgoiri] Could you please take a look? Thanks > RBF: Should check connection channel before sending rpc to namenode > --- > > Key: HDFS-15078 > URL: https://issues.apache.org/jira/browse/HDFS-15078 > Project: Hadoop HDFS > Issue Type: Improvement > Components: rbf >Affects Versions: 3.3.0 >Reporter: Fei Hui >Assignee: Fei Hui >Priority: Major > Attachments: HDFS-15078.001.patch > > > dfsrouter logs show that > {quote} > 2019-12-20 04:11:26,724 WARN org.apache.hadoop.ipc.Server: IPC Server handler > 6400 on , call org.apache.hadoop.hdfs.protocol.ClientProtocol.create from > 10.83.164.11:56908 Call#2 Retry#0: output error > 2019-12-20 04:11:26,724 INFO org.apache.hadoop.ipc.Server: IPC Server handler > 125 on caught an exception > java.nio.channels.ClosedChannelException > at > sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:270) > at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:461) > at org.apache.hadoop.ipc.Server.channelWrite(Server.java:2731) > at org.apache.hadoop.ipc.Server.access$2100(Server.java:134) > at > org.apache.hadoop.ipc.Server$Responder.processResponse(Server.java:1089) > at org.apache.hadoop.ipc.Server$Responder.doRespond(Server.java:1161) > at > org.apache.hadoop.ipc.Server$Connection.sendResponse(Server.java:2109) > at > org.apache.hadoop.ipc.Server$Connection.access$400(Server.java:1229) > at org.apache.hadoop.ipc.Server$Call.sendResponse(Server.java:631) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2245) > {quote} > Maybe checking connection between client and router is better before > sendingrpc to namenode -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15078) RBF: Should check connection channel before sending rpc to namenode
[ https://issues.apache.org/jira/browse/HDFS-15078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fei Hui updated HDFS-15078: --- Status: Patch Available (was: Open) > RBF: Should check connection channel before sending rpc to namenode > --- > > Key: HDFS-15078 > URL: https://issues.apache.org/jira/browse/HDFS-15078 > Project: Hadoop HDFS > Issue Type: Improvement > Components: rbf >Affects Versions: 3.3.0 >Reporter: Fei Hui >Assignee: Fei Hui >Priority: Major > Attachments: HDFS-15078.001.patch > > > dfsrouter logs show that > {quote} > 2019-12-20 04:11:26,724 WARN org.apache.hadoop.ipc.Server: IPC Server handler > 6400 on , call org.apache.hadoop.hdfs.protocol.ClientProtocol.create from > 10.83.164.11:56908 Call#2 Retry#0: output error > 2019-12-20 04:11:26,724 INFO org.apache.hadoop.ipc.Server: IPC Server handler > 125 on caught an exception > java.nio.channels.ClosedChannelException > at > sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:270) > at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:461) > at org.apache.hadoop.ipc.Server.channelWrite(Server.java:2731) > at org.apache.hadoop.ipc.Server.access$2100(Server.java:134) > at > org.apache.hadoop.ipc.Server$Responder.processResponse(Server.java:1089) > at org.apache.hadoop.ipc.Server$Responder.doRespond(Server.java:1161) > at > org.apache.hadoop.ipc.Server$Connection.sendResponse(Server.java:2109) > at > org.apache.hadoop.ipc.Server$Connection.access$400(Server.java:1229) > at org.apache.hadoop.ipc.Server$Call.sendResponse(Server.java:631) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2245) > {quote} > Maybe checking connection between client and router is better before > sendingrpc to namenode -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15078) RBF: Should check connection channel before sending rpc to namenode
[ https://issues.apache.org/jira/browse/HDFS-15078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fei Hui updated HDFS-15078: --- Attachment: HDFS-15078.001.patch > RBF: Should check connection channel before sending rpc to namenode > --- > > Key: HDFS-15078 > URL: https://issues.apache.org/jira/browse/HDFS-15078 > Project: Hadoop HDFS > Issue Type: Improvement > Components: rbf >Affects Versions: 3.3.0 >Reporter: Fei Hui >Assignee: Fei Hui >Priority: Major > Attachments: HDFS-15078.001.patch > > > dfsrouter logs show that > {quote} > 2019-12-20 04:11:26,724 WARN org.apache.hadoop.ipc.Server: IPC Server handler > 6400 on , call org.apache.hadoop.hdfs.protocol.ClientProtocol.create from > 10.83.164.11:56908 Call#2 Retry#0: output error > 2019-12-20 04:11:26,724 INFO org.apache.hadoop.ipc.Server: IPC Server handler > 125 on caught an exception > java.nio.channels.ClosedChannelException > at > sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:270) > at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:461) > at org.apache.hadoop.ipc.Server.channelWrite(Server.java:2731) > at org.apache.hadoop.ipc.Server.access$2100(Server.java:134) > at > org.apache.hadoop.ipc.Server$Responder.processResponse(Server.java:1089) > at org.apache.hadoop.ipc.Server$Responder.doRespond(Server.java:1161) > at > org.apache.hadoop.ipc.Server$Connection.sendResponse(Server.java:2109) > at > org.apache.hadoop.ipc.Server$Connection.access$400(Server.java:1229) > at org.apache.hadoop.ipc.Server$Call.sendResponse(Server.java:631) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2245) > {quote} > Maybe checking connection between client and router is better before > sendingrpc to namenode -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org