[jira] [Created] (HDFS-15394) Add all available fs.viewfs.overload.scheme.target..impl classes in core-default.xml bydefault.

2020-06-05 Thread Uma Maheswara Rao G (Jira)
Uma Maheswara Rao G created HDFS-15394:
--

 Summary: Add all available 
fs.viewfs.overload.scheme.target..impl classes in core-default.xml 
bydefault.
 Key: HDFS-15394
 URL: https://issues.apache.org/jira/browse/HDFS-15394
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: configuration, viewfs, viewfsOverloadScheme
Affects Versions: 3.2.1
Reporter: Uma Maheswara Rao G
Assignee: Uma Maheswara Rao G


This proposes to add all available 
fs.viewfs.overload.scheme.target..impl classes in core-default.xml. So, 
that users need not configure them.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15389) DFSAdmin should close filesystem and dfsadmin -setBalancerBandwidth should work with ViewFSOverloadScheme

2020-06-05 Thread Uma Maheswara Rao G (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uma Maheswara Rao G updated HDFS-15389:
---
Parent: HDFS-15289
Issue Type: Sub-task  (was: Bug)

> DFSAdmin should close filesystem and dfsadmin -setBalancerBandwidth should 
> work with ViewFSOverloadScheme 
> --
>
> Key: HDFS-15389
> URL: https://issues.apache.org/jira/browse/HDFS-15389
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Ayush Saxena
>Assignee: Ayush Saxena
>Priority: Major
> Attachments: HDFS-15389-01.patch
>
>
> Two Issues Here :
> Firstly Prior to HDFS-15321, When DFSAdmin was closed the FileSystem 
> associated with it was closed as part of close method, But post HDFS-15321, 
> the {{FileSystem}} isn't stored as part of {{FsShell}}, hence during close, 
> the FileSystem still stays and isn't close.
> * This is the reason for failure of TestDFSHAAdmin
> Second : {{DfsAdmin -setBalancerBandwidth}} doesn't work with 
> {{ViewFSOverloadScheme}} since the setBalancerBandwidth calls {{getFS()}} 
> rather than {{getDFS()}} which resolves the scheme in {{HDFS-15321}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15321) Make DFSAdmin tool to work with ViewFSOverloadScheme

2020-06-05 Thread Uma Maheswara Rao G (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17127019#comment-17127019
 ] 

Uma Maheswara Rao G commented on HDFS-15321:


Thanks for reporting it [~ayushtkn], I have reviewed ur PR in HDFS-15389. I 
understand this could be an issue for tests. 

> Make DFSAdmin tool to work with ViewFSOverloadScheme
> 
>
> Key: HDFS-15321
> URL: https://issues.apache.org/jira/browse/HDFS-15321
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: dfsadmin, fs, viewfs
>Affects Versions: 3.2.1
>Reporter: Uma Maheswara Rao G
>Assignee: Uma Maheswara Rao G
>Priority: Major
>
> When we enable ViewFSOverLoadScheme and used hdfs scheme as overloaded 
> scheme, users work with hdfs uris. But here DFSAdmin expects the impl classe 
> to be DistribbuteFileSystem. If impl class is ViewFSoverloadScheme, it will 
> fail.
> So, when impl is ViewFSoverloadScheme, we should get corresponding child hdfs 
> to make DFSAdmin to work.
> This Jira makes the DFSAdmin to work with ViewFSoverloadScheme.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15389) DFSAdmin should close filesystem and dfsadmin -setBalancerBandwidth should work with ViewFSOverloadScheme

2020-06-05 Thread Ayush Saxena (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17127005#comment-17127005
 ] 

Ayush Saxena commented on HDFS-15389:
-

Have fixed the checkstyle in PR 
https://github.com/apache/hadoop/pull/2057

> DFSAdmin should close filesystem and dfsadmin -setBalancerBandwidth should 
> work with ViewFSOverloadScheme 
> --
>
> Key: HDFS-15389
> URL: https://issues.apache.org/jira/browse/HDFS-15389
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Ayush Saxena
>Assignee: Ayush Saxena
>Priority: Major
> Attachments: HDFS-15389-01.patch
>
>
> Two Issues Here :
> Firstly Prior to HDFS-15321, When DFSAdmin was closed the FileSystem 
> associated with it was closed as part of close method, But post HDFS-15321, 
> the {{FileSystem}} isn't stored as part of {{FsShell}}, hence during close, 
> the FileSystem still stays and isn't close.
> * This is the reason for failure of TestDFSHAAdmin
> Second : {{DfsAdmin -setBalancerBandwidth}} doesn't work with 
> {{ViewFSOverloadScheme}} since the setBalancerBandwidth calls {{getFS()}} 
> rather than {{getDFS()}} which resolves the scheme in {{HDFS-15321}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15330) Document the ViewFSOverloadScheme details in ViewFS guide

2020-06-05 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17127001#comment-17127001
 ] 

Hudson commented on HDFS-15330:
---

SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #18332 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/18332/])
HDFS-15330. Document the ViewFSOverloadScheme details in ViewFS guide. (github: 
rev 76fa0222f0d2e2d92b4a1eedba8b3e38002e8c23)
* (edit) hadoop-hdfs-project/hadoop-hdfs/src/site/markdown/HDFSCommands.md
* (edit) hadoop-hdfs-project/hadoop-hdfs/src/site/markdown/ViewFs.md
* (edit) hadoop-project/src/site/site.xml
* (add) 
hadoop-hdfs-project/hadoop-hdfs/src/site/resources/images/ViewFSOverloadScheme.png
* (add) 
hadoop-hdfs-project/hadoop-hdfs/src/site/markdown/ViewFsOverloadScheme.md


> Document the ViewFSOverloadScheme details in ViewFS guide
> -
>
> Key: HDFS-15330
> URL: https://issues.apache.org/jira/browse/HDFS-15330
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: viewfs, viewfsOverloadScheme
>Affects Versions: 3.2.1
>Reporter: Uma Maheswara Rao G
>Assignee: Uma Maheswara Rao G
>Priority: Major
> Fix For: 3.4.0
>
>
> This Jira to track for documentation of ViewFSOverloadScheme usage guide.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15330) Document the ViewFSOverloadScheme details in ViewFS guide

2020-06-05 Thread Uma Maheswara Rao G (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uma Maheswara Rao G updated HDFS-15330:
---
Fix Version/s: 3.4.0
 Hadoop Flags: Reviewed
   Resolution: Fixed
   Status: Resolved  (was: Patch Available)

Thanks [~ayushsaxena] for reviews! I have just committed it to trunk.

> Document the ViewFSOverloadScheme details in ViewFS guide
> -
>
> Key: HDFS-15330
> URL: https://issues.apache.org/jira/browse/HDFS-15330
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: viewfs, viewfsOverloadScheme
>Affects Versions: 3.2.1
>Reporter: Uma Maheswara Rao G
>Assignee: Uma Maheswara Rao G
>Priority: Major
> Fix For: 3.4.0
>
>
> This Jira to track for documentation of ViewFSOverloadScheme usage guide.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15393) Review of PendingReconstructionBlocks

2020-06-05 Thread Hadoop QA (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17126894#comment-17126894
 ] 

Hadoop QA commented on HDFS-15393:
--

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 26m 
57s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} dupname {color} | {color:green}  0m  
0s{color} | {color:green} No case conflicting files found. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 1 new or modified test 
files. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 21m 
55s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m  
8s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
47s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m 
14s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
17m 20s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
41s{color} | {color:green} trunk passed {color} |
| {color:blue}0{color} | {color:blue} spotbugs {color} | {color:blue}  3m  
3s{color} | {color:blue} Used deprecated FindBugs config; considering switching 
to SpotBugs. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  3m  
1s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:red}-1{color} | {color:red} mvninstall {color} | {color:red}  0m 
37s{color} | {color:red} hadoop-hdfs in the patch failed. {color} |
| {color:red}-1{color} | {color:red} compile {color} | {color:red}  0m 
37s{color} | {color:red} hadoop-hdfs in the patch failed. {color} |
| {color:red}-1{color} | {color:red} javac {color} | {color:red}  0m 37s{color} 
| {color:red} hadoop-hdfs in the patch failed. {color} |
| {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange}  
0m 42s{color} | {color:orange} hadoop-hdfs-project/hadoop-hdfs: The patch 
generated 23 new + 126 unchanged - 3 fixed = 149 total (was 129) {color} |
| {color:red}-1{color} | {color:red} mvnsite {color} | {color:red}  0m 
40s{color} | {color:red} hadoop-hdfs in the patch failed. {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:red}-1{color} | {color:red} shadedclient {color} | {color:red}  4m  
2s{color} | {color:red} patch has errors when building and testing our client 
artifacts. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
37s{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} findbugs {color} | {color:red}  0m 
39s{color} | {color:red} hadoop-hdfs in the patch failed. {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:red}-1{color} | {color:red} unit {color} | {color:red}  0m 40s{color} 
| {color:red} hadoop-hdfs in the patch failed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
26s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 83m 10s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | ClientAPI=1.40 ServerAPI=1.40 base: 
https://builds.apache.org/job/hadoop-multibranch/job/PR-2055/1/artifact/out/Dockerfile
 |
| GITHUB PR | https://github.com/apache/hadoop/pull/2055 |
| JIRA Issue | HDFS-15393 |
| Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite 
unit shadedclient findbugs checkstyle |
| uname | Linux 61475f59368a 4.15.0-101-generic #102-Ubuntu SMP Mon May 11 
10:07:26 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | personality/hadoop.sh |
| git revision | trunk / 23261237054 |
| Default Java | Private Build-1.8.0_252-8u252-b09-1~18.04-b09 |
| mvninstall | 
https://builds.apache.org/job/hadoop-multibranch/job/PR-2055/1/artifact/out/patch-mvninstall-hadoop-hdfs-project_hadoop-hdfs.txt
 |
| compile | 

[jira] [Updated] (HDFS-15389) DFSAdmin should close filesystem and dfsadmin -setBalancerBandwidth should work with ViewFSOverloadScheme

2020-06-05 Thread Ayush Saxena (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ayush Saxena updated HDFS-15389:

Description: 
Two Issues Here :
Firstly Prior to HDFS-15321, When DFSAdmin was closed the FileSystem associated 
with it was closed as part of close method, But post HDFS-15321, the 
{{FileSystem}} isn't stored as part of {{FsShell}}, hence during close, the 
FileSystem still stays and isn't close.
* This is the reason for failure of TestDFSHAAdmin

Second : {{DfsAdmin -setBalancerBandwidth}} doesn't work with 
{{ViewFSOverloadScheme}} since the setBalancerBandwidth calls {{getFS()}} 
rather than {{getDFS()}} which resolves the scheme in {{HDFS-15321}}

  was:
Two Issues Here :
Firstly Prior to HDFS-15321, When DFSAdmin was closed the FileSystem associated 
with it was closed as part of close method, But post HDFS-15321, the 
{{FileSystem}} isn't stored as part of {{FsShell}}, hence during close, the 
FileSystem still stays and isn't close.
** This is the reason for failure of TestDFSHAAdmin

Second : {{DfsAdmin -setBalancerBandwidth}} doesn't work with 
{{ViewFSOverloadScheme}} since the setBalancerBandwidth calls {{getFS()}} 
rather than {{getDFS()}} which resolves the scheme in {{HDFS-15321}}


> DFSAdmin should close filesystem and dfsadmin -setBalancerBandwidth should 
> work with ViewFSOverloadScheme 
> --
>
> Key: HDFS-15389
> URL: https://issues.apache.org/jira/browse/HDFS-15389
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Ayush Saxena
>Assignee: Ayush Saxena
>Priority: Major
> Attachments: HDFS-15389-01.patch
>
>
> Two Issues Here :
> Firstly Prior to HDFS-15321, When DFSAdmin was closed the FileSystem 
> associated with it was closed as part of close method, But post HDFS-15321, 
> the {{FileSystem}} isn't stored as part of {{FsShell}}, hence during close, 
> the FileSystem still stays and isn't close.
> * This is the reason for failure of TestDFSHAAdmin
> Second : {{DfsAdmin -setBalancerBandwidth}} doesn't work with 
> {{ViewFSOverloadScheme}} since the setBalancerBandwidth calls {{getFS()}} 
> rather than {{getDFS()}} which resolves the scheme in {{HDFS-15321}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15390) client fails forever when namenode ipaddr changed

2020-06-05 Thread Ayush Saxena (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17126883#comment-17126883
 ] 

Ayush Saxena commented on HDFS-15390:
-

Can you extend a UT for the issue?

> client fails forever when namenode ipaddr changed
> -
>
> Key: HDFS-15390
> URL: https://issues.apache.org/jira/browse/HDFS-15390
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: dfsclient
>Affects Versions: 2.10.0, 2.9.2, 3.2.1
>Reporter: Sean Chow
>Priority: Major
> Attachments: HDFS-15390.01.patch
>
>
> For machine replacement, I replace my standby namenode with a new ipaddr and 
> keep the same hostname. Also update the client's hosts to make it resolve 
> correctly
> When I try to run failover to transite the new namenode(let's say nn2), the 
> client will fail to read or write forever until it's restarted.
> That make yarn nodemanager in sick state. Even the new tasks will encounter 
> this exception  too. Until all nodemanager restart.
>  
> {code:java}
> 20/06/02 15:12:25 WARN ipc.Client: Address change detected. Old: 
> nn2-192-168-1-100/192.168.1.100:9000 New: nn2-192-168-1-100/192.168.1.200:9000
> 20/06/02 15:12:25 DEBUG ipc.Client: closing ipc connection to 
> nn2-192-168-1-100/192.168.1.200:9000: Connection refused
> java.net.ConnectException: Connection refused
> at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
> at 
> sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
> at 
> org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
> at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:530)
> at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:494)
> at 
> org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:608)
> at 
> org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:707)
> at 
> org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:368)
> at org.apache.hadoop.ipc.Client.getConnection(Client.java:1517)
> at org.apache.hadoop.ipc.Client.call(Client.java:1440)
> at org.apache.hadoop.ipc.Client.call(Client.java:1401)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
> at com.sun.proxy.$Proxy9.addBlock(Unknown Source)
> at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.addBlock(ClientNamenodeProtocolTranslatorPB.java:399)
> at sun.reflect.GeneratedMethodAccessor8.invoke(Unknown Source)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:193)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
> {code}
>  
> We can see the client has {{Address change detected}}, but it still fails. I 
> find out that's because when method {{updateAddress()}} return true,  the 
> {{handleConnectionFailure()}} thow an exception that break the next retry 
> with the right ipaddr.
> Client.java: setupConnection()
> {code:java}
> } catch (ConnectTimeoutException toe) {
>   /* Check for an address change and update the local reference.
>* Reset the failure counter if the address was changed
>*/
>   if (updateAddress()) {
> timeoutFailures = ioFailures = 0;
>   }
>   handleConnectionTimeout(timeoutFailures++,
>   maxRetriesOnSocketTimeouts, toe);
> } catch (IOException ie) {
>   if (updateAddress()) {
> timeoutFailures = ioFailures = 0;
>   }
> // because the namenode ip changed in updateAddress(), the old namenode 
> ipaddress cannot be accessed now
> // handleConnectionFailure will thow an exception, the next retry never have 
> a chance to use the right server updated in updateAddress()
>   handleConnectionFailure(ioFailures++, ie);
> }
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-15390) client fails forever when namenode ipaddr changed

2020-06-05 Thread Sean Chow (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17126638#comment-17126638
 ] 

Sean Chow edited comment on HDFS-15390 at 6/5/20, 2:57 PM:
---

It's easy to reproduce. You have setup HA namenodes, and a new machine with the 
same hostname with nn2(standby), and copied name-data directory.
 # Use {{hdfs dfs put}} to write a bigfile (to make it a long running client)
 # Stop old nn2, and start new nn2
 # Update the nn2 hostname to resolve as the new ipaddr on all hosts
 # Failover from nn1 to nn2
 # Now you found the client occurs error continuously. (In yarn nodemanager 
scenario, this nodemanager is totally sick until restarted)

 

There 's two way to fix this: 
 # When updateAddress is true, do not handle ConnectionFailure this round 
 # When address change detected, update namenode proxies (only with 
{{ConfiguredFailoverProxyProvider}})

Method one is easy, and in this connection lifecycle the client will use the 
right {{server}} to connect. But when the client connection closed and create a 
new one. It will always try to getConnection from the retired ipaddr, because 
the namenode proxies is still the old one.

Method two solve the root cause. Everytime the client failover namenodes, check 
ipaddr changed or not. If changed, re-initialize the namenode failover proxies.


was (Author: seanlook):
There 's two way to fix this: 
 # When updateAddress is true, do not handle ConnectionFailure this round 
 # When address change detected, update namenode proxies (only with 
{{ConfiguredFailoverProxyProvider}})

Method one is easy, and in this connection lifecycle the client will use the 
right {{server}} to connect. But when the client connection closed and create a 
new one. It will always try to getConnection from the retired ipaddr, because 
the namenode proxies is still the old one.

Method two solve the root cause. Everytime the client failover namenodes, check 
ipaddr changed or not. If changed, re-initialize the namenode failover proxies.

> client fails forever when namenode ipaddr changed
> -
>
> Key: HDFS-15390
> URL: https://issues.apache.org/jira/browse/HDFS-15390
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: dfsclient
>Affects Versions: 2.10.0, 2.9.2, 3.2.1
>Reporter: Sean Chow
>Priority: Major
> Attachments: HDFS-15390.01.patch
>
>
> For machine replacement, I replace my standby namenode with a new ipaddr and 
> keep the same hostname. Also update the client's hosts to make it resolve 
> correctly
> When I try to run failover to transite the new namenode(let's say nn2), the 
> client will fail to read or write forever until it's restarted.
> That make yarn nodemanager in sick state. Even the new tasks will encounter 
> this exception  too. Until all nodemanager restart.
>  
> {code:java}
> 20/06/02 15:12:25 WARN ipc.Client: Address change detected. Old: 
> nn2-192-168-1-100/192.168.1.100:9000 New: nn2-192-168-1-100/192.168.1.200:9000
> 20/06/02 15:12:25 DEBUG ipc.Client: closing ipc connection to 
> nn2-192-168-1-100/192.168.1.200:9000: Connection refused
> java.net.ConnectException: Connection refused
> at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
> at 
> sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
> at 
> org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
> at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:530)
> at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:494)
> at 
> org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:608)
> at 
> org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:707)
> at 
> org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:368)
> at org.apache.hadoop.ipc.Client.getConnection(Client.java:1517)
> at org.apache.hadoop.ipc.Client.call(Client.java:1440)
> at org.apache.hadoop.ipc.Client.call(Client.java:1401)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
> at com.sun.proxy.$Proxy9.addBlock(Unknown Source)
> at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.addBlock(ClientNamenodeProtocolTranslatorPB.java:399)
> at sun.reflect.GeneratedMethodAccessor8.invoke(Unknown Source)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:193)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
> {code}
>  
> We can 

[jira] [Commented] (HDFS-15390) client fails forever when namenode ipaddr changed

2020-06-05 Thread Sean Chow (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17126837#comment-17126837
 ] 

Sean Chow commented on HDFS-15390:
--

Patch attached.

Now we can see the exception is ignored when address updated, and the file is 
written successfully.
{code:java}
20/06/05 20:54:51 WARN ipc.Client: Address change detected. Old: 
nn2-192-168-1-100/192.168.1.100:9000 New: nn2-192-168-1-100/192.168.1.200:9000
20/06/05 20:54:51 DEBUG ipc.Client: Failed to connect to server: 
nn2-192-168-1-100/192.168.1.200:9000: try once and fail.
java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
...
20/06/05 20:54:51 DEBUG hdfs.DFSOutputStream: enqueue full packet seqno: ...
20/06/05 20:54:51 DEBUG hdfs.DataStreamer: Queued packet 100076
20/06/05 20:54:51 WARN ipc.Client: Exception when handle ConnectionFailure: 
Connection refused
{code}

> client fails forever when namenode ipaddr changed
> -
>
> Key: HDFS-15390
> URL: https://issues.apache.org/jira/browse/HDFS-15390
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: dfsclient
>Affects Versions: 2.10.0, 2.9.2, 3.2.1
>Reporter: Sean Chow
>Priority: Major
> Attachments: HDFS-15390.01.patch
>
>
> For machine replacement, I replace my standby namenode with a new ipaddr and 
> keep the same hostname. Also update the client's hosts to make it resolve 
> correctly
> When I try to run failover to transite the new namenode(let's say nn2), the 
> client will fail to read or write forever until it's restarted.
> That make yarn nodemanager in sick state. Even the new tasks will encounter 
> this exception  too. Until all nodemanager restart.
>  
> {code:java}
> 20/06/02 15:12:25 WARN ipc.Client: Address change detected. Old: 
> nn2-192-168-1-100/192.168.1.100:9000 New: nn2-192-168-1-100/192.168.1.200:9000
> 20/06/02 15:12:25 DEBUG ipc.Client: closing ipc connection to 
> nn2-192-168-1-100/192.168.1.200:9000: Connection refused
> java.net.ConnectException: Connection refused
> at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
> at 
> sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
> at 
> org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
> at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:530)
> at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:494)
> at 
> org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:608)
> at 
> org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:707)
> at 
> org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:368)
> at org.apache.hadoop.ipc.Client.getConnection(Client.java:1517)
> at org.apache.hadoop.ipc.Client.call(Client.java:1440)
> at org.apache.hadoop.ipc.Client.call(Client.java:1401)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
> at com.sun.proxy.$Proxy9.addBlock(Unknown Source)
> at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.addBlock(ClientNamenodeProtocolTranslatorPB.java:399)
> at sun.reflect.GeneratedMethodAccessor8.invoke(Unknown Source)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:193)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
> {code}
>  
> We can see the client has {{Address change detected}}, but it still fails. I 
> find out that's because when method {{updateAddress()}} return true,  the 
> {{handleConnectionFailure()}} thow an exception that break the next retry 
> with the right ipaddr.
> Client.java: setupConnection()
> {code:java}
> } catch (ConnectTimeoutException toe) {
>   /* Check for an address change and update the local reference.
>* Reset the failure counter if the address was changed
>*/
>   if (updateAddress()) {
> timeoutFailures = ioFailures = 0;
>   }
>   handleConnectionTimeout(timeoutFailures++,
>   maxRetriesOnSocketTimeouts, toe);
> } catch (IOException ie) {
>   if (updateAddress()) {
> timeoutFailures = ioFailures = 0;
>   }
> // because the namenode ip changed in updateAddress(), the old namenode 
> ipaddress cannot be accessed now
> // handleConnectionFailure will thow an exception, the next retry never have 
> a chance to use the right server updated in updateAddress()
>   

[jira] [Updated] (HDFS-15390) client fails forever when namenode ipaddr changed

2020-06-05 Thread Sean Chow (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Chow updated HDFS-15390:
-
Attachment: HDFS-15390.01.patch

> client fails forever when namenode ipaddr changed
> -
>
> Key: HDFS-15390
> URL: https://issues.apache.org/jira/browse/HDFS-15390
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: dfsclient
>Affects Versions: 2.10.0, 2.9.2, 3.2.1
>Reporter: Sean Chow
>Priority: Major
> Attachments: HDFS-15390.01.patch
>
>
> For machine replacement, I replace my standby namenode with a new ipaddr and 
> keep the same hostname. Also update the client's hosts to make it resolve 
> correctly
> When I try to run failover to transite the new namenode(let's say nn2), the 
> client will fail to read or write forever until it's restarted.
> That make yarn nodemanager in sick state. Even the new tasks will encounter 
> this exception  too. Until all nodemanager restart.
>  
> {code:java}
> 20/06/02 15:12:25 WARN ipc.Client: Address change detected. Old: 
> nn2-192-168-1-100/192.168.1.100:9000 New: nn2-192-168-1-100/192.168.1.200:9000
> 20/06/02 15:12:25 DEBUG ipc.Client: closing ipc connection to 
> nn2-192-168-1-100/192.168.1.200:9000: Connection refused
> java.net.ConnectException: Connection refused
> at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
> at 
> sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
> at 
> org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
> at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:530)
> at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:494)
> at 
> org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:608)
> at 
> org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:707)
> at 
> org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:368)
> at org.apache.hadoop.ipc.Client.getConnection(Client.java:1517)
> at org.apache.hadoop.ipc.Client.call(Client.java:1440)
> at org.apache.hadoop.ipc.Client.call(Client.java:1401)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
> at com.sun.proxy.$Proxy9.addBlock(Unknown Source)
> at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.addBlock(ClientNamenodeProtocolTranslatorPB.java:399)
> at sun.reflect.GeneratedMethodAccessor8.invoke(Unknown Source)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:193)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
> {code}
>  
> We can see the client has {{Address change detected}}, but it still fails. I 
> find out that's because when method {{updateAddress()}} return true,  the 
> {{handleConnectionFailure()}} thow an exception that break the next retry 
> with the right ipaddr.
> Client.java: setupConnection()
> {code:java}
> } catch (ConnectTimeoutException toe) {
>   /* Check for an address change and update the local reference.
>* Reset the failure counter if the address was changed
>*/
>   if (updateAddress()) {
> timeoutFailures = ioFailures = 0;
>   }
>   handleConnectionTimeout(timeoutFailures++,
>   maxRetriesOnSocketTimeouts, toe);
> } catch (IOException ie) {
>   if (updateAddress()) {
> timeoutFailures = ioFailures = 0;
>   }
> // because the namenode ip changed in updateAddress(), the old namenode 
> ipaddress cannot be accessed now
> // handleConnectionFailure will thow an exception, the next retry never have 
> a chance to use the right server updated in updateAddress()
>   handleConnectionFailure(ioFailures++, ie);
> }
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-15393) Review of PendingReconstructionBlocks

2020-06-05 Thread David Mollitor (Jira)
David Mollitor created HDFS-15393:
-

 Summary: Review of PendingReconstructionBlocks
 Key: HDFS-15393
 URL: https://issues.apache.org/jira/browse/HDFS-15393
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: David Mollitor
Assignee: David Mollitor


I started looking at this class based on [HDFS-15351].

* Uses {{java.sql.Time}} unnecessarily.  Confusing since Java ships with time 
formatters out of the box in JDK 8.  I believe this will cause issues later 
when trying to upgrade to JDK 9+ since SQL is a different module in Java.
* Remove code where appropriate
* Use Java Concurrent library for higher concurrent access to underlying map



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15359) EC: Allow closing a file with committed blocks

2020-06-05 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17126794#comment-17126794
 ] 

Hudson commented on HDFS-15359:
---

SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #18331 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/18331/])
HDFS-15359. EC: Allow closing a file with committed blocks. Contributed 
(ayushsaxena: rev 2326123705445dee534ac2c298038831b5d04a0a)
* (edit) 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/INodeFile.java
* (edit) hadoop-hdfs-project/hadoop-hdfs/src/main/resources/hdfs-default.xml
* (edit) 
hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestDistributedFileSystem.java
* (edit) 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockManager.java


> EC: Allow closing a file with committed blocks
> --
>
> Key: HDFS-15359
> URL: https://issues.apache.org/jira/browse/HDFS-15359
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: erasure-coding
>Reporter: Ayush Saxena
>Assignee: Ayush Saxena
>Priority: Major
> Fix For: 3.4.0
>
> Attachments: HDFS-15359-01.patch, HDFS-15359-02.patch, 
> HDFS-15359-03.patch, HDFS-15359-04.patch, HDFS-15359-05.patch
>
>
> Presently, {{dfs.namenode.file.close.num-committed-allowed}} is ignored in 
> case of EC blocks. But in case of heavy loads, IBR's from Datanode may get 
> delayed and cause the file write to fail. So, can allow EC files to close 
> with blocks in committed state as REP files



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15359) EC: Allow closing a file with committed blocks

2020-06-05 Thread Ayush Saxena (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ayush Saxena updated HDFS-15359:

Fix Version/s: 3.4.0
 Hadoop Flags: Reviewed
   Resolution: Fixed
   Status: Resolved  (was: Patch Available)

> EC: Allow closing a file with committed blocks
> --
>
> Key: HDFS-15359
> URL: https://issues.apache.org/jira/browse/HDFS-15359
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: erasure-coding
>Reporter: Ayush Saxena
>Assignee: Ayush Saxena
>Priority: Major
> Fix For: 3.4.0
>
> Attachments: HDFS-15359-01.patch, HDFS-15359-02.patch, 
> HDFS-15359-03.patch, HDFS-15359-04.patch, HDFS-15359-05.patch
>
>
> Presently, {{dfs.namenode.file.close.num-committed-allowed}} is ignored in 
> case of EC blocks. But in case of heavy loads, IBR's from Datanode may get 
> delayed and cause the file write to fail. So, can allow EC files to close 
> with blocks in committed state as REP files



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15359) EC: Allow closing a file with committed blocks

2020-06-05 Thread Ayush Saxena (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17126777#comment-17126777
 ] 

Ayush Saxena commented on HDFS-15359:
-

Committed to trunk.
Thanx [~vinayakumarb] and [~weichiu] for the reviews!!!

> EC: Allow closing a file with committed blocks
> --
>
> Key: HDFS-15359
> URL: https://issues.apache.org/jira/browse/HDFS-15359
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: erasure-coding
>Reporter: Ayush Saxena
>Assignee: Ayush Saxena
>Priority: Major
> Attachments: HDFS-15359-01.patch, HDFS-15359-02.patch, 
> HDFS-15359-03.patch, HDFS-15359-04.patch, HDFS-15359-05.patch
>
>
> Presently, {{dfs.namenode.file.close.num-committed-allowed}} is ignored in 
> case of EC blocks. But in case of heavy loads, IBR's from Datanode may get 
> delayed and cause the file write to fail. So, can allow EC files to close 
> with blocks in committed state as REP files



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15351) Blocks Scheduled Count was wrong on Truncate

2020-06-05 Thread David Mollitor (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17126748#comment-17126748
 ] 

David Mollitor commented on HDFS-15351:
---

Thanks for pinging me [~hemanthboyina] a few times. I have been a bit all over 
the place so thanks for you persistence and patients.

Probably should be using {{Collection}} classes instead of native arrays, but 
that's not for this ticket.
{code:java}
PendingBlockInfo remove = pendingReconstruction.remove(lastBlock);
if (remove != null) {
  List locations = remove.getTargets();
  DatanodeStorageInfo.decrementBlocksScheduled(locations.toArray(new 
DatanodeStorageInfo[0]));
 }
{code}

> Blocks Scheduled Count was wrong on Truncate 
> -
>
> Key: HDFS-15351
> URL: https://issues.apache.org/jira/browse/HDFS-15351
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: hemanthboyina
>Assignee: hemanthboyina
>Priority: Major
> Attachments: HDFS-15351.001.patch, HDFS-15351.002.patch, 
> HDFS-15351.003.patch
>
>
> On truncate and append we remove the blocks from Reconstruction Queue 
> On removing the blocks from pending reconstruction , we need to decrement 
> Blocks Scheduled 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15391) Standby NameNode cannot load the edit log correctly due to edit log corruption, resulting in the service exiting abnormally and unable to restart

2020-06-05 Thread huhaiyang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

huhaiyang updated HDFS-15391:
-
Summary: Standby NameNode cannot load the edit log correctly due to edit 
log corruption, resulting in the service exiting abnormally and unable to 
restart  (was: Due to edit log corruption, Standby NameNode could not properly 
load the Ediltog log, result in abnormal exit of the service and failure to 
restart)

> Standby NameNode cannot load the edit log correctly due to edit log 
> corruption, resulting in the service exiting abnormally and unable to restart
> -
>
> Key: HDFS-15391
> URL: https://issues.apache.org/jira/browse/HDFS-15391
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 3.2.0
>Reporter: huhaiyang
>Priority: Critical
>
> In the cluster version 3.2.0 production environment,
>  We found that due to edit log corruption, Standby NameNode could not 
> properly load the Ediltog log, result in abnormal exit of the service and 
> failure to restart
> {noformat}
> The specific scenario is that Flink writes to HDFS(replication file), and in 
> the case of an exception to the write file, the following operations are 
> performed :
> 1.close file
> 2.open file
> 3.truncate file
> 4.append file
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15391) Due to edit log corruption, Standby NameNode could not properly load the Ediltog log, result in abnormal exit of the service and failure to restart

2020-06-05 Thread huhaiyang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17126740#comment-17126740
 ] 

huhaiyang commented on HDFS-15391:
--

[~ayushtkn] Thank you for reply
{quote}
   These are two different traces, correct?
   You tried restarting the namenode twice, and once it failed for CLOSE_OP and 
other time with TRUNCATE, 
   Correct?
{quote}
Yes, These are two different traces,  I'll add more details later.

> Due to edit log corruption, Standby NameNode could not properly load the 
> Ediltog log, result in abnormal exit of the service and failure to restart
> ---
>
> Key: HDFS-15391
> URL: https://issues.apache.org/jira/browse/HDFS-15391
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 3.2.0
>Reporter: huhaiyang
>Priority: Critical
>
> In the cluster version 3.2.0 production environment,
>  We found that due to edit log corruption, Standby NameNode could not 
> properly load the Ediltog log, result in abnormal exit of the service and 
> failure to restart
> {noformat}
> The specific scenario is that Flink writes to HDFS(replication file), and in 
> the case of an exception to the write file, the following operations are 
> performed :
> 1.close file
> 2.open file
> 3.truncate file
> 4.append file
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-15391) Due to edit log corruption, Standby NameNode could not properly load the Ediltog log, result in abnormal exit of the service and failure to restart

2020-06-05 Thread Ayush Saxena (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17126730#comment-17126730
 ] 

Ayush Saxena edited comment on HDFS-15391 at 6/5/20, 12:31 PM:
---

Thanx,
These are two different traces, correct?
You tried restarting the namenode twice, and once it failed for CLOSE_OP and 
other time with TRUNCATE, Correct?

What was the exception during write?


was (Author: ayushtkn):
Thanx,
These are two different traces, correct?
You tried restarting the namenode twice, and once it failed for CLOSE_OP and 
other time with TRUNCATE, Correct?

> Due to edit log corruption, Standby NameNode could not properly load the 
> Ediltog log, result in abnormal exit of the service and failure to restart
> ---
>
> Key: HDFS-15391
> URL: https://issues.apache.org/jira/browse/HDFS-15391
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 3.2.0
>Reporter: huhaiyang
>Priority: Critical
>
> In the cluster version 3.2.0 production environment,
>  We found that due to edit log corruption, Standby NameNode could not 
> properly load the Ediltog log, result in abnormal exit of the service and 
> failure to restart
> {noformat}
> The specific scenario is that Flink writes to HDFS(replication file), and in 
> the case of an exception to the write file, the following operations are 
> performed :
> 1.close file
> 2.open file
> 3.truncate file
> 4.append file
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-15392) DistrbutedFileSystem#concat api can create large number of small blocks

2020-06-05 Thread Lokesh Jain (Jira)
Lokesh Jain created HDFS-15392:
--

 Summary: DistrbutedFileSystem#concat api can create large number 
of small blocks
 Key: HDFS-15392
 URL: https://issues.apache.org/jira/browse/HDFS-15392
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: Lokesh Jain


DistrbutedFileSystem#concat moves blocks from source to target. If the api is 
repeatedly used on small files it can create large number of small blocks in 
the target file. The Jira aims to optimize the api to avoid the issue of small 
blocks.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-15391) Due to edit log corruption, Standby NameNode could not properly load the Ediltog log, result in abnormal exit of the service and failure to restart

2020-06-05 Thread huhaiyang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17126724#comment-17126724
 ] 

huhaiyang edited comment on HDFS-15391 at 6/5/20, 12:31 PM:


Standby NameNode  exception log:
{noformat}
 2020-06-04 18:32:11,561 ERROR 
org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader: Encountered exception 
on operation CloseOp [length=0, inodeId=0, path=path, replication=3, 
mtime=1591266620287, atime=1591264800229, blockSize=134217728, 
blocks=[blk_11382006007_10353346830, blk_11382023760_10353365201, 
blk_11382041307_10353383098, blk_11382049845_10353392031, 
blk_11382057341_10353399899, blk_11382071544_10353415171, 
blk_11382080753_10354157480], permissions=dw_water:rd:rw-r--r--, 
aclEntries=null, clientName=, clientMachine=, overwrite=false, 
storagePolicyId=0, erasureCodingPolicyId=0, opCode=OP_CLOSE, txid=126060943585]
 java.io.IOException: File is not under construction: hdfs://path
 at 
org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:476)
 at 
org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:258)
 at 
org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:161)
 at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:898)
 at 
org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.doTailEdits(EditLogTailer.java:329)
 at 
org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:460)
 at 
org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$300(EditLogTailer.java:410)
 at 
org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$1.run(EditLogTailer.java:427)
 at 
org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:484)
 at 
org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.run(EditLogTailer.java:423)
 2020-06-04 18:32:11,561 ERROR 
org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer: Unknown error 
encountered while tailing edits. Shutting down standby NN.
{noformat}
 
{noformat}
020-06-04 22:28:04,025 ERROR 
org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader: Encountered exception 
on operation TruncateOp [src=xxxpath, 
clientName=DFSClient_NONMAPREDUCE_-295521672_77, clientMachine=xxx, 
newLength=3210623016, timestamp=1591270219348, 
truncateBlock=blk_11382198393_10355810378, opCode=OP_TRUNCATE, 
txid=126074587217]
 java.lang.IllegalStateException: file is already under construction
 at com.google.common.base.Preconditions.checkState(Preconditions.java:145)
 at 
org.apache.hadoop.hdfs.server.namenode.INodeFile.toUnderConstruction(INodeFile.java:329)
 at 
org.apache.hadoop.hdfs.server.namenode.FSDirTruncateOp.prepareFileForTruncate(FSDirTruncateOp.java:222)
 at 
org.apache.hadoop.hdfs.server.namenode.FSDirTruncateOp.unprotectedTruncate(FSDirTruncateOp.java:183)
 at 
org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:986)
 at 
org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:258)
 at 
org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:161)
 at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:898)
 at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:753)
 at 
org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:331)
 at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:1123)
 at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:730)
 at 
org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:669)
 at 
org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:731)
 at org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:974)
 at org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:947)
 at 
org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1680)
 at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1747)
 2020-06-04 22:28:04,027 WARN 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Encountered exception 
loading fsimage
 java.io.IOException: java.lang.IllegalStateException: file is already under 
construction
 at 
org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:268)
 at 
org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:161)
 at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:898)
 at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:753)
 at 
org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:331)
 at 

[jira] [Updated] (HDFS-15391) Due to edit log corruption, Standby NameNode could not properly load the Ediltog log, result in abnormal exit of the service and failure to restart

2020-06-05 Thread huhaiyang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

huhaiyang updated HDFS-15391:
-
Description: 
In the cluster version 3.2.0 production environment,
 We found that due to edit log corruption, Standby NameNode could not properly 
load the Ediltog log, result in abnormal exit of the service and failure to 
restart
{code:java}
The specific scenario is that Flink writes to HDFS(replication file), and in 
the case of an exception to the write file, the following operations are 
performed :
1.close file
2.open file
3.truncate file
4.append file
{code}

  was:
In the cluster version 3.2.0 production environment,
 We found that due to edit log corruption, Standby NameNode could not properly 
load the Ediltog log, result in abnormal exit of the service and failure to 
restart
{code:java}
The specific scenario is that Flink writes to HDFS(replication file), and in 
the case of an exception to the write file, the following operations are 
performed 

{code}
 # Close file 
 # 2. Open file 
 # 3. truncate file 
 # 4. append file


> Due to edit log corruption, Standby NameNode could not properly load the 
> Ediltog log, result in abnormal exit of the service and failure to restart
> ---
>
> Key: HDFS-15391
> URL: https://issues.apache.org/jira/browse/HDFS-15391
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 3.2.0
>Reporter: huhaiyang
>Priority: Critical
>
> In the cluster version 3.2.0 production environment,
>  We found that due to edit log corruption, Standby NameNode could not 
> properly load the Ediltog log, result in abnormal exit of the service and 
> failure to restart
> {code:java}
> The specific scenario is that Flink writes to HDFS(replication file), and in 
> the case of an exception to the write file, the following operations are 
> performed :
> 1.close file
> 2.open file
> 3.truncate file
> 4.append file
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15391) Due to edit log corruption, Standby NameNode could not properly load the Ediltog log, result in abnormal exit of the service and failure to restart

2020-06-05 Thread huhaiyang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

huhaiyang updated HDFS-15391:
-
Description: 
In the cluster version 3.2.0 production environment,
 We found that due to edit log corruption, Standby NameNode could not properly 
load the Ediltog log, result in abnormal exit of the service and failure to 
restart

{noformat}
The specific scenario is that Flink writes to HDFS(replication file), and in 
the case of an exception to the write file, the following operations are 
performed :
1.close file
2.open file
3.truncate file
4.append file
{noformat}

  was:
In the cluster version 3.2.0 production environment,
 We found that due to edit log corruption, Standby NameNode could not properly 
load the Ediltog log, result in abnormal exit of the service and failure to 
restart
{code:java}
The specific scenario is that Flink writes to HDFS(replication file), and in 
the case of an exception to the write file, the following operations are 
performed :
1.close file
2.open file
3.truncate file
4.append file
{code}


> Due to edit log corruption, Standby NameNode could not properly load the 
> Ediltog log, result in abnormal exit of the service and failure to restart
> ---
>
> Key: HDFS-15391
> URL: https://issues.apache.org/jira/browse/HDFS-15391
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 3.2.0
>Reporter: huhaiyang
>Priority: Critical
>
> In the cluster version 3.2.0 production environment,
>  We found that due to edit log corruption, Standby NameNode could not 
> properly load the Ediltog log, result in abnormal exit of the service and 
> failure to restart
> {noformat}
> The specific scenario is that Flink writes to HDFS(replication file), and in 
> the case of an exception to the write file, the following operations are 
> performed :
> 1.close file
> 2.open file
> 3.truncate file
> 4.append file
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15391) Due to edit log corruption, Standby NameNode could not properly load the Ediltog log, result in abnormal exit of the service and failure to restart

2020-06-05 Thread huhaiyang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

huhaiyang updated HDFS-15391:
-
Description: 
In the cluster version 3.2.0 production environment,
 We found that due to edit log corruption, Standby NameNode could not properly 
load the Ediltog log, result in abnormal exit of the service and failure to 
restart
{code:java}
The specific scenario is that Flink writes to HDFS(replication file), and in 
the case of an exception to the write file, the following operations are 
performed 

{code}
 # Close file 
 # 2. Open file 
 # 3. truncate file 
 # 4. append file

  was:
In the cluster version 3.2.0 production environment,
 We found that due to edit log corruption, Standby NameNode could not properly 
load the Ediltog log, result in abnormal exit of the service and failure to 
restart

 

The specific scenario is that Flink writes to HDFS, and in the case of an 
exception to the write file, the following operations are performed
1. Close file
2. Open file 
3. truncate file
4. append file 

 


> Due to edit log corruption, Standby NameNode could not properly load the 
> Ediltog log, result in abnormal exit of the service and failure to restart
> ---
>
> Key: HDFS-15391
> URL: https://issues.apache.org/jira/browse/HDFS-15391
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 3.2.0
>Reporter: huhaiyang
>Priority: Critical
>
> In the cluster version 3.2.0 production environment,
>  We found that due to edit log corruption, Standby NameNode could not 
> properly load the Ediltog log, result in abnormal exit of the service and 
> failure to restart
> {code:java}
> The specific scenario is that Flink writes to HDFS(replication file), and in 
> the case of an exception to the write file, the following operations are 
> performed 
> {code}
>  # Close file 
>  # 2. Open file 
>  # 3. truncate file 
>  # 4. append file



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15391) Due to edit log corruption, Standby NameNode could not properly load the Ediltog log, result in abnormal exit of the service and failure to restart

2020-06-05 Thread Ayush Saxena (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17126730#comment-17126730
 ] 

Ayush Saxena commented on HDFS-15391:
-

Thanx,
These are two different traces, correct?
You tried restarting the namenode twice, and once it failed for CLOSE_OP and 
other time with TRUNCATE, Correct?

> Due to edit log corruption, Standby NameNode could not properly load the 
> Ediltog log, result in abnormal exit of the service and failure to restart
> ---
>
> Key: HDFS-15391
> URL: https://issues.apache.org/jira/browse/HDFS-15391
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 3.2.0
>Reporter: huhaiyang
>Priority: Critical
>
> In the cluster version 3.2.0 production environment,
>  We found that due to edit log corruption, Standby NameNode could not 
> properly load the Ediltog log, result in abnormal exit of the service and 
> failure to restart
>  
> The specific scenario is that Flink writes to HDFS, and in the case of an 
> exception to the write file, the following operations are performed
> 1. Close file
> 2. Open file 
> 3. truncate file
> 4. append file 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15390) client fails forever when namenode ipaddr changed

2020-06-05 Thread Sean Chow (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Chow updated HDFS-15390:
-
Description: 
For machine replacement, I replace my standby namenode with a new ipaddr and 
keep the same hostname. Also update the client's hosts to make it resolve 
correctly

When I try to run failover to transite the new namenode(let's say nn2), the 
client will fail to read or write forever until it's restarted.

That make yarn nodemanager in sick state. Even the new tasks will encounter 
this exception  too. Until all nodemanager restart.

 
{code:java}
20/06/02 15:12:25 WARN ipc.Client: Address change detected. Old: 
nn2-192-168-1-100/192.168.1.100:9000 New: nn2-192-168-1-100/192.168.1.200:9000
20/06/02 15:12:25 DEBUG ipc.Client: closing ipc connection to 
nn2-192-168-1-100/192.168.1.200:9000: Connection refused
java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at 
sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
at 
org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:530)
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:494)
at 
org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:608)
at 
org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:707)
at org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:368)
at org.apache.hadoop.ipc.Client.getConnection(Client.java:1517)
at org.apache.hadoop.ipc.Client.call(Client.java:1440)
at org.apache.hadoop.ipc.Client.call(Client.java:1401)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
at com.sun.proxy.$Proxy9.addBlock(Unknown Source)
at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.addBlock(ClientNamenodeProtocolTranslatorPB.java:399)
at sun.reflect.GeneratedMethodAccessor8.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:193)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
{code}
 

We can see the client has {{Address change detected}}, but it still fails. I 
find out that's because when method {{updateAddress()}} return true,  the 
{{handleConnectionFailure()}} thow an exception that break the next retry with 
the right ipaddr.

Client.java: setupConnection()
{code:java}
} catch (ConnectTimeoutException toe) {
  /* Check for an address change and update the local reference.
   * Reset the failure counter if the address was changed
   */
  if (updateAddress()) {
timeoutFailures = ioFailures = 0;
  }
  handleConnectionTimeout(timeoutFailures++,
  maxRetriesOnSocketTimeouts, toe);
} catch (IOException ie) {
  if (updateAddress()) {
timeoutFailures = ioFailures = 0;
  }
// because the namenode ip changed in updateAddress(), the old namenode 
ipaddress cannot be accessed now
// handleConnectionFailure will thow an exception, the next retry never have a 
chance to use the right server updated in updateAddress()
  handleConnectionFailure(ioFailures++, ie);
}
{code}
 

  was:
For machine replacement, I replace my standby namenode with a new ipaddr and 
keep the same hostname. Also update the client's hosts to make it resolve 
correctly

When I try to run failover to transite the new namenode(let's say nn2), the 
client will fail to read or write forever until it's restarted.

That make yarn nodemanager in sick state. Even the new tasks will encounter 
this exception  too. Until all nodemanager restart.

 
{code:java}
20/06/02 15:12:25 WARN ipc.Client: Address change detected. Old: 
nn2-192-168-1-100/192.168.1.100:9000 New: nn2-192-168-1-100/192.168.1.200:9000
20/06/02 15:12:25 DEBUG ipc.Client: closing ipc connection to 
nn2-192-168-1-100/192.168.1.200:9000: Connection refused
java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at 
sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
at 
org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:530)
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:494)
at 
org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:608)
at 
org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:707)
at 

[jira] [Updated] (HDFS-15391) Due to edit log corruption, Standby NameNode could not properly load the Ediltog log, result in abnormal exit of the service and failure to restart

2020-06-05 Thread huhaiyang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

huhaiyang updated HDFS-15391:
-
Description: 
In the cluster version 3.2.0 production environment,
 We found that due to edit log corruption, Standby NameNode could not properly 
load the Ediltog log, result in abnormal exit of the service and failure to 
restart

 

The specific scenario is that Flink writes to HDFS, and in the case of an 
exception to the write file, the following operations are performed
1. Close file
2. Open file 
3. truncate file
4. append file 

 

  was:
In the cluster version 3.2.0 production environment,
 We found that due to edit log corruption, Standby NameNode could not properly 
load the Ediltog log, result in abnormal exit of the service and failure to 
restart

 

 

 


> Due to edit log corruption, Standby NameNode could not properly load the 
> Ediltog log, result in abnormal exit of the service and failure to restart
> ---
>
> Key: HDFS-15391
> URL: https://issues.apache.org/jira/browse/HDFS-15391
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 3.2.0
>Reporter: huhaiyang
>Priority: Critical
>
> In the cluster version 3.2.0 production environment,
>  We found that due to edit log corruption, Standby NameNode could not 
> properly load the Ediltog log, result in abnormal exit of the service and 
> failure to restart
>  
> The specific scenario is that Flink writes to HDFS, and in the case of an 
> exception to the write file, the following operations are performed
> 1. Close file
> 2. Open file 
> 3. truncate file
> 4. append file 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15391) Due to edit log corruption, Standby NameNode could not properly load the Ediltog log, result in abnormal exit of the service and failure to restart

2020-06-05 Thread huhaiyang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

huhaiyang updated HDFS-15391:
-
Description: 
In the cluster version 3.2.0 production environment,
 We found that due to edit log corruption, Standby NameNode could not properly 
load the Ediltog log, result in abnormal exit of the service and failure to 
restart

 

 

 

  was:
In the cluster version 3.2.0 production environment,
 We found that due to edit log corruption, Standby NameNode could not properly 
load the Ediltog log, result in abnormal exit of the service and failure to 
restart

 

 


> Due to edit log corruption, Standby NameNode could not properly load the 
> Ediltog log, result in abnormal exit of the service and failure to restart
> ---
>
> Key: HDFS-15391
> URL: https://issues.apache.org/jira/browse/HDFS-15391
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 3.2.0
>Reporter: huhaiyang
>Priority: Critical
>
> In the cluster version 3.2.0 production environment,
>  We found that due to edit log corruption, Standby NameNode could not 
> properly load the Ediltog log, result in abnormal exit of the service and 
> failure to restart
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15391) Due to edit log corruption, Standby NameNode could not properly load the Ediltog log, result in abnormal exit of the service and failure to restart

2020-06-05 Thread huhaiyang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

huhaiyang updated HDFS-15391:
-
Description: 
In the cluster version 3.2.0 production environment,
 We found that due to edit log corruption, Standby NameNode could not properly 
load the Ediltog log, result in abnormal exit of the service and failure to 
restart

 

 

  was:
In the cluster version 3.2.0 production environment,
 We found that due to edit log corruption, Standby NameNode could not properly 
load the Ediltog log, result in abnormal exit of the service and failure to 
restart

 Standby NameNode  exception log:
 2020-06-04 18:32:11,561 ERROR 
org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader: Encountered exception 
on operation CloseOp [length=0, inodeId=0, path=path, replication=3, 
mtime=1591266620287, atime=1591264800229, blockSize=134217728, 
blocks=[blk_11382006007_10353346830, blk_11382023760_10353365201, 
blk_11382041307_10353383098, blk_11382049845_10353392031, 
blk_11382057341_10353399899, blk_11382071544_10353415171, 
blk_11382080753_10354157480], permissions=dw_water:rd:rw-r--r--, 
aclEntries=null, clientName=, clientMachine=, overwrite=false, 
storagePolicyId=0, erasureCodingPolicyId=0, opCode=OP_CLOSE, txid=126060943585]
 java.io.IOException: File is not under construction: hdfs://path
 at 
org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:476)
 at 
org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:258)
 at 
org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:161)
 at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:898)
 at 
org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.doTailEdits(EditLogTailer.java:329)
 at 
org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:460)
 at 
org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$300(EditLogTailer.java:410)
 at 
org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$1.run(EditLogTailer.java:427)
 at 
org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:484)
 at 
org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.run(EditLogTailer.java:423)
 2020-06-04 18:32:11,561 ERROR 
org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer: Unknown error 
encountered while tailing edits. Shutting down standby NN.

 

020-06-04 22:28:04,025 ERROR 
org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader: Encountered exception 
on operation TruncateOp [src=xxxpath, 
clientName=DFSClient_NONMAPREDUCE_-295521672_77, clientMachine=xxx, 
newLength=3210623016, timestamp=1591270219348, 
truncateBlock=blk_11382198393_10355810378, opCode=OP_TRUNCATE, 
txid=126074587217]
java.lang.IllegalStateException: file is already under construction
 at com.google.common.base.Preconditions.checkState(Preconditions.java:145)
 at 
org.apache.hadoop.hdfs.server.namenode.INodeFile.toUnderConstruction(INodeFile.java:329)
 at 
org.apache.hadoop.hdfs.server.namenode.FSDirTruncateOp.prepareFileForTruncate(FSDirTruncateOp.java:222)
 at 
org.apache.hadoop.hdfs.server.namenode.FSDirTruncateOp.unprotectedTruncate(FSDirTruncateOp.java:183)
 at 
org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:986)
 at 
org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:258)
 at 
org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:161)
 at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:898)
 at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:753)
 at 
org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:331)
 at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:1123)
 at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:730)
 at 
org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:669)
 at 
org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:731)
 at org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:974)
 at org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:947)
 at 
org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1680)
 at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1747)
2020-06-04 22:28:04,027 WARN 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Encountered exception 
loading fsimage
java.io.IOException: java.lang.IllegalStateException: file is already under 
construction
 at 
org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:268)
 at 

[jira] [Commented] (HDFS-15391) Due to edit log corruption, Standby NameNode could not properly load the Ediltog log, result in abnormal exit of the service and failure to restart

2020-06-05 Thread huhaiyang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17126724#comment-17126724
 ] 

huhaiyang commented on HDFS-15391:
--

Standby NameNode  exception log:
 2020-06-04 18:32:11,561 ERROR 
org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader: Encountered exception 
on operation CloseOp [length=0, inodeId=0, path=path, replication=3, 
mtime=1591266620287, atime=1591264800229, blockSize=134217728, 
blocks=[blk_11382006007_10353346830, blk_11382023760_10353365201, 
blk_11382041307_10353383098, blk_11382049845_10353392031, 
blk_11382057341_10353399899, blk_11382071544_10353415171, 
blk_11382080753_10354157480], permissions=dw_water:rd:rw-r--r--, 
aclEntries=null, clientName=, clientMachine=, overwrite=false, 
storagePolicyId=0, erasureCodingPolicyId=0, opCode=OP_CLOSE, txid=126060943585]
 java.io.IOException: File is not under construction: hdfs://path
 at 
org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:476)
 at 
org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:258)
 at 
org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:161)
 at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:898)
 at 
org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.doTailEdits(EditLogTailer.java:329)
 at 
org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:460)
 at 
org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$300(EditLogTailer.java:410)
 at 
org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$1.run(EditLogTailer.java:427)
 at 
org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:484)
 at 
org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.run(EditLogTailer.java:423)
 2020-06-04 18:32:11,561 ERROR 
org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer: Unknown error 
encountered while tailing edits. Shutting down standby NN.

 

020-06-04 22:28:04,025 ERROR 
org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader: Encountered exception 
on operation TruncateOp [src=xxxpath, 
clientName=DFSClient_NONMAPREDUCE_-295521672_77, clientMachine=xxx, 
newLength=3210623016, timestamp=1591270219348, 
truncateBlock=blk_11382198393_10355810378, opCode=OP_TRUNCATE, 
txid=126074587217]
 java.lang.IllegalStateException: file is already under construction
 at com.google.common.base.Preconditions.checkState(Preconditions.java:145)
 at 
org.apache.hadoop.hdfs.server.namenode.INodeFile.toUnderConstruction(INodeFile.java:329)
 at 
org.apache.hadoop.hdfs.server.namenode.FSDirTruncateOp.prepareFileForTruncate(FSDirTruncateOp.java:222)
 at 
org.apache.hadoop.hdfs.server.namenode.FSDirTruncateOp.unprotectedTruncate(FSDirTruncateOp.java:183)
 at 
org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:986)
 at 
org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:258)
 at 
org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:161)
 at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:898)
 at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:753)
 at 
org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:331)
 at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:1123)
 at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:730)
 at 
org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:669)
 at 
org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:731)
 at org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:974)
 at org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:947)
 at 
org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1680)
 at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1747)
 2020-06-04 22:28:04,027 WARN 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Encountered exception 
loading fsimage
 java.io.IOException: java.lang.IllegalStateException: file is already under 
construction
 at 
org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:268)
 at 
org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:161)
 at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:898)
 at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:753)
 at 
org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:331)
 at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:1123)
 at 

[jira] [Commented] (HDFS-15391) Due to edit log corruption, Standby NameNode could not properly load the Ediltog log, result in abnormal exit of the service and failure to restart

2020-06-05 Thread Ayush Saxena (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17126723#comment-17126723
 ] 

Ayush Saxena commented on HDFS-15391:
-

Do you have backported HDFS-7663 in that?
If yes, HDFS-14581 may help. 
Else, can you give more background or Audit Logs or anything more.

> Due to edit log corruption, Standby NameNode could not properly load the 
> Ediltog log, result in abnormal exit of the service and failure to restart
> ---
>
> Key: HDFS-15391
> URL: https://issues.apache.org/jira/browse/HDFS-15391
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 3.2.0
>Reporter: huhaiyang
>Priority: Critical
>
> In the cluster version 3.2.0 production environment,
> We found that due to edit log corruption, Standby NameNode could not properly 
> load the Ediltog log, result in abnormal exit of the service and failure to 
> restart
> This is the exception it throws:
> 2020-06-04 18:32:11,561 ERROR 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader: Encountered exception 
> on operation CloseOp [length=0, inodeId=0, path=path, replication=3, 
> mtime=1591266620287, atime=1591264800229, blockSize=134217728, 
> blocks=[blk_11382006007_10353346830, blk_11382023760_10353365201, 
> blk_11382041307_10353383098, blk_11382049845_10353392031, 
> blk_11382057341_10353399899, blk_11382071544_10353415171, 
> blk_11382080753_10354157480], permissions=dw_water:rd:rw-r--r--, 
> aclEntries=null, clientName=, clientMachine=, overwrite=false, 
> storagePolicyId=0, erasureCodingPolicyId=0, opCode=OP_CLOSE, 
> txid=126060943585]
> java.io.IOException: File is not under construction: path
> at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:476)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:258)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:161)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:898)
> at 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.doTailEdits(EditLogTailer.java:329)
> at 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:460)
> at 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$300(EditLogTailer.java:410)
> at 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$1.run(EditLogTailer.java:427)
> at 
> org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:484)
> at 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.run(EditLogTailer.java:423)
> 2020-06-04 18:32:11,561 ERROR 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer: Unknown error 
> encountered while tailing edits. Shutting down standby NN.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15391) Due to edit log corruption, Standby NameNode could not properly load the Ediltog log, result in abnormal exit of the service and failure to restart

2020-06-05 Thread huhaiyang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

huhaiyang updated HDFS-15391:
-
Description: 
In the cluster version 3.2.0 production environment,
 We found that due to edit log corruption, Standby NameNode could not properly 
load the Ediltog log, result in abnormal exit of the service and failure to 
restart

 Standby NameNode  exception log:
 2020-06-04 18:32:11,561 ERROR 
org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader: Encountered exception 
on operation CloseOp [length=0, inodeId=0, path=path, replication=3, 
mtime=1591266620287, atime=1591264800229, blockSize=134217728, 
blocks=[blk_11382006007_10353346830, blk_11382023760_10353365201, 
blk_11382041307_10353383098, blk_11382049845_10353392031, 
blk_11382057341_10353399899, blk_11382071544_10353415171, 
blk_11382080753_10354157480], permissions=dw_water:rd:rw-r--r--, 
aclEntries=null, clientName=, clientMachine=, overwrite=false, 
storagePolicyId=0, erasureCodingPolicyId=0, opCode=OP_CLOSE, txid=126060943585]
 java.io.IOException: File is not under construction: hdfs://path
 at 
org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:476)
 at 
org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:258)
 at 
org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:161)
 at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:898)
 at 
org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.doTailEdits(EditLogTailer.java:329)
 at 
org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:460)
 at 
org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$300(EditLogTailer.java:410)
 at 
org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$1.run(EditLogTailer.java:427)
 at 
org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:484)
 at 
org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.run(EditLogTailer.java:423)
 2020-06-04 18:32:11,561 ERROR 
org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer: Unknown error 
encountered while tailing edits. Shutting down standby NN.

 

020-06-04 22:28:04,025 ERROR 
org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader: Encountered exception 
on operation TruncateOp [src=xxxpath, 
clientName=DFSClient_NONMAPREDUCE_-295521672_77, clientMachine=xxx, 
newLength=3210623016, timestamp=1591270219348, 
truncateBlock=blk_11382198393_10355810378, opCode=OP_TRUNCATE, 
txid=126074587217]
java.lang.IllegalStateException: file is already under construction
 at com.google.common.base.Preconditions.checkState(Preconditions.java:145)
 at 
org.apache.hadoop.hdfs.server.namenode.INodeFile.toUnderConstruction(INodeFile.java:329)
 at 
org.apache.hadoop.hdfs.server.namenode.FSDirTruncateOp.prepareFileForTruncate(FSDirTruncateOp.java:222)
 at 
org.apache.hadoop.hdfs.server.namenode.FSDirTruncateOp.unprotectedTruncate(FSDirTruncateOp.java:183)
 at 
org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:986)
 at 
org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:258)
 at 
org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:161)
 at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:898)
 at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:753)
 at 
org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:331)
 at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:1123)
 at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:730)
 at 
org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:669)
 at 
org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:731)
 at org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:974)
 at org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:947)
 at 
org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1680)
 at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1747)
2020-06-04 22:28:04,027 WARN 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Encountered exception 
loading fsimage
java.io.IOException: java.lang.IllegalStateException: file is already under 
construction
 at 
org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:268)
 at 
org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:161)
 at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:898)
 at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:753)
 at 

[jira] [Commented] (HDFS-13179) TestLazyPersistReplicaRecovery#testDnRestartWithSavedReplicas fails intermittently

2020-06-05 Thread Stephen O'Donnell (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-13179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17126722#comment-17126722
 ] 

Stephen O'Donnell commented on HDFS-13179:
--

Ah good to know. Seems I have been wasting some time pulling changes onto 
branch 3.0 then :-(

At least the branch can be compiled now, but we probably don't need to bother 
committing the updated patch I just uploaded.

> TestLazyPersistReplicaRecovery#testDnRestartWithSavedReplicas fails 
> intermittently
> --
>
> Key: HDFS-13179
> URL: https://issues.apache.org/jira/browse/HDFS-13179
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: fs
>Affects Versions: 3.0.0
>Reporter: Gabor Bota
>Assignee: Ahmed Hussein
>Priority: Critical
> Fix For: 3.0.4, 3.3.0, 3.1.4, 3.2.2, 2.10.1
>
> Attachments: HDFS-13179-branch-2.10.003.patch, 
> HDFS-13179-branch-3.0.003.patch, HDFS-13179.001.patch, HDFS-13179.002.patch, 
> HDFS-13179.003.patch, test runs.zip
>
>
> The error caused by TimeoutException because the test is waiting to ensure 
> that the file is replicated to DISK storage but the replication can't be 
> finished to DISK during the 30s timeout in ensureFileReplicasOnStorageType(), 
> but the file is still on RAM_DISK - so there is no data loss.
> Adding the following to TestLazyPersistReplicaRecovery.java:56 essentially 
> fixes the flakiness. 
> {code:java}
> try {
>   ensureFileReplicasOnStorageType(path1, DEFAULT);
> }catch (TimeoutException t){
>   LOG.warn("We got \"" + t.getMessage() + "\" so trying to find data on 
> RAM_DISK");
>   ensureFileReplicasOnStorageType(path1, RAM_DISK);
> }
>   }
> {code}
> Some thoughts:
> * Successful and failed tests run similar to the point when datanode 
> restarts. Restart line is the following in the log: LazyPersistTestCase - 
> Restarting the DataNode
> * There is a line which only occurs in the failed test: *addStoredBlock: 
> Redundant addStoredBlock request received for blk_1073741825_1001 on node 
> 127.0.0.1:49455 size 5242880*
> * This redundant request at BlockManager#addStoredBlock could be the main 
> reason for the test fail. Something wrong with the gen stamp? Corrupt 
> replicas? 
> =
> Current fail ratio based on my test of TestLazyPersistReplicaRecovery: 
> 1000 runs, 34 failures (3.4% fail)
> Failure rate analysis:
> TestLazyPersistReplicaRecovery.testDnRestartWithSavedReplicas: 3.4%
> 33 failures caused by: {noformat}
> java.util.concurrent.TimeoutException: Timed out waiting for condition. 
> Thread diagnostics: Timestamp: 2018-01-05 11:50:34,964 "IPC Server handler 6 
> on 39589" 
> {noformat}
> 1 failure caused by: {noformat}
> java.net.BindException: Problem binding to [localhost:56729] 
> java.net.BindException: Address already in use; For more details see: 
> http://wiki.apache.org/hadoop/BindException at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.TestLazyPersistReplicaRecovery.testDnRestartWithSavedReplicas(TestLazyPersistReplicaRecovery.java:49)
>  Caused by: java.net.BindException: Address already in use at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.TestLazyPersistReplicaRecovery.testDnRestartWithSavedReplicas(TestLazyPersistReplicaRecovery.java:49)
> {noformat}
> =
> Example stacktrace:
> {noformat}
> Timed out waiting for condition. Thread diagnostics:
> Timestamp: 2017-11-01 10:36:49,499
> "Thread-1" prio=5 tid=13 runnable
> java.lang.Thread.State: RUNNABLE
> at java.lang.Thread.dumpThreads(Native Method)
> at java.lang.Thread.getAllStackTraces(Thread.java:1610)
> at 
> org.apache.hadoop.test.TimedOutTestsListener.buildThreadDump(TimedOutTestsListener.java:87)
> at 
> org.apache.hadoop.test.TimedOutTestsListener.buildThreadDiagnosticString(TimedOutTestsListener.java:73)
> at org.apache.hadoop.test.GenericTestUtils.waitFor(GenericTestUtils.java:369)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.LazyPersistTestCase.ensureFileReplicasOnStorageType(LazyPersistTestCase.java:140)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.TestLazyPersistReplicaRecovery.testDnRestartWithSavedReplicas(TestLazyPersistReplicaRecovery.java:54)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> ...
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: 

[jira] [Updated] (HDFS-13179) TestLazyPersistReplicaRecovery#testDnRestartWithSavedReplicas fails intermittently

2020-06-05 Thread Stephen O'Donnell (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-13179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stephen O'Donnell updated HDFS-13179:
-
Attachment: HDFS-13179-branch-3.0.003.patch

> TestLazyPersistReplicaRecovery#testDnRestartWithSavedReplicas fails 
> intermittently
> --
>
> Key: HDFS-13179
> URL: https://issues.apache.org/jira/browse/HDFS-13179
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: fs
>Affects Versions: 3.0.0
>Reporter: Gabor Bota
>Assignee: Ahmed Hussein
>Priority: Critical
> Fix For: 3.0.4, 3.3.0, 3.1.4, 3.2.2, 2.10.1
>
> Attachments: HDFS-13179-branch-2.10.003.patch, 
> HDFS-13179-branch-3.0.003.patch, HDFS-13179.001.patch, HDFS-13179.002.patch, 
> HDFS-13179.003.patch, test runs.zip
>
>
> The error caused by TimeoutException because the test is waiting to ensure 
> that the file is replicated to DISK storage but the replication can't be 
> finished to DISK during the 30s timeout in ensureFileReplicasOnStorageType(), 
> but the file is still on RAM_DISK - so there is no data loss.
> Adding the following to TestLazyPersistReplicaRecovery.java:56 essentially 
> fixes the flakiness. 
> {code:java}
> try {
>   ensureFileReplicasOnStorageType(path1, DEFAULT);
> }catch (TimeoutException t){
>   LOG.warn("We got \"" + t.getMessage() + "\" so trying to find data on 
> RAM_DISK");
>   ensureFileReplicasOnStorageType(path1, RAM_DISK);
> }
>   }
> {code}
> Some thoughts:
> * Successful and failed tests run similar to the point when datanode 
> restarts. Restart line is the following in the log: LazyPersistTestCase - 
> Restarting the DataNode
> * There is a line which only occurs in the failed test: *addStoredBlock: 
> Redundant addStoredBlock request received for blk_1073741825_1001 on node 
> 127.0.0.1:49455 size 5242880*
> * This redundant request at BlockManager#addStoredBlock could be the main 
> reason for the test fail. Something wrong with the gen stamp? Corrupt 
> replicas? 
> =
> Current fail ratio based on my test of TestLazyPersistReplicaRecovery: 
> 1000 runs, 34 failures (3.4% fail)
> Failure rate analysis:
> TestLazyPersistReplicaRecovery.testDnRestartWithSavedReplicas: 3.4%
> 33 failures caused by: {noformat}
> java.util.concurrent.TimeoutException: Timed out waiting for condition. 
> Thread diagnostics: Timestamp: 2018-01-05 11:50:34,964 "IPC Server handler 6 
> on 39589" 
> {noformat}
> 1 failure caused by: {noformat}
> java.net.BindException: Problem binding to [localhost:56729] 
> java.net.BindException: Address already in use; For more details see: 
> http://wiki.apache.org/hadoop/BindException at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.TestLazyPersistReplicaRecovery.testDnRestartWithSavedReplicas(TestLazyPersistReplicaRecovery.java:49)
>  Caused by: java.net.BindException: Address already in use at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.TestLazyPersistReplicaRecovery.testDnRestartWithSavedReplicas(TestLazyPersistReplicaRecovery.java:49)
> {noformat}
> =
> Example stacktrace:
> {noformat}
> Timed out waiting for condition. Thread diagnostics:
> Timestamp: 2017-11-01 10:36:49,499
> "Thread-1" prio=5 tid=13 runnable
> java.lang.Thread.State: RUNNABLE
> at java.lang.Thread.dumpThreads(Native Method)
> at java.lang.Thread.getAllStackTraces(Thread.java:1610)
> at 
> org.apache.hadoop.test.TimedOutTestsListener.buildThreadDump(TimedOutTestsListener.java:87)
> at 
> org.apache.hadoop.test.TimedOutTestsListener.buildThreadDiagnosticString(TimedOutTestsListener.java:73)
> at org.apache.hadoop.test.GenericTestUtils.waitFor(GenericTestUtils.java:369)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.LazyPersistTestCase.ensureFileReplicasOnStorageType(LazyPersistTestCase.java:140)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.TestLazyPersistReplicaRecovery.testDnRestartWithSavedReplicas(TestLazyPersistReplicaRecovery.java:54)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> ...
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15391) Due to edit log corruption, Standby NameNode could not properly load the Ediltog log, result in abnormal exit of the service and failure to restart

2020-06-05 Thread huhaiyang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

huhaiyang updated HDFS-15391:
-
Description: 
In the cluster version 3.2.0 production environment,
We found that due to edit log corruption, Standby NameNode could not properly 
load the Ediltog log, result in abnormal exit of the service and failure to 
restart

This is the exception it throws:
2020-06-04 18:32:11,561 ERROR 
org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader: Encountered exception 
on operation CloseOp [length=0, inodeId=0, path=path, replication=3, 
mtime=1591266620287, atime=1591264800229, blockSize=134217728, 
blocks=[blk_11382006007_10353346830, blk_11382023760_10353365201, 
blk_11382041307_10353383098, blk_11382049845_10353392031, 
blk_11382057341_10353399899, blk_11382071544_10353415171, 
blk_11382080753_10354157480], permissions=dw_water:rd:rw-r--r--, 
aclEntries=null, clientName=, clientMachine=, overwrite=false, 
storagePolicyId=0, erasureCodingPolicyId=0, opCode=OP_CLOSE, txid=126060943585]
java.io.IOException: File is not under construction: path
at 
org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:476)
at 
org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:258)
at 
org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:161)
at 
org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:898)
at 
org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.doTailEdits(EditLogTailer.java:329)
at 
org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:460)
at 
org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$300(EditLogTailer.java:410)
at 
org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$1.run(EditLogTailer.java:427)
at 
org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:484)
at 
org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.run(EditLogTailer.java:423)
2020-06-04 18:32:11,561 ERROR 
org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer: Unknown error 
encountered while tailing edits. Shutting down standby NN.



  was:
In the cluster version 3.2.0 production environment,
We found that due to edit log corruption, Standby NameNode could not properly 
load the Ediltog log, resulting in abnormal exit of the service and failure to 
restart

This is the exception it throws:
2020-06-04 18:32:11,561 ERROR 
org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader: Encountered exception 
on operation CloseOp [length=0, inodeId=0, path=path, replication=3, 
mtime=1591266620287, atime=1591264800229, blockSize=134217728, 
blocks=[blk_11382006007_10353346830, blk_11382023760_10353365201, 
blk_11382041307_10353383098, blk_11382049845_10353392031, 
blk_11382057341_10353399899, blk_11382071544_10353415171, 
blk_11382080753_10354157480], permissions=dw_water:rd:rw-r--r--, 
aclEntries=null, clientName=, clientMachine=, overwrite=false, 
storagePolicyId=0, erasureCodingPolicyId=0, opCode=OP_CLOSE, txid=126060943585]
java.io.IOException: File is not under construction: path
at 
org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:476)
at 
org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:258)
at 
org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:161)
at 
org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:898)
at 
org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.doTailEdits(EditLogTailer.java:329)
at 
org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:460)
at 
org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$300(EditLogTailer.java:410)
at 
org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$1.run(EditLogTailer.java:427)
at 
org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:484)
at 
org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.run(EditLogTailer.java:423)
2020-06-04 18:32:11,561 ERROR 
org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer: Unknown error 
encountered while tailing edits. Shutting down standby NN.




> Due to edit log corruption, Standby NameNode could not properly load the 
> Ediltog log, result in abnormal exit of the service and failure to restart
> ---
>
> Key: HDFS-15391
> URL: 

[jira] [Updated] (HDFS-15391) Due to edit log corruption, Standby NameNode could not properly load the Ediltog log, result in abnormal exit of the service and failure to restart

2020-06-05 Thread huhaiyang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

huhaiyang updated HDFS-15391:
-
Summary: Due to edit log corruption, Standby NameNode could not properly 
load the Ediltog log, result in abnormal exit of the service and failure to 
restart  (was: Due to edit log corruption, Standby NameNode could not properly 
load the Ediltog log, resulting in abnormal exit of the service and failure to 
restart)

> Due to edit log corruption, Standby NameNode could not properly load the 
> Ediltog log, result in abnormal exit of the service and failure to restart
> ---
>
> Key: HDFS-15391
> URL: https://issues.apache.org/jira/browse/HDFS-15391
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 3.2.0
>Reporter: huhaiyang
>Priority: Critical
>
> In the cluster version 3.2.0 production environment,
> We found that due to edit log corruption, Standby NameNode could not properly 
> load the Ediltog log, resulting in abnormal exit of the service and failure 
> to restart
> This is the exception it throws:
> 2020-06-04 18:32:11,561 ERROR 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader: Encountered exception 
> on operation CloseOp [length=0, inodeId=0, path=path, replication=3, 
> mtime=1591266620287, atime=1591264800229, blockSize=134217728, 
> blocks=[blk_11382006007_10353346830, blk_11382023760_10353365201, 
> blk_11382041307_10353383098, blk_11382049845_10353392031, 
> blk_11382057341_10353399899, blk_11382071544_10353415171, 
> blk_11382080753_10354157480], permissions=dw_water:rd:rw-r--r--, 
> aclEntries=null, clientName=, clientMachine=, overwrite=false, 
> storagePolicyId=0, erasureCodingPolicyId=0, opCode=OP_CLOSE, 
> txid=126060943585]
> java.io.IOException: File is not under construction: path
> at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:476)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:258)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:161)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:898)
> at 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.doTailEdits(EditLogTailer.java:329)
> at 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:460)
> at 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$300(EditLogTailer.java:410)
> at 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$1.run(EditLogTailer.java:427)
> at 
> org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:484)
> at 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.run(EditLogTailer.java:423)
> 2020-06-04 18:32:11,561 ERROR 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer: Unknown error 
> encountered while tailing edits. Shutting down standby NN.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-15391) Due to edit log corruption, Standby NameNode could not properly load the Ediltog log, resulting in abnormal exit of the service and failure to restart

2020-06-05 Thread huhaiyang (Jira)
huhaiyang created HDFS-15391:


 Summary: Due to edit log corruption, Standby NameNode could not 
properly load the Ediltog log, resulting in abnormal exit of the service and 
failure to restart
 Key: HDFS-15391
 URL: https://issues.apache.org/jira/browse/HDFS-15391
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Affects Versions: 3.2.0
Reporter: huhaiyang


In the cluster version 3.2.0 production environment,
We found that due to edit log corruption, Standby NameNode could not properly 
load the Ediltog log, resulting in abnormal exit of the service and failure to 
restart

This is the exception it throws:
2020-06-04 18:32:11,561 ERROR 
org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader: Encountered exception 
on operation CloseOp [length=0, inodeId=0, path=path, replication=3, 
mtime=1591266620287, atime=1591264800229, blockSize=134217728, 
blocks=[blk_11382006007_10353346830, blk_11382023760_10353365201, 
blk_11382041307_10353383098, blk_11382049845_10353392031, 
blk_11382057341_10353399899, blk_11382071544_10353415171, 
blk_11382080753_10354157480], permissions=dw_water:rd:rw-r--r--, 
aclEntries=null, clientName=, clientMachine=, overwrite=false, 
storagePolicyId=0, erasureCodingPolicyId=0, opCode=OP_CLOSE, txid=126060943585]
java.io.IOException: File is not under construction: path
at 
org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:476)
at 
org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:258)
at 
org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:161)
at 
org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:898)
at 
org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.doTailEdits(EditLogTailer.java:329)
at 
org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:460)
at 
org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$300(EditLogTailer.java:410)
at 
org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$1.run(EditLogTailer.java:427)
at 
org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:484)
at 
org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.run(EditLogTailer.java:423)
2020-06-04 18:32:11,561 ERROR 
org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer: Unknown error 
encountered while tailing edits. Shutting down standby NN.





--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-13179) TestLazyPersistReplicaRecovery#testDnRestartWithSavedReplicas fails intermittently

2020-06-05 Thread Ayush Saxena (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-13179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17126718#comment-17126718
 ] 

Ayush Saxena commented on HDFS-13179:
-

branch-3.0 is EOL? 
https://cwiki.apache.org/confluence/display/HADOOP/EOL+%28End-of-life%29+Release+Branches

There doesn't seems to be any need to push to branch-3.0 then?

> TestLazyPersistReplicaRecovery#testDnRestartWithSavedReplicas fails 
> intermittently
> --
>
> Key: HDFS-13179
> URL: https://issues.apache.org/jira/browse/HDFS-13179
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: fs
>Affects Versions: 3.0.0
>Reporter: Gabor Bota
>Assignee: Ahmed Hussein
>Priority: Critical
> Fix For: 3.0.4, 3.3.0, 3.1.4, 3.2.2, 2.10.1
>
> Attachments: HDFS-13179-branch-2.10.003.patch, HDFS-13179.001.patch, 
> HDFS-13179.002.patch, HDFS-13179.003.patch, test runs.zip
>
>
> The error caused by TimeoutException because the test is waiting to ensure 
> that the file is replicated to DISK storage but the replication can't be 
> finished to DISK during the 30s timeout in ensureFileReplicasOnStorageType(), 
> but the file is still on RAM_DISK - so there is no data loss.
> Adding the following to TestLazyPersistReplicaRecovery.java:56 essentially 
> fixes the flakiness. 
> {code:java}
> try {
>   ensureFileReplicasOnStorageType(path1, DEFAULT);
> }catch (TimeoutException t){
>   LOG.warn("We got \"" + t.getMessage() + "\" so trying to find data on 
> RAM_DISK");
>   ensureFileReplicasOnStorageType(path1, RAM_DISK);
> }
>   }
> {code}
> Some thoughts:
> * Successful and failed tests run similar to the point when datanode 
> restarts. Restart line is the following in the log: LazyPersistTestCase - 
> Restarting the DataNode
> * There is a line which only occurs in the failed test: *addStoredBlock: 
> Redundant addStoredBlock request received for blk_1073741825_1001 on node 
> 127.0.0.1:49455 size 5242880*
> * This redundant request at BlockManager#addStoredBlock could be the main 
> reason for the test fail. Something wrong with the gen stamp? Corrupt 
> replicas? 
> =
> Current fail ratio based on my test of TestLazyPersistReplicaRecovery: 
> 1000 runs, 34 failures (3.4% fail)
> Failure rate analysis:
> TestLazyPersistReplicaRecovery.testDnRestartWithSavedReplicas: 3.4%
> 33 failures caused by: {noformat}
> java.util.concurrent.TimeoutException: Timed out waiting for condition. 
> Thread diagnostics: Timestamp: 2018-01-05 11:50:34,964 "IPC Server handler 6 
> on 39589" 
> {noformat}
> 1 failure caused by: {noformat}
> java.net.BindException: Problem binding to [localhost:56729] 
> java.net.BindException: Address already in use; For more details see: 
> http://wiki.apache.org/hadoop/BindException at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.TestLazyPersistReplicaRecovery.testDnRestartWithSavedReplicas(TestLazyPersistReplicaRecovery.java:49)
>  Caused by: java.net.BindException: Address already in use at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.TestLazyPersistReplicaRecovery.testDnRestartWithSavedReplicas(TestLazyPersistReplicaRecovery.java:49)
> {noformat}
> =
> Example stacktrace:
> {noformat}
> Timed out waiting for condition. Thread diagnostics:
> Timestamp: 2017-11-01 10:36:49,499
> "Thread-1" prio=5 tid=13 runnable
> java.lang.Thread.State: RUNNABLE
> at java.lang.Thread.dumpThreads(Native Method)
> at java.lang.Thread.getAllStackTraces(Thread.java:1610)
> at 
> org.apache.hadoop.test.TimedOutTestsListener.buildThreadDump(TimedOutTestsListener.java:87)
> at 
> org.apache.hadoop.test.TimedOutTestsListener.buildThreadDiagnosticString(TimedOutTestsListener.java:73)
> at org.apache.hadoop.test.GenericTestUtils.waitFor(GenericTestUtils.java:369)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.LazyPersistTestCase.ensureFileReplicasOnStorageType(LazyPersistTestCase.java:140)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.TestLazyPersistReplicaRecovery.testDnRestartWithSavedReplicas(TestLazyPersistReplicaRecovery.java:54)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> ...
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15386) ReplicaNotFoundException keeps happening in DN after removing multiple DN's data directories

2020-06-05 Thread Stephen O'Donnell (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stephen O'Donnell updated HDFS-15386:
-
Fix Version/s: 3.1.5
   3.4.0
   3.3.1
   3.2.2
   3.0.4

> ReplicaNotFoundException keeps happening in DN after removing multiple DN's 
> data directories
> 
>
> Key: HDFS-15386
> URL: https://issues.apache.org/jira/browse/HDFS-15386
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Toshihiro Suzuki
>Assignee: Toshihiro Suzuki
>Priority: Major
> Fix For: 3.0.4, 3.2.2, 3.3.1, 3.4.0, 3.1.5
>
>
> When removing volumes, we need to invalidate all the blocks in the volumes. 
> In the following code (FsDatasetImpl), we keep the blocks that will be 
> invalidate in *blkToInvalidate* map. However as the key of the map is *bpid* 
> (Block Pool ID), it will be overwritten by other removed volumes. As a 
> result, the map will have only the blocks of the last volume we are removing, 
> and invalidate only them:
> {code:java}
> for (String bpid : volumeMap.getBlockPoolList()) {
>   List blocks = new ArrayList<>();
>   for (Iterator it =
> volumeMap.replicas(bpid).iterator(); it.hasNext();) {
> ReplicaInfo block = it.next();
> final StorageLocation blockStorageLocation =
> block.getVolume().getStorageLocation();
> LOG.trace("checking for block " + block.getBlockId() +
> " with storageLocation " + blockStorageLocation);
> if (blockStorageLocation.equals(sdLocation)) {
>   blocks.add(block);
>   it.remove();
> }
>   }
>   blkToInvalidate.put(bpid, blocks);
> }
> {code}
> [https://github.com/apache/hadoop/blob/704409d53bf7ebf717a3c2e988ede80f623bbad3/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/impl/FsDatasetImpl.java#L580-L595]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15386) ReplicaNotFoundException keeps happening in DN after removing multiple DN's data directories

2020-06-05 Thread Stephen O'Donnell (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17126715#comment-17126715
 ] 

Stephen O'Donnell commented on HDFS-15386:
--

Please create a PR for branch-2.10 and then we can backport to other 2.x 
branches from there.

This is now committed branch 3.0, 3.1, 3.2, 3.3 and trunk.

> ReplicaNotFoundException keeps happening in DN after removing multiple DN's 
> data directories
> 
>
> Key: HDFS-15386
> URL: https://issues.apache.org/jira/browse/HDFS-15386
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Toshihiro Suzuki
>Assignee: Toshihiro Suzuki
>Priority: Major
>
> When removing volumes, we need to invalidate all the blocks in the volumes. 
> In the following code (FsDatasetImpl), we keep the blocks that will be 
> invalidate in *blkToInvalidate* map. However as the key of the map is *bpid* 
> (Block Pool ID), it will be overwritten by other removed volumes. As a 
> result, the map will have only the blocks of the last volume we are removing, 
> and invalidate only them:
> {code:java}
> for (String bpid : volumeMap.getBlockPoolList()) {
>   List blocks = new ArrayList<>();
>   for (Iterator it =
> volumeMap.replicas(bpid).iterator(); it.hasNext();) {
> ReplicaInfo block = it.next();
> final StorageLocation blockStorageLocation =
> block.getVolume().getStorageLocation();
> LOG.trace("checking for block " + block.getBlockId() +
> " with storageLocation " + blockStorageLocation);
> if (blockStorageLocation.equals(sdLocation)) {
>   blocks.add(block);
>   it.remove();
> }
>   }
>   blkToInvalidate.put(bpid, blocks);
> }
> {code}
> [https://github.com/apache/hadoop/blob/704409d53bf7ebf717a3c2e988ede80f623bbad3/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/impl/FsDatasetImpl.java#L580-L595]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-13179) TestLazyPersistReplicaRecovery#testDnRestartWithSavedReplicas fails intermittently

2020-06-05 Thread Stephen O'Donnell (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-13179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17126713#comment-17126713
 ] 

Stephen O'Donnell commented on HDFS-13179:
--

This is now reverted from branch-3.0. I will post a new patch here shortly, as 
its a trivial change.

> TestLazyPersistReplicaRecovery#testDnRestartWithSavedReplicas fails 
> intermittently
> --
>
> Key: HDFS-13179
> URL: https://issues.apache.org/jira/browse/HDFS-13179
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: fs
>Affects Versions: 3.0.0
>Reporter: Gabor Bota
>Assignee: Ahmed Hussein
>Priority: Critical
> Fix For: 3.0.4, 3.3.0, 3.1.4, 3.2.2, 2.10.1
>
> Attachments: HDFS-13179-branch-2.10.003.patch, HDFS-13179.001.patch, 
> HDFS-13179.002.patch, HDFS-13179.003.patch, test runs.zip
>
>
> The error caused by TimeoutException because the test is waiting to ensure 
> that the file is replicated to DISK storage but the replication can't be 
> finished to DISK during the 30s timeout in ensureFileReplicasOnStorageType(), 
> but the file is still on RAM_DISK - so there is no data loss.
> Adding the following to TestLazyPersistReplicaRecovery.java:56 essentially 
> fixes the flakiness. 
> {code:java}
> try {
>   ensureFileReplicasOnStorageType(path1, DEFAULT);
> }catch (TimeoutException t){
>   LOG.warn("We got \"" + t.getMessage() + "\" so trying to find data on 
> RAM_DISK");
>   ensureFileReplicasOnStorageType(path1, RAM_DISK);
> }
>   }
> {code}
> Some thoughts:
> * Successful and failed tests run similar to the point when datanode 
> restarts. Restart line is the following in the log: LazyPersistTestCase - 
> Restarting the DataNode
> * There is a line which only occurs in the failed test: *addStoredBlock: 
> Redundant addStoredBlock request received for blk_1073741825_1001 on node 
> 127.0.0.1:49455 size 5242880*
> * This redundant request at BlockManager#addStoredBlock could be the main 
> reason for the test fail. Something wrong with the gen stamp? Corrupt 
> replicas? 
> =
> Current fail ratio based on my test of TestLazyPersistReplicaRecovery: 
> 1000 runs, 34 failures (3.4% fail)
> Failure rate analysis:
> TestLazyPersistReplicaRecovery.testDnRestartWithSavedReplicas: 3.4%
> 33 failures caused by: {noformat}
> java.util.concurrent.TimeoutException: Timed out waiting for condition. 
> Thread diagnostics: Timestamp: 2018-01-05 11:50:34,964 "IPC Server handler 6 
> on 39589" 
> {noformat}
> 1 failure caused by: {noformat}
> java.net.BindException: Problem binding to [localhost:56729] 
> java.net.BindException: Address already in use; For more details see: 
> http://wiki.apache.org/hadoop/BindException at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.TestLazyPersistReplicaRecovery.testDnRestartWithSavedReplicas(TestLazyPersistReplicaRecovery.java:49)
>  Caused by: java.net.BindException: Address already in use at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.TestLazyPersistReplicaRecovery.testDnRestartWithSavedReplicas(TestLazyPersistReplicaRecovery.java:49)
> {noformat}
> =
> Example stacktrace:
> {noformat}
> Timed out waiting for condition. Thread diagnostics:
> Timestamp: 2017-11-01 10:36:49,499
> "Thread-1" prio=5 tid=13 runnable
> java.lang.Thread.State: RUNNABLE
> at java.lang.Thread.dumpThreads(Native Method)
> at java.lang.Thread.getAllStackTraces(Thread.java:1610)
> at 
> org.apache.hadoop.test.TimedOutTestsListener.buildThreadDump(TimedOutTestsListener.java:87)
> at 
> org.apache.hadoop.test.TimedOutTestsListener.buildThreadDiagnosticString(TimedOutTestsListener.java:73)
> at org.apache.hadoop.test.GenericTestUtils.waitFor(GenericTestUtils.java:369)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.LazyPersistTestCase.ensureFileReplicasOnStorageType(LazyPersistTestCase.java:140)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.TestLazyPersistReplicaRecovery.testDnRestartWithSavedReplicas(TestLazyPersistReplicaRecovery.java:54)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> ...
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15386) ReplicaNotFoundException keeps happening in DN after removing multiple DN's data directories

2020-06-05 Thread Toshihiro Suzuki (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17126647#comment-17126647
 ] 

Toshihiro Suzuki commented on HDFS-15386:
-

[~sodonnell] Thank you for merging the PR to trunk!

For branch 2, which branch should I create a PR for? Thanks.

> ReplicaNotFoundException keeps happening in DN after removing multiple DN's 
> data directories
> 
>
> Key: HDFS-15386
> URL: https://issues.apache.org/jira/browse/HDFS-15386
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Toshihiro Suzuki
>Assignee: Toshihiro Suzuki
>Priority: Major
>
> When removing volumes, we need to invalidate all the blocks in the volumes. 
> In the following code (FsDatasetImpl), we keep the blocks that will be 
> invalidate in *blkToInvalidate* map. However as the key of the map is *bpid* 
> (Block Pool ID), it will be overwritten by other removed volumes. As a 
> result, the map will have only the blocks of the last volume we are removing, 
> and invalidate only them:
> {code:java}
> for (String bpid : volumeMap.getBlockPoolList()) {
>   List blocks = new ArrayList<>();
>   for (Iterator it =
> volumeMap.replicas(bpid).iterator(); it.hasNext();) {
> ReplicaInfo block = it.next();
> final StorageLocation blockStorageLocation =
> block.getVolume().getStorageLocation();
> LOG.trace("checking for block " + block.getBlockId() +
> " with storageLocation " + blockStorageLocation);
> if (blockStorageLocation.equals(sdLocation)) {
>   blocks.add(block);
>   it.remove();
> }
>   }
>   blkToInvalidate.put(bpid, blocks);
> }
> {code}
> [https://github.com/apache/hadoop/blob/704409d53bf7ebf717a3c2e988ede80f623bbad3/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/impl/FsDatasetImpl.java#L580-L595]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15390) client fails forever when namenode ipaddr changed

2020-06-05 Thread Sean Chow (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17126638#comment-17126638
 ] 

Sean Chow commented on HDFS-15390:
--

There 's two way to fix this: 
 # When updateAddress is true, do not handle ConnectionFailure this round 
 # When address change detected, update namenode proxies (only with 
{{ConfiguredFailoverProxyProvider}})

Method one is easy, and in this connection lifecycle the client will use the 
right {{server}} to connect. But when the client connection closed and create a 
new one. It will always try to getConnection from the retired ipaddr, because 
the namenode proxies is still the old one.

Method two solve the root cause. Everytime the client failover namenodes, check 
ipaddr changed or not. If changed, re-initialize the namenode failover proxies.

> client fails forever when namenode ipaddr changed
> -
>
> Key: HDFS-15390
> URL: https://issues.apache.org/jira/browse/HDFS-15390
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: dfsclient
>Affects Versions: 2.10.0, 2.9.2, 3.2.1
>Reporter: Sean Chow
>Priority: Major
>
> For machine replacement, I replace my standby namenode with a new ipaddr and 
> keep the same hostname. Also update the client's hosts to make it resolve 
> correctly
> When I try to run failover to transite the new namenode(let's say nn2), the 
> client will fail to read or write forever until it's restarted.
> That make yarn nodemanager in sick state. Even the new tasks will encounter 
> this exception  too. Until all nodemanager restart.
>  
> {code:java}
> 20/06/02 15:12:25 WARN ipc.Client: Address change detected. Old: 
> nn2-192-168-1-100/192.168.1.100:9000 New: nn2-192-168-1-100/192.168.1.200:9000
> 20/06/02 15:12:25 DEBUG ipc.Client: closing ipc connection to 
> nn2-192-168-1-100/192.168.1.200:9000: Connection refused
> java.net.ConnectException: Connection refused
> at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
> at 
> sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
> at 
> org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
> at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:530)
> at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:494)
> at 
> org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:608)
> at 
> org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:707)
> at 
> org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:368)
> at org.apache.hadoop.ipc.Client.getConnection(Client.java:1517)
> at org.apache.hadoop.ipc.Client.call(Client.java:1440)
> at org.apache.hadoop.ipc.Client.call(Client.java:1401)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
> at com.sun.proxy.$Proxy9.addBlock(Unknown Source)
> at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.addBlock(ClientNamenodeProtocolTranslatorPB.java:399)
> at sun.reflect.GeneratedMethodAccessor8.invoke(Unknown Source)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:193)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
> {code}
>  
> We can see the client has {{Address change detected}}, but it still fails. I 
> find out that's because when method {{updateAddress()}} return true,  the 
> {{handleConnectionFailure()}} thow an exception that break the next retry 
> with the right ipaddr.
> Client.java: setupConnection()
> {code:java}
> } catch (ConnectTimeoutException toe) {
>   /* Check for an address change and update the local reference.
>* Reset the failure counter if the address was changed
>*/
>   if (updateAddress()) {
> timeoutFailures = ioFailures = 0;
>   }
>   handleConnectionTimeout(timeoutFailures++,
>   maxRetriesOnSocketTimeouts, toe);
> } catch (IOException ie) {
>   if (updateAddress()) {
> timeoutFailures = ioFailures = 0;
>   }
> // because the namenode ip changed in updateAddress(), the old namenode 
> ipaddress cannot be accessed now
> // handleConnectionFailure will thow an exception, the next retry never have 
> a change to use the right server updated in updateAddress()
>   handleConnectionFailure(ioFailures++, ie);
> }
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HDFS-15386) ReplicaNotFoundException keeps happening in DN after removing multiple DN's data directories

2020-06-05 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17126630#comment-17126630
 ] 

Hudson commented on HDFS-15386:
---

SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #18329 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/18329/])
HDFS-15386 ReplicaNotFoundException keeps happening in DN after removing 
(github: rev 545a0a147c5256c44911ba57b4898e01d786d836)
* (edit) 
hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/impl/TestFsDatasetImpl.java
* (edit) 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/impl/FsDatasetImpl.java


> ReplicaNotFoundException keeps happening in DN after removing multiple DN's 
> data directories
> 
>
> Key: HDFS-15386
> URL: https://issues.apache.org/jira/browse/HDFS-15386
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Toshihiro Suzuki
>Assignee: Toshihiro Suzuki
>Priority: Major
>
> When removing volumes, we need to invalidate all the blocks in the volumes. 
> In the following code (FsDatasetImpl), we keep the blocks that will be 
> invalidate in *blkToInvalidate* map. However as the key of the map is *bpid* 
> (Block Pool ID), it will be overwritten by other removed volumes. As a 
> result, the map will have only the blocks of the last volume we are removing, 
> and invalidate only them:
> {code:java}
> for (String bpid : volumeMap.getBlockPoolList()) {
>   List blocks = new ArrayList<>();
>   for (Iterator it =
> volumeMap.replicas(bpid).iterator(); it.hasNext();) {
> ReplicaInfo block = it.next();
> final StorageLocation blockStorageLocation =
> block.getVolume().getStorageLocation();
> LOG.trace("checking for block " + block.getBlockId() +
> " with storageLocation " + blockStorageLocation);
> if (blockStorageLocation.equals(sdLocation)) {
>   blocks.add(block);
>   it.remove();
> }
>   }
>   blkToInvalidate.put(bpid, blocks);
> }
> {code}
> [https://github.com/apache/hadoop/blob/704409d53bf7ebf717a3c2e988ede80f623bbad3/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/impl/FsDatasetImpl.java#L580-L595]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15390) client fails forever when namenode ipaddr changed

2020-06-05 Thread Sean Chow (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Chow updated HDFS-15390:
-
Description: 
For machine replacement, I replace my standby namenode with a new ipaddr and 
keep the same hostname. Also update the client's hosts to make it resolve 
correctly

When I try to run failover to transite the new namenode(let's say nn2), the 
client will fail to read or write forever until it's restarted.

That make yarn nodemanager in sick state. Even the new tasks will encounter 
this exception  too. Until all nodemanager restart.

 
{code:java}
20/06/02 15:12:25 WARN ipc.Client: Address change detected. Old: 
nn2-192-168-1-100/192.168.1.100:9000 New: nn2-192-168-1-100/192.168.1.200:9000
20/06/02 15:12:25 DEBUG ipc.Client: closing ipc connection to 
nn2-192-168-1-100/192.168.1.200:9000: Connection refused
java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at 
sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
at 
org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:530)
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:494)
at 
org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:608)
at 
org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:707)
at org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:368)
at org.apache.hadoop.ipc.Client.getConnection(Client.java:1517)
at org.apache.hadoop.ipc.Client.call(Client.java:1440)
at org.apache.hadoop.ipc.Client.call(Client.java:1401)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
at com.sun.proxy.$Proxy9.addBlock(Unknown Source)
at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.addBlock(ClientNamenodeProtocolTranslatorPB.java:399)
at sun.reflect.GeneratedMethodAccessor8.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:193)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
{code}
 

We can see the client has {{Address change detected}}, but it still fails. I 
find out that's because when method {{updateAddress()}} return true,  the 
{{handleConnectionFailure()}} thow an exception that break the next retry with 
the right ipaddr.

Client.java: setupConnection()
{code:java}
} catch (ConnectTimeoutException toe) {
  /* Check for an address change and update the local reference.
   * Reset the failure counter if the address was changed
   */
  if (updateAddress()) {
timeoutFailures = ioFailures = 0;
  }
  handleConnectionTimeout(timeoutFailures++,
  maxRetriesOnSocketTimeouts, toe);
} catch (IOException ie) {
  if (updateAddress()) {
timeoutFailures = ioFailures = 0;
  }
// because the namenode ip changed in updateAddress(), the old namenode 
ipaddress cannot be accessed now
// handleConnectionFailure will thow an exception, the next retry never have a 
change to use the right server updated in updateAddress()
  handleConnectionFailure(ioFailures++, ie);
}
{code}
 

  was:
For machine replacement, I replace my standby namenode with a new ipaddr and 
keep the same hostname. Also update the client's hosts to make it resolve 
correctly

When I try to run failover to transite the new namenode(let's say nn2), the 
client will fail to read or write forever until it's restarted.

That make yarn nodemanager in sick state. Even the new tasks will encounter 
this exception  too. Until all nodemanager restart.

 
{code:java}
20/06/02 15:12:25 WARN ipc.Client: Address change detected. Old: 
nn2-192-168-1-100/192.168.1.100:9000 New: nn2-192-168-1-100/192.168.1.200:9000
20/06/02 15:12:25 DEBUG ipc.Client: closing ipc connection to 
nn2-192-168-1-100/192.168.1.200:9000: Connection refused
java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at 
sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
at 
org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:530)
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:494)
at 
org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:608)
at 
org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:707)
at 

[jira] [Commented] (HDFS-15389) DFSAdmin should close filesystem and dfsadmin -setBalancerBandwidth should work with ViewFSOverloadScheme

2020-06-05 Thread Hadoop QA (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17126581#comment-17126581
 ] 

Hadoop QA commented on HDFS-15389:
--

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  1m 
23s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} dupname {color} | {color:green}  0m  
0s{color} | {color:green} No case conflicting files found. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 1 new or modified test 
files. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 21m 
42s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m  
8s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
48s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m 
15s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
17m 30s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
40s{color} | {color:green} trunk passed {color} |
| {color:blue}0{color} | {color:blue} spotbugs {color} | {color:blue}  3m  
2s{color} | {color:blue} Used deprecated FindBugs config; considering switching 
to SpotBugs. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  3m  
0s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  1m 
11s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m  
2s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  1m  
2s{color} | {color:green} the patch passed {color} |
| {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange}  
0m 44s{color} | {color:orange} hadoop-hdfs-project/hadoop-hdfs: The patch 
generated 1 new + 179 unchanged - 0 fixed = 180 total (was 179) {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m 
10s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
15m 40s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
39s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  3m  
5s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:red}-1{color} | {color:red} unit {color} | {color:red}119m  3s{color} 
| {color:red} hadoop-hdfs in the patch passed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
35s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black}190m 42s{color} | 
{color:black} {color} |
\\
\\
|| Reason || Tests ||
| Failed junit tests | hadoop.hdfs.TestReconstructStripedFileWithRandomECPolicy 
|
|   | hadoop.hdfs.server.datanode.TestDataNodeVolumeFailure |
|   | hadoop.hdfs.server.blockmanagement.TestUnderReplicatedBlocks |
|   | hadoop.hdfs.TestMultipleNNPortQOP |
|   | hadoop.hdfs.TestErasureCodingPoliciesWithRandomECPolicy |
|   | hadoop.hdfs.TestReconstructStripedFile |
|   | hadoop.hdfs.TestStripedFileAppend |
|   | hadoop.hdfs.TestRollingUpgrade |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | ClientAPI=1.40 ServerAPI=1.40 base: 
https://builds.apache.org/job/PreCommit-HDFS-Build/29404/artifact/out/Dockerfile
 |
| JIRA Issue | HDFS-15389 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/13004879/HDFS-15389-01.patch |
| Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite 
unit shadedclient findbugs checkstyle |
| uname | Linux 2d7d59b9ec89 

[jira] [Updated] (HDFS-15390) client fails forever when namenode ipaddr changed

2020-06-05 Thread Sean Chow (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Chow updated HDFS-15390:
-
Description: 
For machine replacement, I replace my standby namenode with a new ipaddr and 
keep the same hostname. Also update the client's hosts to make it resolve 
correctly

When I try to run failover to transite the new namenode(let's say nn2), the 
client will fail to read or write forever until it's restarted.

That make yarn nodemanager in sick state. Even the new tasks will encounter 
this exception  too. Until all nodemanager restart.

 
{code:java}
20/06/02 15:12:25 WARN ipc.Client: Address change detected. Old: 
nn2-192-168-1-100/192.168.1.100:9000 New: nn2-192-168-1-100/192.168.1.200:9000
20/06/02 15:12:25 DEBUG ipc.Client: closing ipc connection to 
nn2-192-168-1-100/192.168.1.200:9000: Connection refused
java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at 
sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
at 
org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:530)
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:494)
at 
org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:608)
at 
org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:707)
at org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:368)
at org.apache.hadoop.ipc.Client.getConnection(Client.java:1517)
at org.apache.hadoop.ipc.Client.call(Client.java:1440)
at org.apache.hadoop.ipc.Client.call(Client.java:1401)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
at com.sun.proxy.$Proxy9.addBlock(Unknown Source)
at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.addBlock(ClientNamenodeProtocolTranslatorPB.java:399)
at sun.reflect.GeneratedMethodAccessor8.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:193)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
{code}
 

We can see the client has {{Address change detected}}, but it still fails. I 
find out that's because when method {{updateAddress()}} return true,  the 
{{handleConnectionFailure()}} thow an exception that break the next retry with 
the right ipaddr.

 

  was:
For machine replacement, I replace my standby namenode with a new ipaddr and 
keep the same hostname. Also update the client's hosts to make it resolve 
correctly

When I try to run failover to transite the new namenode(let's say nn2), the 
client will fail to read or write forever until it's restarted.

That make yarn nodemanager in sick state. Even the new tasks will encounter 
this exception  too. Until all nodemanager restart.

 

 
{code:java}
20/06/02 15:12:25 WARN ipc.Client: Address change detected. Old: 
nn2-192-168-1-100/192.168.1.100:9000 New: nn2-192-168-1-100/192.168.1.200:9000
20/06/02 15:12:25 DEBUG ipc.Client: closing ipc connection to 
nn2-192-168-1-100/192.168.1.200:9000: Connection refused
java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at 
sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
at 
org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:530)
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:494)
at 
org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:608)
at 
org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:707)
at org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:368)
at org.apache.hadoop.ipc.Client.getConnection(Client.java:1517)
at org.apache.hadoop.ipc.Client.call(Client.java:1440)
at org.apache.hadoop.ipc.Client.call(Client.java:1401)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
at com.sun.proxy.$Proxy9.addBlock(Unknown Source)
at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.addBlock(ClientNamenodeProtocolTranslatorPB.java:399)
at sun.reflect.GeneratedMethodAccessor8.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 

[jira] [Created] (HDFS-15390) client fails forever when namenode ipaddr changed

2020-06-05 Thread Sean Chow (Jira)
Sean Chow created HDFS-15390:


 Summary: client fails forever when namenode ipaddr changed
 Key: HDFS-15390
 URL: https://issues.apache.org/jira/browse/HDFS-15390
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: dfsclient
Affects Versions: 3.2.1, 2.9.2, 2.10.0
Reporter: Sean Chow


For machine replacement, I replace my standby namenode with a new ipaddr and 
keep the same hostname. Also update the client's hosts to make it resolve 
correctly

When I try to run failover to transite the new namenode(let's say nn2), the 
client will fail to read or write forever until it's restarted.

That make yarn nodemanager in sick state. Even the new tasks will encounter 
this exception  too. Until all nodemanager restart.

 

 
{code:java}
20/06/02 15:12:25 WARN ipc.Client: Address change detected. Old: 
nn2-192-168-1-100/192.168.1.100:9000 New: nn2-192-168-1-100/192.168.1.200:9000
20/06/02 15:12:25 DEBUG ipc.Client: closing ipc connection to 
nn2-192-168-1-100/192.168.1.200:9000: Connection refused
java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at 
sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
at 
org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:530)
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:494)
at 
org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:608)
at 
org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:707)
at org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:368)
at org.apache.hadoop.ipc.Client.getConnection(Client.java:1517)
at org.apache.hadoop.ipc.Client.call(Client.java:1440)
at org.apache.hadoop.ipc.Client.call(Client.java:1401)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
at com.sun.proxy.$Proxy9.addBlock(Unknown Source)
at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.addBlock(ClientNamenodeProtocolTranslatorPB.java:399)
at sun.reflect.GeneratedMethodAccessor8.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:193)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
{code}
 

 

We can see the client has Address change detected, but it still fails. I find 
out that's because when method updateAddress() return ture,  the 
handleConnectionFailure() thow an exception that break the next retry with the 
right ipaddr.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15389) DFSAdmin should close filesystem and dfsadmin -setBalancerBandwidth should work with ViewFSOverloadScheme

2020-06-05 Thread Ayush Saxena (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17126436#comment-17126436
 ] 

Ayush Saxena commented on HDFS-15389:
-

[~umamaheswararao] / [~rakeshr] / [~vinayakumarb]  can you give a check once

> DFSAdmin should close filesystem and dfsadmin -setBalancerBandwidth should 
> work with ViewFSOverloadScheme 
> --
>
> Key: HDFS-15389
> URL: https://issues.apache.org/jira/browse/HDFS-15389
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Ayush Saxena
>Assignee: Ayush Saxena
>Priority: Major
> Attachments: HDFS-15389-01.patch
>
>
> Two Issues Here :
> Firstly Prior to HDFS-15321, When DFSAdmin was closed the FileSystem 
> associated with it was closed as part of close method, But post HDFS-15321, 
> the {{FileSystem}} isn't stored as part of {{FsShell}}, hence during close, 
> the FileSystem still stays and isn't close.
> ** This is the reason for failure of TestDFSHAAdmin
> Second : {{DfsAdmin -setBalancerBandwidth}} doesn't work with 
> {{ViewFSOverloadScheme}} since the setBalancerBandwidth calls {{getFS()}} 
> rather than {{getDFS()}} which resolves the scheme in {{HDFS-15321}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15321) Make DFSAdmin tool to work with ViewFSOverloadScheme

2020-06-05 Thread Ayush Saxena (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17126430#comment-17126430
 ] 

Ayush Saxena commented on HDFS-15321:
-

There was one more problem here.
{{DFSAdmin -setBalancerBandwidth}} wasn't working with {{ViewFSOverloadScheme}}.
Have raised HDFS-15389 for both the issues.

> Make DFSAdmin tool to work with ViewFSOverloadScheme
> 
>
> Key: HDFS-15321
> URL: https://issues.apache.org/jira/browse/HDFS-15321
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: dfsadmin, fs, viewfs
>Affects Versions: 3.2.1
>Reporter: Uma Maheswara Rao G
>Assignee: Uma Maheswara Rao G
>Priority: Major
>
> When we enable ViewFSOverLoadScheme and used hdfs scheme as overloaded 
> scheme, users work with hdfs uris. But here DFSAdmin expects the impl classe 
> to be DistribbuteFileSystem. If impl class is ViewFSoverloadScheme, it will 
> fail.
> So, when impl is ViewFSoverloadScheme, we should get corresponding child hdfs 
> to make DFSAdmin to work.
> This Jira makes the DFSAdmin to work with ViewFSoverloadScheme.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15389) DFSAdmin should close filesystem and dfsadmin -setBalancerBandwidth should work with ViewFSOverloadScheme

2020-06-05 Thread Ayush Saxena (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ayush Saxena updated HDFS-15389:

Attachment: HDFS-15389-01.patch

> DFSAdmin should close filesystem and dfsadmin -setBalancerBandwidth should 
> work with ViewFSOverloadScheme 
> --
>
> Key: HDFS-15389
> URL: https://issues.apache.org/jira/browse/HDFS-15389
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Ayush Saxena
>Assignee: Ayush Saxena
>Priority: Major
> Attachments: HDFS-15389-01.patch
>
>
> Two Issues Here :
> Firstly Prior to HDFS-15321, When DFSAdmin was closed the FileSystem 
> associated with it was closed as part of close method, But post HDFS-15321, 
> the {{FileSystem}} isn't stored as part of {{FsShell}}, hence during close, 
> the FileSystem still stays and isn't close.
> ** This is the reason for failure of TestDFSHAAdmin
> Second : {{DfsAdmin -setBalancerBandwidth}} doesn't work with 
> {{ViewFSOverloadScheme}} since the setBalancerBandwidth calls {{getFS()}} 
> rather than {{getDFS()}} which resolves the scheme in {{HDFS-15321}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15389) DFSAdmin should close filesystem and dfsadmin -setBalancerBandwidth should work with ViewFSOverloadScheme

2020-06-05 Thread Ayush Saxena (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ayush Saxena updated HDFS-15389:

Status: Patch Available  (was: Open)

> DFSAdmin should close filesystem and dfsadmin -setBalancerBandwidth should 
> work with ViewFSOverloadScheme 
> --
>
> Key: HDFS-15389
> URL: https://issues.apache.org/jira/browse/HDFS-15389
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Ayush Saxena
>Assignee: Ayush Saxena
>Priority: Major
> Attachments: HDFS-15389-01.patch
>
>
> Two Issues Here :
> Firstly Prior to HDFS-15321, When DFSAdmin was closed the FileSystem 
> associated with it was closed as part of close method, But post HDFS-15321, 
> the {{FileSystem}} isn't stored as part of {{FsShell}}, hence during close, 
> the FileSystem still stays and isn't close.
> ** This is the reason for failure of TestDFSHAAdmin
> Second : {{DfsAdmin -setBalancerBandwidth}} doesn't work with 
> {{ViewFSOverloadScheme}} since the setBalancerBandwidth calls {{getFS()}} 
> rather than {{getDFS()}} which resolves the scheme in {{HDFS-15321}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-15389) DFSAdmin should close filesystem and dfsadmin -setBalancerBandwidth should work with ViewFSOverloadScheme

2020-06-05 Thread Ayush Saxena (Jira)
Ayush Saxena created HDFS-15389:
---

 Summary: DFSAdmin should close filesystem and dfsadmin 
-setBalancerBandwidth should work with ViewFSOverloadScheme 
 Key: HDFS-15389
 URL: https://issues.apache.org/jira/browse/HDFS-15389
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: Ayush Saxena
Assignee: Ayush Saxena


Two Issues Here :
Firstly Prior to HDFS-15321, When DFSAdmin was closed the FileSystem associated 
with it was closed as part of close method, But post HDFS-15321, the 
{{FileSystem}} isn't stored as part of {{FsShell}}, hence during close, the 
FileSystem still stays and isn't close.
** This is the reason for failure of TestDFSHAAdmin

Second : {{DfsAdmin -setBalancerBandwidth}} doesn't work with 
{{ViewFSOverloadScheme}} since the setBalancerBandwidth calls {{getFS()}} 
rather than {{getDFS()}} which resolves the scheme in {{HDFS-15321}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org