[jira] [Created] (HDFS-15394) Add all available fs.viewfs.overload.scheme.target..impl classes in core-default.xml bydefault.
Uma Maheswara Rao G created HDFS-15394: -- Summary: Add all available fs.viewfs.overload.scheme.target..impl classes in core-default.xml bydefault. Key: HDFS-15394 URL: https://issues.apache.org/jira/browse/HDFS-15394 Project: Hadoop HDFS Issue Type: Sub-task Components: configuration, viewfs, viewfsOverloadScheme Affects Versions: 3.2.1 Reporter: Uma Maheswara Rao G Assignee: Uma Maheswara Rao G This proposes to add all available fs.viewfs.overload.scheme.target..impl classes in core-default.xml. So, that users need not configure them. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15389) DFSAdmin should close filesystem and dfsadmin -setBalancerBandwidth should work with ViewFSOverloadScheme
[ https://issues.apache.org/jira/browse/HDFS-15389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uma Maheswara Rao G updated HDFS-15389: --- Parent: HDFS-15289 Issue Type: Sub-task (was: Bug) > DFSAdmin should close filesystem and dfsadmin -setBalancerBandwidth should > work with ViewFSOverloadScheme > -- > > Key: HDFS-15389 > URL: https://issues.apache.org/jira/browse/HDFS-15389 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Ayush Saxena >Assignee: Ayush Saxena >Priority: Major > Attachments: HDFS-15389-01.patch > > > Two Issues Here : > Firstly Prior to HDFS-15321, When DFSAdmin was closed the FileSystem > associated with it was closed as part of close method, But post HDFS-15321, > the {{FileSystem}} isn't stored as part of {{FsShell}}, hence during close, > the FileSystem still stays and isn't close. > * This is the reason for failure of TestDFSHAAdmin > Second : {{DfsAdmin -setBalancerBandwidth}} doesn't work with > {{ViewFSOverloadScheme}} since the setBalancerBandwidth calls {{getFS()}} > rather than {{getDFS()}} which resolves the scheme in {{HDFS-15321}} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15321) Make DFSAdmin tool to work with ViewFSOverloadScheme
[ https://issues.apache.org/jira/browse/HDFS-15321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17127019#comment-17127019 ] Uma Maheswara Rao G commented on HDFS-15321: Thanks for reporting it [~ayushtkn], I have reviewed ur PR in HDFS-15389. I understand this could be an issue for tests. > Make DFSAdmin tool to work with ViewFSOverloadScheme > > > Key: HDFS-15321 > URL: https://issues.apache.org/jira/browse/HDFS-15321 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: dfsadmin, fs, viewfs >Affects Versions: 3.2.1 >Reporter: Uma Maheswara Rao G >Assignee: Uma Maheswara Rao G >Priority: Major > > When we enable ViewFSOverLoadScheme and used hdfs scheme as overloaded > scheme, users work with hdfs uris. But here DFSAdmin expects the impl classe > to be DistribbuteFileSystem. If impl class is ViewFSoverloadScheme, it will > fail. > So, when impl is ViewFSoverloadScheme, we should get corresponding child hdfs > to make DFSAdmin to work. > This Jira makes the DFSAdmin to work with ViewFSoverloadScheme. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15389) DFSAdmin should close filesystem and dfsadmin -setBalancerBandwidth should work with ViewFSOverloadScheme
[ https://issues.apache.org/jira/browse/HDFS-15389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17127005#comment-17127005 ] Ayush Saxena commented on HDFS-15389: - Have fixed the checkstyle in PR https://github.com/apache/hadoop/pull/2057 > DFSAdmin should close filesystem and dfsadmin -setBalancerBandwidth should > work with ViewFSOverloadScheme > -- > > Key: HDFS-15389 > URL: https://issues.apache.org/jira/browse/HDFS-15389 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Ayush Saxena >Assignee: Ayush Saxena >Priority: Major > Attachments: HDFS-15389-01.patch > > > Two Issues Here : > Firstly Prior to HDFS-15321, When DFSAdmin was closed the FileSystem > associated with it was closed as part of close method, But post HDFS-15321, > the {{FileSystem}} isn't stored as part of {{FsShell}}, hence during close, > the FileSystem still stays and isn't close. > * This is the reason for failure of TestDFSHAAdmin > Second : {{DfsAdmin -setBalancerBandwidth}} doesn't work with > {{ViewFSOverloadScheme}} since the setBalancerBandwidth calls {{getFS()}} > rather than {{getDFS()}} which resolves the scheme in {{HDFS-15321}} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15330) Document the ViewFSOverloadScheme details in ViewFS guide
[ https://issues.apache.org/jira/browse/HDFS-15330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17127001#comment-17127001 ] Hudson commented on HDFS-15330: --- SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #18332 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/18332/]) HDFS-15330. Document the ViewFSOverloadScheme details in ViewFS guide. (github: rev 76fa0222f0d2e2d92b4a1eedba8b3e38002e8c23) * (edit) hadoop-hdfs-project/hadoop-hdfs/src/site/markdown/HDFSCommands.md * (edit) hadoop-hdfs-project/hadoop-hdfs/src/site/markdown/ViewFs.md * (edit) hadoop-project/src/site/site.xml * (add) hadoop-hdfs-project/hadoop-hdfs/src/site/resources/images/ViewFSOverloadScheme.png * (add) hadoop-hdfs-project/hadoop-hdfs/src/site/markdown/ViewFsOverloadScheme.md > Document the ViewFSOverloadScheme details in ViewFS guide > - > > Key: HDFS-15330 > URL: https://issues.apache.org/jira/browse/HDFS-15330 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: viewfs, viewfsOverloadScheme >Affects Versions: 3.2.1 >Reporter: Uma Maheswara Rao G >Assignee: Uma Maheswara Rao G >Priority: Major > Fix For: 3.4.0 > > > This Jira to track for documentation of ViewFSOverloadScheme usage guide. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15330) Document the ViewFSOverloadScheme details in ViewFS guide
[ https://issues.apache.org/jira/browse/HDFS-15330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uma Maheswara Rao G updated HDFS-15330: --- Fix Version/s: 3.4.0 Hadoop Flags: Reviewed Resolution: Fixed Status: Resolved (was: Patch Available) Thanks [~ayushsaxena] for reviews! I have just committed it to trunk. > Document the ViewFSOverloadScheme details in ViewFS guide > - > > Key: HDFS-15330 > URL: https://issues.apache.org/jira/browse/HDFS-15330 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: viewfs, viewfsOverloadScheme >Affects Versions: 3.2.1 >Reporter: Uma Maheswara Rao G >Assignee: Uma Maheswara Rao G >Priority: Major > Fix For: 3.4.0 > > > This Jira to track for documentation of ViewFSOverloadScheme usage guide. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15393) Review of PendingReconstructionBlocks
[ https://issues.apache.org/jira/browse/HDFS-15393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17126894#comment-17126894 ] Hadoop QA commented on HDFS-15393: -- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 26m 57s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} dupname {color} | {color:green} 0m 0s{color} | {color:green} No case conflicting files found. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 1 new or modified test files. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 21m 55s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 8s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 47s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 14s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 17m 20s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 41s{color} | {color:green} trunk passed {color} | | {color:blue}0{color} | {color:blue} spotbugs {color} | {color:blue} 3m 3s{color} | {color:blue} Used deprecated FindBugs config; considering switching to SpotBugs. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 3m 1s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:red}-1{color} | {color:red} mvninstall {color} | {color:red} 0m 37s{color} | {color:red} hadoop-hdfs in the patch failed. {color} | | {color:red}-1{color} | {color:red} compile {color} | {color:red} 0m 37s{color} | {color:red} hadoop-hdfs in the patch failed. {color} | | {color:red}-1{color} | {color:red} javac {color} | {color:red} 0m 37s{color} | {color:red} hadoop-hdfs in the patch failed. {color} | | {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange} 0m 42s{color} | {color:orange} hadoop-hdfs-project/hadoop-hdfs: The patch generated 23 new + 126 unchanged - 3 fixed = 149 total (was 129) {color} | | {color:red}-1{color} | {color:red} mvnsite {color} | {color:red} 0m 40s{color} | {color:red} hadoop-hdfs in the patch failed. {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:red}-1{color} | {color:red} shadedclient {color} | {color:red} 4m 2s{color} | {color:red} patch has errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 37s{color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} findbugs {color} | {color:red} 0m 39s{color} | {color:red} hadoop-hdfs in the patch failed. {color} | || || || || {color:brown} Other Tests {color} || | {color:red}-1{color} | {color:red} unit {color} | {color:red} 0m 40s{color} | {color:red} hadoop-hdfs in the patch failed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 26s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 83m 10s{color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Docker | ClientAPI=1.40 ServerAPI=1.40 base: https://builds.apache.org/job/hadoop-multibranch/job/PR-2055/1/artifact/out/Dockerfile | | GITHUB PR | https://github.com/apache/hadoop/pull/2055 | | JIRA Issue | HDFS-15393 | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle | | uname | Linux 61475f59368a 4.15.0-101-generic #102-Ubuntu SMP Mon May 11 10:07:26 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | personality/hadoop.sh | | git revision | trunk / 23261237054 | | Default Java | Private Build-1.8.0_252-8u252-b09-1~18.04-b09 | | mvninstall | https://builds.apache.org/job/hadoop-multibranch/job/PR-2055/1/artifact/out/patch-mvninstall-hadoop-hdfs-project_hadoop-hdfs.txt | | compile |
[jira] [Updated] (HDFS-15389) DFSAdmin should close filesystem and dfsadmin -setBalancerBandwidth should work with ViewFSOverloadScheme
[ https://issues.apache.org/jira/browse/HDFS-15389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ayush Saxena updated HDFS-15389: Description: Two Issues Here : Firstly Prior to HDFS-15321, When DFSAdmin was closed the FileSystem associated with it was closed as part of close method, But post HDFS-15321, the {{FileSystem}} isn't stored as part of {{FsShell}}, hence during close, the FileSystem still stays and isn't close. * This is the reason for failure of TestDFSHAAdmin Second : {{DfsAdmin -setBalancerBandwidth}} doesn't work with {{ViewFSOverloadScheme}} since the setBalancerBandwidth calls {{getFS()}} rather than {{getDFS()}} which resolves the scheme in {{HDFS-15321}} was: Two Issues Here : Firstly Prior to HDFS-15321, When DFSAdmin was closed the FileSystem associated with it was closed as part of close method, But post HDFS-15321, the {{FileSystem}} isn't stored as part of {{FsShell}}, hence during close, the FileSystem still stays and isn't close. ** This is the reason for failure of TestDFSHAAdmin Second : {{DfsAdmin -setBalancerBandwidth}} doesn't work with {{ViewFSOverloadScheme}} since the setBalancerBandwidth calls {{getFS()}} rather than {{getDFS()}} which resolves the scheme in {{HDFS-15321}} > DFSAdmin should close filesystem and dfsadmin -setBalancerBandwidth should > work with ViewFSOverloadScheme > -- > > Key: HDFS-15389 > URL: https://issues.apache.org/jira/browse/HDFS-15389 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Ayush Saxena >Assignee: Ayush Saxena >Priority: Major > Attachments: HDFS-15389-01.patch > > > Two Issues Here : > Firstly Prior to HDFS-15321, When DFSAdmin was closed the FileSystem > associated with it was closed as part of close method, But post HDFS-15321, > the {{FileSystem}} isn't stored as part of {{FsShell}}, hence during close, > the FileSystem still stays and isn't close. > * This is the reason for failure of TestDFSHAAdmin > Second : {{DfsAdmin -setBalancerBandwidth}} doesn't work with > {{ViewFSOverloadScheme}} since the setBalancerBandwidth calls {{getFS()}} > rather than {{getDFS()}} which resolves the scheme in {{HDFS-15321}} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15390) client fails forever when namenode ipaddr changed
[ https://issues.apache.org/jira/browse/HDFS-15390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17126883#comment-17126883 ] Ayush Saxena commented on HDFS-15390: - Can you extend a UT for the issue? > client fails forever when namenode ipaddr changed > - > > Key: HDFS-15390 > URL: https://issues.apache.org/jira/browse/HDFS-15390 > Project: Hadoop HDFS > Issue Type: Bug > Components: dfsclient >Affects Versions: 2.10.0, 2.9.2, 3.2.1 >Reporter: Sean Chow >Priority: Major > Attachments: HDFS-15390.01.patch > > > For machine replacement, I replace my standby namenode with a new ipaddr and > keep the same hostname. Also update the client's hosts to make it resolve > correctly > When I try to run failover to transite the new namenode(let's say nn2), the > client will fail to read or write forever until it's restarted. > That make yarn nodemanager in sick state. Even the new tasks will encounter > this exception too. Until all nodemanager restart. > > {code:java} > 20/06/02 15:12:25 WARN ipc.Client: Address change detected. Old: > nn2-192-168-1-100/192.168.1.100:9000 New: nn2-192-168-1-100/192.168.1.200:9000 > 20/06/02 15:12:25 DEBUG ipc.Client: closing ipc connection to > nn2-192-168-1-100/192.168.1.200:9000: Connection refused > java.net.ConnectException: Connection refused > at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) > at > sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) > at > org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206) > at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:530) > at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:494) > at > org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:608) > at > org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:707) > at > org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:368) > at org.apache.hadoop.ipc.Client.getConnection(Client.java:1517) > at org.apache.hadoop.ipc.Client.call(Client.java:1440) > at org.apache.hadoop.ipc.Client.call(Client.java:1401) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232) > at com.sun.proxy.$Proxy9.addBlock(Unknown Source) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.addBlock(ClientNamenodeProtocolTranslatorPB.java:399) > at sun.reflect.GeneratedMethodAccessor8.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:193) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) > {code} > > We can see the client has {{Address change detected}}, but it still fails. I > find out that's because when method {{updateAddress()}} return true, the > {{handleConnectionFailure()}} thow an exception that break the next retry > with the right ipaddr. > Client.java: setupConnection() > {code:java} > } catch (ConnectTimeoutException toe) { > /* Check for an address change and update the local reference. >* Reset the failure counter if the address was changed >*/ > if (updateAddress()) { > timeoutFailures = ioFailures = 0; > } > handleConnectionTimeout(timeoutFailures++, > maxRetriesOnSocketTimeouts, toe); > } catch (IOException ie) { > if (updateAddress()) { > timeoutFailures = ioFailures = 0; > } > // because the namenode ip changed in updateAddress(), the old namenode > ipaddress cannot be accessed now > // handleConnectionFailure will thow an exception, the next retry never have > a chance to use the right server updated in updateAddress() > handleConnectionFailure(ioFailures++, ie); > } > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-15390) client fails forever when namenode ipaddr changed
[ https://issues.apache.org/jira/browse/HDFS-15390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17126638#comment-17126638 ] Sean Chow edited comment on HDFS-15390 at 6/5/20, 2:57 PM: --- It's easy to reproduce. You have setup HA namenodes, and a new machine with the same hostname with nn2(standby), and copied name-data directory. # Use {{hdfs dfs put}} to write a bigfile (to make it a long running client) # Stop old nn2, and start new nn2 # Update the nn2 hostname to resolve as the new ipaddr on all hosts # Failover from nn1 to nn2 # Now you found the client occurs error continuously. (In yarn nodemanager scenario, this nodemanager is totally sick until restarted) There 's two way to fix this: # When updateAddress is true, do not handle ConnectionFailure this round # When address change detected, update namenode proxies (only with {{ConfiguredFailoverProxyProvider}}) Method one is easy, and in this connection lifecycle the client will use the right {{server}} to connect. But when the client connection closed and create a new one. It will always try to getConnection from the retired ipaddr, because the namenode proxies is still the old one. Method two solve the root cause. Everytime the client failover namenodes, check ipaddr changed or not. If changed, re-initialize the namenode failover proxies. was (Author: seanlook): There 's two way to fix this: # When updateAddress is true, do not handle ConnectionFailure this round # When address change detected, update namenode proxies (only with {{ConfiguredFailoverProxyProvider}}) Method one is easy, and in this connection lifecycle the client will use the right {{server}} to connect. But when the client connection closed and create a new one. It will always try to getConnection from the retired ipaddr, because the namenode proxies is still the old one. Method two solve the root cause. Everytime the client failover namenodes, check ipaddr changed or not. If changed, re-initialize the namenode failover proxies. > client fails forever when namenode ipaddr changed > - > > Key: HDFS-15390 > URL: https://issues.apache.org/jira/browse/HDFS-15390 > Project: Hadoop HDFS > Issue Type: Bug > Components: dfsclient >Affects Versions: 2.10.0, 2.9.2, 3.2.1 >Reporter: Sean Chow >Priority: Major > Attachments: HDFS-15390.01.patch > > > For machine replacement, I replace my standby namenode with a new ipaddr and > keep the same hostname. Also update the client's hosts to make it resolve > correctly > When I try to run failover to transite the new namenode(let's say nn2), the > client will fail to read or write forever until it's restarted. > That make yarn nodemanager in sick state. Even the new tasks will encounter > this exception too. Until all nodemanager restart. > > {code:java} > 20/06/02 15:12:25 WARN ipc.Client: Address change detected. Old: > nn2-192-168-1-100/192.168.1.100:9000 New: nn2-192-168-1-100/192.168.1.200:9000 > 20/06/02 15:12:25 DEBUG ipc.Client: closing ipc connection to > nn2-192-168-1-100/192.168.1.200:9000: Connection refused > java.net.ConnectException: Connection refused > at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) > at > sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) > at > org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206) > at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:530) > at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:494) > at > org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:608) > at > org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:707) > at > org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:368) > at org.apache.hadoop.ipc.Client.getConnection(Client.java:1517) > at org.apache.hadoop.ipc.Client.call(Client.java:1440) > at org.apache.hadoop.ipc.Client.call(Client.java:1401) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232) > at com.sun.proxy.$Proxy9.addBlock(Unknown Source) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.addBlock(ClientNamenodeProtocolTranslatorPB.java:399) > at sun.reflect.GeneratedMethodAccessor8.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:193) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) > {code} > > We can
[jira] [Commented] (HDFS-15390) client fails forever when namenode ipaddr changed
[ https://issues.apache.org/jira/browse/HDFS-15390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17126837#comment-17126837 ] Sean Chow commented on HDFS-15390: -- Patch attached. Now we can see the exception is ignored when address updated, and the file is written successfully. {code:java} 20/06/05 20:54:51 WARN ipc.Client: Address change detected. Old: nn2-192-168-1-100/192.168.1.100:9000 New: nn2-192-168-1-100/192.168.1.200:9000 20/06/05 20:54:51 DEBUG ipc.Client: Failed to connect to server: nn2-192-168-1-100/192.168.1.200:9000: try once and fail. java.net.ConnectException: Connection refused at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) ... 20/06/05 20:54:51 DEBUG hdfs.DFSOutputStream: enqueue full packet seqno: ... 20/06/05 20:54:51 DEBUG hdfs.DataStreamer: Queued packet 100076 20/06/05 20:54:51 WARN ipc.Client: Exception when handle ConnectionFailure: Connection refused {code} > client fails forever when namenode ipaddr changed > - > > Key: HDFS-15390 > URL: https://issues.apache.org/jira/browse/HDFS-15390 > Project: Hadoop HDFS > Issue Type: Bug > Components: dfsclient >Affects Versions: 2.10.0, 2.9.2, 3.2.1 >Reporter: Sean Chow >Priority: Major > Attachments: HDFS-15390.01.patch > > > For machine replacement, I replace my standby namenode with a new ipaddr and > keep the same hostname. Also update the client's hosts to make it resolve > correctly > When I try to run failover to transite the new namenode(let's say nn2), the > client will fail to read or write forever until it's restarted. > That make yarn nodemanager in sick state. Even the new tasks will encounter > this exception too. Until all nodemanager restart. > > {code:java} > 20/06/02 15:12:25 WARN ipc.Client: Address change detected. Old: > nn2-192-168-1-100/192.168.1.100:9000 New: nn2-192-168-1-100/192.168.1.200:9000 > 20/06/02 15:12:25 DEBUG ipc.Client: closing ipc connection to > nn2-192-168-1-100/192.168.1.200:9000: Connection refused > java.net.ConnectException: Connection refused > at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) > at > sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) > at > org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206) > at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:530) > at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:494) > at > org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:608) > at > org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:707) > at > org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:368) > at org.apache.hadoop.ipc.Client.getConnection(Client.java:1517) > at org.apache.hadoop.ipc.Client.call(Client.java:1440) > at org.apache.hadoop.ipc.Client.call(Client.java:1401) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232) > at com.sun.proxy.$Proxy9.addBlock(Unknown Source) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.addBlock(ClientNamenodeProtocolTranslatorPB.java:399) > at sun.reflect.GeneratedMethodAccessor8.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:193) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) > {code} > > We can see the client has {{Address change detected}}, but it still fails. I > find out that's because when method {{updateAddress()}} return true, the > {{handleConnectionFailure()}} thow an exception that break the next retry > with the right ipaddr. > Client.java: setupConnection() > {code:java} > } catch (ConnectTimeoutException toe) { > /* Check for an address change and update the local reference. >* Reset the failure counter if the address was changed >*/ > if (updateAddress()) { > timeoutFailures = ioFailures = 0; > } > handleConnectionTimeout(timeoutFailures++, > maxRetriesOnSocketTimeouts, toe); > } catch (IOException ie) { > if (updateAddress()) { > timeoutFailures = ioFailures = 0; > } > // because the namenode ip changed in updateAddress(), the old namenode > ipaddress cannot be accessed now > // handleConnectionFailure will thow an exception, the next retry never have > a chance to use the right server updated in updateAddress() >
[jira] [Updated] (HDFS-15390) client fails forever when namenode ipaddr changed
[ https://issues.apache.org/jira/browse/HDFS-15390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Chow updated HDFS-15390: - Attachment: HDFS-15390.01.patch > client fails forever when namenode ipaddr changed > - > > Key: HDFS-15390 > URL: https://issues.apache.org/jira/browse/HDFS-15390 > Project: Hadoop HDFS > Issue Type: Bug > Components: dfsclient >Affects Versions: 2.10.0, 2.9.2, 3.2.1 >Reporter: Sean Chow >Priority: Major > Attachments: HDFS-15390.01.patch > > > For machine replacement, I replace my standby namenode with a new ipaddr and > keep the same hostname. Also update the client's hosts to make it resolve > correctly > When I try to run failover to transite the new namenode(let's say nn2), the > client will fail to read or write forever until it's restarted. > That make yarn nodemanager in sick state. Even the new tasks will encounter > this exception too. Until all nodemanager restart. > > {code:java} > 20/06/02 15:12:25 WARN ipc.Client: Address change detected. Old: > nn2-192-168-1-100/192.168.1.100:9000 New: nn2-192-168-1-100/192.168.1.200:9000 > 20/06/02 15:12:25 DEBUG ipc.Client: closing ipc connection to > nn2-192-168-1-100/192.168.1.200:9000: Connection refused > java.net.ConnectException: Connection refused > at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) > at > sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) > at > org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206) > at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:530) > at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:494) > at > org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:608) > at > org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:707) > at > org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:368) > at org.apache.hadoop.ipc.Client.getConnection(Client.java:1517) > at org.apache.hadoop.ipc.Client.call(Client.java:1440) > at org.apache.hadoop.ipc.Client.call(Client.java:1401) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232) > at com.sun.proxy.$Proxy9.addBlock(Unknown Source) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.addBlock(ClientNamenodeProtocolTranslatorPB.java:399) > at sun.reflect.GeneratedMethodAccessor8.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:193) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) > {code} > > We can see the client has {{Address change detected}}, but it still fails. I > find out that's because when method {{updateAddress()}} return true, the > {{handleConnectionFailure()}} thow an exception that break the next retry > with the right ipaddr. > Client.java: setupConnection() > {code:java} > } catch (ConnectTimeoutException toe) { > /* Check for an address change and update the local reference. >* Reset the failure counter if the address was changed >*/ > if (updateAddress()) { > timeoutFailures = ioFailures = 0; > } > handleConnectionTimeout(timeoutFailures++, > maxRetriesOnSocketTimeouts, toe); > } catch (IOException ie) { > if (updateAddress()) { > timeoutFailures = ioFailures = 0; > } > // because the namenode ip changed in updateAddress(), the old namenode > ipaddress cannot be accessed now > // handleConnectionFailure will thow an exception, the next retry never have > a chance to use the right server updated in updateAddress() > handleConnectionFailure(ioFailures++, ie); > } > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-15393) Review of PendingReconstructionBlocks
David Mollitor created HDFS-15393: - Summary: Review of PendingReconstructionBlocks Key: HDFS-15393 URL: https://issues.apache.org/jira/browse/HDFS-15393 Project: Hadoop HDFS Issue Type: Improvement Reporter: David Mollitor Assignee: David Mollitor I started looking at this class based on [HDFS-15351]. * Uses {{java.sql.Time}} unnecessarily. Confusing since Java ships with time formatters out of the box in JDK 8. I believe this will cause issues later when trying to upgrade to JDK 9+ since SQL is a different module in Java. * Remove code where appropriate * Use Java Concurrent library for higher concurrent access to underlying map -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15359) EC: Allow closing a file with committed blocks
[ https://issues.apache.org/jira/browse/HDFS-15359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17126794#comment-17126794 ] Hudson commented on HDFS-15359: --- SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #18331 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/18331/]) HDFS-15359. EC: Allow closing a file with committed blocks. Contributed (ayushsaxena: rev 2326123705445dee534ac2c298038831b5d04a0a) * (edit) hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/INodeFile.java * (edit) hadoop-hdfs-project/hadoop-hdfs/src/main/resources/hdfs-default.xml * (edit) hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestDistributedFileSystem.java * (edit) hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockManager.java > EC: Allow closing a file with committed blocks > -- > > Key: HDFS-15359 > URL: https://issues.apache.org/jira/browse/HDFS-15359 > Project: Hadoop HDFS > Issue Type: Improvement > Components: erasure-coding >Reporter: Ayush Saxena >Assignee: Ayush Saxena >Priority: Major > Fix For: 3.4.0 > > Attachments: HDFS-15359-01.patch, HDFS-15359-02.patch, > HDFS-15359-03.patch, HDFS-15359-04.patch, HDFS-15359-05.patch > > > Presently, {{dfs.namenode.file.close.num-committed-allowed}} is ignored in > case of EC blocks. But in case of heavy loads, IBR's from Datanode may get > delayed and cause the file write to fail. So, can allow EC files to close > with blocks in committed state as REP files -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15359) EC: Allow closing a file with committed blocks
[ https://issues.apache.org/jira/browse/HDFS-15359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ayush Saxena updated HDFS-15359: Fix Version/s: 3.4.0 Hadoop Flags: Reviewed Resolution: Fixed Status: Resolved (was: Patch Available) > EC: Allow closing a file with committed blocks > -- > > Key: HDFS-15359 > URL: https://issues.apache.org/jira/browse/HDFS-15359 > Project: Hadoop HDFS > Issue Type: Improvement > Components: erasure-coding >Reporter: Ayush Saxena >Assignee: Ayush Saxena >Priority: Major > Fix For: 3.4.0 > > Attachments: HDFS-15359-01.patch, HDFS-15359-02.patch, > HDFS-15359-03.patch, HDFS-15359-04.patch, HDFS-15359-05.patch > > > Presently, {{dfs.namenode.file.close.num-committed-allowed}} is ignored in > case of EC blocks. But in case of heavy loads, IBR's from Datanode may get > delayed and cause the file write to fail. So, can allow EC files to close > with blocks in committed state as REP files -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15359) EC: Allow closing a file with committed blocks
[ https://issues.apache.org/jira/browse/HDFS-15359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17126777#comment-17126777 ] Ayush Saxena commented on HDFS-15359: - Committed to trunk. Thanx [~vinayakumarb] and [~weichiu] for the reviews!!! > EC: Allow closing a file with committed blocks > -- > > Key: HDFS-15359 > URL: https://issues.apache.org/jira/browse/HDFS-15359 > Project: Hadoop HDFS > Issue Type: Improvement > Components: erasure-coding >Reporter: Ayush Saxena >Assignee: Ayush Saxena >Priority: Major > Attachments: HDFS-15359-01.patch, HDFS-15359-02.patch, > HDFS-15359-03.patch, HDFS-15359-04.patch, HDFS-15359-05.patch > > > Presently, {{dfs.namenode.file.close.num-committed-allowed}} is ignored in > case of EC blocks. But in case of heavy loads, IBR's from Datanode may get > delayed and cause the file write to fail. So, can allow EC files to close > with blocks in committed state as REP files -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15351) Blocks Scheduled Count was wrong on Truncate
[ https://issues.apache.org/jira/browse/HDFS-15351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17126748#comment-17126748 ] David Mollitor commented on HDFS-15351: --- Thanks for pinging me [~hemanthboyina] a few times. I have been a bit all over the place so thanks for you persistence and patients. Probably should be using {{Collection}} classes instead of native arrays, but that's not for this ticket. {code:java} PendingBlockInfo remove = pendingReconstruction.remove(lastBlock); if (remove != null) { List locations = remove.getTargets(); DatanodeStorageInfo.decrementBlocksScheduled(locations.toArray(new DatanodeStorageInfo[0])); } {code} > Blocks Scheduled Count was wrong on Truncate > - > > Key: HDFS-15351 > URL: https://issues.apache.org/jira/browse/HDFS-15351 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: hemanthboyina >Assignee: hemanthboyina >Priority: Major > Attachments: HDFS-15351.001.patch, HDFS-15351.002.patch, > HDFS-15351.003.patch > > > On truncate and append we remove the blocks from Reconstruction Queue > On removing the blocks from pending reconstruction , we need to decrement > Blocks Scheduled -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15391) Standby NameNode cannot load the edit log correctly due to edit log corruption, resulting in the service exiting abnormally and unable to restart
[ https://issues.apache.org/jira/browse/HDFS-15391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] huhaiyang updated HDFS-15391: - Summary: Standby NameNode cannot load the edit log correctly due to edit log corruption, resulting in the service exiting abnormally and unable to restart (was: Due to edit log corruption, Standby NameNode could not properly load the Ediltog log, result in abnormal exit of the service and failure to restart) > Standby NameNode cannot load the edit log correctly due to edit log > corruption, resulting in the service exiting abnormally and unable to restart > - > > Key: HDFS-15391 > URL: https://issues.apache.org/jira/browse/HDFS-15391 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 3.2.0 >Reporter: huhaiyang >Priority: Critical > > In the cluster version 3.2.0 production environment, > We found that due to edit log corruption, Standby NameNode could not > properly load the Ediltog log, result in abnormal exit of the service and > failure to restart > {noformat} > The specific scenario is that Flink writes to HDFS(replication file), and in > the case of an exception to the write file, the following operations are > performed : > 1.close file > 2.open file > 3.truncate file > 4.append file > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15391) Due to edit log corruption, Standby NameNode could not properly load the Ediltog log, result in abnormal exit of the service and failure to restart
[ https://issues.apache.org/jira/browse/HDFS-15391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17126740#comment-17126740 ] huhaiyang commented on HDFS-15391: -- [~ayushtkn] Thank you for reply {quote} These are two different traces, correct? You tried restarting the namenode twice, and once it failed for CLOSE_OP and other time with TRUNCATE, Correct? {quote} Yes, These are two different traces, I'll add more details later. > Due to edit log corruption, Standby NameNode could not properly load the > Ediltog log, result in abnormal exit of the service and failure to restart > --- > > Key: HDFS-15391 > URL: https://issues.apache.org/jira/browse/HDFS-15391 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 3.2.0 >Reporter: huhaiyang >Priority: Critical > > In the cluster version 3.2.0 production environment, > We found that due to edit log corruption, Standby NameNode could not > properly load the Ediltog log, result in abnormal exit of the service and > failure to restart > {noformat} > The specific scenario is that Flink writes to HDFS(replication file), and in > the case of an exception to the write file, the following operations are > performed : > 1.close file > 2.open file > 3.truncate file > 4.append file > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-15391) Due to edit log corruption, Standby NameNode could not properly load the Ediltog log, result in abnormal exit of the service and failure to restart
[ https://issues.apache.org/jira/browse/HDFS-15391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17126730#comment-17126730 ] Ayush Saxena edited comment on HDFS-15391 at 6/5/20, 12:31 PM: --- Thanx, These are two different traces, correct? You tried restarting the namenode twice, and once it failed for CLOSE_OP and other time with TRUNCATE, Correct? What was the exception during write? was (Author: ayushtkn): Thanx, These are two different traces, correct? You tried restarting the namenode twice, and once it failed for CLOSE_OP and other time with TRUNCATE, Correct? > Due to edit log corruption, Standby NameNode could not properly load the > Ediltog log, result in abnormal exit of the service and failure to restart > --- > > Key: HDFS-15391 > URL: https://issues.apache.org/jira/browse/HDFS-15391 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 3.2.0 >Reporter: huhaiyang >Priority: Critical > > In the cluster version 3.2.0 production environment, > We found that due to edit log corruption, Standby NameNode could not > properly load the Ediltog log, result in abnormal exit of the service and > failure to restart > {noformat} > The specific scenario is that Flink writes to HDFS(replication file), and in > the case of an exception to the write file, the following operations are > performed : > 1.close file > 2.open file > 3.truncate file > 4.append file > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-15392) DistrbutedFileSystem#concat api can create large number of small blocks
Lokesh Jain created HDFS-15392: -- Summary: DistrbutedFileSystem#concat api can create large number of small blocks Key: HDFS-15392 URL: https://issues.apache.org/jira/browse/HDFS-15392 Project: Hadoop HDFS Issue Type: Bug Reporter: Lokesh Jain DistrbutedFileSystem#concat moves blocks from source to target. If the api is repeatedly used on small files it can create large number of small blocks in the target file. The Jira aims to optimize the api to avoid the issue of small blocks. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-15391) Due to edit log corruption, Standby NameNode could not properly load the Ediltog log, result in abnormal exit of the service and failure to restart
[ https://issues.apache.org/jira/browse/HDFS-15391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17126724#comment-17126724 ] huhaiyang edited comment on HDFS-15391 at 6/5/20, 12:31 PM: Standby NameNode exception log: {noformat} 2020-06-04 18:32:11,561 ERROR org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader: Encountered exception on operation CloseOp [length=0, inodeId=0, path=path, replication=3, mtime=1591266620287, atime=1591264800229, blockSize=134217728, blocks=[blk_11382006007_10353346830, blk_11382023760_10353365201, blk_11382041307_10353383098, blk_11382049845_10353392031, blk_11382057341_10353399899, blk_11382071544_10353415171, blk_11382080753_10354157480], permissions=dw_water:rd:rw-r--r--, aclEntries=null, clientName=, clientMachine=, overwrite=false, storagePolicyId=0, erasureCodingPolicyId=0, opCode=OP_CLOSE, txid=126060943585] java.io.IOException: File is not under construction: hdfs://path at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:476) at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:258) at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:161) at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:898) at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.doTailEdits(EditLogTailer.java:329) at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:460) at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$300(EditLogTailer.java:410) at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$1.run(EditLogTailer.java:427) at org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:484) at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.run(EditLogTailer.java:423) 2020-06-04 18:32:11,561 ERROR org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer: Unknown error encountered while tailing edits. Shutting down standby NN. {noformat} {noformat} 020-06-04 22:28:04,025 ERROR org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader: Encountered exception on operation TruncateOp [src=xxxpath, clientName=DFSClient_NONMAPREDUCE_-295521672_77, clientMachine=xxx, newLength=3210623016, timestamp=1591270219348, truncateBlock=blk_11382198393_10355810378, opCode=OP_TRUNCATE, txid=126074587217] java.lang.IllegalStateException: file is already under construction at com.google.common.base.Preconditions.checkState(Preconditions.java:145) at org.apache.hadoop.hdfs.server.namenode.INodeFile.toUnderConstruction(INodeFile.java:329) at org.apache.hadoop.hdfs.server.namenode.FSDirTruncateOp.prepareFileForTruncate(FSDirTruncateOp.java:222) at org.apache.hadoop.hdfs.server.namenode.FSDirTruncateOp.unprotectedTruncate(FSDirTruncateOp.java:183) at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:986) at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:258) at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:161) at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:898) at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:753) at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:331) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:1123) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:730) at org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:669) at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:731) at org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:974) at org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:947) at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1680) at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1747) 2020-06-04 22:28:04,027 WARN org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Encountered exception loading fsimage java.io.IOException: java.lang.IllegalStateException: file is already under construction at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:268) at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:161) at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:898) at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:753) at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:331) at
[jira] [Updated] (HDFS-15391) Due to edit log corruption, Standby NameNode could not properly load the Ediltog log, result in abnormal exit of the service and failure to restart
[ https://issues.apache.org/jira/browse/HDFS-15391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] huhaiyang updated HDFS-15391: - Description: In the cluster version 3.2.0 production environment, We found that due to edit log corruption, Standby NameNode could not properly load the Ediltog log, result in abnormal exit of the service and failure to restart {code:java} The specific scenario is that Flink writes to HDFS(replication file), and in the case of an exception to the write file, the following operations are performed : 1.close file 2.open file 3.truncate file 4.append file {code} was: In the cluster version 3.2.0 production environment, We found that due to edit log corruption, Standby NameNode could not properly load the Ediltog log, result in abnormal exit of the service and failure to restart {code:java} The specific scenario is that Flink writes to HDFS(replication file), and in the case of an exception to the write file, the following operations are performed {code} # Close file # 2. Open file # 3. truncate file # 4. append file > Due to edit log corruption, Standby NameNode could not properly load the > Ediltog log, result in abnormal exit of the service and failure to restart > --- > > Key: HDFS-15391 > URL: https://issues.apache.org/jira/browse/HDFS-15391 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 3.2.0 >Reporter: huhaiyang >Priority: Critical > > In the cluster version 3.2.0 production environment, > We found that due to edit log corruption, Standby NameNode could not > properly load the Ediltog log, result in abnormal exit of the service and > failure to restart > {code:java} > The specific scenario is that Flink writes to HDFS(replication file), and in > the case of an exception to the write file, the following operations are > performed : > 1.close file > 2.open file > 3.truncate file > 4.append file > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15391) Due to edit log corruption, Standby NameNode could not properly load the Ediltog log, result in abnormal exit of the service and failure to restart
[ https://issues.apache.org/jira/browse/HDFS-15391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] huhaiyang updated HDFS-15391: - Description: In the cluster version 3.2.0 production environment, We found that due to edit log corruption, Standby NameNode could not properly load the Ediltog log, result in abnormal exit of the service and failure to restart {noformat} The specific scenario is that Flink writes to HDFS(replication file), and in the case of an exception to the write file, the following operations are performed : 1.close file 2.open file 3.truncate file 4.append file {noformat} was: In the cluster version 3.2.0 production environment, We found that due to edit log corruption, Standby NameNode could not properly load the Ediltog log, result in abnormal exit of the service and failure to restart {code:java} The specific scenario is that Flink writes to HDFS(replication file), and in the case of an exception to the write file, the following operations are performed : 1.close file 2.open file 3.truncate file 4.append file {code} > Due to edit log corruption, Standby NameNode could not properly load the > Ediltog log, result in abnormal exit of the service and failure to restart > --- > > Key: HDFS-15391 > URL: https://issues.apache.org/jira/browse/HDFS-15391 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 3.2.0 >Reporter: huhaiyang >Priority: Critical > > In the cluster version 3.2.0 production environment, > We found that due to edit log corruption, Standby NameNode could not > properly load the Ediltog log, result in abnormal exit of the service and > failure to restart > {noformat} > The specific scenario is that Flink writes to HDFS(replication file), and in > the case of an exception to the write file, the following operations are > performed : > 1.close file > 2.open file > 3.truncate file > 4.append file > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15391) Due to edit log corruption, Standby NameNode could not properly load the Ediltog log, result in abnormal exit of the service and failure to restart
[ https://issues.apache.org/jira/browse/HDFS-15391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] huhaiyang updated HDFS-15391: - Description: In the cluster version 3.2.0 production environment, We found that due to edit log corruption, Standby NameNode could not properly load the Ediltog log, result in abnormal exit of the service and failure to restart {code:java} The specific scenario is that Flink writes to HDFS(replication file), and in the case of an exception to the write file, the following operations are performed {code} # Close file # 2. Open file # 3. truncate file # 4. append file was: In the cluster version 3.2.0 production environment, We found that due to edit log corruption, Standby NameNode could not properly load the Ediltog log, result in abnormal exit of the service and failure to restart The specific scenario is that Flink writes to HDFS, and in the case of an exception to the write file, the following operations are performed 1. Close file 2. Open file 3. truncate file 4. append file > Due to edit log corruption, Standby NameNode could not properly load the > Ediltog log, result in abnormal exit of the service and failure to restart > --- > > Key: HDFS-15391 > URL: https://issues.apache.org/jira/browse/HDFS-15391 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 3.2.0 >Reporter: huhaiyang >Priority: Critical > > In the cluster version 3.2.0 production environment, > We found that due to edit log corruption, Standby NameNode could not > properly load the Ediltog log, result in abnormal exit of the service and > failure to restart > {code:java} > The specific scenario is that Flink writes to HDFS(replication file), and in > the case of an exception to the write file, the following operations are > performed > {code} > # Close file > # 2. Open file > # 3. truncate file > # 4. append file -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15391) Due to edit log corruption, Standby NameNode could not properly load the Ediltog log, result in abnormal exit of the service and failure to restart
[ https://issues.apache.org/jira/browse/HDFS-15391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17126730#comment-17126730 ] Ayush Saxena commented on HDFS-15391: - Thanx, These are two different traces, correct? You tried restarting the namenode twice, and once it failed for CLOSE_OP and other time with TRUNCATE, Correct? > Due to edit log corruption, Standby NameNode could not properly load the > Ediltog log, result in abnormal exit of the service and failure to restart > --- > > Key: HDFS-15391 > URL: https://issues.apache.org/jira/browse/HDFS-15391 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 3.2.0 >Reporter: huhaiyang >Priority: Critical > > In the cluster version 3.2.0 production environment, > We found that due to edit log corruption, Standby NameNode could not > properly load the Ediltog log, result in abnormal exit of the service and > failure to restart > > The specific scenario is that Flink writes to HDFS, and in the case of an > exception to the write file, the following operations are performed > 1. Close file > 2. Open file > 3. truncate file > 4. append file > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15390) client fails forever when namenode ipaddr changed
[ https://issues.apache.org/jira/browse/HDFS-15390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Chow updated HDFS-15390: - Description: For machine replacement, I replace my standby namenode with a new ipaddr and keep the same hostname. Also update the client's hosts to make it resolve correctly When I try to run failover to transite the new namenode(let's say nn2), the client will fail to read or write forever until it's restarted. That make yarn nodemanager in sick state. Even the new tasks will encounter this exception too. Until all nodemanager restart. {code:java} 20/06/02 15:12:25 WARN ipc.Client: Address change detected. Old: nn2-192-168-1-100/192.168.1.100:9000 New: nn2-192-168-1-100/192.168.1.200:9000 20/06/02 15:12:25 DEBUG ipc.Client: closing ipc connection to nn2-192-168-1-100/192.168.1.200:9000: Connection refused java.net.ConnectException: Connection refused at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206) at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:530) at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:494) at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:608) at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:707) at org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:368) at org.apache.hadoop.ipc.Client.getConnection(Client.java:1517) at org.apache.hadoop.ipc.Client.call(Client.java:1440) at org.apache.hadoop.ipc.Client.call(Client.java:1401) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232) at com.sun.proxy.$Proxy9.addBlock(Unknown Source) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.addBlock(ClientNamenodeProtocolTranslatorPB.java:399) at sun.reflect.GeneratedMethodAccessor8.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:193) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) {code} We can see the client has {{Address change detected}}, but it still fails. I find out that's because when method {{updateAddress()}} return true, the {{handleConnectionFailure()}} thow an exception that break the next retry with the right ipaddr. Client.java: setupConnection() {code:java} } catch (ConnectTimeoutException toe) { /* Check for an address change and update the local reference. * Reset the failure counter if the address was changed */ if (updateAddress()) { timeoutFailures = ioFailures = 0; } handleConnectionTimeout(timeoutFailures++, maxRetriesOnSocketTimeouts, toe); } catch (IOException ie) { if (updateAddress()) { timeoutFailures = ioFailures = 0; } // because the namenode ip changed in updateAddress(), the old namenode ipaddress cannot be accessed now // handleConnectionFailure will thow an exception, the next retry never have a chance to use the right server updated in updateAddress() handleConnectionFailure(ioFailures++, ie); } {code} was: For machine replacement, I replace my standby namenode with a new ipaddr and keep the same hostname. Also update the client's hosts to make it resolve correctly When I try to run failover to transite the new namenode(let's say nn2), the client will fail to read or write forever until it's restarted. That make yarn nodemanager in sick state. Even the new tasks will encounter this exception too. Until all nodemanager restart. {code:java} 20/06/02 15:12:25 WARN ipc.Client: Address change detected. Old: nn2-192-168-1-100/192.168.1.100:9000 New: nn2-192-168-1-100/192.168.1.200:9000 20/06/02 15:12:25 DEBUG ipc.Client: closing ipc connection to nn2-192-168-1-100/192.168.1.200:9000: Connection refused java.net.ConnectException: Connection refused at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206) at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:530) at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:494) at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:608) at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:707) at
[jira] [Updated] (HDFS-15391) Due to edit log corruption, Standby NameNode could not properly load the Ediltog log, result in abnormal exit of the service and failure to restart
[ https://issues.apache.org/jira/browse/HDFS-15391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] huhaiyang updated HDFS-15391: - Description: In the cluster version 3.2.0 production environment, We found that due to edit log corruption, Standby NameNode could not properly load the Ediltog log, result in abnormal exit of the service and failure to restart The specific scenario is that Flink writes to HDFS, and in the case of an exception to the write file, the following operations are performed 1. Close file 2. Open file 3. truncate file 4. append file was: In the cluster version 3.2.0 production environment, We found that due to edit log corruption, Standby NameNode could not properly load the Ediltog log, result in abnormal exit of the service and failure to restart > Due to edit log corruption, Standby NameNode could not properly load the > Ediltog log, result in abnormal exit of the service and failure to restart > --- > > Key: HDFS-15391 > URL: https://issues.apache.org/jira/browse/HDFS-15391 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 3.2.0 >Reporter: huhaiyang >Priority: Critical > > In the cluster version 3.2.0 production environment, > We found that due to edit log corruption, Standby NameNode could not > properly load the Ediltog log, result in abnormal exit of the service and > failure to restart > > The specific scenario is that Flink writes to HDFS, and in the case of an > exception to the write file, the following operations are performed > 1. Close file > 2. Open file > 3. truncate file > 4. append file > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15391) Due to edit log corruption, Standby NameNode could not properly load the Ediltog log, result in abnormal exit of the service and failure to restart
[ https://issues.apache.org/jira/browse/HDFS-15391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] huhaiyang updated HDFS-15391: - Description: In the cluster version 3.2.0 production environment, We found that due to edit log corruption, Standby NameNode could not properly load the Ediltog log, result in abnormal exit of the service and failure to restart was: In the cluster version 3.2.0 production environment, We found that due to edit log corruption, Standby NameNode could not properly load the Ediltog log, result in abnormal exit of the service and failure to restart > Due to edit log corruption, Standby NameNode could not properly load the > Ediltog log, result in abnormal exit of the service and failure to restart > --- > > Key: HDFS-15391 > URL: https://issues.apache.org/jira/browse/HDFS-15391 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 3.2.0 >Reporter: huhaiyang >Priority: Critical > > In the cluster version 3.2.0 production environment, > We found that due to edit log corruption, Standby NameNode could not > properly load the Ediltog log, result in abnormal exit of the service and > failure to restart > > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15391) Due to edit log corruption, Standby NameNode could not properly load the Ediltog log, result in abnormal exit of the service and failure to restart
[ https://issues.apache.org/jira/browse/HDFS-15391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] huhaiyang updated HDFS-15391: - Description: In the cluster version 3.2.0 production environment, We found that due to edit log corruption, Standby NameNode could not properly load the Ediltog log, result in abnormal exit of the service and failure to restart was: In the cluster version 3.2.0 production environment, We found that due to edit log corruption, Standby NameNode could not properly load the Ediltog log, result in abnormal exit of the service and failure to restart Standby NameNode exception log: 2020-06-04 18:32:11,561 ERROR org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader: Encountered exception on operation CloseOp [length=0, inodeId=0, path=path, replication=3, mtime=1591266620287, atime=1591264800229, blockSize=134217728, blocks=[blk_11382006007_10353346830, blk_11382023760_10353365201, blk_11382041307_10353383098, blk_11382049845_10353392031, blk_11382057341_10353399899, blk_11382071544_10353415171, blk_11382080753_10354157480], permissions=dw_water:rd:rw-r--r--, aclEntries=null, clientName=, clientMachine=, overwrite=false, storagePolicyId=0, erasureCodingPolicyId=0, opCode=OP_CLOSE, txid=126060943585] java.io.IOException: File is not under construction: hdfs://path at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:476) at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:258) at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:161) at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:898) at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.doTailEdits(EditLogTailer.java:329) at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:460) at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$300(EditLogTailer.java:410) at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$1.run(EditLogTailer.java:427) at org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:484) at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.run(EditLogTailer.java:423) 2020-06-04 18:32:11,561 ERROR org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer: Unknown error encountered while tailing edits. Shutting down standby NN. 020-06-04 22:28:04,025 ERROR org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader: Encountered exception on operation TruncateOp [src=xxxpath, clientName=DFSClient_NONMAPREDUCE_-295521672_77, clientMachine=xxx, newLength=3210623016, timestamp=1591270219348, truncateBlock=blk_11382198393_10355810378, opCode=OP_TRUNCATE, txid=126074587217] java.lang.IllegalStateException: file is already under construction at com.google.common.base.Preconditions.checkState(Preconditions.java:145) at org.apache.hadoop.hdfs.server.namenode.INodeFile.toUnderConstruction(INodeFile.java:329) at org.apache.hadoop.hdfs.server.namenode.FSDirTruncateOp.prepareFileForTruncate(FSDirTruncateOp.java:222) at org.apache.hadoop.hdfs.server.namenode.FSDirTruncateOp.unprotectedTruncate(FSDirTruncateOp.java:183) at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:986) at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:258) at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:161) at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:898) at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:753) at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:331) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:1123) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:730) at org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:669) at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:731) at org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:974) at org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:947) at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1680) at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1747) 2020-06-04 22:28:04,027 WARN org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Encountered exception loading fsimage java.io.IOException: java.lang.IllegalStateException: file is already under construction at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:268) at
[jira] [Commented] (HDFS-15391) Due to edit log corruption, Standby NameNode could not properly load the Ediltog log, result in abnormal exit of the service and failure to restart
[ https://issues.apache.org/jira/browse/HDFS-15391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17126724#comment-17126724 ] huhaiyang commented on HDFS-15391: -- Standby NameNode exception log: 2020-06-04 18:32:11,561 ERROR org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader: Encountered exception on operation CloseOp [length=0, inodeId=0, path=path, replication=3, mtime=1591266620287, atime=1591264800229, blockSize=134217728, blocks=[blk_11382006007_10353346830, blk_11382023760_10353365201, blk_11382041307_10353383098, blk_11382049845_10353392031, blk_11382057341_10353399899, blk_11382071544_10353415171, blk_11382080753_10354157480], permissions=dw_water:rd:rw-r--r--, aclEntries=null, clientName=, clientMachine=, overwrite=false, storagePolicyId=0, erasureCodingPolicyId=0, opCode=OP_CLOSE, txid=126060943585] java.io.IOException: File is not under construction: hdfs://path at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:476) at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:258) at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:161) at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:898) at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.doTailEdits(EditLogTailer.java:329) at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:460) at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$300(EditLogTailer.java:410) at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$1.run(EditLogTailer.java:427) at org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:484) at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.run(EditLogTailer.java:423) 2020-06-04 18:32:11,561 ERROR org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer: Unknown error encountered while tailing edits. Shutting down standby NN. 020-06-04 22:28:04,025 ERROR org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader: Encountered exception on operation TruncateOp [src=xxxpath, clientName=DFSClient_NONMAPREDUCE_-295521672_77, clientMachine=xxx, newLength=3210623016, timestamp=1591270219348, truncateBlock=blk_11382198393_10355810378, opCode=OP_TRUNCATE, txid=126074587217] java.lang.IllegalStateException: file is already under construction at com.google.common.base.Preconditions.checkState(Preconditions.java:145) at org.apache.hadoop.hdfs.server.namenode.INodeFile.toUnderConstruction(INodeFile.java:329) at org.apache.hadoop.hdfs.server.namenode.FSDirTruncateOp.prepareFileForTruncate(FSDirTruncateOp.java:222) at org.apache.hadoop.hdfs.server.namenode.FSDirTruncateOp.unprotectedTruncate(FSDirTruncateOp.java:183) at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:986) at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:258) at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:161) at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:898) at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:753) at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:331) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:1123) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:730) at org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:669) at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:731) at org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:974) at org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:947) at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1680) at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1747) 2020-06-04 22:28:04,027 WARN org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Encountered exception loading fsimage java.io.IOException: java.lang.IllegalStateException: file is already under construction at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:268) at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:161) at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:898) at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:753) at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:331) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:1123) at
[jira] [Commented] (HDFS-15391) Due to edit log corruption, Standby NameNode could not properly load the Ediltog log, result in abnormal exit of the service and failure to restart
[ https://issues.apache.org/jira/browse/HDFS-15391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17126723#comment-17126723 ] Ayush Saxena commented on HDFS-15391: - Do you have backported HDFS-7663 in that? If yes, HDFS-14581 may help. Else, can you give more background or Audit Logs or anything more. > Due to edit log corruption, Standby NameNode could not properly load the > Ediltog log, result in abnormal exit of the service and failure to restart > --- > > Key: HDFS-15391 > URL: https://issues.apache.org/jira/browse/HDFS-15391 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 3.2.0 >Reporter: huhaiyang >Priority: Critical > > In the cluster version 3.2.0 production environment, > We found that due to edit log corruption, Standby NameNode could not properly > load the Ediltog log, result in abnormal exit of the service and failure to > restart > This is the exception it throws: > 2020-06-04 18:32:11,561 ERROR > org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader: Encountered exception > on operation CloseOp [length=0, inodeId=0, path=path, replication=3, > mtime=1591266620287, atime=1591264800229, blockSize=134217728, > blocks=[blk_11382006007_10353346830, blk_11382023760_10353365201, > blk_11382041307_10353383098, blk_11382049845_10353392031, > blk_11382057341_10353399899, blk_11382071544_10353415171, > blk_11382080753_10354157480], permissions=dw_water:rd:rw-r--r--, > aclEntries=null, clientName=, clientMachine=, overwrite=false, > storagePolicyId=0, erasureCodingPolicyId=0, opCode=OP_CLOSE, > txid=126060943585] > java.io.IOException: File is not under construction: path > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:476) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:258) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:161) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:898) > at > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.doTailEdits(EditLogTailer.java:329) > at > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:460) > at > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$300(EditLogTailer.java:410) > at > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$1.run(EditLogTailer.java:427) > at > org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:484) > at > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.run(EditLogTailer.java:423) > 2020-06-04 18:32:11,561 ERROR > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer: Unknown error > encountered while tailing edits. Shutting down standby NN. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15391) Due to edit log corruption, Standby NameNode could not properly load the Ediltog log, result in abnormal exit of the service and failure to restart
[ https://issues.apache.org/jira/browse/HDFS-15391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] huhaiyang updated HDFS-15391: - Description: In the cluster version 3.2.0 production environment, We found that due to edit log corruption, Standby NameNode could not properly load the Ediltog log, result in abnormal exit of the service and failure to restart Standby NameNode exception log: 2020-06-04 18:32:11,561 ERROR org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader: Encountered exception on operation CloseOp [length=0, inodeId=0, path=path, replication=3, mtime=1591266620287, atime=1591264800229, blockSize=134217728, blocks=[blk_11382006007_10353346830, blk_11382023760_10353365201, blk_11382041307_10353383098, blk_11382049845_10353392031, blk_11382057341_10353399899, blk_11382071544_10353415171, blk_11382080753_10354157480], permissions=dw_water:rd:rw-r--r--, aclEntries=null, clientName=, clientMachine=, overwrite=false, storagePolicyId=0, erasureCodingPolicyId=0, opCode=OP_CLOSE, txid=126060943585] java.io.IOException: File is not under construction: hdfs://path at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:476) at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:258) at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:161) at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:898) at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.doTailEdits(EditLogTailer.java:329) at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:460) at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$300(EditLogTailer.java:410) at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$1.run(EditLogTailer.java:427) at org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:484) at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.run(EditLogTailer.java:423) 2020-06-04 18:32:11,561 ERROR org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer: Unknown error encountered while tailing edits. Shutting down standby NN. 020-06-04 22:28:04,025 ERROR org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader: Encountered exception on operation TruncateOp [src=xxxpath, clientName=DFSClient_NONMAPREDUCE_-295521672_77, clientMachine=xxx, newLength=3210623016, timestamp=1591270219348, truncateBlock=blk_11382198393_10355810378, opCode=OP_TRUNCATE, txid=126074587217] java.lang.IllegalStateException: file is already under construction at com.google.common.base.Preconditions.checkState(Preconditions.java:145) at org.apache.hadoop.hdfs.server.namenode.INodeFile.toUnderConstruction(INodeFile.java:329) at org.apache.hadoop.hdfs.server.namenode.FSDirTruncateOp.prepareFileForTruncate(FSDirTruncateOp.java:222) at org.apache.hadoop.hdfs.server.namenode.FSDirTruncateOp.unprotectedTruncate(FSDirTruncateOp.java:183) at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:986) at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:258) at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:161) at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:898) at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:753) at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:331) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:1123) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:730) at org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:669) at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:731) at org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:974) at org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:947) at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1680) at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1747) 2020-06-04 22:28:04,027 WARN org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Encountered exception loading fsimage java.io.IOException: java.lang.IllegalStateException: file is already under construction at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:268) at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:161) at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:898) at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:753) at
[jira] [Commented] (HDFS-13179) TestLazyPersistReplicaRecovery#testDnRestartWithSavedReplicas fails intermittently
[ https://issues.apache.org/jira/browse/HDFS-13179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17126722#comment-17126722 ] Stephen O'Donnell commented on HDFS-13179: -- Ah good to know. Seems I have been wasting some time pulling changes onto branch 3.0 then :-( At least the branch can be compiled now, but we probably don't need to bother committing the updated patch I just uploaded. > TestLazyPersistReplicaRecovery#testDnRestartWithSavedReplicas fails > intermittently > -- > > Key: HDFS-13179 > URL: https://issues.apache.org/jira/browse/HDFS-13179 > Project: Hadoop HDFS > Issue Type: Bug > Components: fs >Affects Versions: 3.0.0 >Reporter: Gabor Bota >Assignee: Ahmed Hussein >Priority: Critical > Fix For: 3.0.4, 3.3.0, 3.1.4, 3.2.2, 2.10.1 > > Attachments: HDFS-13179-branch-2.10.003.patch, > HDFS-13179-branch-3.0.003.patch, HDFS-13179.001.patch, HDFS-13179.002.patch, > HDFS-13179.003.patch, test runs.zip > > > The error caused by TimeoutException because the test is waiting to ensure > that the file is replicated to DISK storage but the replication can't be > finished to DISK during the 30s timeout in ensureFileReplicasOnStorageType(), > but the file is still on RAM_DISK - so there is no data loss. > Adding the following to TestLazyPersistReplicaRecovery.java:56 essentially > fixes the flakiness. > {code:java} > try { > ensureFileReplicasOnStorageType(path1, DEFAULT); > }catch (TimeoutException t){ > LOG.warn("We got \"" + t.getMessage() + "\" so trying to find data on > RAM_DISK"); > ensureFileReplicasOnStorageType(path1, RAM_DISK); > } > } > {code} > Some thoughts: > * Successful and failed tests run similar to the point when datanode > restarts. Restart line is the following in the log: LazyPersistTestCase - > Restarting the DataNode > * There is a line which only occurs in the failed test: *addStoredBlock: > Redundant addStoredBlock request received for blk_1073741825_1001 on node > 127.0.0.1:49455 size 5242880* > * This redundant request at BlockManager#addStoredBlock could be the main > reason for the test fail. Something wrong with the gen stamp? Corrupt > replicas? > = > Current fail ratio based on my test of TestLazyPersistReplicaRecovery: > 1000 runs, 34 failures (3.4% fail) > Failure rate analysis: > TestLazyPersistReplicaRecovery.testDnRestartWithSavedReplicas: 3.4% > 33 failures caused by: {noformat} > java.util.concurrent.TimeoutException: Timed out waiting for condition. > Thread diagnostics: Timestamp: 2018-01-05 11:50:34,964 "IPC Server handler 6 > on 39589" > {noformat} > 1 failure caused by: {noformat} > java.net.BindException: Problem binding to [localhost:56729] > java.net.BindException: Address already in use; For more details see: > http://wiki.apache.org/hadoop/BindException at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.TestLazyPersistReplicaRecovery.testDnRestartWithSavedReplicas(TestLazyPersistReplicaRecovery.java:49) > Caused by: java.net.BindException: Address already in use at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.TestLazyPersistReplicaRecovery.testDnRestartWithSavedReplicas(TestLazyPersistReplicaRecovery.java:49) > {noformat} > = > Example stacktrace: > {noformat} > Timed out waiting for condition. Thread diagnostics: > Timestamp: 2017-11-01 10:36:49,499 > "Thread-1" prio=5 tid=13 runnable > java.lang.Thread.State: RUNNABLE > at java.lang.Thread.dumpThreads(Native Method) > at java.lang.Thread.getAllStackTraces(Thread.java:1610) > at > org.apache.hadoop.test.TimedOutTestsListener.buildThreadDump(TimedOutTestsListener.java:87) > at > org.apache.hadoop.test.TimedOutTestsListener.buildThreadDiagnosticString(TimedOutTestsListener.java:73) > at org.apache.hadoop.test.GenericTestUtils.waitFor(GenericTestUtils.java:369) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.LazyPersistTestCase.ensureFileReplicasOnStorageType(LazyPersistTestCase.java:140) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.TestLazyPersistReplicaRecovery.testDnRestartWithSavedReplicas(TestLazyPersistReplicaRecovery.java:54) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > ... > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail:
[jira] [Updated] (HDFS-13179) TestLazyPersistReplicaRecovery#testDnRestartWithSavedReplicas fails intermittently
[ https://issues.apache.org/jira/browse/HDFS-13179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stephen O'Donnell updated HDFS-13179: - Attachment: HDFS-13179-branch-3.0.003.patch > TestLazyPersistReplicaRecovery#testDnRestartWithSavedReplicas fails > intermittently > -- > > Key: HDFS-13179 > URL: https://issues.apache.org/jira/browse/HDFS-13179 > Project: Hadoop HDFS > Issue Type: Bug > Components: fs >Affects Versions: 3.0.0 >Reporter: Gabor Bota >Assignee: Ahmed Hussein >Priority: Critical > Fix For: 3.0.4, 3.3.0, 3.1.4, 3.2.2, 2.10.1 > > Attachments: HDFS-13179-branch-2.10.003.patch, > HDFS-13179-branch-3.0.003.patch, HDFS-13179.001.patch, HDFS-13179.002.patch, > HDFS-13179.003.patch, test runs.zip > > > The error caused by TimeoutException because the test is waiting to ensure > that the file is replicated to DISK storage but the replication can't be > finished to DISK during the 30s timeout in ensureFileReplicasOnStorageType(), > but the file is still on RAM_DISK - so there is no data loss. > Adding the following to TestLazyPersistReplicaRecovery.java:56 essentially > fixes the flakiness. > {code:java} > try { > ensureFileReplicasOnStorageType(path1, DEFAULT); > }catch (TimeoutException t){ > LOG.warn("We got \"" + t.getMessage() + "\" so trying to find data on > RAM_DISK"); > ensureFileReplicasOnStorageType(path1, RAM_DISK); > } > } > {code} > Some thoughts: > * Successful and failed tests run similar to the point when datanode > restarts. Restart line is the following in the log: LazyPersistTestCase - > Restarting the DataNode > * There is a line which only occurs in the failed test: *addStoredBlock: > Redundant addStoredBlock request received for blk_1073741825_1001 on node > 127.0.0.1:49455 size 5242880* > * This redundant request at BlockManager#addStoredBlock could be the main > reason for the test fail. Something wrong with the gen stamp? Corrupt > replicas? > = > Current fail ratio based on my test of TestLazyPersistReplicaRecovery: > 1000 runs, 34 failures (3.4% fail) > Failure rate analysis: > TestLazyPersistReplicaRecovery.testDnRestartWithSavedReplicas: 3.4% > 33 failures caused by: {noformat} > java.util.concurrent.TimeoutException: Timed out waiting for condition. > Thread diagnostics: Timestamp: 2018-01-05 11:50:34,964 "IPC Server handler 6 > on 39589" > {noformat} > 1 failure caused by: {noformat} > java.net.BindException: Problem binding to [localhost:56729] > java.net.BindException: Address already in use; For more details see: > http://wiki.apache.org/hadoop/BindException at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.TestLazyPersistReplicaRecovery.testDnRestartWithSavedReplicas(TestLazyPersistReplicaRecovery.java:49) > Caused by: java.net.BindException: Address already in use at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.TestLazyPersistReplicaRecovery.testDnRestartWithSavedReplicas(TestLazyPersistReplicaRecovery.java:49) > {noformat} > = > Example stacktrace: > {noformat} > Timed out waiting for condition. Thread diagnostics: > Timestamp: 2017-11-01 10:36:49,499 > "Thread-1" prio=5 tid=13 runnable > java.lang.Thread.State: RUNNABLE > at java.lang.Thread.dumpThreads(Native Method) > at java.lang.Thread.getAllStackTraces(Thread.java:1610) > at > org.apache.hadoop.test.TimedOutTestsListener.buildThreadDump(TimedOutTestsListener.java:87) > at > org.apache.hadoop.test.TimedOutTestsListener.buildThreadDiagnosticString(TimedOutTestsListener.java:73) > at org.apache.hadoop.test.GenericTestUtils.waitFor(GenericTestUtils.java:369) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.LazyPersistTestCase.ensureFileReplicasOnStorageType(LazyPersistTestCase.java:140) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.TestLazyPersistReplicaRecovery.testDnRestartWithSavedReplicas(TestLazyPersistReplicaRecovery.java:54) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > ... > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15391) Due to edit log corruption, Standby NameNode could not properly load the Ediltog log, result in abnormal exit of the service and failure to restart
[ https://issues.apache.org/jira/browse/HDFS-15391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] huhaiyang updated HDFS-15391: - Description: In the cluster version 3.2.0 production environment, We found that due to edit log corruption, Standby NameNode could not properly load the Ediltog log, result in abnormal exit of the service and failure to restart This is the exception it throws: 2020-06-04 18:32:11,561 ERROR org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader: Encountered exception on operation CloseOp [length=0, inodeId=0, path=path, replication=3, mtime=1591266620287, atime=1591264800229, blockSize=134217728, blocks=[blk_11382006007_10353346830, blk_11382023760_10353365201, blk_11382041307_10353383098, blk_11382049845_10353392031, blk_11382057341_10353399899, blk_11382071544_10353415171, blk_11382080753_10354157480], permissions=dw_water:rd:rw-r--r--, aclEntries=null, clientName=, clientMachine=, overwrite=false, storagePolicyId=0, erasureCodingPolicyId=0, opCode=OP_CLOSE, txid=126060943585] java.io.IOException: File is not under construction: path at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:476) at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:258) at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:161) at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:898) at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.doTailEdits(EditLogTailer.java:329) at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:460) at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$300(EditLogTailer.java:410) at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$1.run(EditLogTailer.java:427) at org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:484) at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.run(EditLogTailer.java:423) 2020-06-04 18:32:11,561 ERROR org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer: Unknown error encountered while tailing edits. Shutting down standby NN. was: In the cluster version 3.2.0 production environment, We found that due to edit log corruption, Standby NameNode could not properly load the Ediltog log, resulting in abnormal exit of the service and failure to restart This is the exception it throws: 2020-06-04 18:32:11,561 ERROR org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader: Encountered exception on operation CloseOp [length=0, inodeId=0, path=path, replication=3, mtime=1591266620287, atime=1591264800229, blockSize=134217728, blocks=[blk_11382006007_10353346830, blk_11382023760_10353365201, blk_11382041307_10353383098, blk_11382049845_10353392031, blk_11382057341_10353399899, blk_11382071544_10353415171, blk_11382080753_10354157480], permissions=dw_water:rd:rw-r--r--, aclEntries=null, clientName=, clientMachine=, overwrite=false, storagePolicyId=0, erasureCodingPolicyId=0, opCode=OP_CLOSE, txid=126060943585] java.io.IOException: File is not under construction: path at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:476) at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:258) at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:161) at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:898) at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.doTailEdits(EditLogTailer.java:329) at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:460) at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$300(EditLogTailer.java:410) at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$1.run(EditLogTailer.java:427) at org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:484) at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.run(EditLogTailer.java:423) 2020-06-04 18:32:11,561 ERROR org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer: Unknown error encountered while tailing edits. Shutting down standby NN. > Due to edit log corruption, Standby NameNode could not properly load the > Ediltog log, result in abnormal exit of the service and failure to restart > --- > > Key: HDFS-15391 > URL:
[jira] [Updated] (HDFS-15391) Due to edit log corruption, Standby NameNode could not properly load the Ediltog log, result in abnormal exit of the service and failure to restart
[ https://issues.apache.org/jira/browse/HDFS-15391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] huhaiyang updated HDFS-15391: - Summary: Due to edit log corruption, Standby NameNode could not properly load the Ediltog log, result in abnormal exit of the service and failure to restart (was: Due to edit log corruption, Standby NameNode could not properly load the Ediltog log, resulting in abnormal exit of the service and failure to restart) > Due to edit log corruption, Standby NameNode could not properly load the > Ediltog log, result in abnormal exit of the service and failure to restart > --- > > Key: HDFS-15391 > URL: https://issues.apache.org/jira/browse/HDFS-15391 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 3.2.0 >Reporter: huhaiyang >Priority: Critical > > In the cluster version 3.2.0 production environment, > We found that due to edit log corruption, Standby NameNode could not properly > load the Ediltog log, resulting in abnormal exit of the service and failure > to restart > This is the exception it throws: > 2020-06-04 18:32:11,561 ERROR > org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader: Encountered exception > on operation CloseOp [length=0, inodeId=0, path=path, replication=3, > mtime=1591266620287, atime=1591264800229, blockSize=134217728, > blocks=[blk_11382006007_10353346830, blk_11382023760_10353365201, > blk_11382041307_10353383098, blk_11382049845_10353392031, > blk_11382057341_10353399899, blk_11382071544_10353415171, > blk_11382080753_10354157480], permissions=dw_water:rd:rw-r--r--, > aclEntries=null, clientName=, clientMachine=, overwrite=false, > storagePolicyId=0, erasureCodingPolicyId=0, opCode=OP_CLOSE, > txid=126060943585] > java.io.IOException: File is not under construction: path > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:476) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:258) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:161) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:898) > at > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.doTailEdits(EditLogTailer.java:329) > at > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:460) > at > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$300(EditLogTailer.java:410) > at > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$1.run(EditLogTailer.java:427) > at > org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:484) > at > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.run(EditLogTailer.java:423) > 2020-06-04 18:32:11,561 ERROR > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer: Unknown error > encountered while tailing edits. Shutting down standby NN. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-15391) Due to edit log corruption, Standby NameNode could not properly load the Ediltog log, resulting in abnormal exit of the service and failure to restart
huhaiyang created HDFS-15391: Summary: Due to edit log corruption, Standby NameNode could not properly load the Ediltog log, resulting in abnormal exit of the service and failure to restart Key: HDFS-15391 URL: https://issues.apache.org/jira/browse/HDFS-15391 Project: Hadoop HDFS Issue Type: Bug Components: namenode Affects Versions: 3.2.0 Reporter: huhaiyang In the cluster version 3.2.0 production environment, We found that due to edit log corruption, Standby NameNode could not properly load the Ediltog log, resulting in abnormal exit of the service and failure to restart This is the exception it throws: 2020-06-04 18:32:11,561 ERROR org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader: Encountered exception on operation CloseOp [length=0, inodeId=0, path=path, replication=3, mtime=1591266620287, atime=1591264800229, blockSize=134217728, blocks=[blk_11382006007_10353346830, blk_11382023760_10353365201, blk_11382041307_10353383098, blk_11382049845_10353392031, blk_11382057341_10353399899, blk_11382071544_10353415171, blk_11382080753_10354157480], permissions=dw_water:rd:rw-r--r--, aclEntries=null, clientName=, clientMachine=, overwrite=false, storagePolicyId=0, erasureCodingPolicyId=0, opCode=OP_CLOSE, txid=126060943585] java.io.IOException: File is not under construction: path at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:476) at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:258) at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:161) at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:898) at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.doTailEdits(EditLogTailer.java:329) at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:460) at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$300(EditLogTailer.java:410) at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$1.run(EditLogTailer.java:427) at org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:484) at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.run(EditLogTailer.java:423) 2020-06-04 18:32:11,561 ERROR org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer: Unknown error encountered while tailing edits. Shutting down standby NN. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-13179) TestLazyPersistReplicaRecovery#testDnRestartWithSavedReplicas fails intermittently
[ https://issues.apache.org/jira/browse/HDFS-13179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17126718#comment-17126718 ] Ayush Saxena commented on HDFS-13179: - branch-3.0 is EOL? https://cwiki.apache.org/confluence/display/HADOOP/EOL+%28End-of-life%29+Release+Branches There doesn't seems to be any need to push to branch-3.0 then? > TestLazyPersistReplicaRecovery#testDnRestartWithSavedReplicas fails > intermittently > -- > > Key: HDFS-13179 > URL: https://issues.apache.org/jira/browse/HDFS-13179 > Project: Hadoop HDFS > Issue Type: Bug > Components: fs >Affects Versions: 3.0.0 >Reporter: Gabor Bota >Assignee: Ahmed Hussein >Priority: Critical > Fix For: 3.0.4, 3.3.0, 3.1.4, 3.2.2, 2.10.1 > > Attachments: HDFS-13179-branch-2.10.003.patch, HDFS-13179.001.patch, > HDFS-13179.002.patch, HDFS-13179.003.patch, test runs.zip > > > The error caused by TimeoutException because the test is waiting to ensure > that the file is replicated to DISK storage but the replication can't be > finished to DISK during the 30s timeout in ensureFileReplicasOnStorageType(), > but the file is still on RAM_DISK - so there is no data loss. > Adding the following to TestLazyPersistReplicaRecovery.java:56 essentially > fixes the flakiness. > {code:java} > try { > ensureFileReplicasOnStorageType(path1, DEFAULT); > }catch (TimeoutException t){ > LOG.warn("We got \"" + t.getMessage() + "\" so trying to find data on > RAM_DISK"); > ensureFileReplicasOnStorageType(path1, RAM_DISK); > } > } > {code} > Some thoughts: > * Successful and failed tests run similar to the point when datanode > restarts. Restart line is the following in the log: LazyPersistTestCase - > Restarting the DataNode > * There is a line which only occurs in the failed test: *addStoredBlock: > Redundant addStoredBlock request received for blk_1073741825_1001 on node > 127.0.0.1:49455 size 5242880* > * This redundant request at BlockManager#addStoredBlock could be the main > reason for the test fail. Something wrong with the gen stamp? Corrupt > replicas? > = > Current fail ratio based on my test of TestLazyPersistReplicaRecovery: > 1000 runs, 34 failures (3.4% fail) > Failure rate analysis: > TestLazyPersistReplicaRecovery.testDnRestartWithSavedReplicas: 3.4% > 33 failures caused by: {noformat} > java.util.concurrent.TimeoutException: Timed out waiting for condition. > Thread diagnostics: Timestamp: 2018-01-05 11:50:34,964 "IPC Server handler 6 > on 39589" > {noformat} > 1 failure caused by: {noformat} > java.net.BindException: Problem binding to [localhost:56729] > java.net.BindException: Address already in use; For more details see: > http://wiki.apache.org/hadoop/BindException at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.TestLazyPersistReplicaRecovery.testDnRestartWithSavedReplicas(TestLazyPersistReplicaRecovery.java:49) > Caused by: java.net.BindException: Address already in use at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.TestLazyPersistReplicaRecovery.testDnRestartWithSavedReplicas(TestLazyPersistReplicaRecovery.java:49) > {noformat} > = > Example stacktrace: > {noformat} > Timed out waiting for condition. Thread diagnostics: > Timestamp: 2017-11-01 10:36:49,499 > "Thread-1" prio=5 tid=13 runnable > java.lang.Thread.State: RUNNABLE > at java.lang.Thread.dumpThreads(Native Method) > at java.lang.Thread.getAllStackTraces(Thread.java:1610) > at > org.apache.hadoop.test.TimedOutTestsListener.buildThreadDump(TimedOutTestsListener.java:87) > at > org.apache.hadoop.test.TimedOutTestsListener.buildThreadDiagnosticString(TimedOutTestsListener.java:73) > at org.apache.hadoop.test.GenericTestUtils.waitFor(GenericTestUtils.java:369) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.LazyPersistTestCase.ensureFileReplicasOnStorageType(LazyPersistTestCase.java:140) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.TestLazyPersistReplicaRecovery.testDnRestartWithSavedReplicas(TestLazyPersistReplicaRecovery.java:54) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > ... > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15386) ReplicaNotFoundException keeps happening in DN after removing multiple DN's data directories
[ https://issues.apache.org/jira/browse/HDFS-15386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stephen O'Donnell updated HDFS-15386: - Fix Version/s: 3.1.5 3.4.0 3.3.1 3.2.2 3.0.4 > ReplicaNotFoundException keeps happening in DN after removing multiple DN's > data directories > > > Key: HDFS-15386 > URL: https://issues.apache.org/jira/browse/HDFS-15386 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Toshihiro Suzuki >Assignee: Toshihiro Suzuki >Priority: Major > Fix For: 3.0.4, 3.2.2, 3.3.1, 3.4.0, 3.1.5 > > > When removing volumes, we need to invalidate all the blocks in the volumes. > In the following code (FsDatasetImpl), we keep the blocks that will be > invalidate in *blkToInvalidate* map. However as the key of the map is *bpid* > (Block Pool ID), it will be overwritten by other removed volumes. As a > result, the map will have only the blocks of the last volume we are removing, > and invalidate only them: > {code:java} > for (String bpid : volumeMap.getBlockPoolList()) { > List blocks = new ArrayList<>(); > for (Iterator it = > volumeMap.replicas(bpid).iterator(); it.hasNext();) { > ReplicaInfo block = it.next(); > final StorageLocation blockStorageLocation = > block.getVolume().getStorageLocation(); > LOG.trace("checking for block " + block.getBlockId() + > " with storageLocation " + blockStorageLocation); > if (blockStorageLocation.equals(sdLocation)) { > blocks.add(block); > it.remove(); > } > } > blkToInvalidate.put(bpid, blocks); > } > {code} > [https://github.com/apache/hadoop/blob/704409d53bf7ebf717a3c2e988ede80f623bbad3/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/impl/FsDatasetImpl.java#L580-L595] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15386) ReplicaNotFoundException keeps happening in DN after removing multiple DN's data directories
[ https://issues.apache.org/jira/browse/HDFS-15386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17126715#comment-17126715 ] Stephen O'Donnell commented on HDFS-15386: -- Please create a PR for branch-2.10 and then we can backport to other 2.x branches from there. This is now committed branch 3.0, 3.1, 3.2, 3.3 and trunk. > ReplicaNotFoundException keeps happening in DN after removing multiple DN's > data directories > > > Key: HDFS-15386 > URL: https://issues.apache.org/jira/browse/HDFS-15386 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Toshihiro Suzuki >Assignee: Toshihiro Suzuki >Priority: Major > > When removing volumes, we need to invalidate all the blocks in the volumes. > In the following code (FsDatasetImpl), we keep the blocks that will be > invalidate in *blkToInvalidate* map. However as the key of the map is *bpid* > (Block Pool ID), it will be overwritten by other removed volumes. As a > result, the map will have only the blocks of the last volume we are removing, > and invalidate only them: > {code:java} > for (String bpid : volumeMap.getBlockPoolList()) { > List blocks = new ArrayList<>(); > for (Iterator it = > volumeMap.replicas(bpid).iterator(); it.hasNext();) { > ReplicaInfo block = it.next(); > final StorageLocation blockStorageLocation = > block.getVolume().getStorageLocation(); > LOG.trace("checking for block " + block.getBlockId() + > " with storageLocation " + blockStorageLocation); > if (blockStorageLocation.equals(sdLocation)) { > blocks.add(block); > it.remove(); > } > } > blkToInvalidate.put(bpid, blocks); > } > {code} > [https://github.com/apache/hadoop/blob/704409d53bf7ebf717a3c2e988ede80f623bbad3/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/impl/FsDatasetImpl.java#L580-L595] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-13179) TestLazyPersistReplicaRecovery#testDnRestartWithSavedReplicas fails intermittently
[ https://issues.apache.org/jira/browse/HDFS-13179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17126713#comment-17126713 ] Stephen O'Donnell commented on HDFS-13179: -- This is now reverted from branch-3.0. I will post a new patch here shortly, as its a trivial change. > TestLazyPersistReplicaRecovery#testDnRestartWithSavedReplicas fails > intermittently > -- > > Key: HDFS-13179 > URL: https://issues.apache.org/jira/browse/HDFS-13179 > Project: Hadoop HDFS > Issue Type: Bug > Components: fs >Affects Versions: 3.0.0 >Reporter: Gabor Bota >Assignee: Ahmed Hussein >Priority: Critical > Fix For: 3.0.4, 3.3.0, 3.1.4, 3.2.2, 2.10.1 > > Attachments: HDFS-13179-branch-2.10.003.patch, HDFS-13179.001.patch, > HDFS-13179.002.patch, HDFS-13179.003.patch, test runs.zip > > > The error caused by TimeoutException because the test is waiting to ensure > that the file is replicated to DISK storage but the replication can't be > finished to DISK during the 30s timeout in ensureFileReplicasOnStorageType(), > but the file is still on RAM_DISK - so there is no data loss. > Adding the following to TestLazyPersistReplicaRecovery.java:56 essentially > fixes the flakiness. > {code:java} > try { > ensureFileReplicasOnStorageType(path1, DEFAULT); > }catch (TimeoutException t){ > LOG.warn("We got \"" + t.getMessage() + "\" so trying to find data on > RAM_DISK"); > ensureFileReplicasOnStorageType(path1, RAM_DISK); > } > } > {code} > Some thoughts: > * Successful and failed tests run similar to the point when datanode > restarts. Restart line is the following in the log: LazyPersistTestCase - > Restarting the DataNode > * There is a line which only occurs in the failed test: *addStoredBlock: > Redundant addStoredBlock request received for blk_1073741825_1001 on node > 127.0.0.1:49455 size 5242880* > * This redundant request at BlockManager#addStoredBlock could be the main > reason for the test fail. Something wrong with the gen stamp? Corrupt > replicas? > = > Current fail ratio based on my test of TestLazyPersistReplicaRecovery: > 1000 runs, 34 failures (3.4% fail) > Failure rate analysis: > TestLazyPersistReplicaRecovery.testDnRestartWithSavedReplicas: 3.4% > 33 failures caused by: {noformat} > java.util.concurrent.TimeoutException: Timed out waiting for condition. > Thread diagnostics: Timestamp: 2018-01-05 11:50:34,964 "IPC Server handler 6 > on 39589" > {noformat} > 1 failure caused by: {noformat} > java.net.BindException: Problem binding to [localhost:56729] > java.net.BindException: Address already in use; For more details see: > http://wiki.apache.org/hadoop/BindException at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.TestLazyPersistReplicaRecovery.testDnRestartWithSavedReplicas(TestLazyPersistReplicaRecovery.java:49) > Caused by: java.net.BindException: Address already in use at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.TestLazyPersistReplicaRecovery.testDnRestartWithSavedReplicas(TestLazyPersistReplicaRecovery.java:49) > {noformat} > = > Example stacktrace: > {noformat} > Timed out waiting for condition. Thread diagnostics: > Timestamp: 2017-11-01 10:36:49,499 > "Thread-1" prio=5 tid=13 runnable > java.lang.Thread.State: RUNNABLE > at java.lang.Thread.dumpThreads(Native Method) > at java.lang.Thread.getAllStackTraces(Thread.java:1610) > at > org.apache.hadoop.test.TimedOutTestsListener.buildThreadDump(TimedOutTestsListener.java:87) > at > org.apache.hadoop.test.TimedOutTestsListener.buildThreadDiagnosticString(TimedOutTestsListener.java:73) > at org.apache.hadoop.test.GenericTestUtils.waitFor(GenericTestUtils.java:369) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.LazyPersistTestCase.ensureFileReplicasOnStorageType(LazyPersistTestCase.java:140) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.TestLazyPersistReplicaRecovery.testDnRestartWithSavedReplicas(TestLazyPersistReplicaRecovery.java:54) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > ... > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15386) ReplicaNotFoundException keeps happening in DN after removing multiple DN's data directories
[ https://issues.apache.org/jira/browse/HDFS-15386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17126647#comment-17126647 ] Toshihiro Suzuki commented on HDFS-15386: - [~sodonnell] Thank you for merging the PR to trunk! For branch 2, which branch should I create a PR for? Thanks. > ReplicaNotFoundException keeps happening in DN after removing multiple DN's > data directories > > > Key: HDFS-15386 > URL: https://issues.apache.org/jira/browse/HDFS-15386 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Toshihiro Suzuki >Assignee: Toshihiro Suzuki >Priority: Major > > When removing volumes, we need to invalidate all the blocks in the volumes. > In the following code (FsDatasetImpl), we keep the blocks that will be > invalidate in *blkToInvalidate* map. However as the key of the map is *bpid* > (Block Pool ID), it will be overwritten by other removed volumes. As a > result, the map will have only the blocks of the last volume we are removing, > and invalidate only them: > {code:java} > for (String bpid : volumeMap.getBlockPoolList()) { > List blocks = new ArrayList<>(); > for (Iterator it = > volumeMap.replicas(bpid).iterator(); it.hasNext();) { > ReplicaInfo block = it.next(); > final StorageLocation blockStorageLocation = > block.getVolume().getStorageLocation(); > LOG.trace("checking for block " + block.getBlockId() + > " with storageLocation " + blockStorageLocation); > if (blockStorageLocation.equals(sdLocation)) { > blocks.add(block); > it.remove(); > } > } > blkToInvalidate.put(bpid, blocks); > } > {code} > [https://github.com/apache/hadoop/blob/704409d53bf7ebf717a3c2e988ede80f623bbad3/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/impl/FsDatasetImpl.java#L580-L595] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15390) client fails forever when namenode ipaddr changed
[ https://issues.apache.org/jira/browse/HDFS-15390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17126638#comment-17126638 ] Sean Chow commented on HDFS-15390: -- There 's two way to fix this: # When updateAddress is true, do not handle ConnectionFailure this round # When address change detected, update namenode proxies (only with {{ConfiguredFailoverProxyProvider}}) Method one is easy, and in this connection lifecycle the client will use the right {{server}} to connect. But when the client connection closed and create a new one. It will always try to getConnection from the retired ipaddr, because the namenode proxies is still the old one. Method two solve the root cause. Everytime the client failover namenodes, check ipaddr changed or not. If changed, re-initialize the namenode failover proxies. > client fails forever when namenode ipaddr changed > - > > Key: HDFS-15390 > URL: https://issues.apache.org/jira/browse/HDFS-15390 > Project: Hadoop HDFS > Issue Type: Bug > Components: dfsclient >Affects Versions: 2.10.0, 2.9.2, 3.2.1 >Reporter: Sean Chow >Priority: Major > > For machine replacement, I replace my standby namenode with a new ipaddr and > keep the same hostname. Also update the client's hosts to make it resolve > correctly > When I try to run failover to transite the new namenode(let's say nn2), the > client will fail to read or write forever until it's restarted. > That make yarn nodemanager in sick state. Even the new tasks will encounter > this exception too. Until all nodemanager restart. > > {code:java} > 20/06/02 15:12:25 WARN ipc.Client: Address change detected. Old: > nn2-192-168-1-100/192.168.1.100:9000 New: nn2-192-168-1-100/192.168.1.200:9000 > 20/06/02 15:12:25 DEBUG ipc.Client: closing ipc connection to > nn2-192-168-1-100/192.168.1.200:9000: Connection refused > java.net.ConnectException: Connection refused > at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) > at > sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) > at > org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206) > at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:530) > at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:494) > at > org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:608) > at > org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:707) > at > org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:368) > at org.apache.hadoop.ipc.Client.getConnection(Client.java:1517) > at org.apache.hadoop.ipc.Client.call(Client.java:1440) > at org.apache.hadoop.ipc.Client.call(Client.java:1401) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232) > at com.sun.proxy.$Proxy9.addBlock(Unknown Source) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.addBlock(ClientNamenodeProtocolTranslatorPB.java:399) > at sun.reflect.GeneratedMethodAccessor8.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:193) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) > {code} > > We can see the client has {{Address change detected}}, but it still fails. I > find out that's because when method {{updateAddress()}} return true, the > {{handleConnectionFailure()}} thow an exception that break the next retry > with the right ipaddr. > Client.java: setupConnection() > {code:java} > } catch (ConnectTimeoutException toe) { > /* Check for an address change and update the local reference. >* Reset the failure counter if the address was changed >*/ > if (updateAddress()) { > timeoutFailures = ioFailures = 0; > } > handleConnectionTimeout(timeoutFailures++, > maxRetriesOnSocketTimeouts, toe); > } catch (IOException ie) { > if (updateAddress()) { > timeoutFailures = ioFailures = 0; > } > // because the namenode ip changed in updateAddress(), the old namenode > ipaddress cannot be accessed now > // handleConnectionFailure will thow an exception, the next retry never have > a change to use the right server updated in updateAddress() > handleConnectionFailure(ioFailures++, ie); > } > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HDFS-15386) ReplicaNotFoundException keeps happening in DN after removing multiple DN's data directories
[ https://issues.apache.org/jira/browse/HDFS-15386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17126630#comment-17126630 ] Hudson commented on HDFS-15386: --- SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #18329 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/18329/]) HDFS-15386 ReplicaNotFoundException keeps happening in DN after removing (github: rev 545a0a147c5256c44911ba57b4898e01d786d836) * (edit) hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/impl/TestFsDatasetImpl.java * (edit) hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/impl/FsDatasetImpl.java > ReplicaNotFoundException keeps happening in DN after removing multiple DN's > data directories > > > Key: HDFS-15386 > URL: https://issues.apache.org/jira/browse/HDFS-15386 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Toshihiro Suzuki >Assignee: Toshihiro Suzuki >Priority: Major > > When removing volumes, we need to invalidate all the blocks in the volumes. > In the following code (FsDatasetImpl), we keep the blocks that will be > invalidate in *blkToInvalidate* map. However as the key of the map is *bpid* > (Block Pool ID), it will be overwritten by other removed volumes. As a > result, the map will have only the blocks of the last volume we are removing, > and invalidate only them: > {code:java} > for (String bpid : volumeMap.getBlockPoolList()) { > List blocks = new ArrayList<>(); > for (Iterator it = > volumeMap.replicas(bpid).iterator(); it.hasNext();) { > ReplicaInfo block = it.next(); > final StorageLocation blockStorageLocation = > block.getVolume().getStorageLocation(); > LOG.trace("checking for block " + block.getBlockId() + > " with storageLocation " + blockStorageLocation); > if (blockStorageLocation.equals(sdLocation)) { > blocks.add(block); > it.remove(); > } > } > blkToInvalidate.put(bpid, blocks); > } > {code} > [https://github.com/apache/hadoop/blob/704409d53bf7ebf717a3c2e988ede80f623bbad3/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/impl/FsDatasetImpl.java#L580-L595] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15390) client fails forever when namenode ipaddr changed
[ https://issues.apache.org/jira/browse/HDFS-15390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Chow updated HDFS-15390: - Description: For machine replacement, I replace my standby namenode with a new ipaddr and keep the same hostname. Also update the client's hosts to make it resolve correctly When I try to run failover to transite the new namenode(let's say nn2), the client will fail to read or write forever until it's restarted. That make yarn nodemanager in sick state. Even the new tasks will encounter this exception too. Until all nodemanager restart. {code:java} 20/06/02 15:12:25 WARN ipc.Client: Address change detected. Old: nn2-192-168-1-100/192.168.1.100:9000 New: nn2-192-168-1-100/192.168.1.200:9000 20/06/02 15:12:25 DEBUG ipc.Client: closing ipc connection to nn2-192-168-1-100/192.168.1.200:9000: Connection refused java.net.ConnectException: Connection refused at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206) at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:530) at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:494) at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:608) at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:707) at org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:368) at org.apache.hadoop.ipc.Client.getConnection(Client.java:1517) at org.apache.hadoop.ipc.Client.call(Client.java:1440) at org.apache.hadoop.ipc.Client.call(Client.java:1401) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232) at com.sun.proxy.$Proxy9.addBlock(Unknown Source) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.addBlock(ClientNamenodeProtocolTranslatorPB.java:399) at sun.reflect.GeneratedMethodAccessor8.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:193) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) {code} We can see the client has {{Address change detected}}, but it still fails. I find out that's because when method {{updateAddress()}} return true, the {{handleConnectionFailure()}} thow an exception that break the next retry with the right ipaddr. Client.java: setupConnection() {code:java} } catch (ConnectTimeoutException toe) { /* Check for an address change and update the local reference. * Reset the failure counter if the address was changed */ if (updateAddress()) { timeoutFailures = ioFailures = 0; } handleConnectionTimeout(timeoutFailures++, maxRetriesOnSocketTimeouts, toe); } catch (IOException ie) { if (updateAddress()) { timeoutFailures = ioFailures = 0; } // because the namenode ip changed in updateAddress(), the old namenode ipaddress cannot be accessed now // handleConnectionFailure will thow an exception, the next retry never have a change to use the right server updated in updateAddress() handleConnectionFailure(ioFailures++, ie); } {code} was: For machine replacement, I replace my standby namenode with a new ipaddr and keep the same hostname. Also update the client's hosts to make it resolve correctly When I try to run failover to transite the new namenode(let's say nn2), the client will fail to read or write forever until it's restarted. That make yarn nodemanager in sick state. Even the new tasks will encounter this exception too. Until all nodemanager restart. {code:java} 20/06/02 15:12:25 WARN ipc.Client: Address change detected. Old: nn2-192-168-1-100/192.168.1.100:9000 New: nn2-192-168-1-100/192.168.1.200:9000 20/06/02 15:12:25 DEBUG ipc.Client: closing ipc connection to nn2-192-168-1-100/192.168.1.200:9000: Connection refused java.net.ConnectException: Connection refused at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206) at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:530) at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:494) at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:608) at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:707) at
[jira] [Commented] (HDFS-15389) DFSAdmin should close filesystem and dfsadmin -setBalancerBandwidth should work with ViewFSOverloadScheme
[ https://issues.apache.org/jira/browse/HDFS-15389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17126581#comment-17126581 ] Hadoop QA commented on HDFS-15389: -- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 1m 23s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} dupname {color} | {color:green} 0m 0s{color} | {color:green} No case conflicting files found. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 1 new or modified test files. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 21m 42s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 8s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 48s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 15s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 17m 30s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 40s{color} | {color:green} trunk passed {color} | | {color:blue}0{color} | {color:blue} spotbugs {color} | {color:blue} 3m 2s{color} | {color:blue} Used deprecated FindBugs config; considering switching to SpotBugs. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 3m 0s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 11s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 2s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 1m 2s{color} | {color:green} the patch passed {color} | | {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange} 0m 44s{color} | {color:orange} hadoop-hdfs-project/hadoop-hdfs: The patch generated 1 new + 179 unchanged - 0 fixed = 180 total (was 179) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 10s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 15m 40s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 39s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 3m 5s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:red}-1{color} | {color:red} unit {color} | {color:red}119m 3s{color} | {color:red} hadoop-hdfs in the patch passed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 35s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black}190m 42s{color} | {color:black} {color} | \\ \\ || Reason || Tests || | Failed junit tests | hadoop.hdfs.TestReconstructStripedFileWithRandomECPolicy | | | hadoop.hdfs.server.datanode.TestDataNodeVolumeFailure | | | hadoop.hdfs.server.blockmanagement.TestUnderReplicatedBlocks | | | hadoop.hdfs.TestMultipleNNPortQOP | | | hadoop.hdfs.TestErasureCodingPoliciesWithRandomECPolicy | | | hadoop.hdfs.TestReconstructStripedFile | | | hadoop.hdfs.TestStripedFileAppend | | | hadoop.hdfs.TestRollingUpgrade | \\ \\ || Subsystem || Report/Notes || | Docker | ClientAPI=1.40 ServerAPI=1.40 base: https://builds.apache.org/job/PreCommit-HDFS-Build/29404/artifact/out/Dockerfile | | JIRA Issue | HDFS-15389 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/13004879/HDFS-15389-01.patch | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle | | uname | Linux 2d7d59b9ec89
[jira] [Updated] (HDFS-15390) client fails forever when namenode ipaddr changed
[ https://issues.apache.org/jira/browse/HDFS-15390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Chow updated HDFS-15390: - Description: For machine replacement, I replace my standby namenode with a new ipaddr and keep the same hostname. Also update the client's hosts to make it resolve correctly When I try to run failover to transite the new namenode(let's say nn2), the client will fail to read or write forever until it's restarted. That make yarn nodemanager in sick state. Even the new tasks will encounter this exception too. Until all nodemanager restart. {code:java} 20/06/02 15:12:25 WARN ipc.Client: Address change detected. Old: nn2-192-168-1-100/192.168.1.100:9000 New: nn2-192-168-1-100/192.168.1.200:9000 20/06/02 15:12:25 DEBUG ipc.Client: closing ipc connection to nn2-192-168-1-100/192.168.1.200:9000: Connection refused java.net.ConnectException: Connection refused at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206) at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:530) at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:494) at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:608) at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:707) at org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:368) at org.apache.hadoop.ipc.Client.getConnection(Client.java:1517) at org.apache.hadoop.ipc.Client.call(Client.java:1440) at org.apache.hadoop.ipc.Client.call(Client.java:1401) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232) at com.sun.proxy.$Proxy9.addBlock(Unknown Source) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.addBlock(ClientNamenodeProtocolTranslatorPB.java:399) at sun.reflect.GeneratedMethodAccessor8.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:193) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) {code} We can see the client has {{Address change detected}}, but it still fails. I find out that's because when method {{updateAddress()}} return true, the {{handleConnectionFailure()}} thow an exception that break the next retry with the right ipaddr. was: For machine replacement, I replace my standby namenode with a new ipaddr and keep the same hostname. Also update the client's hosts to make it resolve correctly When I try to run failover to transite the new namenode(let's say nn2), the client will fail to read or write forever until it's restarted. That make yarn nodemanager in sick state. Even the new tasks will encounter this exception too. Until all nodemanager restart. {code:java} 20/06/02 15:12:25 WARN ipc.Client: Address change detected. Old: nn2-192-168-1-100/192.168.1.100:9000 New: nn2-192-168-1-100/192.168.1.200:9000 20/06/02 15:12:25 DEBUG ipc.Client: closing ipc connection to nn2-192-168-1-100/192.168.1.200:9000: Connection refused java.net.ConnectException: Connection refused at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206) at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:530) at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:494) at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:608) at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:707) at org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:368) at org.apache.hadoop.ipc.Client.getConnection(Client.java:1517) at org.apache.hadoop.ipc.Client.call(Client.java:1440) at org.apache.hadoop.ipc.Client.call(Client.java:1401) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232) at com.sun.proxy.$Proxy9.addBlock(Unknown Source) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.addBlock(ClientNamenodeProtocolTranslatorPB.java:399) at sun.reflect.GeneratedMethodAccessor8.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at
[jira] [Created] (HDFS-15390) client fails forever when namenode ipaddr changed
Sean Chow created HDFS-15390: Summary: client fails forever when namenode ipaddr changed Key: HDFS-15390 URL: https://issues.apache.org/jira/browse/HDFS-15390 Project: Hadoop HDFS Issue Type: Bug Components: dfsclient Affects Versions: 3.2.1, 2.9.2, 2.10.0 Reporter: Sean Chow For machine replacement, I replace my standby namenode with a new ipaddr and keep the same hostname. Also update the client's hosts to make it resolve correctly When I try to run failover to transite the new namenode(let's say nn2), the client will fail to read or write forever until it's restarted. That make yarn nodemanager in sick state. Even the new tasks will encounter this exception too. Until all nodemanager restart. {code:java} 20/06/02 15:12:25 WARN ipc.Client: Address change detected. Old: nn2-192-168-1-100/192.168.1.100:9000 New: nn2-192-168-1-100/192.168.1.200:9000 20/06/02 15:12:25 DEBUG ipc.Client: closing ipc connection to nn2-192-168-1-100/192.168.1.200:9000: Connection refused java.net.ConnectException: Connection refused at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206) at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:530) at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:494) at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:608) at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:707) at org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:368) at org.apache.hadoop.ipc.Client.getConnection(Client.java:1517) at org.apache.hadoop.ipc.Client.call(Client.java:1440) at org.apache.hadoop.ipc.Client.call(Client.java:1401) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232) at com.sun.proxy.$Proxy9.addBlock(Unknown Source) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.addBlock(ClientNamenodeProtocolTranslatorPB.java:399) at sun.reflect.GeneratedMethodAccessor8.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:193) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) {code} We can see the client has Address change detected, but it still fails. I find out that's because when method updateAddress() return ture, the handleConnectionFailure() thow an exception that break the next retry with the right ipaddr. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15389) DFSAdmin should close filesystem and dfsadmin -setBalancerBandwidth should work with ViewFSOverloadScheme
[ https://issues.apache.org/jira/browse/HDFS-15389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17126436#comment-17126436 ] Ayush Saxena commented on HDFS-15389: - [~umamaheswararao] / [~rakeshr] / [~vinayakumarb] can you give a check once > DFSAdmin should close filesystem and dfsadmin -setBalancerBandwidth should > work with ViewFSOverloadScheme > -- > > Key: HDFS-15389 > URL: https://issues.apache.org/jira/browse/HDFS-15389 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Ayush Saxena >Assignee: Ayush Saxena >Priority: Major > Attachments: HDFS-15389-01.patch > > > Two Issues Here : > Firstly Prior to HDFS-15321, When DFSAdmin was closed the FileSystem > associated with it was closed as part of close method, But post HDFS-15321, > the {{FileSystem}} isn't stored as part of {{FsShell}}, hence during close, > the FileSystem still stays and isn't close. > ** This is the reason for failure of TestDFSHAAdmin > Second : {{DfsAdmin -setBalancerBandwidth}} doesn't work with > {{ViewFSOverloadScheme}} since the setBalancerBandwidth calls {{getFS()}} > rather than {{getDFS()}} which resolves the scheme in {{HDFS-15321}} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15321) Make DFSAdmin tool to work with ViewFSOverloadScheme
[ https://issues.apache.org/jira/browse/HDFS-15321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17126430#comment-17126430 ] Ayush Saxena commented on HDFS-15321: - There was one more problem here. {{DFSAdmin -setBalancerBandwidth}} wasn't working with {{ViewFSOverloadScheme}}. Have raised HDFS-15389 for both the issues. > Make DFSAdmin tool to work with ViewFSOverloadScheme > > > Key: HDFS-15321 > URL: https://issues.apache.org/jira/browse/HDFS-15321 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: dfsadmin, fs, viewfs >Affects Versions: 3.2.1 >Reporter: Uma Maheswara Rao G >Assignee: Uma Maheswara Rao G >Priority: Major > > When we enable ViewFSOverLoadScheme and used hdfs scheme as overloaded > scheme, users work with hdfs uris. But here DFSAdmin expects the impl classe > to be DistribbuteFileSystem. If impl class is ViewFSoverloadScheme, it will > fail. > So, when impl is ViewFSoverloadScheme, we should get corresponding child hdfs > to make DFSAdmin to work. > This Jira makes the DFSAdmin to work with ViewFSoverloadScheme. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15389) DFSAdmin should close filesystem and dfsadmin -setBalancerBandwidth should work with ViewFSOverloadScheme
[ https://issues.apache.org/jira/browse/HDFS-15389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ayush Saxena updated HDFS-15389: Attachment: HDFS-15389-01.patch > DFSAdmin should close filesystem and dfsadmin -setBalancerBandwidth should > work with ViewFSOverloadScheme > -- > > Key: HDFS-15389 > URL: https://issues.apache.org/jira/browse/HDFS-15389 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Ayush Saxena >Assignee: Ayush Saxena >Priority: Major > Attachments: HDFS-15389-01.patch > > > Two Issues Here : > Firstly Prior to HDFS-15321, When DFSAdmin was closed the FileSystem > associated with it was closed as part of close method, But post HDFS-15321, > the {{FileSystem}} isn't stored as part of {{FsShell}}, hence during close, > the FileSystem still stays and isn't close. > ** This is the reason for failure of TestDFSHAAdmin > Second : {{DfsAdmin -setBalancerBandwidth}} doesn't work with > {{ViewFSOverloadScheme}} since the setBalancerBandwidth calls {{getFS()}} > rather than {{getDFS()}} which resolves the scheme in {{HDFS-15321}} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15389) DFSAdmin should close filesystem and dfsadmin -setBalancerBandwidth should work with ViewFSOverloadScheme
[ https://issues.apache.org/jira/browse/HDFS-15389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ayush Saxena updated HDFS-15389: Status: Patch Available (was: Open) > DFSAdmin should close filesystem and dfsadmin -setBalancerBandwidth should > work with ViewFSOverloadScheme > -- > > Key: HDFS-15389 > URL: https://issues.apache.org/jira/browse/HDFS-15389 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Ayush Saxena >Assignee: Ayush Saxena >Priority: Major > Attachments: HDFS-15389-01.patch > > > Two Issues Here : > Firstly Prior to HDFS-15321, When DFSAdmin was closed the FileSystem > associated with it was closed as part of close method, But post HDFS-15321, > the {{FileSystem}} isn't stored as part of {{FsShell}}, hence during close, > the FileSystem still stays and isn't close. > ** This is the reason for failure of TestDFSHAAdmin > Second : {{DfsAdmin -setBalancerBandwidth}} doesn't work with > {{ViewFSOverloadScheme}} since the setBalancerBandwidth calls {{getFS()}} > rather than {{getDFS()}} which resolves the scheme in {{HDFS-15321}} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-15389) DFSAdmin should close filesystem and dfsadmin -setBalancerBandwidth should work with ViewFSOverloadScheme
Ayush Saxena created HDFS-15389: --- Summary: DFSAdmin should close filesystem and dfsadmin -setBalancerBandwidth should work with ViewFSOverloadScheme Key: HDFS-15389 URL: https://issues.apache.org/jira/browse/HDFS-15389 Project: Hadoop HDFS Issue Type: Bug Reporter: Ayush Saxena Assignee: Ayush Saxena Two Issues Here : Firstly Prior to HDFS-15321, When DFSAdmin was closed the FileSystem associated with it was closed as part of close method, But post HDFS-15321, the {{FileSystem}} isn't stored as part of {{FsShell}}, hence during close, the FileSystem still stays and isn't close. ** This is the reason for failure of TestDFSHAAdmin Second : {{DfsAdmin -setBalancerBandwidth}} doesn't work with {{ViewFSOverloadScheme}} since the setBalancerBandwidth calls {{getFS()}} rather than {{getDFS()}} which resolves the scheme in {{HDFS-15321}} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org