[jira] [Comment Edited] (HDFS-11096) Support rolling upgrade between 2.x and 3.x
[ https://issues.apache.org/jira/browse/HDFS-11096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16234197#comment-16234197 ] Sean Mackrory edited comment on HDFS-11096 at 11/1/17 3:16 PM: --- >From an HDFS standpoint, definitely - I've run many successful rolling upgrade >and distcp-over-webhdfs tests this week and updated the patch. The only thing >remaining is to get automation itself in place after this is committed. I looked into the YARN issues. I'm still seeing very similar symptoms to the YARN-6457 issue mentioned above in both branch-3.0 and trunk. In trunk I'm also seeing this: {code} 17/10/31 23:05:49 INFO security.AMRMTokenSecretManager: Creating password for appattempt_1509490231144_0628_02 17/10/31 23:05:49 INFO amlauncher.AMLauncher: Error launching appattempt_1509490231144_0628_02. Got exception: org.apache.hadoop.security.token.SecretManager$InvalidToken: Invalid container token used for starting container on : container-5.docker:35151 at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.verifyAndGetContainerTokenIdentifier(ContainerManagerImpl.java:974) at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.startContainers(ContainerManagerImpl.java:789) at org.apache.hadoop.yarn.api.impl.pb.service.ContainerManagementProtocolPBServiceImpl.startContainers(ContainerManagementProtocolPBServiceImpl.java:70) at org.apache.hadoop.yarn.proto.ContainerManagementProtocol$ContainerManagementProtocolService$2.callBlockingMethod(ContainerManagementProtocol.java:127) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:447) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:989) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:845) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:788) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1807) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2455) at sun.reflect.GeneratedConstructorAccessor70.newInstance(Unknown Source) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53) at org.apache.hadoop.yarn.ipc.RPCUtil.instantiateIOException(RPCUtil.java:80) at org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:119) at org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagementProtocolPBClientImpl.startContainers(ContainerManagementProtocolPBClientImpl.java:131) at sun.reflect.GeneratedMethodAccessor85.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422) at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165) at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157) at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359) at com.sun.proxy.$Proxy89.startContainers(Unknown Source) at org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.launch(AMLauncher.java:123) at org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.run(AMLauncher.java:304) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken): Invalid container token used for starting container on : container-5.docker:35151 at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.verifyAndGetContainerTokenIdentifier(ContainerManagerImpl.java:974) at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.startContainers(ContainerManagerImpl.java:789) at org.apache.hadoop.yarn.api.impl.pb.service.ContainerManagementProtocolPBServiceImpl.startContainers(ContainerManagementProtocolPBServiceImpl.java:70) at
[jira] [Comment Edited] (HDFS-11096) Support rolling upgrade between 2.x and 3.x
[ https://issues.apache.org/jira/browse/HDFS-11096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16157192#comment-16157192 ] Allen Wittenauer edited comment on HDFS-11096 at 9/7/17 4:42 PM: - {code} set -e {code} I'm really not a fan of using set -e unless one absolutely must. Using it eliminates any possible use of failure mechanisms, including in if tests. There are a lot of caveats when it is in play. {code} set -x {code} Is this just temporary? {code} for hostname in ${HOSTNAMES[@]}; do ssh -i ${ID_FILE} root@${hostname} ". /tmp/env.sh {code} It seems there are a few functions like this that have implementations in hadoop-functions.sh. Shouldn't this just leverage that code? [See also HADOOP-14009 .] The bash settings in place (see above) will be an issue though. {code} cd ${HADOOP_3} sbin/hadoop-daemon.sh start namenode -rollingUpgrade started {code} If it's hadoop 3.x, shouldn't this be using non-deprecated commands? {code} sudo apt-get install -y git {code} This is kind of an interesting one. If I'm using this code, then I'm either already in a git repo or I've got a source tarball. Given that the git hash is encoded at build time, I think there might be an implicit requirement that git is already installed. In the case of some of the other Ubuntu-isms (apt-install of wget), there are likely generic ways to deal with them. (e.g., use the installed perl/python/java). If the intent is to just use the docker images that ship with Hadoop, git is pretty much a requirement for Apache Yetus {code} # Tested on an Ubuntu 16.04 host {code} Probably worth mentioning HADOOP-14816 upgrades the Dockerfile to Xenial. {code} mvn clean package -DskipTests -Pdist -Dtar {code} Shouldn't this just call create-release --docker --native so that we get something closer to what we ship? {code} HDFS_NAMENODE_USER=root \ HDFS_DATANODE_USER=root \ HDFS_JOURNALNODE_USER=root \ HDFS_ZKFC_USER=root \ {code} *dances with glee that someone else is using this feature* was (Author: aw): {code} set -e {code} I'm really not a fan of using set -e unless one absolutely must. Using it eliminates any possible use of failure mechanisms, including in if tests. There are a lot of caveats when it is in play. {code} set -x {code} Is this just temporary? {code} for hostname in ${HOSTNAMES[@]}; do ssh -i ${ID_FILE} root@${hostname} ". /tmp/env.sh {code} It seems there are a few functions like this that have implementations in hadoop-functions.sh. Shouldn't this just leverage that code? [See also HADOOP-14009 .] The bash settings in place (see above) will be an issue though. {code} cd ${HADOOP_3} sbin/hadoop-daemon.sh start namenode -rollingUpgrade started {code} If it's hadoop 3.x, shouldn't this be using non-deprecated commands? {code} sudo apt-get install -y git {code} This is kind of an interesting one. If I'm using this code, then I'm either already in a git repo or I've got a source tarball. Given that the git hash is encoded at build time, I think there might be an implicit requirement that git is already installed. In the case of some of the other Ubuntu-isms (apt-install of wget), there are likely generic ways to deal with them. (e.g., use the installed perl/python/java). If the intent is to just use the docker images that ship with Hadoop, git is pretty much a requirement for Apache Yetus {code} # Tested on an Ubuntu 16.04 host {code} Probably worth mentioning HADOOP-14816 upgrades the Dockerfile to Xenial. > Support rolling upgrade between 2.x and 3.x > --- > > Key: HDFS-11096 > URL: https://issues.apache.org/jira/browse/HDFS-11096 > Project: Hadoop HDFS > Issue Type: Improvement > Components: rolling upgrades >Affects Versions: 3.0.0-alpha1 >Reporter: Andrew Wang >Assignee: Sean Mackrory >Priority: Blocker > Attachments: HDFS-11096.001.patch, HDFS-11096.002.patch > > > trunk has a minimum software version of 3.0.0-alpha1. This means we can't > rolling upgrade between branch-2 and trunk. > This is a showstopper for large deployments. Unless there are very compelling > reasons to break compatibility, let's restore the ability to rolling upgrade > to 3.x releases. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-11096) Support rolling upgrade between 2.x and 3.x
[ https://issues.apache.org/jira/browse/HDFS-11096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16143965#comment-16143965 ] Sean Mackrory edited comment on HDFS-11096 at 8/28/17 4:40 PM: --- So Docker support has been added for the rolling-upgrade and pull-over-http test. They're using the same Docker image as Yetus builds, etc. And they've been really robust lately. I've corrected the copyright headers at the top of the files, and I think dev-support/compat is a good place for these tests to live - but I'm open to other ideas as well. I've also added to the README - now that the scripts spin up the clusters on Docker, it's *really* easy to run these. The Python tests are all still working, but they did not seem to catch the previous incompatibility that prevented older clients from writing to newer DataNodes. There's also still a few TODOs or thing that don't work and it's not clear why. So definitely more work to be done, but there's value in the existing CLI compatibility tests. I'd like to get this put in the codebase and get some Jenkins jobs running on it soon. was (Author: mackrorysd): So Docker support has been added for the rolling-upgrade and pull-over-http test. They're using the same Docker image as Yetus builds, etc. And they've been really robust lately. I've corrected the copyright headers at the top of the files, and I think dev-support/compat is a good place for these tests to live - but I'm open to other ideas as well. The Python tests are all still working, but they did not seem to catch the previous incompatibility that prevented older clients from writing to newer DataNodes. There's also still a few TODOs or thing that don't work and it's not clear why. So definitely more work to be done, but there's value in the existing CLI compatibility tests. I'd like to get this put in the codebase and get some Jenkins jobs running on it soon. > Support rolling upgrade between 2.x and 3.x > --- > > Key: HDFS-11096 > URL: https://issues.apache.org/jira/browse/HDFS-11096 > Project: Hadoop HDFS > Issue Type: Improvement > Components: rolling upgrades >Affects Versions: 3.0.0-alpha1 >Reporter: Andrew Wang >Assignee: Sean Mackrory >Priority: Blocker > Attachments: HDFS-11096.001.patch > > > trunk has a minimum software version of 3.0.0-alpha1. This means we can't > rolling upgrade between branch-2 and trunk. > This is a showstopper for large deployments. Unless there are very compelling > reasons to break compatibility, let's restore the ability to rolling upgrade > to 3.x releases. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-11096) Support rolling upgrade between 2.x and 3.x
[ https://issues.apache.org/jira/browse/HDFS-11096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15852343#comment-15852343 ] Allen Wittenauer edited comment on HDFS-11096 at 2/3/17 11:47 PM: -- I hope folks hoping to do a rolling upgrade with automated tools understand that hadoop-env.sh/yarn-env.sh, log files, pid files, classpath, and a few other things that are outside of the Java code were purposefully made incompatible and will do the correct thing when trying to roll forward was (Author: aw): I hope folks hoping to do a rolling upgrade with automated tools understand that hadoop-env.sh/yarn-env.sh, log files, pid files, and a few other things that are outside of the Java code were purposefully made incompatible and will do the correct thing when trying to roll forward > Support rolling upgrade between 2.x and 3.x > --- > > Key: HDFS-11096 > URL: https://issues.apache.org/jira/browse/HDFS-11096 > Project: Hadoop HDFS > Issue Type: Improvement > Components: rolling upgrades >Affects Versions: 3.0.0-alpha1 >Reporter: Andrew Wang >Priority: Blocker > > trunk has a minimum software version of 3.0.0-alpha1. This means we can't > rolling upgrade between branch-2 and trunk. > This is a showstopper for large deployments. Unless there are very compelling > reasons to break compatibility, let's restore the ability to rolling upgrade > to 3.x releases. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-11096) Support rolling upgrade between 2.x and 3.x
[ https://issues.apache.org/jira/browse/HDFS-11096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15816405#comment-15816405 ] Sean Mackrory edited comment on HDFS-11096 at 1/10/17 10:46 PM: After looking at where SortedMapWritable is used some more, I'm more convinced it's only a concern if we care about source compatibility, which is not required for rolling upgrades. I also had a second look at wire compatibility and I found a few concerning things I'll look at and possibly fix: * The message getHdfsBlockLocations has disappeared as well as related types * The field nonDfsUsed in DatanodeInfoProto changed from index 9 to index 15 For YARN (CC [~kasha]), the field nodeLabels in several structures in yarn_protos changed from string to a custom type, and memory in ResourceProto change from int32 to int64 (not sure if it's the case in protobuf, but that may not be incompatible?) There's also a lot of messages moving between files, but not otherwise changing in any incompatible way. That's not a concern is it? If anyone else wants to see the changes in protobuf, this is what I did (if anything, you'll want to replace meld with your own diff-tool-of-choice): {code} #!/usr/bin/env bash cd /tmp OLD=branch-2.7 NEW=trunk mkdir new mkdir old git clone git://git.apache.org/hadoop.git function gather_protos() { SOURCE=${1} TARGET=${2} for proto in $(cd ${SOURCE} && find . -name \*.proto | sed -e 's|^\./||'); do #flattened=${proto//\//_} # Trips up on files that moved flattened=$(basename ${proto}) cp ${SOURCE}/${proto} ${TARGET}/${flattened} done } (cd hadoop; git checkout ${OLD}) gather_protos hadoop old (cd hadoop; git checkout ${NEW}) gather_protos hadoop new meld old new {code} was (Author: mackrorysd): After looking at where SortedMapWritable is used some more, I'm more convinced it's only a concern if we care about source compatibility, which is not required for rolling upgrades. I also had a second look at wire compatibility and I found a few concerning things I'll look at and possibly fix: * The message getHdfsBlockLocations has disappeared as well as related types * The field nonDfsUsed in DatanodeInfoProto changed from index 9 to index 15 For YARN (CC [~kasha]), the field nodeLabels in several structures in yarn_protos changed from string to a custom type, and memory in ResourceProto change from int32 to int64 (not sure if it's the case in protobuf, but that may not be incompatible?) There's also a lot of messages moving between files, but not otherwise changing in any incompatible way. That's not a concern is it? If anyone else wants to see the changes in protobuf, this is what I did (if anything, you'll want to replace meld with your own diff-tool-of-choice): {quote} #!/usr/bin/env bash cd /tmp OLD=branch-2.7 NEW=trunk mkdir new mkdir old git clone git://git.apache.org/hadoop.git function gather_protos() { SOURCE=${1} TARGET=${2} for proto in $(cd ${SOURCE} && find . -name \*.proto | sed -e 's|^\./||'); do #flattened=${proto//\//_} # Trips up on files that moved flattened=$(basename ${proto}) cp ${SOURCE}/${proto} ${TARGET}/${flattened} done } (cd hadoop; git checkout ${OLD}) gather_protos hadoop old (cd hadoop; git checkout ${NEW}) gather_protos hadoop new meld old new {quote} > Support rolling upgrade between 2.x and 3.x > --- > > Key: HDFS-11096 > URL: https://issues.apache.org/jira/browse/HDFS-11096 > Project: Hadoop HDFS > Issue Type: Improvement > Components: rolling upgrades >Affects Versions: 3.0.0-alpha1 >Reporter: Andrew Wang >Priority: Blocker > > trunk has a minimum software version of 3.0.0-alpha1. This means we can't > rolling upgrade between branch-2 and trunk. > This is a showstopper for large deployments. Unless there are very compelling > reasons to break compatibility, let's restore the ability to rolling upgrade > to 3.x releases. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-11096) Support rolling upgrade between 2.x and 3.x
[ https://issues.apache.org/jira/browse/HDFS-11096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15799802#comment-15799802 ] Sean Mackrory edited comment on HDFS-11096 at 1/5/17 8:03 PM: -- I've been doing a lot of testing. I've posted some automation here, we may want to hook into a Jenkins job or something: https://github.com/mackrorysd/hadoop-compatibility. I've tested running a bunch of MapReduce jobs while doing a rolling upgrade of HDFS, and haven't had any failures that indicate an incompatibility. I've also tested pulling data from an old cluster onto a new cluster. I'll keep adding other aspects to the tests to improve coverage. I haven't seen a way to whitelist stuff. Filed an issue with jacc: https://github.com/lvc/japi-compliance-checker/issues/36. As for the incompatibilities, I think there's relatively little action to be taken, so I'll file JIRAs for those. In detail: metrics and s3a are technically violating the contract, but in all cases it would be some serious baggage and due to their nature I think it's acceptable. I think SortedMapWritable should be put back but deprecated (I'm sure someone's depending on it somewhere and it should be trivial), and FileStatus should still implement Comparable. Not so sure about NameodeMXBean, the missing configuration keys, or the cases of reduced visibility. I'm inclined to leave these as-is unless we know it breaks something and they care. They are technically incompatibilities, so maybe someone else feels differently (or is aware of applications they are likely to break), but it would be nice to shed baggage and poor practices where we can. All other issues I feel more confident that they're either not actually breaking the contract or are extremely unlikely to break anything enough to warrant sticking with the old way. I'll sleep on some of these one more night and file JIRAs to start addressing the issues I think are important enough tomorrow. was (Author: mackrorysd): I've been doing a lot of testing. I've posted some automation here, we may want to hook into a Jenkins job or something: https://github.com/mackrorysd/hadoop-compatibility. I've tested running a bunch of MapReduce jobs while doing a rolling upgrade of HDFS, and haven't had any failures that indicate an incompatibility. I've also tested pulling data from an old cluster onto a new cluster. I'll keep adding other aspects to the tests to improve coverage. I haven't seen a way to whitelist stuff. Filed an issue with jacc: https://github.com/lvc/japi-compliance-checker/issues/36. As for the incompatibilities, I think there's relatively action to be taken, so I'll file JIRAs for those. In detail: metrics and s3a are technically violating the contract, but in all cases it would be some serious baggage and due to their nature I think it's acceptable. I think SortedMapWritable should be put back but deprecated (I'm sure someone's depending on it somewhere and it should be trivial), and FileStatus should still implement Comparable. Not so sure about NameodeMXBean, the missing configuration keys, or the cases of reduced visibility. I'm inclined to leave these as-is unless we know it breaks something and they care. They are technically incompatibilities, so maybe someone else feels differently (or is aware of applications they are likely to break), but it would be nice to shed baggage and poor practices where we can. All other issues I feel more confident that they're either not actually breaking the contract or are extremely unlikely to break anything enough to warrant sticking with the old way. I'll sleep on some of these one more night and file JIRAs to start addressing the issues I think are important enough tomorrow. > Support rolling upgrade between 2.x and 3.x > --- > > Key: HDFS-11096 > URL: https://issues.apache.org/jira/browse/HDFS-11096 > Project: Hadoop HDFS > Issue Type: Improvement > Components: rolling upgrades >Affects Versions: 3.0.0-alpha1 >Reporter: Andrew Wang >Priority: Blocker > > trunk has a minimum software version of 3.0.0-alpha1. This means we can't > rolling upgrade between branch-2 and trunk. > This is a showstopper for large deployments. Unless there are very compelling > reasons to break compatibility, let's restore the ability to rolling upgrade > to 3.x releases. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org