[jira] [Comment Edited] (HDFS-11096) Support rolling upgrade between 2.x and 3.x

2017-11-01 Thread Sean Mackrory (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-11096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16234197#comment-16234197
 ] 

Sean Mackrory edited comment on HDFS-11096 at 11/1/17 3:16 PM:
---

>From an HDFS standpoint, definitely - I've run many successful rolling upgrade 
>and distcp-over-webhdfs tests this week and updated the patch. The only thing 
>remaining is to get automation itself in place after this is committed.

I looked into the YARN issues. I'm still seeing very similar symptoms to the 
YARN-6457 issue mentioned above in both branch-3.0 and trunk. In trunk I'm also 
seeing this:

{code}
17/10/31 23:05:49 INFO security.AMRMTokenSecretManager: Creating password for 
appattempt_1509490231144_0628_02
17/10/31 23:05:49 INFO amlauncher.AMLauncher: Error launching 
appattempt_1509490231144_0628_02. Got exception: 
org.apache.hadoop.security.token.SecretManager$InvalidToken: Invalid container 
token used for starting container on : container-5.docker:35151
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.verifyAndGetContainerTokenIdentifier(ContainerManagerImpl.java:974)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.startContainers(ContainerManagerImpl.java:789)
at 
org.apache.hadoop.yarn.api.impl.pb.service.ContainerManagementProtocolPBServiceImpl.startContainers(ContainerManagementProtocolPBServiceImpl.java:70)
at 
org.apache.hadoop.yarn.proto.ContainerManagementProtocol$ContainerManagementProtocolService$2.callBlockingMethod(ContainerManagementProtocol.java:127)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:447)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:989)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:845)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:788)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1807)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2455)

at sun.reflect.GeneratedConstructorAccessor70.newInstance(Unknown 
Source)
at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at 
org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53)
at 
org.apache.hadoop.yarn.ipc.RPCUtil.instantiateIOException(RPCUtil.java:80)
at 
org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:119)
at 
org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagementProtocolPBClientImpl.startContainers(ContainerManagementProtocolPBClientImpl.java:131)
at sun.reflect.GeneratedMethodAccessor85.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359)
at com.sun.proxy.$Proxy89.startContainers(Unknown Source)
at 
org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.launch(AMLauncher.java:123)
at 
org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.run(AMLauncher.java:304)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: 
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
 Invalid container token used for starting container on : 
container-5.docker:35151
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.verifyAndGetContainerTokenIdentifier(ContainerManagerImpl.java:974)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.startContainers(ContainerManagerImpl.java:789)
at 
org.apache.hadoop.yarn.api.impl.pb.service.ContainerManagementProtocolPBServiceImpl.startContainers(ContainerManagementProtocolPBServiceImpl.java:70)
at 

[jira] [Comment Edited] (HDFS-11096) Support rolling upgrade between 2.x and 3.x

2017-09-07 Thread Allen Wittenauer (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-11096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16157192#comment-16157192
 ] 

Allen Wittenauer edited comment on HDFS-11096 at 9/7/17 4:42 PM:
-

{code}
set -e
{code}

I'm really not a fan of using set -e unless one absolutely must.  Using it 
eliminates any possible use of failure mechanisms, including in if tests. There 
are a lot of caveats when it is in play.  

{code}
set -x
{code}

Is this just temporary?

{code} 
for hostname in ${HOSTNAMES[@]}; do
ssh -i ${ID_FILE} root@${hostname} ". /tmp/env.sh
{code}

It seems there are a few functions like this that have implementations in 
hadoop-functions.sh.  Shouldn't this just leverage that code?  [See also 
HADOOP-14009 .]  The bash settings in place (see above) will be an issue though.

{code}
  cd ${HADOOP_3}
  sbin/hadoop-daemon.sh start namenode -rollingUpgrade started
{code}

If it's hadoop 3.x, shouldn't this be using non-deprecated commands?

{code}
   sudo apt-get install -y git
{code}

This is kind of an interesting one. If I'm using this code, then I'm either 
already in a git repo or I've got a source tarball.  Given that the git hash is 
encoded at build time, I think there might be an implicit requirement that git 
is already installed.  In the case of some of the other Ubuntu-isms 
(apt-install of wget), there are likely generic ways to deal with them. (e.g., 
use the installed perl/python/java).  If the intent is to just use the docker 
images that ship with Hadoop, git is pretty much a requirement for Apache 
Yetus

{code}
# Tested on an Ubuntu 16.04 host
{code}

Probably worth mentioning HADOOP-14816 upgrades the Dockerfile to Xenial.

{code}
mvn clean package -DskipTests -Pdist -Dtar
{code}

Shouldn't this just call create-release --docker --native so that we get 
something closer to what we ship?

{code}
  HDFS_NAMENODE_USER=root \
  HDFS_DATANODE_USER=root \
  HDFS_JOURNALNODE_USER=root \
  HDFS_ZKFC_USER=root \
{code}

*dances with glee that someone else is using this feature*





was (Author: aw):
{code}
set -e
{code}

I'm really not a fan of using set -e unless one absolutely must.  Using it 
eliminates any possible use of failure mechanisms, including in if tests. There 
are a lot of caveats when it is in play.  

{code}
set -x
{code}

Is this just temporary?

{code} 
for hostname in ${HOSTNAMES[@]}; do
ssh -i ${ID_FILE} root@${hostname} ". /tmp/env.sh
{code}

It seems there are a few functions like this that have implementations in 
hadoop-functions.sh.  Shouldn't this just leverage that code?  [See also 
HADOOP-14009 .]  The bash settings in place (see above) will be an issue though.

{code}
  cd ${HADOOP_3}
  sbin/hadoop-daemon.sh start namenode -rollingUpgrade started
{code}

If it's hadoop 3.x, shouldn't this be using non-deprecated commands?

{code}
   sudo apt-get install -y git
{code}

This is kind of an interesting one. If I'm using this code, then I'm either 
already in a git repo or I've got a source tarball.  Given that the git hash is 
encoded at build time, I think there might be an implicit requirement that git 
is already installed.  In the case of some of the other Ubuntu-isms 
(apt-install of wget), there are likely generic ways to deal with them. (e.g., 
use the installed perl/python/java).  If the intent is to just use the docker 
images that ship with Hadoop, git is pretty much a requirement for Apache 
Yetus

{code}
# Tested on an Ubuntu 16.04 host
{code}

Probably worth mentioning HADOOP-14816 upgrades the Dockerfile to Xenial.

> Support rolling upgrade between 2.x and 3.x
> ---
>
> Key: HDFS-11096
> URL: https://issues.apache.org/jira/browse/HDFS-11096
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: rolling upgrades
>Affects Versions: 3.0.0-alpha1
>Reporter: Andrew Wang
>Assignee: Sean Mackrory
>Priority: Blocker
> Attachments: HDFS-11096.001.patch, HDFS-11096.002.patch
>
>
> trunk has a minimum software version of 3.0.0-alpha1. This means we can't 
> rolling upgrade between branch-2 and trunk.
> This is a showstopper for large deployments. Unless there are very compelling 
> reasons to break compatibility, let's restore the ability to rolling upgrade 
> to 3.x releases.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-11096) Support rolling upgrade between 2.x and 3.x

2017-08-28 Thread Sean Mackrory (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-11096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16143965#comment-16143965
 ] 

Sean Mackrory edited comment on HDFS-11096 at 8/28/17 4:40 PM:
---

So Docker support has been added for the rolling-upgrade and pull-over-http 
test. They're using the same Docker image as Yetus builds, etc. And they've 
been really robust lately. I've corrected the copyright headers at the top of 
the files, and I think dev-support/compat is a good place for these tests to 
live - but I'm open to other ideas as well. I've also added to the README - now 
that the scripts spin up the clusters on Docker, it's *really* easy to run 
these.

The Python tests are all still working, but they did not seem to catch the 
previous incompatibility that prevented older clients from writing to newer 
DataNodes. There's also still a few TODOs or thing that don't work and it's not 
clear why. So definitely more work to be done, but there's value in the 
existing CLI compatibility tests.

I'd like to get this put in the codebase and get some Jenkins jobs running on 
it soon.


was (Author: mackrorysd):
So Docker support has been added for the rolling-upgrade and pull-over-http 
test. They're using the same Docker image as Yetus builds, etc. And they've 
been really robust lately. I've corrected the copyright headers at the top of 
the files, and I think dev-support/compat is a good place for these tests to 
live - but I'm open to other ideas as well.

The Python tests are all still working, but they did not seem to catch the 
previous incompatibility that prevented older clients from writing to newer 
DataNodes. There's also still a few TODOs or thing that don't work and it's not 
clear why. So definitely more work to be done, but there's value in the 
existing CLI compatibility tests.

I'd like to get this put in the codebase and get some Jenkins jobs running on 
it soon.

> Support rolling upgrade between 2.x and 3.x
> ---
>
> Key: HDFS-11096
> URL: https://issues.apache.org/jira/browse/HDFS-11096
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: rolling upgrades
>Affects Versions: 3.0.0-alpha1
>Reporter: Andrew Wang
>Assignee: Sean Mackrory
>Priority: Blocker
> Attachments: HDFS-11096.001.patch
>
>
> trunk has a minimum software version of 3.0.0-alpha1. This means we can't 
> rolling upgrade between branch-2 and trunk.
> This is a showstopper for large deployments. Unless there are very compelling 
> reasons to break compatibility, let's restore the ability to rolling upgrade 
> to 3.x releases.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-11096) Support rolling upgrade between 2.x and 3.x

2017-02-03 Thread Allen Wittenauer (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-11096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15852343#comment-15852343
 ] 

Allen Wittenauer edited comment on HDFS-11096 at 2/3/17 11:47 PM:
--

I hope folks hoping to do a rolling upgrade with automated tools understand 
that hadoop-env.sh/yarn-env.sh, log files, pid files, classpath, and a few 
other things that are outside of the Java code were purposefully made 
incompatible and will do the correct thing when trying to roll forward


was (Author: aw):
I hope folks hoping to do a rolling upgrade with automated tools understand 
that hadoop-env.sh/yarn-env.sh, log files, pid files, and a few other things 
that are outside of the Java code were purposefully made incompatible and will 
do the correct thing when trying to roll forward

> Support rolling upgrade between 2.x and 3.x
> ---
>
> Key: HDFS-11096
> URL: https://issues.apache.org/jira/browse/HDFS-11096
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: rolling upgrades
>Affects Versions: 3.0.0-alpha1
>Reporter: Andrew Wang
>Priority: Blocker
>
> trunk has a minimum software version of 3.0.0-alpha1. This means we can't 
> rolling upgrade between branch-2 and trunk.
> This is a showstopper for large deployments. Unless there are very compelling 
> reasons to break compatibility, let's restore the ability to rolling upgrade 
> to 3.x releases.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-11096) Support rolling upgrade between 2.x and 3.x

2017-01-10 Thread Sean Mackrory (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-11096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15816405#comment-15816405
 ] 

Sean Mackrory edited comment on HDFS-11096 at 1/10/17 10:46 PM:


After looking at where SortedMapWritable is used some more, I'm more convinced 
it's only a concern if we care about source compatibility, which is not 
required for rolling upgrades. I also had a second look at wire compatibility 
and I found a few concerning things I'll look at and possibly fix:

* The message getHdfsBlockLocations has disappeared as well as related types
* The field nonDfsUsed in DatanodeInfoProto changed from index 9 to index 15

For YARN (CC [~kasha]), the field nodeLabels in several structures in 
yarn_protos changed from string to a custom type, and memory in ResourceProto 
change from int32 to int64 (not sure if it's the case in protobuf, but that may 
not be incompatible?)

There's also a lot of messages moving between files, but not otherwise changing 
in any incompatible way. That's not a concern is it?

If anyone else wants to see the changes in protobuf, this is what I did (if 
anything, you'll want to replace meld with your own diff-tool-of-choice):

{code}
#!/usr/bin/env bash

cd /tmp

OLD=branch-2.7
NEW=trunk

mkdir new
mkdir old
git clone git://git.apache.org/hadoop.git

function gather_protos() {
  SOURCE=${1}
  TARGET=${2}
  for proto in $(cd ${SOURCE} && find . -name \*.proto | sed -e 's|^\./||'); do
#flattened=${proto//\//_} # Trips up on files that moved
flattened=$(basename ${proto})
cp ${SOURCE}/${proto} ${TARGET}/${flattened}
  done
}

(cd hadoop; git checkout ${OLD})
gather_protos hadoop old

(cd hadoop; git checkout ${NEW})
gather_protos hadoop new

meld old new
{code}


was (Author: mackrorysd):
After looking at where SortedMapWritable is used some more, I'm more convinced 
it's only a concern if we care about source compatibility, which is not 
required for rolling upgrades. I also had a second look at wire compatibility 
and I found a few concerning things I'll look at and possibly fix:

* The message getHdfsBlockLocations has disappeared as well as related types
* The field nonDfsUsed in DatanodeInfoProto changed from index 9 to index 15

For YARN (CC [~kasha]), the field nodeLabels in several structures in 
yarn_protos changed from string to a custom type, and memory in ResourceProto 
change from int32 to int64 (not sure if it's the case in protobuf, but that may 
not be incompatible?)

There's also a lot of messages moving between files, but not otherwise changing 
in any incompatible way. That's not a concern is it?

If anyone else wants to see the changes in protobuf, this is what I did (if 
anything, you'll want to replace meld with your own diff-tool-of-choice):

{quote}
#!/usr/bin/env bash

cd /tmp

OLD=branch-2.7
NEW=trunk

mkdir new
mkdir old
git clone git://git.apache.org/hadoop.git

function gather_protos() {
  SOURCE=${1}
  TARGET=${2}
  for proto in $(cd ${SOURCE} && find . -name \*.proto | sed -e 's|^\./||'); do
#flattened=${proto//\//_} # Trips up on files that moved
flattened=$(basename ${proto})
cp ${SOURCE}/${proto} ${TARGET}/${flattened}
  done
}

(cd hadoop; git checkout ${OLD})
gather_protos hadoop old

(cd hadoop; git checkout ${NEW})
gather_protos hadoop new

meld old new
{quote}

> Support rolling upgrade between 2.x and 3.x
> ---
>
> Key: HDFS-11096
> URL: https://issues.apache.org/jira/browse/HDFS-11096
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: rolling upgrades
>Affects Versions: 3.0.0-alpha1
>Reporter: Andrew Wang
>Priority: Blocker
>
> trunk has a minimum software version of 3.0.0-alpha1. This means we can't 
> rolling upgrade between branch-2 and trunk.
> This is a showstopper for large deployments. Unless there are very compelling 
> reasons to break compatibility, let's restore the ability to rolling upgrade 
> to 3.x releases.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-11096) Support rolling upgrade between 2.x and 3.x

2017-01-05 Thread Sean Mackrory (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-11096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15799802#comment-15799802
 ] 

Sean Mackrory edited comment on HDFS-11096 at 1/5/17 8:03 PM:
--

I've been doing a lot of testing. I've posted some automation here, we may want 
to hook into a Jenkins job or something: 
https://github.com/mackrorysd/hadoop-compatibility. I've tested running a bunch 
of MapReduce jobs while doing a rolling upgrade of HDFS, and haven't had any 
failures that indicate an incompatibility. I've also tested pulling data from 
an old cluster onto a new cluster. I'll keep adding other aspects to the tests 
to improve coverage.

I haven't seen a way to whitelist stuff. Filed an issue with jacc: 
https://github.com/lvc/japi-compliance-checker/issues/36.

As for the incompatibilities, I think there's relatively little action to be 
taken, so I'll file JIRAs for those. In detail: metrics and s3a are technically 
violating the contract, but in all cases it would be some serious baggage and 
due to their nature I think it's acceptable. I think SortedMapWritable should 
be put back but deprecated (I'm sure someone's depending on it somewhere and it 
should be trivial), and FileStatus should still implement Comparable. Not so 
sure about NameodeMXBean, the missing configuration keys, or the cases of 
reduced visibility. I'm inclined to leave these as-is unless we know it breaks 
something and they care. They are technically incompatibilities, so maybe 
someone else feels differently (or is aware of applications they are likely to 
break), but it would be nice to shed baggage and poor practices where we can. 
All other issues I feel more confident that they're either not actually 
breaking the contract or are extremely unlikely to break anything enough to 
warrant sticking with the old way. I'll sleep on some of these one more night 
and file JIRAs to start addressing the issues I think are important enough 
tomorrow.


was (Author: mackrorysd):
I've been doing a lot of testing. I've posted some automation here, we may want 
to hook into a Jenkins job or something: 
https://github.com/mackrorysd/hadoop-compatibility. I've tested running a bunch 
of MapReduce jobs while doing a rolling upgrade of HDFS, and haven't had any 
failures that indicate an incompatibility. I've also tested pulling data from 
an old cluster onto a new cluster. I'll keep adding other aspects to the tests 
to improve coverage.

I haven't seen a way to whitelist stuff. Filed an issue with jacc: 
https://github.com/lvc/japi-compliance-checker/issues/36.

As for the incompatibilities, I think there's relatively action to be taken, so 
I'll file JIRAs for those. In detail: metrics and s3a are technically violating 
the contract, but in all cases it would be some serious baggage and due to 
their nature I think it's acceptable. I think SortedMapWritable should be put 
back but deprecated (I'm sure someone's depending on it somewhere and it should 
be trivial), and FileStatus should still implement Comparable. Not so sure 
about NameodeMXBean, the missing configuration keys, or the cases of reduced 
visibility. I'm inclined to leave these as-is unless we know it breaks 
something and they care. They are technically incompatibilities, so maybe 
someone else feels differently (or is aware of applications they are likely to 
break), but it would be nice to shed baggage and poor practices where we can. 
All other issues I feel more confident that they're either not actually 
breaking the contract or are extremely unlikely to break anything enough to 
warrant sticking with the old way. I'll sleep on some of these one more night 
and file JIRAs to start addressing the issues I think are important enough 
tomorrow.

> Support rolling upgrade between 2.x and 3.x
> ---
>
> Key: HDFS-11096
> URL: https://issues.apache.org/jira/browse/HDFS-11096
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: rolling upgrades
>Affects Versions: 3.0.0-alpha1
>Reporter: Andrew Wang
>Priority: Blocker
>
> trunk has a minimum software version of 3.0.0-alpha1. This means we can't 
> rolling upgrade between branch-2 and trunk.
> This is a showstopper for large deployments. Unless there are very compelling 
> reasons to break compatibility, let's restore the ability to rolling upgrade 
> to 3.x releases.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org