[ 
https://issues.apache.org/jira/browse/HDFS-7231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14167535#comment-14167535
 ] 

Allen Wittenauer edited comment on HDFS-7231 at 10/10/14 9:50 PM:
------------------------------------------------------------------

We had an admin perform an upgrade that went south due to rolling upgrade 
interfering with the previous method of upgrading.  The series of events as 
given to me (I was out of town, so didn't witness firsthand) was:

# Build a 2.0.5/2.1.0/etc cluster where rolling upgrade is not an option.
# Upgrade it to 2.2
# Do not \-finalizeUpgrade
# Upgrade binaries to 2.4.1 
# Run namenode \-upgrade
# watch it fail.
# Leave 2.4.1 DNs running
# Downgrade binaries on NN to 2.2
# Start NN
# DNs now do a partial roll-forward, rendering them unable to continue
# admins manually repair version files on those broken directories

...

There were clearly a few mistakes made in the above procedure, most of which 
were driven by a belief that the NN *had* to be out of safemode to do a 
finalize.  So they attempted to do that, which of course led to other things 
going wrong. I'm not sure what triggered the DNs to basically render their 
VERSION files broken.  I haven't been able to duplicate it, but I've only tried 
on a much smaller scale so that might be related.  I also suspect there was an 
attempt to rollback the binaries on the DNs and I haven't tried that yet 
either. 

My own testing of this scenario has given me a few insights.

* DNs should not start rolling while there is a directory there to finalize.  

Outside of just inconsistent filesystem state, if you do what appears (to me, 
at least) to get the system back up (bring down the namenode, namenode 
\-finalize, bring up namenode): hdfs dfsadmin \-finalizeUpgrade afterward 
doesn't appear to send the message to the DNs to clean up their space, 
requiring manual intervention. 

* Suprise! DNs exit if the 'proper' NN is brought up with \-upgrade.

Doing the 2.2 NN \-finalize and then bringing 2.4.1 NN up with \-upgrade, 
results in the 2.4.1 DNs all coming down.  This was a bit of a surprise given 
they were perfectly happy staying up with a broken 2.2 NN in \-upgrade mode 
before.

I'm sure there are other things here, but these are the two big ones that stuck 
out. I'm doing some other manual testing using the above procedures with a few 
other changes to see what else sticks out.


was (Author: aw):
We had an admin perform an upgrade that went south due to rolling upgrade 
interfering with the previous method of upgrading.  The series of events as 
given to me (I was out of town, so didn't witness firsthand) was:

# Build a 2.0.5/2.1.0/etc cluster where rolling upgrade is not an option.
# Upgrade it to 2.2
# Do not \-finalizeUpgrade
# Upgrade binaries to 2.4.1 
# Run namenode \-upgrade
# watch it fail.
# Leave 2.4.1 DNs running
# Downgrade binaries on NN to 2.2
# Start NN
# DNs now do a partial roll-forward, rendering them unable to continue
# admins manually repair version files on those broken directories

...

There were clearly a few mistakes made in the above procedure, most of which 
were driven by a belief that the NN *had* to be out of safemode to do a 
finalize.  So they attempted to do that, which of course led to other things 
going wrong. I'm not sure what triggered the DNs to basically render their 
VERSION files broken.  I haven't been able to duplicate it, but I've only tried 
on a much smaller scale so that might be related.  I also suspect there was an 
attempt to rollback the binaries on the DNs and I haven't tried that yet 
either. 

My own testing of this scenario has given me a few insights.

* DNs should not start rolling while there is a directory there to finalize.  

If you do what appears (to me, at least) to get the system back up the proper 
thing here: (bring down the namenode, namenode \-finalize, bring up namenode), 
hdfs dfsadmin \-finalizeUpgrade afterward doesn't appear to send the message to 
the DNs to clean up their space, requiring manual intervention. 

* Suprise! DNs exit if the 'proper' NN is brought up with \-upgrade.

Doing the 2.2 NN \-finalize and then bringing 2.4.1 NN up with \-upgrade, 
results in the 2.4.1 DNs all coming down.  This was a bit of a surprise given 
they were perfectly happy staying up with a broken 2.2 NN in \-upgrade mode 
before.

I'm sure there are other things here, but these are the two big ones that stuck 
out. I'm doing some other manual testing using the above procedures with a few 
other changes to see what else sticks out.

> rollingupgrade needs some guard rails
> -------------------------------------
>
>                 Key: HDFS-7231
>                 URL: https://issues.apache.org/jira/browse/HDFS-7231
>             Project: Hadoop HDFS
>          Issue Type: Bug
>            Reporter: Allen Wittenauer
>
> See first comment.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to