Re: Ozone non-rolling upgrades

Aravindan Vijayan Tue, 25 Aug 2020 10:05:48 -0700

Hi Marton,

Thanks for the questions. Answers below.


*On-line upgrade vs offline-upgrade*
The "Pre-Finalized" state is not meant to be a Read only state in Ozone.
All existing Read/Write APIs will be allowed since they are guaranteed to
be backward compatible. The only APIs that will not be allowed before
finalization are those that are new or those that caused a layout change.
For example, create EC file, Truncate etc. Hence, this is not really an
"online" upgrade.


*Enable independent upgrade of datanodes which can make it way more easier
to upgrade a big cluster.*
>From the examples you have mentioned, I do see some advantages to
supporting separate datanode upgrades. The logic we went with now is meant
to be restrictive since it is a full non-rolling upgrade (master +
workers). However, keeping rolling upgrades in mind, we will implement it
in such a way that it can easily support the use case you mention in the
future. Instead of keeping 1 HDDS version, we can fork off the Datanode
layout version separately, and maintain a code level compatibility matrix
between SCM and Datanodes in the future. That way, SCM can support
Datanodes of multiple layout versions together, with the only restriction
that an active pipeline (Ratis/EC) can be created only with those of the
same layout version.


*Finalize*
As mentioned earlier, the Pre-Finalized state is not a complete standstill
state for Ozone. Only new features/APIs/layout changes will be disallowed
until the user decides to Finalize. This state will serve as an "insurance"
for the user (and the Ozone team) to allow downgrade to an older version if
basic compat is broken or there is a serious regression. The name
"finalize" has been borrowed from HDFS world. IMHO, it is a more intuitive
user experience to have a CLI driven (in the case of a CM managed cluster,
it will be a clickable UI option) rather than the user restarting the
cluster again with a specific config change (which is an Ozone internal
detail) for layout update.

*During your presentation you talked about the downgrade/rollback. I felt
that there could be a lot of tricky corner cases related to ratis +
snapshot. *
*As a concept I like it (but my 2nd point is more important for me, if
possible), but I think we will see tricky technical problems on the code
level.*
Yes, with respect to Ratis, it will be a challenge to guarantee that the
same "version" of the code "applies the transaction" on all the 3 nodes
during the upgrade. By doing the following, we can approach the problem
> Handling Ratis request handling changes as layout changes
> Tagging every Ratis request with the current layout version
> Introducing a "factory" in the Ratis request handler which looks at the
version of the request from the log, and then supplies the correct
implementation for that request.
In the future, there is also a plan to move the handling of Ratis request
versioning to a separate version hierarchy than MLV/SLV. I will be adding
more details on the v2.0 doc that will be uploaded later this week to
HDDS-3698.

On Tue, Aug 25, 2020 at 5:22 AM Elek, Marton <e...@apache.org> wrote:

>
> Bumping this thread.
>
> If you have any opinion, please let me know.
>
> Thanks a lot,
> Marton
>
>
>
>
> On 6/26/20 2:51 PM, Elek, Marton wrote:
> >
> > Thanks you very much to work on this Aravindan.
> >
> > Finally, I collected my thoughts about the proposal.
> >
> > First of or, I really like the concept in general, and I like the style
> > the documentation. It clearly explains a lot of existing behavior of
> > Ozone to make it easier to understand the problems.
> >
> > I like the the abstraction of Software Layout Version vs. Metadata
> > Layout Version
> >
> > I have some comments, but most of them are about technical details (not
> > about the concept itself). And they are questions and ideas not strong
> > opinions.
> >
> > 1. On-line upgrade vs offline-upgrade
> >
> > There is an option to do the upgrade offline: instead of calling an RPC,
> > executing a CLI.
> >
> > a) for online upgrade we need to introduce a very specific running mode
> > which means that nobody can use the cluster (or just in read only mode?)
> > until the server is "finalized"
> >
> > b) CLI can do any migration and upgrade the MLV inside database. The
> > only question is the old / peristed data in raft log, but IMHO it
> > shouldn't be a problem:
> >
> >   1. we should commit the MLV upgrade with a raft transaction anyway
> >   2. ratis log entries like client calls, and we supposed to be backward
> > compatible with old clients
> >
> > I am not sure if the CLI approach is better (it seems to be more simple
> > for me) but at least we can compare the two approaches and explain why
> > do we prefer the RPC based method (if that is the better)
> >
> > 2. I had an interesting conversation about why HDFS clusters are not
> > upgraded to Hadoop 3 and got some thoughts.
> >
> > This document propose to always use the same version from SCM and
> > datanode which makes it simple.
> >
> > I agree that it simplifies our job, but I think It can make the upgrade
> > harder. Especially for a 1-2000 node cluster.
> >
> > After the storage-class proposal I have a different mental model:
> >
> >   I think there can be different type of containers with different
> > replication strategies. Containers are classified with storage-class and
> > storage-class defines the container replication type.
> >
> > In this model it's very easy to imagine that different datanodes can
> > support different replication type (or replication version).
> >
> > Let's say I have 1000 nodes and I upgrade 500 of them to a specific
> > datanode version which can support EC container. SCM can easily manage
> > this problem if it's already prepared to support different type of
> > containers / replications (which is our goal, IMHO) based on node
> > capabilities.
> >
> > In this model it should be easy to enable independent upgrade of
> > datanodes which can make it way more easier to upgrade a big cluster.
> > (but I agree to require OM/SCM/RECON upgrade at the same time)
> >
> >
> > What do you think about this?
> >
> >
> > 3. Finalize
> >
> > Personally I don't like the "finalize" word. It suggests that we have an
> > upgrade process which can be "finalized", but in fact we don't have such
> > process. We start do any work AFTER the finalize button is pushed.
> >
> > I know that it comes from the HDFS history, but I would prefer to use a
> > more generic and expressive words. (For example: jar/binary upgrade vs.
> > metadata upgrade).
> >
> > At the end I learned what finally means (thanks to your patient
> > explanation during offline conversation ;-) ), but we can make the
> > understanding easier for next users.
> >
> > 4. During you presentation you talked about the downgrade/rollback. I
> > felt that there could be a lot of tricky corner cases related to ratis +
> > snapshot. As a concept I like it (but my 2nd point is more important for
> > me, if possible), but I think we will see tricky technical problems on
> > the code level.
> >
> >
> > Thanks again the great work,
> > Marton
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: ozone-dev-unsubscr...@hadoop.apache.org
> > For additional commands, e-mail: ozone-dev-h...@hadoop.apache.org
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: ozone-dev-unsubscr...@hadoop.apache.org
> For additional commands, e-mail: ozone-dev-h...@hadoop.apache.org
>
>

-- 
Thanks & Regards,
Aravindan

Re: Ozone non-rolling upgrades

Reply via email to