Re: When are incompatible changes acceptable (HDFS-12990)

Akira Ajisaka Sun, 21 Jan 2018 20:28:25 -0800

Thanks Chris and Daryn for the replies.

First of all, I missed why NN RPC port was moved to 9820.
HDFS-9427 is to avoid ephemeral port, however, NN RPC port (8020)
is already out of the range. The change is only to move the
all ports in the same range, so the change is not really necessary.


I agree the change is disastrous for many users, however,
reverting the change is also disastrous for 3.0.0 users.
Therefore, if we are to revert the change, we must notify
the incompatibility to the users. Adding the notification in the
release announcement seems to be a good choice. Probably
users does not carefully read the change logs and they can
easily miss it, as we missed it in the release process.

Cancelling my -1.

-Akira

On 2018/01/20 7:17, Daryn Sharp wrote:

 > I'm -1 for reverting HDFS-9427 in 3.x.

I'm -1 on not reverting.  If yahoo/oath had the cycles to begin testing 3.0 
prior to release, I would have -1'ed this change immediately.  It's already 
broken our QE testing pipeline.

 > The port number is configurable, so if you want to use 8020 for NN RPC port 
in Hadoop 3.x, you configure this to 8020.

No, it's not that easy.  THE DEFAULT IS HARDCODED.  You can only "configure" 
the port via hardcoding it into all paths.   Which ironically multiple people think 
shouldn't be done?  Let's starting thinking about the impact to those not running just 1 
isolated cluster with fully managed services that can take downtime and be fully upgraded 
in sync.

If the community doesn't revert, I'm not going to tell users to put the port in 
all their paths.  I'll hack the default back to 8020.  Then I'll have to deal 
with other users or closed software stacks bundled with a stock 3.0 hadoop 
client, or using a different 3.0 distro, using the wrong port.   They will 
break unless they hardcode 8020 port into paths.

Let's say I do change to the new port, I still have to tell all my users with 2.x client 
to hardcode the new port but only after the upgrade.  If the "solution" is 
listening on the old and new port, it only proves that a port change is frivolous with 
zero added value.

Someone please explain to me how any heterogenous multi-cluster environment 
benefits from this change?  How does a single cluster environment benefit from 
this change?  If there are no benefits to anyone, why are we even debating a 
revert?  Taking a hardline, under the guise of worrying about compatibility for 
tiny number of users, is either naive or political because this will 
potentially be disastrous for existing deployments.

Daryn

On Fri, Jan 19, 2018 at 3:06 AM, Akira Ajisaka <[email protected] 
<mailto:[email protected]>> wrote:

    I'm -1 for reverting HDFS-9427 in 3.x.

    The port number is configurable, so if you want to use 8020 for
    NN RPC port in Hadoop 3.x, you configure this to 8020. That's fine.
    I don't think it is critical problem.

    If we are to revert this in 3.x, it causes additional incompatible change.

    -Akira


    On 2018/01/18 11:03, Tsz Wo (Nicholas), Sze wrote:

           (Re-sent. Just found that my previous email seems not delivered to 
common-dev.)

                The question is: how are we going to fix it?>> What do you 
propose? -C

        First of all, let's state clearly what is the problem about.  Please 
help me out if I have missed anything.
        The problem reported by HDFS-12990 is that HDFS-9427 has changed NN 
default RPC port from 8020 to 9820.  HDFS-12990 claimed, “the NN RPC port 
change is painful for downstream on migrating to Hadoop 3.”
        Note 1: This isn't a problem for HA cluster.Note 2: The port is 
configurable.  User can set it to any value.Note 3: HDFS-9427 has also changed 
many other HTTP/RPC ports as shown below
        Namenode ports: 50470 --> 9871, 50070 --> 9870, 8020 --> 9820Secondary NN ports: 50091 
--> 9869, 50090 --> 9868Datanode ports: 50020 --> 9867, 50010 --> 9866, 50475 --> 9865, 
50075 --> 9864
        The other port changes probably also affect downstream projects and 
give them a “painful” experience.  For example, NN UI and WebHDFS use a 
different port.
        The problem is related convenience but not anything serious like a 
security bug.
        There are a few possible solutions:1) Considered that the port changes 
are not limited to NN RPC and the default port value should not be hardcoded.  
Also, downstream projects probably need to fix other hardcoded ports (e.g. 
WebHDFS) anyway.  Let’s just keep all the port changes and document them 
clearly about the changes (we may throw an exception if some applications try 
to connect to the old ports.)  In this way, 3.0.1 is compatible with 3.0.0.
        2) Further change the NN RPC so that NN listens to both 8020 and 9820 
by default.  It is a new feature that NN listen to two ports simultaneously.  
The feature has other benefits, e.g. one of the ports is reserved to some high 
priority applications so that it can have a better response time.  It is 
compatible to both 2.x and 3.0.0. Of course, users could choose to set it back 
to one of the ports in the conf.
        3) Revert the NN RPC port back to 8020.  We need to ask where should 
the revert happen?3.1) Revert it in 3.0.1 as proposed by HDFS-12990.  However, 
this is an incompatible change between dot releases 3.0.0 and 3.0.1 and it 
violates our policy.  Being compatible is very important.  Users expect 3.0.0 
and 3.0.1 are compatible.  How could we explain 3.0.0 and 3.0.1 are 
incompatible due to convenience?3.2) Revert it in 4.0.0.  There is no 
compatibility issue since 3.0.0 and 4.0.0 are allowed to have incompatible 
changes according to our policy.
        Since compatibility is more important than convenience, Solution #3.1 
is impermissible.  For the remaining solutions, both #1 and #2 are fine to me.
        Thanks.Tsz-Wo


              On Friday, January 12, 2018, 12:26:47 PM GMT+8, Chris Douglas 
<[email protected] <mailto:[email protected]>> wrote:
        On Thu, Jan 11, 2018 at 6:34 PM Tsz Wo Sze <[email protected] 
<mailto:[email protected]>> wrote:

           The question is: how are we going to fix it?


        What do you propose? -C




            No incompatible changes are allowed between 3.0.0 and 3.0.1. Dot 
releases only allow bug fixes.


        We may not like the statement above but it is our compatibility policy. 
 We should either follow the policy or revise it.

        Some more questions:
                 - What if someone is already using 3.0.0 and has changed all 
the scripts to 9820?  Just let them fail?
             - Compared to 2.x, 3.0.0 has many incompatible changes. Are we 
going to have other incompatible changes in the future minor and dot releases? 
What is the criteria to decide which incompatible changes are allowed?
             - I hate that we have prematurely released 3.0.0 and make 3.0.1 incompatible 
to 3.0.0. If the "bug" is that serious, why not fixing it in 4.0.0 and declare 
3.x as dead?
             - It seems obvious that no one has seriously tested it so that the 
problem is not uncovered until now. Are there bugs in our current release 
procedure?

        ThanksTsz-Wo


              On Thursday, January 11, 2018, 11:36:33 AM GMT+8, Chris Douglas 
<[email protected] <mailto:[email protected]>> wrote:
             Isn't this limited to reverting the 8020 -> 9820 change? -C

        On Wed, Jan 10, 2018 at 6:13 PM Eric Yang <[email protected] 
<mailto:[email protected]>> wrote:

            The fix in HDFS-9427 can potentially bring in new customers because 
less
            chance for new comer to encountering “port already in use” problem. 
 If we
            make change according to HDFS-12990, then this incompatible change 
does not
            make incompatible change compatible.  Other ports are not reverted
            according to HDFS-12990.  User will encounter the bad taste in the 
mouth
            that HDFS-9427 attempt to solve.  Please do consider both negative 
side
            effects of reverting as well as incompatible minor release change.  
Thanks

            Regards,
            Eric

            From: larry mccay <[email protected] <mailto:[email protected]>>
            Date: Wednesday, January 10, 2018 at 10:53 AM
            To: Daryn Sharp <[email protected] <mailto:[email protected]>>
            Cc: "Aaron T. Myers" <[email protected] <mailto:[email protected]>>, Eric Yang 
<[email protected] <mailto:[email protected]>>,
            Chris Douglas <[email protected] <mailto:[email protected]>>, 
Hadoop Common <
            [email protected] <mailto:[email protected]>>
            Subject: Re: When are incompatible changes acceptable (HDFS-12990)

            On Wed, Jan 10, 2018 at 1:34 PM, Daryn Sharp <[email protected] 
<mailto:[email protected]><mailto:
            [email protected] <mailto:[email protected]>>> wrote:

            I fully agree the port changes should be reverted.  Although
            "incompatible", the potential impact to existing 2.x deploys is 
huge.  I'd
            rather inconvenience 3.0 deploys that compromise <1% customers.  An
            incompatible change to revert an incompatible change is called
            compatibility.

            +1




            Most importantly, consider that there is no good upgrade path 
existing
            deploys, esp. large and/or multi-cluster environments.  It’s only 
feasible
            for first-time deploys or simple single-cluster upgrades willing to 
take
            downtime.  Let's consider a few reasons why:



            1. RU is completely broken.  Running jobs will fail.  If MR on hdfs
            bundles the configs, there's no way to transparently coordinate the 
switch
            to the new bundle with the port changed.  Job submissions will fail.



            2. Users generally do not add the rpc port number to uris so unless 
their
            configs are updated they will contact the wrong port.  Seamlessly
            coordinating the conf change without massive failures is impossible.



            3. Even if client confs are updated, they will break in a 
multi-cluster
            env with NNs using different ports.  Users/services will be forced 
to add
            the port.  The cited hive "issue" is not a bug since it's the only 
way to
            work in a multi-port env.



            4. Coordinating the port add/change of uris is systems everywhere 
(you
            know something will be missed), updating of confs, restarting all 
services,
            requiring customers to redeploy their workflows in sync with the NN
            upgrade, will cause mass disruption and downtime that will be 
unacceptable
            for production environments.



            This is a solution to a non-existent problem.  Ports can be bound by
            multiple processes but only 1 can listen.  Maybe multiple listeners 
is an
            issue for compute nodes but not responsibly managed service nodes.  
Ie. Who
            runs arbitrary services on the NNs that bind to random ports?  
Besides, the
            default port is and was ephemeral so it solved nothing.



            This either standardizes ports to a particular customer's ports or 
is a
            poorly thought out whim.  In either case, the needs of the many 
outweigh
            the needs of the few/none (3.0 users).  The only logical conclusion 
is
            revert.  If a particular site wants to change default ports and 
deal with
            the massive fallout, they can explicitly change the ports 
themselves.



            Daryn

            On Tue, Jan 9, 2018 at 11:22 PM, Aaron T. Myers <[email protected] 
<mailto:[email protected]><mailto:
            [email protected] <mailto:[email protected]>>> wrote:
            On Tue, Jan 9, 2018 at 3:15 PM, Eric Yang <[email protected] 
<mailto:[email protected]><mailto:
            [email protected] <mailto:[email protected]>>> wrote:

                While I agree the original port change was unnecessary, I don’t 
think
                Hadoop NN port change is a bad thing.

                I worked for a Hadoop distro that NN RPC port was default to 
port 9000.
                When we migrate from BigInsights to IOP and now to HDP, we have 
to move
                customer Hive metadata to new NN RPC port.  It only took one 
developer
                (myself) to write the tool for the migration.  The incurring 
workload is
                not as bad as most people anticipated because Hadoop depends on
                configuration file for referencing namenode.  Most of the code 
can work
                transparently.  It helped to harden the downstream testing 
tools to be

            more

                robust.


            While there are of course ways to deal with this, the question 
really
            should be whether or not it's a desirable thing to do to our users.



                We will never know how many people are actively working on 
Hadoop 3.0.0.
                Perhaps, couple hundred developers or thousands.



            You're right that we can't know for sure, but I strongly suspect 
that this
            is a substantial overestimate. Given how conservative Hadoop 
operators tend
            to be, I view it as exceptionally unlikely that many deployments 
have been
            created on or upgraded to Hadoop 3.0.0 since it was released less 
than a
            month ago.

            Further, I hope you'll agree that the number of
            users/developers/deployments/applications which are currently on 
Hadoop 2.x
            is *vastly* greater than anyone who might have jumped on Hadoop 
3.0.0 so
            quickly. When all of those users upgrade to any 3.x version, they 
will
            encounter this needless incompatible change and be forced to work 
around
            it.


                I think the switch back may have saved few developers work, but 
there
                could be more people getting impacted at unexpected minor 
release change

            in

                the future.  I recommend keeping current values to avoid rule 
bending and
                future frustrations.


            That we allow this incompatible change now does not mean that we are
            categorically allowing more incompatible changes in the future. My 
point is
            that we should in all instances evaluate the merit of any 
incompatible
            change on a case-by-case basis. This is not an exceptional 
circumstance -
            we've made incompatible changes in the past when appropriate, e.g. 
breaking
            some clients to address a security issue. I and others believe that 
in this
            case the benefits greatly outweigh the downsides of changing this 
back to
            what it has always been.

            Best,
            Aaron



                Regards,
                Eric

                On 1/9/18, 11:21 AM, "Chris Douglas" <[email protected] 
<mailto:[email protected]><mailto:

            [email protected] <mailto:[email protected]>>> wrote:


                      Particularly since 9820 isn't in the contiguous range of 
ports in
                      HDFS-9427, is there any value in this change?

                      Let's change it back to prevent the disruption to users, 
but
                      downstream projects should treat this as a bug in their 
tests. Please
                      open JIRAs in affected projects. -C


                      On Tue, Jan 9, 2018 at 5:18 AM, larry mccay <[email protected] 
<mailto:[email protected]>

            <mailto:[email protected] <mailto:[email protected]>>> wrote:

                      > On Mon, Jan 8, 2018 at 11:28 PM, Aaron T. Myers 
<[email protected] <mailto:[email protected]>

            <mailto:[email protected] <mailto:[email protected]>>>

                wrote:
                      >
                      >> Thanks a lot for the response, Larry. Comments inline.
                      >>
                      >> On Mon, Jan 8, 2018 at 6:44 PM, larry mccay 
<[email protected] <mailto:[email protected]>

            <mailto:[email protected] <mailto:[email protected]>>>

                wrote:
                      >>
                      >>> Question...
                      >>>
                      >>> Can this be addressed in some way during or before 
upgrade that
                allows it
                      >>> to only affect new installs?
                      >>> Even a config based workaround prior to upgrade might 
make this a
                change
                      >>> less disruptive.
                      >>>
                      >>> If part of the upgrade process includes a step (maybe 
even a
                script) to
                      >>> set the NN RPC port explicitly beforehand then it 
would allow
                existing
                      >>> deployments and related clients to remain whole - 
otherwise it
                will uptake
                      >>> the new default port.
                      >>>
                      >>
                      >> Perhaps something like this could be done, but I think 
there are
                downsides
                      >> to anything like this. For example, I'm sure there are 
plenty of
                      >> applications written on top of Hadoop that have tests 
which
                hard-code the
                      >> port number. Nothing we do in a setup script will help 
here. If we
                don't
                      >> change the default port back to what it was, these 
tests will
                likely all
                      >> have to be updated.
                      >>
                      >>
                      >
                      > I may not have made my point clear enough.
                      > What I meant to say is to fix the default port but 
direct folks to
                      > explicitly set the port they are using in a deployment 
(the current
                      > default) so that it doesn't change out from under them 
- unless

            they

                are
                      > fine with it changing.
                      >
                      >
                      >>
                      >>> Meta note: we shouldn't be so pedantic about policy 
that we can't
                back
                      >>> out something that is considered a bug or even 
mistake.
                      >>>
                      >>
                      >> This is my bigger point. Rigidly adhering to the 
compat guidelines
                in this
                      >> instance helps almost no one, while hurting many folks.
                      >>
                      >> We basically made a mistake when we decided to change 
the default
                NN port
                      >> with little upside, even between major versions. We 
discovered

            this

                very
                      >> quickly, and we have an opportunity to fix it now and 
in so doing
                likely
                      >> disrupt very, very few users and downstream 
applications. If we
                don't
                      >> change it, we'll be causing difficulty for our users, 
downstream
                      >> developers, and ourselves, potentially for years.
                      >>
                      >
                      > Agreed.
                      >
                      >
                      >>
                      >> Best,
                      >> Aaron
                      >>

                      
---------------------------------------------------------------------
                      To unsubscribe, e-mail: 
[email protected] 
<mailto:[email protected]>

            <mailto:[email protected] 
<mailto:[email protected]>>

                      For additional commands, e-mail: 
[email protected] <mailto:[email protected]>

            <mailto:[email protected] 
<mailto:[email protected]>>










    ---------------------------------------------------------------------
    To unsubscribe, e-mail: [email protected] 
<mailto:[email protected]>
    For additional commands, e-mail: [email protected] 
<mailto:[email protected]>


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: When are incompatible changes acceptable (HDFS-12990)

Reply via email to