Re: When are incompatible changes acceptable (HDFS-12990)

Akira Ajisaka Fri, 19 Jan 2018 01:07:13 -0800

I'm -1 for reverting HDFS-9427 in 3.x.

The port number is configurable, so if you want to use 8020 for
NN RPC port in Hadoop 3.x, you configure this to 8020. That's fine.
I don't think it is critical problem.


If we are to revert this in 3.x, it causes additional incompatible change.

-Akira

On 2018/01/18 11:03, Tsz Wo (Nicholas), Sze wrote:

  (Re-sent. Just found that my previous email seems not delivered to 
common-dev.)

The question is: how are we going to fix it?>> What do you propose? -C

First of all, let's state clearly what is the problem about. Please help me
out if I have missed anything.
The problem reported by HDFS-12990 is that HDFS-9427 has changed NN default RPC
port from 8020 to 9820. HDFS-12990 claimed, “the NN RPC port change is painful
for downstream on migrating to Hadoop 3.”
Note 1: This isn't a problem for HA cluster.Note 2: The port is configurable.
User can set it to any value.Note 3: HDFS-9427 has also changed many other
HTTP/RPC ports as shown below
Namenode ports: 50470 --> 9871, 50070 --> 9870, 8020 --> 9820Secondary NN ports: 50091 -->
9869, 50090 --> 9868Datanode ports: 50020 --> 9867, 50010 --> 9866, 50475 --> 9865, 50075
--> 9864
The other port changes probably also affect downstream projects and give them a
“painful” experience. For example, NN UI and WebHDFS use a different port.
The problem is related convenience but not anything serious like a security bug.
There are a few possible solutions:1) Considered that the port changes are not
limited to NN RPC and the default port value should not be hardcoded. Also,
downstream projects probably need to fix other hardcoded ports (e.g. WebHDFS)
anyway. Let’s just keep all the port changes and document them clearly about
the changes (we may throw an exception if some applications try to connect to
the old ports.) In this way, 3.0.1 is compatible with 3.0.0.
2) Further change the NN RPC so that NN listens to both 8020 and 9820 by
default. It is a new feature that NN listen to two ports simultaneously. The
feature has other benefits, e.g. one of the ports is reserved to some high
priority applications so that it can have a better response time. It is
compatible to both 2.x and 3.0.0. Of course, users could choose to set it back
to one of the ports in the conf.
3) Revert the NN RPC port back to 8020. We need to ask where should the revert
happen?3.1) Revert it in 3.0.1 as proposed by HDFS-12990. However, this is an
incompatible change between dot releases 3.0.0 and 3.0.1 and it violates our
policy. Being compatible is very important. Users expect 3.0.0 and 3.0.1 are
compatible. How could we explain 3.0.0 and 3.0.1 are incompatible due to
convenience?3.2) Revert it in 4.0.0. There is no compatibility issue since
3.0.0 and 4.0.0 are allowed to have incompatible changes according to our
policy.
Since compatibility is more important than convenience, Solution #3.1 is
impermissible. For the remaining solutions, both #1 and #2 are fine to me.
Thanks.Tsz-Wo

On Friday, January 12, 2018, 12:26:47 PM GMT+8, Chris Douglas
<cdoug...@apache.org> wrote:
On Thu, Jan 11, 2018 at 6:34 PM Tsz Wo Sze <szets...@yahoo.com> wrote:

The question is: how are we going to fix it?



What do you propose? -C

No incompatible changes are allowed between 3.0.0 and 3.0.1. Dot releases only 
allow bug fixes.


We may not like the statement above but it is our compatibility policy.  We 
should either follow the policy or revise it.

Some more questions:

- What if someone is already using 3.0.0 and has changed all the scripts to 9820? Just let them fail?

    - Compared to 2.x, 3.0.0 has many incompatible changes. Are we going to 
have other incompatible changes in the future minor and dot releases? What is 
the criteria to decide which incompatible changes are allowed?
    - I hate that we have prematurely released 3.0.0 and make 3.0.1 incompatible to 
3.0.0. If the "bug" is that serious, why not fixing it in 4.0.0 and declare 3.x 
as dead?
    - It seems obvious that no one has seriously tested it so that the problem 
is not uncovered until now. Are there bugs in our current release procedure?

ThanksTsz-Wo


     On Thursday, January 11, 2018, 11:36:33 AM GMT+8, Chris Douglas 
<cdoug...@apache.org> wrote:

Isn't this limited to reverting the 8020 -> 9820 change? -C


On Wed, Jan 10, 2018 at 6:13 PM Eric Yang <ey...@hortonworks.com> wrote:

The fix in HDFS-9427 can potentially bring in new customers because less
chance for new comer to encountering “port already in use” problem.  If we
make change according to HDFS-12990, then this incompatible change does not
make incompatible change compatible.  Other ports are not reverted
according to HDFS-12990.  User will encounter the bad taste in the mouth
that HDFS-9427 attempt to solve.  Please do consider both negative side
effects of reverting as well as incompatible minor release change.  Thanks

Regards,
Eric

From: larry mccay <lmc...@apache.org>
Date: Wednesday, January 10, 2018 at 10:53 AM
To: Daryn Sharp <da...@oath.com>
Cc: "Aaron T. Myers" <a...@apache.org>, Eric Yang <ey...@hortonworks.com>,
Chris Douglas <cdoug...@apache.org>, Hadoop Common <
common-dev@hadoop.apache.org>
Subject: Re: When are incompatible changes acceptable (HDFS-12990)

On Wed, Jan 10, 2018 at 1:34 PM, Daryn Sharp <da...@oath.com<mailto:
da...@oath.com>> wrote:

I fully agree the port changes should be reverted.  Although
"incompatible", the potential impact to existing 2.x deploys is huge.  I'd
rather inconvenience 3.0 deploys that compromise <1% customers.  An
incompatible change to revert an incompatible change is called
compatibility.

+1




Most importantly, consider that there is no good upgrade path existing
deploys, esp. large and/or multi-cluster environments.  It’s only feasible
for first-time deploys or simple single-cluster upgrades willing to take
downtime.  Let's consider a few reasons why:



1. RU is completely broken.  Running jobs will fail.  If MR on hdfs
bundles the configs, there's no way to transparently coordinate the switch
to the new bundle with the port changed.  Job submissions will fail.



2. Users generally do not add the rpc port number to uris so unless their
configs are updated they will contact the wrong port.  Seamlessly
coordinating the conf change without massive failures is impossible.



3. Even if client confs are updated, they will break in a multi-cluster
env with NNs using different ports.  Users/services will be forced to add
the port.  The cited hive "issue" is not a bug since it's the only way to
work in a multi-port env.



4. Coordinating the port add/change of uris is systems everywhere (you
know something will be missed), updating of confs, restarting all services,
requiring customers to redeploy their workflows in sync with the NN
upgrade, will cause mass disruption and downtime that will be unacceptable
for production environments.



This is a solution to a non-existent problem.  Ports can be bound by
multiple processes but only 1 can listen.  Maybe multiple listeners is an
issue for compute nodes but not responsibly managed service nodes.  Ie. Who
runs arbitrary services on the NNs that bind to random ports?  Besides, the
default port is and was ephemeral so it solved nothing.



This either standardizes ports to a particular customer's ports or is a
poorly thought out whim.  In either case, the needs of the many outweigh
the needs of the few/none (3.0 users).  The only logical conclusion is
revert.  If a particular site wants to change default ports and deal with
the massive fallout, they can explicitly change the ports themselves.



Daryn

On Tue, Jan 9, 2018 at 11:22 PM, Aaron T. Myers <a...@apache.org<mailto:
a...@apache.org>> wrote:
On Tue, Jan 9, 2018 at 3:15 PM, Eric Yang <ey...@hortonworks.com<mailto:
ey...@hortonworks.com>> wrote:

While I agree the original port change was unnecessary, I don’t think
Hadoop NN port change is a bad thing.

I worked for a Hadoop distro that NN RPC port was default to port 9000.
When we migrate from BigInsights to IOP and now to HDP, we have to move
customer Hive metadata to new NN RPC port.  It only took one developer
(myself) to write the tool for the migration.  The incurring workload is
not as bad as most people anticipated because Hadoop depends on
configuration file for referencing namenode.  Most of the code can work
transparently.  It helped to harden the downstream testing tools to be

more

robust.


While there are of course ways to deal with this, the question really
should be whether or not it's a desirable thing to do to our users.


We will never know how many people are actively working on Hadoop 3.0.0.
Perhaps, couple hundred developers or thousands.



You're right that we can't know for sure, but I strongly suspect that this
is a substantial overestimate. Given how conservative Hadoop operators tend
to be, I view it as exceptionally unlikely that many deployments have been
created on or upgraded to Hadoop 3.0.0 since it was released less than a
month ago.

Further, I hope you'll agree that the number of
users/developers/deployments/applications which are currently on Hadoop 2.x
is *vastly* greater than anyone who might have jumped on Hadoop 3.0.0 so
quickly. When all of those users upgrade to any 3.x version, they will
encounter this needless incompatible change and be forced to work around
it.

I think the switch back may have saved few developers work, but there
could be more people getting impacted at unexpected minor release change

in

the future.  I recommend keeping current values to avoid rule bending and
future frustrations.


That we allow this incompatible change now does not mean that we are
categorically allowing more incompatible changes in the future. My point is
that we should in all instances evaluate the merit of any incompatible
change on a case-by-case basis. This is not an exceptional circumstance -
we've made incompatible changes in the past when appropriate, e.g. breaking
some clients to address a security issue. I and others believe that in this
case the benefits greatly outweigh the downsides of changing this back to
what it has always been.

Best,
Aaron


Regards,
Eric

On 1/9/18, 11:21 AM, "Chris Douglas" <cdoug...@apache.org<mailto:

cdoug...@apache.org>> wrote:


     Particularly since 9820 isn't in the contiguous range of ports in
     HDFS-9427, is there any value in this change?

     Let's change it back to prevent the disruption to users, but
     downstream projects should treat this as a bug in their tests. Please
     open JIRAs in affected projects. -C


     On Tue, Jan 9, 2018 at 5:18 AM, larry mccay <lmc...@apache.org

<mailto:lmc...@apache.org>> wrote:

     > On Mon, Jan 8, 2018 at 11:28 PM, Aaron T. Myers <a...@apache.org

<mailto:a...@apache.org>>

wrote:
     >
     >> Thanks a lot for the response, Larry. Comments inline.
     >>
     >> On Mon, Jan 8, 2018 at 6:44 PM, larry mccay <lmc...@apache.org

<mailto:lmc...@apache.org>>

wrote:
     >>
     >>> Question...
     >>>
     >>> Can this be addressed in some way during or before upgrade that
allows it
     >>> to only affect new installs?
     >>> Even a config based workaround prior to upgrade might make this a
change
     >>> less disruptive.
     >>>
     >>> If part of the upgrade process includes a step (maybe even a
script) to
     >>> set the NN RPC port explicitly beforehand then it would allow
existing
     >>> deployments and related clients to remain whole - otherwise it
will uptake
     >>> the new default port.
     >>>
     >>
     >> Perhaps something like this could be done, but I think there are
downsides
     >> to anything like this. For example, I'm sure there are plenty of
     >> applications written on top of Hadoop that have tests which
hard-code the
     >> port number. Nothing we do in a setup script will help here. If we
don't
     >> change the default port back to what it was, these tests will
likely all
     >> have to be updated.
     >>
     >>
     >
     > I may not have made my point clear enough.
     > What I meant to say is to fix the default port but direct folks to
     > explicitly set the port they are using in a deployment (the current
     > default) so that it doesn't change out from under them - unless

they

are
     > fine with it changing.
     >
     >
     >>
     >>> Meta note: we shouldn't be so pedantic about policy that we can't
back
     >>> out something that is considered a bug or even mistake.
     >>>
     >>
     >> This is my bigger point. Rigidly adhering to the compat guidelines
in this
     >> instance helps almost no one, while hurting many folks.
     >>
     >> We basically made a mistake when we decided to change the default
NN port
     >> with little upside, even between major versions. We discovered

this

very
     >> quickly, and we have an opportunity to fix it now and in so doing
likely
     >> disrupt very, very few users and downstream applications. If we
don't
     >> change it, we'll be causing difficulty for our users, downstream
     >> developers, and ourselves, potentially for years.
     >>
     >
     > Agreed.
     >
     >
     >>
     >> Best,
     >> Aaron
     >>

     ---------------------------------------------------------------------
     To unsubscribe, e-mail: common-dev-unsubscr...@hadoop.apache.org

<mailto:common-dev-unsubscr...@hadoop.apache.org>

     For additional commands, e-mail: common-dev-h...@hadoop.apache.org

<mailto:common-dev-h...@hadoop.apache.org>


---------------------------------------------------------------------
To unsubscribe, e-mail: common-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-dev-h...@hadoop.apache.org

Re: When are incompatible changes acceptable (HDFS-12990)

Reply via email to