[
https://issues.apache.org/jira/browse/HADOOP-11656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15976731#comment-15976731
]
Steve Loughran commented on HADOOP-11656:
-----------------------------------------
[~ctubbsii]
bq. It seems like you're trying to dependency convergence for a target
environment, in the upstream project, without knowing what the downstream
environment is yet
Server side, we know and care about the uses: HBase, Hive, Accumulo, Spark,
etc. We make sure that things are in sync (the great protobuf upgrade of 2013,
leveldb sync, jackson, etc). But its getting harder and harder to produce
binaries which work for HDFS operations (e.g. RPC calls to HDFS, YARN) and
allow those downstream projects to actually upgrade their own code.
bq. Apache ships source. Projects should keep that in mind. If the source is
good, then users can build/patch/tweak for their downstream environment, and
packagers/vendors can do this as an intermediary for their own users. If the
project focuses too much on their own "convenience" binaries as the primary
artifact, they might risk making it harder for the source to be reusable
downstream.
We also imply that users of our software will be able to engage in
authenticated IPC calls with an Hadoop cluster, built by somebody else. It's
really hard to pull that off once things like protobuf, Avro, Kryo (at the
hive/spark layer), Jackson, all become part of the story. Locking things down
helps is make that interop guarantee, but currently does it by imposing
inflexibility downstream. A shading option will give those downstream, projects
the option of controlling their own versions of (non-native) things without
waiting for java-9
bq. Adoption of semantic versioning only requires a commitment by the
community to communicate, through the versioning semantics, the extent to which
the APIs have changed. For example, if the API results in a "we broke it"
situation compared to 4.1.0, you call it 5.0.0, not 4.2.0. The only real
requirement to use semantic versioning is to define what you consider "public
API" for the purposes of breakage. This can be refined over several releases.
And, following some sort of semantic versioning (not necessary semver.org's
definition) doesn't have to be perfect. It just has to be a goal to strive for
with each release.
Like I said, we can discuss this at length. Probably over beer.
The concept of _Interface_ was first introduced in _On the Criteria To Be Used
in Decomposing Systems into Modules_, D.L. Parnas, 1972, who defined the
interface to be not just the binary signature between two modules: the API,
calling conventions & scope, byt the behaviour, _the semantics_. All too often
the latter is harder to maintain as while IDL & programming language compilers
can maintain consistent data formats, API-layer interfaces and check the static
linking, all we have for verifying compliance are our test suites, which
inevitably operate in a small subset of the hilbert/ onfiguration space whose
dimensionality is determined by the number of configuration options in our code
and the deployment environment itself. And of course there are epiphenomena:
the side effects people accidentally code against, like the time to enumerate
all filesystem implementations on the classpath, the ordering of entries on a
classpath, etc. Stuff we don't even know people are using until we change
something and it breaks.
All Semver says is "A new major version a 100% guarantee we've broken stuff". I
take that as a given anyway. We also pretty much consider that all minor point
changes break things, and once we know how something breaks, it's often easier
to leave alone for a while.
Take for example. HADOOP-9623, upgrading jets3t to 0.9.0. Seems to work, but
includes a single line change _read to the end of the current GET when closing
a stream_. It achieves the same directly observable outcome "stream is closed",
adds a benefit "reuses of HTTP 1.1 connections". Unfortunately, that little
change has a small consequence, It turns out this raises HADOOP-12376,
performance dies on any seek() of a large file, as irrespective of file length,
the entire file is read down. See? No change in any of Parnas's criteria, yet
enough of a regression to make the thing unusuabe. The worst part: it passed
the tests. Everything worked, even our tests, because yes, the observed state
of the SUT appeared the same. It's only on very large files that the issues
arise. (we closed that as a WONTFIX BTW, moved to S3a, fixed same problem when
it recorrurred and now have a test which implicitly verifies it never comes
back by seeking around a 20MB of Amazon's and verifying that the tests don't
time out)
bq. I do hope that some of the Hadoop developer community will consider
helping with downstream community packaging, and feeding back lessons learned
into the upstream source
I've personally reached the view of assuming that most updates are dangerous,
with [a list of what I fear the
most|https://steveloughran.blogspot.co.uk/2016/05/fear-of-dependencies.html].
As a developer: I want the new stuff and would love to use the latest stuff. As
someone who fields version-related-upgrade support calls sometimes with the
word "Kerberos" in them, I'm happy with the stuff that mostly works as long as
its failures are things we know of.
> Classpath isolation for downstream clients
> ------------------------------------------
>
> Key: HADOOP-11656
> URL: https://issues.apache.org/jira/browse/HADOOP-11656
> Project: Hadoop Common
> Issue Type: New Feature
> Reporter: Sean Busbey
> Assignee: Sean Busbey
> Priority: Blocker
> Labels: classloading, classpath, dependencies, scripts, shell
> Attachments: HADOOP-11656_proposal.md
>
>
> Currently, Hadoop exposes downstream clients to a variety of third party
> libraries. As our code base grows and matures we increase the set of
> libraries we rely on. At the same time, as our user base grows we increase
> the likelihood that some downstream project will run into a conflict while
> attempting to use a different version of some library we depend on. This has
> already happened with i.e. Guava several times for HBase, Accumulo, and Spark
> (and I'm sure others).
> While YARN-286 and MAPREDUCE-1700 provided an initial effort, they default to
> off and they don't do anything to help dependency conflicts on the driver
> side or for folks talking to HDFS directly. This should serve as an umbrella
> for changes needed to do things thoroughly on the next major version.
> We should ensure that downstream clients
> 1) can depend on a client artifact for each of HDFS, YARN, and MapReduce that
> doesn't pull in any third party dependencies
> 2) only see our public API classes (or as close to this as feasible) when
> executing user provided code, whether client side in a launcher/driver or on
> the cluster in a container or within MR.
> This provides us with a double benefit: users get less grief when they want
> to run substantially ahead or behind the versions we need and the project is
> freer to change our own dependency versions because they'll no longer be in
> our compatibility promises.
> Project specific task jiras to follow after I get some justifying use cases
> written in the comments.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]