[jira] [Commented] (HADOOP-11656) Classpath isolation for downstream clients

Steve Loughran (JIRA) Thu, 20 Apr 2017 06:59:46 -0700

    [ 
https://issues.apache.org/jira/browse/HADOOP-11656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15976731#comment-15976731
 ]


Steve Loughran commented on HADOOP-11656:
-----------------------------------------

[~ctubbsii]

bq. It seems like you're trying to dependency convergence for a target 
environment, in the upstream project, without knowing what the downstream 
environment is yet

Server side, we know and care about the uses: HBase, Hive, Accumulo, Spark, 
etc. We make sure that things are in sync (the great protobuf upgrade of 2013, 
leveldb sync, jackson, etc). But its getting harder and harder to produce 
binaries which work for HDFS operations (e.g. RPC calls to HDFS, YARN) and 
allow those downstream projects to actually upgrade their own code.

bq. Apache ships source. Projects should keep that in mind. If the source is 
good, then users can build/patch/tweak for their downstream environment, and 
packagers/vendors can do this as an intermediary for their own users. If the 
project focuses too much on their own "convenience" binaries as the primary 
artifact, they might risk making it harder for the source to be reusable 
downstream.

We also imply that users of our software will be able to engage in 
authenticated IPC calls with an Hadoop cluster, built by somebody else. It's 
really hard to pull that off once things like protobuf, Avro, Kryo (at the 
hive/spark layer), Jackson, all become part of the story. Locking things down 
helps is make that interop guarantee, but currently does it by imposing 
inflexibility downstream. A shading option will give those downstream, projects 
the option of controlling their own versions of (non-native) things without 
waiting for java-9

bq.  Adoption of semantic versioning only requires a commitment by the 
community to communicate, through the versioning semantics, the extent to which 
the APIs have changed. For example, if the API results in a "we broke it" 
situation compared to 4.1.0, you call it 5.0.0, not 4.2.0. The only real 
requirement to use semantic versioning is to define what you consider "public 
API" for the purposes of breakage. This can be refined over several releases. 
And, following some sort of semantic versioning (not necessary semver.org's 
definition) doesn't have to be perfect. It just has to be a goal to strive for 
with each release.

Like I said, we can discuss this at length. Probably over beer.

The concept of _Interface_ was first introduced in _On the Criteria To Be Used 
in Decomposing Systems into Modules_, D.L. Parnas, 1972, who defined the 
interface to be not just the binary signature between two modules: the API, 
calling conventions & scope, byt the behaviour, _the semantics_. All too often 
the latter is harder to maintain as while IDL & programming language compilers 
can maintain consistent data formats, API-layer interfaces and check the static 
linking, all we have for verifying compliance are our test suites, which 
inevitably operate in a small subset of the hilbert/ onfiguration space whose 
dimensionality is determined by the number of configuration options in our code 
and the deployment environment itself. And of course there are epiphenomena: 
the side effects people accidentally code against, like the time to enumerate 
all filesystem implementations on the classpath, the ordering of entries on a 
classpath, etc. Stuff we don't even know people are using until we change 
something and it breaks.

All Semver says is "A new major version a 100% guarantee we've broken stuff". I 
take that as a given anyway. We also pretty much consider that all minor point 
changes break things, and once we know how something breaks, it's often easier 
to leave alone for a while. 

Take for example. HADOOP-9623, upgrading jets3t to 0.9.0. Seems to work, but 
includes a single line change _read to the end of the current GET when closing 
a stream_. It achieves the same directly observable outcome "stream is closed", 
adds a benefit "reuses of HTTP 1.1 connections". Unfortunately, that little 
change has a small consequence, It turns out this raises HADOOP-12376, 
performance dies on any seek() of a large file, as irrespective of file length, 
the entire file is read down. See? No change in any of Parnas's criteria, yet 
enough of a regression to make the thing unusuabe. The worst part: it passed 
the tests. Everything worked, even our tests, because yes, the observed state 
of the SUT appeared the same. It's only on very large files that the issues 
arise. (we closed that as a WONTFIX BTW, moved to S3a, fixed same problem when 
it recorrurred and now have a test which implicitly verifies it never comes 
back by seeking around a 20MB of Amazon's and verifying that the tests don't 
time out)

bq.  I do hope that some of the Hadoop developer community will consider 
helping with downstream community packaging, and feeding back lessons learned 
into the upstream source

I've personally reached the view of assuming that most updates are dangerous, 
with [a list of what I fear the 
most|https://steveloughran.blogspot.co.uk/2016/05/fear-of-dependencies.html]. 
As a developer: I want the new stuff and would love to use the latest stuff. As 
someone who fields version-related-upgrade support calls sometimes with the 
word "Kerberos" in them, I'm happy with the stuff that mostly works as long as 
its failures are things we know of. 




> Classpath isolation for downstream clients
> ------------------------------------------
>
>                 Key: HADOOP-11656
>                 URL: https://issues.apache.org/jira/browse/HADOOP-11656
>             Project: Hadoop Common
>          Issue Type: New Feature
>            Reporter: Sean Busbey
>            Assignee: Sean Busbey
>            Priority: Blocker
>              Labels: classloading, classpath, dependencies, scripts, shell
>         Attachments: HADOOP-11656_proposal.md
>
>
> Currently, Hadoop exposes downstream clients to a variety of third party 
> libraries. As our code base grows and matures we increase the set of 
> libraries we rely on. At the same time, as our user base grows we increase 
> the likelihood that some downstream project will run into a conflict while 
> attempting to use a different version of some library we depend on. This has 
> already happened with i.e. Guava several times for HBase, Accumulo, and Spark 
> (and I'm sure others).
> While YARN-286 and MAPREDUCE-1700 provided an initial effort, they default to 
> off and they don't do anything to help dependency conflicts on the driver 
> side or for folks talking to HDFS directly. This should serve as an umbrella 
> for changes needed to do things thoroughly on the next major version.
> We should ensure that downstream clients
> 1) can depend on a client artifact for each of HDFS, YARN, and MapReduce that 
> doesn't pull in any third party dependencies
> 2) only see our public API classes (or as close to this as feasible) when 
> executing user provided code, whether client side in a launcher/driver or on 
> the cluster in a container or within MR.
> This provides us with a double benefit: users get less grief when they want 
> to run substantially ahead or behind the versions we need and the project is 
> freer to change our own dependency versions because they'll no longer be in 
> our compatibility promises.
> Project specific task jiras to follow after I get some justifying use cases 
> written in the comments.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (HADOOP-11656) Classpath isolation for downstream clients

Reply via email to