[jira] [Commented] (HADOOP-11656) Classpath isolation for downstream clients

Christopher Tubbs (JIRA) Wed, 19 Apr 2017 10:32:59 -0700

    [ 
https://issues.apache.org/jira/browse/HADOOP-11656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15975137#comment-15975137
 ]


Christopher Tubbs commented on HADOOP-11656:
--------------------------------------------

Re: [[email protected]]:

bq. Christopher Tubbs: this is about client side classpath dependencies, not 
server. If you want to know why, look at HADOOP-10101 to see coverage of just 
one JAR, then consider also Jackson 1.x, jackson 2.x, jersey, and other widely 
used things. The ones which cause the most problems are those for IPC: 
protobuf, avro, where the generated code has to be in perfect sync with with 
the version of classes generated by the protoc compiler and compiled into the 
archives.

It seems like you're trying to dependency convergence for a target environment, 
in the upstream project, without knowing what the downstream environment is 
yet. I understand dependency convergence is a big problem. But, it is generally 
understood and expected that downstream packagers and vendors will have to 
resolve that for their own project. This is typically made easy if the upstream 
projects resolves it for its own code, and if the project uses reasonably 
recent versions of their dependencies. Then, users/clients downstream can much 
more easily converge by either downgrading to the version Hadoop is built with 
natively, or by upgrading the dependencies Hadoop is using (which might involve 
a few patches here and there). This is the role of the packager/vendor. Trying 
to do it one time in the upstream project for all possible downstream users 
seems way out of scope for an upstream project to me, and nearly impossible to 
achieve without taking on all the serious risks of static compilation.

Apache ships source. Projects should keep that in mind. If the source is good, 
then users can build/patch/tweak for their downstream environment, and 
packagers/vendors can do this as an intermediary for their own users. If the 
project focuses too much on their own "convenience" binaries as the primary 
artifact, they might risk making it harder for the source to be reusable 
downstream.

bq. requires fundamental change across the entire java stack

No. Adoption of semantic versioning only requires a commitment by the community 
to communicate, through the versioning semantics, the extent to which the APIs 
have changed. For example, if the API results in a "we broke it" situation 
compared to 4.1.0, you call it 5.0.0, not 4.2.0. The only real requirement to 
use semantic versioning is to define what you consider "public API" for the 
purposes of breakage. This can be refined over several releases. And, following 
some sort of semantic versioning (not necessary semver.org's definition) 
doesn't have to be perfect. It just has to be a goal to strive for with each 
release.

bq. what exactly do you mean here? \[wrt. modularity\]

I just mean that you can reduce transitive dependencies at compile time by 
using modules to isolate API from runtime. It doesn't help with runtime 
dependency convergence, if there's a conflict, but it can help users identify 
the points of conflict by more easily identifying which are direct dependencies 
only and which are transitive/runtime dependencies.

bq. see HADOOP-9991

Nice! Kudos. In general, I think sweeping updates to the latest stable of 
everything should be done with each major version. This helps keep things 
modern, and reduces conflict with newly written software using the latest 
version of your software. Previous releases might be left behind a bit... but 
can usually be updated to the latest compatible "patch" version (a semver 
term), if the dependencies are using some sort of semantic versioning.

bq. ... keep an eye on HADOOP-11123 \[wrt. user-defined classpath\]

Mostly, I think I just meant rely on downstream a bit to do their own 
dependency convergence for their environment, rather than try to take on that 
huge responsibility for everyone upstream.

bq. we are not fans of shading a—we recognise its fundamental wrongness,  .... 
right now we don't have any way to stop changes in Hadoop's dependencies from 
breaking things downstream

This is true of any project (though, understandably, Hadoop is more complicated 
than most), but most open source projects still dynamically link to their 
dependencies in their builds. Dependency convergence is largely a downstream 
vendor/community packager responsibility. When I took over in the Fedora 
project for the Hadoop packaging, there was discussion of dropping Hadoop 
entirely from Fedora, because of upstream decisions that made downstream builds 
from source so difficult, and there wasn't sufficient upstream support for 
downstream building from source. I don't know to what extent that is true... I 
know only what I read in the discussions... but I hope it's not true. I hope 
Hadoop will make additional improvements to the build to help downstream build 
from source, and will not treat the "convenience" binaries as the primary 
release artifacts over the source.

bq. You can even skip the shading by building with -DskipShade

Awesome. I'll keep that one in mind! Thanks! (Also, thanks, in general, for the 
information and discussion.)

Re: [~sjlee0]:

bq. You can do that only so much from your side. Maybe swapping 1.2.3 of 
something with 1.2.4 would work, but the Hadoop community cannot guarantee that 
things will work if the version jump is sufficiently large.

Yes, there are limitations for simple swapping. But, it is also normal to patch 
downstream for dependency convergence. Look at the existing Hadoop packaging in 
Fedora, which is heavily patched to converge dependencies: 
http://pkgs.fedoraproject.org/cgit/rpms/hadoop.git/tree/hadoop.spec

In any case, I appreciate the dialogue, and the length the Hadoop community is 
going to try to address some of the outstanding problems which have been 
plaguing Hadoop in the past, particularly with regard to its dependencies. I 
just hope that some of the decisions made will make it easier on downstream 
packagers like myself, and that some acknowledgment that dependency convergence 
is not necessarily a problem upstream has to deal with on its own. It can work 
with downstream packagers (vendors and community packagers) to make it easier 
for them by prioritizing the ability for packagers to easily modify the build 
to do their own dependency convergence, rather than trying to do it all 
upstream in a way that constrains downstream to follow upstream decisions for 
particular versions.

FWIW, most of my concerns are completely alleviated now that I see that Hadoop 
has aggressively modernized its dependencies in 3.0.0 and that it has a 
{{-DskipShade}} option. I think there's still room for improvement, but nothing 
too urgent with these two facts in place. I do hope that some of the Hadoop 
developer community will consider helping with downstream community packaging, 
and feeding back lessons learned into the upstream source. After all, it's a 
great way to make your software available to a larger audience, and quality 
downstream experience is good marketing for new, experienced developers to 
contribute back to upstream.

> Classpath isolation for downstream clients
> ------------------------------------------
>
>                 Key: HADOOP-11656
>                 URL: https://issues.apache.org/jira/browse/HADOOP-11656
>             Project: Hadoop Common
>          Issue Type: New Feature
>            Reporter: Sean Busbey
>            Assignee: Sean Busbey
>            Priority: Blocker
>              Labels: classloading, classpath, dependencies, scripts, shell
>         Attachments: HADOOP-11656_proposal.md
>
>
> Currently, Hadoop exposes downstream clients to a variety of third party 
> libraries. As our code base grows and matures we increase the set of 
> libraries we rely on. At the same time, as our user base grows we increase 
> the likelihood that some downstream project will run into a conflict while 
> attempting to use a different version of some library we depend on. This has 
> already happened with i.e. Guava several times for HBase, Accumulo, and Spark 
> (and I'm sure others).
> While YARN-286 and MAPREDUCE-1700 provided an initial effort, they default to 
> off and they don't do anything to help dependency conflicts on the driver 
> side or for folks talking to HDFS directly. This should serve as an umbrella 
> for changes needed to do things thoroughly on the next major version.
> We should ensure that downstream clients
> 1) can depend on a client artifact for each of HDFS, YARN, and MapReduce that 
> doesn't pull in any third party dependencies
> 2) only see our public API classes (or as close to this as feasible) when 
> executing user provided code, whether client side in a launcher/driver or on 
> the cluster in a container or within MR.
> This provides us with a double benefit: users get less grief when they want 
> to run substantially ahead or behind the versions we need and the project is 
> freer to change our own dependency versions because they'll no longer be in 
> our compatibility promises.
> Project specific task jiras to follow after I get some justifying use cases 
> written in the comments.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (HADOOP-11656) Classpath isolation for downstream clients

Reply via email to