[
https://issues.apache.org/jira/browse/HADOOP-11656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15975137#comment-15975137
]
Christopher Tubbs commented on HADOOP-11656:
--------------------------------------------
Re: [[email protected]]:
bq. Christopher Tubbs: this is about client side classpath dependencies, not
server. If you want to know why, look at HADOOP-10101 to see coverage of just
one JAR, then consider also Jackson 1.x, jackson 2.x, jersey, and other widely
used things. The ones which cause the most problems are those for IPC:
protobuf, avro, where the generated code has to be in perfect sync with with
the version of classes generated by the protoc compiler and compiled into the
archives.
It seems like you're trying to dependency convergence for a target environment,
in the upstream project, without knowing what the downstream environment is
yet. I understand dependency convergence is a big problem. But, it is generally
understood and expected that downstream packagers and vendors will have to
resolve that for their own project. This is typically made easy if the upstream
projects resolves it for its own code, and if the project uses reasonably
recent versions of their dependencies. Then, users/clients downstream can much
more easily converge by either downgrading to the version Hadoop is built with
natively, or by upgrading the dependencies Hadoop is using (which might involve
a few patches here and there). This is the role of the packager/vendor. Trying
to do it one time in the upstream project for all possible downstream users
seems way out of scope for an upstream project to me, and nearly impossible to
achieve without taking on all the serious risks of static compilation.
Apache ships source. Projects should keep that in mind. If the source is good,
then users can build/patch/tweak for their downstream environment, and
packagers/vendors can do this as an intermediary for their own users. If the
project focuses too much on their own "convenience" binaries as the primary
artifact, they might risk making it harder for the source to be reusable
downstream.
bq. requires fundamental change across the entire java stack
No. Adoption of semantic versioning only requires a commitment by the community
to communicate, through the versioning semantics, the extent to which the APIs
have changed. For example, if the API results in a "we broke it" situation
compared to 4.1.0, you call it 5.0.0, not 4.2.0. The only real requirement to
use semantic versioning is to define what you consider "public API" for the
purposes of breakage. This can be refined over several releases. And, following
some sort of semantic versioning (not necessary semver.org's definition)
doesn't have to be perfect. It just has to be a goal to strive for with each
release.
bq. what exactly do you mean here? \[wrt. modularity\]
I just mean that you can reduce transitive dependencies at compile time by
using modules to isolate API from runtime. It doesn't help with runtime
dependency convergence, if there's a conflict, but it can help users identify
the points of conflict by more easily identifying which are direct dependencies
only and which are transitive/runtime dependencies.
bq. see HADOOP-9991
Nice! Kudos. In general, I think sweeping updates to the latest stable of
everything should be done with each major version. This helps keep things
modern, and reduces conflict with newly written software using the latest
version of your software. Previous releases might be left behind a bit... but
can usually be updated to the latest compatible "patch" version (a semver
term), if the dependencies are using some sort of semantic versioning.
bq. ... keep an eye on HADOOP-11123 \[wrt. user-defined classpath\]
Mostly, I think I just meant rely on downstream a bit to do their own
dependency convergence for their environment, rather than try to take on that
huge responsibility for everyone upstream.
bq. we are not fans of shading a—we recognise its fundamental wrongness, ....
right now we don't have any way to stop changes in Hadoop's dependencies from
breaking things downstream
This is true of any project (though, understandably, Hadoop is more complicated
than most), but most open source projects still dynamically link to their
dependencies in their builds. Dependency convergence is largely a downstream
vendor/community packager responsibility. When I took over in the Fedora
project for the Hadoop packaging, there was discussion of dropping Hadoop
entirely from Fedora, because of upstream decisions that made downstream builds
from source so difficult, and there wasn't sufficient upstream support for
downstream building from source. I don't know to what extent that is true... I
know only what I read in the discussions... but I hope it's not true. I hope
Hadoop will make additional improvements to the build to help downstream build
from source, and will not treat the "convenience" binaries as the primary
release artifacts over the source.
bq. You can even skip the shading by building with -DskipShade
Awesome. I'll keep that one in mind! Thanks! (Also, thanks, in general, for the
information and discussion.)
Re: [~sjlee0]:
bq. You can do that only so much from your side. Maybe swapping 1.2.3 of
something with 1.2.4 would work, but the Hadoop community cannot guarantee that
things will work if the version jump is sufficiently large.
Yes, there are limitations for simple swapping. But, it is also normal to patch
downstream for dependency convergence. Look at the existing Hadoop packaging in
Fedora, which is heavily patched to converge dependencies:
http://pkgs.fedoraproject.org/cgit/rpms/hadoop.git/tree/hadoop.spec
In any case, I appreciate the dialogue, and the length the Hadoop community is
going to try to address some of the outstanding problems which have been
plaguing Hadoop in the past, particularly with regard to its dependencies. I
just hope that some of the decisions made will make it easier on downstream
packagers like myself, and that some acknowledgment that dependency convergence
is not necessarily a problem upstream has to deal with on its own. It can work
with downstream packagers (vendors and community packagers) to make it easier
for them by prioritizing the ability for packagers to easily modify the build
to do their own dependency convergence, rather than trying to do it all
upstream in a way that constrains downstream to follow upstream decisions for
particular versions.
FWIW, most of my concerns are completely alleviated now that I see that Hadoop
has aggressively modernized its dependencies in 3.0.0 and that it has a
{{-DskipShade}} option. I think there's still room for improvement, but nothing
too urgent with these two facts in place. I do hope that some of the Hadoop
developer community will consider helping with downstream community packaging,
and feeding back lessons learned into the upstream source. After all, it's a
great way to make your software available to a larger audience, and quality
downstream experience is good marketing for new, experienced developers to
contribute back to upstream.
> Classpath isolation for downstream clients
> ------------------------------------------
>
> Key: HADOOP-11656
> URL: https://issues.apache.org/jira/browse/HADOOP-11656
> Project: Hadoop Common
> Issue Type: New Feature
> Reporter: Sean Busbey
> Assignee: Sean Busbey
> Priority: Blocker
> Labels: classloading, classpath, dependencies, scripts, shell
> Attachments: HADOOP-11656_proposal.md
>
>
> Currently, Hadoop exposes downstream clients to a variety of third party
> libraries. As our code base grows and matures we increase the set of
> libraries we rely on. At the same time, as our user base grows we increase
> the likelihood that some downstream project will run into a conflict while
> attempting to use a different version of some library we depend on. This has
> already happened with i.e. Guava several times for HBase, Accumulo, and Spark
> (and I'm sure others).
> While YARN-286 and MAPREDUCE-1700 provided an initial effort, they default to
> off and they don't do anything to help dependency conflicts on the driver
> side or for folks talking to HDFS directly. This should serve as an umbrella
> for changes needed to do things thoroughly on the next major version.
> We should ensure that downstream clients
> 1) can depend on a client artifact for each of HDFS, YARN, and MapReduce that
> doesn't pull in any third party dependencies
> 2) only see our public API classes (or as close to this as feasible) when
> executing user provided code, whether client side in a launcher/driver or on
> the cluster in a container or within MR.
> This provides us with a double benefit: users get less grief when they want
> to run substantially ahead or behind the versions we need and the project is
> freer to change our own dependency versions because they'll no longer be in
> our compatibility promises.
> Project specific task jiras to follow after I get some justifying use cases
> written in the comments.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]