[ 
https://issues.apache.org/jira/browse/HADOOP-11656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14343926#comment-14343926
 ] 

Colin Patrick McCabe commented on HADOOP-11656:
-----------------------------------------------

bq. I'm not trying to stop this work, I do agree that it needs fixing, just 
wondering how to do this in a way which has (a) tangible immediate benefits in 
2015 (b) keeps Hadoop 3.x a low-cost, low-risk update, not a Perl 6 or python 3.

Sure.

bq. Hadoop works across all shipping guava versions, so update it in 2.8 
(giving a warning in 2.7 that this is the last)

A Guava version bump is very destabilizing... much more so than most version 
bumps.  They are very aggressive about removing and changing APIs.  Many 
applications will break.  Since the sun is starting to set on Hadoop 2.x, it's 
unclear if the pain is worth the gain at this point.

bq. get the OSGI patches in, so that anyone who wants to use Hadoop 2.x code 
within an OSGi-enabled JVM, can.

Agree

bq. split client/server artifacts with a leaner client (which can still use 
guava, protobuf, SLF4J &c), just strip out the pure-server side stuff from 
HDFS, so at least introduce less there.

Honestly, I don't see a lot of value in that work.  It's a lot of refactoring 
and restructuring, and at the end of the day, you end up with (almost) as many 
library dependencies in the client as you had before (if you keep guava, 
protobuf, jackson, etc. etc. in the client).  It's nice to have fewer jars on 
the classpath, but that's a pretty minor benefit compared with the amount of 
work it would take to split this out.  The split would also introduce more 
complexity in the build... you can expect that a lot of "class not found" 
errors would be popping up for a while.  The size of the Hadoop install would 
probably not even shrink, since nearly every Hadoop install co-locates the DNs 
and the clients.

If we really want to shrink the size of the Hadoop install, a more effective 
way would be to de-duplicate some of the dependencies that we're shipping 
currently.  Maven seems to have a nasty habit of shipping the same jars in 
multiple directories, when symlinks would do just as well.

For example:
{code}
cmccabe@keter:/h> ls -lh ./share/hadoop/yarn/lib/jackson-mapper-asl-1.9.13.jar
-rw-r--r-- 1 cmccabe users 763K Feb 10 15:05 
./share/hadoop/yarn/lib/jackson-mapper-asl-1.9.13.jar
cmccabe@keter:/h> ls -lh ./share/hadoop/common/lib/jackson-mapper-asl-1.9.13.jar
-rw-r--r-- 1 cmccabe users 763K Feb 10 15:05 
./share/hadoop/common/lib/jackson-mapper-asl-1.9.13.jar
{code}

bq. maybe a pure-REST client built on Jersey (and its dependencies), supporting 
SPNEGO authed interaction with WebHDFS, YARN, other apps. This will 
underperform compared to in-cluster HDFS apps, but should be sufficient for 
remote interaction.

Yeah, we could create a Java REST client with stripped-down dependencies.  But 
it fills sort of a narrow and weird use-case.  You have to be willing to 
install Hadoop jars, but only certain Hadoop jars.  And you have to be willing 
to accept reduced performance.  In practice, users usually just roll out all 
the jars together using puppet or chef or something.  Rolling out 5 MB of 
Hadoop jars versus 50 MB is not substantially more work for a sysadmin.  And if 
the REST client doesn't encapsulate its dependencies, you have the same problem 
in miniature as with the full client.

Probably the nicest thing you can say about WebHDFS is that it let us transfer 
data between old and new Hadoops, back in the days when native RPC was changing 
in incompatible ways all the time.  Now that we've been on RPCv9 for a while, 
you don't necessarily need to use webhdfs just to be able to do a rolling 
upgrade of your cluster + external clients.

So while I wouldn't oppose a stripped-down REST client, I don't think it solves 
the same problems as this JIRA.

bq. classpath isolation as proposed here (somehow)

I think this is the way to go.

> Classpath isolation for downstream clients
> ------------------------------------------
>
>                 Key: HADOOP-11656
>                 URL: https://issues.apache.org/jira/browse/HADOOP-11656
>             Project: Hadoop Common
>          Issue Type: New Feature
>            Reporter: Sean Busbey
>            Assignee: Sean Busbey
>              Labels: classloading, classpath, dependencies
>
> Currently, Hadoop exposes downstream clients to a variety of third party 
> libraries. As our code base grows and matures we increase the set of 
> libraries we rely on. At the same time, as our user base grows we increase 
> the likelihood that some downstream project will run into a conflict while 
> attempting to use a different version of some library we depend on. This has 
> already happened with i.e. Guava several times for HBase, Accumulo, and Spark 
> (and I'm sure others).
> While YARN-286 and MAPREDUCE-1700 provided an initial effort, they default to 
> off and they don't do anything to help dependency conflicts on the driver 
> side or for folks talking to HDFS directly. This should serve as an umbrella 
> for changes needed to do things thoroughly on the next major version.
> We should ensure that downstream clients
> 1) can depend on a client artifact for each of HDFS, YARN, and MapReduce that 
> doesn't pull in any third party dependencies
> 2) only see our public API classes (or as close to this as feasible) when 
> executing user provided code, whether client side in a launcher/driver or on 
> the cluster in a container or within MR.
> This provides us with a double benefit: users get less grief when they want 
> to run substantially ahead or behind the versions we need and the project is 
> freer to change our own dependency versions because they'll no longer be in 
> our compatibility promises.
> Project specific task jiras to follow after I get some justifying use cases 
> written in the comments.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to