[ 
https://issues.apache.org/jira/browse/MAPREDUCE-3378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13173437#comment-13173437
 ] 

Scott Carey commented on MAPREDUCE-3378:
----------------------------------------

{quote}IMO the root issue is that we are not using dependencies correctly. 
{quote}

Absolutely.  Hadoop's dependency setup is absolutely atrocious in 0.20.205 and 
0.22  I haven't looked at 0.23 in enough detail yet but would love the 
situation to be fixed.

I have a project that needs to read and write from HDFS.  Declaring hadoop 
pulls in all of Jetty, the tomcat compiler, and a dozen other jars that I 
manually  have to exclude.

The above needs to be avoided for mapreduce.

Building larger jars that package dependencies in them is OK for some use cases 
but absolutely worthless for any real application that has any chance of 
dependency conflict.  Things like Jetty should be marked as provided not 
compile scope (or perhaps optional).

{quote}There should be a hadoop-client that allows me to code and run HDFS/MR 
client apps (with the exact set of transitive dependencies, ie you don't need 
jetty stuff there).{quote}

:-D  YES!
IMO, we need an hdfs-api.jar and mapreduce-api.jar that pull in only what is 
needed to build an application that uses those APIs as a client.  A user should 
be able to declare those in their project, and have only the transitive 
dependencies needed for those use cases pulled in, and nothing extra.  One 
could even go to the extreme of having a mapred-api.jar and mapreduce-api.jar 
with the old and new apis separated (and a mapreduce-common-api.jar they both 
depend on) if that was a bigger use case.  More modularization will be a great 
benefit to users, when combined with using dependencies properly in hadoop 
itself. 

{quote}
The fact that under the hood these 'hadoop-client' & 'hadoop-test' component 
pull 1 or 100 hadoop JARs is irrelevant (although IMO I think we have too many 
JARs).
{quote}

Yes, if the artifacts are configured properly with the right dependencies in 
the correct scope (e.g. jetty in provided scope since only one trying to run 
the framework needs it, not clients) then there is only one artifact to declare 
for each use.  It is not the total number of jars, it is the total _size_ of 
jars that matters.  Finer grained control of dependencies by users is a good 
thing.  As a user I want to declare what I need as simply as possible ("I need 
to launch a mini-mr during test, so I need hadoop-mr-test.jar"; "I need to 
submit a job to a cluster, so I need mr-client.jar"), what that means behind 
the scenes in total jar count of transitive dependencies is a different issue 
entirely.  As long as this pulls in only what is needed and not useless baggage 
(jetty, tomcat's compiler, etc).

There is no need to package 'fat jars' unless you wish to have a single 
artifact for uses where tooling does not build the classpath for you.

{quote}
Regarding my prev second bullet item, it seems via a classifier this is 
possible ( 
http://maven.apache.org/plugins/maven-shade-plugin/examples/attached-artifact.html
 ), still this is kind of uncommon for commonly used artifacts.
{quote}

I support using an attached artifact with a classifier for any jars containing 
dependencies.  It is an anti-pattern to put a jar with dependencies into a 
maven repo as the primary artifact however (unless you move those dependencies 
into a private scope to avoid conflicts).


                
> Create a single 'hadoop-mapreduce' Maven artifact
> -------------------------------------------------
>
>                 Key: MAPREDUCE-3378
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3378
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: build
>    Affects Versions: 0.23.0
>            Reporter: Tom White
>         Attachments: MAPREDUCE-3378.patch
>
>
> In 0.23.0 there are multiple artifacts (hadoop-mapreduce-client-app, 
> hadoop-mapreduce-client-common, hadoop-mapreduce-client-core, etc). It would 
> be simpler for users to declare a dependency on hadoop-mapreduce (much like 
> there's hadoop-common and hadoop-hdfs). (This would also be a step towards 
> MAPREDUCE-2600.)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to