[jira] [Commented] (MAPREDUCE-2600) MR-279: simplify the jars

Scott Carey (Commented) (JIRA) Tue, 20 Dec 2011 10:56:55 -0800

    [ 
https://issues.apache.org/jira/browse/MAPREDUCE-2600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13173416#comment-13173416
 ]


Scott Carey commented on MAPREDUCE-2600:
----------------------------------------

I am a little late to the party here but:
{quote}
It is a big issue for downstream users. Projects that use Hadoop already pick 
up a lot of jars and increasing the set when all of the versions are the same 
is a problem. We'll also have users using different versions of the jars, which 
won't be useful.

Having a source structure that requires an IDE to use isn't making the code 
easy for people to browse, use and modify. It will also become a maintenance 
problem as the dependency graph between the components change.

Yes, you can munge the results together into a single jar as part of the build, 
but I don't see how it makes development easier or faster to have lots of 
little directories. 
{quote}

I disagree.  It is a huge issue as a downstream user when the jar granularity 
is not fine enough.  You don't have to manually pick each jar, so the total 
number is not the issue.  If set up correctly a user will only pick the _one_ 
or maybe _two_ jars needed for their use case and maven/ivy/etc pulls in the 
transitive dependencies for you with the correct versions.  It is a MUCH bigger 
risk if as a user I don't have the ability to build the package I want that 
_excludes_ the stuff I don't need without a lot of trouble.  It is not the 
_number_ of jars that is the problem, it is the total _size_ of all of them and 
the likelihood of version mismatches with transitive dependencies.  The current 
issue is not that projects that use Hadoop 'pick up a lot of jars' it is that 
they 'pick up a lot of jars that are not needed at all'.

A few 'top level' jars that are useful for various use cases as single points 
of inclusion would be perfect.  This does not imply few jars total, it implies 
a few that you choose to declare for your use cases -- they can pull in any 
number of other shared hadoop that are required for those use cases, it doesn't 
matter if they are 'the same version', the user does not need to know since 
maven handles that and maven best practices make many jars with the 'same 
version' a non-issue. 

A user pulls in a mapreduce client jar, and that might also pull in a couple 
'common' jars from the same project.  That is the intended best practice of 
maven.  If the mapreduce client jar were to bundle common stuff in it, and that 
same common stuff were bundled in say, an hdfs-client jar, then you risk all 
sorts of trouble as a downstream user with multiple colliding classes on your 
classpath, the inability to have the tooling (maven) detect and deal with 
conflicts appropriately, etc.  If it were to bundle stuff that is not useful as 
a client, that would bloat client application jars and potentially pull in 
useless transitive dependencies.

If the jars are reduced into only a few big blobs, it will end up more like the 
absolutely atrocious maven dependency management in 0.20.205 and 0.22.x. where 
a user who just wants to build a mapreduce program pulls in 20MB of downstream 
jars that are not needed unless they manually exclude them.

Having more source trees is a slight development burden, but enforces the right 
encapsulation and organization of dependencies.  One of the benefits of 
organizing modules in maven is that the end result almost always leads to more 
clear code boundaries and better architectural separation of concerns.  It also 
helps define API boundaries and prevent creating leaky abstractions / apis by 
accident.

{quote}
Thinking more on it. I am inclined to keeping the modules separate as it is 
currently, instead of combining the source tree.
I am counting the no of modules to be 10-12. So the source tree should not be 
59 or am I missing something.

The separate modules do help identify the boundaries more clearly and help in 
enforcing those. Separation just based on java packages is loose. I know this 
based on the unnecessary pain I went thru when I was working on the project 
split 2 years ago. In future, refactoring code or doing things like rewriting 
NM in C++ will be least intrusive with the current module structure.

If no of jars is the problem, can we just merge the jars at build time the way 
we want. Using maven shade plugin or some such ?
{quote}

I agree.  You can use the shade plugin to make a few 'fat' jars for some use 
cases that live _along side_ the normal artifacts that do not embed any 
dependencies.


Please, please don't put any jars in a maven repo that bundle dependencies 
unless they are attached artifacts and not the primary artifact.
Please, please declare the dependencies properly, using 'optional' or 
'provided' scope as appropriate to prevent downstream users from pulling in 
artifacts transitively that a client user does not need.  
I believe that too few jars is worst than too many, when the two items above 
are done correctly (e.g. maven best practices are followed).  Then as a 
downstream user, I can easily select the features I want, and trust that the 
dependencies that are pulled in to my project transitively as a consequence of 
say, pulling in a mapreduce client jar, are only the jars needed as a mapreduce 
client and not the entire freaking hadoop framework or any other extra 
unnecessary baggage.


                
> MR-279: simplify the jars 
> --------------------------
>
>                 Key: MAPREDUCE-2600
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2600
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: mrv2
>    Affects Versions: 0.23.0
>            Reporter: Owen O'Malley
>            Assignee: Luke Lu
>
> Currently the MR-279 mapreduce project generates 59 jars from 59 source 
> roots, which can be dramatically simplified.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-2600) MR-279: simplify the jars

Reply via email to