[
https://issues.apache.org/jira/browse/MAPREDUCE-2600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13173416#comment-13173416
]
Scott Carey commented on MAPREDUCE-2600:
----------------------------------------
I am a little late to the party here but:
{quote}
It is a big issue for downstream users. Projects that use Hadoop already pick
up a lot of jars and increasing the set when all of the versions are the same
is a problem. We'll also have users using different versions of the jars, which
won't be useful.
Having a source structure that requires an IDE to use isn't making the code
easy for people to browse, use and modify. It will also become a maintenance
problem as the dependency graph between the components change.
Yes, you can munge the results together into a single jar as part of the build,
but I don't see how it makes development easier or faster to have lots of
little directories.
{quote}
I disagree. It is a huge issue as a downstream user when the jar granularity
is not fine enough. You don't have to manually pick each jar, so the total
number is not the issue. If set up correctly a user will only pick the _one_
or maybe _two_ jars needed for their use case and maven/ivy/etc pulls in the
transitive dependencies for you with the correct versions. It is a MUCH bigger
risk if as a user I don't have the ability to build the package I want that
_excludes_ the stuff I don't need without a lot of trouble. It is not the
_number_ of jars that is the problem, it is the total _size_ of all of them and
the likelihood of version mismatches with transitive dependencies. The current
issue is not that projects that use Hadoop 'pick up a lot of jars' it is that
they 'pick up a lot of jars that are not needed at all'.
A few 'top level' jars that are useful for various use cases as single points
of inclusion would be perfect. This does not imply few jars total, it implies
a few that you choose to declare for your use cases -- they can pull in any
number of other shared hadoop that are required for those use cases, it doesn't
matter if they are 'the same version', the user does not need to know since
maven handles that and maven best practices make many jars with the 'same
version' a non-issue.
A user pulls in a mapreduce client jar, and that might also pull in a couple
'common' jars from the same project. That is the intended best practice of
maven. If the mapreduce client jar were to bundle common stuff in it, and that
same common stuff were bundled in say, an hdfs-client jar, then you risk all
sorts of trouble as a downstream user with multiple colliding classes on your
classpath, the inability to have the tooling (maven) detect and deal with
conflicts appropriately, etc. If it were to bundle stuff that is not useful as
a client, that would bloat client application jars and potentially pull in
useless transitive dependencies.
If the jars are reduced into only a few big blobs, it will end up more like the
absolutely atrocious maven dependency management in 0.20.205 and 0.22.x. where
a user who just wants to build a mapreduce program pulls in 20MB of downstream
jars that are not needed unless they manually exclude them.
Having more source trees is a slight development burden, but enforces the right
encapsulation and organization of dependencies. One of the benefits of
organizing modules in maven is that the end result almost always leads to more
clear code boundaries and better architectural separation of concerns. It also
helps define API boundaries and prevent creating leaky abstractions / apis by
accident.
{quote}
Thinking more on it. I am inclined to keeping the modules separate as it is
currently, instead of combining the source tree.
I am counting the no of modules to be 10-12. So the source tree should not be
59 or am I missing something.
The separate modules do help identify the boundaries more clearly and help in
enforcing those. Separation just based on java packages is loose. I know this
based on the unnecessary pain I went thru when I was working on the project
split 2 years ago. In future, refactoring code or doing things like rewriting
NM in C++ will be least intrusive with the current module structure.
If no of jars is the problem, can we just merge the jars at build time the way
we want. Using maven shade plugin or some such ?
{quote}
I agree. You can use the shade plugin to make a few 'fat' jars for some use
cases that live _along side_ the normal artifacts that do not embed any
dependencies.
Please, please don't put any jars in a maven repo that bundle dependencies
unless they are attached artifacts and not the primary artifact.
Please, please declare the dependencies properly, using 'optional' or
'provided' scope as appropriate to prevent downstream users from pulling in
artifacts transitively that a client user does not need.
I believe that too few jars is worst than too many, when the two items above
are done correctly (e.g. maven best practices are followed). Then as a
downstream user, I can easily select the features I want, and trust that the
dependencies that are pulled in to my project transitively as a consequence of
say, pulling in a mapreduce client jar, are only the jars needed as a mapreduce
client and not the entire freaking hadoop framework or any other extra
unnecessary baggage.
> MR-279: simplify the jars
> --------------------------
>
> Key: MAPREDUCE-2600
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-2600
> Project: Hadoop Map/Reduce
> Issue Type: Improvement
> Components: mrv2
> Affects Versions: 0.23.0
> Reporter: Owen O'Malley
> Assignee: Luke Lu
>
> Currently the MR-279 mapreduce project generates 59 jars from 59 source
> roots, which can be dramatically simplified.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira