Re: [DISCUSS] Necessity of Maven and SBT Build in Spark

Koert Kuipers Sat, 01 Mar 2014 09:16:30 -0800

does maven support cross building for different scala versions?

we do this inhouse all the time with sbt. i know spark does not cross build
at this point, but is it guaranteed to stay that way?



On Sat, Mar 1, 2014 at 12:02 PM, Koert Kuipers <ko...@tresata.com> wrote:

> i am still unsure what is wrong with sbt assembly. i would like a
> real-world example of where it does not work, that i can run.
>
> this is what i know:
>
> 1) sbt assembly works fine for version conflicts for an artifact. no
> exclusion rules are needed.
>
> 2) if artifacts have the same classes inside yet are not recognized as
> different versions of the same artifact (due to renaming of artifacts
> typically, or due to the inclusion of classes from another jar) then a
> manual exclusion rule will be needed, or else sbt will apply a simple but
> programmable rule to pick one class and drop the rest. i do not see how
> maven could do this better or without manual exclusion rules.
>
>
>
> On Sat, Mar 1, 2014 at 1:00 AM, Mridul Muralidharan <mri...@gmail.com>wrote:
>
>> On Sat, Mar 1, 2014 at 2:05 AM, Patrick Wendell <pwend...@gmail.com>
>> wrote:
>> > Hey,
>> >
>> > Thanks everyone for chiming in on this. I wanted to summarize these
>> > issues a bit particularly wrt the constituents involved - does this
>> > seem accurate?
>> >
>> > = Spark Users =
>> > In general those linking against Spark should be totally unaffected by
>> > the build choice. Spark will continue to publish well-formed poms and
>> > jars to maven central. This is a no-op wrt this decision.
>> >
>> > = Spark Developers =
>> > There are two concerns. (a) General day-to-day development and
>> > packaging and (b) Spark binaries and packages for distribution.
>> >
>> > For (a) - sbt seems better because it's just nicer for doing scala
>> > development (incremental complication is simple, we have some
>> > home-baked tools for compiling Spark vs. the spark deps etc). The
>> > arguments that maven has more "general know how", at least so far,
>> > haven't affected us in the ~2 years we've maintained both builds -
>> > where adding stuff for Maven is typically just as annoying/difficult
>> > with sbt.
>> >
>> > For (b) - Some non-specific concerns were raised about bugs with the
>> > sbt assembly package - we should look into this and see what is going
>> > on. Maven has better out-of-the-box support for publishing to Maven
>> > central, we'd have to do some manual work on our end to make this work
>> > well with sbt.
>>
>>
>> Not non-specific concerns, assembly via sbt is fragile - the (manual)
>> exclusion rules in sbt project are testament to this.
>>
>> In particular, I dont see any quantifiable benefits in using sbt over
>> maven.
>> Incremental compilation, compiling only a subproject, running specific
>> tests, etc are all available even with maven - so are not
>> differentiators.
>> On other hand, sbt does introduce further manual overhead in
>> dependency management for assembled/shaded jar creation.
>>
>> Regards,
>> Mridul
>>
>>
>>
>>
>> >
>> > = Downstream Integrators =
>> > On this one it seems that Maven is the universal favorite, largely
>> > because of community awareness of Maven and comfort with Maven builds.
>> > Some things like restructuring the Spark build to inherit config
>> > values from a vendor build will be not possible with sbt (though
>> > fairly straightforward to work around). Other cases where vendors have
>> > directly modified or inherited the Spark build won't work anymore if
>> > we standardize on SBT. These have no obvious work around at this point
>> > as far as I see.
>> >
>> > - Patrick
>> >
>> > On Wed, Feb 26, 2014 at 7:09 PM, Mridul Muralidharan <mri...@gmail.com>
>> wrote:
>> >> On Feb 26, 2014 11:12 PM, "Patrick Wendell" <pwend...@gmail.com>
>> wrote:
>> >>>
>> >>> @mridul - As far as I know both Maven and Sbt use fairly similar
>> >>> processes for building the assembly/uber jar. We actually used to
>> >>> package spark with sbt and there were no specific issues we
>> >>> encountered and AFAIK sbt respects versioning of transitive
>> >>> dependencies correctly. Do you have a specific bug listing for sbt
>> >>> that indicates something is broken?
>> >>
>> >> Slightly longish ...
>> >>
>> >> The assembled jar, generated via sbt broke all over the place while I
>> was
>> >> adding yarn support in 0.6 - and I had to fix sbt project a fair bit
>> to get
>> >> it to work : we need the assembled jar to submit a yarn job.
>> >>
>> >> When I finally submitted those changes to 0.7, it broke even more -
>> since
>> >> dependencies changed : someone else had thankfully already added maven
>> >> support by then - which worked remarkably well out of the box (with
>> some
>> >> minor tweaks) !
>> >>
>> >> In theory, they might be expected to work the same, but practically
>> they
>> >> did not : as I mentioned,  it must just have been luck that maven
>> worked
>> >> that well; but given multiple past nasty experiences with sbt, and the
>> fact
>> >> that it does not bring anything compelling or new in contrast, I am
>> fairly
>> >> against the idea of using only sbt - inspite of maven being
>> unintuitive at
>> >> times.
>> >>
>> >> Regards,
>> >> Mridul
>> >>
>> >>>
>> >>> @sandy - It sounds like you are saying that the CDH build would be
>> >>> easier with Maven because you can inherit the POM. However, is this
>> >>> just a matter of convenience for packagers or would standardizing on
>> >>> sbt limit capabilities in some way? I assume that it would just mean a
>> >>> bit more manual work for packagers having to figure out how to set the
>> >>> hadoop version in SBT and exclude certain dependencies. For instance,
>> >>> what does CDH about other components like Impala that are not based on
>> >>> Maven at all?
>> >>>
>> >>> On Wed, Feb 26, 2014 at 9:31 AM, Evan Chan <e...@ooyala.com> wrote:
>> >>> > I'd like to propose the following way to move forward, based on the
>> >>> > comments I've seen:
>> >>> >
>> >>> > 1.  Aggressively clean up the giant dependency graph.   One ticket I
>> >>> > might work on if I have time is SPARK-681 which might remove the
>> giant
>> >>> > fastutil dependency (~15MB by itself).
>> >>> >
>> >>> > 2.  Take an intermediate step by having only ONE source of truth
>> >>> > w.r.t. dependencies and versions.  This means either:
>> >>> >    a)  Using a maven POM as the spec for dependencies, Hadoop
>> version,
>> >>> > etc.   Then, use sbt-pom-reader to import it.
>> >>> >    b)  Using the build.scala as the spec, and "sbt make-pom" to
>> >>> > generate the pom.xml for the dependencies
>> >>> >
>> >>> >     The idea is to remove the pain and errors associated with manual
>> >>> > translation of dependency specs from one system to another, while
>> >>> > still maintaining the things which are hard to translate (plugins).
>> >>> >
>> >>> >
>> >>> > On Wed, Feb 26, 2014 at 7:17 AM, Koert Kuipers <ko...@tresata.com>
>> >> wrote:
>> >>> >> We maintain in house spark build using sbt. We have no problem
>> using
>> >> sbt
>> >>> >> assembly. We did add a few exclude statements for transitive
>> >> dependencies.
>> >>> >>
>> >>> >> The main enemy of assemblies are jars that include stuff they
>> shouldn't
>> >>> >> (kryo comes to mind, I think they include logback?), new versions
>> of
>> >> jars
>> >>> >> that change the provider/artifact without changing the package
>> (asm),
>> >> and
>> >>> >> incompatible new releases (protobuf). These break the transitive
>> >> resolution
>> >>> >> process. I imagine that's true for any build tool.
>> >>> >>
>> >>> >> Besides shading I don't see anything maven can do sbt cannot, and
>> if I
>> >>> >> understand it correctly shading is not done currently using the
>> build
>> >> tool.
>> >>> >>
>> >>> >> Since spark is primarily scala/akka based the main developer base
>> will
>> >> be
>> >>> >> familiar with sbt (I think?). Switching build tool is always
>> painful. I
>> >>> >> personally think it is smarter to put this burden on a limited
>> number
>> >> of
>> >>> >> upstream integrators than on the community. However that said I
>> don't
>> >> think
>> >>> >> its a problem for us to maintain an sbt build in-house if spark
>> >> switched to
>> >>> >> maven.
>> >>> >> The problem is, the complete spark dependency graph is fairly
>> large,
>> >>> >> and there are lot of conflicting versions in there.
>> >>> >> In particular, when we bump versions of dependencies - making
>> managing
>> >>> >> this messy at best.
>> >>> >>
>> >>> >> Now, I have not looked in detail at how maven manages this - it
>> might
>> >>> >> just be accidental that we get a decent out-of-the-box assembled
>> >>> >> shaded jar (since we dont do anything great to configure it).
>> >>> >> With current state of sbt in spark, it definitely is not a good
>> >>> >> solution : if we can enhance it (or it already is ?), while keeping
>> >>> >> the management of the version/dependency graph manageable, I dont
>> have
>> >>> >> any objections to using sbt or maven !
>> >>> >> Too many exclude versions, pinned versions, etc would just make
>> things
>> >>> >> unmanageable in future.
>> >>> >>
>> >>> >>
>> >>> >> Regards,
>> >>> >> Mridul
>> >>> >>
>> >>> >>
>> >>> >>
>> >>> >>
>> >>> >> On Wed, Feb 26, 2014 at 8:56 AM, Evan chan <e...@ooyala.com> wrote:
>> >>> >>> Actually you can control exactly how sbt assembly merges or
>> resolves
>> >>> >> conflicts.  I believe the default settings however lead to order
>> which
>> >>> >> cannot be controlled.
>> >>> >>>
>> >>> >>> I do wish for a smarter fat jar plugin.
>> >>> >>>
>> >>> >>> -Evan
>> >>> >>> To be free is not merely to cast off one's chains, but to live in
>> a
>> >> way
>> >>> >> that respects & enhances the freedom of others. (#NelsonMandela)
>> >>> >>>
>> >>> >>>> On Feb 25, 2014, at 6:50 PM, Mridul Muralidharan <
>> mri...@gmail.com>
>> >>> >> wrote:
>> >>> >>>>
>> >>> >>>>> On Wed, Feb 26, 2014 at 5:31 AM, Patrick Wendell <
>> pwend...@gmail.com
>> >>>
>> >>> >> wrote:
>> >>> >>>>> Evan - this is a good thing to bring up. Wrt the shader plug-in
>> -
>> >>> >>>>> right now we don't actually use it for bytecode shading - we
>> simply
>> >>> >>>>> use it for creating the uber jar with excludes (which sbt
>> supports
>> >>> >>>>> just fine via assembly).
>> >>> >>>>
>> >>> >>>>
>> >>> >>>> Not really - as I mentioned initially in this thread, sbt's
>> assembly
>> >>> >>>> does not take dependencies into account properly : and can
>> overwrite
>> >>> >>>> newer classes with older versions.
>> >>> >>>> From an assembly point of view, sbt is not very good : we are
>> yet to
>> >>> >>>> try it after 2.10 shift though (and probably wont, given the
>> mess it
>> >>> >>>> created last time).
>> >>> >>>>
>> >>> >>>> Regards,
>> >>> >>>> Mridul
>> >>> >>>>
>> >>> >>>>
>> >>> >>>>
>> >>> >>>>
>> >>> >>>>
>> >>> >>>>>
>> >>> >>>>> I was wondering actually, do you know if it's possible to added
>> >> shaded
>> >>> >>>>> artifacts to the *spark jar* using this plug-in (e.g. not an
>> uber
>> >>> >>>>> jar)? That's something I could see being really handy in the
>> future.
>> >>> >>>>>
>> >>> >>>>> - Patrick
>> >>> >>>>>
>> >>> >>>>>> On Tue, Feb 25, 2014 at 3:39 PM, Evan Chan <e...@ooyala.com>
>> wrote:
>> >>> >>>>>> The problem is that plugins are not equivalent.  There is
>> AFAIK no
>> >>> >>>>>> equivalent to the maven shader plugin for SBT.
>> >>> >>>>>> There is an SBT plugin which can apparently read POM XML files
>> >>> >>>>>> (sbt-pom-reader).   However, it can't possibly handle plugins,
>> >> which
>> >>> >>>>>> is still problematic.
>> >>> >>>>>>
>> >>> >>>>>>> On Tue, Feb 25, 2014 at 3:31 PM, yao <yaosheng...@gmail.com>
>> >> wrote:
>> >>> >>>>>>> I would prefer keep both of them, it would be better even if
>> that
>> >>> >> means
>> >>> >>>>>>> pom.xml will be generated using sbt. Some company, like my
>> current
>> >>> >> one,
>> >>> >>>>>>> have their own build infrastructures built on top of maven.
>> It is
>> >> not
>> >>> >> easy
>> >>> >>>>>>> to support sbt for these potential spark clients. But I do
>> agree
>> >> to
>> >>> >> only
>> >>> >>>>>>> keep one if there is a promising way to generate correct
>> >>> >> configuration from
>> >>> >>>>>>> the other.
>> >>> >>>>>>>
>> >>> >>>>>>> -Shengzhe
>> >>> >>>>>>>
>> >>> >>>>>>>
>> >>> >>>>>>>> On Tue, Feb 25, 2014 at 3:20 PM, Evan Chan <e...@ooyala.com>
>> wrote:
>> >>> >>>>>>>>
>> >>> >>>>>>>> The correct way to exclude dependencies in SBT is actually to
>> >> declare
>> >>> >>>>>>>> a dependency as "provided".   I'm not familiar with Maven or
>> its
>> >>> >>>>>>>> dependencySet, but provided will mark the entire dependency
>> tree
>> >> as
>> >>> >>>>>>>> excluded.   It is also possible to exclude jar by jar, but
>> this
>> >> is
>> >>> >>>>>>>> pretty error prone and messy.
>> >>> >>>>>>>>
>> >>> >>>>>>>>> On Tue, Feb 25, 2014 at 2:45 PM, Koert Kuipers <
>> >> ko...@tresata.com>
>> >>> >> wrote:
>> >>> >>>>>>>>> yes in sbt assembly you can exclude jars (although i never
>> had a
>> >>> >> need for
>> >>> >>>>>>>>> this) and files in jars.
>> >>> >>>>>>>>>
>> >>> >>>>>>>>> for example i frequently remove log4j.properties, because
>> for
>> >>> >> whatever
>> >>> >>>>>>>>> reason hadoop decided to include it making it very
>> difficult to
>> >> use
>> >>> >> our
>> >>> >>>>>>>> own
>> >>> >>>>>>>>> logging config.
>> >>> >>>>>>>>>
>> >>> >>>>>>>>>
>> >>> >>>>>>>>>
>> >>> >>>>>>>>>> On Tue, Feb 25, 2014 at 4:24 PM, Konstantin Boudnik <
>> >> c...@apache.org
>> >>> >>>
>> >>> >>>>>>>>> wrote:
>> >>> >>>>>>>>>
>> >>> >>>>>>>>>>> On Fri, Feb 21, 2014 at 11:11AM, Patrick Wendell wrote:
>> >>> >>>>>>>>>>> Kos - thanks for chiming in. Could you be more specific
>> about
>> >>> >> what is
>> >>> >>>>>>>>>>> available in maven and not in sbt for these issues? I
>> took a
>> >> look
>> >>> >> at
>> >>> >>>>>>>>>>> the bigtop code relating to Spark. As far as I could tell
>> [1]
>> >> was
>> >>> >> the
>> >>> >>>>>>>>>>> main point of integration with the build system (maybe
>> there
>> >> are
>> >>> >> other
>> >>> >>>>>>>>>>> integration points)?
>> >>> >>>>>>>>>>>
>> >>> >>>>>>>>>>>>  - in order to integrate Spark well into existing Hadoop
>> >> stack it
>> >>> >>>>>>>> was
>> >>> >>>>>>>>>>>>    necessary to have a way to avoid transitive
>> dependencies
>> >>> >>>>>>>>>> duplications and
>> >>> >>>>>>>>>>>>    possible conflicts.
>> >>> >>>>>>>>>>>>
>> >>> >>>>>>>>>>>>    E.g. Maven assembly allows us to avoid adding _all_
>> Hadoop
>> >>> >> libs
>> >>> >>>>>>>>>> and later
>> >>> >>>>>>>>>>>>    merely declare Spark package dependency on standard
>> Bigtop
>> >>> >>>>>>>> Hadoop
>> >>> >>>>>>>>>>>>    packages. And yes - Bigtop packaging means the naming
>> and
>> >>> >> layout
>> >>> >>>>>>>>>> would be
>> >>> >>>>>>>>>>>>    standard across all commercial Hadoop distributions
>> that
>> >> are
>> >>> >>>>>>>> worth
>> >>> >>>>>>>>>>>>    mentioning: ASF Bigtop convenience binary packages,
>> and
>> >>> >>>>>>>> Cloudera or
>> >>> >>>>>>>>>>>>    Hortonworks packages. Hence, the downstream user
>> doesn't
>> >> need
>> >>> >> to
>> >>> >>>>>>>>>> spend any
>> >>> >>>>>>>>>>>>    effort to make sure that Spark "clicks-in" properly.
>> >>> >>>>>>>>>>>
>> >>> >>>>>>>>>>> The sbt build also allows you to plug in a Hadoop version
>> >> similar
>> >>> >> to
>> >>> >>>>>>>>>>> the maven build.
>> >>> >>>>>>>>>>
>> >>> >>>>>>>>>> I am actually talking about an ability to exclude a set of
>> >>> >> dependencies
>> >>> >>>>>>>>>> from an
>> >>> >>>>>>>>>> assembly, similarly to what's happening in dependencySet
>> >> sections
>> >>> >> of
>> >>> >>>>>>>>>>    assembly/src/main/assembly/assembly.xml
>> >>> >>>>>>>>>> If there is a comparable functionality in Sbt, that would
>> help
>> >>> >> quite a
>> >>> >>>>>>>> bit,
>> >>> >>>>>>>>>> apparently.
>> >>> >>>>>>>>>>
>> >>> >>>>>>>>>> Cos
>> >>> >>>>>>>>>>
>> >>> >>>>>>>>>>>>  - Maven provides a relatively easy way to deal with the
>> >> jar-hell
>> >>> >>>>>>>>>> problem,
>> >>> >>>>>>>>>>>>    although the original maven build was just Shader'ing
>> >>> >> everything
>> >>> >>>>>>>>>> into a
>> >>> >>>>>>>>>>>>    huge lump of class files. Oftentimes ending up with
>> >> classes
>> >>> >>>>>>>>>> slamming on
>> >>> >>>>>>>>>>>>    top of each other from different transitive
>> dependencies.
>> >>> >>>>>>>>>>>
>> >>> >>>>>>>>>>> AFIAK we are only using the shade plug-in to deal with
>> >> conflict
>> >>> >>>>>>>>>>> resolution in the assembly jar. These are dealt with in
>> sbt
>> >> via
>> >>> >> the
>> >>> >>>>>>>>>>> sbt assembly plug-in in an identical way. Is there a
>> >> difference?
>> >>> >>>>>>>>>>
>> >>> >>>>>>>>>> I am bringing up the Sharder, because it is an awful hack,
>> >> which is
>> >>> >>>>>>>> can't
>> >>> >>>>>>>>>> be
>> >>> >>>>>>>>>> used in real controlled deployment.
>> >>> >>>>>>>>>>
>> >>> >>>>>>>>>> Cos
>> >>> >>>>>>>>>>
>> >>> >>>>>>>>>>> [1]
>> >>> >>>>>>>>
>> >>> >>
>> >>
>> https://git-wip-us.apache.org/repos/asf?p=bigtop.git;a=blob;f=bigtop-packages/src/common/spark/do-component-build;h=428540e0f6aa56cd7e78eb1c831aa7fe9496a08f;hb=master
>> >>> >>>>>>>>
>> >>> >>>>>>>>
>> >>> >>>>>>>>
>> >>> >>>>>>>> --
>> >>> >>>>>>>> --
>> >>> >>>>>>>> Evan Chan
>> >>> >>>>>>>> Staff Engineer
>> >>> >>>>>>>> e...@ooyala.com  |
>> >>> >>>>>>
>> >>> >>>>>>
>> >>> >>>>>>
>> >>> >>>>>> --
>> >>> >>>>>> --
>> >>> >>>>>> Evan Chan
>> >>> >>>>>> Staff Engineer
>> >>> >>>>>> e...@ooyala.com  |
>> >>> >
>> >>> >
>> >>> >
>> >>> > --
>> >>> > --
>> >>> > Evan Chan
>> >>> > Staff Engineer
>> >>> > e...@ooyala.com  |
>>
>
>

Re: [DISCUSS] Necessity of Maven *and* SBT Build in Spark

Reply via email to

Re: [DISCUSS] Necessity of Maven and SBT Build in Spark