Re: [DISCUSS] Necessity of Maven and SBT Build in Spark

Matei Zaharia Sat, 01 Mar 2014 11:57:58 -0800

We would like to cross-build Spark for Scala 2.11 and 2.10 eventually (they’re 
a lot closer than 2.10 and 2.9). In Maven this might mean creating two POMs or 
a special variable for the version or something.


Matei

On Mar 1, 2014, at 12:15 PM, Koert Kuipers <[email protected]> wrote:

> does maven support cross building for different scala versions?
> 
> we do this inhouse all the time with sbt. i know spark does not cross build
> at this point, but is it guaranteed to stay that way?
> 
> 
> On Sat, Mar 1, 2014 at 12:02 PM, Koert Kuipers <[email protected]> wrote:
> 
>> i am still unsure what is wrong with sbt assembly. i would like a
>> real-world example of where it does not work, that i can run.
>> 
>> this is what i know:
>> 
>> 1) sbt assembly works fine for version conflicts for an artifact. no
>> exclusion rules are needed.
>> 
>> 2) if artifacts have the same classes inside yet are not recognized as
>> different versions of the same artifact (due to renaming of artifacts
>> typically, or due to the inclusion of classes from another jar) then a
>> manual exclusion rule will be needed, or else sbt will apply a simple but
>> programmable rule to pick one class and drop the rest. i do not see how
>> maven could do this better or without manual exclusion rules.
>> 
>> 
>> 
>> On Sat, Mar 1, 2014 at 1:00 AM, Mridul Muralidharan <[email protected]>wrote:
>> 
>>> On Sat, Mar 1, 2014 at 2:05 AM, Patrick Wendell <[email protected]>
>>> wrote:
>>>> Hey,
>>>> 
>>>> Thanks everyone for chiming in on this. I wanted to summarize these
>>>> issues a bit particularly wrt the constituents involved - does this
>>>> seem accurate?
>>>> 
>>>> = Spark Users =
>>>> In general those linking against Spark should be totally unaffected by
>>>> the build choice. Spark will continue to publish well-formed poms and
>>>> jars to maven central. This is a no-op wrt this decision.
>>>> 
>>>> = Spark Developers =
>>>> There are two concerns. (a) General day-to-day development and
>>>> packaging and (b) Spark binaries and packages for distribution.
>>>> 
>>>> For (a) - sbt seems better because it's just nicer for doing scala
>>>> development (incremental complication is simple, we have some
>>>> home-baked tools for compiling Spark vs. the spark deps etc). The
>>>> arguments that maven has more "general know how", at least so far,
>>>> haven't affected us in the ~2 years we've maintained both builds -
>>>> where adding stuff for Maven is typically just as annoying/difficult
>>>> with sbt.
>>>> 
>>>> For (b) - Some non-specific concerns were raised about bugs with the
>>>> sbt assembly package - we should look into this and see what is going
>>>> on. Maven has better out-of-the-box support for publishing to Maven
>>>> central, we'd have to do some manual work on our end to make this work
>>>> well with sbt.
>>> 
>>> 
>>> Not non-specific concerns, assembly via sbt is fragile - the (manual)
>>> exclusion rules in sbt project are testament to this.
>>> 
>>> In particular, I dont see any quantifiable benefits in using sbt over
>>> maven.
>>> Incremental compilation, compiling only a subproject, running specific
>>> tests, etc are all available even with maven - so are not
>>> differentiators.
>>> On other hand, sbt does introduce further manual overhead in
>>> dependency management for assembled/shaded jar creation.
>>> 
>>> Regards,
>>> Mridul
>>> 
>>> 
>>> 
>>> 
>>>> 
>>>> = Downstream Integrators =
>>>> On this one it seems that Maven is the universal favorite, largely
>>>> because of community awareness of Maven and comfort with Maven builds.
>>>> Some things like restructuring the Spark build to inherit config
>>>> values from a vendor build will be not possible with sbt (though
>>>> fairly straightforward to work around). Other cases where vendors have
>>>> directly modified or inherited the Spark build won't work anymore if
>>>> we standardize on SBT. These have no obvious work around at this point
>>>> as far as I see.
>>>> 
>>>> - Patrick
>>>> 
>>>> On Wed, Feb 26, 2014 at 7:09 PM, Mridul Muralidharan <[email protected]>
>>> wrote:
>>>>> On Feb 26, 2014 11:12 PM, "Patrick Wendell" <[email protected]>
>>> wrote:
>>>>>> 
>>>>>> @mridul - As far as I know both Maven and Sbt use fairly similar
>>>>>> processes for building the assembly/uber jar. We actually used to
>>>>>> package spark with sbt and there were no specific issues we
>>>>>> encountered and AFAIK sbt respects versioning of transitive
>>>>>> dependencies correctly. Do you have a specific bug listing for sbt
>>>>>> that indicates something is broken?
>>>>> 
>>>>> Slightly longish ...
>>>>> 
>>>>> The assembled jar, generated via sbt broke all over the place while I
>>> was
>>>>> adding yarn support in 0.6 - and I had to fix sbt project a fair bit
>>> to get
>>>>> it to work : we need the assembled jar to submit a yarn job.
>>>>> 
>>>>> When I finally submitted those changes to 0.7, it broke even more -
>>> since
>>>>> dependencies changed : someone else had thankfully already added maven
>>>>> support by then - which worked remarkably well out of the box (with
>>> some
>>>>> minor tweaks) !
>>>>> 
>>>>> In theory, they might be expected to work the same, but practically
>>> they
>>>>> did not : as I mentioned,  it must just have been luck that maven
>>> worked
>>>>> that well; but given multiple past nasty experiences with sbt, and the
>>> fact
>>>>> that it does not bring anything compelling or new in contrast, I am
>>> fairly
>>>>> against the idea of using only sbt - inspite of maven being
>>> unintuitive at
>>>>> times.
>>>>> 
>>>>> Regards,
>>>>> Mridul
>>>>> 
>>>>>> 
>>>>>> @sandy - It sounds like you are saying that the CDH build would be
>>>>>> easier with Maven because you can inherit the POM. However, is this
>>>>>> just a matter of convenience for packagers or would standardizing on
>>>>>> sbt limit capabilities in some way? I assume that it would just mean a
>>>>>> bit more manual work for packagers having to figure out how to set the
>>>>>> hadoop version in SBT and exclude certain dependencies. For instance,
>>>>>> what does CDH about other components like Impala that are not based on
>>>>>> Maven at all?
>>>>>> 
>>>>>> On Wed, Feb 26, 2014 at 9:31 AM, Evan Chan <[email protected]> wrote:
>>>>>>> I'd like to propose the following way to move forward, based on the
>>>>>>> comments I've seen:
>>>>>>> 
>>>>>>> 1.  Aggressively clean up the giant dependency graph.   One ticket I
>>>>>>> might work on if I have time is SPARK-681 which might remove the
>>> giant
>>>>>>> fastutil dependency (~15MB by itself).
>>>>>>> 
>>>>>>> 2.  Take an intermediate step by having only ONE source of truth
>>>>>>> w.r.t. dependencies and versions.  This means either:
>>>>>>>   a)  Using a maven POM as the spec for dependencies, Hadoop
>>> version,
>>>>>>> etc.   Then, use sbt-pom-reader to import it.
>>>>>>>   b)  Using the build.scala as the spec, and "sbt make-pom" to
>>>>>>> generate the pom.xml for the dependencies
>>>>>>> 
>>>>>>>    The idea is to remove the pain and errors associated with manual
>>>>>>> translation of dependency specs from one system to another, while
>>>>>>> still maintaining the things which are hard to translate (plugins).
>>>>>>> 
>>>>>>> 
>>>>>>> On Wed, Feb 26, 2014 at 7:17 AM, Koert Kuipers <[email protected]>
>>>>> wrote:
>>>>>>>> We maintain in house spark build using sbt. We have no problem
>>> using
>>>>> sbt
>>>>>>>> assembly. We did add a few exclude statements for transitive
>>>>> dependencies.
>>>>>>>> 
>>>>>>>> The main enemy of assemblies are jars that include stuff they
>>> shouldn't
>>>>>>>> (kryo comes to mind, I think they include logback?), new versions
>>> of
>>>>> jars
>>>>>>>> that change the provider/artifact without changing the package
>>> (asm),
>>>>> and
>>>>>>>> incompatible new releases (protobuf). These break the transitive
>>>>> resolution
>>>>>>>> process. I imagine that's true for any build tool.
>>>>>>>> 
>>>>>>>> Besides shading I don't see anything maven can do sbt cannot, and
>>> if I
>>>>>>>> understand it correctly shading is not done currently using the
>>> build
>>>>> tool.
>>>>>>>> 
>>>>>>>> Since spark is primarily scala/akka based the main developer base
>>> will
>>>>> be
>>>>>>>> familiar with sbt (I think?). Switching build tool is always
>>> painful. I
>>>>>>>> personally think it is smarter to put this burden on a limited
>>> number
>>>>> of
>>>>>>>> upstream integrators than on the community. However that said I
>>> don't
>>>>> think
>>>>>>>> its a problem for us to maintain an sbt build in-house if spark
>>>>> switched to
>>>>>>>> maven.
>>>>>>>> The problem is, the complete spark dependency graph is fairly
>>> large,
>>>>>>>> and there are lot of conflicting versions in there.
>>>>>>>> In particular, when we bump versions of dependencies - making
>>> managing
>>>>>>>> this messy at best.
>>>>>>>> 
>>>>>>>> Now, I have not looked in detail at how maven manages this - it
>>> might
>>>>>>>> just be accidental that we get a decent out-of-the-box assembled
>>>>>>>> shaded jar (since we dont do anything great to configure it).
>>>>>>>> With current state of sbt in spark, it definitely is not a good
>>>>>>>> solution : if we can enhance it (or it already is ?), while keeping
>>>>>>>> the management of the version/dependency graph manageable, I dont
>>> have
>>>>>>>> any objections to using sbt or maven !
>>>>>>>> Too many exclude versions, pinned versions, etc would just make
>>> things
>>>>>>>> unmanageable in future.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Regards,
>>>>>>>> Mridul
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Wed, Feb 26, 2014 at 8:56 AM, Evan chan <[email protected]> wrote:
>>>>>>>>> Actually you can control exactly how sbt assembly merges or
>>> resolves
>>>>>>>> conflicts.  I believe the default settings however lead to order
>>> which
>>>>>>>> cannot be controlled.
>>>>>>>>> 
>>>>>>>>> I do wish for a smarter fat jar plugin.
>>>>>>>>> 
>>>>>>>>> -Evan
>>>>>>>>> To be free is not merely to cast off one's chains, but to live in
>>> a
>>>>> way
>>>>>>>> that respects & enhances the freedom of others. (#NelsonMandela)
>>>>>>>>> 
>>>>>>>>>> On Feb 25, 2014, at 6:50 PM, Mridul Muralidharan <
>>> [email protected]>
>>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>>> On Wed, Feb 26, 2014 at 5:31 AM, Patrick Wendell <
>>> [email protected]
>>>>>> 
>>>>>>>> wrote:
>>>>>>>>>>> Evan - this is a good thing to bring up. Wrt the shader plug-in
>>> -
>>>>>>>>>>> right now we don't actually use it for bytecode shading - we
>>> simply
>>>>>>>>>>> use it for creating the uber jar with excludes (which sbt
>>> supports
>>>>>>>>>>> just fine via assembly).
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> Not really - as I mentioned initially in this thread, sbt's
>>> assembly
>>>>>>>>>> does not take dependencies into account properly : and can
>>> overwrite
>>>>>>>>>> newer classes with older versions.
>>>>>>>>>> From an assembly point of view, sbt is not very good : we are
>>> yet to
>>>>>>>>>> try it after 2.10 shift though (and probably wont, given the
>>> mess it
>>>>>>>>>> created last time).
>>>>>>>>>> 
>>>>>>>>>> Regards,
>>>>>>>>>> Mridul
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> I was wondering actually, do you know if it's possible to added
>>>>> shaded
>>>>>>>>>>> artifacts to the *spark jar* using this plug-in (e.g. not an
>>> uber
>>>>>>>>>>> jar)? That's something I could see being really handy in the
>>> future.
>>>>>>>>>>> 
>>>>>>>>>>> - Patrick
>>>>>>>>>>> 
>>>>>>>>>>>> On Tue, Feb 25, 2014 at 3:39 PM, Evan Chan <[email protected]>
>>> wrote:
>>>>>>>>>>>> The problem is that plugins are not equivalent.  There is
>>> AFAIK no
>>>>>>>>>>>> equivalent to the maven shader plugin for SBT.
>>>>>>>>>>>> There is an SBT plugin which can apparently read POM XML files
>>>>>>>>>>>> (sbt-pom-reader).   However, it can't possibly handle plugins,
>>>>> which
>>>>>>>>>>>> is still problematic.
>>>>>>>>>>>> 
>>>>>>>>>>>>> On Tue, Feb 25, 2014 at 3:31 PM, yao <[email protected]>
>>>>> wrote:
>>>>>>>>>>>>> I would prefer keep both of them, it would be better even if
>>> that
>>>>>>>> means
>>>>>>>>>>>>> pom.xml will be generated using sbt. Some company, like my
>>> current
>>>>>>>> one,
>>>>>>>>>>>>> have their own build infrastructures built on top of maven.
>>> It is
>>>>> not
>>>>>>>> easy
>>>>>>>>>>>>> to support sbt for these potential spark clients. But I do
>>> agree
>>>>> to
>>>>>>>> only
>>>>>>>>>>>>> keep one if there is a promising way to generate correct
>>>>>>>> configuration from
>>>>>>>>>>>>> the other.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> -Shengzhe
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On Tue, Feb 25, 2014 at 3:20 PM, Evan Chan <[email protected]>
>>> wrote:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> The correct way to exclude dependencies in SBT is actually to
>>>>> declare
>>>>>>>>>>>>>> a dependency as "provided".   I'm not familiar with Maven or
>>> its
>>>>>>>>>>>>>> dependencySet, but provided will mark the entire dependency
>>> tree
>>>>> as
>>>>>>>>>>>>>> excluded.   It is also possible to exclude jar by jar, but
>>> this
>>>>> is
>>>>>>>>>>>>>> pretty error prone and messy.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> On Tue, Feb 25, 2014 at 2:45 PM, Koert Kuipers <
>>>>> [email protected]>
>>>>>>>> wrote:
>>>>>>>>>>>>>>> yes in sbt assembly you can exclude jars (although i never
>>> had a
>>>>>>>> need for
>>>>>>>>>>>>>>> this) and files in jars.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> for example i frequently remove log4j.properties, because
>>> for
>>>>>>>> whatever
>>>>>>>>>>>>>>> reason hadoop decided to include it making it very
>>> difficult to
>>>>> use
>>>>>>>> our
>>>>>>>>>>>>>> own
>>>>>>>>>>>>>>> logging config.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> On Tue, Feb 25, 2014 at 4:24 PM, Konstantin Boudnik <
>>>>> [email protected]
>>>>>>>>> 
>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> On Fri, Feb 21, 2014 at 11:11AM, Patrick Wendell wrote:
>>>>>>>>>>>>>>>>> Kos - thanks for chiming in. Could you be more specific
>>> about
>>>>>>>> what is
>>>>>>>>>>>>>>>>> available in maven and not in sbt for these issues? I
>>> took a
>>>>> look
>>>>>>>> at
>>>>>>>>>>>>>>>>> the bigtop code relating to Spark. As far as I could tell
>>> [1]
>>>>> was
>>>>>>>> the
>>>>>>>>>>>>>>>>> main point of integration with the build system (maybe
>>> there
>>>>> are
>>>>>>>> other
>>>>>>>>>>>>>>>>> integration points)?
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> - in order to integrate Spark well into existing Hadoop
>>>>> stack it
>>>>>>>>>>>>>> was
>>>>>>>>>>>>>>>>>>   necessary to have a way to avoid transitive
>>> dependencies
>>>>>>>>>>>>>>>> duplications and
>>>>>>>>>>>>>>>>>>   possible conflicts.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>   E.g. Maven assembly allows us to avoid adding _all_
>>> Hadoop
>>>>>>>> libs
>>>>>>>>>>>>>>>> and later
>>>>>>>>>>>>>>>>>>   merely declare Spark package dependency on standard
>>> Bigtop
>>>>>>>>>>>>>> Hadoop
>>>>>>>>>>>>>>>>>>   packages. And yes - Bigtop packaging means the naming
>>> and
>>>>>>>> layout
>>>>>>>>>>>>>>>> would be
>>>>>>>>>>>>>>>>>>   standard across all commercial Hadoop distributions
>>> that
>>>>> are
>>>>>>>>>>>>>> worth
>>>>>>>>>>>>>>>>>>   mentioning: ASF Bigtop convenience binary packages,
>>> and
>>>>>>>>>>>>>> Cloudera or
>>>>>>>>>>>>>>>>>>   Hortonworks packages. Hence, the downstream user
>>> doesn't
>>>>> need
>>>>>>>> to
>>>>>>>>>>>>>>>> spend any
>>>>>>>>>>>>>>>>>>   effort to make sure that Spark "clicks-in" properly.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> The sbt build also allows you to plug in a Hadoop version
>>>>> similar
>>>>>>>> to
>>>>>>>>>>>>>>>>> the maven build.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> I am actually talking about an ability to exclude a set of
>>>>>>>> dependencies
>>>>>>>>>>>>>>>> from an
>>>>>>>>>>>>>>>> assembly, similarly to what's happening in dependencySet
>>>>> sections
>>>>>>>> of
>>>>>>>>>>>>>>>>   assembly/src/main/assembly/assembly.xml
>>>>>>>>>>>>>>>> If there is a comparable functionality in Sbt, that would
>>> help
>>>>>>>> quite a
>>>>>>>>>>>>>> bit,
>>>>>>>>>>>>>>>> apparently.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Cos
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> - Maven provides a relatively easy way to deal with the
>>>>> jar-hell
>>>>>>>>>>>>>>>> problem,
>>>>>>>>>>>>>>>>>>   although the original maven build was just Shader'ing
>>>>>>>> everything
>>>>>>>>>>>>>>>> into a
>>>>>>>>>>>>>>>>>>   huge lump of class files. Oftentimes ending up with
>>>>> classes
>>>>>>>>>>>>>>>> slamming on
>>>>>>>>>>>>>>>>>>   top of each other from different transitive
>>> dependencies.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> AFIAK we are only using the shade plug-in to deal with
>>>>> conflict
>>>>>>>>>>>>>>>>> resolution in the assembly jar. These are dealt with in
>>> sbt
>>>>> via
>>>>>>>> the
>>>>>>>>>>>>>>>>> sbt assembly plug-in in an identical way. Is there a
>>>>> difference?
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> I am bringing up the Sharder, because it is an awful hack,
>>>>> which is
>>>>>>>>>>>>>> can't
>>>>>>>>>>>>>>>> be
>>>>>>>>>>>>>>>> used in real controlled deployment.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Cos
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> [1]
>>>>>>>>>>>>>> 
>>>>>>>> 
>>>>> 
>>> https://git-wip-us.apache.org/repos/asf?p=bigtop.git;a=blob;f=bigtop-packages/src/common/spark/do-component-build;h=428540e0f6aa56cd7e78eb1c831aa7fe9496a08f;hb=master
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> --
>>>>>>>>>>>>>> --
>>>>>>>>>>>>>> Evan Chan
>>>>>>>>>>>>>> Staff Engineer
>>>>>>>>>>>>>> [email protected]  |
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> --
>>>>>>>>>>>> --
>>>>>>>>>>>> Evan Chan
>>>>>>>>>>>> Staff Engineer
>>>>>>>>>>>> [email protected]  |
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> --
>>>>>>> --
>>>>>>> Evan Chan
>>>>>>> Staff Engineer
>>>>>>> [email protected]  |
>>> 
>> 
>>

Re: [DISCUSS] Necessity of Maven *and* SBT Build in Spark

Reply via email to

Re: [DISCUSS] Necessity of Maven and SBT Build in Spark