We would like to cross-build Spark for Scala 2.11 and 2.10 eventually (they’re a lot closer than 2.10 and 2.9). In Maven this might mean creating two POMs or a special variable for the version or something.
Matei On Mar 1, 2014, at 12:15 PM, Koert Kuipers <ko...@tresata.com> wrote: > does maven support cross building for different scala versions? > > we do this inhouse all the time with sbt. i know spark does not cross build > at this point, but is it guaranteed to stay that way? > > > On Sat, Mar 1, 2014 at 12:02 PM, Koert Kuipers <ko...@tresata.com> wrote: > >> i am still unsure what is wrong with sbt assembly. i would like a >> real-world example of where it does not work, that i can run. >> >> this is what i know: >> >> 1) sbt assembly works fine for version conflicts for an artifact. no >> exclusion rules are needed. >> >> 2) if artifacts have the same classes inside yet are not recognized as >> different versions of the same artifact (due to renaming of artifacts >> typically, or due to the inclusion of classes from another jar) then a >> manual exclusion rule will be needed, or else sbt will apply a simple but >> programmable rule to pick one class and drop the rest. i do not see how >> maven could do this better or without manual exclusion rules. >> >> >> >> On Sat, Mar 1, 2014 at 1:00 AM, Mridul Muralidharan <mri...@gmail.com>wrote: >> >>> On Sat, Mar 1, 2014 at 2:05 AM, Patrick Wendell <pwend...@gmail.com> >>> wrote: >>>> Hey, >>>> >>>> Thanks everyone for chiming in on this. I wanted to summarize these >>>> issues a bit particularly wrt the constituents involved - does this >>>> seem accurate? >>>> >>>> = Spark Users = >>>> In general those linking against Spark should be totally unaffected by >>>> the build choice. Spark will continue to publish well-formed poms and >>>> jars to maven central. This is a no-op wrt this decision. >>>> >>>> = Spark Developers = >>>> There are two concerns. (a) General day-to-day development and >>>> packaging and (b) Spark binaries and packages for distribution. >>>> >>>> For (a) - sbt seems better because it's just nicer for doing scala >>>> development (incremental complication is simple, we have some >>>> home-baked tools for compiling Spark vs. the spark deps etc). The >>>> arguments that maven has more "general know how", at least so far, >>>> haven't affected us in the ~2 years we've maintained both builds - >>>> where adding stuff for Maven is typically just as annoying/difficult >>>> with sbt. >>>> >>>> For (b) - Some non-specific concerns were raised about bugs with the >>>> sbt assembly package - we should look into this and see what is going >>>> on. Maven has better out-of-the-box support for publishing to Maven >>>> central, we'd have to do some manual work on our end to make this work >>>> well with sbt. >>> >>> >>> Not non-specific concerns, assembly via sbt is fragile - the (manual) >>> exclusion rules in sbt project are testament to this. >>> >>> In particular, I dont see any quantifiable benefits in using sbt over >>> maven. >>> Incremental compilation, compiling only a subproject, running specific >>> tests, etc are all available even with maven - so are not >>> differentiators. >>> On other hand, sbt does introduce further manual overhead in >>> dependency management for assembled/shaded jar creation. >>> >>> Regards, >>> Mridul >>> >>> >>> >>> >>>> >>>> = Downstream Integrators = >>>> On this one it seems that Maven is the universal favorite, largely >>>> because of community awareness of Maven and comfort with Maven builds. >>>> Some things like restructuring the Spark build to inherit config >>>> values from a vendor build will be not possible with sbt (though >>>> fairly straightforward to work around). Other cases where vendors have >>>> directly modified or inherited the Spark build won't work anymore if >>>> we standardize on SBT. These have no obvious work around at this point >>>> as far as I see. >>>> >>>> - Patrick >>>> >>>> On Wed, Feb 26, 2014 at 7:09 PM, Mridul Muralidharan <mri...@gmail.com> >>> wrote: >>>>> On Feb 26, 2014 11:12 PM, "Patrick Wendell" <pwend...@gmail.com> >>> wrote: >>>>>> >>>>>> @mridul - As far as I know both Maven and Sbt use fairly similar >>>>>> processes for building the assembly/uber jar. We actually used to >>>>>> package spark with sbt and there were no specific issues we >>>>>> encountered and AFAIK sbt respects versioning of transitive >>>>>> dependencies correctly. Do you have a specific bug listing for sbt >>>>>> that indicates something is broken? >>>>> >>>>> Slightly longish ... >>>>> >>>>> The assembled jar, generated via sbt broke all over the place while I >>> was >>>>> adding yarn support in 0.6 - and I had to fix sbt project a fair bit >>> to get >>>>> it to work : we need the assembled jar to submit a yarn job. >>>>> >>>>> When I finally submitted those changes to 0.7, it broke even more - >>> since >>>>> dependencies changed : someone else had thankfully already added maven >>>>> support by then - which worked remarkably well out of the box (with >>> some >>>>> minor tweaks) ! >>>>> >>>>> In theory, they might be expected to work the same, but practically >>> they >>>>> did not : as I mentioned, it must just have been luck that maven >>> worked >>>>> that well; but given multiple past nasty experiences with sbt, and the >>> fact >>>>> that it does not bring anything compelling or new in contrast, I am >>> fairly >>>>> against the idea of using only sbt - inspite of maven being >>> unintuitive at >>>>> times. >>>>> >>>>> Regards, >>>>> Mridul >>>>> >>>>>> >>>>>> @sandy - It sounds like you are saying that the CDH build would be >>>>>> easier with Maven because you can inherit the POM. However, is this >>>>>> just a matter of convenience for packagers or would standardizing on >>>>>> sbt limit capabilities in some way? I assume that it would just mean a >>>>>> bit more manual work for packagers having to figure out how to set the >>>>>> hadoop version in SBT and exclude certain dependencies. For instance, >>>>>> what does CDH about other components like Impala that are not based on >>>>>> Maven at all? >>>>>> >>>>>> On Wed, Feb 26, 2014 at 9:31 AM, Evan Chan <e...@ooyala.com> wrote: >>>>>>> I'd like to propose the following way to move forward, based on the >>>>>>> comments I've seen: >>>>>>> >>>>>>> 1. Aggressively clean up the giant dependency graph. One ticket I >>>>>>> might work on if I have time is SPARK-681 which might remove the >>> giant >>>>>>> fastutil dependency (~15MB by itself). >>>>>>> >>>>>>> 2. Take an intermediate step by having only ONE source of truth >>>>>>> w.r.t. dependencies and versions. This means either: >>>>>>> a) Using a maven POM as the spec for dependencies, Hadoop >>> version, >>>>>>> etc. Then, use sbt-pom-reader to import it. >>>>>>> b) Using the build.scala as the spec, and "sbt make-pom" to >>>>>>> generate the pom.xml for the dependencies >>>>>>> >>>>>>> The idea is to remove the pain and errors associated with manual >>>>>>> translation of dependency specs from one system to another, while >>>>>>> still maintaining the things which are hard to translate (plugins). >>>>>>> >>>>>>> >>>>>>> On Wed, Feb 26, 2014 at 7:17 AM, Koert Kuipers <ko...@tresata.com> >>>>> wrote: >>>>>>>> We maintain in house spark build using sbt. We have no problem >>> using >>>>> sbt >>>>>>>> assembly. We did add a few exclude statements for transitive >>>>> dependencies. >>>>>>>> >>>>>>>> The main enemy of assemblies are jars that include stuff they >>> shouldn't >>>>>>>> (kryo comes to mind, I think they include logback?), new versions >>> of >>>>> jars >>>>>>>> that change the provider/artifact without changing the package >>> (asm), >>>>> and >>>>>>>> incompatible new releases (protobuf). These break the transitive >>>>> resolution >>>>>>>> process. I imagine that's true for any build tool. >>>>>>>> >>>>>>>> Besides shading I don't see anything maven can do sbt cannot, and >>> if I >>>>>>>> understand it correctly shading is not done currently using the >>> build >>>>> tool. >>>>>>>> >>>>>>>> Since spark is primarily scala/akka based the main developer base >>> will >>>>> be >>>>>>>> familiar with sbt (I think?). Switching build tool is always >>> painful. I >>>>>>>> personally think it is smarter to put this burden on a limited >>> number >>>>> of >>>>>>>> upstream integrators than on the community. However that said I >>> don't >>>>> think >>>>>>>> its a problem for us to maintain an sbt build in-house if spark >>>>> switched to >>>>>>>> maven. >>>>>>>> The problem is, the complete spark dependency graph is fairly >>> large, >>>>>>>> and there are lot of conflicting versions in there. >>>>>>>> In particular, when we bump versions of dependencies - making >>> managing >>>>>>>> this messy at best. >>>>>>>> >>>>>>>> Now, I have not looked in detail at how maven manages this - it >>> might >>>>>>>> just be accidental that we get a decent out-of-the-box assembled >>>>>>>> shaded jar (since we dont do anything great to configure it). >>>>>>>> With current state of sbt in spark, it definitely is not a good >>>>>>>> solution : if we can enhance it (or it already is ?), while keeping >>>>>>>> the management of the version/dependency graph manageable, I dont >>> have >>>>>>>> any objections to using sbt or maven ! >>>>>>>> Too many exclude versions, pinned versions, etc would just make >>> things >>>>>>>> unmanageable in future. >>>>>>>> >>>>>>>> >>>>>>>> Regards, >>>>>>>> Mridul >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Wed, Feb 26, 2014 at 8:56 AM, Evan chan <e...@ooyala.com> wrote: >>>>>>>>> Actually you can control exactly how sbt assembly merges or >>> resolves >>>>>>>> conflicts. I believe the default settings however lead to order >>> which >>>>>>>> cannot be controlled. >>>>>>>>> >>>>>>>>> I do wish for a smarter fat jar plugin. >>>>>>>>> >>>>>>>>> -Evan >>>>>>>>> To be free is not merely to cast off one's chains, but to live in >>> a >>>>> way >>>>>>>> that respects & enhances the freedom of others. (#NelsonMandela) >>>>>>>>> >>>>>>>>>> On Feb 25, 2014, at 6:50 PM, Mridul Muralidharan < >>> mri...@gmail.com> >>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> On Wed, Feb 26, 2014 at 5:31 AM, Patrick Wendell < >>> pwend...@gmail.com >>>>>> >>>>>>>> wrote: >>>>>>>>>>> Evan - this is a good thing to bring up. Wrt the shader plug-in >>> - >>>>>>>>>>> right now we don't actually use it for bytecode shading - we >>> simply >>>>>>>>>>> use it for creating the uber jar with excludes (which sbt >>> supports >>>>>>>>>>> just fine via assembly). >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Not really - as I mentioned initially in this thread, sbt's >>> assembly >>>>>>>>>> does not take dependencies into account properly : and can >>> overwrite >>>>>>>>>> newer classes with older versions. >>>>>>>>>> From an assembly point of view, sbt is not very good : we are >>> yet to >>>>>>>>>> try it after 2.10 shift though (and probably wont, given the >>> mess it >>>>>>>>>> created last time). >>>>>>>>>> >>>>>>>>>> Regards, >>>>>>>>>> Mridul >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> I was wondering actually, do you know if it's possible to added >>>>> shaded >>>>>>>>>>> artifacts to the *spark jar* using this plug-in (e.g. not an >>> uber >>>>>>>>>>> jar)? That's something I could see being really handy in the >>> future. >>>>>>>>>>> >>>>>>>>>>> - Patrick >>>>>>>>>>> >>>>>>>>>>>> On Tue, Feb 25, 2014 at 3:39 PM, Evan Chan <e...@ooyala.com> >>> wrote: >>>>>>>>>>>> The problem is that plugins are not equivalent. There is >>> AFAIK no >>>>>>>>>>>> equivalent to the maven shader plugin for SBT. >>>>>>>>>>>> There is an SBT plugin which can apparently read POM XML files >>>>>>>>>>>> (sbt-pom-reader). However, it can't possibly handle plugins, >>>>> which >>>>>>>>>>>> is still problematic. >>>>>>>>>>>> >>>>>>>>>>>>> On Tue, Feb 25, 2014 at 3:31 PM, yao <yaosheng...@gmail.com> >>>>> wrote: >>>>>>>>>>>>> I would prefer keep both of them, it would be better even if >>> that >>>>>>>> means >>>>>>>>>>>>> pom.xml will be generated using sbt. Some company, like my >>> current >>>>>>>> one, >>>>>>>>>>>>> have their own build infrastructures built on top of maven. >>> It is >>>>> not >>>>>>>> easy >>>>>>>>>>>>> to support sbt for these potential spark clients. But I do >>> agree >>>>> to >>>>>>>> only >>>>>>>>>>>>> keep one if there is a promising way to generate correct >>>>>>>> configuration from >>>>>>>>>>>>> the other. >>>>>>>>>>>>> >>>>>>>>>>>>> -Shengzhe >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>>> On Tue, Feb 25, 2014 at 3:20 PM, Evan Chan <e...@ooyala.com> >>> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>> The correct way to exclude dependencies in SBT is actually to >>>>> declare >>>>>>>>>>>>>> a dependency as "provided". I'm not familiar with Maven or >>> its >>>>>>>>>>>>>> dependencySet, but provided will mark the entire dependency >>> tree >>>>> as >>>>>>>>>>>>>> excluded. It is also possible to exclude jar by jar, but >>> this >>>>> is >>>>>>>>>>>>>> pretty error prone and messy. >>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Tue, Feb 25, 2014 at 2:45 PM, Koert Kuipers < >>>>> ko...@tresata.com> >>>>>>>> wrote: >>>>>>>>>>>>>>> yes in sbt assembly you can exclude jars (although i never >>> had a >>>>>>>> need for >>>>>>>>>>>>>>> this) and files in jars. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> for example i frequently remove log4j.properties, because >>> for >>>>>>>> whatever >>>>>>>>>>>>>>> reason hadoop decided to include it making it very >>> difficult to >>>>> use >>>>>>>> our >>>>>>>>>>>>>> own >>>>>>>>>>>>>>> logging config. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On Tue, Feb 25, 2014 at 4:24 PM, Konstantin Boudnik < >>>>> c...@apache.org >>>>>>>>> >>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> On Fri, Feb 21, 2014 at 11:11AM, Patrick Wendell wrote: >>>>>>>>>>>>>>>>> Kos - thanks for chiming in. Could you be more specific >>> about >>>>>>>> what is >>>>>>>>>>>>>>>>> available in maven and not in sbt for these issues? I >>> took a >>>>> look >>>>>>>> at >>>>>>>>>>>>>>>>> the bigtop code relating to Spark. As far as I could tell >>> [1] >>>>> was >>>>>>>> the >>>>>>>>>>>>>>>>> main point of integration with the build system (maybe >>> there >>>>> are >>>>>>>> other >>>>>>>>>>>>>>>>> integration points)? >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> - in order to integrate Spark well into existing Hadoop >>>>> stack it >>>>>>>>>>>>>> was >>>>>>>>>>>>>>>>>> necessary to have a way to avoid transitive >>> dependencies >>>>>>>>>>>>>>>> duplications and >>>>>>>>>>>>>>>>>> possible conflicts. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> E.g. Maven assembly allows us to avoid adding _all_ >>> Hadoop >>>>>>>> libs >>>>>>>>>>>>>>>> and later >>>>>>>>>>>>>>>>>> merely declare Spark package dependency on standard >>> Bigtop >>>>>>>>>>>>>> Hadoop >>>>>>>>>>>>>>>>>> packages. And yes - Bigtop packaging means the naming >>> and >>>>>>>> layout >>>>>>>>>>>>>>>> would be >>>>>>>>>>>>>>>>>> standard across all commercial Hadoop distributions >>> that >>>>> are >>>>>>>>>>>>>> worth >>>>>>>>>>>>>>>>>> mentioning: ASF Bigtop convenience binary packages, >>> and >>>>>>>>>>>>>> Cloudera or >>>>>>>>>>>>>>>>>> Hortonworks packages. Hence, the downstream user >>> doesn't >>>>> need >>>>>>>> to >>>>>>>>>>>>>>>> spend any >>>>>>>>>>>>>>>>>> effort to make sure that Spark "clicks-in" properly. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> The sbt build also allows you to plug in a Hadoop version >>>>> similar >>>>>>>> to >>>>>>>>>>>>>>>>> the maven build. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> I am actually talking about an ability to exclude a set of >>>>>>>> dependencies >>>>>>>>>>>>>>>> from an >>>>>>>>>>>>>>>> assembly, similarly to what's happening in dependencySet >>>>> sections >>>>>>>> of >>>>>>>>>>>>>>>> assembly/src/main/assembly/assembly.xml >>>>>>>>>>>>>>>> If there is a comparable functionality in Sbt, that would >>> help >>>>>>>> quite a >>>>>>>>>>>>>> bit, >>>>>>>>>>>>>>>> apparently. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Cos >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> - Maven provides a relatively easy way to deal with the >>>>> jar-hell >>>>>>>>>>>>>>>> problem, >>>>>>>>>>>>>>>>>> although the original maven build was just Shader'ing >>>>>>>> everything >>>>>>>>>>>>>>>> into a >>>>>>>>>>>>>>>>>> huge lump of class files. Oftentimes ending up with >>>>> classes >>>>>>>>>>>>>>>> slamming on >>>>>>>>>>>>>>>>>> top of each other from different transitive >>> dependencies. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> AFIAK we are only using the shade plug-in to deal with >>>>> conflict >>>>>>>>>>>>>>>>> resolution in the assembly jar. These are dealt with in >>> sbt >>>>> via >>>>>>>> the >>>>>>>>>>>>>>>>> sbt assembly plug-in in an identical way. Is there a >>>>> difference? >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> I am bringing up the Sharder, because it is an awful hack, >>>>> which is >>>>>>>>>>>>>> can't >>>>>>>>>>>>>>>> be >>>>>>>>>>>>>>>> used in real controlled deployment. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Cos >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> [1] >>>>>>>>>>>>>> >>>>>>>> >>>>> >>> https://git-wip-us.apache.org/repos/asf?p=bigtop.git;a=blob;f=bigtop-packages/src/common/spark/do-component-build;h=428540e0f6aa56cd7e78eb1c831aa7fe9496a08f;hb=master >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> -- >>>>>>>>>>>>>> -- >>>>>>>>>>>>>> Evan Chan >>>>>>>>>>>>>> Staff Engineer >>>>>>>>>>>>>> e...@ooyala.com | >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> -- >>>>>>>>>>>> -- >>>>>>>>>>>> Evan Chan >>>>>>>>>>>> Staff Engineer >>>>>>>>>>>> e...@ooyala.com | >>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> -- >>>>>>> Evan Chan >>>>>>> Staff Engineer >>>>>>> e...@ooyala.com | >>> >> >>