Github reviews now going to separate reviews@ mailing list
Hey All, We've created a new list called revi...@spark.apache.org which will contain the contents from the github pull requests and comments. Note that these e-mails will no longer appear on the dev list. Thanks to Apache Infra for helping us set this up. To subscribe to this e-mail: reviews-subscr...@spark.apache.org - Patrick
Help vote for Spark talks at the Hadoop Summit
Hey All, The Hadoop Summit uses community choice voting to decide which talks to feature. It would be great if the community could help vote for Spark talks so that Spark has a good showing at this event. You can make three votes on each track. Below I've listed Spark talks in each of the tracks - voting closes tomorrow so vote now!! Building a Unified Data Pipeline in Apache Spark bit.ly/O8USIq (Committer Track) Building a Data Processing System for Real Time Auctions bit.ly/1ij3XJJ (Business Apps Track) SparkR: Enabling Interactive Data Science at Scale on Hadoop bit.ly/1kPQUlG (Data Science Track) Recent Developments in Spark MLlib and Beyond bit.ly/1hgZW5D (The Future of Apache Hadoop Track) Cheers, - Patrick
Re: Spark 0.9.0 and log4j
Evan I actually remembered that Paul Brown (who also reported this issue) tested it and found that it worked. I'm going to merge this into master and branch 0.9, so please give it a spin when you have a chance. - Patrick On Sat, Mar 8, 2014 at 2:00 PM, Patrick Wendell pwend...@gmail.com wrote: Hey Evan, This is being tracked here: https://spark-project.atlassian.net/browse/SPARK-1190 That patch didn't get merged but I've just opened a new one here: https://github.com/apache/spark/pull/107/files Would you have any interest in testing this? I want to make sure it works for users who are using logback. I'd like to get this merged quickly since it's one of the only remaining blockers for Spark 0.9.1. - Patrick On Fri, Mar 7, 2014 at 11:04 AM, Evan Chan e...@ooyala.com wrote: Hey guys, This is a follow-up to this semi-recent thread: http://apache-spark-developers-list.1001551.n3.nabble.com/0-9-0-forces-log4j-usage-td532.html 0.9.0 final is causing issues for us as well because we use Logback as our backend and Spark requires Log4j now. I see Patrick has a PR #560 to incubator-spark, was that merged in or left out? Also I see references to a new PR that might fix this, but I can't seem to find it in the github open PR page. Anybody have a link? As a last resort we can switch to Log4j, but would rather not have to do that if possible. thanks, Evan -- -- Evan Chan Staff Engineer e...@ooyala.com |
Re: 0.9.0 forces log4j usage
The fix for this was just merged into branch 0.9 (will be in 0.9.1+) and master. On Sun, Feb 9, 2014 at 11:44 PM, Patrick Wendell pwend...@gmail.com wrote: Thanks Paul - it isn't mean to be a full solution but just a fix for the 0.9 branch - for the full solution there is another PR by Sean Owen. On Sun, Feb 9, 2014 at 11:35 PM, Paul Brown p...@mult.ifario.us wrote: Hi, Patrick -- I gave that a go locally, and it works as desired. Best. -- Paul -- p...@mult.ifario.us | Multifarious, Inc. | http://mult.ifario.us/ On Fri, Feb 7, 2014 at 6:10 PM, Patrick Wendell pwend...@gmail.com wrote: Ah okay sounds good. This is what I meant earlier by You have some other application that directly calls log4j i.e. you have for historical reasons installed the log4j-over-slf4j. Would you mind trying out this fix and seeing if it works? This is designed to be a hotfix for 0.9, not a general solution where we rip out log4j from our published dependencies: https://github.com/apache/incubator-spark/pull/560/files - Patrick On Fri, Feb 7, 2014 at 5:57 PM, Paul Brown p...@mult.ifario.us wrote: Hi, Patrick -- I forget which other component is responsible, but we're using the log4j-over-slf4j as part of an overall requirement to centralize logging, i.e., *someone* else is logging over log4j and we're pulling that in. (There's also some jul logging from Jersey, etc.) Goals: - Fully control/capture all possible logging. (God forbid we have to grab System.out/err, but we'd do it if needed.) - Use the backend we like best at the moment. (Happens to be logback.) Possible cases: - If Spark used Log4j at all, we would pull in that logging via log4j-over-slf4j. - If Spark used only slf4j and referenced no backend, we would use it as-is although we'd still have the log4j-over-slf4j because of other libraries. - If Spark used only slf4j and referenced the slf4j-log4j12 backend, we would exclude that one dependency (via our POM). Best. -- Paul -- p...@mult.ifario.us | Multifarious, Inc. | http://mult.ifario.us/ On Fri, Feb 7, 2014 at 5:38 PM, Patrick Wendell pwend...@gmail.com wrote: Hey Paul, So if your goal is ultimately to output to logback. Then why don't you just use slf4j and logback-classic.jar as described here [1]. Why involve log4j-over-slf4j at all? Let's say we refactored the spark build so it didn't advertise slf4j-log4j12 as a dependency. Would you still be using log4j-over-slf4j... or is this just a fix to deal with the fact that Spark is somewhat log4j dependent at this point. [1] http://www.slf4j.org/manual.html - Patrick On Fri, Feb 7, 2014 at 5:14 PM, Paul Brown p...@mult.ifario.us wrote: Hi, Patrick -- That's close but not quite it. The issue that occurs is not the delegation loop mentioned in slf4j documentation. The stack overflow is entirely within the code in the Spark trait: at org.apache.spark.Logging$class.initializeLogging(Logging.scala:112) at org.apache.spark.Logging$class.initializeIfNecessary(Logging.scala:97) at org.apache.spark.Logging$class.log(Logging.scala:36) at org.apache.spark.SparkEnv$.log(SparkEnv.scala:94) And then that repeats. As for our situation, we exclude the slf4j-log4j12 dependency when we import the Spark library (because we don't want to use log4j) and have log4j-over-slf4j already in place to ensure that all of the logging in the overall application runs through slf4j and then out through logback. (We also, as another poster already mentioned, also force jcl and jul through slf4j.) The zen of slf4j for libraries is that the library uses the slf4j API and then the enclosing application can route logging as it sees fit. Spark master CLI would log via slf4j and include the slf4j-log4j12 backend; same for Spark worker CLI. Spark as a library (versus as a container) would not include any backend to the slf4j API and leave this up to the application. (FWIW, this would also avoid your log4j warning message.) But as I was saying before, I'd be happy with a situation where I can avoid log4j being enabled or configured, and I think you'll find an existing choice of logging framework to be a common scenario for those embedding Spark in other systems. Best. -- Paul -- p...@mult.ifario.us | Multifarious, Inc. | http://mult.ifario.us/ On Fri, Feb 7, 2014 at 3:01 PM, Patrick Wendell pwend...@gmail.com wrote: Paul, Looking back at your problem. I think it's the one here: http://www.slf4j.org/codes.html#log4jDelegationLoop So let me just be clear what you are doing so I understand. You have some other application that directly calls log4j. So you have to include log4j-over-slf4j to route those logs through slf4j to logback. At the same time you embed Spark in this application
Updated Developer Docs
Hey All, Just a heads up that there are a bunch of updated developer docs on the wiki including posting the dates around the current merge window. Some of the new docs might be useful for developers/committers: https://cwiki.apache.org/confluence/display/SPARK/Wiki+Homepage Cheers, - Patrick
Re: [DISCUSS] Necessity of Maven *and* SBT Build in Spark
Hey, Thanks everyone for chiming in on this. I wanted to summarize these issues a bit particularly wrt the constituents involved - does this seem accurate? = Spark Users = In general those linking against Spark should be totally unaffected by the build choice. Spark will continue to publish well-formed poms and jars to maven central. This is a no-op wrt this decision. = Spark Developers = There are two concerns. (a) General day-to-day development and packaging and (b) Spark binaries and packages for distribution. For (a) - sbt seems better because it's just nicer for doing scala development (incremental complication is simple, we have some home-baked tools for compiling Spark vs. the spark deps etc). The arguments that maven has more general know how, at least so far, haven't affected us in the ~2 years we've maintained both builds - where adding stuff for Maven is typically just as annoying/difficult with sbt. For (b) - Some non-specific concerns were raised about bugs with the sbt assembly package - we should look into this and see what is going on. Maven has better out-of-the-box support for publishing to Maven central, we'd have to do some manual work on our end to make this work well with sbt. = Downstream Integrators = On this one it seems that Maven is the universal favorite, largely because of community awareness of Maven and comfort with Maven builds. Some things like restructuring the Spark build to inherit config values from a vendor build will be not possible with sbt (though fairly straightforward to work around). Other cases where vendors have directly modified or inherited the Spark build won't work anymore if we standardize on SBT. These have no obvious work around at this point as far as I see. - Patrick On Wed, Feb 26, 2014 at 7:09 PM, Mridul Muralidharan mri...@gmail.com wrote: On Feb 26, 2014 11:12 PM, Patrick Wendell pwend...@gmail.com wrote: @mridul - As far as I know both Maven and Sbt use fairly similar processes for building the assembly/uber jar. We actually used to package spark with sbt and there were no specific issues we encountered and AFAIK sbt respects versioning of transitive dependencies correctly. Do you have a specific bug listing for sbt that indicates something is broken? Slightly longish ... The assembled jar, generated via sbt broke all over the place while I was adding yarn support in 0.6 - and I had to fix sbt project a fair bit to get it to work : we need the assembled jar to submit a yarn job. When I finally submitted those changes to 0.7, it broke even more - since dependencies changed : someone else had thankfully already added maven support by then - which worked remarkably well out of the box (with some minor tweaks) ! In theory, they might be expected to work the same, but practically they did not : as I mentioned, it must just have been luck that maven worked that well; but given multiple past nasty experiences with sbt, and the fact that it does not bring anything compelling or new in contrast, I am fairly against the idea of using only sbt - inspite of maven being unintuitive at times. Regards, Mridul @sandy - It sounds like you are saying that the CDH build would be easier with Maven because you can inherit the POM. However, is this just a matter of convenience for packagers or would standardizing on sbt limit capabilities in some way? I assume that it would just mean a bit more manual work for packagers having to figure out how to set the hadoop version in SBT and exclude certain dependencies. For instance, what does CDH about other components like Impala that are not based on Maven at all? On Wed, Feb 26, 2014 at 9:31 AM, Evan Chan e...@ooyala.com wrote: I'd like to propose the following way to move forward, based on the comments I've seen: 1. Aggressively clean up the giant dependency graph. One ticket I might work on if I have time is SPARK-681 which might remove the giant fastutil dependency (~15MB by itself). 2. Take an intermediate step by having only ONE source of truth w.r.t. dependencies and versions. This means either: a) Using a maven POM as the spec for dependencies, Hadoop version, etc. Then, use sbt-pom-reader to import it. b) Using the build.scala as the spec, and sbt make-pom to generate the pom.xml for the dependencies The idea is to remove the pain and errors associated with manual translation of dependency specs from one system to another, while still maintaining the things which are hard to translate (plugins). On Wed, Feb 26, 2014 at 7:17 AM, Koert Kuipers ko...@tresata.com wrote: We maintain in house spark build using sbt. We have no problem using sbt assembly. We did add a few exclude statements for transitive dependencies. The main enemy of assemblies are jars that include stuff they shouldn't (kryo comes to mind, I think they include logback?), new versions of jars that change the provider/artifact
Re: [DISCUSS] Necessity of Maven *and* SBT Build in Spark
@mridul - As far as I know both Maven and Sbt use fairly similar processes for building the assembly/uber jar. We actually used to package spark with sbt and there were no specific issues we encountered and AFAIK sbt respects versioning of transitive dependencies correctly. Do you have a specific bug listing for sbt that indicates something is broken? @sandy - It sounds like you are saying that the CDH build would be easier with Maven because you can inherit the POM. However, is this just a matter of convenience for packagers or would standardizing on sbt limit capabilities in some way? I assume that it would just mean a bit more manual work for packagers having to figure out how to set the hadoop version in SBT and exclude certain dependencies. For instance, what does CDH about other components like Impala that are not based on Maven at all? On Wed, Feb 26, 2014 at 9:31 AM, Evan Chan e...@ooyala.com wrote: I'd like to propose the following way to move forward, based on the comments I've seen: 1. Aggressively clean up the giant dependency graph. One ticket I might work on if I have time is SPARK-681 which might remove the giant fastutil dependency (~15MB by itself). 2. Take an intermediate step by having only ONE source of truth w.r.t. dependencies and versions. This means either: a) Using a maven POM as the spec for dependencies, Hadoop version, etc. Then, use sbt-pom-reader to import it. b) Using the build.scala as the spec, and sbt make-pom to generate the pom.xml for the dependencies The idea is to remove the pain and errors associated with manual translation of dependency specs from one system to another, while still maintaining the things which are hard to translate (plugins). On Wed, Feb 26, 2014 at 7:17 AM, Koert Kuipers ko...@tresata.com wrote: We maintain in house spark build using sbt. We have no problem using sbt assembly. We did add a few exclude statements for transitive dependencies. The main enemy of assemblies are jars that include stuff they shouldn't (kryo comes to mind, I think they include logback?), new versions of jars that change the provider/artifact without changing the package (asm), and incompatible new releases (protobuf). These break the transitive resolution process. I imagine that's true for any build tool. Besides shading I don't see anything maven can do sbt cannot, and if I understand it correctly shading is not done currently using the build tool. Since spark is primarily scala/akka based the main developer base will be familiar with sbt (I think?). Switching build tool is always painful. I personally think it is smarter to put this burden on a limited number of upstream integrators than on the community. However that said I don't think its a problem for us to maintain an sbt build in-house if spark switched to maven. The problem is, the complete spark dependency graph is fairly large, and there are lot of conflicting versions in there. In particular, when we bump versions of dependencies - making managing this messy at best. Now, I have not looked in detail at how maven manages this - it might just be accidental that we get a decent out-of-the-box assembled shaded jar (since we dont do anything great to configure it). With current state of sbt in spark, it definitely is not a good solution : if we can enhance it (or it already is ?), while keeping the management of the version/dependency graph manageable, I dont have any objections to using sbt or maven ! Too many exclude versions, pinned versions, etc would just make things unmanageable in future. Regards, Mridul On Wed, Feb 26, 2014 at 8:56 AM, Evan chan e...@ooyala.com wrote: Actually you can control exactly how sbt assembly merges or resolves conflicts. I believe the default settings however lead to order which cannot be controlled. I do wish for a smarter fat jar plugin. -Evan To be free is not merely to cast off one's chains, but to live in a way that respects enhances the freedom of others. (#NelsonMandela) On Feb 25, 2014, at 6:50 PM, Mridul Muralidharan mri...@gmail.com wrote: On Wed, Feb 26, 2014 at 5:31 AM, Patrick Wendell pwend...@gmail.com wrote: Evan - this is a good thing to bring up. Wrt the shader plug-in - right now we don't actually use it for bytecode shading - we simply use it for creating the uber jar with excludes (which sbt supports just fine via assembly). Not really - as I mentioned initially in this thread, sbt's assembly does not take dependencies into account properly : and can overwrite newer classes with older versions. From an assembly point of view, sbt is not very good : we are yet to try it after 2.10 shift though (and probably wont, given the mess it created last time). Regards, Mridul I was wondering actually, do you know if it's possible to added shaded artifacts to the *spark jar* using this plug-in (e.g. not an uber jar)? That's something I could see being