Github reviews now going to separate reviews@ mailing list

2014-03-16 Thread Patrick Wendell
Hey All,

We've created a new list called revi...@spark.apache.org which will
contain the contents from the github pull requests and comments.

Note that these e-mails will no longer appear on the dev list. Thanks
to Apache Infra for helping us set this up.

To subscribe to this e-mail:
reviews-subscr...@spark.apache.org

- Patrick


Help vote for Spark talks at the Hadoop Summit

2014-03-13 Thread Patrick Wendell
Hey All,

The Hadoop Summit uses community choice voting to decide which talks
to feature. It would be great if the community could help vote for
Spark talks so that Spark has a good showing at this event. You can
make three votes on each track. Below I've listed Spark talks in each
of the tracks - voting closes tomorrow so vote now!!

Building a Unified Data Pipeline in Apache Spark
bit.ly/O8USIq
(Committer Track)

Building a Data Processing System for Real Time Auctions
bit.ly/1ij3XJJ
(Business Apps Track)

SparkR: Enabling Interactive Data Science at Scale on Hadoop
bit.ly/1kPQUlG
(Data Science Track)

Recent Developments in Spark MLlib and Beyond
bit.ly/1hgZW5D
(The Future of Apache Hadoop Track)

Cheers,
- Patrick


Re: Spark 0.9.0 and log4j

2014-03-08 Thread Patrick Wendell
Evan I actually remembered that Paul Brown (who also reported this
issue) tested it and found that it worked. I'm going to merge this
into master and branch 0.9, so please give it a spin when you have a
chance.

- Patrick

On Sat, Mar 8, 2014 at 2:00 PM, Patrick Wendell pwend...@gmail.com wrote:
 Hey Evan,

 This is being tracked here:
 https://spark-project.atlassian.net/browse/SPARK-1190

 That patch didn't get merged but I've just opened a new one here:
 https://github.com/apache/spark/pull/107/files

 Would you have any interest in testing this? I want to make sure it
 works for users who are using logback.

 I'd like to get this merged quickly since it's one of the only
 remaining blockers for Spark 0.9.1.

 - Patrick



 On Fri, Mar 7, 2014 at 11:04 AM, Evan Chan e...@ooyala.com wrote:
 Hey guys,

 This is a follow-up to this semi-recent thread:
 http://apache-spark-developers-list.1001551.n3.nabble.com/0-9-0-forces-log4j-usage-td532.html

 0.9.0 final is causing issues for us as well because we use Logback as
 our backend and Spark requires Log4j now.

 I see Patrick has a PR #560 to incubator-spark, was that merged in or
 left out?

 Also I see references to a new PR that might fix this, but I can't
 seem to find it in the github open PR page.   Anybody have a link?

 As a last resort we can switch to Log4j, but would rather not have to
 do that if possible.

 thanks,
 Evan

 --
 --
 Evan Chan
 Staff Engineer
 e...@ooyala.com  |


Re: 0.9.0 forces log4j usage

2014-03-08 Thread Patrick Wendell
The fix for this was just merged into branch 0.9 (will be in 0.9.1+) and master.

On Sun, Feb 9, 2014 at 11:44 PM, Patrick Wendell pwend...@gmail.com wrote:
 Thanks Paul - it isn't mean to be a full solution but just a fix for
 the 0.9 branch - for the full solution there is another PR by Sean
 Owen.

 On Sun, Feb 9, 2014 at 11:35 PM, Paul Brown p...@mult.ifario.us wrote:
 Hi, Patrick --

 I gave that a go locally, and it works as desired.

 Best.
 -- Paul

 --
 p...@mult.ifario.us | Multifarious, Inc. | http://mult.ifario.us/


 On Fri, Feb 7, 2014 at 6:10 PM, Patrick Wendell pwend...@gmail.com wrote:

 Ah okay sounds good. This is what I meant earlier by You have
 some other application that directly calls log4j i.e. you have
 for historical reasons installed the log4j-over-slf4j.

 Would you mind trying out this fix and seeing if it works? This is
 designed to be a hotfix for 0.9, not a general solution where we rip
 out log4j from our published dependencies:

 https://github.com/apache/incubator-spark/pull/560/files

 - Patrick

 On Fri, Feb 7, 2014 at 5:57 PM, Paul Brown p...@mult.ifario.us wrote:
  Hi, Patrick --
 
  I forget which other component is responsible, but we're using the
  log4j-over-slf4j as part of an overall requirement to centralize logging,
  i.e., *someone* else is logging over log4j and we're pulling that in.
   (There's also some jul logging from Jersey, etc.)
 
  Goals:
 
  - Fully control/capture all possible logging.  (God forbid we have to
 grab
  System.out/err, but we'd do it if needed.)
  - Use the backend we like best at the moment.  (Happens to be logback.)
 
  Possible cases:
 
  - If Spark used Log4j at all, we would pull in that logging via
  log4j-over-slf4j.
  - If Spark used only slf4j and referenced no backend, we would use it
 as-is
  although we'd still have the log4j-over-slf4j because of other libraries.
  - If Spark used only slf4j and referenced the slf4j-log4j12 backend, we
  would exclude that one dependency (via our POM).
 
  Best.
  -- Paul
 
 
  --
  p...@mult.ifario.us | Multifarious, Inc. | http://mult.ifario.us/
 
 
  On Fri, Feb 7, 2014 at 5:38 PM, Patrick Wendell pwend...@gmail.com
 wrote:
 
  Hey Paul,
 
  So if your goal is ultimately to output to logback. Then why don't you
  just use slf4j and logback-classic.jar as described here [1]. Why
  involve log4j-over-slf4j at all?
 
  Let's say we refactored the spark build so it didn't advertise
  slf4j-log4j12 as a dependency. Would you still be using
  log4j-over-slf4j... or is this just a fix to deal with the fact that
  Spark is somewhat log4j dependent at this point.
 
  [1] http://www.slf4j.org/manual.html
 
  - Patrick
 
  On Fri, Feb 7, 2014 at 5:14 PM, Paul Brown p...@mult.ifario.us wrote:
   Hi, Patrick --
  
   That's close but not quite it.
  
   The issue that occurs is not the delegation loop mentioned in slf4j
   documentation.  The stack overflow is entirely within the code in the
  Spark
   trait:
  
   at org.apache.spark.Logging$class.initializeLogging(Logging.scala:112)
   at
 org.apache.spark.Logging$class.initializeIfNecessary(Logging.scala:97)
   at org.apache.spark.Logging$class.log(Logging.scala:36)
   at org.apache.spark.SparkEnv$.log(SparkEnv.scala:94)
  
  
   And then that repeats.
  
   As for our situation, we exclude the slf4j-log4j12 dependency when we
   import the Spark library (because we don't want to use log4j) and have
   log4j-over-slf4j already in place to ensure that all of the logging in
  the
   overall application runs through slf4j and then out through logback.
  (We
   also, as another poster already mentioned, also force jcl and jul
 through
   slf4j.)
  
   The zen of slf4j for libraries is that the library uses the slf4j API
 and
   then the enclosing application can route logging as it sees fit.
  Spark
   master CLI would log via slf4j and include the slf4j-log4j12 backend;
  same
   for Spark worker CLI.  Spark as a library (versus as a container)
 would
  not
   include any backend to the slf4j API and leave this up to the
  application.
(FWIW, this would also avoid your log4j warning message.)
  
   But as I was saying before, I'd be happy with a situation where I can
  avoid
   log4j being enabled or configured, and I think you'll find an existing
   choice of logging framework to be a common scenario for those
 embedding
   Spark in other systems.
  
   Best.
   -- Paul
  
   --
   p...@mult.ifario.us | Multifarious, Inc. | http://mult.ifario.us/
  
  
   On Fri, Feb 7, 2014 at 3:01 PM, Patrick Wendell pwend...@gmail.com
  wrote:
  
   Paul,
  
   Looking back at your problem. I think it's the one here:
   http://www.slf4j.org/codes.html#log4jDelegationLoop
  
   So let me just be clear what you are doing so I understand. You have
   some other application that directly calls log4j. So you have to
   include log4j-over-slf4j to route those logs through slf4j to
 logback.
  
   At the same time you embed Spark in this application

Updated Developer Docs

2014-03-04 Thread Patrick Wendell
Hey All,

Just a heads up that there are a bunch of updated developer docs on
the wiki including posting the dates around the current merge window.
Some of the new docs might be useful for developers/committers:

https://cwiki.apache.org/confluence/display/SPARK/Wiki+Homepage

Cheers,
- Patrick


Re: [DISCUSS] Necessity of Maven *and* SBT Build in Spark

2014-02-28 Thread Patrick Wendell
Hey,

Thanks everyone for chiming in on this. I wanted to summarize these
issues a bit particularly wrt the constituents involved - does this
seem accurate?

= Spark Users =
In general those linking against Spark should be totally unaffected by
the build choice. Spark will continue to publish well-formed poms and
jars to maven central. This is a no-op wrt this decision.

= Spark Developers =
There are two concerns. (a) General day-to-day development and
packaging and (b) Spark binaries and packages for distribution.

For (a) - sbt seems better because it's just nicer for doing scala
development (incremental complication is simple, we have some
home-baked tools for compiling Spark vs. the spark deps etc). The
arguments that maven has more general know how, at least so far,
haven't affected us in the ~2 years we've maintained both builds -
where adding stuff for Maven is typically just as annoying/difficult
with sbt.

For (b) - Some non-specific concerns were raised about bugs with the
sbt assembly package - we should look into this and see what is going
on. Maven has better out-of-the-box support for publishing to Maven
central, we'd have to do some manual work on our end to make this work
well with sbt.

= Downstream Integrators =
On this one it seems that Maven is the universal favorite, largely
because of community awareness of Maven and comfort with Maven builds.
Some things like restructuring the Spark build to inherit config
values from a vendor build will be not possible with sbt (though
fairly straightforward to work around). Other cases where vendors have
directly modified or inherited the Spark build won't work anymore if
we standardize on SBT. These have no obvious work around at this point
as far as I see.

- Patrick

On Wed, Feb 26, 2014 at 7:09 PM, Mridul Muralidharan mri...@gmail.com wrote:
 On Feb 26, 2014 11:12 PM, Patrick Wendell pwend...@gmail.com wrote:

 @mridul - As far as I know both Maven and Sbt use fairly similar
 processes for building the assembly/uber jar. We actually used to
 package spark with sbt and there were no specific issues we
 encountered and AFAIK sbt respects versioning of transitive
 dependencies correctly. Do you have a specific bug listing for sbt
 that indicates something is broken?

 Slightly longish ...

 The assembled jar, generated via sbt broke all over the place while I was
 adding yarn support in 0.6 - and I had to fix sbt project a fair bit to get
 it to work : we need the assembled jar to submit a yarn job.

 When I finally submitted those changes to 0.7, it broke even more - since
 dependencies changed : someone else had thankfully already added maven
 support by then - which worked remarkably well out of the box (with some
 minor tweaks) !

 In theory, they might be expected to work the same, but practically they
 did not : as I mentioned,  it must just have been luck that maven worked
 that well; but given multiple past nasty experiences with sbt, and the fact
 that it does not bring anything compelling or new in contrast, I am fairly
 against the idea of using only sbt - inspite of maven being unintuitive at
 times.

 Regards,
 Mridul


 @sandy - It sounds like you are saying that the CDH build would be
 easier with Maven because you can inherit the POM. However, is this
 just a matter of convenience for packagers or would standardizing on
 sbt limit capabilities in some way? I assume that it would just mean a
 bit more manual work for packagers having to figure out how to set the
 hadoop version in SBT and exclude certain dependencies. For instance,
 what does CDH about other components like Impala that are not based on
 Maven at all?

 On Wed, Feb 26, 2014 at 9:31 AM, Evan Chan e...@ooyala.com wrote:
  I'd like to propose the following way to move forward, based on the
  comments I've seen:
 
  1.  Aggressively clean up the giant dependency graph.   One ticket I
  might work on if I have time is SPARK-681 which might remove the giant
  fastutil dependency (~15MB by itself).
 
  2.  Take an intermediate step by having only ONE source of truth
  w.r.t. dependencies and versions.  This means either:
 a)  Using a maven POM as the spec for dependencies, Hadoop version,
  etc.   Then, use sbt-pom-reader to import it.
 b)  Using the build.scala as the spec, and sbt make-pom to
  generate the pom.xml for the dependencies
 
  The idea is to remove the pain and errors associated with manual
  translation of dependency specs from one system to another, while
  still maintaining the things which are hard to translate (plugins).
 
 
  On Wed, Feb 26, 2014 at 7:17 AM, Koert Kuipers ko...@tresata.com
 wrote:
  We maintain in house spark build using sbt. We have no problem using
 sbt
  assembly. We did add a few exclude statements for transitive
 dependencies.
 
  The main enemy of assemblies are jars that include stuff they shouldn't
  (kryo comes to mind, I think they include logback?), new versions of
 jars
  that change the provider/artifact

Re: [DISCUSS] Necessity of Maven *and* SBT Build in Spark

2014-02-26 Thread Patrick Wendell
@mridul - As far as I know both Maven and Sbt use fairly similar
processes for building the assembly/uber jar. We actually used to
package spark with sbt and there were no specific issues we
encountered and AFAIK sbt respects versioning of transitive
dependencies correctly. Do you have a specific bug listing for sbt
that indicates something is broken?

@sandy - It sounds like you are saying that the CDH build would be
easier with Maven because you can inherit the POM. However, is this
just a matter of convenience for packagers or would standardizing on
sbt limit capabilities in some way? I assume that it would just mean a
bit more manual work for packagers having to figure out how to set the
hadoop version in SBT and exclude certain dependencies. For instance,
what does CDH about other components like Impala that are not based on
Maven at all?

On Wed, Feb 26, 2014 at 9:31 AM, Evan Chan e...@ooyala.com wrote:
 I'd like to propose the following way to move forward, based on the
 comments I've seen:

 1.  Aggressively clean up the giant dependency graph.   One ticket I
 might work on if I have time is SPARK-681 which might remove the giant
 fastutil dependency (~15MB by itself).

 2.  Take an intermediate step by having only ONE source of truth
 w.r.t. dependencies and versions.  This means either:
a)  Using a maven POM as the spec for dependencies, Hadoop version,
 etc.   Then, use sbt-pom-reader to import it.
b)  Using the build.scala as the spec, and sbt make-pom to
 generate the pom.xml for the dependencies

 The idea is to remove the pain and errors associated with manual
 translation of dependency specs from one system to another, while
 still maintaining the things which are hard to translate (plugins).


 On Wed, Feb 26, 2014 at 7:17 AM, Koert Kuipers ko...@tresata.com wrote:
 We maintain in house spark build using sbt. We have no problem using sbt
 assembly. We did add a few exclude statements for transitive dependencies.

 The main enemy of assemblies are jars that include stuff they shouldn't
 (kryo comes to mind, I think they include logback?), new versions of jars
 that change the provider/artifact without changing the package (asm), and
 incompatible new releases (protobuf). These break the transitive resolution
 process. I imagine that's true for any build tool.

 Besides shading I don't see anything maven can do sbt cannot, and if I
 understand it correctly shading is not done currently using the build tool.

 Since spark is primarily scala/akka based the main developer base will be
 familiar with sbt (I think?). Switching build tool is always painful. I
 personally think it is smarter to put this burden on a limited number of
 upstream integrators than on the community. However that said I don't think
 its a problem for us to maintain an sbt build in-house if spark switched to
 maven.
 The problem is, the complete spark dependency graph is fairly large,
 and there are lot of conflicting versions in there.
 In particular, when we bump versions of dependencies - making managing
 this messy at best.

 Now, I have not looked in detail at how maven manages this - it might
 just be accidental that we get a decent out-of-the-box assembled
 shaded jar (since we dont do anything great to configure it).
 With current state of sbt in spark, it definitely is not a good
 solution : if we can enhance it (or it already is ?), while keeping
 the management of the version/dependency graph manageable, I dont have
 any objections to using sbt or maven !
 Too many exclude versions, pinned versions, etc would just make things
 unmanageable in future.


 Regards,
 Mridul




 On Wed, Feb 26, 2014 at 8:56 AM, Evan chan e...@ooyala.com wrote:
 Actually you can control exactly how sbt assembly merges or resolves
 conflicts.  I believe the default settings however lead to order which
 cannot be controlled.

 I do wish for a smarter fat jar plugin.

 -Evan
 To be free is not merely to cast off one's chains, but to live in a way
 that respects  enhances the freedom of others. (#NelsonMandela)

 On Feb 25, 2014, at 6:50 PM, Mridul Muralidharan mri...@gmail.com
 wrote:

 On Wed, Feb 26, 2014 at 5:31 AM, Patrick Wendell pwend...@gmail.com
 wrote:
 Evan - this is a good thing to bring up. Wrt the shader plug-in -
 right now we don't actually use it for bytecode shading - we simply
 use it for creating the uber jar with excludes (which sbt supports
 just fine via assembly).


 Not really - as I mentioned initially in this thread, sbt's assembly
 does not take dependencies into account properly : and can overwrite
 newer classes with older versions.
 From an assembly point of view, sbt is not very good : we are yet to
 try it after 2.10 shift though (and probably wont, given the mess it
 created last time).

 Regards,
 Mridul






 I was wondering actually, do you know if it's possible to added shaded
 artifacts to the *spark jar* using this plug-in (e.g. not an uber
 jar)? That's something I could see being

<    1   2   3   4   5   6