Re: [DISCUSS] Necessity of Maven *and* SBT Build in Spark

2014-02-26 Thread Patrick Wendell
@mridul - As far as I know both Maven and Sbt use fairly similar
processes for building the assembly/uber jar. We actually used to
package spark with sbt and there were no specific issues we
encountered and AFAIK sbt respects versioning of transitive
dependencies correctly. Do you have a specific bug listing for sbt
that indicates something is broken?

@sandy - It sounds like you are saying that the CDH build would be
easier with Maven because you can inherit the POM. However, is this
just a matter of convenience for packagers or would standardizing on
sbt limit capabilities in some way? I assume that it would just mean a
bit more manual work for packagers having to figure out how to set the
hadoop version in SBT and exclude certain dependencies. For instance,
what does CDH about other components like Impala that are not based on
Maven at all?

On Wed, Feb 26, 2014 at 9:31 AM, Evan Chan e...@ooyala.com wrote:
 I'd like to propose the following way to move forward, based on the
 comments I've seen:

 1.  Aggressively clean up the giant dependency graph.   One ticket I
 might work on if I have time is SPARK-681 which might remove the giant
 fastutil dependency (~15MB by itself).

 2.  Take an intermediate step by having only ONE source of truth
 w.r.t. dependencies and versions.  This means either:
a)  Using a maven POM as the spec for dependencies, Hadoop version,
 etc.   Then, use sbt-pom-reader to import it.
b)  Using the build.scala as the spec, and sbt make-pom to
 generate the pom.xml for the dependencies

 The idea is to remove the pain and errors associated with manual
 translation of dependency specs from one system to another, while
 still maintaining the things which are hard to translate (plugins).


 On Wed, Feb 26, 2014 at 7:17 AM, Koert Kuipers ko...@tresata.com wrote:
 We maintain in house spark build using sbt. We have no problem using sbt
 assembly. We did add a few exclude statements for transitive dependencies.

 The main enemy of assemblies are jars that include stuff they shouldn't
 (kryo comes to mind, I think they include logback?), new versions of jars
 that change the provider/artifact without changing the package (asm), and
 incompatible new releases (protobuf). These break the transitive resolution
 process. I imagine that's true for any build tool.

 Besides shading I don't see anything maven can do sbt cannot, and if I
 understand it correctly shading is not done currently using the build tool.

 Since spark is primarily scala/akka based the main developer base will be
 familiar with sbt (I think?). Switching build tool is always painful. I
 personally think it is smarter to put this burden on a limited number of
 upstream integrators than on the community. However that said I don't think
 its a problem for us to maintain an sbt build in-house if spark switched to
 maven.
 The problem is, the complete spark dependency graph is fairly large,
 and there are lot of conflicting versions in there.
 In particular, when we bump versions of dependencies - making managing
 this messy at best.

 Now, I have not looked in detail at how maven manages this - it might
 just be accidental that we get a decent out-of-the-box assembled
 shaded jar (since we dont do anything great to configure it).
 With current state of sbt in spark, it definitely is not a good
 solution : if we can enhance it (or it already is ?), while keeping
 the management of the version/dependency graph manageable, I dont have
 any objections to using sbt or maven !
 Too many exclude versions, pinned versions, etc would just make things
 unmanageable in future.


 Regards,
 Mridul




 On Wed, Feb 26, 2014 at 8:56 AM, Evan chan e...@ooyala.com wrote:
 Actually you can control exactly how sbt assembly merges or resolves
 conflicts.  I believe the default settings however lead to order which
 cannot be controlled.

 I do wish for a smarter fat jar plugin.

 -Evan
 To be free is not merely to cast off one's chains, but to live in a way
 that respects  enhances the freedom of others. (#NelsonMandela)

 On Feb 25, 2014, at 6:50 PM, Mridul Muralidharan mri...@gmail.com
 wrote:

 On Wed, Feb 26, 2014 at 5:31 AM, Patrick Wendell pwend...@gmail.com
 wrote:
 Evan - this is a good thing to bring up. Wrt the shader plug-in -
 right now we don't actually use it for bytecode shading - we simply
 use it for creating the uber jar with excludes (which sbt supports
 just fine via assembly).


 Not really - as I mentioned initially in this thread, sbt's assembly
 does not take dependencies into account properly : and can overwrite
 newer classes with older versions.
 From an assembly point of view, sbt is not very good : we are yet to
 try it after 2.10 shift though (and probably wont, given the mess it
 created last time).

 Regards,
 Mridul






 I was wondering actually, do you know if it's possible to added shaded
 artifacts to the *spark jar* using this plug-in (e.g. not an uber
 jar)? That's something I could see being

Re: [DISCUSS] Necessity of Maven *and* SBT Build in Spark

2014-02-28 Thread Patrick Wendell
Hey,

Thanks everyone for chiming in on this. I wanted to summarize these
issues a bit particularly wrt the constituents involved - does this
seem accurate?

= Spark Users =
In general those linking against Spark should be totally unaffected by
the build choice. Spark will continue to publish well-formed poms and
jars to maven central. This is a no-op wrt this decision.

= Spark Developers =
There are two concerns. (a) General day-to-day development and
packaging and (b) Spark binaries and packages for distribution.

For (a) - sbt seems better because it's just nicer for doing scala
development (incremental complication is simple, we have some
home-baked tools for compiling Spark vs. the spark deps etc). The
arguments that maven has more general know how, at least so far,
haven't affected us in the ~2 years we've maintained both builds -
where adding stuff for Maven is typically just as annoying/difficult
with sbt.

For (b) - Some non-specific concerns were raised about bugs with the
sbt assembly package - we should look into this and see what is going
on. Maven has better out-of-the-box support for publishing to Maven
central, we'd have to do some manual work on our end to make this work
well with sbt.

= Downstream Integrators =
On this one it seems that Maven is the universal favorite, largely
because of community awareness of Maven and comfort with Maven builds.
Some things like restructuring the Spark build to inherit config
values from a vendor build will be not possible with sbt (though
fairly straightforward to work around). Other cases where vendors have
directly modified or inherited the Spark build won't work anymore if
we standardize on SBT. These have no obvious work around at this point
as far as I see.

- Patrick

On Wed, Feb 26, 2014 at 7:09 PM, Mridul Muralidharan mri...@gmail.com wrote:
 On Feb 26, 2014 11:12 PM, Patrick Wendell pwend...@gmail.com wrote:

 @mridul - As far as I know both Maven and Sbt use fairly similar
 processes for building the assembly/uber jar. We actually used to
 package spark with sbt and there were no specific issues we
 encountered and AFAIK sbt respects versioning of transitive
 dependencies correctly. Do you have a specific bug listing for sbt
 that indicates something is broken?

 Slightly longish ...

 The assembled jar, generated via sbt broke all over the place while I was
 adding yarn support in 0.6 - and I had to fix sbt project a fair bit to get
 it to work : we need the assembled jar to submit a yarn job.

 When I finally submitted those changes to 0.7, it broke even more - since
 dependencies changed : someone else had thankfully already added maven
 support by then - which worked remarkably well out of the box (with some
 minor tweaks) !

 In theory, they might be expected to work the same, but practically they
 did not : as I mentioned,  it must just have been luck that maven worked
 that well; but given multiple past nasty experiences with sbt, and the fact
 that it does not bring anything compelling or new in contrast, I am fairly
 against the idea of using only sbt - inspite of maven being unintuitive at
 times.

 Regards,
 Mridul


 @sandy - It sounds like you are saying that the CDH build would be
 easier with Maven because you can inherit the POM. However, is this
 just a matter of convenience for packagers or would standardizing on
 sbt limit capabilities in some way? I assume that it would just mean a
 bit more manual work for packagers having to figure out how to set the
 hadoop version in SBT and exclude certain dependencies. For instance,
 what does CDH about other components like Impala that are not based on
 Maven at all?

 On Wed, Feb 26, 2014 at 9:31 AM, Evan Chan e...@ooyala.com wrote:
  I'd like to propose the following way to move forward, based on the
  comments I've seen:
 
  1.  Aggressively clean up the giant dependency graph.   One ticket I
  might work on if I have time is SPARK-681 which might remove the giant
  fastutil dependency (~15MB by itself).
 
  2.  Take an intermediate step by having only ONE source of truth
  w.r.t. dependencies and versions.  This means either:
 a)  Using a maven POM as the spec for dependencies, Hadoop version,
  etc.   Then, use sbt-pom-reader to import it.
 b)  Using the build.scala as the spec, and sbt make-pom to
  generate the pom.xml for the dependencies
 
  The idea is to remove the pain and errors associated with manual
  translation of dependency specs from one system to another, while
  still maintaining the things which are hard to translate (plugins).
 
 
  On Wed, Feb 26, 2014 at 7:17 AM, Koert Kuipers ko...@tresata.com
 wrote:
  We maintain in house spark build using sbt. We have no problem using
 sbt
  assembly. We did add a few exclude statements for transitive
 dependencies.
 
  The main enemy of assemblies are jars that include stuff they shouldn't
  (kryo comes to mind, I think they include logback?), new versions of
 jars
  that change the provider/artifact

Updated Developer Docs

2014-03-04 Thread Patrick Wendell
Hey All,

Just a heads up that there are a bunch of updated developer docs on
the wiki including posting the dates around the current merge window.
Some of the new docs might be useful for developers/committers:

https://cwiki.apache.org/confluence/display/SPARK/Wiki+Homepage

Cheers,
- Patrick


Re: Spark 0.9.0 and log4j

2014-03-08 Thread Patrick Wendell
Evan I actually remembered that Paul Brown (who also reported this
issue) tested it and found that it worked. I'm going to merge this
into master and branch 0.9, so please give it a spin when you have a
chance.

- Patrick

On Sat, Mar 8, 2014 at 2:00 PM, Patrick Wendell pwend...@gmail.com wrote:
 Hey Evan,

 This is being tracked here:
 https://spark-project.atlassian.net/browse/SPARK-1190

 That patch didn't get merged but I've just opened a new one here:
 https://github.com/apache/spark/pull/107/files

 Would you have any interest in testing this? I want to make sure it
 works for users who are using logback.

 I'd like to get this merged quickly since it's one of the only
 remaining blockers for Spark 0.9.1.

 - Patrick



 On Fri, Mar 7, 2014 at 11:04 AM, Evan Chan e...@ooyala.com wrote:
 Hey guys,

 This is a follow-up to this semi-recent thread:
 http://apache-spark-developers-list.1001551.n3.nabble.com/0-9-0-forces-log4j-usage-td532.html

 0.9.0 final is causing issues for us as well because we use Logback as
 our backend and Spark requires Log4j now.

 I see Patrick has a PR #560 to incubator-spark, was that merged in or
 left out?

 Also I see references to a new PR that might fix this, but I can't
 seem to find it in the github open PR page.   Anybody have a link?

 As a last resort we can switch to Log4j, but would rather not have to
 do that if possible.

 thanks,
 Evan

 --
 --
 Evan Chan
 Staff Engineer
 e...@ooyala.com  |


Re: 0.9.0 forces log4j usage

2014-03-08 Thread Patrick Wendell
The fix for this was just merged into branch 0.9 (will be in 0.9.1+) and master.

On Sun, Feb 9, 2014 at 11:44 PM, Patrick Wendell pwend...@gmail.com wrote:
 Thanks Paul - it isn't mean to be a full solution but just a fix for
 the 0.9 branch - for the full solution there is another PR by Sean
 Owen.

 On Sun, Feb 9, 2014 at 11:35 PM, Paul Brown p...@mult.ifario.us wrote:
 Hi, Patrick --

 I gave that a go locally, and it works as desired.

 Best.
 -- Paul

 --
 p...@mult.ifario.us | Multifarious, Inc. | http://mult.ifario.us/


 On Fri, Feb 7, 2014 at 6:10 PM, Patrick Wendell pwend...@gmail.com wrote:

 Ah okay sounds good. This is what I meant earlier by You have
 some other application that directly calls log4j i.e. you have
 for historical reasons installed the log4j-over-slf4j.

 Would you mind trying out this fix and seeing if it works? This is
 designed to be a hotfix for 0.9, not a general solution where we rip
 out log4j from our published dependencies:

 https://github.com/apache/incubator-spark/pull/560/files

 - Patrick

 On Fri, Feb 7, 2014 at 5:57 PM, Paul Brown p...@mult.ifario.us wrote:
  Hi, Patrick --
 
  I forget which other component is responsible, but we're using the
  log4j-over-slf4j as part of an overall requirement to centralize logging,
  i.e., *someone* else is logging over log4j and we're pulling that in.
   (There's also some jul logging from Jersey, etc.)
 
  Goals:
 
  - Fully control/capture all possible logging.  (God forbid we have to
 grab
  System.out/err, but we'd do it if needed.)
  - Use the backend we like best at the moment.  (Happens to be logback.)
 
  Possible cases:
 
  - If Spark used Log4j at all, we would pull in that logging via
  log4j-over-slf4j.
  - If Spark used only slf4j and referenced no backend, we would use it
 as-is
  although we'd still have the log4j-over-slf4j because of other libraries.
  - If Spark used only slf4j and referenced the slf4j-log4j12 backend, we
  would exclude that one dependency (via our POM).
 
  Best.
  -- Paul
 
 
  --
  p...@mult.ifario.us | Multifarious, Inc. | http://mult.ifario.us/
 
 
  On Fri, Feb 7, 2014 at 5:38 PM, Patrick Wendell pwend...@gmail.com
 wrote:
 
  Hey Paul,
 
  So if your goal is ultimately to output to logback. Then why don't you
  just use slf4j and logback-classic.jar as described here [1]. Why
  involve log4j-over-slf4j at all?
 
  Let's say we refactored the spark build so it didn't advertise
  slf4j-log4j12 as a dependency. Would you still be using
  log4j-over-slf4j... or is this just a fix to deal with the fact that
  Spark is somewhat log4j dependent at this point.
 
  [1] http://www.slf4j.org/manual.html
 
  - Patrick
 
  On Fri, Feb 7, 2014 at 5:14 PM, Paul Brown p...@mult.ifario.us wrote:
   Hi, Patrick --
  
   That's close but not quite it.
  
   The issue that occurs is not the delegation loop mentioned in slf4j
   documentation.  The stack overflow is entirely within the code in the
  Spark
   trait:
  
   at org.apache.spark.Logging$class.initializeLogging(Logging.scala:112)
   at
 org.apache.spark.Logging$class.initializeIfNecessary(Logging.scala:97)
   at org.apache.spark.Logging$class.log(Logging.scala:36)
   at org.apache.spark.SparkEnv$.log(SparkEnv.scala:94)
  
  
   And then that repeats.
  
   As for our situation, we exclude the slf4j-log4j12 dependency when we
   import the Spark library (because we don't want to use log4j) and have
   log4j-over-slf4j already in place to ensure that all of the logging in
  the
   overall application runs through slf4j and then out through logback.
  (We
   also, as another poster already mentioned, also force jcl and jul
 through
   slf4j.)
  
   The zen of slf4j for libraries is that the library uses the slf4j API
 and
   then the enclosing application can route logging as it sees fit.
  Spark
   master CLI would log via slf4j and include the slf4j-log4j12 backend;
  same
   for Spark worker CLI.  Spark as a library (versus as a container)
 would
  not
   include any backend to the slf4j API and leave this up to the
  application.
(FWIW, this would also avoid your log4j warning message.)
  
   But as I was saying before, I'd be happy with a situation where I can
  avoid
   log4j being enabled or configured, and I think you'll find an existing
   choice of logging framework to be a common scenario for those
 embedding
   Spark in other systems.
  
   Best.
   -- Paul
  
   --
   p...@mult.ifario.us | Multifarious, Inc. | http://mult.ifario.us/
  
  
   On Fri, Feb 7, 2014 at 3:01 PM, Patrick Wendell pwend...@gmail.com
  wrote:
  
   Paul,
  
   Looking back at your problem. I think it's the one here:
   http://www.slf4j.org/codes.html#log4jDelegationLoop
  
   So let me just be clear what you are doing so I understand. You have
   some other application that directly calls log4j. So you have to
   include log4j-over-slf4j to route those logs through slf4j to
 logback.
  
   At the same time you embed Spark in this application

Help vote for Spark talks at the Hadoop Summit

2014-03-13 Thread Patrick Wendell
Hey All,

The Hadoop Summit uses community choice voting to decide which talks
to feature. It would be great if the community could help vote for
Spark talks so that Spark has a good showing at this event. You can
make three votes on each track. Below I've listed Spark talks in each
of the tracks - voting closes tomorrow so vote now!!

Building a Unified Data Pipeline in Apache Spark
bit.ly/O8USIq
(Committer Track)

Building a Data Processing System for Real Time Auctions
bit.ly/1ij3XJJ
(Business Apps Track)

SparkR: Enabling Interactive Data Science at Scale on Hadoop
bit.ly/1kPQUlG
(Data Science Track)

Recent Developments in Spark MLlib and Beyond
bit.ly/1hgZW5D
(The Future of Apache Hadoop Track)

Cheers,
- Patrick


Github reviews now going to separate reviews@ mailing list

2014-03-16 Thread Patrick Wendell
Hey All,

We've created a new list called revi...@spark.apache.org which will
contain the contents from the github pull requests and comments.

Note that these e-mails will no longer appear on the dev list. Thanks
to Apache Infra for helping us set this up.

To subscribe to this e-mail:
reviews-subscr...@spark.apache.org

- Patrick


Re: repositories for spark jars

2014-03-17 Thread Patrick Wendell
Hey Nathan,

I don't think this would be possible because there are at least dozens
of permutations of Hadoop versions (different vendor distros X
different versions X YARN vs not YARN, etc) and maybe hundreds. So
publishing new artifacts for each would be really difficult.

What is the exact problem you ran into? Maybe we need to improve the
documentation to make it more clear how to correctly link against
spark/hadoop for user applications. Basically the model we have now is
users link against Spark and then link against the hadoop-client
relevant to their version of Hadoop.

- Patrick

On Mon, Mar 17, 2014 at 9:50 AM, Nathan Kronenfeld
nkronenf...@oculusinfo.com wrote:
 After just spending a couple days fighting with a new spark installation,
 getting spark and hadoop version numbers matching everywhere, I have a
 suggestion I'd like to put out there.

 Can we put the hadoop version against which the spark jars were built into
 the version number?

 I noticed that the Cloudera maven repo has started to do this (
 https://repository.cloudera.com/artifactory/cloudera-repos/org/apache/spark/spark-core_2.10/)
 - sadly, though, only with the cdh5.x versions, not with the 4.x versions
 for which they also have spark parcels.  But I see no signs of it in the
 central maven repo.

 Is this already done in some other repo about which I don't know, perhaps?

 I know it would save us a lot of time and grief simply to be able to point
 a project we build at the right version, and not have to rebuild and deploy
 spark manually.

 --
 Nathan Kronenfeld
 Senior Visualization Developer
 Oculus Info Inc
 2 Berkeley Street, Suite 600,
 Toronto, Ontario M5A 4J5
 Phone:  +1-416-203-3003 x 238
 Email:  nkronenf...@oculusinfo.com


Re: Announcing the official Spark Job Server repo

2014-03-19 Thread Patrick Wendell
Evan - yep definitely open a JIRA. It would be nice to have a contrib
repo set-up for the 1.0 release.

On Tue, Mar 18, 2014 at 11:28 PM, Evan Chan e...@ooyala.com wrote:
 Matei,

 Maybe it's time to explore the spark-contrib idea again?   Should I
 start a JIRA ticket?

 -Evan


 On Tue, Mar 18, 2014 at 4:04 PM, Matei Zaharia matei.zaha...@gmail.com 
 wrote:
 Cool, glad to see this posted! I've added a link to it at 
 https://cwiki.apache.org/confluence/display/SPARK/Powered+By+Spark.

 Matei

 On Mar 18, 2014, at 1:51 PM, Evan Chan e...@ooyala.com wrote:

 Dear Spark developers,

 Ooyala is happy to announce that we have pushed our official, Spark
 0.9.0 / Scala 2.10-compatible, job server as a github repo:

 https://github.com/ooyala/spark-jobserver

 Complete with unit tests, deploy scripts, and examples.

 The original PR (#222) on incubator-spark is now closed.

 Please have a look; pull requests are very welcome.
 --
 --
 Evan Chan
 Staff Engineer
 e...@ooyala.com  |




 --
 --
 Evan Chan
 Staff Engineer
 e...@ooyala.com  |


Re: Spark 0.9.1 release

2014-03-20 Thread Patrick Wendell
Hey Tom,

 I'll pull [SPARK-1053] Should not require SPARK_YARN_APP_JAR when running on 
 YARN - JIRA and  [SPARK-1051] On Yarn, executors don't doAs as submitting 
 user - JIRA in.  The pyspark one I would consider more of an enhancement so 
 might not be appropriate for a point release.

Someone recently sent me a personal e-mail reporting some problems
with this. I'll ask them to forward it to you/the dev list. Might be
worth looking into before merging.

  [SPARK-1051] On Yarn, executors don't doAs as submitting user - JIRA
 This means that they can't write/read from files that the yarn user doesn't 
 have permissions to but the submitting user does.
 View on spark-project.atlassian.net Preview by Yahoo

Good call on this one.

- Patrick


Re: Spark 0.9.1 release

2014-03-24 Thread Patrick Wendell
Hey Evan and TD,

Spark's dependency graph in a maintenance release seems potentially
harmful, especially upgrading a minor version (not just a patch
version) like this. This could affect other downstream users. For
instance, now without knowing their fastutil dependency gets bumped
and they hit some new problem in fastutil 6.5.

- Patrick

On Mon, Mar 24, 2014 at 12:02 AM, Tathagata Das
tathagata.das1...@gmail.com wrote:
 @Shivaram, That is a useful patch but I am bit afraid merge it in.
 Randomizing the executor has performance implications, especially for Spark
 Streaming. The non-randomized ordering of allocating machines to tasks was
 subtly helping to speed up certain window-based shuffle operations.  For
 example, corresponding shuffle partitions in multiple shuffles using the
 same partitioner were likely to be co-located, that is, shuffle partition 0
 were likely to be on the same machine for multiple shuffles. While this is
 the not a reliable mechanism to rely on, randomization may lead to
 performance degradation. So I am afraid to merge this one without
 understanding the consequences.

 @Evan, I have already cut a release! You can submit the PR and we can merge
 it branch-0.9. If we have to cut another release, then we can include it.



 On Sun, Mar 23, 2014 at 11:42 PM, Evan Chan e...@ooyala.com wrote:

 I also have a really minor fix for SPARK-1057  (upgrading fastutil),
 could that also make it in?

 -Evan


 On Sun, Mar 23, 2014 at 11:01 PM, Shivaram Venkataraman
 shiva...@eecs.berkeley.edu wrote:
  Sorry this request is coming in a bit late, but would it be possible to
  backport SPARK-979[1] to branch-0.9 ? This is the patch for randomizing
  executor offers and I would like to use this in a release sooner rather
  than later.
 
  Thanks
  Shivaram
 
  [1]
 
 https://github.com/apache/spark/commit/556c56689bbc32c6cec0d07b57bd3ec73ceb243e#diff-8ef3258646b0e6a4793d6ad99848eacd
 
 
  On Thu, Mar 20, 2014 at 10:18 PM, Bhaskar Dutta bhas...@gmail.com
 wrote:
 
  Thank You! We plan to test out 0.9.1 on YARN once it is out.
 
  Regards,
  Bhaskar
 
  On Fri, Mar 21, 2014 at 12:42 AM, Tom Graves tgraves...@yahoo.com
 wrote:
 
   I'll pull [SPARK-1053] Should not require SPARK_YARN_APP_JAR when
 running
   on YARN - JIRA and  [SPARK-1051] On Yarn, executors don't doAs as
   submitting user - JIRA in.  The pyspark one I would consider more of
 an
   enhancement so might not be appropriate for a point release.
  
  
[SPARK-1053] Should not require SPARK_YARN_APP_JAR when running on
 YA...
   org.apache.spark.SparkException: env SPARK_YARN_APP_JAR is not set at
  
 
 org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:49)
   at org.apache.spark.schedule...
   View on spark-project.atlassian.net Preview by Yahoo
  
  
[SPARK-1051] On Yarn, executors don't doAs as submitting user - JIRA
   This means that they can't write/read from files that the yarn user
   doesn't have permissions to but the submitting user does.
   View on spark-project.atlassian.net Preview by Yahoo
  
  
  
  
  
   On Thursday, March 20, 2014 1:35 PM, Bhaskar Dutta bhas...@gmail.com
 
   wrote:
  
   It will be great if
   SPARK-1101https://spark-project.atlassian.net/browse/SPARK-1101:
   Umbrella
   for hardening Spark on YARN can get into 0.9.1.
  
   Thanks,
   Bhaskar
  
  
   On Thu, Mar 20, 2014 at 5:37 AM, Tathagata Das
   tathagata.das1...@gmail.comwrote:
  
 Hello everyone,
   
Since the release of Spark 0.9, we have received a number of
 important
   bug
fixes and we would like to make a bug-fix release of Spark 0.9.1. We
  are
going to cut a release candidate soon and we would love it if people
  test
it out. We have backported several bug fixes into the 0.9 and
 updated
   JIRA
accordingly
   
  
 
 https://spark-project.atlassian.net/browse/SPARK-1275?jql=project%20in%20(SPARK%2C%20BLINKDB%2C%20MLI%2C%20MLLIB%2C%20SHARK%2C%20STREAMING%2C%20GRAPH%2C%20TACHYON)%20AND%20fixVersion%20%3D%200.9.1%20AND%20status%20in%20(Resolved%2C%20Closed)
.
Please let me know if there are fixes that were not backported but
 you
would like to see them in 0.9.1.
   
Thanks!
   
TD
   
  
 



 --
 --
 Evan Chan
 Staff Engineer
 e...@ooyala.com  |



Re: Spark 0.9.1 release

2014-03-24 Thread Patrick Wendell
 Spark's dependency graph in a maintenance
*Modifying* Spark's dependency graph...


Re: Travis CI

2014-03-25 Thread Patrick Wendell
That's not correct - like Michael said the Jenkins build remains the
reference build for now.

On Tue, Mar 25, 2014 at 7:03 PM, Nan Zhu zhunanmcg...@gmail.com wrote:
 I assume the Jenkins is not working now?

 Best,

 --
 Nan Zhu


 On Tuesday, March 25, 2014 at 6:42 PM, Michael Armbrust wrote:

 Just a quick note to everyone that Patrick and I are playing around with
 Travis CI on the Spark github repository. For now, travis does not run all
 of the test cases, so will only be turned on experimentally. Long term it
 looks like Travis might give better integration with github, so we are
 going to see if it is feasible to get all of our tests running on it.

 *Jenkins remains the reference CI and should be consulted before merging
 pull requests, independent of what Travis says.*

 If you have any questions or want to help out with the investigation, let
 me know!

 Michael




Re: Travis CI

2014-03-25 Thread Patrick Wendell
Ya It's been a little bit slow lately because of a high error rate in
interactions with the git-hub API. Unfortunately we are pretty slammed
for the release and haven't had a ton of time to do further debugging.

- Patrick

On Tue, Mar 25, 2014 at 7:13 PM, Nan Zhu zhunanmcg...@gmail.com wrote:
 I just found that the Jenkins is not working from this afternoon

 for one PR, the first time build failed after 90 minutes, the second time it
 has run for more than 2 hours, no result is returned

 Best,

 --
 Nan Zhu


 On Tuesday, March 25, 2014 at 10:06 PM, Patrick Wendell wrote:

 That's not correct - like Michael said the Jenkins build remains the
 reference build for now.

 On Tue, Mar 25, 2014 at 7:03 PM, Nan Zhu zhunanmcg...@gmail.com wrote:

 I assume the Jenkins is not working now?

 Best,

 --
 Nan Zhu


 On Tuesday, March 25, 2014 at 6:42 PM, Michael Armbrust wrote:

 Just a quick note to everyone that Patrick and I are playing around with
 Travis CI on the Spark github repository. For now, travis does not run all
 of the test cases, so will only be turned on experimentally. Long term it
 looks like Travis might give better integration with github, so we are
 going to see if it is feasible to get all of our tests running on it.

 *Jenkins remains the reference CI and should be consulted before merging
 pull requests, independent of what Travis says.*

 If you have any questions or want to help out with the investigation, let
 me know!

 Michael




Re: Spark 0.9.1 release

2014-03-26 Thread Patrick Wendell
Hey TD,

This one we just merged into master this morning:
https://spark-project.atlassian.net/browse/SPARK-1322

It should definitely go into the 0.9 branch because there was a bug in the
semantics of top() which at this point is unreleased in Python.

I didn't backport it yet because I figured you might want to do this at a
specific time. So please go ahead and backport it. Not sure whether this
warrants another RC.

- Patrick


On Tue, Mar 25, 2014 at 10:47 PM, Mridul Muralidharan mri...@gmail.comwrote:

 On Wed, Mar 26, 2014 at 10:53 AM, Tathagata Das
 tathagata.das1...@gmail.com wrote:
  PR 159 seems like a fairly big patch to me. And quite recent, so its
 impact
  on the scheduling is not clear. It may also depend on other changes that
  may have gotten into the DAGScheduler but not pulled into branch 0.9. I
 am
  not sure it is a good idea to pull that in. We can pull those changes
 later
  for 0.9.2 if required.


 There is no impact on scheduling : it only has an impact on error
 handling - it ensures that you can actually use spark on yarn in
 multi-tennent clusters more reliably.
 Currently, any reasonably long running job (30 mins+) working on non
 trivial dataset will fail due to accumulated failures in spark.


 Regards,
 Mridul


 
  TD
 
 
 
 
  On Tue, Mar 25, 2014 at 8:44 PM, Mridul Muralidharan mri...@gmail.com
 wrote:
 
  Forgot to mention this in the earlier request for PR's.
  If there is another RC being cut, please add
  https://github.com/apache/spark/pull/159 to it too (if not done
  already !).
 
  Thanks,
  Mridul
 
  On Thu, Mar 20, 2014 at 5:37 AM, Tathagata Das
  tathagata.das1...@gmail.com wrote:
Hello everyone,
  
   Since the release of Spark 0.9, we have received a number of important
  bug
   fixes and we would like to make a bug-fix release of Spark 0.9.1. We
 are
   going to cut a release candidate soon and we would love it if people
 test
   it out. We have backported several bug fixes into the 0.9 and updated
  JIRA
   accordingly
 
 https://spark-project.atlassian.net/browse/SPARK-1275?jql=project%20in%20(SPARK%2C%20BLINKDB%2C%20MLI%2C%20MLLIB%2C%20SHARK%2C%20STREAMING%2C%20GRAPH%2C%20TACHYON)%20AND%20fixVersion%20%3D%200.9.1%20AND%20status%20in%20(Resolved%2C%20Closed)
  .
   Please let me know if there are fixes that were not backported but you
   would like to see them in 0.9.1.
  
   Thanks!
  
   TD
 



Re: JIRA. github and asf updates

2014-03-29 Thread Patrick Wendell
Mridul,

You can unsubscribe yourself from any of these sources, right?

- Patrick


On Sat, Mar 29, 2014 at 11:05 AM, Mridul Muralidharan mri...@gmail.comwrote:

 Hi,

   So we are now receiving updates from three sources for each change to
 the PR.
 While each of them handles a corner case which others might miss,
 would be great if we could minimize the volume of duplicated
 communication.


 Regards,
 Mridul



Could you undo the JIRA dev list e-mails?

2014-03-29 Thread Patrick Wendell
Hey Chris,

I don't think our JIRA has been fully migrated to Apache infra, so it's
really confusing to send people e-mails referring to the new JIRA since we
haven't announced it yet. There is some content there because we've been
trying to do the migration, but I'm not sure it's entirely finished.

Also, right now our github comments go to a commits@ list. I'm actually -1
copying all of these to JIRA because we do a bunch of review level comments
that are going to pollute the JIRA a bunch.

In any case, can you revert the change whatever it was that sent these to
the dev list? We should have a coordinated plan about this transition and
the e-mail changes we plan to make.

- Patrick


Re: Could you undo the JIRA dev list e-mails?

2014-03-29 Thread Patrick Wendell
Okay I think I managed to revert this by just removing jira@a.o from our
dev list.


On Sat, Mar 29, 2014 at 11:37 AM, Patrick Wendell pwend...@gmail.comwrote:

 Hey Chris,

 I don't think our JIRA has been fully migrated to Apache infra, so it's
 really confusing to send people e-mails referring to the new JIRA since we
 haven't announced it yet. There is some content there because we've been
 trying to do the migration, but I'm not sure it's entirely finished.

 Also, right now our github comments go to a commits@ list. I'm actually
 -1 copying all of these to JIRA because we do a bunch of review level
 comments that are going to pollute the JIRA a bunch.

 In any case, can you revert the change whatever it was that sent these to
 the dev list? We should have a coordinated plan about this transition and
 the e-mail changes we plan to make.

 - Patrick



Re: JIRA. github and asf updates

2014-03-29 Thread Patrick Wendell
I'm working with infra to get the following set-up:

1. Don't post github updates to jira comments (they are too low level). If
users want these they can subscribe to commits@s.a.o.
2. Jira comment stream will go to issues@s.a.o so people can opt into that.

One thing YARN has set-up that might be cool in the future is to e-mail
*new* JIRA's to the dev list. That might be cool to set up in the future.


On Sat, Mar 29, 2014 at 1:15 PM, Mridul Muralidharan mri...@gmail.comwrote:

 If the PR comments are going to be replicated into the jira's and they
 are going to be set to dev@, then we could keep that and remove
 [Github] updates ?
 The last was added since discussions were happening off apache lists -
 which should be handled by the jira updates ?

 I dont mind the mails if they had content - this is just duplication
 of the same message in three mails :-)
 Btw, this is a good problem to have - a vibrant and very actively
 engaged community generated a lot of meaningful traffic !
 I just dont want to get distracted from it by repetitions.

 Regards,
 Mridul


 On Sat, Mar 29, 2014 at 11:46 PM, Patrick Wendell pwend...@gmail.com
 wrote:
  Ah sorry I see - Jira updates are going to the dev list. Maybe that's not
  desirable. I think we should send them to the issues@ list.
 
 
  On Sat, Mar 29, 2014 at 11:16 AM, Patrick Wendell pwend...@gmail.com
 wrote:
 
  Mridul,
 
  You can unsubscribe yourself from any of these sources, right?
 
  - Patrick
 
 
  On Sat, Mar 29, 2014 at 11:05 AM, Mridul Muralidharan mri...@gmail.com
 wrote:
 
  Hi,
 
So we are now receiving updates from three sources for each change to
  the PR.
  While each of them handles a corner case which others might miss,
  would be great if we could minimize the volume of duplicated
  communication.
 
 
  Regards,
  Mridul
 
 
 



Re: [VOTE] Release Apache Spark 0.9.1 (RC3)

2014-03-30 Thread Patrick Wendell
TD - I downloaded and did some local testing. Looks good to me!

+1

You should cast your own vote - at that point it's enough to pass.

- Patrick


On Sun, Mar 30, 2014 at 9:47 PM, prabeesh k prabsma...@gmail.com wrote:

 +1
 tested on Ubuntu12.04 64bit


 On Mon, Mar 31, 2014 at 3:56 AM, Matei Zaharia matei.zaha...@gmail.com
 wrote:

  +1 tested on Mac OS X.
 
  Matei
 
  On Mar 27, 2014, at 1:32 AM, Tathagata Das tathagata.das1...@gmail.com
  wrote:
 
   Please vote on releasing the following candidate as Apache Spark
 version
  0.9.1
  
   A draft of the release notes along with the CHANGES.txt file is
   attached to this e-mail.
  
   The tag to be voted on is v0.9.1-rc3 (commit 4c43182b):
  
 
 https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=4c43182b6d1b0b7717423f386c0214fe93073208
  
   The release files, including signatures, digests, etc. can be found at:
   http://people.apache.org/~tdas/spark-0.9.1-rc3/
  
   Release artifacts are signed with the following key:
   https://people.apache.org/keys/committer/tdas.asc
  
   The staging repository for this release can be found at:
  
 https://repository.apache.org/content/repositories/orgapachespark-1009/
  
   The documentation corresponding to this release can be found at:
   http://people.apache.org/~tdas/spark-0.9.1-rc3-docs/
  
   Please vote on releasing this package as Apache Spark 0.9.1!
  
   The vote is open until Sunday, March 30, at 10:00 UTC and passes if
   a majority of at least 3 +1 PMC votes are cast.
  
   [ ] +1 Release this package as Apache Spark 0.9.1
   [ ] -1 Do not release this package because ...
  
   To learn more about Apache Spark, please see
   http://spark.apache.org/
   CHANGES.txtRELEASE_NOTES.txt
 
 



Re: [VOTE] Release Apache Spark 0.9.1 (RC3)

2014-03-31 Thread Patrick Wendell
Yeah good point. Let's just extend this vote another few days?


On Mon, Mar 31, 2014 at 8:12 AM, Tom Graves tgraves...@yahoo.com wrote:

 I should probably pull this off into another thread, but going forward can
 we try to not have the release votes end on a weekend? Since we only seem
 to give 3 days, it makes it really hard for anyone who is offline for the
 weekend to try it out.   Either that or extend the voting for more then 3
 days.

 Tom
 On Monday, March 31, 2014 12:50 AM, Patrick Wendell pwend...@gmail.com
 wrote:

 TD - I downloaded and did some local testing. Looks good to me!

 +1

 You should cast your own vote - at that point it's enough to pass.

 - Patrick



 On Sun, Mar 30, 2014 at 9:47 PM, prabeesh k prabsma...@gmail.com wrote:

  +1
  tested on Ubuntu12.04 64bit
 
 
  On Mon, Mar 31, 2014 at 3:56 AM, Matei Zaharia matei.zaha...@gmail.com
  wrote:
 
   +1 tested on Mac OS X.
  
   Matei
  
   On Mar 27, 2014, at 1:32 AM, Tathagata Das 
 tathagata.das1...@gmail.com
   wrote:
  
Please vote on releasing the following candidate as Apache Spark
  version
   0.9.1
   
A draft of the release notes along with the CHANGES.txt file is
attached to this e-mail.
   
The tag to be voted on is v0.9.1-rc3 (commit 4c43182b):
   
  
 
 https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=4c43182b6d1b0b7717423f386c0214fe93073208
   
The release files, including signatures, digests, etc. can be found
 at:
http://people.apache.org/~tdas/spark-0.9.1-rc3/
   
Release artifacts are signed with the following key:
https://people.apache.org/keys/committer/tdas.asc
   
The staging repository for this release can be found at:
   
  https://repository.apache.org/content/repositories/orgapachespark-1009/
   
The documentation corresponding to this release can be found at:
http://people.apache.org/~tdas/spark-0.9.1-rc3-docs/
   
Please vote on releasing this package as Apache Spark 0.9.1!
   
The vote is open until Sunday, March 30, at 10:00 UTC and passes if
a majority of at least 3 +1 PMC votes are cast.
   
[ ] +1 Release this package as Apache Spark 0.9.1
[ ] -1 Do not release this package because ...
   
To learn more about Apache Spark, please see
http://spark.apache.org/
CHANGES.txtRELEASE_NOTES.txt
  
  
 



Re: sbt-package-bin

2014-04-01 Thread Patrick Wendell
And there is a deb target as well - ah didn't see Mark's email.


On Tue, Apr 1, 2014 at 11:36 AM, Patrick Wendell pwend...@gmail.com wrote:

 Ya there is already some fragmentation here. Maven has some dist targets
 and there is also ./make-distribution.sh.


 On Tue, Apr 1, 2014 at 11:31 AM, Mark Hamstra m...@clearstorydata.comwrote:

 A basic Debian package can already be created from the Maven build: mvn
 -Pdeb ...


 On Tue, Apr 1, 2014 at 11:24 AM, Evan Chan e...@ooyala.com wrote:

  Also, I understand this is the last week / merge window for 1.0, so if
  folks are interested I'd like to get in a PR quickly.
 
  thanks,
  Evan
 
 
 
  On Tue, Apr 1, 2014 at 11:24 AM, Evan Chan e...@ooyala.com wrote:
 
   Hey folks,
  
   We are in the middle of creating a Chef recipe for Spark.   As part of
   that we want to create a Debian package for Spark.
  
   What do folks think of adding the sbt-package-bin plugin to allow easy
   creation of a Spark .deb file?  I believe it adds all dependency jars
  into
   a single lib/ folder, so in some ways it's even easier to manage than
 the
   assembly.
  
   Also I'm not sure if there's an equivalent plugin for Maven.
  
   thanks,
   Evan
  
  
   --
   --
Evan Chan
   Staff Engineer
   e...@ooyala.com  |
  
   http://www.ooyala.com/ http://www.facebook.com/ooyala
  http://www.linkedin.com/company/ooyalahttp://www.twitter.com/ooyala
  
  
 
 
  --
  --
  Evan Chan
  Staff Engineer
  e...@ooyala.com  |
 
  http://www.ooyala.com/
  http://www.facebook.com/ooyalahttp://www.linkedin.com/company/ooyala
 
  http://www.twitter.com/ooyala
 





Re: sbt-package-bin

2014-04-01 Thread Patrick Wendell
Ya there is already some fragmentation here. Maven has some dist targets
and there is also ./make-distribution.sh.


On Tue, Apr 1, 2014 at 11:31 AM, Mark Hamstra m...@clearstorydata.comwrote:

 A basic Debian package can already be created from the Maven build: mvn
 -Pdeb ...


 On Tue, Apr 1, 2014 at 11:24 AM, Evan Chan e...@ooyala.com wrote:

  Also, I understand this is the last week / merge window for 1.0, so if
  folks are interested I'd like to get in a PR quickly.
 
  thanks,
  Evan
 
 
 
  On Tue, Apr 1, 2014 at 11:24 AM, Evan Chan e...@ooyala.com wrote:
 
   Hey folks,
  
   We are in the middle of creating a Chef recipe for Spark.   As part of
   that we want to create a Debian package for Spark.
  
   What do folks think of adding the sbt-package-bin plugin to allow easy
   creation of a Spark .deb file?  I believe it adds all dependency jars
  into
   a single lib/ folder, so in some ways it's even easier to manage than
 the
   assembly.
  
   Also I'm not sure if there's an equivalent plugin for Maven.
  
   thanks,
   Evan
  
  
   --
   --
Evan Chan
   Staff Engineer
   e...@ooyala.com  |
  
   http://www.ooyala.com/ http://www.facebook.com/ooyala
  http://www.linkedin.com/company/ooyalahttp://www.twitter.com/ooyala
  
  
 
 
  --
  --
  Evan Chan
  Staff Engineer
  e...@ooyala.com  |
 
  http://www.ooyala.com/
  http://www.facebook.com/ooyalahttp://www.linkedin.com/company/ooyala
 
  http://www.twitter.com/ooyala
 



Re: Would anyone mind having a quick look at PR#288?

2014-04-02 Thread Patrick Wendell
Hey Evan,

Ya thanks this is a pretty small patch. Should definitely be do-able for
1.0.

- Patrick


On Wed, Apr 2, 2014 at 10:25 AM, Evan Chan e...@ooyala.com wrote:

 https://github.com/apache/spark/pull/288

 It's for fixing SPARK-1154, which would help Spark be a better citizen for
 most deploys, and should be really small and easy to review.

 thanks,
 Evan


 --
 --
 Evan Chan
 Staff Engineer
 e...@ooyala.com  |

 http://www.ooyala.com/
 http://www.facebook.com/ooyalahttp://www.linkedin.com/company/ooyala
 http://www.twitter.com/ooyala



Re: Recent heartbeats

2014-04-04 Thread Patrick Wendell
I answered this over on the user list...


On Fri, Apr 4, 2014 at 6:13 PM, Debasish Das debasish.da...@gmail.comwrote:

 Hi,

 Also posted it on user but then I realized it might be more involved.

 In my ALS runs I am noticing messages that complain about heart beats:

 14/04/04 20:43:09 WARN BlockManagerMasterActor: Removing BlockManager
 BlockManagerId(17, machine1, 53419, 0) with no recent heart beats: 48476ms
 exceeds 45000ms
 14/04/04 20:43:09 WARN BlockManagerMasterActor: Removing BlockManager
 BlockManagerId(12, machine2, 60714, 0) with no recent heart beats: 45328ms
 exceeds 45000ms
 14/04/04 20:43:09 WARN BlockManagerMasterActor: Removing BlockManager
 BlockManagerId(19, machine3, 39496, 0) with no recent heart beats: 53259ms
 exceeds 45000ms

 Is this some issue with the underlying jvm over which akka is run ? Can I
 increase the heartbeat somehow to get these messages resolved ?

 Any more insight about the possible cause for the heartbeat will be
 helpful...

 Thanks.
 Deb



Re: Flaky streaming tests

2014-04-07 Thread Patrick Wendell
TD - do you know what is going on here?

I looked into this ab it and at least a few of these that use
Thread.sleep() and assume the sleep will be exact, which is wrong. We
should disable all the tests that do and probably they should be re-written
to virtualize time.

- Patrick


On Mon, Apr 7, 2014 at 10:52 AM, Kay Ousterhout k...@eecs.berkeley.eduwrote:

 Hi all,

 The InputStreamsSuite seems to have some serious flakiness issues -- I've
 seen the file input stream fail many times and now I'm seeing some actor
 input stream test failures (

 https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13846/consoleFull
 )
 on what I think is an unrelated change.  Does anyone know anything about
 these?  Should we just remove some of these tests since they seem to be
 constantly failing?

 -Kay



Re: It seems that jenkins for PR is not working

2014-04-15 Thread Patrick Wendell
There are a few things going on here wrt tests.

1. I fixed up the RAT issues with a hotfix.

2. The Hive tests were actually disabled for a while accidentally. A recent
fix correctly re-enabled them. Without Hive Spark tests run in about 40
minutes and with Hive it runs in 1 hour and 15 minutes, so it's a big
difference.

To ease things I committed a patch today that only runs the Hive tests if
the change touches Spark SQL. So this should make it simpler for normal
tests.

We can actually generalize this to do much finer grained testing, e.g. if
something in MLLib changes we don't need to re-run the streaming tests.
I've added this JIRA to track it:
https://issues.apache.org/jira/browse/SPARK-1455

3. Overall we've experienced more race conditions with tests recently. I
noticed a few zombie test processes on Jenkins hogging up 100% of CPU so I
think this has triggered several previously unseen races due to CPU
contention on the test cluster. I killed them and we'll see if they crop up
again.

4. Please try to keep an eye on the length of new tests that get committed.
It's common to see people commit tests that e.g. sleep for several seconds
or do things that take a long time. Almost always this can be avoided and
usually avoiding it makes the test cleaner anyways (e.g. use proper
synchronization instead of sleeping).

- Patrick


On Tue, Apr 15, 2014 at 9:34 AM, Mark Hamstra m...@clearstorydata.comwrote:

 The RAT path issue is now fixed, but it appears to me that some recent
 change has dramatically altered the behavior of the testing framework, so
 that I am now seeing many individual tests taking more than a minute to run
 and the complete test run taking a very, very long time.  I expect that
 this is what is causing Jenkins to now timeout repeatedly.


 On Mon, Apr 14, 2014 at 1:32 PM, Nan Zhu zhunanmcg...@gmail.com wrote:

  +1
 
  --
  Nan Zhu
 
 
  On Friday, April 11, 2014 at 5:35 PM, DB Tsai wrote:
 
   I always got
  
 =
  
   Could not find Apache license headers in the following files:
   !? /root/workspace/SparkPullRequestBuilder/python/metastore/db.lck
   !?
 
 /root/workspace/SparkPullRequestBuilder/python/metastore/service.properties
  
  
   Sincerely,
  
   DB Tsai
   ---
   My Blog: https://www.dbtsai.com
   LinkedIn: https://www.linkedin.com/in/dbtsai
  
  
 
 
 



Spark 1.0.0 rc3

2014-04-29 Thread Patrick Wendell
Hey All,

This is not an official vote, but I wanted to cut an RC so that people can
test against the Maven artifacts, test building with their configuration,
etc. We are still chasing down a few issues and updating docs, etc.

If you have issues or bug reports for this release, please send an e-mail
to the Spark dev list and/or file a JIRA.

Commit: d636772 (v1.0.0-rc3)
https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=d636772ea9f98e449a038567b7975b1a07de3221

Binaries:
http://people.apache.org/~pwendell/spark-1.0.0-rc3/

Docs:
http://people.apache.org/~pwendell/spark-1.0.0-rc3-docs/

Repository:
https://repository.apache.org/content/repositories/orgapachespark-1012/

== API Changes ==
If you want to test building against Spark there are some minor API
changes. We'll get these written up for the final release but I'm noting a
few here (not comprehensive):

changes to ML vector specification:
http://people.apache.org/~pwendell/spark-1.0.0-rc3-docs/mllib-guide.html#from-09-to-10

changes to the Java API:
http://people.apache.org/~pwendell/spark-1.0.0-rc3-docs/java-programming-guide.html#upgrading-from-pre-10-versions-of-spark

coGroup and related functions now return Iterable[T] instead of Seq[T]
== Call toSeq on the result to restore the old behavior

SparkContext.jarOfClass returns Option[String] instead of Seq[String]
== Call toSeq on the result to restore old behavior

Streaming classes have been renamed:
NetworkReceiver - Receiver


Re: Spark 1.0.0 rc3

2014-04-29 Thread Patrick Wendell
 What are the expectations / guarantees on binary compatibility between
 0.9 and 1.0?

There are not guarantees.


Re: Spark 1.0.0 rc3

2014-04-29 Thread Patrick Wendell
Hi Dean,

We always used the Hadoop libraries here to read and write local
files. In Spark 1.0 we started enforcing the rule that you can't
over-write an existing directory because it can cause
confusing/undefined behavior if multiple jobs output to the directory
(they partially clobber each other's output).

https://issues.apache.org/jira/browse/SPARK-1100
https://github.com/apache/spark/pull/11

In the JIRA I actually proposed slightly deviating from Hadoop
semantics and allowing the directory to exist if it is empty, but I
think in the end we decided to just go with the exact same semantics
as Hadoop (i.e. empty directories are a problem).

- Patrick

On Tue, Apr 29, 2014 at 9:43 AM, Dean Wampler deanwamp...@gmail.com wrote:
 I'm observing one anomalous behavior. With the 1.0.0 libraries, it's using
 HDFS classes for file I/O, while the same script compiled and running with
 0.9.1 uses only the local-mode File IO.

 The script is a variation of the Word Count script. Here are the guts:

 object WordCount2 {
   def main(args: Array[String]) = {

 val sc = new SparkContext(local, Word Count (2))

 val input = sc.textFile(.../some/local/file).map(line =
 line.toLowerCase)
 input.cache

 val wc2 = input
   .flatMap(line = line.split(\W+))
   .map(word = (word, 1))
   .reduceByKey((count1, count2) = count1 + count2)

 wc2.saveAsTextFile(output/some/directory)

 sc.stop()

 It works fine compiled and executed with 0.9.1. If I recompile and run with
 1.0.0-RC1, where the same output directory still exists, I get this
 familiar Hadoop-ish exception:

 [error] (run-main-0) org.apache.hadoop.mapred.FileAlreadyExistsException:
 Output directory
 file:/Users/deanwampler/projects/typesafe/activator/activator-spark/output/kjv-wc
 already exists
 org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory
 file:/Users/deanwampler/projects/typesafe/activator/activator-spark/output/kjv-wc
 already exists
  at
 org.apache.hadoop.mapred.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:121)
 at
 org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopDataset(PairRDDFunctions.scala:749)
  at
 org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:662)
 at
 org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:581)
  at org.apache.spark.rdd.RDD.saveAsTextFile(RDD.scala:1057)
 at spark.activator.WordCount2$.main(WordCount2.scala:42)
  at spark.activator.WordCount2.main(WordCount2.scala)
 ...

 Thoughts?


 On Tue, Apr 29, 2014 at 3:05 AM, Patrick Wendell pwend...@gmail.com wrote:

 Hey All,

 This is not an official vote, but I wanted to cut an RC so that people can
 test against the Maven artifacts, test building with their configuration,
 etc. We are still chasing down a few issues and updating docs, etc.

 If you have issues or bug reports for this release, please send an e-mail
 to the Spark dev list and/or file a JIRA.

 Commit: d636772 (v1.0.0-rc3)

 https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=d636772ea9f98e449a038567b7975b1a07de3221

 Binaries:
 http://people.apache.org/~pwendell/spark-1.0.0-rc3/

 Docs:
 http://people.apache.org/~pwendell/spark-1.0.0-rc3-docs/

 Repository:
 https://repository.apache.org/content/repositories/orgapachespark-1012/

 == API Changes ==
 If you want to test building against Spark there are some minor API
 changes. We'll get these written up for the final release but I'm noting a
 few here (not comprehensive):

 changes to ML vector specification:

 http://people.apache.org/~pwendell/spark-1.0.0-rc3-docs/mllib-guide.html#from-09-to-10

 changes to the Java API:

 http://people.apache.org/~pwendell/spark-1.0.0-rc3-docs/java-programming-guide.html#upgrading-from-pre-10-versions-of-spark

 coGroup and related functions now return Iterable[T] instead of Seq[T]
 == Call toSeq on the result to restore the old behavior

 SparkContext.jarOfClass returns Option[String] instead of Seq[String]
 == Call toSeq on the result to restore old behavior

 Streaming classes have been renamed:
 NetworkReceiver - Receiver




 --
 Dean Wampler, Ph.D.
 Typesafe
 @deanwampler
 http://typesafe.com
 http://polyglotprogramming.com


Re: Spark 1.0.0 rc3

2014-04-29 Thread Patrick Wendell
That suggestion got lost along the way and IIRC the patch didn't have
that. It's a good idea though, if nothing else to provide a simple
means for backwards compatibility.

I created a JIRA for this. It's very straightforward so maybe someone
can pick it up quickly:
https://issues.apache.org/jira/browse/SPARK-1677


On Tue, Apr 29, 2014 at 2:20 PM, Dean Wampler deanwamp...@gmail.com wrote:
 Thanks. I'm fine with the logic change, although I was a bit surprised to
 see Hadoop used for file I/O.

 Anyway, the jira issue and pull request discussions mention a flag to
 enable overwrites. That would be very convenient for a tutorial I'm
 writing, although I wouldn't recommend it for normal use, of course.
 However, I can't figure out if this actually exists. I found the
 spark.files.overwrite property, but that doesn't apply.  Does this override
 flag, method call, or method argument actually exist?

 Thanks,
 Dean


 On Tue, Apr 29, 2014 at 1:54 PM, Patrick Wendell pwend...@gmail.com wrote:

 Hi Dean,

 We always used the Hadoop libraries here to read and write local
 files. In Spark 1.0 we started enforcing the rule that you can't
 over-write an existing directory because it can cause
 confusing/undefined behavior if multiple jobs output to the directory
 (they partially clobber each other's output).

 https://issues.apache.org/jira/browse/SPARK-1100
 https://github.com/apache/spark/pull/11

 In the JIRA I actually proposed slightly deviating from Hadoop
 semantics and allowing the directory to exist if it is empty, but I
 think in the end we decided to just go with the exact same semantics
 as Hadoop (i.e. empty directories are a problem).

 - Patrick

 On Tue, Apr 29, 2014 at 9:43 AM, Dean Wampler deanwamp...@gmail.com
 wrote:
  I'm observing one anomalous behavior. With the 1.0.0 libraries, it's
 using
  HDFS classes for file I/O, while the same script compiled and running
 with
  0.9.1 uses only the local-mode File IO.
 
  The script is a variation of the Word Count script. Here are the guts:
 
  object WordCount2 {
def main(args: Array[String]) = {
 
  val sc = new SparkContext(local, Word Count (2))
 
  val input = sc.textFile(.../some/local/file).map(line =
  line.toLowerCase)
  input.cache
 
  val wc2 = input
.flatMap(line = line.split(\W+))
.map(word = (word, 1))
.reduceByKey((count1, count2) = count1 + count2)
 
  wc2.saveAsTextFile(output/some/directory)
 
  sc.stop()
 
  It works fine compiled and executed with 0.9.1. If I recompile and run
 with
  1.0.0-RC1, where the same output directory still exists, I get this
  familiar Hadoop-ish exception:
 
  [error] (run-main-0) org.apache.hadoop.mapred.FileAlreadyExistsException:
  Output directory
 
 file:/Users/deanwampler/projects/typesafe/activator/activator-spark/output/kjv-wc
  already exists
  org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory
 
 file:/Users/deanwampler/projects/typesafe/activator/activator-spark/output/kjv-wc
  already exists
   at
 
 org.apache.hadoop.mapred.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:121)
  at
 
 org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopDataset(PairRDDFunctions.scala:749)
   at
 
 org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:662)
  at
 
 org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:581)
   at org.apache.spark.rdd.RDD.saveAsTextFile(RDD.scala:1057)
  at spark.activator.WordCount2$.main(WordCount2.scala:42)
   at spark.activator.WordCount2.main(WordCount2.scala)
  ...
 
  Thoughts?
 
 
  On Tue, Apr 29, 2014 at 3:05 AM, Patrick Wendell pwend...@gmail.com
 wrote:
 
  Hey All,
 
  This is not an official vote, but I wanted to cut an RC so that people
 can
  test against the Maven artifacts, test building with their
 configuration,
  etc. We are still chasing down a few issues and updating docs, etc.
 
  If you have issues or bug reports for this release, please send an
 e-mail
  to the Spark dev list and/or file a JIRA.
 
  Commit: d636772 (v1.0.0-rc3)
 
 
 https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=d636772ea9f98e449a038567b7975b1a07de3221
 
  Binaries:
  http://people.apache.org/~pwendell/spark-1.0.0-rc3/
 
  Docs:
  http://people.apache.org/~pwendell/spark-1.0.0-rc3-docs/
 
  Repository:
  https://repository.apache.org/content/repositories/orgapachespark-1012/
 
  == API Changes ==
  If you want to test building against Spark there are some minor API
  changes. We'll get these written up for the final release but I'm
 noting a
  few here (not comprehensive):
 
  changes to ML vector specification:
 
 
 http://people.apache.org/~pwendell/spark-1.0.0-rc3-docs/mllib-guide.html#from-09-to-10
 
  changes to the Java API:
 
 
 http://people.apache.org/~pwendell/spark-1.0.0-rc3-docs/java-programming-guide.html#upgrading-from-pre-10-versions-of-spark
 
  coGroup and related functions now return Iterable[T] instead of Seq[T]
  == Call toSeq

Re: SparkSubmit and --driver-java-options

2014-04-30 Thread Patrick Wendell
I added a fix for this recently and it didn't require adding -J
notation - are you trying it with this patch?

https://issues.apache.org/jira/browse/SPARK-1654

 ./bin/spark-shell --driver-java-options -Dfoo=a -Dbar=b
scala sys.props.get(foo)
res0: Option[String] = Some(a)
scala sys.props.get(bar)
res1: Option[String] = Some(b)

- Patrick

On Wed, Apr 30, 2014 at 11:29 AM, Marcelo Vanzin van...@cloudera.com wrote:
 Hello all,

 Maybe my brain is not evolved enough to be able to trace through what
 happens with command-line arguments as they're parsed through all the
 shell scripts... but I really can't figure out how to pass more than a
 single JVM option on the command line.

 Unless someone has an obvious workaround that I'm missing, I'd like to
 propose something that is actually pretty standard in JVM tools: using
 -J. From javac:

   -Jflag   Pass flag directly to the runtime system

 So javac -J-Xmx1g would pass -Xmx1g to the underlying JVM. You can
 use several of those to pass multiple options (unlike
 --driver-java-options), so it helps that it's a short syntax.

 Unless someone has some issue with that I'll work on a patch for it...
 (well, I'm going to do it locally for me anyway because I really can't
 figure out how to do what I want to otherwise.)


 --
 Marcelo


Re: SparkSubmit and --driver-java-options

2014-04-30 Thread Patrick Wendell
Yeah I think the problem is that the spark-submit script doesn't pass
the argument array to spark-class in the right way, so any quoted
strings get flattened.

We do:
ORIG_ARGS=$@
$SPARK_HOME/bin/spark-class org.apache.spark.deploy.SparkSubmit $ORIG_ARGS

This works:
// remove all the code relating to `shift`ing the arguments
$SPARK_HOME/bin/spark-class org.apache.spark.deploy.SparkSubmit $@

Not sure, but I think the issue is that when you make a copy of $@ in
bash the type actually changes from an array to something else.

My patch fixes this for spark-shell but I didn't realize that
spark-submit does the same thing.
https://github.com/apache/spark/pull/576/files#diff-bc287993dfd11fd18794041e169ffd72L23

I think we'll need to figure out how to do this correctly in the bash
script so that quoted strings get passed in the right way.

On Wed, Apr 30, 2014 at 1:06 PM, Marcelo Vanzin van...@cloudera.com wrote:
 Just pulled again just in case. Verified your fix is there.

 $ ./bin/spark-submit --master yarn --deploy-mode client
 --driver-java-options -Dfoo -Dbar blah blah blah
 error: Unrecognized option '-Dbar'.
 run with --help for more information or --verbose for debugging output


 On Wed, Apr 30, 2014 at 12:49 PM, Patrick Wendell pwend...@gmail.com wrote:
 I added a fix for this recently and it didn't require adding -J
 notation - are you trying it with this patch?

 https://issues.apache.org/jira/browse/SPARK-1654

  ./bin/spark-shell --driver-java-options -Dfoo=a -Dbar=b
 scala sys.props.get(foo)
 res0: Option[String] = Some(a)
 scala sys.props.get(bar)
 res1: Option[String] = Some(b)

 - Patrick

 On Wed, Apr 30, 2014 at 11:29 AM, Marcelo Vanzin van...@cloudera.com wrote:
 Hello all,

 Maybe my brain is not evolved enough to be able to trace through what
 happens with command-line arguments as they're parsed through all the
 shell scripts... but I really can't figure out how to pass more than a
 single JVM option on the command line.

 Unless someone has an obvious workaround that I'm missing, I'd like to
 propose something that is actually pretty standard in JVM tools: using
 -J. From javac:

   -Jflag   Pass flag directly to the runtime system

 So javac -J-Xmx1g would pass -Xmx1g to the underlying JVM. You can
 use several of those to pass multiple options (unlike
 --driver-java-options), so it helps that it's a short syntax.

 Unless someone has some issue with that I'll work on a patch for it...
 (well, I'm going to do it locally for me anyway because I really can't
 figure out how to do what I want to otherwise.)


 --
 Marcelo



 --
 Marcelo


Re: SparkSubmit and --driver-java-options

2014-04-30 Thread Patrick Wendell
So I reproduced the problem here:

== test.sh ==
#!/bin/bash
for x in $@; do
  echo arg: $x
done
ARGS_COPY=$@
for x in $ARGS_COPY; do
  echo arg_copy: $x
done
==

./test.sh a b c d e f
arg: a
arg: b
arg: c d e
arg: f
arg_copy: a b c d e f

I'll dig around a bit more and see if we can fix it. Pretty sure we
aren't passing these argument arrays around correctly in bash.

On Wed, Apr 30, 2014 at 1:48 PM, Marcelo Vanzin van...@cloudera.com wrote:
 On Wed, Apr 30, 2014 at 1:41 PM, Patrick Wendell pwend...@gmail.com wrote:
 Yeah I think the problem is that the spark-submit script doesn't pass
 the argument array to spark-class in the right way, so any quoted
 strings get flattened.

 I think we'll need to figure out how to do this correctly in the bash
 script so that quoted strings get passed in the right way.

 I tried a few different approaches but finally ended up giving up; my
 bash-fu is apparently not strong enough. If you can make it work
 great, but I have -J working locally in case you give up like me.
 :-)

 --
 Marcelo


Re: SparkSubmit and --driver-java-options

2014-04-30 Thread Patrick Wendell
Marcelo - Mind trying the following diff locally? If it works I can
send a patch:

patrick@patrick-t430s:~/Documents/spark$ git diff bin/spark-submit
diff --git a/bin/spark-submit b/bin/spark-submit
index dd0d95d..49bc262 100755
--- a/bin/spark-submit
+++ b/bin/spark-submit
@@ -18,7 +18,7 @@
 #

 export SPARK_HOME=$(cd `dirname $0`/..; pwd)
-ORIG_ARGS=$@
+ORIG_ARGS=($@)

 while (($#)); do
   if [ $1 = --deploy-mode ]; then
@@ -39,5 +39,5 @@ if [ ! -z $DRIVER_MEMORY ]  [ ! -z $DEPLOY_MODE ]
 [ $DEPLOY_MODE = client
   export SPARK_MEM=$DRIVER_MEMORY
 fi

-$SPARK_HOME/bin/spark-class org.apache.spark.deploy.SparkSubmit $ORIG_ARGS
+$SPARK_HOME/bin/spark-class org.apache.spark.deploy.SparkSubmit
${ORIG_ARGS[@]}

On Wed, Apr 30, 2014 at 1:51 PM, Patrick Wendell pwend...@gmail.com wrote:
 So I reproduced the problem here:

 == test.sh ==
 #!/bin/bash
 for x in $@; do
   echo arg: $x
 done
 ARGS_COPY=$@
 for x in $ARGS_COPY; do
   echo arg_copy: $x
 done
 ==

 ./test.sh a b c d e f
 arg: a
 arg: b
 arg: c d e
 arg: f
 arg_copy: a b c d e f

 I'll dig around a bit more and see if we can fix it. Pretty sure we
 aren't passing these argument arrays around correctly in bash.

 On Wed, Apr 30, 2014 at 1:48 PM, Marcelo Vanzin van...@cloudera.com wrote:
 On Wed, Apr 30, 2014 at 1:41 PM, Patrick Wendell pwend...@gmail.com wrote:
 Yeah I think the problem is that the spark-submit script doesn't pass
 the argument array to spark-class in the right way, so any quoted
 strings get flattened.

 I think we'll need to figure out how to do this correctly in the bash
 script so that quoted strings get passed in the right way.

 I tried a few different approaches but finally ended up giving up; my
 bash-fu is apparently not strong enough. If you can make it work
 great, but I have -J working locally in case you give up like me.
 :-)

 --
 Marcelo


Re: SparkSubmit and --driver-java-options

2014-04-30 Thread Patrick Wendell
Dean - our e-mails crossed, but thanks for the tip. Was independently
arriving at your solution :)

Okay I'll submit something.

- Patrick

On Wed, Apr 30, 2014 at 2:14 PM, Marcelo Vanzin van...@cloudera.com wrote:
 Cool, that seems to work. Thanks!

 On Wed, Apr 30, 2014 at 2:09 PM, Patrick Wendell pwend...@gmail.com wrote:
 Marcelo - Mind trying the following diff locally? If it works I can
 send a patch:

 patrick@patrick-t430s:~/Documents/spark$ git diff bin/spark-submit
 diff --git a/bin/spark-submit b/bin/spark-submit
 index dd0d95d..49bc262 100755
 --- a/bin/spark-submit
 +++ b/bin/spark-submit
 @@ -18,7 +18,7 @@
  #

  export SPARK_HOME=$(cd `dirname $0`/..; pwd)
 -ORIG_ARGS=$@
 +ORIG_ARGS=($@)

  while (($#)); do
if [ $1 = --deploy-mode ]; then
 @@ -39,5 +39,5 @@ if [ ! -z $DRIVER_MEMORY ]  [ ! -z $DEPLOY_MODE ]
  [ $DEPLOY_MODE = client
export SPARK_MEM=$DRIVER_MEMORY
  fi

 -$SPARK_HOME/bin/spark-class org.apache.spark.deploy.SparkSubmit $ORIG_ARGS
 +$SPARK_HOME/bin/spark-class org.apache.spark.deploy.SparkSubmit
 ${ORIG_ARGS[@]}

 On Wed, Apr 30, 2014 at 1:51 PM, Patrick Wendell pwend...@gmail.com wrote:
 So I reproduced the problem here:

 == test.sh ==
 #!/bin/bash
 for x in $@; do
   echo arg: $x
 done
 ARGS_COPY=$@
 for x in $ARGS_COPY; do
   echo arg_copy: $x
 done
 ==

 ./test.sh a b c d e f
 arg: a
 arg: b
 arg: c d e
 arg: f
 arg_copy: a b c d e f

 I'll dig around a bit more and see if we can fix it. Pretty sure we
 aren't passing these argument arrays around correctly in bash.

 On Wed, Apr 30, 2014 at 1:48 PM, Marcelo Vanzin van...@cloudera.com wrote:
 On Wed, Apr 30, 2014 at 1:41 PM, Patrick Wendell pwend...@gmail.com 
 wrote:
 Yeah I think the problem is that the spark-submit script doesn't pass
 the argument array to spark-class in the right way, so any quoted
 strings get flattened.

 I think we'll need to figure out how to do this correctly in the bash
 script so that quoted strings get passed in the right way.

 I tried a few different approaches but finally ended up giving up; my
 bash-fu is apparently not strong enough. If you can make it work
 great, but I have -J working locally in case you give up like me.
 :-)

 --
 Marcelo



 --
 Marcelo


Re: SparkSubmit and --driver-java-options

2014-04-30 Thread Patrick Wendell
Patch here:
https://github.com/apache/spark/pull/609

On Wed, Apr 30, 2014 at 2:26 PM, Patrick Wendell pwend...@gmail.com wrote:
 Dean - our e-mails crossed, but thanks for the tip. Was independently
 arriving at your solution :)

 Okay I'll submit something.

 - Patrick

 On Wed, Apr 30, 2014 at 2:14 PM, Marcelo Vanzin van...@cloudera.com wrote:
 Cool, that seems to work. Thanks!

 On Wed, Apr 30, 2014 at 2:09 PM, Patrick Wendell pwend...@gmail.com wrote:
 Marcelo - Mind trying the following diff locally? If it works I can
 send a patch:

 patrick@patrick-t430s:~/Documents/spark$ git diff bin/spark-submit
 diff --git a/bin/spark-submit b/bin/spark-submit
 index dd0d95d..49bc262 100755
 --- a/bin/spark-submit
 +++ b/bin/spark-submit
 @@ -18,7 +18,7 @@
  #

  export SPARK_HOME=$(cd `dirname $0`/..; pwd)
 -ORIG_ARGS=$@
 +ORIG_ARGS=($@)

  while (($#)); do
if [ $1 = --deploy-mode ]; then
 @@ -39,5 +39,5 @@ if [ ! -z $DRIVER_MEMORY ]  [ ! -z $DEPLOY_MODE ]
  [ $DEPLOY_MODE = client
export SPARK_MEM=$DRIVER_MEMORY
  fi

 -$SPARK_HOME/bin/spark-class org.apache.spark.deploy.SparkSubmit $ORIG_ARGS
 +$SPARK_HOME/bin/spark-class org.apache.spark.deploy.SparkSubmit
 ${ORIG_ARGS[@]}

 On Wed, Apr 30, 2014 at 1:51 PM, Patrick Wendell pwend...@gmail.com wrote:
 So I reproduced the problem here:

 == test.sh ==
 #!/bin/bash
 for x in $@; do
   echo arg: $x
 done
 ARGS_COPY=$@
 for x in $ARGS_COPY; do
   echo arg_copy: $x
 done
 ==

 ./test.sh a b c d e f
 arg: a
 arg: b
 arg: c d e
 arg: f
 arg_copy: a b c d e f

 I'll dig around a bit more and see if we can fix it. Pretty sure we
 aren't passing these argument arrays around correctly in bash.

 On Wed, Apr 30, 2014 at 1:48 PM, Marcelo Vanzin van...@cloudera.com 
 wrote:
 On Wed, Apr 30, 2014 at 1:41 PM, Patrick Wendell pwend...@gmail.com 
 wrote:
 Yeah I think the problem is that the spark-submit script doesn't pass
 the argument array to spark-class in the right way, so any quoted
 strings get flattened.

 I think we'll need to figure out how to do this correctly in the bash
 script so that quoted strings get passed in the right way.

 I tried a few different approaches but finally ended up giving up; my
 bash-fu is apparently not strong enough. If you can make it work
 great, but I have -J working locally in case you give up like me.
 :-)

 --
 Marcelo



 --
 Marcelo


Re: [VOTE] Release Apache Spark 1.0.0 (rc5)

2014-05-15 Thread Patrick Wendell
I'm cancelling this vote in favor of rc6.

On Tue, May 13, 2014 at 8:01 AM, Sean Owen so...@cloudera.com wrote:
 On Tue, May 13, 2014 at 2:49 PM, Sean Owen so...@cloudera.com wrote:
 On Tue, May 13, 2014 at 9:36 AM, Patrick Wendell pwend...@gmail.com wrote:
 The release files, including signatures, digests, etc. can be found at:
 http://people.apache.org/~pwendell/spark-1.0.0-rc5/

 Good news is that the sigs, MD5 and SHA are all correct.

 Tiny note: the Maven artifacts use SHA1, while the binary artifacts
 use SHA512, which took me a bit of head-scratching to figure out.

 If another RC comes out, I might suggest making it SHA1 everywhere?
 But there is nothing wrong with these signatures and checksums.

 Now to look at the contents...

 This is a bit of drudgery that probably needs to be done too: a review
 of the LICENSE and NOTICE file. Having dumped the licenses of
 dependencies, I don't believe these reflect all of the software that's
 going to be distributed in 1.0.

 (Good news is there's no forbidden license stuff included AFAICT.)

 And good news is that NOTICE can be auto-generated, largely, with a
 Maven plugin. This can be done manually for now.

 And there is a license plugin that will list all known licenses of
 transitive dependencies so that LICENSE can be filled out fairly
 easily.

 What say? want a JIRA with details?


[VOTE] Release Apache Spark 1.0.0 (rc6)

2014-05-15 Thread Patrick Wendell
Please vote on releasing the following candidate as Apache Spark version 1.0.0!

This patch has a few minor fixes on top of rc5. I've also built the
binary artifacts with Hive support enabled so people can test this
configuration. When we release 1.0 we might just release both vanilla
and Hive-enabled binaries.

The tag to be voted on is v1.0.0-rc6 (commit 54133a):
https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=54133abdce0246f6643a1112a5204afb2c4caa82

The release files, including signatures, digests, etc. can be found at:
http://people.apache.org/~pwendell/spark-1.0.0-rc6/

Release artifacts are signed with the following key:
https://people.apache.org/keys/committer/pwendell.asc

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachestratos-1011

The documentation corresponding to this release can be found at:
http://people.apache.org/~pwendell/spark-1.0.0-rc6-docs/

Please vote on releasing this package as Apache Spark 1.0.0!

The vote is open until Saturday, May 17, at 20:58 UTC and passes if
amajority of at least 3 +1 PMC votes are cast.

[ ] +1 Release this package as Apache Spark 1.0.0
[ ] -1 Do not release this package because ...

To learn more about Apache Spark, please see
http://spark.apache.org/

== API Changes ==
We welcome users to compile Spark applications against 1.0. There are
a few API changes in this release. Here are links to the associated
upgrade guides - user facing changes have been kept as small as
possible.

changes to ML vector specification:
http://people.apache.org/~pwendell/spark-1.0.0-rc5-docs/mllib-guide.html#from-09-to-10

changes to the Java API:
http://people.apache.org/~pwendell/spark-1.0.0-rc5-docs/java-programming-guide.html#upgrading-from-pre-10-versions-of-spark

changes to the streaming API:
http://people.apache.org/~pwendell/spark-1.0.0-rc5-docs/streaming-programming-guide.html#migration-guide-from-091-or-below-to-1x

changes to the GraphX API:
http://people.apache.org/~pwendell/spark-1.0.0-rc5-docs/graphx-programming-guide.html#upgrade-guide-from-spark-091

coGroup and related functions now return Iterable[T] instead of Seq[T]
== Call toSeq on the result to restore the old behavior

SparkContext.jarOfClass returns Option[String] instead of Seq[String]
== Call toSeq on the result to restore old behavior


Re: [VOTE] Release Apache Spark 1.0.0 (rc7)

2014-05-16 Thread Patrick Wendell
I'll start the voting with a +1.

On Thu, May 15, 2014 at 1:14 AM, Patrick Wendell pwend...@gmail.com wrote:
 Please vote on releasing the following candidate as Apache Spark version 
 1.0.0!

 This patch has minor documentation changes and fixes on top of rc6.

 The tag to be voted on is v1.0.0-rc7 (commit 9212b3e):
 https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=9212b3e5bb5545ccfce242da8d89108e6fb1c464

 The release files, including signatures, digests, etc. can be found at:
 http://people.apache.org/~pwendell/spark-1.0.0-rc7/

 Release artifacts are signed with the following key:
 https://people.apache.org/keys/committer/pwendell.asc

 The staging repository for this release can be found at:
 https://repository.apache.org/content/repositories/orgapachespark-1015

 The documentation corresponding to this release can be found at:
 http://people.apache.org/~pwendell/spark-1.0.0-rc7-docs/

 Please vote on releasing this package as Apache Spark 1.0.0!

 The vote is open until Sunday, May 18, at 09:12 UTC and passes if a
 majority of at least 3 +1 PMC votes are cast.

 [ ] +1 Release this package as Apache Spark 1.0.0
 [ ] -1 Do not release this package because ...

 To learn more about Apache Spark, please see
 http://spark.apache.org/

 == API Changes ==
 We welcome users to compile Spark applications against 1.0. There are
 a few API changes in this release. Here are links to the associated
 upgrade guides - user facing changes have been kept as small as
 possible.

 changes to ML vector specification:
 http://people.apache.org/~pwendell/spark-1.0.0-rc5-docs/mllib-guide.html#from-09-to-10

 changes to the Java API:
 http://people.apache.org/~pwendell/spark-1.0.0-rc5-docs/java-programming-guide.html#upgrading-from-pre-10-versions-of-spark

 changes to the streaming API:
 http://people.apache.org/~pwendell/spark-1.0.0-rc5-docs/streaming-programming-guide.html#migration-guide-from-091-or-below-to-1x

 changes to the GraphX API:
 http://people.apache.org/~pwendell/spark-1.0.0-rc5-docs/graphx-programming-guide.html#upgrade-guide-from-spark-091

 coGroup and related functions now return Iterable[T] instead of Seq[T]
 == Call toSeq on the result to restore the old behavior

 SparkContext.jarOfClass returns Option[String] instead of Seq[String]
 == Call toSeq on the result to restore old behavior


[RESULT][VOTE] Release Apache Spark 1.0.0 (rc6)

2014-05-16 Thread Patrick Wendell
This vote is cancelled in favor of rc7.

On Wed, May 14, 2014 at 1:02 PM, Patrick Wendell pwend...@gmail.com wrote:
 Please vote on releasing the following candidate as Apache Spark version 
 1.0.0!

 This patch has a few minor fixes on top of rc5. I've also built the
 binary artifacts with Hive support enabled so people can test this
 configuration. When we release 1.0 we might just release both vanilla
 and Hive-enabled binaries.

 The tag to be voted on is v1.0.0-rc6 (commit 54133a):
 https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=54133abdce0246f6643a1112a5204afb2c4caa82

 The release files, including signatures, digests, etc. can be found at:
 http://people.apache.org/~pwendell/spark-1.0.0-rc6/

 Release artifacts are signed with the following key:
 https://people.apache.org/keys/committer/pwendell.asc

 The staging repository for this release can be found at:
 https://repository.apache.org/content/repositories/orgapachestratos-1011

 The documentation corresponding to this release can be found at:
 http://people.apache.org/~pwendell/spark-1.0.0-rc6-docs/

 Please vote on releasing this package as Apache Spark 1.0.0!

 The vote is open until Saturday, May 17, at 20:58 UTC and passes if
 amajority of at least 3 +1 PMC votes are cast.

 [ ] +1 Release this package as Apache Spark 1.0.0
 [ ] -1 Do not release this package because ...

 To learn more about Apache Spark, please see
 http://spark.apache.org/

 == API Changes ==
 We welcome users to compile Spark applications against 1.0. There are
 a few API changes in this release. Here are links to the associated
 upgrade guides - user facing changes have been kept as small as
 possible.

 changes to ML vector specification:
 http://people.apache.org/~pwendell/spark-1.0.0-rc5-docs/mllib-guide.html#from-09-to-10

 changes to the Java API:
 http://people.apache.org/~pwendell/spark-1.0.0-rc5-docs/java-programming-guide.html#upgrading-from-pre-10-versions-of-spark

 changes to the streaming API:
 http://people.apache.org/~pwendell/spark-1.0.0-rc5-docs/streaming-programming-guide.html#migration-guide-from-091-or-below-to-1x

 changes to the GraphX API:
 http://people.apache.org/~pwendell/spark-1.0.0-rc5-docs/graphx-programming-guide.html#upgrade-guide-from-spark-091

 coGroup and related functions now return Iterable[T] instead of Seq[T]
 == Call toSeq on the result to restore the old behavior

 SparkContext.jarOfClass returns Option[String] instead of Seq[String]
 == Call toSeq on the result to restore old behavior


[VOTE] Release Apache Spark 1.0.0 (rc8)

2014-05-16 Thread Patrick Wendell
[Due to ASF e-mail outage, I'm not if anyone will actually receive this.]

Please vote on releasing the following candidate as Apache Spark version 1.0.0!
This has only minor changes on top of rc7.

The tag to be voted on is v1.0.0-rc8 (commit 80eea0f):
https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=80eea0f111c06260ffaa780d2f3f7facd09c17bc

The release files, including signatures, digests, etc. can be found at:
http://people.apache.org/~pwendell/spark-1.0.0-rc8/

Release artifacts are signed with the following key:
https://people.apache.org/keys/committer/pwendell.asc

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1016/

The documentation corresponding to this release can be found at:
http://people.apache.org/~pwendell/spark-1.0.0-rc8-docs/

Please vote on releasing this package as Apache Spark 1.0.0!

The vote is open until Monday, May 19, at 10:15 UTC and passes if a
majority of at least 3 +1 PMC votes are cast.

[ ] +1 Release this package as Apache Spark 1.0.0
[ ] -1 Do not release this package because ...

To learn more about Apache Spark, please see
http://spark.apache.org/

== API Changes ==
We welcome users to compile Spark applications against 1.0. There are
a few API changes in this release. Here are links to the associated
upgrade guides - user facing changes have been kept as small as
possible.

changes to ML vector specification:
http://people.apache.org/~pwendell/spark-1.0.0-rc8-docs/mllib-guide.html#from-09-to-10

changes to the Java API:
http://people.apache.org/~pwendell/spark-1.0.0-rc8-docs/java-programming-guide.html#upgrading-from-pre-10-versions-of-spark

changes to the streaming API:
http://people.apache.org/~pwendell/spark-1.0.0-rc8-docs/streaming-programming-guide.html#migration-guide-from-091-or-below-to-1x

changes to the GraphX API:
http://people.apache.org/~pwendell/spark-1.0.0-rc8-docs/graphx-programming-guide.html#upgrade-guide-from-spark-091

coGroup and related functions now return Iterable[T] instead of Seq[T]
== Call toSeq on the result to restore the old behavior

SparkContext.jarOfClass returns Option[String] instead of Seq[String]
== Call toSeq on the result to restore old behavior


[VOTE] Release Apache Spark 1.0.0 (rc7)

2014-05-16 Thread Patrick Wendell
Please vote on releasing the following candidate as Apache Spark version 1.0.0!

This patch has minor documentation changes and fixes on top of rc6.

The tag to be voted on is v1.0.0-rc7 (commit 9212b3e):
https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=9212b3e5bb5545ccfce242da8d89108e6fb1c464

The release files, including signatures, digests, etc. can be found at:
http://people.apache.org/~pwendell/spark-1.0.0-rc7/

Release artifacts are signed with the following key:
https://people.apache.org/keys/committer/pwendell.asc

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1015

The documentation corresponding to this release can be found at:
http://people.apache.org/~pwendell/spark-1.0.0-rc7-docs/

Please vote on releasing this package as Apache Spark 1.0.0!

The vote is open until Sunday, May 18, at 09:12 UTC and passes if a
majority of at least 3 +1 PMC votes are cast.

[ ] +1 Release this package as Apache Spark 1.0.0
[ ] -1 Do not release this package because ...

To learn more about Apache Spark, please see
http://spark.apache.org/

== API Changes ==
We welcome users to compile Spark applications against 1.0. There are
a few API changes in this release. Here are links to the associated
upgrade guides - user facing changes have been kept as small as
possible.

changes to ML vector specification:
http://people.apache.org/~pwendell/spark-1.0.0-rc5-docs/mllib-guide.html#from-09-to-10

changes to the Java API:
http://people.apache.org/~pwendell/spark-1.0.0-rc5-docs/java-programming-guide.html#upgrading-from-pre-10-versions-of-spark

changes to the streaming API:
http://people.apache.org/~pwendell/spark-1.0.0-rc5-docs/streaming-programming-guide.html#migration-guide-from-091-or-below-to-1x

changes to the GraphX API:
http://people.apache.org/~pwendell/spark-1.0.0-rc5-docs/graphx-programming-guide.html#upgrade-guide-from-spark-091

coGroup and related functions now return Iterable[T] instead of Seq[T]
== Call toSeq on the result to restore the old behavior

SparkContext.jarOfClass returns Option[String] instead of Seq[String]
== Call toSeq on the result to restore old behavior


Re: [VOTE] Release Apache Spark 1.0.0 (rc7)

2014-05-16 Thread Patrick Wendell
Hey all,

My vote threads seem to be running about 24 hours behind and/or
getting swallowed by infra e-mail.

I sent RC8 yesterday and we might send one tonight as well. I'll make
sure to close all existing ones

There have been only small polish changes in the recent RC's since
RC5. So testing any off these should be pretty equivalent. I'll make
sure I close all the other threads by tonight.

- Patrick

On Fri, May 16, 2014 at 1:10 PM, Mark Hamstra m...@clearstorydata.com wrote:
 Sorry for the duplication, but I think this is the current VOTE candidate
 -- we're not voting on rc8 yet?

 +1, but just barely.  We've got quite a number of outstanding bugs
 identified, and many of them have fixes in progress.  I'd hate to see those
 efforts get lost in a post-1.0.0 flood of new features targeted at 1.1.0 --
 in other words, I'd like to see 1.0.1 retain a high priority relative to
 1.1.0.

 Looking through the unresolved JIRAs, it doesn't look like any of the
 identified bugs are show-stoppers or strictly regressions (although I will
 note that one that I have in progress, SPARK-1749, is a bug that we
 introduced with recent work -- it's not strictly a regression because we
 had equally bad but different behavior when the DAGScheduler exceptions
 weren't previously being handled at all vs. being slightly mis-handled
 now), so I'm not currently seeing a reason not to release.


 On Fri, May 16, 2014 at 11:42 AM, Henry Saputra 
 henry.sapu...@gmail.comwrote:

 Ah ok, thanks Aaron

 Just to make sure we VOTE the right RC.

 Thanks,

 Henry

 On Fri, May 16, 2014 at 11:37 AM, Aaron Davidson ilike...@gmail.com
 wrote:
  It was, but due to the apache infra issues, some may not have received
 the
  email yet...
 
  On Fri, May 16, 2014 at 10:48 AM, Henry Saputra henry.sapu...@gmail.com
 
  wrote:
 
  Hi Patrick,
 
  Just want to make sure that VOTE for rc6 also cancelled?
 
 
  Thanks,
 
  Henry
 
  On Thu, May 15, 2014 at 1:15 AM, Patrick Wendell pwend...@gmail.com
  wrote:
   I'll start the voting with a +1.
  
   On Thu, May 15, 2014 at 1:14 AM, Patrick Wendell pwend...@gmail.com
   wrote:
   Please vote on releasing the following candidate as Apache Spark
   version 1.0.0!
  
   This patch has minor documentation changes and fixes on top of rc6.
  
   The tag to be voted on is v1.0.0-rc7 (commit 9212b3e):
  
  
 https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=9212b3e5bb5545ccfce242da8d89108e6fb1c464
  
   The release files, including signatures, digests, etc. can be found
 at:
   http://people.apache.org/~pwendell/spark-1.0.0-rc7/
  
   Release artifacts are signed with the following key:
   https://people.apache.org/keys/committer/pwendell.asc
  
   The staging repository for this release can be found at:
  
 https://repository.apache.org/content/repositories/orgapachespark-1015
  
   The documentation corresponding to this release can be found at:
   http://people.apache.org/~pwendell/spark-1.0.0-rc7-docs/
  
   Please vote on releasing this package as Apache Spark 1.0.0!
  
   The vote is open until Sunday, May 18, at 09:12 UTC and passes if a
   majority of at least 3 +1 PMC votes are cast.
  
   [ ] +1 Release this package as Apache Spark 1.0.0
   [ ] -1 Do not release this package because ...
  
   To learn more about Apache Spark, please see
   http://spark.apache.org/
  
   == API Changes ==
   We welcome users to compile Spark applications against 1.0. There are
   a few API changes in this release. Here are links to the associated
   upgrade guides - user facing changes have been kept as small as
   possible.
  
   changes to ML vector specification:
  
  
 http://people.apache.org/~pwendell/spark-1.0.0-rc5-docs/mllib-guide.html#from-09-to-10
  
   changes to the Java API:
  
  
 http://people.apache.org/~pwendell/spark-1.0.0-rc5-docs/java-programming-guide.html#upgrading-from-pre-10-versions-of-spark
  
   changes to the streaming API:
  
  
 http://people.apache.org/~pwendell/spark-1.0.0-rc5-docs/streaming-programming-guide.html#migration-guide-from-091-or-below-to-1x
  
   changes to the GraphX API:
  
  
 http://people.apache.org/~pwendell/spark-1.0.0-rc5-docs/graphx-programming-guide.html#upgrade-guide-from-spark-091
  
   coGroup and related functions now return Iterable[T] instead of
 Seq[T]
   == Call toSeq on the result to restore the old behavior
  
   SparkContext.jarOfClass returns Option[String] instead of Seq[String]
   == Call toSeq on the result to restore old behavior
 
 



[RESULT] [VOTE] Release Apache Spark 1.0.0 (rc8)

2014-05-17 Thread Patrick Wendell
Cancelled in favor of rc9.

On Sat, May 17, 2014 at 12:51 AM, Patrick Wendell pwend...@gmail.com wrote:
 Due to the issue discovered by Michael, this vote is cancelled in favor of 
 rc9.

 On Fri, May 16, 2014 at 6:22 PM, Michael Armbrust
 mich...@databricks.com wrote:
 -1

 We found a regression in the way configuration is passed to executors.

 https://issues.apache.org/jira/browse/SPARK-1864
 https://github.com/apache/spark/pull/808

 Michael


 On Fri, May 16, 2014 at 3:57 PM, Mark Hamstra m...@clearstorydata.com
 wrote:

 +1


 On Fri, May 16, 2014 at 2:16 AM, Patrick Wendell pwend...@gmail.com
 wrote:

  [Due to ASF e-mail outage, I'm not if anyone will actually receive
  this.]
 
  Please vote on releasing the following candidate as Apache Spark version
  1.0.0!
  This has only minor changes on top of rc7.
 
  The tag to be voted on is v1.0.0-rc8 (commit 80eea0f):
 
 
  https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=80eea0f111c06260ffaa780d2f3f7facd09c17bc
 
  The release files, including signatures, digests, etc. can be found at:
  http://people.apache.org/~pwendell/spark-1.0.0-rc8/
 
  Release artifacts are signed with the following key:
  https://people.apache.org/keys/committer/pwendell.asc
 
  The staging repository for this release can be found at:
  https://repository.apache.org/content/repositories/orgapachespark-1016/
 
  The documentation corresponding to this release can be found at:
  http://people.apache.org/~pwendell/spark-1.0.0-rc8-docs/
 
  Please vote on releasing this package as Apache Spark 1.0.0!
 
  The vote is open until Monday, May 19, at 10:15 UTC and passes if a
  majority of at least 3 +1 PMC votes are cast.
 
  [ ] +1 Release this package as Apache Spark 1.0.0
  [ ] -1 Do not release this package because ...
 
  To learn more about Apache Spark, please see
  http://spark.apache.org/
 
  == API Changes ==
  We welcome users to compile Spark applications against 1.0. There are
  a few API changes in this release. Here are links to the associated
  upgrade guides - user facing changes have been kept as small as
  possible.
 
  changes to ML vector specification:
 
 
  http://people.apache.org/~pwendell/spark-1.0.0-rc8-docs/mllib-guide.html#from-09-to-10
 
  changes to the Java API:
 
 
  http://people.apache.org/~pwendell/spark-1.0.0-rc8-docs/java-programming-guide.html#upgrading-from-pre-10-versions-of-spark
 
  changes to the streaming API:
 
 
  http://people.apache.org/~pwendell/spark-1.0.0-rc8-docs/streaming-programming-guide.html#migration-guide-from-091-or-below-to-1x
 
  changes to the GraphX API:
 
 
  http://people.apache.org/~pwendell/spark-1.0.0-rc8-docs/graphx-programming-guide.html#upgrade-guide-from-spark-091
 
  coGroup and related functions now return Iterable[T] instead of Seq[T]
  == Call toSeq on the result to restore the old behavior
 
  SparkContext.jarOfClass returns Option[String] instead of Seq[String]
  == Call toSeq on the result to restore old behavior
 




Re: [VOTE] Release Apache Spark 1.0.0 (rc8)

2014-05-17 Thread Patrick Wendell
Due to the issue discovered by Michael, this vote is cancelled in favor of rc9.

On Fri, May 16, 2014 at 6:22 PM, Michael Armbrust
mich...@databricks.com wrote:
 -1

 We found a regression in the way configuration is passed to executors.

 https://issues.apache.org/jira/browse/SPARK-1864
 https://github.com/apache/spark/pull/808

 Michael


 On Fri, May 16, 2014 at 3:57 PM, Mark Hamstra m...@clearstorydata.com
 wrote:

 +1


 On Fri, May 16, 2014 at 2:16 AM, Patrick Wendell pwend...@gmail.com
 wrote:

  [Due to ASF e-mail outage, I'm not if anyone will actually receive
  this.]
 
  Please vote on releasing the following candidate as Apache Spark version
  1.0.0!
  This has only minor changes on top of rc7.
 
  The tag to be voted on is v1.0.0-rc8 (commit 80eea0f):
 
 
  https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=80eea0f111c06260ffaa780d2f3f7facd09c17bc
 
  The release files, including signatures, digests, etc. can be found at:
  http://people.apache.org/~pwendell/spark-1.0.0-rc8/
 
  Release artifacts are signed with the following key:
  https://people.apache.org/keys/committer/pwendell.asc
 
  The staging repository for this release can be found at:
  https://repository.apache.org/content/repositories/orgapachespark-1016/
 
  The documentation corresponding to this release can be found at:
  http://people.apache.org/~pwendell/spark-1.0.0-rc8-docs/
 
  Please vote on releasing this package as Apache Spark 1.0.0!
 
  The vote is open until Monday, May 19, at 10:15 UTC and passes if a
  majority of at least 3 +1 PMC votes are cast.
 
  [ ] +1 Release this package as Apache Spark 1.0.0
  [ ] -1 Do not release this package because ...
 
  To learn more about Apache Spark, please see
  http://spark.apache.org/
 
  == API Changes ==
  We welcome users to compile Spark applications against 1.0. There are
  a few API changes in this release. Here are links to the associated
  upgrade guides - user facing changes have been kept as small as
  possible.
 
  changes to ML vector specification:
 
 
  http://people.apache.org/~pwendell/spark-1.0.0-rc8-docs/mllib-guide.html#from-09-to-10
 
  changes to the Java API:
 
 
  http://people.apache.org/~pwendell/spark-1.0.0-rc8-docs/java-programming-guide.html#upgrading-from-pre-10-versions-of-spark
 
  changes to the streaming API:
 
 
  http://people.apache.org/~pwendell/spark-1.0.0-rc8-docs/streaming-programming-guide.html#migration-guide-from-091-or-below-to-1x
 
  changes to the GraphX API:
 
 
  http://people.apache.org/~pwendell/spark-1.0.0-rc8-docs/graphx-programming-guide.html#upgrade-guide-from-spark-091
 
  coGroup and related functions now return Iterable[T] instead of Seq[T]
  == Call toSeq on the result to restore the old behavior
 
  SparkContext.jarOfClass returns Option[String] instead of Seq[String]
  == Call toSeq on the result to restore old behavior
 




Re: [VOTE] Release Apache Spark 1.0.0 (rc9)

2014-05-17 Thread Patrick Wendell
I'll start the voting with a +1.

On Sat, May 17, 2014 at 12:58 AM, Patrick Wendell pwend...@gmail.com wrote:
 Please vote on releasing the following candidate as Apache Spark version 
 1.0.0!
 This has one bug fix and one minor feature on top of rc8:
 SPARK-1864: https://github.com/apache/spark/pull/808
 SPARK-1808: https://github.com/apache/spark/pull/799

 The tag to be voted on is v1.0.0-rc9 (commit 920f947):
 https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=920f947eb5a22a679c0c3186cf69ee75f6041c75

 The release files, including signatures, digests, etc. can be found at:
 http://people.apache.org/~pwendell/spark-1.0.0-rc9/

 Release artifacts are signed with the following key:
 https://people.apache.org/keys/committer/pwendell.asc

 The staging repository for this release can be found at:
 https://repository.apache.org/content/repositories/orgapachespark-1017/

 The documentation corresponding to this release can be found at:
 http://people.apache.org/~pwendell/spark-1.0.0-rc9-docs/

 Please vote on releasing this package as Apache Spark 1.0.0!

 The vote is open until Tuesday, May 20, at 08:56 UTC and passes if
 amajority of at least 3 +1 PMC votes are cast.

 [ ] +1 Release this package as Apache Spark 1.0.0
 [ ] -1 Do not release this package because ...

 To learn more about Apache Spark, please see
 http://spark.apache.org/

 == API Changes ==
 We welcome users to compile Spark applications against 1.0. There are
 a few API changes in this release. Here are links to the associated
 upgrade guides - user facing changes have been kept as small as
 possible.

 changes to ML vector specification:
 http://people.apache.org/~pwendell/spark-1.0.0-rc8-docs/mllib-guide.html#from-09-to-10

 changes to the Java API:
 http://people.apache.org/~pwendell/spark-1.0.0-rc8-docs/java-programming-guide.html#upgrading-from-pre-10-versions-of-spark

 changes to the streaming API:
 http://people.apache.org/~pwendell/spark-1.0.0-rc8-docs/streaming-programming-guide.html#migration-guide-from-091-or-below-to-1x

 changes to the GraphX API:
 http://people.apache.org/~pwendell/spark-1.0.0-rc8-docs/graphx-programming-guide.html#upgrade-guide-from-spark-091

 coGroup and related functions now return Iterable[T] instead of Seq[T]
 == Call toSeq on the result to restore the old behavior

 SparkContext.jarOfClass returns Option[String] instead of Seq[String]
 == Call toSeq on the result to restore old behavior


Re: Calling external classes added by sc.addJar needs to be through reflection

2014-05-18 Thread Patrick Wendell
@db - it's possible that you aren't including the jar in the classpath
of your driver program (I think this is what mridul was suggesting).
It would be helpful to see the stack trace of the CNFE.

- Patrick

On Sun, May 18, 2014 at 11:54 AM, Patrick Wendell pwend...@gmail.com wrote:
 @xiangrui - we don't expect these to be present on the system
 classpath, because they get dynamically added by Spark (e.g. your
 application can call sc.addJar well after the JVM's have started).

 @db - I'm pretty surprised to see that behavior. It's definitely not
 intended that users need reflection to instantiate their classes -
 something odd is going on in your case. If you could create an
 isolated example and post it to the JIRA, that would be great.

 On Sun, May 18, 2014 at 9:58 AM, Xiangrui Meng men...@gmail.com wrote:
 I created a JIRA: https://issues.apache.org/jira/browse/SPARK-1870

 DB, could you add more info to that JIRA? Thanks!

 -Xiangrui

 On Sun, May 18, 2014 at 9:46 AM, Xiangrui Meng men...@gmail.com wrote:
 Btw, I tried

 rdd.map { i =
   System.getProperty(java.class.path)
 }.collect()

 but didn't see the jars added via --jars on the executor classpath.

 -Xiangrui

 On Sat, May 17, 2014 at 11:26 PM, Xiangrui Meng men...@gmail.com wrote:
 I can re-produce the error with Spark 1.0-RC and YARN (CDH-5). The
 reflection approach mentioned by DB didn't work either. I checked the
 distributed cache on a worker node and found the jar there. It is also
 in the Environment tab of the WebUI. The workaround is making an
 assembly jar.

 DB, could you create a JIRA and describe what you have found so far? 
 Thanks!

 Best,
 Xiangrui

 On Sat, May 17, 2014 at 1:29 AM, Mridul Muralidharan mri...@gmail.com 
 wrote:
 Can you try moving your mapPartitions to another class/object which is
 referenced only after sc.addJar ?

 I would suspect CNFEx is coming while loading the class containing
 mapPartitions before addJars is executed.

 In general though, dynamic loading of classes means you use reflection to
 instantiate it since expectation is you don't know which implementation
 provides the interface ... If you statically know it apriori, you bundle 
 it
 in your classpath.

 Regards
 Mridul
 On 17-May-2014 7:28 am, DB Tsai dbt...@stanford.edu wrote:

 Finally find a way out of the ClassLoader maze! It took me some times to
 understand how it works; I think it worths to document it in a separated
 thread.

 We're trying to add external utility.jar which contains CSVRecordParser,
 and we added the jar to executors through sc.addJar APIs.

 If the instance of CSVRecordParser is created without reflection, it
 raises *ClassNotFound
 Exception*.

 data.mapPartitions(lines = {
 val csvParser = new CSVRecordParser((delimiter.charAt(0))
 lines.foreach(line = {
   val lineElems = csvParser.parseLine(line)
 })
 ...
 ...
  )


 If the instance of CSVRecordParser is created through reflection, it 
 works.

 data.mapPartitions(lines = {
 val loader = Thread.currentThread.getContextClassLoader
 val CSVRecordParser =
 loader.loadClass(com.alpine.hadoop.ext.CSVRecordParser)

 val csvParser = CSVRecordParser.getConstructor(Character.TYPE)
 .newInstance(delimiter.charAt(0).asInstanceOf[Character])

 val parseLine = CSVRecordParser
 .getDeclaredMethod(parseLine, classOf[String])

 lines.foreach(line = {
val lineElems = parseLine.invoke(csvParser,
 line).asInstanceOf[Array[String]]
 })
 ...
 ...
  )


 This is identical to this question,

 http://stackoverflow.com/questions/7452411/thread-currentthread-setcontextclassloader-without-using-reflection

 It's not intuitive for users to load external classes through reflection,
 but couple available solutions including 1) messing around
 systemClassLoader by calling systemClassLoader.addURI through reflection 
 or
 2) forking another JVM to add jars into classpath before bootstrap loader
 are very tricky.

 Any thought on fixing it properly?

 @Xiangrui,
 netlib-java jniloader is loaded from netlib-java through reflection, so
 this problem will not be seen.

 Sincerely,

 DB Tsai
 ---
 My Blog: https://www.dbtsai.com
 LinkedIn: https://www.linkedin.com/in/dbtsai



Re: Calling external classes added by sc.addJar needs to be through reflection

2014-05-18 Thread Patrick Wendell
@xiangrui - we don't expect these to be present on the system
classpath, because they get dynamically added by Spark (e.g. your
application can call sc.addJar well after the JVM's have started).

@db - I'm pretty surprised to see that behavior. It's definitely not
intended that users need reflection to instantiate their classes -
something odd is going on in your case. If you could create an
isolated example and post it to the JIRA, that would be great.

On Sun, May 18, 2014 at 9:58 AM, Xiangrui Meng men...@gmail.com wrote:
 I created a JIRA: https://issues.apache.org/jira/browse/SPARK-1870

 DB, could you add more info to that JIRA? Thanks!

 -Xiangrui

 On Sun, May 18, 2014 at 9:46 AM, Xiangrui Meng men...@gmail.com wrote:
 Btw, I tried

 rdd.map { i =
   System.getProperty(java.class.path)
 }.collect()

 but didn't see the jars added via --jars on the executor classpath.

 -Xiangrui

 On Sat, May 17, 2014 at 11:26 PM, Xiangrui Meng men...@gmail.com wrote:
 I can re-produce the error with Spark 1.0-RC and YARN (CDH-5). The
 reflection approach mentioned by DB didn't work either. I checked the
 distributed cache on a worker node and found the jar there. It is also
 in the Environment tab of the WebUI. The workaround is making an
 assembly jar.

 DB, could you create a JIRA and describe what you have found so far? Thanks!

 Best,
 Xiangrui

 On Sat, May 17, 2014 at 1:29 AM, Mridul Muralidharan mri...@gmail.com 
 wrote:
 Can you try moving your mapPartitions to another class/object which is
 referenced only after sc.addJar ?

 I would suspect CNFEx is coming while loading the class containing
 mapPartitions before addJars is executed.

 In general though, dynamic loading of classes means you use reflection to
 instantiate it since expectation is you don't know which implementation
 provides the interface ... If you statically know it apriori, you bundle it
 in your classpath.

 Regards
 Mridul
 On 17-May-2014 7:28 am, DB Tsai dbt...@stanford.edu wrote:

 Finally find a way out of the ClassLoader maze! It took me some times to
 understand how it works; I think it worths to document it in a separated
 thread.

 We're trying to add external utility.jar which contains CSVRecordParser,
 and we added the jar to executors through sc.addJar APIs.

 If the instance of CSVRecordParser is created without reflection, it
 raises *ClassNotFound
 Exception*.

 data.mapPartitions(lines = {
 val csvParser = new CSVRecordParser((delimiter.charAt(0))
 lines.foreach(line = {
   val lineElems = csvParser.parseLine(line)
 })
 ...
 ...
  )


 If the instance of CSVRecordParser is created through reflection, it 
 works.

 data.mapPartitions(lines = {
 val loader = Thread.currentThread.getContextClassLoader
 val CSVRecordParser =
 loader.loadClass(com.alpine.hadoop.ext.CSVRecordParser)

 val csvParser = CSVRecordParser.getConstructor(Character.TYPE)
 .newInstance(delimiter.charAt(0).asInstanceOf[Character])

 val parseLine = CSVRecordParser
 .getDeclaredMethod(parseLine, classOf[String])

 lines.foreach(line = {
val lineElems = parseLine.invoke(csvParser,
 line).asInstanceOf[Array[String]]
 })
 ...
 ...
  )


 This is identical to this question,

 http://stackoverflow.com/questions/7452411/thread-currentthread-setcontextclassloader-without-using-reflection

 It's not intuitive for users to load external classes through reflection,
 but couple available solutions including 1) messing around
 systemClassLoader by calling systemClassLoader.addURI through reflection 
 or
 2) forking another JVM to add jars into classpath before bootstrap loader
 are very tricky.

 Any thought on fixing it properly?

 @Xiangrui,
 netlib-java jniloader is loaded from netlib-java through reflection, so
 this problem will not be seen.

 Sincerely,

 DB Tsai
 ---
 My Blog: https://www.dbtsai.com
 LinkedIn: https://www.linkedin.com/in/dbtsai



Re: [VOTE] Release Apache Spark 1.0.0 (rc9)

2014-05-18 Thread Patrick Wendell
Hey Matei - the issue you found is not related to security. This patch
a few days ago broke builds for Hadoop 1 with YARN support enabled.
The patch directly altered the way we deal with commons-lang
dependency, which is what is at the base of this stack trace.

https://github.com/apache/spark/pull/754

- Patrick

On Sun, May 18, 2014 at 5:28 PM, Matei Zaharia matei.zaha...@gmail.com wrote:
 Alright, I've opened https://github.com/apache/spark/pull/819 with the 
 Windows fixes. I also found one other likely bug, 
 https://issues.apache.org/jira/browse/SPARK-1875, in the binary packages for 
 Hadoop1 built in this RC. I think this is due to Hadoop 1's security code 
 depending on a different version of org.apache.commons than Hadoop 2, but it 
 needs investigation. Tom, any thoughts on this?

 Matei

 On May 18, 2014, at 12:33 PM, Matei Zaharia matei.zaha...@gmail.com wrote:

 I took the always fun task of testing it on Windows, and unfortunately, I 
 found some small problems with the prebuilt packages due to recent changes 
 to the launch scripts: bin/spark-class2.cmd looks in ./jars instead of ./lib 
 for the assembly JAR, and bin/run-example2.cmd doesn't quite match the 
 master-setting behavior of the Unix based one. I'll send a pull request to 
 fix them soon.

 Matei


 On May 17, 2014, at 11:32 AM, Sandy Ryza sandy.r...@cloudera.com wrote:

 +1

 Reran my tests from rc5:

 * Built the release from source.
 * Compiled Java and Scala apps that interact with HDFS against it.
 * Ran them in local mode.
 * Ran them against a pseudo-distributed YARN cluster in both yarn-client
 mode and yarn-cluster mode.


 On Sat, May 17, 2014 at 10:08 AM, Andrew Or and...@databricks.com wrote:

 +1


 2014-05-17 8:53 GMT-07:00 Mark Hamstra m...@clearstorydata.com:

 +1


 On Sat, May 17, 2014 at 12:58 AM, Patrick Wendell pwend...@gmail.com
 wrote:

 I'll start the voting with a +1.

 On Sat, May 17, 2014 at 12:58 AM, Patrick Wendell pwend...@gmail.com
 wrote:
 Please vote on releasing the following candidate as Apache Spark
 version
 1.0.0!
 This has one bug fix and one minor feature on top of rc8:
 SPARK-1864: https://github.com/apache/spark/pull/808
 SPARK-1808: https://github.com/apache/spark/pull/799

 The tag to be voted on is v1.0.0-rc9 (commit 920f947):



 https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=920f947eb5a22a679c0c3186cf69ee75f6041c75

 The release files, including signatures, digests, etc. can be found
 at:
 http://people.apache.org/~pwendell/spark-1.0.0-rc9/

 Release artifacts are signed with the following key:
 https://people.apache.org/keys/committer/pwendell.asc

 The staging repository for this release can be found at:

 https://repository.apache.org/content/repositories/orgapachespark-1017/

 The documentation corresponding to this release can be found at:
 http://people.apache.org/~pwendell/spark-1.0.0-rc9-docs/

 Please vote on releasing this package as Apache Spark 1.0.0!

 The vote is open until Tuesday, May 20, at 08:56 UTC and passes if
 amajority of at least 3 +1 PMC votes are cast.

 [ ] +1 Release this package as Apache Spark 1.0.0
 [ ] -1 Do not release this package because ...

 To learn more about Apache Spark, please see
 http://spark.apache.org/

 == API Changes ==
 We welcome users to compile Spark applications against 1.0. There are
 a few API changes in this release. Here are links to the associated
 upgrade guides - user facing changes have been kept as small as
 possible.

 changes to ML vector specification:



 http://people.apache.org/~pwendell/spark-1.0.0-rc8-docs/mllib-guide.html#from-09-to-10

 changes to the Java API:



 http://people.apache.org/~pwendell/spark-1.0.0-rc8-docs/java-programming-guide.html#upgrading-from-pre-10-versions-of-spark

 changes to the streaming API:



 http://people.apache.org/~pwendell/spark-1.0.0-rc8-docs/streaming-programming-guide.html#migration-guide-from-091-or-below-to-1x

 changes to the GraphX API:



 http://people.apache.org/~pwendell/spark-1.0.0-rc8-docs/graphx-programming-guide.html#upgrade-guide-from-spark-091

 coGroup and related functions now return Iterable[T] instead of
 Seq[T]
 == Call toSeq on the result to restore the old behavior

 SparkContext.jarOfClass returns Option[String] instead of Seq[String]
 == Call toSeq on the result to restore old behavior







Re: Calling external classes added by sc.addJar needs to be through reflection

2014-05-19 Thread Patrick Wendell
Having a user add define a custom class inside of an added jar and
instantiate it directly inside of an executor is definitely supported
in Spark and has been for a really long time (several years). This is
something we do all the time in Spark.

DB - I'd hold off on a re-architecting of this until we identify
exactly what is causing the bug you are running into.

In a nutshell, when the bytecode new Foo() is run on the executor,
it will ask the driver for the class over HTTP using a custom
classloader. Something in that pipeline is breaking here, possibly
related to the YARN deployment stuff.


On Mon, May 19, 2014 at 12:29 AM, Sean Owen so...@cloudera.com wrote:
 I don't think a customer classloader is necessary.

 Well, it occurs to me that this is no new problem. Hadoop, Tomcat, etc
 all run custom user code that creates new user objects without
 reflection. I should go see how that's done. Maybe it's totally valid
 to set the thread's context classloader for just this purpose, and I
 am not thinking clearly.

 On Mon, May 19, 2014 at 8:26 AM, Andrew Ash and...@andrewash.com wrote:
 Sounds like the problem is that classloaders always look in their parents
 before themselves, and Spark users want executors to pick up classes from
 their custom code before the ones in Spark plus its dependencies.

 Would a custom classloader that delegates to the parent after first
 checking itself fix this up?


 On Mon, May 19, 2014 at 12:17 AM, DB Tsai dbt...@stanford.edu wrote:

 Hi Sean,

 It's true that the issue here is classloader, and due to the classloader
 delegation model, users have to use reflection in the executors to pick up
 the classloader in order to use those classes added by sc.addJars APIs.
 However, it's very inconvenience for users, and not documented in spark.

 I'm working on a patch to solve it by calling the protected method addURL
 in URLClassLoader to update the current default classloader, so no
 customClassLoader anymore. I wonder if this is an good way to go.

   private def addURL(url: URL, loader: URLClassLoader){
 try {
   val method: Method =
 classOf[URLClassLoader].getDeclaredMethod(addURL, classOf[URL])
   method.setAccessible(true)
   method.invoke(loader, url)
 }
 catch {
   case t: Throwable = {
 throw new IOException(Error, could not add URL to system
 classloader)
   }
 }
   }



 Sincerely,

 DB Tsai
 ---
 My Blog: https://www.dbtsai.com
 LinkedIn: https://www.linkedin.com/in/dbtsai


 On Sun, May 18, 2014 at 11:57 PM, Sean Owen so...@cloudera.com wrote:

  I might be stating the obvious for everyone, but the issue here is not
  reflection or the source of the JAR, but the ClassLoader. The basic
  rules are this.
 
  new Foo will use the ClassLoader that defines Foo. This is usually
  the ClassLoader that loaded whatever it is that first referenced Foo
  and caused it to be loaded -- usually the ClassLoader holding your
  other app classes.
 
  ClassLoaders can have a parent-child relationship. ClassLoaders always
  look in their parent before themselves.
 
  (Careful then -- in contexts like Hadoop or Tomcat where your app is
  loaded in a child ClassLoader, and you reference a class that Hadoop
  or Tomcat also has (like a lib class) you will get the container's
  version!)
 
  When you load an external JAR it has a separate ClassLoader which does
  not necessarily bear any relation to the one containing your app
  classes, so yeah it is not generally going to make new Foo work.
 
  Reflection lets you pick the ClassLoader, yes.
 
  I would not call setContextClassLoader.
 
  On Mon, May 19, 2014 at 12:00 AM, Sandy Ryza sandy.r...@cloudera.com
  wrote:
   I spoke with DB offline about this a little while ago and he confirmed
  that
   he was able to access the jar from the driver.
  
   The issue appears to be a general Java issue: you can't directly
   instantiate a class from a dynamically loaded jar.
  
   I reproduced it locally outside of Spark with:
   ---
   URLClassLoader urlClassLoader = new URLClassLoader(new URL[] { new
   File(myotherjar.jar).toURI().toURL() }, null);
   Thread.currentThread().setContextClassLoader(urlClassLoader);
   MyClassFromMyOtherJar obj = new MyClassFromMyOtherJar();
   ---
  
   I was able to load the class with reflection.
 



Re: spark 1.0 standalone application

2014-05-19 Thread Patrick Wendell
Whenever we publish a release candidate, we create a temporary maven
repository that host the artifacts. We do this precisely for the case
you are running into (where a user wants to build an application
against it to test).

You can build against the release candidate by just adding that
repository in your sbt build, then linking against spark-core
version 1.0.0. For rc9 the repository is in the vote e-mail:

http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Release-Apache-Spark-1-0-0-rc9-td6629.html

On Mon, May 19, 2014 at 7:03 PM, Mark Hamstra m...@clearstorydata.com wrote:
 That's the crude way to do it.  If you run `sbt/sbt publishLocal`, then you
 can resolve the artifact from your local cache in the same way that you
 would resolve it if it were deployed to a remote cache.  That's just the
 build step.  Actually running the application will require the necessary
 jars to be accessible by the cluster nodes.


 On Mon, May 19, 2014 at 7:04 PM, Nan Zhu zhunanmcg...@gmail.com wrote:

 en, you have to put spark-assembly-*.jar to the lib directory of your
 application

 Best,

 --
 Nan Zhu


 On Monday, May 19, 2014 at 9:48 PM, nit wrote:

  I am not much comfortable with sbt. I want to build a standalone
 application
  using spark 1.0 RC9. I can build sbt assembly for my application with
 Spark
  0.9.1, and I think in that case spark is pulled from Aka Repository?
 
  Now if I want to use 1.0 RC9 for my application; what is the process ?
  (FYI, I was able to build spark-1.0 via sbt/assembly and I can see
  sbt-assembly jar; and I think I will have to copy my jar somewhere? and
  update build.sbt?)
 
  PS: I am not sure if this is the right place for this question; but since
  1.0 is still RC, I felt that this may be appropriate forum.
 
  thank!
 
 
 
  --
  View this message in context:
 http://apache-spark-developers-list.1001551.n3.nabble.com/spark-1-0-standalone-application-tp6698.html
  Sent from the Apache Spark Developers List mailing list archive at
 Nabble.com (http://Nabble.com).
 
 





Re: [VOTE] Release Apache Spark 1.0.0 (rc9)

2014-05-19 Thread Patrick Wendell
We're cancelling this RC in favor of rc10. There were two blockers: an
issue with Windows run scripts and an issue with the packaging for
Hadoop 1 when hive support is bundled.

https://issues.apache.org/jira/browse/SPARK-1875
https://issues.apache.org/jira/browse/SPARK-1876

Thanks everyone for the testing. TD will be cutting rc10, since I'm
travelling this week (thanks TD!).

- Patrick

On Mon, May 19, 2014 at 7:06 PM, Nan Zhu zhunanmcg...@gmail.com wrote:
 just rerun my test on rc5

 everything works

 build applications with sbt and the spark-*.jar which is compiled with Hadoop 
 2.3

 +1

 --
 Nan Zhu


 On Sunday, May 18, 2014 at 11:07 PM, witgo wrote:

 How to reproduce this bug?


 -- Original --
 From: Patrick Wendell;pwend...@gmail.com (mailto:pwend...@gmail.com);
 Date: Mon, May 19, 2014 10:08 AM
 To: dev@spark.apache.org 
 (mailto:dev@spark.apache.org)dev@spark.apache.org 
 (mailto:dev@spark.apache.org);
 Cc: Tom Gravestgraves...@yahoo.com (mailto:tgraves...@yahoo.com);
 Subject: Re: [VOTE] Release Apache Spark 1.0.0 (rc9)



 Hey Matei - the issue you found is not related to security. This patch
 a few days ago broke builds for Hadoop 1 with YARN support enabled.
 The patch directly altered the way we deal with commons-lang
 dependency, which is what is at the base of this stack trace.

 https://github.com/apache/spark/pull/754

 - Patrick

 On Sun, May 18, 2014 at 5:28 PM, Matei Zaharia matei.zaha...@gmail.com 
 (mailto:matei.zaha...@gmail.com) wrote:
  Alright, I've opened https://github.com/apache/spark/pull/819 with the 
  Windows fixes. I also found one other likely bug, 
  https://issues.apache.org/jira/browse/SPARK-1875, in the binary packages 
  for Hadoop1 built in this RC. I think this is due to Hadoop 1's security 
  code depending on a different version of org.apache.commons than Hadoop 2, 
  but it needs investigation. Tom, any thoughts on this?
 
  Matei
 
  On May 18, 2014, at 12:33 PM, Matei Zaharia matei.zaha...@gmail.com 
  (mailto:matei.zaha...@gmail.com) wrote:
 
   I took the always fun task of testing it on Windows, and unfortunately, 
   I found some small problems with the prebuilt packages due to recent 
   changes to the launch scripts: bin/spark-class2.cmd looks in ./jars 
   instead of ./lib for the assembly JAR, and bin/run-example2.cmd doesn't 
   quite match the master-setting behavior of the Unix based one. I'll send 
   a pull request to fix them soon.
  
   Matei
  
  
   On May 17, 2014, at 11:32 AM, Sandy Ryza sandy.r...@cloudera.com 
   (mailto:sandy.r...@cloudera.com) wrote:
  
+1
   
Reran my tests from rc5:
   
* Built the release from source.
* Compiled Java and Scala apps that interact with HDFS against it.
* Ran them in local mode.
* Ran them against a pseudo-distributed YARN cluster in both 
yarn-client
mode and yarn-cluster mode.
   
   
On Sat, May 17, 2014 at 10:08 AM, Andrew Or and...@databricks.com 
(mailto:and...@databricks.com) wrote:
   
 +1


 2014-05-17 8:53 GMT-07:00 Mark Hamstra m...@clearstorydata.com 
 (mailto:m...@clearstorydata.com):

  +1
 
 
  On Sat, May 17, 2014 at 12:58 AM, Patrick Wendell 
  pwend...@gmail.com (mailto:pwend...@gmail.com)
   wrote:
 
 
   I'll start the voting with a +1.
  
   On Sat, May 17, 2014 at 12:58 AM, Patrick Wendell 
   pwend...@gmail.com (mailto:pwend...@gmail.com)
   wrote:
Please vote on releasing the following candidate as Apache 
Spark
  
  
 
  version
   1.0.0!
This has one bug fix and one minor feature on top of rc8:
SPARK-1864: https://github.com/apache/spark/pull/808
SPARK-1808: https://github.com/apache/spark/pull/799
   
The tag to be voted on is v1.0.0-rc9 (commit 920f947):
 https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=920f947eb5a22a679c0c3186cf69ee75f6041c75
   
The release files, including signatures, digests, etc. can be 
found
 at:
http://people.apache.org/~pwendell/spark-1.0.0-rc9/
   
Release artifacts are signed with the following key:
https://people.apache.org/keys/committer/pwendell.asc
   
The staging repository for this release can be found at:
  https://repository.apache.org/content/repositories/orgapachespark-1017/
   
The documentation corresponding to this release can be found 
at:
http://people.apache.org/~pwendell/spark-1.0.0-rc9-docs/
   
Please vote on releasing this package as Apache Spark 1.0.0!
   
The vote is open until Tuesday, May 20, at 08:56 UTC and 
passes if
amajority of at least 3 +1 PMC votes are cast.
   
[ ] +1 Release this package as Apache Spark 1.0.0
[ ] -1 Do not release this package because ...
   
To learn more about Apache Spark, please see

Re: Calling external classes added by sc.addJar needs to be through reflection

2014-05-21 Thread Patrick Wendell
...@stanford.edu
 wrote:
   
Good summary! We fixed it in branch 0.9 since our production is
 still
  in
0.9. I'm porting it to 1.0 now, and hopefully will submit PR for
 1.0
tonight.
   
   
Sincerely,
   
DB Tsai
---
My Blog: https://www.dbtsai.com
LinkedIn: https://www.linkedin.com/in/dbtsai
   
   
On Mon, May 19, 2014 at 7:38 PM, Sandy Ryza 
 sandy.r...@cloudera.com
   wrote:
   
It just hit me why this problem is showing up on YARN and not on
standalone.
   
The relevant difference between YARN and standalone is that, on
 YARN,
   the
app jar is loaded by the system classloader instead of Spark's
 custom
   URL
classloader.
   
On YARN, the system classloader knows about [the classes in the
 spark
jars,
the classes in the primary app jar].   The custom classloader
 knows
   about
[the classes in secondary app jars] and has the system
 classloader as
   its
parent.
   
A few relevant facts (mostly redundant with what Sean pointed
 out):
* Every class has a classloader that loaded it.
* When an object of class B is instantiated inside of class A, the
classloader used for loading B is the classloader that was used
 for
loading
A.
* When a classloader fails to load a class, it lets its parent
   classloader
try.  If its parent succeeds, its parent becomes the classloader
  that
loaded it.
   
So suppose class B is in a secondary app jar and class A is in the
   primary
app jar:
1. The custom classloader will try to load class A.
2. It will fail, because it only knows about the secondary jars.
3. It will delegate to its parent, the system classloader.
4. The system classloader will succeed, because it knows about the
   primary
app jar.
5. A's classloader will be the system classloader.
6. A tries to instantiate an instance of class B.
7. B will be loaded with A's classloader, which is the system
   classloader.
8. Loading B will fail, because A's classloader, which is the
 system
classloader, doesn't know about the secondary app jars.
   
In Spark standalone, A and B are both loaded by the custom
   classloader, so
this issue doesn't come up.
   
-Sandy
   
On Mon, May 19, 2014 at 7:07 PM, Patrick Wendell 
 pwend...@gmail.com
  
wrote:
   
 Having a user add define a custom class inside of an added jar
 and
 instantiate it directly inside of an executor is definitely
  supported
 in Spark and has been for a really long time (several years).
 This
  is
 something we do all the time in Spark.

 DB - I'd hold off on a re-architecting of this until we identify
 exactly what is causing the bug you are running into.

 In a nutshell, when the bytecode new Foo() is run on the
  executor,
 it will ask the driver for the class over HTTP using a custom
 classloader. Something in that pipeline is breaking here,
 possibly
 related to the YARN deployment stuff.


 On Mon, May 19, 2014 at 12:29 AM, Sean Owen so...@cloudera.com
 
   wrote:
  I don't think a customer classloader is necessary.
 
  Well, it occurs to me that this is no new problem. Hadoop,
  Tomcat,
   etc
  all run custom user code that creates new user objects without
  reflection. I should go see how that's done. Maybe it's
 totally
   valid
  to set the thread's context classloader for just this purpose,
  and
   I
  am not thinking clearly.
 
  On Mon, May 19, 2014 at 8:26 AM, Andrew Ash 
  and...@andrewash.com
 wrote:
  Sounds like the problem is that classloaders always look in
  their
 parents
  before themselves, and Spark users want executors to pick up
   classes
 from
  their custom code before the ones in Spark plus its
  dependencies.
 
  Would a custom classloader that delegates to the parent after
   first
  checking itself fix this up?
 
 
  On Mon, May 19, 2014 at 12:17 AM, DB Tsai 
 dbt...@stanford.edu
wrote:
 
  Hi Sean,
 
  It's true that the issue here is classloader, and due to the
 classloader
  delegation model, users have to use reflection in the
 executors
   to
 pick up
  the classloader in order to use those classes added by
  sc.addJars
APIs.
  However, it's very inconvenience for users, and not
 documented
  in
 spark.
 
  I'm working on a patch to solve it by calling the protected
   method
 addURL
  in URLClassLoader to update the current default
 classloader, so
   no
  customClassLoader anymore. I wonder if this is an good way
 to
  go.
 
private def addURL(url: URL, loader: URLClassLoader){
  try {
val method: Method =
  classOf[URLClassLoader].getDeclaredMethod(addURL,
  classOf[URL])
method.setAccessible(true)
method.invoke

Re: Calling external classes added by sc.addJar needs to be through reflection

2014-05-21 Thread Patrick Wendell
Hey I just looked at the fix here:
https://github.com/apache/spark/pull/848

Given that this is quite simple, maybe it's best to just go with this
and just explain that we don't support adding jars dynamically in YARN
in Spark 1.0. That seems like a reasonable thing to do.

On Wed, May 21, 2014 at 3:15 PM, Patrick Wendell pwend...@gmail.com wrote:
 Of these two solutions I'd definitely prefer 2 in the short term. I'd
 imagine the fix is very straightforward (it would mostly just be
 remove code), and we'd be making this more consistent with the
 standalone mode which makes things way easier to reason about.

 In the long term we'll definitely want to exploit the distributed
 cache more, but at this point it's premature optimization at a high
 complexity cost. Writing stuff to HDFS through is so slow anyways I'd
 guess that serving it directly from the driver is still faster in most
 cases (though for very large jar sizes or very large clusters, yes,
 we'll need the distributed cache).

 - Patrick

 On Wed, May 21, 2014 at 2:41 PM, Xiangrui Meng men...@gmail.com wrote:
 That's a good example. If we really want to cover that case, there are
 two solutions:

 1. Follow DB's patch, adding jars to the system classloader. Then we
 cannot put a user class in front of an existing class.
 2. Do not send the primary jar and secondary jars to executors'
 distributed cache. Instead, add them to spark.jars in SparkSubmit
 and serve them via http by called sc.addJar in SparkContext.

 What is your preference?

 On Wed, May 21, 2014 at 2:27 PM, Sandy Ryza sandy.r...@cloudera.com wrote:
 Is that an assumption we can make?  I think we'd run into an issue in this
 situation:

 *In primary jar:*
 def makeDynamicObject(clazz: String) = Class.forName(clazz).newInstance()

 *In app code:*
 sc.addJar(dynamicjar.jar)
 ...
 rdd.map(x = makeDynamicObject(some.class.from.DynamicJar))

 It might be fair to say that the user should make sure to use the context
 classloader when instantiating dynamic classes, but I think it's weird that
 this code would work on Spark standalone but not on YARN.

 -Sandy


 On Wed, May 21, 2014 at 2:10 PM, Xiangrui Meng men...@gmail.com wrote:

 I think adding jars dynamically should work as long as the primary jar
 and the secondary jars do not depend on dynamically added jars, which
 should be the correct logic. -Xiangrui

 On Wed, May 21, 2014 at 1:40 PM, DB Tsai dbt...@stanford.edu wrote:
  This will be another separate story.
 
  Since in the yarn deployment, as Sandy said, the app.jar will be always
 in
  the systemclassloader which means any object instantiated in app.jar will
  have parent loader of systemclassloader instead of custom one. As a
 result,
  the custom class loader in yarn will never work without specifically
 using
  reflection.
 
  Solution will be not using system classloader in the classloader
 hierarchy,
  and add all the resources in system one into custom one. This is the
  approach of tomcat takes.
 
  Or we can directly overwirte the system class loader by calling the
  protected method `addURL` which will not work and throw exception if the
  code is wrapped in security manager.
 
 
  Sincerely,
 
  DB Tsai
  ---
  My Blog: https://www.dbtsai.com
  LinkedIn: https://www.linkedin.com/in/dbtsai
 
 
  On Wed, May 21, 2014 at 1:13 PM, Sandy Ryza sandy.r...@cloudera.com
 wrote:
 
  This will solve the issue for jars added upon application submission,
 but,
  on top of this, we need to make sure that anything dynamically added
  through sc.addJar works as well.
 
  To do so, we need to make sure that any jars retrieved via the driver's
  HTTP server are loaded by the same classloader that loads the jars
 given on
  app submission.  To achieve this, we need to either use the same
  classloader for both system jars and user jars, or make sure that the
 user
  jars given on app submission are under the same classloader used for
  dynamically added jars.
 
  On Tue, May 20, 2014 at 5:59 PM, Xiangrui Meng men...@gmail.com
 wrote:
 
   Talked with Sandy and DB offline. I think the best solution is sending
   the secondary jars to the distributed cache of all containers rather
   than just the master, and set the classpath to include spark jar,
   primary app jar, and secondary jars before executor starts. In this
   way, user only needs to specify secondary jars via --jars instead of
   calling sc.addJar inside the code. It also solves the scalability
   problem of serving all the jars via http.
  
   If this solution sounds good, I can try to make a patch.
  
   Best,
   Xiangrui
  
   On Mon, May 19, 2014 at 10:04 PM, DB Tsai dbt...@stanford.edu
 wrote:
In 1.0, there is a new option for users to choose which classloader
 has
higher priority via spark.files.userClassPathFirst, I decided to
 submit
   the
PR for 0.9 first. We use this patch in our lab and we can use those
  jars
added by sc.addJar without reflection

Re: No output from Spark Streaming program with Spark 1.0

2014-05-23 Thread Patrick Wendell
Also one other thing to try, try removing all of the logic form inside
of foreach and just printing something. It could be that somehow an
exception is being triggered inside of your foreach block and as a
result the output goes away.

On Fri, May 23, 2014 at 6:00 PM, Patrick Wendell pwend...@gmail.com wrote:
 Hey Jim,

 Do you see the same behavior if you run this outside of eclipse?

 Also, what happens if you print something to standard out when setting
 up your streams (i.e. not inside of the foreach) do you see that? This
 could be a streaming issue, but it could also be something related to
 the way it's running in eclipse.

 - Patrick

 On Fri, May 23, 2014 at 2:57 PM, Jim Donahue jdona...@adobe.com wrote:
 I¹m trying out 1.0 on a set of small Spark Streaming tests and am running
 into problems.  Here¹s one of the little programs I¹ve used for a long
 time ‹ it reads a Kafka stream that contains Twitter JSON tweets and does
 some simple counting.  The program starts OK (it connects to the Kafka
 stream fine) and generates a stream of INFO logging messages, but never
 generates any output. :-(

 I¹m running this in Eclipse, so there may be some class loading issue
 (loading the wrong class or something like that), but I¹m not seeing
 anything in the console output.

 Thanks,

 Jim Donahue
 Adobe



 val kafka_messages =
   KafkaUtils.createStream[Array[Byte], Array[Byte],
 kafka.serializer.DefaultDecoder, kafka.serializer.DefaultDecoder](ssc,
 propsMap, topicMap, StorageLevel.MEMORY_AND_DISK)


  val messages = kafka_messages.map(_._2)


  val total = ssc.sparkContext.accumulator(0)


  val startTime = new java.util.Date().getTime()


  val jsonstream = messages.map[JSONObject](message =
   {val string = new String(message);
   val json = new JSONObject(string);
   total += 1
   json
   }
 )


 val deleted = ssc.sparkContext.accumulator(0)


 val msgstream = jsonstream.filter(json =
   if (!json.has(delete)) true else { deleted += 1; false}
   )


 msgstream.foreach(rdd = {
   if(rdd.count()  0){
   val data = rdd.map(json = (json.has(entities),
 json.length())).collect()
   val entities: Double = data.count(t = t._1)
   val fieldCounts = data.sortBy(_._2)
   val minFields = fieldCounts(0)._2
   val maxFields = fieldCounts(fieldCounts.size - 1)._2
   val now = new java.util.Date()
   val interval = (now.getTime() - startTime) / 1000
   System.out.println(now.toString)
   System.out.println(processing time:  + interval +  seconds)
   System.out.println(total messages:  + total.value)
   System.out.println(deleted messages:  + deleted.value)
   System.out.println(message receipt rate:  + (total.value/interval)
 +  per second)
   System.out.println(messages this interval:  + data.length)
   System.out.println(message fields varied between:  + minFields + 
 and  + maxFields)
   System.out.println(fraction with entities is  + (entities /
 data.length))
   }
 }
 )

 ssc.start()



Re: [VOTE] Release Apache Spark 1.0.0 (RC11)

2014-05-26 Thread Patrick Wendell
Hey Ankur,

That does seem like a good fix, but right now we are only blocking the
release on major regressions that affect all components. So I don't
think this is sufficient to block it from going forward and cutting a
new candidate. This is because we are in the very late stage of the
release.

We can slot that for the 1.0.1 release and merge it into the 1.0
branch so people can get access to the fix easily.

On Mon, May 26, 2014 at 6:50 PM, ankurdave ankurd...@gmail.com wrote:
 -1

 I just fixed  SPARK-1931 https://issues.apache.org/jira/browse/SPARK-1931
 , which was a critical bug in Graph#partitionBy. Since this is an important
 part of the GraphX API, I think Spark 1.0.0 should include the fix:
 https://github.com/apache/spark/pull/885.



 --
 View this message in context: 
 http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Release-Apache-Spark-1-0-0-RC11-tp6797p6802.html
 Sent from the Apache Spark Developers List mailing list archive at Nabble.com.


Re: [VOTE] Release Apache Spark 1.0.0 (RC11)

2014-05-29 Thread Patrick Wendell
+1

I spun up a few EC2 clusters and ran my normal audit checks. Tests
passing, sigs, CHANGES and NOTICE look good

Thanks TD for helping cut this RC!

On Wed, May 28, 2014 at 9:38 PM, Kevin Markey kevin.mar...@oracle.com wrote:
 +1

 Built -Pyarn -Phadoop-2.3 -Dhadoop.version=2.3.0
 Ran current version of one of my applications on 1-node pseudocluster
 (sorry, unable to test on full cluster).
 yarn-cluster mode
 Ran regression tests.

 Thanks
 Kevin


 On 05/28/2014 09:55 PM, Krishna Sankar wrote:

 +1
 Pulled  built on MacOS X, EC2 Amazon Linux
 Ran test programs on OS X, 5 node c3.4xlarge cluster
 Cheers
 k/


 On Wed, May 28, 2014 at 7:36 PM, Andy Konwinski
 andykonwin...@gmail.comwrote:

 +1
 On May 28, 2014 7:05 PM, Xiangrui Meng men...@gmail.com wrote:

 +1

 Tested apps with standalone client mode and yarn cluster and client

 modes.

 Xiangrui

 On Wed, May 28, 2014 at 1:07 PM, Sean McNamara
 sean.mcnam...@webtrends.com wrote:

 Pulled down, compiled, and tested examples on OS X and ubuntu.
 Deployed app we are building on spark and poured data through it.

 +1

 Sean


 On May 26, 2014, at 8:39 AM, Tathagata Das 

 tathagata.das1...@gmail.com

 wrote:

 Please vote on releasing the following candidate as Apache Spark

 version 1.0.0!

 This has a few important bug fixes on top of rc10:
 SPARK-1900 and SPARK-1918: https://github.com/apache/spark/pull/853
 SPARK-1870: https://github.com/apache/spark/pull/848
 SPARK-1897: https://github.com/apache/spark/pull/849

 The tag to be voted on is v1.0.0-rc11 (commit c69d97cd):


 https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=c69d97cdb42f809cb71113a1db4194c21372242a

 The release files, including signatures, digests, etc. can be found

 at:

 http://people.apache.org/~tdas/spark-1.0.0-rc11/

 Release artifacts are signed with the following key:
 https://people.apache.org/keys/committer/tdas.asc

 The staging repository for this release can be found at:

 https://repository.apache.org/content/repositories/orgapachespark-1019/

 The documentation corresponding to this release can be found at:
 http://people.apache.org/~tdas/spark-1.0.0-rc11-docs/

 Please vote on releasing this package as Apache Spark 1.0.0!

 The vote is open until Thursday, May 29, at 16:00 UTC and passes if
 a majority of at least 3 +1 PMC votes are cast.

 [ ] +1 Release this package as Apache Spark 1.0.0
 [ ] -1 Do not release this package because ...

 To learn more about Apache Spark, please see
 http://spark.apache.org/

 == API Changes ==
 We welcome users to compile Spark applications against 1.0. There are
 a few API changes in this release. Here are links to the associated
 upgrade guides - user facing changes have been kept as small as
 possible.

 Changes to ML vector specification:


 http://people.apache.org/~tdas/spark-1.0.0-rc11-docs/mllib-guide.html#from-09-to-10

 Changes to the Java API:


 http://people.apache.org/~tdas/spark-1.0.0-rc11-docs/java-programming-guide.html#upgrading-from-pre-10-versions-of-spark

 Changes to the streaming API:


 http://people.apache.org/~tdas/spark-1.0.0-rc11-docs/streaming-programming-guide.html#migration-guide-from-091-or-below-to-1x

 Changes to the GraphX API:


 http://people.apache.org/~tdas/spark-1.0.0-rc11-docs/graphx-programming-guide.html#upgrade-guide-from-spark-091

 Other changes:
 coGroup and related functions now return Iterable[T] instead of Seq[T]
 == Call toSeq on the result to restore the old behavior

 SparkContext.jarOfClass returns Option[String] instead of Seq[String]
 == Call toSeq on the result to restore old behavior




Re: [VOTE] Release Apache Spark 1.0.0 (rc5)

2014-05-29 Thread Patrick Wendell
[tl;dr stable API's are important - sorry, this is slightly meandering]

Hey - just wanted to chime in on this as I was travelling. Sean, you
bring up great points here about the velocity and stability of Spark.
Many projects have fairly customized semantics around what versions
actually mean (HBase is a good, if somewhat hard-to-comprehend,
example).

What the 1.X label means to Spark is that we are willing to guarantee
stability for Spark's core API. This is something that actually, Spark
has been doing for a while already (we've made few or no breaking
changes to the Spark core API for several years) and we want to codify
this for application developers. In this regard Spark has made a bunch
of changes to enforce the integrity of our API's:

- We went through and clearly annotated internal, or experimental
API's. This was a huge project-wide effort and included Scaladoc and
several other components to make it clear to users.
- We implemented automated byte-code verification of all proposed pull
requests that they don't break public API's. Pull requests after 1.0
will fail if they break API's that are not explicitly declared private
or experimental.

I can't possibly emphasize enough the importance of API stability.
What we want to avoid is the Hadoop approach. Candidly, Hadoop does a
poor job on this. There really isn't a well defined stable API for any
of the Hadoop components, for a few reasons:

1. Hadoop projects don't do any rigorous checking that new patches
don't break API's. Of course, the results in regular API breaks and a
poor understanding of what is a public API.
2. In several cases it's not possible to do basic things in Hadoop
without using deprecated or private API's.
3. There is significant vendor fragmentation of API's.

The main focus of the Hadoop vendors is making consistent cuts of the
core projects work together (HDFS/Pig/Hive/etc) - so API breaks are
sometimes considered fixed as long as the other projects work around
them (see [1]). We also regularly need to do archaeology (see [2]) and
directly interact with Hadoop committers to understand what API's are
stable and in which versions.

One goal of Spark is to deal with the pain of inter-operating with
Hadoop so that application writers don't to. We'd like to retain the
property that if you build an application against the (well defined,
stable) Spark API's right now, you'll be able to run it across many
Hadoop vendors and versions for the entire Spark 1.X release cycle.

Writing apps against Hadoop can be very difficult... consider how much
more engineering effort we spent maintaining YARN support than Mesos
support. There are many factors, but one is that Mesos has a single,
narrow, stable API. We've never had to make a change in Mesos due to
an API change, for several years. YARN on the other hand, there are at
least 3 YARN API's that currently exist, which are all binary
incompatible. We'd like to offer apps the ability to build against
Spark's API and just let us deal with it.

As more vendors packaging Spark, I'd like to see us put tools in the
upstream Spark repo that do validation for vendor packages of Spark,
so that we don't end up with fragmentation. Of course, vendors can
enhance the API and are encouraged to, but we need a kernel of API's
that vendors must maintain (think POSIX) to be considered compliant
with Apache Spark. I believe some other projects like OpenStack have
done this to avoid fragmentation.

- Patrick

[1] https://issues.apache.org/jira/browse/MAPREDUCE-5830
[2] 
http://2.bp.blogspot.com/-GO6HF0OAFHw/UOfNEH-4sEI/AD0/dEWFFYTRgYw/s1600/output-file.png

On Sun, May 18, 2014 at 2:13 AM, Mridul Muralidharan mri...@gmail.com wrote:
 So I think I need to clarify a few things here - particularly since
 this mail went to the wrong mailing list and a much wider audience
 than I intended it for :-)


 Most of the issues I mentioned are internal implementation detail of
 spark core : which means, we can enhance them in future without
 disruption to our userbase (ability to support large number of
 input/output partitions. Note: this is of order of 100k input and
 output partitions with uniform spread of keys - very rarely seen
 outside of some crazy jobs).

 Some of the issues I mentioned would reqiure DeveloperApi changes -
 which are not user exposed : they would impact developer use of these
 api's - which are mostly internally provided by spark. (Like fixing
 blocks  2G would require change to Serializer api)

 A smaller faction might require interface changes - note, I am
 referring specifically to configuration changes (removing/deprecating
 some) and possibly newer options to submit/env, etc - I dont envision
 any programming api change itself.
 The only api change we did was from Seq - Iterable - which is
 actually to address some of the issues I mentioned (join/cogroup).

 Remaining are bugs which need to be addressed or the feature
 removed/enhanced like shuffle consolidation.

 There might be 

Announcing Spark 1.0.0

2014-05-30 Thread Patrick Wendell
I'm thrilled to announce the availability of Spark 1.0.0! Spark 1.0.0
is a milestone release as the first in the 1.0 line of releases,
providing API stability for Spark's core interfaces.

Spark 1.0.0 is Spark's largest release ever, with contributions from
117 developers. I'd like to thank everyone involved in this release -
it was truly a community effort with fixes, features, and
optimizations contributed from dozens of organizations.

This release expands Spark's standard libraries, introducing a new SQL
package (SparkSQL) which lets users integrate SQL queries into
existing Spark workflows. MLlib, Spark's machine learning library, is
expanded with sparse vector support and several new algorithms. The
GraphX and Streaming libraries also introduce new features and
optimizations. Spark's core engine adds support for secured YARN
clusters, a unified tool for submitting Spark applications, and
several performance and stability improvements. Finally, Spark adds
support for Java 8 lambda syntax and improves coverage of the Java and
Python API's.

Those features only scratch the surface - check out the release notes here:
http://spark.apache.org/releases/spark-release-1-0-0.html

Note that since release artifacts were posted recently, certain
mirrors may not have working downloads for a few hours.

- Patrick


Re: Streaming example stops outputting (Java, Kafka at least)

2014-05-30 Thread Patrick Wendell
Yeah - Spark streaming needs at least two threads to run. I actually
thought we warned the user if they only use one (@tdas?) but the
warning might not be working correctly - or I'm misremembering.

On Fri, May 30, 2014 at 6:38 AM, Sean Owen so...@cloudera.com wrote:
 Thanks Nan, that does appear to fix it. I was using local. Can
 anyone say whether that's to be expected or whether it could be a bug
 somewhere?

 On Fri, May 30, 2014 at 2:42 PM, Nan Zhu zhunanmcg...@gmail.com wrote:
 Hi, Sean

 I was in the same problem

 but when I changed MASTER=local to MASTER=local[2]

 everything back to the normal

 Hasn't get a chance to ask here

 Best,

 --
 Nan Zhu



Re: [VOTE] Release Apache Spark 1.0.0 (rc5)

2014-05-30 Thread Patrick Wendell
Hey guys, thanks for the insights. Also, I realize Hadoop has gotten
way better about this with 2.2+ and I think it's great progress.

We have well defined API levels in Spark and also automated checking
of API violations for new pull requests. When doing code reviews we
always enforce the narrowest possible visibility:

1. private
2. private[spark]
3. @Experimental or @DeveloperApi
4. public

Our automated checks exclude 1-3. Anything that breaks 4 will trigger
a build failure.

The Scala compiler prevents anyone external from using 1 or 2. We do
have bytecode public but annotated (3) API's that we might change.
We spent a lot of time looking into whether these can offer compiler
warnings, but we haven't found a way to do this and do not see a
better alternative at this point.

Regarding Scala compatibility, Scala 2.11+ is source code
compatible, meaning we'll be able to cross-compile Spark for
different versions of Scala. We've already been in touch with Typesafe
about this and they've offered to integrate Spark into their
compatibility test suite. They've also committed to patching 2.11 with
a minor release if bugs are found.

Anyways, my point is we've actually thought a lot about this already.

The CLASSPATH thing is different than API stability, but indeed also a
form of compatibility. This is something where I'd also like to see
Spark have better isolation of user classes from Spark's own
execution...

- Patrick



On Fri, May 30, 2014 at 12:30 PM, Marcelo Vanzin van...@cloudera.com wrote:
 On Fri, May 30, 2014 at 12:05 PM, Colin McCabe cmcc...@alumni.cmu.edu wrote:
 I don't know if Scala provides any mechanisms to do this beyond what Java 
 provides.

 In fact it does. You can say something like private[foo] and the
 annotated element will be visible for all classes under foo (where
 foo is any package in the hierarchy leading up to the class). That's
 used a lot in Spark.

 I haven't fully looked at how the @DeveloperApi is used, but I agree
 with you - annotations are not a good way to do this. The Scala
 feature above would be much better, but it might still leak things at
 the Java bytecode level (don't know how Scala implements it under the
 cover, but I assume it's not by declaring the element as a Java
 private).

 Another thing is that in Scala the default visibility is public, which
 makes it very easy to inadvertently add things to the API. I'd like to
 see more care in making things have the proper visibility - I
 generally declare things private first, and relax that as needed.
 Using @VisibleForTesting would be great too, when the Scala
 private[foo] approach doesn't work.

 Does Spark also expose its CLASSPATH in
 this way to executors?  I was under the impression that it did.

 If you're using the Spark assemblies, yes, there is a lot of things
 that your app gets exposed to. For example, you can see Guava and
 Jetty (and many other things) there. This is something that has always
 bugged me, but I don't really have a good suggestion of how to fix it;
 shading goes a certain way, but it also breaks codes that uses
 reflection (e.g. Class.forName()-style class loading).

 What is worse is that Spark doesn't even agree with the Hadoop code it
 depends on; e.g., Spark uses Guava 14.x while Hadoop is still in Guava
 11.x. So when you run your Scala app, what gets loaded?

 At some point we will also have to confront the Scala version issue.  Will
 there be flag days where Spark jobs need to be upgraded to a new,
 incompatible version of Scala to run on the latest Spark?

 Yes, this could be an issue - I'm not sure Scala has a policy towards
 this, but updates (at least minor, e.g. 2.9 - 2.10) tend to break
 binary compatibility.

 Scala also makes some API updates tricky - e.g., adding a new named
 argument to a Scala method is not a binary compatible change (while,
 e.g., adding a new keyword argument in a python method is just fine).
 The use of implicits and other Scala features make this even more
 opaque...

 Anyway, not really any solutions in this message, just a few comments
 I wanted to throw out there. :-)

 --
 Marcelo


Re: [VOTE] Release Apache Spark 1.0.0 (rc5)

2014-05-30 Thread Patrick Wendell
Spark is a bit different than Hadoop MapReduce, so maybe that's a
source of some confusion. Spark is often used as a substrate for
building different types of analytics applications, so @DeveloperAPI
are internal API's that we'd like to expose to application writers,
but that might be more volatile. This is like the internal API's in
the linux kernel, they aren't stable, but of course we try to minimize
changes to them. If people want to write lower-level modules against
them, that's fine with us, but they know the interfaces might change.

This has worked pretty well over the years, even with many different
companies writing against those API's.

@Experimental are user-facing features we are trying out. Hopefully
that one is more clear.

In terms of making a big jar that shades all of our dependencies - I'm
curious how that would actually work in practice. It would be good to
explore. There are a few potential challenges I see:

1. If any of our dependencies encode class name information in IPC
messages, this would break. E.g. can you definitely shade the Hadoop
client, protobuf, hbase client, etc and have them send messages over
the wire? This could break things if class names are ever encoded in a
wire format.
2. Many libraries like logging subsystems, configuration systems, etc
rely on static state and initialization. I'm not totally sure how e.g.
slf4j initializes itself if you have both a shaded and non-shaded copy
of slf4j present.
3. This would mean the spark-core jar would be really massive because
it would inline all of our deps. We've actually been thinking of
avoiding the current assembly jar approach because, due to scala
specialized classes, our assemblies now have more than 65,000 class
files in them leading to all kinds of bad issues. We'd have to stick
with a big uber assembly-like jar if we decide to shade stuff.
4. I'm not totally sure how this would work when people want to e.g.
build Spark with different Hadoop versions. Would we publish different
shaded uber-jars for every Hadoop version? Would the Hadoop dep just
not be shaded... if so what about all it's dependencies.

Anyways just some things to consider... simplifying our classpath is
definitely an avenue worth exploring!




On Fri, May 30, 2014 at 2:56 PM, Colin McCabe cmcc...@alumni.cmu.edu wrote:
 On Fri, May 30, 2014 at 2:11 PM, Patrick Wendell pwend...@gmail.com wrote:

 Hey guys, thanks for the insights. Also, I realize Hadoop has gotten
 way better about this with 2.2+ and I think it's great progress.

 We have well defined API levels in Spark and also automated checking
 of API violations for new pull requests. When doing code reviews we
 always enforce the narrowest possible visibility:

 1. private
 2. private[spark]
 3. @Experimental or @DeveloperApi
 4. public

 Our automated checks exclude 1-3. Anything that breaks 4 will trigger
 a build failure.


 That's really excellent.  Great job.

 I like the private[spark] visibility level-- sounds like this is another
 way Scala has greatly improved on Java.

 The Scala compiler prevents anyone external from using 1 or 2. We do
 have bytecode public but annotated (3) API's that we might change.
 We spent a lot of time looking into whether these can offer compiler
 warnings, but we haven't found a way to do this and do not see a
 better alternative at this point.


 It would be nice if the production build could strip this stuff out.
  Otherwise, it feels a lot like a @private, @unstable Hadoop API... and we
 know how those turned out.


 Regarding Scala compatibility, Scala 2.11+ is source code
 compatible, meaning we'll be able to cross-compile Spark for
 different versions of Scala. We've already been in touch with Typesafe
 about this and they've offered to integrate Spark into their
 compatibility test suite. They've also committed to patching 2.11 with
 a minor release if bugs are found.


 Thanks, I hadn't heard about this plan.  Hopefully we can get everyone on
 2.11 ASAP.


 Anyways, my point is we've actually thought a lot about this already.

 The CLASSPATH thing is different than API stability, but indeed also a
 form of compatibility. This is something where I'd also like to see
 Spark have better isolation of user classes from Spark's own
 execution...


 I think the best thing to do is just shade all the dependencies.  Then
 they will be in a different namespace, and clients can have their own
 versions of whatever dependencies they like without conflicting.  As
 Marcelo mentioned, there might be a few edge cases where this breaks
 reflection, but I don't think that's an issue for most libraries.  So at
 worst case we could end up needing apps to follow us in lockstep for Kryo
 or maybe Akka, but not the whole kit and caboodle like with Hadoop.

 best,
 Colin


 - Patrick



 On Fri, May 30, 2014 at 12:30 PM, Marcelo Vanzin van...@cloudera.com
 wrote:
  On Fri, May 30, 2014 at 12:05 PM, Colin McCabe cmcc...@alumni.cmu.edu
 wrote:
  I don't know if Scala

Re: Unable to execute saveAsTextFile on multi node mesos

2014-05-31 Thread Patrick Wendell
Can you look at the logs from the executor or in the UI? They should
give an exception with the reason for the task failure. Also in the
future, for this type of e-mail please only e-mail the user@ list
and not both lists.

- Patrick

On Sat, May 31, 2014 at 3:22 AM, prabeesh k prabsma...@gmail.com wrote:
 Hi,

 scenario : Read data from HDFS and apply hive query  on it and the result is
 written back to HDFS.

  Scheme creation, Querying  and saveAsTextFile are working fine with
 following mode

 local mode
 mesos cluster with single node
 spark cluster with multi node

 Schema creation and querying are working fine with mesos multi node cluster.
 But  while trying to write back to HDFS using saveAsTextFile, the following
 error occurs

  14/05/30 10:16:35 INFO DAGScheduler: The failed fetch was from Stage 4
 (mapPartitionsWithIndex at Operator.scala:333); marking it for resubmission
 14/05/30 10:16:35 INFO DAGScheduler: Executor lost:
 201405291518-3644595722-5050-17933-1 (epoch 148)

 Let me know your thoughts regarding this.

 Regards,
 prabeesh


Re: [VOTE] Release Apache Spark 1.0.0 (rc5)

2014-05-31 Thread Patrick Wendell
One other consideration popped into my head:

5. Shading our dependencies could mess up our external API's if we
ever return types that are outside of the spark package because we'd
then be returned shaded types that users have to deal with. E.g. where
before we returned an o.a.flume.AvroFlumeEvent we'd have to return a
some.namespace.AvroFlumeEvent. Then users downstream would have to
deal with converting our types into the correct namespace if they want
to inter-operate with other libraries. We generally try to avoid ever
returning types from other libraries, but it would be good to audit
our API's and see if we ever do this.

On Fri, May 30, 2014 at 10:54 PM, Patrick Wendell pwend...@gmail.com wrote:
 Spark is a bit different than Hadoop MapReduce, so maybe that's a
 source of some confusion. Spark is often used as a substrate for
 building different types of analytics applications, so @DeveloperAPI
 are internal API's that we'd like to expose to application writers,
 but that might be more volatile. This is like the internal API's in
 the linux kernel, they aren't stable, but of course we try to minimize
 changes to them. If people want to write lower-level modules against
 them, that's fine with us, but they know the interfaces might change.

 This has worked pretty well over the years, even with many different
 companies writing against those API's.

 @Experimental are user-facing features we are trying out. Hopefully
 that one is more clear.

 In terms of making a big jar that shades all of our dependencies - I'm
 curious how that would actually work in practice. It would be good to
 explore. There are a few potential challenges I see:

 1. If any of our dependencies encode class name information in IPC
 messages, this would break. E.g. can you definitely shade the Hadoop
 client, protobuf, hbase client, etc and have them send messages over
 the wire? This could break things if class names are ever encoded in a
 wire format.
 2. Many libraries like logging subsystems, configuration systems, etc
 rely on static state and initialization. I'm not totally sure how e.g.
 slf4j initializes itself if you have both a shaded and non-shaded copy
 of slf4j present.
 3. This would mean the spark-core jar would be really massive because
 it would inline all of our deps. We've actually been thinking of
 avoiding the current assembly jar approach because, due to scala
 specialized classes, our assemblies now have more than 65,000 class
 files in them leading to all kinds of bad issues. We'd have to stick
 with a big uber assembly-like jar if we decide to shade stuff.
 4. I'm not totally sure how this would work when people want to e.g.
 build Spark with different Hadoop versions. Would we publish different
 shaded uber-jars for every Hadoop version? Would the Hadoop dep just
 not be shaded... if so what about all it's dependencies.

 Anyways just some things to consider... simplifying our classpath is
 definitely an avenue worth exploring!




 On Fri, May 30, 2014 at 2:56 PM, Colin McCabe cmcc...@alumni.cmu.edu wrote:
 On Fri, May 30, 2014 at 2:11 PM, Patrick Wendell pwend...@gmail.com wrote:

 Hey guys, thanks for the insights. Also, I realize Hadoop has gotten
 way better about this with 2.2+ and I think it's great progress.

 We have well defined API levels in Spark and also automated checking
 of API violations for new pull requests. When doing code reviews we
 always enforce the narrowest possible visibility:

 1. private
 2. private[spark]
 3. @Experimental or @DeveloperApi
 4. public

 Our automated checks exclude 1-3. Anything that breaks 4 will trigger
 a build failure.


 That's really excellent.  Great job.

 I like the private[spark] visibility level-- sounds like this is another
 way Scala has greatly improved on Java.

 The Scala compiler prevents anyone external from using 1 or 2. We do
 have bytecode public but annotated (3) API's that we might change.
 We spent a lot of time looking into whether these can offer compiler
 warnings, but we haven't found a way to do this and do not see a
 better alternative at this point.


 It would be nice if the production build could strip this stuff out.
  Otherwise, it feels a lot like a @private, @unstable Hadoop API... and we
 know how those turned out.


 Regarding Scala compatibility, Scala 2.11+ is source code
 compatible, meaning we'll be able to cross-compile Spark for
 different versions of Scala. We've already been in touch with Typesafe
 about this and they've offered to integrate Spark into their
 compatibility test suite. They've also committed to patching 2.11 with
 a minor release if bugs are found.


 Thanks, I hadn't heard about this plan.  Hopefully we can get everyone on
 2.11 ASAP.


 Anyways, my point is we've actually thought a lot about this already.

 The CLASSPATH thing is different than API stability, but indeed also a
 form of compatibility. This is something where I'd also like to see
 Spark have better isolation of user classes

Re: SCALA_HOME or SCALA_LIBRARY_PATH not set during build

2014-06-01 Thread Patrick Wendell
This is a false error message actually - the Maven build no longer
requires SCALA_HOME but the message/check was still there. This was
fixed recently in master:

https://github.com/apache/spark/commit/d8c005d5371f81a2a06c5d27c7021e1ae43d7193

I can back port that fix into branch-1.0 so it will be in 1.0.1 as
well. For other people running into this, you can export SCALA_HOME to
any value and it will work.

- Patrick

On Sat, May 31, 2014 at 8:34 PM, Colin McCabe cmcc...@alumni.cmu.edu wrote:
 Spark currently supports two build systems, sbt and maven.  sbt will
 download the correct version of scala, but with Maven you need to supply it
 yourself and set SCALA_HOME.

 It sounds like the instructions need to be updated-- perhaps create a JIRA?

 best,
 Colin


 On Sat, May 31, 2014 at 7:06 PM, Soren Macbeth so...@yieldbot.com wrote:

 Hello,

 Following the instructions for building spark 1.0.0, I encountered the
 following error:

 [ERROR] Failed to execute goal
 org.apache.maven.plugins:maven-antrun-plugin:1.7:run (default) on project
 spark-core_2.10: An Ant BuildException has occured: Please set the
 SCALA_HOME (or SCALA_LIBRARY_PATH if scala is on the path) environment
 variables and retry.
 [ERROR] around Ant part ...fail message=Please set the SCALA_HOME (or
 SCALA_LIBRARY_PATH if scala is on the path) environment variables and
 retry @ 6:126 in
 /Users/soren/src/spark-1.0.0/core/target/antrun/build-main.xml

 No where in the documentation does it mention that having scala installed
 and either of these env vars set nor what version should be installed.
 Setting these env vars wasn't required for 0.9.1 with sbt.

 I was able to get past it by downloading the scala 2.10.4 binary package to
 a temp dir and setting SCALA_HOME to that dir.

 Ideally, it would be nice to not have to require people to have a
 standalone scala installation but at a minimum this requirement should be
 documented in the build instructions no?

 -Soren



Re: SCALA_HOME or SCALA_LIBRARY_PATH not set during build

2014-06-01 Thread Patrick Wendell
I went ahead and created a JIRA for this and back ported the
improvement into branch-1.0. This wasn't a regression per-se because
the behavior existed in all previous versions, but it's annoying
behavior so best to fix it.

https://issues.apache.org/jira/browse/SPARK-1984

- Patrick

On Sun, Jun 1, 2014 at 11:13 AM, Patrick Wendell pwend...@gmail.com wrote:
 This is a false error message actually - the Maven build no longer
 requires SCALA_HOME but the message/check was still there. This was
 fixed recently in master:

 https://github.com/apache/spark/commit/d8c005d5371f81a2a06c5d27c7021e1ae43d7193

 I can back port that fix into branch-1.0 so it will be in 1.0.1 as
 well. For other people running into this, you can export SCALA_HOME to
 any value and it will work.

 - Patrick

 On Sat, May 31, 2014 at 8:34 PM, Colin McCabe cmcc...@alumni.cmu.edu wrote:
 Spark currently supports two build systems, sbt and maven.  sbt will
 download the correct version of scala, but with Maven you need to supply it
 yourself and set SCALA_HOME.

 It sounds like the instructions need to be updated-- perhaps create a JIRA?

 best,
 Colin


 On Sat, May 31, 2014 at 7:06 PM, Soren Macbeth so...@yieldbot.com wrote:

 Hello,

 Following the instructions for building spark 1.0.0, I encountered the
 following error:

 [ERROR] Failed to execute goal
 org.apache.maven.plugins:maven-antrun-plugin:1.7:run (default) on project
 spark-core_2.10: An Ant BuildException has occured: Please set the
 SCALA_HOME (or SCALA_LIBRARY_PATH if scala is on the path) environment
 variables and retry.
 [ERROR] around Ant part ...fail message=Please set the SCALA_HOME (or
 SCALA_LIBRARY_PATH if scala is on the path) environment variables and
 retry @ 6:126 in
 /Users/soren/src/spark-1.0.0/core/target/antrun/build-main.xml

 No where in the documentation does it mention that having scala installed
 and either of these env vars set nor what version should be installed.
 Setting these env vars wasn't required for 0.9.1 with sbt.

 I was able to get past it by downloading the scala 2.10.4 binary package to
 a temp dir and setting SCALA_HOME to that dir.

 Ideally, it would be nice to not have to require people to have a
 standalone scala installation but at a minimum this requirement should be
 documented in the build instructions no?

 -Soren



Re: Which version does the binary compatibility test against by default?

2014-06-02 Thread Patrick Wendell
Yeah - check out sparkPreviousArtifact in the build:
https://github.com/apache/spark/blob/master/project/SparkBuild.scala#L325

- Patrick

On Mon, Jun 2, 2014 at 5:30 PM, Xiangrui Meng men...@gmail.com wrote:
 Is there a way to specify the target version? -Xiangrui


Re: [VOTE] Release Apache Spark 1.0.0 (RC11)

2014-06-04 Thread Patrick Wendell
Received!

On Wed, Jun 4, 2014 at 10:47 AM, Tom Graves
tgraves...@yahoo.com.invalid wrote:
 Testing... Resending as it appears my message didn't go through last week.

 Tom


 On Wednesday, May 28, 2014 4:12 PM, Tom Graves tgraves...@yahoo.com wrote:



 +1. Tested spark on yarn (cluster mode, client mode, pyspark, spark-shell) on 
 hadoop 0.23 and 2.4.

 Tom


 On Wednesday, May 28, 2014 3:07 PM, Sean McNamara 
 sean.mcnam...@webtrends.com wrote:



 Pulled down, compiled, and tested examples on OS X and ubuntu.
 Deployed app we are building on spark and poured data through it.

 +1

 Sean



 On May 26, 2014, at 8:39 AM, Tathagata Das tathagata.das1...@gmail.com 
 wrote:

 Please vote on releasing the following candidate as Apache Spark version 
 1.0.0!

 This has a few important bug fixes on top of rc10:
 SPARK-1900 and SPARK-1918: https://github.com/apache/spark/pull/853
 SPARK-1870: https://github.com/apache/spark/pull/848
 SPARK-1897: https://github.com/apache/spark/pull/849

 The tag to be voted on is v1.0.0-rc11 (commit c69d97cd):
 https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=c69d97cdb42f809cb71113a1db4194c21372242a

 The release files, including signatures, digests, etc. can be found at:
 http://people.apache.org/~tdas/spark-1.0.0-rc11/

 Release
  artifacts are signed with the following key:
 https://people.apache.org/keys/committer/tdas.asc

 The staging repository for this release can be found at:
 https://repository.apache.org/content/repositories/orgapachespark-1019/

 The documentation corresponding to this release can be found at:
 http://people.apache.org/~tdas/spark-1.0.0-rc11-docs/

 Please vote on releasing this package as Apache Spark 1.0.0!

 The vote is open until
  Thursday, May 29, at 16:00 UTC and passes if
 a majority of at least 3 +1 PMC votes are cast.

 [ ] +1 Release this package as Apache Spark 1.0.0
 [ ] -1 Do not release this package because ...

 To learn more about Apache Spark, please see
 http://spark.apache.org/

 == API Changes ==
 We welcome users to compile Spark applications against 1.0. There are
 a few API changes in this release. Here are links to the associated
 upgrade guides - user facing changes have been kept as small as
 possible.

 Changes to ML vector specification:
 http://people.apache.org/~tdas/spark-1.0.0-rc11-docs/mllib-guide.html#from-09-to-10

 Changes to the Java API:
 http://people.apache.org/~tdas/spark-1.0.0-rc11-docs/java-programming-guide.html#upgrading-from-pre-10-versions-of-spark

 Changes to the streaming API:
 http://people.apache.org/~tdas/spark-1.0.0-rc11-docs/streaming-programming-guide.html#migration-guide-from-091-or-below-to-1x

 Changes to the GraphX API:
 http://people.apache.org/~tdas/spark-1.0.0-rc11-docs/graphx-programming-guide.html#upgrade-guide-from-spark-091

 Other changes:
 coGroup and related functions now return Iterable[T] instead of Seq[T]
 == Call toSeq on the result to restore the old behavior

 SparkContext.jarOfClass returns Option[String] instead of
  Seq[String]
 == Call toSeq on the result to restore old behavior


Re: [VOTE] Release Apache Spark 1.0.0 (RC11)

2014-06-04 Thread Patrick Wendell
Hey There,

The best way is to use the v1.0.0 tag:
https://github.com/apache/spark/releases/tag/v1.0.0

- Patrick

On Wed, Jun 4, 2014 at 12:19 PM, Debasish Das debasish.da...@gmail.com wrote:
 Hi Patrick,

 We maintain internal Spark mirror in sync with Spark github master...

 What's the way to get the 1.0.0 stable release from github to deploy on our
 production cluster ? Is there a tag for 1.0.0 that I should use to deploy ?

 Thanks.
 Deb



 On Wed, Jun 4, 2014 at 10:49 AM, Patrick Wendell pwend...@gmail.com wrote:

 Received!

 On Wed, Jun 4, 2014 at 10:47 AM, Tom Graves
 tgraves...@yahoo.com.invalid wrote:
  Testing... Resending as it appears my message didn't go through last
 week.
 
  Tom
 
 
  On Wednesday, May 28, 2014 4:12 PM, Tom Graves tgraves...@yahoo.com
 wrote:
 
 
 
  +1. Tested spark on yarn (cluster mode, client mode, pyspark,
 spark-shell) on hadoop 0.23 and 2.4.
 
  Tom
 
 
  On Wednesday, May 28, 2014 3:07 PM, Sean McNamara
 sean.mcnam...@webtrends.com wrote:
 
 
 
  Pulled down, compiled, and tested examples on OS X and ubuntu.
  Deployed app we are building on spark and poured data through it.
 
  +1
 
  Sean
 
 
 
  On May 26, 2014, at 8:39 AM, Tathagata Das tathagata.das1...@gmail.com
 wrote:
 
  Please vote on releasing the following candidate as Apache Spark
 version 1.0.0!
 
  This has a few important bug fixes on top of rc10:
  SPARK-1900 and SPARK-1918: https://github.com/apache/spark/pull/853
  SPARK-1870: https://github.com/apache/spark/pull/848
  SPARK-1897: https://github.com/apache/spark/pull/849
 
  The tag to be voted on is v1.0.0-rc11 (commit c69d97cd):
 
 https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=c69d97cdb42f809cb71113a1db4194c21372242a
 
  The release files, including signatures, digests, etc. can be found at:
  http://people.apache.org/~tdas/spark-1.0.0-rc11/
 
  Release
   artifacts are signed with the following key:
  https://people.apache.org/keys/committer/tdas.asc
 
  The staging repository for this release can be found at:
  https://repository.apache.org/content/repositories/orgapachespark-1019/
 
  The documentation corresponding to this release can be found at:
  http://people.apache.org/~tdas/spark-1.0.0-rc11-docs/
 
  Please vote on releasing this package as Apache Spark 1.0.0!
 
  The vote is open until
   Thursday, May 29, at 16:00 UTC and passes if
  a majority of at least 3 +1 PMC votes are cast.
 
  [ ] +1 Release this package as Apache Spark 1.0.0
  [ ] -1 Do not release this package because ...
 
  To learn more about Apache Spark, please see
  http://spark.apache.org/
 
  == API Changes ==
  We welcome users to compile Spark applications against 1.0. There are
  a few API changes in this release. Here are links to the associated
  upgrade guides - user facing changes have been kept as small as
  possible.
 
  Changes to ML vector specification:
 
 http://people.apache.org/~tdas/spark-1.0.0-rc11-docs/mllib-guide.html#from-09-to-10
 
  Changes to the Java API:
 
 http://people.apache.org/~tdas/spark-1.0.0-rc11-docs/java-programming-guide.html#upgrading-from-pre-10-versions-of-spark
 
  Changes to the streaming API:
 
 http://people.apache.org/~tdas/spark-1.0.0-rc11-docs/streaming-programming-guide.html#migration-guide-from-091-or-below-to-1x
 
  Changes to the GraphX API:
 
 http://people.apache.org/~tdas/spark-1.0.0-rc11-docs/graphx-programming-guide.html#upgrade-guide-from-spark-091
 
  Other changes:
  coGroup and related functions now return Iterable[T] instead of Seq[T]
  == Call toSeq on the result to restore the old behavior
 
  SparkContext.jarOfClass returns Option[String] instead of
   Seq[String]
  == Call toSeq on the result to restore old behavior



Re: Announcing Spark 1.0.0

2014-06-04 Thread Patrick Wendell
Hey Rahul,

The v1.0.0 tag is correct. When we release Spark we create multiple
candidates. One of the candidates is promoted to the full release. So
rc11 is also the same as the official v1.0.0 release.

- Patrick

On Wed, Jun 4, 2014 at 8:29 PM, Rahul Singhal rahul.sing...@guavus.com wrote:
 Could someone please clarify my confusion or is this not an issue that we
 should be concerned about?

 Thanks,
 Rahul Singhal





 On 30/05/14 5:28 PM, Rahul Singhal rahul.sing...@guavus.com wrote:

Is it intentional/ok that the tag v1.0.0 is behind tag v1.0.0-rc11?


Thanks,
Rahul Singhal





On 30/05/14 3:43 PM, Patrick Wendell pwend...@gmail.com wrote:

I'm thrilled to announce the availability of Spark 1.0.0! Spark 1.0.0
is a milestone release as the first in the 1.0 line of releases,
providing API stability for Spark's core interfaces.

Spark 1.0.0 is Spark's largest release ever, with contributions from
117 developers. I'd like to thank everyone involved in this release -
it was truly a community effort with fixes, features, and
optimizations contributed from dozens of organizations.

This release expands Spark's standard libraries, introducing a new SQL
package (SparkSQL) which lets users integrate SQL queries into
existing Spark workflows. MLlib, Spark's machine learning library, is
expanded with sparse vector support and several new algorithms. The
GraphX and Streaming libraries also introduce new features and
optimizations. Spark's core engine adds support for secured YARN
clusters, a unified tool for submitting Spark applications, and
several performance and stability improvements. Finally, Spark adds
support for Java 8 lambda syntax and improves coverage of the Java and
Python API's.

Those features only scratch the surface - check out the release notes
here:
http://spark.apache.org/releases/spark-release-1-0-0.html

Note that since release artifacts were posted recently, certain
mirrors may not have working downloads for a few hours.

- Patrick




MIMA Compatiblity Checks

2014-06-08 Thread Patrick Wendell
Hey All,

Some people may have noticed PR failures due to binary compatibility
checks. We've had these enabled in several of the sub-modules since
the 0.9.0 release but we've turned them on in Spark core post 1.0.0
which has much higher churn.

The checks are based on the migration manager tool from Typesafe.
One issue is that tool doesn't support package-private declarations of
classes or methods. Prashant Sharma has built instrumentation that
adds partial support for package-privacy (via a workaround) but since
there isn't really native support for this in MIMA we are still
finding cases in which we trigger false positives.

In the next week or two we'll make it a priority to handle more of
these false-positive cases. In the mean time users can add manual
excludes to:

project/MimaExcludes.scala

to avoid triggering warnings for certain issues.

This is definitely annoying - sorry about that. Unfortunately we are
the first open source Scala project to ever do this, so we are dealing
with uncharted territory.

Longer term I'd actually like to see us just write our own sbt-based
tool to do this in a better way (we've had trouble trying to extend
MIMA itself, it e.g. has copy-pasted code in it from an old version of
the scala compiler). If someone in the community is a Scala fan and
wants to take that on, I'm happy to give more details.

- Patrick


Re: Strange problem with saveAsTextFile after upgrade Spark 0.9.0-1.0.0

2014-06-08 Thread Patrick Wendell
Paul,

Could you give the version of Java that you are building with and the
version of Java you are running with? Are they the same?

Just off the cuff, I wonder if this is related to:
https://issues.apache.org/jira/browse/SPARK-1520

If it is, it could appear that certain functions are not in the jar
because they go beyond the extended zip boundary `jar tvf` won't list
them.

- Patrick

On Sun, Jun 8, 2014 at 12:45 PM, Paul Brown p...@mult.ifario.us wrote:
 Moving over to the dev list, as this isn't a user-scope issue.

 I just ran into this issue with the missing saveAsTestFile, and here's a
 little additional information:

 - Code ported from 0.9.1 up to 1.0.0; works with local[n] in both cases.
 - Driver built as an uberjar via Maven.
 - Deployed to smallish EC2 cluster in standalone mode (S3 storage) with
 Spark 1.0.0-hadoop1 downloaded from Apache.

 Given that it functions correctly in local mode but not in a standalone
 cluster, this suggests to me that the issue is in a difference between the
 Maven version and the hadoop1 version.

 In the spirit of taking the computer at its word, we can just have a look
 in the JAR files.  Here's what's in the Maven dep as of 1.0.0:

 jar tvf
 ~/.m2/repository/org/apache/spark/spark-core_2.10/1.0.0/spark-core_2.10-1.0.0.jar
 | grep 'rdd/RDD' | grep 'saveAs'
   1519 Mon May 26 13:57:58 PDT 2014
 org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$1.class
   1560 Mon May 26 13:57:58 PDT 2014
 org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$2.class


 And here's what's in the hadoop1 distribution:

 jar tvf spark-assembly-1.0.0-hadoop1.0.4.jar| grep 'rdd/RDD' | grep 'saveAs'


 I.e., it's not there.  It is in the hadoop2 distribution:

 jar tvf spark-assembly-1.0.0-hadoop2.2.0.jar| grep 'rdd/RDD' | grep 'saveAs'
   1519 Mon May 26 07:29:54 PDT 2014
 org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$1.class
   1560 Mon May 26 07:29:54 PDT 2014
 org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$2.class


 So something's clearly broken with the way that the distribution assemblies
 are created.

 FWIW and IMHO, the right way to publish the hadoop1 and hadoop2 flavors
 of Spark to Maven Central would be as *entirely different* artifacts
 (spark-core-h1, spark-core-h2).

 Logged as SPARK-2075 https://issues.apache.org/jira/browse/SPARK-2075.

 Cheers.
 -- Paul



 --
 p...@mult.ifario.us | Multifarious, Inc. | http://mult.ifario.us/


 On Fri, Jun 6, 2014 at 2:45 AM, HenriV henri.vanh...@vdab.be wrote:

 I'm experiencing the same error while upgrading from 0.9.1 to 1.0.0.
 Im using google compute engine and cloud storage. but saveAsTextFile is
 returning errors while saving in the cloud or saving local. When i start a
 job in the cluster i do get an error but after this error it keeps on
 running fine untill the saveAsTextFile. ( I don't know if the two are
 connected)

 ---Error at job startup---
  ERROR metrics.MetricsSystem: Sink class
 org.apache.spark.metrics.sink.MetricsServlet cannot be instantialized
 java.lang.reflect.InvocationTargetException
 at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
 Method)
 at

 sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
 at

 sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
 at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
 at

 org.apache.spark.metrics.MetricsSystem$$anonfun$registerSinks$1.apply(MetricsSystem.scala:136)
 at

 org.apache.spark.metrics.MetricsSystem$$anonfun$registerSinks$1.apply(MetricsSystem.scala:130)
 at
 scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
 at
 scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
 at
 scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226)
 at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39)
 at scala.collection.mutable.HashMap.foreach(HashMap.scala:98)
 at

 org.apache.spark.metrics.MetricsSystem.registerSinks(MetricsSystem.scala:130)
 at
 org.apache.spark.metrics.MetricsSystem.init(MetricsSystem.scala:84)
 at

 org.apache.spark.metrics.MetricsSystem$.createMetricsSystem(MetricsSystem.scala:167)
 at org.apache.spark.SparkEnv$.create(SparkEnv.scala:230)
 at org.apache.spark.SparkContext.init(SparkContext.scala:202)
 at Hello$.main(Hello.scala:101)
 at Hello.main(Hello.scala)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at

 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at

 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:606)
 at sbt.Run.invokeMain(Run.scala:72)
 at sbt.Run.run0(Run.scala:65)
 at 

Re: Strange problem with saveAsTextFile after upgrade Spark 0.9.0-1.0.0

2014-06-08 Thread Patrick Wendell
Also I should add - thanks for taking time to help narrow this down!

On Sun, Jun 8, 2014 at 1:02 PM, Patrick Wendell pwend...@gmail.com wrote:
 Paul,

 Could you give the version of Java that you are building with and the
 version of Java you are running with? Are they the same?

 Just off the cuff, I wonder if this is related to:
 https://issues.apache.org/jira/browse/SPARK-1520

 If it is, it could appear that certain functions are not in the jar
 because they go beyond the extended zip boundary `jar tvf` won't list
 them.

 - Patrick

 On Sun, Jun 8, 2014 at 12:45 PM, Paul Brown p...@mult.ifario.us wrote:
 Moving over to the dev list, as this isn't a user-scope issue.

 I just ran into this issue with the missing saveAsTestFile, and here's a
 little additional information:

 - Code ported from 0.9.1 up to 1.0.0; works with local[n] in both cases.
 - Driver built as an uberjar via Maven.
 - Deployed to smallish EC2 cluster in standalone mode (S3 storage) with
 Spark 1.0.0-hadoop1 downloaded from Apache.

 Given that it functions correctly in local mode but not in a standalone
 cluster, this suggests to me that the issue is in a difference between the
 Maven version and the hadoop1 version.

 In the spirit of taking the computer at its word, we can just have a look
 in the JAR files.  Here's what's in the Maven dep as of 1.0.0:

 jar tvf
 ~/.m2/repository/org/apache/spark/spark-core_2.10/1.0.0/spark-core_2.10-1.0.0.jar
 | grep 'rdd/RDD' | grep 'saveAs'
   1519 Mon May 26 13:57:58 PDT 2014
 org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$1.class
   1560 Mon May 26 13:57:58 PDT 2014
 org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$2.class


 And here's what's in the hadoop1 distribution:

 jar tvf spark-assembly-1.0.0-hadoop1.0.4.jar| grep 'rdd/RDD' | grep 'saveAs'


 I.e., it's not there.  It is in the hadoop2 distribution:

 jar tvf spark-assembly-1.0.0-hadoop2.2.0.jar| grep 'rdd/RDD' | grep 'saveAs'
   1519 Mon May 26 07:29:54 PDT 2014
 org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$1.class
   1560 Mon May 26 07:29:54 PDT 2014
 org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$2.class


 So something's clearly broken with the way that the distribution assemblies
 are created.

 FWIW and IMHO, the right way to publish the hadoop1 and hadoop2 flavors
 of Spark to Maven Central would be as *entirely different* artifacts
 (spark-core-h1, spark-core-h2).

 Logged as SPARK-2075 https://issues.apache.org/jira/browse/SPARK-2075.

 Cheers.
 -- Paul



 --
 p...@mult.ifario.us | Multifarious, Inc. | http://mult.ifario.us/


 On Fri, Jun 6, 2014 at 2:45 AM, HenriV henri.vanh...@vdab.be wrote:

 I'm experiencing the same error while upgrading from 0.9.1 to 1.0.0.
 Im using google compute engine and cloud storage. but saveAsTextFile is
 returning errors while saving in the cloud or saving local. When i start a
 job in the cluster i do get an error but after this error it keeps on
 running fine untill the saveAsTextFile. ( I don't know if the two are
 connected)

 ---Error at job startup---
  ERROR metrics.MetricsSystem: Sink class
 org.apache.spark.metrics.sink.MetricsServlet cannot be instantialized
 java.lang.reflect.InvocationTargetException
 at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
 Method)
 at

 sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
 at

 sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
 at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
 at

 org.apache.spark.metrics.MetricsSystem$$anonfun$registerSinks$1.apply(MetricsSystem.scala:136)
 at

 org.apache.spark.metrics.MetricsSystem$$anonfun$registerSinks$1.apply(MetricsSystem.scala:130)
 at
 scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
 at
 scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
 at
 scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226)
 at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39)
 at scala.collection.mutable.HashMap.foreach(HashMap.scala:98)
 at

 org.apache.spark.metrics.MetricsSystem.registerSinks(MetricsSystem.scala:130)
 at
 org.apache.spark.metrics.MetricsSystem.init(MetricsSystem.scala:84)
 at

 org.apache.spark.metrics.MetricsSystem$.createMetricsSystem(MetricsSystem.scala:167)
 at org.apache.spark.SparkEnv$.create(SparkEnv.scala:230)
 at org.apache.spark.SparkContext.init(SparkContext.scala:202)
 at Hello$.main(Hello.scala:101)
 at Hello.main(Hello.scala)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at

 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at

 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43

Re: Strange problem with saveAsTextFile after upgrade Spark 0.9.0-1.0.0

2014-06-08 Thread Patrick Wendell
Okay I think I've isolated this a bit more. Let's discuss over on the JIRA:

https://issues.apache.org/jira/browse/SPARK-2075

On Sun, Jun 8, 2014 at 1:16 PM, Paul Brown p...@mult.ifario.us wrote:

 Hi, Patrick --

 Java 7 on the development machines:

 » java -version
 1 ↵
 java version 1.7.0_51
 Java(TM) SE Runtime Environment (build 1.7.0_51-b13)
 Java HotSpot(TM) 64-Bit Server VM (build 24.51-b03, mixed mode)


 And on the deployed boxes:

 $ java -version
 java version 1.7.0_55
 OpenJDK Runtime Environment (IcedTea 2.4.7) (7u55-2.4.7-1ubuntu1)
 OpenJDK 64-Bit Server VM (build 24.51-b03, mixed mode)


 Also, unzip -l in place of jar tvf gives the same results, so I don't
 think it's an issue with jar not reporting the files.  Also, the classes do
 get correctly packaged into the uberjar:

 unzip -l /target/[deleted]-driver.jar | grep 'rdd/RDD' | grep 'saveAs'
  1519  06-08-14 12:05
 org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$1.class
  1560  06-08-14 12:05
 org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$2.class


 Best.
 -- Paul

 —
 p...@mult.ifario.us | Multifarious, Inc. | http://mult.ifario.us/


 On Sun, Jun 8, 2014 at 1:02 PM, Patrick Wendell pwend...@gmail.com wrote:

 Paul,

 Could you give the version of Java that you are building with and the
 version of Java you are running with? Are they the same?

 Just off the cuff, I wonder if this is related to:
 https://issues.apache.org/jira/browse/SPARK-1520

 If it is, it could appear that certain functions are not in the jar
 because they go beyond the extended zip boundary `jar tvf` won't list
 them.

 - Patrick

 On Sun, Jun 8, 2014 at 12:45 PM, Paul Brown p...@mult.ifario.us wrote:
  Moving over to the dev list, as this isn't a user-scope issue.
 
  I just ran into this issue with the missing saveAsTestFile, and here's a
  little additional information:
 
  - Code ported from 0.9.1 up to 1.0.0; works with local[n] in both cases.
  - Driver built as an uberjar via Maven.
  - Deployed to smallish EC2 cluster in standalone mode (S3 storage) with
  Spark 1.0.0-hadoop1 downloaded from Apache.
 
  Given that it functions correctly in local mode but not in a standalone
  cluster, this suggests to me that the issue is in a difference between
  the
  Maven version and the hadoop1 version.
 
  In the spirit of taking the computer at its word, we can just have a
  look
  in the JAR files.  Here's what's in the Maven dep as of 1.0.0:
 
  jar tvf
 
  ~/.m2/repository/org/apache/spark/spark-core_2.10/1.0.0/spark-core_2.10-1.0.0.jar
  | grep 'rdd/RDD' | grep 'saveAs'
1519 Mon May 26 13:57:58 PDT 2014
  org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$1.class
1560 Mon May 26 13:57:58 PDT 2014
  org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$2.class
 
 
  And here's what's in the hadoop1 distribution:
 
  jar tvf spark-assembly-1.0.0-hadoop1.0.4.jar| grep 'rdd/RDD' | grep
  'saveAs'
 
 
  I.e., it's not there.  It is in the hadoop2 distribution:
 
  jar tvf spark-assembly-1.0.0-hadoop2.2.0.jar| grep 'rdd/RDD' | grep
  'saveAs'
1519 Mon May 26 07:29:54 PDT 2014
  org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$1.class
1560 Mon May 26 07:29:54 PDT 2014
  org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$2.class
 
 
  So something's clearly broken with the way that the distribution
  assemblies
  are created.
 
  FWIW and IMHO, the right way to publish the hadoop1 and hadoop2
  flavors
  of Spark to Maven Central would be as *entirely different* artifacts
  (spark-core-h1, spark-core-h2).
 
  Logged as SPARK-2075 https://issues.apache.org/jira/browse/SPARK-2075.
 
  Cheers.
  -- Paul
 
 
 
  --
  p...@mult.ifario.us | Multifarious, Inc. | http://mult.ifario.us/
 
 
  On Fri, Jun 6, 2014 at 2:45 AM, HenriV henri.vanh...@vdab.be wrote:
 
  I'm experiencing the same error while upgrading from 0.9.1 to 1.0.0.
  Im using google compute engine and cloud storage. but saveAsTextFile is
  returning errors while saving in the cloud or saving local. When i
  start a
  job in the cluster i do get an error but after this error it keeps on
  running fine untill the saveAsTextFile. ( I don't know if the two are
  connected)
 
  ---Error at job startup---
   ERROR metrics.MetricsSystem: Sink class
  org.apache.spark.metrics.sink.MetricsServlet cannot be instantialized
  java.lang.reflect.InvocationTargetException
  at
  sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
  Method)
  at
 
 
  sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
  at
 
 
  sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
  at
  java.lang.reflect.Constructor.newInstance(Constructor.java:526)
  at
 
 
  org.apache.spark.metrics.MetricsSystem$$anonfun$registerSinks$1.apply(MetricsSystem.scala:136)
  at
 
 
  org.apache.spark.metrics.MetricsSystem$$anonfun$registerSinks$1.apply(MetricsSystem.scala:130

Emergency maintenace on jenkins

2014-06-09 Thread Patrick Wendell
Just a heads up - due to an outage at UCB we've lost several of the
Jenkins slaves. I'm trying to spin up new slaves on EC2 in order to
compensate, but this might fail some ongoing builds.

The good news is if we do get it working with EC2 workers, then we
will have burst capability in the future - e.g. on release deadlines.
So it's not all bad!

- Patrick


Re: Emergency maintenace on jenkins

2014-06-10 Thread Patrick Wendell
No luck with this tonight - unfortunately our Python tests aren't
working well with Python 2.6 and some other issues made it hard to get
the EC2 worker up to speed. Hopefully we can have this up and running
tomororw.

- Patrick

On Mon, Jun 9, 2014 at 10:17 PM, Patrick Wendell pwend...@gmail.com wrote:
 Just a heads up - due to an outage at UCB we've lost several of the
 Jenkins slaves. I'm trying to spin up new slaves on EC2 in order to
 compensate, but this might fail some ongoing builds.

 The good news is if we do get it working with EC2 workers, then we
 will have burst capability in the future - e.g. on release deadlines.
 So it's not all bad!

 - Patrick


Re: Emergency maintenace on jenkins

2014-06-10 Thread Patrick Wendell
Hey just to update people - as of around 1pm PT we were back up and
running with Jenkins slaves on EC2. Sorry about the disruption.

- Patrick

On Tue, Jun 10, 2014 at 1:15 AM, Patrick Wendell pwend...@gmail.com wrote:
 No luck with this tonight - unfortunately our Python tests aren't
 working well with Python 2.6 and some other issues made it hard to get
 the EC2 worker up to speed. Hopefully we can have this up and running
 tomororw.

 - Patrick

 On Mon, Jun 9, 2014 at 10:17 PM, Patrick Wendell pwend...@gmail.com wrote:
 Just a heads up - due to an outage at UCB we've lost several of the
 Jenkins slaves. I'm trying to spin up new slaves on EC2 in order to
 compensate, but this might fail some ongoing builds.

 The good news is if we do get it working with EC2 workers, then we
 will have burst capability in the future - e.g. on release deadlines.
 So it's not all bad!

 - Patrick


Re: Java IO Stream Corrupted - Invalid Type AC?

2014-06-17 Thread Patrick Wendell
Out of curiosity - are you guys using speculation, shuffle
consolidation, or any other non-default option? If so that would help
narrow down what's causing this corruption.

On Tue, Jun 17, 2014 at 10:40 AM, Surendranauth Hiraman
suren.hira...@velos.io wrote:
 Matt/Ryan,

 Did you make any headway on this? My team is running into this also.
 Doesn't happen on smaller datasets. Our input set is about 10 GB but we
 generate 100s of GBs in the flow itself.

 -Suren




 On Fri, Jun 6, 2014 at 5:19 PM, Ryan Compton compton.r...@gmail.com wrote:

 Just ran into this today myself. I'm on branch-1.0 using a CDH3
 cluster (no modifications to Spark or its dependencies). The error
 appeared trying to run GraphX's .connectedComponents() on a ~200GB
 edge list (GraphX worked beautifully on smaller data).

 Here's the stacktrace (it's quite similar to yours
 https://imgur.com/7iBA4nJ ).

 14/06/05 20:02:28 ERROR scheduler.TaskSetManager: Task 5.599:39 failed
 4 times; aborting job
 14/06/05 20:02:28 INFO scheduler.DAGScheduler: Failed to run reduce at
 VertexRDD.scala:100
 Exception in thread main org.apache.spark.SparkException: Job
 aborted due to stage failure: Task 5.599:39 failed 4 times, most
 recent failure: Exception failure in TID 29735 on host node18:
 java.io.StreamCorruptedException: invalid type code: AC
 java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1355)
 java.io.ObjectInputStream.readObject(ObjectInputStream.java:350)

 org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:63)

 org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:125)
 org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71)
 scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)

 org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:30)

 org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
 scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
 scala.collection.Iterator$class.foreach(Iterator.scala:727)
 scala.collection.AbstractIterator.foreach(Iterator.scala:1157)

 org.apache.spark.graphx.impl.VertexPartitionBaseOps.innerJoinKeepLeft(VertexPartitionBaseOps.scala:192)

 org.apache.spark.graphx.impl.EdgePartition.updateVertices(EdgePartition.scala:78)

 org.apache.spark.graphx.impl.ReplicatedVertexView$$anonfun$2$$anonfun$apply$1.apply(ReplicatedVertexView.scala:75)

 org.apache.spark.graphx.impl.ReplicatedVertexView$$anonfun$2$$anonfun$apply$1.apply(ReplicatedVertexView.scala:73)
 scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
 scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
 scala.collection.Iterator$class.foreach(Iterator.scala:727)
 scala.collection.AbstractIterator.foreach(Iterator.scala:1157)

 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:158)

 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
 org.apache.spark.scheduler.Task.run(Task.scala:51)

 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)

 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)

 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
 java.lang.Thread.run(Thread.java:662)
 Driver stacktrace:
 at org.apache.spark.scheduler.DAGScheduler.org
 $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1033)
 at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1017)
 at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1015)
 at
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
 at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
 at
 org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1015)
 at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:633)
 at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:633)
 at scala.Option.foreach(Option.scala:236)
 at
 org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:633)
 at
 org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1207)
 at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
 at akka.actor.ActorCell.invoke(ActorCell.scala:456)
 at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
 at akka.dispatch.Mailbox.run(Mailbox.scala:219)
 at
 akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
 at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
 at
 scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
 at 

Re: Java IO Stream Corrupted - Invalid Type AC?

2014-06-18 Thread Patrick Wendell
Just wondering, do you get this particular exception if you are not
consolidating shuffle data?

On Wed, Jun 18, 2014 at 12:15 PM, Mridul Muralidharan mri...@gmail.com wrote:
 On Wed, Jun 18, 2014 at 6:19 PM, Surendranauth Hiraman
 suren.hira...@velos.io wrote:
 Patrick,

 My team is using shuffle consolidation but not speculation. We are also
 using persist(DISK_ONLY) for caching.


 Use of shuffle consolidation is probably what is causing the issue.
 Would be good idea to try again with that turned off (which is the default).

 It should get fixed most likely in 1.1 timeframe.


 Regards,
 Mridul



 Here are some config changes that are in our work-in-progress.

 We've been trying for 2 weeks to get our production flow (maybe around
 50-70 stages, a few forks and joins with up to 20 branches in the forks) to
 run end to end without any success, running into other problems besides
 this one as well. For example, we have run into situations where saving to
 HDFS just hangs on a couple of tasks, which are printing out nothing in
 their logs and not taking any CPU. For testing, our input data is 10 GB
 across 320 input splits and generates maybe around 200-300 GB of
 intermediate and final data.


 conf.set(spark.executor.memory, 14g) // TODO make this
 configurable

 // shuffle configs
 conf.set(spark.default.parallelism, 320) // TODO make this
 configurable
 conf.set(spark.shuffle.consolidateFiles,true)

 conf.set(spark.shuffle.file.buffer.kb, 200)
 conf.set(spark.reducer.maxMbInFlight, 96)

 conf.set(spark.rdd.compress,true

 // we ran into a problem with the default timeout of 60 seconds
 // this is also being set in the master's spark-env.sh. Not sure if
 it needs to be in both places
 conf.set(spark.worker.timeout,180)

 // akka settings
 conf.set(spark.akka.threads, 300)
 conf.set(spark.akka.timeout, 180)
 conf.set(spark.akka.frameSize, 100)
 conf.set(spark.akka.batchSize, 30)
 conf.set(spark.akka.askTimeout, 30)

 // block manager
 conf.set(spark.storage.blockManagerTimeoutIntervalMs, 18)
 conf.set(spark.blockManagerHeartBeatMs, 8)

 -Suren



 On Wed, Jun 18, 2014 at 1:42 AM, Patrick Wendell pwend...@gmail.com wrote:

 Out of curiosity - are you guys using speculation, shuffle
 consolidation, or any other non-default option? If so that would help
 narrow down what's causing this corruption.

 On Tue, Jun 17, 2014 at 10:40 AM, Surendranauth Hiraman
 suren.hira...@velos.io wrote:
  Matt/Ryan,
 
  Did you make any headway on this? My team is running into this also.
  Doesn't happen on smaller datasets. Our input set is about 10 GB but we
  generate 100s of GBs in the flow itself.
 
  -Suren
 
 
 
 
  On Fri, Jun 6, 2014 at 5:19 PM, Ryan Compton compton.r...@gmail.com
 wrote:
 
  Just ran into this today myself. I'm on branch-1.0 using a CDH3
  cluster (no modifications to Spark or its dependencies). The error
  appeared trying to run GraphX's .connectedComponents() on a ~200GB
  edge list (GraphX worked beautifully on smaller data).
 
  Here's the stacktrace (it's quite similar to yours
  https://imgur.com/7iBA4nJ ).
 
  14/06/05 20:02:28 ERROR scheduler.TaskSetManager: Task 5.599:39 failed
  4 times; aborting job
  14/06/05 20:02:28 INFO scheduler.DAGScheduler: Failed to run reduce at
  VertexRDD.scala:100
  Exception in thread main org.apache.spark.SparkException: Job
  aborted due to stage failure: Task 5.599:39 failed 4 times, most
  recent failure: Exception failure in TID 29735 on host node18:
  java.io.StreamCorruptedException: invalid type code: AC
 
 java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1355)
  java.io.ObjectInputStream.readObject(ObjectInputStream.java:350)
 
 
 org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:63)
 
 
 org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:125)
 
 org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71)
  scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
 
 
 org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:30)
 
 
 org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
  scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
  scala.collection.Iterator$class.foreach(Iterator.scala:727)
  scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
 
 
 org.apache.spark.graphx.impl.VertexPartitionBaseOps.innerJoinKeepLeft(VertexPartitionBaseOps.scala:192)
 
 
 org.apache.spark.graphx.impl.EdgePartition.updateVertices(EdgePartition.scala:78)
 
 
 org.apache.spark.graphx.impl.ReplicatedVertexView$$anonfun$2$$anonfun$apply$1.apply(ReplicatedVertexView.scala:75)
 
 
 org.apache.spark.graphx.impl.ReplicatedVertexView$$anonfun$2$$anonfun$apply$1.apply(ReplicatedVertexView.scala:73

Re: Scala examples for Spark do not work as written in documentation

2014-06-20 Thread Patrick Wendell
Those are pretty old - but I think the reason Matei did that was to
make it less confusing for brand new users. `spark` is actually a
valid identifier because it's just a variable name (val spark = new
SparkContext()) but I agree this could be confusing for users who want
to drop into the shell.

On Fri, Jun 20, 2014 at 12:04 PM, Will Benton wi...@redhat.com wrote:
 Hey, sorry to reanimate this thread, but just a quick question:  why do the 
 examples (on http://spark.apache.org/examples.html) use spark for the 
 SparkContext reference?  This is minor, but it seems like it could be a 
 little confusing for people who want to run them in the shell and need to 
 change spark to sc.  (I noticed because this was a speedbump for a 
 colleague who is trying out Spark.)


 thanks,
 wb

 - Original Message -
 From: Andy Konwinski andykonwin...@gmail.com
 To: dev@spark.apache.org
 Sent: Tuesday, May 20, 2014 4:06:33 PM
 Subject: Re: Scala examples for Spark do not work as written in documentation

 I fixed the bug, but I kept the parameter i instead of _ since that (1)
 keeps it more parallel to the python and java versions which also use
 functions with a named variable and (2) doesn't require readers to know
 this particular use of the _ syntax in Scala.

 Thanks for catching this Glenn.

 Andy


 On Fri, May 16, 2014 at 12:38 PM, Mark Hamstra
 m...@clearstorydata.comwrote:

  Sorry, looks like an extra line got inserted in there.  One more try:
 
  val count = spark.parallelize(1 to NUM_SAMPLES).map { _ =
val x = Math.random()
val y = Math.random()
if (x*x + y*y  1) 1 else 0
  }.reduce(_ + _)
 
 
 
  On Fri, May 16, 2014 at 12:36 PM, Mark Hamstra m...@clearstorydata.com
  wrote:
 
   Actually, the better way to write the multi-line closure would be:
  
   val count = spark.parallelize(1 to NUM_SAMPLES).map { _ =
  
 val x = Math.random()
 val y = Math.random()
 if (x*x + y*y  1) 1 else 0
   }.reduce(_ + _)
  
  
   On Fri, May 16, 2014 at 9:41 AM, GlennStrycker glenn.stryc...@gmail.com
  wrote:
  
   On the webpage http://spark.apache.org/examples.html, there is an
  example
   written as
  
   val count = spark.parallelize(1 to NUM_SAMPLES).map(i =
 val x = Math.random()
 val y = Math.random()
 if (x*x + y*y  1) 1 else 0
   ).reduce(_ + _)
   println(Pi is roughly  + 4.0 * count / NUM_SAMPLES)
  
   This does not execute in Spark, which gives me an error:
   console:2: error: illegal start of simple expression
val x = Math.random()
^
  
   If I rewrite the query slightly, adding in {}, it works:
  
   val count = spark.parallelize(1 to 1).map(i =
  {
  val x = Math.random()
  val y = Math.random()
  if (x*x + y*y  1) 1 else 0
  }
   ).reduce(_ + _)
   println(Pi is roughly  + 4.0 * count / 1.0)
  
  
  
  
  
   --
   View this message in context:
  
  http://apache-spark-developers-list.1001551.n3.nabble.com/Scala-examples-for-Spark-do-not-work-as-written-in-documentation-tp6593.html
   Sent from the Apache Spark Developers List mailing list archive at
   Nabble.com.
  
  
  
 



Assorted project updates (tests, build, etc)

2014-06-22 Thread Patrick Wendell
Hey All,

1. The original test infrastructure hosted by the AMPLab has been
fully restored and also expanded with many more executor slots for
tests. Thanks to Matt Massie at the Amplab for helping with this.

2. We now have a nightly build matrix across different Hadoop
versions. It appears that the Maven build is failing tests with some
of the newer Hadoop versions. If people from the community are
interested, diagnosing and fixing test issues would be welcome patches
(they are all dependency related).

https://issues.apache.org/jira/browse/SPARK-2232

3. Prashant Sharma has spent a lot of time to make it possible for our
sbt build to read dependencies from Maven. This will save us a huge
amount of headache keeping the builds consistent. I just wanted to
give a heads up to users about this - we should retain compatibility
with features of the sbt build, but if you are e.g. hooking into deep
internals of our build it may affect you. I'm hoping this can be
updated and merged in the next week:

https://github.com/apache/spark/pull/77

4. We've moved most of the documentation over to recommending users
build with Maven when creating official packages. This is just to
provide a single reference build of Spark since it's the one we test
and package for releases, we make sure all recursive dependencies are
correct, etc. I'd recommend that all downstream packagers use this
build.

For day-to-day development I imagine sbt will remain more popular
(repl, incremental builds, etc). Prashant's work allows us to get the
best of both worlds which is great.

- Patrick


[VOTE] Release Apache Spark 1.0.1 (RC1)

2014-06-26 Thread Patrick Wendell
Please vote on releasing the following candidate as Apache Spark version 1.0.1!

The tag to be voted on is v1.0.1-rc1 (commit 7feeda3):
https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=7feeda3d729f9397aa15ee8750c01ef5aa601962

The release files, including signatures, digests, etc. can be found at:
http://people.apache.org/~pwendell/spark-1.0.1-rc1/

Release artifacts are signed with the following key:
https://people.apache.org/keys/committer/pwendell.asc

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1020/

The documentation corresponding to this release can be found at:
http://people.apache.org/~pwendell/spark-1.0.1-rc1-docs/

Please vote on releasing this package as Apache Spark 1.0.1!

The vote is open until Monday, June 30, at 03:00 UTC and passes if
a majority of at least 3 +1 PMC votes are cast.

[ ] +1 Release this package as Apache Spark 1.0.1
[ ] -1 Do not release this package because ...

To learn more about Apache Spark, please see
http://spark.apache.org/

=== About this release ===
This release fixes a few high-priority bugs in 1.0 and has a variety
of smaller fixes. The full list is here: http://s.apache.org/b45. Some
of the more visible patches are:

SPARK-2043: ExternalAppendOnlyMap doesn't always find matching keys
SPARK-2156 and SPARK-1112: Issues with jobs hanging due to akka frame size.
SPARK-1790: Support r3 instance types on EC2.

This is the first maintenance release on the 1.0 line. We plan to make
additional maintenance releases as new fixes come in.

- Patrick


Re: Errors from Sbt Test

2014-07-01 Thread Patrick Wendell
Do those also happen if you run other hadoop versions (e.g. try 1.0.4)?

On Tue, Jul 1, 2014 at 1:00 AM, Taka Shinagawa taka.epsi...@gmail.com wrote:
 Since Spark 1.0.0, I've been seeing multiple errors when running sbt test.

 I ran the following commands from Spark 1.0.1 RC1 on Mac OSX 10.9.2.

 $ sbt/sbt clean
 $ SPARK_HADOOP_VERSION=1.2.1 sbt/sbt assembly
 $ sbt/sbt test


 I'm attaching the log file generated by the sbt test.

 Here's the summary part of the test.

 [info] Run completed in 30 minutes, 57 seconds.
 [info] Total number of tests run: 605
 [info] Suites: completed 83, aborted 0
 [info] Tests: succeeded 600, failed 5, canceled 0, ignored 5, pending 0
 [info] *** 5 TESTS FAILED ***
 [error] Failed: Total 653, Failed 5, Errors 0, Passed 648, Ignored 5
 [error] Failed tests:
 [error] org.apache.spark.ShuffleNettySuite
 [error] org.apache.spark.ShuffleSuite
 [error] org.apache.spark.FileServerSuite
 [error] org.apache.spark.DistributedSuite
 [error] (core/test:test) sbt.TestsFailedException: Tests unsuccessful
 [error] Total time: 2033 s, completed Jul 1, 2014 12:08:03 AM

 Is anyone else seeing errors like this?


 Thanks,
 Taka


Re: Eliminate copy while sending data : any Akka experts here ?

2014-07-01 Thread Patrick Wendell
Yeah I created a JIRA a while back to piggy-back the map status info
on top of the task (I honestly think it will be a small change). There
isn't a good reason to broadcast the entire array and it can be an
issue during large shuffles.

- Patrick

On Mon, Jun 30, 2014 at 7:58 PM, Aaron Davidson ilike...@gmail.com wrote:
 I don't know of any way to avoid Akka doing a copy, but I would like to
 mention that it's on the priority list to piggy-back only the map statuses
 relevant to a particular map task on the task itself, thus reducing the
 total amount of data sent over the wire by a factor of N for N physical
 machines in your cluster. Ideally we would also avoid Akka entirely when
 sending the tasks, as these can get somewhat large and Akka doesn't work
 well with large messages.

 Do note that your solution of using broadcast to send the map tasks is very
 similar to how the executor returns the result of a task when it's too big
 for akka. We were thinking of refactoring this too, as using the block
 manager has much higher latency than a direct TCP send.


 On Mon, Jun 30, 2014 at 12:13 PM, Mridul Muralidharan mri...@gmail.com
 wrote:

 Our current hack is to use Broadcast variables when serialized
 statuses are above some (configurable) size : and have the workers
 directly pull them from master.
 This is a workaround : so would be great if there was a
 better/principled solution.

 Please note that the responses are going to different workers
 requesting for the output statuses for shuffle (after map) - so not
 sure if back pressure buffers, etc would help.


 Regards,
 Mridul


 On Mon, Jun 30, 2014 at 11:07 PM, Mridul Muralidharan mri...@gmail.com
 wrote:
  Hi,
 
While sending map output tracker result, the same serialized byte
  array is sent multiple times - but the akka implementation copies it
  to a private byte array within ByteString for each send.
  Caching a ByteString instead of Array[Byte] did not help, since akka
  does not support special casing ByteString : serializes the
  ByteString, and copies the result out to an array before creating
  ByteString out of it (in Array[Byte] serializing is thankfully simply
  returning same array - so one copy only).
 
 
  Given the need to send immutable data large number of times, is there
  any way to do it in akka without copying internally in akka ?
 
 
  To see how expensive it is, for 200 nodes withi large number of
  mappers and reducers, the status becomes something like 30 mb for us -
  and pulling this about 200 to 300 times results in OOM due to the
  large number of copies sent out.
 
 
  Thanks,
  Mridul



Re: Eliminate copy while sending data : any Akka experts here ?

2014-07-01 Thread Patrick Wendell
 b) Instead of pulling this information, push it to executors as part
 of task submission. (What Patrick mentioned ?)
 (1) a.1 from above is still an issue for this.

I don't understand problem a.1 is. In this case, we don't need to do
caching, right?

 (2) Serialized task size is also a concern : we have already seen
 users hitting akka limits for task size - this will be an additional
 vector which might exacerbate it.

This would add only a small, constant amount of data to the task. It's
strictly better than before. Before if the map output status array was
size M x R, we send a single akka message to every node of size M x
R... this basically scales quadratically with the size of the RDD. The
new approach is constant... it's much better. And the total amount of
data send over the wire is likely much less.

- Patrick


[RESULT] [VOTE] Release Apache Spark 1.0.1 (RC1)

2014-07-04 Thread Patrick Wendell
This vote is cancelled in favor of RC2. Thanks to everyone who voted.

On Sun, Jun 29, 2014 at 11:23 PM, Andrew Ash and...@andrewash.com wrote:
 Ok that's reasonable -- it's certainly more of an enhancement than a
 critical bug-fix.  I would like to get this in for 1.1.0 though, so let's
 talk through the right way to do that on the PR.

 In the meantime the best alternative is running with lax firewall settings,
 which can be somewhat mitigated by modifying the ephemeral port range.

 Thanks!
 Andrew


 On Sun, Jun 29, 2014 at 11:14 PM, Reynold Xin r...@databricks.com wrote:

 Hi Andrew,

 The port stuff is great to have, but they are pretty big changes to the
 core that are introducing new features and are not exactly fixing important
 bugs. For this reason, it probably can't block a release (I'm not even sure
 if it should go into a maintenance release where we fix critical bugs for
 Spark core).

 We should definitely include them for 1.1.0 though (~Aug).




 On Sun, Jun 29, 2014 at 11:09 PM, Andrew Ash and...@andrewash.com wrote:

  Thanks for helping shepherd the voting on 1.0.1 Patrick.
 
  I'd like to call attention to
  https://issues.apache.org/jira/browse/SPARK-2157 and
  https://github.com/apache/spark/pull/1107 -- Ability to write tight
  firewall rules for Spark
 
  I'm currently unable to run Spark on some projects because our cloud ops
  team is uncomfortable with the firewall situation around Spark at the
  moment.  Currently Spark starts listening on random ephemeral ports and
  does server to server communication on them.  This keeps the team from
  writing tight firewall rules between the services -- they get real queasy
  when asked to open inbound connections to the entire ephemeral port range
  of a cluster.  We can tighten the size of the ephemeral range using
 kernel
  settings to mitigate the issue, but it doesn't actually solve the
 problem.
 
  The PR above aims to make every listening port on JVMs in a Spark
  standalone cluster configurable with an option.  If not set, the current
  behavior stands (start listening on an ephemeral port).  Is this
 something
  the Spark team would consider merging into 1.0.1?
 
  Thanks!
  Andrew
 
 
 
  On Sun, Jun 29, 2014 at 10:54 PM, Patrick Wendell pwend...@gmail.com
  wrote:
 
   Hey All,
  
   We're going to move onto another rc because of this vote.
   Unfortunately with the summit activities I haven't been able to usher
   in the necessary patches and cut the RC. I will do so as soon as
   possible and we can commence official voting.
  
   - Patrick
  
   On Sun, Jun 29, 2014 at 4:56 PM, Reynold Xin r...@databricks.com
  wrote:
We should make sure we include the following two patches:
   
https://github.com/apache/spark/pull/1264
   
https://github.com/apache/spark/pull/1263
   
   
   
   
On Fri, Jun 27, 2014 at 8:39 PM, Krishna Sankar ksanka...@gmail.com
 
   wrote:
   
+1
Compiled for CentOS 6.5, deployed in our 4 node cluster (Hadoop 2.2,
   YARN)
Smoke Tests (sparkPi,spark-shell, web UI) successful
   
Cheers
k/
   
   
On Thu, Jun 26, 2014 at 7:06 PM, Patrick Wendell 
 pwend...@gmail.com
wrote:
   
 Please vote on releasing the following candidate as Apache Spark
   version
 1.0.1!

 The tag to be voted on is v1.0.1-rc1 (commit 7feeda3):


   
  
 
 https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=7feeda3d729f9397aa15ee8750c01ef5aa601962

 The release files, including signatures, digests, etc. can be
 found
   at:
 http://people.apache.org/~pwendell/spark-1.0.1-rc1/

 Release artifacts are signed with the following key:
 https://people.apache.org/keys/committer/pwendell.asc

 The staging repository for this release can be found at:

  
 https://repository.apache.org/content/repositories/orgapachespark-1020/

 The documentation corresponding to this release can be found at:
 http://people.apache.org/~pwendell/spark-1.0.1-rc1-docs/

 Please vote on releasing this package as Apache Spark 1.0.1!

 The vote is open until Monday, June 30, at 03:00 UTC and passes if
 a majority of at least 3 +1 PMC votes are cast.

 [ ] +1 Release this package as Apache Spark 1.0.1
 [ ] -1 Do not release this package because ...

 To learn more about Apache Spark, please see
 http://spark.apache.org/

 === About this release ===
 This release fixes a few high-priority bugs in 1.0 and has a
 variety
 of smaller fixes. The full list is here: http://s.apache.org/b45.
   Some
 of the more visible patches are:

 SPARK-2043: ExternalAppendOnlyMap doesn't always find matching
 keys
 SPARK-2156 and SPARK-1112: Issues with jobs hanging due to akka
  frame
size.
 SPARK-1790: Support r3 instance types on EC2.

 This is the first maintenance release on the 1.0 line. We plan to
  make
 additional maintenance releases as new fixes come

[VOTE] Release Apache Spark 1.0.1 (RC2)

2014-07-04 Thread Patrick Wendell
Please vote on releasing the following candidate as Apache Spark version 1.0.1!

The tag to be voted on is v1.0.1-rc1 (commit 7d1043c):
https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=7d1043c99303b87aef8ee19873629c2bfba4cc78

The release files, including signatures, digests, etc. can be found at:
http://people.apache.org/~pwendell/spark-1.0.1-rc2/

Release artifacts are signed with the following key:
https://people.apache.org/keys/committer/pwendell.asc

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1021/

The documentation corresponding to this release can be found at:
http://people.apache.org/~pwendell/spark-1.0.1-rc2-docs/

Please vote on releasing this package as Apache Spark 1.0.1!

The vote is open until Monday, July 07, at 20:45 UTC and passes if
a majority of at least 3 +1 PMC votes are cast.

[ ] +1 Release this package as Apache Spark 1.0.1
[ ] -1 Do not release this package because ...

To learn more about Apache Spark, please see
http://spark.apache.org/

=== Differences from RC1 ===
This release includes only one blocking patch from rc1:
https://github.com/apache/spark/pull/1255

There are also smaller fixes which came in over the last week.

=== About this release ===
This release fixes a few high-priority bugs in 1.0 and has a variety
of smaller fixes. The full list is here: http://s.apache.org/b45. Some
of the more visible patches are:

SPARK-2043: ExternalAppendOnlyMap doesn't always find matching keys
SPARK-2156 and SPARK-1112: Issues with jobs hanging due to akka frame size.
SPARK-1790: Support r3 instance types on EC2.

This is the first maintenance release on the 1.0 line. We plan to make
additional maintenance releases as new fixes come in.


Testing period for better jenkins integration

2014-07-09 Thread Patrick Wendell
Just a heads up - I've added some better Jenkins integration that
posts more useful messages on pull requests. We'll run this
side-by-side with the current Jenkins messages for a while to make
sure it's working well. Things may be a bit chatty while we are
testing this - we can migrate over as soon as we feel it's stable.

- Patrick


Changes to sbt build have been merged

2014-07-10 Thread Patrick Wendell
Just a heads up, we merged Prashant's work on having the sbt build read all
dependencies from Maven. Please report any issues you find on the dev list
or on JIRA.

One note here for developers, going forward the sbt build will use the same
configuration style as the maven build (-D for options and -P for maven
profiles). So this will be a change for developers:

sbt/sbt -Dhadoop.version=2.2.0 -Pyarn assembly

For now, we'll continue to support the old env-var options with a
deprecation warning.

- Patrick


Re: what is the difference between org.spark-project.hive and org.apache.hadoop.hive

2014-07-11 Thread Patrick Wendell
There are two differences:

1. We publish hive with a shaded protobuf dependency to avoid
conflicts with some Hadoop versions.
2. We publish a proper hive-exec jar that only includes hive packages.
The upstream version of hive-exec bundles a bunch of other random
dependencies in it which makes it really hard for third-party projects
to use it.

On Thu, Jul 10, 2014 at 11:29 PM, kingfly wangf...@huawei.com wrote:

 --

 Best Regards
 Frank Wang | Software Engineer

 Mobile: +86 18505816792
 Phone: +86 571 63547
 Fax:
 Email: wangf...@huawei.com
 
 Huawei Technologies Co., Ltd.
 Hangzhou RD Center
 NO.410, JiangHong Road, Binjiang Area, Hangzhou, 310052, P. R. China




Re: [VOTE] Release Apache Spark 1.0.1 (RC2)

2014-07-11 Thread Patrick Wendell
Hey Gary,

Why do you think the akka frame size changed? It didn't change - we
added some fixes for cases where users were setting non-default
values.

On Fri, Jul 11, 2014 at 9:31 AM, Gary Malouf malouf.g...@gmail.com wrote:
 Hi Matei,

 We have not had time to re-deploy the rc today, but one thing that jumps
 out is the shrinking of the default akka frame size from 10MB to around
 128KB by default.  That is my first suspicion for our issue - could imagine
 that biting others as well.

 I'll try to re-test that today - either way, understand moving forward at
 this point.

 Gary


 On Fri, Jul 11, 2014 at 12:08 PM, Matei Zaharia matei.zaha...@gmail.com
 wrote:

 Unless you can diagnose the problem quickly, Gary, I think we need to go
 ahead with this release as is. This release didn't touch the Mesos support
 as far as I know, so the problem might be a nondeterministic issue with
 your application. But on the other hand the release does fix some critical
 bugs that affect all users. We can always do 1.0.2 later if we discover a
 problem.

 Matei

 On Jul 10, 2014, at 9:40 PM, Patrick Wendell pwend...@gmail.com wrote:

  Hey Gary,
 
  The vote technically doesn't close until I send the vote summary
  e-mail, but I was planning to close and package this tonight. It's too
  bad if there is a regression, it might be worth holding the release
  but it really requires narrowing down the issue to get more
  information about the scope and severity. Could you fork another
  thread for this?
 
  - Patrick
 
  On Thu, Jul 10, 2014 at 6:28 PM, Gary Malouf malouf.g...@gmail.com
 wrote:
  -1 I honestly do not know the voting rules for the Spark community, so
  please excuse me if I am out of line or if Mesos compatibility is not a
  concern at this point.
 
  We just tried to run this version built against 2.3.0-cdh5.0.2 on mesos
  0.18.2.  All of our jobs with data above a few gigabytes hung
 indefinitely.
  Downgrading back to the 1.0.0 stable release of Spark built the same way
  worked for us.
 
 
  On Mon, Jul 7, 2014 at 5:17 PM, Tom Graves tgraves...@yahoo.com.invalid
 
  wrote:
 
  +1. Ran some Spark on yarn jobs on a hadoop 2.4 cluster with
  authentication on.
 
  Tom
 
 
  On Friday, July 4, 2014 2:39 PM, Patrick Wendell pwend...@gmail.com
  wrote:
 
 
 
  Please vote on releasing the following candidate as Apache Spark
 version
  1.0.1!
 
  The tag to be voted on is v1.0.1-rc1 (commit 7d1043c):
 
 
 https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=7d1043c99303b87aef8ee19873629c2bfba4cc78
 
  The release files, including signatures, digests, etc. can be found at:
  http://people.apache.org/~pwendell/spark-1.0.1-rc2/
 
  Release artifacts are signed with the following key:
  https://people.apache.org/keys/committer/pwendell.asc
 
  The staging repository for this release can be found at:
 
 https://repository.apache.org/content/repositories/orgapachespark-1021/
 
  The documentation corresponding to this release can be found at:
  http://people.apache.org/~pwendell/spark-1.0.1-rc2-docs/
 
  Please vote on releasing this package as Apache Spark 1.0.1!
 
  The vote is open until Monday, July 07, at 20:45 UTC and passes if
  a majority of at least 3 +1 PMC votes are cast.
 
  [ ] +1 Release this package as Apache Spark 1.0.1
  [ ] -1 Do not release this package because ...
 
  To learn more about Apache Spark, please see
  http://spark.apache.org/
 
  === Differences from RC1 ===
  This release includes only one blocking patch from rc1:
  https://github.com/apache/spark/pull/1255
 
  There are also smaller fixes which came in over the last week.
 
  === About this release ===
  This release fixes a few high-priority bugs in 1.0 and has a variety
  of smaller fixes. The full list is here: http://s.apache.org/b45. Some
  of the more visible patches are:
 
  SPARK-2043: ExternalAppendOnlyMap doesn't always find matching keys
  SPARK-2156 and SPARK-1112: Issues with jobs hanging due to akka frame
 size.
  SPARK-1790: Support r3 instance types on EC2.
 
  This is the first maintenance release on the 1.0 line. We plan to make
  additional maintenance releases as new fixes come in.
 




Re: [VOTE] Release Apache Spark 1.0.1 (RC2)

2014-07-11 Thread Patrick Wendell
Okay just FYI - I'm closing this vote since many people are waiting on
the release and I was hoping to package it today. If we find a
reproducible Mesos issue here, we can definitely spin the fix into a
subsequent release.



On Fri, Jul 11, 2014 at 9:37 AM, Patrick Wendell pwend...@gmail.com wrote:
 Hey Gary,

 Why do you think the akka frame size changed? It didn't change - we
 added some fixes for cases where users were setting non-default
 values.

 On Fri, Jul 11, 2014 at 9:31 AM, Gary Malouf malouf.g...@gmail.com wrote:
 Hi Matei,

 We have not had time to re-deploy the rc today, but one thing that jumps
 out is the shrinking of the default akka frame size from 10MB to around
 128KB by default.  That is my first suspicion for our issue - could imagine
 that biting others as well.

 I'll try to re-test that today - either way, understand moving forward at
 this point.

 Gary


 On Fri, Jul 11, 2014 at 12:08 PM, Matei Zaharia matei.zaha...@gmail.com
 wrote:

 Unless you can diagnose the problem quickly, Gary, I think we need to go
 ahead with this release as is. This release didn't touch the Mesos support
 as far as I know, so the problem might be a nondeterministic issue with
 your application. But on the other hand the release does fix some critical
 bugs that affect all users. We can always do 1.0.2 later if we discover a
 problem.

 Matei

 On Jul 10, 2014, at 9:40 PM, Patrick Wendell pwend...@gmail.com wrote:

  Hey Gary,
 
  The vote technically doesn't close until I send the vote summary
  e-mail, but I was planning to close and package this tonight. It's too
  bad if there is a regression, it might be worth holding the release
  but it really requires narrowing down the issue to get more
  information about the scope and severity. Could you fork another
  thread for this?
 
  - Patrick
 
  On Thu, Jul 10, 2014 at 6:28 PM, Gary Malouf malouf.g...@gmail.com
 wrote:
  -1 I honestly do not know the voting rules for the Spark community, so
  please excuse me if I am out of line or if Mesos compatibility is not a
  concern at this point.
 
  We just tried to run this version built against 2.3.0-cdh5.0.2 on mesos
  0.18.2.  All of our jobs with data above a few gigabytes hung
 indefinitely.
  Downgrading back to the 1.0.0 stable release of Spark built the same way
  worked for us.
 
 
  On Mon, Jul 7, 2014 at 5:17 PM, Tom Graves tgraves...@yahoo.com.invalid
 
  wrote:
 
  +1. Ran some Spark on yarn jobs on a hadoop 2.4 cluster with
  authentication on.
 
  Tom
 
 
  On Friday, July 4, 2014 2:39 PM, Patrick Wendell pwend...@gmail.com
  wrote:
 
 
 
  Please vote on releasing the following candidate as Apache Spark
 version
  1.0.1!
 
  The tag to be voted on is v1.0.1-rc1 (commit 7d1043c):
 
 
 https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=7d1043c99303b87aef8ee19873629c2bfba4cc78
 
  The release files, including signatures, digests, etc. can be found at:
  http://people.apache.org/~pwendell/spark-1.0.1-rc2/
 
  Release artifacts are signed with the following key:
  https://people.apache.org/keys/committer/pwendell.asc
 
  The staging repository for this release can be found at:
 
 https://repository.apache.org/content/repositories/orgapachespark-1021/
 
  The documentation corresponding to this release can be found at:
  http://people.apache.org/~pwendell/spark-1.0.1-rc2-docs/
 
  Please vote on releasing this package as Apache Spark 1.0.1!
 
  The vote is open until Monday, July 07, at 20:45 UTC and passes if
  a majority of at least 3 +1 PMC votes are cast.
 
  [ ] +1 Release this package as Apache Spark 1.0.1
  [ ] -1 Do not release this package because ...
 
  To learn more about Apache Spark, please see
  http://spark.apache.org/
 
  === Differences from RC1 ===
  This release includes only one blocking patch from rc1:
  https://github.com/apache/spark/pull/1255
 
  There are also smaller fixes which came in over the last week.
 
  === About this release ===
  This release fixes a few high-priority bugs in 1.0 and has a variety
  of smaller fixes. The full list is here: http://s.apache.org/b45. Some
  of the more visible patches are:
 
  SPARK-2043: ExternalAppendOnlyMap doesn't always find matching keys
  SPARK-2156 and SPARK-1112: Issues with jobs hanging due to akka frame
 size.
  SPARK-1790: Support r3 instance types on EC2.
 
  This is the first maintenance release on the 1.0 line. We plan to make
  additional maintenance releases as new fixes come in.
 




[RESULT] [VOTE] Release Apache Spark 1.0.1 (RC2)

2014-07-11 Thread Patrick Wendell
This vote has passed with 9 +1 votes (5 binding) and 1 -1 vote (0 binding).

+1:
Patrick Wendell*
Mark Hamstra*
DB Tsai
Krishna Sankar
Soren Macbeth
Andrew Or
Matei Zaharia*
Xiangrui Meng*
Tom Graves*

0:

-1:
Gary Malouf


Announcing Spark 1.0.1

2014-07-11 Thread Patrick Wendell
I am happy to announce the availability of Spark 1.0.1! This release
includes contributions from 70 developers. Spark 1.0.0 includes fixes
across several areas of Spark, including the core API, PySpark, and
MLlib. It also includes new features in Spark's (alpha) SQL library,
including support for JSON data and performance and stability fixes.

Visit the release notes[1] to read about this release or download[2]
the release today.

[1] http://spark.apache.org/releases/spark-release-1-0-1.html
[2] http://spark.apache.org/downloads.html


Re: how to run the program compiled with spark 1.0.0 in the branch-0.1-jdbc cluster

2014-07-14 Thread Patrick Wendell
 1. The first error I met is the different SerializationVersionUID in 
 ExecuterStatus

 I resolved by explicitly declare SerializationVersionUID in 
 ExecuterStatus.scala and recompile branch-0.1-jdbc


I don't think there is a class in Spark named ExecuterStatus (sic) ...
or ExecutorStatus. Is this a class you made?


Re: Reproducible deadlock in 1.0.1, possibly related to Spark-1097

2014-07-14 Thread Patrick Wendell
Hey Cody,

This Jstack seems truncated, would you mind giving the entire stack
trace? For the second thread, for instance, we can't see where the
lock is being acquired.

- Patrick

On Mon, Jul 14, 2014 at 1:42 PM, Cody Koeninger
cody.koenin...@mediacrossing.com wrote:
 Hi all, just wanted to give a heads up that we're seeing a reproducible
 deadlock with spark 1.0.1 with 2.3.0-mr1-cdh5.0.2

 If jira is a better place for this, apologies in advance - figured talking
 about it on the mailing list was friendlier than randomly (re)opening jira
 tickets.

 I know Gary had mentioned some issues with 1.0.1 on the mailing list, once
 we got a thread dump I wanted to follow up.

 The thread dump shows the deadlock occurs in the synchronized block of code
 that was changed in HadoopRDD.scala, for the Spark-1097 issue

 Relevant portions of the thread dump are summarized below, we can provide
 the whole dump if it's useful.

 Found one Java-level deadlock:
 =
 Executor task launch worker-1:
   waiting to lock monitor 0x7f250400c520 (object 0xfae7dc30, a
 org.apache.hadoop.co
 nf.Configuration),
   which is held by Executor task launch worker-0
 Executor task launch worker-0:
   waiting to lock monitor 0x7f2520495620 (object 0xfaeb4fc8, a
 java.lang.Class),
   which is held by Executor task launch worker-1


 Executor task launch worker-1:
 at
 org.apache.hadoop.conf.Configuration.reloadConfiguration(Configuration.java:791)
 - waiting to lock 0xfae7dc30 (a
 org.apache.hadoop.conf.Configuration)
 at
 org.apache.hadoop.conf.Configuration.addDefaultResource(Configuration.java:690)
 - locked 0xfaca6ff8 (a java.lang.Class for
 org.apache.hadoop.conf.Configurati
 on)
 at
 org.apache.hadoop.hdfs.HdfsConfiguration.clinit(HdfsConfiguration.java:34)
 at
 org.apache.hadoop.hdfs.DistributedFileSystem.clinit(DistributedFileSystem.java:110
 )
 at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
 Method)
 at
 sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.
 java:57)
 at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
 Method)
 at
 sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.
 java:57)
 at
 sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAcces
 sorImpl.java:45)
 at java.lang.reflect.Constructor.newInstance(Constructor.java:525)
 at java.lang.Class.newInstance0(Class.java:374)
 at java.lang.Class.newInstance(Class.java:327)
 at java.util.ServiceLoader$LazyIterator.next(ServiceLoader.java:373)
 at java.util.ServiceLoader$1.next(ServiceLoader.java:445)
 at
 org.apache.hadoop.fs.FileSystem.loadFileSystems(FileSystem.java:2364)
 - locked 0xfaeb4fc8 (a java.lang.Class for
 org.apache.hadoop.fs.FileSystem)
 at
 org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2375)
 at
 org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2392)
 at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:89)
 at
 org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2431)
 at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2413)
 at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:368)
 at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:167)
 at
 org.apache.hadoop.mapred.JobConf.getWorkingDirectory(JobConf.java:587)
 at
 org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:315)
 at
 org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:288)
 at
 org.apache.spark.SparkContext$$anonfun$22.apply(SparkContext.scala:546)
 at
 org.apache.spark.SparkContext$$anonfun$22.apply(SparkContext.scala:546)
 at
 org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$1.apply(HadoopRDD.scala:145)



 ...elided...


 Executor task launch worker-0 daemon prio=10 tid=0x01e71800
 nid=0x2d97 waiting for monitor entry [0x7f24d2bf1000]
java.lang.Thread.State: BLOCKED (on object monitor)
 at
 org.apache.hadoop.fs.FileSystem.loadFileSystems(FileSystem.java:2362)
 - waiting to lock 0xfaeb4fc8 (a java.lang.Class for
 org.apache.hadoop.fs.FileSystem)
 at
 org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2375)
 at
 org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2392)
 at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:89)
 at
 org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2431)
 at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2413)
 at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:368)
 at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:167)
 at

Re: Reproducible deadlock in 1.0.1, possibly related to Spark-1097

2014-07-14 Thread Patrick Wendell
Hey Nishkam,

Aaron's fix should prevent two concurrent accesses to getJobConf (and
the Hadoop code therein). But if there is code elsewhere that tries to
mutate the configuration, then I could see how we might still have the
ConcurrentModificationException.

I looked at your patch for HADOOP-10456 and the only example you give
is of the data being accessed inside of getJobConf. Is it accessed
somewhere else too from Spark that you are aware of?

https://issues.apache.org/jira/browse/HADOOP-10456

- Patrick

On Mon, Jul 14, 2014 at 3:28 PM, Nishkam Ravi nr...@cloudera.com wrote:
 Hi Aaron, I'm not sure if synchronizing on an arbitrary lock object would
 help. I suspect we will start seeing the ConcurrentModificationException
 again. The right fix has gone into Hadoop through 10456. Unfortunately, I
 don't have any bright ideas on how to synchronize this at the Spark level
 without the risk of deadlocks.


 On Mon, Jul 14, 2014 at 3:07 PM, Aaron Davidson ilike...@gmail.com wrote:

 The full jstack would still be useful, but our current working theory is
 that this is due to the fact that Configuration#loadDefaults goes through
 every Configuration object that was ever created (via
 Configuration.REGISTRY) and locks it, thus introducing a dependency from
 new Configuration to old, otherwise unrelated, Configuration objects that
 our locking did not anticipate.

 I have created https://github.com/apache/spark/pull/1409 to hopefully fix
 this bug.


 On Mon, Jul 14, 2014 at 2:44 PM, Patrick Wendell pwend...@gmail.com
 wrote:

  Hey Cody,
 
  This Jstack seems truncated, would you mind giving the entire stack
  trace? For the second thread, for instance, we can't see where the
  lock is being acquired.
 
  - Patrick
 
  On Mon, Jul 14, 2014 at 1:42 PM, Cody Koeninger
  cody.koenin...@mediacrossing.com wrote:
   Hi all, just wanted to give a heads up that we're seeing a reproducible
   deadlock with spark 1.0.1 with 2.3.0-mr1-cdh5.0.2
  
   If jira is a better place for this, apologies in advance - figured
  talking
   about it on the mailing list was friendlier than randomly (re)opening
  jira
   tickets.
  
   I know Gary had mentioned some issues with 1.0.1 on the mailing list,
  once
   we got a thread dump I wanted to follow up.
  
   The thread dump shows the deadlock occurs in the synchronized block of
  code
   that was changed in HadoopRDD.scala, for the Spark-1097 issue
  
   Relevant portions of the thread dump are summarized below, we can
 provide
   the whole dump if it's useful.
  
   Found one Java-level deadlock:
   =
   Executor task launch worker-1:
 waiting to lock monitor 0x7f250400c520 (object
 0xfae7dc30,
  a
   org.apache.hadoop.co
   nf.Configuration),
 which is held by Executor task launch worker-0
   Executor task launch worker-0:
 waiting to lock monitor 0x7f2520495620 (object
 0xfaeb4fc8,
  a
   java.lang.Class),
 which is held by Executor task launch worker-1
  
  
   Executor task launch worker-1:
   at
  
 
 org.apache.hadoop.conf.Configuration.reloadConfiguration(Configuration.java:791)
   - waiting to lock 0xfae7dc30 (a
   org.apache.hadoop.conf.Configuration)
   at
  
 
 org.apache.hadoop.conf.Configuration.addDefaultResource(Configuration.java:690)
   - locked 0xfaca6ff8 (a java.lang.Class for
   org.apache.hadoop.conf.Configurati
   on)
   at
  
 
 org.apache.hadoop.hdfs.HdfsConfiguration.clinit(HdfsConfiguration.java:34)
   at
  
 
 org.apache.hadoop.hdfs.DistributedFileSystem.clinit(DistributedFileSystem.java:110
   )
   at
 sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
   Method)
   at
  
 
 sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.
   java:57)
   at
 sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
   Method)
   at
  
 
 sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.
   java:57)
   at
  
 
 sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAcces
   sorImpl.java:45)
   at
  java.lang.reflect.Constructor.newInstance(Constructor.java:525)
   at java.lang.Class.newInstance0(Class.java:374)
   at java.lang.Class.newInstance(Class.java:327)
   at
  java.util.ServiceLoader$LazyIterator.next(ServiceLoader.java:373)
   at java.util.ServiceLoader$1.next(ServiceLoader.java:445)
   at
   org.apache.hadoop.fs.FileSystem.loadFileSystems(FileSystem.java:2364)
   - locked 0xfaeb4fc8 (a java.lang.Class for
   org.apache.hadoop.fs.FileSystem)
   at
  
 org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2375)
   at
   org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2392)
   at
 org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:89

Re: Reproducible deadlock in 1.0.1, possibly related to Spark-1097

2014-07-14 Thread Patrick Wendell
Andrew is your issue also a regression from 1.0.0 to 1.0.1? The
immediate priority is addressing regressions between these two
releases.

On Mon, Jul 14, 2014 at 9:05 PM, Andrew Ash and...@andrewash.com wrote:
 I'm not sure either of those PRs will fix the concurrent adds to
 Configuration issue I observed. I've got a stack trace and writeup I'll
 share in an hour or two (traveling today).
 On Jul 14, 2014 9:50 PM, scwf wangf...@huawei.com wrote:

 hi,Cody
   i met this issue days before and i post a PR for this(
 https://github.com/apache/spark/pull/1385)
 it's very strange that if i synchronize conf it will deadlock but it is ok
 when synchronize initLocalJobConfFuncOpt


  Here's the entire jstack output.


 On Mon, Jul 14, 2014 at 4:44 PM, Patrick Wendell pwend...@gmail.com
 mailto:pwend...@gmail.com wrote:

 Hey Cody,

 This Jstack seems truncated, would you mind giving the entire stack
 trace? For the second thread, for instance, we can't see where the
 lock is being acquired.

 - Patrick

 On Mon, Jul 14, 2014 at 1:42 PM, Cody Koeninger
 cody.koenin...@mediacrossing.com mailto:cody.koeninger@
 mediacrossing.com wrote:
   Hi all, just wanted to give a heads up that we're seeing a
 reproducible
   deadlock with spark 1.0.1 with 2.3.0-mr1-cdh5.0.2
  
   If jira is a better place for this, apologies in advance - figured
 talking
   about it on the mailing list was friendlier than randomly
 (re)opening jira
   tickets.
  
   I know Gary had mentioned some issues with 1.0.1 on the mailing
 list, once
   we got a thread dump I wanted to follow up.
  
   The thread dump shows the deadlock occurs in the synchronized
 block of code
   that was changed in HadoopRDD.scala, for the Spark-1097 issue
  
   Relevant portions of the thread dump are summarized below, we can
 provide
   the whole dump if it's useful.
  
   Found one Java-level deadlock:
   =
   Executor task launch worker-1:
 waiting to lock monitor 0x7f250400c520 (object
 0xfae7dc30, a
   org.apache.hadoop.co http://org.apache.hadoop.co
   nf.Configuration),
 which is held by Executor task launch worker-0
   Executor task launch worker-0:
 waiting to lock monitor 0x7f2520495620 (object
 0xfaeb4fc8, a
   java.lang.Class),
 which is held by Executor task launch worker-1
  
  
   Executor task launch worker-1:
   at
   org.apache.hadoop.conf.Configuration.reloadConfiguration(
 Configuration.java:791)
   - waiting to lock 0xfae7dc30 (a
   org.apache.hadoop.conf.Configuration)
   at
   org.apache.hadoop.conf.Configuration.addDefaultResource(
 Configuration.java:690)
   - locked 0xfaca6ff8 (a java.lang.Class for
   org.apache.hadoop.conf.Configurati
   on)
   at
   org.apache.hadoop.hdfs.HdfsConfiguration.clinit(
 HdfsConfiguration.java:34)
   at
   org.apache.hadoop.hdfs.DistributedFileSystem.clinit
 (DistributedFileSystem.java:110
   )
   at sun.reflect.NativeConstructorAccessorImpl.
 newInstance0(Native
   Method)
   at
   sun.reflect.NativeConstructorAccessorImpl.newInstance(
 NativeConstructorAccessorImpl.
   java:57)
   at sun.reflect.NativeConstructorAccessorImpl.
 newInstance0(Native
   Method)
   at
   sun.reflect.NativeConstructorAccessorImpl.newInstance(
 NativeConstructorAccessorImpl.
   java:57)
   at
   sun.reflect.DelegatingConstructorAccessorImpl.newInstance(
 DelegatingConstructorAcces
   sorImpl.java:45)
   at java.lang.reflect.Constructor.
 newInstance(Constructor.java:525)
   at java.lang.Class.newInstance0(Class.java:374)
   at java.lang.Class.newInstance(Class.java:327)
   at java.util.ServiceLoader$LazyIterator.next(
 ServiceLoader.java:373)
   at java.util.ServiceLoader$1.next(ServiceLoader.java:445)
   at
   org.apache.hadoop.fs.FileSystem.loadFileSystems(
 FileSystem.java:2364)
   - locked 0xfaeb4fc8 (a java.lang.Class for
   org.apache.hadoop.fs.FileSystem)
   at
   org.apache.hadoop.fs.FileSystem.getFileSystemClass(
 FileSystem.java:2375)
   at
   org.apache.hadoop.fs.FileSystem.createFileSystem(
 FileSystem.java:2392)
   at org.apache.hadoop.fs.FileSystem.access$200(
 FileSystem.java:89)
   at
   org.apache.hadoop.fs.FileSystem$Cache.getInternal(
 FileSystem.java:2431)
   at org.apache.hadoop.fs.FileSystem$Cache.get(
 FileSystem.java:2413)
   at org.apache.hadoop.fs.FileSystem.get(FileSystem.
 java:368)
   at org.apache.hadoop.fs.FileSystem.get(FileSystem.
 java:167

Re: Catalyst dependency on Spark Core

2014-07-14 Thread Patrick Wendell
Adding new build modules is pretty high overhead, so if this is a case
where a small amount of duplicated code could get rid of the
dependency, that could also be a good short-term option.

- Patrick

On Mon, Jul 14, 2014 at 2:15 PM, Matei Zaharia matei.zaha...@gmail.com wrote:
 Yeah, I'd just add a spark-util that has these things.

 Matei

 On Jul 14, 2014, at 1:04 PM, Michael Armbrust mich...@databricks.com
 wrote:

 Yeah, sadly this dependency was introduced when someone consolidated the
 logging infrastructure.  However, the dependency should be very small and
 thus easy to remove, and I would like catalyst to be usable outside of
 Spark.  A pull request to make this possible would be welcome.

 Ideally, we'd create some sort of spark common package that has things like
 logging.  That way catalyst could depend on that, without pulling in all of
 Hadoop, etc.  Maybe others have opinions though, so I'm cc-ing the dev list.


 On Mon, Jul 14, 2014 at 12:21 AM, Yanbo Liang yanboha...@gmail.com wrote:

 Make Catalyst independent of Spark is the goal of Catalyst, maybe need
 time and evolution.
 I awared that package org.apache.spark.sql.catalyst.util embraced
 org.apache.spark.util.{Utils = SparkUtils},
 so that Catalyst has a dependency on Spark core.
 I'm not sure whether it will be replaced by other component independent of
 Spark in later release.


 2014-07-14 11:51 GMT+08:00 Aniket Bhatnagar aniket.bhatna...@gmail.com:

 As per the recent presentation given in Scala days
 (http://people.apache.org/~marmbrus/talks/SparkSQLScalaDays2014.pdf), it was
 mentioned that Catalyst is independent of Spark. But on inspecting pom.xml
 of sql/catalyst module, it seems it has a dependency on Spark Core. Any
 particular reason for the dependency? I would love to use Catalyst outside
 Spark

 (reposted as previous email bounced. Sorry if this is a duplicate).






  1   2   3   4   5   6   >