Re: [DISCUSS] Necessity of Maven *and* SBT Build in Spark
@mridul - As far as I know both Maven and Sbt use fairly similar processes for building the assembly/uber jar. We actually used to package spark with sbt and there were no specific issues we encountered and AFAIK sbt respects versioning of transitive dependencies correctly. Do you have a specific bug listing for sbt that indicates something is broken? @sandy - It sounds like you are saying that the CDH build would be easier with Maven because you can inherit the POM. However, is this just a matter of convenience for packagers or would standardizing on sbt limit capabilities in some way? I assume that it would just mean a bit more manual work for packagers having to figure out how to set the hadoop version in SBT and exclude certain dependencies. For instance, what does CDH about other components like Impala that are not based on Maven at all? On Wed, Feb 26, 2014 at 9:31 AM, Evan Chan e...@ooyala.com wrote: I'd like to propose the following way to move forward, based on the comments I've seen: 1. Aggressively clean up the giant dependency graph. One ticket I might work on if I have time is SPARK-681 which might remove the giant fastutil dependency (~15MB by itself). 2. Take an intermediate step by having only ONE source of truth w.r.t. dependencies and versions. This means either: a) Using a maven POM as the spec for dependencies, Hadoop version, etc. Then, use sbt-pom-reader to import it. b) Using the build.scala as the spec, and sbt make-pom to generate the pom.xml for the dependencies The idea is to remove the pain and errors associated with manual translation of dependency specs from one system to another, while still maintaining the things which are hard to translate (plugins). On Wed, Feb 26, 2014 at 7:17 AM, Koert Kuipers ko...@tresata.com wrote: We maintain in house spark build using sbt. We have no problem using sbt assembly. We did add a few exclude statements for transitive dependencies. The main enemy of assemblies are jars that include stuff they shouldn't (kryo comes to mind, I think they include logback?), new versions of jars that change the provider/artifact without changing the package (asm), and incompatible new releases (protobuf). These break the transitive resolution process. I imagine that's true for any build tool. Besides shading I don't see anything maven can do sbt cannot, and if I understand it correctly shading is not done currently using the build tool. Since spark is primarily scala/akka based the main developer base will be familiar with sbt (I think?). Switching build tool is always painful. I personally think it is smarter to put this burden on a limited number of upstream integrators than on the community. However that said I don't think its a problem for us to maintain an sbt build in-house if spark switched to maven. The problem is, the complete spark dependency graph is fairly large, and there are lot of conflicting versions in there. In particular, when we bump versions of dependencies - making managing this messy at best. Now, I have not looked in detail at how maven manages this - it might just be accidental that we get a decent out-of-the-box assembled shaded jar (since we dont do anything great to configure it). With current state of sbt in spark, it definitely is not a good solution : if we can enhance it (or it already is ?), while keeping the management of the version/dependency graph manageable, I dont have any objections to using sbt or maven ! Too many exclude versions, pinned versions, etc would just make things unmanageable in future. Regards, Mridul On Wed, Feb 26, 2014 at 8:56 AM, Evan chan e...@ooyala.com wrote: Actually you can control exactly how sbt assembly merges or resolves conflicts. I believe the default settings however lead to order which cannot be controlled. I do wish for a smarter fat jar plugin. -Evan To be free is not merely to cast off one's chains, but to live in a way that respects enhances the freedom of others. (#NelsonMandela) On Feb 25, 2014, at 6:50 PM, Mridul Muralidharan mri...@gmail.com wrote: On Wed, Feb 26, 2014 at 5:31 AM, Patrick Wendell pwend...@gmail.com wrote: Evan - this is a good thing to bring up. Wrt the shader plug-in - right now we don't actually use it for bytecode shading - we simply use it for creating the uber jar with excludes (which sbt supports just fine via assembly). Not really - as I mentioned initially in this thread, sbt's assembly does not take dependencies into account properly : and can overwrite newer classes with older versions. From an assembly point of view, sbt is not very good : we are yet to try it after 2.10 shift though (and probably wont, given the mess it created last time). Regards, Mridul I was wondering actually, do you know if it's possible to added shaded artifacts to the *spark jar* using this plug-in (e.g. not an uber jar)? That's something I could see being
Re: [DISCUSS] Necessity of Maven *and* SBT Build in Spark
Hey, Thanks everyone for chiming in on this. I wanted to summarize these issues a bit particularly wrt the constituents involved - does this seem accurate? = Spark Users = In general those linking against Spark should be totally unaffected by the build choice. Spark will continue to publish well-formed poms and jars to maven central. This is a no-op wrt this decision. = Spark Developers = There are two concerns. (a) General day-to-day development and packaging and (b) Spark binaries and packages for distribution. For (a) - sbt seems better because it's just nicer for doing scala development (incremental complication is simple, we have some home-baked tools for compiling Spark vs. the spark deps etc). The arguments that maven has more general know how, at least so far, haven't affected us in the ~2 years we've maintained both builds - where adding stuff for Maven is typically just as annoying/difficult with sbt. For (b) - Some non-specific concerns were raised about bugs with the sbt assembly package - we should look into this and see what is going on. Maven has better out-of-the-box support for publishing to Maven central, we'd have to do some manual work on our end to make this work well with sbt. = Downstream Integrators = On this one it seems that Maven is the universal favorite, largely because of community awareness of Maven and comfort with Maven builds. Some things like restructuring the Spark build to inherit config values from a vendor build will be not possible with sbt (though fairly straightforward to work around). Other cases where vendors have directly modified or inherited the Spark build won't work anymore if we standardize on SBT. These have no obvious work around at this point as far as I see. - Patrick On Wed, Feb 26, 2014 at 7:09 PM, Mridul Muralidharan mri...@gmail.com wrote: On Feb 26, 2014 11:12 PM, Patrick Wendell pwend...@gmail.com wrote: @mridul - As far as I know both Maven and Sbt use fairly similar processes for building the assembly/uber jar. We actually used to package spark with sbt and there were no specific issues we encountered and AFAIK sbt respects versioning of transitive dependencies correctly. Do you have a specific bug listing for sbt that indicates something is broken? Slightly longish ... The assembled jar, generated via sbt broke all over the place while I was adding yarn support in 0.6 - and I had to fix sbt project a fair bit to get it to work : we need the assembled jar to submit a yarn job. When I finally submitted those changes to 0.7, it broke even more - since dependencies changed : someone else had thankfully already added maven support by then - which worked remarkably well out of the box (with some minor tweaks) ! In theory, they might be expected to work the same, but practically they did not : as I mentioned, it must just have been luck that maven worked that well; but given multiple past nasty experiences with sbt, and the fact that it does not bring anything compelling or new in contrast, I am fairly against the idea of using only sbt - inspite of maven being unintuitive at times. Regards, Mridul @sandy - It sounds like you are saying that the CDH build would be easier with Maven because you can inherit the POM. However, is this just a matter of convenience for packagers or would standardizing on sbt limit capabilities in some way? I assume that it would just mean a bit more manual work for packagers having to figure out how to set the hadoop version in SBT and exclude certain dependencies. For instance, what does CDH about other components like Impala that are not based on Maven at all? On Wed, Feb 26, 2014 at 9:31 AM, Evan Chan e...@ooyala.com wrote: I'd like to propose the following way to move forward, based on the comments I've seen: 1. Aggressively clean up the giant dependency graph. One ticket I might work on if I have time is SPARK-681 which might remove the giant fastutil dependency (~15MB by itself). 2. Take an intermediate step by having only ONE source of truth w.r.t. dependencies and versions. This means either: a) Using a maven POM as the spec for dependencies, Hadoop version, etc. Then, use sbt-pom-reader to import it. b) Using the build.scala as the spec, and sbt make-pom to generate the pom.xml for the dependencies The idea is to remove the pain and errors associated with manual translation of dependency specs from one system to another, while still maintaining the things which are hard to translate (plugins). On Wed, Feb 26, 2014 at 7:17 AM, Koert Kuipers ko...@tresata.com wrote: We maintain in house spark build using sbt. We have no problem using sbt assembly. We did add a few exclude statements for transitive dependencies. The main enemy of assemblies are jars that include stuff they shouldn't (kryo comes to mind, I think they include logback?), new versions of jars that change the provider/artifact
Updated Developer Docs
Hey All, Just a heads up that there are a bunch of updated developer docs on the wiki including posting the dates around the current merge window. Some of the new docs might be useful for developers/committers: https://cwiki.apache.org/confluence/display/SPARK/Wiki+Homepage Cheers, - Patrick
Re: Spark 0.9.0 and log4j
Evan I actually remembered that Paul Brown (who also reported this issue) tested it and found that it worked. I'm going to merge this into master and branch 0.9, so please give it a spin when you have a chance. - Patrick On Sat, Mar 8, 2014 at 2:00 PM, Patrick Wendell pwend...@gmail.com wrote: Hey Evan, This is being tracked here: https://spark-project.atlassian.net/browse/SPARK-1190 That patch didn't get merged but I've just opened a new one here: https://github.com/apache/spark/pull/107/files Would you have any interest in testing this? I want to make sure it works for users who are using logback. I'd like to get this merged quickly since it's one of the only remaining blockers for Spark 0.9.1. - Patrick On Fri, Mar 7, 2014 at 11:04 AM, Evan Chan e...@ooyala.com wrote: Hey guys, This is a follow-up to this semi-recent thread: http://apache-spark-developers-list.1001551.n3.nabble.com/0-9-0-forces-log4j-usage-td532.html 0.9.0 final is causing issues for us as well because we use Logback as our backend and Spark requires Log4j now. I see Patrick has a PR #560 to incubator-spark, was that merged in or left out? Also I see references to a new PR that might fix this, but I can't seem to find it in the github open PR page. Anybody have a link? As a last resort we can switch to Log4j, but would rather not have to do that if possible. thanks, Evan -- -- Evan Chan Staff Engineer e...@ooyala.com |
Re: 0.9.0 forces log4j usage
The fix for this was just merged into branch 0.9 (will be in 0.9.1+) and master. On Sun, Feb 9, 2014 at 11:44 PM, Patrick Wendell pwend...@gmail.com wrote: Thanks Paul - it isn't mean to be a full solution but just a fix for the 0.9 branch - for the full solution there is another PR by Sean Owen. On Sun, Feb 9, 2014 at 11:35 PM, Paul Brown p...@mult.ifario.us wrote: Hi, Patrick -- I gave that a go locally, and it works as desired. Best. -- Paul -- p...@mult.ifario.us | Multifarious, Inc. | http://mult.ifario.us/ On Fri, Feb 7, 2014 at 6:10 PM, Patrick Wendell pwend...@gmail.com wrote: Ah okay sounds good. This is what I meant earlier by You have some other application that directly calls log4j i.e. you have for historical reasons installed the log4j-over-slf4j. Would you mind trying out this fix and seeing if it works? This is designed to be a hotfix for 0.9, not a general solution where we rip out log4j from our published dependencies: https://github.com/apache/incubator-spark/pull/560/files - Patrick On Fri, Feb 7, 2014 at 5:57 PM, Paul Brown p...@mult.ifario.us wrote: Hi, Patrick -- I forget which other component is responsible, but we're using the log4j-over-slf4j as part of an overall requirement to centralize logging, i.e., *someone* else is logging over log4j and we're pulling that in. (There's also some jul logging from Jersey, etc.) Goals: - Fully control/capture all possible logging. (God forbid we have to grab System.out/err, but we'd do it if needed.) - Use the backend we like best at the moment. (Happens to be logback.) Possible cases: - If Spark used Log4j at all, we would pull in that logging via log4j-over-slf4j. - If Spark used only slf4j and referenced no backend, we would use it as-is although we'd still have the log4j-over-slf4j because of other libraries. - If Spark used only slf4j and referenced the slf4j-log4j12 backend, we would exclude that one dependency (via our POM). Best. -- Paul -- p...@mult.ifario.us | Multifarious, Inc. | http://mult.ifario.us/ On Fri, Feb 7, 2014 at 5:38 PM, Patrick Wendell pwend...@gmail.com wrote: Hey Paul, So if your goal is ultimately to output to logback. Then why don't you just use slf4j and logback-classic.jar as described here [1]. Why involve log4j-over-slf4j at all? Let's say we refactored the spark build so it didn't advertise slf4j-log4j12 as a dependency. Would you still be using log4j-over-slf4j... or is this just a fix to deal with the fact that Spark is somewhat log4j dependent at this point. [1] http://www.slf4j.org/manual.html - Patrick On Fri, Feb 7, 2014 at 5:14 PM, Paul Brown p...@mult.ifario.us wrote: Hi, Patrick -- That's close but not quite it. The issue that occurs is not the delegation loop mentioned in slf4j documentation. The stack overflow is entirely within the code in the Spark trait: at org.apache.spark.Logging$class.initializeLogging(Logging.scala:112) at org.apache.spark.Logging$class.initializeIfNecessary(Logging.scala:97) at org.apache.spark.Logging$class.log(Logging.scala:36) at org.apache.spark.SparkEnv$.log(SparkEnv.scala:94) And then that repeats. As for our situation, we exclude the slf4j-log4j12 dependency when we import the Spark library (because we don't want to use log4j) and have log4j-over-slf4j already in place to ensure that all of the logging in the overall application runs through slf4j and then out through logback. (We also, as another poster already mentioned, also force jcl and jul through slf4j.) The zen of slf4j for libraries is that the library uses the slf4j API and then the enclosing application can route logging as it sees fit. Spark master CLI would log via slf4j and include the slf4j-log4j12 backend; same for Spark worker CLI. Spark as a library (versus as a container) would not include any backend to the slf4j API and leave this up to the application. (FWIW, this would also avoid your log4j warning message.) But as I was saying before, I'd be happy with a situation where I can avoid log4j being enabled or configured, and I think you'll find an existing choice of logging framework to be a common scenario for those embedding Spark in other systems. Best. -- Paul -- p...@mult.ifario.us | Multifarious, Inc. | http://mult.ifario.us/ On Fri, Feb 7, 2014 at 3:01 PM, Patrick Wendell pwend...@gmail.com wrote: Paul, Looking back at your problem. I think it's the one here: http://www.slf4j.org/codes.html#log4jDelegationLoop So let me just be clear what you are doing so I understand. You have some other application that directly calls log4j. So you have to include log4j-over-slf4j to route those logs through slf4j to logback. At the same time you embed Spark in this application
Help vote for Spark talks at the Hadoop Summit
Hey All, The Hadoop Summit uses community choice voting to decide which talks to feature. It would be great if the community could help vote for Spark talks so that Spark has a good showing at this event. You can make three votes on each track. Below I've listed Spark talks in each of the tracks - voting closes tomorrow so vote now!! Building a Unified Data Pipeline in Apache Spark bit.ly/O8USIq (Committer Track) Building a Data Processing System for Real Time Auctions bit.ly/1ij3XJJ (Business Apps Track) SparkR: Enabling Interactive Data Science at Scale on Hadoop bit.ly/1kPQUlG (Data Science Track) Recent Developments in Spark MLlib and Beyond bit.ly/1hgZW5D (The Future of Apache Hadoop Track) Cheers, - Patrick
Github reviews now going to separate reviews@ mailing list
Hey All, We've created a new list called revi...@spark.apache.org which will contain the contents from the github pull requests and comments. Note that these e-mails will no longer appear on the dev list. Thanks to Apache Infra for helping us set this up. To subscribe to this e-mail: reviews-subscr...@spark.apache.org - Patrick
Re: repositories for spark jars
Hey Nathan, I don't think this would be possible because there are at least dozens of permutations of Hadoop versions (different vendor distros X different versions X YARN vs not YARN, etc) and maybe hundreds. So publishing new artifacts for each would be really difficult. What is the exact problem you ran into? Maybe we need to improve the documentation to make it more clear how to correctly link against spark/hadoop for user applications. Basically the model we have now is users link against Spark and then link against the hadoop-client relevant to their version of Hadoop. - Patrick On Mon, Mar 17, 2014 at 9:50 AM, Nathan Kronenfeld nkronenf...@oculusinfo.com wrote: After just spending a couple days fighting with a new spark installation, getting spark and hadoop version numbers matching everywhere, I have a suggestion I'd like to put out there. Can we put the hadoop version against which the spark jars were built into the version number? I noticed that the Cloudera maven repo has started to do this ( https://repository.cloudera.com/artifactory/cloudera-repos/org/apache/spark/spark-core_2.10/) - sadly, though, only with the cdh5.x versions, not with the 4.x versions for which they also have spark parcels. But I see no signs of it in the central maven repo. Is this already done in some other repo about which I don't know, perhaps? I know it would save us a lot of time and grief simply to be able to point a project we build at the right version, and not have to rebuild and deploy spark manually. -- Nathan Kronenfeld Senior Visualization Developer Oculus Info Inc 2 Berkeley Street, Suite 600, Toronto, Ontario M5A 4J5 Phone: +1-416-203-3003 x 238 Email: nkronenf...@oculusinfo.com
Re: Announcing the official Spark Job Server repo
Evan - yep definitely open a JIRA. It would be nice to have a contrib repo set-up for the 1.0 release. On Tue, Mar 18, 2014 at 11:28 PM, Evan Chan e...@ooyala.com wrote: Matei, Maybe it's time to explore the spark-contrib idea again? Should I start a JIRA ticket? -Evan On Tue, Mar 18, 2014 at 4:04 PM, Matei Zaharia matei.zaha...@gmail.com wrote: Cool, glad to see this posted! I've added a link to it at https://cwiki.apache.org/confluence/display/SPARK/Powered+By+Spark. Matei On Mar 18, 2014, at 1:51 PM, Evan Chan e...@ooyala.com wrote: Dear Spark developers, Ooyala is happy to announce that we have pushed our official, Spark 0.9.0 / Scala 2.10-compatible, job server as a github repo: https://github.com/ooyala/spark-jobserver Complete with unit tests, deploy scripts, and examples. The original PR (#222) on incubator-spark is now closed. Please have a look; pull requests are very welcome. -- -- Evan Chan Staff Engineer e...@ooyala.com | -- -- Evan Chan Staff Engineer e...@ooyala.com |
Re: Spark 0.9.1 release
Hey Tom, I'll pull [SPARK-1053] Should not require SPARK_YARN_APP_JAR when running on YARN - JIRA and [SPARK-1051] On Yarn, executors don't doAs as submitting user - JIRA in. The pyspark one I would consider more of an enhancement so might not be appropriate for a point release. Someone recently sent me a personal e-mail reporting some problems with this. I'll ask them to forward it to you/the dev list. Might be worth looking into before merging. [SPARK-1051] On Yarn, executors don't doAs as submitting user - JIRA This means that they can't write/read from files that the yarn user doesn't have permissions to but the submitting user does. View on spark-project.atlassian.net Preview by Yahoo Good call on this one. - Patrick
Re: Spark 0.9.1 release
Hey Evan and TD, Spark's dependency graph in a maintenance release seems potentially harmful, especially upgrading a minor version (not just a patch version) like this. This could affect other downstream users. For instance, now without knowing their fastutil dependency gets bumped and they hit some new problem in fastutil 6.5. - Patrick On Mon, Mar 24, 2014 at 12:02 AM, Tathagata Das tathagata.das1...@gmail.com wrote: @Shivaram, That is a useful patch but I am bit afraid merge it in. Randomizing the executor has performance implications, especially for Spark Streaming. The non-randomized ordering of allocating machines to tasks was subtly helping to speed up certain window-based shuffle operations. For example, corresponding shuffle partitions in multiple shuffles using the same partitioner were likely to be co-located, that is, shuffle partition 0 were likely to be on the same machine for multiple shuffles. While this is the not a reliable mechanism to rely on, randomization may lead to performance degradation. So I am afraid to merge this one without understanding the consequences. @Evan, I have already cut a release! You can submit the PR and we can merge it branch-0.9. If we have to cut another release, then we can include it. On Sun, Mar 23, 2014 at 11:42 PM, Evan Chan e...@ooyala.com wrote: I also have a really minor fix for SPARK-1057 (upgrading fastutil), could that also make it in? -Evan On Sun, Mar 23, 2014 at 11:01 PM, Shivaram Venkataraman shiva...@eecs.berkeley.edu wrote: Sorry this request is coming in a bit late, but would it be possible to backport SPARK-979[1] to branch-0.9 ? This is the patch for randomizing executor offers and I would like to use this in a release sooner rather than later. Thanks Shivaram [1] https://github.com/apache/spark/commit/556c56689bbc32c6cec0d07b57bd3ec73ceb243e#diff-8ef3258646b0e6a4793d6ad99848eacd On Thu, Mar 20, 2014 at 10:18 PM, Bhaskar Dutta bhas...@gmail.com wrote: Thank You! We plan to test out 0.9.1 on YARN once it is out. Regards, Bhaskar On Fri, Mar 21, 2014 at 12:42 AM, Tom Graves tgraves...@yahoo.com wrote: I'll pull [SPARK-1053] Should not require SPARK_YARN_APP_JAR when running on YARN - JIRA and [SPARK-1051] On Yarn, executors don't doAs as submitting user - JIRA in. The pyspark one I would consider more of an enhancement so might not be appropriate for a point release. [SPARK-1053] Should not require SPARK_YARN_APP_JAR when running on YA... org.apache.spark.SparkException: env SPARK_YARN_APP_JAR is not set at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:49) at org.apache.spark.schedule... View on spark-project.atlassian.net Preview by Yahoo [SPARK-1051] On Yarn, executors don't doAs as submitting user - JIRA This means that they can't write/read from files that the yarn user doesn't have permissions to but the submitting user does. View on spark-project.atlassian.net Preview by Yahoo On Thursday, March 20, 2014 1:35 PM, Bhaskar Dutta bhas...@gmail.com wrote: It will be great if SPARK-1101https://spark-project.atlassian.net/browse/SPARK-1101: Umbrella for hardening Spark on YARN can get into 0.9.1. Thanks, Bhaskar On Thu, Mar 20, 2014 at 5:37 AM, Tathagata Das tathagata.das1...@gmail.comwrote: Hello everyone, Since the release of Spark 0.9, we have received a number of important bug fixes and we would like to make a bug-fix release of Spark 0.9.1. We are going to cut a release candidate soon and we would love it if people test it out. We have backported several bug fixes into the 0.9 and updated JIRA accordingly https://spark-project.atlassian.net/browse/SPARK-1275?jql=project%20in%20(SPARK%2C%20BLINKDB%2C%20MLI%2C%20MLLIB%2C%20SHARK%2C%20STREAMING%2C%20GRAPH%2C%20TACHYON)%20AND%20fixVersion%20%3D%200.9.1%20AND%20status%20in%20(Resolved%2C%20Closed) . Please let me know if there are fixes that were not backported but you would like to see them in 0.9.1. Thanks! TD -- -- Evan Chan Staff Engineer e...@ooyala.com |
Re: Spark 0.9.1 release
Spark's dependency graph in a maintenance *Modifying* Spark's dependency graph...
Re: Travis CI
That's not correct - like Michael said the Jenkins build remains the reference build for now. On Tue, Mar 25, 2014 at 7:03 PM, Nan Zhu zhunanmcg...@gmail.com wrote: I assume the Jenkins is not working now? Best, -- Nan Zhu On Tuesday, March 25, 2014 at 6:42 PM, Michael Armbrust wrote: Just a quick note to everyone that Patrick and I are playing around with Travis CI on the Spark github repository. For now, travis does not run all of the test cases, so will only be turned on experimentally. Long term it looks like Travis might give better integration with github, so we are going to see if it is feasible to get all of our tests running on it. *Jenkins remains the reference CI and should be consulted before merging pull requests, independent of what Travis says.* If you have any questions or want to help out with the investigation, let me know! Michael
Re: Travis CI
Ya It's been a little bit slow lately because of a high error rate in interactions with the git-hub API. Unfortunately we are pretty slammed for the release and haven't had a ton of time to do further debugging. - Patrick On Tue, Mar 25, 2014 at 7:13 PM, Nan Zhu zhunanmcg...@gmail.com wrote: I just found that the Jenkins is not working from this afternoon for one PR, the first time build failed after 90 minutes, the second time it has run for more than 2 hours, no result is returned Best, -- Nan Zhu On Tuesday, March 25, 2014 at 10:06 PM, Patrick Wendell wrote: That's not correct - like Michael said the Jenkins build remains the reference build for now. On Tue, Mar 25, 2014 at 7:03 PM, Nan Zhu zhunanmcg...@gmail.com wrote: I assume the Jenkins is not working now? Best, -- Nan Zhu On Tuesday, March 25, 2014 at 6:42 PM, Michael Armbrust wrote: Just a quick note to everyone that Patrick and I are playing around with Travis CI on the Spark github repository. For now, travis does not run all of the test cases, so will only be turned on experimentally. Long term it looks like Travis might give better integration with github, so we are going to see if it is feasible to get all of our tests running on it. *Jenkins remains the reference CI and should be consulted before merging pull requests, independent of what Travis says.* If you have any questions or want to help out with the investigation, let me know! Michael
Re: Spark 0.9.1 release
Hey TD, This one we just merged into master this morning: https://spark-project.atlassian.net/browse/SPARK-1322 It should definitely go into the 0.9 branch because there was a bug in the semantics of top() which at this point is unreleased in Python. I didn't backport it yet because I figured you might want to do this at a specific time. So please go ahead and backport it. Not sure whether this warrants another RC. - Patrick On Tue, Mar 25, 2014 at 10:47 PM, Mridul Muralidharan mri...@gmail.comwrote: On Wed, Mar 26, 2014 at 10:53 AM, Tathagata Das tathagata.das1...@gmail.com wrote: PR 159 seems like a fairly big patch to me. And quite recent, so its impact on the scheduling is not clear. It may also depend on other changes that may have gotten into the DAGScheduler but not pulled into branch 0.9. I am not sure it is a good idea to pull that in. We can pull those changes later for 0.9.2 if required. There is no impact on scheduling : it only has an impact on error handling - it ensures that you can actually use spark on yarn in multi-tennent clusters more reliably. Currently, any reasonably long running job (30 mins+) working on non trivial dataset will fail due to accumulated failures in spark. Regards, Mridul TD On Tue, Mar 25, 2014 at 8:44 PM, Mridul Muralidharan mri...@gmail.com wrote: Forgot to mention this in the earlier request for PR's. If there is another RC being cut, please add https://github.com/apache/spark/pull/159 to it too (if not done already !). Thanks, Mridul On Thu, Mar 20, 2014 at 5:37 AM, Tathagata Das tathagata.das1...@gmail.com wrote: Hello everyone, Since the release of Spark 0.9, we have received a number of important bug fixes and we would like to make a bug-fix release of Spark 0.9.1. We are going to cut a release candidate soon and we would love it if people test it out. We have backported several bug fixes into the 0.9 and updated JIRA accordingly https://spark-project.atlassian.net/browse/SPARK-1275?jql=project%20in%20(SPARK%2C%20BLINKDB%2C%20MLI%2C%20MLLIB%2C%20SHARK%2C%20STREAMING%2C%20GRAPH%2C%20TACHYON)%20AND%20fixVersion%20%3D%200.9.1%20AND%20status%20in%20(Resolved%2C%20Closed) . Please let me know if there are fixes that were not backported but you would like to see them in 0.9.1. Thanks! TD
Re: JIRA. github and asf updates
Mridul, You can unsubscribe yourself from any of these sources, right? - Patrick On Sat, Mar 29, 2014 at 11:05 AM, Mridul Muralidharan mri...@gmail.comwrote: Hi, So we are now receiving updates from three sources for each change to the PR. While each of them handles a corner case which others might miss, would be great if we could minimize the volume of duplicated communication. Regards, Mridul
Could you undo the JIRA dev list e-mails?
Hey Chris, I don't think our JIRA has been fully migrated to Apache infra, so it's really confusing to send people e-mails referring to the new JIRA since we haven't announced it yet. There is some content there because we've been trying to do the migration, but I'm not sure it's entirely finished. Also, right now our github comments go to a commits@ list. I'm actually -1 copying all of these to JIRA because we do a bunch of review level comments that are going to pollute the JIRA a bunch. In any case, can you revert the change whatever it was that sent these to the dev list? We should have a coordinated plan about this transition and the e-mail changes we plan to make. - Patrick
Re: Could you undo the JIRA dev list e-mails?
Okay I think I managed to revert this by just removing jira@a.o from our dev list. On Sat, Mar 29, 2014 at 11:37 AM, Patrick Wendell pwend...@gmail.comwrote: Hey Chris, I don't think our JIRA has been fully migrated to Apache infra, so it's really confusing to send people e-mails referring to the new JIRA since we haven't announced it yet. There is some content there because we've been trying to do the migration, but I'm not sure it's entirely finished. Also, right now our github comments go to a commits@ list. I'm actually -1 copying all of these to JIRA because we do a bunch of review level comments that are going to pollute the JIRA a bunch. In any case, can you revert the change whatever it was that sent these to the dev list? We should have a coordinated plan about this transition and the e-mail changes we plan to make. - Patrick
Re: JIRA. github and asf updates
I'm working with infra to get the following set-up: 1. Don't post github updates to jira comments (they are too low level). If users want these they can subscribe to commits@s.a.o. 2. Jira comment stream will go to issues@s.a.o so people can opt into that. One thing YARN has set-up that might be cool in the future is to e-mail *new* JIRA's to the dev list. That might be cool to set up in the future. On Sat, Mar 29, 2014 at 1:15 PM, Mridul Muralidharan mri...@gmail.comwrote: If the PR comments are going to be replicated into the jira's and they are going to be set to dev@, then we could keep that and remove [Github] updates ? The last was added since discussions were happening off apache lists - which should be handled by the jira updates ? I dont mind the mails if they had content - this is just duplication of the same message in three mails :-) Btw, this is a good problem to have - a vibrant and very actively engaged community generated a lot of meaningful traffic ! I just dont want to get distracted from it by repetitions. Regards, Mridul On Sat, Mar 29, 2014 at 11:46 PM, Patrick Wendell pwend...@gmail.com wrote: Ah sorry I see - Jira updates are going to the dev list. Maybe that's not desirable. I think we should send them to the issues@ list. On Sat, Mar 29, 2014 at 11:16 AM, Patrick Wendell pwend...@gmail.com wrote: Mridul, You can unsubscribe yourself from any of these sources, right? - Patrick On Sat, Mar 29, 2014 at 11:05 AM, Mridul Muralidharan mri...@gmail.com wrote: Hi, So we are now receiving updates from three sources for each change to the PR. While each of them handles a corner case which others might miss, would be great if we could minimize the volume of duplicated communication. Regards, Mridul
Re: [VOTE] Release Apache Spark 0.9.1 (RC3)
TD - I downloaded and did some local testing. Looks good to me! +1 You should cast your own vote - at that point it's enough to pass. - Patrick On Sun, Mar 30, 2014 at 9:47 PM, prabeesh k prabsma...@gmail.com wrote: +1 tested on Ubuntu12.04 64bit On Mon, Mar 31, 2014 at 3:56 AM, Matei Zaharia matei.zaha...@gmail.com wrote: +1 tested on Mac OS X. Matei On Mar 27, 2014, at 1:32 AM, Tathagata Das tathagata.das1...@gmail.com wrote: Please vote on releasing the following candidate as Apache Spark version 0.9.1 A draft of the release notes along with the CHANGES.txt file is attached to this e-mail. The tag to be voted on is v0.9.1-rc3 (commit 4c43182b): https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=4c43182b6d1b0b7717423f386c0214fe93073208 The release files, including signatures, digests, etc. can be found at: http://people.apache.org/~tdas/spark-0.9.1-rc3/ Release artifacts are signed with the following key: https://people.apache.org/keys/committer/tdas.asc The staging repository for this release can be found at: https://repository.apache.org/content/repositories/orgapachespark-1009/ The documentation corresponding to this release can be found at: http://people.apache.org/~tdas/spark-0.9.1-rc3-docs/ Please vote on releasing this package as Apache Spark 0.9.1! The vote is open until Sunday, March 30, at 10:00 UTC and passes if a majority of at least 3 +1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 0.9.1 [ ] -1 Do not release this package because ... To learn more about Apache Spark, please see http://spark.apache.org/ CHANGES.txtRELEASE_NOTES.txt
Re: [VOTE] Release Apache Spark 0.9.1 (RC3)
Yeah good point. Let's just extend this vote another few days? On Mon, Mar 31, 2014 at 8:12 AM, Tom Graves tgraves...@yahoo.com wrote: I should probably pull this off into another thread, but going forward can we try to not have the release votes end on a weekend? Since we only seem to give 3 days, it makes it really hard for anyone who is offline for the weekend to try it out. Either that or extend the voting for more then 3 days. Tom On Monday, March 31, 2014 12:50 AM, Patrick Wendell pwend...@gmail.com wrote: TD - I downloaded and did some local testing. Looks good to me! +1 You should cast your own vote - at that point it's enough to pass. - Patrick On Sun, Mar 30, 2014 at 9:47 PM, prabeesh k prabsma...@gmail.com wrote: +1 tested on Ubuntu12.04 64bit On Mon, Mar 31, 2014 at 3:56 AM, Matei Zaharia matei.zaha...@gmail.com wrote: +1 tested on Mac OS X. Matei On Mar 27, 2014, at 1:32 AM, Tathagata Das tathagata.das1...@gmail.com wrote: Please vote on releasing the following candidate as Apache Spark version 0.9.1 A draft of the release notes along with the CHANGES.txt file is attached to this e-mail. The tag to be voted on is v0.9.1-rc3 (commit 4c43182b): https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=4c43182b6d1b0b7717423f386c0214fe93073208 The release files, including signatures, digests, etc. can be found at: http://people.apache.org/~tdas/spark-0.9.1-rc3/ Release artifacts are signed with the following key: https://people.apache.org/keys/committer/tdas.asc The staging repository for this release can be found at: https://repository.apache.org/content/repositories/orgapachespark-1009/ The documentation corresponding to this release can be found at: http://people.apache.org/~tdas/spark-0.9.1-rc3-docs/ Please vote on releasing this package as Apache Spark 0.9.1! The vote is open until Sunday, March 30, at 10:00 UTC and passes if a majority of at least 3 +1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 0.9.1 [ ] -1 Do not release this package because ... To learn more about Apache Spark, please see http://spark.apache.org/ CHANGES.txtRELEASE_NOTES.txt
Re: sbt-package-bin
And there is a deb target as well - ah didn't see Mark's email. On Tue, Apr 1, 2014 at 11:36 AM, Patrick Wendell pwend...@gmail.com wrote: Ya there is already some fragmentation here. Maven has some dist targets and there is also ./make-distribution.sh. On Tue, Apr 1, 2014 at 11:31 AM, Mark Hamstra m...@clearstorydata.comwrote: A basic Debian package can already be created from the Maven build: mvn -Pdeb ... On Tue, Apr 1, 2014 at 11:24 AM, Evan Chan e...@ooyala.com wrote: Also, I understand this is the last week / merge window for 1.0, so if folks are interested I'd like to get in a PR quickly. thanks, Evan On Tue, Apr 1, 2014 at 11:24 AM, Evan Chan e...@ooyala.com wrote: Hey folks, We are in the middle of creating a Chef recipe for Spark. As part of that we want to create a Debian package for Spark. What do folks think of adding the sbt-package-bin plugin to allow easy creation of a Spark .deb file? I believe it adds all dependency jars into a single lib/ folder, so in some ways it's even easier to manage than the assembly. Also I'm not sure if there's an equivalent plugin for Maven. thanks, Evan -- -- Evan Chan Staff Engineer e...@ooyala.com | http://www.ooyala.com/ http://www.facebook.com/ooyala http://www.linkedin.com/company/ooyalahttp://www.twitter.com/ooyala -- -- Evan Chan Staff Engineer e...@ooyala.com | http://www.ooyala.com/ http://www.facebook.com/ooyalahttp://www.linkedin.com/company/ooyala http://www.twitter.com/ooyala
Re: sbt-package-bin
Ya there is already some fragmentation here. Maven has some dist targets and there is also ./make-distribution.sh. On Tue, Apr 1, 2014 at 11:31 AM, Mark Hamstra m...@clearstorydata.comwrote: A basic Debian package can already be created from the Maven build: mvn -Pdeb ... On Tue, Apr 1, 2014 at 11:24 AM, Evan Chan e...@ooyala.com wrote: Also, I understand this is the last week / merge window for 1.0, so if folks are interested I'd like to get in a PR quickly. thanks, Evan On Tue, Apr 1, 2014 at 11:24 AM, Evan Chan e...@ooyala.com wrote: Hey folks, We are in the middle of creating a Chef recipe for Spark. As part of that we want to create a Debian package for Spark. What do folks think of adding the sbt-package-bin plugin to allow easy creation of a Spark .deb file? I believe it adds all dependency jars into a single lib/ folder, so in some ways it's even easier to manage than the assembly. Also I'm not sure if there's an equivalent plugin for Maven. thanks, Evan -- -- Evan Chan Staff Engineer e...@ooyala.com | http://www.ooyala.com/ http://www.facebook.com/ooyala http://www.linkedin.com/company/ooyalahttp://www.twitter.com/ooyala -- -- Evan Chan Staff Engineer e...@ooyala.com | http://www.ooyala.com/ http://www.facebook.com/ooyalahttp://www.linkedin.com/company/ooyala http://www.twitter.com/ooyala
Re: Would anyone mind having a quick look at PR#288?
Hey Evan, Ya thanks this is a pretty small patch. Should definitely be do-able for 1.0. - Patrick On Wed, Apr 2, 2014 at 10:25 AM, Evan Chan e...@ooyala.com wrote: https://github.com/apache/spark/pull/288 It's for fixing SPARK-1154, which would help Spark be a better citizen for most deploys, and should be really small and easy to review. thanks, Evan -- -- Evan Chan Staff Engineer e...@ooyala.com | http://www.ooyala.com/ http://www.facebook.com/ooyalahttp://www.linkedin.com/company/ooyala http://www.twitter.com/ooyala
Re: Recent heartbeats
I answered this over on the user list... On Fri, Apr 4, 2014 at 6:13 PM, Debasish Das debasish.da...@gmail.comwrote: Hi, Also posted it on user but then I realized it might be more involved. In my ALS runs I am noticing messages that complain about heart beats: 14/04/04 20:43:09 WARN BlockManagerMasterActor: Removing BlockManager BlockManagerId(17, machine1, 53419, 0) with no recent heart beats: 48476ms exceeds 45000ms 14/04/04 20:43:09 WARN BlockManagerMasterActor: Removing BlockManager BlockManagerId(12, machine2, 60714, 0) with no recent heart beats: 45328ms exceeds 45000ms 14/04/04 20:43:09 WARN BlockManagerMasterActor: Removing BlockManager BlockManagerId(19, machine3, 39496, 0) with no recent heart beats: 53259ms exceeds 45000ms Is this some issue with the underlying jvm over which akka is run ? Can I increase the heartbeat somehow to get these messages resolved ? Any more insight about the possible cause for the heartbeat will be helpful... Thanks. Deb
Re: Flaky streaming tests
TD - do you know what is going on here? I looked into this ab it and at least a few of these that use Thread.sleep() and assume the sleep will be exact, which is wrong. We should disable all the tests that do and probably they should be re-written to virtualize time. - Patrick On Mon, Apr 7, 2014 at 10:52 AM, Kay Ousterhout k...@eecs.berkeley.eduwrote: Hi all, The InputStreamsSuite seems to have some serious flakiness issues -- I've seen the file input stream fail many times and now I'm seeing some actor input stream test failures ( https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13846/consoleFull ) on what I think is an unrelated change. Does anyone know anything about these? Should we just remove some of these tests since they seem to be constantly failing? -Kay
Re: It seems that jenkins for PR is not working
There are a few things going on here wrt tests. 1. I fixed up the RAT issues with a hotfix. 2. The Hive tests were actually disabled for a while accidentally. A recent fix correctly re-enabled them. Without Hive Spark tests run in about 40 minutes and with Hive it runs in 1 hour and 15 minutes, so it's a big difference. To ease things I committed a patch today that only runs the Hive tests if the change touches Spark SQL. So this should make it simpler for normal tests. We can actually generalize this to do much finer grained testing, e.g. if something in MLLib changes we don't need to re-run the streaming tests. I've added this JIRA to track it: https://issues.apache.org/jira/browse/SPARK-1455 3. Overall we've experienced more race conditions with tests recently. I noticed a few zombie test processes on Jenkins hogging up 100% of CPU so I think this has triggered several previously unseen races due to CPU contention on the test cluster. I killed them and we'll see if they crop up again. 4. Please try to keep an eye on the length of new tests that get committed. It's common to see people commit tests that e.g. sleep for several seconds or do things that take a long time. Almost always this can be avoided and usually avoiding it makes the test cleaner anyways (e.g. use proper synchronization instead of sleeping). - Patrick On Tue, Apr 15, 2014 at 9:34 AM, Mark Hamstra m...@clearstorydata.comwrote: The RAT path issue is now fixed, but it appears to me that some recent change has dramatically altered the behavior of the testing framework, so that I am now seeing many individual tests taking more than a minute to run and the complete test run taking a very, very long time. I expect that this is what is causing Jenkins to now timeout repeatedly. On Mon, Apr 14, 2014 at 1:32 PM, Nan Zhu zhunanmcg...@gmail.com wrote: +1 -- Nan Zhu On Friday, April 11, 2014 at 5:35 PM, DB Tsai wrote: I always got = Could not find Apache license headers in the following files: !? /root/workspace/SparkPullRequestBuilder/python/metastore/db.lck !? /root/workspace/SparkPullRequestBuilder/python/metastore/service.properties Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai
Spark 1.0.0 rc3
Hey All, This is not an official vote, but I wanted to cut an RC so that people can test against the Maven artifacts, test building with their configuration, etc. We are still chasing down a few issues and updating docs, etc. If you have issues or bug reports for this release, please send an e-mail to the Spark dev list and/or file a JIRA. Commit: d636772 (v1.0.0-rc3) https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=d636772ea9f98e449a038567b7975b1a07de3221 Binaries: http://people.apache.org/~pwendell/spark-1.0.0-rc3/ Docs: http://people.apache.org/~pwendell/spark-1.0.0-rc3-docs/ Repository: https://repository.apache.org/content/repositories/orgapachespark-1012/ == API Changes == If you want to test building against Spark there are some minor API changes. We'll get these written up for the final release but I'm noting a few here (not comprehensive): changes to ML vector specification: http://people.apache.org/~pwendell/spark-1.0.0-rc3-docs/mllib-guide.html#from-09-to-10 changes to the Java API: http://people.apache.org/~pwendell/spark-1.0.0-rc3-docs/java-programming-guide.html#upgrading-from-pre-10-versions-of-spark coGroup and related functions now return Iterable[T] instead of Seq[T] == Call toSeq on the result to restore the old behavior SparkContext.jarOfClass returns Option[String] instead of Seq[String] == Call toSeq on the result to restore old behavior Streaming classes have been renamed: NetworkReceiver - Receiver
Re: Spark 1.0.0 rc3
What are the expectations / guarantees on binary compatibility between 0.9 and 1.0? There are not guarantees.
Re: Spark 1.0.0 rc3
Hi Dean, We always used the Hadoop libraries here to read and write local files. In Spark 1.0 we started enforcing the rule that you can't over-write an existing directory because it can cause confusing/undefined behavior if multiple jobs output to the directory (they partially clobber each other's output). https://issues.apache.org/jira/browse/SPARK-1100 https://github.com/apache/spark/pull/11 In the JIRA I actually proposed slightly deviating from Hadoop semantics and allowing the directory to exist if it is empty, but I think in the end we decided to just go with the exact same semantics as Hadoop (i.e. empty directories are a problem). - Patrick On Tue, Apr 29, 2014 at 9:43 AM, Dean Wampler deanwamp...@gmail.com wrote: I'm observing one anomalous behavior. With the 1.0.0 libraries, it's using HDFS classes for file I/O, while the same script compiled and running with 0.9.1 uses only the local-mode File IO. The script is a variation of the Word Count script. Here are the guts: object WordCount2 { def main(args: Array[String]) = { val sc = new SparkContext(local, Word Count (2)) val input = sc.textFile(.../some/local/file).map(line = line.toLowerCase) input.cache val wc2 = input .flatMap(line = line.split(\W+)) .map(word = (word, 1)) .reduceByKey((count1, count2) = count1 + count2) wc2.saveAsTextFile(output/some/directory) sc.stop() It works fine compiled and executed with 0.9.1. If I recompile and run with 1.0.0-RC1, where the same output directory still exists, I get this familiar Hadoop-ish exception: [error] (run-main-0) org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory file:/Users/deanwampler/projects/typesafe/activator/activator-spark/output/kjv-wc already exists org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory file:/Users/deanwampler/projects/typesafe/activator/activator-spark/output/kjv-wc already exists at org.apache.hadoop.mapred.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:121) at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopDataset(PairRDDFunctions.scala:749) at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:662) at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:581) at org.apache.spark.rdd.RDD.saveAsTextFile(RDD.scala:1057) at spark.activator.WordCount2$.main(WordCount2.scala:42) at spark.activator.WordCount2.main(WordCount2.scala) ... Thoughts? On Tue, Apr 29, 2014 at 3:05 AM, Patrick Wendell pwend...@gmail.com wrote: Hey All, This is not an official vote, but I wanted to cut an RC so that people can test against the Maven artifacts, test building with their configuration, etc. We are still chasing down a few issues and updating docs, etc. If you have issues or bug reports for this release, please send an e-mail to the Spark dev list and/or file a JIRA. Commit: d636772 (v1.0.0-rc3) https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=d636772ea9f98e449a038567b7975b1a07de3221 Binaries: http://people.apache.org/~pwendell/spark-1.0.0-rc3/ Docs: http://people.apache.org/~pwendell/spark-1.0.0-rc3-docs/ Repository: https://repository.apache.org/content/repositories/orgapachespark-1012/ == API Changes == If you want to test building against Spark there are some minor API changes. We'll get these written up for the final release but I'm noting a few here (not comprehensive): changes to ML vector specification: http://people.apache.org/~pwendell/spark-1.0.0-rc3-docs/mllib-guide.html#from-09-to-10 changes to the Java API: http://people.apache.org/~pwendell/spark-1.0.0-rc3-docs/java-programming-guide.html#upgrading-from-pre-10-versions-of-spark coGroup and related functions now return Iterable[T] instead of Seq[T] == Call toSeq on the result to restore the old behavior SparkContext.jarOfClass returns Option[String] instead of Seq[String] == Call toSeq on the result to restore old behavior Streaming classes have been renamed: NetworkReceiver - Receiver -- Dean Wampler, Ph.D. Typesafe @deanwampler http://typesafe.com http://polyglotprogramming.com
Re: Spark 1.0.0 rc3
That suggestion got lost along the way and IIRC the patch didn't have that. It's a good idea though, if nothing else to provide a simple means for backwards compatibility. I created a JIRA for this. It's very straightforward so maybe someone can pick it up quickly: https://issues.apache.org/jira/browse/SPARK-1677 On Tue, Apr 29, 2014 at 2:20 PM, Dean Wampler deanwamp...@gmail.com wrote: Thanks. I'm fine with the logic change, although I was a bit surprised to see Hadoop used for file I/O. Anyway, the jira issue and pull request discussions mention a flag to enable overwrites. That would be very convenient for a tutorial I'm writing, although I wouldn't recommend it for normal use, of course. However, I can't figure out if this actually exists. I found the spark.files.overwrite property, but that doesn't apply. Does this override flag, method call, or method argument actually exist? Thanks, Dean On Tue, Apr 29, 2014 at 1:54 PM, Patrick Wendell pwend...@gmail.com wrote: Hi Dean, We always used the Hadoop libraries here to read and write local files. In Spark 1.0 we started enforcing the rule that you can't over-write an existing directory because it can cause confusing/undefined behavior if multiple jobs output to the directory (they partially clobber each other's output). https://issues.apache.org/jira/browse/SPARK-1100 https://github.com/apache/spark/pull/11 In the JIRA I actually proposed slightly deviating from Hadoop semantics and allowing the directory to exist if it is empty, but I think in the end we decided to just go with the exact same semantics as Hadoop (i.e. empty directories are a problem). - Patrick On Tue, Apr 29, 2014 at 9:43 AM, Dean Wampler deanwamp...@gmail.com wrote: I'm observing one anomalous behavior. With the 1.0.0 libraries, it's using HDFS classes for file I/O, while the same script compiled and running with 0.9.1 uses only the local-mode File IO. The script is a variation of the Word Count script. Here are the guts: object WordCount2 { def main(args: Array[String]) = { val sc = new SparkContext(local, Word Count (2)) val input = sc.textFile(.../some/local/file).map(line = line.toLowerCase) input.cache val wc2 = input .flatMap(line = line.split(\W+)) .map(word = (word, 1)) .reduceByKey((count1, count2) = count1 + count2) wc2.saveAsTextFile(output/some/directory) sc.stop() It works fine compiled and executed with 0.9.1. If I recompile and run with 1.0.0-RC1, where the same output directory still exists, I get this familiar Hadoop-ish exception: [error] (run-main-0) org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory file:/Users/deanwampler/projects/typesafe/activator/activator-spark/output/kjv-wc already exists org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory file:/Users/deanwampler/projects/typesafe/activator/activator-spark/output/kjv-wc already exists at org.apache.hadoop.mapred.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:121) at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopDataset(PairRDDFunctions.scala:749) at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:662) at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:581) at org.apache.spark.rdd.RDD.saveAsTextFile(RDD.scala:1057) at spark.activator.WordCount2$.main(WordCount2.scala:42) at spark.activator.WordCount2.main(WordCount2.scala) ... Thoughts? On Tue, Apr 29, 2014 at 3:05 AM, Patrick Wendell pwend...@gmail.com wrote: Hey All, This is not an official vote, but I wanted to cut an RC so that people can test against the Maven artifacts, test building with their configuration, etc. We are still chasing down a few issues and updating docs, etc. If you have issues or bug reports for this release, please send an e-mail to the Spark dev list and/or file a JIRA. Commit: d636772 (v1.0.0-rc3) https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=d636772ea9f98e449a038567b7975b1a07de3221 Binaries: http://people.apache.org/~pwendell/spark-1.0.0-rc3/ Docs: http://people.apache.org/~pwendell/spark-1.0.0-rc3-docs/ Repository: https://repository.apache.org/content/repositories/orgapachespark-1012/ == API Changes == If you want to test building against Spark there are some minor API changes. We'll get these written up for the final release but I'm noting a few here (not comprehensive): changes to ML vector specification: http://people.apache.org/~pwendell/spark-1.0.0-rc3-docs/mllib-guide.html#from-09-to-10 changes to the Java API: http://people.apache.org/~pwendell/spark-1.0.0-rc3-docs/java-programming-guide.html#upgrading-from-pre-10-versions-of-spark coGroup and related functions now return Iterable[T] instead of Seq[T] == Call toSeq
Re: SparkSubmit and --driver-java-options
I added a fix for this recently and it didn't require adding -J notation - are you trying it with this patch? https://issues.apache.org/jira/browse/SPARK-1654 ./bin/spark-shell --driver-java-options -Dfoo=a -Dbar=b scala sys.props.get(foo) res0: Option[String] = Some(a) scala sys.props.get(bar) res1: Option[String] = Some(b) - Patrick On Wed, Apr 30, 2014 at 11:29 AM, Marcelo Vanzin van...@cloudera.com wrote: Hello all, Maybe my brain is not evolved enough to be able to trace through what happens with command-line arguments as they're parsed through all the shell scripts... but I really can't figure out how to pass more than a single JVM option on the command line. Unless someone has an obvious workaround that I'm missing, I'd like to propose something that is actually pretty standard in JVM tools: using -J. From javac: -Jflag Pass flag directly to the runtime system So javac -J-Xmx1g would pass -Xmx1g to the underlying JVM. You can use several of those to pass multiple options (unlike --driver-java-options), so it helps that it's a short syntax. Unless someone has some issue with that I'll work on a patch for it... (well, I'm going to do it locally for me anyway because I really can't figure out how to do what I want to otherwise.) -- Marcelo
Re: SparkSubmit and --driver-java-options
Yeah I think the problem is that the spark-submit script doesn't pass the argument array to spark-class in the right way, so any quoted strings get flattened. We do: ORIG_ARGS=$@ $SPARK_HOME/bin/spark-class org.apache.spark.deploy.SparkSubmit $ORIG_ARGS This works: // remove all the code relating to `shift`ing the arguments $SPARK_HOME/bin/spark-class org.apache.spark.deploy.SparkSubmit $@ Not sure, but I think the issue is that when you make a copy of $@ in bash the type actually changes from an array to something else. My patch fixes this for spark-shell but I didn't realize that spark-submit does the same thing. https://github.com/apache/spark/pull/576/files#diff-bc287993dfd11fd18794041e169ffd72L23 I think we'll need to figure out how to do this correctly in the bash script so that quoted strings get passed in the right way. On Wed, Apr 30, 2014 at 1:06 PM, Marcelo Vanzin van...@cloudera.com wrote: Just pulled again just in case. Verified your fix is there. $ ./bin/spark-submit --master yarn --deploy-mode client --driver-java-options -Dfoo -Dbar blah blah blah error: Unrecognized option '-Dbar'. run with --help for more information or --verbose for debugging output On Wed, Apr 30, 2014 at 12:49 PM, Patrick Wendell pwend...@gmail.com wrote: I added a fix for this recently and it didn't require adding -J notation - are you trying it with this patch? https://issues.apache.org/jira/browse/SPARK-1654 ./bin/spark-shell --driver-java-options -Dfoo=a -Dbar=b scala sys.props.get(foo) res0: Option[String] = Some(a) scala sys.props.get(bar) res1: Option[String] = Some(b) - Patrick On Wed, Apr 30, 2014 at 11:29 AM, Marcelo Vanzin van...@cloudera.com wrote: Hello all, Maybe my brain is not evolved enough to be able to trace through what happens with command-line arguments as they're parsed through all the shell scripts... but I really can't figure out how to pass more than a single JVM option on the command line. Unless someone has an obvious workaround that I'm missing, I'd like to propose something that is actually pretty standard in JVM tools: using -J. From javac: -Jflag Pass flag directly to the runtime system So javac -J-Xmx1g would pass -Xmx1g to the underlying JVM. You can use several of those to pass multiple options (unlike --driver-java-options), so it helps that it's a short syntax. Unless someone has some issue with that I'll work on a patch for it... (well, I'm going to do it locally for me anyway because I really can't figure out how to do what I want to otherwise.) -- Marcelo -- Marcelo
Re: SparkSubmit and --driver-java-options
So I reproduced the problem here: == test.sh == #!/bin/bash for x in $@; do echo arg: $x done ARGS_COPY=$@ for x in $ARGS_COPY; do echo arg_copy: $x done == ./test.sh a b c d e f arg: a arg: b arg: c d e arg: f arg_copy: a b c d e f I'll dig around a bit more and see if we can fix it. Pretty sure we aren't passing these argument arrays around correctly in bash. On Wed, Apr 30, 2014 at 1:48 PM, Marcelo Vanzin van...@cloudera.com wrote: On Wed, Apr 30, 2014 at 1:41 PM, Patrick Wendell pwend...@gmail.com wrote: Yeah I think the problem is that the spark-submit script doesn't pass the argument array to spark-class in the right way, so any quoted strings get flattened. I think we'll need to figure out how to do this correctly in the bash script so that quoted strings get passed in the right way. I tried a few different approaches but finally ended up giving up; my bash-fu is apparently not strong enough. If you can make it work great, but I have -J working locally in case you give up like me. :-) -- Marcelo
Re: SparkSubmit and --driver-java-options
Marcelo - Mind trying the following diff locally? If it works I can send a patch: patrick@patrick-t430s:~/Documents/spark$ git diff bin/spark-submit diff --git a/bin/spark-submit b/bin/spark-submit index dd0d95d..49bc262 100755 --- a/bin/spark-submit +++ b/bin/spark-submit @@ -18,7 +18,7 @@ # export SPARK_HOME=$(cd `dirname $0`/..; pwd) -ORIG_ARGS=$@ +ORIG_ARGS=($@) while (($#)); do if [ $1 = --deploy-mode ]; then @@ -39,5 +39,5 @@ if [ ! -z $DRIVER_MEMORY ] [ ! -z $DEPLOY_MODE ] [ $DEPLOY_MODE = client export SPARK_MEM=$DRIVER_MEMORY fi -$SPARK_HOME/bin/spark-class org.apache.spark.deploy.SparkSubmit $ORIG_ARGS +$SPARK_HOME/bin/spark-class org.apache.spark.deploy.SparkSubmit ${ORIG_ARGS[@]} On Wed, Apr 30, 2014 at 1:51 PM, Patrick Wendell pwend...@gmail.com wrote: So I reproduced the problem here: == test.sh == #!/bin/bash for x in $@; do echo arg: $x done ARGS_COPY=$@ for x in $ARGS_COPY; do echo arg_copy: $x done == ./test.sh a b c d e f arg: a arg: b arg: c d e arg: f arg_copy: a b c d e f I'll dig around a bit more and see if we can fix it. Pretty sure we aren't passing these argument arrays around correctly in bash. On Wed, Apr 30, 2014 at 1:48 PM, Marcelo Vanzin van...@cloudera.com wrote: On Wed, Apr 30, 2014 at 1:41 PM, Patrick Wendell pwend...@gmail.com wrote: Yeah I think the problem is that the spark-submit script doesn't pass the argument array to spark-class in the right way, so any quoted strings get flattened. I think we'll need to figure out how to do this correctly in the bash script so that quoted strings get passed in the right way. I tried a few different approaches but finally ended up giving up; my bash-fu is apparently not strong enough. If you can make it work great, but I have -J working locally in case you give up like me. :-) -- Marcelo
Re: SparkSubmit and --driver-java-options
Dean - our e-mails crossed, but thanks for the tip. Was independently arriving at your solution :) Okay I'll submit something. - Patrick On Wed, Apr 30, 2014 at 2:14 PM, Marcelo Vanzin van...@cloudera.com wrote: Cool, that seems to work. Thanks! On Wed, Apr 30, 2014 at 2:09 PM, Patrick Wendell pwend...@gmail.com wrote: Marcelo - Mind trying the following diff locally? If it works I can send a patch: patrick@patrick-t430s:~/Documents/spark$ git diff bin/spark-submit diff --git a/bin/spark-submit b/bin/spark-submit index dd0d95d..49bc262 100755 --- a/bin/spark-submit +++ b/bin/spark-submit @@ -18,7 +18,7 @@ # export SPARK_HOME=$(cd `dirname $0`/..; pwd) -ORIG_ARGS=$@ +ORIG_ARGS=($@) while (($#)); do if [ $1 = --deploy-mode ]; then @@ -39,5 +39,5 @@ if [ ! -z $DRIVER_MEMORY ] [ ! -z $DEPLOY_MODE ] [ $DEPLOY_MODE = client export SPARK_MEM=$DRIVER_MEMORY fi -$SPARK_HOME/bin/spark-class org.apache.spark.deploy.SparkSubmit $ORIG_ARGS +$SPARK_HOME/bin/spark-class org.apache.spark.deploy.SparkSubmit ${ORIG_ARGS[@]} On Wed, Apr 30, 2014 at 1:51 PM, Patrick Wendell pwend...@gmail.com wrote: So I reproduced the problem here: == test.sh == #!/bin/bash for x in $@; do echo arg: $x done ARGS_COPY=$@ for x in $ARGS_COPY; do echo arg_copy: $x done == ./test.sh a b c d e f arg: a arg: b arg: c d e arg: f arg_copy: a b c d e f I'll dig around a bit more and see if we can fix it. Pretty sure we aren't passing these argument arrays around correctly in bash. On Wed, Apr 30, 2014 at 1:48 PM, Marcelo Vanzin van...@cloudera.com wrote: On Wed, Apr 30, 2014 at 1:41 PM, Patrick Wendell pwend...@gmail.com wrote: Yeah I think the problem is that the spark-submit script doesn't pass the argument array to spark-class in the right way, so any quoted strings get flattened. I think we'll need to figure out how to do this correctly in the bash script so that quoted strings get passed in the right way. I tried a few different approaches but finally ended up giving up; my bash-fu is apparently not strong enough. If you can make it work great, but I have -J working locally in case you give up like me. :-) -- Marcelo -- Marcelo
Re: SparkSubmit and --driver-java-options
Patch here: https://github.com/apache/spark/pull/609 On Wed, Apr 30, 2014 at 2:26 PM, Patrick Wendell pwend...@gmail.com wrote: Dean - our e-mails crossed, but thanks for the tip. Was independently arriving at your solution :) Okay I'll submit something. - Patrick On Wed, Apr 30, 2014 at 2:14 PM, Marcelo Vanzin van...@cloudera.com wrote: Cool, that seems to work. Thanks! On Wed, Apr 30, 2014 at 2:09 PM, Patrick Wendell pwend...@gmail.com wrote: Marcelo - Mind trying the following diff locally? If it works I can send a patch: patrick@patrick-t430s:~/Documents/spark$ git diff bin/spark-submit diff --git a/bin/spark-submit b/bin/spark-submit index dd0d95d..49bc262 100755 --- a/bin/spark-submit +++ b/bin/spark-submit @@ -18,7 +18,7 @@ # export SPARK_HOME=$(cd `dirname $0`/..; pwd) -ORIG_ARGS=$@ +ORIG_ARGS=($@) while (($#)); do if [ $1 = --deploy-mode ]; then @@ -39,5 +39,5 @@ if [ ! -z $DRIVER_MEMORY ] [ ! -z $DEPLOY_MODE ] [ $DEPLOY_MODE = client export SPARK_MEM=$DRIVER_MEMORY fi -$SPARK_HOME/bin/spark-class org.apache.spark.deploy.SparkSubmit $ORIG_ARGS +$SPARK_HOME/bin/spark-class org.apache.spark.deploy.SparkSubmit ${ORIG_ARGS[@]} On Wed, Apr 30, 2014 at 1:51 PM, Patrick Wendell pwend...@gmail.com wrote: So I reproduced the problem here: == test.sh == #!/bin/bash for x in $@; do echo arg: $x done ARGS_COPY=$@ for x in $ARGS_COPY; do echo arg_copy: $x done == ./test.sh a b c d e f arg: a arg: b arg: c d e arg: f arg_copy: a b c d e f I'll dig around a bit more and see if we can fix it. Pretty sure we aren't passing these argument arrays around correctly in bash. On Wed, Apr 30, 2014 at 1:48 PM, Marcelo Vanzin van...@cloudera.com wrote: On Wed, Apr 30, 2014 at 1:41 PM, Patrick Wendell pwend...@gmail.com wrote: Yeah I think the problem is that the spark-submit script doesn't pass the argument array to spark-class in the right way, so any quoted strings get flattened. I think we'll need to figure out how to do this correctly in the bash script so that quoted strings get passed in the right way. I tried a few different approaches but finally ended up giving up; my bash-fu is apparently not strong enough. If you can make it work great, but I have -J working locally in case you give up like me. :-) -- Marcelo -- Marcelo
Re: [VOTE] Release Apache Spark 1.0.0 (rc5)
I'm cancelling this vote in favor of rc6. On Tue, May 13, 2014 at 8:01 AM, Sean Owen so...@cloudera.com wrote: On Tue, May 13, 2014 at 2:49 PM, Sean Owen so...@cloudera.com wrote: On Tue, May 13, 2014 at 9:36 AM, Patrick Wendell pwend...@gmail.com wrote: The release files, including signatures, digests, etc. can be found at: http://people.apache.org/~pwendell/spark-1.0.0-rc5/ Good news is that the sigs, MD5 and SHA are all correct. Tiny note: the Maven artifacts use SHA1, while the binary artifacts use SHA512, which took me a bit of head-scratching to figure out. If another RC comes out, I might suggest making it SHA1 everywhere? But there is nothing wrong with these signatures and checksums. Now to look at the contents... This is a bit of drudgery that probably needs to be done too: a review of the LICENSE and NOTICE file. Having dumped the licenses of dependencies, I don't believe these reflect all of the software that's going to be distributed in 1.0. (Good news is there's no forbidden license stuff included AFAICT.) And good news is that NOTICE can be auto-generated, largely, with a Maven plugin. This can be done manually for now. And there is a license plugin that will list all known licenses of transitive dependencies so that LICENSE can be filled out fairly easily. What say? want a JIRA with details?
[VOTE] Release Apache Spark 1.0.0 (rc6)
Please vote on releasing the following candidate as Apache Spark version 1.0.0! This patch has a few minor fixes on top of rc5. I've also built the binary artifacts with Hive support enabled so people can test this configuration. When we release 1.0 we might just release both vanilla and Hive-enabled binaries. The tag to be voted on is v1.0.0-rc6 (commit 54133a): https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=54133abdce0246f6643a1112a5204afb2c4caa82 The release files, including signatures, digests, etc. can be found at: http://people.apache.org/~pwendell/spark-1.0.0-rc6/ Release artifacts are signed with the following key: https://people.apache.org/keys/committer/pwendell.asc The staging repository for this release can be found at: https://repository.apache.org/content/repositories/orgapachestratos-1011 The documentation corresponding to this release can be found at: http://people.apache.org/~pwendell/spark-1.0.0-rc6-docs/ Please vote on releasing this package as Apache Spark 1.0.0! The vote is open until Saturday, May 17, at 20:58 UTC and passes if amajority of at least 3 +1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 1.0.0 [ ] -1 Do not release this package because ... To learn more about Apache Spark, please see http://spark.apache.org/ == API Changes == We welcome users to compile Spark applications against 1.0. There are a few API changes in this release. Here are links to the associated upgrade guides - user facing changes have been kept as small as possible. changes to ML vector specification: http://people.apache.org/~pwendell/spark-1.0.0-rc5-docs/mllib-guide.html#from-09-to-10 changes to the Java API: http://people.apache.org/~pwendell/spark-1.0.0-rc5-docs/java-programming-guide.html#upgrading-from-pre-10-versions-of-spark changes to the streaming API: http://people.apache.org/~pwendell/spark-1.0.0-rc5-docs/streaming-programming-guide.html#migration-guide-from-091-or-below-to-1x changes to the GraphX API: http://people.apache.org/~pwendell/spark-1.0.0-rc5-docs/graphx-programming-guide.html#upgrade-guide-from-spark-091 coGroup and related functions now return Iterable[T] instead of Seq[T] == Call toSeq on the result to restore the old behavior SparkContext.jarOfClass returns Option[String] instead of Seq[String] == Call toSeq on the result to restore old behavior
Re: [VOTE] Release Apache Spark 1.0.0 (rc7)
I'll start the voting with a +1. On Thu, May 15, 2014 at 1:14 AM, Patrick Wendell pwend...@gmail.com wrote: Please vote on releasing the following candidate as Apache Spark version 1.0.0! This patch has minor documentation changes and fixes on top of rc6. The tag to be voted on is v1.0.0-rc7 (commit 9212b3e): https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=9212b3e5bb5545ccfce242da8d89108e6fb1c464 The release files, including signatures, digests, etc. can be found at: http://people.apache.org/~pwendell/spark-1.0.0-rc7/ Release artifacts are signed with the following key: https://people.apache.org/keys/committer/pwendell.asc The staging repository for this release can be found at: https://repository.apache.org/content/repositories/orgapachespark-1015 The documentation corresponding to this release can be found at: http://people.apache.org/~pwendell/spark-1.0.0-rc7-docs/ Please vote on releasing this package as Apache Spark 1.0.0! The vote is open until Sunday, May 18, at 09:12 UTC and passes if a majority of at least 3 +1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 1.0.0 [ ] -1 Do not release this package because ... To learn more about Apache Spark, please see http://spark.apache.org/ == API Changes == We welcome users to compile Spark applications against 1.0. There are a few API changes in this release. Here are links to the associated upgrade guides - user facing changes have been kept as small as possible. changes to ML vector specification: http://people.apache.org/~pwendell/spark-1.0.0-rc5-docs/mllib-guide.html#from-09-to-10 changes to the Java API: http://people.apache.org/~pwendell/spark-1.0.0-rc5-docs/java-programming-guide.html#upgrading-from-pre-10-versions-of-spark changes to the streaming API: http://people.apache.org/~pwendell/spark-1.0.0-rc5-docs/streaming-programming-guide.html#migration-guide-from-091-or-below-to-1x changes to the GraphX API: http://people.apache.org/~pwendell/spark-1.0.0-rc5-docs/graphx-programming-guide.html#upgrade-guide-from-spark-091 coGroup and related functions now return Iterable[T] instead of Seq[T] == Call toSeq on the result to restore the old behavior SparkContext.jarOfClass returns Option[String] instead of Seq[String] == Call toSeq on the result to restore old behavior
[RESULT][VOTE] Release Apache Spark 1.0.0 (rc6)
This vote is cancelled in favor of rc7. On Wed, May 14, 2014 at 1:02 PM, Patrick Wendell pwend...@gmail.com wrote: Please vote on releasing the following candidate as Apache Spark version 1.0.0! This patch has a few minor fixes on top of rc5. I've also built the binary artifacts with Hive support enabled so people can test this configuration. When we release 1.0 we might just release both vanilla and Hive-enabled binaries. The tag to be voted on is v1.0.0-rc6 (commit 54133a): https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=54133abdce0246f6643a1112a5204afb2c4caa82 The release files, including signatures, digests, etc. can be found at: http://people.apache.org/~pwendell/spark-1.0.0-rc6/ Release artifacts are signed with the following key: https://people.apache.org/keys/committer/pwendell.asc The staging repository for this release can be found at: https://repository.apache.org/content/repositories/orgapachestratos-1011 The documentation corresponding to this release can be found at: http://people.apache.org/~pwendell/spark-1.0.0-rc6-docs/ Please vote on releasing this package as Apache Spark 1.0.0! The vote is open until Saturday, May 17, at 20:58 UTC and passes if amajority of at least 3 +1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 1.0.0 [ ] -1 Do not release this package because ... To learn more about Apache Spark, please see http://spark.apache.org/ == API Changes == We welcome users to compile Spark applications against 1.0. There are a few API changes in this release. Here are links to the associated upgrade guides - user facing changes have been kept as small as possible. changes to ML vector specification: http://people.apache.org/~pwendell/spark-1.0.0-rc5-docs/mllib-guide.html#from-09-to-10 changes to the Java API: http://people.apache.org/~pwendell/spark-1.0.0-rc5-docs/java-programming-guide.html#upgrading-from-pre-10-versions-of-spark changes to the streaming API: http://people.apache.org/~pwendell/spark-1.0.0-rc5-docs/streaming-programming-guide.html#migration-guide-from-091-or-below-to-1x changes to the GraphX API: http://people.apache.org/~pwendell/spark-1.0.0-rc5-docs/graphx-programming-guide.html#upgrade-guide-from-spark-091 coGroup and related functions now return Iterable[T] instead of Seq[T] == Call toSeq on the result to restore the old behavior SparkContext.jarOfClass returns Option[String] instead of Seq[String] == Call toSeq on the result to restore old behavior
[VOTE] Release Apache Spark 1.0.0 (rc8)
[Due to ASF e-mail outage, I'm not if anyone will actually receive this.] Please vote on releasing the following candidate as Apache Spark version 1.0.0! This has only minor changes on top of rc7. The tag to be voted on is v1.0.0-rc8 (commit 80eea0f): https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=80eea0f111c06260ffaa780d2f3f7facd09c17bc The release files, including signatures, digests, etc. can be found at: http://people.apache.org/~pwendell/spark-1.0.0-rc8/ Release artifacts are signed with the following key: https://people.apache.org/keys/committer/pwendell.asc The staging repository for this release can be found at: https://repository.apache.org/content/repositories/orgapachespark-1016/ The documentation corresponding to this release can be found at: http://people.apache.org/~pwendell/spark-1.0.0-rc8-docs/ Please vote on releasing this package as Apache Spark 1.0.0! The vote is open until Monday, May 19, at 10:15 UTC and passes if a majority of at least 3 +1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 1.0.0 [ ] -1 Do not release this package because ... To learn more about Apache Spark, please see http://spark.apache.org/ == API Changes == We welcome users to compile Spark applications against 1.0. There are a few API changes in this release. Here are links to the associated upgrade guides - user facing changes have been kept as small as possible. changes to ML vector specification: http://people.apache.org/~pwendell/spark-1.0.0-rc8-docs/mllib-guide.html#from-09-to-10 changes to the Java API: http://people.apache.org/~pwendell/spark-1.0.0-rc8-docs/java-programming-guide.html#upgrading-from-pre-10-versions-of-spark changes to the streaming API: http://people.apache.org/~pwendell/spark-1.0.0-rc8-docs/streaming-programming-guide.html#migration-guide-from-091-or-below-to-1x changes to the GraphX API: http://people.apache.org/~pwendell/spark-1.0.0-rc8-docs/graphx-programming-guide.html#upgrade-guide-from-spark-091 coGroup and related functions now return Iterable[T] instead of Seq[T] == Call toSeq on the result to restore the old behavior SparkContext.jarOfClass returns Option[String] instead of Seq[String] == Call toSeq on the result to restore old behavior
[VOTE] Release Apache Spark 1.0.0 (rc7)
Please vote on releasing the following candidate as Apache Spark version 1.0.0! This patch has minor documentation changes and fixes on top of rc6. The tag to be voted on is v1.0.0-rc7 (commit 9212b3e): https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=9212b3e5bb5545ccfce242da8d89108e6fb1c464 The release files, including signatures, digests, etc. can be found at: http://people.apache.org/~pwendell/spark-1.0.0-rc7/ Release artifacts are signed with the following key: https://people.apache.org/keys/committer/pwendell.asc The staging repository for this release can be found at: https://repository.apache.org/content/repositories/orgapachespark-1015 The documentation corresponding to this release can be found at: http://people.apache.org/~pwendell/spark-1.0.0-rc7-docs/ Please vote on releasing this package as Apache Spark 1.0.0! The vote is open until Sunday, May 18, at 09:12 UTC and passes if a majority of at least 3 +1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 1.0.0 [ ] -1 Do not release this package because ... To learn more about Apache Spark, please see http://spark.apache.org/ == API Changes == We welcome users to compile Spark applications against 1.0. There are a few API changes in this release. Here are links to the associated upgrade guides - user facing changes have been kept as small as possible. changes to ML vector specification: http://people.apache.org/~pwendell/spark-1.0.0-rc5-docs/mllib-guide.html#from-09-to-10 changes to the Java API: http://people.apache.org/~pwendell/spark-1.0.0-rc5-docs/java-programming-guide.html#upgrading-from-pre-10-versions-of-spark changes to the streaming API: http://people.apache.org/~pwendell/spark-1.0.0-rc5-docs/streaming-programming-guide.html#migration-guide-from-091-or-below-to-1x changes to the GraphX API: http://people.apache.org/~pwendell/spark-1.0.0-rc5-docs/graphx-programming-guide.html#upgrade-guide-from-spark-091 coGroup and related functions now return Iterable[T] instead of Seq[T] == Call toSeq on the result to restore the old behavior SparkContext.jarOfClass returns Option[String] instead of Seq[String] == Call toSeq on the result to restore old behavior
Re: [VOTE] Release Apache Spark 1.0.0 (rc7)
Hey all, My vote threads seem to be running about 24 hours behind and/or getting swallowed by infra e-mail. I sent RC8 yesterday and we might send one tonight as well. I'll make sure to close all existing ones There have been only small polish changes in the recent RC's since RC5. So testing any off these should be pretty equivalent. I'll make sure I close all the other threads by tonight. - Patrick On Fri, May 16, 2014 at 1:10 PM, Mark Hamstra m...@clearstorydata.com wrote: Sorry for the duplication, but I think this is the current VOTE candidate -- we're not voting on rc8 yet? +1, but just barely. We've got quite a number of outstanding bugs identified, and many of them have fixes in progress. I'd hate to see those efforts get lost in a post-1.0.0 flood of new features targeted at 1.1.0 -- in other words, I'd like to see 1.0.1 retain a high priority relative to 1.1.0. Looking through the unresolved JIRAs, it doesn't look like any of the identified bugs are show-stoppers or strictly regressions (although I will note that one that I have in progress, SPARK-1749, is a bug that we introduced with recent work -- it's not strictly a regression because we had equally bad but different behavior when the DAGScheduler exceptions weren't previously being handled at all vs. being slightly mis-handled now), so I'm not currently seeing a reason not to release. On Fri, May 16, 2014 at 11:42 AM, Henry Saputra henry.sapu...@gmail.comwrote: Ah ok, thanks Aaron Just to make sure we VOTE the right RC. Thanks, Henry On Fri, May 16, 2014 at 11:37 AM, Aaron Davidson ilike...@gmail.com wrote: It was, but due to the apache infra issues, some may not have received the email yet... On Fri, May 16, 2014 at 10:48 AM, Henry Saputra henry.sapu...@gmail.com wrote: Hi Patrick, Just want to make sure that VOTE for rc6 also cancelled? Thanks, Henry On Thu, May 15, 2014 at 1:15 AM, Patrick Wendell pwend...@gmail.com wrote: I'll start the voting with a +1. On Thu, May 15, 2014 at 1:14 AM, Patrick Wendell pwend...@gmail.com wrote: Please vote on releasing the following candidate as Apache Spark version 1.0.0! This patch has minor documentation changes and fixes on top of rc6. The tag to be voted on is v1.0.0-rc7 (commit 9212b3e): https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=9212b3e5bb5545ccfce242da8d89108e6fb1c464 The release files, including signatures, digests, etc. can be found at: http://people.apache.org/~pwendell/spark-1.0.0-rc7/ Release artifacts are signed with the following key: https://people.apache.org/keys/committer/pwendell.asc The staging repository for this release can be found at: https://repository.apache.org/content/repositories/orgapachespark-1015 The documentation corresponding to this release can be found at: http://people.apache.org/~pwendell/spark-1.0.0-rc7-docs/ Please vote on releasing this package as Apache Spark 1.0.0! The vote is open until Sunday, May 18, at 09:12 UTC and passes if a majority of at least 3 +1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 1.0.0 [ ] -1 Do not release this package because ... To learn more about Apache Spark, please see http://spark.apache.org/ == API Changes == We welcome users to compile Spark applications against 1.0. There are a few API changes in this release. Here are links to the associated upgrade guides - user facing changes have been kept as small as possible. changes to ML vector specification: http://people.apache.org/~pwendell/spark-1.0.0-rc5-docs/mllib-guide.html#from-09-to-10 changes to the Java API: http://people.apache.org/~pwendell/spark-1.0.0-rc5-docs/java-programming-guide.html#upgrading-from-pre-10-versions-of-spark changes to the streaming API: http://people.apache.org/~pwendell/spark-1.0.0-rc5-docs/streaming-programming-guide.html#migration-guide-from-091-or-below-to-1x changes to the GraphX API: http://people.apache.org/~pwendell/spark-1.0.0-rc5-docs/graphx-programming-guide.html#upgrade-guide-from-spark-091 coGroup and related functions now return Iterable[T] instead of Seq[T] == Call toSeq on the result to restore the old behavior SparkContext.jarOfClass returns Option[String] instead of Seq[String] == Call toSeq on the result to restore old behavior
[RESULT] [VOTE] Release Apache Spark 1.0.0 (rc8)
Cancelled in favor of rc9. On Sat, May 17, 2014 at 12:51 AM, Patrick Wendell pwend...@gmail.com wrote: Due to the issue discovered by Michael, this vote is cancelled in favor of rc9. On Fri, May 16, 2014 at 6:22 PM, Michael Armbrust mich...@databricks.com wrote: -1 We found a regression in the way configuration is passed to executors. https://issues.apache.org/jira/browse/SPARK-1864 https://github.com/apache/spark/pull/808 Michael On Fri, May 16, 2014 at 3:57 PM, Mark Hamstra m...@clearstorydata.com wrote: +1 On Fri, May 16, 2014 at 2:16 AM, Patrick Wendell pwend...@gmail.com wrote: [Due to ASF e-mail outage, I'm not if anyone will actually receive this.] Please vote on releasing the following candidate as Apache Spark version 1.0.0! This has only minor changes on top of rc7. The tag to be voted on is v1.0.0-rc8 (commit 80eea0f): https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=80eea0f111c06260ffaa780d2f3f7facd09c17bc The release files, including signatures, digests, etc. can be found at: http://people.apache.org/~pwendell/spark-1.0.0-rc8/ Release artifacts are signed with the following key: https://people.apache.org/keys/committer/pwendell.asc The staging repository for this release can be found at: https://repository.apache.org/content/repositories/orgapachespark-1016/ The documentation corresponding to this release can be found at: http://people.apache.org/~pwendell/spark-1.0.0-rc8-docs/ Please vote on releasing this package as Apache Spark 1.0.0! The vote is open until Monday, May 19, at 10:15 UTC and passes if a majority of at least 3 +1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 1.0.0 [ ] -1 Do not release this package because ... To learn more about Apache Spark, please see http://spark.apache.org/ == API Changes == We welcome users to compile Spark applications against 1.0. There are a few API changes in this release. Here are links to the associated upgrade guides - user facing changes have been kept as small as possible. changes to ML vector specification: http://people.apache.org/~pwendell/spark-1.0.0-rc8-docs/mllib-guide.html#from-09-to-10 changes to the Java API: http://people.apache.org/~pwendell/spark-1.0.0-rc8-docs/java-programming-guide.html#upgrading-from-pre-10-versions-of-spark changes to the streaming API: http://people.apache.org/~pwendell/spark-1.0.0-rc8-docs/streaming-programming-guide.html#migration-guide-from-091-or-below-to-1x changes to the GraphX API: http://people.apache.org/~pwendell/spark-1.0.0-rc8-docs/graphx-programming-guide.html#upgrade-guide-from-spark-091 coGroup and related functions now return Iterable[T] instead of Seq[T] == Call toSeq on the result to restore the old behavior SparkContext.jarOfClass returns Option[String] instead of Seq[String] == Call toSeq on the result to restore old behavior
Re: [VOTE] Release Apache Spark 1.0.0 (rc8)
Due to the issue discovered by Michael, this vote is cancelled in favor of rc9. On Fri, May 16, 2014 at 6:22 PM, Michael Armbrust mich...@databricks.com wrote: -1 We found a regression in the way configuration is passed to executors. https://issues.apache.org/jira/browse/SPARK-1864 https://github.com/apache/spark/pull/808 Michael On Fri, May 16, 2014 at 3:57 PM, Mark Hamstra m...@clearstorydata.com wrote: +1 On Fri, May 16, 2014 at 2:16 AM, Patrick Wendell pwend...@gmail.com wrote: [Due to ASF e-mail outage, I'm not if anyone will actually receive this.] Please vote on releasing the following candidate as Apache Spark version 1.0.0! This has only minor changes on top of rc7. The tag to be voted on is v1.0.0-rc8 (commit 80eea0f): https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=80eea0f111c06260ffaa780d2f3f7facd09c17bc The release files, including signatures, digests, etc. can be found at: http://people.apache.org/~pwendell/spark-1.0.0-rc8/ Release artifacts are signed with the following key: https://people.apache.org/keys/committer/pwendell.asc The staging repository for this release can be found at: https://repository.apache.org/content/repositories/orgapachespark-1016/ The documentation corresponding to this release can be found at: http://people.apache.org/~pwendell/spark-1.0.0-rc8-docs/ Please vote on releasing this package as Apache Spark 1.0.0! The vote is open until Monday, May 19, at 10:15 UTC and passes if a majority of at least 3 +1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 1.0.0 [ ] -1 Do not release this package because ... To learn more about Apache Spark, please see http://spark.apache.org/ == API Changes == We welcome users to compile Spark applications against 1.0. There are a few API changes in this release. Here are links to the associated upgrade guides - user facing changes have been kept as small as possible. changes to ML vector specification: http://people.apache.org/~pwendell/spark-1.0.0-rc8-docs/mllib-guide.html#from-09-to-10 changes to the Java API: http://people.apache.org/~pwendell/spark-1.0.0-rc8-docs/java-programming-guide.html#upgrading-from-pre-10-versions-of-spark changes to the streaming API: http://people.apache.org/~pwendell/spark-1.0.0-rc8-docs/streaming-programming-guide.html#migration-guide-from-091-or-below-to-1x changes to the GraphX API: http://people.apache.org/~pwendell/spark-1.0.0-rc8-docs/graphx-programming-guide.html#upgrade-guide-from-spark-091 coGroup and related functions now return Iterable[T] instead of Seq[T] == Call toSeq on the result to restore the old behavior SparkContext.jarOfClass returns Option[String] instead of Seq[String] == Call toSeq on the result to restore old behavior
Re: [VOTE] Release Apache Spark 1.0.0 (rc9)
I'll start the voting with a +1. On Sat, May 17, 2014 at 12:58 AM, Patrick Wendell pwend...@gmail.com wrote: Please vote on releasing the following candidate as Apache Spark version 1.0.0! This has one bug fix and one minor feature on top of rc8: SPARK-1864: https://github.com/apache/spark/pull/808 SPARK-1808: https://github.com/apache/spark/pull/799 The tag to be voted on is v1.0.0-rc9 (commit 920f947): https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=920f947eb5a22a679c0c3186cf69ee75f6041c75 The release files, including signatures, digests, etc. can be found at: http://people.apache.org/~pwendell/spark-1.0.0-rc9/ Release artifacts are signed with the following key: https://people.apache.org/keys/committer/pwendell.asc The staging repository for this release can be found at: https://repository.apache.org/content/repositories/orgapachespark-1017/ The documentation corresponding to this release can be found at: http://people.apache.org/~pwendell/spark-1.0.0-rc9-docs/ Please vote on releasing this package as Apache Spark 1.0.0! The vote is open until Tuesday, May 20, at 08:56 UTC and passes if amajority of at least 3 +1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 1.0.0 [ ] -1 Do not release this package because ... To learn more about Apache Spark, please see http://spark.apache.org/ == API Changes == We welcome users to compile Spark applications against 1.0. There are a few API changes in this release. Here are links to the associated upgrade guides - user facing changes have been kept as small as possible. changes to ML vector specification: http://people.apache.org/~pwendell/spark-1.0.0-rc8-docs/mllib-guide.html#from-09-to-10 changes to the Java API: http://people.apache.org/~pwendell/spark-1.0.0-rc8-docs/java-programming-guide.html#upgrading-from-pre-10-versions-of-spark changes to the streaming API: http://people.apache.org/~pwendell/spark-1.0.0-rc8-docs/streaming-programming-guide.html#migration-guide-from-091-or-below-to-1x changes to the GraphX API: http://people.apache.org/~pwendell/spark-1.0.0-rc8-docs/graphx-programming-guide.html#upgrade-guide-from-spark-091 coGroup and related functions now return Iterable[T] instead of Seq[T] == Call toSeq on the result to restore the old behavior SparkContext.jarOfClass returns Option[String] instead of Seq[String] == Call toSeq on the result to restore old behavior
Re: Calling external classes added by sc.addJar needs to be through reflection
@db - it's possible that you aren't including the jar in the classpath of your driver program (I think this is what mridul was suggesting). It would be helpful to see the stack trace of the CNFE. - Patrick On Sun, May 18, 2014 at 11:54 AM, Patrick Wendell pwend...@gmail.com wrote: @xiangrui - we don't expect these to be present on the system classpath, because they get dynamically added by Spark (e.g. your application can call sc.addJar well after the JVM's have started). @db - I'm pretty surprised to see that behavior. It's definitely not intended that users need reflection to instantiate their classes - something odd is going on in your case. If you could create an isolated example and post it to the JIRA, that would be great. On Sun, May 18, 2014 at 9:58 AM, Xiangrui Meng men...@gmail.com wrote: I created a JIRA: https://issues.apache.org/jira/browse/SPARK-1870 DB, could you add more info to that JIRA? Thanks! -Xiangrui On Sun, May 18, 2014 at 9:46 AM, Xiangrui Meng men...@gmail.com wrote: Btw, I tried rdd.map { i = System.getProperty(java.class.path) }.collect() but didn't see the jars added via --jars on the executor classpath. -Xiangrui On Sat, May 17, 2014 at 11:26 PM, Xiangrui Meng men...@gmail.com wrote: I can re-produce the error with Spark 1.0-RC and YARN (CDH-5). The reflection approach mentioned by DB didn't work either. I checked the distributed cache on a worker node and found the jar there. It is also in the Environment tab of the WebUI. The workaround is making an assembly jar. DB, could you create a JIRA and describe what you have found so far? Thanks! Best, Xiangrui On Sat, May 17, 2014 at 1:29 AM, Mridul Muralidharan mri...@gmail.com wrote: Can you try moving your mapPartitions to another class/object which is referenced only after sc.addJar ? I would suspect CNFEx is coming while loading the class containing mapPartitions before addJars is executed. In general though, dynamic loading of classes means you use reflection to instantiate it since expectation is you don't know which implementation provides the interface ... If you statically know it apriori, you bundle it in your classpath. Regards Mridul On 17-May-2014 7:28 am, DB Tsai dbt...@stanford.edu wrote: Finally find a way out of the ClassLoader maze! It took me some times to understand how it works; I think it worths to document it in a separated thread. We're trying to add external utility.jar which contains CSVRecordParser, and we added the jar to executors through sc.addJar APIs. If the instance of CSVRecordParser is created without reflection, it raises *ClassNotFound Exception*. data.mapPartitions(lines = { val csvParser = new CSVRecordParser((delimiter.charAt(0)) lines.foreach(line = { val lineElems = csvParser.parseLine(line) }) ... ... ) If the instance of CSVRecordParser is created through reflection, it works. data.mapPartitions(lines = { val loader = Thread.currentThread.getContextClassLoader val CSVRecordParser = loader.loadClass(com.alpine.hadoop.ext.CSVRecordParser) val csvParser = CSVRecordParser.getConstructor(Character.TYPE) .newInstance(delimiter.charAt(0).asInstanceOf[Character]) val parseLine = CSVRecordParser .getDeclaredMethod(parseLine, classOf[String]) lines.foreach(line = { val lineElems = parseLine.invoke(csvParser, line).asInstanceOf[Array[String]] }) ... ... ) This is identical to this question, http://stackoverflow.com/questions/7452411/thread-currentthread-setcontextclassloader-without-using-reflection It's not intuitive for users to load external classes through reflection, but couple available solutions including 1) messing around systemClassLoader by calling systemClassLoader.addURI through reflection or 2) forking another JVM to add jars into classpath before bootstrap loader are very tricky. Any thought on fixing it properly? @Xiangrui, netlib-java jniloader is loaded from netlib-java through reflection, so this problem will not be seen. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai
Re: Calling external classes added by sc.addJar needs to be through reflection
@xiangrui - we don't expect these to be present on the system classpath, because they get dynamically added by Spark (e.g. your application can call sc.addJar well after the JVM's have started). @db - I'm pretty surprised to see that behavior. It's definitely not intended that users need reflection to instantiate their classes - something odd is going on in your case. If you could create an isolated example and post it to the JIRA, that would be great. On Sun, May 18, 2014 at 9:58 AM, Xiangrui Meng men...@gmail.com wrote: I created a JIRA: https://issues.apache.org/jira/browse/SPARK-1870 DB, could you add more info to that JIRA? Thanks! -Xiangrui On Sun, May 18, 2014 at 9:46 AM, Xiangrui Meng men...@gmail.com wrote: Btw, I tried rdd.map { i = System.getProperty(java.class.path) }.collect() but didn't see the jars added via --jars on the executor classpath. -Xiangrui On Sat, May 17, 2014 at 11:26 PM, Xiangrui Meng men...@gmail.com wrote: I can re-produce the error with Spark 1.0-RC and YARN (CDH-5). The reflection approach mentioned by DB didn't work either. I checked the distributed cache on a worker node and found the jar there. It is also in the Environment tab of the WebUI. The workaround is making an assembly jar. DB, could you create a JIRA and describe what you have found so far? Thanks! Best, Xiangrui On Sat, May 17, 2014 at 1:29 AM, Mridul Muralidharan mri...@gmail.com wrote: Can you try moving your mapPartitions to another class/object which is referenced only after sc.addJar ? I would suspect CNFEx is coming while loading the class containing mapPartitions before addJars is executed. In general though, dynamic loading of classes means you use reflection to instantiate it since expectation is you don't know which implementation provides the interface ... If you statically know it apriori, you bundle it in your classpath. Regards Mridul On 17-May-2014 7:28 am, DB Tsai dbt...@stanford.edu wrote: Finally find a way out of the ClassLoader maze! It took me some times to understand how it works; I think it worths to document it in a separated thread. We're trying to add external utility.jar which contains CSVRecordParser, and we added the jar to executors through sc.addJar APIs. If the instance of CSVRecordParser is created without reflection, it raises *ClassNotFound Exception*. data.mapPartitions(lines = { val csvParser = new CSVRecordParser((delimiter.charAt(0)) lines.foreach(line = { val lineElems = csvParser.parseLine(line) }) ... ... ) If the instance of CSVRecordParser is created through reflection, it works. data.mapPartitions(lines = { val loader = Thread.currentThread.getContextClassLoader val CSVRecordParser = loader.loadClass(com.alpine.hadoop.ext.CSVRecordParser) val csvParser = CSVRecordParser.getConstructor(Character.TYPE) .newInstance(delimiter.charAt(0).asInstanceOf[Character]) val parseLine = CSVRecordParser .getDeclaredMethod(parseLine, classOf[String]) lines.foreach(line = { val lineElems = parseLine.invoke(csvParser, line).asInstanceOf[Array[String]] }) ... ... ) This is identical to this question, http://stackoverflow.com/questions/7452411/thread-currentthread-setcontextclassloader-without-using-reflection It's not intuitive for users to load external classes through reflection, but couple available solutions including 1) messing around systemClassLoader by calling systemClassLoader.addURI through reflection or 2) forking another JVM to add jars into classpath before bootstrap loader are very tricky. Any thought on fixing it properly? @Xiangrui, netlib-java jniloader is loaded from netlib-java through reflection, so this problem will not be seen. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai
Re: [VOTE] Release Apache Spark 1.0.0 (rc9)
Hey Matei - the issue you found is not related to security. This patch a few days ago broke builds for Hadoop 1 with YARN support enabled. The patch directly altered the way we deal with commons-lang dependency, which is what is at the base of this stack trace. https://github.com/apache/spark/pull/754 - Patrick On Sun, May 18, 2014 at 5:28 PM, Matei Zaharia matei.zaha...@gmail.com wrote: Alright, I've opened https://github.com/apache/spark/pull/819 with the Windows fixes. I also found one other likely bug, https://issues.apache.org/jira/browse/SPARK-1875, in the binary packages for Hadoop1 built in this RC. I think this is due to Hadoop 1's security code depending on a different version of org.apache.commons than Hadoop 2, but it needs investigation. Tom, any thoughts on this? Matei On May 18, 2014, at 12:33 PM, Matei Zaharia matei.zaha...@gmail.com wrote: I took the always fun task of testing it on Windows, and unfortunately, I found some small problems with the prebuilt packages due to recent changes to the launch scripts: bin/spark-class2.cmd looks in ./jars instead of ./lib for the assembly JAR, and bin/run-example2.cmd doesn't quite match the master-setting behavior of the Unix based one. I'll send a pull request to fix them soon. Matei On May 17, 2014, at 11:32 AM, Sandy Ryza sandy.r...@cloudera.com wrote: +1 Reran my tests from rc5: * Built the release from source. * Compiled Java and Scala apps that interact with HDFS against it. * Ran them in local mode. * Ran them against a pseudo-distributed YARN cluster in both yarn-client mode and yarn-cluster mode. On Sat, May 17, 2014 at 10:08 AM, Andrew Or and...@databricks.com wrote: +1 2014-05-17 8:53 GMT-07:00 Mark Hamstra m...@clearstorydata.com: +1 On Sat, May 17, 2014 at 12:58 AM, Patrick Wendell pwend...@gmail.com wrote: I'll start the voting with a +1. On Sat, May 17, 2014 at 12:58 AM, Patrick Wendell pwend...@gmail.com wrote: Please vote on releasing the following candidate as Apache Spark version 1.0.0! This has one bug fix and one minor feature on top of rc8: SPARK-1864: https://github.com/apache/spark/pull/808 SPARK-1808: https://github.com/apache/spark/pull/799 The tag to be voted on is v1.0.0-rc9 (commit 920f947): https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=920f947eb5a22a679c0c3186cf69ee75f6041c75 The release files, including signatures, digests, etc. can be found at: http://people.apache.org/~pwendell/spark-1.0.0-rc9/ Release artifacts are signed with the following key: https://people.apache.org/keys/committer/pwendell.asc The staging repository for this release can be found at: https://repository.apache.org/content/repositories/orgapachespark-1017/ The documentation corresponding to this release can be found at: http://people.apache.org/~pwendell/spark-1.0.0-rc9-docs/ Please vote on releasing this package as Apache Spark 1.0.0! The vote is open until Tuesday, May 20, at 08:56 UTC and passes if amajority of at least 3 +1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 1.0.0 [ ] -1 Do not release this package because ... To learn more about Apache Spark, please see http://spark.apache.org/ == API Changes == We welcome users to compile Spark applications against 1.0. There are a few API changes in this release. Here are links to the associated upgrade guides - user facing changes have been kept as small as possible. changes to ML vector specification: http://people.apache.org/~pwendell/spark-1.0.0-rc8-docs/mllib-guide.html#from-09-to-10 changes to the Java API: http://people.apache.org/~pwendell/spark-1.0.0-rc8-docs/java-programming-guide.html#upgrading-from-pre-10-versions-of-spark changes to the streaming API: http://people.apache.org/~pwendell/spark-1.0.0-rc8-docs/streaming-programming-guide.html#migration-guide-from-091-or-below-to-1x changes to the GraphX API: http://people.apache.org/~pwendell/spark-1.0.0-rc8-docs/graphx-programming-guide.html#upgrade-guide-from-spark-091 coGroup and related functions now return Iterable[T] instead of Seq[T] == Call toSeq on the result to restore the old behavior SparkContext.jarOfClass returns Option[String] instead of Seq[String] == Call toSeq on the result to restore old behavior
Re: Calling external classes added by sc.addJar needs to be through reflection
Having a user add define a custom class inside of an added jar and instantiate it directly inside of an executor is definitely supported in Spark and has been for a really long time (several years). This is something we do all the time in Spark. DB - I'd hold off on a re-architecting of this until we identify exactly what is causing the bug you are running into. In a nutshell, when the bytecode new Foo() is run on the executor, it will ask the driver for the class over HTTP using a custom classloader. Something in that pipeline is breaking here, possibly related to the YARN deployment stuff. On Mon, May 19, 2014 at 12:29 AM, Sean Owen so...@cloudera.com wrote: I don't think a customer classloader is necessary. Well, it occurs to me that this is no new problem. Hadoop, Tomcat, etc all run custom user code that creates new user objects without reflection. I should go see how that's done. Maybe it's totally valid to set the thread's context classloader for just this purpose, and I am not thinking clearly. On Mon, May 19, 2014 at 8:26 AM, Andrew Ash and...@andrewash.com wrote: Sounds like the problem is that classloaders always look in their parents before themselves, and Spark users want executors to pick up classes from their custom code before the ones in Spark plus its dependencies. Would a custom classloader that delegates to the parent after first checking itself fix this up? On Mon, May 19, 2014 at 12:17 AM, DB Tsai dbt...@stanford.edu wrote: Hi Sean, It's true that the issue here is classloader, and due to the classloader delegation model, users have to use reflection in the executors to pick up the classloader in order to use those classes added by sc.addJars APIs. However, it's very inconvenience for users, and not documented in spark. I'm working on a patch to solve it by calling the protected method addURL in URLClassLoader to update the current default classloader, so no customClassLoader anymore. I wonder if this is an good way to go. private def addURL(url: URL, loader: URLClassLoader){ try { val method: Method = classOf[URLClassLoader].getDeclaredMethod(addURL, classOf[URL]) method.setAccessible(true) method.invoke(loader, url) } catch { case t: Throwable = { throw new IOException(Error, could not add URL to system classloader) } } } Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Sun, May 18, 2014 at 11:57 PM, Sean Owen so...@cloudera.com wrote: I might be stating the obvious for everyone, but the issue here is not reflection or the source of the JAR, but the ClassLoader. The basic rules are this. new Foo will use the ClassLoader that defines Foo. This is usually the ClassLoader that loaded whatever it is that first referenced Foo and caused it to be loaded -- usually the ClassLoader holding your other app classes. ClassLoaders can have a parent-child relationship. ClassLoaders always look in their parent before themselves. (Careful then -- in contexts like Hadoop or Tomcat where your app is loaded in a child ClassLoader, and you reference a class that Hadoop or Tomcat also has (like a lib class) you will get the container's version!) When you load an external JAR it has a separate ClassLoader which does not necessarily bear any relation to the one containing your app classes, so yeah it is not generally going to make new Foo work. Reflection lets you pick the ClassLoader, yes. I would not call setContextClassLoader. On Mon, May 19, 2014 at 12:00 AM, Sandy Ryza sandy.r...@cloudera.com wrote: I spoke with DB offline about this a little while ago and he confirmed that he was able to access the jar from the driver. The issue appears to be a general Java issue: you can't directly instantiate a class from a dynamically loaded jar. I reproduced it locally outside of Spark with: --- URLClassLoader urlClassLoader = new URLClassLoader(new URL[] { new File(myotherjar.jar).toURI().toURL() }, null); Thread.currentThread().setContextClassLoader(urlClassLoader); MyClassFromMyOtherJar obj = new MyClassFromMyOtherJar(); --- I was able to load the class with reflection.
Re: spark 1.0 standalone application
Whenever we publish a release candidate, we create a temporary maven repository that host the artifacts. We do this precisely for the case you are running into (where a user wants to build an application against it to test). You can build against the release candidate by just adding that repository in your sbt build, then linking against spark-core version 1.0.0. For rc9 the repository is in the vote e-mail: http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Release-Apache-Spark-1-0-0-rc9-td6629.html On Mon, May 19, 2014 at 7:03 PM, Mark Hamstra m...@clearstorydata.com wrote: That's the crude way to do it. If you run `sbt/sbt publishLocal`, then you can resolve the artifact from your local cache in the same way that you would resolve it if it were deployed to a remote cache. That's just the build step. Actually running the application will require the necessary jars to be accessible by the cluster nodes. On Mon, May 19, 2014 at 7:04 PM, Nan Zhu zhunanmcg...@gmail.com wrote: en, you have to put spark-assembly-*.jar to the lib directory of your application Best, -- Nan Zhu On Monday, May 19, 2014 at 9:48 PM, nit wrote: I am not much comfortable with sbt. I want to build a standalone application using spark 1.0 RC9. I can build sbt assembly for my application with Spark 0.9.1, and I think in that case spark is pulled from Aka Repository? Now if I want to use 1.0 RC9 for my application; what is the process ? (FYI, I was able to build spark-1.0 via sbt/assembly and I can see sbt-assembly jar; and I think I will have to copy my jar somewhere? and update build.sbt?) PS: I am not sure if this is the right place for this question; but since 1.0 is still RC, I felt that this may be appropriate forum. thank! -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/spark-1-0-standalone-application-tp6698.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com (http://Nabble.com).
Re: [VOTE] Release Apache Spark 1.0.0 (rc9)
We're cancelling this RC in favor of rc10. There were two blockers: an issue with Windows run scripts and an issue with the packaging for Hadoop 1 when hive support is bundled. https://issues.apache.org/jira/browse/SPARK-1875 https://issues.apache.org/jira/browse/SPARK-1876 Thanks everyone for the testing. TD will be cutting rc10, since I'm travelling this week (thanks TD!). - Patrick On Mon, May 19, 2014 at 7:06 PM, Nan Zhu zhunanmcg...@gmail.com wrote: just rerun my test on rc5 everything works build applications with sbt and the spark-*.jar which is compiled with Hadoop 2.3 +1 -- Nan Zhu On Sunday, May 18, 2014 at 11:07 PM, witgo wrote: How to reproduce this bug? -- Original -- From: Patrick Wendell;pwend...@gmail.com (mailto:pwend...@gmail.com); Date: Mon, May 19, 2014 10:08 AM To: dev@spark.apache.org (mailto:dev@spark.apache.org)dev@spark.apache.org (mailto:dev@spark.apache.org); Cc: Tom Gravestgraves...@yahoo.com (mailto:tgraves...@yahoo.com); Subject: Re: [VOTE] Release Apache Spark 1.0.0 (rc9) Hey Matei - the issue you found is not related to security. This patch a few days ago broke builds for Hadoop 1 with YARN support enabled. The patch directly altered the way we deal with commons-lang dependency, which is what is at the base of this stack trace. https://github.com/apache/spark/pull/754 - Patrick On Sun, May 18, 2014 at 5:28 PM, Matei Zaharia matei.zaha...@gmail.com (mailto:matei.zaha...@gmail.com) wrote: Alright, I've opened https://github.com/apache/spark/pull/819 with the Windows fixes. I also found one other likely bug, https://issues.apache.org/jira/browse/SPARK-1875, in the binary packages for Hadoop1 built in this RC. I think this is due to Hadoop 1's security code depending on a different version of org.apache.commons than Hadoop 2, but it needs investigation. Tom, any thoughts on this? Matei On May 18, 2014, at 12:33 PM, Matei Zaharia matei.zaha...@gmail.com (mailto:matei.zaha...@gmail.com) wrote: I took the always fun task of testing it on Windows, and unfortunately, I found some small problems with the prebuilt packages due to recent changes to the launch scripts: bin/spark-class2.cmd looks in ./jars instead of ./lib for the assembly JAR, and bin/run-example2.cmd doesn't quite match the master-setting behavior of the Unix based one. I'll send a pull request to fix them soon. Matei On May 17, 2014, at 11:32 AM, Sandy Ryza sandy.r...@cloudera.com (mailto:sandy.r...@cloudera.com) wrote: +1 Reran my tests from rc5: * Built the release from source. * Compiled Java and Scala apps that interact with HDFS against it. * Ran them in local mode. * Ran them against a pseudo-distributed YARN cluster in both yarn-client mode and yarn-cluster mode. On Sat, May 17, 2014 at 10:08 AM, Andrew Or and...@databricks.com (mailto:and...@databricks.com) wrote: +1 2014-05-17 8:53 GMT-07:00 Mark Hamstra m...@clearstorydata.com (mailto:m...@clearstorydata.com): +1 On Sat, May 17, 2014 at 12:58 AM, Patrick Wendell pwend...@gmail.com (mailto:pwend...@gmail.com) wrote: I'll start the voting with a +1. On Sat, May 17, 2014 at 12:58 AM, Patrick Wendell pwend...@gmail.com (mailto:pwend...@gmail.com) wrote: Please vote on releasing the following candidate as Apache Spark version 1.0.0! This has one bug fix and one minor feature on top of rc8: SPARK-1864: https://github.com/apache/spark/pull/808 SPARK-1808: https://github.com/apache/spark/pull/799 The tag to be voted on is v1.0.0-rc9 (commit 920f947): https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=920f947eb5a22a679c0c3186cf69ee75f6041c75 The release files, including signatures, digests, etc. can be found at: http://people.apache.org/~pwendell/spark-1.0.0-rc9/ Release artifacts are signed with the following key: https://people.apache.org/keys/committer/pwendell.asc The staging repository for this release can be found at: https://repository.apache.org/content/repositories/orgapachespark-1017/ The documentation corresponding to this release can be found at: http://people.apache.org/~pwendell/spark-1.0.0-rc9-docs/ Please vote on releasing this package as Apache Spark 1.0.0! The vote is open until Tuesday, May 20, at 08:56 UTC and passes if amajority of at least 3 +1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 1.0.0 [ ] -1 Do not release this package because ... To learn more about Apache Spark, please see
Re: Calling external classes added by sc.addJar needs to be through reflection
...@stanford.edu wrote: Good summary! We fixed it in branch 0.9 since our production is still in 0.9. I'm porting it to 1.0 now, and hopefully will submit PR for 1.0 tonight. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Mon, May 19, 2014 at 7:38 PM, Sandy Ryza sandy.r...@cloudera.com wrote: It just hit me why this problem is showing up on YARN and not on standalone. The relevant difference between YARN and standalone is that, on YARN, the app jar is loaded by the system classloader instead of Spark's custom URL classloader. On YARN, the system classloader knows about [the classes in the spark jars, the classes in the primary app jar]. The custom classloader knows about [the classes in secondary app jars] and has the system classloader as its parent. A few relevant facts (mostly redundant with what Sean pointed out): * Every class has a classloader that loaded it. * When an object of class B is instantiated inside of class A, the classloader used for loading B is the classloader that was used for loading A. * When a classloader fails to load a class, it lets its parent classloader try. If its parent succeeds, its parent becomes the classloader that loaded it. So suppose class B is in a secondary app jar and class A is in the primary app jar: 1. The custom classloader will try to load class A. 2. It will fail, because it only knows about the secondary jars. 3. It will delegate to its parent, the system classloader. 4. The system classloader will succeed, because it knows about the primary app jar. 5. A's classloader will be the system classloader. 6. A tries to instantiate an instance of class B. 7. B will be loaded with A's classloader, which is the system classloader. 8. Loading B will fail, because A's classloader, which is the system classloader, doesn't know about the secondary app jars. In Spark standalone, A and B are both loaded by the custom classloader, so this issue doesn't come up. -Sandy On Mon, May 19, 2014 at 7:07 PM, Patrick Wendell pwend...@gmail.com wrote: Having a user add define a custom class inside of an added jar and instantiate it directly inside of an executor is definitely supported in Spark and has been for a really long time (several years). This is something we do all the time in Spark. DB - I'd hold off on a re-architecting of this until we identify exactly what is causing the bug you are running into. In a nutshell, when the bytecode new Foo() is run on the executor, it will ask the driver for the class over HTTP using a custom classloader. Something in that pipeline is breaking here, possibly related to the YARN deployment stuff. On Mon, May 19, 2014 at 12:29 AM, Sean Owen so...@cloudera.com wrote: I don't think a customer classloader is necessary. Well, it occurs to me that this is no new problem. Hadoop, Tomcat, etc all run custom user code that creates new user objects without reflection. I should go see how that's done. Maybe it's totally valid to set the thread's context classloader for just this purpose, and I am not thinking clearly. On Mon, May 19, 2014 at 8:26 AM, Andrew Ash and...@andrewash.com wrote: Sounds like the problem is that classloaders always look in their parents before themselves, and Spark users want executors to pick up classes from their custom code before the ones in Spark plus its dependencies. Would a custom classloader that delegates to the parent after first checking itself fix this up? On Mon, May 19, 2014 at 12:17 AM, DB Tsai dbt...@stanford.edu wrote: Hi Sean, It's true that the issue here is classloader, and due to the classloader delegation model, users have to use reflection in the executors to pick up the classloader in order to use those classes added by sc.addJars APIs. However, it's very inconvenience for users, and not documented in spark. I'm working on a patch to solve it by calling the protected method addURL in URLClassLoader to update the current default classloader, so no customClassLoader anymore. I wonder if this is an good way to go. private def addURL(url: URL, loader: URLClassLoader){ try { val method: Method = classOf[URLClassLoader].getDeclaredMethod(addURL, classOf[URL]) method.setAccessible(true) method.invoke
Re: Calling external classes added by sc.addJar needs to be through reflection
Hey I just looked at the fix here: https://github.com/apache/spark/pull/848 Given that this is quite simple, maybe it's best to just go with this and just explain that we don't support adding jars dynamically in YARN in Spark 1.0. That seems like a reasonable thing to do. On Wed, May 21, 2014 at 3:15 PM, Patrick Wendell pwend...@gmail.com wrote: Of these two solutions I'd definitely prefer 2 in the short term. I'd imagine the fix is very straightforward (it would mostly just be remove code), and we'd be making this more consistent with the standalone mode which makes things way easier to reason about. In the long term we'll definitely want to exploit the distributed cache more, but at this point it's premature optimization at a high complexity cost. Writing stuff to HDFS through is so slow anyways I'd guess that serving it directly from the driver is still faster in most cases (though for very large jar sizes or very large clusters, yes, we'll need the distributed cache). - Patrick On Wed, May 21, 2014 at 2:41 PM, Xiangrui Meng men...@gmail.com wrote: That's a good example. If we really want to cover that case, there are two solutions: 1. Follow DB's patch, adding jars to the system classloader. Then we cannot put a user class in front of an existing class. 2. Do not send the primary jar and secondary jars to executors' distributed cache. Instead, add them to spark.jars in SparkSubmit and serve them via http by called sc.addJar in SparkContext. What is your preference? On Wed, May 21, 2014 at 2:27 PM, Sandy Ryza sandy.r...@cloudera.com wrote: Is that an assumption we can make? I think we'd run into an issue in this situation: *In primary jar:* def makeDynamicObject(clazz: String) = Class.forName(clazz).newInstance() *In app code:* sc.addJar(dynamicjar.jar) ... rdd.map(x = makeDynamicObject(some.class.from.DynamicJar)) It might be fair to say that the user should make sure to use the context classloader when instantiating dynamic classes, but I think it's weird that this code would work on Spark standalone but not on YARN. -Sandy On Wed, May 21, 2014 at 2:10 PM, Xiangrui Meng men...@gmail.com wrote: I think adding jars dynamically should work as long as the primary jar and the secondary jars do not depend on dynamically added jars, which should be the correct logic. -Xiangrui On Wed, May 21, 2014 at 1:40 PM, DB Tsai dbt...@stanford.edu wrote: This will be another separate story. Since in the yarn deployment, as Sandy said, the app.jar will be always in the systemclassloader which means any object instantiated in app.jar will have parent loader of systemclassloader instead of custom one. As a result, the custom class loader in yarn will never work without specifically using reflection. Solution will be not using system classloader in the classloader hierarchy, and add all the resources in system one into custom one. This is the approach of tomcat takes. Or we can directly overwirte the system class loader by calling the protected method `addURL` which will not work and throw exception if the code is wrapped in security manager. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Wed, May 21, 2014 at 1:13 PM, Sandy Ryza sandy.r...@cloudera.com wrote: This will solve the issue for jars added upon application submission, but, on top of this, we need to make sure that anything dynamically added through sc.addJar works as well. To do so, we need to make sure that any jars retrieved via the driver's HTTP server are loaded by the same classloader that loads the jars given on app submission. To achieve this, we need to either use the same classloader for both system jars and user jars, or make sure that the user jars given on app submission are under the same classloader used for dynamically added jars. On Tue, May 20, 2014 at 5:59 PM, Xiangrui Meng men...@gmail.com wrote: Talked with Sandy and DB offline. I think the best solution is sending the secondary jars to the distributed cache of all containers rather than just the master, and set the classpath to include spark jar, primary app jar, and secondary jars before executor starts. In this way, user only needs to specify secondary jars via --jars instead of calling sc.addJar inside the code. It also solves the scalability problem of serving all the jars via http. If this solution sounds good, I can try to make a patch. Best, Xiangrui On Mon, May 19, 2014 at 10:04 PM, DB Tsai dbt...@stanford.edu wrote: In 1.0, there is a new option for users to choose which classloader has higher priority via spark.files.userClassPathFirst, I decided to submit the PR for 0.9 first. We use this patch in our lab and we can use those jars added by sc.addJar without reflection
Re: No output from Spark Streaming program with Spark 1.0
Also one other thing to try, try removing all of the logic form inside of foreach and just printing something. It could be that somehow an exception is being triggered inside of your foreach block and as a result the output goes away. On Fri, May 23, 2014 at 6:00 PM, Patrick Wendell pwend...@gmail.com wrote: Hey Jim, Do you see the same behavior if you run this outside of eclipse? Also, what happens if you print something to standard out when setting up your streams (i.e. not inside of the foreach) do you see that? This could be a streaming issue, but it could also be something related to the way it's running in eclipse. - Patrick On Fri, May 23, 2014 at 2:57 PM, Jim Donahue jdona...@adobe.com wrote: I¹m trying out 1.0 on a set of small Spark Streaming tests and am running into problems. Here¹s one of the little programs I¹ve used for a long time ‹ it reads a Kafka stream that contains Twitter JSON tweets and does some simple counting. The program starts OK (it connects to the Kafka stream fine) and generates a stream of INFO logging messages, but never generates any output. :-( I¹m running this in Eclipse, so there may be some class loading issue (loading the wrong class or something like that), but I¹m not seeing anything in the console output. Thanks, Jim Donahue Adobe val kafka_messages = KafkaUtils.createStream[Array[Byte], Array[Byte], kafka.serializer.DefaultDecoder, kafka.serializer.DefaultDecoder](ssc, propsMap, topicMap, StorageLevel.MEMORY_AND_DISK) val messages = kafka_messages.map(_._2) val total = ssc.sparkContext.accumulator(0) val startTime = new java.util.Date().getTime() val jsonstream = messages.map[JSONObject](message = {val string = new String(message); val json = new JSONObject(string); total += 1 json } ) val deleted = ssc.sparkContext.accumulator(0) val msgstream = jsonstream.filter(json = if (!json.has(delete)) true else { deleted += 1; false} ) msgstream.foreach(rdd = { if(rdd.count() 0){ val data = rdd.map(json = (json.has(entities), json.length())).collect() val entities: Double = data.count(t = t._1) val fieldCounts = data.sortBy(_._2) val minFields = fieldCounts(0)._2 val maxFields = fieldCounts(fieldCounts.size - 1)._2 val now = new java.util.Date() val interval = (now.getTime() - startTime) / 1000 System.out.println(now.toString) System.out.println(processing time: + interval + seconds) System.out.println(total messages: + total.value) System.out.println(deleted messages: + deleted.value) System.out.println(message receipt rate: + (total.value/interval) + per second) System.out.println(messages this interval: + data.length) System.out.println(message fields varied between: + minFields + and + maxFields) System.out.println(fraction with entities is + (entities / data.length)) } } ) ssc.start()
Re: [VOTE] Release Apache Spark 1.0.0 (RC11)
Hey Ankur, That does seem like a good fix, but right now we are only blocking the release on major regressions that affect all components. So I don't think this is sufficient to block it from going forward and cutting a new candidate. This is because we are in the very late stage of the release. We can slot that for the 1.0.1 release and merge it into the 1.0 branch so people can get access to the fix easily. On Mon, May 26, 2014 at 6:50 PM, ankurdave ankurd...@gmail.com wrote: -1 I just fixed SPARK-1931 https://issues.apache.org/jira/browse/SPARK-1931 , which was a critical bug in Graph#partitionBy. Since this is an important part of the GraphX API, I think Spark 1.0.0 should include the fix: https://github.com/apache/spark/pull/885. -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Release-Apache-Spark-1-0-0-RC11-tp6797p6802.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com.
Re: [VOTE] Release Apache Spark 1.0.0 (RC11)
+1 I spun up a few EC2 clusters and ran my normal audit checks. Tests passing, sigs, CHANGES and NOTICE look good Thanks TD for helping cut this RC! On Wed, May 28, 2014 at 9:38 PM, Kevin Markey kevin.mar...@oracle.com wrote: +1 Built -Pyarn -Phadoop-2.3 -Dhadoop.version=2.3.0 Ran current version of one of my applications on 1-node pseudocluster (sorry, unable to test on full cluster). yarn-cluster mode Ran regression tests. Thanks Kevin On 05/28/2014 09:55 PM, Krishna Sankar wrote: +1 Pulled built on MacOS X, EC2 Amazon Linux Ran test programs on OS X, 5 node c3.4xlarge cluster Cheers k/ On Wed, May 28, 2014 at 7:36 PM, Andy Konwinski andykonwin...@gmail.comwrote: +1 On May 28, 2014 7:05 PM, Xiangrui Meng men...@gmail.com wrote: +1 Tested apps with standalone client mode and yarn cluster and client modes. Xiangrui On Wed, May 28, 2014 at 1:07 PM, Sean McNamara sean.mcnam...@webtrends.com wrote: Pulled down, compiled, and tested examples on OS X and ubuntu. Deployed app we are building on spark and poured data through it. +1 Sean On May 26, 2014, at 8:39 AM, Tathagata Das tathagata.das1...@gmail.com wrote: Please vote on releasing the following candidate as Apache Spark version 1.0.0! This has a few important bug fixes on top of rc10: SPARK-1900 and SPARK-1918: https://github.com/apache/spark/pull/853 SPARK-1870: https://github.com/apache/spark/pull/848 SPARK-1897: https://github.com/apache/spark/pull/849 The tag to be voted on is v1.0.0-rc11 (commit c69d97cd): https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=c69d97cdb42f809cb71113a1db4194c21372242a The release files, including signatures, digests, etc. can be found at: http://people.apache.org/~tdas/spark-1.0.0-rc11/ Release artifacts are signed with the following key: https://people.apache.org/keys/committer/tdas.asc The staging repository for this release can be found at: https://repository.apache.org/content/repositories/orgapachespark-1019/ The documentation corresponding to this release can be found at: http://people.apache.org/~tdas/spark-1.0.0-rc11-docs/ Please vote on releasing this package as Apache Spark 1.0.0! The vote is open until Thursday, May 29, at 16:00 UTC and passes if a majority of at least 3 +1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 1.0.0 [ ] -1 Do not release this package because ... To learn more about Apache Spark, please see http://spark.apache.org/ == API Changes == We welcome users to compile Spark applications against 1.0. There are a few API changes in this release. Here are links to the associated upgrade guides - user facing changes have been kept as small as possible. Changes to ML vector specification: http://people.apache.org/~tdas/spark-1.0.0-rc11-docs/mllib-guide.html#from-09-to-10 Changes to the Java API: http://people.apache.org/~tdas/spark-1.0.0-rc11-docs/java-programming-guide.html#upgrading-from-pre-10-versions-of-spark Changes to the streaming API: http://people.apache.org/~tdas/spark-1.0.0-rc11-docs/streaming-programming-guide.html#migration-guide-from-091-or-below-to-1x Changes to the GraphX API: http://people.apache.org/~tdas/spark-1.0.0-rc11-docs/graphx-programming-guide.html#upgrade-guide-from-spark-091 Other changes: coGroup and related functions now return Iterable[T] instead of Seq[T] == Call toSeq on the result to restore the old behavior SparkContext.jarOfClass returns Option[String] instead of Seq[String] == Call toSeq on the result to restore old behavior
Re: [VOTE] Release Apache Spark 1.0.0 (rc5)
[tl;dr stable API's are important - sorry, this is slightly meandering] Hey - just wanted to chime in on this as I was travelling. Sean, you bring up great points here about the velocity and stability of Spark. Many projects have fairly customized semantics around what versions actually mean (HBase is a good, if somewhat hard-to-comprehend, example). What the 1.X label means to Spark is that we are willing to guarantee stability for Spark's core API. This is something that actually, Spark has been doing for a while already (we've made few or no breaking changes to the Spark core API for several years) and we want to codify this for application developers. In this regard Spark has made a bunch of changes to enforce the integrity of our API's: - We went through and clearly annotated internal, or experimental API's. This was a huge project-wide effort and included Scaladoc and several other components to make it clear to users. - We implemented automated byte-code verification of all proposed pull requests that they don't break public API's. Pull requests after 1.0 will fail if they break API's that are not explicitly declared private or experimental. I can't possibly emphasize enough the importance of API stability. What we want to avoid is the Hadoop approach. Candidly, Hadoop does a poor job on this. There really isn't a well defined stable API for any of the Hadoop components, for a few reasons: 1. Hadoop projects don't do any rigorous checking that new patches don't break API's. Of course, the results in regular API breaks and a poor understanding of what is a public API. 2. In several cases it's not possible to do basic things in Hadoop without using deprecated or private API's. 3. There is significant vendor fragmentation of API's. The main focus of the Hadoop vendors is making consistent cuts of the core projects work together (HDFS/Pig/Hive/etc) - so API breaks are sometimes considered fixed as long as the other projects work around them (see [1]). We also regularly need to do archaeology (see [2]) and directly interact with Hadoop committers to understand what API's are stable and in which versions. One goal of Spark is to deal with the pain of inter-operating with Hadoop so that application writers don't to. We'd like to retain the property that if you build an application against the (well defined, stable) Spark API's right now, you'll be able to run it across many Hadoop vendors and versions for the entire Spark 1.X release cycle. Writing apps against Hadoop can be very difficult... consider how much more engineering effort we spent maintaining YARN support than Mesos support. There are many factors, but one is that Mesos has a single, narrow, stable API. We've never had to make a change in Mesos due to an API change, for several years. YARN on the other hand, there are at least 3 YARN API's that currently exist, which are all binary incompatible. We'd like to offer apps the ability to build against Spark's API and just let us deal with it. As more vendors packaging Spark, I'd like to see us put tools in the upstream Spark repo that do validation for vendor packages of Spark, so that we don't end up with fragmentation. Of course, vendors can enhance the API and are encouraged to, but we need a kernel of API's that vendors must maintain (think POSIX) to be considered compliant with Apache Spark. I believe some other projects like OpenStack have done this to avoid fragmentation. - Patrick [1] https://issues.apache.org/jira/browse/MAPREDUCE-5830 [2] http://2.bp.blogspot.com/-GO6HF0OAFHw/UOfNEH-4sEI/AD0/dEWFFYTRgYw/s1600/output-file.png On Sun, May 18, 2014 at 2:13 AM, Mridul Muralidharan mri...@gmail.com wrote: So I think I need to clarify a few things here - particularly since this mail went to the wrong mailing list and a much wider audience than I intended it for :-) Most of the issues I mentioned are internal implementation detail of spark core : which means, we can enhance them in future without disruption to our userbase (ability to support large number of input/output partitions. Note: this is of order of 100k input and output partitions with uniform spread of keys - very rarely seen outside of some crazy jobs). Some of the issues I mentioned would reqiure DeveloperApi changes - which are not user exposed : they would impact developer use of these api's - which are mostly internally provided by spark. (Like fixing blocks 2G would require change to Serializer api) A smaller faction might require interface changes - note, I am referring specifically to configuration changes (removing/deprecating some) and possibly newer options to submit/env, etc - I dont envision any programming api change itself. The only api change we did was from Seq - Iterable - which is actually to address some of the issues I mentioned (join/cogroup). Remaining are bugs which need to be addressed or the feature removed/enhanced like shuffle consolidation. There might be
Announcing Spark 1.0.0
I'm thrilled to announce the availability of Spark 1.0.0! Spark 1.0.0 is a milestone release as the first in the 1.0 line of releases, providing API stability for Spark's core interfaces. Spark 1.0.0 is Spark's largest release ever, with contributions from 117 developers. I'd like to thank everyone involved in this release - it was truly a community effort with fixes, features, and optimizations contributed from dozens of organizations. This release expands Spark's standard libraries, introducing a new SQL package (SparkSQL) which lets users integrate SQL queries into existing Spark workflows. MLlib, Spark's machine learning library, is expanded with sparse vector support and several new algorithms. The GraphX and Streaming libraries also introduce new features and optimizations. Spark's core engine adds support for secured YARN clusters, a unified tool for submitting Spark applications, and several performance and stability improvements. Finally, Spark adds support for Java 8 lambda syntax and improves coverage of the Java and Python API's. Those features only scratch the surface - check out the release notes here: http://spark.apache.org/releases/spark-release-1-0-0.html Note that since release artifacts were posted recently, certain mirrors may not have working downloads for a few hours. - Patrick
Re: Streaming example stops outputting (Java, Kafka at least)
Yeah - Spark streaming needs at least two threads to run. I actually thought we warned the user if they only use one (@tdas?) but the warning might not be working correctly - or I'm misremembering. On Fri, May 30, 2014 at 6:38 AM, Sean Owen so...@cloudera.com wrote: Thanks Nan, that does appear to fix it. I was using local. Can anyone say whether that's to be expected or whether it could be a bug somewhere? On Fri, May 30, 2014 at 2:42 PM, Nan Zhu zhunanmcg...@gmail.com wrote: Hi, Sean I was in the same problem but when I changed MASTER=local to MASTER=local[2] everything back to the normal Hasn't get a chance to ask here Best, -- Nan Zhu
Re: [VOTE] Release Apache Spark 1.0.0 (rc5)
Hey guys, thanks for the insights. Also, I realize Hadoop has gotten way better about this with 2.2+ and I think it's great progress. We have well defined API levels in Spark and also automated checking of API violations for new pull requests. When doing code reviews we always enforce the narrowest possible visibility: 1. private 2. private[spark] 3. @Experimental or @DeveloperApi 4. public Our automated checks exclude 1-3. Anything that breaks 4 will trigger a build failure. The Scala compiler prevents anyone external from using 1 or 2. We do have bytecode public but annotated (3) API's that we might change. We spent a lot of time looking into whether these can offer compiler warnings, but we haven't found a way to do this and do not see a better alternative at this point. Regarding Scala compatibility, Scala 2.11+ is source code compatible, meaning we'll be able to cross-compile Spark for different versions of Scala. We've already been in touch with Typesafe about this and they've offered to integrate Spark into their compatibility test suite. They've also committed to patching 2.11 with a minor release if bugs are found. Anyways, my point is we've actually thought a lot about this already. The CLASSPATH thing is different than API stability, but indeed also a form of compatibility. This is something where I'd also like to see Spark have better isolation of user classes from Spark's own execution... - Patrick On Fri, May 30, 2014 at 12:30 PM, Marcelo Vanzin van...@cloudera.com wrote: On Fri, May 30, 2014 at 12:05 PM, Colin McCabe cmcc...@alumni.cmu.edu wrote: I don't know if Scala provides any mechanisms to do this beyond what Java provides. In fact it does. You can say something like private[foo] and the annotated element will be visible for all classes under foo (where foo is any package in the hierarchy leading up to the class). That's used a lot in Spark. I haven't fully looked at how the @DeveloperApi is used, but I agree with you - annotations are not a good way to do this. The Scala feature above would be much better, but it might still leak things at the Java bytecode level (don't know how Scala implements it under the cover, but I assume it's not by declaring the element as a Java private). Another thing is that in Scala the default visibility is public, which makes it very easy to inadvertently add things to the API. I'd like to see more care in making things have the proper visibility - I generally declare things private first, and relax that as needed. Using @VisibleForTesting would be great too, when the Scala private[foo] approach doesn't work. Does Spark also expose its CLASSPATH in this way to executors? I was under the impression that it did. If you're using the Spark assemblies, yes, there is a lot of things that your app gets exposed to. For example, you can see Guava and Jetty (and many other things) there. This is something that has always bugged me, but I don't really have a good suggestion of how to fix it; shading goes a certain way, but it also breaks codes that uses reflection (e.g. Class.forName()-style class loading). What is worse is that Spark doesn't even agree with the Hadoop code it depends on; e.g., Spark uses Guava 14.x while Hadoop is still in Guava 11.x. So when you run your Scala app, what gets loaded? At some point we will also have to confront the Scala version issue. Will there be flag days where Spark jobs need to be upgraded to a new, incompatible version of Scala to run on the latest Spark? Yes, this could be an issue - I'm not sure Scala has a policy towards this, but updates (at least minor, e.g. 2.9 - 2.10) tend to break binary compatibility. Scala also makes some API updates tricky - e.g., adding a new named argument to a Scala method is not a binary compatible change (while, e.g., adding a new keyword argument in a python method is just fine). The use of implicits and other Scala features make this even more opaque... Anyway, not really any solutions in this message, just a few comments I wanted to throw out there. :-) -- Marcelo
Re: [VOTE] Release Apache Spark 1.0.0 (rc5)
Spark is a bit different than Hadoop MapReduce, so maybe that's a source of some confusion. Spark is often used as a substrate for building different types of analytics applications, so @DeveloperAPI are internal API's that we'd like to expose to application writers, but that might be more volatile. This is like the internal API's in the linux kernel, they aren't stable, but of course we try to minimize changes to them. If people want to write lower-level modules against them, that's fine with us, but they know the interfaces might change. This has worked pretty well over the years, even with many different companies writing against those API's. @Experimental are user-facing features we are trying out. Hopefully that one is more clear. In terms of making a big jar that shades all of our dependencies - I'm curious how that would actually work in practice. It would be good to explore. There are a few potential challenges I see: 1. If any of our dependencies encode class name information in IPC messages, this would break. E.g. can you definitely shade the Hadoop client, protobuf, hbase client, etc and have them send messages over the wire? This could break things if class names are ever encoded in a wire format. 2. Many libraries like logging subsystems, configuration systems, etc rely on static state and initialization. I'm not totally sure how e.g. slf4j initializes itself if you have both a shaded and non-shaded copy of slf4j present. 3. This would mean the spark-core jar would be really massive because it would inline all of our deps. We've actually been thinking of avoiding the current assembly jar approach because, due to scala specialized classes, our assemblies now have more than 65,000 class files in them leading to all kinds of bad issues. We'd have to stick with a big uber assembly-like jar if we decide to shade stuff. 4. I'm not totally sure how this would work when people want to e.g. build Spark with different Hadoop versions. Would we publish different shaded uber-jars for every Hadoop version? Would the Hadoop dep just not be shaded... if so what about all it's dependencies. Anyways just some things to consider... simplifying our classpath is definitely an avenue worth exploring! On Fri, May 30, 2014 at 2:56 PM, Colin McCabe cmcc...@alumni.cmu.edu wrote: On Fri, May 30, 2014 at 2:11 PM, Patrick Wendell pwend...@gmail.com wrote: Hey guys, thanks for the insights. Also, I realize Hadoop has gotten way better about this with 2.2+ and I think it's great progress. We have well defined API levels in Spark and also automated checking of API violations for new pull requests. When doing code reviews we always enforce the narrowest possible visibility: 1. private 2. private[spark] 3. @Experimental or @DeveloperApi 4. public Our automated checks exclude 1-3. Anything that breaks 4 will trigger a build failure. That's really excellent. Great job. I like the private[spark] visibility level-- sounds like this is another way Scala has greatly improved on Java. The Scala compiler prevents anyone external from using 1 or 2. We do have bytecode public but annotated (3) API's that we might change. We spent a lot of time looking into whether these can offer compiler warnings, but we haven't found a way to do this and do not see a better alternative at this point. It would be nice if the production build could strip this stuff out. Otherwise, it feels a lot like a @private, @unstable Hadoop API... and we know how those turned out. Regarding Scala compatibility, Scala 2.11+ is source code compatible, meaning we'll be able to cross-compile Spark for different versions of Scala. We've already been in touch with Typesafe about this and they've offered to integrate Spark into their compatibility test suite. They've also committed to patching 2.11 with a minor release if bugs are found. Thanks, I hadn't heard about this plan. Hopefully we can get everyone on 2.11 ASAP. Anyways, my point is we've actually thought a lot about this already. The CLASSPATH thing is different than API stability, but indeed also a form of compatibility. This is something where I'd also like to see Spark have better isolation of user classes from Spark's own execution... I think the best thing to do is just shade all the dependencies. Then they will be in a different namespace, and clients can have their own versions of whatever dependencies they like without conflicting. As Marcelo mentioned, there might be a few edge cases where this breaks reflection, but I don't think that's an issue for most libraries. So at worst case we could end up needing apps to follow us in lockstep for Kryo or maybe Akka, but not the whole kit and caboodle like with Hadoop. best, Colin - Patrick On Fri, May 30, 2014 at 12:30 PM, Marcelo Vanzin van...@cloudera.com wrote: On Fri, May 30, 2014 at 12:05 PM, Colin McCabe cmcc...@alumni.cmu.edu wrote: I don't know if Scala
Re: Unable to execute saveAsTextFile on multi node mesos
Can you look at the logs from the executor or in the UI? They should give an exception with the reason for the task failure. Also in the future, for this type of e-mail please only e-mail the user@ list and not both lists. - Patrick On Sat, May 31, 2014 at 3:22 AM, prabeesh k prabsma...@gmail.com wrote: Hi, scenario : Read data from HDFS and apply hive query on it and the result is written back to HDFS. Scheme creation, Querying and saveAsTextFile are working fine with following mode local mode mesos cluster with single node spark cluster with multi node Schema creation and querying are working fine with mesos multi node cluster. But while trying to write back to HDFS using saveAsTextFile, the following error occurs 14/05/30 10:16:35 INFO DAGScheduler: The failed fetch was from Stage 4 (mapPartitionsWithIndex at Operator.scala:333); marking it for resubmission 14/05/30 10:16:35 INFO DAGScheduler: Executor lost: 201405291518-3644595722-5050-17933-1 (epoch 148) Let me know your thoughts regarding this. Regards, prabeesh
Re: [VOTE] Release Apache Spark 1.0.0 (rc5)
One other consideration popped into my head: 5. Shading our dependencies could mess up our external API's if we ever return types that are outside of the spark package because we'd then be returned shaded types that users have to deal with. E.g. where before we returned an o.a.flume.AvroFlumeEvent we'd have to return a some.namespace.AvroFlumeEvent. Then users downstream would have to deal with converting our types into the correct namespace if they want to inter-operate with other libraries. We generally try to avoid ever returning types from other libraries, but it would be good to audit our API's and see if we ever do this. On Fri, May 30, 2014 at 10:54 PM, Patrick Wendell pwend...@gmail.com wrote: Spark is a bit different than Hadoop MapReduce, so maybe that's a source of some confusion. Spark is often used as a substrate for building different types of analytics applications, so @DeveloperAPI are internal API's that we'd like to expose to application writers, but that might be more volatile. This is like the internal API's in the linux kernel, they aren't stable, but of course we try to minimize changes to them. If people want to write lower-level modules against them, that's fine with us, but they know the interfaces might change. This has worked pretty well over the years, even with many different companies writing against those API's. @Experimental are user-facing features we are trying out. Hopefully that one is more clear. In terms of making a big jar that shades all of our dependencies - I'm curious how that would actually work in practice. It would be good to explore. There are a few potential challenges I see: 1. If any of our dependencies encode class name information in IPC messages, this would break. E.g. can you definitely shade the Hadoop client, protobuf, hbase client, etc and have them send messages over the wire? This could break things if class names are ever encoded in a wire format. 2. Many libraries like logging subsystems, configuration systems, etc rely on static state and initialization. I'm not totally sure how e.g. slf4j initializes itself if you have both a shaded and non-shaded copy of slf4j present. 3. This would mean the spark-core jar would be really massive because it would inline all of our deps. We've actually been thinking of avoiding the current assembly jar approach because, due to scala specialized classes, our assemblies now have more than 65,000 class files in them leading to all kinds of bad issues. We'd have to stick with a big uber assembly-like jar if we decide to shade stuff. 4. I'm not totally sure how this would work when people want to e.g. build Spark with different Hadoop versions. Would we publish different shaded uber-jars for every Hadoop version? Would the Hadoop dep just not be shaded... if so what about all it's dependencies. Anyways just some things to consider... simplifying our classpath is definitely an avenue worth exploring! On Fri, May 30, 2014 at 2:56 PM, Colin McCabe cmcc...@alumni.cmu.edu wrote: On Fri, May 30, 2014 at 2:11 PM, Patrick Wendell pwend...@gmail.com wrote: Hey guys, thanks for the insights. Also, I realize Hadoop has gotten way better about this with 2.2+ and I think it's great progress. We have well defined API levels in Spark and also automated checking of API violations for new pull requests. When doing code reviews we always enforce the narrowest possible visibility: 1. private 2. private[spark] 3. @Experimental or @DeveloperApi 4. public Our automated checks exclude 1-3. Anything that breaks 4 will trigger a build failure. That's really excellent. Great job. I like the private[spark] visibility level-- sounds like this is another way Scala has greatly improved on Java. The Scala compiler prevents anyone external from using 1 or 2. We do have bytecode public but annotated (3) API's that we might change. We spent a lot of time looking into whether these can offer compiler warnings, but we haven't found a way to do this and do not see a better alternative at this point. It would be nice if the production build could strip this stuff out. Otherwise, it feels a lot like a @private, @unstable Hadoop API... and we know how those turned out. Regarding Scala compatibility, Scala 2.11+ is source code compatible, meaning we'll be able to cross-compile Spark for different versions of Scala. We've already been in touch with Typesafe about this and they've offered to integrate Spark into their compatibility test suite. They've also committed to patching 2.11 with a minor release if bugs are found. Thanks, I hadn't heard about this plan. Hopefully we can get everyone on 2.11 ASAP. Anyways, my point is we've actually thought a lot about this already. The CLASSPATH thing is different than API stability, but indeed also a form of compatibility. This is something where I'd also like to see Spark have better isolation of user classes
Re: SCALA_HOME or SCALA_LIBRARY_PATH not set during build
This is a false error message actually - the Maven build no longer requires SCALA_HOME but the message/check was still there. This was fixed recently in master: https://github.com/apache/spark/commit/d8c005d5371f81a2a06c5d27c7021e1ae43d7193 I can back port that fix into branch-1.0 so it will be in 1.0.1 as well. For other people running into this, you can export SCALA_HOME to any value and it will work. - Patrick On Sat, May 31, 2014 at 8:34 PM, Colin McCabe cmcc...@alumni.cmu.edu wrote: Spark currently supports two build systems, sbt and maven. sbt will download the correct version of scala, but with Maven you need to supply it yourself and set SCALA_HOME. It sounds like the instructions need to be updated-- perhaps create a JIRA? best, Colin On Sat, May 31, 2014 at 7:06 PM, Soren Macbeth so...@yieldbot.com wrote: Hello, Following the instructions for building spark 1.0.0, I encountered the following error: [ERROR] Failed to execute goal org.apache.maven.plugins:maven-antrun-plugin:1.7:run (default) on project spark-core_2.10: An Ant BuildException has occured: Please set the SCALA_HOME (or SCALA_LIBRARY_PATH if scala is on the path) environment variables and retry. [ERROR] around Ant part ...fail message=Please set the SCALA_HOME (or SCALA_LIBRARY_PATH if scala is on the path) environment variables and retry @ 6:126 in /Users/soren/src/spark-1.0.0/core/target/antrun/build-main.xml No where in the documentation does it mention that having scala installed and either of these env vars set nor what version should be installed. Setting these env vars wasn't required for 0.9.1 with sbt. I was able to get past it by downloading the scala 2.10.4 binary package to a temp dir and setting SCALA_HOME to that dir. Ideally, it would be nice to not have to require people to have a standalone scala installation but at a minimum this requirement should be documented in the build instructions no? -Soren
Re: SCALA_HOME or SCALA_LIBRARY_PATH not set during build
I went ahead and created a JIRA for this and back ported the improvement into branch-1.0. This wasn't a regression per-se because the behavior existed in all previous versions, but it's annoying behavior so best to fix it. https://issues.apache.org/jira/browse/SPARK-1984 - Patrick On Sun, Jun 1, 2014 at 11:13 AM, Patrick Wendell pwend...@gmail.com wrote: This is a false error message actually - the Maven build no longer requires SCALA_HOME but the message/check was still there. This was fixed recently in master: https://github.com/apache/spark/commit/d8c005d5371f81a2a06c5d27c7021e1ae43d7193 I can back port that fix into branch-1.0 so it will be in 1.0.1 as well. For other people running into this, you can export SCALA_HOME to any value and it will work. - Patrick On Sat, May 31, 2014 at 8:34 PM, Colin McCabe cmcc...@alumni.cmu.edu wrote: Spark currently supports two build systems, sbt and maven. sbt will download the correct version of scala, but with Maven you need to supply it yourself and set SCALA_HOME. It sounds like the instructions need to be updated-- perhaps create a JIRA? best, Colin On Sat, May 31, 2014 at 7:06 PM, Soren Macbeth so...@yieldbot.com wrote: Hello, Following the instructions for building spark 1.0.0, I encountered the following error: [ERROR] Failed to execute goal org.apache.maven.plugins:maven-antrun-plugin:1.7:run (default) on project spark-core_2.10: An Ant BuildException has occured: Please set the SCALA_HOME (or SCALA_LIBRARY_PATH if scala is on the path) environment variables and retry. [ERROR] around Ant part ...fail message=Please set the SCALA_HOME (or SCALA_LIBRARY_PATH if scala is on the path) environment variables and retry @ 6:126 in /Users/soren/src/spark-1.0.0/core/target/antrun/build-main.xml No where in the documentation does it mention that having scala installed and either of these env vars set nor what version should be installed. Setting these env vars wasn't required for 0.9.1 with sbt. I was able to get past it by downloading the scala 2.10.4 binary package to a temp dir and setting SCALA_HOME to that dir. Ideally, it would be nice to not have to require people to have a standalone scala installation but at a minimum this requirement should be documented in the build instructions no? -Soren
Re: Which version does the binary compatibility test against by default?
Yeah - check out sparkPreviousArtifact in the build: https://github.com/apache/spark/blob/master/project/SparkBuild.scala#L325 - Patrick On Mon, Jun 2, 2014 at 5:30 PM, Xiangrui Meng men...@gmail.com wrote: Is there a way to specify the target version? -Xiangrui
Re: [VOTE] Release Apache Spark 1.0.0 (RC11)
Received! On Wed, Jun 4, 2014 at 10:47 AM, Tom Graves tgraves...@yahoo.com.invalid wrote: Testing... Resending as it appears my message didn't go through last week. Tom On Wednesday, May 28, 2014 4:12 PM, Tom Graves tgraves...@yahoo.com wrote: +1. Tested spark on yarn (cluster mode, client mode, pyspark, spark-shell) on hadoop 0.23 and 2.4. Tom On Wednesday, May 28, 2014 3:07 PM, Sean McNamara sean.mcnam...@webtrends.com wrote: Pulled down, compiled, and tested examples on OS X and ubuntu. Deployed app we are building on spark and poured data through it. +1 Sean On May 26, 2014, at 8:39 AM, Tathagata Das tathagata.das1...@gmail.com wrote: Please vote on releasing the following candidate as Apache Spark version 1.0.0! This has a few important bug fixes on top of rc10: SPARK-1900 and SPARK-1918: https://github.com/apache/spark/pull/853 SPARK-1870: https://github.com/apache/spark/pull/848 SPARK-1897: https://github.com/apache/spark/pull/849 The tag to be voted on is v1.0.0-rc11 (commit c69d97cd): https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=c69d97cdb42f809cb71113a1db4194c21372242a The release files, including signatures, digests, etc. can be found at: http://people.apache.org/~tdas/spark-1.0.0-rc11/ Release artifacts are signed with the following key: https://people.apache.org/keys/committer/tdas.asc The staging repository for this release can be found at: https://repository.apache.org/content/repositories/orgapachespark-1019/ The documentation corresponding to this release can be found at: http://people.apache.org/~tdas/spark-1.0.0-rc11-docs/ Please vote on releasing this package as Apache Spark 1.0.0! The vote is open until Thursday, May 29, at 16:00 UTC and passes if a majority of at least 3 +1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 1.0.0 [ ] -1 Do not release this package because ... To learn more about Apache Spark, please see http://spark.apache.org/ == API Changes == We welcome users to compile Spark applications against 1.0. There are a few API changes in this release. Here are links to the associated upgrade guides - user facing changes have been kept as small as possible. Changes to ML vector specification: http://people.apache.org/~tdas/spark-1.0.0-rc11-docs/mllib-guide.html#from-09-to-10 Changes to the Java API: http://people.apache.org/~tdas/spark-1.0.0-rc11-docs/java-programming-guide.html#upgrading-from-pre-10-versions-of-spark Changes to the streaming API: http://people.apache.org/~tdas/spark-1.0.0-rc11-docs/streaming-programming-guide.html#migration-guide-from-091-or-below-to-1x Changes to the GraphX API: http://people.apache.org/~tdas/spark-1.0.0-rc11-docs/graphx-programming-guide.html#upgrade-guide-from-spark-091 Other changes: coGroup and related functions now return Iterable[T] instead of Seq[T] == Call toSeq on the result to restore the old behavior SparkContext.jarOfClass returns Option[String] instead of Seq[String] == Call toSeq on the result to restore old behavior
Re: [VOTE] Release Apache Spark 1.0.0 (RC11)
Hey There, The best way is to use the v1.0.0 tag: https://github.com/apache/spark/releases/tag/v1.0.0 - Patrick On Wed, Jun 4, 2014 at 12:19 PM, Debasish Das debasish.da...@gmail.com wrote: Hi Patrick, We maintain internal Spark mirror in sync with Spark github master... What's the way to get the 1.0.0 stable release from github to deploy on our production cluster ? Is there a tag for 1.0.0 that I should use to deploy ? Thanks. Deb On Wed, Jun 4, 2014 at 10:49 AM, Patrick Wendell pwend...@gmail.com wrote: Received! On Wed, Jun 4, 2014 at 10:47 AM, Tom Graves tgraves...@yahoo.com.invalid wrote: Testing... Resending as it appears my message didn't go through last week. Tom On Wednesday, May 28, 2014 4:12 PM, Tom Graves tgraves...@yahoo.com wrote: +1. Tested spark on yarn (cluster mode, client mode, pyspark, spark-shell) on hadoop 0.23 and 2.4. Tom On Wednesday, May 28, 2014 3:07 PM, Sean McNamara sean.mcnam...@webtrends.com wrote: Pulled down, compiled, and tested examples on OS X and ubuntu. Deployed app we are building on spark and poured data through it. +1 Sean On May 26, 2014, at 8:39 AM, Tathagata Das tathagata.das1...@gmail.com wrote: Please vote on releasing the following candidate as Apache Spark version 1.0.0! This has a few important bug fixes on top of rc10: SPARK-1900 and SPARK-1918: https://github.com/apache/spark/pull/853 SPARK-1870: https://github.com/apache/spark/pull/848 SPARK-1897: https://github.com/apache/spark/pull/849 The tag to be voted on is v1.0.0-rc11 (commit c69d97cd): https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=c69d97cdb42f809cb71113a1db4194c21372242a The release files, including signatures, digests, etc. can be found at: http://people.apache.org/~tdas/spark-1.0.0-rc11/ Release artifacts are signed with the following key: https://people.apache.org/keys/committer/tdas.asc The staging repository for this release can be found at: https://repository.apache.org/content/repositories/orgapachespark-1019/ The documentation corresponding to this release can be found at: http://people.apache.org/~tdas/spark-1.0.0-rc11-docs/ Please vote on releasing this package as Apache Spark 1.0.0! The vote is open until Thursday, May 29, at 16:00 UTC and passes if a majority of at least 3 +1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 1.0.0 [ ] -1 Do not release this package because ... To learn more about Apache Spark, please see http://spark.apache.org/ == API Changes == We welcome users to compile Spark applications against 1.0. There are a few API changes in this release. Here are links to the associated upgrade guides - user facing changes have been kept as small as possible. Changes to ML vector specification: http://people.apache.org/~tdas/spark-1.0.0-rc11-docs/mllib-guide.html#from-09-to-10 Changes to the Java API: http://people.apache.org/~tdas/spark-1.0.0-rc11-docs/java-programming-guide.html#upgrading-from-pre-10-versions-of-spark Changes to the streaming API: http://people.apache.org/~tdas/spark-1.0.0-rc11-docs/streaming-programming-guide.html#migration-guide-from-091-or-below-to-1x Changes to the GraphX API: http://people.apache.org/~tdas/spark-1.0.0-rc11-docs/graphx-programming-guide.html#upgrade-guide-from-spark-091 Other changes: coGroup and related functions now return Iterable[T] instead of Seq[T] == Call toSeq on the result to restore the old behavior SparkContext.jarOfClass returns Option[String] instead of Seq[String] == Call toSeq on the result to restore old behavior
Re: Announcing Spark 1.0.0
Hey Rahul, The v1.0.0 tag is correct. When we release Spark we create multiple candidates. One of the candidates is promoted to the full release. So rc11 is also the same as the official v1.0.0 release. - Patrick On Wed, Jun 4, 2014 at 8:29 PM, Rahul Singhal rahul.sing...@guavus.com wrote: Could someone please clarify my confusion or is this not an issue that we should be concerned about? Thanks, Rahul Singhal On 30/05/14 5:28 PM, Rahul Singhal rahul.sing...@guavus.com wrote: Is it intentional/ok that the tag v1.0.0 is behind tag v1.0.0-rc11? Thanks, Rahul Singhal On 30/05/14 3:43 PM, Patrick Wendell pwend...@gmail.com wrote: I'm thrilled to announce the availability of Spark 1.0.0! Spark 1.0.0 is a milestone release as the first in the 1.0 line of releases, providing API stability for Spark's core interfaces. Spark 1.0.0 is Spark's largest release ever, with contributions from 117 developers. I'd like to thank everyone involved in this release - it was truly a community effort with fixes, features, and optimizations contributed from dozens of organizations. This release expands Spark's standard libraries, introducing a new SQL package (SparkSQL) which lets users integrate SQL queries into existing Spark workflows. MLlib, Spark's machine learning library, is expanded with sparse vector support and several new algorithms. The GraphX and Streaming libraries also introduce new features and optimizations. Spark's core engine adds support for secured YARN clusters, a unified tool for submitting Spark applications, and several performance and stability improvements. Finally, Spark adds support for Java 8 lambda syntax and improves coverage of the Java and Python API's. Those features only scratch the surface - check out the release notes here: http://spark.apache.org/releases/spark-release-1-0-0.html Note that since release artifacts were posted recently, certain mirrors may not have working downloads for a few hours. - Patrick
MIMA Compatiblity Checks
Hey All, Some people may have noticed PR failures due to binary compatibility checks. We've had these enabled in several of the sub-modules since the 0.9.0 release but we've turned them on in Spark core post 1.0.0 which has much higher churn. The checks are based on the migration manager tool from Typesafe. One issue is that tool doesn't support package-private declarations of classes or methods. Prashant Sharma has built instrumentation that adds partial support for package-privacy (via a workaround) but since there isn't really native support for this in MIMA we are still finding cases in which we trigger false positives. In the next week or two we'll make it a priority to handle more of these false-positive cases. In the mean time users can add manual excludes to: project/MimaExcludes.scala to avoid triggering warnings for certain issues. This is definitely annoying - sorry about that. Unfortunately we are the first open source Scala project to ever do this, so we are dealing with uncharted territory. Longer term I'd actually like to see us just write our own sbt-based tool to do this in a better way (we've had trouble trying to extend MIMA itself, it e.g. has copy-pasted code in it from an old version of the scala compiler). If someone in the community is a Scala fan and wants to take that on, I'm happy to give more details. - Patrick
Re: Strange problem with saveAsTextFile after upgrade Spark 0.9.0-1.0.0
Paul, Could you give the version of Java that you are building with and the version of Java you are running with? Are they the same? Just off the cuff, I wonder if this is related to: https://issues.apache.org/jira/browse/SPARK-1520 If it is, it could appear that certain functions are not in the jar because they go beyond the extended zip boundary `jar tvf` won't list them. - Patrick On Sun, Jun 8, 2014 at 12:45 PM, Paul Brown p...@mult.ifario.us wrote: Moving over to the dev list, as this isn't a user-scope issue. I just ran into this issue with the missing saveAsTestFile, and here's a little additional information: - Code ported from 0.9.1 up to 1.0.0; works with local[n] in both cases. - Driver built as an uberjar via Maven. - Deployed to smallish EC2 cluster in standalone mode (S3 storage) with Spark 1.0.0-hadoop1 downloaded from Apache. Given that it functions correctly in local mode but not in a standalone cluster, this suggests to me that the issue is in a difference between the Maven version and the hadoop1 version. In the spirit of taking the computer at its word, we can just have a look in the JAR files. Here's what's in the Maven dep as of 1.0.0: jar tvf ~/.m2/repository/org/apache/spark/spark-core_2.10/1.0.0/spark-core_2.10-1.0.0.jar | grep 'rdd/RDD' | grep 'saveAs' 1519 Mon May 26 13:57:58 PDT 2014 org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$1.class 1560 Mon May 26 13:57:58 PDT 2014 org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$2.class And here's what's in the hadoop1 distribution: jar tvf spark-assembly-1.0.0-hadoop1.0.4.jar| grep 'rdd/RDD' | grep 'saveAs' I.e., it's not there. It is in the hadoop2 distribution: jar tvf spark-assembly-1.0.0-hadoop2.2.0.jar| grep 'rdd/RDD' | grep 'saveAs' 1519 Mon May 26 07:29:54 PDT 2014 org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$1.class 1560 Mon May 26 07:29:54 PDT 2014 org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$2.class So something's clearly broken with the way that the distribution assemblies are created. FWIW and IMHO, the right way to publish the hadoop1 and hadoop2 flavors of Spark to Maven Central would be as *entirely different* artifacts (spark-core-h1, spark-core-h2). Logged as SPARK-2075 https://issues.apache.org/jira/browse/SPARK-2075. Cheers. -- Paul -- p...@mult.ifario.us | Multifarious, Inc. | http://mult.ifario.us/ On Fri, Jun 6, 2014 at 2:45 AM, HenriV henri.vanh...@vdab.be wrote: I'm experiencing the same error while upgrading from 0.9.1 to 1.0.0. Im using google compute engine and cloud storage. but saveAsTextFile is returning errors while saving in the cloud or saving local. When i start a job in the cluster i do get an error but after this error it keeps on running fine untill the saveAsTextFile. ( I don't know if the two are connected) ---Error at job startup--- ERROR metrics.MetricsSystem: Sink class org.apache.spark.metrics.sink.MetricsServlet cannot be instantialized java.lang.reflect.InvocationTargetException at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:526) at org.apache.spark.metrics.MetricsSystem$$anonfun$registerSinks$1.apply(MetricsSystem.scala:136) at org.apache.spark.metrics.MetricsSystem$$anonfun$registerSinks$1.apply(MetricsSystem.scala:130) at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98) at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98) at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226) at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39) at scala.collection.mutable.HashMap.foreach(HashMap.scala:98) at org.apache.spark.metrics.MetricsSystem.registerSinks(MetricsSystem.scala:130) at org.apache.spark.metrics.MetricsSystem.init(MetricsSystem.scala:84) at org.apache.spark.metrics.MetricsSystem$.createMetricsSystem(MetricsSystem.scala:167) at org.apache.spark.SparkEnv$.create(SparkEnv.scala:230) at org.apache.spark.SparkContext.init(SparkContext.scala:202) at Hello$.main(Hello.scala:101) at Hello.main(Hello.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at sbt.Run.invokeMain(Run.scala:72) at sbt.Run.run0(Run.scala:65) at
Re: Strange problem with saveAsTextFile after upgrade Spark 0.9.0-1.0.0
Also I should add - thanks for taking time to help narrow this down! On Sun, Jun 8, 2014 at 1:02 PM, Patrick Wendell pwend...@gmail.com wrote: Paul, Could you give the version of Java that you are building with and the version of Java you are running with? Are they the same? Just off the cuff, I wonder if this is related to: https://issues.apache.org/jira/browse/SPARK-1520 If it is, it could appear that certain functions are not in the jar because they go beyond the extended zip boundary `jar tvf` won't list them. - Patrick On Sun, Jun 8, 2014 at 12:45 PM, Paul Brown p...@mult.ifario.us wrote: Moving over to the dev list, as this isn't a user-scope issue. I just ran into this issue with the missing saveAsTestFile, and here's a little additional information: - Code ported from 0.9.1 up to 1.0.0; works with local[n] in both cases. - Driver built as an uberjar via Maven. - Deployed to smallish EC2 cluster in standalone mode (S3 storage) with Spark 1.0.0-hadoop1 downloaded from Apache. Given that it functions correctly in local mode but not in a standalone cluster, this suggests to me that the issue is in a difference between the Maven version and the hadoop1 version. In the spirit of taking the computer at its word, we can just have a look in the JAR files. Here's what's in the Maven dep as of 1.0.0: jar tvf ~/.m2/repository/org/apache/spark/spark-core_2.10/1.0.0/spark-core_2.10-1.0.0.jar | grep 'rdd/RDD' | grep 'saveAs' 1519 Mon May 26 13:57:58 PDT 2014 org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$1.class 1560 Mon May 26 13:57:58 PDT 2014 org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$2.class And here's what's in the hadoop1 distribution: jar tvf spark-assembly-1.0.0-hadoop1.0.4.jar| grep 'rdd/RDD' | grep 'saveAs' I.e., it's not there. It is in the hadoop2 distribution: jar tvf spark-assembly-1.0.0-hadoop2.2.0.jar| grep 'rdd/RDD' | grep 'saveAs' 1519 Mon May 26 07:29:54 PDT 2014 org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$1.class 1560 Mon May 26 07:29:54 PDT 2014 org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$2.class So something's clearly broken with the way that the distribution assemblies are created. FWIW and IMHO, the right way to publish the hadoop1 and hadoop2 flavors of Spark to Maven Central would be as *entirely different* artifacts (spark-core-h1, spark-core-h2). Logged as SPARK-2075 https://issues.apache.org/jira/browse/SPARK-2075. Cheers. -- Paul -- p...@mult.ifario.us | Multifarious, Inc. | http://mult.ifario.us/ On Fri, Jun 6, 2014 at 2:45 AM, HenriV henri.vanh...@vdab.be wrote: I'm experiencing the same error while upgrading from 0.9.1 to 1.0.0. Im using google compute engine and cloud storage. but saveAsTextFile is returning errors while saving in the cloud or saving local. When i start a job in the cluster i do get an error but after this error it keeps on running fine untill the saveAsTextFile. ( I don't know if the two are connected) ---Error at job startup--- ERROR metrics.MetricsSystem: Sink class org.apache.spark.metrics.sink.MetricsServlet cannot be instantialized java.lang.reflect.InvocationTargetException at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:526) at org.apache.spark.metrics.MetricsSystem$$anonfun$registerSinks$1.apply(MetricsSystem.scala:136) at org.apache.spark.metrics.MetricsSystem$$anonfun$registerSinks$1.apply(MetricsSystem.scala:130) at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98) at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98) at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226) at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39) at scala.collection.mutable.HashMap.foreach(HashMap.scala:98) at org.apache.spark.metrics.MetricsSystem.registerSinks(MetricsSystem.scala:130) at org.apache.spark.metrics.MetricsSystem.init(MetricsSystem.scala:84) at org.apache.spark.metrics.MetricsSystem$.createMetricsSystem(MetricsSystem.scala:167) at org.apache.spark.SparkEnv$.create(SparkEnv.scala:230) at org.apache.spark.SparkContext.init(SparkContext.scala:202) at Hello$.main(Hello.scala:101) at Hello.main(Hello.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43
Re: Strange problem with saveAsTextFile after upgrade Spark 0.9.0-1.0.0
Okay I think I've isolated this a bit more. Let's discuss over on the JIRA: https://issues.apache.org/jira/browse/SPARK-2075 On Sun, Jun 8, 2014 at 1:16 PM, Paul Brown p...@mult.ifario.us wrote: Hi, Patrick -- Java 7 on the development machines: » java -version 1 ↵ java version 1.7.0_51 Java(TM) SE Runtime Environment (build 1.7.0_51-b13) Java HotSpot(TM) 64-Bit Server VM (build 24.51-b03, mixed mode) And on the deployed boxes: $ java -version java version 1.7.0_55 OpenJDK Runtime Environment (IcedTea 2.4.7) (7u55-2.4.7-1ubuntu1) OpenJDK 64-Bit Server VM (build 24.51-b03, mixed mode) Also, unzip -l in place of jar tvf gives the same results, so I don't think it's an issue with jar not reporting the files. Also, the classes do get correctly packaged into the uberjar: unzip -l /target/[deleted]-driver.jar | grep 'rdd/RDD' | grep 'saveAs' 1519 06-08-14 12:05 org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$1.class 1560 06-08-14 12:05 org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$2.class Best. -- Paul — p...@mult.ifario.us | Multifarious, Inc. | http://mult.ifario.us/ On Sun, Jun 8, 2014 at 1:02 PM, Patrick Wendell pwend...@gmail.com wrote: Paul, Could you give the version of Java that you are building with and the version of Java you are running with? Are they the same? Just off the cuff, I wonder if this is related to: https://issues.apache.org/jira/browse/SPARK-1520 If it is, it could appear that certain functions are not in the jar because they go beyond the extended zip boundary `jar tvf` won't list them. - Patrick On Sun, Jun 8, 2014 at 12:45 PM, Paul Brown p...@mult.ifario.us wrote: Moving over to the dev list, as this isn't a user-scope issue. I just ran into this issue with the missing saveAsTestFile, and here's a little additional information: - Code ported from 0.9.1 up to 1.0.0; works with local[n] in both cases. - Driver built as an uberjar via Maven. - Deployed to smallish EC2 cluster in standalone mode (S3 storage) with Spark 1.0.0-hadoop1 downloaded from Apache. Given that it functions correctly in local mode but not in a standalone cluster, this suggests to me that the issue is in a difference between the Maven version and the hadoop1 version. In the spirit of taking the computer at its word, we can just have a look in the JAR files. Here's what's in the Maven dep as of 1.0.0: jar tvf ~/.m2/repository/org/apache/spark/spark-core_2.10/1.0.0/spark-core_2.10-1.0.0.jar | grep 'rdd/RDD' | grep 'saveAs' 1519 Mon May 26 13:57:58 PDT 2014 org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$1.class 1560 Mon May 26 13:57:58 PDT 2014 org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$2.class And here's what's in the hadoop1 distribution: jar tvf spark-assembly-1.0.0-hadoop1.0.4.jar| grep 'rdd/RDD' | grep 'saveAs' I.e., it's not there. It is in the hadoop2 distribution: jar tvf spark-assembly-1.0.0-hadoop2.2.0.jar| grep 'rdd/RDD' | grep 'saveAs' 1519 Mon May 26 07:29:54 PDT 2014 org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$1.class 1560 Mon May 26 07:29:54 PDT 2014 org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$2.class So something's clearly broken with the way that the distribution assemblies are created. FWIW and IMHO, the right way to publish the hadoop1 and hadoop2 flavors of Spark to Maven Central would be as *entirely different* artifacts (spark-core-h1, spark-core-h2). Logged as SPARK-2075 https://issues.apache.org/jira/browse/SPARK-2075. Cheers. -- Paul -- p...@mult.ifario.us | Multifarious, Inc. | http://mult.ifario.us/ On Fri, Jun 6, 2014 at 2:45 AM, HenriV henri.vanh...@vdab.be wrote: I'm experiencing the same error while upgrading from 0.9.1 to 1.0.0. Im using google compute engine and cloud storage. but saveAsTextFile is returning errors while saving in the cloud or saving local. When i start a job in the cluster i do get an error but after this error it keeps on running fine untill the saveAsTextFile. ( I don't know if the two are connected) ---Error at job startup--- ERROR metrics.MetricsSystem: Sink class org.apache.spark.metrics.sink.MetricsServlet cannot be instantialized java.lang.reflect.InvocationTargetException at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:526) at org.apache.spark.metrics.MetricsSystem$$anonfun$registerSinks$1.apply(MetricsSystem.scala:136) at org.apache.spark.metrics.MetricsSystem$$anonfun$registerSinks$1.apply(MetricsSystem.scala:130
Emergency maintenace on jenkins
Just a heads up - due to an outage at UCB we've lost several of the Jenkins slaves. I'm trying to spin up new slaves on EC2 in order to compensate, but this might fail some ongoing builds. The good news is if we do get it working with EC2 workers, then we will have burst capability in the future - e.g. on release deadlines. So it's not all bad! - Patrick
Re: Emergency maintenace on jenkins
No luck with this tonight - unfortunately our Python tests aren't working well with Python 2.6 and some other issues made it hard to get the EC2 worker up to speed. Hopefully we can have this up and running tomororw. - Patrick On Mon, Jun 9, 2014 at 10:17 PM, Patrick Wendell pwend...@gmail.com wrote: Just a heads up - due to an outage at UCB we've lost several of the Jenkins slaves. I'm trying to spin up new slaves on EC2 in order to compensate, but this might fail some ongoing builds. The good news is if we do get it working with EC2 workers, then we will have burst capability in the future - e.g. on release deadlines. So it's not all bad! - Patrick
Re: Emergency maintenace on jenkins
Hey just to update people - as of around 1pm PT we were back up and running with Jenkins slaves on EC2. Sorry about the disruption. - Patrick On Tue, Jun 10, 2014 at 1:15 AM, Patrick Wendell pwend...@gmail.com wrote: No luck with this tonight - unfortunately our Python tests aren't working well with Python 2.6 and some other issues made it hard to get the EC2 worker up to speed. Hopefully we can have this up and running tomororw. - Patrick On Mon, Jun 9, 2014 at 10:17 PM, Patrick Wendell pwend...@gmail.com wrote: Just a heads up - due to an outage at UCB we've lost several of the Jenkins slaves. I'm trying to spin up new slaves on EC2 in order to compensate, but this might fail some ongoing builds. The good news is if we do get it working with EC2 workers, then we will have burst capability in the future - e.g. on release deadlines. So it's not all bad! - Patrick
Re: Java IO Stream Corrupted - Invalid Type AC?
Out of curiosity - are you guys using speculation, shuffle consolidation, or any other non-default option? If so that would help narrow down what's causing this corruption. On Tue, Jun 17, 2014 at 10:40 AM, Surendranauth Hiraman suren.hira...@velos.io wrote: Matt/Ryan, Did you make any headway on this? My team is running into this also. Doesn't happen on smaller datasets. Our input set is about 10 GB but we generate 100s of GBs in the flow itself. -Suren On Fri, Jun 6, 2014 at 5:19 PM, Ryan Compton compton.r...@gmail.com wrote: Just ran into this today myself. I'm on branch-1.0 using a CDH3 cluster (no modifications to Spark or its dependencies). The error appeared trying to run GraphX's .connectedComponents() on a ~200GB edge list (GraphX worked beautifully on smaller data). Here's the stacktrace (it's quite similar to yours https://imgur.com/7iBA4nJ ). 14/06/05 20:02:28 ERROR scheduler.TaskSetManager: Task 5.599:39 failed 4 times; aborting job 14/06/05 20:02:28 INFO scheduler.DAGScheduler: Failed to run reduce at VertexRDD.scala:100 Exception in thread main org.apache.spark.SparkException: Job aborted due to stage failure: Task 5.599:39 failed 4 times, most recent failure: Exception failure in TID 29735 on host node18: java.io.StreamCorruptedException: invalid type code: AC java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1355) java.io.ObjectInputStream.readObject(ObjectInputStream.java:350) org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:63) org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:125) org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71) scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:30) org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) scala.collection.Iterator$class.foreach(Iterator.scala:727) scala.collection.AbstractIterator.foreach(Iterator.scala:1157) org.apache.spark.graphx.impl.VertexPartitionBaseOps.innerJoinKeepLeft(VertexPartitionBaseOps.scala:192) org.apache.spark.graphx.impl.EdgePartition.updateVertices(EdgePartition.scala:78) org.apache.spark.graphx.impl.ReplicatedVertexView$$anonfun$2$$anonfun$apply$1.apply(ReplicatedVertexView.scala:75) org.apache.spark.graphx.impl.ReplicatedVertexView$$anonfun$2$$anonfun$apply$1.apply(ReplicatedVertexView.scala:73) scala.collection.Iterator$$anon$11.next(Iterator.scala:328) scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) scala.collection.Iterator$class.foreach(Iterator.scala:727) scala.collection.AbstractIterator.foreach(Iterator.scala:1157) org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:158) org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) org.apache.spark.scheduler.Task.run(Task.scala:51) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187) java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) java.lang.Thread.run(Thread.java:662) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1033) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1017) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1015) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1015) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:633) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:633) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:633) at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1207) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at
Re: Java IO Stream Corrupted - Invalid Type AC?
Just wondering, do you get this particular exception if you are not consolidating shuffle data? On Wed, Jun 18, 2014 at 12:15 PM, Mridul Muralidharan mri...@gmail.com wrote: On Wed, Jun 18, 2014 at 6:19 PM, Surendranauth Hiraman suren.hira...@velos.io wrote: Patrick, My team is using shuffle consolidation but not speculation. We are also using persist(DISK_ONLY) for caching. Use of shuffle consolidation is probably what is causing the issue. Would be good idea to try again with that turned off (which is the default). It should get fixed most likely in 1.1 timeframe. Regards, Mridul Here are some config changes that are in our work-in-progress. We've been trying for 2 weeks to get our production flow (maybe around 50-70 stages, a few forks and joins with up to 20 branches in the forks) to run end to end without any success, running into other problems besides this one as well. For example, we have run into situations where saving to HDFS just hangs on a couple of tasks, which are printing out nothing in their logs and not taking any CPU. For testing, our input data is 10 GB across 320 input splits and generates maybe around 200-300 GB of intermediate and final data. conf.set(spark.executor.memory, 14g) // TODO make this configurable // shuffle configs conf.set(spark.default.parallelism, 320) // TODO make this configurable conf.set(spark.shuffle.consolidateFiles,true) conf.set(spark.shuffle.file.buffer.kb, 200) conf.set(spark.reducer.maxMbInFlight, 96) conf.set(spark.rdd.compress,true // we ran into a problem with the default timeout of 60 seconds // this is also being set in the master's spark-env.sh. Not sure if it needs to be in both places conf.set(spark.worker.timeout,180) // akka settings conf.set(spark.akka.threads, 300) conf.set(spark.akka.timeout, 180) conf.set(spark.akka.frameSize, 100) conf.set(spark.akka.batchSize, 30) conf.set(spark.akka.askTimeout, 30) // block manager conf.set(spark.storage.blockManagerTimeoutIntervalMs, 18) conf.set(spark.blockManagerHeartBeatMs, 8) -Suren On Wed, Jun 18, 2014 at 1:42 AM, Patrick Wendell pwend...@gmail.com wrote: Out of curiosity - are you guys using speculation, shuffle consolidation, or any other non-default option? If so that would help narrow down what's causing this corruption. On Tue, Jun 17, 2014 at 10:40 AM, Surendranauth Hiraman suren.hira...@velos.io wrote: Matt/Ryan, Did you make any headway on this? My team is running into this also. Doesn't happen on smaller datasets. Our input set is about 10 GB but we generate 100s of GBs in the flow itself. -Suren On Fri, Jun 6, 2014 at 5:19 PM, Ryan Compton compton.r...@gmail.com wrote: Just ran into this today myself. I'm on branch-1.0 using a CDH3 cluster (no modifications to Spark or its dependencies). The error appeared trying to run GraphX's .connectedComponents() on a ~200GB edge list (GraphX worked beautifully on smaller data). Here's the stacktrace (it's quite similar to yours https://imgur.com/7iBA4nJ ). 14/06/05 20:02:28 ERROR scheduler.TaskSetManager: Task 5.599:39 failed 4 times; aborting job 14/06/05 20:02:28 INFO scheduler.DAGScheduler: Failed to run reduce at VertexRDD.scala:100 Exception in thread main org.apache.spark.SparkException: Job aborted due to stage failure: Task 5.599:39 failed 4 times, most recent failure: Exception failure in TID 29735 on host node18: java.io.StreamCorruptedException: invalid type code: AC java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1355) java.io.ObjectInputStream.readObject(ObjectInputStream.java:350) org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:63) org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:125) org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71) scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:30) org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) scala.collection.Iterator$class.foreach(Iterator.scala:727) scala.collection.AbstractIterator.foreach(Iterator.scala:1157) org.apache.spark.graphx.impl.VertexPartitionBaseOps.innerJoinKeepLeft(VertexPartitionBaseOps.scala:192) org.apache.spark.graphx.impl.EdgePartition.updateVertices(EdgePartition.scala:78) org.apache.spark.graphx.impl.ReplicatedVertexView$$anonfun$2$$anonfun$apply$1.apply(ReplicatedVertexView.scala:75) org.apache.spark.graphx.impl.ReplicatedVertexView$$anonfun$2$$anonfun$apply$1.apply(ReplicatedVertexView.scala:73
Re: Scala examples for Spark do not work as written in documentation
Those are pretty old - but I think the reason Matei did that was to make it less confusing for brand new users. `spark` is actually a valid identifier because it's just a variable name (val spark = new SparkContext()) but I agree this could be confusing for users who want to drop into the shell. On Fri, Jun 20, 2014 at 12:04 PM, Will Benton wi...@redhat.com wrote: Hey, sorry to reanimate this thread, but just a quick question: why do the examples (on http://spark.apache.org/examples.html) use spark for the SparkContext reference? This is minor, but it seems like it could be a little confusing for people who want to run them in the shell and need to change spark to sc. (I noticed because this was a speedbump for a colleague who is trying out Spark.) thanks, wb - Original Message - From: Andy Konwinski andykonwin...@gmail.com To: dev@spark.apache.org Sent: Tuesday, May 20, 2014 4:06:33 PM Subject: Re: Scala examples for Spark do not work as written in documentation I fixed the bug, but I kept the parameter i instead of _ since that (1) keeps it more parallel to the python and java versions which also use functions with a named variable and (2) doesn't require readers to know this particular use of the _ syntax in Scala. Thanks for catching this Glenn. Andy On Fri, May 16, 2014 at 12:38 PM, Mark Hamstra m...@clearstorydata.comwrote: Sorry, looks like an extra line got inserted in there. One more try: val count = spark.parallelize(1 to NUM_SAMPLES).map { _ = val x = Math.random() val y = Math.random() if (x*x + y*y 1) 1 else 0 }.reduce(_ + _) On Fri, May 16, 2014 at 12:36 PM, Mark Hamstra m...@clearstorydata.com wrote: Actually, the better way to write the multi-line closure would be: val count = spark.parallelize(1 to NUM_SAMPLES).map { _ = val x = Math.random() val y = Math.random() if (x*x + y*y 1) 1 else 0 }.reduce(_ + _) On Fri, May 16, 2014 at 9:41 AM, GlennStrycker glenn.stryc...@gmail.com wrote: On the webpage http://spark.apache.org/examples.html, there is an example written as val count = spark.parallelize(1 to NUM_SAMPLES).map(i = val x = Math.random() val y = Math.random() if (x*x + y*y 1) 1 else 0 ).reduce(_ + _) println(Pi is roughly + 4.0 * count / NUM_SAMPLES) This does not execute in Spark, which gives me an error: console:2: error: illegal start of simple expression val x = Math.random() ^ If I rewrite the query slightly, adding in {}, it works: val count = spark.parallelize(1 to 1).map(i = { val x = Math.random() val y = Math.random() if (x*x + y*y 1) 1 else 0 } ).reduce(_ + _) println(Pi is roughly + 4.0 * count / 1.0) -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Scala-examples-for-Spark-do-not-work-as-written-in-documentation-tp6593.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com.
Assorted project updates (tests, build, etc)
Hey All, 1. The original test infrastructure hosted by the AMPLab has been fully restored and also expanded with many more executor slots for tests. Thanks to Matt Massie at the Amplab for helping with this. 2. We now have a nightly build matrix across different Hadoop versions. It appears that the Maven build is failing tests with some of the newer Hadoop versions. If people from the community are interested, diagnosing and fixing test issues would be welcome patches (they are all dependency related). https://issues.apache.org/jira/browse/SPARK-2232 3. Prashant Sharma has spent a lot of time to make it possible for our sbt build to read dependencies from Maven. This will save us a huge amount of headache keeping the builds consistent. I just wanted to give a heads up to users about this - we should retain compatibility with features of the sbt build, but if you are e.g. hooking into deep internals of our build it may affect you. I'm hoping this can be updated and merged in the next week: https://github.com/apache/spark/pull/77 4. We've moved most of the documentation over to recommending users build with Maven when creating official packages. This is just to provide a single reference build of Spark since it's the one we test and package for releases, we make sure all recursive dependencies are correct, etc. I'd recommend that all downstream packagers use this build. For day-to-day development I imagine sbt will remain more popular (repl, incremental builds, etc). Prashant's work allows us to get the best of both worlds which is great. - Patrick
[VOTE] Release Apache Spark 1.0.1 (RC1)
Please vote on releasing the following candidate as Apache Spark version 1.0.1! The tag to be voted on is v1.0.1-rc1 (commit 7feeda3): https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=7feeda3d729f9397aa15ee8750c01ef5aa601962 The release files, including signatures, digests, etc. can be found at: http://people.apache.org/~pwendell/spark-1.0.1-rc1/ Release artifacts are signed with the following key: https://people.apache.org/keys/committer/pwendell.asc The staging repository for this release can be found at: https://repository.apache.org/content/repositories/orgapachespark-1020/ The documentation corresponding to this release can be found at: http://people.apache.org/~pwendell/spark-1.0.1-rc1-docs/ Please vote on releasing this package as Apache Spark 1.0.1! The vote is open until Monday, June 30, at 03:00 UTC and passes if a majority of at least 3 +1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 1.0.1 [ ] -1 Do not release this package because ... To learn more about Apache Spark, please see http://spark.apache.org/ === About this release === This release fixes a few high-priority bugs in 1.0 and has a variety of smaller fixes. The full list is here: http://s.apache.org/b45. Some of the more visible patches are: SPARK-2043: ExternalAppendOnlyMap doesn't always find matching keys SPARK-2156 and SPARK-1112: Issues with jobs hanging due to akka frame size. SPARK-1790: Support r3 instance types on EC2. This is the first maintenance release on the 1.0 line. We plan to make additional maintenance releases as new fixes come in. - Patrick
Re: Errors from Sbt Test
Do those also happen if you run other hadoop versions (e.g. try 1.0.4)? On Tue, Jul 1, 2014 at 1:00 AM, Taka Shinagawa taka.epsi...@gmail.com wrote: Since Spark 1.0.0, I've been seeing multiple errors when running sbt test. I ran the following commands from Spark 1.0.1 RC1 on Mac OSX 10.9.2. $ sbt/sbt clean $ SPARK_HADOOP_VERSION=1.2.1 sbt/sbt assembly $ sbt/sbt test I'm attaching the log file generated by the sbt test. Here's the summary part of the test. [info] Run completed in 30 minutes, 57 seconds. [info] Total number of tests run: 605 [info] Suites: completed 83, aborted 0 [info] Tests: succeeded 600, failed 5, canceled 0, ignored 5, pending 0 [info] *** 5 TESTS FAILED *** [error] Failed: Total 653, Failed 5, Errors 0, Passed 648, Ignored 5 [error] Failed tests: [error] org.apache.spark.ShuffleNettySuite [error] org.apache.spark.ShuffleSuite [error] org.apache.spark.FileServerSuite [error] org.apache.spark.DistributedSuite [error] (core/test:test) sbt.TestsFailedException: Tests unsuccessful [error] Total time: 2033 s, completed Jul 1, 2014 12:08:03 AM Is anyone else seeing errors like this? Thanks, Taka
Re: Eliminate copy while sending data : any Akka experts here ?
Yeah I created a JIRA a while back to piggy-back the map status info on top of the task (I honestly think it will be a small change). There isn't a good reason to broadcast the entire array and it can be an issue during large shuffles. - Patrick On Mon, Jun 30, 2014 at 7:58 PM, Aaron Davidson ilike...@gmail.com wrote: I don't know of any way to avoid Akka doing a copy, but I would like to mention that it's on the priority list to piggy-back only the map statuses relevant to a particular map task on the task itself, thus reducing the total amount of data sent over the wire by a factor of N for N physical machines in your cluster. Ideally we would also avoid Akka entirely when sending the tasks, as these can get somewhat large and Akka doesn't work well with large messages. Do note that your solution of using broadcast to send the map tasks is very similar to how the executor returns the result of a task when it's too big for akka. We were thinking of refactoring this too, as using the block manager has much higher latency than a direct TCP send. On Mon, Jun 30, 2014 at 12:13 PM, Mridul Muralidharan mri...@gmail.com wrote: Our current hack is to use Broadcast variables when serialized statuses are above some (configurable) size : and have the workers directly pull them from master. This is a workaround : so would be great if there was a better/principled solution. Please note that the responses are going to different workers requesting for the output statuses for shuffle (after map) - so not sure if back pressure buffers, etc would help. Regards, Mridul On Mon, Jun 30, 2014 at 11:07 PM, Mridul Muralidharan mri...@gmail.com wrote: Hi, While sending map output tracker result, the same serialized byte array is sent multiple times - but the akka implementation copies it to a private byte array within ByteString for each send. Caching a ByteString instead of Array[Byte] did not help, since akka does not support special casing ByteString : serializes the ByteString, and copies the result out to an array before creating ByteString out of it (in Array[Byte] serializing is thankfully simply returning same array - so one copy only). Given the need to send immutable data large number of times, is there any way to do it in akka without copying internally in akka ? To see how expensive it is, for 200 nodes withi large number of mappers and reducers, the status becomes something like 30 mb for us - and pulling this about 200 to 300 times results in OOM due to the large number of copies sent out. Thanks, Mridul
Re: Eliminate copy while sending data : any Akka experts here ?
b) Instead of pulling this information, push it to executors as part of task submission. (What Patrick mentioned ?) (1) a.1 from above is still an issue for this. I don't understand problem a.1 is. In this case, we don't need to do caching, right? (2) Serialized task size is also a concern : we have already seen users hitting akka limits for task size - this will be an additional vector which might exacerbate it. This would add only a small, constant amount of data to the task. It's strictly better than before. Before if the map output status array was size M x R, we send a single akka message to every node of size M x R... this basically scales quadratically with the size of the RDD. The new approach is constant... it's much better. And the total amount of data send over the wire is likely much less. - Patrick
[RESULT] [VOTE] Release Apache Spark 1.0.1 (RC1)
This vote is cancelled in favor of RC2. Thanks to everyone who voted. On Sun, Jun 29, 2014 at 11:23 PM, Andrew Ash and...@andrewash.com wrote: Ok that's reasonable -- it's certainly more of an enhancement than a critical bug-fix. I would like to get this in for 1.1.0 though, so let's talk through the right way to do that on the PR. In the meantime the best alternative is running with lax firewall settings, which can be somewhat mitigated by modifying the ephemeral port range. Thanks! Andrew On Sun, Jun 29, 2014 at 11:14 PM, Reynold Xin r...@databricks.com wrote: Hi Andrew, The port stuff is great to have, but they are pretty big changes to the core that are introducing new features and are not exactly fixing important bugs. For this reason, it probably can't block a release (I'm not even sure if it should go into a maintenance release where we fix critical bugs for Spark core). We should definitely include them for 1.1.0 though (~Aug). On Sun, Jun 29, 2014 at 11:09 PM, Andrew Ash and...@andrewash.com wrote: Thanks for helping shepherd the voting on 1.0.1 Patrick. I'd like to call attention to https://issues.apache.org/jira/browse/SPARK-2157 and https://github.com/apache/spark/pull/1107 -- Ability to write tight firewall rules for Spark I'm currently unable to run Spark on some projects because our cloud ops team is uncomfortable with the firewall situation around Spark at the moment. Currently Spark starts listening on random ephemeral ports and does server to server communication on them. This keeps the team from writing tight firewall rules between the services -- they get real queasy when asked to open inbound connections to the entire ephemeral port range of a cluster. We can tighten the size of the ephemeral range using kernel settings to mitigate the issue, but it doesn't actually solve the problem. The PR above aims to make every listening port on JVMs in a Spark standalone cluster configurable with an option. If not set, the current behavior stands (start listening on an ephemeral port). Is this something the Spark team would consider merging into 1.0.1? Thanks! Andrew On Sun, Jun 29, 2014 at 10:54 PM, Patrick Wendell pwend...@gmail.com wrote: Hey All, We're going to move onto another rc because of this vote. Unfortunately with the summit activities I haven't been able to usher in the necessary patches and cut the RC. I will do so as soon as possible and we can commence official voting. - Patrick On Sun, Jun 29, 2014 at 4:56 PM, Reynold Xin r...@databricks.com wrote: We should make sure we include the following two patches: https://github.com/apache/spark/pull/1264 https://github.com/apache/spark/pull/1263 On Fri, Jun 27, 2014 at 8:39 PM, Krishna Sankar ksanka...@gmail.com wrote: +1 Compiled for CentOS 6.5, deployed in our 4 node cluster (Hadoop 2.2, YARN) Smoke Tests (sparkPi,spark-shell, web UI) successful Cheers k/ On Thu, Jun 26, 2014 at 7:06 PM, Patrick Wendell pwend...@gmail.com wrote: Please vote on releasing the following candidate as Apache Spark version 1.0.1! The tag to be voted on is v1.0.1-rc1 (commit 7feeda3): https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=7feeda3d729f9397aa15ee8750c01ef5aa601962 The release files, including signatures, digests, etc. can be found at: http://people.apache.org/~pwendell/spark-1.0.1-rc1/ Release artifacts are signed with the following key: https://people.apache.org/keys/committer/pwendell.asc The staging repository for this release can be found at: https://repository.apache.org/content/repositories/orgapachespark-1020/ The documentation corresponding to this release can be found at: http://people.apache.org/~pwendell/spark-1.0.1-rc1-docs/ Please vote on releasing this package as Apache Spark 1.0.1! The vote is open until Monday, June 30, at 03:00 UTC and passes if a majority of at least 3 +1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 1.0.1 [ ] -1 Do not release this package because ... To learn more about Apache Spark, please see http://spark.apache.org/ === About this release === This release fixes a few high-priority bugs in 1.0 and has a variety of smaller fixes. The full list is here: http://s.apache.org/b45. Some of the more visible patches are: SPARK-2043: ExternalAppendOnlyMap doesn't always find matching keys SPARK-2156 and SPARK-1112: Issues with jobs hanging due to akka frame size. SPARK-1790: Support r3 instance types on EC2. This is the first maintenance release on the 1.0 line. We plan to make additional maintenance releases as new fixes come
[VOTE] Release Apache Spark 1.0.1 (RC2)
Please vote on releasing the following candidate as Apache Spark version 1.0.1! The tag to be voted on is v1.0.1-rc1 (commit 7d1043c): https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=7d1043c99303b87aef8ee19873629c2bfba4cc78 The release files, including signatures, digests, etc. can be found at: http://people.apache.org/~pwendell/spark-1.0.1-rc2/ Release artifacts are signed with the following key: https://people.apache.org/keys/committer/pwendell.asc The staging repository for this release can be found at: https://repository.apache.org/content/repositories/orgapachespark-1021/ The documentation corresponding to this release can be found at: http://people.apache.org/~pwendell/spark-1.0.1-rc2-docs/ Please vote on releasing this package as Apache Spark 1.0.1! The vote is open until Monday, July 07, at 20:45 UTC and passes if a majority of at least 3 +1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 1.0.1 [ ] -1 Do not release this package because ... To learn more about Apache Spark, please see http://spark.apache.org/ === Differences from RC1 === This release includes only one blocking patch from rc1: https://github.com/apache/spark/pull/1255 There are also smaller fixes which came in over the last week. === About this release === This release fixes a few high-priority bugs in 1.0 and has a variety of smaller fixes. The full list is here: http://s.apache.org/b45. Some of the more visible patches are: SPARK-2043: ExternalAppendOnlyMap doesn't always find matching keys SPARK-2156 and SPARK-1112: Issues with jobs hanging due to akka frame size. SPARK-1790: Support r3 instance types on EC2. This is the first maintenance release on the 1.0 line. We plan to make additional maintenance releases as new fixes come in.
Testing period for better jenkins integration
Just a heads up - I've added some better Jenkins integration that posts more useful messages on pull requests. We'll run this side-by-side with the current Jenkins messages for a while to make sure it's working well. Things may be a bit chatty while we are testing this - we can migrate over as soon as we feel it's stable. - Patrick
Changes to sbt build have been merged
Just a heads up, we merged Prashant's work on having the sbt build read all dependencies from Maven. Please report any issues you find on the dev list or on JIRA. One note here for developers, going forward the sbt build will use the same configuration style as the maven build (-D for options and -P for maven profiles). So this will be a change for developers: sbt/sbt -Dhadoop.version=2.2.0 -Pyarn assembly For now, we'll continue to support the old env-var options with a deprecation warning. - Patrick
Re: what is the difference between org.spark-project.hive and org.apache.hadoop.hive
There are two differences: 1. We publish hive with a shaded protobuf dependency to avoid conflicts with some Hadoop versions. 2. We publish a proper hive-exec jar that only includes hive packages. The upstream version of hive-exec bundles a bunch of other random dependencies in it which makes it really hard for third-party projects to use it. On Thu, Jul 10, 2014 at 11:29 PM, kingfly wangf...@huawei.com wrote: -- Best Regards Frank Wang | Software Engineer Mobile: +86 18505816792 Phone: +86 571 63547 Fax: Email: wangf...@huawei.com Huawei Technologies Co., Ltd. Hangzhou RD Center NO.410, JiangHong Road, Binjiang Area, Hangzhou, 310052, P. R. China
Re: [VOTE] Release Apache Spark 1.0.1 (RC2)
Hey Gary, Why do you think the akka frame size changed? It didn't change - we added some fixes for cases where users were setting non-default values. On Fri, Jul 11, 2014 at 9:31 AM, Gary Malouf malouf.g...@gmail.com wrote: Hi Matei, We have not had time to re-deploy the rc today, but one thing that jumps out is the shrinking of the default akka frame size from 10MB to around 128KB by default. That is my first suspicion for our issue - could imagine that biting others as well. I'll try to re-test that today - either way, understand moving forward at this point. Gary On Fri, Jul 11, 2014 at 12:08 PM, Matei Zaharia matei.zaha...@gmail.com wrote: Unless you can diagnose the problem quickly, Gary, I think we need to go ahead with this release as is. This release didn't touch the Mesos support as far as I know, so the problem might be a nondeterministic issue with your application. But on the other hand the release does fix some critical bugs that affect all users. We can always do 1.0.2 later if we discover a problem. Matei On Jul 10, 2014, at 9:40 PM, Patrick Wendell pwend...@gmail.com wrote: Hey Gary, The vote technically doesn't close until I send the vote summary e-mail, but I was planning to close and package this tonight. It's too bad if there is a regression, it might be worth holding the release but it really requires narrowing down the issue to get more information about the scope and severity. Could you fork another thread for this? - Patrick On Thu, Jul 10, 2014 at 6:28 PM, Gary Malouf malouf.g...@gmail.com wrote: -1 I honestly do not know the voting rules for the Spark community, so please excuse me if I am out of line or if Mesos compatibility is not a concern at this point. We just tried to run this version built against 2.3.0-cdh5.0.2 on mesos 0.18.2. All of our jobs with data above a few gigabytes hung indefinitely. Downgrading back to the 1.0.0 stable release of Spark built the same way worked for us. On Mon, Jul 7, 2014 at 5:17 PM, Tom Graves tgraves...@yahoo.com.invalid wrote: +1. Ran some Spark on yarn jobs on a hadoop 2.4 cluster with authentication on. Tom On Friday, July 4, 2014 2:39 PM, Patrick Wendell pwend...@gmail.com wrote: Please vote on releasing the following candidate as Apache Spark version 1.0.1! The tag to be voted on is v1.0.1-rc1 (commit 7d1043c): https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=7d1043c99303b87aef8ee19873629c2bfba4cc78 The release files, including signatures, digests, etc. can be found at: http://people.apache.org/~pwendell/spark-1.0.1-rc2/ Release artifacts are signed with the following key: https://people.apache.org/keys/committer/pwendell.asc The staging repository for this release can be found at: https://repository.apache.org/content/repositories/orgapachespark-1021/ The documentation corresponding to this release can be found at: http://people.apache.org/~pwendell/spark-1.0.1-rc2-docs/ Please vote on releasing this package as Apache Spark 1.0.1! The vote is open until Monday, July 07, at 20:45 UTC and passes if a majority of at least 3 +1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 1.0.1 [ ] -1 Do not release this package because ... To learn more about Apache Spark, please see http://spark.apache.org/ === Differences from RC1 === This release includes only one blocking patch from rc1: https://github.com/apache/spark/pull/1255 There are also smaller fixes which came in over the last week. === About this release === This release fixes a few high-priority bugs in 1.0 and has a variety of smaller fixes. The full list is here: http://s.apache.org/b45. Some of the more visible patches are: SPARK-2043: ExternalAppendOnlyMap doesn't always find matching keys SPARK-2156 and SPARK-1112: Issues with jobs hanging due to akka frame size. SPARK-1790: Support r3 instance types on EC2. This is the first maintenance release on the 1.0 line. We plan to make additional maintenance releases as new fixes come in.
Re: [VOTE] Release Apache Spark 1.0.1 (RC2)
Okay just FYI - I'm closing this vote since many people are waiting on the release and I was hoping to package it today. If we find a reproducible Mesos issue here, we can definitely spin the fix into a subsequent release. On Fri, Jul 11, 2014 at 9:37 AM, Patrick Wendell pwend...@gmail.com wrote: Hey Gary, Why do you think the akka frame size changed? It didn't change - we added some fixes for cases where users were setting non-default values. On Fri, Jul 11, 2014 at 9:31 AM, Gary Malouf malouf.g...@gmail.com wrote: Hi Matei, We have not had time to re-deploy the rc today, but one thing that jumps out is the shrinking of the default akka frame size from 10MB to around 128KB by default. That is my first suspicion for our issue - could imagine that biting others as well. I'll try to re-test that today - either way, understand moving forward at this point. Gary On Fri, Jul 11, 2014 at 12:08 PM, Matei Zaharia matei.zaha...@gmail.com wrote: Unless you can diagnose the problem quickly, Gary, I think we need to go ahead with this release as is. This release didn't touch the Mesos support as far as I know, so the problem might be a nondeterministic issue with your application. But on the other hand the release does fix some critical bugs that affect all users. We can always do 1.0.2 later if we discover a problem. Matei On Jul 10, 2014, at 9:40 PM, Patrick Wendell pwend...@gmail.com wrote: Hey Gary, The vote technically doesn't close until I send the vote summary e-mail, but I was planning to close and package this tonight. It's too bad if there is a regression, it might be worth holding the release but it really requires narrowing down the issue to get more information about the scope and severity. Could you fork another thread for this? - Patrick On Thu, Jul 10, 2014 at 6:28 PM, Gary Malouf malouf.g...@gmail.com wrote: -1 I honestly do not know the voting rules for the Spark community, so please excuse me if I am out of line or if Mesos compatibility is not a concern at this point. We just tried to run this version built against 2.3.0-cdh5.0.2 on mesos 0.18.2. All of our jobs with data above a few gigabytes hung indefinitely. Downgrading back to the 1.0.0 stable release of Spark built the same way worked for us. On Mon, Jul 7, 2014 at 5:17 PM, Tom Graves tgraves...@yahoo.com.invalid wrote: +1. Ran some Spark on yarn jobs on a hadoop 2.4 cluster with authentication on. Tom On Friday, July 4, 2014 2:39 PM, Patrick Wendell pwend...@gmail.com wrote: Please vote on releasing the following candidate as Apache Spark version 1.0.1! The tag to be voted on is v1.0.1-rc1 (commit 7d1043c): https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=7d1043c99303b87aef8ee19873629c2bfba4cc78 The release files, including signatures, digests, etc. can be found at: http://people.apache.org/~pwendell/spark-1.0.1-rc2/ Release artifacts are signed with the following key: https://people.apache.org/keys/committer/pwendell.asc The staging repository for this release can be found at: https://repository.apache.org/content/repositories/orgapachespark-1021/ The documentation corresponding to this release can be found at: http://people.apache.org/~pwendell/spark-1.0.1-rc2-docs/ Please vote on releasing this package as Apache Spark 1.0.1! The vote is open until Monday, July 07, at 20:45 UTC and passes if a majority of at least 3 +1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 1.0.1 [ ] -1 Do not release this package because ... To learn more about Apache Spark, please see http://spark.apache.org/ === Differences from RC1 === This release includes only one blocking patch from rc1: https://github.com/apache/spark/pull/1255 There are also smaller fixes which came in over the last week. === About this release === This release fixes a few high-priority bugs in 1.0 and has a variety of smaller fixes. The full list is here: http://s.apache.org/b45. Some of the more visible patches are: SPARK-2043: ExternalAppendOnlyMap doesn't always find matching keys SPARK-2156 and SPARK-1112: Issues with jobs hanging due to akka frame size. SPARK-1790: Support r3 instance types on EC2. This is the first maintenance release on the 1.0 line. We plan to make additional maintenance releases as new fixes come in.
[RESULT] [VOTE] Release Apache Spark 1.0.1 (RC2)
This vote has passed with 9 +1 votes (5 binding) and 1 -1 vote (0 binding). +1: Patrick Wendell* Mark Hamstra* DB Tsai Krishna Sankar Soren Macbeth Andrew Or Matei Zaharia* Xiangrui Meng* Tom Graves* 0: -1: Gary Malouf
Announcing Spark 1.0.1
I am happy to announce the availability of Spark 1.0.1! This release includes contributions from 70 developers. Spark 1.0.0 includes fixes across several areas of Spark, including the core API, PySpark, and MLlib. It also includes new features in Spark's (alpha) SQL library, including support for JSON data and performance and stability fixes. Visit the release notes[1] to read about this release or download[2] the release today. [1] http://spark.apache.org/releases/spark-release-1-0-1.html [2] http://spark.apache.org/downloads.html
Re: how to run the program compiled with spark 1.0.0 in the branch-0.1-jdbc cluster
1. The first error I met is the different SerializationVersionUID in ExecuterStatus I resolved by explicitly declare SerializationVersionUID in ExecuterStatus.scala and recompile branch-0.1-jdbc I don't think there is a class in Spark named ExecuterStatus (sic) ... or ExecutorStatus. Is this a class you made?
Re: Reproducible deadlock in 1.0.1, possibly related to Spark-1097
Hey Cody, This Jstack seems truncated, would you mind giving the entire stack trace? For the second thread, for instance, we can't see where the lock is being acquired. - Patrick On Mon, Jul 14, 2014 at 1:42 PM, Cody Koeninger cody.koenin...@mediacrossing.com wrote: Hi all, just wanted to give a heads up that we're seeing a reproducible deadlock with spark 1.0.1 with 2.3.0-mr1-cdh5.0.2 If jira is a better place for this, apologies in advance - figured talking about it on the mailing list was friendlier than randomly (re)opening jira tickets. I know Gary had mentioned some issues with 1.0.1 on the mailing list, once we got a thread dump I wanted to follow up. The thread dump shows the deadlock occurs in the synchronized block of code that was changed in HadoopRDD.scala, for the Spark-1097 issue Relevant portions of the thread dump are summarized below, we can provide the whole dump if it's useful. Found one Java-level deadlock: = Executor task launch worker-1: waiting to lock monitor 0x7f250400c520 (object 0xfae7dc30, a org.apache.hadoop.co nf.Configuration), which is held by Executor task launch worker-0 Executor task launch worker-0: waiting to lock monitor 0x7f2520495620 (object 0xfaeb4fc8, a java.lang.Class), which is held by Executor task launch worker-1 Executor task launch worker-1: at org.apache.hadoop.conf.Configuration.reloadConfiguration(Configuration.java:791) - waiting to lock 0xfae7dc30 (a org.apache.hadoop.conf.Configuration) at org.apache.hadoop.conf.Configuration.addDefaultResource(Configuration.java:690) - locked 0xfaca6ff8 (a java.lang.Class for org.apache.hadoop.conf.Configurati on) at org.apache.hadoop.hdfs.HdfsConfiguration.clinit(HdfsConfiguration.java:34) at org.apache.hadoop.hdfs.DistributedFileSystem.clinit(DistributedFileSystem.java:110 ) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl. java:57) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl. java:57) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAcces sorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:525) at java.lang.Class.newInstance0(Class.java:374) at java.lang.Class.newInstance(Class.java:327) at java.util.ServiceLoader$LazyIterator.next(ServiceLoader.java:373) at java.util.ServiceLoader$1.next(ServiceLoader.java:445) at org.apache.hadoop.fs.FileSystem.loadFileSystems(FileSystem.java:2364) - locked 0xfaeb4fc8 (a java.lang.Class for org.apache.hadoop.fs.FileSystem) at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2375) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2392) at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:89) at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2431) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2413) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:368) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:167) at org.apache.hadoop.mapred.JobConf.getWorkingDirectory(JobConf.java:587) at org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:315) at org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:288) at org.apache.spark.SparkContext$$anonfun$22.apply(SparkContext.scala:546) at org.apache.spark.SparkContext$$anonfun$22.apply(SparkContext.scala:546) at org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$1.apply(HadoopRDD.scala:145) ...elided... Executor task launch worker-0 daemon prio=10 tid=0x01e71800 nid=0x2d97 waiting for monitor entry [0x7f24d2bf1000] java.lang.Thread.State: BLOCKED (on object monitor) at org.apache.hadoop.fs.FileSystem.loadFileSystems(FileSystem.java:2362) - waiting to lock 0xfaeb4fc8 (a java.lang.Class for org.apache.hadoop.fs.FileSystem) at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2375) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2392) at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:89) at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2431) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2413) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:368) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:167) at
Re: Reproducible deadlock in 1.0.1, possibly related to Spark-1097
Hey Nishkam, Aaron's fix should prevent two concurrent accesses to getJobConf (and the Hadoop code therein). But if there is code elsewhere that tries to mutate the configuration, then I could see how we might still have the ConcurrentModificationException. I looked at your patch for HADOOP-10456 and the only example you give is of the data being accessed inside of getJobConf. Is it accessed somewhere else too from Spark that you are aware of? https://issues.apache.org/jira/browse/HADOOP-10456 - Patrick On Mon, Jul 14, 2014 at 3:28 PM, Nishkam Ravi nr...@cloudera.com wrote: Hi Aaron, I'm not sure if synchronizing on an arbitrary lock object would help. I suspect we will start seeing the ConcurrentModificationException again. The right fix has gone into Hadoop through 10456. Unfortunately, I don't have any bright ideas on how to synchronize this at the Spark level without the risk of deadlocks. On Mon, Jul 14, 2014 at 3:07 PM, Aaron Davidson ilike...@gmail.com wrote: The full jstack would still be useful, but our current working theory is that this is due to the fact that Configuration#loadDefaults goes through every Configuration object that was ever created (via Configuration.REGISTRY) and locks it, thus introducing a dependency from new Configuration to old, otherwise unrelated, Configuration objects that our locking did not anticipate. I have created https://github.com/apache/spark/pull/1409 to hopefully fix this bug. On Mon, Jul 14, 2014 at 2:44 PM, Patrick Wendell pwend...@gmail.com wrote: Hey Cody, This Jstack seems truncated, would you mind giving the entire stack trace? For the second thread, for instance, we can't see where the lock is being acquired. - Patrick On Mon, Jul 14, 2014 at 1:42 PM, Cody Koeninger cody.koenin...@mediacrossing.com wrote: Hi all, just wanted to give a heads up that we're seeing a reproducible deadlock with spark 1.0.1 with 2.3.0-mr1-cdh5.0.2 If jira is a better place for this, apologies in advance - figured talking about it on the mailing list was friendlier than randomly (re)opening jira tickets. I know Gary had mentioned some issues with 1.0.1 on the mailing list, once we got a thread dump I wanted to follow up. The thread dump shows the deadlock occurs in the synchronized block of code that was changed in HadoopRDD.scala, for the Spark-1097 issue Relevant portions of the thread dump are summarized below, we can provide the whole dump if it's useful. Found one Java-level deadlock: = Executor task launch worker-1: waiting to lock monitor 0x7f250400c520 (object 0xfae7dc30, a org.apache.hadoop.co nf.Configuration), which is held by Executor task launch worker-0 Executor task launch worker-0: waiting to lock monitor 0x7f2520495620 (object 0xfaeb4fc8, a java.lang.Class), which is held by Executor task launch worker-1 Executor task launch worker-1: at org.apache.hadoop.conf.Configuration.reloadConfiguration(Configuration.java:791) - waiting to lock 0xfae7dc30 (a org.apache.hadoop.conf.Configuration) at org.apache.hadoop.conf.Configuration.addDefaultResource(Configuration.java:690) - locked 0xfaca6ff8 (a java.lang.Class for org.apache.hadoop.conf.Configurati on) at org.apache.hadoop.hdfs.HdfsConfiguration.clinit(HdfsConfiguration.java:34) at org.apache.hadoop.hdfs.DistributedFileSystem.clinit(DistributedFileSystem.java:110 ) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl. java:57) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl. java:57) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAcces sorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:525) at java.lang.Class.newInstance0(Class.java:374) at java.lang.Class.newInstance(Class.java:327) at java.util.ServiceLoader$LazyIterator.next(ServiceLoader.java:373) at java.util.ServiceLoader$1.next(ServiceLoader.java:445) at org.apache.hadoop.fs.FileSystem.loadFileSystems(FileSystem.java:2364) - locked 0xfaeb4fc8 (a java.lang.Class for org.apache.hadoop.fs.FileSystem) at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2375) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2392) at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:89
Re: Reproducible deadlock in 1.0.1, possibly related to Spark-1097
Andrew is your issue also a regression from 1.0.0 to 1.0.1? The immediate priority is addressing regressions between these two releases. On Mon, Jul 14, 2014 at 9:05 PM, Andrew Ash and...@andrewash.com wrote: I'm not sure either of those PRs will fix the concurrent adds to Configuration issue I observed. I've got a stack trace and writeup I'll share in an hour or two (traveling today). On Jul 14, 2014 9:50 PM, scwf wangf...@huawei.com wrote: hi,Cody i met this issue days before and i post a PR for this( https://github.com/apache/spark/pull/1385) it's very strange that if i synchronize conf it will deadlock but it is ok when synchronize initLocalJobConfFuncOpt Here's the entire jstack output. On Mon, Jul 14, 2014 at 4:44 PM, Patrick Wendell pwend...@gmail.com mailto:pwend...@gmail.com wrote: Hey Cody, This Jstack seems truncated, would you mind giving the entire stack trace? For the second thread, for instance, we can't see where the lock is being acquired. - Patrick On Mon, Jul 14, 2014 at 1:42 PM, Cody Koeninger cody.koenin...@mediacrossing.com mailto:cody.koeninger@ mediacrossing.com wrote: Hi all, just wanted to give a heads up that we're seeing a reproducible deadlock with spark 1.0.1 with 2.3.0-mr1-cdh5.0.2 If jira is a better place for this, apologies in advance - figured talking about it on the mailing list was friendlier than randomly (re)opening jira tickets. I know Gary had mentioned some issues with 1.0.1 on the mailing list, once we got a thread dump I wanted to follow up. The thread dump shows the deadlock occurs in the synchronized block of code that was changed in HadoopRDD.scala, for the Spark-1097 issue Relevant portions of the thread dump are summarized below, we can provide the whole dump if it's useful. Found one Java-level deadlock: = Executor task launch worker-1: waiting to lock monitor 0x7f250400c520 (object 0xfae7dc30, a org.apache.hadoop.co http://org.apache.hadoop.co nf.Configuration), which is held by Executor task launch worker-0 Executor task launch worker-0: waiting to lock monitor 0x7f2520495620 (object 0xfaeb4fc8, a java.lang.Class), which is held by Executor task launch worker-1 Executor task launch worker-1: at org.apache.hadoop.conf.Configuration.reloadConfiguration( Configuration.java:791) - waiting to lock 0xfae7dc30 (a org.apache.hadoop.conf.Configuration) at org.apache.hadoop.conf.Configuration.addDefaultResource( Configuration.java:690) - locked 0xfaca6ff8 (a java.lang.Class for org.apache.hadoop.conf.Configurati on) at org.apache.hadoop.hdfs.HdfsConfiguration.clinit( HdfsConfiguration.java:34) at org.apache.hadoop.hdfs.DistributedFileSystem.clinit (DistributedFileSystem.java:110 ) at sun.reflect.NativeConstructorAccessorImpl. newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance( NativeConstructorAccessorImpl. java:57) at sun.reflect.NativeConstructorAccessorImpl. newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance( NativeConstructorAccessorImpl. java:57) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance( DelegatingConstructorAcces sorImpl.java:45) at java.lang.reflect.Constructor. newInstance(Constructor.java:525) at java.lang.Class.newInstance0(Class.java:374) at java.lang.Class.newInstance(Class.java:327) at java.util.ServiceLoader$LazyIterator.next( ServiceLoader.java:373) at java.util.ServiceLoader$1.next(ServiceLoader.java:445) at org.apache.hadoop.fs.FileSystem.loadFileSystems( FileSystem.java:2364) - locked 0xfaeb4fc8 (a java.lang.Class for org.apache.hadoop.fs.FileSystem) at org.apache.hadoop.fs.FileSystem.getFileSystemClass( FileSystem.java:2375) at org.apache.hadoop.fs.FileSystem.createFileSystem( FileSystem.java:2392) at org.apache.hadoop.fs.FileSystem.access$200( FileSystem.java:89) at org.apache.hadoop.fs.FileSystem$Cache.getInternal( FileSystem.java:2431) at org.apache.hadoop.fs.FileSystem$Cache.get( FileSystem.java:2413) at org.apache.hadoop.fs.FileSystem.get(FileSystem. java:368) at org.apache.hadoop.fs.FileSystem.get(FileSystem. java:167
Re: Catalyst dependency on Spark Core
Adding new build modules is pretty high overhead, so if this is a case where a small amount of duplicated code could get rid of the dependency, that could also be a good short-term option. - Patrick On Mon, Jul 14, 2014 at 2:15 PM, Matei Zaharia matei.zaha...@gmail.com wrote: Yeah, I'd just add a spark-util that has these things. Matei On Jul 14, 2014, at 1:04 PM, Michael Armbrust mich...@databricks.com wrote: Yeah, sadly this dependency was introduced when someone consolidated the logging infrastructure. However, the dependency should be very small and thus easy to remove, and I would like catalyst to be usable outside of Spark. A pull request to make this possible would be welcome. Ideally, we'd create some sort of spark common package that has things like logging. That way catalyst could depend on that, without pulling in all of Hadoop, etc. Maybe others have opinions though, so I'm cc-ing the dev list. On Mon, Jul 14, 2014 at 12:21 AM, Yanbo Liang yanboha...@gmail.com wrote: Make Catalyst independent of Spark is the goal of Catalyst, maybe need time and evolution. I awared that package org.apache.spark.sql.catalyst.util embraced org.apache.spark.util.{Utils = SparkUtils}, so that Catalyst has a dependency on Spark core. I'm not sure whether it will be replaced by other component independent of Spark in later release. 2014-07-14 11:51 GMT+08:00 Aniket Bhatnagar aniket.bhatna...@gmail.com: As per the recent presentation given in Scala days (http://people.apache.org/~marmbrus/talks/SparkSQLScalaDays2014.pdf), it was mentioned that Catalyst is independent of Spark. But on inspecting pom.xml of sql/catalyst module, it seems it has a dependency on Spark Core. Any particular reason for the dependency? I would love to use Catalyst outside Spark (reposted as previous email bounced. Sorry if this is a duplicate).