Re: A proposal for Spark 2.0
I see. My concern is / was that cluster operators will be reluctant to upgrade to 2.0, meaning that developers using those clusters need to stay on 1.x, and, if they want to move to DataFrames, essentially need to port their app twice. I misunderstood and thought part of the proposal was to drop support for 2.10 though. If your broad point is that there aren't changes in 2.0 that will make it less palatable to cluster administrators than releases in the 1.x line, then yes, 2.0 as the next release sounds fine to me. -Sandy On Tue, Nov 24, 2015 at 11:55 AM, Matei Zaharia <matei.zaha...@gmail.com> wrote: > What are the other breaking changes in 2.0 though? Note that we're not > removing Scala 2.10, we're just making the default build be against Scala > 2.11 instead of 2.10. There seem to be very few changes that people would > worry about. If people are going to update their apps, I think it's better > to make the other small changes in 2.0 at the same time than to update once > for Dataset and another time for 2.0. > > BTW just refer to Reynold's original post for the other proposed API > changes. > > Matei > > On Nov 24, 2015, at 12:27 PM, Sandy Ryza <sandy.r...@cloudera.com> wrote: > > I think that Kostas' logic still holds. The majority of Spark users, and > likely an even vaster majority of people running vaster jobs, are still on > RDDs and on the cusp of upgrading to DataFrames. Users will probably want > to upgrade to the stable version of the Dataset / DataFrame API so they > don't need to do so twice. Requiring that they absorb all the other ways > that Spark breaks compatibility in the move to 2.0 makes it much more > difficult for them to make this transition. > > Using the same set of APIs also means that it will be easier to backport > critical fixes to the 1.x line. > > It's not clear to me that avoiding breakage of an experimental API in the > 1.x line outweighs these issues. > > -Sandy > > On Mon, Nov 23, 2015 at 10:51 PM, Reynold Xin <r...@databricks.com> wrote: > >> I actually think the next one (after 1.6) should be Spark 2.0. The reason >> is that I already know we have to break some part of the DataFrame/Dataset >> API as part of the Dataset design. (e.g. DataFrame.map should return >> Dataset rather than RDD). In that case, I'd rather break this sooner (in >> one release) than later (in two releases). so the damage is smaller. >> >> I don't think whether we call Dataset/DataFrame experimental or not >> matters too much for 2.0. We can still call Dataset experimental in 2.0 and >> then mark them as stable in 2.1. Despite being "experimental", there has >> been no breaking changes to DataFrame from 1.3 to 1.6. >> >> >> >> On Wed, Nov 18, 2015 at 3:43 PM, Mark Hamstra <m...@clearstorydata.com> >> wrote: >> >>> Ah, got it; by "stabilize" you meant changing the API, not just bug >>> fixing. We're on the same page now. >>> >>> On Wed, Nov 18, 2015 at 3:39 PM, Kostas Sakellis <kos...@cloudera.com> >>> wrote: >>> >>>> A 1.6.x release will only fix bugs - we typically don't change APIs in >>>> z releases. The Dataset API is experimental and so we might be changing the >>>> APIs before we declare it stable. This is why I think it is important to >>>> first stabilize the Dataset API with a Spark 1.7 release before moving to >>>> Spark 2.0. This will benefit users that would like to use the new Dataset >>>> APIs but can't move to Spark 2.0 because of the backwards incompatible >>>> changes, like removal of deprecated APIs, Scala 2.11 etc. >>>> >>>> Kostas >>>> >>>> >>>> On Fri, Nov 13, 2015 at 12:26 PM, Mark Hamstra <m...@clearstorydata.com >>>> > wrote: >>>> >>>>> Why does stabilization of those two features require a 1.7 release >>>>> instead of 1.6.1? >>>>> >>>>> On Fri, Nov 13, 2015 at 11:40 AM, Kostas Sakellis <kos...@cloudera.com >>>>> > wrote: >>>>> >>>>>> We have veered off the topic of Spark 2.0 a little bit here - yes we >>>>>> can talk about RDD vs. DS/DF more but lets refocus on Spark 2.0. I'd like >>>>>> to propose we have one more 1.x release after Spark 1.6. This will allow >>>>>> us >>>>>> to stabilize a few of the new features that were added in 1.6: >>>>>> >>>>>> 1) the experimental Datasets API >>>>>> 2) the new unified memory manager. >>>>>> >
Re: A proposal for Spark 2.0
I think that Kostas' logic still holds. The majority of Spark users, and likely an even vaster majority of people running vaster jobs, are still on RDDs and on the cusp of upgrading to DataFrames. Users will probably want to upgrade to the stable version of the Dataset / DataFrame API so they don't need to do so twice. Requiring that they absorb all the other ways that Spark breaks compatibility in the move to 2.0 makes it much more difficult for them to make this transition. Using the same set of APIs also means that it will be easier to backport critical fixes to the 1.x line. It's not clear to me that avoiding breakage of an experimental API in the 1.x line outweighs these issues. -Sandy On Mon, Nov 23, 2015 at 10:51 PM, Reynold Xinwrote: > I actually think the next one (after 1.6) should be Spark 2.0. The reason > is that I already know we have to break some part of the DataFrame/Dataset > API as part of the Dataset design. (e.g. DataFrame.map should return > Dataset rather than RDD). In that case, I'd rather break this sooner (in > one release) than later (in two releases). so the damage is smaller. > > I don't think whether we call Dataset/DataFrame experimental or not > matters too much for 2.0. We can still call Dataset experimental in 2.0 and > then mark them as stable in 2.1. Despite being "experimental", there has > been no breaking changes to DataFrame from 1.3 to 1.6. > > > > On Wed, Nov 18, 2015 at 3:43 PM, Mark Hamstra > wrote: > >> Ah, got it; by "stabilize" you meant changing the API, not just bug >> fixing. We're on the same page now. >> >> On Wed, Nov 18, 2015 at 3:39 PM, Kostas Sakellis >> wrote: >> >>> A 1.6.x release will only fix bugs - we typically don't change APIs in z >>> releases. The Dataset API is experimental and so we might be changing the >>> APIs before we declare it stable. This is why I think it is important to >>> first stabilize the Dataset API with a Spark 1.7 release before moving to >>> Spark 2.0. This will benefit users that would like to use the new Dataset >>> APIs but can't move to Spark 2.0 because of the backwards incompatible >>> changes, like removal of deprecated APIs, Scala 2.11 etc. >>> >>> Kostas >>> >>> >>> On Fri, Nov 13, 2015 at 12:26 PM, Mark Hamstra >>> wrote: >>> Why does stabilization of those two features require a 1.7 release instead of 1.6.1? On Fri, Nov 13, 2015 at 11:40 AM, Kostas Sakellis wrote: > We have veered off the topic of Spark 2.0 a little bit here - yes we > can talk about RDD vs. DS/DF more but lets refocus on Spark 2.0. I'd like > to propose we have one more 1.x release after Spark 1.6. This will allow > us > to stabilize a few of the new features that were added in 1.6: > > 1) the experimental Datasets API > 2) the new unified memory manager. > > I understand our goal for Spark 2.0 is to offer an easy transition but > there will be users that won't be able to seamlessly upgrade given what we > have discussed as in scope for 2.0. For these users, having a 1.x release > with these new features/APIs stabilized will be very beneficial. This > might > make Spark 1.7 a lighter release but that is not necessarily a bad thing. > > Any thoughts on this timeline? > > Kostas Sakellis > > > > On Thu, Nov 12, 2015 at 8:39 PM, Cheng, Hao > wrote: > >> Agree, more features/apis/optimization need to be added in DF/DS. >> >> >> >> I mean, we need to think about what kind of RDD APIs we have to >> provide to developer, maybe the fundamental API is enough, like, the >> ShuffledRDD etc.. But PairRDDFunctions probably not in this category, as >> we can do the same thing easily with DF/DS, even better performance. >> >> >> >> *From:* Mark Hamstra [mailto:m...@clearstorydata.com] >> *Sent:* Friday, November 13, 2015 11:23 AM >> *To:* Stephen Boesch >> >> *Cc:* dev@spark.apache.org >> *Subject:* Re: A proposal for Spark 2.0 >> >> >> >> Hmmm... to me, that seems like precisely the kind of thing that >> argues for retaining the RDD API but not as the first thing presented to >> new Spark developers: "Here's how to use groupBy with DataFrames >> Until >> the optimizer is more fully developed, that won't always get you the best >> performance that can be obtained. In these particular circumstances, >> ..., >> you may want to use the low-level RDD API while setting >> preservesPartitioning to true. Like this" >> >> >> >> On Thu, Nov 12, 2015 at 7:05 PM, Stephen Boesch >> wrote: >> >> My understanding is that the RDD's presently have more support for >> complete control of partitioning which is a key
Re: Dropping support for earlier Hadoop versions in Spark 2.0?
To answer your fourth question from Cloudera's perspective, we would never support a customer running Spark 2.0 on a Hadoop version < 2.6. -Sandy On Fri, Nov 20, 2015 at 1:39 PM, Reynold Xinwrote: > OK I'm not exactly asking for a vote here :) > > I don't think we should look at it from only maintenance point of view -- > because in that case the answer is clearly supporting as few versions as > possible (or just rm -rf spark source code and call it a day). It is a > tradeoff between the number of users impacted and the maintenance burden. > > So a few questions for those more familiar with Hadoop: > > 1. Can Hadoop 2.6 client read Hadoop 2.4 / 2.3? > > 2. If the answer to 1 is yes, are there known, major issues with backward > compatibility? > > 3. Can Hadoop 2.6+ YARN work on older versions of YARN clusters? > > 4. (for Hadoop vendors) When did/will support for Hadoop 2.4 and below > stop? To what extent do you care about running Spark on older Hadoop > clusters. > > > > On Fri, Nov 20, 2015 at 7:52 AM, Steve Loughran > wrote: > >> >> On 20 Nov 2015, at 14:28, ches...@alpinenow.com wrote: >> >> Assuming we have 1.6 and 1.7 releases, then spark 2.0 is about 9 months >> away. >> >> customer will need to upgrade the new Hadoop clusters to Apache 2.6 or >> later to leverage new spark 2.0 in one year. I think this possible as >> latest release on cdh5.x, HDP 2.x are both on Apache 2.6.0 already. >> Company will have enough time to upgrade cluster. >> >> +1 for me as well >> >> Chester >> >> >> now, if you are looking that far ahead, the other big issue is "when to >> retire Java 7 support".? >> >> That's a tough decision for all projects. Hadoop 3.x will be Java 8 only, >> but nobody has committed the patch to the trunk codebase to force a java 8 >> build; + most of *todays* hadoop clusters are Java 7. But as you can't even >> download a Java 7 JDK for the desktop from oracle any more today, 2016 is a >> time to look at the language support and decide what is the baseline >> version >> >> Commentary from Twitter here -as they point out, it's not just the server >> farm that matters, it's all the apps that talk to it >> >> >> >> http://mail-archives.apache.org/mod_mbox/hadoop-common-dev/201503.mbox/%3ccab7mwte+kefcxsr6n46-ztcs19ed7cwc9vobtr1jqewdkye...@mail.gmail.com%3E >> >> -Steve >> > >
Re: A proposal for Spark 2.0
Another +1 to Reynold's proposal. Maybe this is obvious, but I'd like to advocate against a blanket removal of deprecated / developer APIs. Many APIs can likely be removed without material impact (e.g. the SparkContext constructor that takes preferred node location data), while others likely see heavier usage (e.g. I wouldn't be surprised if mapPartitionsWithContext was baked into a number of apps) and merit a little extra consideration. Maybe also obvious, but I think a migration guide with API equivlents and the like would be incredibly useful in easing the transition. -Sandy On Tue, Nov 10, 2015 at 4:28 PM, Reynold Xinwrote: > Echoing Shivaram here. I don't think it makes a lot of sense to add more > features to the 1.x line. We should still do critical bug fixes though. > > > On Tue, Nov 10, 2015 at 4:23 PM, Shivaram Venkataraman < > shiva...@eecs.berkeley.edu> wrote: > >> +1 >> >> On a related note I think making it lightweight will ensure that we >> stay on the current release schedule and don't unnecessarily delay 2.0 >> to wait for new features / big architectural changes. >> >> In terms of fixes to 1.x, I think our current policy of back-porting >> fixes to older releases would still apply. I don't think developing >> new features on both 1.x and 2.x makes a lot of sense as we would like >> users to switch to 2.x. >> >> Shivaram >> >> On Tue, Nov 10, 2015 at 4:02 PM, Kostas Sakellis >> wrote: >> > +1 on a lightweight 2.0 >> > >> > What is the thinking around the 1.x line after Spark 2.0 is released? >> If not >> > terminated, how will we determine what goes into each major version >> line? >> > Will 1.x only be for stability fixes? >> > >> > Thanks, >> > Kostas >> > >> > On Tue, Nov 10, 2015 at 3:41 PM, Patrick Wendell >> wrote: >> >> >> >> I also feel the same as Reynold. I agree we should minimize API breaks >> and >> >> focus on fixing things around the edge that were mistakes (e.g. >> exposing >> >> Guava and Akka) rather than any overhaul that could fragment the >> community. >> >> Ideally a major release is a lightweight process we can do every >> couple of >> >> years, with minimal impact for users. >> >> >> >> - Patrick >> >> >> >> On Tue, Nov 10, 2015 at 3:35 PM, Nicholas Chammas >> >> wrote: >> >>> >> >>> > For this reason, I would *not* propose doing major releases to break >> >>> > substantial API's or perform large re-architecting that prevent >> users from >> >>> > upgrading. Spark has always had a culture of evolving architecture >> >>> > incrementally and making changes - and I don't think we want to >> change this >> >>> > model. >> >>> >> >>> +1 for this. The Python community went through a lot of turmoil over >> the >> >>> Python 2 -> Python 3 transition because the upgrade process was too >> painful >> >>> for too long. The Spark community will benefit greatly from our >> explicitly >> >>> looking to avoid a similar situation. >> >>> >> >>> > 3. Assembly-free distribution of Spark: don’t require building an >> >>> > enormous assembly jar in order to run Spark. >> >>> >> >>> Could you elaborate a bit on this? I'm not sure what an assembly-free >> >>> distribution means. >> >>> >> >>> Nick >> >>> >> >>> On Tue, Nov 10, 2015 at 6:11 PM Reynold Xin >> wrote: >> >> I’m starting a new thread since the other one got intermixed with >> feature requests. Please refrain from making feature request in this >> thread. >> Not that we shouldn’t be adding features, but we can always add >> features in >> 1.7, 2.1, 2.2, ... >> >> First - I want to propose a premise for how to think about Spark 2.0 >> and >> major releases in Spark, based on discussion with several members of >> the >> community: a major release should be low overhead and minimally >> disruptive >> to the Spark community. A major release should not be very different >> from a >> minor release and should not be gated based on new features. The main >> purpose of a major release is an opportunity to fix things that are >> broken >> in the current API and remove certain deprecated APIs (examples >> follow). >> >> For this reason, I would *not* propose doing major releases to break >> substantial API's or perform large re-architecting that prevent >> users from >> upgrading. Spark has always had a culture of evolving architecture >> incrementally and making changes - and I don't think we want to >> change this >> model. In fact, we’ve released many architectural changes on the 1.X >> line. >> >> If the community likes the above model, then to me it seems >> reasonable >> to do Spark 2.0 either after Spark 1.6 (in lieu of Spark 1.7) or >> immediately >> after Spark 1.7. It will be 18 or 21 months since Spark 1.0. A >> cadence of >> major releases every 2 years seems doable within the above
Re: A proposal for Spark 2.0
Oh and another question - should Spark 2.0 support Java 7? On Tue, Nov 10, 2015 at 4:53 PM, Sandy Ryza <sandy.r...@cloudera.com> wrote: > Another +1 to Reynold's proposal. > > Maybe this is obvious, but I'd like to advocate against a blanket removal > of deprecated / developer APIs. Many APIs can likely be removed without > material impact (e.g. the SparkContext constructor that takes preferred > node location data), while others likely see heavier usage (e.g. I wouldn't > be surprised if mapPartitionsWithContext was baked into a number of apps) > and merit a little extra consideration. > > Maybe also obvious, but I think a migration guide with API equivlents and > the like would be incredibly useful in easing the transition. > > -Sandy > > On Tue, Nov 10, 2015 at 4:28 PM, Reynold Xin <r...@databricks.com> wrote: > >> Echoing Shivaram here. I don't think it makes a lot of sense to add more >> features to the 1.x line. We should still do critical bug fixes though. >> >> >> On Tue, Nov 10, 2015 at 4:23 PM, Shivaram Venkataraman < >> shiva...@eecs.berkeley.edu> wrote: >> >>> +1 >>> >>> On a related note I think making it lightweight will ensure that we >>> stay on the current release schedule and don't unnecessarily delay 2.0 >>> to wait for new features / big architectural changes. >>> >>> In terms of fixes to 1.x, I think our current policy of back-porting >>> fixes to older releases would still apply. I don't think developing >>> new features on both 1.x and 2.x makes a lot of sense as we would like >>> users to switch to 2.x. >>> >>> Shivaram >>> >>> On Tue, Nov 10, 2015 at 4:02 PM, Kostas Sakellis <kos...@cloudera.com> >>> wrote: >>> > +1 on a lightweight 2.0 >>> > >>> > What is the thinking around the 1.x line after Spark 2.0 is released? >>> If not >>> > terminated, how will we determine what goes into each major version >>> line? >>> > Will 1.x only be for stability fixes? >>> > >>> > Thanks, >>> > Kostas >>> > >>> > On Tue, Nov 10, 2015 at 3:41 PM, Patrick Wendell <pwend...@gmail.com> >>> wrote: >>> >> >>> >> I also feel the same as Reynold. I agree we should minimize API >>> breaks and >>> >> focus on fixing things around the edge that were mistakes (e.g. >>> exposing >>> >> Guava and Akka) rather than any overhaul that could fragment the >>> community. >>> >> Ideally a major release is a lightweight process we can do every >>> couple of >>> >> years, with minimal impact for users. >>> >> >>> >> - Patrick >>> >> >>> >> On Tue, Nov 10, 2015 at 3:35 PM, Nicholas Chammas >>> >> <nicholas.cham...@gmail.com> wrote: >>> >>> >>> >>> > For this reason, I would *not* propose doing major releases to >>> break >>> >>> > substantial API's or perform large re-architecting that prevent >>> users from >>> >>> > upgrading. Spark has always had a culture of evolving architecture >>> >>> > incrementally and making changes - and I don't think we want to >>> change this >>> >>> > model. >>> >>> >>> >>> +1 for this. The Python community went through a lot of turmoil over >>> the >>> >>> Python 2 -> Python 3 transition because the upgrade process was too >>> painful >>> >>> for too long. The Spark community will benefit greatly from our >>> explicitly >>> >>> looking to avoid a similar situation. >>> >>> >>> >>> > 3. Assembly-free distribution of Spark: don’t require building an >>> >>> > enormous assembly jar in order to run Spark. >>> >>> >>> >>> Could you elaborate a bit on this? I'm not sure what an assembly-free >>> >>> distribution means. >>> >>> >>> >>> Nick >>> >>> >>> >>> On Tue, Nov 10, 2015 at 6:11 PM Reynold Xin <r...@databricks.com> >>> wrote: >>> >>>> >>> >>>> I’m starting a new thread since the other one got intermixed with >>> >>>> feature requests. Please refrain from making feature request in >>> this thread. >>> >>>> Not that we shouldn’t be add
Re: Info about Dataset
Hi Justin, The Dataset API proposal is available here: https://issues.apache.org/jira/browse/SPARK-. -Sandy On Tue, Nov 3, 2015 at 1:41 PM, Justin Uangwrote: > Hi, > > I was looking through some of the PRs slated for 1.6.0 and I noted > something called a Dataset, which looks like a new concept based off of the > scaladoc for the class. Can anyone point me to some references/design_docs > regarding the choice to introduce the new concept? I presume it is probably > something to do with performance optimizations? > > Thanks! > > Justin >
Re: [VOTE] Release Apache Spark 1.5.0 (RC2)
+1 (non-binding) built from source and ran some jobs against YARN -Sandy On Sat, Aug 29, 2015 at 5:50 AM, vaquar khan vaquar.k...@gmail.com wrote: +1 (1.5.0 RC2)Compiled on Windows with YARN. Regards, Vaquar khan +1 (non-binding, of course) 1. Compiled OSX 10.10 (Yosemite) OK Total time: 42:36 min mvn clean package -Pyarn -Phadoop-2.6 -DskipTests 2. Tested pyspark, mllib 2.1. statistics (min,max,mean,Pearson,Spearman) OK 2.2. Linear/Ridge/Laso Regression OK 2.3. Decision Tree, Naive Bayes OK 2.4. KMeans OK Center And Scale OK 2.5. RDD operations OK State of the Union Texts - MapReduce, Filter,sortByKey (word count) 2.6. Recommendation (Movielens medium dataset ~1 M ratings) OK Model evaluation/optimization (rank, numIter, lambda) with itertools OK 3. Scala - MLlib 3.1. statistics (min,max,mean,Pearson,Spearman) OK 3.2. LinearRegressionWithSGD OK 3.3. Decision Tree OK 3.4. KMeans OK 3.5. Recommendation (Movielens medium dataset ~1 M ratings) OK 3.6. saveAsParquetFile OK 3.7. Read and verify the 4.3 save(above) - sqlContext.parquetFile, registerTempTable, sql OK 3.8. result = sqlContext.sql(SELECT OrderDetails.OrderID,ShipCountry,UnitPrice,Qty,Discount FROM Orders INNER JOIN OrderDetails ON Orders.OrderID = OrderDetails.OrderID) OK 4.0. Spark SQL from Python OK 4.1. result = sqlContext.sql(SELECT * from people WHERE State = 'WA') OK 5.0. Packages 5.1. com.databricks.spark.csv - read/write OK (--packages com.databricks:spark-csv_2.11:1.2.0-s_2.11 didn’t work. But com.databricks:spark-csv_2.11:1.2.0 worked) 6.0. DataFrames 6.1. cast,dtypes OK 6.2. groupBy,avg,crosstab,corr,isNull,na.drop OK 6.3. joins,sql,set operations,udf OK Cheers k/ On Tue, Aug 25, 2015 at 9:28 PM, Reynold Xin r...@databricks.com wrote: Please vote on releasing the following candidate as Apache Spark version 1.5.0. The vote is open until Friday, Aug 29, 2015 at 5:00 UTC and passes if a majority of at least 3 +1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 1.5.0 [ ] -1 Do not release this package because ... To learn more about Apache Spark, please see http://spark.apache.org/ The tag to be voted on is v1.5.0-rc2: https://github.com/apache/spark/tree/727771352855dbb780008c449a877f5aaa5fc27a The release files, including signatures, digests, etc. can be found at: http://people.apache.org/~pwendell/spark-releases/spark-1.5.0-rc2-bin/ Release artifacts are signed with the following key: https://people.apache.org/keys/committer/pwendell.asc The staging repository for this release (published as 1.5.0-rc2) can be found at: https://repository.apache.org/content/repositories/orgapachespark-1141/ The staging repository for this release (published as 1.5.0) can be found at: https://repository.apache.org/content/repositories/orgapachespark-1140/ The documentation corresponding to this release can be found at: http://people.apache.org/~pwendell/spark-releases/spark-1.5.0-rc2-docs/ === How can I help test this release? === If you are a Spark user, you can help us test this release by taking an existing Spark workload and running on this release candidate, then reporting any regressions. What justifies a -1 vote for this release? This vote is happening towards the end of the 1.5 QA period, so -1 votes should only occur for significant regressions from 1.4. Bugs already present in 1.4, minor regressions, or bugs related to new features will not block this release. === What should happen to JIRA tickets still targeting 1.5.0? === 1. It is OK for documentation patches to target 1.5.0 and still go into branch-1.5, since documentations will be packaged separately from the release. 2. New features for non-alpha-modules should target 1.6+. 3. Non-blocker bug fixes should target 1.5.1 or 1.6.0, or drop the target version. == Major changes to help you focus your testing == As of today, Spark 1.5 contains more than 1000 commits from 220+ contributors. I've curated a list of important changes for 1.5. For the complete list, please refer to Apache JIRA changelog. RDD/DataFrame/SQL APIs - New UDAF interface - DataFrame hints for broadcast join - expr function for turning a SQL expression into DataFrame column - Improved support for NaN values - StructType now supports ordering - TimestampType precision is reduced to 1us - 100 new built-in expressions, including date/time, string, math - memory and local disk only checkpointing DataFrame/SQL Backend Execution - Code generation on by default - Improved join,
Re: [VOTE] Release Apache Spark 1.5.0 (RC1)
I see that there's an 1.5.0-rc2 tag in github now. Is that the official RC2 tag to start trying out? -Sandy On Mon, Aug 24, 2015 at 7:23 AM, Sean Owen so...@cloudera.com wrote: PS Shixiong Zhu is correct that this one has to be fixed: https://issues.apache.org/jira/browse/SPARK-10168 For example you can see assemblies like this are nearly empty: https://repository.apache.org/content/repositories/orgapachespark-1137/org/apache/spark/spark-streaming-flume-assembly_2.10/1.5.0-rc1/ Just a publishing glitch but worth a few more eyes on. On Fri, Aug 21, 2015 at 5:28 PM, Sean Owen so...@cloudera.com wrote: Signatures, license, etc. look good. I'm getting some fairly consistent failures using Java 7 + Ubuntu 15 + -Pyarn -Phive -Phive-thriftserver -Phadoop-2.6 -- does anyone else see these? they are likely just test problems, but worth asking. Stack traces are at the end. There are currently 79 issues targeted for 1.5.0, of which 19 are bugs, of which 1 is a blocker. (1032 have been resolved for 1.5.0.) That's significantly better than at the last release. I presume a lot of what's still targeted is not critical and can now be untargeted/retargeted. It occurs to me that the flurry of planning that took place at the start of the 1.5 QA cycle a few weeks ago was quite helpful, and is the kind of thing that would be even more useful at the start of a release cycle. So would be great to do this for 1.6 in a few weeks. Indeed there are already 267 issues targeted for 1.6.0 -- a decent roadmap already. Test failures: Core - Unpersisting TorrentBroadcast on executors and driver in distributed mode *** FAILED *** java.util.concurrent.TimeoutException: Can't find 2 executors before 1 milliseconds elapsed at org.apache.spark.ui.jobs.JobProgressListener.waitUntilExecutorsUp(JobProgressListener.scala:561) at org.apache.spark.broadcast.BroadcastSuite.testUnpersistBroadcast(BroadcastSuite.scala:313) at org.apache.spark.broadcast.BroadcastSuite.org $apache$spark$broadcast$BroadcastSuite$$testUnpersistTorrentBroadcast(BroadcastSuite.scala:287) at org.apache.spark.broadcast.BroadcastSuite$$anonfun$16.apply$mcV$sp(BroadcastSuite.scala:165) at org.apache.spark.broadcast.BroadcastSuite$$anonfun$16.apply(BroadcastSuite.scala:165) at org.apache.spark.broadcast.BroadcastSuite$$anonfun$16.apply(BroadcastSuite.scala:165) at org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22) at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) at org.scalatest.Transformer.apply(Transformer.scala:22) ... Streaming - stop slow receiver gracefully *** FAILED *** 0 was not greater than 0 (StreamingContextSuite.scala:324) Kafka - offset recovery *** FAILED *** The code passed to eventually never returned normally. Attempted 191 times over 10.043196973 seconds. Last failure message: strings.forall({ ((elem: Any) = DirectKafkaStreamSuite.collectedData.contains(elem)) }) was false. (DirectKafkaStreamSuite.scala:249) On Fri, Aug 21, 2015 at 5:37 AM, Reynold Xin r...@databricks.com wrote: Please vote on releasing the following candidate as Apache Spark version 1.5.0! The vote is open until Monday, Aug 17, 2015 at 20:00 UTC and passes if a majority of at least 3 +1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 1.5.0 [ ] -1 Do not release this package because ... To learn more about Apache Spark, please see http://spark.apache.org/ The tag to be voted on is v1.5.0-rc1: https://github.com/apache/spark/tree/4c56ad772637615cc1f4f88d619fac6c372c8552 The release files, including signatures, digests, etc. can be found at: http://people.apache.org/~pwendell/spark-releases/spark-1.5.0-rc1-bin/ Release artifacts are signed with the following key: https://people.apache.org/keys/committer/pwendell.asc The staging repository for this release can be found at: https://repository.apache.org/content/repositories/orgapachespark-1137/ The documentation corresponding to this release can be found at: http://people.apache.org/~pwendell/spark-releases/spark-1.5.0-rc1-docs/ === == How can I help test this release? == === If you are a Spark user, you can help us test this release by taking an existing Spark workload and running on this release candidate, then reporting any regressions. == What justifies a -1 vote for this release? == This vote is happening towards the end of the 1.5 QA period, so -1 votes should only occur for significant regressions from 1.4. Bugs already present in 1.4, minor regressions, or bugs related to new features will not block this
Re: [VOTE] Release Apache Spark 1.5.0 (RC1)
Cool, thanks! On Mon, Aug 24, 2015 at 2:07 PM, Reynold Xin r...@databricks.com wrote: Nope --- I cut that last Friday but had an error. I will remove it and cut a new one. On Mon, Aug 24, 2015 at 2:06 PM, Sandy Ryza sandy.r...@cloudera.com wrote: I see that there's an 1.5.0-rc2 tag in github now. Is that the official RC2 tag to start trying out? -Sandy On Mon, Aug 24, 2015 at 7:23 AM, Sean Owen so...@cloudera.com wrote: PS Shixiong Zhu is correct that this one has to be fixed: https://issues.apache.org/jira/browse/SPARK-10168 For example you can see assemblies like this are nearly empty: https://repository.apache.org/content/repositories/orgapachespark-1137/org/apache/spark/spark-streaming-flume-assembly_2.10/1.5.0-rc1/ Just a publishing glitch but worth a few more eyes on. On Fri, Aug 21, 2015 at 5:28 PM, Sean Owen so...@cloudera.com wrote: Signatures, license, etc. look good. I'm getting some fairly consistent failures using Java 7 + Ubuntu 15 + -Pyarn -Phive -Phive-thriftserver -Phadoop-2.6 -- does anyone else see these? they are likely just test problems, but worth asking. Stack traces are at the end. There are currently 79 issues targeted for 1.5.0, of which 19 are bugs, of which 1 is a blocker. (1032 have been resolved for 1.5.0.) That's significantly better than at the last release. I presume a lot of what's still targeted is not critical and can now be untargeted/retargeted. It occurs to me that the flurry of planning that took place at the start of the 1.5 QA cycle a few weeks ago was quite helpful, and is the kind of thing that would be even more useful at the start of a release cycle. So would be great to do this for 1.6 in a few weeks. Indeed there are already 267 issues targeted for 1.6.0 -- a decent roadmap already. Test failures: Core - Unpersisting TorrentBroadcast on executors and driver in distributed mode *** FAILED *** java.util.concurrent.TimeoutException: Can't find 2 executors before 1 milliseconds elapsed at org.apache.spark.ui.jobs.JobProgressListener.waitUntilExecutorsUp(JobProgressListener.scala:561) at org.apache.spark.broadcast.BroadcastSuite.testUnpersistBroadcast(BroadcastSuite.scala:313) at org.apache.spark.broadcast.BroadcastSuite.org $apache$spark$broadcast$BroadcastSuite$$testUnpersistTorrentBroadcast(BroadcastSuite.scala:287) at org.apache.spark.broadcast.BroadcastSuite$$anonfun$16.apply$mcV$sp(BroadcastSuite.scala:165) at org.apache.spark.broadcast.BroadcastSuite$$anonfun$16.apply(BroadcastSuite.scala:165) at org.apache.spark.broadcast.BroadcastSuite$$anonfun$16.apply(BroadcastSuite.scala:165) at org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22) at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) at org.scalatest.Transformer.apply(Transformer.scala:22) ... Streaming - stop slow receiver gracefully *** FAILED *** 0 was not greater than 0 (StreamingContextSuite.scala:324) Kafka - offset recovery *** FAILED *** The code passed to eventually never returned normally. Attempted 191 times over 10.043196973 seconds. Last failure message: strings.forall({ ((elem: Any) = DirectKafkaStreamSuite.collectedData.contains(elem)) }) was false. (DirectKafkaStreamSuite.scala:249) On Fri, Aug 21, 2015 at 5:37 AM, Reynold Xin r...@databricks.com wrote: Please vote on releasing the following candidate as Apache Spark version 1.5.0! The vote is open until Monday, Aug 17, 2015 at 20:00 UTC and passes if a majority of at least 3 +1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 1.5.0 [ ] -1 Do not release this package because ... To learn more about Apache Spark, please see http://spark.apache.org/ The tag to be voted on is v1.5.0-rc1: https://github.com/apache/spark/tree/4c56ad772637615cc1f4f88d619fac6c372c8552 The release files, including signatures, digests, etc. can be found at: http://people.apache.org/~pwendell/spark-releases/spark-1.5.0-rc1-bin/ Release artifacts are signed with the following key: https://people.apache.org/keys/committer/pwendell.asc The staging repository for this release can be found at: https://repository.apache.org/content/repositories/orgapachespark-1137/ The documentation corresponding to this release can be found at: http://people.apache.org/~pwendell/spark-releases/spark-1.5.0-rc1-docs/ === == How can I help test this release? == === If you are a Spark user, you can help us test this release by taking an existing Spark workload and running on this release candidate, then reporting any regressions. == What justifies a -1 vote for this release
Re: Compact RDD representation
Edit: the first line should read: val groupedRdd = rdd.map((_, 1)).reduceByKey(_ + _) On Sun, Jul 19, 2015 at 11:02 AM, Sandy Ryza sandy.r...@cloudera.com wrote: This functionality already basically exists in Spark. To create the grouped RDD, one can run: val groupedRdd = rdd.reduceByKey(_ + _) To get it back into the original form: groupedRdd.flatMap(x = List.fill(x._1)(x._2)) -Sandy -Sandy On Sun, Jul 19, 2015 at 10:40 AM, Сергей Лихоман sergliho...@gmail.com wrote: Hi, I am looking for suitable issue for Master Degree project(it sounds like scalability problems and improvements for spark streaming) and seems like introduction of grouped RDD(for example: don't store Spark, Spark, Spark, instead store (Spark, 3)) can: 1. Reduce memory needed for RDD (roughly, used memory will be: % of uniq messages) 2. Improve performance(no need to apply function several times for the same message). Can I create ticket and introduce API for grouped RDDs? Is it make sense? Also I will be very appreciated for critic and ideas
Re: Compact RDD representation
The user gets to choose what they want to reside in memory. If they call rdd.cache() on the original RDD, it will be in memory. If they call rdd.cache() on the compact RDD, it will be in memory. If cache() is called on both, they'll both be in memory. -Sandy On Sun, Jul 19, 2015 at 11:09 AM, Сергей Лихоман sergliho...@gmail.com wrote: Thanks for answer! Could you please answer for one more question? Will we have in memory original rdd and grouped rdd in the same time? 2015-07-19 21:04 GMT+03:00 Sandy Ryza sandy.r...@cloudera.com: Edit: the first line should read: val groupedRdd = rdd.map((_, 1)).reduceByKey(_ + _) On Sun, Jul 19, 2015 at 11:02 AM, Sandy Ryza sandy.r...@cloudera.com wrote: This functionality already basically exists in Spark. To create the grouped RDD, one can run: val groupedRdd = rdd.reduceByKey(_ + _) To get it back into the original form: groupedRdd.flatMap(x = List.fill(x._1)(x._2)) -Sandy -Sandy On Sun, Jul 19, 2015 at 10:40 AM, Сергей Лихоман sergliho...@gmail.com wrote: Hi, I am looking for suitable issue for Master Degree project(it sounds like scalability problems and improvements for spark streaming) and seems like introduction of grouped RDD(for example: don't store Spark, Spark, Spark, instead store (Spark, 3)) can: 1. Reduce memory needed for RDD (roughly, used memory will be: % of uniq messages) 2. Improve performance(no need to apply function several times for the same message). Can I create ticket and introduce API for grouped RDDs? Is it make sense? Also I will be very appreciated for critic and ideas
Re: Compact RDD representation
This functionality already basically exists in Spark. To create the grouped RDD, one can run: val groupedRdd = rdd.reduceByKey(_ + _) To get it back into the original form: groupedRdd.flatMap(x = List.fill(x._1)(x._2)) -Sandy -Sandy On Sun, Jul 19, 2015 at 10:40 AM, Сергей Лихоман sergliho...@gmail.com wrote: Hi, I am looking for suitable issue for Master Degree project(it sounds like scalability problems and improvements for spark streaming) and seems like introduction of grouped RDD(for example: don't store Spark, Spark, Spark, instead store (Spark, 3)) can: 1. Reduce memory needed for RDD (roughly, used memory will be: % of uniq messages) 2. Improve performance(no need to apply function several times for the same message). Can I create ticket and introduce API for grouped RDDs? Is it make sense? Also I will be very appreciated for critic and ideas
Re: Compact RDD representation
In the Spark model, constructing an RDD does not mean storing all its contents in memory. Rather, an RDD is a description of a dataset that enables iterating over its contents, record by record (in parallel). The only time the full contents of an RDD are stored in memory is when a user explicitly calls cache or persist on it. -Sandy On Sun, Jul 19, 2015 at 11:41 AM, Сергей Лихоман sergliho...@gmail.com wrote: Sorry, maybe I am saying something completely wrong... we have a stream, we digitize it to created rdd. rdd in this case will be just array of any. than we apply transformation to create new grouped rdd and GC should remove original rdd from memory(if we won't persist it). Will we have GC step in val groupedRdd = rdd.map((_, 1)).reduceByKey(_ + _) ? my suggestion is to remove creation and reclaiming of unneeded rdd and create already grouped one 2015-07-19 21:26 GMT+03:00 Sandy Ryza sandy.r...@cloudera.com: The user gets to choose what they want to reside in memory. If they call rdd.cache() on the original RDD, it will be in memory. If they call rdd.cache() on the compact RDD, it will be in memory. If cache() is called on both, they'll both be in memory. -Sandy On Sun, Jul 19, 2015 at 11:09 AM, Сергей Лихоман sergliho...@gmail.com wrote: Thanks for answer! Could you please answer for one more question? Will we have in memory original rdd and grouped rdd in the same time? 2015-07-19 21:04 GMT+03:00 Sandy Ryza sandy.r...@cloudera.com: Edit: the first line should read: val groupedRdd = rdd.map((_, 1)).reduceByKey(_ + _) On Sun, Jul 19, 2015 at 11:02 AM, Sandy Ryza sandy.r...@cloudera.com wrote: This functionality already basically exists in Spark. To create the grouped RDD, one can run: val groupedRdd = rdd.reduceByKey(_ + _) To get it back into the original form: groupedRdd.flatMap(x = List.fill(x._1)(x._2)) -Sandy -Sandy On Sun, Jul 19, 2015 at 10:40 AM, Сергей Лихоман sergliho...@gmail.com wrote: Hi, I am looking for suitable issue for Master Degree project(it sounds like scalability problems and improvements for spark streaming) and seems like introduction of grouped RDD(for example: don't store Spark, Spark, Spark, instead store (Spark, 3)) can: 1. Reduce memory needed for RDD (roughly, used memory will be: % of uniq messages) 2. Improve performance(no need to apply function several times for the same message). Can I create ticket and introduce API for grouped RDDs? Is it make sense? Also I will be very appreciated for critic and ideas
Re: How to Read Excel file in Spark 1.4
Hi Su, Spark can't read excel files directly. Your best best is probably to export the contents as a CSV and use the csvFile API. -Sandy On Mon, Jul 13, 2015 at 9:22 AM, spark user spark_u...@yahoo.com.invalid wrote: Hi I need your help to save excel data in hive . 1. how to read excel file in spark using spark 1.4 2. How to save using data frame If you have some sample code pls send Thanks su
Re: External Shuffle service over yarn
Hi Yash, One of the main advantages is that, if you turn dynamic allocation on, and executors are discarded, your application is still able to get at the shuffle data that they wrote out. -Sandy On Thu, Jun 25, 2015 at 11:08 PM, yash datta sau...@gmail.com wrote: Hi devs, Can someone point out if there are any distinct advantages of using external shuffle service over yarn (runs on node manager as an auxiliary service https://issues.apache.org/jira/browse/SPARK-3797) instead of the default execution in the executor containers ? Please also mention if you have seen any differences having used both ways ? Thanks and Best Regards Yash -- When events unfold with calm and ease When the winds that blow are merely breeze Learn from nature, from birds and bees Live your life in love, and let joy not cease.
Re: Increase partition count (repartition) without shuffle
Hi Alexander, There is currently no way to create an RDD with more partitions than its parent RDD without causing a shuffle. However, if the files are splittable, you can set the Hadoop configurations that control split size to something smaller so that the HadoopRDD ends up with more partitions. -Sandy On Thu, Jun 18, 2015 at 2:26 PM, Ulanov, Alexander alexander.ula...@hp.com wrote: Hi, Is there a way to increase the amount of partition of RDD without causing shuffle? I’ve found JIRA issue https://issues.apache.org/jira/browse/SPARK-5997 however there is no implementation yet. Just in case, I am reading data from ~300 big binary files, which results in 300 partitions, then I need to sort my RDD, but it crashes with outofmemory exception. If I change the number of partitions to 2000, sort works OK, but repartition itself takes a lot of time due to shuffle. Best regards, Alexander
Re: [SparkScore] Performance portal for Apache Spark
This looks really awesome. On Tue, Jun 16, 2015 at 10:27 AM, Huang, Jie jie.hu...@intel.com wrote: Hi All We are happy to announce Performance portal for Apache Spark http://01org.github.io/sparkscore/ ! The Performance Portal for Apache Spark provides performance data on the Spark upsteam to the community to help identify issues, better understand performance differentials between versions, and help Spark customers get across the finish line faster. The Performance Portal generates two reports, regular (weekly) report and release based regression test report. We are currently using two benchmark suites which include HiBench ( http://github.com/intel-bigdata/HiBench) and Spark-perf ( https://github.com/databricks/spark-perf ). We welcome and look forward to your suggestions and feedbacks. More information and details provided below Abount Performance Portal for Apache Spark Our goal is to work with the Apache Spark community to further enhance the scalability and reliability of the Apache Spark. The data available on this site allows community members and potential Spark customers to closely track performance trend of the Apache Spark. Ultimately, we hope that this project will help community to fix performance issue quickly, thus providing better Apache spark code to end customers. The current workloads used in the benchmarking include HiBench (a benchmark suite to evaluate big data framework like Hadoop MR, Spark from Intel) and Spark-perf (a performance testing framework for Apache Spark from Databricks). Additional benchmarks will be added as they become available Description -- Each data point represents each workload runtime percent compared with the previous week. Different lines represents different workloads running on spark yarn-client mode. Hardware -- CPU type: Intel® Xeon® CPU E5-2697 v2 @ 2.70GHz Memory: 128GB NIC: 10GbE Disk(s): 8 x 1TB SATA HDD Software -- JAVA ver sion: 1.8.0_25 Hadoop version: 2.5.0-CDH5.3.2 HiBench version: 4.0 Spark on yarn-client mode Cluster -- 1 node for Master 10 nodes for Slave Summary The lower percent the better performance. -- *Group* *ww19 * *ww20 * *ww22 * *ww23 * *ww24 * *ww25 * HiBench 9.1% 6.6% 6.0% 7.9% -6.5% -3.1% spark-perf 4.1% 4.4% -1.8% 4.1% -4.7% -4.6% *Y-Axis: normalized completion time; X-Axis: Work Week. * * The commit number can be found in the result table. The performance score for each workload is normalized based on the elapsed time for 1.2 release.The lower the better.* HiBench -- *JOB* *ww19 * *ww20 * *ww22 * *ww23 * *ww24 * *ww25 * *commit* *489700c8 * *8e3822a0 * *530efe3e * *90c60692 * *db81b9d8 * *4eb48ed1 * sleep % % -2.1% -2.9% -4.1% 12.8% wordcount 17.6% 11.4% 8.0% 8.3% -18.6% -10.9% kmeans 92.1% 61.5% 72.1% 92.9% 86.9% 95.8% scan -4.9% -7.2% % -1.1% -25.5% -21.0% bayes -24.3% -20.1% -18.3% -11.1% -29.7% -31.3% aggregation 5.6% 10.5% % 9.2% -15.3% -15.0% join 4.5% 1.2% % 1.0% -12.7% -13.9% sort -3.3% -0.5% -11.9% -12.5% -17.5% -17.3% pagerank 2.2% 3.2% 4.0% 2.9% -11.4% -13.0% terasort -7.1% -0.2% -9.5% -7.3% -16.7% -17.0% Comments: null means no such workload running or workload failed in this time. *Y-Axis: normalized completion time; X-Axis: Work Week. * * The commit number can be found in the result table. The performance score for each workload is normalized based on the elapsed time for 1.2 release.The lower the better.* spark-perf -- *JOB* *ww19 * *ww20 * *ww22 * *ww23 * *ww24 * *ww25 * *commit* *489700c8 * *8e3822a0 * *530efe3e * *90c60692 * *db81b9d8 * *4eb48ed1 * agg 13.2% 7.0% % 18.3% 5.2% 2.5% agg-int 16.4% 21.2% % 9.6% 4.0% 8.2% agg-naive 4.3% -2.4% % -0.8% -6.7% -6.8 % scheduling -6.1% -8.9% -14.5% -2.1% -6.4% -6.5% count-filter 4.1% 1.0% 6.6% 6.8% -10.2% -10.4% count 4.8% 4.6% 6.7% 8.0% -7.3% -7.0% sort -8.1% -2.5% -6.2% -7.0% -14.6% -14.4% sort-int 4.5% 15.3% -1.6% -0.1% -1.5% -2.2% Comments: null means no such workload running or workload failed in this time. *Y-Axis: normalized completion time; X-Axis: Work Week. * * The commit number can be found in the result table. The pe rformance score for each workload is normalized based on the elapsed time for 1.2 release.The lower the better.* Release Summary The lower percent the better performance. -- *Group* *1.2.1 * *1.3.0 * *1.3.1 * *1.4.0 * HiBench -1.0% 10.5% 8.4% 8.6% spark-perf 3.2%
Re: [VOTE] Release Apache Spark 1.4.0 (RC4)
+1 (non-binding) Built from source and ran some jobs against a pseudo-distributed YARN cluster. -Sandy On Fri, Jun 5, 2015 at 11:05 AM, Ram Sriharsha sriharsha@gmail.com wrote: +1 , tested with hadoop 2.6/ yarn on centos 6.5 after building w/ -Pyarn -Phadoop-2.4 -Dhadoop.version=2.6.0 -Phive -Phive-thriftserver and ran a few SQL tests and the ML examples On Fri, Jun 5, 2015 at 10:55 AM, Hari Shreedharan hshreedha...@cloudera.com wrote: +1. Build looks good, ran a couple apps on YARN Thanks, Hari On Fri, Jun 5, 2015 at 10:52 AM, Yin Huai yh...@databricks.com wrote: Sean, Can you add -Phive -Phive-thriftserver and try those Hive tests? Thanks, Yin On Fri, Jun 5, 2015 at 5:19 AM, Sean Owen so...@cloudera.com wrote: Everything checks out again, and the tests pass for me on Ubuntu + Java 7 with '-Pyarn -Phadoop-2.6', except that I always get SparkSubmitSuite errors like ... - success sanity check *** FAILED *** java.lang.RuntimeException: [download failed: org.jboss.netty#netty;3.2.2.Final!netty.jar(bundle), download failed: commons-net#commons-net;3.1!commons-net.jar] at org.apache.spark.deploy.SparkSubmitUtils$.resolveMavenCoordinates(SparkSubmit.scala:978) at org.apache.spark.sql.hive.client.IsolatedClientLoader$$anonfun$3.apply(IsolatedClientLoader.scala:62) ... I also can't get hive tests to pass. Is anyone else seeing anything like this? if not I'll assume this is something specific to the env -- or that I don't have the build invocation just right. It's puzzling since it's so consistent, but I presume others' tests pass and Jenkins does. On Wed, Jun 3, 2015 at 5:53 AM, Patrick Wendell pwend...@gmail.com wrote: Please vote on releasing the following candidate as Apache Spark version 1.4.0! The tag to be voted on is v1.4.0-rc3 (commit 22596c5): https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h= 22596c534a38cfdda91aef18aa9037ab101e4251 The release files, including signatures, digests, etc. can be found at: http://people.apache.org/~pwendell/spark-releases/spark-1.4.0-rc4-bin/ Release artifacts are signed with the following key: https://people.apache.org/keys/committer/pwendell.asc The staging repository for this release can be found at: [published as version: 1.4.0] https://repository.apache.org/content/repositories/orgapachespark-/ [published as version: 1.4.0-rc4] https://repository.apache.org/content/repositories/orgapachespark-1112/ The documentation corresponding to this release can be found at: http://people.apache.org/~pwendell/spark-releases/spark-1.4.0-rc4-docs/ Please vote on releasing this package as Apache Spark 1.4.0! The vote is open until Saturday, June 06, at 05:00 UTC and passes if a majority of at least 3 +1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 1.4.0 [ ] -1 Do not release this package because ... To learn more about Apache Spark, please see http://spark.apache.org/ == What has changed since RC3 == In addition to may smaller fixes, three blocker issues were fixed: 4940630 [SPARK-8020] [SQL] Spark SQL conf in spark-defaults.conf make metadataHive get constructed too early 6b0f615 [SPARK-8038] [SQL] [PYSPARK] fix Column.when() and otherwise() 78a6723 [SPARK-7978] [SQL] [PYSPARK] DecimalType should not be singleton == How can I help test this release? == If you are a Spark user, you can help us test this release by taking a Spark 1.3 workload and running on this release candidate, then reporting any regressions. == What justifies a -1 vote for this release? == This vote is happening towards the end of the 1.4 QA period, so -1 votes should only occur for significant regressions from 1.3.1. Bugs already present in 1.3.X, minor regressions, or bugs related to new features will not block this release. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: [VOTE] Release Apache Spark 1.4.0 (RC3)
+1 (non-binding) Launched against a pseudo-distributed YARN cluster running Hadoop 2.6.0 and ran some jobs. -Sandy On Sat, May 30, 2015 at 3:44 PM, Krishna Sankar ksanka...@gmail.com wrote: +1 (non-binding, of course) 1. Compiled OSX 10.10 (Yosemite) OK Total time: 17:07 min mvn clean package -Pyarn -Dyarn.version=2.6.0 -Phadoop-2.4 -Dhadoop.version=2.6.0 -DskipTests 2. Tested pyspark, mlib - running as well as compare results with 1.3.1 2.1. statistics (min,max,mean,Pearson,Spearman) OK 2.2. Linear/Ridge/Laso Regression OK 2.3. Decision Tree, Naive Bayes OK 2.4. KMeans OK Center And Scale OK 2.5. RDD operations OK State of the Union Texts - MapReduce, Filter,sortByKey (word count) 2.6. Recommendation (Movielens medium dataset ~1 M ratings) OK Model evaluation/optimization (rank, numIter, lambda) with itertools OK 3. Scala - MLlib 3.1. statistics (min,max,mean,Pearson,Spearman) OK 3.2. LinearRegressionWithSGD OK 3.3. Decision Tree OK 3.4. KMeans OK 3.5. Recommendation (Movielens medium dataset ~1 M ratings) OK 3.6. saveAsParquetFile OK 3.7. Read and verify the 4.3 save(above) - sqlContext.parquetFile, registerTempTable, sql OK 3.8. result = sqlContext.sql(SELECT OrderDetails.OrderID,ShipCountry,UnitPrice,Qty,Discount FROM Orders INNER JOIN OrderDetails ON Orders.OrderID = OrderDetails.OrderID) OK 4.0. Spark SQL from Python OK 4.1. result = sqlContext.sql(SELECT * from people WHERE State = 'WA') OK Cheers k/ On Fri, May 29, 2015 at 4:40 PM, Patrick Wendell pwend...@gmail.com wrote: Please vote on releasing the following candidate as Apache Spark version 1.4.0! The tag to be voted on is v1.4.0-rc3 (commit dd109a8): https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=dd109a8746ec07c7c83995890fc2c0cd7a693730 The release files, including signatures, digests, etc. can be found at: http://people.apache.org/~pwendell/spark-releases/spark-1.4.0-rc3-bin/ Release artifacts are signed with the following key: https://people.apache.org/keys/committer/pwendell.asc The staging repository for this release can be found at: [published as version: 1.4.0] https://repository.apache.org/content/repositories/orgapachespark-1109/ [published as version: 1.4.0-rc3] https://repository.apache.org/content/repositories/orgapachespark-1110/ The documentation corresponding to this release can be found at: http://people.apache.org/~pwendell/spark-releases/spark-1.4.0-rc3-docs/ Please vote on releasing this package as Apache Spark 1.4.0! The vote is open until Tuesday, June 02, at 00:32 UTC and passes if a majority of at least 3 +1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 1.4.0 [ ] -1 Do not release this package because ... To learn more about Apache Spark, please see http://spark.apache.org/ == What has changed since RC1 == Below is a list of bug fixes that went into this RC: http://s.apache.org/vN == How can I help test this release? == If you are a Spark user, you can help us test this release by taking a Spark 1.3 workload and running on this release candidate, then reporting any regressions. == What justifies a -1 vote for this release? == This vote is happening towards the end of the 1.4 QA period, so -1 votes should only occur for significant regressions from 1.3.1. Bugs already present in 1.3.X, minor regressions, or bugs related to new features will not block this release. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: YARN mode startup takes too long (10+ secs)
Wow, I hadn't noticed this, but 5 seconds is really long. It's true that it's configurable, but I think we need to provide a decent out-of-the-box experience. For comparison, the MapReduce equivalent is 1 second. I filed https://issues.apache.org/jira/browse/SPARK-7533 for this. -Sandy On Mon, May 11, 2015 at 9:03 AM, Mridul Muralidharan mri...@gmail.com wrote: For tiny/small clusters (particularly single tenet), you can set it to lower value. But for anything reasonably large or multi-tenet, the request storm can be bad if large enough number of applications start aggressively polling RM. That is why the interval is set to configurable. - Mridul On Mon, May 11, 2015 at 6:54 AM, Zoltán Zvara zoltan.zv...@gmail.com wrote: Isn't this issue something that should be improved? Based on the following discussion, there are two places were YARN's heartbeat interval is respected on job start-up, but do we really need to respect it on start-up? On Fri, May 8, 2015 at 12:14 PM Taeyun Kim taeyun@innowireless.com wrote: I think so. In fact, the flow is: allocator.allocateResources() - sleep - allocator.allocateResources() - sleep … But I guess that on the first allocateResources() the allocation is not fulfilled. So sleep occurs. *From:* Zoltán Zvara [mailto:zoltan.zv...@gmail.com] *Sent:* Friday, May 08, 2015 4:25 PM *To:* Taeyun Kim; u...@spark.apache.org *Subject:* Re: YARN mode startup takes too long (10+ secs) So is this sleep occurs before allocating resources for the first few executors to start the job? On Fri, May 8, 2015 at 6:23 AM Taeyun Kim taeyun@innowireless.com wrote: I think I’ve found the (maybe partial, but major) reason. It’s between the following lines, (it’s newly captured, but essentially the same place that Zoltán Zvara picked: 15/05/08 11:36:32 INFO BlockManagerMaster: Registered BlockManager 15/05/08 11:36:38 INFO YarnClientSchedulerBackend: Registered executor: Actor[akka.tcp://sparkExecutor@cluster04 :55237/user/Executor#-149550753] with ID 1 When I read the logs on cluster side, the following lines were found: (the exact time is different with above line, but it’s the difference between machines) 15/05/08 11:36:23 INFO yarn.ApplicationMaster: Started progress reporter thread - sleep time : 5000 15/05/08 11:36:28 INFO impl.AMRMClientImpl: Received new token for : cluster04:45454 It seemed that Spark deliberately sleeps 5 secs. I’ve read the Spark source code, and in org.apache.spark.deploy.yarn.ApplicationMaster.scala, launchReporterThread() had the code for that. It loops calling allocator.allocateResources() and Thread.sleep(). For sleep, it reads the configuration variable spark.yarn.scheduler.heartbeat.interval-ms (the default value is 5000, which is 5 secs). According to the comment, “we want to be reasonably responsive without causing too many requests to RM”. So, unless YARN immediately fulfill the allocation request, it seems that 5 secs will be wasted. When I modified the configuration variable to 1000, it only waited for 1 sec. Here is the log lines after the change: 15/05/08 11:47:21 INFO yarn.ApplicationMaster: Started progress reporter thread - sleep time : 1000 15/05/08 11:47:22 INFO impl.AMRMClientImpl: Received new token for : cluster04:45454 4 secs saved. So, when one does not want to wait 5 secs, one can change the spark.yarn.scheduler.heartbeat.interval-ms. I hope that the additional overhead it incurs would be negligible. *From:* Zoltán Zvara [mailto:zoltan.zv...@gmail.com] *Sent:* Thursday, May 07, 2015 10:05 PM *To:* Taeyun Kim; u...@spark.apache.org *Subject:* Re: YARN mode startup takes too long (10+ secs) Without considering everything, just a few hints: You are running on YARN. From 09:18:34 to 09:18:37 your application is in state ACCEPTED. There is a noticeable overhead introduced due to communicating with YARN's ResourceManager, NodeManager and given that the YARN scheduler needs time to make a decision. I guess somewhere from 09:18:38 to 09:18:43 your application JAR gets copied to another container requested by the Spark ApplicationMaster deployed on YARN's container 0. Deploying an executor needs further resource negotiations with the ResourceManager usually. Also, as I said, your JAR and Executor's code requires copying to the container's local directory - execution blocked until that is complete. On Thu, May 7, 2015 at 3:09 AM Taeyun Kim taeyun@innowireless.com wrote: Hi, I’m running a spark application with YARN-client or YARN-cluster mode. But it seems to take too long to startup. It takes 10+ seconds to initialize the spark context. Is this normal? Or can it be optimized? The environment is as follows: - Hadoop:
Re: Regarding KryoSerialization in Spark
Hi Twinkle, Registering the class makes it so that writeClass only writes out a couple bytes, instead of a full String of the class name. -Sandy On Thu, Apr 30, 2015 at 4:13 AM, twinkle sachdeva twinkle.sachd...@gmail.com wrote: Hi, As per the code, KryoSerialization used writeClassAndObject method, which internally calls writeClass method, which will write the class of the object while serilization. As per the documentation in tuning page of spark, it says that registering the class will avoid that. Am I missing something or there is some issue with the documentation??? Thanks, Twinkle
Re: Using memory mapped file for shuffle
Spark currently doesn't allocate any memory off of the heap for shuffle objects. When the in-memory data gets too large, it will write it out to a file, and then merge spilled filed later. What exactly do you mean by store shuffle data in HDFS? -Sandy On Tue, Apr 14, 2015 at 10:15 AM, Kannan Rajah kra...@maprtech.com wrote: Sandy, Can you clarify how it won't cause OOM? Is it anyway to related to memory being allocated outside the heap - native space? The reason I ask is that I have a use case to store shuffle data in HDFS. Since there is no notion of memory mapped files, I need to store it as a byte buffer. I want to make sure this will not cause OOM when the file size is large. -- Kannan On Tue, Apr 14, 2015 at 9:07 AM, Sandy Ryza sandy.r...@cloudera.com wrote: Hi Kannan, Both in MapReduce and Spark, the amount of shuffle data a task produces can exceed the tasks memory without risk of OOM. -Sandy On Tue, Apr 14, 2015 at 6:47 AM, Imran Rashid iras...@cloudera.com wrote: That limit doesn't have anything to do with the amount of available memory. Its just a tuning parameter, as one version is more efficient for smaller files, the other is better for bigger files. I suppose the comment is a little better in FileSegmentManagedBuffer: https://github.com/apache/spark/blob/master/network/common/src/main/java/org/apache/spark/network/buffer/FileSegmentManagedBuffer.java#L62 On Tue, Apr 14, 2015 at 12:01 AM, Kannan Rajah kra...@maprtech.com wrote: DiskStore.getBytes uses memory mapped files if the length is more than a configured limit. This code path is used during map side shuffle in ExternalSorter. I want to know if its possible for the length to exceed the limit in the case of shuffle. The reason I ask is in the case of Hadoop, each map task is supposed to produce only data that can fit within the task's configured max memory. Otherwise it will result in OOM. Is the behavior same in Spark or the size of data generated by a map task can exceed what can be fitted in memory. if (length minMemoryMapBytes) { val buf = ByteBuffer.allocate(length.toInt) } else { Some(channel.map(MapMode.READ_ONLY, offset, length)) } -- Kannan
Re: Design docs: consolidation and discoverability
: Only catch there is it requires commit access to the repo. We need a way for people who aren't committers to write and collaborate (for point #1) On Fri, Apr 24, 2015 at 3:56 PM, Punyashloka Biswal punya.bis...@gmail.com wrote: Sandy, doesn't keeping (in-progress) design docs in Git satisfy the history requirement? Referring back to my Gradle example, it seems that https://github.com/gradle/gradle/commits/master/design-docs/build-comparison.md is a really good way to see why the design doc evolved the way it did. When keeping the doc in Jira (presumably as an attachment) it's not easy to see what changed between successive versions of the doc. Punya On Fri, Apr 24, 2015 at 3:53 PM Sandy Ryza sandy.r...@cloudera.com wrote: I think there are maybe two separate things we're talking about? 1. Design discussions and in-progress design docs. My two cents are that JIRA is the best place for this. It allows tracking the progression of a design across multiple PRs and contributors. A piece of useful feedback that I've gotten in the past is to make design docs immutable. When updating them in response to feedback, post a new version rather than editing the existing one. This enables tracking the history of a design and makes it possible to read comments about previous designs in context. Otherwise it's really difficult to understand why particular approaches were chosen or abandoned. 2. Completed design docs for features that we've implemented. Perhaps less essential to project progress, but it would be really lovely to have a central repository to all the projects design doc. If anyone wants to step up to maintain it, it would be cool to have a wiki page with links to all the final design docs posted on JIRA. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Design docs: consolidation and discoverability
I think there are maybe two separate things we're talking about? 1. Design discussions and in-progress design docs. My two cents are that JIRA is the best place for this. It allows tracking the progression of a design across multiple PRs and contributors. A piece of useful feedback that I've gotten in the past is to make design docs immutable. When updating them in response to feedback, post a new version rather than editing the existing one. This enables tracking the history of a design and makes it possible to read comments about previous designs in context. Otherwise it's really difficult to understand why particular approaches were chosen or abandoned. 2. Completed design docs for features that we've implemented. Perhaps less essential to project progress, but it would be really lovely to have a central repository to all the projects design doc. If anyone wants to step up to maintain it, it would be cool to have a wiki page with links to all the final design docs posted on JIRA. -Sandy On Fri, Apr 24, 2015 at 12:01 PM, Punyashloka Biswal punya.bis...@gmail.com wrote: The Gradle dev team keep their design documents *checked into* their Git repository -- see https://github.com/gradle/gradle/blob/master/design-docs/build-comparison.md for example. The advantages I see to their approach are: - design docs stay on ASF property (since Github is synced to the Apache-run Git repository) - design docs have a lifetime across PRs, but can still be modified and commented on through the mechanism of PRs - keeping a central location helps people to find good role models and converge on conventions Sean, I find it hard to use the central Jira as a jumping-off point for understanding ongoing design work because a tiny fraction of the tickets actually relate to design docs, and it's not easy from the outside to figure out which ones are relevant. Punya On Fri, Apr 24, 2015 at 2:49 PM Sean Owen so...@cloudera.com wrote: I think it's OK to have design discussions on github, as emails go to ASF lists. After all, loads of PR discussions happen there. It's easy for anyone to follow. I also would rather just discuss on Github, except for all that noise. It's not great to put discussions in something like Google Docs actually; the resulting doc needs to be pasted back to JIRA promptly if so. I suppose it's still better than a private conversation or not talking at all, but the principle is that one should be able to access any substantive decision or conversation by being tuned in to only the project systems of record -- mailing list, JIRA. On Fri, Apr 24, 2015 at 2:30 PM, Reynold Xin r...@databricks.com wrote: I'd love to see more design discussions consolidated in a single place as well. That said, there are many practical challenges to overcome. Some of them are out of our control: 1. For large features, it is fairly common to open a PR for discussion, close the PR taking some feedback into account, and reopen another one. You sort of lose the discussions that way. 2. With the way Jenkins is setup currently, Jenkins testing introduces a lot of noise to GitHub pull requests, making it hard to differentiate legitimate comments from noise. This is unfortunately due to the fact that ASF won't allow our Jenkins bot to have API privilege to post messages. 3. The Apache Way is that all development discussions need to happen on ASF property, i.e. dev lists and JIRA. As a result, technically we are not allowed to have development discussions on GitHub. On Fri, Apr 24, 2015 at 7:09 AM, Cody Koeninger c...@koeninger.org wrote: My 2 cents - I'd rather see design docs in github pull requests (using plain text / markdown). That doesn't require changing access or adding people, and github PRs already allow for conversation / email notifications. Conversation is already split between jira and github PRs. Having a third stream of conversation in Google Docs just leads to things being ignored. On Fri, Apr 24, 2015 at 7:21 AM, Sean Owen so...@cloudera.com wrote: That would require giving wiki access to everyone or manually adding people any time they make a doc. I don't see how this helps though. They're still docs on the internet and they're still linked from the central project JIRA, which is what you should follow. On Apr 24, 2015 8:14 AM, Punyashloka Biswal punya.bis...@gmail.com wrote: Dear Spark devs, Right now, design docs are stored on Google docs and linked from tickets. For someone new to the project, it's hard to figure out what subjects are being discussed, what organization to follow for new feature proposals, etc. Would it make sense to consolidate future design docs in either a designated area on the Apache Confluence
Re: Should we let everyone set Assignee?
I think one of the benefits of assignee fields that I've seen in other projects is their potential to coordinate and prevent duplicate work. It's really frustrating to put a lot of work into a patch and then find out that someone has been doing the same. It's helpful for the project etiquette to include a way to signal to others that you are working or intend to work on a patch. Obviously there are limits to how long someone should be able to hold on to a JIRA without making progress on it, but a signal is still useful. Historically, in other projects, the assignee field serves as this signal. If we don't want to use the assignee field for this, I think it's important to have some alternative, even if it's just encouraging contributors to comment I'm planning to work on this on JIRA. -Sandy On Wed, Apr 22, 2015 at 1:30 PM, Ganelin, Ilya ilya.gane...@capitalone.com wrote: As a contributor, I¹ve never felt shut out from the Spark community, nor have I seen any examples of territorial behavior. A few times I¹ve expressed interest in more challenging work and the response I received was generally ³go ahead and give it a shot, just understand that this is sensitive code so we may end up modifying the PR substantially.² Honestly, that seems fine, and in general, I think it¹s completely fair to go with the PR model - e.g. If a JIRA has an open PR then it¹s an active effort, otherwise it¹s fair game unless otherwise stated. At the end of the day, it¹s about moving the project forward and the only way to do that is to have actual code in the pipes -speculation and intent don¹t really help, and there¹s nothing preventing an interested party from submitting a PR against an issue. Thank you, Ilya Ganelin On 4/22/15, 1:25 PM, Mark Hamstra m...@clearstorydata.com wrote: Agreed. The Spark project and community that Vinod describes do not resemble the ones with which I am familiar. On Wed, Apr 22, 2015 at 1:20 PM, Patrick Wendell pwend...@gmail.com wrote: Hi Vinod, Thanks for you thoughts - However, I do not agree with your sentiment and implications. Spark is broadly quite an inclusive project and we spend a lot of effort culturally to help make newcomers feel welcome. - Patrick On Wed, Apr 22, 2015 at 1:11 PM, Vinod Kumar Vavilapalli vino...@hortonworks.com wrote: Actually what this community got away with is pretty much an anti-pattern compared to every other Apache project I have seen. And may I say in a not so Apache way. Waiting for a committer to assign a patch to someone leaves it as a privilege to a committer. Not alluding to anything fishy in practice, but this also leaves a lot of open ground for self-interest. Committers defining notions of good fit / level of experience do not work, highly subjective and lead to group control. In terms of semantics, here is what most other projects (dare I say every Apache project?) that I have seen do - A new contributor comes in who is not yet added to the JIRA project. He/she requests one of the project's JIRA admins to add him/her. - After that, he or she is free to assign tickets to themselves. - What this means -- Assigning a ticket to oneself is a signal to the rest of the community that he/she is actively working on the said patch. -- If multiple contributors want to work on the same patch, it needs to resolved amicably through open communication. On JIRA, or on mailing lists. Not by the whim of a committer. - Common issues -- Land grabbing: Other contributors can nudge him/her in case of inactivity and take them over. Again, amicably instead of a committer making subjective decisions. -- Progress stalling: One contributor assigns the ticket to himself/herself is actively debating but with no real code/docs contribution or with any real intention of making progress. Here workable, reviewable code for review usually wins. Assigning patches is not a privilege. Contributors at Apache are a bunch of volunteers, the PMC should let volunteers contribute as they see fit. We do not assign work at Apache. +Vinod On Apr 22, 2015, at 12:32 PM, Patrick Wendell pwend...@gmail.com wrote: One over arching issue is that it's pretty unclear what Assigned to X in JIAR means from a process perspective. Personally I actually feel it's better for this to be more historical - i.e. who ended up submitting a patch for this feature that was merged - rather than creating an exclusive reservation for a particular user to work on something. If an issue is assigned to person X, but some other person Y submits a great patch for it, I think we have some obligation to Spark users and to the community to merge the better patch. So the idea of reserving the right to add a feature, it just seems overall off to me. IMO, its fine if multiple people want to submit competing patches
Re: [VOTE] Release Apache Spark 1.3.1 (RC2)
+1 Built against Hadoop 2.6 and ran some jobs against a pseudo-distributed YARN cluster. -Sandy On Wed, Apr 8, 2015 at 12:49 PM, Patrick Wendell pwend...@gmail.com wrote: Oh I see - ah okay I'm guessing it was a transient build error and I'll get it posted ASAP. On Wed, Apr 8, 2015 at 3:41 PM, Denny Lee denny.g@gmail.com wrote: Oh, it appears the 2.4 bits without hive are there but not the 2.4 bits with hive. Cool stuff on the 2.6. On Wed, Apr 8, 2015 at 12:30 Patrick Wendell pwend...@gmail.com wrote: Hey Denny, I beleive the 2.4 bits are there. The 2.6 bits I had done specially (we haven't merge that into our upstream build script). I'll do it again now for RC2. - Patrick On Wed, Apr 8, 2015 at 1:53 PM, Timothy Chen tnac...@gmail.com wrote: +1 Tested on 4 nodes Mesos cluster with fine-grain and coarse-grain mode. Tim On Wed, Apr 8, 2015 at 9:32 AM, Denny Lee denny.g@gmail.com wrote: The RC2 bits are lacking Hadoop 2.4 and Hadoop 2.6 - was that intended (they were included in RC1)? On Wed, Apr 8, 2015 at 9:01 AM Tom Graves tgraves...@yahoo.com.invalid wrote: +1. Tested spark on yarn against hadoop 2.6. Tom On Wednesday, April 8, 2015 6:15 AM, Sean Owen so...@cloudera.com wrote: Still a +1 from me; same result (except that now of course the UISeleniumSuite test does not fail) On Wed, Apr 8, 2015 at 1:46 AM, Patrick Wendell pwend...@gmail.com wrote: Please vote on releasing the following candidate as Apache Spark version 1.3.1! The tag to be voted on is v1.3.1-rc2 (commit 7c4473a): https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h= 7c4473aa5a7f5de0323394aaedeefbf9738e8eb5 The list of fixes present in this release can be found at: http://bit.ly/1C2nVPY The release files, including signatures, digests, etc. can be found at: http://people.apache.org/~pwendell/spark-1.3.1-rc2/ Release artifacts are signed with the following key: https://people.apache.org/keys/committer/pwendell.asc The staging repository for this release can be found at: https://repository.apache.org/content/repositories/orgapachespark-1083/ The documentation corresponding to this release can be found at: http://people.apache.org/~pwendell/spark-1.3.1-rc2-docs/ The patches on top of RC1 are: [SPARK-6737] Fix memory leak in OutputCommitCoordinator https://github.com/apache/spark/pull/5397 [SPARK-6636] Use public DNS hostname everywhere in spark_ec2.py https://github.com/apache/spark/pull/5302 [SPARK-6205] [CORE] UISeleniumSuite fails for Hadoop 2.x test with NoClassDefFoundError https://github.com/apache/spark/pull/4933 Please vote on releasing this package as Apache Spark 1.3.1! The vote is open until Saturday, April 11, at 07:00 UTC and passes if a majority of at least 3 +1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 1.3.1 [ ] -1 Do not release this package because ... To learn more about Apache Spark, please see http://spark.apache.org/ - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: RDD.count
I definitely see the value in this. However, I think at this point it would be an incompatible behavioral change. People often use count in Spark to exercise their DAG. Omitting processing steps that were previously included would likely mislead many users into thinking their pipeline was running faster. It's possible there might be room for something like a new smartCount API or a new argument to count that allows it to avoid unnecessary transformations. -Sandy On Sat, Mar 28, 2015 at 6:10 AM, Sean Owen so...@cloudera.com wrote: No, I'm not saying side effects change the count. But not executing the map() function at all certainly has an effect on the side effects of that function: the side effects which should take place never do. I am not sure that is something to be 'fixed'; it's a legitimate question. You can persist an RDD if you do not want to compute it twice. On Sat, Mar 28, 2015 at 1:05 PM, jimfcarroll jimfcarr...@gmail.com wrote: Hi Sean, Thanks for the response. I can't imagine a case (though my imagination may be somewhat limited) where even map side effects could change the number of elements in the resulting map. I guess count wouldn't officially be an 'action' if it were implemented this way. At least it wouldn't ALWAYS be one. My example was contrived. We're passing RDDs to functions. If that RDD is an instance of my class, then its count() may take a shortcut. If I map/zip/zipWithIndex/mapPartition/etc. first then I'm stuck with a call that literally takes 100s to 1000s of times longer (seconds vs hours on some of our datasets) and since my custom RDDs are immutable they cache the count call so a second invocation is the cost of a method call's overhead. I could fix this in Spark if there's any interest in that change. Otherwise I'll need to overload more RDD methods for my own purposes (like all of the transformations). Of course, that will be more difficult because those intermediate classes (like MappedRDD) are private, so I can't extend them. Jim -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/RDD-count-tp11298p11302.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: hadoop input/output format advanced control
Regarding Patrick's question, you can just do new Configuration(oldConf) to get a cloned Configuration object and add any new properties to it. -Sandy On Wed, Mar 25, 2015 at 4:42 PM, Imran Rashid iras...@cloudera.com wrote: Hi Nick, I don't remember the exact details of these scenarios, but I think the user wanted a lot more control over how the files got grouped into partitions, to group the files together by some arbitrary function. I didn't think that was possible w/ CombineFileInputFormat, but maybe there is a way? thanks On Tue, Mar 24, 2015 at 1:50 PM, Nick Pentreath nick.pentre...@gmail.com wrote: Imran, on your point to read multiple files together in a partition, is it not simpler to use the approach of copy Hadoop conf and set per-RDD settings for min split to control the input size per partition, together with something like CombineFileInputFormat? On Tue, Mar 24, 2015 at 5:28 PM, Imran Rashid iras...@cloudera.com wrote: I think this would be a great addition, I totally agree that you need to be able to set these at a finer context than just the SparkContext. Just to play devil's advocate, though -- the alternative is for you just subclass HadoopRDD yourself, or make a totally new RDD, and then you could expose whatever you need. Why is this solution better? IMO the criteria are: (a) common operations (b) error-prone / difficult to implement (c) non-obvious, but important for performance I think this case fits (a) (c), so I think its still worthwhile. But its also worth asking whether or not its too difficult for a user to extend HadoopRDD right now. There have been several cases in the past week where we've suggested that a user should read from hdfs themselves (eg., to read multiple files together in one partition) -- with*out* reusing the code in HadoopRDD, though they would lose things like the metric tracking preferred locations you get from HadoopRDD. Does HadoopRDD need to some refactoring to make that easier to do? Or do we just need a good example? Imran (sorry for hijacking your thread, Koert) On Mon, Mar 23, 2015 at 3:52 PM, Koert Kuipers ko...@tresata.com wrote: see email below. reynold suggested i send it to dev instead of user -- Forwarded message -- From: Koert Kuipers ko...@tresata.com Date: Mon, Mar 23, 2015 at 4:36 PM Subject: hadoop input/output format advanced control To: u...@spark.apache.org u...@spark.apache.org currently its pretty hard to control the Hadoop Input/Output formats used in Spark. The conventions seems to be to add extra parameters to all methods and then somewhere deep inside the code (for example in PairRDDFunctions.saveAsHadoopFile) all these parameters get translated into settings on the Hadoop Configuration object. for example for compression i see codec: Option[Class[_ : CompressionCodec]] = None added to a bunch of methods. how scalable is this solution really? for example i need to read from a hadoop dataset and i dont want the input (part) files to get split up. the way to do this is to set mapred.min.split.size. now i dont want to set this at the level of the SparkContext (which can be done), since i dont want it to apply to input formats in general. i want it to apply to just this one specific input dataset i need to read. which leaves me with no options currently. i could go add yet another input parameter to all the methods (SparkContext.textFile, SparkContext.hadoopFile, SparkContext.objectFile, etc.). but that seems ineffective. why can we not expose a Map[String, String] or some other generic way to manipulate settings for hadoop input/output formats? it would require adding one more parameter to all methods to deal with hadoop input/output formats, but after that its done. one parameter to rule them all then i could do: val x = sc.textFile(/some/path, formatSettings = Map(mapred.min.split.size - 12345)) or rdd.saveAsTextFile(/some/path, formatSettings = Map(mapred.output.compress - true, mapred.output.compression.codec - somecodec))
Re: Spark Executor resources
That's correct. What's the reason this information is needed? -Sandy On Tue, Mar 24, 2015 at 11:41 AM, Zoltán Zvara zoltan.zv...@gmail.com wrote: Thank you for your response! I guess the (Spark)AM, who gives the container leash to the NM (along with the executor JAR and command to run) must know how many CPU or RAM that container capped, isolated at. There must be a resource vector along the encrypted container leash if I'm right that describes this. Or maybe is there a way for the ExecutorBackend to fetch this information directly from the environment? Then, the ExecutorBackend would be able to hand over this information to the actual Executor who creates the TaskRunner. Zvara Zoltán mail, hangout, skype: zoltan.zv...@gmail.com mobile, viber: +36203129543 bank: 10918001-0021-50480008 address: Hungary, 2475 Kápolnásnyék, Kossuth 6/a elte: HSKSJZ (ZVZOAAI.ELTE) 2015-03-24 16:30 GMT+01:00 Sandy Ryza sandy.r...@cloudera.com: Hi Zoltan, If running on YARN, the YARN NodeManager starts executors. I don't think there's a 100% precise way for the Spark executor way to know how many resources are allotted to it. It can come close by looking at the Spark configuration options used to request it (spark.executor.memory and spark.yarn.executor.memoryOverhead), but it can't necessarily for the amount that YARN has rounded up if those configuration properties (yarn.scheduler.minimum-allocation-mb and yarn.scheduler.increment-allocation-mb) are not present on the node. -Sandy -Sandy On Mon, Mar 23, 2015 at 5:08 PM, Zoltán Zvara zoltan.zv...@gmail.com wrote: Let's say I'm an Executor instance in a Spark system. Who started me and where, when I run on a worker node supervised by (a) Mesos, (b) YARN? I suppose I'm the only one Executor on a worker node for a given framework scheduler (driver). If I'm an Executor instance, who is the closest object to me who can tell me how many resources do I have on (a) Mesos, (b) YARN? Thank you for your kind input! Zvara Zoltán mail, hangout, skype: zoltan.zv...@gmail.com mobile, viber: +36203129543 bank: 10918001-0021-50480008 address: Hungary, 2475 Kápolnásnyék, Kossuth 6/a elte: HSKSJZ (ZVZOAAI.ELTE)
Re: Spark Executor resources
Hi Zoltan, If running on YARN, the YARN NodeManager starts executors. I don't think there's a 100% precise way for the Spark executor way to know how many resources are allotted to it. It can come close by looking at the Spark configuration options used to request it (spark.executor.memory and spark.yarn.executor.memoryOverhead), but it can't necessarily for the amount that YARN has rounded up if those configuration properties (yarn.scheduler.minimum-allocation-mb and yarn.scheduler.increment-allocation-mb) are not present on the node. -Sandy -Sandy On Mon, Mar 23, 2015 at 5:08 PM, Zoltán Zvara zoltan.zv...@gmail.com wrote: Let's say I'm an Executor instance in a Spark system. Who started me and where, when I run on a worker node supervised by (a) Mesos, (b) YARN? I suppose I'm the only one Executor on a worker node for a given framework scheduler (driver). If I'm an Executor instance, who is the closest object to me who can tell me how many resources do I have on (a) Mesos, (b) YARN? Thank you for your kind input! Zvara Zoltán mail, hangout, skype: zoltan.zv...@gmail.com mobile, viber: +36203129543 bank: 10918001-0021-50480008 address: Hungary, 2475 Kápolnásnyék, Kossuth 6/a elte: HSKSJZ (ZVZOAAI.ELTE)
Re: Directly broadcasting (sort of) RDDs
Hi Guillaume, I've long thought something like this would be useful - i.e. the ability to broadcast RDDs directly without first pulling data through the driver. If I understand correctly, your requirement to block a matrix up and only fetch the needed parts could be implemented on top of this by splitting an RDD into a set of smaller RDDs and then broadcasting each one on its own. Unfortunately nobody is working on this currently (and I couldn't promise to have bandwidth to review it at the moment either), but I suspect we'll eventually need to add something like this for map joins in Hive on Spark and Spark SQL. -Sandy On Sat, Mar 21, 2015 at 3:11 AM, Guillaume Pitel guillaume.pi...@exensa.com wrote: Hi, Thanks for your answer. This is precisely the use case I'm interested in, but I know it already, I should have mentionned it. Unfortunately this implementation of BlockMatrix has (in my opinion) some disadvantages (the fact that it split the matrix by range instead of using a modulo is bad for block skewness). Besides, and more importantly, as I was writing, it uses the join solution (actually a cogroup : https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/BlockMatrix.scala, line 361). The reduplication of the elements of the dense matrix is thus dependent on the block size. Actually I'm wondering if what I want to achieve could be made with a simple modification to the join, allowing a partition to be weakly cached wafter being retrieved. Guillaume There is block matrix in Spark 1.3 - http://spark.apache.org/docs/latest/mllib-data-types.html#blockmatrix However I believe it only supports dense matrix blocks. Still, might be possible to use it or exetend JIRAs: https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-3434 Was based on https://github.com/amplab/ml-matrix Another lib: https://github.com/PasaLab/marlin/blob/master/README.md — Sent from Mailbox On Sat, Mar 21, 2015 at 12:24 AM, Guillaume Pitelguillaume.pi...@exensa.com guillaume.pi...@exensa.com wrote: Hi, I have an idea that I would like to discuss with the Spark devs. The idea comes from a very real problem that I have struggled with since almost a year. My problem is very simple, it's a dense matrix * sparse matrix operation. I have a dense matrix RDD[(Int,FloatMatrix)] which is divided in X large blocks (one block per partition), and a sparse matrix RDD[((Int,Int),Array[Array[(Int,Float)]]] , divided in X * Y blocks. The most efficient way to perform the operation is to collectAsMap() the dense matrix and broadcast it, then perform the block-local mutliplications, and combine the results by column. This is quite fine, unless the matrix is too big to fit in memory (especially since the multiplication is performed several times iteratively, and the broadcasts are not always cleaned from memory as I would naively expect). When the dense matrix is too big, a second solution is to split the big sparse matrix in several RDD, and do several broadcasts. Doing this creates quite a big overhead, but it mostly works, even though I often face some problems with unaccessible broadcast files, for instance. Then there is the terrible but apparently very effective good old join. Since X blocks of the sparse matrix use the same block from the dense matrix, I suspect that the dense matrix is somehow replicated X times (either on disk or in the network), which is the reason why the join takes so much time. After this bit of a context, here is my idea : would it be possible to somehow broadcast (or maybe more accurately, share or serve) a persisted RDD which is distributed on all workers, in a way that would, a bit like the IndexedRDD, allow a task to access a partition or an element of a partition in the closure, with a worker-local memory cache . i.e. the information about where each block resides would be distributed on the workers, to allow them to access parts of the RDD directly. I think that's already a bit how RDD are shuffled ? The RDD could stay distributed (no need to collect then broadcast), and only necessary transfers would be required. Is this a bad idea, is it already implemented somewhere (I would love it !) ?or is it something that could add efficiency not only for my use case, but maybe for others ? Could someone give me some hint about how I could add this possibility to Spark ? I would probably try to extend a RDD into a specific SharedIndexedRDD with a special lookup that would be allowed from tasks as a special case, and that would try to contact the blockManager and reach the corresponding data from the right worker. Thanks in advance for your advices Guillaume -- eXenSa *Guillaume PITEL, Président* +33(0)626 222 431 eXenSa S.A.S. http://www.exensa.com/ http://www.exensa.com/ 41, rue Périer - 92120 Montrouge - FRANCE Tel +33(0)184
Re: multi-line comment style
+1 to what Andrew said, I think both make sense in different situations and trusting developer discretion here is reasonable. On Mon, Feb 9, 2015 at 1:48 PM, Andrew Or and...@databricks.com wrote: In my experience I find it much more natural to use // for short multi-line comments (2 or 3 lines), and /* */ for long multi-line comments involving one or more paragraphs. For short multi-line comments, there is no reason not to use // if it just so happens that your first line exceeded 100 characters and you have to wrap it. For long multi-line comments, however, using // all the way looks really awkward especially if you have multiple paragraphs. Thus, I would actually suggest that we don't try to pick a favorite and document that both are acceptable. I don't expect developers to follow my exact usage (i.e. with a tipping point of 2-3 lines) so I wouldn't enforce anything specific either. 2015-02-09 13:36 GMT-08:00 Reynold Xin r...@databricks.com: Why don't we just pick // as the default (by encouraging it in the style guide), since it is mostly used, and then do not disallow /* */? I don't think it is that big of a deal to have slightly deviations here since it is dead simple to understand what's going on. On Mon, Feb 9, 2015 at 1:33 PM, Patrick Wendell pwend...@gmail.com wrote: Clearly there isn't a strictly optimal commenting format (pro's and cons for both '//' and '/*'). My thought is for consistency we should just chose one and put in the style guide. On Mon, Feb 9, 2015 at 12:25 PM, Xiangrui Meng men...@gmail.com wrote: Btw, I think allowing `/* ... */` without the leading `*` in lines is also useful. Check this line: https://github.com/apache/spark/pull/4259/files#diff-e9dcb3b5f3de77fc31b3aff7831110eaR55 , where we put the R commands that can reproduce the test result. It is easier if we write in the following style: ~~~ /* Using the following R code to load the data and train the model using glmnet package. library(glmnet) data - read.csv(path, header=FALSE, stringsAsFactors=FALSE) features - as.matrix(data.frame(as.numeric(data$V2), as.numeric(data$V3))) label - as.numeric(data$V1) weights - coef(glmnet(features, label, family=gaussian, alpha = 0, lambda = 0)) */ ~~~ So people can copy paste the R commands directly. Xiangrui On Mon, Feb 9, 2015 at 12:18 PM, Xiangrui Meng men...@gmail.com wrote: I like the `/* .. */` style more. Because it is easier for IDEs to recognize it as a block comment. If you press enter in the comment block with the `//` style, IDEs won't add `//` for you. -Xiangrui On Wed, Feb 4, 2015 at 2:15 PM, Reynold Xin r...@databricks.com wrote: We should update the style doc to reflect what we have in most places (which I think is //). On Wed, Feb 4, 2015 at 2:09 PM, Shivaram Venkataraman shiva...@eecs.berkeley.edu wrote: FWIW I like the multi-line // over /* */ from a purely style standpoint. The Google Java style guide[1] has some comment about code formatting tools working better with /* */ but there doesn't seem to be any strong arguments for one over the other I can find Thanks Shivaram [1] https://google-styleguide.googlecode.com/svn/trunk/javaguide.html#s4.8.6.1-block-comment-style On Wed, Feb 4, 2015 at 2:05 PM, Patrick Wendell pwend...@gmail.com wrote: Personally I have no opinion, but agree it would be nice to standardize. - Patrick On Wed, Feb 4, 2015 at 1:58 PM, Sean Owen so...@cloudera.com wrote: One thing Marcelo pointed out to me is that the // style does not interfere with commenting out blocks of code with /* */, which is a small good thing. I am also accustomed to // style for multiline, and reserve /** */ for javadoc / scaladoc. Meaning, seeing the /* */ style inline always looks a little funny to me. On Wed, Feb 4, 2015 at 3:53 PM, Kay Ousterhout kayousterh...@gmail.com wrote: Hi all, The Spark Style Guide https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide says multi-line comments should formatted as: /* * This is a * very * long comment. */ But in my experience, we almost always use // for multi-line comments: // This is a // very // long comment. Here are some examples: - Recent commit by Reynold, king of style: https://github.com/apache/spark/commit/bebf4c42bef3e75d31ffce9bfdb331c16f34ddb1#diff-d616b5496d1a9f648864f4ab0db5a026R58 - RDD.scala:
Re: Improving metadata in Spark JIRA
JIRA updates don't go to this list, they go to iss...@spark.apache.org. I don't think many are signed up for that list, and those that are probably have a flood of emails anyway. So I'd definitely be in favor of any JIRA cleanup that you're up for. -Sandy On Fri, Feb 6, 2015 at 6:45 AM, Sean Owen so...@cloudera.com wrote: I've wasted no time in wielding the commit bit to complete a number of small, uncontroversial changes. I wouldn't commit anything that didn't already appear to have review, consensus and little risk, but please let me know if anything looked a little too bold, so I can calibrate. Anyway, I'd like to continue some small house-cleaning by improving the state of JIRA's metadata, in order to let it give us a little clearer view on what's happening in the project: a. Add Component to every (open) issue that's missing one b. Review all Critical / Blocker issues to de-escalate ones that seem obviously neither c. Correct open issues that list a Fix version that has already been released d. Close all issues Resolved for a release that has already been released The problem with doing so is that it will create a tremendous amount of email to the list, like, several hundred. It's possible to make bulk changes and suppress e-mail though, which could be done for all but b. Better to suppress the emails when making such changes? or just not bother on some of these? - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Issue with repartition and cache
Hi Dirceu, Does the issue not show up if you run map(f = f(1).asInstanceOf[Int]).sum on the train RDD? It appears that f(1) is an String, not an Int. If you're looking to parse and convert it, toInt should be used instead of asInstanceOf. -Sandy On Wed, Jan 21, 2015 at 8:43 AM, Dirceu Semighini Filho dirceu.semigh...@gmail.com wrote: Hi guys, have anyone find something like this? I have a training set, and when I repartition it, if I call cache it throw a classcastexception when I try to execute anything that access it val rep120 = train.repartition(120) val cached120 = rep120.cache cached120.map(f = f(1).asInstanceOf[Int]).sum Cell Toolbar: In [1]: ClusterSettings.executorMemory=Some(28g) ClusterSettings.maxResultSize = 20g ClusterSettings.resume=true ClusterSettings.coreInstanceType=r3.xlarge ClusterSettings.coreInstanceCount = 30 ClusterSettings.clusterName=UberdataContextCluster-Dirceu uc.applyDateFormat(YYMMddHH) Searching for existing cluster UberdataContextCluster-Dirceu ... Spark standalone cluster started at http://ec2-54-68-91-64.us-west-2.compute.amazonaws.com:8080 Found 1 master(s), 30 slaves Ganglia started at http://ec2-54-68-91-64.us-west-2.compute.amazonaws.com:5080/ganglia In [37]: import org.apache.spark.sql.catalyst.types._ import eleflow.uberdata.util.IntStringImplicitTypeConverter._ import eleflow.uberdata.enums.SupportedAlgorithm._ import eleflow.uberdata.data._ import org.apache.spark.mllib.tree.DecisionTree import eleflow.uberdata.enums.DateSplitType._ import org.apache.spark.mllib.regression.LabeledPoint import org.apache.spark.mllib.linalg.Vectors import org.apache.spark.mllib.classification._ import eleflow.uberdata.model._ import eleflow.uberdata.data.stat.Statistics import eleflow.uberdata.enums.ValidationMethod._ import org.apache.spark.rdd.RDD In [5]: val train = uc.load(uc.toHDFSURI(/tmp/data/input/train_rev4.csv)).applyColumnTypes(Seq(DecimalType(), LongType,TimestampType, StringType, StringType, StringType, StringType, StringType, StringType, StringType, StringType, StringType, StringType, StringType, StringType, StringType, StringType, StringType, StringType, StringType, LongType, LongType,StringType, StringType,StringType, StringType,StringType)) .formatDateValues(2,DayOfAWeek | Period).slice(excludes = Seq(12,13)) Out[5]: idclickhour1hour2C1banner_possite_idsite_domainsite_categoryapp_idapp_domain app_categorydevice_modeldevice_typedevice_conn_typeC14C15C16C17C18C19C20C21 10941815109427302.03.0100501fbe01fef384576728905ebdecad23867801e8d9 07d7df2244956a241215706320501722035-179116934911786371502.03.010050 1fbe01fef384576728905ebdecad23867801e8d907d7df22711ee1201015704320501722035 10008479137190421511948602.03.0100501fbe01fef384576728905ebdecad2386 7801e8d907d7df228a4875bd101570432050172203510008479164072448083837602.0 3.0100501fbe01fef384576728905ebdecad23867801e8d907d7df226332421a101570632050 172203510008479167905641704209602.03.010051fe8cc4489166c1610569f928 ecad23867801e8d907d7df22779d90c21018993320502161035-11571720757801103869 02.03.010050d6137915bb1ef334f028772becad23867801e8d907d7df228a4875bd1016920 3205018990431100077117172472998854491102.03.0100508fda644b25d4cfcd f028772becad23867801e8d907d7df22be6db1d71020362320502333039-1157 In [7]: val test = uc.load(uc.toHDFSURI(/tmp/data/input/test_rev4.csv)).applyColumnTypes(Seq(DecimalType(), TimestampType, StringType, StringType, StringType, StringType, StringType, StringType, StringType, StringType, StringType, StringType, StringType, StringType, StringType, StringType, StringType, StringType, StringType, LongType, LongType,StringType, StringType,StringType, StringType,StringType)). formatDateValues(1,DayOfAWeek | Period).slice(excludes =Seq(11,12)) Out[7]: idhour1hour2C1banner_possite_idsite_domainsite_categoryapp_idapp_domain app_categorydevice_modeldevice_typedevice_conn_typeC14C15C16C17C18C19C20C21 11740588092635695.03.010050235ba823f6ebf28ef028772becad23867801e8d9 07d7df220eb711ec1083303205076131751000752311825269208554285.03.010050 1fbe01fef384576728905ebdecad23867801e8d907d7df22ecb851b21022676320502616035 1000835115541398292139845.03.0100501fbe01fef384576728905ebdecad2386 7801e8d907d7df221f0bc64f102267632050261603510008351100010946378097988455.0 3.01005085f751fdc4e18dd650e219e051cedd4eaefc06bd0f2161f8542422a7101864832050 1092380910015661100013770415586707455.03.01005085f751fdc4e18dd650e219e0 9c13b4192347f47af95efa071f0bc64f1023160320502667047-122110001521204153353724 5.03.01005157fe1b205b626596f028772becad23867801e8d907d7df2268b6db2c106563320 50572239-132100019110567070233785.03.0100501fbe01fef384576728905ebdecad2386
Re: Semantics of LGTM
I think clarifying these semantics is definitely worthwhile. Maybe this complicates the process with additional terminology, but the way I've used these has been: +1 - I think this is safe to merge and, barring objections from others, would merge it immediately. LGTM - I have no concerns about this patch, but I don't necessarily feel qualified to make a final call about it. The TM part acknowledges the judgment as a little more subjective. I think having some concise way to express both of these is useful. -Sandy On Jan 17, 2015, at 5:40 PM, Patrick Wendell pwend...@gmail.com wrote: Hey All, Just wanted to ping about a minor issue - but one that ends up having consequence given Spark's volume of reviews and commits. As much as possible, I think that we should try and gear towards Google Style LGTM on reviews. What I mean by this is that LGTM has the following semantics: I know this code well, or I've looked at it close enough to feel confident it should be merged. If there are issues/bugs with this code later on, I feel confident I can help with them. Here is an alternative semantic: Based on what I know about this part of the code, I don't see any show-stopper problems with this patch. The issue with the latter is that it ultimately erodes the significance of LGTM, since subsequent reviewers need to reason about what the person meant by saying LGTM. In contrast, having strong semantics around LGTM can help streamline reviews a lot, especially as reviewers get more experienced and gain trust from the comittership. There are several easy ways to give a more limited endorsement of a patch: - I'm not familiar with this code, but style, etc look good (general endorsement) - The build changes in this code LGTM, but I haven't reviewed the rest (limited LGTM) If people are okay with this, I might add a short note on the wiki. I'm sending this e-mail first, though, to see whether anyone wants to express agreement or disagreement with this approach. - Patrick - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Semantics of LGTM
Yeah, the ASF +1 has become partly overloaded to mean both I would like to see this feature and this patch should be committed, although, at least in Hadoop, using +1 on JIRA (as opposed to, say, in a release vote) should unambiguously mean the latter unless qualified in some other way. I don't have any opinion on the specific characters, but I agree with Aaron that it would be nice to have some sort of abbreviation for both the strong and weak forms of approval. -Sandy On Jan 17, 2015, at 7:25 PM, Patrick Wendell pwend...@gmail.com wrote: I think the ASF +1 is *slightly* different than Google's LGTM, because it might convey wanting the patch/feature to be merged but not necessarily saying you did a thorough review and stand behind it's technical contents. For instance, I've seen people pile on +1's to try and indicate support for a feature or patch in some projects, even though they didn't do a thorough technical review. This +1 is definitely a useful mechanism. There is definitely much overlap though in the meaning, though, and it's largely because Spark had it's own culture around reviews before it was donated to the ASF, so there is a mix of two styles. Nonetheless, I'd prefer to stick with the stronger LGTM semantics I proposed originally (unlike the one Sandy proposed, e.g.). This is what I've seen every project using the LGTM convention do (Google, and some open source projects such as Impala) to indicate technical sign-off. - Patrick On Sat, Jan 17, 2015 at 7:09 PM, Aaron Davidson ilike...@gmail.com wrote: I think I've seen something like +2 = strong LGTM and +1 = weak LGTM; someone else should review before. It's nice to have a shortcut which isn't a sentence when talking about weaker forms of LGTM. On Sat, Jan 17, 2015 at 6:59 PM, sandy.r...@cloudera.com wrote: I think clarifying these semantics is definitely worthwhile. Maybe this complicates the process with additional terminology, but the way I've used these has been: +1 - I think this is safe to merge and, barring objections from others, would merge it immediately. LGTM - I have no concerns about this patch, but I don't necessarily feel qualified to make a final call about it. The TM part acknowledges the judgment as a little more subjective. I think having some concise way to express both of these is useful. -Sandy On Jan 17, 2015, at 5:40 PM, Patrick Wendell pwend...@gmail.com wrote: Hey All, Just wanted to ping about a minor issue - but one that ends up having consequence given Spark's volume of reviews and commits. As much as possible, I think that we should try and gear towards Google Style LGTM on reviews. What I mean by this is that LGTM has the following semantics: I know this code well, or I've looked at it close enough to feel confident it should be merged. If there are issues/bugs with this code later on, I feel confident I can help with them. Here is an alternative semantic: Based on what I know about this part of the code, I don't see any show-stopper problems with this patch. The issue with the latter is that it ultimately erodes the significance of LGTM, since subsequent reviewers need to reason about what the person meant by saying LGTM. In contrast, having strong semantics around LGTM can help streamline reviews a lot, especially as reviewers get more experienced and gain trust from the comittership. There are several easy ways to give a more limited endorsement of a patch: - I'm not familiar with this code, but style, etc look good (general endorsement) - The build changes in this code LGTM, but I haven't reviewed the rest (limited LGTM) If people are okay with this, I might add a short note on the wiki. I'm sending this e-mail first, though, to see whether anyone wants to express agreement or disagreement with this approach. - Patrick - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Spark Dev
Hi Harikrishna, A good place to start is taking a look at the wiki page on contributing: https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark -Sandy On Fri, Dec 19, 2014 at 2:43 PM, Harikrishna Kamepalli harikrishna.kamepa...@gmail.com wrote: i am interested to contribute to spark
Re: one hot encoding
Hi Lochana, We haven't yet added this in 1.2. https://issues.apache.org/jira/browse/SPARK-4081 tracks adding categorical feature indexing, which one-hot encoding can be built on. https://issues.apache.org/jira/browse/SPARK-1216 also tracks a version of this prior to the ML pipelines work. -Sandy On Fri, Dec 12, 2014 at 6:16 PM, Lochana Menikarachchi locha...@gmail.com wrote: Do we have one-hot encoding in spark MLLib 1.1.1 or 1.2.0 ? It wasn't available in 1.1.0. Thanks. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: [VOTE] Release Apache Spark 1.2.0 (RC2)
+1 (non-binding). Tested on Ubuntu against YARN. On Thu, Dec 11, 2014 at 9:38 AM, Reynold Xin r...@databricks.com wrote: +1 Tested on OS X. On Wednesday, December 10, 2014, Patrick Wendell pwend...@gmail.com wrote: Please vote on releasing the following candidate as Apache Spark version 1.2.0! The tag to be voted on is v1.2.0-rc2 (commit a428c446e2): https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=a428c446e23e628b746e0626cc02b7b3cadf588e The release files, including signatures, digests, etc. can be found at: http://people.apache.org/~pwendell/spark-1.2.0-rc2/ Release artifacts are signed with the following key: https://people.apache.org/keys/committer/pwendell.asc The staging repository for this release can be found at: https://repository.apache.org/content/repositories/orgapachespark-1055/ The documentation corresponding to this release can be found at: http://people.apache.org/~pwendell/spark-1.2.0-rc2-docs/ Please vote on releasing this package as Apache Spark 1.2.0! The vote is open until Saturday, December 13, at 21:00 UTC and passes if a majority of at least 3 +1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 1.2.0 [ ] -1 Do not release this package because ... To learn more about Apache Spark, please see http://spark.apache.org/ == What justifies a -1 vote for this release? == This vote is happening relatively late into the QA period, so -1 votes should only occur for significant regressions from 1.0.2. Bugs already present in 1.1.X, minor regressions, or bugs related to new features will not block this release. == What default changes should I be aware of? == 1. The default value of spark.shuffle.blockTransferService has been changed to netty -- Old behavior can be restored by switching to nio 2. The default value of spark.shuffle.manager has been changed to sort. -- Old behavior can be restored by setting spark.shuffle.manager to hash. == How does this differ from RC1 == This has fixes for a handful of issues identified - some of the notable fixes are: [Core] SPARK-4498: Standalone Master can fail to recognize completed/failed applications [SQL] SPARK-4552: Query for empty parquet table in spark sql hive get IllegalArgumentException SPARK-4753: Parquet2 does not prune based on OR filters on partition columns SPARK-4761: With JDBC server, set Kryo as default serializer and disable reference tracking SPARK-4785: When called with arguments referring column fields, PMOD throws NPE - Patrick - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org javascript:; For additional commands, e-mail: dev-h...@spark.apache.org javascript:;
Re: HA support for Spark
I think that if we were able to maintain the full set of created RDDs as well as some scheduler and block manager state, it would be enough for most apps to recover. On Wed, Dec 10, 2014 at 5:30 AM, Jun Feng Liu liuj...@cn.ibm.com wrote: Well, it should not be mission impossible thinking there are so many HA solution existing today. I would interest to know if there is any specific difficult. Best Regards *Jun Feng Liu* IBM China Systems Technology Laboratory in Beijing -- [image: 2D barcode - encoded with contact information] *Phone: *86-10-82452683 * E-mail:* *liuj...@cn.ibm.com* liuj...@cn.ibm.com [image: IBM] BLD 28,ZGC Software Park No.8 Rd.Dong Bei Wang West, Dist.Haidian Beijing 100193 China *Reynold Xin r...@databricks.com r...@databricks.com* 2014/12/10 16:30 To Jun Feng Liu/China/IBM@IBMCN, cc dev@spark.apache.org dev@spark.apache.org Subject Re: HA support for Spark This would be plausible for specific purposes such as Spark streaming or Spark SQL, but I don't think it is doable for general Spark driver since it is just a normal JVM process with arbitrary program state. On Wed, Dec 10, 2014 at 12:25 AM, Jun Feng Liu liuj...@cn.ibm.com wrote: Do we have any high availability support in Spark driver level? For example, if we want spark drive can move to another node continue execution when failure happen. I can see the RDD checkpoint can help to serialization the status of RDD. I can image to load the check point from another node when error happen, but seems like will lost track all tasks status or even executor information that maintain in spark context. I am not sure if there is any existing stuff I can leverage to do that. thanks for any suggests Best Regards *Jun Feng Liu* IBM China Systems Technology Laboratory in Beijing -- [image: 2D barcode - encoded with contact information] *Phone: *86-10-82452683 * E-mail:* *liuj...@cn.ibm.com* liuj...@cn.ibm.com [image: IBM] BLD 28,ZGC Software Park No.8 Rd.Dong Bei Wang West, Dist.Haidian Beijing 100193 China
Re: [VOTE] Release Apache Spark 1.2.0 (RC1)
+1 (non-binding) built from source fired up a spark-shell against YARN cluster ran some jobs using parallelize ran some jobs that read files clicked around the web UI On Sun, Nov 30, 2014 at 1:10 AM, GuoQiang Li wi...@qq.com wrote: +1 (non-binding) -- Original -- From: Patrick Wendell;pwend...@gmail.com; Date: Sat, Nov 29, 2014 01:16 PM To: dev@spark.apache.orgdev@spark.apache.org; Subject: [VOTE] Release Apache Spark 1.2.0 (RC1) Please vote on releasing the following candidate as Apache Spark version 1.2.0! The tag to be voted on is v1.2.0-rc1 (commit 1056e9ec1): https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=1056e9ec13203d0c51564265e94d77a054498fdb The release files, including signatures, digests, etc. can be found at: http://people.apache.org/~pwendell/spark-1.2.0-rc1/ Release artifacts are signed with the following key: https://people.apache.org/keys/committer/pwendell.asc The staging repository for this release can be found at: https://repository.apache.org/content/repositories/orgapachespark-1048/ The documentation corresponding to this release can be found at: http://people.apache.org/~pwendell/spark-1.2.0-rc1-docs/ Please vote on releasing this package as Apache Spark 1.2.0! The vote is open until Tuesday, December 02, at 05:15 UTC and passes if a majority of at least 3 +1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 1.1.0 [ ] -1 Do not release this package because ... To learn more about Apache Spark, please see http://spark.apache.org/ == What justifies a -1 vote for this release? == This vote is happening very late into the QA period compared with previous votes, so -1 votes should only occur for significant regressions from 1.0.2. Bugs already present in 1.1.X, minor regressions, or bugs related to new features will not block this release. == What default changes should I be aware of? == 1. The default value of spark.shuffle.blockTransferService has been changed to netty -- Old behavior can be restored by switching to nio 2. The default value of spark.shuffle.manager has been changed to sort. -- Old behavior can be restored by setting spark.shuffle.manager to hash. == Other notes == Because this vote is occurring over a weekend, I will likely extend the vote if this RC survives until the end of the vote period. - Patrick - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Too many open files error
Quizhang, This is a known issue that ExternalAppendOnlyMap can do tons of tiny spills in certain situations. SPARK-4452 aims to deal with this issue, but we haven't finalized a solution yet. Dinesh's solution should help as a workaround, but you'll likely experience suboptimal performance when trying to merge tons of small files from disk. -Sandy On Wed, Nov 19, 2014 at 10:10 PM, Dinesh J. Weerakkody dineshjweerakk...@gmail.com wrote: Hi Qiuzhuang, This is a linux related issue. Please go through this [1] and change the limits. hope this will solve your problem. [1] https://rtcamp.com/tutorials/linux/increase-open-files-limit/ On Thu, Nov 20, 2014 at 9:45 AM, Qiuzhuang Lian qiuzhuang.l...@gmail.com wrote: Hi All, While doing some ETL, I run into error of 'Too many open files' as following logs, Thanks, Qiuzhuang 4/11/20 20:12:02 INFO collection.ExternalAppendOnlyMap: Thread 63 spilling in-memory map of 100.8 KB to disk (953 times so far) 14/11/20 20:12:02 ERROR storage.DiskBlockObjectWriter: Uncaught exception while reverting partial writes to file /tmp/spark-local-20141120200455-4137/2f/temp_local_f83cbf2f-60a4-4fbd-b5d2-32a0c569311b java.io.FileNotFoundException: /tmp/spark-local-20141120200455-4137/2f/temp_local_f83cbf2f-60a4-4fbd-b5d2-32a0c569311b (Too many open files) at java.io.FileOutputStream.open(Native Method) at java.io.FileOutputStream.init(FileOutputStream.java:221) at org.apache.spark.storage.DiskBlockObjectWriter.revertPartialWritesAndClose(BlockObjectWriter.scala:178) at org.apache.spark.util.collection.ExternalAppendOnlyMap.spill(ExternalAppendOnlyMap.scala:203) at org.apache.spark.util.collection.ExternalAppendOnlyMap.spill(ExternalAppendOnlyMap.scala:63) at org.apache.spark.util.collection.Spillable$class.maybeSpill(Spillable.scala:77) at org.apache.spark.util.collection.ExternalAppendOnlyMap.maybeSpill(ExternalAppendOnlyMap.scala:63) at org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:131) at org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:160) at org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:159) at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771) at org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:159) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) at org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) at org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) at org.apache.spark.rdd.FilteredRDD.compute(FilteredRDD.scala:34) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:61) at org.apache.spark.rdd.RDD.iterator(RDD.scala:228) at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) at org.apache.spark.scheduler.Task.run(Task.scala:56) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) 14/11/20 20:12:02 ERROR executor.Executor: Exception in task 0.0 in stage 36.0 (TID 20) java.io.FileNotFoundException:
Re: Spark Hadoop 2.5.1
You're the second person to request this today. Planning to include this in my PR for Spark-4338. -Sandy On Nov 14, 2014, at 8:48 AM, Corey Nolet cjno...@gmail.com wrote: In the past, I've built it by providing -Dhadoop.version=2.5.1 exactly like you've mentioned. What prompted me to write this email was that I did not see any documentation that told me Hadoop 2.5.1 was officially supported by Spark (i.e. community has been using it, any bugs are being fixed, etc...). It builds, tests pass, etc... but there could be other implications that I have not run into based on my own use of the framework. If we are saying that the standard procedure is to build with the hadoop-2.4 profile and override the -Dhadoop.version property, should we provide that on the build instructions [1] at least? [1] http://spark.apache.org/docs/latest/building-with-maven.html On Fri, Nov 14, 2014 at 10:46 AM, Sean Owen so...@cloudera.com wrote: I don't think it's necessary. You're looking at the hadoop-2.4 profile, which works with anything = 2.4. AFAIK there is no further specialization needed beyond that. The profile sets hadoop.version to 2.4.0 by default, but this can be overridden. On Fri, Nov 14, 2014 at 3:43 PM, Corey Nolet cjno...@gmail.com wrote: I noticed Spark 1.2.0-SNAPSHOT still has 2.4.x in the pom. Since 2.5.x is the current stable Hadoop 2.x, would it make sense for us to update the poms? - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: [NOTICE] [BUILD] Minor changes to Spark's build
Currently there are no mandatory profiles required to build Spark. I.e. mvn package just works. It seems sad that we would need to break this. On Wed, Nov 12, 2014 at 10:59 PM, Patrick Wendell pwend...@gmail.com wrote: I think printing an error that says -Pscala-2.10 must be enabled is probably okay. It's a slight regression but it's super obvious to users. That could be a more elegant solution than the somewhat complicated monstrosity I proposed on the JIRA. On Wed, Nov 12, 2014 at 10:37 PM, Prashant Sharma scrapco...@gmail.com wrote: One thing we can do it is print a helpful error and break. I don't know about how this can be done, but since now I can write groovy inside maven build so we have more control. (Yay!!) Prashant Sharma On Thu, Nov 13, 2014 at 12:05 PM, Patrick Wendell pwend...@gmail.com wrote: Yeah Sandy and I were chatting about this today and din't realize -Pscala-2.10 was mandatory. This is a fairly invasive change, so I was thinking maybe we could try to remove that. Also if someone doesn't give -Pscala-2.10 it fails in a way that is initially silent, which is bad because most people won't know to do this. https://issues.apache.org/jira/browse/SPARK-4375 On Wed, Nov 12, 2014 at 10:29 PM, Prashant Sharma scrapco...@gmail.com wrote: Thanks Patrick, I have one suggestion that we should make passing -Pscala-2.10 mandatory for maven users. I am sorry for not mentioning this before. There is no way around not passing that option for maven users(only). However, this is unnecessary for sbt users because it is added automatically if -Pscala-2.11 is absent. Prashant Sharma On Wed, Nov 12, 2014 at 3:53 PM, Sean Owen so...@cloudera.com wrote: - Tip: when you rebase, IntelliJ will temporarily think things like the Kafka module are being removed. Say 'no' when it asks if you want to remove them. - Can we go straight to Scala 2.11.4? On Wed, Nov 12, 2014 at 5:47 AM, Patrick Wendell pwend...@gmail.com wrote: Hey All, I've just merged a patch that adds support for Scala 2.11 which will have some minor implications for the build. These are due to the complexities of supporting two versions of Scala in a single project. 1. The JDBC server will now require a special flag to build -Phive-thriftserver on top of the existing flag -Phive. This is because some build permutations (only in Scala 2.11) won't support the JDBC server yet due to transitive dependency conflicts. 2. The build now uses non-standard source layouts in a few additional places (we already did this for the Hive project) - the repl and the examples modules. This is just fine for maven/sbt, but it may affect users who import the build in IDE's that are using these projects and want to build Spark from the IDE. I'm going to update our wiki to include full instructions for making this work well in IntelliJ. If there are any other build related issues please respond to this thread and we'll make sure they get sorted out. Thanks to Prashant Sharma who is the author of this feature! - Patrick - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: proposal / discuss: multiple Serializers within a SparkContext?
Ah awesome. Passing customer serializers when persisting an RDD is exactly one of the things I was thinking of. -Sandy On Fri, Nov 7, 2014 at 1:19 AM, Matei Zaharia matei.zaha...@gmail.com wrote: Yup, the JIRA for this was https://issues.apache.org/jira/browse/SPARK-540 (one of our older JIRAs). I think it would be interesting to explore this further. Basically the way to add it into the API would be to add a version of persist() that takes another class than StorageLevel, say StorageStrategy, which allows specifying a custom serializer or perhaps even a transformation to turn each partition into another representation before saving it. It would also be interesting if this could work directly on an InputStream or ByteBuffer to deal with off-heap data. One issue we've found with our current Serializer interface by the way is that a lot of type information is lost when you pass data to it, so the serializers spend a fair bit of time figuring out what class each object written is. With this model, it would be possible for a serializer to know that all its data is of one type, which is pretty cool, but we might also consider ways of expanding the current Serializer interface to take more info. Matei On Nov 7, 2014, at 1:09 AM, Reynold Xin r...@databricks.com wrote: Technically you can already do custom serializer for each shuffle operation (it is part of the ShuffledRDD). I've seen Matei suggesting on jira issues (or github) in the past a storage policy in which you can specify how data should be stored. I think that would be a great API to have in the long run. Designing it won't be trivial though. On Fri, Nov 7, 2014 at 1:05 AM, Sandy Ryza sandy.r...@cloudera.com wrote: Hey all, Was messing around with Spark and Google FlatBuffers for fun, and it got me thinking about Spark and serialization. I know there's been work / talk about in-memory columnar formats Spark SQL, so maybe there are ways to provide this flexibility already that I've missed? Either way, my thoughts: Java and Kryo serialization are really nice in that they require almost no extra work on the part of the user. They can also represent complex object graphs with cycles etc. There are situations where other serialization frameworks are more efficient: * A Hadoop Writable style format that delineates key-value boundaries and allows for raw comparisons can greatly speed up some shuffle operations by entirely avoiding deserialization until the object hits user code. Writables also probably ser / deser faster than Kryo. * No-deserialization formats like FlatBuffers and Cap'n Proto address the tradeoff between (1) Java objects that offer fast access but take lots of space and stress GC and (2) Kryo-serialized buffers that are more compact but take time to deserialize. The drawbacks of these frameworks are that they require more work from the user to define types. And that they're more restrictive in the reference graphs they can represent. In large applications, there are probably a few points where a specialized serialization format is useful. But requiring Writables everywhere because they're needed in a particularly intense shuffle is cumbersome. In light of that, would it make sense to enable varying Serializers within an app? It could make sense to choose a serialization framework both based on the objects being serialized and what they're being serialized for (caching vs. shuffle). It might be possible to implement this underneath the Serializer interface with some sort of multiplexing serializer that chooses between subserializers. Nothing urgent here, but curious to hear other's opinions. -Sandy
proposal / discuss: multiple Serializers within a SparkContext?
Hey all, Was messing around with Spark and Google FlatBuffers for fun, and it got me thinking about Spark and serialization. I know there's been work / talk about in-memory columnar formats Spark SQL, so maybe there are ways to provide this flexibility already that I've missed? Either way, my thoughts: Java and Kryo serialization are really nice in that they require almost no extra work on the part of the user. They can also represent complex object graphs with cycles etc. There are situations where other serialization frameworks are more efficient: * A Hadoop Writable style format that delineates key-value boundaries and allows for raw comparisons can greatly speed up some shuffle operations by entirely avoiding deserialization until the object hits user code. Writables also probably ser / deser faster than Kryo. * No-deserialization formats like FlatBuffers and Cap'n Proto address the tradeoff between (1) Java objects that offer fast access but take lots of space and stress GC and (2) Kryo-serialized buffers that are more compact but take time to deserialize. The drawbacks of these frameworks are that they require more work from the user to define types. And that they're more restrictive in the reference graphs they can represent. In large applications, there are probably a few points where a specialized serialization format is useful. But requiring Writables everywhere because they're needed in a particularly intense shuffle is cumbersome. In light of that, would it make sense to enable varying Serializers within an app? It could make sense to choose a serialization framework both based on the objects being serialized and what they're being serialized for (caching vs. shuffle). It might be possible to implement this underneath the Serializer interface with some sort of multiplexing serializer that chooses between subserializers. Nothing urgent here, but curious to hear other's opinions. -Sandy
Re: [VOTE] Designating maintainers for some Spark components
It looks like the difference between the proposed Spark model and the CloudStack / SVN model is: * In the former, maintainers / partial committers are a way of centralizing oversight over particular components among committers * In the latter, maintainers / partial committers are a way of giving non-committers some power to make changes -Sandy On Thu, Nov 6, 2014 at 5:17 PM, Corey Nolet cjno...@gmail.com wrote: PMC [1] is responsible for oversight and does not designate partial or full committer. There are projects where all committers become PMC and others where PMC is reserved for committers with the most merit (and willingness to take on the responsibility of project oversight, releases, etc...). Community maintains the codebase through committers. Committers to mentor, roll in patches, and spread the project throughout other communities. Adding someone's name to a list as a maintainer is not a barrier. With a community as large as Spark's, and myself not being a committer on this project, I see it as a welcome opportunity to find a mentor in the areas in which I'm interested in contributing. We'd expect the list of names to grow as more volunteers gain more interest, correct? To me, that seems quite contrary to a barrier. [1] http://www.apache.org/dev/pmc.html On Thu, Nov 6, 2014 at 7:49 PM, Matei Zaharia matei.zaha...@gmail.com wrote: So I don't understand, Greg, are the partial committers committers, or are they not? Spark also has a PMC, but our PMC currently consists of all committers (we decided not to have a differentiation when we left the incubator). I see the Subversion partial committers listed as committers on https://people.apache.org/committers-by-project.html#subversion, so I assume they are committers. As far as I can see, CloudStack is similar. Matei On Nov 6, 2014, at 4:43 PM, Greg Stein gst...@gmail.com wrote: Partial committers are people invited to work on a particular area, and they do not require sign-off to work on that area. They can get a sign-off and commit outside that area. That approach doesn't compare to this proposal. Full committers are PMC members. As each PMC member is responsible for *every* line of code, then every PMC member should have complete rights to every line of code. Creating disparity flies in the face of a PMC member's responsibility. If I am a Spark PMC member, then I have responsibility for GraphX code, whether my name is Ankur, Joey, Reynold, or Greg. And interposing a barrier inhibits my responsibility to ensure GraphX is designed, maintained, and delivered to the Public. Cheers, -g (and yes, I'm aware of COMMITTERS; I've been changing that file for the past 12 years :-) ) On Thu, Nov 6, 2014 at 6:28 PM, Patrick Wendell pwend...@gmail.com mailto:pwend...@gmail.com wrote: In fact, if you look at the subversion commiter list, the majority of people here have commit access only for particular areas of the project: http://svn.apache.org/repos/asf/subversion/trunk/COMMITTERS http://svn.apache.org/repos/asf/subversion/trunk/COMMITTERS On Thu, Nov 6, 2014 at 4:26 PM, Patrick Wendell pwend...@gmail.com mailto:pwend...@gmail.com wrote: Hey Greg, Regarding subversion - I think the reference is to partial vs full committers here: https://subversion.apache.org/docs/community-guide/roles.html https://subversion.apache.org/docs/community-guide/roles.html - Patrick On Thu, Nov 6, 2014 at 4:18 PM, Greg Stein gst...@gmail.com mailto: gst...@gmail.com wrote: -1 (non-binding) This is an idea that runs COMPLETELY counter to the Apache Way, and is to be severely frowned up. This creates *unequal* ownership of the codebase. Each Member of the PMC should have *equal* rights to all areas of the codebase until their purview. It should not be subjected to others' ownership except throught the standard mechanisms of reviews and if/when absolutely necessary, to vetos. Apache does not want leads, benevolent dictators or assigned maintainers, no matter how you may dress it up with multiple maintainers per component. The fact is that this creates an unequal level of ownership and responsibility. The Board has shut down projects that attempted or allowed for Leads. Just a few months ago, there was a problem with somebody calling themself a Lead. I don't know why you suggest that Apache Subversion does this. We absolutely do not. Never have. Never will. The Subversion codebase is owned by all of us, and we all care for every line of it. Some people know more than others, of course. But any one of us, can change any part, without being subjected to a maintainer. Of course, we ask people with more knowledge of the component when we feel uncomfortable, but we also know when it is safe or not to make a specific
Re: [VOTE] Designating maintainers for some Spark components
This seems like a good idea. An area that wasn't listed, but that I think could strongly benefit from maintainers, is the build. Having consistent oversight over Maven, SBT, and dependencies would allow us to avoid subtle breakages. Component maintainers have come up several times within the Hadoop project, and I think one of the main reasons the proposals have been rejected is that, structurally, its effect is to slow down development. As you mention, this is somewhat mitigated if being a maintainer leads committers to take on more responsibility, but it might be worthwhile to draw up more specific ideas on how to combat this? E.g. do obvious changes, doc fixes, test fixes, etc. always require a maintainer? -Sandy On Wed, Nov 5, 2014 at 5:36 PM, Michael Armbrust mich...@databricks.com wrote: +1 (binding) On Wed, Nov 5, 2014 at 5:33 PM, Matei Zaharia matei.zaha...@gmail.com wrote: BTW, my own vote is obviously +1 (binding). Matei On Nov 5, 2014, at 5:31 PM, Matei Zaharia matei.zaha...@gmail.com wrote: Hi all, I wanted to share a discussion we've been having on the PMC list, as well as call for an official vote on it on a public list. Basically, as the Spark project scales up, we need to define a model to make sure there is still great oversight of key components (in particular internal architecture and public APIs), and to this end I've proposed implementing a maintainer model for some of these components, similar to other large projects. As background on this, Spark has grown a lot since joining Apache. We've had over 80 contributors/month for the past 3 months, which I believe makes us the most active project in contributors/month at Apache, as well as over 500 patches/month. The codebase has also grown significantly, with new libraries for SQL, ML, graphs and more. In this kind of large project, one common way to scale development is to assign maintainers to oversee key components, where each patch to that component needs to get sign-off from at least one of its maintainers. Most existing large projects do this -- at Apache, some large ones with this model are CloudStack (the second-most active project overall), Subversion, and Kafka, and other examples include Linux and Python. This is also by-and-large how Spark operates today -- most components have a de-facto maintainer. IMO, adopting this model would have two benefits: 1) Consistent oversight of design for that component, especially regarding architecture and API. This process would ensure that the component's maintainers see all proposed changes and consider them to fit together in a good way. 2) More structure for new contributors and committers -- in particular, it would be easy to look up who’s responsible for each module and ask them for reviews, etc, rather than having patches slip between the cracks. We'd like to start with in a light-weight manner, where the model only applies to certain key components (e.g. scheduler, shuffle) and user-facing APIs (MLlib, GraphX, etc). Over time, as the project grows, we can expand it if we deem it useful. The specific mechanics would be as follows: - Some components in Spark will have maintainers assigned to them, where one of the maintainers needs to sign off on each patch to the component. - Each component with maintainers will have at least 2 maintainers. - Maintainers will be assigned from the most active and knowledgeable committers on that component by the PMC. The PMC can vote to add / remove maintainers, and maintained components, through consensus. - Maintainers are expected to be active in responding to patches for their components, though they do not need to be the main reviewers for them (e.g. they might just sign off on architecture / API). To prevent inactive maintainers from blocking the project, if a maintainer isn't responding in a reasonable time period (say 2 weeks), other committers can merge the patch, and the PMC will want to discuss adding another maintainer. If you'd like to see examples for this model, check out the following projects: - CloudStack: https://cwiki.apache.org/confluence/display/CLOUDSTACK/CloudStack+Maintainers+Guide https://cwiki.apache.org/confluence/display/CLOUDSTACK/CloudStack+Maintainers+Guide - Subversion: https://subversion.apache.org/docs/community-guide/roles.html https://subversion.apache.org/docs/community-guide/roles.html Finally, I wanted to list our current proposal for initial components and maintainers. It would be good to get feedback on other components we might add, but please note that personnel discussions (e.g. I don't think Matei should maintain *that* component) should only happen on the private list. The initial components were chosen to include all public APIs and the main core components, and the maintainers were chosen
Re: A couple questions about shared variables
MapReduce counters do not count duplications. In MapReduce, if a task needs to be re-run, the value of the counter from the second task overwrites the value from the first task. -Sandy On Mon, Sep 22, 2014 at 4:55 AM, Nan Zhu zhunanmcg...@gmail.com wrote: If you think it as necessary to fix, I would like to resubmit that PR (seems to have some conflicts with the current DAGScheduler) My suggestion is to make it as an option in accumulator, e.g. some algorithms utilizing accumulator for result calculation, it needs a deterministic accumulator, while others implementing something like Hadoop counters may need the current implementation (count everything happened, including the duplications) Your thoughts? -- Nan Zhu On Sunday, September 21, 2014 at 6:35 PM, Matei Zaharia wrote: Hmm, good point, this seems to have been broken by refactorings of the scheduler, but it worked in the past. Basically the solution is simple -- in a result stage, we should not apply the update for each task ID more than once -- the same way we don't call job.listener.taskSucceeded more than once. Your PR also tried to avoid this for resubmitted shuffle stages, but I don't think we need to do that necessarily (though we could). Matei On September 21, 2014 at 1:11:13 PM, Nan Zhu (zhunanmcg...@gmail.com) wrote: Hi, Matei, Can you give some hint on how the current implementation guarantee the accumulator is only applied for once? There is a pending PR trying to achieving this ( https://github.com/apache/spark/pull/228/files), but from the current implementation, I didn’t see this has been done? (maybe I missed something) Best, -- Nan Zhu On Sunday, September 21, 2014 at 1:10 AM, Matei Zaharia wrote: Hey Sandy, On September 20, 2014 at 8:50:54 AM, Sandy Ryza (sandy.r...@cloudera.com) wrote: Hey All, A couple questions came up about shared variables recently, and I wanted to confirm my understanding and update the doc to be a little more clear. *Broadcast variables* Now that tasks data is automatically broadcast, the only occasions where it makes sense to explicitly broadcast are: * You want to use a variable from tasks in multiple stages. * You want to have the variable stored on the executors in deserialized form. * You want tasks to be able to modify the variable and have those modifications take effect for other tasks running on the same executor (usually a very bad idea). Is that right? Yeah, pretty much. Reason 1 above is probably the biggest, but 2 also matters. (We might later factor tasks in a different way to avoid 2, but it's hard due to things like Hadoop JobConf objects in the tasks). *Accumulators* Values are only counted for successful tasks. Is that right? KMeans seems to use it in this way. What happens if a node goes away and successful tasks need to be resubmitted? Or the stage runs again because a different job needed it. Accumulators are guaranteed to give a deterministic result if you only increment them in actions. For each result stage, the accumulator's update from each task is only applied once, even if that task runs multiple times. If you use accumulators in transformations (i.e. in a stage that may be part of multiple jobs), then you may see multiple updates, from each run. This is kind of confusing but it was useful for people who wanted to use these for debugging. Matei thanks, Sandy
Re: Spark authenticate enablement
Hi Jun, I believe that's correct that Spark authentication only works against YARN. -Sandy On Thu, Sep 11, 2014 at 2:14 AM, Jun Feng Liu liuj...@cn.ibm.com wrote: Hi, there I am trying to enable the authentication on spark on standealone model. Seems like only SparkSubmit load the properties from spark-defaults.conf. org.apache.spark.deploy.master.Master dose not really load the default setting from spark-defaults.conf. Dose it mean the spark authentication only work for like YARN model? Or I missed something with standalone model. Best Regards *Jun Feng Liu* IBM China Systems Technology Laboratory in Beijing -- [image: 2D barcode - encoded with contact information] *Phone: *86-10-82452683 * E-mail:* *liuj...@cn.ibm.com* liuj...@cn.ibm.com BLD 28,ZGC Software Park No.8 Rd.Dong Bei Wang West, Dist.Haidian Beijing 100193 China
Reporting serialized task size after task broadcast change?
After the change to broadcast all task data, is there any easy way to discover the serialized size of the data getting sent down for a task? thanks, -Sandy
Re: Reporting serialized task size after task broadcast change?
It used to be available on the UI, no? On Thu, Sep 11, 2014 at 6:26 PM, Reynold Xin r...@databricks.com wrote: I don't think so. We should probably add a line to log it. On Thursday, September 11, 2014, Sandy Ryza sandy.r...@cloudera.com wrote: After the change to broadcast all task data, is there any easy way to discover the serialized size of the data getting sent down for a task? thanks, -Sandy
Re: Reporting serialized task size after task broadcast change?
Hmm, well I can't find it now, must have been hallucinating. Do you know off the top of your head where I'd be able to find the size to log it? On Thu, Sep 11, 2014 at 6:33 PM, Reynold Xin r...@databricks.com wrote: I didn't know about that On Thu, Sep 11, 2014 at 6:29 PM, Sandy Ryza sandy.r...@cloudera.com wrote: It used to be available on the UI, no? On Thu, Sep 11, 2014 at 6:26 PM, Reynold Xin r...@databricks.com wrote: I don't think so. We should probably add a line to log it. On Thursday, September 11, 2014, Sandy Ryza sandy.r...@cloudera.com wrote: After the change to broadcast all task data, is there any easy way to discover the serialized size of the data getting sent down for a task? thanks, -Sandy
Re: Lost executor on YARN ALS iterations
That's right On Tue, Sep 9, 2014 at 2:04 PM, Debasish Das debasish.da...@gmail.com wrote: Last time it did not show up on environment tab but I will give it another shot...Expected behavior is that this env variable will show up right ? On Tue, Sep 9, 2014 at 12:15 PM, Sandy Ryza sandy.r...@cloudera.com wrote: I would expect 2 GB would be enough or more than enough for 16 GB executors (unless ALS is using a bunch of off-heap memory?). You mentioned earlier in this thread that the property wasn't showing up in the Environment tab. Are you sure it's making it in? -Sandy On Tue, Sep 9, 2014 at 11:58 AM, Debasish Das debasish.da...@gmail.com wrote: Hmm...I did try it increase to few gb but did not get a successful run yet... Any idea if I am using say 40 executors, each running 16GB, what's the typical spark.yarn.executor.memoryOverhead for say 100M x 10 M large matrices with say few billion ratings... On Tue, Sep 9, 2014 at 10:49 AM, Sandy Ryza sandy.r...@cloudera.com wrote: Hi Deb, The current state of the art is to increase spark.yarn.executor.memoryOverhead until the job stops failing. We do have plans to try to automatically scale this based on the amount of memory requested, but it will still just be a heuristic. -Sandy On Tue, Sep 9, 2014 at 7:32 AM, Debasish Das debasish.da...@gmail.com wrote: Hi Sandy, Any resolution for YARN failures ? It's a blocker for running spark on top of YARN. Thanks. Deb On Tue, Aug 19, 2014 at 11:29 PM, Xiangrui Meng men...@gmail.com wrote: Hi Deb, I think this may be the same issue as described in https://issues.apache.org/jira/browse/SPARK-2121 . We know that the container got killed by YARN because it used much more memory that it requested. But we haven't figured out the root cause yet. +Sandy Best, Xiangrui On Tue, Aug 19, 2014 at 8:51 PM, Debasish Das debasish.da...@gmail.com wrote: Hi, During the 4th ALS iteration, I am noticing that one of the executor gets disconnected: 14/08/19 23:40:00 ERROR network.ConnectionManager: Corresponding SendingConnectionManagerId not found 14/08/19 23:40:00 INFO cluster.YarnClientSchedulerBackend: Executor 5 disconnected, so removing it 14/08/19 23:40:00 ERROR cluster.YarnClientClusterScheduler: Lost executor 5 on tblpmidn42adv-hdp.tdc.vzwcorp.com: remote Akka client disassociated 14/08/19 23:40:00 INFO scheduler.DAGScheduler: Executor lost: 5 (epoch 12) Any idea if this is a bug related to akka on YARN ? I am using master Thanks. Deb
Re: Lost executor on YARN ALS iterations
Hi Deb, The current state of the art is to increase spark.yarn.executor.memoryOverhead until the job stops failing. We do have plans to try to automatically scale this based on the amount of memory requested, but it will still just be a heuristic. -Sandy On Tue, Sep 9, 2014 at 7:32 AM, Debasish Das debasish.da...@gmail.com wrote: Hi Sandy, Any resolution for YARN failures ? It's a blocker for running spark on top of YARN. Thanks. Deb On Tue, Aug 19, 2014 at 11:29 PM, Xiangrui Meng men...@gmail.com wrote: Hi Deb, I think this may be the same issue as described in https://issues.apache.org/jira/browse/SPARK-2121 . We know that the container got killed by YARN because it used much more memory that it requested. But we haven't figured out the root cause yet. +Sandy Best, Xiangrui On Tue, Aug 19, 2014 at 8:51 PM, Debasish Das debasish.da...@gmail.com wrote: Hi, During the 4th ALS iteration, I am noticing that one of the executor gets disconnected: 14/08/19 23:40:00 ERROR network.ConnectionManager: Corresponding SendingConnectionManagerId not found 14/08/19 23:40:00 INFO cluster.YarnClientSchedulerBackend: Executor 5 disconnected, so removing it 14/08/19 23:40:00 ERROR cluster.YarnClientClusterScheduler: Lost executor 5 on tblpmidn42adv-hdp.tdc.vzwcorp.com: remote Akka client disassociated 14/08/19 23:40:00 INFO scheduler.DAGScheduler: Executor lost: 5 (epoch 12) Any idea if this is a bug related to akka on YARN ? I am using master Thanks. Deb
Re: Lost executor on YARN ALS iterations
I would expect 2 GB would be enough or more than enough for 16 GB executors (unless ALS is using a bunch of off-heap memory?). You mentioned earlier in this thread that the property wasn't showing up in the Environment tab. Are you sure it's making it in? -Sandy On Tue, Sep 9, 2014 at 11:58 AM, Debasish Das debasish.da...@gmail.com wrote: Hmm...I did try it increase to few gb but did not get a successful run yet... Any idea if I am using say 40 executors, each running 16GB, what's the typical spark.yarn.executor.memoryOverhead for say 100M x 10 M large matrices with say few billion ratings... On Tue, Sep 9, 2014 at 10:49 AM, Sandy Ryza sandy.r...@cloudera.com wrote: Hi Deb, The current state of the art is to increase spark.yarn.executor.memoryOverhead until the job stops failing. We do have plans to try to automatically scale this based on the amount of memory requested, but it will still just be a heuristic. -Sandy On Tue, Sep 9, 2014 at 7:32 AM, Debasish Das debasish.da...@gmail.com wrote: Hi Sandy, Any resolution for YARN failures ? It's a blocker for running spark on top of YARN. Thanks. Deb On Tue, Aug 19, 2014 at 11:29 PM, Xiangrui Meng men...@gmail.com wrote: Hi Deb, I think this may be the same issue as described in https://issues.apache.org/jira/browse/SPARK-2121 . We know that the container got killed by YARN because it used much more memory that it requested. But we haven't figured out the root cause yet. +Sandy Best, Xiangrui On Tue, Aug 19, 2014 at 8:51 PM, Debasish Das debasish.da...@gmail.com wrote: Hi, During the 4th ALS iteration, I am noticing that one of the executor gets disconnected: 14/08/19 23:40:00 ERROR network.ConnectionManager: Corresponding SendingConnectionManagerId not found 14/08/19 23:40:00 INFO cluster.YarnClientSchedulerBackend: Executor 5 disconnected, so removing it 14/08/19 23:40:00 ERROR cluster.YarnClientClusterScheduler: Lost executor 5 on tblpmidn42adv-hdp.tdc.vzwcorp.com: remote Akka client disassociated 14/08/19 23:40:00 INFO scheduler.DAGScheduler: Executor lost: 5 (epoch 12) Any idea if this is a bug related to akka on YARN ? I am using master Thanks. Deb
Re: about spark assembly jar
This doesn't help for every dependency, but Spark provides an option to build the assembly jar without Hadoop and its dependencies. We make use of this in CDH packaging. -Sandy On Tue, Sep 2, 2014 at 2:12 AM, scwf wangf...@huawei.com wrote: Hi sean owen, here are some problems when i used assembly jar 1 i put spark-assembly-*.jar to the lib directory of my application, it throw compile error Error:scalac: Error: class scala.reflect.BeanInfo not found. scala.tools.nsc.MissingRequirementError: class scala.reflect.BeanInfo not found. at scala.tools.nsc.symtab.Definitions$definitions$. getModuleOrClass(Definitions.scala:655) at scala.tools.nsc.symtab.Definitions$definitions$. getClass(Definitions.scala:608) at scala.tools.nsc.backend.jvm.GenJVM$BytecodeGenerator. init(GenJVM.scala:127) at scala.tools.nsc.backend.jvm.GenJVM$JvmPhase.run(GenJVM. scala:85) at scala.tools.nsc.Global$Run.compileSources(Global.scala:953) at scala.tools.nsc.Global$Run.compile(Global.scala:1041) at xsbt.CachedCompiler0.run(CompilerInterface.scala:126) at xsbt.CachedCompiler0.liftedTree1$1(CompilerInterface.scala:102) at xsbt.CachedCompiler0.run(CompilerInterface.scala:102) at xsbt.CompilerInterface.run(CompilerInterface.scala:27) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke( NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke( DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at sbt.compiler.AnalyzingCompiler.call( AnalyzingCompiler.scala:102) at sbt.compiler.AnalyzingCompiler.compile( AnalyzingCompiler.scala:48) at sbt.compiler.AnalyzingCompiler.compile( AnalyzingCompiler.scala:41) at org.jetbrains.jps.incremental.scala.local. IdeaIncrementalCompiler.compile(IdeaIncrementalCompiler.scala:28) at org.jetbrains.jps.incremental.scala.local.LocalServer. compile(LocalServer.scala:25) at org.jetbrains.jps.incremental.scala.remote.Main$.make(Main. scala:58) at org.jetbrains.jps.incremental.scala.remote.Main$.nailMain( Main.scala:21) at org.jetbrains.jps.incremental.scala.remote.Main.nailMain( Main.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke( NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke( DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at com.martiansoftware.nailgun.NGSession.run(NGSession.java:319) 2 i test my branch which updated hive version to org.apache.hive 0.13.1 it run successfully when use a bag of 3rd jars as dependency but throw error using assembly jar, it seems assembly jar lead to conflict ERROR DDLTask: java.lang.NoSuchFieldError: doubleTypeInfo at org.apache.hadoop.hive.ql.io.parquet.serde. ArrayWritableObjectInspector.getObjectInspector( ArrayWritableObjectInspector.java:66) at org.apache.hadoop.hive.ql.io.parquet.serde. ArrayWritableObjectInspector.init(ArrayWritableObjectInspector.java:59) at org.apache.hadoop.hive.ql.io.parquet.serde. ParquetHiveSerDe.initialize(ParquetHiveSerDe.java:113) at org.apache.hadoop.hive.metastore.MetaStoreUtils. getDeserializer(MetaStoreUtils.java:339) at org.apache.hadoop.hive.ql.metadata.Table. getDeserializerFromMetaStore(Table.java:283) at org.apache.hadoop.hive.ql.metadata.Table.checkValidity( Table.java:189) at org.apache.hadoop.hive.ql.metadata.Hive.createTable( Hive.java:597) at org.apache.hadoop.hive.ql.exec.DDLTask.createTable( DDLTask.java:4194) at org.apache.hadoop.hive.ql.exec.DDLTask.execute(DDLTask. java:281) at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:153) at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential( TaskRunner.java:85) On 2014/9/2 16:45, Sean Owen wrote: Hm, are you suggesting that the Spark distribution be a bag of 100 JARs? It doesn't quite seem reasonable. It does not remove version conflicts, just pushes them to run-time, which isn't good. The assembly is also necessary because that's where shading happens. In development, you want to run against exactly what will be used in a real Spark distro. On Tue, Sep 2, 2014 at 9:39 AM, scwf wangf...@huawei.com wrote: hi, all I suggest spark not use assembly jar as default run-time dependency(spark-submit/spark-class depend on assembly jar),use a library of all 3rd dependency jar like hadoop/hive/hbase more reasonable. 1 assembly jar packaged all 3rd jars into a big one, so we need rebuild this jar if we want to update the version of some component(such as hadoop) 2 in
Re: Lost executor on YARN ALS iterations
Hi Debasish, The fix is to raise spark.yarn.executor.memoryOverhead until this goes away. This controls the buffer between the JVM heap size and the amount of memory requested from YARN (JVMs can take up memory beyond their heap size). You should also make sure that, in the YARN NodeManager configuration, yarn.nodemanager.vmem-check-enabled is set to false. -Sandy On Wed, Aug 20, 2014 at 12:27 AM, Debasish Das debasish.da...@gmail.com wrote: I could reproduce the issue in both 1.0 and 1.1 using YARN...so this is definitely a YARN related problem... At least for me right now only deployment option possible is standalone... On Tue, Aug 19, 2014 at 11:29 PM, Xiangrui Meng men...@gmail.com wrote: Hi Deb, I think this may be the same issue as described in https://issues.apache.org/jira/browse/SPARK-2121 . We know that the container got killed by YARN because it used much more memory that it requested. But we haven't figured out the root cause yet. +Sandy Best, Xiangrui On Tue, Aug 19, 2014 at 8:51 PM, Debasish Das debasish.da...@gmail.com wrote: Hi, During the 4th ALS iteration, I am noticing that one of the executor gets disconnected: 14/08/19 23:40:00 ERROR network.ConnectionManager: Corresponding SendingConnectionManagerId not found 14/08/19 23:40:00 INFO cluster.YarnClientSchedulerBackend: Executor 5 disconnected, so removing it 14/08/19 23:40:00 ERROR cluster.YarnClientClusterScheduler: Lost executor 5 on tblpmidn42adv-hdp.tdc.vzwcorp.com: remote Akka client disassociated 14/08/19 23:40:00 INFO scheduler.DAGScheduler: Executor lost: 5 (epoch 12) Any idea if this is a bug related to akka on YARN ? I am using master Thanks. Deb
Re: Fine-Grained Scheduler on Yarn
Hi Jun, Spark currently doesn't have that feature, i.e. it aims for a fixed number of executors per application regardless of resource usage, but it's definitely worth considering. We could start more executors when we have a large backlog of tasks and shut some down when we're underutilized. The fine-grained task scheduling is blocked on work from YARN that will allow changing the CPU allocation of a YARN container dynamically. The relevant JIRA for this dependency is YARN-1197, though YARN-1488 might serve this purpose as well if it comes first. -Sandy On Thu, Aug 7, 2014 at 10:56 PM, Jun Feng Liu liuj...@cn.ibm.com wrote: Thanks for echo on this. Possible to adjust resource based on container numbers? e.g to allocate more container when driver need more resources and return some resource by delete some container when parts of container already have enough cores/memory Best Regards *Jun Feng Liu* IBM China Systems Technology Laboratory in Beijing -- [image: 2D barcode - encoded with contact information] *Phone: *86-10-82452683 * E-mail:* *liuj...@cn.ibm.com* liuj...@cn.ibm.com [image: IBM] BLD 28,ZGC Software Park No.8 Rd.Dong Bei Wang West, Dist.Haidian Beijing 100193 China *Patrick Wendell pwend...@gmail.com pwend...@gmail.com* 2014/08/08 13:10 To Jun Feng Liu/China/IBM@IBMCN, cc dev@spark.apache.org dev@spark.apache.org Subject Re: Fine-Grained Scheduler on Yarn Hey sorry about that - what I said was the opposite of what is true. The current YARN mode is equivalent to coarse grained mesos. There is no fine-grained scheduling on YARN at the moment. I'm not sure YARN supports scheduling in units other than containers. Fine-grained scheduling requires scheduling at the granularity of individual cores. On Thu, Aug 7, 2014 at 9:43 PM, Patrick Wendell *pwend...@gmail.com* pwend...@gmail.com wrote: The current YARN is equivalent to what is called fine grained mode in Mesos. The scheduling of tasks happens totally inside of the Spark driver. On Thu, Aug 7, 2014 at 7:50 PM, Jun Feng Liu *liuj...@cn.ibm.com* liuj...@cn.ibm.com wrote: Any one know the answer? Best Regards * Jun Feng Liu* IBM China Systems Technology Laboratory in Beijing -- *Phone: *86-10-82452683 * E-mail:* *liuj...@cn.ibm.com* liuj...@cn.ibm.com BLD 28,ZGC Software Park No.8 Rd.Dong Bei Wang West, Dist.Haidian Beijing 100193 China *Jun Feng Liu/China/IBM* 2014/08/07 15:37 To *dev@spark.apache.org* dev@spark.apache.org, cc Subject Fine-Grained Scheduler on Yarn Hi, there Just aware right now Spark only support fine grained scheduler on Mesos with MesosSchedulerBackend. The Yarn schedule sounds like only works on coarse-grained model. Is there any plan to implement fine-grained scheduler for YARN? Or there is any technical issue block us to do that. Best Regards * Jun Feng Liu* IBM China Systems Technology Laboratory in Beijing -- *Phone: *86-10-82452683 * E-mail:* *liuj...@cn.ibm.com* liuj...@cn.ibm.com BLD 28,ZGC Software Park No.8 Rd.Dong Bei Wang West, Dist.Haidian Beijing 100193 China
Re: Fine-Grained Scheduler on Yarn
I think that would be useful work. I don't know the minute details of this code, but in general TaskSchedulerImpl keeps track of pending tasks. Tasks are organized into TaskSets, each of which corresponds to a particular stage. Each TaskSet has a TaskSetManager, which directly tracks the pending tasks for that stage. -Sandy On Fri, Aug 8, 2014 at 12:37 AM, Jun Feng Liu liuj...@cn.ibm.com wrote: Yes, I think we need both level resource control (container numbers and dynamically change container resources), which can make the resource utilization much more effective, especially when we have more types work load share the same infrastructure. Is there anyway I can observe the tasks backlog in schedulerbackend? Sounds like scheduler backend be triggered during new taskset submitted. I did not figured if there is a way to check the whole backlog tasks inside it. I am interesting to implement some policy in schedulerbackend and test to see how useful it is going to be. Best Regards *Jun Feng Liu* IBM China Systems Technology Laboratory in Beijing -- [image: 2D barcode - encoded with contact information] *Phone: *86-10-82452683 * E-mail:* *liuj...@cn.ibm.com* liuj...@cn.ibm.com [image: IBM] BLD 28,ZGC Software Park No.8 Rd.Dong Bei Wang West, Dist.Haidian Beijing 100193 China *Sandy Ryza sandy.r...@cloudera.com sandy.r...@cloudera.com* 2014/08/08 15:14 To Jun Feng Liu/China/IBM@IBMCN, cc Patrick Wendell pwend...@gmail.com, dev@spark.apache.org dev@spark.apache.org Subject Re: Fine-Grained Scheduler on Yarn Hi Jun, Spark currently doesn't have that feature, i.e. it aims for a fixed number of executors per application regardless of resource usage, but it's definitely worth considering. We could start more executors when we have a large backlog of tasks and shut some down when we're underutilized. The fine-grained task scheduling is blocked on work from YARN that will allow changing the CPU allocation of a YARN container dynamically. The relevant JIRA for this dependency is YARN-1197, though YARN-1488 might serve this purpose as well if it comes first. -Sandy On Thu, Aug 7, 2014 at 10:56 PM, Jun Feng Liu liuj...@cn.ibm.com wrote: Thanks for echo on this. Possible to adjust resource based on container numbers? e.g to allocate more container when driver need more resources and return some resource by delete some container when parts of container already have enough cores/memory Best Regards *Jun Feng Liu* IBM China Systems Technology Laboratory in Beijing -- [image: 2D barcode - encoded with contact information] *Phone: *86-10-82452683 * E-mail:* *liuj...@cn.ibm.com* liuj...@cn.ibm.com [image: IBM] BLD 28,ZGC Software Park No.8 Rd.Dong Bei Wang West, Dist.Haidian Beijing 100193 China *Patrick Wendell pwend...@gmail.com pwend...@gmail.com* 2014/08/08 13:10 To Jun Feng Liu/China/IBM@IBMCN, cc dev@spark.apache.org dev@spark.apache.org Subject Re: Fine-Grained Scheduler on Yarn Hey sorry about that - what I said was the opposite of what is true. The current YARN mode is equivalent to coarse grained mesos. There is no fine-grained scheduling on YARN at the moment. I'm not sure YARN supports scheduling in units other than containers. Fine-grained scheduling requires scheduling at the granularity of individual cores. On Thu, Aug 7, 2014 at 9:43 PM, Patrick Wendell *pwend...@gmail.com* pwend...@gmail.com wrote: The current YARN is equivalent to what is called fine grained mode in Mesos. The scheduling of tasks happens totally inside of the Spark driver. On Thu, Aug 7, 2014 at 7:50 PM, Jun Feng Liu *liuj...@cn.ibm.com* liuj...@cn.ibm.com wrote: Any one know the answer? Best Regards * Jun Feng Liu* IBM China Systems Technology Laboratory in Beijing -- *Phone: *86-10-82452683 * E-mail:* *liuj...@cn.ibm.com* liuj...@cn.ibm.com BLD 28,ZGC Software Park No.8 Rd.Dong Bei Wang West, Dist.Haidian Beijing 100193 China *Jun Feng Liu/China/IBM* 2014/08/07 15:37 To *dev@spark.apache.org* dev@spark.apache.org, cc Subject Fine-Grained Scheduler on Yarn Hi, there Just aware right now Spark only support fine grained scheduler on Mesos with MesosSchedulerBackend. The Yarn schedule sounds like only works on coarse-grained model. Is there any plan to implement fine-grained scheduler for YARN? Or there is any technical issue block us to do that. Best Regards * Jun Feng Liu* IBM China Systems Technology Laboratory in Beijing -- *Phone: *86-10-82452683 * E-mail:* *liuj...@cn.ibm.com* liuj...@cn.ibm.com BLD 28,ZGC Software Park No.8 Rd.Dong Bei Wang West, Dist.Haidian Beijing 100193 China
Re: RFC: Supporting the Scala drop Method for Spark RDDs
Yeah, the input format doesn't support this behavior. But it does tell you the byte position of each record in the file. On Mon, Jul 21, 2014 at 10:55 PM, Reynold Xin r...@databricks.com wrote: Yes, that could work. But it is not as simple as just a binary flag. We might want to skip the first row for every file, or the header only for the first file. The former is not really supported out of the box by the input format I think? On Mon, Jul 21, 2014 at 10:50 PM, Sandy Ryza sandy.r...@cloudera.com wrote: It could make sense to add a skipHeader argument to SparkContext.textFile? On Mon, Jul 21, 2014 at 10:37 PM, Reynold Xin r...@databricks.com wrote: If the purpose is for dropping csv headers, perhaps we don't really need a common drop and only one that drops the first line in a file? I'd really try hard to avoid a common drop/dropWhile because they can be expensive to do. Note that I think we will be adding this functionality (ignoring headers) to the CsvRDD functionality in Spark SQL. https://github.com/apache/spark/pull/1351 On Mon, Jul 21, 2014 at 1:45 PM, Mark Hamstra m...@clearstorydata.com wrote: You can find some of the prior, related discussion here: https://issues.apache.org/jira/browse/SPARK-1021 On Mon, Jul 21, 2014 at 1:25 PM, Erik Erlandson e...@redhat.com wrote: - Original Message - Rather than embrace non-lazy transformations and add more of them, I'd rather we 1) try to fully characterize the needs that are driving their creation/usage; and 2) design and implement new Spark abstractions that will allow us to meet those needs and eliminate existing non-lazy transformation. In the case of drop, obtaining the index of the boundary partition can be viewed as the action forcing compute -- one that happens to be invoked inside of a transform. The concept of a lazy action, that is only triggered if the result rdd has compute invoked on it, might be sufficient to restore laziness to the drop transform. For that matter, I might find some way to make use of Scala lazy values directly and achieve the same goal for drop. They really mess up things like creation of asynchronous FutureActions, job cancellation and accounting of job resource usage, etc., so I'd rather we seek a way out of the existing hole rather than make it deeper. On Mon, Jul 21, 2014 at 10:24 AM, Erik Erlandson e...@redhat.com wrote: - Original Message - Sure, drop() would be useful, but breaking the transformations are lazy; only actions launch jobs model is abhorrent -- which is not to say that we haven't already broken that model for useful operations (cf. RangePartitioner, which is used for sorted RDDs), but rather that each such exception to the model is a significant source of pain that can be hard to work with or work around. A thought that comes to my mind here is that there are in fact already two categories of transform: ones that are truly lazy, and ones that are not. A possible option is to embrace that, and commit to documenting the two categories as such, with an obvious bias towards favoring lazy transforms (to paraphrase Churchill, we're down to haggling over the price). I really wouldn't like to see another such model-breaking transformation added to the API. On the other hand, being able to write transformations with dependencies on these kind of internal jobs is sometimes very useful, so a significant reworking of Spark's Dependency model that would allow for lazily running such internal jobs and making the results available to subsequent stages may be something worth pursuing. This seems like a very interesting angle. I don't have much feel for what a solution would look like, but it sounds as if it would involve caching all operations embodied by RDD transform method code for provisional execution. I believe that these levels of invocation are currently executed in the master, not executor nodes. On Mon, Jul 21, 2014 at 8:27 AM, Andrew Ash and...@andrewash.com wrote: Personally I'd find the method useful -- I've often had a .csv file with a header row that I want to drop so filter it out, which touches all partitions anyway. I don't have any comments on the implementation quite yet though
Re: RFC: Supporting the Scala drop Method for Spark RDDs
It could make sense to add a skipHeader argument to SparkContext.textFile? On Mon, Jul 21, 2014 at 10:37 PM, Reynold Xin r...@databricks.com wrote: If the purpose is for dropping csv headers, perhaps we don't really need a common drop and only one that drops the first line in a file? I'd really try hard to avoid a common drop/dropWhile because they can be expensive to do. Note that I think we will be adding this functionality (ignoring headers) to the CsvRDD functionality in Spark SQL. https://github.com/apache/spark/pull/1351 On Mon, Jul 21, 2014 at 1:45 PM, Mark Hamstra m...@clearstorydata.com wrote: You can find some of the prior, related discussion here: https://issues.apache.org/jira/browse/SPARK-1021 On Mon, Jul 21, 2014 at 1:25 PM, Erik Erlandson e...@redhat.com wrote: - Original Message - Rather than embrace non-lazy transformations and add more of them, I'd rather we 1) try to fully characterize the needs that are driving their creation/usage; and 2) design and implement new Spark abstractions that will allow us to meet those needs and eliminate existing non-lazy transformation. In the case of drop, obtaining the index of the boundary partition can be viewed as the action forcing compute -- one that happens to be invoked inside of a transform. The concept of a lazy action, that is only triggered if the result rdd has compute invoked on it, might be sufficient to restore laziness to the drop transform. For that matter, I might find some way to make use of Scala lazy values directly and achieve the same goal for drop. They really mess up things like creation of asynchronous FutureActions, job cancellation and accounting of job resource usage, etc., so I'd rather we seek a way out of the existing hole rather than make it deeper. On Mon, Jul 21, 2014 at 10:24 AM, Erik Erlandson e...@redhat.com wrote: - Original Message - Sure, drop() would be useful, but breaking the transformations are lazy; only actions launch jobs model is abhorrent -- which is not to say that we haven't already broken that model for useful operations (cf. RangePartitioner, which is used for sorted RDDs), but rather that each such exception to the model is a significant source of pain that can be hard to work with or work around. A thought that comes to my mind here is that there are in fact already two categories of transform: ones that are truly lazy, and ones that are not. A possible option is to embrace that, and commit to documenting the two categories as such, with an obvious bias towards favoring lazy transforms (to paraphrase Churchill, we're down to haggling over the price). I really wouldn't like to see another such model-breaking transformation added to the API. On the other hand, being able to write transformations with dependencies on these kind of internal jobs is sometimes very useful, so a significant reworking of Spark's Dependency model that would allow for lazily running such internal jobs and making the results available to subsequent stages may be something worth pursuing. This seems like a very interesting angle. I don't have much feel for what a solution would look like, but it sounds as if it would involve caching all operations embodied by RDD transform method code for provisional execution. I believe that these levels of invocation are currently executed in the master, not executor nodes. On Mon, Jul 21, 2014 at 8:27 AM, Andrew Ash and...@andrewash.com wrote: Personally I'd find the method useful -- I've often had a .csv file with a header row that I want to drop so filter it out, which touches all partitions anyway. I don't have any comments on the implementation quite yet though. On Mon, Jul 21, 2014 at 8:24 AM, Erik Erlandson e...@redhat.com wrote: A few weeks ago I submitted a PR for supporting rdd.drop(n), under SPARK-2315: https://issues.apache.org/jira/browse/SPARK-2315 Supporting the drop method would make some operations convenient, however it forces computation of = 1 partition of the parent RDD, and so it would behave like a partial action that returns an RDD as the result. I wrote up a discussion of these trade-offs here: http://erikerlandson.github.io/blog/2014/07/20/some-implications-of-supporting-the-scala-drop-method-for-spark-rdds/
Re: Possible bug in ClientBase.scala?
/YarnAllocationHandler.scala:508: not found: type ContainerRequest [error] ): ArrayBuffer[ContainerRequest] = { [error]^ [error] /Users/chester/projects/spark/yarn/stable/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocationHandler.scala:446: not found: type ContainerRequest [error] val hostContainerRequests = new ArrayBuffer[ContainerRequest](preferredHostToCount.size) [error] ^ [error] /Users/chester/projects/spark/yarn/stable/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocationHandler.scala:458: not found: type ContainerRequest [error] val rackContainerRequests: List[ContainerRequest] = createRackResourceRequests( [error] ^ [error] /Users/chester/projects/spark/yarn/stable/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocationHandler.scala:467: not found: type ContainerRequest [error] val containerRequestBuffer = new ArrayBuffer[ContainerRequest]( [error] ^ [error] /Users/chester/projects/spark/yarn/stable/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocationHandler.scala:542: not found: type ContainerRequest [error] ): ArrayBuffer[ContainerRequest] = { [error]^ [error] /Users/chester/projects/spark/yarn/stable/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocationHandler.scala:545: value newInstance is not a member of object org.apache.hadoop.yarn.api.records.Resource [error] val resource = Resource.newInstance(memoryRequest, executorCores) [error] ^ [error] /Users/chester/projects/spark/yarn/stable/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocationHandler.scala:550: not found: type ContainerRequest [error] val requests = new ArrayBuffer[ContainerRequest]() [error]^ [error] 40 errors found [error] (yarn-stable/compile:compile) Compilation failed [error] Total time: 98 s, completed Jul 16, 2014 5:14:44 PM On Wed, Jul 16, 2014 at 4:19 PM, Sandy Ryza sandy.r...@cloudera.com wrote: Hi Ron, I just checked and this bug is fixed in recent releases of Spark. -Sandy On Sun, Jul 13, 2014 at 8:15 PM, Chester Chen ches...@alpinenow.com wrote: Ron, Which distribution and Version of Hadoop are you using ? I just looked at CDH5 ( hadoop-mapreduce-client-core- 2.3.0-cdh5.0.0), MRJobConfig does have the field : java.lang.String DEFAULT_MAPREDUCE_APPLICATION_CLASSPATH; Chester On Sun, Jul 13, 2014 at 6:49 PM, Ron Gonzalez zlgonza...@yahoo.com wrote: Hi, I was doing programmatic submission of Spark yarn jobs and I saw code in ClientBase.getDefaultYarnApplicationClasspath(): val field = classOf[MRJobConfig].getField(DEFAULT_YARN_APPLICATION_CLASSPATH) MRJobConfig doesn't have this field so the created launch env is incomplete. Workaround is to set yarn.application.classpath with the value from YarnConfiguration.DEFAULT_YARN_APPLICATION_CLASSPATH. This results in having the spark job hang if the submission config is different from the default config. For example, if my resource manager port is 8050 instead of 8030, then the spark app is not able to register itself and stays in ACCEPTED state. I can easily fix this by changing this to YarnConfiguration instead of MRJobConfig but was wondering what the steps are for submitting a fix. Thanks, Ron Sent from my iPhone
Re: better compression codecs for shuffle blocks?
Stephen, Often the shuffle is bound by writes to disk, so even if disks have enough space to store the uncompressed data, the shuffle can complete faster by writing less data. Reynold, This isn't a big help in the short term, but if we switch to a sort-based shuffle, we'll only need a single LZFOutputStream per map task. On Mon, Jul 14, 2014 at 3:30 PM, Stephen Haberman stephen.haber...@gmail.com wrote: Just a comment from the peanut gallery, but these buffers are a real PITA for us as well. Probably 75% of our non-user-error job failures are related to them. Just naively, what about not doing compression on the fly? E.g. during the shuffle just write straight to disk, uncompressed? For us, we always have plenty of disk space, and if you're concerned about network transmission, you could add a separate compress step after the blocks have been written to disk, but before being sent over the wire. Granted, IANAE, so perhaps this is a bad idea; either way, awesome to see work in this area! - Stephen
Re: Changes to sbt build have been merged
Woot! On Thu, Jul 10, 2014 at 11:15 AM, Patrick Wendell patr...@databricks.com wrote: Just a heads up, we merged Prashant's work on having the sbt build read all dependencies from Maven. Please report any issues you find on the dev list or on JIRA. One note here for developers, going forward the sbt build will use the same configuration style as the maven build (-D for options and -P for maven profiles). So this will be a change for developers: sbt/sbt -Dhadoop.version=2.2.0 -Pyarn assembly For now, we'll continue to support the old env-var options with a deprecation warning. - Patrick
Re: Data Locality In Spark
Hi Anish, Spark, like MapReduce, makes an effort to schedule tasks on the same nodes and racks that the input blocks reside on. -Sandy On Tue, Jul 8, 2014 at 12:27 PM, anishs...@yahoo.co.in anishs...@yahoo.co.in wrote: Hi All My apologies for very basic question, do we have full support of data locality in Spark MapReduce. Please suggest. -- Anish Sneh Experience is the best teacher. http://in.linkedin.com/in/anishsneh
Re: Contributing to MLlib: Proposal for Clustering Algorithms
Having a common framework for clustering makes sense to me. While we should be careful about what algorithms we include, having solid implementations of minibatch clustering and hierarchical clustering seems like a worthwhile goal, and we should reuse as much code and APIs as reasonable. On Tue, Jul 8, 2014 at 1:19 PM, RJ Nowling rnowl...@gmail.com wrote: Thanks, Hector! Your feedback is useful. On Tuesday, July 8, 2014, Hector Yee hector@gmail.com wrote: I would say for bigdata applications the most useful would be hierarchical k-means with back tracking and the ability to support k nearest centroids. On Tue, Jul 8, 2014 at 10:54 AM, RJ Nowling rnowl...@gmail.com javascript:; wrote: Hi all, MLlib currently has one clustering algorithm implementation, KMeans. It would benefit from having implementations of other clustering algorithms such as MiniBatch KMeans, Fuzzy C-Means, Hierarchical Clustering, and Affinity Propagation. I recently submitted a PR [1] for a MiniBatch KMeans implementation, and I saw an email on this list about interest in implementing Fuzzy C-Means. Based on Sean Owen's review of my MiniBatch KMeans code, it became apparent that before I implement more clustering algorithms, it would be useful to hammer out a framework to reduce code duplication and implement a consistent API. I'd like to gauge the interest and goals of the MLlib community: 1. Are you interested in having more clustering algorithms available? 2. Is the community interested in specifying a common framework? Thanks! RJ [1] - https://github.com/apache/spark/pull/1248 -- em rnowl...@gmail.com javascript:; c 954.496.2314 -- Yee Yang Li Hector http://google.com/+HectorYee *google.com/+HectorYee http://google.com/+HectorYee* -- em rnowl...@gmail.com c 954.496.2314
Re: Contributing to MLlib on GLM
Hi Xiaokai, I think MLLib is definitely interested in supporting additional GLMs. I'm not aware of anybody working on this at the moment. -Sandy On Tue, Jun 17, 2014 at 5:00 PM, Xiaokai Wei x...@palantir.com wrote: Hi, I am an intern at PalantirTech and we are building some stuff on top of MLlib. In Particular, GLM is of great interest to us. Though GeneralizedLinearModel in MLlib 1.0.0 has some important GLMs such as Logistic Regression, Linear Regression, some other important GLMs like Poisson Regression are still missing. I am curious that if anyone is already working on other GLMs (e.g. Poisson, Gamma). If not, we would like to contribute to MLlib on GLM. Is adding more GLMs on the roadmap of MLlib? Sincerely, Xiaokai
Re: Please change instruction about Launching Applications Inside the Cluster
They should be - in the sense that the docs now recommend using spark-submit and thus include entirely different invocations. On Fri, May 30, 2014 at 12:46 AM, Reynold Xin r...@databricks.com wrote: Can you take a look at the latest Spark 1.0 docs and see if they are fixed? https://github.com/apache/spark/tree/master/docs Thanks. On Thu, May 29, 2014 at 5:29 AM, Lizhengbing (bing, BIPA) zhengbing...@huawei.com wrote: The instruction address is in http://spark.apache.org/docs/0.9.0/spark-standalone.html#launching-applications-inside-the-cluster or http://spark.apache.org/docs/0.9.1/spark-standalone.html#launching-applications-inside-the-cluster Origin instruction is: ./bin/spark-class org.apache.spark.deploy.Client launch [client-options] \ cluster-url application-jar-url main-class \ [application-options] If I follow this instruction, I will not run my program deployed in a spark standalone cluster properly. Based on source code, This instruction should be changed to ./bin/spark-class org.apache.spark.deploy.Client [client-options] launch \ cluster-url application-jar-url main-class \ [application-options] That is to say: [client-options] must be put ahead of launch
Re: [VOTE] Release Apache Spark 1.0.0 (RC11)
+1 On Mon, May 26, 2014 at 7:38 AM, Tathagata Das tathagata.das1...@gmail.comwrote: Please vote on releasing the following candidate as Apache Spark version 1.0.0! This has a few important bug fixes on top of rc10: SPARK-1900 and SPARK-1918: https://github.com/apache/spark/pull/853 SPARK-1870 https://github.com/apache/spark/pull/853SPARK-1870: https://github.com/apache/spark/pull/848 SPARK-1897: https://github.com/apache/spark/pull/849 The tag to be voted on is v1.0.0-rc11 (commit c69d97cd): https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=c69d97cdb42f809cb71113a1db4194c21372242a The release files, including signatures, digests, etc. can be found at: http://people.apache.org/~tdas/spark-1.0.0-rc11/ Release artifacts are signed with the following key: https://people.apache.org/keys/committer/tdas.asc The staging repository for this release can be found at: https://repository.apache.org/content/repositories/orgapachespark-1019/ The documentation corresponding to this release can be found at: http://people.apache.org/~tdas/spark-1.0.0-rc11-docs/ Please vote on releasing this package as Apache Spark 1.0.0! The vote is open until Thursday, May 29, at 16:00 UTC and passes if a majority of at least 3 +1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 1.0.0 [ ] -1 Do not release this package because ... To learn more about Apache Spark, please see http://spark.apache.org/ == API Changes == We welcome users to compile Spark applications against 1.0. There are a few API changes in this release. Here are links to the associated upgrade guides - user facing changes have been kept as small as possible. Changes to ML vector specification: http://people.apache.org/~tdas/spark-1.0.0-rc11-docs/mllib-guide.html#from-09-to-10 Changes to the Java API: http://people.apache.org/~tdas/spark-1.0.0-rc11-docs/java-programming-guide.html#upgrading-from-pre-10-versions-of-spark Changes to the streaming API: http://people.apache.org/~tdas/spark-1.0.0-rc11-docs/streaming-programming-guide.html#migration-guide-from-091-or-below-to-1x Changes to the GraphX API: http://people.apache.org/~tdas/spark-1.0.0-rc11-docs/graphx-programming-guide.html#upgrade-guide-from-spark-091 Other changes: coGroup and related functions now return Iterable[T] instead of Seq[T] == Call toSeq on the result to restore the old behavior SparkContext.jarOfClass returns Option[String] instead of Seq[String] == Call toSeq on the result to restore old behavior
Re: Calling external classes added by sc.addJar needs to be through reflection
This will solve the issue for jars added upon application submission, but, on top of this, we need to make sure that anything dynamically added through sc.addJar works as well. To do so, we need to make sure that any jars retrieved via the driver's HTTP server are loaded by the same classloader that loads the jars given on app submission. To achieve this, we need to either use the same classloader for both system jars and user jars, or make sure that the user jars given on app submission are under the same classloader used for dynamically added jars. On Tue, May 20, 2014 at 5:59 PM, Xiangrui Meng men...@gmail.com wrote: Talked with Sandy and DB offline. I think the best solution is sending the secondary jars to the distributed cache of all containers rather than just the master, and set the classpath to include spark jar, primary app jar, and secondary jars before executor starts. In this way, user only needs to specify secondary jars via --jars instead of calling sc.addJar inside the code. It also solves the scalability problem of serving all the jars via http. If this solution sounds good, I can try to make a patch. Best, Xiangrui On Mon, May 19, 2014 at 10:04 PM, DB Tsai dbt...@stanford.edu wrote: In 1.0, there is a new option for users to choose which classloader has higher priority via spark.files.userClassPathFirst, I decided to submit the PR for 0.9 first. We use this patch in our lab and we can use those jars added by sc.addJar without reflection. https://github.com/apache/spark/pull/834 Can anyone comment if it's a good approach? Thanks. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Mon, May 19, 2014 at 7:42 PM, DB Tsai dbt...@stanford.edu wrote: Good summary! We fixed it in branch 0.9 since our production is still in 0.9. I'm porting it to 1.0 now, and hopefully will submit PR for 1.0 tonight. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Mon, May 19, 2014 at 7:38 PM, Sandy Ryza sandy.r...@cloudera.com wrote: It just hit me why this problem is showing up on YARN and not on standalone. The relevant difference between YARN and standalone is that, on YARN, the app jar is loaded by the system classloader instead of Spark's custom URL classloader. On YARN, the system classloader knows about [the classes in the spark jars, the classes in the primary app jar]. The custom classloader knows about [the classes in secondary app jars] and has the system classloader as its parent. A few relevant facts (mostly redundant with what Sean pointed out): * Every class has a classloader that loaded it. * When an object of class B is instantiated inside of class A, the classloader used for loading B is the classloader that was used for loading A. * When a classloader fails to load a class, it lets its parent classloader try. If its parent succeeds, its parent becomes the classloader that loaded it. So suppose class B is in a secondary app jar and class A is in the primary app jar: 1. The custom classloader will try to load class A. 2. It will fail, because it only knows about the secondary jars. 3. It will delegate to its parent, the system classloader. 4. The system classloader will succeed, because it knows about the primary app jar. 5. A's classloader will be the system classloader. 6. A tries to instantiate an instance of class B. 7. B will be loaded with A's classloader, which is the system classloader. 8. Loading B will fail, because A's classloader, which is the system classloader, doesn't know about the secondary app jars. In Spark standalone, A and B are both loaded by the custom classloader, so this issue doesn't come up. -Sandy On Mon, May 19, 2014 at 7:07 PM, Patrick Wendell pwend...@gmail.com wrote: Having a user add define a custom class inside of an added jar and instantiate it directly inside of an executor is definitely supported in Spark and has been for a really long time (several years). This is something we do all the time in Spark. DB - I'd hold off on a re-architecting of this until we identify exactly what is causing the bug you are running into. In a nutshell, when the bytecode new Foo() is run on the executor, it will ask the driver for the class over HTTP using a custom classloader. Something in that pipeline is breaking here, possibly related to the YARN deployment stuff. On Mon, May 19, 2014 at 12:29 AM, Sean Owen so...@cloudera.com wrote: I don't think a customer classloader is necessary. Well, it occurs to me that this is no new problem. Hadoop, Tomcat, etc all run custom user code that creates new user objects without
Re: Calling external classes added by sc.addJar needs to be through reflection
Is that an assumption we can make? I think we'd run into an issue in this situation: *In primary jar:* def makeDynamicObject(clazz: String) = Class.forName(clazz).newInstance() *In app code:* sc.addJar(dynamicjar.jar) ... rdd.map(x = makeDynamicObject(some.class.from.DynamicJar)) It might be fair to say that the user should make sure to use the context classloader when instantiating dynamic classes, but I think it's weird that this code would work on Spark standalone but not on YARN. -Sandy On Wed, May 21, 2014 at 2:10 PM, Xiangrui Meng men...@gmail.com wrote: I think adding jars dynamically should work as long as the primary jar and the secondary jars do not depend on dynamically added jars, which should be the correct logic. -Xiangrui On Wed, May 21, 2014 at 1:40 PM, DB Tsai dbt...@stanford.edu wrote: This will be another separate story. Since in the yarn deployment, as Sandy said, the app.jar will be always in the systemclassloader which means any object instantiated in app.jar will have parent loader of systemclassloader instead of custom one. As a result, the custom class loader in yarn will never work without specifically using reflection. Solution will be not using system classloader in the classloader hierarchy, and add all the resources in system one into custom one. This is the approach of tomcat takes. Or we can directly overwirte the system class loader by calling the protected method `addURL` which will not work and throw exception if the code is wrapped in security manager. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Wed, May 21, 2014 at 1:13 PM, Sandy Ryza sandy.r...@cloudera.com wrote: This will solve the issue for jars added upon application submission, but, on top of this, we need to make sure that anything dynamically added through sc.addJar works as well. To do so, we need to make sure that any jars retrieved via the driver's HTTP server are loaded by the same classloader that loads the jars given on app submission. To achieve this, we need to either use the same classloader for both system jars and user jars, or make sure that the user jars given on app submission are under the same classloader used for dynamically added jars. On Tue, May 20, 2014 at 5:59 PM, Xiangrui Meng men...@gmail.com wrote: Talked with Sandy and DB offline. I think the best solution is sending the secondary jars to the distributed cache of all containers rather than just the master, and set the classpath to include spark jar, primary app jar, and secondary jars before executor starts. In this way, user only needs to specify secondary jars via --jars instead of calling sc.addJar inside the code. It also solves the scalability problem of serving all the jars via http. If this solution sounds good, I can try to make a patch. Best, Xiangrui On Mon, May 19, 2014 at 10:04 PM, DB Tsai dbt...@stanford.edu wrote: In 1.0, there is a new option for users to choose which classloader has higher priority via spark.files.userClassPathFirst, I decided to submit the PR for 0.9 first. We use this patch in our lab and we can use those jars added by sc.addJar without reflection. https://github.com/apache/spark/pull/834 Can anyone comment if it's a good approach? Thanks. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Mon, May 19, 2014 at 7:42 PM, DB Tsai dbt...@stanford.edu wrote: Good summary! We fixed it in branch 0.9 since our production is still in 0.9. I'm porting it to 1.0 now, and hopefully will submit PR for 1.0 tonight. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Mon, May 19, 2014 at 7:38 PM, Sandy Ryza sandy.r...@cloudera.com wrote: It just hit me why this problem is showing up on YARN and not on standalone. The relevant difference between YARN and standalone is that, on YARN, the app jar is loaded by the system classloader instead of Spark's custom URL classloader. On YARN, the system classloader knows about [the classes in the spark jars, the classes in the primary app jar]. The custom classloader knows about [the classes in secondary app jars] and has the system classloader as its parent. A few relevant facts (mostly redundant with what Sean pointed out): * Every class has a classloader that loaded it. * When an object of class B is instantiated inside of class A, the classloader used for loading B
Re: [VOTE] Release Apache Spark 1.0.0 (RC10)
+1 On Tue, May 20, 2014 at 5:26 PM, Andrew Or and...@databricks.com wrote: +1 2014-05-20 13:13 GMT-07:00 Tathagata Das tathagata.das1...@gmail.com: Please vote on releasing the following candidate as Apache Spark version 1.0.0! This has a few bug fixes on top of rc9: SPARK-1875: https://github.com/apache/spark/pull/824 SPARK-1876: https://github.com/apache/spark/pull/819 SPARK-1878: https://github.com/apache/spark/pull/822 SPARK-1879: https://github.com/apache/spark/pull/823 The tag to be voted on is v1.0.0-rc10 (commit d8070234): https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=d807023479ce10aec28ef3c1ab646ddefc2e663c The release files, including signatures, digests, etc. can be found at: http://people.apache.org/~tdas/spark-1.0.0-rc10/ The release artifacts are signed with the following key: https://people.apache.org/keys/committer/tdas.asc The staging repository for this release can be found at: https://repository.apache.org/content/repositories/orgapachespark-1018/ The documentation corresponding to this release can be found at: http://people.apache.org/~tdas/spark-1.0.0-rc10-docs/ The full list of changes in this release can be found at: https://git-wip-us.apache.org/repos/asf?p=spark.git;a=blob;f=CHANGES.txt;h=d21f0ace6326e099360975002797eb7cba9d5273;hb=d807023479ce10aec28ef3c1ab646ddefc2e663c Please vote on releasing this package as Apache Spark 1.0.0! The vote is open until Friday, May 23, at 20:00 UTC and passes if amajority of at least 3 +1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 1.0.0 [ ] -1 Do not release this package because ... To learn more about Apache Spark, please see http://spark.apache.org/ == API Changes == We welcome users to compile Spark applications against 1.0. There are a few API changes in this release. Here are links to the associated upgrade guides - user facing changes have been kept as small as possible. Changes to ML vector specification: http://people.apache.org/~tdas/spark-1.0.0-rc10-docs/mllib-guide.html#from-09-to-10 Changes to the Java API: http://people.apache.org/~tdas/spark-1.0.0-rc10-docs/java-programming-guide.html#upgrading-from-pre-10-versions-of-spark Changes to the streaming API: http://people.apache.org/~tdas/spark-1.0.0-rc10-docs/streaming-programming-guide.html#migration-guide-from-091-or-below-to-1x Changes to the GraphX API: http://people.apache.org/~tdas/spark-1.0.0-rc10-docs/graphx-programming-guide.html#upgrade-guide-from-spark-091 Other changes: coGroup and related functions now return Iterable[T] instead of Seq[T] == Call toSeq on the result to restore the old behavior SparkContext.jarOfClass returns Option[String] instead of Seq[String] == Call toSeq on the result to restore old behavior
Re: Calling external classes added by sc.addJar needs to be through reflection
or Tomcat where your app is loaded in a child ClassLoader, and you reference a class that Hadoop or Tomcat also has (like a lib class) you will get the container's version!) When you load an external JAR it has a separate ClassLoader which does not necessarily bear any relation to the one containing your app classes, so yeah it is not generally going to make new Foo work. Reflection lets you pick the ClassLoader, yes. I would not call setContextClassLoader. On Mon, May 19, 2014 at 12:00 AM, Sandy Ryza sandy.r...@cloudera.com wrote: I spoke with DB offline about this a little while ago and he confirmed that he was able to access the jar from the driver. The issue appears to be a general Java issue: you can't directly instantiate a class from a dynamically loaded jar. I reproduced it locally outside of Spark with: --- URLClassLoader urlClassLoader = new URLClassLoader(new URL[] { new File(myotherjar.jar).toURI().toURL() }, null); Thread.currentThread().setContextClassLoader(urlClassLoader); MyClassFromMyOtherJar obj = new MyClassFromMyOtherJar(); --- I was able to load the class with reflection.
Re: Calling external classes added by sc.addJar needs to be through reflection
Hey Xiangrui, If the jars are placed in the distributed cache and loaded statically, as the primary app jar is in YARN, then it shouldn't be an issue. Other jars, however, including additional jars that are sc.addJar'd and jars specified with the spark-submit --jars argument, are loaded dynamically by executors with a URLClassLoader. These jars aren't next to the executors when they start - the executors fetch them from the driver's HTTP server. On Sun, May 18, 2014 at 4:05 PM, Xiangrui Meng men...@gmail.com wrote: Hi Sandy, It is hard to imagine that a user needs to create an object in that way. Since the jars are already in distributed cache before the executor starts, is there any reason we cannot add the locally cached jars to classpath directly? Best, Xiangrui On Sun, May 18, 2014 at 4:00 PM, Sandy Ryza sandy.r...@cloudera.com wrote: I spoke with DB offline about this a little while ago and he confirmed that he was able to access the jar from the driver. The issue appears to be a general Java issue: you can't directly instantiate a class from a dynamically loaded jar. I reproduced it locally outside of Spark with: --- URLClassLoader urlClassLoader = new URLClassLoader(new URL[] { new File(myotherjar.jar).toURI().toURL() }, null); Thread.currentThread().setContextClassLoader(urlClassLoader); MyClassFromMyOtherJar obj = new MyClassFromMyOtherJar(); --- I was able to load the class with reflection. On Sun, May 18, 2014 at 11:58 AM, Patrick Wendell pwend...@gmail.com wrote: @db - it's possible that you aren't including the jar in the classpath of your driver program (I think this is what mridul was suggesting). It would be helpful to see the stack trace of the CNFE. - Patrick On Sun, May 18, 2014 at 11:54 AM, Patrick Wendell pwend...@gmail.com wrote: @xiangrui - we don't expect these to be present on the system classpath, because they get dynamically added by Spark (e.g. your application can call sc.addJar well after the JVM's have started). @db - I'm pretty surprised to see that behavior. It's definitely not intended that users need reflection to instantiate their classes - something odd is going on in your case. If you could create an isolated example and post it to the JIRA, that would be great. On Sun, May 18, 2014 at 9:58 AM, Xiangrui Meng men...@gmail.com wrote: I created a JIRA: https://issues.apache.org/jira/browse/SPARK-1870 DB, could you add more info to that JIRA? Thanks! -Xiangrui On Sun, May 18, 2014 at 9:46 AM, Xiangrui Meng men...@gmail.com wrote: Btw, I tried rdd.map { i = System.getProperty(java.class.path) }.collect() but didn't see the jars added via --jars on the executor classpath. -Xiangrui On Sat, May 17, 2014 at 11:26 PM, Xiangrui Meng men...@gmail.com wrote: I can re-produce the error with Spark 1.0-RC and YARN (CDH-5). The reflection approach mentioned by DB didn't work either. I checked the distributed cache on a worker node and found the jar there. It is also in the Environment tab of the WebUI. The workaround is making an assembly jar. DB, could you create a JIRA and describe what you have found so far? Thanks! Best, Xiangrui On Sat, May 17, 2014 at 1:29 AM, Mridul Muralidharan mri...@gmail.com wrote: Can you try moving your mapPartitions to another class/object which is referenced only after sc.addJar ? I would suspect CNFEx is coming while loading the class containing mapPartitions before addJars is executed. In general though, dynamic loading of classes means you use reflection to instantiate it since expectation is you don't know which implementation provides the interface ... If you statically know it apriori, you bundle it in your classpath. Regards Mridul On 17-May-2014 7:28 am, DB Tsai dbt...@stanford.edu wrote: Finally find a way out of the ClassLoader maze! It took me some times to understand how it works; I think it worths to document it in a separated thread. We're trying to add external utility.jar which contains CSVRecordParser, and we added the jar to executors through sc.addJar APIs. If the instance of CSVRecordParser is created without reflection, it raises *ClassNotFound Exception*. data.mapPartitions(lines = { val csvParser = new CSVRecordParser((delimiter.charAt(0)) lines.foreach(line = { val lineElems = csvParser.parseLine(line) }) ... ... ) If the instance of CSVRecordParser is created through reflection, it works. data.mapPartitions(lines = { val loader = Thread.currentThread.getContextClassLoader val CSVRecordParser = loader.loadClass(com.alpine.hadoop.ext.CSVRecordParser) val csvParser = CSVRecordParser.getConstructor(Character.TYPE
Re: [VOTE] Release Apache Spark 1.0.0 (rc5)
+1 (non-binding) * Built the release from source. * Compiled Java and Scala apps that interact with HDFS against it. * Ran them in local mode. * Ran them against a pseudo-distributed YARN cluster in both yarn-client mode and yarn-cluster mode. On Tue, May 13, 2014 at 9:09 PM, witgo wi...@qq.com wrote: You need to set: spark.akka.frameSize 5 spark.default.parallelism1 -- Original -- From: Madhu;ma...@madhu.com; Date: Wed, May 14, 2014 09:15 AM To: devd...@spark.incubator.apache.org; Subject: Re: [VOTE] Release Apache Spark 1.0.0 (rc5) I just built rc5 on Windows 7 and tried to reproduce the problem described in https://issues.apache.org/jira/browse/SPARK-1712 It works on my machine: 14/05/13 21:06:47 INFO DAGScheduler: Stage 1 (sum at console:17) finished in 4.548 s 14/05/13 21:06:47 INFO TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool 14/05/13 21:06:47 INFO SparkContext: Job finished: sum at console:17, took 4.814991993 s res1: Double = 5.05E11 I used all defaults, no config files were changed. Not sure if that makes a difference... -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Release-Apache-Spark-1-0-0-rc5-tp6542p6560.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. .
Re: Apache Spark running out of the spark shell
Hi AJ, You might find this helpful - http://blog.cloudera.com/blog/2014/04/how-to-run-a-simple-apache-spark-app-in-cdh-5/ -Sandy On Sat, May 3, 2014 at 8:42 AM, Ajay Nair prodig...@gmail.com wrote: Hi, I have written a code that works just about fine in the spark shell on EC2. The ec2 script helped me configure my master and worker nodes. Now I want to run the scala-spark code out side the interactive shell. How do I go about doing it. I was referring to the instructions mentioned here: https://spark.apache.org/docs/0.9.1/quick-start.html But this is confusing because it mentions about a simple project jar file which I am not sure how to generate. I only have the file that runs directly on my spark shell. Any easy intruction to get this quickly running as a job? Thanks AJ -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Apache-Spark-running-out-of-the-spark-shell-tp6459.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com.
Re: Any plans for new clustering algorithms?
Thanks Matei. I added a section How to contribute page. On Mon, Apr 21, 2014 at 7:25 PM, Matei Zaharia matei.zaha...@gmail.comwrote: The wiki is actually maintained separately in https://cwiki.apache.org/confluence/display/SPARK/Wiki+Homepage. We restricted editing of the wiki because bots would automatically add stuff. I've given you permissions now. Matei On Apr 21, 2014, at 6:22 PM, Nan Zhu zhunanmcg...@gmail.com wrote: I thought those are files of spark.apache.org? -- Nan Zhu On Monday, April 21, 2014 at 9:09 PM, Xiangrui Meng wrote: The markdown files are under spark/docs. You can submit a PR for changes. -Xiangrui On Mon, Apr 21, 2014 at 6:01 PM, Sandy Ryza sandy.r...@cloudera.com(mailto: sandy.r...@cloudera.com) wrote: How do I get permissions to edit the wiki? On Mon, Apr 21, 2014 at 3:19 PM, Xiangrui Meng men...@gmail.com(mailto: men...@gmail.com) wrote: Cannot agree more with your words. Could you add one section about how and what to contribute to MLlib's guide? -Xiangrui On Mon, Apr 21, 2014 at 1:41 PM, Nick Pentreath nick.pentre...@gmail.com (mailto:nick.pentre...@gmail.com) wrote: I'd say a section in the how to contribute page would be a good place to put this. In general I'd say that the criteria for inclusion of an algorithm is it should be high quality, widely known, used and accepted (citations and concrete use cases as examples of this), scalable and parallelizable, well documented and with reasonable expectation of dev support Sent from my iPhone On 21 Apr 2014, at 19:59, Sandy Ryza sandy.r...@cloudera.com(mailto: sandy.r...@cloudera.com) wrote: If it's not done already, would it make sense to codify this philosophy somewhere? I imagine this won't be the first time this discussion comes up, and it would be nice to have a doc to point to. I'd be happy to take a stab at this. On Mon, Apr 21, 2014 at 10:54 AM, Xiangrui Meng men...@gmail.com(mailto: men...@gmail.com) wrote: +1 on Sean's comment. MLlib covers the basic algorithms but we definitely need to spend more time on how to make the design scalable. For example, think about current ProblemWithAlgorithm naming scheme. That being said, new algorithms are welcomed. I wish they are well-established and well-understood by users. They shouldn't be research algorithms tuned to work well with a particular dataset but not tested widely. You see the change log from Mahout: === The following algorithms that were marked deprecated in 0.8 have been removed in 0.9: From Clustering: Switched LDA implementation from using Gibbs Sampling to Collapsed Variational Bayes (CVB) Meanshift MinHash - removed due to poor performance, lack of support and lack of usage From Classification (both are sequential implementations) Winnow - lack of actual usage and support Perceptron - lack of actual usage and support Collaborative Filtering SlopeOne implementations in org.apache.mahout.cf.taste.hadoop.slopeone and org.apache.mahout.cf.taste.impl.recommender.slopeone Distributed pseudo recommender in org.apache.mahout.cf.taste.hadoop.pseudo TreeClusteringRecommender in org.apache.mahout.cf.taste.impl.recommender Mahout Math Hadoop entropy stuff in org.apache.mahout.math.stats.entropy === In MLlib, we should include the algorithms users know how to use and we can provide support rather than letting algorithms come and go. My $0.02, Xiangrui On Mon, Apr 21, 2014 at 10:23 AM, Sean Owen so...@cloudera.com(mailto: so...@cloudera.com) wrote: On Mon, Apr 21, 2014 at 6:03 PM, Paul Brown p...@mult.ifario.us(mailto: p...@mult.ifario.us) wrote: - MLlib as Mahout.next would be a unfortunate. There are some gems in Mahout, but there are also lots of rocks. Setting a minimal bar of working, correctly implemented, and documented requires a surprising amount of work. As someone with first-hand knowledge, this is correct. To Sang's question, I can't see value in 'porting' Mahout since it is based on a quite different paradigm. About the only part that translates is the algorithm concept itself. This is also the cautionary tale. The contents of the project have ended up being a number of drive-by contributions of implementations that, while individually perhaps brilliant (perhaps), didn't necessarily match any other implementation in structure, input/output, libraries used. The implementations were often a touch academic. The result was hard to document, maintain, evolve or use. Far more of the structure of the MLlib implementations are consistent by virtue of being built around Spark core already. That's great. One can't wait to completely build the foundation before building any implementations. To me, the existing implementations are almost exactly the basics I
Re: all values for a key must fit in memory
Thanks Matei and Mridul - was basically wondering whether we would be able to change the shuffle to accommodate this after 1.0, and from your answers it sounds like we can. On Mon, Apr 21, 2014 at 12:31 AM, Mridul Muralidharan mri...@gmail.comwrote: As Matei mentioned, the Values is now an Iterable : which can be disk backed. Does that not address the concern ? @Patrick - we do have cases where the length of the sequence is large and size per value is also non trivial : so we do need this :-) Note that join is a trivial example where this is required (in our current implementation). Regards, Mridul On Mon, Apr 21, 2014 at 6:25 AM, Sandy Ryza sandy.r...@cloudera.com wrote: The issue isn't that the Iterator[P] can't be disk-backed. It's that, with a groupBy, each P is a (Key, Values) tuple, and the entire tuple is read into memory at once. The ShuffledRDD is agnostic to what goes inside P. On Sun, Apr 20, 2014 at 11:36 AM, Mridul Muralidharan mri...@gmail.com wrote: An iterator does not imply data has to be memory resident. Think merge sort output as an iterator (disk backed). Tom is actually planning to work on something similar with me on this hopefully this or next month. Regards, Mridul On Sun, Apr 20, 2014 at 11:46 PM, Sandy Ryza sandy.r...@cloudera.com wrote: Hey all, After a shuffle / groupByKey, Hadoop MapReduce allows the values for a key to not all fit in memory. The current ShuffleFetcher.fetch API, which doesn't distinguish between keys and values, only returning an Iterator[P], seems incompatible with this. Any thoughts on how we could achieve parity here? -Sandy
Re: Any plans for new clustering algorithms?
How do I get permissions to edit the wiki? On Mon, Apr 21, 2014 at 3:19 PM, Xiangrui Meng men...@gmail.com wrote: Cannot agree more with your words. Could you add one section about how and what to contribute to MLlib's guide? -Xiangrui On Mon, Apr 21, 2014 at 1:41 PM, Nick Pentreath nick.pentre...@gmail.com wrote: I'd say a section in the how to contribute page would be a good place to put this. In general I'd say that the criteria for inclusion of an algorithm is it should be high quality, widely known, used and accepted (citations and concrete use cases as examples of this), scalable and parallelizable, well documented and with reasonable expectation of dev support Sent from my iPhone On 21 Apr 2014, at 19:59, Sandy Ryza sandy.r...@cloudera.com wrote: If it's not done already, would it make sense to codify this philosophy somewhere? I imagine this won't be the first time this discussion comes up, and it would be nice to have a doc to point to. I'd be happy to take a stab at this. On Mon, Apr 21, 2014 at 10:54 AM, Xiangrui Meng men...@gmail.com wrote: +1 on Sean's comment. MLlib covers the basic algorithms but we definitely need to spend more time on how to make the design scalable. For example, think about current ProblemWithAlgorithm naming scheme. That being said, new algorithms are welcomed. I wish they are well-established and well-understood by users. They shouldn't be research algorithms tuned to work well with a particular dataset but not tested widely. You see the change log from Mahout: === The following algorithms that were marked deprecated in 0.8 have been removed in 0.9: From Clustering: Switched LDA implementation from using Gibbs Sampling to Collapsed Variational Bayes (CVB) Meanshift MinHash - removed due to poor performance, lack of support and lack of usage From Classification (both are sequential implementations) Winnow - lack of actual usage and support Perceptron - lack of actual usage and support Collaborative Filtering SlopeOne implementations in org.apache.mahout.cf.taste.hadoop.slopeone and org.apache.mahout.cf.taste.impl.recommender.slopeone Distributed pseudo recommender in org.apache.mahout.cf.taste.hadoop.pseudo TreeClusteringRecommender in org.apache.mahout.cf.taste.impl.recommender Mahout Math Hadoop entropy stuff in org.apache.mahout.math.stats.entropy === In MLlib, we should include the algorithms users know how to use and we can provide support rather than letting algorithms come and go. My $0.02, Xiangrui On Mon, Apr 21, 2014 at 10:23 AM, Sean Owen so...@cloudera.com wrote: On Mon, Apr 21, 2014 at 6:03 PM, Paul Brown p...@mult.ifario.us wrote: - MLlib as Mahout.next would be a unfortunate. There are some gems in Mahout, but there are also lots of rocks. Setting a minimal bar of working, correctly implemented, and documented requires a surprising amount of work. As someone with first-hand knowledge, this is correct. To Sang's question, I can't see value in 'porting' Mahout since it is based on a quite different paradigm. About the only part that translates is the algorithm concept itself. This is also the cautionary tale. The contents of the project have ended up being a number of drive-by contributions of implementations that, while individually perhaps brilliant (perhaps), didn't necessarily match any other implementation in structure, input/output, libraries used. The implementations were often a touch academic. The result was hard to document, maintain, evolve or use. Far more of the structure of the MLlib implementations are consistent by virtue of being built around Spark core already. That's great. One can't wait to completely build the foundation before building any implementations. To me, the existing implementations are almost exactly the basics I would choose. They cover the bases and will exercise the abstractions and structure. So that's also great IMHO.
Re: Any plans for new clustering algorithms?
I thought this might be a good thing to add to the wiki's How to contribute pagehttps://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark, as it's not tied to a release. On Mon, Apr 21, 2014 at 6:09 PM, Xiangrui Meng men...@gmail.com wrote: The markdown files are under spark/docs. You can submit a PR for changes. -Xiangrui On Mon, Apr 21, 2014 at 6:01 PM, Sandy Ryza sandy.r...@cloudera.com wrote: How do I get permissions to edit the wiki? On Mon, Apr 21, 2014 at 3:19 PM, Xiangrui Meng men...@gmail.com wrote: Cannot agree more with your words. Could you add one section about how and what to contribute to MLlib's guide? -Xiangrui On Mon, Apr 21, 2014 at 1:41 PM, Nick Pentreath nick.pentre...@gmail.com wrote: I'd say a section in the how to contribute page would be a good place to put this. In general I'd say that the criteria for inclusion of an algorithm is it should be high quality, widely known, used and accepted (citations and concrete use cases as examples of this), scalable and parallelizable, well documented and with reasonable expectation of dev support Sent from my iPhone On 21 Apr 2014, at 19:59, Sandy Ryza sandy.r...@cloudera.com wrote: If it's not done already, would it make sense to codify this philosophy somewhere? I imagine this won't be the first time this discussion comes up, and it would be nice to have a doc to point to. I'd be happy to take a stab at this. On Mon, Apr 21, 2014 at 10:54 AM, Xiangrui Meng men...@gmail.com wrote: +1 on Sean's comment. MLlib covers the basic algorithms but we definitely need to spend more time on how to make the design scalable. For example, think about current ProblemWithAlgorithm naming scheme. That being said, new algorithms are welcomed. I wish they are well-established and well-understood by users. They shouldn't be research algorithms tuned to work well with a particular dataset but not tested widely. You see the change log from Mahout: === The following algorithms that were marked deprecated in 0.8 have been removed in 0.9: From Clustering: Switched LDA implementation from using Gibbs Sampling to Collapsed Variational Bayes (CVB) Meanshift MinHash - removed due to poor performance, lack of support and lack of usage From Classification (both are sequential implementations) Winnow - lack of actual usage and support Perceptron - lack of actual usage and support Collaborative Filtering SlopeOne implementations in org.apache.mahout.cf.taste.hadoop.slopeone and org.apache.mahout.cf.taste.impl.recommender.slopeone Distributed pseudo recommender in org.apache.mahout.cf.taste.hadoop.pseudo TreeClusteringRecommender in org.apache.mahout.cf.taste.impl.recommender Mahout Math Hadoop entropy stuff in org.apache.mahout.math.stats.entropy === In MLlib, we should include the algorithms users know how to use and we can provide support rather than letting algorithms come and go. My $0.02, Xiangrui On Mon, Apr 21, 2014 at 10:23 AM, Sean Owen so...@cloudera.com wrote: On Mon, Apr 21, 2014 at 6:03 PM, Paul Brown p...@mult.ifario.us wrote: - MLlib as Mahout.next would be a unfortunate. There are some gems in Mahout, but there are also lots of rocks. Setting a minimal bar of working, correctly implemented, and documented requires a surprising amount of work. As someone with first-hand knowledge, this is correct. To Sang's question, I can't see value in 'porting' Mahout since it is based on a quite different paradigm. About the only part that translates is the algorithm concept itself. This is also the cautionary tale. The contents of the project have ended up being a number of drive-by contributions of implementations that, while individually perhaps brilliant (perhaps), didn't necessarily match any other implementation in structure, input/output, libraries used. The implementations were often a touch academic. The result was hard to document, maintain, evolve or use. Far more of the structure of the MLlib implementations are consistent by virtue of being built around Spark core already. That's great. One can't wait to completely build the foundation before building any implementations. To me, the existing implementations are almost exactly the basics I would choose. They cover the bases and will exercise the abstractions and structure. So that's also great IMHO.
Re: all values for a key must fit in memory
The issue isn't that the Iterator[P] can't be disk-backed. It's that, with a groupBy, each P is a (Key, Values) tuple, and the entire tuple is read into memory at once. The ShuffledRDD is agnostic to what goes inside P. On Sun, Apr 20, 2014 at 11:36 AM, Mridul Muralidharan mri...@gmail.comwrote: An iterator does not imply data has to be memory resident. Think merge sort output as an iterator (disk backed). Tom is actually planning to work on something similar with me on this hopefully this or next month. Regards, Mridul On Sun, Apr 20, 2014 at 11:46 PM, Sandy Ryza sandy.r...@cloudera.com wrote: Hey all, After a shuffle / groupByKey, Hadoop MapReduce allows the values for a key to not all fit in memory. The current ShuffleFetcher.fetch API, which doesn't distinguish between keys and values, only returning an Iterator[P], seems incompatible with this. Any thoughts on how we could achieve parity here? -Sandy
Re: Suggestion
Hi Priya, Here's a good place to start: https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark -Sandy On Fri, Apr 11, 2014 at 12:05 PM, priya arora arora.priya4...@gmail.comwrote: Hi, May I know how one can contribute in this project http://spark.apache.org/mllib/ or in any other project. I am very eager to contribute. Do let me know. Thanks Regards, Priya Arora
Re: cloudera repo down again - mqtt
Our guys are looking into it. I'll post when things are back up. -Sandy On Mar 14, 2014, at 7:37 AM, Tom Graves tgraves...@yahoo.com wrote: It appears the cloudera repo for the mqtt stuff is down again. Did someone ping them the last time? Can we pick this up from some other repo? [ERROR] Failed to execute goal org.apache.maven.plugins:maven-remote-resources-plugin:1.4:process (default) on project spark-examples_2.10: Error resolving project artifact: Could not transfer artifact org.eclipse.paho:mqtt-client:pom:0.4.0 from/to cloudera-repo (https://repository.cloudera.com/artifactory/cloudera-repos): peer not authenticated for project org.eclipse.paho:mqtt-client:jar:0.4.0 Tom
Re: Assigning JIRA's to self
In the mean time, you don't need to wait for the task to be assigned to you to start work. If you're worried about someone else picking it up, you can drop a short comment on the JIRA saying that you're working on it. On Wed, Mar 12, 2014 at 3:25 PM, Konstantin Boudnik c...@apache.org wrote: Someone with proper karma needs to add you to the contributor list in the JIRA. Cos On Wed, Mar 12, 2014 at 02:19PM, Sujeet Varakhedi wrote: Hi, I am new to Spark and would like to contribute. I wanted to assign a task to myself but looks like I do not have permission. What is the process if I want to work on a JIRA? Thanks Sujeet
Re: YARN Maven build questions
Hi Lars, Unfortunately, due to some incompatible changes we pulled in to be closer to YARN trunk, Spark-on-YARN does not work against CDH 4.4+ (but does work against CDH5) -Sandy On Tue, Mar 4, 2014 at 6:33 AM, Tom Graves tgraves...@yahoo.com wrote: What is your question about Any hints? The maven build worked for me yesterday again fine. You should create a jira for any pull request like the documentation states. The jira thing is new so I think people are still getting used to it. Tom On Tuesday, March 4, 2014 2:51 AM, Lars Francke lars.fran...@gmail.com wrote: Hi, sorry to bother again. As a newbie to the project it's hard to judge whether I'm doing anything wrong, the documentation is outdated or the Maven/SBT files have diverged from the actual code by defining older/now incompatible versions or something else going wrong. Any hints? Also an unrelated note/question: I see tons of pull requests being accepted without a JIRA but the documentation says to create a JIRA issue first[1]. So I assume it's okay to just send pull requests? Thanks for your help. Cheers, Lars [1] https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark On Fri, Feb 28, 2014 at 6:41 PM, Lars Francke lars.fran...@gmail.com wrote: Hey, so currently it doesn't work because of https://github.com/apache/spark/pull/6#issuecomment-36343187 IntelliJ reports a lot of warnings with default settings and I haven't found a way to tell IntellJ to use different Hadoop versions yet. mvn clean compile -Pyarn fails as well (compilation errror Your command works indeed. Default yarn version is 0.23.7 which doesn't seem to work with the default 2.2.0 Hadoop version (anymore?) I was basically trying to follow the documentation: http://spark.incubator.apache.org/docs/latest/building-with-maven.html mvn clean compile -Pyarn-alpha -Dhadoop.version=2.0.0-cdh4.5.0 -Dyarn.version=2.0.0-cdh4.5.0 fails as well as does mvn clean compile -Pyarn-alpha Thanks for showing me a configuration that works. Unfortunately the default ones and at least one of the documented ones fail. Cheers, Lars On Fri, Feb 28, 2014 at 3:05 PM, Tom Graves tgraves...@yahoo.com wrote: what build command are you using?What do you mean when you say YARN branch? The yarn builds have been working fine for me with maven. Build command I use against hadoop 2.2 or higher: mvn -Dyarn.version=2.2.0 -Dhadoop.version=2.2.0 -Pyarn clean package -DskipTests Tom On Friday, February 28, 2014 6:14 AM, Lars Francke lars.fran...@gmail.com wrote: Hey, I'm trying to dig into Spark's code but am running into a couple of problems. 1) The yarn-common directory is not included in the Maven build causing things to fail because the dependency is missing. If I see the history correct it used to be a Maven module but is not anymore. 2) When I try to include the yarn-common directory in the build things start going bad. Compilation failures all over the place and I think there are some dependency issues in there as well. This leads me to believe that either the Maven build system isn't maintained for YARN or the whole YARN branch isn't. What's the status here? Without YARN things build fine for me using Maven. Thanks for your help. Cheers, Lars
Re: [DISCUSS] Necessity of Maven *and* SBT Build in Spark
@patrick - It seems like my point about being able to inherit the root pom was addressed and there's a way to handle this. The larger point I meant to make is that Maven is by far the most common build tool in projects that are likely to share contributors with Spark. I personally know 10 people I can go to if I run into a Maven problem, and maybe 1 if I run into an SBT problem. Google is also much more likely to be able to answer a question about Maven than SBT. I don't know of any particular functionality Maven has that SBT is lacking, but if it was up to me I'd still lean towards Maven on the grounds that it's way easier to find familiarity and expertise with. On Wed, Feb 26, 2014 at 9:42 AM, Patrick Wendell pwend...@gmail.com wrote: @mridul - As far as I know both Maven and Sbt use fairly similar processes for building the assembly/uber jar. We actually used to package spark with sbt and there were no specific issues we encountered and AFAIK sbt respects versioning of transitive dependencies correctly. Do you have a specific bug listing for sbt that indicates something is broken? @sandy - It sounds like you are saying that the CDH build would be easier with Maven because you can inherit the POM. However, is this just a matter of convenience for packagers or would standardizing on sbt limit capabilities in some way? I assume that it would just mean a bit more manual work for packagers having to figure out how to set the hadoop version in SBT and exclude certain dependencies. For instance, what does CDH about other components like Impala that are not based on Maven at all? On Wed, Feb 26, 2014 at 9:31 AM, Evan Chan e...@ooyala.com wrote: I'd like to propose the following way to move forward, based on the comments I've seen: 1. Aggressively clean up the giant dependency graph. One ticket I might work on if I have time is SPARK-681 which might remove the giant fastutil dependency (~15MB by itself). 2. Take an intermediate step by having only ONE source of truth w.r.t. dependencies and versions. This means either: a) Using a maven POM as the spec for dependencies, Hadoop version, etc. Then, use sbt-pom-reader to import it. b) Using the build.scala as the spec, and sbt make-pom to generate the pom.xml for the dependencies The idea is to remove the pain and errors associated with manual translation of dependency specs from one system to another, while still maintaining the things which are hard to translate (plugins). On Wed, Feb 26, 2014 at 7:17 AM, Koert Kuipers ko...@tresata.com wrote: We maintain in house spark build using sbt. We have no problem using sbt assembly. We did add a few exclude statements for transitive dependencies. The main enemy of assemblies are jars that include stuff they shouldn't (kryo comes to mind, I think they include logback?), new versions of jars that change the provider/artifact without changing the package (asm), and incompatible new releases (protobuf). These break the transitive resolution process. I imagine that's true for any build tool. Besides shading I don't see anything maven can do sbt cannot, and if I understand it correctly shading is not done currently using the build tool. Since spark is primarily scala/akka based the main developer base will be familiar with sbt (I think?). Switching build tool is always painful. I personally think it is smarter to put this burden on a limited number of upstream integrators than on the community. However that said I don't think its a problem for us to maintain an sbt build in-house if spark switched to maven. The problem is, the complete spark dependency graph is fairly large, and there are lot of conflicting versions in there. In particular, when we bump versions of dependencies - making managing this messy at best. Now, I have not looked in detail at how maven manages this - it might just be accidental that we get a decent out-of-the-box assembled shaded jar (since we dont do anything great to configure it). With current state of sbt in spark, it definitely is not a good solution : if we can enhance it (or it already is ?), while keeping the management of the version/dependency graph manageable, I dont have any objections to using sbt or maven ! Too many exclude versions, pinned versions, etc would just make things unmanageable in future. Regards, Mridul On Wed, Feb 26, 2014 at 8:56 AM, Evan chan e...@ooyala.com wrote: Actually you can control exactly how sbt assembly merges or resolves conflicts. I believe the default settings however lead to order which cannot be controlled. I do wish for a smarter fat jar plugin. -Evan To be free is not merely to cast off one's chains, but to live in a way that respects enhances the freedom of others. (#NelsonMandela) On Feb 25, 2014, at 6:50 PM, Mridul Muralidharan mri...@gmail.com wrote: