Re: Joining the spark dev community
Hi Saurabh, Good way to start is to use Spark with your applications and file issues you might have found and maybe provide patch for those or existing ones. Please take a look at Spark's how to contribute page [1] to help you get started. Hope this helps. - Henry [1] https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark On Sat, Oct 18, 2014 at 1:46 PM, Saurabh Wadhawan saurabh.wadha...@guavus.com wrote: How can I become a spark contributor. What's the good path that I can follow to become an active code submitter for spark from a newbie. Regards - Saurabh - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Breaking the previous large-scale sort record with Spark
Congrats to Reynold et al leading this effort! - Henry On Fri, Oct 10, 2014 at 7:54 AM, Matei Zaharia matei.zaha...@gmail.com wrote: Hi folks, I interrupt your regularly scheduled user / dev list to bring you some pretty cool news for the project, which is that we've been able to use Spark to break MapReduce's 100 TB and 1 PB sort records, sorting data 3x faster on 10x fewer nodes. There's a detailed writeup at http://databricks.com/blog/2014/10/10/spark-breaks-previous-large-scale-sort-record.html. Summary: while Hadoop MapReduce held last year's 100 TB world record by sorting 100 TB in 72 minutes on 2100 nodes, we sorted it in 23 minutes on 206 nodes; and we also scaled up to sort 1 PB in 234 minutes. I want to thank Reynold Xin for leading this effort over the past few weeks, along with Parviz Deyhim, Xiangrui Meng, Aaron Davidson and Ali Ghodsi. In addition, we'd really like to thank Amazon's EC2 team for providing the machines to make this possible. Finally, this result would of course not be possible without the many many other contributions, testing and feature requests from throughout the community. For an engine to scale from these multi-hour petabyte batch jobs down to 100-millisecond streaming and interactive queries is quite uncommon, and it's thanks to all of you folks that we are able to make this happen. Matei - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: [VOTE] Release Apache Spark 1.1.0 (RC4)
LICENSE and NOTICE files are good Hash files are good Signature files are good No 3rd parties executables Source compiled Run local and standalone tests Test persist off heap with Tachyon looks good +1 - Henry On Wed, Sep 3, 2014 at 12:24 AM, Patrick Wendell pwend...@gmail.com wrote: Please vote on releasing the following candidate as Apache Spark version 1.1.0! The tag to be voted on is v1.1.0-rc4 (commit 2f9b2bd): https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=2f9b2bd7844ee8393dc9c319f4fefedf95f5e460 The release files, including signatures, digests, etc. can be found at: http://people.apache.org/~pwendell/spark-1.1.0-rc4/ Release artifacts are signed with the following key: https://people.apache.org/keys/committer/pwendell.asc The staging repository for this release can be found at: https://repository.apache.org/content/repositories/orgapachespark-1031/ The documentation corresponding to this release can be found at: http://people.apache.org/~pwendell/spark-1.1.0-rc4-docs/ Please vote on releasing this package as Apache Spark 1.1.0! The vote is open until Saturday, September 06, at 08:30 UTC and passes if a majority of at least 3 +1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 1.1.0 [ ] -1 Do not release this package because ... To learn more about Apache Spark, please see http://spark.apache.org/ == Regressions fixed since RC3 == SPARK-3332 - Issue with tagging in EC2 scripts SPARK-3358 - Issue with regression for m3.XX instances == What justifies a -1 vote for this release? == This vote is happening very late into the QA period compared with previous votes, so -1 votes should only occur for significant regressions from 1.0.2. Bugs already present in 1.0.X will not block this release. == What default changes should I be aware of? == 1. The default value of spark.io.compression.codec is now snappy -- Old behavior can be restored by switching to lzf 2. PySpark now performs external spilling during aggregations. -- Old behavior can be restored by setting spark.shuffle.spill to false. 3. PySpark uses a new heuristic for determining the parallelism of shuffle operations. -- Old behavior can be restored by setting spark.default.parallelism to the number of cores in the cluster. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: hey spark developers! intro from shane knapp, devops engineer @ AMPLab
Welcome Shane =) - Henry On Tue, Sep 2, 2014 at 10:35 AM, shane knapp skn...@berkeley.edu wrote: so, i had a meeting w/the databricks guys on friday and they recommended i send an email out to the list to say 'hi' and give you guys a quick intro. :) hi! i'm shane knapp, the new AMPLab devops engineer, and will be spending time getting the jenkins build infrastructure up to production quality. much of this will be 'under the covers' work, like better system level auth, backups, etc, but some will definitely be user facing: timely jenkins updates, debugging broken build infrastructure and some plugin support. i've been working in the bay area now since 1997 at many different companies, and my last 10 years has been split between google and palantir. i'm a huge proponent of OSS, and am really happy to be able to help with the work you guys are doing! if anyone has any requests/questions/comments, feel free to drop me a line! shane - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: [Spark SQL] off-heap columnar store
Hi Michael, This is great news. Any initial proposal or design about the caching to Tachyon that you can share so far? I don't think there is a JIRA ticket open to track this feature yet. - Henry On Mon, Aug 25, 2014 at 1:13 PM, Michael Armbrust mich...@databricks.com wrote: What is the plan for getting Tachyon/off-heap support for the columnar compressed store? It's not in 1.1 is it? It is not in 1.1 and there are not concrete plans for adding it at this point. Currently, there is more engineering investment going into caching parquet data in Tachyon instead. This approach is going to have much better support for nested data, leverages other work being done on parquet, and alleviates your concerns about wire format compatibility. That said, if someone really wants to try and implement it, I don't think it would be very hard. The primary issue is going to be designing a clean interface that is not too tied to this one implementation. Also, how likely is the wire format for the columnar compressed data to change? That would be a problem for write-through or persistence. We aren't making any guarantees at the moment that it won't change. Its currently only intended for temporary caching of data. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Spark Contribution
The Apache Spark wiki on how to contribute should be great place to start: https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark - Henry On Thu, Aug 21, 2014 at 3:25 AM, Maisnam Ns maisnam...@gmail.com wrote: Hi, Can someone help me with some links on how to contribute for Spark Regards mns - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: [VOTE] Release Apache Spark 1.0.2 (RC1)
NOTICE and LICENSE files look good Hashes and sigs look good No executable in the source distribution Compile source and run standalone +1 - Henry On Fri, Jul 25, 2014 at 4:08 PM, Tathagata Das tathagata.das1...@gmail.com wrote: Please vote on releasing the following candidate as Apache Spark version 1.0.2. This release fixes a number of bugs in Spark 1.0.1. Some of the notable ones are - SPARK-2452: Known issue is Spark 1.0.1 caused by attempted fix for SPARK-1199. The fix was reverted for 1.0.2. - SPARK-2576: NoClassDefFoundError when executing Spark QL query on HDFS CSV file. The full list is at http://s.apache.org/9NJ The tag to be voted on is v1.0.2-rc1 (commit 8fb6f00e): https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=8fb6f00e195fb258f3f70f04756e07c259a2351f The release files, including signatures, digests, etc can be found at: http://people.apache.org/~tdas/spark-1.0.2-rc1/ Release artifacts are signed with the following key: https://people.apache.org/keys/committer/tdas.asc The staging repository for this release can be found at: https://repository.apache.org/content/repositories/orgapachespark-1024/ The documentation corresponding to this release can be found at: http://people.apache.org/~tdas/spark-1.0.2-rc1-docs/ Please vote on releasing this package as Apache Spark 1.0.2! The vote is open until Tuesday, July 29, at 23:00 UTC and passes if a majority of at least 3 +1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 1.0.2 [ ] -1 Do not release this package because ... To learn more about Apache Spark, please see http://spark.apache.org/
Re: -1s on pull requests?
There is ASF guidelines about Voting, including code review for patches: http://www.apache.org/foundation/voting.html Some ASF project do three +1 votes are required (to the issues like JIRA or Github PR in this case) for a patch unless it is tagged with lazy consensus [1] of like 48 hours. For patches that are not critical, waiting for a while to let some time for additional committers to review would be the best way to go. Another thing is that all contributors need to be patience once their patches have been submitted and pending reviewed. This is part of being in open community. Hope this helps. - Henry [1] http://www.apache.org/foundation/glossary.html#LazyConsensus On Mon, Jul 21, 2014 at 1:59 PM, Patrick Wendell pwend...@gmail.com wrote: I've always operated under the assumption that if a commiter makes a comment on a PR, and that's not addressed, that should block the PR from being merged (even without a specific -1). I don't know of any cases where this has intentionally been violated, but I do think this happens accidentally some times. Unfortunately, we are not allowed to use those github hooks because of the way the ASF github integration works. I've lately been using a custom-made tool to help review pull requests. One thing I could do is add a feature here saying which committers have said LGTM on a PR (vs the ones that have commented). We could also indicate the latest test status as Green/Yellow/Red based on the Jenkins comments: http://pwendell.github.io/spark-github-shim/ As a warning to potential users, my tool might crash your browser. - Patrick On Mon, Jul 21, 2014 at 1:44 PM, Kay Ousterhout k...@eecs.berkeley.edu wrote: Hi all, As the number of committers / contributors on Spark has increased, there are cases where pull requests get merged before all the review comments have been addressed. This happens say when one committer points out a problem with the pull request, and another committer doesn't see the earlier comment and merges the PR before the comment has been addressed. This is especially tricky for pull requests with a large number of comments, because it can be difficult to notice early comments describing blocking issues. This also happens when something accidentally gets merged after the tests have started but before tests have passed. Do folks have ideas on how we can handle this issue? Are there other projects that have good ways of handling this? It looks like for Hadoop, people can -1 / +1 on the JIRA. -Kay
Re: Announcing Spark 1.0.1
Congrats to the Spark community ! On Friday, July 11, 2014, Patrick Wendell pwend...@gmail.com wrote: I am happy to announce the availability of Spark 1.0.1! This release includes contributions from 70 developers. Spark 1.0.0 includes fixes across several areas of Spark, including the core API, PySpark, and MLlib. It also includes new features in Spark's (alpha) SQL library, including support for JSON data and performance and stability fixes. Visit the release notes[1] to read about this release or download[2] the release today. [1] http://spark.apache.org/releases/spark-release-1-0-1.html [2] http://spark.apache.org/downloads.html
Re: Run ScalaTest inside Intellij IDEA
I got stuck on this one too after did git pull from master. Have not been able to resolve it yet =( - Henry On Wed, Jun 11, 2014 at 6:51 AM, Yijie Shen henry.yijies...@gmail.com wrote: Thx Qiuzhuang, the problems disappeared after I add assembly jar at the head of list dependencies in *.iml, but while running test in Spark SQL(SQLQuerySuite in sql-core), another two error occurs: Error 1: Error:scalac: while compiling: /Users/yijie/code/apache.spark.master/sql/core/src/main/scala/org/apache/spark/sql/test/TestSQLContext.scala during phase: jvm library version: version 2.10.4 compiler version: version 2.10.4 reconstructed args: -Xmax-classfile-name 120 -deprecation -P:genjavadoc:out=/Users/yijie/code/apache.spark.master/sql/core/target/java -feature -classpath /Library/Java/JavaVirtualMachines/jdk1.7.0_51.jdk/Contents/Home/lib/ant-javafx.jar:/Library/Java/JavaVirtualMachines/jdk1.7.0_51.jdk/Contents/Home/lib/dt.jar:/Library/Java/JavaVirtualMachines/jdk1.7.0_51.jdk/Contents/Home/lib/javafx-doclet.jar:/Library/Java/JavaVirtualMachines/jdk1.7.0_51.jdk/Contents/Home/lib/javafx-mx.jar:/Library/Java/JavaVirtualMachines/jdk1.7.0_51.jdk/Contents/Home/lib/jconsole.jar:/Library/Java/JavaVirtualMachines/jdk1.7.0_51.jdk/Contents/Home/lib/sa-jdi.jar:/Library/Java/JavaVirtualMachines/jdk1.7.0_51.jdk/Contents/Home/lib/tools.jar:/Library/Java/JavaVirtualMachines/jdk1.7.0_51.jdk/Contents/Home/jre/lib/charsets.jar:/Library/Java/JavaVirtualMachines/jdk1.7.0_51.jdk/Conte… … ... /Java/JavaVirtualMachines/jdk1.7.0_51.jdk/Contents/Home/jre/lib/charsets.jar:/Library/Java/JavaVirtualMachines/jdk1.7.0_51.jdk/Contents/Home/jre/lib/jfr.jar:/Library/Java/JavaVirtualMachines/jdk1.7.0_51.jdk/Contents/Home/jre/classes:/Users/yijie/code/apache.spark.master/lib_managed/jars/scala-library-2.10.4.jar -Xplugin:/Users/yijie/code/apache.spark.master/lib_managed/jars/genjavadoc-plugin_2.10.4-0.5.jar -Xplugin:/Users/yijie/code/apache.spark.master/lib_managed/jars/genjavadoc-plugin_2.10.4-0.5.jar last tree to typer: Literal(Constant(parquet.io.api.Converter)) symbol: null symbol definition: null tpe: Class(classOf[parquet.io.api.Converter]) symbol owners: context owners: object TestSQLContext - package test == Enclosing template or block == Template( // val local TestSQLContext: notype in object TestSQLContext, tree.tpe=org.apache.spark.sql.test.TestSQLContext.type org.apache.spark.sql.SQLContext // parents ValDef( private _ tpt empty ) // 2 statements DefDef( // private def readResolve(): Object in object TestSQLContext method private synthetic readResolve [] List(Nil) tpt // tree.tpe=Object test.this.TestSQLContext // object TestSQLContext in package test, tree.tpe=org.apache.spark.sql.test.TestSQLContext.type ) DefDef( // def init(): org.apache.spark.sql.test.TestSQLContext.type in object TestSQLContext method init [] List(Nil) tpt // tree.tpe=org.apache.spark.sql.test.TestSQLContext.type Block( // tree.tpe=Unit Apply( // def init(sparkContext: org.apache.spark.SparkContext): org.apache.spark.sql.SQLContext in class SQLContext, tree.tpe=org.apache.spark.sql.SQLContext TestSQLContext.super.init // def init(sparkContext: org.apache.spark.SparkContext): org.apache.spark.sql.SQLContext in class SQLContext, tree.tpe=(sparkContext: org.apache.spark.SparkContext)org.apache.spark.sql.SQLContext Apply( // def init(master: String,appName: String,conf: org.apache.spark.SparkConf): org.apache.spark.SparkContext in class SparkContext, tree.tpe=org.apache.spark.SparkContext new org.apache.spark.SparkContext.init // def init(master: String,appName: String,conf: org.apache.spark.SparkConf): org.apache.spark.SparkContext in class SparkContext, tree.tpe=(master: String, appName: String, conf: org.apache.spark.SparkConf)org.apache.spark.SparkContext // 3 arguments local TestSQLContext Apply( // def init(): org.apache.spark.SparkConf in class SparkConf, tree.tpe=org.apache.spark.SparkConf new org.apache.spark.SparkConf.init // def init(): org.apache.spark.SparkConf in class SparkConf, tree.tpe=()org.apache.spark.SparkConf Nil ) ) ) () ) ) ) == Expanded type of tree == ConstantType(value = Constant(parquet.io.api.Converter)) uncaught exception during compilation: java.lang.AssertionError Error 2: Error:scalac: Error: assertion failed: List(object package$DebugNode, object package$DebugNode) java.lang.AssertionError: assertion failed: List(object package$DebugNode, object package$DebugNode) at scala.reflect.internal.Symbols$Symbol.suchThat(Symbols.scala:1678) at scala.reflect.internal.Symbols$ClassSymbol.companionModule0(Symbols.scala:2988)
Re: Emergency maintenace on jenkins
Thanks for letting us know Patrick. - Henry On Monday, June 9, 2014, Patrick Wendell pwend...@gmail.com wrote: Just a heads up - due to an outage at UCB we've lost several of the Jenkins slaves. I'm trying to spin up new slaves on EC2 in order to compensate, but this might fail some ongoing builds. The good news is if we do get it working with EC2 workers, then we will have burst capability in the future - e.g. on release deadlines. So it's not all bad! - Patrick
Removing spark-debugger.md file from master?
Hi All, Seemed like the spark-debugger.md is no longer accurate (see http://spark.apache.org/docs/latest/spark-debugger.html) and since it was originally written Spark has evolved that makes the doc obsolete. There are already work pending for new replay debugging (I could not find the PR links for it) so I With version control we could always reinstate the old doc if needed, but as of today the doc is no longer reflect the current state of Spark's RDD. If no objection I could send PR to remove the md file in master. Thoughts? - Henry
Re: Removing spark-debugger.md file from master?
Cool, thanks Ankur, sounds good. PR is coming. - Henry On Tue, Jun 3, 2014 at 11:11 AM, Ankur Dave ankurd...@gmail.com wrote: I agree, let's go ahead and remove it. Ankur http://www.ankurdave.com/
Add my JIRA username (hsaputra) to Spark's contributor's list
Hi, Could someone with right karma kindly add my username (hsaputra) to Spark's contributor list? I was added before but somehow now I can no longer assign ticket to myself nor update tickets I am working on. Thanks, - Henry
Re: [VOTE] Release Apache Spark 1.0.0 (RC11)
NOTICE and LICENSE files look good Signatures look good. Hashes look good No external executables in the source distributions Source compiled with sbt Run local and standalone examples look good. +1 - Henry On Mon, May 26, 2014 at 7:38 AM, Tathagata Das tathagata.das1...@gmail.com wrote: Please vote on releasing the following candidate as Apache Spark version 1.0.0! This has a few important bug fixes on top of rc10: SPARK-1900 and SPARK-1918: https://github.com/apache/spark/pull/853 SPARK-1870: https://github.com/apache/spark/pull/848 SPARK-1897: https://github.com/apache/spark/pull/849 The tag to be voted on is v1.0.0-rc11 (commit c69d97cd): https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=c69d97cdb42f809cb71113a1db4194c21372242a The release files, including signatures, digests, etc. can be found at: http://people.apache.org/~tdas/spark-1.0.0-rc11/ Release artifacts are signed with the following key: https://people.apache.org/keys/committer/tdas.asc The staging repository for this release can be found at: https://repository.apache.org/content/repositories/orgapachespark-1019/ The documentation corresponding to this release can be found at: http://people.apache.org/~tdas/spark-1.0.0-rc11-docs/ Please vote on releasing this package as Apache Spark 1.0.0! The vote is open until Thursday, May 29, at 16:00 UTC and passes if a majority of at least 3 +1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 1.0.0 [ ] -1 Do not release this package because ... To learn more about Apache Spark, please see http://spark.apache.org/ == API Changes == We welcome users to compile Spark applications against 1.0. There are a few API changes in this release. Here are links to the associated upgrade guides - user facing changes have been kept as small as possible. Changes to ML vector specification: http://people.apache.org/~tdas/spark-1.0.0-rc11-docs/mllib-guide.html#from-09-to-10 Changes to the Java API: http://people.apache.org/~tdas/spark-1.0.0-rc11-docs/java-programming-guide.html#upgrading-from-pre-10-versions-of-spark Changes to the streaming API: http://people.apache.org/~tdas/spark-1.0.0-rc11-docs/streaming-programming-guide.html#migration-guide-from-091-or-below-to-1x Changes to the GraphX API: http://people.apache.org/~tdas/spark-1.0.0-rc11-docs/graphx-programming-guide.html#upgrade-guide-from-spark-091 Other changes: coGroup and related functions now return Iterable[T] instead of Seq[T] == Call toSeq on the result to restore the old behavior SparkContext.jarOfClass returns Option[String] instead of Seq[String] == Call toSeq on the result to restore old behavior
Re: [VOTE] Release Apache Spark 1.0.0 (RC10)
Looks like SPARK-1900 is a blocker for YARN and might as well add SPARK-1870 while at it. TD or Patrick, could you kindly send [CANCEL] prefixed in the subject email out for the RC10 Vote to help people follow the active VOTE threads? The VOTE emails are getting a bit hard to follow. - Henry On Thu, May 22, 2014 at 2:05 PM, Tathagata Das tathagata.das1...@gmail.com wrote: Hey all, On further testing, I came across a bug that breaks execution of pyspark scripts on YARN. https://issues.apache.org/jira/browse/SPARK-1900 This is a blocker and worth cutting a new RC. We also found a fix for a known issue that prevents additional jar files to be specified through spark-submit on YARN. https://issues.apache.org/jira/browse/SPARK-1870 The has been fixed and will be in the next RC. We are canceling this vote for now. We will post RC11 shortly. Thanks everyone for testing! TD
Re: [VOTE] Release Apache Spark 1.0.0 (rc5)
HI Sandy, Just curious if the Vote is for rc5 or rc6? Gmail shows me that you replied to the rc5 thread. Thanks, - Henry On Wed, May 14, 2014 at 1:28 PM, Sandy Ryza sandy.r...@cloudera.com wrote: +1 (non-binding) * Built the release from source. * Compiled Java and Scala apps that interact with HDFS against it. * Ran them in local mode. * Ran them against a pseudo-distributed YARN cluster in both yarn-client mode and yarn-cluster mode. On Tue, May 13, 2014 at 9:09 PM, witgo wi...@qq.com wrote: You need to set: spark.akka.frameSize 5 spark.default.parallelism1 -- Original -- From: Madhu;ma...@madhu.com; Date: Wed, May 14, 2014 09:15 AM To: devd...@spark.incubator.apache.org; Subject: Re: [VOTE] Release Apache Spark 1.0.0 (rc5) I just built rc5 on Windows 7 and tried to reproduce the problem described in https://issues.apache.org/jira/browse/SPARK-1712 It works on my machine: 14/05/13 21:06:47 INFO DAGScheduler: Stage 1 (sum at console:17) finished in 4.548 s 14/05/13 21:06:47 INFO TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool 14/05/13 21:06:47 INFO SparkContext: Job finished: sum at console:17, took 4.814991993 s res1: Double = 5.05E11 I used all defaults, no config files were changed. Not sure if that makes a difference... -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Release-Apache-Spark-1-0-0-rc5-tp6542p6560.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. .
Re: minor optimizations to get my feet wet
HI Ignacio, Thank you for your contribution. Just a friendly reminder, in case you have not contributed to Apache Software Foundation projects before please submit ASF ICLA form [1] or if you are sponsored by your company also ask the company to send CCLA [2] to clear the intellectual property for your contributions. You can ignore preferred Apache id section for now. Thank you, Henry Saputra [1] https://www.apache.org/licenses/icla.txt [2] http://www.apache.org/licenses/cla-corporate.txt On Thu, Apr 10, 2014 at 1:48 PM, Ignacio Zendejas ignacio.zendejas...@gmail.com wrote: Hi, all - First off, I want to say that I love spark and am very excited about MLBase. I'd love to contribute now that I have some time, but before I do that I'd like to familiarize myself with the process. In looking for a few projects and settling on one which I'll discuss in another thread, I found some very minor optimizations I could contribute, again, as part of this first step. Before I initiate a PR, I've gone ahead and tested style, ran tests, etc per the instructions, but I'd still like to have someone quickly glance over it and ensure that these are JIRA worthy. Commit: https://github.com/izendejas/spark/commit/81065aed9987c1b08cd5784b7a6153e26f3f7402 To summarize: * I got rid of some SeqLike.reverse calls when sorting by descending order * replaced slice(1, length) calls with the much safer (avoids IOOBEs) and more readable .tail calls * used a foldleft to avoid using mutable variables in NaiveBayes code This last one is meant to understand what's valued more between idiomatic Scala development or readability. I'm personally a fan of foldLefts where applicable, but do think they're a bit less readable. Thanks, Ignacio
Re: minor optimizations to get my feet wet
You are welcome, thanks again for contributing =) - Henry On Thu, Apr 10, 2014 at 3:17 PM, Ignacio Zendejas ignacio.zendejas...@gmail.com wrote: I don't think there's a noticeable performance hit by the use of reverse in those cases. It was a quick set of changes and it helped understand what you look for. I didn't intend to nitpick, so I'll leave as is. I could have used a scala.Ordering implicitly/explicitly also, but seems overkill and don't want to necessarily start a discussion about what's best--unless one of the admins deems this important. I'll only keep the use of take and tail over using slice and switch over to math.min where indicated. This after I follow Henry's timely advice--thanks, Henry. cheers. On Thu, Apr 10, 2014 at 2:10 PM, Reynold Xin r...@databricks.com wrote: Thanks for contributing! I think often unless the feature is gigantic, you can send a pull request directly for discussion. One rule of thumb in the Spark code base is that we typically prefer readability over conciseness, and thus we tend to avoid using too much Scala magic or operator overloading. In this specific case, do you know if using - instead of reverse improve performance? I personally find it slightly awkward to use underscore right after negation ... The tail change looks good to me. For foldLeft, I agree with you that the old way is more readable (although less idiomatic scala). On Thu, Apr 10, 2014 at 1:48 PM, Ignacio Zendejas ignacio.zendejas...@gmail.com wrote: Hi, all - First off, I want to say that I love spark and am very excited about MLBase. I'd love to contribute now that I have some time, but before I do that I'd like to familiarize myself with the process. In looking for a few projects and settling on one which I'll discuss in another thread, I found some very minor optimizations I could contribute, again, as part of this first step. Before I initiate a PR, I've gone ahead and tested style, ran tests, etc per the instructions, but I'd still like to have someone quickly glance over it and ensure that these are JIRA worthy. Commit: https://github.com/izendejas/spark/commit/81065aed9987c1b08cd5784b7a6153e26f3f7402 To summarize: * I got rid of some SeqLike.reverse calls when sorting by descending order * replaced slice(1, length) calls with the much safer (avoids IOOBEs) and more readable .tail calls * used a foldleft to avoid using mutable variables in NaiveBayes code This last one is meant to understand what's valued more between idiomatic Scala development or readability. I'm personally a fan of foldLefts where applicable, but do think they're a bit less readable. Thanks, Ignacio
Re: JIRA. github and asf updates
With the speed of comments updates in Jira by Spark dev community +1 for issues@ list - Henry On Saturday, March 29, 2014, Patrick Wendell pwend...@gmail.com wrote: Ah sorry I see - Jira updates are going to the dev list. Maybe that's not desirable. I think we should send them to the issues@ list. On Sat, Mar 29, 2014 at 11:16 AM, Patrick Wendell pwend...@gmail.comjavascript:; wrote: Mridul, You can unsubscribe yourself from any of these sources, right? - Patrick On Sat, Mar 29, 2014 at 11:05 AM, Mridul Muralidharan mri...@gmail.comjavascript:; wrote: Hi, So we are now receiving updates from three sources for each change to the PR. While each of them handles a corner case which others might miss, would be great if we could minimize the volume of duplicated communication. Regards, Mridul
Re: Largest input data set observed for Spark.
Reynold, just curious did you guys ran it in AWS? - Henry On Thu, Mar 20, 2014 at 11:08 AM, Reynold Xin r...@databricks.com wrote: Actually we just ran a job with 70TB+ compressed data on 28 worker nodes - I didn't count the size of the uncompressed data, but I am guessing it is somewhere between 200TB to 700TB. On Thu, Mar 20, 2014 at 12:23 AM, Usman Ghani us...@platfora.com wrote: All, What is the largest input data set y'all have come across that has been successfully processed in production using spark. Ball park?
Re: Announcing the official Spark Job Server repo
W00t! Thanks for releasing this, Evan. - Henry On Tue, Mar 18, 2014 at 1:51 PM, Evan Chan e...@ooyala.com wrote: Dear Spark developers, Ooyala is happy to announce that we have pushed our official, Spark 0.9.0 / Scala 2.10-compatible, job server as a github repo: https://github.com/ooyala/spark-jobserver Complete with unit tests, deploy scripts, and examples. The original PR (#222) on incubator-spark is now closed. Please have a look; pull requests are very welcome. -- -- Evan Chan Staff Engineer e...@ooyala.com |