Re: Mail to u...@spark.apache.org failing
Ah - we should update it to suggest mailing the dev@ list (and if there is enough traffic maybe do something else). I'm happy to add you if you can give an organization name, URL, a list of which Spark components you are using, and a short description of your use case.. On Mon, Feb 9, 2015 at 9:00 PM, Meethu Mathew meethu.mat...@flytxt.com wrote: Hi, The mail id given in https://cwiki.apache.org/confluence/display/SPARK/Powered+By+Spark seems to be failing. Can anyone tell me how to get added to Powered By Spark list? -- Regards, *Meethu* - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: New Metrics Sink class not packaged in spark-assembly jar
Hi Judy, If you have added source files in the sink/ source folder, they should appear in the assembly jar when you build. One thing I noticed is that you are looking inside the /dist folder. That only gets populated if you run make-distribution. The normal development process is just to do mvn package and then look at the assembly jar that is contained in core/target. - Patrick On Mon, Feb 9, 2015 at 10:02 PM, Judy Nash judyn...@exchange.microsoft.com wrote: Hello, Working on SPARK-5708 https://issues.apache.org/jira/browse/SPARK-5708 - Add Slf4jSink to Spark Metrics Sink. Wrote a new Slf4jSink class (see patch attached), but the new class is not packaged as part of spark-assembly jar. Do I need to update build config somewhere to have this packaged? Current packaged class: Thought I must have missed something basic but can't figure out why. Thanks! Judy - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
New Metrics Sink class not packaged in spark-assembly jar
Hello, Working on SPARK-5708https://issues.apache.org/jira/browse/SPARK-5708 - Add Slf4jSink to Spark Metrics Sink. Wrote a new Slf4jSink class (see patch attached), but the new class is not packaged as part of spark-assembly jar. Do I need to update build config somewhere to have this packaged? Current packaged class: [cid:image001.png@01D044B4.1B17A1C0] Thought I must have missed something basic but can't figure out why. Thanks! Judy - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
RE: New Metrics Sink class not packaged in spark-assembly jar
Thanks Patrick! That was the issue. Built the jars on windows env with mvn and forgot to run make-distributions.ps1 afterward, so was looking at old jars. From: Patrick Wendell [mailto:pwend...@gmail.com] Sent: Monday, February 9, 2015 10:43 PM To: Judy Nash Cc: dev@spark.apache.org Subject: Re: New Metrics Sink class not packaged in spark-assembly jar Actually, to correct myself, the assembly jar is in assembly/target/scala-2.11 (I think). On Mon, Feb 9, 2015 at 10:42 PM, Patrick Wendell pwend...@gmail.commailto:pwend...@gmail.com wrote: Hi Judy, If you have added source files in the sink/ source folder, they should appear in the assembly jar when you build. One thing I noticed is that you are looking inside the /dist folder. That only gets populated if you run make-distribution. The normal development process is just to do mvn package and then look at the assembly jar that is contained in core/target. - Patrick On Mon, Feb 9, 2015 at 10:02 PM, Judy Nash judyn...@exchange.microsoft.commailto:judyn...@exchange.microsoft.com wrote: Hello, Working on SPARK-5708https://issues.apache.org/jira/browse/SPARK-5708 - Add Slf4jSink to Spark Metrics Sink. Wrote a new Slf4jSink class (see patch attached), but the new class is not packaged as part of spark-assembly jar. Do I need to update build config somewhere to have this packaged? Current packaged class: [cid:image001.png@01D044BC.8FE515C0] Thought I must have missed something basic but can't figure out why. Thanks! Judy - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.orgmailto:dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.orgmailto:dev-h...@spark.apache.org
Re: Using CUDA within Spark / boosting linear algebra
Great - perhaps we can move this discussion off-list and onto a JIRA ticket? (Here's one: https://issues.apache.org/jira/browse/SPARK-5705) It seems like this is going to be somewhat exploratory for a while (and there's probably only a handful of us who really care about fast linear algebra!) - Evan On Mon, Feb 9, 2015 at 4:48 PM, Ulanov, Alexander alexander.ula...@hp.com wrote: Hi Evan, Thank you for explanation and useful link. I am going to build OpenBLAS, link it with Netlib-java and perform benchmark again. Do I understand correctly that BIDMat binaries contain statically linked Intel MKL BLAS? It might be the reason why I am able to run BIDMat not having MKL BLAS installed on my server. If it is true, I wonder if it is OK because Intel sells this library. Nevertheless, it seems that in my case precompiled MKL BLAS performs better than precompiled OpenBLAS given that BIDMat and Netlib-java are supposed to be on par with JNI overheads. Though, it might be interesting to link Netlib-java with Intel MKL, as you suggested. I wonder, are John Canny (BIDMat) and Sam Halliday (Netlib-java) interested to compare their libraries. Best regards, Alexander *From:* Evan R. Sparks [mailto:evan.spa...@gmail.com] *Sent:* Friday, February 06, 2015 5:58 PM *To:* Ulanov, Alexander *Cc:* Joseph Bradley; dev@spark.apache.org *Subject:* Re: Using CUDA within Spark / boosting linear algebra I would build OpenBLAS yourself, since good BLAS performance comes from getting cache sizes, etc. set up correctly for your particular hardware - this is often a very tricky process (see, e.g. ATLAS), but we found that on relatively modern Xeon chips, OpenBLAS builds quickly and yields performance competitive with MKL. To make sure the right library is getting used, you have to make sure it's first on the search path - export LD_LIBRARY_PATH=/path/to/blas/library.so will do the trick here. For some examples of getting netlib-java setup on an ec2 node and some example benchmarking code we ran a while back, see: https://github.com/shivaram/matrix-bench In particular - build-openblas-ec2.sh shows you how to build the library and set up symlinks correctly, and scala/run-netlib.sh shows you how to get the path setup and get that library picked up by netlib-java. In this way - you could probably get cuBLAS set up to be used by netlib-java as well. - Evan On Fri, Feb 6, 2015 at 5:43 PM, Ulanov, Alexander alexander.ula...@hp.com wrote: Evan, could you elaborate on how to force BIDMat and netlib-java to force loading the right blas? For netlib, I there are few JVM flags, such as -Dcom.github.fommil.netlib.BLAS=com.github.fommil.netlib.F2jBLAS, so I can force it to use Java implementation. Not sure I understand how to force use a specific blas (not specific wrapper for blas). Btw. I have installed openblas (yum install openblas), so I suppose that netlib is using it. *From:* Evan R. Sparks [mailto:evan.spa...@gmail.com] *Sent:* Friday, February 06, 2015 5:19 PM *To:* Ulanov, Alexander *Cc:* Joseph Bradley; dev@spark.apache.org *Subject:* Re: Using CUDA within Spark / boosting linear algebra Getting breeze to pick up the right blas library is critical for performance. I recommend using OpenBLAS (or MKL, if you already have it). It might make sense to force BIDMat to use the same underlying BLAS library as well. On Fri, Feb 6, 2015 at 4:42 PM, Ulanov, Alexander alexander.ula...@hp.com wrote: Hi Evan, Joseph I did few matrix multiplication test and BIDMat seems to be ~10x faster than netlib-java+breeze (sorry for weird table formatting): |A*B size | BIDMat MKL | Breeze+Netlib-java native_system_linux_x86-64| Breeze+Netlib-java f2jblas | +---+ |100x100*100x100 | 0,00205596 | 0,03810324 | 0,002556 | |1000x1000*1000x1000 | 0,018320947 | 0,51803557 |1,638475459 | |1x1*1x1 | 23,78046632 | 445,0935211 | 1569,233228 | Configuration: Intel(R) Xeon(R) CPU E31240 3.3 GHz, 6GB RAM, Fedora 19 Linux, Scala 2.11. Later I will make tests with Cuda. I need to install new Cuda version for this purpose. Do you have any ideas why breeze-netlib with native blas is so much slower than BIDMat MKL? Best regards, Alexander From: Joseph Bradley [mailto:jos...@databricks.com] Sent: Thursday, February 05, 2015 5:29 PM To: Ulanov, Alexander Cc: Evan R. Sparks; dev@spark.apache.org Subject: Re: Using CUDA within Spark / boosting linear algebra Hi Alexander, Using GPUs with Spark would be very exciting. Small comment: Concerning your question earlier about keeping data stored on the GPU rather than having to move it between main memory and GPU memory on each iteration, I would guess this would be critical to getting good performance. If you could do multiple local iterations before aggregating results, then the cost of data movement to the GPU
Re: Powered by Spark: Concur
Thanks Denny; added you. Matei On Feb 9, 2015, at 10:11 PM, Denny Lee denny.g@gmail.com wrote: Forgot to add Concur to the Powered by Spark wiki: Concur https://www.concur.com Spark SQL, MLLib Using Spark for travel and expenses analytics and personalization Thanks! Denny - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Powered by Spark: Concur
Thanks Matei - much appreciated! On Mon Feb 09 2015 at 10:23:57 PM Matei Zaharia matei.zaha...@gmail.com wrote: Thanks Denny; added you. Matei On Feb 9, 2015, at 10:11 PM, Denny Lee denny.g@gmail.com wrote: Forgot to add Concur to the Powered by Spark wiki: Concur https://www.concur.com Spark SQL, MLLib Using Spark for travel and expenses analytics and personalization Thanks! Denny
Re: Using CUDA within Spark / boosting linear algebra
Maybe you can ask prof john canny himself:-) as I invited him to give a talk at Alpine data labs in March's meetup (SF big Analytics SF machine learning joined meetup) , 3/11. To be announced in next day or so. Chester Sent from my iPhone On Feb 9, 2015, at 4:48 PM, Ulanov, Alexander alexander.ula...@hp.com wrote: Hi Evan, Thank you for explanation and useful link. I am going to build OpenBLAS, link it with Netlib-java and perform benchmark again. Do I understand correctly that BIDMat binaries contain statically linked Intel MKL BLAS? It might be the reason why I am able to run BIDMat not having MKL BLAS installed on my server. If it is true, I wonder if it is OK because Intel sells this library. Nevertheless, it seems that in my case precompiled MKL BLAS performs better than precompiled OpenBLAS given that BIDMat and Netlib-java are supposed to be on par with JNI overheads. Though, it might be interesting to link Netlib-java with Intel MKL, as you suggested. I wonder, are John Canny (BIDMat) and Sam Halliday (Netlib-java) interested to compare their libraries. Best regards, Alexander From: Evan R. Sparks [mailto:evan.spa...@gmail.com] Sent: Friday, February 06, 2015 5:58 PM To: Ulanov, Alexander Cc: Joseph Bradley; dev@spark.apache.org Subject: Re: Using CUDA within Spark / boosting linear algebra I would build OpenBLAS yourself, since good BLAS performance comes from getting cache sizes, etc. set up correctly for your particular hardware - this is often a very tricky process (see, e.g. ATLAS), but we found that on relatively modern Xeon chips, OpenBLAS builds quickly and yields performance competitive with MKL. To make sure the right library is getting used, you have to make sure it's first on the search path - export LD_LIBRARY_PATH=/path/to/blas/library.so will do the trick here. For some examples of getting netlib-java setup on an ec2 node and some example benchmarking code we ran a while back, see: https://github.com/shivaram/matrix-bench In particular - build-openblas-ec2.sh shows you how to build the library and set up symlinks correctly, and scala/run-netlib.sh shows you how to get the path setup and get that library picked up by netlib-java. In this way - you could probably get cuBLAS set up to be used by netlib-java as well. - Evan On Fri, Feb 6, 2015 at 5:43 PM, Ulanov, Alexander alexander.ula...@hp.commailto:alexander.ula...@hp.com wrote: Evan, could you elaborate on how to force BIDMat and netlib-java to force loading the right blas? For netlib, I there are few JVM flags, such as -Dcom.github.fommil.netlib.BLAS=com.github.fommil.netlib.F2jBLAS, so I can force it to use Java implementation. Not sure I understand how to force use a specific blas (not specific wrapper for blas). Btw. I have installed openblas (yum install openblas), so I suppose that netlib is using it. From: Evan R. Sparks [mailto:evan.spa...@gmail.commailto:evan.spa...@gmail.com] Sent: Friday, February 06, 2015 5:19 PM To: Ulanov, Alexander Cc: Joseph Bradley; dev@spark.apache.orgmailto:dev@spark.apache.org Subject: Re: Using CUDA within Spark / boosting linear algebra Getting breeze to pick up the right blas library is critical for performance. I recommend using OpenBLAS (or MKL, if you already have it). It might make sense to force BIDMat to use the same underlying BLAS library as well. On Fri, Feb 6, 2015 at 4:42 PM, Ulanov, Alexander alexander.ula...@hp.commailto:alexander.ula...@hp.com wrote: Hi Evan, Joseph I did few matrix multiplication test and BIDMat seems to be ~10x faster than netlib-java+breeze (sorry for weird table formatting): |A*B size | BIDMat MKL | Breeze+Netlib-java native_system_linux_x86-64| Breeze+Netlib-java f2jblas | +---+ |100x100*100x100 | 0,00205596 | 0,03810324 | 0,002556 | |1000x1000*1000x1000 | 0,018320947 | 0,51803557 |1,638475459 | |1x1*1x1 | 23,78046632 | 445,0935211 | 1569,233228 | Configuration: Intel(R) Xeon(R) CPU E31240 3.3 GHz, 6GB RAM, Fedora 19 Linux, Scala 2.11. Later I will make tests with Cuda. I need to install new Cuda version for this purpose. Do you have any ideas why breeze-netlib with native blas is so much slower than BIDMat MKL? Best regards, Alexander From: Joseph Bradley [mailto:jos...@databricks.commailto:jos...@databricks.com] Sent: Thursday, February 05, 2015 5:29 PM To: Ulanov, Alexander Cc: Evan R. Sparks; dev@spark.apache.orgmailto:dev@spark.apache.org Subject: Re: Using CUDA within Spark / boosting linear algebra Hi Alexander, Using GPUs with Spark would be very exciting. Small comment: Concerning your question earlier about keeping data stored on the GPU rather than having to move it between main memory and GPU memory on each iteration, I would guess this would be critical to getting good
Re: Keep or remove Debian packaging in Spark?
it sounds like nobody intends these to be used to actually deploy Spark I wouldn't go quite that far. What we have now can serve as useful input to a deployment tool like Chef, but the user is then going to need to add some customization or configuration within the context of that tooling to get Spark installed just the way they want. So it is not so much that the current Debian packaging can't be used as that it has never really been intended to be a completely finished product that a newcomer could, for example, use to install Spark completely and quickly to Ubuntu and have a fully-functional environment in which they could then run all of the examples, tutorials, etc. Getting to that level of packaging (and maintenance) is something that I'm not sure we want to do since that is a better fit with Bigtop and the efforts of Cloudera, Horton Works, MapR, etc. to distribute Spark. On Mon, Feb 9, 2015 at 2:41 AM, Sean Owen so...@cloudera.com wrote: This is a straw poll to assess whether there is support to keep and fix, or remove, the Debian packaging-related config in Spark. I see several oldish outstanding JIRAs relating to problems in the packaging: https://issues.apache.org/jira/browse/SPARK-1799 https://issues.apache.org/jira/browse/SPARK-2614 https://issues.apache.org/jira/browse/SPARK-3624 https://issues.apache.org/jira/browse/SPARK-4436 (and a similar idea about making RPMs) https://issues.apache.org/jira/browse/SPARK-665 The original motivation seems related to Chef: https://issues.apache.org/jira/browse/SPARK-2614?focusedCommentId=14070908page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14070908 Mark's recent comments cast some doubt on whether it is essential: https://github.com/apache/spark/pull/4277#issuecomment-72114226 and in recent conversations I didn't hear dissent to the idea of removing this. Is this still useful enough to fix up? All else equal I'd like to start to walk back some of the complexity of the build, but I don't know how all-else-equal it is. Certainly, it sounds like nobody intends these to be used to actually deploy Spark. I don't doubt it's useful to someone, but can they maintain the packaging logic elsewhere? - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: multi-line comment style
I like the `/* .. */` style more. Because it is easier for IDEs to recognize it as a block comment. If you press enter in the comment block with the `//` style, IDEs won't add `//` for you. -Xiangrui On Wed, Feb 4, 2015 at 2:15 PM, Reynold Xin r...@databricks.com wrote: We should update the style doc to reflect what we have in most places (which I think is //). On Wed, Feb 4, 2015 at 2:09 PM, Shivaram Venkataraman shiva...@eecs.berkeley.edu wrote: FWIW I like the multi-line // over /* */ from a purely style standpoint. The Google Java style guide[1] has some comment about code formatting tools working better with /* */ but there doesn't seem to be any strong arguments for one over the other I can find Thanks Shivaram [1] https://google-styleguide.googlecode.com/svn/trunk/javaguide.html#s4.8.6.1-block-comment-style On Wed, Feb 4, 2015 at 2:05 PM, Patrick Wendell pwend...@gmail.com wrote: Personally I have no opinion, but agree it would be nice to standardize. - Patrick On Wed, Feb 4, 2015 at 1:58 PM, Sean Owen so...@cloudera.com wrote: One thing Marcelo pointed out to me is that the // style does not interfere with commenting out blocks of code with /* */, which is a small good thing. I am also accustomed to // style for multiline, and reserve /** */ for javadoc / scaladoc. Meaning, seeing the /* */ style inline always looks a little funny to me. On Wed, Feb 4, 2015 at 3:53 PM, Kay Ousterhout kayousterh...@gmail.com wrote: Hi all, The Spark Style Guide https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide says multi-line comments should formatted as: /* * This is a * very * long comment. */ But in my experience, we almost always use // for multi-line comments: // This is a // very // long comment. Here are some examples: - Recent commit by Reynold, king of style: https://github.com/apache/spark/commit/bebf4c42bef3e75d31ffce9bfdb331c16f34ddb1#diff-d616b5496d1a9f648864f4ab0db5a026R58 - RDD.scala: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L361 - DAGScheduler.scala: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L281 Any objections to me updating the style guide to reflect this? As with other style issues, I think consistency here is helpful (and formatting multi-line comments as // does nicely visually distinguish code comments from doc comments). -Kay - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
R: Powered by Spark: Concur
Hi, I checked the powered by wiki too and Agile Labs should be Agile Lab. The link is wrong too, it should be www.agilelab.it. The description is correct. Thanks a lot Paolo Inviata dal mio Windows Phone Da: Denny Leemailto:denny.g@gmail.com Inviato: 10/02/2015 07:41 A: Matei Zahariamailto:matei.zaha...@gmail.com Cc: dev@spark.apache.orgmailto:dev@spark.apache.org Oggetto: Re: Powered by Spark: Concur Thanks Matei - much appreciated! On Mon Feb 09 2015 at 10:23:57 PM Matei Zaharia matei.zaha...@gmail.com wrote: Thanks Denny; added you. Matei On Feb 9, 2015, at 10:11 PM, Denny Lee denny.g@gmail.com wrote: Forgot to add Concur to the Powered by Spark wiki: Concur https://www.concur.com Spark SQL, MLLib Using Spark for travel and expenses analytics and personalization Thanks! Denny
Re: Pull Requests on github
Cool, thanks! Let me know if there are any more core numerical libraries that you'd like to see to support Spark with optimised natives using a similar packaging model at netlib-java. I'm interested in fast random number generation next, and I keep wondering if anybody would be interested in paying for FPGA or GPU / APU backends for netlib-java. It would be a *lot* of work but I'd be very interested to talk to an organisation with such a requirement and I'd be able to do it in less time than they would internally. On 10 Feb 2015 04:12, Andrew Ash [via Apache Spark Developers List] ml-node+s1001551n10546...@n3.nabble.com wrote: Sam, I see your PR was merged -- many thanks for sending it in and getting it merged! In general for future reference, the most effective way to contribute is outlined on this wiki page: https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark On Mon, Feb 9, 2015 at 1:04 AM, Akhil Das [hidden email] http:///user/SendEmail.jtp?type=nodenode=10546i=0 wrote: You can open a Jira issue pointing this PR to get it processed faster. :) Thanks Best Regards On Sat, Feb 7, 2015 at 7:07 AM, fommil [hidden email] http:///user/SendEmail.jtp?type=nodenode=10546i=1 wrote: Hi all, I'm the author of netlib-java and I noticed that the documentation in MLlib was out of date and misleading, so I submitted a pull request on github which will hopefully make things easier for everybody to understand the benefits of system optimised natives and how to use them :-) https://github.com/apache/spark/pull/4448 However, it looks like there are a *lot* of outstanding PRs and that this is just a mirror repository. Will somebody please look at my PR and merge into the canonical source (and let me know)? Best regards, Sam -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Pull-Requests-on-github-tp10502.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: [hidden email] http:///user/SendEmail.jtp?type=nodenode=10546i=2 For additional commands, e-mail: [hidden email] http:///user/SendEmail.jtp?type=nodenode=10546i=3 -- If you reply to this email, your message will be added to the discussion below: http://apache-spark-developers-list.1001551.n3.nabble.com/Pull-Requests-on-github-tp10502p10546.html To unsubscribe from Pull Requests on github, click here http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=10502code=c2FtLmhhbGxpZGF5QGdtYWlsLmNvbXwxMDUwMnwtMzI4MzQzMDI0 . NAML http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Pull-Requests-on-github-tp10502p10558.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com.
Re: Unit tests
Hi Iulian, I think the AkakUtilsSuite failure that you observed has been fixed in https://issues.apache.org/jira/browse/SPARK-5548 / https://github.com/apache/spark/pull/4343 On February 9, 2015 at 5:47:59 AM, Iulian Dragoș (iulian.dra...@typesafe.com) wrote: Hi Patrick, Thanks for the heads up. I was trying to set up our own infrastructure for testing Spark (essentially, running `run-tests` every night) on EC2. I stumbled upon a number of flaky tests, but none of them look similar to anything in Jira with the flaky-test tag. I wonder if there's something wrong with our infrastructure, or I should simply open Jira tickets with the failures I find. For example, one that appears fairly often on our setup is in AkkaUtilsSuite remote fetch ssl on - untrusted server (exception `ActorNotFound`, instead of `TimeoutException`). thanks, iulian On Fri, Feb 6, 2015 at 9:55 PM, Patrick Wendell pwend...@gmail.com wrote: Hey All, The tests are in a not-amazing state right now due to a few compounding factors: 1. We've merged a large volume of patches recently. 2. The load on jenkins has been relatively high, exposing races and other behavior not seen at lower load. For those not familiar, the main issue is flaky (non deterministic) test failures. Right now I'm trying to prioritize keeping the PullReqeustBuilder in good shape since it will block development if it is down. For other tests, let's try to keep filing JIRA's when we see issues and use the flaky-test label (see http://bit.ly/1yRif9S): I may contact people regarding specific tests. This is a very high priority to get in good shape. This kind of thing is no one's fault but just the result of a lot of concurrent development, and everyone needs to pitch in to get back in a good place. - Patrick - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org -- -- Iulian Dragos -- Reactive Apps on the JVM www.typesafe.com
Re: Keep or remove Debian packaging in Spark?
I have wondered whether we should sort of deprecated it more officially, since otherwise I think people have the reasonable expectation based on the current code that Spark intends to support complete Debian packaging as part of the upstream build. Having something that's sort-of maintained but no one is helping review and merge patches on it or make it fully functional, IMO that doesn't benefit us or our users. There are a bunch of other projects that are specifically devoted to packaging, so it seems like there is a clear separation of concerns here. On Mon, Feb 9, 2015 at 7:31 AM, Mark Hamstra m...@clearstorydata.com wrote: it sounds like nobody intends these to be used to actually deploy Spark I wouldn't go quite that far. What we have now can serve as useful input to a deployment tool like Chef, but the user is then going to need to add some customization or configuration within the context of that tooling to get Spark installed just the way they want. So it is not so much that the current Debian packaging can't be used as that it has never really been intended to be a completely finished product that a newcomer could, for example, use to install Spark completely and quickly to Ubuntu and have a fully-functional environment in which they could then run all of the examples, tutorials, etc. Getting to that level of packaging (and maintenance) is something that I'm not sure we want to do since that is a better fit with Bigtop and the efforts of Cloudera, Horton Works, MapR, etc. to distribute Spark. On Mon, Feb 9, 2015 at 2:41 AM, Sean Owen so...@cloudera.com wrote: This is a straw poll to assess whether there is support to keep and fix, or remove, the Debian packaging-related config in Spark. I see several oldish outstanding JIRAs relating to problems in the packaging: https://issues.apache.org/jira/browse/SPARK-1799 https://issues.apache.org/jira/browse/SPARK-2614 https://issues.apache.org/jira/browse/SPARK-3624 https://issues.apache.org/jira/browse/SPARK-4436 (and a similar idea about making RPMs) https://issues.apache.org/jira/browse/SPARK-665 The original motivation seems related to Chef: https://issues.apache.org/jira/browse/SPARK-2614?focusedCommentId=14070908page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14070908 Mark's recent comments cast some doubt on whether it is essential: https://github.com/apache/spark/pull/4277#issuecomment-72114226 and in recent conversations I didn't hear dissent to the idea of removing this. Is this still useful enough to fix up? All else equal I'd like to start to walk back some of the complexity of the build, but I don't know how all-else-equal it is. Certainly, it sounds like nobody intends these to be used to actually deploy Spark. I don't doubt it's useful to someone, but can they maintain the packaging logic elsewhere? - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Keep or remove Debian packaging in Spark?
+1 to an official deprecation + redirecting users to some other project that will or already is taking this on. Nate? On Mon Feb 09 2015 at 10:08:27 AM Patrick Wendell pwend...@gmail.com wrote: I have wondered whether we should sort of deprecated it more officially, since otherwise I think people have the reasonable expectation based on the current code that Spark intends to support complete Debian packaging as part of the upstream build. Having something that's sort-of maintained but no one is helping review and merge patches on it or make it fully functional, IMO that doesn't benefit us or our users. There are a bunch of other projects that are specifically devoted to packaging, so it seems like there is a clear separation of concerns here. On Mon, Feb 9, 2015 at 7:31 AM, Mark Hamstra m...@clearstorydata.com wrote: it sounds like nobody intends these to be used to actually deploy Spark I wouldn't go quite that far. What we have now can serve as useful input to a deployment tool like Chef, but the user is then going to need to add some customization or configuration within the context of that tooling to get Spark installed just the way they want. So it is not so much that the current Debian packaging can't be used as that it has never really been intended to be a completely finished product that a newcomer could, for example, use to install Spark completely and quickly to Ubuntu and have a fully-functional environment in which they could then run all of the examples, tutorials, etc. Getting to that level of packaging (and maintenance) is something that I'm not sure we want to do since that is a better fit with Bigtop and the efforts of Cloudera, Horton Works, MapR, etc. to distribute Spark. On Mon, Feb 9, 2015 at 2:41 AM, Sean Owen so...@cloudera.com wrote: This is a straw poll to assess whether there is support to keep and fix, or remove, the Debian packaging-related config in Spark. I see several oldish outstanding JIRAs relating to problems in the packaging: https://issues.apache.org/jira/browse/SPARK-1799 https://issues.apache.org/jira/browse/SPARK-2614 https://issues.apache.org/jira/browse/SPARK-3624 https://issues.apache.org/jira/browse/SPARK-4436 (and a similar idea about making RPMs) https://issues.apache.org/jira/browse/SPARK-665 The original motivation seems related to Chef: https://issues.apache.org/jira/browse/SPARK-2614?focusedComm entId=14070908page=com.atlassian.jira.plugin.system.issuetabpanels: comment-tabpanel#comment-14070908 Mark's recent comments cast some doubt on whether it is essential: https://github.com/apache/spark/pull/4277#issuecomment-72114226 and in recent conversations I didn't hear dissent to the idea of removing this. Is this still useful enough to fix up? All else equal I'd like to start to walk back some of the complexity of the build, but I don't know how all-else-equal it is. Certainly, it sounds like nobody intends these to be used to actually deploy Spark. I don't doubt it's useful to someone, but can they maintain the packaging logic elsewhere? - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
RE: Using CUDA within Spark / boosting linear algebra
Hi Evan, Thank you for explanation and useful link. I am going to build OpenBLAS, link it with Netlib-java and perform benchmark again. Do I understand correctly that BIDMat binaries contain statically linked Intel MKL BLAS? It might be the reason why I am able to run BIDMat not having MKL BLAS installed on my server. If it is true, I wonder if it is OK because Intel sells this library. Nevertheless, it seems that in my case precompiled MKL BLAS performs better than precompiled OpenBLAS given that BIDMat and Netlib-java are supposed to be on par with JNI overheads. Though, it might be interesting to link Netlib-java with Intel MKL, as you suggested. I wonder, are John Canny (BIDMat) and Sam Halliday (Netlib-java) interested to compare their libraries. Best regards, Alexander From: Evan R. Sparks [mailto:evan.spa...@gmail.com] Sent: Friday, February 06, 2015 5:58 PM To: Ulanov, Alexander Cc: Joseph Bradley; dev@spark.apache.org Subject: Re: Using CUDA within Spark / boosting linear algebra I would build OpenBLAS yourself, since good BLAS performance comes from getting cache sizes, etc. set up correctly for your particular hardware - this is often a very tricky process (see, e.g. ATLAS), but we found that on relatively modern Xeon chips, OpenBLAS builds quickly and yields performance competitive with MKL. To make sure the right library is getting used, you have to make sure it's first on the search path - export LD_LIBRARY_PATH=/path/to/blas/library.so will do the trick here. For some examples of getting netlib-java setup on an ec2 node and some example benchmarking code we ran a while back, see: https://github.com/shivaram/matrix-bench In particular - build-openblas-ec2.sh shows you how to build the library and set up symlinks correctly, and scala/run-netlib.sh shows you how to get the path setup and get that library picked up by netlib-java. In this way - you could probably get cuBLAS set up to be used by netlib-java as well. - Evan On Fri, Feb 6, 2015 at 5:43 PM, Ulanov, Alexander alexander.ula...@hp.commailto:alexander.ula...@hp.com wrote: Evan, could you elaborate on how to force BIDMat and netlib-java to force loading the right blas? For netlib, I there are few JVM flags, such as -Dcom.github.fommil.netlib.BLAS=com.github.fommil.netlib.F2jBLAS, so I can force it to use Java implementation. Not sure I understand how to force use a specific blas (not specific wrapper for blas). Btw. I have installed openblas (yum install openblas), so I suppose that netlib is using it. From: Evan R. Sparks [mailto:evan.spa...@gmail.commailto:evan.spa...@gmail.com] Sent: Friday, February 06, 2015 5:19 PM To: Ulanov, Alexander Cc: Joseph Bradley; dev@spark.apache.orgmailto:dev@spark.apache.org Subject: Re: Using CUDA within Spark / boosting linear algebra Getting breeze to pick up the right blas library is critical for performance. I recommend using OpenBLAS (or MKL, if you already have it). It might make sense to force BIDMat to use the same underlying BLAS library as well. On Fri, Feb 6, 2015 at 4:42 PM, Ulanov, Alexander alexander.ula...@hp.commailto:alexander.ula...@hp.com wrote: Hi Evan, Joseph I did few matrix multiplication test and BIDMat seems to be ~10x faster than netlib-java+breeze (sorry for weird table formatting): |A*B size | BIDMat MKL | Breeze+Netlib-java native_system_linux_x86-64| Breeze+Netlib-java f2jblas | +---+ |100x100*100x100 | 0,00205596 | 0,03810324 | 0,002556 | |1000x1000*1000x1000 | 0,018320947 | 0,51803557 |1,638475459 | |1x1*1x1 | 23,78046632 | 445,0935211 | 1569,233228 | Configuration: Intel(R) Xeon(R) CPU E31240 3.3 GHz, 6GB RAM, Fedora 19 Linux, Scala 2.11. Later I will make tests with Cuda. I need to install new Cuda version for this purpose. Do you have any ideas why breeze-netlib with native blas is so much slower than BIDMat MKL? Best regards, Alexander From: Joseph Bradley [mailto:jos...@databricks.commailto:jos...@databricks.com] Sent: Thursday, February 05, 2015 5:29 PM To: Ulanov, Alexander Cc: Evan R. Sparks; dev@spark.apache.orgmailto:dev@spark.apache.org Subject: Re: Using CUDA within Spark / boosting linear algebra Hi Alexander, Using GPUs with Spark would be very exciting. Small comment: Concerning your question earlier about keeping data stored on the GPU rather than having to move it between main memory and GPU memory on each iteration, I would guess this would be critical to getting good performance. If you could do multiple local iterations before aggregating results, then the cost of data movement to the GPU could be amortized (and I believe that is done in practice). Having Spark be aware of the GPU and using it as another part of memory sounds like a much bigger undertaking. Joseph On Thu, Feb 5, 2015 at 4:59 PM, Ulanov, Alexander alexander.ula...@hp.commailto:alexander.ula...@hp.com wrote: Thank you for
adding some temporary jenkins worker nodes...
...to help w/the build backlog. let's all welcome amp-jenkins-slave-{01..03} back to the fray!
pyspark.daemon issues?
I've noticed a couple oddities with the pyspark.daemons which are causing us a bit of memory problems within some of our heavy spark jobs, especially when they run at the same time... It seems that there is typically a 1-to-1 ratio of pyspark.daemons to cores per executor during aggregations. By default the spark.python.worker.memory is left at the default of 512MB, after which, the remainder of the aggregations are supposed to spill to disk. However: *1)* I'm not entirely sure what cases would result in random numbers of pyspark daemons which do not respect the python worker memory limit. I've seen some go up to as far as 2GB each (well over the 512MB limit) which is when we run into some crazy memory problems for jobs making use of many cores on each executor. To be clear here, they ARE spilling to disk as well, but also blowing past the memory limits at the same time somehow. *2)* Another scenario specifically relates to when we want to join RDDs, where for example, say there are 4 cores per executor, and therefore 4 pyspark daemons during most aggregations. It seems that if a Join occurs, it will spawn up 4 additional pyspark daemons as opposed to simply re-using the ones that were already present during the aggregation stage that occurred before it. This, combined with the case where the python worker memory limit is not strictly respected, can pose problems for using way more memory per node. The fact that the python worker memory appears to use memory *outside* of the executor memory is what poses the biggest challenge for preventing memory depletion on a node. Is there something obvious, or some environment variable I may have missed that could potentially help with one/both of the above memory concerns? Alternatively, any suggestions would be greatly appreciated! :) Thanks, Mark. -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/pyspark-daemon-issues-tp10533.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: multi-line comment style
Btw, I think allowing `/* ... */` without the leading `*` in lines is also useful. Check this line: https://github.com/apache/spark/pull/4259/files#diff-e9dcb3b5f3de77fc31b3aff7831110eaR55, where we put the R commands that can reproduce the test result. It is easier if we write in the following style: ~~~ /* Using the following R code to load the data and train the model using glmnet package. library(glmnet) data - read.csv(path, header=FALSE, stringsAsFactors=FALSE) features - as.matrix(data.frame(as.numeric(data$V2), as.numeric(data$V3))) label - as.numeric(data$V1) weights - coef(glmnet(features, label, family=gaussian, alpha = 0, lambda = 0)) */ ~~~ So people can copy paste the R commands directly. Xiangrui On Mon, Feb 9, 2015 at 12:18 PM, Xiangrui Meng men...@gmail.com wrote: I like the `/* .. */` style more. Because it is easier for IDEs to recognize it as a block comment. If you press enter in the comment block with the `//` style, IDEs won't add `//` for you. -Xiangrui On Wed, Feb 4, 2015 at 2:15 PM, Reynold Xin r...@databricks.com wrote: We should update the style doc to reflect what we have in most places (which I think is //). On Wed, Feb 4, 2015 at 2:09 PM, Shivaram Venkataraman shiva...@eecs.berkeley.edu wrote: FWIW I like the multi-line // over /* */ from a purely style standpoint. The Google Java style guide[1] has some comment about code formatting tools working better with /* */ but there doesn't seem to be any strong arguments for one over the other I can find Thanks Shivaram [1] https://google-styleguide.googlecode.com/svn/trunk/javaguide.html#s4.8.6.1-block-comment-style On Wed, Feb 4, 2015 at 2:05 PM, Patrick Wendell pwend...@gmail.com wrote: Personally I have no opinion, but agree it would be nice to standardize. - Patrick On Wed, Feb 4, 2015 at 1:58 PM, Sean Owen so...@cloudera.com wrote: One thing Marcelo pointed out to me is that the // style does not interfere with commenting out blocks of code with /* */, which is a small good thing. I am also accustomed to // style for multiline, and reserve /** */ for javadoc / scaladoc. Meaning, seeing the /* */ style inline always looks a little funny to me. On Wed, Feb 4, 2015 at 3:53 PM, Kay Ousterhout kayousterh...@gmail.com wrote: Hi all, The Spark Style Guide https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide says multi-line comments should formatted as: /* * This is a * very * long comment. */ But in my experience, we almost always use // for multi-line comments: // This is a // very // long comment. Here are some examples: - Recent commit by Reynold, king of style: https://github.com/apache/spark/commit/bebf4c42bef3e75d31ffce9bfdb331c16f34ddb1#diff-d616b5496d1a9f648864f4ab0db5a026R58 - RDD.scala: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L361 - DAGScheduler.scala: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L281 Any objections to me updating the style guide to reflect this? As with other style issues, I think consistency here is helpful (and formatting multi-line comments as // does nicely visually distinguish code comments from doc comments). -Kay - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: run time exceptions in Spark 1.2.0 manual build together with OpenStack hadoop driver
Old releases can't be changed, but new ones can. This was merged into the 1.3 branch for the upcoming 1.3.0 release. If you really had to, you could do some surgery on existing distributions to swap in/out Jackson. On Mon, Feb 9, 2015 at 11:22 AM, Gil Vernik g...@il.ibm.com wrote: Hi All, I understand that https://github.com/apache/spark/pull/3938 was closed and merged into Spark? And this suppose to fix this Jackson issue. If so, is there any way to update binary distributions of Spark so that it will contain this fix? Current binary versions of Spark available for download were built with jackson 1.8.8 which makes them impossible to use with Hadoop 2.6.0 jars Thanks Gil Vernik. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: run time exceptions in Spark 1.2.0 manual build together with OpenStack hadoop driver
Hi All, I understand that https://github.com/apache/spark/pull/3938 was closed and merged into Spark? And this suppose to fix this Jackson issue. If so, is there any way to update binary distributions of Spark so that it will contain this fix? Current binary versions of Spark available for download were built with jackson 1.8.8 which makes them impossible to use with Hadoop 2.6.0 jars Thanks Gil Vernik. From: Sean Owen so...@cloudera.com To: Ted Yu yuzhih...@gmail.com Cc: Gil Vernik/Haifa/IBM@IBMIL, dev dev@spark.apache.org Date: 18/01/2015 08:23 PM Subject:Re: run time exceptions in Spark 1.2.0 manual build together with OpenStack hadoop driver Agree, I think this can / should be fixed with a slightly more conservative version of https://github.com/apache/spark/pull/3938 related to SPARK-5108. On Sun, Jan 18, 2015 at 3:41 PM, Ted Yu yuzhih...@gmail.com wrote: Please tale a look at SPARK-4048 and SPARK-5108 Cheers On Sat, Jan 17, 2015 at 10:26 PM, Gil Vernik g...@il.ibm.com wrote: Hi, I took a source code of Spark 1.2.0 and tried to build it together with hadoop-openstack.jar ( To allow Spark an access to OpenStack Swift ) I used Hadoop 2.6.0. The build was fine without problems, however in run time, while trying to access swift:// name space i got an exception: java.lang.NoClassDefFoundError: org/codehaus/jackson/annotate/JsonClass at org.codehaus.jackson.map.introspect.JacksonAnnotationIntrospector.findDeserializationType(JacksonAnnotationIntrospector.java:524) at org.codehaus.jackson.map.deser.BasicDeserializerFactory.modifyTypeByAnnotation(BasicDeserializerFactory.java:732) ...and the long stack trace goes here Digging into the problem i saw the following: Jackson versions 1.9.X are not backward compatible, in particular they removed JsonClass annotation. Hadoop 2.6.0 uses jackson-asl version 1.9.13, while Spark has reference to older version of jackson. This is the main pom.xml of Spark 1.2.0 : dependency !-- Matches the version of jackson-core-asl pulled in by avro -- groupIdorg.codehaus.jackson/groupId artifactIdjackson-mapper-asl/artifactId version1.8.8/version /dependency Referencing 1.8.8 version, which is not compatible with Hadoop 2.6.0 . If we change version to 1.9.13, than all will work fine and there will be no run time exceptions while accessing Swift. The following change will solve the problem: dependency !-- Matches the version of jackson-core-asl pulled in by avro -- groupIdorg.codehaus.jackson/groupId artifactIdjackson-mapper-asl/artifactId version1.9.13/version /dependency I am trying to resolve this somehow so people will not get into this issue. Is there any particular need in Spark for jackson 1.8.8 and not 1.9.13? Can we remove 1.8.8 and put 1.9.13 for Avro? It looks to me that all works fine when Spark build with jackson 1.9.13, but i am not an expert and not sure what should be tested. Thanks, Gil Vernik. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Keep or remove Debian packaging in Spark?
This is a straw poll to assess whether there is support to keep and fix, or remove, the Debian packaging-related config in Spark. I see several oldish outstanding JIRAs relating to problems in the packaging: https://issues.apache.org/jira/browse/SPARK-1799 https://issues.apache.org/jira/browse/SPARK-2614 https://issues.apache.org/jira/browse/SPARK-3624 https://issues.apache.org/jira/browse/SPARK-4436 (and a similar idea about making RPMs) https://issues.apache.org/jira/browse/SPARK-665 The original motivation seems related to Chef: https://issues.apache.org/jira/browse/SPARK-2614?focusedCommentId=14070908page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14070908 Mark's recent comments cast some doubt on whether it is essential: https://github.com/apache/spark/pull/4277#issuecomment-72114226 and in recent conversations I didn't hear dissent to the idea of removing this. Is this still useful enough to fix up? All else equal I'd like to start to walk back some of the complexity of the build, but I don't know how all-else-equal it is. Certainly, it sounds like nobody intends these to be used to actually deploy Spark. I don't doubt it's useful to someone, but can they maintain the packaging logic elsewhere? - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: multi-line comment style
Clearly there isn't a strictly optimal commenting format (pro's and cons for both '//' and '/*'). My thought is for consistency we should just chose one and put in the style guide. On Mon, Feb 9, 2015 at 12:25 PM, Xiangrui Meng men...@gmail.com wrote: Btw, I think allowing `/* ... */` without the leading `*` in lines is also useful. Check this line: https://github.com/apache/spark/pull/4259/files#diff-e9dcb3b5f3de77fc31b3aff7831110eaR55, where we put the R commands that can reproduce the test result. It is easier if we write in the following style: ~~~ /* Using the following R code to load the data and train the model using glmnet package. library(glmnet) data - read.csv(path, header=FALSE, stringsAsFactors=FALSE) features - as.matrix(data.frame(as.numeric(data$V2), as.numeric(data$V3))) label - as.numeric(data$V1) weights - coef(glmnet(features, label, family=gaussian, alpha = 0, lambda = 0)) */ ~~~ So people can copy paste the R commands directly. Xiangrui On Mon, Feb 9, 2015 at 12:18 PM, Xiangrui Meng men...@gmail.com wrote: I like the `/* .. */` style more. Because it is easier for IDEs to recognize it as a block comment. If you press enter in the comment block with the `//` style, IDEs won't add `//` for you. -Xiangrui On Wed, Feb 4, 2015 at 2:15 PM, Reynold Xin r...@databricks.com wrote: We should update the style doc to reflect what we have in most places (which I think is //). On Wed, Feb 4, 2015 at 2:09 PM, Shivaram Venkataraman shiva...@eecs.berkeley.edu wrote: FWIW I like the multi-line // over /* */ from a purely style standpoint. The Google Java style guide[1] has some comment about code formatting tools working better with /* */ but there doesn't seem to be any strong arguments for one over the other I can find Thanks Shivaram [1] https://google-styleguide.googlecode.com/svn/trunk/javaguide.html#s4.8.6.1-block-comment-style On Wed, Feb 4, 2015 at 2:05 PM, Patrick Wendell pwend...@gmail.com wrote: Personally I have no opinion, but agree it would be nice to standardize. - Patrick On Wed, Feb 4, 2015 at 1:58 PM, Sean Owen so...@cloudera.com wrote: One thing Marcelo pointed out to me is that the // style does not interfere with commenting out blocks of code with /* */, which is a small good thing. I am also accustomed to // style for multiline, and reserve /** */ for javadoc / scaladoc. Meaning, seeing the /* */ style inline always looks a little funny to me. On Wed, Feb 4, 2015 at 3:53 PM, Kay Ousterhout kayousterh...@gmail.com wrote: Hi all, The Spark Style Guide https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide says multi-line comments should formatted as: /* * This is a * very * long comment. */ But in my experience, we almost always use // for multi-line comments: // This is a // very // long comment. Here are some examples: - Recent commit by Reynold, king of style: https://github.com/apache/spark/commit/bebf4c42bef3e75d31ffce9bfdb331c16f34ddb1#diff-d616b5496d1a9f648864f4ab0db5a026R58 - RDD.scala: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L361 - DAGScheduler.scala: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L281 Any objections to me updating the style guide to reflect this? As with other style issues, I think consistency here is helpful (and formatting multi-line comments as // does nicely visually distinguish code comments from doc comments). -Kay - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
[ANNOUNCE] Apache Spark 1.2.1 Released
Hi All, I've just posted the 1.2.1 maintenance release of Apache Spark. We recommend all 1.2.0 users upgrade to this release, as this release includes stability fixes across all components of Spark. - Download this release: http://spark.apache.org/downloads.html - View the release notes: http://spark.apache.org/releases/spark-release-1-2-1.html - Full list of JIRA issues resolved in this release: http://s.apache.org/Mpn Thanks to everyone who helped work on this release! - Patrick - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: multi-line comment style
Why don't we just pick // as the default (by encouraging it in the style guide), since it is mostly used, and then do not disallow /* */? I don't think it is that big of a deal to have slightly deviations here since it is dead simple to understand what's going on. On Mon, Feb 9, 2015 at 1:33 PM, Patrick Wendell pwend...@gmail.com wrote: Clearly there isn't a strictly optimal commenting format (pro's and cons for both '//' and '/*'). My thought is for consistency we should just chose one and put in the style guide. On Mon, Feb 9, 2015 at 12:25 PM, Xiangrui Meng men...@gmail.com wrote: Btw, I think allowing `/* ... */` without the leading `*` in lines is also useful. Check this line: https://github.com/apache/spark/pull/4259/files#diff-e9dcb3b5f3de77fc31b3aff7831110eaR55 , where we put the R commands that can reproduce the test result. It is easier if we write in the following style: ~~~ /* Using the following R code to load the data and train the model using glmnet package. library(glmnet) data - read.csv(path, header=FALSE, stringsAsFactors=FALSE) features - as.matrix(data.frame(as.numeric(data$V2), as.numeric(data$V3))) label - as.numeric(data$V1) weights - coef(glmnet(features, label, family=gaussian, alpha = 0, lambda = 0)) */ ~~~ So people can copy paste the R commands directly. Xiangrui On Mon, Feb 9, 2015 at 12:18 PM, Xiangrui Meng men...@gmail.com wrote: I like the `/* .. */` style more. Because it is easier for IDEs to recognize it as a block comment. If you press enter in the comment block with the `//` style, IDEs won't add `//` for you. -Xiangrui On Wed, Feb 4, 2015 at 2:15 PM, Reynold Xin r...@databricks.com wrote: We should update the style doc to reflect what we have in most places (which I think is //). On Wed, Feb 4, 2015 at 2:09 PM, Shivaram Venkataraman shiva...@eecs.berkeley.edu wrote: FWIW I like the multi-line // over /* */ from a purely style standpoint. The Google Java style guide[1] has some comment about code formatting tools working better with /* */ but there doesn't seem to be any strong arguments for one over the other I can find Thanks Shivaram [1] https://google-styleguide.googlecode.com/svn/trunk/javaguide.html#s4.8.6.1-block-comment-style On Wed, Feb 4, 2015 at 2:05 PM, Patrick Wendell pwend...@gmail.com wrote: Personally I have no opinion, but agree it would be nice to standardize. - Patrick On Wed, Feb 4, 2015 at 1:58 PM, Sean Owen so...@cloudera.com wrote: One thing Marcelo pointed out to me is that the // style does not interfere with commenting out blocks of code with /* */, which is a small good thing. I am also accustomed to // style for multiline, and reserve /** */ for javadoc / scaladoc. Meaning, seeing the /* */ style inline always looks a little funny to me. On Wed, Feb 4, 2015 at 3:53 PM, Kay Ousterhout kayousterh...@gmail.com wrote: Hi all, The Spark Style Guide https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide says multi-line comments should formatted as: /* * This is a * very * long comment. */ But in my experience, we almost always use // for multi-line comments: // This is a // very // long comment. Here are some examples: - Recent commit by Reynold, king of style: https://github.com/apache/spark/commit/bebf4c42bef3e75d31ffce9bfdb331c16f34ddb1#diff-d616b5496d1a9f648864f4ab0db5a026R58 - RDD.scala: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L361 - DAGScheduler.scala: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L281 Any objections to me updating the style guide to reflect this? As with other style issues, I think consistency here is helpful (and formatting multi-line comments as // does nicely visually distinguish code comments from doc comments). -Kay - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: multi-line comment style
+1 to what Andrew said, I think both make sense in different situations and trusting developer discretion here is reasonable. On Mon, Feb 9, 2015 at 1:48 PM, Andrew Or and...@databricks.com wrote: In my experience I find it much more natural to use // for short multi-line comments (2 or 3 lines), and /* */ for long multi-line comments involving one or more paragraphs. For short multi-line comments, there is no reason not to use // if it just so happens that your first line exceeded 100 characters and you have to wrap it. For long multi-line comments, however, using // all the way looks really awkward especially if you have multiple paragraphs. Thus, I would actually suggest that we don't try to pick a favorite and document that both are acceptable. I don't expect developers to follow my exact usage (i.e. with a tipping point of 2-3 lines) so I wouldn't enforce anything specific either. 2015-02-09 13:36 GMT-08:00 Reynold Xin r...@databricks.com: Why don't we just pick // as the default (by encouraging it in the style guide), since it is mostly used, and then do not disallow /* */? I don't think it is that big of a deal to have slightly deviations here since it is dead simple to understand what's going on. On Mon, Feb 9, 2015 at 1:33 PM, Patrick Wendell pwend...@gmail.com wrote: Clearly there isn't a strictly optimal commenting format (pro's and cons for both '//' and '/*'). My thought is for consistency we should just chose one and put in the style guide. On Mon, Feb 9, 2015 at 12:25 PM, Xiangrui Meng men...@gmail.com wrote: Btw, I think allowing `/* ... */` without the leading `*` in lines is also useful. Check this line: https://github.com/apache/spark/pull/4259/files#diff-e9dcb3b5f3de77fc31b3aff7831110eaR55 , where we put the R commands that can reproduce the test result. It is easier if we write in the following style: ~~~ /* Using the following R code to load the data and train the model using glmnet package. library(glmnet) data - read.csv(path, header=FALSE, stringsAsFactors=FALSE) features - as.matrix(data.frame(as.numeric(data$V2), as.numeric(data$V3))) label - as.numeric(data$V1) weights - coef(glmnet(features, label, family=gaussian, alpha = 0, lambda = 0)) */ ~~~ So people can copy paste the R commands directly. Xiangrui On Mon, Feb 9, 2015 at 12:18 PM, Xiangrui Meng men...@gmail.com wrote: I like the `/* .. */` style more. Because it is easier for IDEs to recognize it as a block comment. If you press enter in the comment block with the `//` style, IDEs won't add `//` for you. -Xiangrui On Wed, Feb 4, 2015 at 2:15 PM, Reynold Xin r...@databricks.com wrote: We should update the style doc to reflect what we have in most places (which I think is //). On Wed, Feb 4, 2015 at 2:09 PM, Shivaram Venkataraman shiva...@eecs.berkeley.edu wrote: FWIW I like the multi-line // over /* */ from a purely style standpoint. The Google Java style guide[1] has some comment about code formatting tools working better with /* */ but there doesn't seem to be any strong arguments for one over the other I can find Thanks Shivaram [1] https://google-styleguide.googlecode.com/svn/trunk/javaguide.html#s4.8.6.1-block-comment-style On Wed, Feb 4, 2015 at 2:05 PM, Patrick Wendell pwend...@gmail.com wrote: Personally I have no opinion, but agree it would be nice to standardize. - Patrick On Wed, Feb 4, 2015 at 1:58 PM, Sean Owen so...@cloudera.com wrote: One thing Marcelo pointed out to me is that the // style does not interfere with commenting out blocks of code with /* */, which is a small good thing. I am also accustomed to // style for multiline, and reserve /** */ for javadoc / scaladoc. Meaning, seeing the /* */ style inline always looks a little funny to me. On Wed, Feb 4, 2015 at 3:53 PM, Kay Ousterhout kayousterh...@gmail.com wrote: Hi all, The Spark Style Guide https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide says multi-line comments should formatted as: /* * This is a * very * long comment. */ But in my experience, we almost always use // for multi-line comments: // This is a // very // long comment. Here are some examples: - Recent commit by Reynold, king of style: https://github.com/apache/spark/commit/bebf4c42bef3e75d31ffce9bfdb331c16f34ddb1#diff-d616b5496d1a9f648864f4ab0db5a026R58 - RDD.scala:
Re: multi-line comment style
In my experience I find it much more natural to use // for short multi-line comments (2 or 3 lines), and /* */ for long multi-line comments involving one or more paragraphs. For short multi-line comments, there is no reason not to use // if it just so happens that your first line exceeded 100 characters and you have to wrap it. For long multi-line comments, however, using // all the way looks really awkward especially if you have multiple paragraphs. Thus, I would actually suggest that we don't try to pick a favorite and document that both are acceptable. I don't expect developers to follow my exact usage (i.e. with a tipping point of 2-3 lines) so I wouldn't enforce anything specific either. 2015-02-09 13:36 GMT-08:00 Reynold Xin r...@databricks.com: Why don't we just pick // as the default (by encouraging it in the style guide), since it is mostly used, and then do not disallow /* */? I don't think it is that big of a deal to have slightly deviations here since it is dead simple to understand what's going on. On Mon, Feb 9, 2015 at 1:33 PM, Patrick Wendell pwend...@gmail.com wrote: Clearly there isn't a strictly optimal commenting format (pro's and cons for both '//' and '/*'). My thought is for consistency we should just chose one and put in the style guide. On Mon, Feb 9, 2015 at 12:25 PM, Xiangrui Meng men...@gmail.com wrote: Btw, I think allowing `/* ... */` without the leading `*` in lines is also useful. Check this line: https://github.com/apache/spark/pull/4259/files#diff-e9dcb3b5f3de77fc31b3aff7831110eaR55 , where we put the R commands that can reproduce the test result. It is easier if we write in the following style: ~~~ /* Using the following R code to load the data and train the model using glmnet package. library(glmnet) data - read.csv(path, header=FALSE, stringsAsFactors=FALSE) features - as.matrix(data.frame(as.numeric(data$V2), as.numeric(data$V3))) label - as.numeric(data$V1) weights - coef(glmnet(features, label, family=gaussian, alpha = 0, lambda = 0)) */ ~~~ So people can copy paste the R commands directly. Xiangrui On Mon, Feb 9, 2015 at 12:18 PM, Xiangrui Meng men...@gmail.com wrote: I like the `/* .. */` style more. Because it is easier for IDEs to recognize it as a block comment. If you press enter in the comment block with the `//` style, IDEs won't add `//` for you. -Xiangrui On Wed, Feb 4, 2015 at 2:15 PM, Reynold Xin r...@databricks.com wrote: We should update the style doc to reflect what we have in most places (which I think is //). On Wed, Feb 4, 2015 at 2:09 PM, Shivaram Venkataraman shiva...@eecs.berkeley.edu wrote: FWIW I like the multi-line // over /* */ from a purely style standpoint. The Google Java style guide[1] has some comment about code formatting tools working better with /* */ but there doesn't seem to be any strong arguments for one over the other I can find Thanks Shivaram [1] https://google-styleguide.googlecode.com/svn/trunk/javaguide.html#s4.8.6.1-block-comment-style On Wed, Feb 4, 2015 at 2:05 PM, Patrick Wendell pwend...@gmail.com wrote: Personally I have no opinion, but agree it would be nice to standardize. - Patrick On Wed, Feb 4, 2015 at 1:58 PM, Sean Owen so...@cloudera.com wrote: One thing Marcelo pointed out to me is that the // style does not interfere with commenting out blocks of code with /* */, which is a small good thing. I am also accustomed to // style for multiline, and reserve /** */ for javadoc / scaladoc. Meaning, seeing the /* */ style inline always looks a little funny to me. On Wed, Feb 4, 2015 at 3:53 PM, Kay Ousterhout kayousterh...@gmail.com wrote: Hi all, The Spark Style Guide https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide says multi-line comments should formatted as: /* * This is a * very * long comment. */ But in my experience, we almost always use // for multi-line comments: // This is a // very // long comment. Here are some examples: - Recent commit by Reynold, king of style: https://github.com/apache/spark/commit/bebf4c42bef3e75d31ffce9bfdb331c16f34ddb1#diff-d616b5496d1a9f648864f4ab0db5a026R58 - RDD.scala: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L361 - DAGScheduler.scala: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L281 Any objections to me updating the style guide to reflect this? As with other style issues, I think consistency here is helpful (and formatting multi-line
Re: Unit tests
Hi Patrick, Thanks for the heads up. I was trying to set up our own infrastructure for testing Spark (essentially, running `run-tests` every night) on EC2. I stumbled upon a number of flaky tests, but none of them look similar to anything in Jira with the flaky-test tag. I wonder if there's something wrong with our infrastructure, or I should simply open Jira tickets with the failures I find. For example, one that appears fairly often on our setup is in AkkaUtilsSuite remote fetch ssl on - untrusted server (exception `ActorNotFound`, instead of `TimeoutException`). thanks, iulian On Fri, Feb 6, 2015 at 9:55 PM, Patrick Wendell pwend...@gmail.com wrote: Hey All, The tests are in a not-amazing state right now due to a few compounding factors: 1. We've merged a large volume of patches recently. 2. The load on jenkins has been relatively high, exposing races and other behavior not seen at lower load. For those not familiar, the main issue is flaky (non deterministic) test failures. Right now I'm trying to prioritize keeping the PullReqeustBuilder in good shape since it will block development if it is down. For other tests, let's try to keep filing JIRA's when we see issues and use the flaky-test label (see http://bit.ly/1yRif9S): I may contact people regarding specific tests. This is a very high priority to get in good shape. This kind of thing is no one's fault but just the result of a lot of concurrent development, and everyone needs to pitch in to get back in a good place. - Patrick - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org -- -- Iulian Dragos -- Reactive Apps on the JVM www.typesafe.com
Re: Keep or remove Debian packaging in Spark?
What about this straw man proposal: deprecate in 1.3 with some kind of message in the build, and remove for 1.4? And add a pointer to any third-party packaging that might provide similar functionality? On Mon, Feb 9, 2015 at 6:47 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: +1 to an official deprecation + redirecting users to some other project that will or already is taking this on. Nate? On Mon Feb 09 2015 at 10:08:27 AM Patrick Wendell pwend...@gmail.com wrote: I have wondered whether we should sort of deprecated it more officially, since otherwise I think people have the reasonable expectation based on the current code that Spark intends to support complete Debian packaging as part of the upstream build. Having something that's sort-of maintained but no one is helping review and merge patches on it or make it fully functional, IMO that doesn't benefit us or our users. There are a bunch of other projects that are specifically devoted to packaging, so it seems like there is a clear separation of concerns here. On Mon, Feb 9, 2015 at 7:31 AM, Mark Hamstra m...@clearstorydata.com wrote: it sounds like nobody intends these to be used to actually deploy Spark I wouldn't go quite that far. What we have now can serve as useful input to a deployment tool like Chef, but the user is then going to need to add some customization or configuration within the context of that tooling to get Spark installed just the way they want. So it is not so much that the current Debian packaging can't be used as that it has never really been intended to be a completely finished product that a newcomer could, for example, use to install Spark completely and quickly to Ubuntu and have a fully-functional environment in which they could then run all of the examples, tutorials, etc. Getting to that level of packaging (and maintenance) is something that I'm not sure we want to do since that is a better fit with Bigtop and the efforts of Cloudera, Horton Works, MapR, etc. to distribute Spark. On Mon, Feb 9, 2015 at 2:41 AM, Sean Owen so...@cloudera.com wrote: This is a straw poll to assess whether there is support to keep and fix, or remove, the Debian packaging-related config in Spark. I see several oldish outstanding JIRAs relating to problems in the packaging: https://issues.apache.org/jira/browse/SPARK-1799 https://issues.apache.org/jira/browse/SPARK-2614 https://issues.apache.org/jira/browse/SPARK-3624 https://issues.apache.org/jira/browse/SPARK-4436 (and a similar idea about making RPMs) https://issues.apache.org/jira/browse/SPARK-665 The original motivation seems related to Chef: https://issues.apache.org/jira/browse/SPARK-2614?focusedCommentId=14070908page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14070908 Mark's recent comments cast some doubt on whether it is essential: https://github.com/apache/spark/pull/4277#issuecomment-72114226 and in recent conversations I didn't hear dissent to the idea of removing this. Is this still useful enough to fix up? All else equal I'd like to start to walk back some of the complexity of the build, but I don't know how all-else-equal it is. Certainly, it sounds like nobody intends these to be used to actually deploy Spark. I don't doubt it's useful to someone, but can they maintain the packaging logic elsewhere? - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: spark-ec2 licensing clarification
+spark dev list Yes, we should add an Apache license to it -- Feel free to open a PR for it. BTW though it is a part of the mesos github account, it is almost exclusively used by the Spark Project AFAIK. Longer term it may make sense to move it to a more appropriate github account (we could move it to amplab/ for instance as the AMPLab provides Jenkins support etc. too) Thanks Shivaram On Mon, Feb 9, 2015 at 3:26 PM, Florian Verhein flor...@arkig.com wrote: Hi guys, Are there any plans to add licensing information to the mesos/spark-ec2 repo? I'd assumed it would be Apache 2.0 but then noticed there's no info in the repo. Background: https://issues.apache.org/jira/browse/SPARK-5676 Regards, Florian
RE: Keep or remove Debian packaging in Spark?
This could be something if the spark community wanted to not maintain debs/rpms directly via the project could direct interested efforts towards apache bigtop. Right now debs/rpms of bigtop components, as well as related tests is a focus. Something that would be great is if at least one spark committer with interests in config/pkg/testing could be liason and pt for bigtop efforts. Right now focus on bigtop 0.9, which currently includes spark 1.2. Jira for items included in 0.9 can be found here: https://issues.apache.org/jira/browse/BIGTOP-1480 -Original Message- From: Sean Owen [mailto:so...@cloudera.com] Sent: Monday, February 9, 2015 3:52 PM To: Nicholas Chammas Cc: Patrick Wendell; Mark Hamstra; dev Subject: Re: Keep or remove Debian packaging in Spark? What about this straw man proposal: deprecate in 1.3 with some kind of message in the build, and remove for 1.4? And add a pointer to any third-party packaging that might provide similar functionality? On Mon, Feb 9, 2015 at 6:47 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: +1 to an official deprecation + redirecting users to some other +project that will or already is taking this on. Nate? On Mon Feb 09 2015 at 10:08:27 AM Patrick Wendell pwend...@gmail.com wrote: I have wondered whether we should sort of deprecated it more officially, since otherwise I think people have the reasonable expectation based on the current code that Spark intends to support complete Debian packaging as part of the upstream build. Having something that's sort-of maintained but no one is helping review and merge patches on it or make it fully functional, IMO that doesn't benefit us or our users. There are a bunch of other projects that are specifically devoted to packaging, so it seems like there is a clear separation of concerns here. On Mon, Feb 9, 2015 at 7:31 AM, Mark Hamstra m...@clearstorydata.com wrote: it sounds like nobody intends these to be used to actually deploy Spark I wouldn't go quite that far. What we have now can serve as useful input to a deployment tool like Chef, but the user is then going to need to add some customization or configuration within the context of that tooling to get Spark installed just the way they want. So it is not so much that the current Debian packaging can't be used as that it has never really been intended to be a completely finished product that a newcomer could, for example, use to install Spark completely and quickly to Ubuntu and have a fully-functional environment in which they could then run all of the examples, tutorials, etc. Getting to that level of packaging (and maintenance) is something that I'm not sure we want to do since that is a better fit with Bigtop and the efforts of Cloudera, Horton Works, MapR, etc. to distribute Spark. On Mon, Feb 9, 2015 at 2:41 AM, Sean Owen so...@cloudera.com wrote: This is a straw poll to assess whether there is support to keep and fix, or remove, the Debian packaging-related config in Spark. I see several oldish outstanding JIRAs relating to problems in the packaging: https://issues.apache.org/jira/browse/SPARK-1799 https://issues.apache.org/jira/browse/SPARK-2614 https://issues.apache.org/jira/browse/SPARK-3624 https://issues.apache.org/jira/browse/SPARK-4436 (and a similar idea about making RPMs) https://issues.apache.org/jira/browse/SPARK-665 The original motivation seems related to Chef: https://issues.apache.org/jira/browse/SPARK-2614?focusedCommentId= 14070908page=com.atlassian.jira.plugin.system.issuetabpanels:comm ent-tabpanel#comment-14070908 Mark's recent comments cast some doubt on whether it is essential: https://github.com/apache/spark/pull/4277#issuecomment-72114226 and in recent conversations I didn't hear dissent to the idea of removing this. Is this still useful enough to fix up? All else equal I'd like to start to walk back some of the complexity of the build, but I don't know how all-else-equal it is. Certainly, it sounds like nobody intends these to be used to actually deploy Spark. I don't doubt it's useful to someone, but can they maintain the packaging logic elsewhere? -- --- To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org - To unsubscribe, e-mail:
Mail to u...@spark.apache.org failing
Hi, The mail id given in https://cwiki.apache.org/confluence/display/SPARK/Powered+By+Spark seems to be failing. Can anyone tell me how to get added to Powered By Spark list? -- Regards, *Meethu*