Re: Should new YARN shuffle service work with yarn-alpha?
I think you might be conflating two things. The first error you posted was because YARN didn't standardize the shuffle API in alpha versions so our spark-network-yarn module won't compile. We should just disable that module if yarn alpha is used. spark-network-yarn is a leaf in the intra-module dependency graph, and core doesn't depend on it. This second error is something else. Maybe you are excluding network-shuffle instead of spark-network-yarn? On Fri, Nov 7, 2014 at 11:50 PM, Sean Owen so...@cloudera.com wrote: Hm. Problem is, core depends directly on it: [error] /Users/srowen/Documents/spark/core/src/main/scala/org/apache/spark/SecurityManager.scala:25: object sasl is not a member of package org.apache.spark.network [error] import org.apache.spark.network.sasl.SecretKeyHolder [error] ^ [error] /Users/srowen/Documents/spark/core/src/main/scala/org/apache/spark/SecurityManager.scala:147: not found: type SecretKeyHolder [error] private[spark] class SecurityManager(sparkConf: SparkConf) extends Logging with SecretKeyHolder { [error] ^ [error] /Users/srowen/Documents/spark/core/src/main/scala/org/apache/spark/network/netty/NettyBlockTransferService.scala:29: object RetryingBlockFetcher is not a member of package org.apache.spark.network.shuffle [error] import org.apache.spark.network.shuffle.{RetryingBlockFetcher, BlockFetchingListener, OneForOneBlockFetcher} [error]^ [error] /Users/srowen/Documents/spark/core/src/main/scala/org/apache/spark/deploy/worker/StandaloneWorkerShuffleService.scala:23: object sasl is not a member of package org.apache.spark.network [error] import org.apache.spark.network.sasl.SaslRpcHandler [error] ... [error] /Users/srowen/Documents/spark/core/src/main/scala/org/apache/spark/storage/BlockManager.scala:124: too many arguments for constructor ExternalShuffleClient: (x$1: org.apache.spark.network.util.TransportConf, x$2: String)org.apache.spark.network.shuffle.ExternalShuffleClient [error] new ExternalShuffleClient(SparkTransportConf.fromSparkConf(conf), securityManager, [error] ^ [error] /Users/srowen/Documents/spark/core/src/main/scala/org/apache/spark/storage/BlockManager.scala:39: object protocol is not a member of package org.apache.spark.network.shuffle [error] import org.apache.spark.network.shuffle.protocol.ExecutorShuffleInfo [error] ^ [error] /Users/srowen/Documents/spark/core/src/main/scala/org/apache/spark/storage/BlockManager.scala:214: not found: type ExecutorShuffleInfo [error] val shuffleConfig = new ExecutorShuffleInfo( [error] ... More refactoring needed? Either to support YARN alpha as a separate shuffle module, or sever this dependency? Of course this goes away when yarn-alpha goes away too. On Sat, Nov 8, 2014 at 7:45 AM, Patrick Wendell pwend...@gmail.com wrote: I bet it doesn't work. +1 on isolating it's inclusion to only the newer YARN API's. - Patrick On Fri, Nov 7, 2014 at 11:43 PM, Sean Owen so...@cloudera.com wrote: I noticed that this doesn't compile: mvn -Pyarn-alpha -Phadoop-0.23 -Dhadoop.version=0.23.7 -DskipTests clean package [error] warning: [options] bootstrap class path not set in conjunction with -source 1.6 [error] /Users/srowen/Documents/spark/network/yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java:26: error: cannot find symbol [error] import org.apache.hadoop.yarn.server.api.AuxiliaryService; [error] ^ [error] symbol: class AuxiliaryService [error] location: package org.apache.hadoop.yarn.server.api [error] /Users/srowen/Documents/spark/network/yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java:27: error: cannot find symbol [error] import org.apache.hadoop.yarn.server.api.ApplicationInitializationContext; [error] ^ ... Should it work? if not shall I propose to enable the service only with -Pyarn? - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: EC2 clusters ready in launch time + 30 seconds
I've posted https://issues.apache.org/jira/browse/SPARK-3821?focusedCommentId=14203280page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14203280 an initial proposal and implementation of using Packer to automate generating Spark AMIs to SPARK-3821 https://issues.apache.org/jira/browse/SPARK-3821. On Mon, Oct 6, 2014 at 7:40 PM, David Rowe davidr...@gmail.com wrote: I agree with this - there is also the issue of different sized masters and slaves, and numbers of executors for hefty machines (e.g. r3.8xlarges), tagging of instances and volumes (we use this for cost attribution at my workplace), and running in VPCs. I think think it might be useful to take a layered approach: the first step could be getting a good reliable image produced - Nick's ticket - then doing some work on the launch script. Regarding the EMR like service - I think I heard that AWS is planning to add spark support to EMR, but as usual there's nothing firm until it's released. On Tue, Oct 7, 2014 at 7:48 AM, Daniil Osipov daniil.osi...@shazam.com wrote: I've also been looking at this. Basically, the Spark EC2 script is excellent for small development clusters of several nodes, but isn't suitable for production. It handles instance setup in a single threaded manner, while it can easily be parallelized. It also doesn't handle failure well, ex when an instance fails to start or is taking too long to respond. Our desire was to have an equivalent to Amazon EMR[1] API that would trigger Spark jobs, including specified cluster setup. I've done some work towards that end, and it would benefit from an updated AMI greatly. Dan [1] http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-cli-commands.html On Sat, Oct 4, 2014 at 7:28 AM, Nicholas Chammas nicholas.cham...@gmail.com wrote: Thanks for posting that script, Patrick. It looks like a good place to start. Regarding Docker vs. Packer, as I understand it you can use Packer to create Docker containers at the same time as AMIs and other image types. Nick On Sat, Oct 4, 2014 at 2:49 AM, Patrick Wendell pwend...@gmail.com wrote: Hey All, Just a couple notes. I recently posted a shell script for creating the AMI's from a clean Amazon Linux AMI. https://github.com/mesos/spark-ec2/blob/v3/create_image.sh I think I will update the AMI's soon to get the most recent security updates. For spark-ec2's purpose this is probably sufficient (we'll only need to re-create them every few months). However, it would be cool if someone wanted to tackle providing a more general mechanism for defining Spark-friendly images that can be used more generally. I had thought that docker might be a good way to go for something like this - but maybe this packer thing is good too. For one thing, if we had a standard image we could use it to create containers for running Spark's unit test, which would be really cool. This would help a lot with random issues around port and filesystem contention we have for unit tests. I'm not sure if the long term place for this would be inside the spark codebase or a community library or what. But it would definitely be very valuable to have if someone wanted to take it on. - Patrick On Fri, Oct 3, 2014 at 5:20 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: FYI: There is an existing issue -- SPARK-3314 https://issues.apache.org/jira/browse/SPARK-3314 -- about scripting the creation of Spark AMIs. With Packer, it looks like we may be able to script the creation of multiple image types (VMWare, GCE, AMI, Docker, etc...) at once from a single Packer template. That's very cool. I'll be looking into this. Nick On Thu, Oct 2, 2014 at 8:23 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: Thanks for the update, Nate. I'm looking forward to seeing how these projects turn out. David, Packer looks very, very interesting. I'm gonna look into it more next week. Nick On Thu, Oct 2, 2014 at 8:00 PM, Nate D'Amico n...@reactor8.com wrote: Bit of progress on our end, bit of lagging as well. Our guy leading effort got little bogged down on client project to update hive/sql testbed to latest spark/sparkSQL, also launching public service so we have been bit scattered recently. Will have some more updates probably after next week. We are planning on taking our client work around hive/spark, plus taking over the bigtop automation work to modernize and get that fit for human consumption outside or org. All our work and puppet modules will be open sourced, documented, hopefully start to rally some other folks around effort that find it useful Side note, another effort we are looking into is gradle tests/support.
Re: Should new YARN shuffle service work with yarn-alpha?
Oops, that was my mistake. I moved network/shuffle into yarn, when it's just that network/yarn should be removed from yarn-alpha. That makes yarn-alpha work. I'll run tests and open a quick JIRA / PR for the change. On Sat, Nov 8, 2014 at 8:23 AM, Patrick Wendell pwend...@gmail.com wrote: This second error is something else. Maybe you are excluding network-shuffle instead of spark-network-yarn? - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
MLlib related query
Hi All, I would like to contribute code to the MLlib library with some other ML algorithms, but I was wondering if there were any research papers that led to the development of these libraries using Breeze? I see papers for Apache Spark, but not for MLlib. Thanks, Manu -- The greater danger for most of us lies not in setting our aim too high and falling short; but in setting our aim too low, and achieving our mark. - Michelangelo
Re: Should new YARN shuffle service work with yarn-alpha?
Great - I think that should work, but if there are any issues we can definitely fix them up. On Sat, Nov 8, 2014 at 12:47 AM, Sean Owen so...@cloudera.com wrote: Oops, that was my mistake. I moved network/shuffle into yarn, when it's just that network/yarn should be removed from yarn-alpha. That makes yarn-alpha work. I'll run tests and open a quick JIRA / PR for the change. On Sat, Nov 8, 2014 at 8:23 AM, Patrick Wendell pwend...@gmail.com wrote: This second error is something else. Maybe you are excluding network-shuffle instead of spark-network-yarn? - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Replacing Spark's native scheduler with Sparrow
However, I haven't seen it be as high as the 100ms Michael quoted (maybe this was for jobs with tasks that have much larger objects that take a long time to deserialize?). I was thinking more about the average end-to-end latency for launching a query that has 100s of partitions. Its also quite possible that SQLs task launch overhead is higher since we have never profiled how much is getting pulled into the closures.
Re: proposal / discuss: multiple Serializers within a SparkContext?
Ah awesome. Passing customer serializers when persisting an RDD is exactly one of the things I was thinking of. -Sandy On Fri, Nov 7, 2014 at 1:19 AM, Matei Zaharia matei.zaha...@gmail.com wrote: Yup, the JIRA for this was https://issues.apache.org/jira/browse/SPARK-540 (one of our older JIRAs). I think it would be interesting to explore this further. Basically the way to add it into the API would be to add a version of persist() that takes another class than StorageLevel, say StorageStrategy, which allows specifying a custom serializer or perhaps even a transformation to turn each partition into another representation before saving it. It would also be interesting if this could work directly on an InputStream or ByteBuffer to deal with off-heap data. One issue we've found with our current Serializer interface by the way is that a lot of type information is lost when you pass data to it, so the serializers spend a fair bit of time figuring out what class each object written is. With this model, it would be possible for a serializer to know that all its data is of one type, which is pretty cool, but we might also consider ways of expanding the current Serializer interface to take more info. Matei On Nov 7, 2014, at 1:09 AM, Reynold Xin r...@databricks.com wrote: Technically you can already do custom serializer for each shuffle operation (it is part of the ShuffledRDD). I've seen Matei suggesting on jira issues (or github) in the past a storage policy in which you can specify how data should be stored. I think that would be a great API to have in the long run. Designing it won't be trivial though. On Fri, Nov 7, 2014 at 1:05 AM, Sandy Ryza sandy.r...@cloudera.com wrote: Hey all, Was messing around with Spark and Google FlatBuffers for fun, and it got me thinking about Spark and serialization. I know there's been work / talk about in-memory columnar formats Spark SQL, so maybe there are ways to provide this flexibility already that I've missed? Either way, my thoughts: Java and Kryo serialization are really nice in that they require almost no extra work on the part of the user. They can also represent complex object graphs with cycles etc. There are situations where other serialization frameworks are more efficient: * A Hadoop Writable style format that delineates key-value boundaries and allows for raw comparisons can greatly speed up some shuffle operations by entirely avoiding deserialization until the object hits user code. Writables also probably ser / deser faster than Kryo. * No-deserialization formats like FlatBuffers and Cap'n Proto address the tradeoff between (1) Java objects that offer fast access but take lots of space and stress GC and (2) Kryo-serialized buffers that are more compact but take time to deserialize. The drawbacks of these frameworks are that they require more work from the user to define types. And that they're more restrictive in the reference graphs they can represent. In large applications, there are probably a few points where a specialized serialization format is useful. But requiring Writables everywhere because they're needed in a particularly intense shuffle is cumbersome. In light of that, would it make sense to enable varying Serializers within an app? It could make sense to choose a serialization framework both based on the objects being serialized and what they're being serialized for (caching vs. shuffle). It might be possible to implement this underneath the Serializer interface with some sort of multiplexing serializer that chooses between subserializers. Nothing urgent here, but curious to hear other's opinions. -Sandy
[RESULT] [VOTE] Designating maintainers for some Spark components
Thanks everyone for voting on this. With all of the PMC votes being for, the vote passes, but there were some concerns that I wanted to address for everyone who brought them up, as well as in the wording we will use for this policy. First, like every Apache project, Spark follows the Apache voting process (http://www.apache.org/foundation/voting.html), wherein all code changes are done by consensus. This means that any PMC member can block a code change on technical grounds, and thus that there is consensus when something goes in. It's absolutely true that every PMC member is responsible for the whole codebase, as Greg said (not least due to legal reasons, e.g. making sure it complies to licensing rules), and this idea will not change that. To make this clear, I will include that in the wording on the project page, to make sure new committers and other community members are all aware of it. What the maintainer model does, instead, is to change the review process, by having a required review from some people on some types of code changes (assuming those people respond in time). Projects can have their own diverse review processes (e.g. some do commit-then-review and others do review-then-commit, some point people to specific reviewers, etc). This kind of process seems useful to try (and to refine) as the project grows. We will of course evaluate how it goes and respond to any problems. So to summarize, - Every committer is responsible for, and more than welcome to review and vote on, every code change. In fact all community members are welcome to do this, and lots are doing it. - Everyone has the same voting rights on these code changes (namely consensus as described at http://www.apache.org/foundation/voting.html) - Committers will be asked to run patches that are making architectural and API changes by the maintainers before merging. In practice, none of this matters too much because we are not exactly a hot-well of discord ;), and even in the case of discord, the point of the ASF voting process is to create consensus. The goal is just to have a better structure for reviewing and minimize the chance of errors. Here is a tally of the votes: Binding votes (from PMC): 17 +1, no 0 or -1 Matei Zaharia Michael Armbrust Reynold Xin Patrick Wendell Andrew Or Prashant Sharma Mark Hamstra Xiangrui Meng Ankur Dave Imran Rashid Jason Dai Tom Graves Sean McNamara Nick Pentreath Josh Rosen Kay Ousterhout Tathagata Das Non-binding votes: 18 +1, one +0, one -1 +1: Nan Zhu Nicholas Chammas Denny Lee Cheng Lian Timothy Chen Jeremy Freeman Cheng Hao Jackylk Likun Kousuke Saruta Reza Zadeh Xuefeng Wu Witgo Manoj Babu Ravindra Pesala Liquan Pei Kushal Datta Davies Liu Vaquar Khan +0: Corey Nolet -1: Greg Stein I'll send another email when I have a more detailed writeup of this on the website. Matei - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org