Re: Should new YARN shuffle service work with yarn-alpha?

2014-11-08 Thread Patrick Wendell
I think you might be conflating two things. The first error you posted
was because YARN didn't standardize the shuffle API in alpha versions
so our spark-network-yarn module won't compile. We should just disable
that module if yarn alpha is used. spark-network-yarn is a leaf in the
intra-module dependency graph, and core doesn't depend on it.

This second error is something else. Maybe you are excluding
network-shuffle instead of spark-network-yarn?



On Fri, Nov 7, 2014 at 11:50 PM, Sean Owen so...@cloudera.com wrote:
 Hm. Problem is, core depends directly on it:

 [error] 
 /Users/srowen/Documents/spark/core/src/main/scala/org/apache/spark/SecurityManager.scala:25:
 object sasl is not a member of package org.apache.spark.network
 [error] import org.apache.spark.network.sasl.SecretKeyHolder
 [error] ^
 [error] 
 /Users/srowen/Documents/spark/core/src/main/scala/org/apache/spark/SecurityManager.scala:147:
 not found: type SecretKeyHolder
 [error] private[spark] class SecurityManager(sparkConf: SparkConf)
 extends Logging with SecretKeyHolder {
 [error]
  ^
 [error] 
 /Users/srowen/Documents/spark/core/src/main/scala/org/apache/spark/network/netty/NettyBlockTransferService.scala:29:
 object RetryingBlockFetcher is not a member of package
 org.apache.spark.network.shuffle
 [error] import org.apache.spark.network.shuffle.{RetryingBlockFetcher,
 BlockFetchingListener, OneForOneBlockFetcher}
 [error]^
 [error] 
 /Users/srowen/Documents/spark/core/src/main/scala/org/apache/spark/deploy/worker/StandaloneWorkerShuffleService.scala:23:
 object sasl is not a member of package org.apache.spark.network
 [error] import org.apache.spark.network.sasl.SaslRpcHandler
 [error]

 ...

 [error] 
 /Users/srowen/Documents/spark/core/src/main/scala/org/apache/spark/storage/BlockManager.scala:124:
 too many arguments for constructor ExternalShuffleClient: (x$1:
 org.apache.spark.network.util.TransportConf, x$2:
 String)org.apache.spark.network.shuffle.ExternalShuffleClient
 [error] new
 ExternalShuffleClient(SparkTransportConf.fromSparkConf(conf),
 securityManager,
 [error] ^
 [error] 
 /Users/srowen/Documents/spark/core/src/main/scala/org/apache/spark/storage/BlockManager.scala:39:
 object protocol is not a member of package
 org.apache.spark.network.shuffle
 [error] import org.apache.spark.network.shuffle.protocol.ExecutorShuffleInfo
 [error] ^
 [error] 
 /Users/srowen/Documents/spark/core/src/main/scala/org/apache/spark/storage/BlockManager.scala:214:
 not found: type ExecutorShuffleInfo
 [error] val shuffleConfig = new ExecutorShuffleInfo(
 [error]
 ...


 More refactoring needed? Either to support YARN alpha as a separate
 shuffle module, or sever this dependency?

 Of course this goes away when yarn-alpha goes away too.


 On Sat, Nov 8, 2014 at 7:45 AM, Patrick Wendell pwend...@gmail.com wrote:
 I bet it doesn't work. +1 on isolating it's inclusion to only the
 newer YARN API's.

 - Patrick

 On Fri, Nov 7, 2014 at 11:43 PM, Sean Owen so...@cloudera.com wrote:
 I noticed that this doesn't compile:

 mvn -Pyarn-alpha -Phadoop-0.23 -Dhadoop.version=0.23.7 -DskipTests clean 
 package

 [error] warning: [options] bootstrap class path not set in conjunction
 with -source 1.6
 [error] 
 /Users/srowen/Documents/spark/network/yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java:26:
 error: cannot find symbol
 [error] import org.apache.hadoop.yarn.server.api.AuxiliaryService;
 [error] ^
 [error]   symbol:   class AuxiliaryService
 [error]   location: package org.apache.hadoop.yarn.server.api
 [error] 
 /Users/srowen/Documents/spark/network/yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java:27:
 error: cannot find symbol
 [error] import 
 org.apache.hadoop.yarn.server.api.ApplicationInitializationContext;
 [error] ^
 ...

 Should it work? if not shall I propose to enable the service only with 
 -Pyarn?

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org


-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: EC2 clusters ready in launch time + 30 seconds

2014-11-08 Thread Nicholas Chammas
I've posted
https://issues.apache.org/jira/browse/SPARK-3821?focusedCommentId=14203280page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14203280
an initial proposal and implementation of using Packer to automate
generating Spark AMIs to SPARK-3821
https://issues.apache.org/jira/browse/SPARK-3821.

On Mon, Oct 6, 2014 at 7:40 PM, David Rowe davidr...@gmail.com wrote:

 I agree with this - there is also the issue of different sized masters and
 slaves, and numbers of executors for hefty machines (e.g. r3.8xlarges),
 tagging of instances and volumes (we use this for cost attribution at my
 workplace), and running in VPCs.

 I think think it might be useful to take a layered approach: the first
 step could be getting a good reliable image produced - Nick's ticket - then
 doing some work on the launch script.

 Regarding the EMR like service - I think I heard that AWS is planning to
 add spark support to EMR, but as usual there's nothing firm until it's
 released.


 On Tue, Oct 7, 2014 at 7:48 AM, Daniil Osipov daniil.osi...@shazam.com
 wrote:

 I've also been looking at this. Basically, the Spark EC2 script is
 excellent for small development clusters of several nodes, but isn't
 suitable for production. It handles instance setup in a single threaded
 manner, while it can easily be parallelized. It also doesn't handle
 failure
 well, ex when an instance fails to start or is taking too long to respond.

 Our desire was to have an equivalent to Amazon EMR[1] API that would
 trigger Spark jobs, including specified cluster setup. I've done some work
 towards that end, and it would benefit from an updated AMI greatly.

 Dan

 [1]

 http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-cli-commands.html

 On Sat, Oct 4, 2014 at 7:28 AM, Nicholas Chammas 
 nicholas.cham...@gmail.com
  wrote:

  Thanks for posting that script, Patrick. It looks like a good place to
  start.
 
  Regarding Docker vs. Packer, as I understand it you can use Packer to
  create Docker containers at the same time as AMIs and other image types.
 
  Nick
 
 
  On Sat, Oct 4, 2014 at 2:49 AM, Patrick Wendell pwend...@gmail.com
  wrote:
 
   Hey All,
  
   Just a couple notes. I recently posted a shell script for creating the
   AMI's from a clean Amazon Linux AMI.
  
   https://github.com/mesos/spark-ec2/blob/v3/create_image.sh
  
   I think I will update the AMI's soon to get the most recent security
   updates. For spark-ec2's purpose this is probably sufficient (we'll
   only need to re-create them every few months).
  
   However, it would be cool if someone wanted to tackle providing a more
   general mechanism for defining Spark-friendly images that can be
   used more generally. I had thought that docker might be a good way to
   go for something like this - but maybe this packer thing is good too.
  
   For one thing, if we had a standard image we could use it to create
   containers for running Spark's unit test, which would be really cool.
   This would help a lot with random issues around port and filesystem
   contention we have for unit tests.
  
   I'm not sure if the long term place for this would be inside the spark
   codebase or a community library or what. But it would definitely be
   very valuable to have if someone wanted to take it on.
  
   - Patrick
  
   On Fri, Oct 3, 2014 at 5:20 PM, Nicholas Chammas
   nicholas.cham...@gmail.com wrote:
FYI: There is an existing issue -- SPARK-3314
https://issues.apache.org/jira/browse/SPARK-3314 -- about
 scripting
   the
creation of Spark AMIs.
   
With Packer, it looks like we may be able to script the creation of
multiple image types (VMWare, GCE, AMI, Docker, etc...) at once
 from a
single Packer template. That's very cool.
   
I'll be looking into this.
   
Nick
   
   
On Thu, Oct 2, 2014 at 8:23 PM, Nicholas Chammas 
   nicholas.cham...@gmail.com
wrote:
   
Thanks for the update, Nate. I'm looking forward to seeing how
 these
projects turn out.
   
David, Packer looks very, very interesting. I'm gonna look into it
  more
next week.
   
Nick
   
   
On Thu, Oct 2, 2014 at 8:00 PM, Nate D'Amico n...@reactor8.com
  wrote:
   
Bit of progress on our end, bit of lagging as well.  Our guy
 leading
effort got little bogged down on client project to update hive/sql
   testbed
to latest spark/sparkSQL, also launching public service so we have
   been bit
scattered recently.
   
Will have some more updates probably after next week.  We are
  planning
   on
taking our client work around hive/spark, plus taking over the
 bigtop
automation work to modernize and get that fit for human
 consumption
   outside
or org.  All our work and puppet modules will be open sourced,
   documented,
hopefully start to rally some other folks around effort that find
 it
   useful
   
Side note, another effort we are looking into is gradle
  tests/support.

Re: Should new YARN shuffle service work with yarn-alpha?

2014-11-08 Thread Sean Owen
Oops, that was my mistake. I moved network/shuffle into yarn, when
it's just that network/yarn should be removed from yarn-alpha. That
makes yarn-alpha work. I'll run tests and open a quick JIRA / PR for
the change.

On Sat, Nov 8, 2014 at 8:23 AM, Patrick Wendell pwend...@gmail.com wrote:
 This second error is something else. Maybe you are excluding
 network-shuffle instead of spark-network-yarn?

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



MLlib related query

2014-11-08 Thread Manu Kaul
Hi All,
I would like to contribute code to the MLlib library with some other ML
algorithms, but I was
wondering if there were any research papers that led to the development of
these libraries
using Breeze? I see papers for Apache Spark, but not for MLlib.

Thanks,
Manu

-- 

The greater danger for most of us lies not in setting our aim too high and
falling short; but in setting our aim too low, and achieving our mark.
- Michelangelo


Re: Should new YARN shuffle service work with yarn-alpha?

2014-11-08 Thread Patrick Wendell
Great - I think that should work, but if there are any issues we can
definitely fix them up.

On Sat, Nov 8, 2014 at 12:47 AM, Sean Owen so...@cloudera.com wrote:
 Oops, that was my mistake. I moved network/shuffle into yarn, when
 it's just that network/yarn should be removed from yarn-alpha. That
 makes yarn-alpha work. I'll run tests and open a quick JIRA / PR for
 the change.

 On Sat, Nov 8, 2014 at 8:23 AM, Patrick Wendell pwend...@gmail.com wrote:
 This second error is something else. Maybe you are excluding
 network-shuffle instead of spark-network-yarn?

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Replacing Spark's native scheduler with Sparrow

2014-11-08 Thread Michael Armbrust

 However, I haven't seen it be as
 high as the 100ms Michael quoted (maybe this was for jobs with tasks that
 have much larger objects that take a long time to deserialize?).


I was thinking more about the average end-to-end latency for launching a
query that has 100s of partitions. Its also quite possible that SQLs task
launch overhead is higher since we have never profiled how much is getting
pulled into the closures.


Re: proposal / discuss: multiple Serializers within a SparkContext?

2014-11-08 Thread Sandy Ryza
Ah awesome.  Passing customer serializers when persisting an RDD is exactly
one of the things I was thinking of.

-Sandy

On Fri, Nov 7, 2014 at 1:19 AM, Matei Zaharia matei.zaha...@gmail.com
wrote:

 Yup, the JIRA for this was https://issues.apache.org/jira/browse/SPARK-540
 (one of our older JIRAs). I think it would be interesting to explore this
 further. Basically the way to add it into the API would be to add a version
 of persist() that takes another class than StorageLevel, say
 StorageStrategy, which allows specifying a custom serializer or perhaps
 even a transformation to turn each partition into another representation
 before saving it. It would also be interesting if this could work directly
 on an InputStream or ByteBuffer to deal with off-heap data.

 One issue we've found with our current Serializer interface by the way is
 that a lot of type information is lost when you pass data to it, so the
 serializers spend a fair bit of time figuring out what class each object
 written is. With this model, it would be possible for a serializer to know
 that all its data is of one type, which is pretty cool, but we might also
 consider ways of expanding the current Serializer interface to take more
 info.

 Matei

  On Nov 7, 2014, at 1:09 AM, Reynold Xin r...@databricks.com wrote:
 
  Technically you can already do custom serializer for each shuffle
 operation
  (it is part of the ShuffledRDD). I've seen Matei suggesting on jira
 issues
  (or github) in the past a storage policy in which you can specify how
  data should be stored. I think that would be a great API to have in the
  long run. Designing it won't be trivial though.
 
 
  On Fri, Nov 7, 2014 at 1:05 AM, Sandy Ryza sandy.r...@cloudera.com
 wrote:
 
  Hey all,
 
  Was messing around with Spark and Google FlatBuffers for fun, and it
 got me
  thinking about Spark and serialization.  I know there's been work / talk
  about in-memory columnar formats Spark SQL, so maybe there are ways to
  provide this flexibility already that I've missed?  Either way, my
  thoughts:
 
  Java and Kryo serialization are really nice in that they require almost
 no
  extra work on the part of the user.  They can also represent complex
 object
  graphs with cycles etc.
 
  There are situations where other serialization frameworks are more
  efficient:
  * A Hadoop Writable style format that delineates key-value boundaries
 and
  allows for raw comparisons can greatly speed up some shuffle operations
 by
  entirely avoiding deserialization until the object hits user code.
  Writables also probably ser / deser faster than Kryo.
  * No-deserialization formats like FlatBuffers and Cap'n Proto address
 the
  tradeoff between (1) Java objects that offer fast access but take lots
 of
  space and stress GC and (2) Kryo-serialized buffers that are more
 compact
  but take time to deserialize.
 
  The drawbacks of these frameworks are that they require more work from
 the
  user to define types.  And that they're more restrictive in the
 reference
  graphs they can represent.
 
  In large applications, there are probably a few points where a
  specialized serialization format is useful. But requiring Writables
  everywhere because they're needed in a particularly intense shuffle is
  cumbersome.
 
  In light of that, would it make sense to enable varying Serializers
 within
  an app? It could make sense to choose a serialization framework both
 based
  on the objects being serialized and what they're being serialized for
  (caching vs. shuffle).  It might be possible to implement this
 underneath
  the Serializer interface with some sort of multiplexing serializer that
  chooses between subserializers.
 
  Nothing urgent here, but curious to hear other's opinions.
 
  -Sandy
 




[RESULT] [VOTE] Designating maintainers for some Spark components

2014-11-08 Thread Matei Zaharia
Thanks everyone for voting on this. With all of the PMC votes being for, the 
vote passes, but there were some concerns that I wanted to address for everyone 
who brought them up, as well as in the wording we will use for this policy.

First, like every Apache project, Spark follows the Apache voting process 
(http://www.apache.org/foundation/voting.html), wherein all code changes are 
done by consensus. This means that any PMC member can block a code change on 
technical grounds, and thus that there is consensus when something goes in. 
It's absolutely true that every PMC member is responsible for the whole 
codebase, as Greg said (not least due to legal reasons, e.g. making sure it 
complies to licensing rules), and this idea will not change that. To make this 
clear, I will include that in the wording on the project page, to make sure new 
committers and other community members are all aware of it.

What the maintainer model does, instead, is to change the review process, by 
having a required review from some people on some types of code changes 
(assuming those people respond in time). Projects can have their own diverse 
review processes (e.g. some do commit-then-review and others do 
review-then-commit, some point people to specific reviewers, etc). This kind of 
process seems useful to try (and to refine) as the project grows. We will of 
course evaluate how it goes and respond to any problems.

So to summarize,

- Every committer is responsible for, and more than welcome to review and vote 
on, every code change. In fact all community members are welcome to do this, 
and lots are doing it.
- Everyone has the same voting rights on these code changes (namely consensus 
as described at http://www.apache.org/foundation/voting.html)
- Committers will be asked to run patches that are making architectural and API 
changes by the maintainers before merging.

In practice, none of this matters too much because we are not exactly a 
hot-well of discord ;), and even in the case of discord, the point of the ASF 
voting process is to create consensus. The goal is just to have a better 
structure for reviewing and minimize the chance of errors.

Here is a tally of the votes:

Binding votes (from PMC): 17 +1, no 0 or -1

Matei Zaharia
Michael Armbrust
Reynold Xin
Patrick Wendell
Andrew Or
Prashant Sharma
Mark Hamstra
Xiangrui Meng
Ankur Dave
Imran Rashid
Jason Dai
Tom Graves
Sean McNamara
Nick Pentreath
Josh Rosen
Kay Ousterhout
Tathagata Das

Non-binding votes: 18 +1, one +0, one -1

+1:
Nan Zhu
Nicholas Chammas
Denny Lee
Cheng Lian
Timothy Chen
Jeremy Freeman
Cheng Hao
Jackylk Likun
Kousuke Saruta
Reza Zadeh
Xuefeng Wu
Witgo
Manoj Babu
Ravindra Pesala
Liquan Pei
Kushal Datta
Davies Liu
Vaquar Khan

+0: Corey Nolet

-1: Greg Stein

I'll send another email when I have a more detailed writeup of this on the 
website.

Matei
-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org