Wow yes, that PR#230 looks like exactly what I outlined in #2! I'll leave some comments on there.
Anything going on for service reliability (#3) since apparently someone is reading my mind? On Thu, Dec 19, 2013 at 2:02 PM, Nick Pentreath <nick.pentre...@gmail.com>wrote: > Some good things to look at though hopefully #2 will be largely addressed > by: https://github.com/apache/incubator-spark/pull/230— > Sent from Mailbox for iPhone > > On Thu, Dec 19, 2013 at 8:57 PM, Andrew Ash <and...@andrewash.com> wrote: > > > I think there are also some improvements that could be made to > > deployability in an enterprise setting. From my experience: > > 1. Most places I deploy Spark in don't have internet access. So I can't > > build from source, compile against a different version of Hadoop, etc > > without doing it locally and then getting that onto my servers manually. > > This is less a problem with Spark now that there are binary > distributions, > > but it's still a problem for using Mesos with Spark. > > 2. Configuration of Spark is confusing -- you can make configuration in > > Java system properties, environment variables, command line parameters, > and > > for the standalone cluster deployment mode you need to worry about > whether > > these need to be set on the master, the worker, the executor, or the > > application/driver program. Also because spark-shell automatically > > instantiates a SparkContext you have to set up any system properties in > the > > init scripts or on the command line with > > JAVA_OPTS="-Dspark.executor.memory=8g" etc. I'm not sure what needs to > be > > done, but it feels that there are gains to be made in configuration > options > > here. Ideally, I would have one configuration file that can be used in > all > > 4 places and that's the only place to make configuration changes. > > 3. Standalone cluster mode could use improved resiliency for starting, > > stopping, and keeping alive a service -- there are custom init scripts > that > > call each other in a mess of ways: spark-shell, spark-daemon.sh, > > spark-daemons.sh, spark-config.sh, spark-env.sh, compute-classpath.sh, > > spark-executor, spark-class, run-example, and several others in the bin/ > > directory. I would love it if Spark used the Tanuki Service Wrapper, > which > > is widely-used for Java service daemons, supports retries, installation > as > > init scripts that can be chkconfig'd, etc. Let's not re-solve the "how > do > > I keep a service running?" problem when it's been done so well by Tanuki > -- > > we use it at my day job for all our services, plus it's used by > > Elasticsearch. This would help solve the problem where a quick bounce of > > the master causes all the workers to self-destruct. > > 4. Sensitivity to hostname vs FQDN vs IP address in spark URL -- this is > > entirely an Akka bug based on previous mailing list discussion with > Matei, > > but it'd be awesome if you could use either the hostname or the FQDN or > the > > IP address in the Spark URL and not have Akka barf at you. > > I've been telling myself I'd look into these at some point but just > haven't > > gotten around to them myself yet. Some day! I would prioritize these > > requests from most- to least-important as 3, 2, 4, 1. > > Andrew > > On Thu, Dec 19, 2013 at 1:38 PM, Nick Pentreath < > nick.pentre...@gmail.com>wrote: > >> Or if you're extremely ambitious work in implementing Spark Streaming in > >> Python— > >> Sent from Mailbox for iPhone > >> > >> On Thu, Dec 19, 2013 at 8:30 PM, Matei Zaharia <matei.zaha...@gmail.com > > > >> wrote: > >> > >> > Hi Matt, > >> > If you want to get started looking at Spark, I recommend the following > >> resources: > >> > - Our issue tracker at http://spark-project.atlassian.net contains > some > >> issues marked “Starter” that are good places to jump into. You might be > >> able to take one of those and extend it into a bigger project. > >> > - The “contributing to Spark” wiki page covers how to send patches and > >> set up development: > >> https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark > >> > - This talk has an intro to Spark internals (video and slides are in > the > >> comments): http://www.meetup.com/spark-users/events/94101942/ > >> > For a longer project, here are some possible ones: > >> > - Create a tool that automatically checks which Scala API methods are > >> missing in Python. We had a similar one for Java that was very useful. > Even > >> better would be to automatically create wrappers for the Scala ones. > >> > - Extend the Spark monitoring UI with profiling information (to sample > >> the workers and say where they’re spending time, or what data structures > >> consume the most memory). > >> > - Pick and implement a new machine learning algorithm for MLlib. > >> > Matei > >> > On Dec 17, 2013, at 10:43 AM, Matthew Cheah <mcch...@uwaterloo.ca> > >> wrote: > >> >> Hi everyone, > >> >> > >> >> During my most recent internship, I worked extensively with Apache > >> Spark, > >> >> integrating it into a company's data analytics platform. I've now > become > >> >> interested in contributing to Apache Spark. > >> >> > >> >> I'm returning to undergraduate studies in January and there is an > >> academic > >> >> course which is simply a standalone software engineering project. I > was > >> >> thinking that some contribution to Apache Spark would satisfy my > >> curiosity, > >> >> help continue support the company I interned at, and give me academic > >> >> credits required to graduate, all at the same time. It seems like too > >> good > >> >> an opportunity to pass up. > >> >> > >> >> With that in mind, I have the following questions: > >> >> > >> >> 1. At this point, is there any self-contained project that I could > >> work > >> >> on within Spark? Ideally, I would work on it independently, in > about a > >> >> three month time frame. This time also needs to accommodate ramping > >> up on > >> >> the Spark codebase and adjusting to the Scala programming language > and > >> >> paradigms. The company I worked at primarily used the Java APIs. > The > >> output > >> >> needs to be a technical report describing the project requirements, > >> and the > >> >> design process I took to engineer the solution for the > requirements. > >> In > >> >> particular, it cannot just be a series of haphazard patches. > >> >> 2. How can I get started with contributing to Spark? > >> >> 3. Is there a high-level UML or some other design specification for > >> the > >> >> Spark architecture? > >> >> > >> >> Thanks! I hope to be of some help =) > >> >> > >> >> -Matt Cheah > >> >