Some good things to look at though hopefully #2 will be largely addressed by: https://github.com/apache/incubator-spark/pull/230— Sent from Mailbox for iPhone
On Thu, Dec 19, 2013 at 8:57 PM, Andrew Ash <and...@andrewash.com> wrote: > I think there are also some improvements that could be made to > deployability in an enterprise setting. From my experience: > 1. Most places I deploy Spark in don't have internet access. So I can't > build from source, compile against a different version of Hadoop, etc > without doing it locally and then getting that onto my servers manually. > This is less a problem with Spark now that there are binary distributions, > but it's still a problem for using Mesos with Spark. > 2. Configuration of Spark is confusing -- you can make configuration in > Java system properties, environment variables, command line parameters, and > for the standalone cluster deployment mode you need to worry about whether > these need to be set on the master, the worker, the executor, or the > application/driver program. Also because spark-shell automatically > instantiates a SparkContext you have to set up any system properties in the > init scripts or on the command line with > JAVA_OPTS="-Dspark.executor.memory=8g" etc. I'm not sure what needs to be > done, but it feels that there are gains to be made in configuration options > here. Ideally, I would have one configuration file that can be used in all > 4 places and that's the only place to make configuration changes. > 3. Standalone cluster mode could use improved resiliency for starting, > stopping, and keeping alive a service -- there are custom init scripts that > call each other in a mess of ways: spark-shell, spark-daemon.sh, > spark-daemons.sh, spark-config.sh, spark-env.sh, compute-classpath.sh, > spark-executor, spark-class, run-example, and several others in the bin/ > directory. I would love it if Spark used the Tanuki Service Wrapper, which > is widely-used for Java service daemons, supports retries, installation as > init scripts that can be chkconfig'd, etc. Let's not re-solve the "how do > I keep a service running?" problem when it's been done so well by Tanuki -- > we use it at my day job for all our services, plus it's used by > Elasticsearch. This would help solve the problem where a quick bounce of > the master causes all the workers to self-destruct. > 4. Sensitivity to hostname vs FQDN vs IP address in spark URL -- this is > entirely an Akka bug based on previous mailing list discussion with Matei, > but it'd be awesome if you could use either the hostname or the FQDN or the > IP address in the Spark URL and not have Akka barf at you. > I've been telling myself I'd look into these at some point but just haven't > gotten around to them myself yet. Some day! I would prioritize these > requests from most- to least-important as 3, 2, 4, 1. > Andrew > On Thu, Dec 19, 2013 at 1:38 PM, Nick Pentreath > <nick.pentre...@gmail.com>wrote: >> Or if you're extremely ambitious work in implementing Spark Streaming in >> Python— >> Sent from Mailbox for iPhone >> >> On Thu, Dec 19, 2013 at 8:30 PM, Matei Zaharia <matei.zaha...@gmail.com> >> wrote: >> >> > Hi Matt, >> > If you want to get started looking at Spark, I recommend the following >> resources: >> > - Our issue tracker at http://spark-project.atlassian.net contains some >> issues marked “Starter” that are good places to jump into. You might be >> able to take one of those and extend it into a bigger project. >> > - The “contributing to Spark” wiki page covers how to send patches and >> set up development: >> https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark >> > - This talk has an intro to Spark internals (video and slides are in the >> comments): http://www.meetup.com/spark-users/events/94101942/ >> > For a longer project, here are some possible ones: >> > - Create a tool that automatically checks which Scala API methods are >> missing in Python. We had a similar one for Java that was very useful. Even >> better would be to automatically create wrappers for the Scala ones. >> > - Extend the Spark monitoring UI with profiling information (to sample >> the workers and say where they’re spending time, or what data structures >> consume the most memory). >> > - Pick and implement a new machine learning algorithm for MLlib. >> > Matei >> > On Dec 17, 2013, at 10:43 AM, Matthew Cheah <mcch...@uwaterloo.ca> >> wrote: >> >> Hi everyone, >> >> >> >> During my most recent internship, I worked extensively with Apache >> Spark, >> >> integrating it into a company's data analytics platform. I've now become >> >> interested in contributing to Apache Spark. >> >> >> >> I'm returning to undergraduate studies in January and there is an >> academic >> >> course which is simply a standalone software engineering project. I was >> >> thinking that some contribution to Apache Spark would satisfy my >> curiosity, >> >> help continue support the company I interned at, and give me academic >> >> credits required to graduate, all at the same time. It seems like too >> good >> >> an opportunity to pass up. >> >> >> >> With that in mind, I have the following questions: >> >> >> >> 1. At this point, is there any self-contained project that I could >> work >> >> on within Spark? Ideally, I would work on it independently, in about a >> >> three month time frame. This time also needs to accommodate ramping >> up on >> >> the Spark codebase and adjusting to the Scala programming language and >> >> paradigms. The company I worked at primarily used the Java APIs. The >> output >> >> needs to be a technical report describing the project requirements, >> and the >> >> design process I took to engineer the solution for the requirements. >> In >> >> particular, it cannot just be a series of haphazard patches. >> >> 2. How can I get started with contributing to Spark? >> >> 3. Is there a high-level UML or some other design specification for >> the >> >> Spark architecture? >> >> >> >> Thanks! I hope to be of some help =) >> >> >> >> -Matt Cheah >>