I think there are also some improvements that could be made to
deployability in an enterprise setting.  From my experience:

1. Most places I deploy Spark in don't have internet access.  So I can't
build from source, compile against a different version of Hadoop, etc
without doing it locally and then getting that onto my servers manually.
 This is less a problem with Spark now that there are binary distributions,
but it's still a problem for using Mesos with Spark.
2. Configuration of Spark is confusing -- you can make configuration in
Java system properties, environment variables, command line parameters, and
for the standalone cluster deployment mode you need to worry about whether
these need to be set on the master, the worker, the executor, or the
application/driver program.  Also because spark-shell automatically
instantiates a SparkContext you have to set up any system properties in the
init scripts or on the command line with
JAVA_OPTS="-Dspark.executor.memory=8g" etc.  I'm not sure what needs to be
done, but it feels that there are gains to be made in configuration options
here.  Ideally, I would have one configuration file that can be used in all
4 places and that's the only place to make configuration changes.
3. Standalone cluster mode could use improved resiliency for starting,
stopping, and keeping alive a service -- there are custom init scripts that
call each other in a mess of ways: spark-shell, spark-daemon.sh,
spark-daemons.sh, spark-config.sh, spark-env.sh, compute-classpath.sh,
spark-executor, spark-class, run-example, and several others in the bin/
directory.  I would love it if Spark used the Tanuki Service Wrapper, which
is widely-used for Java service daemons, supports retries, installation as
init scripts that can be chkconfig'd, etc.  Let's not re-solve the "how do
I keep a service running?" problem when it's been done so well by Tanuki --
we use it at my day job for all our services, plus it's used by
Elasticsearch.  This would help solve the problem where a quick bounce of
the master causes all the workers to self-destruct.
4. Sensitivity to hostname vs FQDN vs IP address in spark URL -- this is
entirely an Akka bug based on previous mailing list discussion with Matei,
but it'd be awesome if you could use either the hostname or the FQDN or the
IP address in the Spark URL and not have Akka barf at you.

I've been telling myself I'd look into these at some point but just haven't
gotten around to them myself yet.  Some day!  I would prioritize these
requests from most- to least-important as 3, 2, 4, 1.

Andrew


On Thu, Dec 19, 2013 at 1:38 PM, Nick Pentreath <nick.pentre...@gmail.com>wrote:

> Or if you're extremely ambitious work in implementing Spark Streaming in
> Python—
> Sent from Mailbox for iPhone
>
> On Thu, Dec 19, 2013 at 8:30 PM, Matei Zaharia <matei.zaha...@gmail.com>
> wrote:
>
> > Hi Matt,
> > If you want to get started looking at Spark, I recommend the following
> resources:
> > - Our issue tracker at http://spark-project.atlassian.net contains some
> issues marked “Starter” that are good places to jump into. You might be
> able to take one of those and extend it into a bigger project.
> > - The “contributing to Spark” wiki page covers how to send patches and
> set up development:
> https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark
> > - This talk has an intro to Spark internals (video and slides are in the
> comments): http://www.meetup.com/spark-users/events/94101942/
> > For a longer project, here are some possible ones:
> > - Create a tool that automatically checks which Scala API methods are
> missing in Python. We had a similar one for Java that was very useful. Even
> better would be to automatically create wrappers for the Scala ones.
> > - Extend the Spark monitoring UI with profiling information (to sample
> the workers and say where they’re spending time, or what data structures
> consume the most memory).
> > - Pick and implement a new machine learning algorithm for MLlib.
> > Matei
> > On Dec 17, 2013, at 10:43 AM, Matthew Cheah <mcch...@uwaterloo.ca>
> wrote:
> >> Hi everyone,
> >>
> >> During my most recent internship, I worked extensively with Apache
> Spark,
> >> integrating it into a company's data analytics platform. I've now become
> >> interested in contributing to Apache Spark.
> >>
> >> I'm returning to undergraduate studies in January and there is an
> academic
> >> course which is simply a standalone software engineering project. I was
> >> thinking that some contribution to Apache Spark would satisfy my
> curiosity,
> >> help continue support the company I interned at, and give me academic
> >> credits required to graduate, all at the same time. It seems like too
> good
> >> an opportunity to pass up.
> >>
> >> With that in mind, I have the following questions:
> >>
> >>   1. At this point, is there any self-contained project that I could
> work
> >>   on within Spark? Ideally, I would work on it independently, in about a
> >>   three month time frame. This time also needs to accommodate ramping
> up on
> >>   the Spark codebase and adjusting to the Scala programming language and
> >>   paradigms. The company I worked at primarily used the Java APIs. The
> output
> >>   needs to be a technical report describing the project requirements,
> and the
> >>   design process I took to engineer the solution for the requirements.
> In
> >>   particular, it cannot just be a series of haphazard patches.
> >>   2. How can I get started with contributing to Spark?
> >>   3. Is there a high-level UML or some other design specification for
> the
> >>   Spark architecture?
> >>
> >> Thanks! I hope to be of some help =)
> >>
> >> -Matt Cheah
>

Reply via email to