Re: Spark development for undergraduate project

Andrew Ash Thu, 19 Dec 2013 11:21:08 -0800

Wow yes, that PR#230 looks like exactly what I outlined in #2!  I'll leave
some comments on there.


Anything going on for service reliability (#3) since apparently someone is
reading my mind?


On Thu, Dec 19, 2013 at 2:02 PM, Nick Pentreath <nick.pentre...@gmail.com>wrote:

> Some good things to look at though hopefully #2 will be largely addressed
> by: https://github.com/apache/incubator-spark/pull/230—
> Sent from Mailbox for iPhone
>
> On Thu, Dec 19, 2013 at 8:57 PM, Andrew Ash <and...@andrewash.com> wrote:
>
> > I think there are also some improvements that could be made to
> > deployability in an enterprise setting.  From my experience:
> > 1. Most places I deploy Spark in don't have internet access.  So I can't
> > build from source, compile against a different version of Hadoop, etc
> > without doing it locally and then getting that onto my servers manually.
> >  This is less a problem with Spark now that there are binary
> distributions,
> > but it's still a problem for using Mesos with Spark.
> > 2. Configuration of Spark is confusing -- you can make configuration in
> > Java system properties, environment variables, command line parameters,
> and
> > for the standalone cluster deployment mode you need to worry about
> whether
> > these need to be set on the master, the worker, the executor, or the
> > application/driver program.  Also because spark-shell automatically
> > instantiates a SparkContext you have to set up any system properties in
> the
> > init scripts or on the command line with
> > JAVA_OPTS="-Dspark.executor.memory=8g" etc.  I'm not sure what needs to
> be
> > done, but it feels that there are gains to be made in configuration
> options
> > here.  Ideally, I would have one configuration file that can be used in
> all
> > 4 places and that's the only place to make configuration changes.
> > 3. Standalone cluster mode could use improved resiliency for starting,
> > stopping, and keeping alive a service -- there are custom init scripts
> that
> > call each other in a mess of ways: spark-shell, spark-daemon.sh,
> > spark-daemons.sh, spark-config.sh, spark-env.sh, compute-classpath.sh,
> > spark-executor, spark-class, run-example, and several others in the bin/
> > directory.  I would love it if Spark used the Tanuki Service Wrapper,
> which
> > is widely-used for Java service daemons, supports retries, installation
> as
> > init scripts that can be chkconfig'd, etc.  Let's not re-solve the "how
> do
> > I keep a service running?" problem when it's been done so well by Tanuki
> --
> > we use it at my day job for all our services, plus it's used by
> > Elasticsearch.  This would help solve the problem where a quick bounce of
> > the master causes all the workers to self-destruct.
> > 4. Sensitivity to hostname vs FQDN vs IP address in spark URL -- this is
> > entirely an Akka bug based on previous mailing list discussion with
> Matei,
> > but it'd be awesome if you could use either the hostname or the FQDN or
> the
> > IP address in the Spark URL and not have Akka barf at you.
> > I've been telling myself I'd look into these at some point but just
> haven't
> > gotten around to them myself yet.  Some day!  I would prioritize these
> > requests from most- to least-important as 3, 2, 4, 1.
> > Andrew
> > On Thu, Dec 19, 2013 at 1:38 PM, Nick Pentreath <
> nick.pentre...@gmail.com>wrote:
> >> Or if you're extremely ambitious work in implementing Spark Streaming in
> >> Python—
> >> Sent from Mailbox for iPhone
> >>
> >> On Thu, Dec 19, 2013 at 8:30 PM, Matei Zaharia <matei.zaha...@gmail.com
> >
> >> wrote:
> >>
> >> > Hi Matt,
> >> > If you want to get started looking at Spark, I recommend the following
> >> resources:
> >> > - Our issue tracker at http://spark-project.atlassian.net contains
> some
> >> issues marked “Starter” that are good places to jump into. You might be
> >> able to take one of those and extend it into a bigger project.
> >> > - The “contributing to Spark” wiki page covers how to send patches and
> >> set up development:
> >> https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark
> >> > - This talk has an intro to Spark internals (video and slides are in
> the
> >> comments): http://www.meetup.com/spark-users/events/94101942/
> >> > For a longer project, here are some possible ones:
> >> > - Create a tool that automatically checks which Scala API methods are
> >> missing in Python. We had a similar one for Java that was very useful.
> Even
> >> better would be to automatically create wrappers for the Scala ones.
> >> > - Extend the Spark monitoring UI with profiling information (to sample
> >> the workers and say where they’re spending time, or what data structures
> >> consume the most memory).
> >> > - Pick and implement a new machine learning algorithm for MLlib.
> >> > Matei
> >> > On Dec 17, 2013, at 10:43 AM, Matthew Cheah <mcch...@uwaterloo.ca>
> >> wrote:
> >> >> Hi everyone,
> >> >>
> >> >> During my most recent internship, I worked extensively with Apache
> >> Spark,
> >> >> integrating it into a company's data analytics platform. I've now
> become
> >> >> interested in contributing to Apache Spark.
> >> >>
> >> >> I'm returning to undergraduate studies in January and there is an
> >> academic
> >> >> course which is simply a standalone software engineering project. I
> was
> >> >> thinking that some contribution to Apache Spark would satisfy my
> >> curiosity,
> >> >> help continue support the company I interned at, and give me academic
> >> >> credits required to graduate, all at the same time. It seems like too
> >> good
> >> >> an opportunity to pass up.
> >> >>
> >> >> With that in mind, I have the following questions:
> >> >>
> >> >>   1. At this point, is there any self-contained project that I could
> >> work
> >> >>   on within Spark? Ideally, I would work on it independently, in
> about a
> >> >>   three month time frame. This time also needs to accommodate ramping
> >> up on
> >> >>   the Spark codebase and adjusting to the Scala programming language
> and
> >> >>   paradigms. The company I worked at primarily used the Java APIs.
> The
> >> output
> >> >>   needs to be a technical report describing the project requirements,
> >> and the
> >> >>   design process I took to engineer the solution for the
> requirements.
> >> In
> >> >>   particular, it cannot just be a series of haphazard patches.
> >> >>   2. How can I get started with contributing to Spark?
> >> >>   3. Is there a high-level UML or some other design specification for
> >> the
> >> >>   Spark architecture?
> >> >>
> >> >> Thanks! I hope to be of some help =)
> >> >>
> >> >> -Matt Cheah
> >>
>

Re: Spark development for undergraduate project

Reply via email to