Re: Spark development for undergraduate project

Tathagata Das Thu, 19 Dec 2013 17:47:39 -0800

+1 to that (assuming by 'online' Andrew meant MLLib algorithm from Spark
Streaming)


Something you can look into is implementing a streaming KMeans. Maybe you
can re-use a lot of the offline KMeans code in MLLib.

TD


On Thu, Dec 19, 2013 at 5:33 PM, Andrew Ash <and...@andrewash.com> wrote:

> Sounds like a great choice.  It would be particularly impressive if you
> could add the first online learning algorithm (all the current ones are
> offline I believe) to pave the way for future contributions.
>
>
> On Thu, Dec 19, 2013 at 8:27 PM, Matthew Cheah <mcch...@uwaterloo.ca>
> wrote:
>
> > Thanks a lot everyone! I'm looking into adding an algorithm to MLib for
> the
> > project. Nice and self-contained.
> >
> > -Matt Cheah
> >
> >
> > On Thu, Dec 19, 2013 at 12:52 PM, Christopher Nguyen <c...@adatao.com>
> > wrote:
> >
> > > +1 to most of Andrew's suggestions here, and while we're in that
> > > neighborhood, how about generalizing something like "wtf-spark" (from
> the
> > > Bizo team (http://youtu.be/6Sn1xs5DN1Y?t=38m36s)? It may not be of
> high
> > > academic interest, but it's something people would use many times a
> > > debugging day.
> > >
> > > Or am I behind and something like that is already there in 0.8?
> > >
> > > --
> > > Christopher T. Nguyen
> > > Co-founder & CEO, Adatao <http://adatao.com>
> > > linkedin.com/in/ctnguyen
> > >
> > >
> > >
> > > On Thu, Dec 19, 2013 at 10:56 AM, Andrew Ash <and...@andrewash.com>
> > wrote:
> > >
> > > > I think there are also some improvements that could be made to
> > > > deployability in an enterprise setting.  From my experience:
> > > >
> > > > 1. Most places I deploy Spark in don't have internet access.  So I
> > can't
> > > > build from source, compile against a different version of Hadoop, etc
> > > > without doing it locally and then getting that onto my servers
> > manually.
> > > >  This is less a problem with Spark now that there are binary
> > > distributions,
> > > > but it's still a problem for using Mesos with Spark.
> > > > 2. Configuration of Spark is confusing -- you can make configuration
> in
> > > > Java system properties, environment variables, command line
> parameters,
> > > and
> > > > for the standalone cluster deployment mode you need to worry about
> > > whether
> > > > these need to be set on the master, the worker, the executor, or the
> > > > application/driver program.  Also because spark-shell automatically
> > > > instantiates a SparkContext you have to set up any system properties
> in
> > > the
> > > > init scripts or on the command line with
> > > > JAVA_OPTS="-Dspark.executor.memory=8g" etc.  I'm not sure what needs
> to
> > > be
> > > > done, but it feels that there are gains to be made in configuration
> > > options
> > > > here.  Ideally, I would have one configuration file that can be used
> in
> > > all
> > > > 4 places and that's the only place to make configuration changes.
> > > > 3. Standalone cluster mode could use improved resiliency for
> starting,
> > > > stopping, and keeping alive a service -- there are custom init
> scripts
> > > that
> > > > call each other in a mess of ways: spark-shell, spark-daemon.sh,
> > > > spark-daemons.sh, spark-config.sh, spark-env.sh,
> compute-classpath.sh,
> > > > spark-executor, spark-class, run-example, and several others in the
> > bin/
> > > > directory.  I would love it if Spark used the Tanuki Service Wrapper,
> > > which
> > > > is widely-used for Java service daemons, supports retries,
> installation
> > > as
> > > > init scripts that can be chkconfig'd, etc.  Let's not re-solve the
> "how
> > > do
> > > > I keep a service running?" problem when it's been done so well by
> > Tanuki
> > > --
> > > > we use it at my day job for all our services, plus it's used by
> > > > Elasticsearch.  This would help solve the problem where a quick
> bounce
> > of
> > > > the master causes all the workers to self-destruct.
> > > > 4. Sensitivity to hostname vs FQDN vs IP address in spark URL -- this
> > is
> > > > entirely an Akka bug based on previous mailing list discussion with
> > > Matei,
> > > > but it'd be awesome if you could use either the hostname or the FQDN
> or
> > > the
> > > > IP address in the Spark URL and not have Akka barf at you.
> > > >
> > > > I've been telling myself I'd look into these at some point but just
> > > haven't
> > > > gotten around to them myself yet.  Some day!  I would prioritize
> these
> > > > requests from most- to least-important as 3, 2, 4, 1.
> > > >
> > > > Andrew
> > > >
> > > >
> > > > On Thu, Dec 19, 2013 at 1:38 PM, Nick Pentreath <
> > > nick.pentre...@gmail.com
> > > > >wrote:
> > > >
> > > > > Or if you're extremely ambitious work in implementing Spark
> Streaming
> > > in
> > > > > Python—
> > > > > Sent from Mailbox for iPhone
> > > > >
> > > > > On Thu, Dec 19, 2013 at 8:30 PM, Matei Zaharia <
> > > matei.zaha...@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > Hi Matt,
> > > > > > If you want to get started looking at Spark, I recommend the
> > > following
> > > > > resources:
> > > > > > - Our issue tracker at http://spark-project.atlassian.netcontains
> > > > some
> > > > > issues marked “Starter” that are good places to jump into. You
> might
> > be
> > > > > able to take one of those and extend it into a bigger project.
> > > > > > - The “contributing to Spark” wiki page covers how to send
> patches
> > > and
> > > > > set up development:
> > > > >
> > >
> https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark
> > > > > > - This talk has an intro to Spark internals (video and slides are
> > in
> > > > the
> > > > > comments): http://www.meetup.com/spark-users/events/94101942/
> > > > > > For a longer project, here are some possible ones:
> > > > > > - Create a tool that automatically checks which Scala API methods
> > are
> > > > > missing in Python. We had a similar one for Java that was very
> > useful.
> > > > Even
> > > > > better would be to automatically create wrappers for the Scala
> ones.
> > > > > > - Extend the Spark monitoring UI with profiling information (to
> > > sample
> > > > > the workers and say where they’re spending time, or what data
> > > structures
> > > > > consume the most memory).
> > > > > > - Pick and implement a new machine learning algorithm for MLlib.
> > > > > > Matei
> > > > > > On Dec 17, 2013, at 10:43 AM, Matthew Cheah <
> mcch...@uwaterloo.ca>
> > > > > wrote:
> > > > > >> Hi everyone,
> > > > > >>
> > > > > >> During my most recent internship, I worked extensively with
> Apache
> > > > > Spark,
> > > > > >> integrating it into a company's data analytics platform. I've
> now
> > > > become
> > > > > >> interested in contributing to Apache Spark.
> > > > > >>
> > > > > >> I'm returning to undergraduate studies in January and there is
> an
> > > > > academic
> > > > > >> course which is simply a standalone software engineering
> project.
> > I
> > > > was
> > > > > >> thinking that some contribution to Apache Spark would satisfy my
> > > > > curiosity,
> > > > > >> help continue support the company I interned at, and give me
> > > academic
> > > > > >> credits required to graduate, all at the same time. It seems
> like
> > > too
> > > > > good
> > > > > >> an opportunity to pass up.
> > > > > >>
> > > > > >> With that in mind, I have the following questions:
> > > > > >>
> > > > > >>   1. At this point, is there any self-contained project that I
> > could
> > > > > work
> > > > > >>   on within Spark? Ideally, I would work on it independently, in
> > > > about a
> > > > > >>   three month time frame. This time also needs to accommodate
> > > ramping
> > > > > up on
> > > > > >>   the Spark codebase and adjusting to the Scala programming
> > language
> > > > and
> > > > > >>   paradigms. The company I worked at primarily used the Java
> APIs.
> > > The
> > > > > output
> > > > > >>   needs to be a technical report describing the project
> > > requirements,
> > > > > and the
> > > > > >>   design process I took to engineer the solution for the
> > > requirements.
> > > > > In
> > > > > >>   particular, it cannot just be a series of haphazard patches.
> > > > > >>   2. How can I get started with contributing to Spark?
> > > > > >>   3. Is there a high-level UML or some other design
> specification
> > > for
> > > > > the
> > > > > >>   Spark architecture?
> > > > > >>
> > > > > >> Thanks! I hope to be of some help =)
> > > > > >>
> > > > > >> -Matt Cheah
> > > > >
> > > >
> > >
> >
>

Re: Spark development for undergraduate project

Reply via email to