Re: Spark development for undergraduate project

Debasish Das Fri, 20 Dec 2013 00:16:44 -0800

Decision trees, random forest, professor Hastie's gbdt R package are also
nice to have...


Gbdt 1 algorithm is in 0xdata and that too netflix guys were complaining it
does not scale beyond 1000 trees :)
 On Dec 19, 2013 10:18 PM, "Nick Pentreath" <nick.pentre...@gmail.com>
wrote:

> Another option would be:
> 1. Add another recommendation model based on mrec's sgd based model:
> https://github.com/mendeley/mrec
> 2. Look at the streaming K-means from Mahout and see if that might be
> integrated or adapted into MLlib
> 3. Work on adding to or refactoring the existing linear model framework,
> for example adaptive learning rate schedules, adaptive norm stuff from John
> Langford et al
> 4. Adding sparse vector/matrix support to MLlib?
>
> Sent from my iPad
>
> > On 20 Dec 2013, at 3:46 AM, Tathagata Das <tathagata.das1...@gmail.com>
> wrote:
> >
> > +1 to that (assuming by 'online' Andrew meant MLLib algorithm from Spark
> > Streaming)
> >
> > Something you can look into is implementing a streaming KMeans. Maybe you
> > can re-use a lot of the offline KMeans code in MLLib.
> >
> > TD
> >
> >
> >> On Thu, Dec 19, 2013 at 5:33 PM, Andrew Ash <and...@andrewash.com>
> wrote:
> >>
> >> Sounds like a great choice.  It would be particularly impressive if you
> >> could add the first online learning algorithm (all the current ones are
> >> offline I believe) to pave the way for future contributions.
> >>
> >>
> >> On Thu, Dec 19, 2013 at 8:27 PM, Matthew Cheah <mcch...@uwaterloo.ca>
> >> wrote:
> >>
> >>> Thanks a lot everyone! I'm looking into adding an algorithm to MLib for
> >> the
> >>> project. Nice and self-contained.
> >>>
> >>> -Matt Cheah
> >>>
> >>>
> >>> On Thu, Dec 19, 2013 at 12:52 PM, Christopher Nguyen <c...@adatao.com>
> >>> wrote:
> >>>
> >>>> +1 to most of Andrew's suggestions here, and while we're in that
> >>>> neighborhood, how about generalizing something like "wtf-spark" (from
> >> the
> >>>> Bizo team (http://youtu.be/6Sn1xs5DN1Y?t=38m36s)? It may not be of
> >> high
> >>>> academic interest, but it's something people would use many times a
> >>>> debugging day.
> >>>>
> >>>> Or am I behind and something like that is already there in 0.8?
> >>>>
> >>>> --
> >>>> Christopher T. Nguyen
> >>>> Co-founder & CEO, Adatao <http://adatao.com>
> >>>> linkedin.com/in/ctnguyen
> >>>>
> >>>>
> >>>>
> >>>>> On Thu, Dec 19, 2013 at 10:56 AM, Andrew Ash <and...@andrewash.com>
> >>>> wrote:
> >>>>
> >>>>> I think there are also some improvements that could be made to
> >>>>> deployability in an enterprise setting.  From my experience:
> >>>>>
> >>>>> 1. Most places I deploy Spark in don't have internet access.  So I
> >>> can't
> >>>>> build from source, compile against a different version of Hadoop, etc
> >>>>> without doing it locally and then getting that onto my servers
> >>> manually.
> >>>>> This is less a problem with Spark now that there are binary
> >>>> distributions,
> >>>>> but it's still a problem for using Mesos with Spark.
> >>>>> 2. Configuration of Spark is confusing -- you can make configuration
> >> in
> >>>>> Java system properties, environment variables, command line
> >> parameters,
> >>>> and
> >>>>> for the standalone cluster deployment mode you need to worry about
> >>>> whether
> >>>>> these need to be set on the master, the worker, the executor, or the
> >>>>> application/driver program.  Also because spark-shell automatically
> >>>>> instantiates a SparkContext you have to set up any system properties
> >> in
> >>>> the
> >>>>> init scripts or on the command line with
> >>>>> JAVA_OPTS="-Dspark.executor.memory=8g" etc.  I'm not sure what needs
> >> to
> >>>> be
> >>>>> done, but it feels that there are gains to be made in configuration
> >>>> options
> >>>>> here.  Ideally, I would have one configuration file that can be used
> >> in
> >>>> all
> >>>>> 4 places and that's the only place to make configuration changes.
> >>>>> 3. Standalone cluster mode could use improved resiliency for
> >> starting,
> >>>>> stopping, and keeping alive a service -- there are custom init
> >> scripts
> >>>> that
> >>>>> call each other in a mess of ways: spark-shell, spark-daemon.sh,
> >>>>> spark-daemons.sh, spark-config.sh, spark-env.sh,
> >> compute-classpath.sh,
> >>>>> spark-executor, spark-class, run-example, and several others in the
> >>> bin/
> >>>>> directory.  I would love it if Spark used the Tanuki Service Wrapper,
> >>>> which
> >>>>> is widely-used for Java service daemons, supports retries,
> >> installation
> >>>> as
> >>>>> init scripts that can be chkconfig'd, etc.  Let's not re-solve the
> >> "how
> >>>> do
> >>>>> I keep a service running?" problem when it's been done so well by
> >>> Tanuki
> >>>> --
> >>>>> we use it at my day job for all our services, plus it's used by
> >>>>> Elasticsearch.  This would help solve the problem where a quick
> >> bounce
> >>> of
> >>>>> the master causes all the workers to self-destruct.
> >>>>> 4. Sensitivity to hostname vs FQDN vs IP address in spark URL -- this
> >>> is
> >>>>> entirely an Akka bug based on previous mailing list discussion with
> >>>> Matei,
> >>>>> but it'd be awesome if you could use either the hostname or the FQDN
> >> or
> >>>> the
> >>>>> IP address in the Spark URL and not have Akka barf at you.
> >>>>>
> >>>>> I've been telling myself I'd look into these at some point but just
> >>>> haven't
> >>>>> gotten around to them myself yet.  Some day!  I would prioritize
> >> these
> >>>>> requests from most- to least-important as 3, 2, 4, 1.
> >>>>>
> >>>>> Andrew
> >>>>>
> >>>>>
> >>>>> On Thu, Dec 19, 2013 at 1:38 PM, Nick Pentreath <
> >>>> nick.pentre...@gmail.com
> >>>>>> wrote:
> >>>>>
> >>>>>> Or if you're extremely ambitious work in implementing Spark
> >> Streaming
> >>>> in
> >>>>>> Python—
> >>>>>> Sent from Mailbox for iPhone
> >>>>>>
> >>>>>> On Thu, Dec 19, 2013 at 8:30 PM, Matei Zaharia <
> >>>> matei.zaha...@gmail.com>
> >>>>>> wrote:
> >>>>>>
> >>>>>>> Hi Matt,
> >>>>>>> If you want to get started looking at Spark, I recommend the
> >>>> following
> >>>>>> resources:
> >>>>>>> - Our issue tracker at http://spark-project.atlassian.netcontains
> >>>>> some
> >>>>>> issues marked “Starter” that are good places to jump into. You
> >> might
> >>> be
> >>>>>> able to take one of those and extend it into a bigger project.
> >>>>>>> - The “contributing to Spark” wiki page covers how to send
> >> patches
> >>>> and
> >>>>>> set up development:
> >> https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark
> >>>>>>> - This talk has an intro to Spark internals (video and slides are
> >>> in
> >>>>> the
> >>>>>> comments): http://www.meetup.com/spark-users/events/94101942/
> >>>>>>> For a longer project, here are some possible ones:
> >>>>>>> - Create a tool that automatically checks which Scala API methods
> >>> are
> >>>>>> missing in Python. We had a similar one for Java that was very
> >>> useful.
> >>>>> Even
> >>>>>> better would be to automatically create wrappers for the Scala
> >> ones.
> >>>>>>> - Extend the Spark monitoring UI with profiling information (to
> >>>> sample
> >>>>>> the workers and say where they’re spending time, or what data
> >>>> structures
> >>>>>> consume the most memory).
> >>>>>>> - Pick and implement a new machine learning algorithm for MLlib.
> >>>>>>> Matei
> >>>>>>> On Dec 17, 2013, at 10:43 AM, Matthew Cheah <
> >> mcch...@uwaterloo.ca>
> >>>>>> wrote:
> >>>>>>>> Hi everyone,
> >>>>>>>>
> >>>>>>>> During my most recent internship, I worked extensively with
> >> Apache
> >>>>>> Spark,
> >>>>>>>> integrating it into a company's data analytics platform. I've
> >> now
> >>>>> become
> >>>>>>>> interested in contributing to Apache Spark.
> >>>>>>>>
> >>>>>>>> I'm returning to undergraduate studies in January and there is
> >> an
> >>>>>> academic
> >>>>>>>> course which is simply a standalone software engineering
> >> project.
> >>> I
> >>>>> was
> >>>>>>>> thinking that some contribution to Apache Spark would satisfy my
> >>>>>> curiosity,
> >>>>>>>> help continue support the company I interned at, and give me
> >>>> academic
> >>>>>>>> credits required to graduate, all at the same time. It seems
> >> like
> >>>> too
> >>>>>> good
> >>>>>>>> an opportunity to pass up.
> >>>>>>>>
> >>>>>>>> With that in mind, I have the following questions:
> >>>>>>>>
> >>>>>>>>  1. At this point, is there any self-contained project that I
> >>> could
> >>>>>> work
> >>>>>>>>  on within Spark? Ideally, I would work on it independently, in
> >>>>> about a
> >>>>>>>>  three month time frame. This time also needs to accommodate
> >>>> ramping
> >>>>>> up on
> >>>>>>>>  the Spark codebase and adjusting to the Scala programming
> >>> language
> >>>>> and
> >>>>>>>>  paradigms. The company I worked at primarily used the Java
> >> APIs.
> >>>> The
> >>>>>> output
> >>>>>>>>  needs to be a technical report describing the project
> >>>> requirements,
> >>>>>> and the
> >>>>>>>>  design process I took to engineer the solution for the
> >>>> requirements.
> >>>>>> In
> >>>>>>>>  particular, it cannot just be a series of haphazard patches.
> >>>>>>>>  2. How can I get started with contributing to Spark?
> >>>>>>>>  3. Is there a high-level UML or some other design
> >> specification
> >>>> for
> >>>>>> the
> >>>>>>>>  Spark architecture?
> >>>>>>>>
> >>>>>>>> Thanks! I hope to be of some help =)
> >>>>>>>>
> >>>>>>>> -Matt Cheah
> >>
>

Re: Spark development for undergraduate project

Reply via email to