Decision trees, random forest, professor Hastie's gbdt R package are also nice to have...
Gbdt 1 algorithm is in 0xdata and that too netflix guys were complaining it does not scale beyond 1000 trees :) On Dec 19, 2013 10:18 PM, "Nick Pentreath" <nick.pentre...@gmail.com> wrote: > Another option would be: > 1. Add another recommendation model based on mrec's sgd based model: > https://github.com/mendeley/mrec > 2. Look at the streaming K-means from Mahout and see if that might be > integrated or adapted into MLlib > 3. Work on adding to or refactoring the existing linear model framework, > for example adaptive learning rate schedules, adaptive norm stuff from John > Langford et al > 4. Adding sparse vector/matrix support to MLlib? > > Sent from my iPad > > > On 20 Dec 2013, at 3:46 AM, Tathagata Das <tathagata.das1...@gmail.com> > wrote: > > > > +1 to that (assuming by 'online' Andrew meant MLLib algorithm from Spark > > Streaming) > > > > Something you can look into is implementing a streaming KMeans. Maybe you > > can re-use a lot of the offline KMeans code in MLLib. > > > > TD > > > > > >> On Thu, Dec 19, 2013 at 5:33 PM, Andrew Ash <and...@andrewash.com> > wrote: > >> > >> Sounds like a great choice. It would be particularly impressive if you > >> could add the first online learning algorithm (all the current ones are > >> offline I believe) to pave the way for future contributions. > >> > >> > >> On Thu, Dec 19, 2013 at 8:27 PM, Matthew Cheah <mcch...@uwaterloo.ca> > >> wrote: > >> > >>> Thanks a lot everyone! I'm looking into adding an algorithm to MLib for > >> the > >>> project. Nice and self-contained. > >>> > >>> -Matt Cheah > >>> > >>> > >>> On Thu, Dec 19, 2013 at 12:52 PM, Christopher Nguyen <c...@adatao.com> > >>> wrote: > >>> > >>>> +1 to most of Andrew's suggestions here, and while we're in that > >>>> neighborhood, how about generalizing something like "wtf-spark" (from > >> the > >>>> Bizo team (http://youtu.be/6Sn1xs5DN1Y?t=38m36s)? It may not be of > >> high > >>>> academic interest, but it's something people would use many times a > >>>> debugging day. > >>>> > >>>> Or am I behind and something like that is already there in 0.8? > >>>> > >>>> -- > >>>> Christopher T. Nguyen > >>>> Co-founder & CEO, Adatao <http://adatao.com> > >>>> linkedin.com/in/ctnguyen > >>>> > >>>> > >>>> > >>>>> On Thu, Dec 19, 2013 at 10:56 AM, Andrew Ash <and...@andrewash.com> > >>>> wrote: > >>>> > >>>>> I think there are also some improvements that could be made to > >>>>> deployability in an enterprise setting. From my experience: > >>>>> > >>>>> 1. Most places I deploy Spark in don't have internet access. So I > >>> can't > >>>>> build from source, compile against a different version of Hadoop, etc > >>>>> without doing it locally and then getting that onto my servers > >>> manually. > >>>>> This is less a problem with Spark now that there are binary > >>>> distributions, > >>>>> but it's still a problem for using Mesos with Spark. > >>>>> 2. Configuration of Spark is confusing -- you can make configuration > >> in > >>>>> Java system properties, environment variables, command line > >> parameters, > >>>> and > >>>>> for the standalone cluster deployment mode you need to worry about > >>>> whether > >>>>> these need to be set on the master, the worker, the executor, or the > >>>>> application/driver program. Also because spark-shell automatically > >>>>> instantiates a SparkContext you have to set up any system properties > >> in > >>>> the > >>>>> init scripts or on the command line with > >>>>> JAVA_OPTS="-Dspark.executor.memory=8g" etc. I'm not sure what needs > >> to > >>>> be > >>>>> done, but it feels that there are gains to be made in configuration > >>>> options > >>>>> here. Ideally, I would have one configuration file that can be used > >> in > >>>> all > >>>>> 4 places and that's the only place to make configuration changes. > >>>>> 3. Standalone cluster mode could use improved resiliency for > >> starting, > >>>>> stopping, and keeping alive a service -- there are custom init > >> scripts > >>>> that > >>>>> call each other in a mess of ways: spark-shell, spark-daemon.sh, > >>>>> spark-daemons.sh, spark-config.sh, spark-env.sh, > >> compute-classpath.sh, > >>>>> spark-executor, spark-class, run-example, and several others in the > >>> bin/ > >>>>> directory. I would love it if Spark used the Tanuki Service Wrapper, > >>>> which > >>>>> is widely-used for Java service daemons, supports retries, > >> installation > >>>> as > >>>>> init scripts that can be chkconfig'd, etc. Let's not re-solve the > >> "how > >>>> do > >>>>> I keep a service running?" problem when it's been done so well by > >>> Tanuki > >>>> -- > >>>>> we use it at my day job for all our services, plus it's used by > >>>>> Elasticsearch. This would help solve the problem where a quick > >> bounce > >>> of > >>>>> the master causes all the workers to self-destruct. > >>>>> 4. Sensitivity to hostname vs FQDN vs IP address in spark URL -- this > >>> is > >>>>> entirely an Akka bug based on previous mailing list discussion with > >>>> Matei, > >>>>> but it'd be awesome if you could use either the hostname or the FQDN > >> or > >>>> the > >>>>> IP address in the Spark URL and not have Akka barf at you. > >>>>> > >>>>> I've been telling myself I'd look into these at some point but just > >>>> haven't > >>>>> gotten around to them myself yet. Some day! I would prioritize > >> these > >>>>> requests from most- to least-important as 3, 2, 4, 1. > >>>>> > >>>>> Andrew > >>>>> > >>>>> > >>>>> On Thu, Dec 19, 2013 at 1:38 PM, Nick Pentreath < > >>>> nick.pentre...@gmail.com > >>>>>> wrote: > >>>>> > >>>>>> Or if you're extremely ambitious work in implementing Spark > >> Streaming > >>>> in > >>>>>> Python— > >>>>>> Sent from Mailbox for iPhone > >>>>>> > >>>>>> On Thu, Dec 19, 2013 at 8:30 PM, Matei Zaharia < > >>>> matei.zaha...@gmail.com> > >>>>>> wrote: > >>>>>> > >>>>>>> Hi Matt, > >>>>>>> If you want to get started looking at Spark, I recommend the > >>>> following > >>>>>> resources: > >>>>>>> - Our issue tracker at http://spark-project.atlassian.netcontains > >>>>> some > >>>>>> issues marked “Starter” that are good places to jump into. You > >> might > >>> be > >>>>>> able to take one of those and extend it into a bigger project. > >>>>>>> - The “contributing to Spark” wiki page covers how to send > >> patches > >>>> and > >>>>>> set up development: > >> https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark > >>>>>>> - This talk has an intro to Spark internals (video and slides are > >>> in > >>>>> the > >>>>>> comments): http://www.meetup.com/spark-users/events/94101942/ > >>>>>>> For a longer project, here are some possible ones: > >>>>>>> - Create a tool that automatically checks which Scala API methods > >>> are > >>>>>> missing in Python. We had a similar one for Java that was very > >>> useful. > >>>>> Even > >>>>>> better would be to automatically create wrappers for the Scala > >> ones. > >>>>>>> - Extend the Spark monitoring UI with profiling information (to > >>>> sample > >>>>>> the workers and say where they’re spending time, or what data > >>>> structures > >>>>>> consume the most memory). > >>>>>>> - Pick and implement a new machine learning algorithm for MLlib. > >>>>>>> Matei > >>>>>>> On Dec 17, 2013, at 10:43 AM, Matthew Cheah < > >> mcch...@uwaterloo.ca> > >>>>>> wrote: > >>>>>>>> Hi everyone, > >>>>>>>> > >>>>>>>> During my most recent internship, I worked extensively with > >> Apache > >>>>>> Spark, > >>>>>>>> integrating it into a company's data analytics platform. I've > >> now > >>>>> become > >>>>>>>> interested in contributing to Apache Spark. > >>>>>>>> > >>>>>>>> I'm returning to undergraduate studies in January and there is > >> an > >>>>>> academic > >>>>>>>> course which is simply a standalone software engineering > >> project. > >>> I > >>>>> was > >>>>>>>> thinking that some contribution to Apache Spark would satisfy my > >>>>>> curiosity, > >>>>>>>> help continue support the company I interned at, and give me > >>>> academic > >>>>>>>> credits required to graduate, all at the same time. It seems > >> like > >>>> too > >>>>>> good > >>>>>>>> an opportunity to pass up. > >>>>>>>> > >>>>>>>> With that in mind, I have the following questions: > >>>>>>>> > >>>>>>>> 1. At this point, is there any self-contained project that I > >>> could > >>>>>> work > >>>>>>>> on within Spark? Ideally, I would work on it independently, in > >>>>> about a > >>>>>>>> three month time frame. This time also needs to accommodate > >>>> ramping > >>>>>> up on > >>>>>>>> the Spark codebase and adjusting to the Scala programming > >>> language > >>>>> and > >>>>>>>> paradigms. The company I worked at primarily used the Java > >> APIs. > >>>> The > >>>>>> output > >>>>>>>> needs to be a technical report describing the project > >>>> requirements, > >>>>>> and the > >>>>>>>> design process I took to engineer the solution for the > >>>> requirements. > >>>>>> In > >>>>>>>> particular, it cannot just be a series of haphazard patches. > >>>>>>>> 2. How can I get started with contributing to Spark? > >>>>>>>> 3. Is there a high-level UML or some other design > >> specification > >>>> for > >>>>>> the > >>>>>>>> Spark architecture? > >>>>>>>> > >>>>>>>> Thanks! I hope to be of some help =) > >>>>>>>> > >>>>>>>> -Matt Cheah > >> >