+1 to that (assuming by 'online' Andrew meant MLLib algorithm from Spark Streaming)
Something you can look into is implementing a streaming KMeans. Maybe you can re-use a lot of the offline KMeans code in MLLib. TD On Thu, Dec 19, 2013 at 5:33 PM, Andrew Ash <and...@andrewash.com> wrote: > Sounds like a great choice. It would be particularly impressive if you > could add the first online learning algorithm (all the current ones are > offline I believe) to pave the way for future contributions. > > > On Thu, Dec 19, 2013 at 8:27 PM, Matthew Cheah <mcch...@uwaterloo.ca> > wrote: > > > Thanks a lot everyone! I'm looking into adding an algorithm to MLib for > the > > project. Nice and self-contained. > > > > -Matt Cheah > > > > > > On Thu, Dec 19, 2013 at 12:52 PM, Christopher Nguyen <c...@adatao.com> > > wrote: > > > > > +1 to most of Andrew's suggestions here, and while we're in that > > > neighborhood, how about generalizing something like "wtf-spark" (from > the > > > Bizo team (http://youtu.be/6Sn1xs5DN1Y?t=38m36s)? It may not be of > high > > > academic interest, but it's something people would use many times a > > > debugging day. > > > > > > Or am I behind and something like that is already there in 0.8? > > > > > > -- > > > Christopher T. Nguyen > > > Co-founder & CEO, Adatao <http://adatao.com> > > > linkedin.com/in/ctnguyen > > > > > > > > > > > > On Thu, Dec 19, 2013 at 10:56 AM, Andrew Ash <and...@andrewash.com> > > wrote: > > > > > > > I think there are also some improvements that could be made to > > > > deployability in an enterprise setting. From my experience: > > > > > > > > 1. Most places I deploy Spark in don't have internet access. So I > > can't > > > > build from source, compile against a different version of Hadoop, etc > > > > without doing it locally and then getting that onto my servers > > manually. > > > > This is less a problem with Spark now that there are binary > > > distributions, > > > > but it's still a problem for using Mesos with Spark. > > > > 2. Configuration of Spark is confusing -- you can make configuration > in > > > > Java system properties, environment variables, command line > parameters, > > > and > > > > for the standalone cluster deployment mode you need to worry about > > > whether > > > > these need to be set on the master, the worker, the executor, or the > > > > application/driver program. Also because spark-shell automatically > > > > instantiates a SparkContext you have to set up any system properties > in > > > the > > > > init scripts or on the command line with > > > > JAVA_OPTS="-Dspark.executor.memory=8g" etc. I'm not sure what needs > to > > > be > > > > done, but it feels that there are gains to be made in configuration > > > options > > > > here. Ideally, I would have one configuration file that can be used > in > > > all > > > > 4 places and that's the only place to make configuration changes. > > > > 3. Standalone cluster mode could use improved resiliency for > starting, > > > > stopping, and keeping alive a service -- there are custom init > scripts > > > that > > > > call each other in a mess of ways: spark-shell, spark-daemon.sh, > > > > spark-daemons.sh, spark-config.sh, spark-env.sh, > compute-classpath.sh, > > > > spark-executor, spark-class, run-example, and several others in the > > bin/ > > > > directory. I would love it if Spark used the Tanuki Service Wrapper, > > > which > > > > is widely-used for Java service daemons, supports retries, > installation > > > as > > > > init scripts that can be chkconfig'd, etc. Let's not re-solve the > "how > > > do > > > > I keep a service running?" problem when it's been done so well by > > Tanuki > > > -- > > > > we use it at my day job for all our services, plus it's used by > > > > Elasticsearch. This would help solve the problem where a quick > bounce > > of > > > > the master causes all the workers to self-destruct. > > > > 4. Sensitivity to hostname vs FQDN vs IP address in spark URL -- this > > is > > > > entirely an Akka bug based on previous mailing list discussion with > > > Matei, > > > > but it'd be awesome if you could use either the hostname or the FQDN > or > > > the > > > > IP address in the Spark URL and not have Akka barf at you. > > > > > > > > I've been telling myself I'd look into these at some point but just > > > haven't > > > > gotten around to them myself yet. Some day! I would prioritize > these > > > > requests from most- to least-important as 3, 2, 4, 1. > > > > > > > > Andrew > > > > > > > > > > > > On Thu, Dec 19, 2013 at 1:38 PM, Nick Pentreath < > > > nick.pentre...@gmail.com > > > > >wrote: > > > > > > > > > Or if you're extremely ambitious work in implementing Spark > Streaming > > > in > > > > > Python— > > > > > Sent from Mailbox for iPhone > > > > > > > > > > On Thu, Dec 19, 2013 at 8:30 PM, Matei Zaharia < > > > matei.zaha...@gmail.com> > > > > > wrote: > > > > > > > > > > > Hi Matt, > > > > > > If you want to get started looking at Spark, I recommend the > > > following > > > > > resources: > > > > > > - Our issue tracker at http://spark-project.atlassian.netcontains > > > > some > > > > > issues marked “Starter” that are good places to jump into. You > might > > be > > > > > able to take one of those and extend it into a bigger project. > > > > > > - The “contributing to Spark” wiki page covers how to send > patches > > > and > > > > > set up development: > > > > > > > > > https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark > > > > > > - This talk has an intro to Spark internals (video and slides are > > in > > > > the > > > > > comments): http://www.meetup.com/spark-users/events/94101942/ > > > > > > For a longer project, here are some possible ones: > > > > > > - Create a tool that automatically checks which Scala API methods > > are > > > > > missing in Python. We had a similar one for Java that was very > > useful. > > > > Even > > > > > better would be to automatically create wrappers for the Scala > ones. > > > > > > - Extend the Spark monitoring UI with profiling information (to > > > sample > > > > > the workers and say where they’re spending time, or what data > > > structures > > > > > consume the most memory). > > > > > > - Pick and implement a new machine learning algorithm for MLlib. > > > > > > Matei > > > > > > On Dec 17, 2013, at 10:43 AM, Matthew Cheah < > mcch...@uwaterloo.ca> > > > > > wrote: > > > > > >> Hi everyone, > > > > > >> > > > > > >> During my most recent internship, I worked extensively with > Apache > > > > > Spark, > > > > > >> integrating it into a company's data analytics platform. I've > now > > > > become > > > > > >> interested in contributing to Apache Spark. > > > > > >> > > > > > >> I'm returning to undergraduate studies in January and there is > an > > > > > academic > > > > > >> course which is simply a standalone software engineering > project. > > I > > > > was > > > > > >> thinking that some contribution to Apache Spark would satisfy my > > > > > curiosity, > > > > > >> help continue support the company I interned at, and give me > > > academic > > > > > >> credits required to graduate, all at the same time. It seems > like > > > too > > > > > good > > > > > >> an opportunity to pass up. > > > > > >> > > > > > >> With that in mind, I have the following questions: > > > > > >> > > > > > >> 1. At this point, is there any self-contained project that I > > could > > > > > work > > > > > >> on within Spark? Ideally, I would work on it independently, in > > > > about a > > > > > >> three month time frame. This time also needs to accommodate > > > ramping > > > > > up on > > > > > >> the Spark codebase and adjusting to the Scala programming > > language > > > > and > > > > > >> paradigms. The company I worked at primarily used the Java > APIs. > > > The > > > > > output > > > > > >> needs to be a technical report describing the project > > > requirements, > > > > > and the > > > > > >> design process I took to engineer the solution for the > > > requirements. > > > > > In > > > > > >> particular, it cannot just be a series of haphazard patches. > > > > > >> 2. How can I get started with contributing to Spark? > > > > > >> 3. Is there a high-level UML or some other design > specification > > > for > > > > > the > > > > > >> Spark architecture? > > > > > >> > > > > > >> Thanks! I hope to be of some help =) > > > > > >> > > > > > >> -Matt Cheah > > > > > > > > > > > > > > >