Hopefully that's not too ambitious a title. Starting a new thread here to discuss, at least conceptually, possible implementations of scala traits and or abstract classes for Classifiers/Clusterers/Recommendors. The idea would be to lay these out wherever possible in order to make porting to and new algorithm development in the Scala DSL as easy and as uniform as possible.
See below for Ted's inital proposals regarding Classifiers, and Pat's work in implementing a Scala based cooccurrence recommender with a CLI wrapper and import/export functionality and proposal for an API to serve recommenders. Any input is appreciated. Regarding recommenders, drivers, and import/export: > Subject: Re: do we really need scala still > From: [email protected] > Date: Thu, 29 May 2014 08:58:04 -0700 > To: [email protected] > > Regarding recommenders, drivers, and import/export: > > I’ve got Sebastian’s cooccurrence code wrapped with a driver that reads text > delimited files into a drm for use with cooccurrence. Then it writes the indicator matrix(es) as text delimited files with user specified IDs. It also has a proposed Driver base class, Scala based option parser and ReadStore/WriteStore traits. The CLI will be mostly a superset of the itemsimilarity in legacy mr. The read/write stuff is meant to be pretty generic so I was planning to do a DB and maybe JSON example (some day). There is still a bit of functional programming refactoring and the docs are not up to date. > > With cooccurrence working we could do something that replaces all the > cooccurrence recommenders (in-memory and MR) with one codebase. Add Solr and you have a single machine server based recommender that we can supply with an API similar to the legacy in-memory recommender. The cool thing is that It will scale out to a cluster with Solr and HDFS, requiring only config changes. The downside is that it requires at least a standalone local version of Spark to do the cooccurrence. BTW this would give us something people have been asking for—a recommender service. > > Is anyone else interested in CLI, drivers, read/write in the import/export > sense? Or a new architecture for the recommenders? If so, maybe a separate thread? > > On May 29, 2014, at 7:03 AM, Ted Dunning <[email protected]> wrote: > > Andrew, > > Sebastian and I were talking yesterday and guessing that you would be > interested in this soon. Glad to know the world is as expected. > > Yes. This needs to happen at least at a very conceptual level. For > instance, for classifiers, I think that we need to have something like: > > - progressively train against a batch of data > questions: should this do multiple epochs? Throw an exception if > on-line training not supported? throw an exception if too little data > provided? > > - classify a batch of data > > - serialize a model > > - de-serialize a model > > Note that a batch listed above should be either a bunch of observations or > just one. > > Question: does this handle the following cases: > > - naive bayes > - SGD trained on continuous data > - batch trained <mumble> classifiers > - downpour type classifier training > > ? > > > > On Wed, May 28, 2014 at 6:25 PM, Andrew Palumbo <[email protected]> wrote: > > > This may be somewhat tangential to this thread, but would now be a good > > time to start laying out some scala traits for > > Classifiers/Clusterers/Recommenders? I am totally scala-naive, but have > > been trying to keep up with the discussions. > > > > I don't know if this is premature but it seems that now that the DSL data > > structures have been at least sketched out if not fully implemented, it > > would be useful to have these in place before people start porting too much > > over. It might be helpful in bringing in new contributions as well. > > > > It could also help regarding people's questions of integrating a future > > wrapper layer. > > > >
