On Tue, Jun 22, 2010 at 8:33 AM, Grant Ingersoll <[email protected]>wrote:

>
> On Jun 21, 2010, at 1:12 PM, Ted Dunning wrote:
>
> > We really need to have a simple way to integrate all of the input
> processing
> > options easily into new and old code
>
> More or less, what we need is a pipeline that can ingest many different
> kinds of things and output Vectors, right (assuming bayes is converted to
> use vectors).  Ideally it would be easy to configure, work well in a cluster
> and can output various formats (for instance freq. item set as well).
>

Yes.

But classifiers need to be able to do the conversion on the fly as well.
 Just recently a client had a model where there are almost 20 interaction
variables among categorical variables with a large number of possible
values.  Very soon, there will be interaction variables against text.  This
means that the vector form of the training or test examples will be 2-3x
larger than the original form.  SGD is already likely to be I/O bound and
killing performance further seems a very bad idea.

We also very much need to be good at both command line and programmatic
composition of these pipelines.

>
> > - model storage
> >
> > It would be lovely if we could instantiate a model from a stored form
> > without even known what kind of learning produced the model.  All of the
> > classifiers and clustering algorithms should put out something that can
> be
> > instantiated this way.  I used Gson in the SGD code and found it pretty
> > congenial, but I didn't encode the class of the classifier, nor did I
> > provide a classifier abstract class.  I don't know what k-means or Canopy
> > clustering produce, nor random forests or Naive Bayes, but I am sure that
> > all of them are highly specific to the particular kind of model.
>
> Just to be clear, are you suggesting that, ultimately, the models can be
> used interchangeably?
>

Yes.

And in combination.  It is common for models and clusterers to be used as
feature extractors for other models (or clustering).  Model combination like
this was what won Netflix.

The most common use case, however, is evaluation.  It is important to be
able to throw any model at exactly the same test set and evaluation code.

An as yet unexplored use case (for Mahout) is to use feature sharding such
as with the random forest with alternative models.

Another use case is semi-supervised learning where you train a model and use
the output of the model against a larger corpus as training data for another
model.  We shouldn't be limited as to which models go where in such an
architecture.

Reply via email to