sk-learn learner, transformer and predictor features sound good to me,
tried-and-proven
most importantly imo we need strong established type system and not repeat
what i view as a problem in some other offerings. If the type system is
strict and limited in size, then there's much less need in data adapters,
or none at all.
so what we have :
-- double precison tensor types (but not n-d arrays though)
what we don't have:
-- data frames
What we may want to have
-- formula support, especially for non-linear glm ("linear generalized
linear", does this makes sense at all?) ok non-linear regressions
formula normally acts on data-frame-y data, not on tensor data, albeit it
produces tensor data. Herein lies a conundrum. I don't see mahout taking on
data frames, this is just too big. but good formula and "factor" (in R
sense) support is nice to have for down-to-earth problems.
perhaps a tactical solution here is to integrate some foreign engine data
frames but mahout native formula support. But i didn't give it much
thought, because, although formulas and step-wise non-linear model searches
are the first thing to happen to any analytics (but somehow it hasn't
happened well enough elsewhere), i don't see how it can be made cheaply in
engine-agnostic way. I still commonly view mahout as an under-funded
project, so choices of new things should be smart -- small in volume, great
in the bang. Dataframes are not small in the volume, esp. since i am
increasingly turning away from Spark in my personal endeavors, so i won't
support just integrating sparkql for this purpose.
Big area that people actually need (IMO) and what hasn't been done well
elsewhere (IMO) are model and model parameter searches. This "ML optimizer"
idea that has been in AMPLab for as long as i remember them, and is still
very popular, but I don't think there are good offers that actually solve
this problem in OSS. One of the reasons, modern OSS is pretty slow for the
volume required by the task. if we get some unique improvements to the
framework, we can think of getting in this business. this shouldn't be that
much difficult, assuming the throughput is not an issue. GPU clusters are
increasingly common, we can hope we'll get there in the future.
on algorithm side, i would love to see something with 2d inputs, cnns or
something, for image processing.
On Thu, Jul 21, 2016 at 8:08 AM, Trevor Grant <[email protected]>
wrote:
> I was thinking so too. Most ML frameworks are at least loosly based on the
> Sklearn paradigm. For those not familiar, at a very abstract level-
>
> model1 = new Algo // e.g. K-Means, Random Forest, Neural Net
>
> model1.fit(trainingData)
>
> // then depending on the goal of the algorithm you have either (or both)
> preds = model1.predict( testData) // which returns a vector of predictions
> for each obs point in testing data
>
> // or sometimes
> newVals = model1.transform( testData) // which returns a new dataset like
> object, as this makes more sense for things like neural nets, or when
> you're not just predicting a single value per observation
>
>
> In addition to the above, pre-processing operations then also have a
> transform method such as
>
> preprocess1 = new Normalizer
>
> preprocess1.fit( trainingData ) // in this phase calculates the mean and
> variance of the training data set
>
> preprocessedTrainingData = preprocess1.transform( trainingData)
> preprocessTestingData = preprocess1.transform( testingData)
>
> I think this is a reasonalbe approach bc A) it makes sense and B) is a
> standard of sorts across ML libraries (bc of A)
>
> We have two high level bucket types, based on what the output is:
>
> Predictors and Transformers
>
> Predictors: anything that return a single value per observation, this is
> classifiers and regressors
>
> Transformers: anything that returns a vector per observation
> - Pre-processing operations
> - Classifiers, in that usually there is a probability vector for each
> observation as to which class it belongs too, the 'predict' method then
> just picks the most likely class
> - Neural nets ( though with one small tweak can be extended to regression
> or classification )
> - Any unsupervised learning application (e.g. clustering)
> - etc.
>
> And so really we have something like:
>
> class LearningFunction
> def fit()
>
> class Transformer extends LearningFunction:
> def transform
>
> class Predictor extends Transformer:
> def predict
>
>
> This paradigm also lends its self nicely to pipelines...
>
> pipeline1 = new Pipeline
> .add( transformer1 )
> .add( transformer2 )
> .add( model1 )
>
> pipeline1.fit( trainingData )
> pipelin1.predict( testingData )
>
> I have to read up on reccomenders a bit more to figure how those play in,
> or if we need another class.
>
> In addition to that I think we would have an optimizers section that allows
> for the various flavors of SGD, but also allows other types of optimizers
> all together.
>
> Again, just moving the conversation forward a bit here.
>
> Excited to get to work on this
>
> Best,
>
> tg
>
>
>
>
>
>
> Trevor Grant
> Data Scientist
> https://github.com/rawkintrevo
> http://stackexchange.com/users/3002022/rawkintrevo
> http://trevorgrant.org
>
> *"Fortunate is he, who is able to know the causes of things." -Virgil*
>
>
> On Thu, Jul 21, 2016 at 7:13 AM, Sebastian <[email protected]> wrote:
>
> > Hi Andrew,
> >
> > I think this topic is broader than just defining a few traits. A popular
> > way of integrating ML algorithms is via the combination of dataframes and
> > pipelines, similar to what scipy and SparkML are offering at the moment.
> > Maybe it could make sense to integrate with what they have instead of
> > starting our own efforts?
> >
> > Best,
> > Sebastian
> >
> >
> >
> > On 21.07.2016 04:35, Andrew Palumbo wrote:
> >
> >> Hi All,
> >>
> >>
> >> I'd like to draw your attention to MAHOUT-1856:
> >> https://issues.apache.org/jira/browse/MAHOUT-1856
> >>
> >>
> >> This is a discussion that has popped up several times over the last
> >> couple of years. as we move towards building out our algorithm library,
> It
> >> would be great to nail this down now.
> >>
> >>
> >> Most Importantly to not be able to be criticized as "a loose bag of
> >> algorithms" as we've sometimes been in the past.
> >>
> >>
> >> The main point being It would be good to lay out common traits for
> >> Classification, Clustering, and Optimization algorithms.
> >>
> >>
> >> This is just a start. I created this issue a few months back, and
> >> intentionally left off Recommender, because I was unsure if there were
> >> common traits across them. By traits, I am referring to both both the
> >> literal meaning and more specifically, actual Scala traits.
> >>
> >>
> >> @pat, @tdunning, @ssc, could you give your thoughts on this?
> >>
> >>
> >> As well, it would be good to add online flavors of different algorithm
> >> classes into the mix.
> >>
> >>
> >> @tdunning could you share some thoughts here?
> >>
> >>
> >> Trevor Grant will be heading up this effort, and It would be great if we
> >> all as a team could come up with abstract design plans for each class of
> >> algorithm (as well as to determine the current "classes of algorithms",
> as
> >> each of us has our own unique blend of specializations. And could give
> our
> >> thoughts on this.
> >>
> >>
> >> Currently this is really the opening of the conversation.
> >>
> >>
> >> It would be best to post thoughts on:
> >> https://issues.apache.org/jira/browse/MAHOUT-1856
> >>
> >>
> >> Any feedback is welcomed.
> >>
> >>
> >> Thanks,
> >>
> >>
> >> Andy
> >>
> >>
> >>
> >>
>