[jira] [Commented] (MAHOUT-1856) Create a framework for new Mahout Clustering, Classification, and Optimization Algorithms

2017-02-01 Thread Andrew Palumbo (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15848930#comment-15848930
 ] 

Andrew Palumbo commented on MAHOUT-1856:


[~rawkintrevo] can this be marked as resolved? Or is there still more to do 
here for 0.13.0?

> Create a framework for new Mahout Clustering, Classification, and 
> Optimization  Algorithms
> --
>
> Key: MAHOUT-1856
> URL: https://issues.apache.org/jira/browse/MAHOUT-1856
> Project: Mahout
>  Issue Type: New Feature
>Affects Versions: 0.12.1
>Reporter: Andrew Palumbo
>Assignee: Trevor Grant
>Priority: Critical
> Fix For: 0.13.0
>
>
> To ensure that Mahout does not become "A loose bag of algorithms", Create 
> basic traits with funtions common to each class of algorithm. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MAHOUT-1856) Create a framework for new Mahout Clustering, Classification, and Optimization Algorithms

2017-01-31 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15847936#comment-15847936
 ] 

Hudson commented on MAHOUT-1856:


FAILURE: Integrated in Jenkins build Mahout-Quality #3412 (See 
[https://builds.apache.org/job/Mahout-Quality/3412/])
MAHOUT-1856 Add Framework for Models, Fitters, and Tests closes (rawkintrevo: 
rev 9a31923eae3727d9d91bd2c2ed8df12a616a577e)
* (add) 
spark/src/test/scala/org/apache/mahout/math/algorithms/RegressionTestsSuite.scala
* (add) 
math-scala/src/main/scala/org/apache/mahout/math/algorithms/regression/tests/AutocorrelationTests.scala
* (add) 
math-scala/src/main/scala/org/apache/mahout/math/algorithms/regression/OrdinaryLeastSquaresModel.scala
* (add) 
math-scala/src/test/scala/org/apache/mahout/math/algorithms/RegressionSuiteBase.scala
* (edit) .gitignore
* (add) 
flink/src/test/scala/org/apache/mahout/flinkbindings/standard/RegressionSuite.scala
* (add) 
math-scala/src/test/scala/org/apache/mahout/math/algorithms/RegressionTestsSuiteBase.scala
* (add) 
spark/src/test/scala/org/apache/mahout/math/algorithms/PreprocessorSuite.scala
* (add) 
math-scala/src/test/scala/org/apache/mahout/math/algorithms/PreprocessorSuiteBase.scala
* (add) 
math-scala/src/main/scala/org/apache/mahout/math/algorithms/preprocessing/PreprocessorModel.scala
* (add) 
math-scala/src/main/scala/org/apache/mahout/math/algorithms/SupervisedFitter.scala
* (add) 
flink/src/test/scala/org/apache/mahout/flinkbindings/standard/PreprocessorSuite.scala
* (add) math-scala/src/main/scala/org/apache/mahout/math/algorithms/Fitter.scala
* (add) 
math-scala/src/main/scala/org/apache/mahout/math/algorithms/UnsupervisedFitter.scala
* (add) 
h2o/src/test/scala/org/apache/mahout/math/algorithms/RegressionSuite.scala
* (add) 
math-scala/src/main/scala/org/apache/mahout/math/algorithms/preprocessing/StandardScaler.scala
* (add) 
math-scala/src/main/scala/org/apache/mahout/math/algorithms/regression/tests/FittnessTests.scala
* (add) 
math-scala/src/main/scala/org/apache/mahout/math/algorithms/regression/CochraneOrcuttModel.scala
* (add) math-scala/src/main/scala/org/apache/mahout/math/algorithms/Model.scala
* (add) 
math-scala/src/main/scala/org/apache/mahout/math/algorithms/regression/RegressorModel.scala
* (add) 
math-scala/src/main/scala/org/apache/mahout/math/algorithms/UnsupervisedModel.scala
* (add) 
math-scala/src/main/scala/org/apache/mahout/math/algorithms/regression/LinearRegressorModel.scala
* (add) 
h2o/src/test/scala/org/apache/mahout/math/algorithms/RegressionTestsSuite.scala
* (add) 
spark/src/test/scala/org/apache/mahout/math/algorithms/RegressionSuite.scala
* (add) 
math-scala/src/main/scala/org/apache/mahout/math/algorithms/preprocessing/MeanCenter.scala
* (add) 
math-scala/src/main/scala/org/apache/mahout/math/algorithms/SupervisedModel.scala
* (add) 
flink/src/test/scala/org/apache/mahout/flinkbindings/standard/RegressionTestsSuite.scala
* (add) 
math-scala/src/main/scala/org/apache/mahout/math/algorithms/preprocessing/AsFactor.scala
* (add) 
h2o/src/test/scala/org/apache/mahout/math/algorithms/PreprocessorSuite.scala


> Create a framework for new Mahout Clustering, Classification, and 
> Optimization  Algorithms
> --
>
> Key: MAHOUT-1856
> URL: https://issues.apache.org/jira/browse/MAHOUT-1856
> Project: Mahout
>  Issue Type: New Feature
>Affects Versions: 0.12.1
>Reporter: Andrew Palumbo
>Assignee: Trevor Grant
>Priority: Critical
> Fix For: 0.13.0
>
>
> To ensure that Mahout does not become "A loose bag of algorithms", Create 
> basic traits with funtions common to each class of algorithm. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MAHOUT-1856) Create a framework for new Mahout Clustering, Classification, and Optimization Algorithms

2017-01-31 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15847923#comment-15847923
 ] 

ASF GitHub Bot commented on MAHOUT-1856:


Github user asfgit closed the pull request at:

https://github.com/apache/mahout/pull/246


> Create a framework for new Mahout Clustering, Classification, and 
> Optimization  Algorithms
> --
>
> Key: MAHOUT-1856
> URL: https://issues.apache.org/jira/browse/MAHOUT-1856
> Project: Mahout
>  Issue Type: New Feature
>Affects Versions: 0.12.1
>Reporter: Andrew Palumbo
>Assignee: Trevor Grant
>Priority: Critical
> Fix For: 0.13.0
>
>
> To ensure that Mahout does not become "A loose bag of algorithms", Create 
> basic traits with funtions common to each class of algorithm. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MAHOUT-1856) Create a framework for new Mahout Clustering, Classification, and Optimization Algorithms

2017-01-26 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15840810#comment-15840810
 ] 

ASF GitHub Bot commented on MAHOUT-1856:


Github user dlyubimov commented on a diff in the pull request:

https://github.com/apache/mahout/pull/246#discussion_r98131529
  
--- Diff: 
math-scala/src/main/scala/org/apache/mahout/math/algorithms/regression/Regressor.scala
 ---
@@ -0,0 +1,49 @@
+/**
+  * Licensed to the Apache Software Foundation (ASF) under one
+  * or more contributor license agreements. See the NOTICE file
+  * distributed with this work for additional information
+  * regarding copyright ownership. The ASF licenses this file
+  * to you under the Apache License, Version 2.0 (the
+  * "License"); you may not use this file except in compliance
+  * with the License. You may obtain a copy of the License at
+  *
+  * http://www.apache.org/licenses/LICENSE-2.0
+  *
+  * Unless required by applicable law or agreed to in writing,
+  * software distributed under the License is distributed on an
+  * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  * KIND, either express or implied. See the License for the
+  * specific language governing permissions and limitations
+  * under the License.
+  */
+
+package org.apache.mahout.math.algorithms.regression
+
+import org.apache.mahout.math.algorithms.regression.tests._
+import org.apache.mahout.math.drm._
+import org.apache.mahout.math.{Vector => MahoutVector}
+import org.apache.mahout.math.algorithms.Model
+import org.apache.mahout.math.drm.DrmLike
+
+import scala.reflect.ClassTag
+
+/**
+  * Abstract of Regressors
+  */
+trait Regressor[K] extends Model {
--- End diff --

in most kits i know (including two i am working on right now myself) the 
pattern is that fit returns a model object. sckit seems to be on the outlier 
end on this.

Also i think the approach "package A does it like this and we do/don't like 
it therefore it is good/bad" to be a dogma fallacy. IMO We just need to do what 
makes sense. 

And it makes sense to me to serialize or persist the model, not the 
(fitting algorithm+model). This will cause problems both on user and 
implementation ends IMO


> Create a framework for new Mahout Clustering, Classification, and 
> Optimization  Algorithms
> --
>
> Key: MAHOUT-1856
> URL: https://issues.apache.org/jira/browse/MAHOUT-1856
> Project: Mahout
>  Issue Type: New Feature
>Affects Versions: 0.12.1
>Reporter: Andrew Palumbo
>Assignee: Trevor Grant
>Priority: Critical
> Fix For: 0.13.0
>
>
> To ensure that Mahout does not become "A loose bag of algorithms", Create 
> basic traits with funtions common to each class of algorithm. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAHOUT-1856) Create a framework for new Mahout Clustering, Classification, and Optimization Algorithms

2017-01-25 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15837955#comment-15837955
 ] 

ASF GitHub Bot commented on MAHOUT-1856:


Github user rawkintrevo commented on a diff in the pull request:

https://github.com/apache/mahout/pull/246#discussion_r97809121
  
--- Diff: 
math-scala/src/main/scala/org/apache/mahout/math/algorithms/regression/CochraneOrcutt.scala
 ---
@@ -0,0 +1,89 @@
+/**
+  * Licensed to the Apache Software Foundation (ASF) under one
+  * or more contributor license agreements. See the NOTICE file
+  * distributed with this work for additional information
+  * regarding copyright ownership. The ASF licenses this file
+  * to you under the Apache License, Version 2.0 (the
+  * "License"); you may not use this file except in compliance
+  * with the License. You may obtain a copy of the License at
+  *
+  * http://www.apache.org/licenses/LICENSE-2.0
+  *
+  * Unless required by applicable law or agreed to in writing,
+  * software distributed under the License is distributed on an
+  * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  * KIND, either express or implied. See the License for the
+  * specific language governing permissions and limitations
+  * under the License.
+  */
+
+package org.apache.mahout.math.algorithms.regression
+
+import org.apache.mahout.math.{Vector => MahoutVector}
+import org.apache.mahout.math.drm.CacheHint
+import org.apache.mahout.math.drm.DrmLike
+import org.apache.mahout.math.drm.RLikeDrmOps._
+import org.apache.mahout.math.scalabindings.RLikeOps._
+
+class CochraneOrcutt[K](hyperparameters: (Symbol, Any)*) extends 
LinearRegressor[K] {
+  // https://en.wikipedia.org/wiki/Cochrane%E2%80%93Orcutt_estimation
+
+  var regressor: LinearRegressor[K] = 
hyperparameters.asInstanceOf[Map[Symbol, 
LinearRegressor[K]]].getOrElse('regressor, new OrdinaryLeastSquares())
+  var iterations: Int = hyperparameters.asInstanceOf[Map[Symbol, 
Int]].getOrElse('iterations, 3)
+  var cacheHint: CacheHint.CacheHint = 
hyperparameters.asInstanceOf[Map[Symbol, 
CacheHint.CacheHint]].getOrElse('cacheHint, CacheHint.MEMORY_ONLY)
+  // For larger inputs, CacheHint.MEMORY_AND_DISK2 is reccomended.
+
+  var betas: Array[MahoutVector] = _
+
+  var summary = ""
+
+  setHyperparameters(hyperparameters.toMap)
+
+  def setHyperparameters(hyperparameters: Map[Symbol, Any] = Map('foo -> 
None)): Unit = {
+regressor = hyperparameters.asInstanceOf[Map[Symbol, 
LinearRegressor[K]]].getOrElse('regressor, new OrdinaryLeastSquares())
+iterations = hyperparameters.asInstanceOf[Map[Symbol, 
Int]].getOrElse('iterations, 3)
+cacheHint = hyperparameters.asInstanceOf[Map[Symbol, 
CacheHint.CacheHint]].getOrElse('cacheHint, CacheHint.MEMORY_ONLY)
+  }
+
+  def fit(drmFeatures: DrmLike[K], drmTarget: DrmLike[K], hyperparameters: 
(Symbol, Any)*): Unit = {
+
+var hyperparameters: Option[Map[String,Any]] = None
+betas = new Array[MahoutVector](iterations)
+regressor.fit(drmFeatures, drmTarget)
+betas(0) = regressor.beta
+
+drmY = drmTarget
+
+val Y = drmTarget(1 until drmTarget.nrow.toInt, 0 until 
1).checkpoint(cacheHint)
+val Y_lag = drmTarget(0 until drmTarget.nrow.toInt - 1, 0 until 
1).checkpoint(cacheHint)
+val X = drmFeatures(1 until drmFeatures.nrow.toInt, 0 until 
1).checkpoint(cacheHint)
+val X_lag = drmFeatures(0 until drmFeatures.nrow.toInt - 1, 0 until 
1).checkpoint(cacheHint)
--- End diff --

missed all of these- but have since updated with `safeToNonNegInt(`


> Create a framework for new Mahout Clustering, Classification, and 
> Optimization  Algorithms
> --
>
> Key: MAHOUT-1856
> URL: https://issues.apache.org/jira/browse/MAHOUT-1856
> Project: Mahout
>  Issue Type: New Feature
>Affects Versions: 0.12.1
>Reporter: Andrew Palumbo
>Assignee: Trevor Grant
>Priority: Critical
> Fix For: 0.13.0
>
>
> To ensure that Mahout does not become "A loose bag of algorithms", Create 
> basic traits with funtions common to each class of algorithm. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAHOUT-1856) Create a framework for new Mahout Clustering, Classification, and Optimization Algorithms

2017-01-25 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15837948#comment-15837948
 ] 

ASF GitHub Bot commented on MAHOUT-1856:


Github user rawkintrevo commented on a diff in the pull request:

https://github.com/apache/mahout/pull/246#discussion_r97808310
  
--- Diff: 
math-scala/src/main/scala/org/apache/mahout/math/algorithms/regression/tests/FittnessTests.scala
 ---
@@ -0,0 +1,52 @@
+/**
+  * Licensed to the Apache Software Foundation (ASF) under one
+  * or more contributor license agreements. See the NOTICE file
+  * distributed with this work for additional information
+  * regarding copyright ownership. The ASF licenses this file
+  * to you under the Apache License, Version 2.0 (the
+  * "License"); you may not use this file except in compliance
+  * with the License. You may obtain a copy of the License at
+  *
+  * http://www.apache.org/licenses/LICENSE-2.0
+  *
+  * Unless required by applicable law or agreed to in writing,
+  * software distributed under the License is distributed on an
+  * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  * KIND, either express or implied. See the License for the
+  * specific language governing permissions and limitations
+  * under the License.
+  */
+
+package org.apache.mahout.math.algorithms.regression.tests
+
+import org.apache.mahout.math.algorithms.regression.Regressor
+import org.apache.mahout.math.algorithms.transformer.MeanCenter
+import org.apache.mahout.math.drm.DrmLike
+import org.apache.mahout.math.function.Functions.SQUARE
+import org.apache.mahout.math.scalabindings.RLikeOps._
+
+import scala.reflect.ClassTag
+
+object FittnessTests {
+
+  // https://en.wikipedia.org/wiki/Coefficient_of_determination
+  def CoefficientOfDetermination[R[K] <: Regressor[K], K](model: R[K],
+  drmTarget: DrmLike[K]): R[K] 
= {
+val sumSquareResiduals = model.residuals.assign(SQUARE).sum
+val mc = new MeanCenter()
+val totalResiduals = mc.fitTransform(drmTarget)
+val sumSquareTotal = totalResiduals.assign(SQUARE).sum
+val r2 = 1 - (sumSquareResiduals / sumSquareTotal)
+model.testResults += ("r2" -> r2)
+model.summary += s"\nR^2: ${r2}"
+model
+  }
+
+  // https://en.wikipedia.org/wiki/Mean_squared_error
+  def MeanSquareError[R[K] <: Regressor[K], K](model: R[K]): R[K] = {
+val mse = model.residuals.assign(SQUARE).sum / model.residuals.nrow
--- End diff --

will update this to `safeToNonNegInt(`


> Create a framework for new Mahout Clustering, Classification, and 
> Optimization  Algorithms
> --
>
> Key: MAHOUT-1856
> URL: https://issues.apache.org/jira/browse/MAHOUT-1856
> Project: Mahout
>  Issue Type: New Feature
>Affects Versions: 0.12.1
>Reporter: Andrew Palumbo
>Assignee: Trevor Grant
>Priority: Critical
> Fix For: 0.13.0
>
>
> To ensure that Mahout does not become "A loose bag of algorithms", Create 
> basic traits with funtions common to each class of algorithm. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAHOUT-1856) Create a framework for new Mahout Clustering, Classification, and Optimization Algorithms

2017-01-24 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15837256#comment-15837256
 ] 

ASF GitHub Bot commented on MAHOUT-1856:


Github user rawkintrevo commented on a diff in the pull request:

https://github.com/apache/mahout/pull/246#discussion_r97715579
  
--- Diff: 
math-scala/src/main/scala/org/apache/mahout/math/algorithms/regression/Regressor.scala
 ---
@@ -0,0 +1,49 @@
+/**
+  * Licensed to the Apache Software Foundation (ASF) under one
+  * or more contributor license agreements. See the NOTICE file
+  * distributed with this work for additional information
+  * regarding copyright ownership. The ASF licenses this file
+  * to you under the Apache License, Version 2.0 (the
+  * "License"); you may not use this file except in compliance
+  * with the License. You may obtain a copy of the License at
+  *
+  * http://www.apache.org/licenses/LICENSE-2.0
+  *
+  * Unless required by applicable law or agreed to in writing,
+  * software distributed under the License is distributed on an
+  * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  * KIND, either express or implied. See the License for the
+  * specific language governing permissions and limitations
+  * under the License.
+  */
+
+package org.apache.mahout.math.algorithms.regression
+
+import org.apache.mahout.math.algorithms.regression.tests._
+import org.apache.mahout.math.drm._
+import org.apache.mahout.math.{Vector => MahoutVector}
+import org.apache.mahout.math.algorithms.Model
+import org.apache.mahout.math.drm.DrmLike
+
+import scala.reflect.ClassTag
+
+/**
+  * Abstract of Regressors
+  */
+trait Regressor[K] extends Model {
--- End diff --

I agree that Spark and Flink follow the paradigm you are suggesting, 
however sklearn doesn't.  If we're just going off of what others do, following 
the other larger packages- yea we should probably follow conventions of what 
other scala based "Big Data" packages do.  However, I can't understand WHY they 
do it that way- it makes the code hard to read/follow and I assume is an 
artifact of all the serialization and the way they execute models (having to 
ship object around for map / reduce phases), that is to say they do it because 
_they are forced to_ and _at the expense of_ readability. 

In Mahout, most of that is taken care of at the distributed engine level.  

If we start going down the rabbit hole of "do as Spark and Flink do" we may 
find ourselves with [entire class just for the summary of a linear 
model](https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala#L620).
  I for one, want to stay as far away from that as possible.  I'd like to see 
algorithm code (these and future) be as succinct and tractable as possible so 
that 

1. New contributors aren't intimidated (that is to say are encouraged to 
commit algorithms)
2. Those algorithms can be easily reviewed and maintained with minimal 
Scala knowledge (as it limits the pool of willing and able contributors who 
understand the actual math at play)

That isn't to say, at the end of the day, your proposal is incorrect- you 
usually are correct and I value and appreciate you taking the time to review.  
I am saying, " i think that's how this pattern goes in most kits." is neither 
necessary nor sufficient imo, as in some respects I'm explicitly trying to 
avoid the approach of other packages, in this case- refactoring something to be 
more complex with no clear understanding of the benefit. 


> Create a framework for new Mahout Clustering, Classification, and 
> Optimization  Algorithms
> --
>
> Key: MAHOUT-1856
> URL: https://issues.apache.org/jira/browse/MAHOUT-1856
> Project: Mahout
>  Issue Type: New Feature
>Affects Versions: 0.12.1
>Reporter: Andrew Palumbo
>Assignee: Trevor Grant
>Priority: Critical
> Fix For: 0.13.0
>
>
> To ensure that Mahout does not become "A loose bag of algorithms", Create 
> basic traits with funtions common to each class of algorithm. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAHOUT-1856) Create a framework for new Mahout Clustering, Classification, and Optimization Algorithms

2017-01-24 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15837211#comment-15837211
 ] 

ASF GitHub Bot commented on MAHOUT-1856:


Github user rawkintrevo commented on a diff in the pull request:

https://github.com/apache/mahout/pull/246#discussion_r97711337
  
--- Diff: 
math-scala/src/main/scala/org/apache/mahout/math/algorithms/regression/Regressor.scala
 ---
@@ -0,0 +1,37 @@
+/**
+  * Licensed to the Apache Software Foundation (ASF) under one
+  * or more contributor license agreements. See the NOTICE file
+  * distributed with this work for additional information
+  * regarding copyright ownership. The ASF licenses this file
+  * to you under the Apache License, Version 2.0 (the
+  * "License"); you may not use this file except in compliance
+  * with the License. You may obtain a copy of the License at
+  *
+  * http://www.apache.org/licenses/LICENSE-2.0
+  *
+  * Unless required by applicable law or agreed to in writing,
+  * software distributed under the License is distributed on an
+  * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  * KIND, either express or implied. See the License for the
+  * specific language governing permissions and limitations
+  * under the License.
+  */
+
+package org.apache.mahout.math.algorithms.regression
+
+import org.apache.mahout.math.drm._
+import org.apache.mahout.math.{Vector => MahoutVector}
+import org.apache.mahout.math.algorithms.Model
+import org.apache.mahout.math.drm.DrmLike
+
+/**
+  * Abstract of Regressors
+  */
+trait Regressor[K] extends Model {
+
+  var residuals: DrmLike[K] = _
+
+  def fit(drmFeatures: DrmLike[K], drmTarget: DrmLike[K]): Unit
--- End diff --

@andrewpalumbo re `drmFeatures` vs `drmX` and @dlyubimov 
We're mixing conventions-
I feel like, for consistency- we either use `drmFeatures` and `drmTarget` 
(similar to sparkML) or `drmX` and `drmY` similar to sklearn and R- leaving for 
now, but open to debate- I have a slight bias towards `drmX` , `drmY`


> Create a framework for new Mahout Clustering, Classification, and 
> Optimization  Algorithms
> --
>
> Key: MAHOUT-1856
> URL: https://issues.apache.org/jira/browse/MAHOUT-1856
> Project: Mahout
>  Issue Type: New Feature
>Affects Versions: 0.12.1
>Reporter: Andrew Palumbo
>Assignee: Trevor Grant
>Priority: Critical
> Fix For: 0.13.0
>
>
> To ensure that Mahout does not become "A loose bag of algorithms", Create 
> basic traits with funtions common to each class of algorithm. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAHOUT-1856) Create a framework for new Mahout Clustering, Classification, and Optimization Algorithms

2017-01-24 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15836634#comment-15836634
 ] 

ASF GitHub Bot commented on MAHOUT-1856:


Github user dlyubimov commented on a diff in the pull request:

https://github.com/apache/mahout/pull/246#discussion_r97646707
  
--- Diff: 
math-scala/src/main/scala/org/apache/mahout/math/algorithms/regression/Regressor.scala
 ---
@@ -0,0 +1,49 @@
+/**
+  * Licensed to the Apache Software Foundation (ASF) under one
+  * or more contributor license agreements. See the NOTICE file
+  * distributed with this work for additional information
+  * regarding copyright ownership. The ASF licenses this file
+  * to you under the Apache License, Version 2.0 (the
+  * "License"); you may not use this file except in compliance
+  * with the License. You may obtain a copy of the License at
+  *
+  * http://www.apache.org/licenses/LICENSE-2.0
+  *
+  * Unless required by applicable law or agreed to in writing,
+  * software distributed under the License is distributed on an
+  * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  * KIND, either express or implied. See the License for the
+  * specific language governing permissions and limitations
+  * under the License.
+  */
+
+package org.apache.mahout.math.algorithms.regression
+
+import org.apache.mahout.math.algorithms.regression.tests._
+import org.apache.mahout.math.drm._
+import org.apache.mahout.math.{Vector => MahoutVector}
+import org.apache.mahout.math.algorithms.Model
+import org.apache.mahout.math.drm.DrmLike
+
+import scala.reflect.ClassTag
+
+/**
+  * Abstract of Regressors
+  */
+trait Regressor[K] extends Model {
+
+  var residuals: DrmLike[K] = _
+
+  var drmY: DrmLike[K] = _
+
+  def fit(drmFeatures: DrmLike[K],
+  drmTarget: DrmLike[K],
+  hyperparameters: Map[String,Any] = Map("" -> None)): Unit
--- End diff --

Just in case it was not quite clear, my suggestion was to use 
`hyperparameters: (Symbol, Any)*`. First, symbols are faster for something that 
meant to be an id, and second, have somewhat more palatable notation, e.g., 
```
val model = fit(X,Y,'k -> 10, 'alpha -> 1e-5)
```
and the signature overall
```
def fit(drmFeatures: DrmLike[K],
 +  drmTarget: DrmLike[K],
 +  hyperparameters: (Symbol, Any)*): Model
```
of course the implementation can easily get a map, should it need to:
```
  val hmap = hyperparameters.toMap
```
That's actually a Scala pattern i developed and used in a similar situation.


> Create a framework for new Mahout Clustering, Classification, and 
> Optimization  Algorithms
> --
>
> Key: MAHOUT-1856
> URL: https://issues.apache.org/jira/browse/MAHOUT-1856
> Project: Mahout
>  Issue Type: New Feature
>Affects Versions: 0.12.1
>Reporter: Andrew Palumbo
>Assignee: Trevor Grant
>Priority: Critical
> Fix For: 0.13.0
>
>
> To ensure that Mahout does not become "A loose bag of algorithms", Create 
> basic traits with funtions common to each class of algorithm. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAHOUT-1856) Create a framework for new Mahout Clustering, Classification, and Optimization Algorithms

2017-01-24 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15836636#comment-15836636
 ] 

ASF GitHub Bot commented on MAHOUT-1856:


Github user dlyubimov commented on a diff in the pull request:

https://github.com/apache/mahout/pull/246#discussion_r97649184
  
--- Diff: 
math-scala/src/main/scala/org/apache/mahout/math/algorithms/regression/tests/AutocorrelationTests.scala
 ---
@@ -0,0 +1,54 @@
+/**
+  * Licensed to the Apache Software Foundation (ASF) under one
+  * or more contributor license agreements. See the NOTICE file
+  * distributed with this work for additional information
+  * regarding copyright ownership. The ASF licenses this file
+  * to you under the Apache License, Version 2.0 (the
+  * "License"); you may not use this file except in compliance
+  * with the License. You may obtain a copy of the License at
+  *
+  * http://www.apache.org/licenses/LICENSE-2.0
+  *
+  * Unless required by applicable law or agreed to in writing,
+  * software distributed under the License is distributed on an
+  * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  * KIND, either express or implied. See the License for the
+  * specific language governing permissions and limitations
+  * under the License.
+  */
+
+package org.apache.mahout.math.algorithms.regression.tests
+
+import org.apache.mahout.math.algorithms.regression.Regressor
+import org.apache.mahout.math.drm.DrmLike
+import org.apache.mahout.math.drm.RLikeDrmOps._
+import org.apache.mahout.math.function.Functions.SQUARE
+import org.apache.mahout.math.scalabindings.RLikeOps._
+
+
+object AutocorrelationTests {
+
+  //https://en.wikipedia.org/wiki/Durbin%E2%80%93Watson_statistic
+  /*
+  To test for positive autocorrelation at significance α, the test 
statistic d is compared to lower and upper critical values (dL,α and dU,α):
+  If d < dL,α, there is statistical evidence that the error terms are 
positively autocorrelated.
+  If d > dU,α, there is no statistical evidence that the error terms 
are positively autocorrelated.
+  If dL,α < d < dU,α, the test is inconclusive.
+
+  Rule of Thumb:
+   d < 2 : positive auto-correlation
+   d = 2 : no auto-correlation
+   d > 2 : negative auto-correlation
+  */
+  def DurbinWatson[K](model: Regressor[K]): Regressor[K] = {
+val e: DrmLike[K] = model.residuals(1 until 
model.residuals.nrow.toInt, 0 until 1)
--- End diff --

`nrow.toInt` is generally dangerous as it does not catch the algorithm 
limitation. The problem with this (and i actually have run into this before) is 
that the algorithm obviously has in this case a limitation of 2 bln rows, and 
it should explicitly fall apart once this limit is reached, instead of silently 
producing a nonsense. I think there's a method specifically for this purpose in 
one of our math-scala package `drm`,  `safeToNonNegInt` that would throw 
IllegalArgument if conversion loses significant bits.

Of course the best approach is to avoid such limitations in the first 
place, but if unavoidable, please use `safeToNonNegInt`.


> Create a framework for new Mahout Clustering, Classification, and 
> Optimization  Algorithms
> --
>
> Key: MAHOUT-1856
> URL: https://issues.apache.org/jira/browse/MAHOUT-1856
> Project: Mahout
>  Issue Type: New Feature
>Affects Versions: 0.12.1
>Reporter: Andrew Palumbo
>Assignee: Trevor Grant
>Priority: Critical
> Fix For: 0.13.0
>
>
> To ensure that Mahout does not become "A loose bag of algorithms", Create 
> basic traits with funtions common to each class of algorithm. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAHOUT-1856) Create a framework for new Mahout Clustering, Classification, and Optimization Algorithms

2017-01-24 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15836635#comment-15836635
 ] 

ASF GitHub Bot commented on MAHOUT-1856:


Github user dlyubimov commented on a diff in the pull request:

https://github.com/apache/mahout/pull/246#discussion_r97647259
  
--- Diff: 
math-scala/src/main/scala/org/apache/mahout/math/algorithms/regression/Regressor.scala
 ---
@@ -0,0 +1,49 @@
+/**
+  * Licensed to the Apache Software Foundation (ASF) under one
+  * or more contributor license agreements. See the NOTICE file
+  * distributed with this work for additional information
+  * regarding copyright ownership. The ASF licenses this file
+  * to you under the Apache License, Version 2.0 (the
+  * "License"); you may not use this file except in compliance
+  * with the License. You may obtain a copy of the License at
+  *
+  * http://www.apache.org/licenses/LICENSE-2.0
+  *
+  * Unless required by applicable law or agreed to in writing,
+  * software distributed under the License is distributed on an
+  * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  * KIND, either express or implied. See the License for the
+  * specific language governing permissions and limitations
+  * under the License.
+  */
+
+package org.apache.mahout.math.algorithms.regression
+
+import org.apache.mahout.math.algorithms.regression.tests._
+import org.apache.mahout.math.drm._
+import org.apache.mahout.math.{Vector => MahoutVector}
+import org.apache.mahout.math.algorithms.Model
+import org.apache.mahout.math.drm.DrmLike
+
+import scala.reflect.ClassTag
+
+/**
+  * Abstract of Regressors
+  */
+trait Regressor[K] extends Model {
--- End diff --

Regressor still extends model? Regressor's fit IMO should be a `factory 
method` w.r.t. model instead


> Create a framework for new Mahout Clustering, Classification, and 
> Optimization  Algorithms
> --
>
> Key: MAHOUT-1856
> URL: https://issues.apache.org/jira/browse/MAHOUT-1856
> Project: Mahout
>  Issue Type: New Feature
>Affects Versions: 0.12.1
>Reporter: Andrew Palumbo
>Assignee: Trevor Grant
>Priority: Critical
> Fix For: 0.13.0
>
>
> To ensure that Mahout does not become "A loose bag of algorithms", Create 
> basic traits with funtions common to each class of algorithm. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAHOUT-1856) Create a framework for new Mahout Clustering, Classification, and Optimization Algorithms

2017-01-24 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15836633#comment-15836633
 ] 

ASF GitHub Bot commented on MAHOUT-1856:


Github user dlyubimov commented on the issue:

https://github.com/apache/mahout/pull/246
  
If not too much hassle, please consider using unicode characters →,  ⇒, ←. 
In intelliJ this is easily facilitated by adding substitution 'live' 
templates:
![spectacle 
kp8565](https://cloud.githubusercontent.com/assets/523263/22266754/06b4b516-e236-11e6-8375-246440bf41d0.png)



> Create a framework for new Mahout Clustering, Classification, and 
> Optimization  Algorithms
> --
>
> Key: MAHOUT-1856
> URL: https://issues.apache.org/jira/browse/MAHOUT-1856
> Project: Mahout
>  Issue Type: New Feature
>Affects Versions: 0.12.1
>Reporter: Andrew Palumbo
>Assignee: Trevor Grant
>Priority: Critical
> Fix For: 0.13.0
>
>
> To ensure that Mahout does not become "A loose bag of algorithms", Create 
> basic traits with funtions common to each class of algorithm. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAHOUT-1856) Create a framework for new Mahout Clustering, Classification, and Optimization Algorithms

2017-01-24 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15836619#comment-15836619
 ] 

ASF GitHub Bot commented on MAHOUT-1856:


Github user rawkintrevo commented on a diff in the pull request:

https://github.com/apache/mahout/pull/246#discussion_r97647112
  
--- Diff: 
math-scala/src/main/scala/org/apache/mahout/math/algorithms/regression/CochraneOrcutt.scala
 ---
@@ -0,0 +1,83 @@
+/**
+  * Licensed to the Apache Software Foundation (ASF) under one
+  * or more contributor license agreements. See the NOTICE file
+  * distributed with this work for additional information
+  * regarding copyright ownership. The ASF licenses this file
+  * to you under the Apache License, Version 2.0 (the
+  * "License"); you may not use this file except in compliance
+  * with the License. You may obtain a copy of the License at
+  *
+  * http://www.apache.org/licenses/LICENSE-2.0
+  *
+  * Unless required by applicable law or agreed to in writing,
+  * software distributed under the License is distributed on an
+  * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  * KIND, either express or implied. See the License for the
+  * specific language governing permissions and limitations
+  * under the License.
+  */
+
+package org.apache.mahout.math.algorithms.regression
+
+import org.apache.mahout.math.{Vector => MahoutVector}
+import org.apache.mahout.math.drm.CacheHint
+import org.apache.mahout.math.drm.DrmLike
+import org.apache.mahout.math.drm.RLikeDrmOps._
+import org.apache.mahout.math.scalabindings.RLikeOps._
+
+class CochraneOrcutt[K](hyperparameters: Map[String, Any] = Map("" -> 
None)) extends LinearRegressor[K] {
+  // https://en.wikipedia.org/wiki/Cochrane%E2%80%93Orcutt_estimation
+
+  var regressor: LinearRegressor[K] = 
hyperparameters.asInstanceOf[Map[String, 
LinearRegressor[K]]].getOrElse("regressor", new OrdinaryLeastSquares())
+  var iterations: Int = hyperparameters.asInstanceOf[Map[String, 
Int]].getOrElse("iterations", 3)
+  var cacheHint: CacheHint.CacheHint = 
hyperparameters.asInstanceOf[Map[String, 
CacheHint.CacheHint]].getOrElse("cacheHint", CacheHint.MEMORY_ONLY)
+  // For larger inputs, CacheHint.MEMORY_AND_DISK2 is reccomended.
+
+  var betas: Array[MahoutVector] = _
+
+  var summary = ""
+
+  def fit(drmFeatures: DrmLike[K],
+  drmTarget: DrmLike[K],
+  hyperparameters: Map[String, Any] = Map("" -> None)): Unit = {
+
+var hyperparameters: Option[Map[String,Any]] = None
--- End diff --

there should be a `setHyperparameters` right about here...


> Create a framework for new Mahout Clustering, Classification, and 
> Optimization  Algorithms
> --
>
> Key: MAHOUT-1856
> URL: https://issues.apache.org/jira/browse/MAHOUT-1856
> Project: Mahout
>  Issue Type: New Feature
>Affects Versions: 0.12.1
>Reporter: Andrew Palumbo
>Assignee: Trevor Grant
>Priority: Critical
> Fix For: 0.13.0
>
>
> To ensure that Mahout does not become "A loose bag of algorithms", Create 
> basic traits with funtions common to each class of algorithm. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAHOUT-1856) Create a framework for new Mahout Clustering, Classification, and Optimization Algorithms

2017-01-22 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15833768#comment-15833768
 ] 

ASF GitHub Bot commented on MAHOUT-1856:


Github user andrewpalumbo commented on a diff in the pull request:

https://github.com/apache/mahout/pull/246#discussion_r97239580
  
--- Diff: 
math-scala/src/main/scala/org/apache/mahout/math/algorithms/regression/Regressor.scala
 ---
@@ -0,0 +1,37 @@
+/**
+  * Licensed to the Apache Software Foundation (ASF) under one
+  * or more contributor license agreements. See the NOTICE file
+  * distributed with this work for additional information
+  * regarding copyright ownership. The ASF licenses this file
+  * to you under the Apache License, Version 2.0 (the
+  * "License"); you may not use this file except in compliance
+  * with the License. You may obtain a copy of the License at
+  *
+  * http://www.apache.org/licenses/LICENSE-2.0
+  *
+  * Unless required by applicable law or agreed to in writing,
+  * software distributed under the License is distributed on an
+  * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  * KIND, either express or implied. See the License for the
+  * specific language governing permissions and limitations
+  * under the License.
+  */
+
+package org.apache.mahout.math.algorithms.regression
+
+import org.apache.mahout.math.drm._
+import org.apache.mahout.math.{Vector => MahoutVector}
+import org.apache.mahout.math.algorithms.Model
+import org.apache.mahout.math.drm.DrmLike
+
+/**
+  * Abstract of Regressors
+  */
+trait Regressor[K] extends Model {
+
+  var residuals: DrmLike[K] = _
+
+  def fit(drmFeatures: DrmLike[K], drmTarget: DrmLike[K]): Unit
--- End diff --

yes eg. residuals, a hessian matrix, etc, anything that a user  who is 
developing their own algorithm might like to see returned from their own 
`fit(..)` method.  My point was No reason to limit the return value to a 
`Unit`, rather I'd think `Any` would be a more appropriate return value from 
the base trait. Though maybe This would not work with the pipeline structure 
that you're setting up?  


> Create a framework for new Mahout Clustering, Classification, and 
> Optimization  Algorithms
> --
>
> Key: MAHOUT-1856
> URL: https://issues.apache.org/jira/browse/MAHOUT-1856
> Project: Mahout
>  Issue Type: New Feature
>Affects Versions: 0.12.1
>Reporter: Andrew Palumbo
>Assignee: Trevor Grant
>Priority: Critical
> Fix For: 0.13.0
>
>
> To ensure that Mahout does not become "A loose bag of algorithms", Create 
> basic traits with funtions common to each class of algorithm. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAHOUT-1856) Create a framework for new Mahout Clustering, Classification, and Optimization Algorithms

2017-01-22 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15833763#comment-15833763
 ] 

ASF GitHub Bot commented on MAHOUT-1856:


Github user andrewpalumbo commented on the issue:

https://github.com/apache/mahout/pull/246
  
@rawkintrevo Great- best of both worlds then!


> Create a framework for new Mahout Clustering, Classification, and 
> Optimization  Algorithms
> --
>
> Key: MAHOUT-1856
> URL: https://issues.apache.org/jira/browse/MAHOUT-1856
> Project: Mahout
>  Issue Type: New Feature
>Affects Versions: 0.12.1
>Reporter: Andrew Palumbo
>Assignee: Trevor Grant
>Priority: Critical
> Fix For: 0.13.0
>
>
> To ensure that Mahout does not become "A loose bag of algorithms", Create 
> basic traits with funtions common to each class of algorithm. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAHOUT-1856) Create a framework for new Mahout Clustering, Classification, and Optimization Algorithms

2017-01-22 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15833761#comment-15833761
 ] 

ASF GitHub Bot commented on MAHOUT-1856:


Github user rawkintrevo commented on a diff in the pull request:

https://github.com/apache/mahout/pull/246#discussion_r97238201
  
--- Diff: 
math-scala/src/main/scala/org/apache/mahout/math/algorithms/transformer/StandardScaler.scala
 ---
@@ -0,0 +1,119 @@
+/**
+  * Licensed to the Apache Software Foundation (ASF) under one
+  * or more contributor license agreements. See the NOTICE file
+  * distributed with this work for additional information
+  * regarding copyright ownership. The ASF licenses this file
+  * to you under the Apache License, Version 2.0 (the
+  * "License"); you may not use this file except in compliance
+  * with the License. You may obtain a copy of the License at
+  *
+  * http://www.apache.org/licenses/LICENSE-2.0
+  *
+  * Unless required by applicable law or agreed to in writing,
+  * software distributed under the License is distributed on an
+  * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  * KIND, either express or implied. See the License for the
+  * specific language governing permissions and limitations
+  * under the License.
+  */
+
+package org.apache.mahout.math.algorithms.transformer
+
+import org.apache.mahout.math.drm
+
+import org.apache.mahout.math.scalabindings._
+
+import org.apache.mahout.math.scalabindings.RLikeVectorOps
+import org.apache.mahout.math.{Vector => MahoutVector}
+
+import org.apache.mahout.math.scalabindings.RLikeOps._
+import org.apache.mahout.math.scalabindings._
+import org.apache.mahout.math.scalabindings.RLikeVectorOps
+import org.apache.mahout.math.scalabindings.MatrixOps
+
+import org.apache.mahout.math._
+import org.apache.mahout.math.scalabindings._
+import org.apache.mahout.math.drm._
+import org.apache.mahout.math.scalabindings.RLikeOps._
+import org.apache.mahout.math.drm.RLikeDrmOps._
+
+
+import org.apache.mahout.math.Matrix
+
+import collection._
+import JavaConversions._
+
+import Math.sqrt
+
+import scala.reflect.{ClassTag,classTag}
+
+/**
+  * Scales columns to mean 0 and unit variance
+  */
+class StandardScaler extends Transformer{
+  var meanVec: MahoutVector = _
+  var variance: MahoutVector = _
+  var stdev: MahoutVector = _
+  var summary = ""
+
+  def fit[K](input: DrmLike[K]) = {
+val mNv = dcolMeanVars(input)
+meanVec = mNv._1
+variance = mNv._2
+stdev = mNv._2.sqrt
+isFit = true
+
+  }
+
+  def transform[K: ClassTag](input: DrmLike[K]): DrmLike[K] = {
+
+if (!isFit) {
+  //throw an error
--- End diff --

correct and agreed- sklearn convention is to also have a `fitTransform` 
method to do it in one step.  

will add that. 


> Create a framework for new Mahout Clustering, Classification, and 
> Optimization  Algorithms
> --
>
> Key: MAHOUT-1856
> URL: https://issues.apache.org/jira/browse/MAHOUT-1856
> Project: Mahout
>  Issue Type: New Feature
>Affects Versions: 0.12.1
>Reporter: Andrew Palumbo
>Assignee: Trevor Grant
>Priority: Critical
> Fix For: 0.13.0
>
>
> To ensure that Mahout does not become "A loose bag of algorithms", Create 
> basic traits with funtions common to each class of algorithm. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAHOUT-1856) Create a framework for new Mahout Clustering, Classification, and Optimization Algorithms

2017-01-22 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15833760#comment-15833760
 ] 

ASF GitHub Bot commented on MAHOUT-1856:


Github user rawkintrevo commented on a diff in the pull request:

https://github.com/apache/mahout/pull/246#discussion_r97238168
  
--- Diff: 
math-scala/src/main/scala/org/apache/mahout/math/algorithms/regression/Regressor.scala
 ---
@@ -0,0 +1,37 @@
+/**
+  * Licensed to the Apache Software Foundation (ASF) under one
+  * or more contributor license agreements. See the NOTICE file
+  * distributed with this work for additional information
+  * regarding copyright ownership. The ASF licenses this file
+  * to you under the Apache License, Version 2.0 (the
+  * "License"); you may not use this file except in compliance
+  * with the License. You may obtain a copy of the License at
+  *
+  * http://www.apache.org/licenses/LICENSE-2.0
+  *
+  * Unless required by applicable law or agreed to in writing,
+  * software distributed under the License is distributed on an
+  * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  * KIND, either express or implied. See the License for the
+  * specific language governing permissions and limitations
+  * under the License.
+  */
+
+package org.apache.mahout.math.algorithms.regression
+
+import org.apache.mahout.math.drm._
+import org.apache.mahout.math.{Vector => MahoutVector}
+import org.apache.mahout.math.algorithms.Model
+import org.apache.mahout.math.drm.DrmLike
+
+/**
+  * Abstract of Regressors
+  */
+trait Regressor[K] extends Model {
+
+  var residuals: DrmLike[K] = _
+
+  def fit(drmFeatures: DrmLike[K], drmTarget: DrmLike[K]): Unit
--- End diff --

by errors- do you mean like code errors, or the residuals?




> Create a framework for new Mahout Clustering, Classification, and 
> Optimization  Algorithms
> --
>
> Key: MAHOUT-1856
> URL: https://issues.apache.org/jira/browse/MAHOUT-1856
> Project: Mahout
>  Issue Type: New Feature
>Affects Versions: 0.12.1
>Reporter: Andrew Palumbo
>Assignee: Trevor Grant
>Priority: Critical
> Fix For: 0.13.0
>
>
> To ensure that Mahout does not become "A loose bag of algorithms", Create 
> basic traits with funtions common to each class of algorithm. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAHOUT-1856) Create a framework for new Mahout Clustering, Classification, and Optimization Algorithms

2017-01-22 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15833758#comment-15833758
 ] 

ASF GitHub Bot commented on MAHOUT-1856:


Github user rawkintrevo commented on the issue:

https://github.com/apache/mahout/pull/246
  
Thanks for the review @andrewpalumbo !

[sklearn parameters are set when the `Estimator` is 
instantiated.](http://scikit-learn.org/stable/tutorial/statistical_inference/settings.html).
 

MLlib on the other hand, [passes parameter maps in `fit` as you 
suggest](https://spark.apache.org/docs/2.0.1/api/java/org/apache/spark/ml/Estimator.html#fit(org.apache.spark.sql.Dataset,%20org.apache.spark.ml.param.ParamMap))

BOTH however, allow hyper parameters to be updated.  And in the case you 
refer too, the model would not be re-instantiated, but something like this:

```scala
model.param1 = 1
model.fit(X, y) 
model.param2 = 2
```

To your point, I also want to make this as easy as possible for new users- 
so I think it would be best to leave the option to pass a parameter map at 
initiation, and also expose it as a optional parameter of the `fit` method. 



> Create a framework for new Mahout Clustering, Classification, and 
> Optimization  Algorithms
> --
>
> Key: MAHOUT-1856
> URL: https://issues.apache.org/jira/browse/MAHOUT-1856
> Project: Mahout
>  Issue Type: New Feature
>Affects Versions: 0.12.1
>Reporter: Andrew Palumbo
>Assignee: Trevor Grant
>Priority: Critical
> Fix For: 0.13.0
>
>
> To ensure that Mahout does not become "A loose bag of algorithms", Create 
> basic traits with funtions common to each class of algorithm. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAHOUT-1856) Create a framework for new Mahout Clustering, Classification, and Optimization Algorithms

2017-01-22 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15833751#comment-15833751
 ] 

ASF GitHub Bot commented on MAHOUT-1856:


Github user andrewpalumbo commented on the issue:

https://github.com/apache/mahout/pull/246
  
I want to emphasize though that if you're building up to follow a 
convention (sk-learn-like, MlLib etc) which I am not familiar with That may e 
better to follow than my suggestions to make this framework as easy on new 
users as possible. 


> Create a framework for new Mahout Clustering, Classification, and 
> Optimization  Algorithms
> --
>
> Key: MAHOUT-1856
> URL: https://issues.apache.org/jira/browse/MAHOUT-1856
> Project: Mahout
>  Issue Type: New Feature
>Affects Versions: 0.12.1
>Reporter: Andrew Palumbo
>Assignee: Trevor Grant
>Priority: Critical
> Fix For: 0.13.0
>
>
> To ensure that Mahout does not become "A loose bag of algorithms", Create 
> basic traits with funtions common to each class of algorithm. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAHOUT-1856) Create a framework for new Mahout Clustering, Classification, and Optimization Algorithms

2017-01-22 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15833750#comment-15833750
 ] 

ASF GitHub Bot commented on MAHOUT-1856:


Github user andrewpalumbo commented on the issue:

https://github.com/apache/mahout/pull/246
  
Trevor- looks really good to me- I've left some comments mainly about 
hyperparameter being moved to fit(...) from model, I think that his makes sense 
in many ways, E.g, When doing an highly iterative Hyperparameter search, It 
would eliminate a good amount of overhead to call:

  ```aModel.fit(,HyperParameters: Map["hParameter1" -> value , 
"hParameter2" -> ...])``` 

 rather than re-constructing the entire class each time.  As well as i 
noted in line, I think that the `fit(...)` method should have the ability to 
return at least a `List[double]` of errors per row if needed, So I would 
suggest that it return `Any` rather than Unit in the base Traits. (unless the 
convention that you're following is to rely on predict for this.  


> Create a framework for new Mahout Clustering, Classification, and 
> Optimization  Algorithms
> --
>
> Key: MAHOUT-1856
> URL: https://issues.apache.org/jira/browse/MAHOUT-1856
> Project: Mahout
>  Issue Type: New Feature
>Affects Versions: 0.12.1
>Reporter: Andrew Palumbo
>Assignee: Trevor Grant
>Priority: Critical
> Fix For: 0.13.0
>
>
> To ensure that Mahout does not become "A loose bag of algorithms", Create 
> basic traits with funtions common to each class of algorithm. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAHOUT-1856) Create a framework for new Mahout Clustering, Classification, and Optimization Algorithms

2017-01-22 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15833744#comment-15833744
 ] 

ASF GitHub Bot commented on MAHOUT-1856:


Github user andrewpalumbo commented on a diff in the pull request:

https://github.com/apache/mahout/pull/246#discussion_r97236994
  
--- Diff: 
math-scala/src/test/scala/org/apache/mahout/math/algorithms/RegressionSuite.scala
 ---
@@ -0,0 +1,82 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.mahout.math.algorithms
+
+// arrange these proper
+import org.apache.mahout.math.algorithms.regression.OrdinaryLeastSquares
+import org.apache.mahout.math.drm.RLikeDrmOps._
+import org.apache.mahout.math.scalabindings.MahoutCollections._
+import org.apache.mahout.math.scalabindings.RLikeOps._
+import org.apache.mahout.math.scalabindings._
+import org.apache.mahout.math.drm._
+import org.apache.mahout.test.{DistributedMahoutSuite, MahoutSuite}
+import org.scalatest.{FunSuite, Matchers}
+
+trait RegressionSuite extends DistributedMahoutSuite with Matchers {
+  this: FunSuite =>
+
+  test("ordinary least squares") {
+/*
+R Prototype:
+dataM <- matrix( c(2, 2, 10.5, 10, 29.509541,
+  1, 2, 12,   12, 18.042851,
+  1, 1, 12,   13, 22.736446,
+  2, 1, 11,   13, 32.207582,
+  1, 2, 12,   11, 21.871292,
+  2, 1, 16,   8,  36.187559,
+  6, 2, 17,   1,  50.764999,
+  3, 2, 13,   7,  40.400208,
+  3, 3, 13,   4,  45.811716), nrow=9, ncol=5, byrow=TRUE)
+
+
+X = dataM[, c(1,2,3,4)]
+y = dataM[, c(5)]
+
+model <- lm(y ~ X - 1)
+summary(model)
+
+ */
+
+val drmData = drmParallelize(dense(
+  (2, 2, 10.5, 10, 29.509541),  // Apple Cinnamon Cheerios
+  (1, 2, 12,   12, 18.042851),  // Cap'n'Crunch
+  (1, 1, 12,   13, 22.736446),  // Cocoa Puffs
+  (2, 1, 11,   13, 32.207582),  // Froot Loops
+  (1, 2, 12,   11, 21.871292),  // Honey Graham Ohs
+  (2, 1, 16,   8,  36.187559),  // Wheaties Honey Gold
+  (6, 2, 17,   1,  50.764999),  // Cheerios
+  (3, 2, 13,   7,  40.400208),  // Clusters
+  (3, 3, 13,   4,  45.811716)), numPartitions = 2)
+
+drmData.collect(::, 0 until 4)
+
+val drmX = drmData(::, 0 until 4)
+val drmY = drmData(::, 4 until 5)
+
+val model = new OrdinaryLeastSquares[Int]()
+model.fit(drmY, drmX)
+val estimate = model.beta
+val Ranswers = dvec(-1.336265, -13.157702, -4.152654, -5.679908, 
163.179329)
+
+val epsilon = 1E-6
+(estimate - Ranswers).sum should be < epsilon
+
+  }
+
--- End diff --

It would be good to have a couple of more tests here; at least one for 
`Transform(...)`


> Create a framework for new Mahout Clustering, Classification, and 
> Optimization  Algorithms
> --
>
> Key: MAHOUT-1856
> URL: https://issues.apache.org/jira/browse/MAHOUT-1856
> Project: Mahout
>  Issue Type: New Feature
>Affects Versions: 0.12.1
>Reporter: Andrew Palumbo
>Assignee: Trevor Grant
>Priority: Critical
> Fix For: 0.13.0
>
>
> To ensure that Mahout does not become "A loose bag of algorithms", Create 
> basic traits with funtions common to each class of algorithm. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAHOUT-1856) Create a framework for new Mahout Clustering, Classification, and Optimization Algorithms

2017-01-22 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15833743#comment-15833743
 ] 

ASF GitHub Bot commented on MAHOUT-1856:


Github user andrewpalumbo commented on a diff in the pull request:

https://github.com/apache/mahout/pull/246#discussion_r97236955
  
--- Diff: 
math-scala/src/main/scala/org/apache/mahout/math/algorithms/transformer/StandardScaler.scala
 ---
@@ -0,0 +1,119 @@
+/**
+  * Licensed to the Apache Software Foundation (ASF) under one
+  * or more contributor license agreements. See the NOTICE file
+  * distributed with this work for additional information
+  * regarding copyright ownership. The ASF licenses this file
+  * to you under the Apache License, Version 2.0 (the
+  * "License"); you may not use this file except in compliance
+  * with the License. You may obtain a copy of the License at
+  *
+  * http://www.apache.org/licenses/LICENSE-2.0
+  *
+  * Unless required by applicable law or agreed to in writing,
+  * software distributed under the License is distributed on an
+  * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  * KIND, either express or implied. See the License for the
+  * specific language governing permissions and limitations
+  * under the License.
+  */
+
+package org.apache.mahout.math.algorithms.transformer
+
+import org.apache.mahout.math.drm
+
+import org.apache.mahout.math.scalabindings._
+
+import org.apache.mahout.math.scalabindings.RLikeVectorOps
+import org.apache.mahout.math.{Vector => MahoutVector}
+
+import org.apache.mahout.math.scalabindings.RLikeOps._
+import org.apache.mahout.math.scalabindings._
+import org.apache.mahout.math.scalabindings.RLikeVectorOps
+import org.apache.mahout.math.scalabindings.MatrixOps
+
+import org.apache.mahout.math._
+import org.apache.mahout.math.scalabindings._
+import org.apache.mahout.math.drm._
+import org.apache.mahout.math.scalabindings.RLikeOps._
+import org.apache.mahout.math.drm.RLikeDrmOps._
+
+
+import org.apache.mahout.math.Matrix
+
+import collection._
+import JavaConversions._
+
+import Math.sqrt
+
+import scala.reflect.{ClassTag,classTag}
+
+/**
+  * Scales columns to mean 0 and unit variance
+  */
+class StandardScaler extends Transformer{
+  var meanVec: MahoutVector = _
+  var variance: MahoutVector = _
+  var stdev: MahoutVector = _
+  var summary = ""
+
+  def fit[K](input: DrmLike[K]) = {
+val mNv = dcolMeanVars(input)
+meanVec = mNv._1
+variance = mNv._2
+stdev = mNv._2.sqrt
+isFit = true
+
+  }
+
+  def transform[K: ClassTag](input: DrmLike[K]): DrmLike[K] = {
+
+if (!isFit) {
+  //throw an error
--- End diff --

As well, I think that this could be another argument for moving  
Hyperparamaters into `fit(...)`, e.g. If for some reason we wanted to 
standardize on N(mean = 0, stdDev = 2),  we could still call `StandardScaler` 
and `fit (Map["mu" -> 0, "sigma" ->2])`:
```
val drmStandardized = StandardScaler(unscaledDrm).fit(Map["mu" -> 0, 
"sigma" ->2]).transform()
```


> Create a framework for new Mahout Clustering, Classification, and 
> Optimization  Algorithms
> --
>
> Key: MAHOUT-1856
> URL: https://issues.apache.org/jira/browse/MAHOUT-1856
> Project: Mahout
>  Issue Type: New Feature
>Affects Versions: 0.12.1
>Reporter: Andrew Palumbo
>Assignee: Trevor Grant
>Priority: Critical
> Fix For: 0.13.0
>
>
> To ensure that Mahout does not become "A loose bag of algorithms", Create 
> basic traits with funtions common to each class of algorithm. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAHOUT-1856) Create a framework for new Mahout Clustering, Classification, and Optimization Algorithms

2017-01-22 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15833739#comment-15833739
 ] 

ASF GitHub Bot commented on MAHOUT-1856:


Github user andrewpalumbo commented on a diff in the pull request:

https://github.com/apache/mahout/pull/246#discussion_r97236451
  
--- Diff: 
math-scala/src/main/scala/org/apache/mahout/math/algorithms/transformer/StandardScaler.scala
 ---
@@ -0,0 +1,119 @@
+/**
+  * Licensed to the Apache Software Foundation (ASF) under one
+  * or more contributor license agreements. See the NOTICE file
+  * distributed with this work for additional information
+  * regarding copyright ownership. The ASF licenses this file
+  * to you under the Apache License, Version 2.0 (the
+  * "License"); you may not use this file except in compliance
+  * with the License. You may obtain a copy of the License at
+  *
+  * http://www.apache.org/licenses/LICENSE-2.0
+  *
+  * Unless required by applicable law or agreed to in writing,
+  * software distributed under the License is distributed on an
+  * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  * KIND, either express or implied. See the License for the
+  * specific language governing permissions and limitations
+  * under the License.
+  */
+
+package org.apache.mahout.math.algorithms.transformer
+
+import org.apache.mahout.math.drm
+
+import org.apache.mahout.math.scalabindings._
+
+import org.apache.mahout.math.scalabindings.RLikeVectorOps
+import org.apache.mahout.math.{Vector => MahoutVector}
+
+import org.apache.mahout.math.scalabindings.RLikeOps._
+import org.apache.mahout.math.scalabindings._
+import org.apache.mahout.math.scalabindings.RLikeVectorOps
+import org.apache.mahout.math.scalabindings.MatrixOps
+
+import org.apache.mahout.math._
+import org.apache.mahout.math.scalabindings._
+import org.apache.mahout.math.drm._
+import org.apache.mahout.math.scalabindings.RLikeOps._
+import org.apache.mahout.math.drm.RLikeDrmOps._
+
+
+import org.apache.mahout.math.Matrix
+
+import collection._
+import JavaConversions._
+
+import Math.sqrt
+
+import scala.reflect.{ClassTag,classTag}
+
+/**
+  * Scales columns to mean 0 and unit variance
+  */
+class StandardScaler extends Transformer{
+  var meanVec: MahoutVector = _
+  var variance: MahoutVector = _
+  var stdev: MahoutVector = _
+  var summary = ""
+
+  def fit[K](input: DrmLike[K]) = {
+val mNv = dcolMeanVars(input)
+meanVec = mNv._1
+variance = mNv._2
+stdev = mNv._2.sqrt
+isFit = true
+
+  }
+
+  def transform[K: ClassTag](input: DrmLike[K]): DrmLike[K] = {
+
+if (!isFit) {
+  //throw an error
--- End diff --

I find this a bit confusing (as well as the other mentions of 'if (!isFit) 
{// throw }.  So if i want to standardize a Drm to (mean = 0,std_dev = 1),I 
would need to do something like:
```
drmStandardized = StandardScaler(unscaledDrm).fit().transform()
```
? 


> Create a framework for new Mahout Clustering, Classification, and 
> Optimization  Algorithms
> --
>
> Key: MAHOUT-1856
> URL: https://issues.apache.org/jira/browse/MAHOUT-1856
> Project: Mahout
>  Issue Type: New Feature
>Affects Versions: 0.12.1
>Reporter: Andrew Palumbo
>Assignee: Trevor Grant
>Priority: Critical
> Fix For: 0.13.0
>
>
> To ensure that Mahout does not become "A loose bag of algorithms", Create 
> basic traits with funtions common to each class of algorithm. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAHOUT-1856) Create a framework for new Mahout Clustering, Classification, and Optimization Algorithms

2017-01-22 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15833732#comment-15833732
 ] 

ASF GitHub Bot commented on MAHOUT-1856:


Github user andrewpalumbo commented on a diff in the pull request:

https://github.com/apache/mahout/pull/246#discussion_r97235959
  
--- Diff: 
math-scala/src/main/scala/org/apache/mahout/math/algorithms/transformer/MeanCenter.scala
 ---
@@ -0,0 +1,93 @@
+/**
+  * Licensed to the Apache Software Foundation (ASF) under one
+  * or more contributor license agreements. See the NOTICE file
+  * distributed with this work for additional information
+  * regarding copyright ownership. The ASF licenses this file
+  * to you under the Apache License, Version 2.0 (the
+  * "License"); you may not use this file except in compliance
+  * with the License. You may obtain a copy of the License at
+  *
+  * http://www.apache.org/licenses/LICENSE-2.0
+  *
+  * Unless required by applicable law or agreed to in writing,
+  * software distributed under the License is distributed on an
+  * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  * KIND, either express or implied. See the License for the
+  * specific language governing permissions and limitations
+  * under the License.
+  */
+
+package org.apache.mahout.math.algorithms.transformer
+
+import collection._
+import JavaConversions._
+import org.apache.mahout.math.drm._
+import org.apache.mahout.math.drm.RLikeDrmOps._
+import org.apache.mahout.math.{Matrix, Vector}
+import org.apache.mahout.math.scalabindings.RLikeOps._
+import org.apache.mahout.math.scalabindings.MahoutCollections._
+import org.apache.mahout.math.{Vector => MahoutVector}
+
+import scala.reflect.ClassTag
+
+class MeanCenter extends Transformer {
+
+  var summary = ""
+  var colMeansV: MahoutVector = _
+
+  /**
+* Optionally set the centers of each column to some value other than 
Zero
+* @param centers A vector of length equal to the `input` in the fit 
method specifying the
+*centers to set each column to.
+*/
+  def setCenter(centers: MahoutVector) = {
+colMeansV = colMeansV - centers
+  }
+
+  /**
+* Centers Columns at zero
+* @param input
+*/
+  def fit[K](input: DrmLike[K]) = {
+colMeansV = input.colMeans
+val colMeansA = colMeansV.toArray
+//summary = (0 until colMeansA.length).map(i => s"Column ${i} mean: 
${colMeansA(i)}").mkString(", ")
+  }
+
+  def transform[K: ClassTag](input: DrmLike[K]): DrmLike[K] = {
+
+if (!isFit) {
+  //throw an error
+}
+
+implicit val ctx = input.context
+val bcastV = drmBroadcast(colMeansV)
+
+val output = input.mapBlock(input.ncol) {
+  case (keys, block) =>
+val copy: Matrix = block.cloned
+copy.foreach(row => row -= bcastV.value)
+(keys, copy)
+}
+output
+  }
+
+  def invTransform[K: ClassTag](input: DrmLike[K]): DrmLike[K] = {
+
+if (!isFit) {
+  //throw an error
--- End diff --

Same qestion here as before- do you want to throw an error here?


> Create a framework for new Mahout Clustering, Classification, and 
> Optimization  Algorithms
> --
>
> Key: MAHOUT-1856
> URL: https://issues.apache.org/jira/browse/MAHOUT-1856
> Project: Mahout
>  Issue Type: New Feature
>Affects Versions: 0.12.1
>Reporter: Andrew Palumbo
>Assignee: Trevor Grant
>Priority: Critical
> Fix For: 0.13.0
>
>
> To ensure that Mahout does not become "A loose bag of algorithms", Create 
> basic traits with funtions common to each class of algorithm. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAHOUT-1856) Create a framework for new Mahout Clustering, Classification, and Optimization Algorithms

2017-01-22 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15833730#comment-15833730
 ] 

ASF GitHub Bot commented on MAHOUT-1856:


Github user andrewpalumbo commented on a diff in the pull request:

https://github.com/apache/mahout/pull/246#discussion_r97235908
  
--- Diff: 
math-scala/src/main/scala/org/apache/mahout/math/algorithms/transformer/AsFactor.scala
 ---
@@ -0,0 +1,79 @@
+/**
+  * Licensed to the Apache Software Foundation (ASF) under one
+  * or more contributor license agreements. See the NOTICE file
+  * distributed with this work for additional information
+  * regarding copyright ownership. The ASF licenses this file
+  * to you under the Apache License, Version 2.0 (the
+  * "License"); you may not use this file except in compliance
+  * with the License. You may obtain a copy of the License at
+  *
+  * http://www.apache.org/licenses/LICENSE-2.0
+  *
+  * Unless required by applicable law or agreed to in writing,
+  * software distributed under the License is distributed on an
+  * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  * KIND, either express or implied. See the License for the
+  * specific language governing permissions and limitations
+  * under the License.
+  */
+
+package org.apache.mahout.math.algorithms.transformer
+
+import org.apache.mahout.math.{Vector => MahoutVector}
+
+import collection._
+import JavaConversions._
+import org.apache.mahout.math._
+import org.apache.mahout.math.drm._
+import org.apache.mahout.math.drm.RLikeDrmOps._
+import org.apache.mahout.math.scalabindings._
+import org.apache.mahout.math.scalabindings.RLikeOps._
+import scala.reflect.ClassTag
+
+class AsFactor extends Transformer{
+
+  var factorMap: MahoutVector = _
+  var k: MahoutVector = _
+  var summary = ""
+
+  def transform[K: ClassTag](input: DrmLike[K]): DrmLike[K] ={
+if (!isFit) {
+  //throw an error
--- End diff --

Is this complete? I.e., do you want to throw an error here for this 
release? 


> Create a framework for new Mahout Clustering, Classification, and 
> Optimization  Algorithms
> --
>
> Key: MAHOUT-1856
> URL: https://issues.apache.org/jira/browse/MAHOUT-1856
> Project: Mahout
>  Issue Type: New Feature
>Affects Versions: 0.12.1
>Reporter: Andrew Palumbo
>Assignee: Trevor Grant
>Priority: Critical
> Fix For: 0.13.0
>
>
> To ensure that Mahout does not become "A loose bag of algorithms", Create 
> basic traits with funtions common to each class of algorithm. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAHOUT-1856) Create a framework for new Mahout Clustering, Classification, and Optimization Algorithms

2017-01-22 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15833728#comment-15833728
 ] 

ASF GitHub Bot commented on MAHOUT-1856:


Github user andrewpalumbo commented on a diff in the pull request:

https://github.com/apache/mahout/pull/246#discussion_r97235841
  
--- Diff: 
math-scala/src/main/scala/org/apache/mahout/math/algorithms/regression/tests/AutocorrelationTests.scala
 ---
@@ -0,0 +1,54 @@
+/**
+  * Licensed to the Apache Software Foundation (ASF) under one
+  * or more contributor license agreements. See the NOTICE file
+  * distributed with this work for additional information
+  * regarding copyright ownership. The ASF licenses this file
+  * to you under the Apache License, Version 2.0 (the
+  * "License"); you may not use this file except in compliance
+  * with the License. You may obtain a copy of the License at
+  *
+  * http://www.apache.org/licenses/LICENSE-2.0
+  *
+  * Unless required by applicable law or agreed to in writing,
+  * software distributed under the License is distributed on an
+  * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  * KIND, either express or implied. See the License for the
+  * specific language governing permissions and limitations
+  * under the License.
+  */
+
+package org.apache.mahout.math.algorithms.regression.tests
+
+import org.apache.mahout.math.algorithms.regression.Regressor
+import org.apache.mahout.math.drm.DrmLike
+import org.apache.mahout.math.drm.RLikeDrmOps._
+import org.apache.mahout.math.function.Functions.SQUARE
+import org.apache.mahout.math.scalabindings.RLikeOps._
+
+
+object AutocorrelationTests {
+
+  //https://en.wikipedia.org/wiki/Durbin%E2%80%93Watson_statistic
+  /*
+  To test for positive autocorrelation at significance α, the test 
statistic d is compared to lower and upper critical values (dL,α and dU,α):
+  If d < dL,α, there is statistical evidence that the error terms are 
positively autocorrelated.
+  If d > dU,α, there is no statistical evidence that the error terms 
are positively autocorrelated.
+  If dL,α < d < dU,α, the test is inconclusive.
+
+  Rule of Thumb:
+   d < 2 : positive auto-correlation
+   d = 2 : no auto-correlation
+   d > 2 : negative auto-correlation
+  */
+  def DurbinWaton[K](model: Regressor[K]): Regressor[K] = {
--- End diff --

misspelling- "DurbinWatson" 


> Create a framework for new Mahout Clustering, Classification, and 
> Optimization  Algorithms
> --
>
> Key: MAHOUT-1856
> URL: https://issues.apache.org/jira/browse/MAHOUT-1856
> Project: Mahout
>  Issue Type: New Feature
>Affects Versions: 0.12.1
>Reporter: Andrew Palumbo
>Assignee: Trevor Grant
>Priority: Critical
> Fix For: 0.13.0
>
>
> To ensure that Mahout does not become "A loose bag of algorithms", Create 
> basic traits with funtions common to each class of algorithm. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAHOUT-1856) Create a framework for new Mahout Clustering, Classification, and Optimization Algorithms

2017-01-22 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15833722#comment-15833722
 ] 

ASF GitHub Bot commented on MAHOUT-1856:


Github user andrewpalumbo commented on a diff in the pull request:

https://github.com/apache/mahout/pull/246#discussion_r97235704
  
--- Diff: 
math-scala/src/main/scala/org/apache/mahout/math/algorithms/regression/OrdinaryLeastSquares.scala
 ---
@@ -0,0 +1,96 @@
+/**
+  * Licensed to the Apache Software Foundation (ASF) under one
+  * or more contributor license agreements. See the NOTICE file
+  * distributed with this work for additional information
+  * regarding copyright ownership. The ASF licenses this file
+  * to you under the Apache License, Version 2.0 (the
+  * "License"); you may not use this file except in compliance
+  * with the License. You may obtain a copy of the License at
+  *
+  * http://www.apache.org/licenses/LICENSE-2.0
+  *
+  * Unless required by applicable law or agreed to in writing,
+  * software distributed under the License is distributed on an
+  * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  * KIND, either express or implied. See the License for the
+  * specific language governing permissions and limitations
+  * under the License.
+  */
+
+package org.apache.mahout.math.algorithms.regression
+
+import org.apache.mahout.math.{Vector => MahoutVector}
+import org.apache.mahout.math.drm.RLikeDrmOps._
+import org.apache.mahout.math.drm.DrmLike
+import org.apache.mahout.math.scalabindings._
+import org.apache.mahout.math.scalabindings.RLikeOps._
+
+import scala.reflect.ClassTag
+
+/**
+  * import 
org.apache.mahout.math.algorithms.regression.OrdinaryLeastSquares
+  * val model = new OrdinaryLeastSquares()
+  *
+  * model.calcStandardErrors = true
+  * 
+  */
+class OrdinaryLeastSquares[K](hyperparameters: Map[String, Any] = Map("" 
-> None)) extends LinearRegressor[K] {
+  // https://en.wikipedia.org/wiki/Ordinary_least_squares
+
+  var calcStandardErrors: Boolean = 
hyperparameters.asInstanceOf[Map[String, 
Boolean]].getOrElse("calcStandardErrors", true)
+  var addIntercept: Boolean = hyperparameters.asInstanceOf[Map[String, 
Boolean]].getOrElse("addIntercept", true)
+
+  var summary = ""
+  def fit(drmFeatures: DrmLike[K], drmTarget: DrmLike[K]): Unit = {
--- End diff --

Continuing on from our discussion on Slack I would think that Fit may be a 
more appropriate place for Hyperparameters eg:
```
fit(observed_independent: Drm[K], observerd_targets: Drm[K],  
hyperparamters: Option[List[double]]): List[double]
```
I think that this may be a matter of Convention, so If you're following a 
convention that I am not familiar with, this may be fine.  However I feel that 
this may be more robust.


> Create a framework for new Mahout Clustering, Classification, and 
> Optimization  Algorithms
> --
>
> Key: MAHOUT-1856
> URL: https://issues.apache.org/jira/browse/MAHOUT-1856
> Project: Mahout
>  Issue Type: New Feature
>Affects Versions: 0.12.1
>Reporter: Andrew Palumbo
>Assignee: Trevor Grant
>Priority: Critical
> Fix For: 0.13.0
>
>
> To ensure that Mahout does not become "A loose bag of algorithms", Create 
> basic traits with funtions common to each class of algorithm. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAHOUT-1856) Create a framework for new Mahout Clustering, Classification, and Optimization Algorithms

2017-01-22 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15833721#comment-15833721
 ] 

ASF GitHub Bot commented on MAHOUT-1856:


Github user andrewpalumbo commented on a diff in the pull request:

https://github.com/apache/mahout/pull/246#discussion_r97235509
  
--- Diff: 
math-scala/src/main/scala/org/apache/mahout/math/algorithms/regression/Regressor.scala
 ---
@@ -0,0 +1,37 @@
+/**
+  * Licensed to the Apache Software Foundation (ASF) under one
+  * or more contributor license agreements. See the NOTICE file
+  * distributed with this work for additional information
+  * regarding copyright ownership. The ASF licenses this file
+  * to you under the Apache License, Version 2.0 (the
+  * "License"); you may not use this file except in compliance
+  * with the License. You may obtain a copy of the License at
+  *
+  * http://www.apache.org/licenses/LICENSE-2.0
+  *
+  * Unless required by applicable law or agreed to in writing,
+  * software distributed under the License is distributed on an
+  * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  * KIND, either express or implied. See the License for the
+  * specific language governing permissions and limitations
+  * under the License.
+  */
+
+package org.apache.mahout.math.algorithms.regression
+
+import org.apache.mahout.math.drm._
+import org.apache.mahout.math.{Vector => MahoutVector}
+import org.apache.mahout.math.algorithms.Model
+import org.apache.mahout.math.drm.DrmLike
+
+/**
+  * Abstract of Regressors
+  */
+trait Regressor[K] extends Model {
+
+  var residuals: DrmLike[K] = _
+
+  def fit(drmFeatures: DrmLike[K], drmTarget: DrmLike[K]): Unit
--- End diff --

As well, `drmFeatures` is somewhat confusing,  with the  Drm being a matrix 
of samples x features, i.e. features are columns,  I think something like 
`drmSamples`, or `drmObservations`, or even `drmX` may be more straightforward.


> Create a framework for new Mahout Clustering, Classification, and 
> Optimization  Algorithms
> --
>
> Key: MAHOUT-1856
> URL: https://issues.apache.org/jira/browse/MAHOUT-1856
> Project: Mahout
>  Issue Type: New Feature
>Affects Versions: 0.12.1
>Reporter: Andrew Palumbo
>Assignee: Trevor Grant
>Priority: Critical
> Fix For: 0.13.0
>
>
> To ensure that Mahout does not become "A loose bag of algorithms", Create 
> basic traits with funtions common to each class of algorithm. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAHOUT-1856) Create a framework for new Mahout Clustering, Classification, and Optimization Algorithms

2017-01-22 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15833723#comment-15833723
 ] 

ASF GitHub Bot commented on MAHOUT-1856:


Github user andrewpalumbo commented on a diff in the pull request:

https://github.com/apache/mahout/pull/246#discussion_r97235398
  
--- Diff: 
math-scala/src/main/scala/org/apache/mahout/math/algorithms/regression/Regressor.scala
 ---
@@ -0,0 +1,37 @@
+/**
+  * Licensed to the Apache Software Foundation (ASF) under one
+  * or more contributor license agreements. See the NOTICE file
+  * distributed with this work for additional information
+  * regarding copyright ownership. The ASF licenses this file
+  * to you under the Apache License, Version 2.0 (the
+  * "License"); you may not use this file except in compliance
+  * with the License. You may obtain a copy of the License at
+  *
+  * http://www.apache.org/licenses/LICENSE-2.0
+  *
+  * Unless required by applicable law or agreed to in writing,
+  * software distributed under the License is distributed on an
+  * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  * KIND, either express or implied. See the License for the
+  * specific language governing permissions and limitations
+  * under the License.
+  */
+
+package org.apache.mahout.math.algorithms.regression
+
+import org.apache.mahout.math.drm._
+import org.apache.mahout.math.{Vector => MahoutVector}
+import org.apache.mahout.math.algorithms.Model
+import org.apache.mahout.math.drm.DrmLike
+
+/**
+  * Abstract of Regressors
+  */
+trait Regressor[K] extends Model {
+
+  var residuals: DrmLike[K] = _
+
+  def fit(drmFeatures: DrmLike[K], drmTarget: DrmLike[K]): Unit
--- End diff --

I would think that `fit(..)` should be able to return a List of errors per 
sample so possibly: ```def fit(drmFeatures: DrmLike[K], drmTarget: DrmLike[K]): 
Any``` would be a better signature for the trait.


> Create a framework for new Mahout Clustering, Classification, and 
> Optimization  Algorithms
> --
>
> Key: MAHOUT-1856
> URL: https://issues.apache.org/jira/browse/MAHOUT-1856
> Project: Mahout
>  Issue Type: New Feature
>Affects Versions: 0.12.1
>Reporter: Andrew Palumbo
>Assignee: Trevor Grant
>Priority: Critical
> Fix For: 0.13.0
>
>
> To ensure that Mahout does not become "A loose bag of algorithms", Create 
> basic traits with funtions common to each class of algorithm. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAHOUT-1856) Create a framework for new Mahout Clustering, Classification, and Optimization Algorithms

2017-01-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15830815#comment-15830815
 ] 

ASF GitHub Bot commented on MAHOUT-1856:


Github user dlyubimov commented on a diff in the pull request:

https://github.com/apache/mahout/pull/246#discussion_r96982985
  
--- Diff: 
math-scala/src/main/scala/org/apache/mahout/math/algorithms/regression/Regressor.scala
 ---
@@ -0,0 +1,33 @@
+/**
+  * Licensed to the Apache Software Foundation (ASF) under one
+  * or more contributor license agreements. See the NOTICE file
+  * distributed with this work for additional information
+  * regarding copyright ownership. The ASF licenses this file
+  * to you under the Apache License, Version 2.0 (the
+  * "License"); you may not use this file except in compliance
+  * with the License. You may obtain a copy of the License at
+  *
+  * http://www.apache.org/licenses/LICENSE-2.0
+  *
+  * Unless required by applicable law or agreed to in writing,
+  * software distributed under the License is distributed on an
+  * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  * KIND, either express or implied. See the License for the
+  * specific language governing permissions and limitations
+  * under the License.
+  */
+
+package org.apache.mahout.math.algorithms.regression
+
+import org.apache.mahout.math.algorithms.Model
+import org.apache.mahout.math.drm.DrmLike
+
+/**
+  * Abstract of Regressors
+  */
+abstract class Regressor extends Model {
+
+  def fit[Int](drmY: DrmLike[Int], drmX: DrmLike[Int]): Unit
--- End diff --

like perhaps `fit[K](X:drmLike[K],T:drmLike[K], (Symbol,Any)*):Model` where 
the optional list is hyperparameter list. Then hyperparameterized calls could 
be something like: ```
fit(X,Y, 'alpha ->10.0, 'lambda -> 1.e-15)
``` etc.
This loses strong type-iness of the signature but call is not that ugly, 
and it might be ok if specific implementations are cleanly documented .


> Create a framework for new Mahout Clustering, Classification, and 
> Optimization  Algorithms
> --
>
> Key: MAHOUT-1856
> URL: https://issues.apache.org/jira/browse/MAHOUT-1856
> Project: Mahout
>  Issue Type: New Feature
>Affects Versions: 0.12.1
>Reporter: Andrew Palumbo
>Assignee: Trevor Grant
>Priority: Critical
> Fix For: 0.13.0
>
>
> To ensure that Mahout does not become "A loose bag of algorithms", Create 
> basic traits with funtions common to each class of algorithm. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAHOUT-1856) Create a framework for new Mahout Clustering, Classification, and Optimization Algorithms

2017-01-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15830810#comment-15830810
 ] 

ASF GitHub Bot commented on MAHOUT-1856:


Github user dlyubimov commented on a diff in the pull request:

https://github.com/apache/mahout/pull/246#discussion_r96982250
  
--- Diff: 
math-scala/src/main/scala/org/apache/mahout/math/algorithms/regression/Regressor.scala
 ---
@@ -0,0 +1,33 @@
+/**
+  * Licensed to the Apache Software Foundation (ASF) under one
+  * or more contributor license agreements. See the NOTICE file
+  * distributed with this work for additional information
+  * regarding copyright ownership. The ASF licenses this file
+  * to you under the Apache License, Version 2.0 (the
+  * "License"); you may not use this file except in compliance
+  * with the License. You may obtain a copy of the License at
+  *
+  * http://www.apache.org/licenses/LICENSE-2.0
+  *
+  * Unless required by applicable law or agreed to in writing,
+  * software distributed under the License is distributed on an
+  * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  * KIND, either express or implied. See the License for the
+  * specific language governing permissions and limitations
+  * under the License.
+  */
+
+package org.apache.mahout.math.algorithms.regression
+
+import org.apache.mahout.math.algorithms.Model
+import org.apache.mahout.math.drm.DrmLike
+
+/**
+  * Abstract of Regressors
+  */
+abstract class Regressor extends Model {
+
+  def fit[Int](drmY: DrmLike[Int], drmX: DrmLike[Int]): Unit
--- End diff --

i guess if this is abstract enough, we also need to be able admit 
hyperparameters which are of course specific for every fitter. in R this is 
trivial (any call can be made a bag of whatever named arguments), but in Scala 
this may need a bit of a thought (if this abstraction needs to be that high). 
otherwise, i guess most scala kits just create a concrete fit signature per 
implementation.

if the Regressor trait is meant to be common to all possible regression 
class algorithms, we either need a way to universally pass in the 
hyperparameters, or just not have fit abstraction in the regressor trait at all 
. (then what i guess :) ) 



> Create a framework for new Mahout Clustering, Classification, and 
> Optimization  Algorithms
> --
>
> Key: MAHOUT-1856
> URL: https://issues.apache.org/jira/browse/MAHOUT-1856
> Project: Mahout
>  Issue Type: New Feature
>Affects Versions: 0.12.1
>Reporter: Andrew Palumbo
>Assignee: Trevor Grant
>Priority: Critical
> Fix For: 0.13.0
>
>
> To ensure that Mahout does not become "A loose bag of algorithms", Create 
> basic traits with funtions common to each class of algorithm. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAHOUT-1856) Create a framework for new Mahout Clustering, Classification, and Optimization Algorithms

2017-01-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15830778#comment-15830778
 ] 

ASF GitHub Bot commented on MAHOUT-1856:


Github user rawkintrevo commented on the issue:

https://github.com/apache/mahout/pull/246
  
Thanks for the quick review @dlyubimov will incorporate your sugguestions


> Create a framework for new Mahout Clustering, Classification, and 
> Optimization  Algorithms
> --
>
> Key: MAHOUT-1856
> URL: https://issues.apache.org/jira/browse/MAHOUT-1856
> Project: Mahout
>  Issue Type: New Feature
>Affects Versions: 0.12.1
>Reporter: Andrew Palumbo
>Assignee: Trevor Grant
>Priority: Critical
> Fix For: 0.13.0
>
>
> To ensure that Mahout does not become "A loose bag of algorithms", Create 
> basic traits with funtions common to each class of algorithm. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAHOUT-1856) Create a framework for new Mahout Clustering, Classification, and Optimization Algorithms

2017-01-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15830776#comment-15830776
 ] 

ASF GitHub Bot commented on MAHOUT-1856:


Github user rawkintrevo commented on a diff in the pull request:

https://github.com/apache/mahout/pull/246#discussion_r96979374
  
--- Diff: 
math-scala/src/main/scala/org/apache/mahout/math/algorithms/regression/Regressor.scala
 ---
@@ -0,0 +1,33 @@
+/**
+  * Licensed to the Apache Software Foundation (ASF) under one
+  * or more contributor license agreements. See the NOTICE file
+  * distributed with this work for additional information
+  * regarding copyright ownership. The ASF licenses this file
+  * to you under the Apache License, Version 2.0 (the
+  * "License"); you may not use this file except in compliance
+  * with the License. You may obtain a copy of the License at
+  *
+  * http://www.apache.org/licenses/LICENSE-2.0
+  *
+  * Unless required by applicable law or agreed to in writing,
+  * software distributed under the License is distributed on an
+  * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  * KIND, either express or implied. See the License for the
+  * specific language governing permissions and limitations
+  * under the License.
+  */
+
+package org.apache.mahout.math.algorithms.regression
+
+import org.apache.mahout.math.algorithms.Model
+import org.apache.mahout.math.drm.DrmLike
+
+/**
+  * Abstract of Regressors
+  */
+abstract class Regressor extends Model {
+
+  def fit[Int](drmY: DrmLike[Int], drmX: DrmLike[Int]): Unit
--- End diff --

+1 on swapping Y and X- I've had to catch myself more than once on that 
already.  I think the original motivation was a tip of the hat to R's `Y ~ x` 
but I agree with you.

Re `[Int]` I realized that later, but haven't gone through to swap them all 
back to `K`.  It is (or was) `K` in some places. 


> Create a framework for new Mahout Clustering, Classification, and 
> Optimization  Algorithms
> --
>
> Key: MAHOUT-1856
> URL: https://issues.apache.org/jira/browse/MAHOUT-1856
> Project: Mahout
>  Issue Type: New Feature
>Affects Versions: 0.12.1
>Reporter: Andrew Palumbo
>Assignee: Trevor Grant
>Priority: Critical
> Fix For: 0.13.0
>
>
> To ensure that Mahout does not become "A loose bag of algorithms", Create 
> basic traits with funtions common to each class of algorithm. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAHOUT-1856) Create a framework for new Mahout Clustering, Classification, and Optimization Algorithms

2017-01-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15830738#comment-15830738
 ] 

ASF GitHub Bot commented on MAHOUT-1856:


Github user dlyubimov commented on a diff in the pull request:

https://github.com/apache/mahout/pull/246#discussion_r96976147
  
--- Diff: 
math-scala/src/main/scala/org/apache/mahout/math/algorithms/regression/Regressor.scala
 ---
@@ -0,0 +1,33 @@
+/**
+  * Licensed to the Apache Software Foundation (ASF) under one
+  * or more contributor license agreements. See the NOTICE file
+  * distributed with this work for additional information
+  * regarding copyright ownership. The ASF licenses this file
+  * to you under the Apache License, Version 2.0 (the
+  * "License"); you may not use this file except in compliance
+  * with the License. You may obtain a copy of the License at
+  *
+  * http://www.apache.org/licenses/LICENSE-2.0
+  *
+  * Unless required by applicable law or agreed to in writing,
+  * software distributed under the License is distributed on an
+  * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  * KIND, either express or implied. See the License for the
+  * specific language governing permissions and limitations
+  * under the License.
+  */
+
+package org.apache.mahout.math.algorithms.regression
+
+import org.apache.mahout.math.algorithms.Model
+import org.apache.mahout.math.drm.DrmLike
+
+/**
+  * Abstract of Regressors
+  */
+abstract class Regressor extends Model {
+
+  def fit[Int](drmY: DrmLike[Int], drmX: DrmLike[Int]): Unit
--- End diff --

also : i am not sure fitter should extend a model. Rather, fitter should 
return a model, i.e.,  
`fit[k](...) : Model`, right? i think that's how this pattern goes in most 
kits. Fitter is just a startegy.

And i'd abstain from doing abstract classes in Scala, unless trait 
absolutely cannot do it. (and it can in this case). Abstract class points to a 
specific, single and necessary base implementation in hirerarchy, which is too 
constraining without need for the actual implementations.


> Create a framework for new Mahout Clustering, Classification, and 
> Optimization  Algorithms
> --
>
> Key: MAHOUT-1856
> URL: https://issues.apache.org/jira/browse/MAHOUT-1856
> Project: Mahout
>  Issue Type: New Feature
>Affects Versions: 0.12.1
>Reporter: Andrew Palumbo
>Assignee: Trevor Grant
>Priority: Critical
> Fix For: 0.13.0
>
>
> To ensure that Mahout does not become "A loose bag of algorithms", Create 
> basic traits with funtions common to each class of algorithm. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAHOUT-1856) Create a framework for new Mahout Clustering, Classification, and Optimization Algorithms

2017-01-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15830723#comment-15830723
 ] 

ASF GitHub Bot commented on MAHOUT-1856:


Github user dlyubimov commented on a diff in the pull request:

https://github.com/apache/mahout/pull/246#discussion_r96467982
  
--- Diff: 
math-scala/src/main/scala/org/apache/mahout/math/algorithms/Model.scala ---
@@ -0,0 +1,36 @@
+/**
+  * Licensed to the Apache Software Foundation (ASF) under one
+  * or more contributor license agreements. See the NOTICE file
+  * distributed with this work for additional information
+  * regarding copyright ownership. The ASF licenses this file
+  * to you under the Apache License, Version 2.0 (the
+  * "License"); you may not use this file except in compliance
+  * with the License. You may obtain a copy of the License at
+  *
+  * http://www.apache.org/licenses/LICENSE-2.0
+  *
+  * Unless required by applicable law or agreed to in writing,
+  * software distributed under the License is distributed on an
+  * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  * KIND, either express or implied. See the License for the
+  * specific language governing permissions and limitations
+  * under the License.
+  */
+
+package org.apache.mahout.math.algorithms
+
+import org.apache.mahout.math.{Vector => MahoutVector, drm}
+import org.apache.mahout.math.drm.DrmLike
+import org.apache.mahout.math.scalabindings._
+
+import scala.reflect.ClassTag
+
+abstract class Model extends Serializable {
+
+  var fitParams = collection.mutable.Map[String, MahoutVector]()
--- End diff --

So model requires all parameters be named and be vectors? Shouldn't this be 
an artifact of a more specialized models, like say glms? there are plenty of ML 
models that would probably not fit that fairly rigid definition, not easily or 
pragmatically, at least.


> Create a framework for new Mahout Clustering, Classification, and 
> Optimization  Algorithms
> --
>
> Key: MAHOUT-1856
> URL: https://issues.apache.org/jira/browse/MAHOUT-1856
> Project: Mahout
>  Issue Type: New Feature
>Affects Versions: 0.12.1
>Reporter: Andrew Palumbo
>Assignee: Trevor Grant
>Priority: Critical
> Fix For: 0.13.0
>
>
> To ensure that Mahout does not become "A loose bag of algorithms", Create 
> basic traits with funtions common to each class of algorithm. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAHOUT-1856) Create a framework for new Mahout Clustering, Classification, and Optimization Algorithms

2017-01-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15830724#comment-15830724
 ] 

ASF GitHub Bot commented on MAHOUT-1856:


Github user dlyubimov commented on a diff in the pull request:

https://github.com/apache/mahout/pull/246#discussion_r96473124
  
--- Diff: 
math-scala/src/main/scala/org/apache/mahout/math/algorithms/regression/Regressor.scala
 ---
@@ -0,0 +1,33 @@
+/**
+  * Licensed to the Apache Software Foundation (ASF) under one
+  * or more contributor license agreements. See the NOTICE file
+  * distributed with this work for additional information
+  * regarding copyright ownership. The ASF licenses this file
+  * to you under the Apache License, Version 2.0 (the
+  * "License"); you may not use this file except in compliance
+  * with the License. You may obtain a copy of the License at
+  *
+  * http://www.apache.org/licenses/LICENSE-2.0
+  *
+  * Unless required by applicable law or agreed to in writing,
+  * software distributed under the License is distributed on an
+  * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  * KIND, either express or implied. See the License for the
+  * specific language governing permissions and limitations
+  * under the License.
+  */
+
+package org.apache.mahout.math.algorithms.regression
+
+import org.apache.mahout.math.algorithms.Model
+import org.apache.mahout.math.drm.DrmLike
+
+/**
+  * Abstract of Regressors
+  */
+abstract class Regressor extends Model {
+
+  def fit[Int](drmY: DrmLike[Int], drmX: DrmLike[Int]): Unit
--- End diff --

also maybe drmT for target ?


> Create a framework for new Mahout Clustering, Classification, and 
> Optimization  Algorithms
> --
>
> Key: MAHOUT-1856
> URL: https://issues.apache.org/jira/browse/MAHOUT-1856
> Project: Mahout
>  Issue Type: New Feature
>Affects Versions: 0.12.1
>Reporter: Andrew Palumbo
>Assignee: Trevor Grant
>Priority: Critical
> Fix For: 0.13.0
>
>
> To ensure that Mahout does not become "A loose bag of algorithms", Create 
> basic traits with funtions common to each class of algorithm. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAHOUT-1856) Create a framework for new Mahout Clustering, Classification, and Optimization Algorithms

2017-01-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15830726#comment-15830726
 ] 

ASF GitHub Bot commented on MAHOUT-1856:


Github user dlyubimov commented on a diff in the pull request:

https://github.com/apache/mahout/pull/246#discussion_r96468873
  
--- Diff: 
math-scala/src/main/scala/org/apache/mahout/math/algorithms/regression/CochraneOrcutt.scala
 ---
@@ -0,0 +1,68 @@
+/**
+  * Licensed to the Apache Software Foundation (ASF) under one
+  * or more contributor license agreements. See the NOTICE file
+  * distributed with this work for additional information
+  * regarding copyright ownership. The ASF licenses this file
+  * to you under the Apache License, Version 2.0 (the
+  * "License"); you may not use this file except in compliance
+  * with the License. You may obtain a copy of the License at
+  *
+  * http://www.apache.org/licenses/LICENSE-2.0
+  *
+  * Unless required by applicable law or agreed to in writing,
+  * software distributed under the License is distributed on an
+  * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  * KIND, either express or implied. See the License for the
+  * specific language governing permissions and limitations
+  * under the License.
+  */
+
+package org.apache.mahout.math.algorithms.regression
+
+import org.apache.mahout.math.drm.DrmLike
+import org.apache.mahout.math.drm.RLikeDrmOps._
+import org.apache.mahout.math.scalabindings.RLikeOps._
+
+class CochraneOrcutt extends Regressor {
+  // https://en.wikipedia.org/wiki/Cochrane%E2%80%93Orcutt_estimation
+
+  var regressor : Regressor = new OLS() // type of regression to do- must 
have a 'beta' fit param
+  var iterations = 3 // Number of iterations to run
+
+  def fit[Int](drmY: DrmLike[Int], drmX: DrmLike[Int]) = {
+
+regressor.fit(drmY, drmX)
+fitParams("beta0") = regressor.fitParams("beta")
+
+val Y = drmY(1 until drmY.nrow.toInt, 0 until 1).checkpoint()
--- End diff --

Consider giving the callee an option to specify cache hint here, since it 
seems essential that this algorithm relies on plenty of things being put into 
memory. right now this implies memory only, so if it doesn't fit then the 
algorithm is going to a crawl. (in all fairness, in spark it would go to a 
crawl with memory and disk spec too, but to put things in perspective, we are 
probably talking of a  difference between crawling snail and snail skeleton 20 
years after its death.)


> Create a framework for new Mahout Clustering, Classification, and 
> Optimization  Algorithms
> --
>
> Key: MAHOUT-1856
> URL: https://issues.apache.org/jira/browse/MAHOUT-1856
> Project: Mahout
>  Issue Type: New Feature
>Affects Versions: 0.12.1
>Reporter: Andrew Palumbo
>Assignee: Trevor Grant
>Priority: Critical
> Fix For: 0.13.0
>
>
> To ensure that Mahout does not become "A loose bag of algorithms", Create 
> basic traits with funtions common to each class of algorithm. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAHOUT-1856) Create a framework for new Mahout Clustering, Classification, and Optimization Algorithms

2017-01-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15830725#comment-15830725
 ] 

ASF GitHub Bot commented on MAHOUT-1856:


Github user dlyubimov commented on a diff in the pull request:

https://github.com/apache/mahout/pull/246#discussion_r96473064
  
--- Diff: 
math-scala/src/main/scala/org/apache/mahout/math/algorithms/regression/Regressor.scala
 ---
@@ -0,0 +1,33 @@
+/**
+  * Licensed to the Apache Software Foundation (ASF) under one
+  * or more contributor license agreements. See the NOTICE file
+  * distributed with this work for additional information
+  * regarding copyright ownership. The ASF licenses this file
+  * to you under the Apache License, Version 2.0 (the
+  * "License"); you may not use this file except in compliance
+  * with the License. You may obtain a copy of the License at
+  *
+  * http://www.apache.org/licenses/LICENSE-2.0
+  *
+  * Unless required by applicable law or agreed to in writing,
+  * software distributed under the License is distributed on an
+  * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  * KIND, either express or implied. See the License for the
+  * specific language governing permissions and limitations
+  * under the License.
+  */
+
+package org.apache.mahout.math.algorithms.regression
+
+import org.apache.mahout.math.algorithms.Model
+import org.apache.mahout.math.drm.DrmLike
+
+/**
+  * Abstract of Regressors
+  */
+abstract class Regressor extends Model {
+
+  def fit[Int](drmY: DrmLike[Int], drmX: DrmLike[Int]): Unit
--- End diff --

perhaps first parameter being predictors and second parameter being target 
is more intuitive signature for most


> Create a framework for new Mahout Clustering, Classification, and 
> Optimization  Algorithms
> --
>
> Key: MAHOUT-1856
> URL: https://issues.apache.org/jira/browse/MAHOUT-1856
> Project: Mahout
>  Issue Type: New Feature
>Affects Versions: 0.12.1
>Reporter: Andrew Palumbo
>Assignee: Trevor Grant
>Priority: Critical
> Fix For: 0.13.0
>
>
> To ensure that Mahout does not become "A loose bag of algorithms", Create 
> basic traits with funtions common to each class of algorithm. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAHOUT-1856) Create a framework for new Mahout Clustering, Classification, and Optimization Algorithms

2017-01-16 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15824680#comment-15824680
 ] 

ASF GitHub Bot commented on MAHOUT-1856:


Github user andrewpalumbo commented on the issue:

https://github.com/apache/mahout/pull/246
  
I should note that my above comment was meant as a question to all 
concerned: "Does this (the algorithms base traits package) Belong in math-scala 
or in its own module... Trevor initially had it in its own module, I asked that 
he move it to math-scala, due in most part to size concerns of the entire 
binary distribution.. however I am now questioning whether that is the correct 
call.. I can think of several good reasons that it should have its own module, 
(and others that it should be in math-scala).. For now, I think that math-scala 
is an ok place, and we can just move it to another module if we want later.  
@smarthi @smarthi @rawkintrevo @dlyubimov please weigh in here.  Not a big deal 
right now, but thinking ahead it may be good to get the location right. 


> Create a framework for new Mahout Clustering, Classification, and 
> Optimization  Algorithms
> --
>
> Key: MAHOUT-1856
> URL: https://issues.apache.org/jira/browse/MAHOUT-1856
> Project: Mahout
>  Issue Type: New Feature
>Affects Versions: 0.12.1
>Reporter: Andrew Palumbo
>Assignee: Trevor Grant
>Priority: Critical
> Fix For: 0.13.0
>
>
> To ensure that Mahout does not become "A loose bag of algorithms", Create 
> basic traits with funtions common to each class of algorithm. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAHOUT-1856) Create a framework for new Mahout Clustering, Classification, and Optimization Algorithms

2017-01-15 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15823453#comment-15823453
 ] 

ASF GitHub Bot commented on MAHOUT-1856:


Github user andrewpalumbo commented on the issue:

https://github.com/apache/mahout/pull/246
  
@dlyubimov, @mahout-team could you review/provide feedback on this?  
Originally Trevor had a separate module for this, and I asked him to move it 
into math-scala.


> Create a framework for new Mahout Clustering, Classification, and 
> Optimization  Algorithms
> --
>
> Key: MAHOUT-1856
> URL: https://issues.apache.org/jira/browse/MAHOUT-1856
> Project: Mahout
>  Issue Type: New Feature
>Affects Versions: 0.12.1
>Reporter: Andrew Palumbo
>Assignee: Trevor Grant
>Priority: Critical
> Fix For: 0.13.0
>
>
> To ensure that Mahout does not become "A loose bag of algorithms", Create 
> basic traits with funtions common to each class of algorithm. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAHOUT-1856) Create a framework for new Mahout Clustering, Classification, and Optimization Algorithms

2016-12-21 Thread Dmitriy Lyubimov (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15768521#comment-15768521
 ] 

Dmitriy Lyubimov commented on MAHOUT-1856:
--

one thing -- we usually squash working braches before moving a PR to master so 
that we preferrably have one commit per issue. this is much easier manage (and 
hot-fix stuff if needed later).

> Create a framework for new Mahout Clustering, Classification, and 
> Optimization  Algorithms
> --
>
> Key: MAHOUT-1856
> URL: https://issues.apache.org/jira/browse/MAHOUT-1856
> Project: Mahout
>  Issue Type: New Feature
>Affects Versions: 0.12.1
>Reporter: Andrew Palumbo
>Assignee: Trevor Grant
>Priority: Critical
> Fix For: 0.13.0
>
>
> To ensure that Mahout does not become "A loose bag of algorithms", Create 
> basic traits with funtions common to each class of algorithm. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAHOUT-1856) Create a framework for new Mahout Clustering, Classification, and Optimization Algorithms

2016-11-30 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15709182#comment-15709182
 ] 

ASF GitHub Bot commented on MAHOUT-1856:


Github user rawkintrevo commented on the issue:

https://github.com/apache/mahout/pull/246
  
This is WIP so it doesn't really matter that it's failing atm but 
[this](https://travis-ci.org/apache/mahout/builds/180124934#L176) isn't good.. 
Couldn't find maven on the `wget` command.


> Create a framework for new Mahout Clustering, Classification, and 
> Optimization  Algorithms
> --
>
> Key: MAHOUT-1856
> URL: https://issues.apache.org/jira/browse/MAHOUT-1856
> Project: Mahout
>  Issue Type: New Feature
>Affects Versions: 0.12.1
>Reporter: Andrew Palumbo
>Assignee: Trevor Grant
>Priority: Critical
> Fix For: 0.13.0
>
>
> To ensure that Mahout does not become "A loose bag of algorithms", Create 
> basic traits with funtions common to each class of algorithm. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAHOUT-1856) Create a framework for new Mahout Clustering, Classification, and Optimization Algorithms

2016-08-04 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15408731#comment-15408731
 ] 

ASF GitHub Bot commented on MAHOUT-1856:


Github user andrewpalumbo commented on the issue:

https://github.com/apache/mahout/pull/246
  
Nice work on the slides also! :100: 

Another thought that I had is that we may want to allow In-Core matrices as 
parameters.  Just throwing it out there for discuassion.  I cant think of a 
particular use case off the top of my head but It seems that there should be. 

  


> Create a framework for new Mahout Clustering, Classification, and 
> Optimization  Algorithms
> --
>
> Key: MAHOUT-1856
> URL: https://issues.apache.org/jira/browse/MAHOUT-1856
> Project: Mahout
>  Issue Type: New Feature
>Affects Versions: 0.12.1
>Reporter: Andrew Palumbo
>Assignee: Trevor Grant
>Priority: Critical
> Fix For: 0.13.0
>
>
> To ensure that Mahout does not become "A loose bag of algorithms", Create 
> basic traits with funtions common to each class of algorithm. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAHOUT-1856) Create a framework for new Mahout Clustering, Classification, and Optimization Algorithms

2016-08-04 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15408589#comment-15408589
 ] 

ASF GitHub Bot commented on MAHOUT-1856:


Github user andrewpalumbo commented on the issue:

https://github.com/apache/mahout/pull/246
  
Thanks @rawkintrevo.. This is a great start.  I'd originally thought that 
we'd put the algos in the math-scala module, but looking at it,  I think this 
makes sense.  


> Create a framework for new Mahout Clustering, Classification, and 
> Optimization  Algorithms
> --
>
> Key: MAHOUT-1856
> URL: https://issues.apache.org/jira/browse/MAHOUT-1856
> Project: Mahout
>  Issue Type: New Feature
>Affects Versions: 0.12.1
>Reporter: Andrew Palumbo
>Assignee: Trevor Grant
>Priority: Critical
> Fix For: 0.13.0
>
>
> To ensure that Mahout does not become "A loose bag of algorithms", Create 
> basic traits with funtions common to each class of algorithm. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAHOUT-1856) Create a framework for new Mahout Clustering, Classification, and Optimization Algorithms

2016-08-04 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15407985#comment-15407985
 ] 

ASF GitHub Bot commented on MAHOUT-1856:


GitHub user rawkintrevo opened a pull request:

https://github.com/apache/mahout/pull/246

[MAHOUT-1856][WIP] reate a framework for new Mahout Clustering, 
Classification, and Optimization Algorithms 

Relevant JIRA: 
[https://issues.apache.org/jira/browse/MAHOUT-1856](https://issues.apache.org/jira/browse/MAHOUT-1856)

Readme.md provides a more comprehensive (yet still incomplete) overview.

Key Points:
Top Level Class: 
Model has one method- fit, and coefs.

Transformers map a vector input to a vector output (same or different 
length)
Regressors map a vector input to a single output (e.g. a Double)
Classifiers extend Transformers which have created a probability vector by 
'selecting' the class and returning the label (instead of the entire p-vector)

Pipelines and Ensembles are models as well, except they are composed from 
other models listed above, or from other pipelines and ensembles.

ToDo:
- [ ] All models need a uniform way to expose their tuning parameters -> 
this will be required for a auto-tuning algo.  
- [ ] Pipelines / Ensembles must be able to account and report the tunable 
paremeters of their sub models
- [ ] Need fitness functions
- [ ] Native method wrappers- Underlying engines and third party packages 
have implementations of many ML models, let's not recreate the wheel by 
exposing YET ANOTHER sgd algorithm. Instead should be able to convert matrix to 
expected format of 'other' library, run model, get results, package back into 
matrix and pass on in pipeline or ensemble. (This is especially useful for 
DeepLearning4J integration). Also Native implementations on engine of some 
algos probably more efficient by leveraging engine specific tricks (think Flink 
delta iterators) than implementations we would make. 
- [ ] Lots more, open for discussion. 

This is merely a conversation started on what to do.  

I've included OLS as an example regressor and a normalizer as an example 
transformer, only for illustrative purposes.  I really don't want to pack to 
many algos in to this initial commit, just an example/ proof of concept so we 
can say, yea- this framework makes sense for this kind of model OR ooh, we 
probably want to have these features too. 



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/rawkintrevo/mahout mahout-1856

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/mahout/pull/246.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #246


commit 6c0f6bd322a50341bcc587750146467f9ff3fa0a
Author: rawkintrevo 
Date:   2016-08-01T00:08:16Z

[MAHOUT-1856] ML Algo Framework

commit 1f04cd5436df12ded23b8a1815b93ce73ea2a32a
Author: rawkintrevo 
Date:   2016-08-02T17:22:48Z

Building framework

commit 33b90c9795bbb1ff381a98045b0d5f2b641693a9
Author: rawkintrevo 
Date:   2016-08-02T23:09:30Z

add placeholders for ensemble pipeline and fitness test

commit 83c6068e2aa18a62f6ae8b84169a018f764ab408
Author: rawkintrevo 
Date:   2016-08-03T14:54:32Z

added readme

commit 52e9c3e1df4db1397ab81bf07c0e191cfd229b1a
Author: rawkintrevo 
Date:   2016-08-03T14:58:59Z

fixed readme image

commit 92ceeb9603ff9c4927214b896c4dbcfc63f8c7c4
Author: rawkintrevo 
Date:   2016-08-03T15:04:11Z

fixed readme image

commit c0b0464f45470375d709ef9475d474440411879f
Author: rawkintrevo 
Date:   2016-08-03T15:04:52Z

fixed readme image

commit 6f0228aa7ff349cd8ff5c10a4dafe55ec2037ee4
Author: rawkintrevo 
Date:   2016-08-04T15:36:53Z

removed autogen comments from files

commit 065fb24068e5e98b24f4f53ab8cb312abfb8b9ed
Author: rawkintrevo 
Date:   2016-08-01T00:08:16Z

[MAHOUT-1856] ML Algo Framework

commit 127d5dec29ac8b7d6ad3a12c494d4ccdae24cd31
Author: rawkintrevo 
Date:   2016-08-02T17:22:48Z

Building framework

commit 557af2ee7bec17b176c6def768ea6d3da8495b42
Author: rawkintrevo 
Date:   2016-08-02T23:09:30Z

add placeholders for ensemble pipeline and fitness test

commit bde4c940f3e540ffb2e8eceb87355638ca157f89
Author: rawkintrevo 
Date:   2016-08-03T14:54:32Z

added readme

commit 565a164082b3c00294db2a4bd1a0b001d561d6f9
Author: rawkintrevo 
Date:   2016-08-03T14:58:59Z

fixed readme image

commit 950027c047021c23f44af64b842bcbc1bbd717f9
Author: rawkintrevo 
Date:   2016-08-03T15:04:11Z

fixed readme image

commit 045192146e290d9762f09e4235dd4c2f947891d4
Author: rawkintrevo 
Date:   2016-08-03T15:04:52Z

fixed readme image

commit f65d7a941f666d0a58d56ac642558dd15fb57cd7
Author: