[GitHub] spark pull request: [SPARK-13430][PySpark][ML] Python API for trai...

2016-03-19 Thread thunterdb
Github user thunterdb commented on a diff in the pull request:

https://github.com/apache/spark/pull/11621#discussion_r56587642
  
--- Diff: python/pyspark/ml/classification.py ---
@@ -231,6 +232,210 @@ def intercept(self):
 """
 return self._call_java("intercept")
 
+@property
+@since("2.0.0")
+def summary(self):
+"""
+Gets summary (e.g. residuals, mse, r-squared ) of model on
+training set. An exception is thrown if
+`trainingSummary == None`.
+"""
+java_blrt_summary = self._call_java("summary")
+return BinaryLogisticRegressionTrainingSummary(java_blrt_summary)
+
+@property
+@since("2.0.0")
+def hasSummary(self):
+"""
+Indicates whether a training summary exists for this model
+instance.
+"""
+return self._call_java("hasSummary")
+
+"""
+TODO: enable once Scala API is made public
+def evaluate(self, df):
+""
+Evaluates the model on a testset.
+@param dataset Test dataset to evaluate model on.
+""
+java_blr_summary = self._call_java("evaluate", df)
+return BinaryLogisticRegressionSummary(java_blr_summary)
+"""
+
+
+class LogisticRegressionSummary(JavaCallable):
+"""
+Abstraction for Logistic Regression Results for a given model.
+
+.. versionadded:: 2.0.0
+"""
+
+@property
+@since("2.0.0")
+def predictions(self):
+"""
+Dataframe outputted by the model's `transform` method.
--- End diff --

nit: technically, `outputted` is not in the dictionary.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13430][PySpark][ML] Python API for trai...

2016-03-19 Thread thunterdb
Github user thunterdb commented on a diff in the pull request:

https://github.com/apache/spark/pull/11621#discussion_r56587676
  
--- Diff: python/pyspark/ml/wrapper.py ---
@@ -223,3 +223,20 @@ def _call_java(self, name, *args):
 sc = SparkContext._active_spark_context
 java_args = [_py2java(sc, arg) for arg in args]
 return _java2py(sc, m(*java_args))
+
+
+class JavaCallable(object):
--- End diff --

this looks good to me, but @mengxr will have a more expert opinion on this 
class.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13430][PySpark][ML] Python API for trai...

2016-03-18 Thread thunterdb
Github user thunterdb commented on a diff in the pull request:

https://github.com/apache/spark/pull/11621#discussion_r56587656
  
--- Diff: python/pyspark/ml/regression.py ---
@@ -151,6 +151,209 @@ def intercept(self):
 """
 return self._call_java("intercept")
 
+@property
+@since("2.0.0")
+def summary(self):
+"""
+Gets summary (e.g. residuals, mse, r-squared ) of model on
+training set. An exception is thrown if
+`trainingSummary == None`.
+"""
+java_lrt_summary = self._call_java("summary")
+return LinearRegressionTrainingSummary(java_lrt_summary)
+
+@property
+@since("2.0.0")
+def hasSummary(self):
+"""
+Indicates whether a training summary exists for this model
+instance.
+"""
+return self._call_java("hasSummary")
+
+"""
--- End diff --

same thing here


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [MINOR] Typo fixes

2016-03-18 Thread thunterdb
Github user thunterdb commented on a diff in the pull request:

https://github.com/apache/spark/pull/11802#discussion_r56595822
  
--- Diff: 
streaming/src/main/scala/org/apache/spark/streaming/dstream/DStreamCheckpointData.scala
 ---
@@ -102,8 +102,8 @@ class DStreamCheckpointData[T: ClassTag] (dstream: 
DStream[T])
   }
 
   /**
-   * Restore the checkpoint data. This gets called once when the DStream 
graph
-   * (along with its DStreams) are being restored from a graph checkpoint 
file.
+   * Restore the checkpoint data. This gets called once while the DStream 
graph
--- End diff --

I am not sure if this comment is accurate, someone from the streaming team 
should check it. cc @tdas 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [MINOR] Typo fixes

2016-03-18 Thread thunterdb
Github user thunterdb commented on a diff in the pull request:

https://github.com/apache/spark/pull/11802#discussion_r56595611
  
--- Diff: 
streaming/src/main/scala/org/apache/spark/streaming/dstream/DStreamCheckpointData.scala
 ---
@@ -45,7 +45,7 @@ class DStreamCheckpointData[T: ClassTag] (dstream: 
DStream[T])
   /**
* Updates the checkpoint data of the DStream. This gets called every 
time
* the graph checkpoint is initiated. Default implementation records the
-   * checkpoint files to which the generate RDDs of the DStream has been 
saved.
+   * checkpoint files to which the generated RDDs of the DStream have been 
saved.
--- End diff --

should be `records ... at which ...`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: Changes to support KMeans with large feature s...

2016-03-10 Thread thunterdb
Github user thunterdb commented on the pull request:

https://github.com/apache/spark/pull/10739#issuecomment-195090771
  
@levin-royl it looks like @hhbyyh has a branch with some code that tackles 
the same issue (see the jira discussion for more information), you may want to 
coordinate there.

Also, I suggest you take again a look at the pull request section of the 
guidelines 
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13600] [MLlib] Use approxQuantile from ...

2016-03-10 Thread thunterdb
Github user thunterdb commented on the pull request:

https://github.com/apache/spark/pull/11553#issuecomment-195054258
  
@oliverpierson I was thinking that `relativeError` should be automatically 
selected (and not exposed as a param). However, I am fine with exposing it for 
the sake of simplicity.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12566] [ML] [WIP] GLM model family, lin...

2016-03-09 Thread thunterdb
Github user thunterdb commented on the pull request:

https://github.com/apache/spark/pull/11549#issuecomment-194528019
  
@hhbyyh thanks! I just have some small comments; my main comment being in 
the jira ticket regarding the choice of options 1/2/3.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12566] [ML] [WIP] GLM model family, lin...

2016-03-09 Thread thunterdb
Github user thunterdb commented on a diff in the pull request:

https://github.com/apache/spark/pull/11549#discussion_r55597240
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
 ---
@@ -569,9 +572,46 @@ class GeneralizedLinearRegressionModel private[ml] (
 familyAndLink.fitted(eta)
   }
 
+  private var trainingSummary: Option[GeneralizedLinearRegressionSummary] 
= None
+
+  private[regression] def setSummary(summary: 
GeneralizedLinearRegressionSummary): this.type = {
+this.trainingSummary = Some(summary)
+this
+  }
+
+  /**
+* Gets summary of model on training set. An exception is
+* thrown if `trainingSummary == None`.
+*/
+  @Since("2.0.0")
+  def summary: GeneralizedLinearRegressionSummary = trainingSummary match {
+case Some(summ) => summ
+case None =>
+  throw new SparkException(
+"No training summary available for this 
GeneralizedLinearRegressionModel",
+new NullPointerException())
+  }
+
   @Since("2.0.0")
   override def copy(extra: ParamMap): GeneralizedLinearRegressionModel = {
 copyValues(new GeneralizedLinearRegressionModel(uid, coefficients, 
intercept), extra)
   .setParent(parent)
   }
 }
+
+/**
+ * :: Experimental ::
+ * GeneralizedLinearRegressionModel results evaluated on a dataset.
+ *
+ * @param predictions dataframe outputted by the model's `transform` 
method.
+ * @param predictionCol field in "predictions" which gives the prediction 
of each instance.
+ * @param labelCol field in "predictions" which gives the true label of 
each instance.
+ * @param featuresCol field in "predictions" which gives the features of 
each instance as a vector.
+ */
+@Experimental
+@Since("2.0.0")
+class GeneralizedLinearRegressionSummary private[regression] (
+@Since("2.0.0") @transient val predictions: DataFrame,
+@Since("2.0.0") val predictionCol: String,
--- End diff --

Based on a private discussion with @jkbradley , we should not expose the 
name of the columns in the public API. Users should be referring to the 
original model to get access to the column. In `LinearRegressionSummary`, they 
are passed around because they are required for metrics. We do not need them 
here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12566] [ML] [WIP] GLM model family, lin...

2016-03-09 Thread thunterdb
Github user thunterdb commented on a diff in the pull request:

https://github.com/apache/spark/pull/11549#discussion_r55595546
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
 ---
@@ -569,9 +572,46 @@ class GeneralizedLinearRegressionModel private[ml] (
 familyAndLink.fitted(eta)
   }
 
+  private var trainingSummary: Option[GeneralizedLinearRegressionSummary] 
= None
+
+  private[regression] def setSummary(summary: 
GeneralizedLinearRegressionSummary): this.type = {
+this.trainingSummary = Some(summary)
+this
+  }
+
+  /**
+* Gets summary of model on training set. An exception is
+* thrown if `trainingSummary == None`.
+*/
+  @Since("2.0.0")
+  def summary: GeneralizedLinearRegressionSummary = trainingSummary match {
--- End diff --

nit: you can also do 
```scala
trainingSummary.getOrElse {
  throw new Exception(...)
}
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12566] [ML] [WIP] GLM model family, lin...

2016-03-09 Thread thunterdb
Github user thunterdb commented on a diff in the pull request:

https://github.com/apache/spark/pull/11549#discussion_r55592623
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/r/SparkRWrappers.scala 
---
@@ -17,15 +17,41 @@
 
 package org.apache.spark.ml.api.r
 
+import org.apache.spark.SparkException
 import org.apache.spark.ml.{Pipeline, PipelineModel}
 import org.apache.spark.ml.attribute._
 import org.apache.spark.ml.classification.{LogisticRegression, 
LogisticRegressionModel}
 import org.apache.spark.ml.clustering.{KMeans, KMeansModel}
 import org.apache.spark.ml.feature.{RFormula, VectorAssembler}
-import org.apache.spark.ml.regression.{LinearRegression, 
LinearRegressionModel}
+import org.apache.spark.ml.regression._
 import org.apache.spark.sql.DataFrame
 
 private[r] object SparkRWrappers {
+  def fitGLM(
+  value: String,
+  df: DataFrame,
+  family: String,
+  lambda: Double,
+  solver: String): PipelineModel = {
+if (solver.trim != "irls") throw new SparkException("Currently only 
support irls")
+
+val formula = new RFormula().setFormula(value)
+val regex = 
"^\\s*(\\w+)\\s*(\\(\\s*link\\s*=\\s*\"(\\w+)\"\\s*\\))?\\s*$".r
+val estimator = family match {
+  case regex(familyName, group2, linkName) =>
--- End diff --

I am confused: why do you need a regex here? I do not see anything special 
on the other side in R.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12566] [ML] [WIP] GLM model family, lin...

2016-03-09 Thread thunterdb
Github user thunterdb commented on a diff in the pull request:

https://github.com/apache/spark/pull/11549#discussion_r55591400
  
--- Diff: R/pkg/R/mllib.R ---
@@ -51,13 +45,12 @@ setClass("PipelineModel", representation(model = 
"jobj"))
 #' summary(model)
 #'}
 setMethod("glm", signature(formula = "formula", family = "ANY", data = 
"DataFrame"),
-  function(formula, family = c("gaussian", "binomial"), data, 
lambda = 0, alpha = 0,
-standardize = TRUE, solver = "auto") {
+  function(formula, family = c("gaussian", "binomial", "poisson", 
"gamma"), data,
+  lambda = 0, solver = "irls") {
--- End diff --

I would keep the solver to 'auto', so that we can change the implementation 
of the solver without regression. However, the option 'irls' is available for 
users who want to use it.

As for the other options for the solver, see my comment in the ticket.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12566] [ML] [WIP] GLM model family, lin...

2016-03-09 Thread thunterdb
Github user thunterdb commented on a diff in the pull request:

https://github.com/apache/spark/pull/11549#discussion_r55563547
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/r/SparkRWrappers.scala 
---
@@ -17,15 +17,41 @@
 
 package org.apache.spark.ml.api.r
 
+import org.apache.spark.SparkException
 import org.apache.spark.ml.{Pipeline, PipelineModel}
 import org.apache.spark.ml.attribute._
 import org.apache.spark.ml.classification.{LogisticRegression, 
LogisticRegressionModel}
 import org.apache.spark.ml.clustering.{KMeans, KMeansModel}
 import org.apache.spark.ml.feature.{RFormula, VectorAssembler}
-import org.apache.spark.ml.regression.{LinearRegression, 
LinearRegressionModel}
+import org.apache.spark.ml.regression._
 import org.apache.spark.sql.DataFrame
 
 private[r] object SparkRWrappers {
+  def fitGLM(
+  value: String,
+  df: DataFrame,
+  family: String,
+  lambda: Double,
+  solver: String): PipelineModel = {
+if (solver.trim != "irls") throw new SparkException("Currently only 
support irls")
+
+val formula = new RFormula().setFormula(value)
+val regex = 
"^\\s*(\\w+)\\s*(\\(\\s*link\\s*=\\s*\"(\\w+)\"\\s*\\))?\\s*$".r
--- End diff --

in order to minimize the escaping, you can use Scala's raw strings:
```scala
"""^\s*(\w+...\s*$""".r
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13600] [MLlib] Incorrect number of buck...

2016-03-08 Thread thunterdb
Github user thunterdb commented on the pull request:

https://github.com/apache/spark/pull/11553#issuecomment-193889292
  
If you use 0.0 for the relative error, it is going to return the exact 
quantiles. However in this case, there will be no data compression and the 
algorithm will essentially run distributed merge-sort. The cost of doing this 
may be prohibitive.

The current implementation is already approximate, so it makes sense to 
pick a relative error > 0. It looks like we do not offer any guarantee in the 
approximation already anyway. I suggest we automatically pick the relative 
error as follows:

>  target_error = min(0.1, 1.0 / (alpha * num_buckets))

for some value of alpha (1-10): the run time is proportional to the number 
of buckets, and the precision increases with the number of buckets required.

If this is considered too complicated, then a fixed value such as 1e-2 or 
1e-3 looks fine to me as well.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11535][ML] handling empty string in Str...

2016-03-07 Thread thunterdb
Github user thunterdb commented on the pull request:

https://github.com/apache/spark/pull/9522#issuecomment-193542005
  
@pravingadakh sorry for the delay. Would you mind resolving the conflicts?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-6761][SQL][ML] Fixes to API and documen...

2016-02-23 Thread thunterdb
Github user thunterdb commented on the pull request:

https://github.com/apache/spark/pull/11332#issuecomment-187964822
  
LGTM, thanks @mengxr 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-6761][SQL] Approximate quantile for Dat...

2016-02-23 Thread thunterdb
Github user thunterdb commented on a diff in the pull request:

https://github.com/apache/spark/pull/6042#discussion_r53821901
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/stat/ApproxQuantileSuite.scala
 ---
@@ -0,0 +1,129 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.stat
+
+import scala.util.Random
+
+import org.apache.spark.SparkFunSuite
+import org.apache.spark.sql.execution.stat.StatFunctions.QuantileSummaries
+
+
+class ApproxQuantileSuite extends SparkFunSuite {
+
+  private val r = new Random(1)
+  private val n = 100
+  private val increasing = "increasing" -> (0 until n).map(_.toDouble)
+  private val decreasing = "decreasing" -> (n until 0 by 
-1).map(_.toDouble)
+  private val random = "random" -> Seq.fill(n)(math.ceil(r.nextDouble() * 
1000))
+
+  private def buildSummary(
+  data: Seq[Double],
+  epsi: Double,
+  threshold: Int): QuantileSummaries = {
+var summary = new QuantileSummaries(threshold, epsi)
+data.foreach { x =>
+  summary = summary.insert(x)
+}
+summary.compress()
+  }
+
+  private def checkQuantile(quant: Double, data: Seq[Double], summary: 
QuantileSummaries): Unit = {
+val approx = summary.query(quant)
+// The rank of the approximation.
+val rank = data.count(_ < approx) // has to be <, not <= to be exact
+val lower = math.floor((quant - summary.epsilon) * data.size)
+assert(rank >= lower,
+  s"approx_rank: $rank ! >= $lower, requested quantile = $quant")
+val upper = math.ceil((quant + summary.epsilon) * data.size)
+assert(rank <= upper,
--- End diff --

fixed


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-6761][SQL] Approximate quantile for Dat...

2016-02-23 Thread thunterdb
Github user thunterdb commented on a diff in the pull request:

https://github.com/apache/spark/pull/6042#discussion_r53821883
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/DataFrameStatSuite.scala ---
@@ -269,3 +291,41 @@ class DataFrameStatSuite extends QueryTest with 
SharedSQLContext {
 assert(0.until(1000).forall(i => filter4.mightContain(i * 3)))
   }
 }
+
+
+class DataFrameStatPerfSuite extends QueryTest with SharedSQLContext with 
Logging {
+
+  // Turn on this test if you want to test the performance of approximate 
quantiles.
+  ignore("describe() should not be slowed down too much by quantiles") {
--- End diff --

changed, thanks


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-6761][SQL] Approximate quantile for Dat...

2016-02-23 Thread thunterdb
Github user thunterdb commented on a diff in the pull request:

https://github.com/apache/spark/pull/6042#discussion_r53821857
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/stat/StatFunctions.scala 
---
@@ -27,6 +30,312 @@ import org.apache.spark.unsafe.types.UTF8String
 
 private[sql] object StatFunctions extends Logging {
 
+  import QuantileSummaries.Stats
+
+  /**
+   * Calculates the approximate quantile for the given column.
+   *
+   * If you need to compute multiple quantiles at once, you should use 
[[multipleApproxQuantiles]]
+   *
+   * Note on the target error.
+   *
+   * The result of this algorithm has the following deterministic bound:
+   * if the DataFrame has N elements and if we request the quantile `phi` 
up to error `epsi`,
+   * then the algorithm will return a sample `x` from the DataFrame so 
that the *exact* rank
+   * of `x` close to (phi * N). More precisely:
+   *
+   *   floor((phi - epsi) * N) <= rank(x) <= ceil((phi + epsi) * N)
+   *
+   * Note on the algorithm used.
+   *
+   * This method implements a variation of the Greenwald-Khanna algorithm
+   * (with some speed optimizations). The algorithm was first present in 
the following article:
+   * "Space-efficient Online Computation of Quantile Summaries" by 
Greenwald, Michael
+   * and Khanna, Sanjeev. (http://dl.acm.org/citation.cfm?id=375670)
+   *
+   * The performance optimizations are detailed in the comments of the 
implementation.
+   *
+   * @param df the dataframe to estimate quantiles on
+   * @param col the name of the column
+   * @param quantile the target quantile of interest
+   * @param epsilon the target error. Should be >= 0.
+   * */
+  def approxQuantile(
+  df: DataFrame,
+  col: String,
+  quantile: Double,
+  epsilon: Double = QuantileSummaries.defaultEpsilon): Double = {
+require(quantile >= 0.0 && quantile <= 1.0, "Quantile must be in the 
range of (0.0, 1.0).")
+val Seq(Seq(res)) = multipleApproxQuantiles(df, Seq(col), 
Seq(quantile), epsilon)
+res
+  }
+
+  /**
+   * Runs multiple quantile computations in a single pass, with the same 
target error.
+   *
+   * See [[approxQuantile)]] for more details on the approximation 
guarantees.
+   *
+   * @param df the dataframe
+   * @param cols columns of the dataframe
+   * @param quantiles target quantiles to compute
+   * @param epsilon the precision to achieve
+   * @return for each column, returns the requested approximations
+   */
+  def multipleApproxQuantiles(
+  df: DataFrame,
+  cols: Seq[String],
+  quantiles: Seq[Double],
+  epsilon: Double): Seq[Seq[Double]] = {
+val columns: Seq[Column] = cols.map { colName =>
+  val field = df.schema(colName)
+  require(field.dataType.isInstanceOf[NumericType],
+s"Quantile calculation for column $colName with data type 
${field.dataType}" +
+" is not supported.")
+  Column(Cast(Column(colName).expr, DoubleType))
+}
+val emptySummaries = Array.fill(cols.size)(
+  new QuantileSummaries(QuantileSummaries.defaultCompressThreshold, 
epsilon))
+
+// Note that it works more or less by accident as `rdd.aggregate` is 
not a pure function:
+// this function returns the same array as given in the input (because 
`aggregate` reuses
+// the same argument).
+def apply(summaries: Array[QuantileSummaries], row: Row): 
Array[QuantileSummaries] = {
+  var i = 0
+  while (i < summaries.length) {
+summaries(i) = summaries(i).insert(row.getDouble(i))
+i += 1
+  }
+  summaries
+}
+
+def merge(
+sum1: Array[QuantileSummaries],
+sum2: Array[QuantileSummaries]): Array[QuantileSummaries] = {
+  sum1.zip(sum2).map { case (s1, s2) => 
s1.compress().merge(s2.compress()) }
+}
+val summaries = df.select(columns: 
_*).rdd.aggregate(emptySummaries)(apply, merge)
+
+summaries.map { summary => quantiles.map(summary.query) }
+  }
+
+  /**
+   * Helper class to compute approximate quantile summary.
+   * This implementation is based on the algorithm proposed in the paper:
+   * "Space-efficient Online Computation of Quantile Summaries" by 
Greenwald, Michael
+   * and Khanna, Sanjeev. (http://dl.acm.org/citation.cfm?id=375670)
+   *
+   * In order to optimize for speed, it maintains an internal buffer of 
the last seen samples,
+   * and only inserts them after crossing a certain size threshold. This 
guarantees a near-constant
+   * runtime complexity

[GitHub] spark pull request: [SPARK-6761][SQL] Approximate quantile for Dat...

2016-02-23 Thread thunterdb
Github user thunterdb commented on a diff in the pull request:

https://github.com/apache/spark/pull/6042#discussion_r53821725
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/stat/StatFunctions.scala 
---
@@ -27,6 +30,312 @@ import org.apache.spark.unsafe.types.UTF8String
 
 private[sql] object StatFunctions extends Logging {
 
+  import QuantileSummaries.Stats
+
+  /**
+   * Calculates the approximate quantile for the given column.
+   *
+   * If you need to compute multiple quantiles at once, you should use 
[[multipleApproxQuantiles]]
+   *
+   * Note on the target error.
+   *
+   * The result of this algorithm has the following deterministic bound:
+   * if the DataFrame has N elements and if we request the quantile `phi` 
up to error `epsi`,
+   * then the algorithm will return a sample `x` from the DataFrame so 
that the *exact* rank
+   * of `x` close to (phi * N). More precisely:
+   *
+   *   floor((phi - epsi) * N) <= rank(x) <= ceil((phi + epsi) * N)
+   *
+   * Note on the algorithm used.
+   *
+   * This method implements a variation of the Greenwald-Khanna algorithm
+   * (with some speed optimizations). The algorithm was first present in 
the following article:
+   * "Space-efficient Online Computation of Quantile Summaries" by 
Greenwald, Michael
+   * and Khanna, Sanjeev. (http://dl.acm.org/citation.cfm?id=375670)
+   *
+   * The performance optimizations are detailed in the comments of the 
implementation.
+   *
+   * @param df the dataframe to estimate quantiles on
+   * @param col the name of the column
+   * @param quantile the target quantile of interest
+   * @param epsilon the target error. Should be >= 0.
+   * */
+  def approxQuantile(
--- End diff --

Ok, removed


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-6761][SQL] Approximate quantile for Dat...

2016-02-23 Thread thunterdb
Github user thunterdb commented on a diff in the pull request:

https://github.com/apache/spark/pull/6042#discussion_r53821685
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala ---
@@ -37,6 +37,16 @@ import org.apache.spark.util.sketch.{BloomFilter, 
CountMinSketch}
 final class DataFrameStatFunctions private[sql](df: DataFrame) {
 
   /**
+   * Calculate the approximate quantile of numerical column of a DataFrame.
+   * @param col the name of the column
+   * @param quantile the quantile number
+   * @return the approximate quantile
+   */
+  def approxQuantile(col: String, quantile: Double, epsilon: Double): 
Double = {
--- End diff --

Good point, changing this to an Array return


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-6761][SQL] Approximate quantile for Dat...

2016-02-23 Thread thunterdb
Github user thunterdb commented on a diff in the pull request:

https://github.com/apache/spark/pull/6042#discussion_r53821749
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/stat/StatFunctions.scala 
---
@@ -27,6 +30,312 @@ import org.apache.spark.unsafe.types.UTF8String
 
 private[sql] object StatFunctions extends Logging {
 
+  import QuantileSummaries.Stats
+
+  /**
+   * Calculates the approximate quantile for the given column.
+   *
+   * If you need to compute multiple quantiles at once, you should use 
[[multipleApproxQuantiles]]
+   *
+   * Note on the target error.
+   *
+   * The result of this algorithm has the following deterministic bound:
+   * if the DataFrame has N elements and if we request the quantile `phi` 
up to error `epsi`,
+   * then the algorithm will return a sample `x` from the DataFrame so 
that the *exact* rank
+   * of `x` close to (phi * N). More precisely:
+   *
+   *   floor((phi - epsi) * N) <= rank(x) <= ceil((phi + epsi) * N)
+   *
+   * Note on the algorithm used.
+   *
+   * This method implements a variation of the Greenwald-Khanna algorithm
+   * (with some speed optimizations). The algorithm was first present in 
the following article:
+   * "Space-efficient Online Computation of Quantile Summaries" by 
Greenwald, Michael
+   * and Khanna, Sanjeev. (http://dl.acm.org/citation.cfm?id=375670)
+   *
+   * The performance optimizations are detailed in the comments of the 
implementation.
+   *
+   * @param df the dataframe to estimate quantiles on
+   * @param col the name of the column
+   * @param quantile the target quantile of interest
+   * @param epsilon the target error. Should be >= 0.
+   * */
+  def approxQuantile(
+  df: DataFrame,
+  col: String,
+  quantile: Double,
+  epsilon: Double = QuantileSummaries.defaultEpsilon): Double = {
+require(quantile >= 0.0 && quantile <= 1.0, "Quantile must be in the 
range of (0.0, 1.0).")
+val Seq(Seq(res)) = multipleApproxQuantiles(df, Seq(col), 
Seq(quantile), epsilon)
+res
+  }
+
+  /**
+   * Runs multiple quantile computations in a single pass, with the same 
target error.
+   *
+   * See [[approxQuantile)]] for more details on the approximation 
guarantees.
+   *
+   * @param df the dataframe
+   * @param cols columns of the dataframe
+   * @param quantiles target quantiles to compute
+   * @param epsilon the precision to achieve
+   * @return for each column, returns the requested approximations
+   */
+  def multipleApproxQuantiles(
+  df: DataFrame,
+  cols: Seq[String],
+  quantiles: Seq[Double],
+  epsilon: Double): Seq[Seq[Double]] = {
+val columns: Seq[Column] = cols.map { colName =>
+  val field = df.schema(colName)
+  require(field.dataType.isInstanceOf[NumericType],
+s"Quantile calculation for column $colName with data type 
${field.dataType}" +
+" is not supported.")
+  Column(Cast(Column(colName).expr, DoubleType))
+}
+val emptySummaries = Array.fill(cols.size)(
+  new QuantileSummaries(QuantileSummaries.defaultCompressThreshold, 
epsilon))
+
+// Note that it works more or less by accident as `rdd.aggregate` is 
not a pure function:
+// this function returns the same array as given in the input (because 
`aggregate` reuses
+// the same argument).
+def apply(summaries: Array[QuantileSummaries], row: Row): 
Array[QuantileSummaries] = {
+  var i = 0
+  while (i < summaries.length) {
+summaries(i) = summaries(i).insert(row.getDouble(i))
+i += 1
+  }
+  summaries
+}
+
+def merge(
+sum1: Array[QuantileSummaries],
+sum2: Array[QuantileSummaries]): Array[QuantileSummaries] = {
+  sum1.zip(sum2).map { case (s1, s2) => 
s1.compress().merge(s2.compress()) }
+}
+val summaries = df.select(columns: 
_*).rdd.aggregate(emptySummaries)(apply, merge)
+
+summaries.map { summary => quantiles.map(summary.query) }
+  }
+
+  /**
+   * Helper class to compute approximate quantile summary.
+   * This implementation is based on the algorithm proposed in the paper:
+   * "Space-efficient Online Computation of Quantile Summaries" by 
Greenwald, Michael
+   * and Khanna, Sanjeev. (http://dl.acm.org/citation.cfm?id=375670)
+   *
+   * In order to optimize for speed, it maintains an internal buffer of 
the last seen samples,
+   * and only inserts them after crossing a certain size threshold. This 
guarantees a near-constant
+   * runtime complexity

[GitHub] spark pull request: [SPARK-6761][SQL] Approximate quantile for Dat...

2016-02-23 Thread thunterdb
Github user thunterdb commented on a diff in the pull request:

https://github.com/apache/spark/pull/6042#discussion_r53821735
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/stat/StatFunctions.scala 
---
@@ -27,6 +30,312 @@ import org.apache.spark.unsafe.types.UTF8String
 
 private[sql] object StatFunctions extends Logging {
 
+  import QuantileSummaries.Stats
+
+  /**
+   * Calculates the approximate quantile for the given column.
+   *
+   * If you need to compute multiple quantiles at once, you should use 
[[multipleApproxQuantiles]]
+   *
+   * Note on the target error.
+   *
+   * The result of this algorithm has the following deterministic bound:
+   * if the DataFrame has N elements and if we request the quantile `phi` 
up to error `epsi`,
+   * then the algorithm will return a sample `x` from the DataFrame so 
that the *exact* rank
+   * of `x` close to (phi * N). More precisely:
+   *
+   *   floor((phi - epsi) * N) <= rank(x) <= ceil((phi + epsi) * N)
+   *
+   * Note on the algorithm used.
+   *
+   * This method implements a variation of the Greenwald-Khanna algorithm
+   * (with some speed optimizations). The algorithm was first present in 
the following article:
+   * "Space-efficient Online Computation of Quantile Summaries" by 
Greenwald, Michael
+   * and Khanna, Sanjeev. (http://dl.acm.org/citation.cfm?id=375670)
+   *
+   * The performance optimizations are detailed in the comments of the 
implementation.
+   *
+   * @param df the dataframe to estimate quantiles on
+   * @param col the name of the column
+   * @param quantile the target quantile of interest
+   * @param epsilon the target error. Should be >= 0.
+   * */
+  def approxQuantile(
+  df: DataFrame,
+  col: String,
+  quantile: Double,
+  epsilon: Double = QuantileSummaries.defaultEpsilon): Double = {
+require(quantile >= 0.0 && quantile <= 1.0, "Quantile must be in the 
range of (0.0, 1.0).")
+val Seq(Seq(res)) = multipleApproxQuantiles(df, Seq(col), 
Seq(quantile), epsilon)
+res
+  }
+
+  /**
+   * Runs multiple quantile computations in a single pass, with the same 
target error.
+   *
+   * See [[approxQuantile)]] for more details on the approximation 
guarantees.
+   *
+   * @param df the dataframe
+   * @param cols columns of the dataframe
+   * @param quantiles target quantiles to compute
+   * @param epsilon the precision to achieve
+   * @return for each column, returns the requested approximations
+   */
+  def multipleApproxQuantiles(
+  df: DataFrame,
+  cols: Seq[String],
+  quantiles: Seq[Double],
+  epsilon: Double): Seq[Seq[Double]] = {
+val columns: Seq[Column] = cols.map { colName =>
+  val field = df.schema(colName)
+  require(field.dataType.isInstanceOf[NumericType],
+s"Quantile calculation for column $colName with data type 
${field.dataType}" +
+" is not supported.")
+  Column(Cast(Column(colName).expr, DoubleType))
+}
+val emptySummaries = Array.fill(cols.size)(
+  new QuantileSummaries(QuantileSummaries.defaultCompressThreshold, 
epsilon))
+
+// Note that it works more or less by accident as `rdd.aggregate` is 
not a pure function:
+// this function returns the same array as given in the input (because 
`aggregate` reuses
+// the same argument).
+def apply(summaries: Array[QuantileSummaries], row: Row): 
Array[QuantileSummaries] = {
+  var i = 0
+  while (i < summaries.length) {
+summaries(i) = summaries(i).insert(row.getDouble(i))
+i += 1
+  }
+  summaries
+}
+
+def merge(
+sum1: Array[QuantileSummaries],
+sum2: Array[QuantileSummaries]): Array[QuantileSummaries] = {
+  sum1.zip(sum2).map { case (s1, s2) => 
s1.compress().merge(s2.compress()) }
+}
+val summaries = df.select(columns: 
_*).rdd.aggregate(emptySummaries)(apply, merge)
+
+summaries.map { summary => quantiles.map(summary.query) }
+  }
+
+  /**
+   * Helper class to compute approximate quantile summary.
+   * This implementation is based on the algorithm proposed in the paper:
+   * "Space-efficient Online Computation of Quantile Summaries" by 
Greenwald, Michael
+   * and Khanna, Sanjeev. (http://dl.acm.org/citation.cfm?id=375670)
+   *
+   * In order to optimize for speed, it maintains an internal buffer of 
the last seen samples,
+   * and only inserts them after crossing a certain size threshold. This 
guarantees a near-constant
+   * runtime complexity

[GitHub] spark pull request: [SPARK-6761][SQL] Approximate quantile for Dat...

2016-02-23 Thread thunterdb
Github user thunterdb commented on a diff in the pull request:

https://github.com/apache/spark/pull/6042#discussion_r53821700
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/stat/StatFunctions.scala 
---
@@ -27,6 +30,312 @@ import org.apache.spark.unsafe.types.UTF8String
 
 private[sql] object StatFunctions extends Logging {
 
+  import QuantileSummaries.Stats
+
+  /**
+   * Calculates the approximate quantile for the given column.
+   *
+   * If you need to compute multiple quantiles at once, you should use 
[[multipleApproxQuantiles]]
+   *
+   * Note on the target error.
+   *
+   * The result of this algorithm has the following deterministic bound:
+   * if the DataFrame has N elements and if we request the quantile `phi` 
up to error `epsi`,
+   * then the algorithm will return a sample `x` from the DataFrame so 
that the *exact* rank
+   * of `x` close to (phi * N). More precisely:
+   *
+   *   floor((phi - epsi) * N) <= rank(x) <= ceil((phi + epsi) * N)
+   *
+   * Note on the algorithm used.
+   *
+   * This method implements a variation of the Greenwald-Khanna algorithm
+   * (with some speed optimizations). The algorithm was first present in 
the following article:
+   * "Space-efficient Online Computation of Quantile Summaries" by 
Greenwald, Michael
+   * and Khanna, Sanjeev. (http://dl.acm.org/citation.cfm?id=375670)
--- End diff --

Done in both places


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-6761][SQL] Approximate quantile for Dat...

2016-02-23 Thread thunterdb
Github user thunterdb commented on a diff in the pull request:

https://github.com/apache/spark/pull/6042#discussion_r53821676
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala ---
@@ -37,6 +37,16 @@ import org.apache.spark.util.sketch.{BloomFilter, 
CountMinSketch}
 final class DataFrameStatFunctions private[sql](df: DataFrame) {
 
   /**
+   * Calculate the approximate quantile of numerical column of a DataFrame.
+   * @param col the name of the column
+   * @param quantile the quantile number
+   * @return the approximate quantile
--- End diff --

Done


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-6761][SQL][ML] Fixes to API and documen...

2016-02-23 Thread thunterdb
GitHub user thunterdb opened a pull request:

https://github.com/apache/spark/pull/11325

[SPARK-6761][SQL][ML] Fixes to API and documentation of approximate 
quantiles

## What changes were proposed in this pull request?

This PR addresses the remaining comments from @mengxr  in #6042 
 - fixes to the API
 - more tests
 - cleanups and more documentation in the implementation

## How was the this patch tested?

This test passes the comprehensive test suite of the approximate quantiles, 
and adds more tests for the public API.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/thunterdb/spark spark-6761c

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/11325.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #11325


commit a77b7e678ae266210bfaf8ca9b88fc3d2dd54d0d
Author: Timothy Hunter 
Date:   2016-02-23T17:32:18Z

SPARK-6761 Additional cleanups to address comments, there is an error 
somewhere

commit dbf87f1da765eed21880cda71a4b5ec62bee3e4d
Author: Timothy Hunter 
Date:   2016-02-23T17:50:13Z

fixed test




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-6761][SQL] Approximate quantile for Dat...

2016-02-23 Thread thunterdb
Github user thunterdb commented on the pull request:

https://github.com/apache/spark/pull/6042#issuecomment-187774708
  
@mengxr thanks for the review, will do in another PR.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-6761][SQL] Approximate quantile for Dat...

2016-02-11 Thread thunterdb
Github user thunterdb commented on the pull request:

https://github.com/apache/spark/pull/6042#issuecomment-182958229
  
@viirya sorry I missed your email, I will look at your PR today.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13146][SQL] Management API for continuo...

2016-02-08 Thread thunterdb
Github user thunterdb commented on a diff in the pull request:

https://github.com/apache/spark/pull/11030#discussion_r52202089
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/ContinuousQuery.scala ---
@@ -17,11 +17,47 @@
 
 package org.apache.spark.sql
 
+import org.apache.spark.annotation.Experimental
+
 /**
+ * :: Experimental ::
  * A handle to a query that is executing continuously in the background as 
new data arrives.
+ * @since 2.0.0
  */
+@Experimental
 trait ContinuousQuery {
 
+  /** Returns the name of the query */
+  def name: String
+
+  def sqlContext: SQLContext
+
+  /** Whether the query is currently active or not */
+  def isActive: Boolean
+
+  /** Returns the [[ContinuousQueryException]] if the query was terminated 
by an exception. */
+  def exception: Option[ContinuousQueryException]
+
+  /** Returns current status of all the sources. */
+  def sourceStatuses: Array[SourceStatus]
+
+  /** Returns current status of the sink. */
+  def sinkStatus: SinkStatus
+
+  /**
+   * Waits for the termination of this query, either by `stop` or by any 
exception.
+   * @throws ContinuousQueryException, if the query terminated by an 
exception.
+   */
+  def awaitTermination(): Unit
+
+  /**
+   * Waits for the termination of this query, either by `stop` or by any 
exception.
+   * Returns whether the query has terminated or not.
+   * @throws ContinuousQueryException, if the query terminated by an 
exception before
+   * `timeoutMs` milliseconds
+   */
+  def awaitTermination(timeoutMs: Long): Boolean
+
   /**
* Stops the execution of this query if it is running.  This method 
blocks until the threads
--- End diff --

Also, can you document if the methods above are thread-safe? This is 
important for complex client applications.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13146][SQL] Management API for continuo...

2016-02-08 Thread thunterdb
Github user thunterdb commented on a diff in the pull request:

https://github.com/apache/spark/pull/11030#discussion_r52201810
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/ContinuousQuery.scala ---
@@ -17,11 +17,47 @@
 
 package org.apache.spark.sql
 
+import org.apache.spark.annotation.Experimental
+
 /**
+ * :: Experimental ::
  * A handle to a query that is executing continuously in the background as 
new data arrives.
+ * @since 2.0.0
  */
+@Experimental
 trait ContinuousQuery {
 
+  /** Returns the name of the query */
+  def name: String
+
+  def sqlContext: SQLContext
+
+  /** Whether the query is currently active or not */
+  def isActive: Boolean
+
+  /** Returns the [[ContinuousQueryException]] if the query was terminated 
by an exception. */
+  def exception: Option[ContinuousQueryException]
+
+  /** Returns current status of all the sources. */
+  def sourceStatuses: Array[SourceStatus]
+
+  /** Returns current status of the sink. */
+  def sinkStatus: SinkStatus
+
+  /**
+   * Waits for the termination of this query, either by `stop` or by any 
exception.
+   * @throws ContinuousQueryException, if the query terminated by an 
exception.
+   */
+  def awaitTermination(): Unit
+
+  /**
+   * Waits for the termination of this query, either by `stop` or by any 
exception.
+   * Returns whether the query has terminated or not.
+   * @throws ContinuousQueryException, if the query terminated by an 
exception before
+   * `timeoutMs` milliseconds
+   */
+  def awaitTermination(timeoutMs: Long): Boolean
+
   /**
* Stops the execution of this query if it is running.  This method 
blocks until the threads
--- End diff --

there is a typo in the doc. Also, could you document the kind of exceptions 
that are expected to be thrown by `stop()`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11515][ML] QuantileDiscretizer should t...

2016-01-19 Thread thunterdb
Github user thunterdb commented on a diff in the pull request:

https://github.com/apache/spark/pull/9535#discussion_r50163562
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/QuantileDiscretizer.scala ---
@@ -23,8 +23,8 @@ import org.apache.spark.Logging
 import org.apache.spark.annotation.{Experimental, Since}
 import org.apache.spark.ml._
 import org.apache.spark.ml.attribute.NominalAttribute
+import org.apache.spark.ml.param.shared.{HasSeed, HasInputCol, 
HasOutputCol}
--- End diff --

@yu-iskw you need to respect the lexicographic order for the imports


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12765] [ML] [CountVectorizer]fix CountV...

2016-01-14 Thread thunterdb
Github user thunterdb commented on the pull request:

https://github.com/apache/spark/pull/10720#issuecomment-171778898
  
@sloth2012 this looks good to me, thanks for the fix. 

cc @jkbradley 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12026] [MLlib] ChiSqTest gets slower an...

2016-01-12 Thread thunterdb
Github user thunterdb commented on the pull request:

https://github.com/apache/spark/pull/10146#issuecomment-170949938
  
@hhbyyh yes, option 3 sounds good.

A caveat, though, about the numbers you posted: micro benchmarks on the JVM 
are very hard to get right, and a simple loop is not considered a good practice 
in general. I recommend you use a framework like JMH. There are some scala 
wrappers for it.
http://openjdk.java.net/projects/code-tools/jmh/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11938][ML] Expose numFeatures in all ML...

2016-01-07 Thread thunterdb
Github user thunterdb commented on a diff in the pull request:

https://github.com/apache/spark/pull/9936#discussion_r49143733
  
--- Diff: python/pyspark/ml/tests.py ---
@@ -371,6 +378,103 @@ def test_fit_maximize_metric(self):
 self.assertEqual(1.0, bestModelMetric, "Best model has R-squared 
of 1")
 
 
+class RegressorTest(PySparkTestCase):
+
+def setupData(self):
+try:
+self.df
+except AttributeError:
+from pyspark.mllib.linalg import Vectors
+sqlContext = SQLContext(self.sc)
+self.df = sqlContext.createDataFrame([
+(1.0, Vectors.dense(1.0)),
+(0.0, Vectors.sparse(1, [], []))], ["label", "features"])
+
+def test_linear_regression(self):
+self.setupData()
+lr = LinearRegression(maxIter=5, regParam=0.0, solver="normal")
+model = lr.fit(self.df)
+self.assertEquals(1, model.numFeatures)
+
+def test_decision_tree_regressor(self):
+self.setupData()
+dt = DecisionTreeRegressor(maxDepth=2)
+model = dt.fit(self.df)
+self.assertEquals(1, model.numFeatures)
+
+def test_random_forest_regressor(self):
+self.setupData()
+rf = RandomForestRegressor(numTrees=2, maxDepth=2, seed=42)
+model = rf.fit(self.df)
+self.assertEquals(1, model.numFeatures)
+
+def test_gbt_regressor(self):
+self.setupData()
+gbt = GBTRegressor(maxIter=5, maxDepth=2)
+model = gbt.fit(self.df)
+self.assertEquals(1, model.numFeatures)
+
+
+class ClassificationTest(PySparkTestCase):
+
+def setupData(self):
+try:
+self.df
+except AttributeError:
+from pyspark.mllib.linalg import Vectors
+sqlContext = SQLContext(self.sc)
+self.df = sqlContext.createDataFrame([
+(1.0, Vectors.dense(1.0, 0.0)),
+(0.0, Vectors.sparse(2, [1], [1.0]))], ["label", 
"features"])
+
+def test_logistic_regression(self):
+self.setupData()
+lr = LogisticRegression(maxIter=5, regParam=0.01)
+model = lr.fit(self.df)
+self.assertEqual(2, model.numFeatures)
+
+def test_decision_tree_classifier(self):
+from pyspark.ml.feature import StringIndexer
--- End diff --

same thing here


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11938][ML] Expose numFeatures in all ML...

2016-01-07 Thread thunterdb
Github user thunterdb commented on a diff in the pull request:

https://github.com/apache/spark/pull/9936#discussion_r49143685
  
--- Diff: python/pyspark/ml/tests.py ---
@@ -371,6 +378,103 @@ def test_fit_maximize_metric(self):
 self.assertEqual(1.0, bestModelMetric, "Best model has R-squared 
of 1")
 
 
+class RegressorTest(PySparkTestCase):
+
+def setupData(self):
+try:
+self.df
+except AttributeError:
+from pyspark.mllib.linalg import Vectors
+sqlContext = SQLContext(self.sc)
+self.df = sqlContext.createDataFrame([
+(1.0, Vectors.dense(1.0)),
+(0.0, Vectors.sparse(1, [], []))], ["label", "features"])
+
+def test_linear_regression(self):
+self.setupData()
+lr = LinearRegression(maxIter=5, regParam=0.0, solver="normal")
+model = lr.fit(self.df)
+self.assertEquals(1, model.numFeatures)
+
+def test_decision_tree_regressor(self):
+self.setupData()
+dt = DecisionTreeRegressor(maxDepth=2)
+model = dt.fit(self.df)
+self.assertEquals(1, model.numFeatures)
+
+def test_random_forest_regressor(self):
+self.setupData()
+rf = RandomForestRegressor(numTrees=2, maxDepth=2, seed=42)
+model = rf.fit(self.df)
+self.assertEquals(1, model.numFeatures)
+
+def test_gbt_regressor(self):
+self.setupData()
+gbt = GBTRegressor(maxIter=5, maxDepth=2)
+model = gbt.fit(self.df)
+self.assertEquals(1, model.numFeatures)
+
+
+class ClassificationTest(PySparkTestCase):
+
+def setupData(self):
+try:
+self.df
+except AttributeError:
+from pyspark.mllib.linalg import Vectors
--- End diff --

is there any reason for putting the import in the code?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11923][ML] Python API for ml.feature.Ch...

2016-01-07 Thread thunterdb
Github user thunterdb commented on the pull request:

https://github.com/apache/spark/pull/10186#issuecomment-169840549
  
LGTM cc @jkbradley 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12632][Python][Make Parameter Descripti...

2016-01-07 Thread thunterdb
Github user thunterdb commented on a diff in the pull request:

https://github.com/apache/spark/pull/10602#discussion_r49140273
  
--- Diff: python/pyspark/mllib/fpm.py ---
@@ -130,15 +133,21 @@ def train(cls, data, minSupport=0.1, 
maxPatternLength=10, maxLocalProjDBSize=320
 """
 Finds the complete set of frequent sequential patterns in the 
input sequences of itemsets.
 
-:param data: The input data set, each element contains a sequnce 
of itemsets.
-:param minSupport: the minimal support level of the sequential 
pattern, any pattern appears
-more than  (minSupport * size-of-the-dataset) times will be 
output (default: `0.1`)
-:param maxPatternLength: the maximal length of the sequential 
pattern, any pattern appears
-less than maxPatternLength will be output. (default: `10`)
-:param maxLocalProjDBSize: The maximum number of items (including 
delimiters used in
-the internal storage format) allowed in a projected database 
before local
-processing. If a projected database exceeds this size, another
-iteration of distributed prefix growth is run. (default: 
`3200`)
+:param data:
+  The input data set, each element contains a sequnce of itemsets.
+:param minSupport:
+  The minimal support level of the sequential pattern, any pattern 
appears more than
--- End diff --

the lines below have indentation issues


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12632][Python][Make Parameter Descripti...

2016-01-07 Thread thunterdb
Github user thunterdb commented on a diff in the pull request:

https://github.com/apache/spark/pull/10602#discussion_r49140295
  
--- Diff: python/pyspark/mllib/recommendation.py ---
@@ -239,6 +239,17 @@ def train(cls, ratings, rank, iterations=5, 
lambda_=0.01, blocks=-1, nonnegative
 product of two lower-rank matrices of a given rank (number of 
features). To solve for these
 features, we run a given number of iterations of ALS. This is done 
using a level of
 parallelism given by `blocks`.
+   
+   :param iterations:
--- End diff --

indentation issues?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...

2016-01-07 Thread thunterdb
Github user thunterdb commented on a diff in the pull request:

https://github.com/apache/spark/pull/10150#discussion_r49140159
  
--- Diff: python/pyspark/mllib/clustering.py ---
@@ -38,13 +38,116 @@
 from pyspark.mllib.util import Saveable, Loader, inherit_doc, JavaLoader, 
JavaSaveable
 from pyspark.streaming import DStream
 
-__all__ = ['KMeansModel', 'KMeans', 'GaussianMixtureModel', 
'GaussianMixture',
-   'PowerIterationClusteringModel', 'PowerIterationClustering',
-   'StreamingKMeans', 'StreamingKMeansModel',
+__all__ = ['BisectingKMeansModel', 'BisectingKMeans', 'KMeansModel', 
'KMeans',
+   'GaussianMixtureModel', 'GaussianMixture', 
'PowerIterationClusteringModel',
+   'PowerIterationClustering', 'StreamingKMeans', 
'StreamingKMeansModel',
'LDA', 'LDAModel']
 
 
 @inherit_doc
+class BisectingKMeansModel(JavaModelWrapper):
+"""
+.. note:: Experimental
+
+A clustering model derived from the bisecting k-means method.
+
+>>> data = array([0.0,0.0, 1.0,1.0, 9.0,8.0, 8.0,9.0]).reshape(4, 2)
+>>> bskm = BisectingKMeans()
+>>> model = bskm.train(sc.parallelize(data), k=4)
+>>> model.predict(array([0.0, 0.0])) == model.predict(array([0.0, 
0.0]))
+True
+>>> model.k
+4
+>>> model.computeCost(array([0.0, 0.0]))
+0.0
+>>> model.k == len(model.clusterCenters)
+True
+>>> model = bskm.train(sc.parallelize(data), k=2)
+>>> model.predict(array([0.0, 0.0])) == model.predict(array([1.0, 
1.0]))
+True
+>>> model.k
+2
+
+.. versionadded:: 2.0.0
+"""
+
+@property
+@since('2.0.0')
+def clusterCenters(self):
+"""Get the cluster centers, represented as a list of NumPy 
arrays."""
+return [c.toArray() for c in self.call("clusterCenters")]
+
+@property
+@since('2.0.0')
+def k(self):
+"""Get the number of clusters"""
+return self.call("k")
+
+@since('2.0.0')
+def predict(self, x):
+"""
+Find the cluster to which x belongs in this model.
+
+:param x: Either the point to determine the cluster for or an RDD 
of points to determine
+the clusters for.
+"""
+if isinstance(x, RDD):
+return x.map(self.predict(x))
--- End diff --

Also, maybe you can add a test for this case in the docstring.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...

2016-01-07 Thread thunterdb
Github user thunterdb commented on a diff in the pull request:

https://github.com/apache/spark/pull/10150#discussion_r49140117
  
--- Diff: python/pyspark/mllib/clustering.py ---
@@ -38,13 +38,116 @@
 from pyspark.mllib.util import Saveable, Loader, inherit_doc, JavaLoader, 
JavaSaveable
 from pyspark.streaming import DStream
 
-__all__ = ['KMeansModel', 'KMeans', 'GaussianMixtureModel', 
'GaussianMixture',
-   'PowerIterationClusteringModel', 'PowerIterationClustering',
-   'StreamingKMeans', 'StreamingKMeansModel',
+__all__ = ['BisectingKMeansModel', 'BisectingKMeans', 'KMeansModel', 
'KMeans',
+   'GaussianMixtureModel', 'GaussianMixture', 
'PowerIterationClusteringModel',
+   'PowerIterationClustering', 'StreamingKMeans', 
'StreamingKMeansModel',
'LDA', 'LDAModel']
 
 
 @inherit_doc
+class BisectingKMeansModel(JavaModelWrapper):
+"""
+.. note:: Experimental
+
+A clustering model derived from the bisecting k-means method.
+
+>>> data = array([0.0,0.0, 1.0,1.0, 9.0,8.0, 8.0,9.0]).reshape(4, 2)
+>>> bskm = BisectingKMeans()
+>>> model = bskm.train(sc.parallelize(data), k=4)
+>>> model.predict(array([0.0, 0.0])) == model.predict(array([0.0, 
0.0]))
+True
+>>> model.k
+4
+>>> model.computeCost(array([0.0, 0.0]))
+0.0
+>>> model.k == len(model.clusterCenters)
+True
+>>> model = bskm.train(sc.parallelize(data), k=2)
+>>> model.predict(array([0.0, 0.0])) == model.predict(array([1.0, 
1.0]))
+True
+>>> model.k
+2
+
+.. versionadded:: 2.0.0
+"""
+
+@property
+@since('2.0.0')
+def clusterCenters(self):
+"""Get the cluster centers, represented as a list of NumPy 
arrays."""
+return [c.toArray() for c in self.call("clusterCenters")]
+
+@property
+@since('2.0.0')
+def k(self):
+"""Get the number of clusters"""
+return self.call("k")
+
+@since('2.0.0')
+def predict(self, x):
+"""
+Find the cluster to which x belongs in this model.
+
+:param x: Either the point to determine the cluster for or an RDD 
of points to determine
+the clusters for.
+"""
+if isinstance(x, RDD):
+return x.map(self.predict(x))
--- End diff --

I am not sure I understand this line, shouldn't it be `x.map(self.predict)`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12618] [CORE] [STREAMING] [SQL] Clean u...

2016-01-07 Thread thunterdb
Github user thunterdb commented on a diff in the pull request:

https://github.com/apache/spark/pull/10570#discussion_r49139498
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/ExpressionEvalHelper.scala
 ---
@@ -57,8 +57,8 @@ trait ExpressionEvalHelper extends 
GeneratorDrivenPropertyChecks {
 (result, expected) match {
   case (result: Array[Byte], expected: Array[Byte]) =>
 java.util.Arrays.equals(result, expected)
-  case (result: Double, expected: Spread[Double]) =>
-expected.isWithin(result)
+  case (result: Double, expected: Spread[_]) => // Can't use 
Spread[Double] b/c of erasure
--- End diff --

I see. Sadly, I think this is not going to work here without extra work, 
and then it is not going to do what you want. This version of scalatest uses 
manifest to encode type information, and you would have to define it manually 
in this context:
```scala
implicit val x: Manifest[Int] = ???
stream shouldBe a [ReceiverInputDStream[Int @unchecked]]
```
but then the scalatest library is not aware of the `unchecked` annotation, 
and still throws a warning. Let's just have `_` in the suite file.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12634][Python][MLlib][DOC] Update param...

2016-01-07 Thread thunterdb
Github user thunterdb commented on the pull request:

https://github.com/apache/spark/pull/10601#issuecomment-169830444
  
@vijaykiran thanks for the style fixes.

cc @jkbradley 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12618] [CORE] [STREAMING] [SQL] Clean u...

2016-01-07 Thread thunterdb
Github user thunterdb commented on the pull request:

https://github.com/apache/spark/pull/10570#issuecomment-169828910
  
@srowen I am confused, did you send me this message for this PR? It somehow 
does not show up in github, and I do not see the jenkins run that failed:
```
@thunterdb Hm, I get this error now:

[error] 
/home/jenkins/workspace/SparkPullRequestBuilder/extras/kinesis-asl/src/test/scala/org/apache/spark/streaming/kinesis/KinesisStreamSuite.scala:140:
 No Manifest available for 
org.apache.spark.streaming.kinesis.KinesisBackedBlockRDD[Array[Byte] 
@unchecked].
[error] nonEmptyRDD shouldBe a [KinesisBackedBlockRDD[Array[Byte] 
@unchecked]]
[error]  
Did I misunderstand how to apply it or is it not going to work here?
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12631] [PYSPARK] [DOC] PySpark clusteri...

2016-01-06 Thread thunterdb
Github user thunterdb commented on the pull request:

https://github.com/apache/spark/pull/10610#issuecomment-169506325
  
@BryanCutler it looks great, thanks!
One overall comment about the seeds: it is unclear what `None` means, and 
it has a different behavior between `spark.ml` and `spark.mllib`. I suggest you 
mention that if unspecified, the seed is initialized to to the current machine 
time. (It is a fixed value in `spark.ml`)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12631] [PYSPARK] [DOC] PySpark clusteri...

2016-01-06 Thread thunterdb
Github user thunterdb commented on a diff in the pull request:

https://github.com/apache/spark/pull/10610#discussion_r49028171
  
--- Diff: python/pyspark/mllib/clustering.py ---
@@ -774,17 +843,32 @@ def train(cls, rdd, k=10, maxIterations=20, 
docConcentration=-1.0,
   topicConcentration=-1.0, seed=None, checkpointInterval=10, 
optimizer="em"):
 """Train a LDA model.
 
-:param rdd: RDD of data points
-:param k:   Number of clusters you want
-:param maxIterations:   Number of iterations. Default to 20
-:param docConcentration:Concentration parameter (commonly 
named "alpha")
-for the prior placed on documents' distributions over topics 
("theta").
-:param topicConcentration:  Concentration parameter (commonly 
named "beta" or "eta")
-for the prior placed on topics' distributions over terms.
-:param seed:Random Seed
-:param checkpointInterval:  Period (in iterations) between 
checkpoints.
-:param optimizer:   LDAOptimizer used to perform the 
actual calculation.
-Currently "em", "online" are supported. Default to "em".
+:param rdd:
+  Train with a RDD of data points.
+:param k:
+  Number of topics to infer.  I.e., the number of soft cluster 
centers.
--- End diff --

nit. According to Strunk and White, you cannot start a sentence with I.e.: 
`to infer, i.e., the...`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9716] [ML] BinaryClassificationEvaluato...

2016-01-06 Thread thunterdb
Github user thunterdb commented on the pull request:

https://github.com/apache/spark/pull/10472#issuecomment-169483754
  
cc @jkbradley 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12663] [MLlib] More informative error m...

2016-01-06 Thread thunterdb
Github user thunterdb commented on the pull request:

https://github.com/apache/spark/pull/10611#issuecomment-169483568
  
cc @jkbradley 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12618] [CORE] [STREAMING] [SQL] Clean u...

2016-01-06 Thread thunterdb
Github user thunterdb commented on the pull request:

https://github.com/apache/spark/pull/10570#issuecomment-169449477
  
cc @jkbradley 

@srowen I have some concerns about the number of files being touched by 
this PR, it may be hard to merge without an ever-present conflict somewhere. 
Maybe splitting it will go faster?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11520][ML] RegressionMetrics should sup...

2016-01-06 Thread thunterdb
Github user thunterdb commented on the pull request:

https://github.com/apache/spark/pull/9907#issuecomment-169444237
  
@Lewuathe thanks for your patch. I think it will require more work in 
`RegressionMetrics` to fully implement weighted metrics. We need to do the 
following changes:
- expose `weightSum` in `MultivariateStatisticalSummary` (as a developer 
API)
- the computations of `SSreg` and `SStot` should take the weights into 
account
- in `RegressionMetrics`, all references to `summary.count` should be 
replaced by `summary.weightSum` 
Given that the default weights are 0, it should give the same result.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11520][ML] RegressionMetrics should sup...

2016-01-06 Thread thunterdb
Github user thunterdb commented on a diff in the pull request:

https://github.com/apache/spark/pull/9907#discussion_r49002633
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/mllib/evaluation/RegressionMetricsSuite.scala
 ---
@@ -109,4 +109,55 @@ class RegressionMetricsSuite extends SparkFunSuite 
with MLlibTestSparkContext {
   "root mean squared error mismatch")
 assert(metrics.r2 ~== 1.0 absTol 1E-5, "r2 score mismatch")
   }
+
+  test("regression metrics with same(1.0) weight samples") {
+val predictionAndObservationWithWeight = sc.parallelize(
+  Seq((2.25, 3.0, 1.0), (-0.25, -0.5, 1.0), (1.75, 2.0, 1.0), (7.75, 
7.0, 1.0)), 2)
+val metrics = new RegressionMetrics(predictionAndObservationWithWeight)
+assert(metrics.explainedVariance ~== 8.79687 absTol 1E-5,
+  "explained variance regression score mismatch")
+assert(metrics.meanAbsoluteError ~== 0.5 absTol 1E-5, "mean absolute 
error mismatch")
+assert(metrics.meanSquaredError ~== 0.3125 absTol 1E-5, "mean squared 
error mismatch")
+assert(metrics.rootMeanSquaredError ~== 0.55901 absTol 1E-5,
+  "root mean squared error mismatch")
+assert(metrics.r2 ~== 0.95717 absTol 1E-5, "r2 score mismatch")
+  }
+
+
+  /**
+* The following values are hand calculated using the formula:
+* 
[[https://en.wikipedia.org/wiki/Weighted_arithmetic_mean#Reliability_weights]]
+* preds = c(2.25, -0.25, 1.75, 7.75)
+* obs = c(3.0, -0.5, 2.0, 7.0)
+* weights = c(0.1, 0.2, 0.15, 0.05)
+* count = 4
+*
+* Weighted metrics can be calculated with 
MultivariateStatisticalSummary.
+* (observations, observations - predictions)
+* mean(1.7, 0.05)
+* variance(7.3, 0.3)
+* numNonZeros (0.5, 0.5)
+* max (7.0, 0.75)
+* min (-0.5, -0.75)
+* normL2  (2.0, 0.32596)
+* normL1  (1.05, 0.2)
+*
+* explainedVariance: sum((preds - 1.7)^2) / count = 10.1775
+* meanAbsoluteError: normL1(1) / count = 0.05
+* meanSquaredError: normL2(1)^2 / count = 0.02656
+* rootMeanSquaredError: sqrt(meanSquaredError) = 0.16298
+* r2: 1 - normL2(1)^2 / (variance(0) * (count - 1)) = 0.9951484
--- End diff --

Same thing for this formula.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11520][ML] RegressionMetrics should sup...

2016-01-06 Thread thunterdb
Github user thunterdb commented on a diff in the pull request:

https://github.com/apache/spark/pull/9907#discussion_r49002581
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/mllib/evaluation/RegressionMetricsSuite.scala
 ---
@@ -109,4 +109,55 @@ class RegressionMetricsSuite extends SparkFunSuite 
with MLlibTestSparkContext {
   "root mean squared error mismatch")
 assert(metrics.r2 ~== 1.0 absTol 1E-5, "r2 score mismatch")
   }
+
+  test("regression metrics with same(1.0) weight samples") {
+val predictionAndObservationWithWeight = sc.parallelize(
+  Seq((2.25, 3.0, 1.0), (-0.25, -0.5, 1.0), (1.75, 2.0, 1.0), (7.75, 
7.0, 1.0)), 2)
+val metrics = new RegressionMetrics(predictionAndObservationWithWeight)
+assert(metrics.explainedVariance ~== 8.79687 absTol 1E-5,
+  "explained variance regression score mismatch")
+assert(metrics.meanAbsoluteError ~== 0.5 absTol 1E-5, "mean absolute 
error mismatch")
+assert(metrics.meanSquaredError ~== 0.3125 absTol 1E-5, "mean squared 
error mismatch")
+assert(metrics.rootMeanSquaredError ~== 0.55901 absTol 1E-5,
+  "root mean squared error mismatch")
+assert(metrics.r2 ~== 0.95717 absTol 1E-5, "r2 score mismatch")
+  }
+
+
+  /**
+* The following values are hand calculated using the formula:
+* 
[[https://en.wikipedia.org/wiki/Weighted_arithmetic_mean#Reliability_weights]]
+* preds = c(2.25, -0.25, 1.75, 7.75)
+* obs = c(3.0, -0.5, 2.0, 7.0)
+* weights = c(0.1, 0.2, 0.15, 0.05)
+* count = 4
+*
+* Weighted metrics can be calculated with 
MultivariateStatisticalSummary.
+* (observations, observations - predictions)
+* mean(1.7, 0.05)
+* variance(7.3, 0.3)
+* numNonZeros (0.5, 0.5)
+* max (7.0, 0.75)
+* min (-0.5, -0.75)
+* normL2  (2.0, 0.32596)
+* normL1  (1.05, 0.2)
+*
+* explainedVariance: sum((preds - 1.7)^2) / count = 10.1775
--- End diff --

I am having an issue with these formulas, based on the very wikipedia page 
you are referencing. It should not be normalized by the count but by the weight 
sum, right?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11520][ML] RegressionMetrics should sup...

2016-01-06 Thread thunterdb
Github user thunterdb commented on a diff in the pull request:

https://github.com/apache/spark/pull/9907#discussion_r49000659
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/evaluation/RegressionMetrics.scala 
---
@@ -39,16 +51,17 @@ class RegressionMetrics @Since("1.2.0") (
*  prediction and observation
*/
   private[mllib] def this(predictionAndObservations: DataFrame) =
-this(predictionAndObservations.map(r => (r.getDouble(0), 
r.getDouble(1
+this(predictionAndObservations.map(r => (r.getDouble(0), 
r.getDouble(1), 1.0)))
 
   /**
* Use MultivariateOnlineSummarizer to calculate summary statistics of 
observations and errors.
*/
   private lazy val summary: MultivariateStatisticalSummary = {
-val summary: MultivariateStatisticalSummary = 
predictionAndObservations.map {
-  case (prediction, observation) => Vectors.dense(observation, 
observation - prediction)
+val summary: MultivariateStatisticalSummary = 
predictionAndObservationsWithWeight.map {
+  case (prediction, observation, weight) =>
+(Vectors.dense(observation, observation - prediction), weight)
 }.aggregate(new MultivariateOnlineSummarizer())(
-(summary, v) => summary.add(v),
+(summary, sample) => summary.add(sample._1, sample._2),
--- End diff --

We should rename this `summary` variable, it is used 3 times for different 
objects.

Also, `sample` is a bit too generic here, and using the `_xx` methods are 
not very intuitive. I suggest you unpack the tuple fully: `{ case 
(currentSummary, (vec, weight)) => currentSummary.add(vec, weight) }`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12346] [ML] Missing attribute names in ...

2016-01-06 Thread thunterdb
Github user thunterdb commented on the pull request:

https://github.com/apache/spark/pull/10323#issuecomment-169425244
  
@ericl this looks great, thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9716] [ML] BinaryClassificationEvaluato...

2016-01-06 Thread thunterdb
Github user thunterdb commented on the pull request:

https://github.com/apache/spark/pull/10472#issuecomment-169418377
  
@BenFradet thanks! Just a small comment.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9716] [ML] BinaryClassificationEvaluato...

2016-01-06 Thread thunterdb
Github user thunterdb commented on a diff in the pull request:

https://github.com/apache/spark/pull/10472#discussion_r48993376
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/util/SchemaUtils.scala 
---
@@ -44,6 +44,23 @@ private[spark] object SchemaUtils {
   }
 
   /**
+* Check whether the given schema contains a column of one of the 
require data types.
+* @param colName  column name
+* @param dataTypes  required column data types
+*/
+  def checkColumnTypes(
+  schema: StructType,
+  colName: String,
+  dataTypes: Seq[DataType],
+  msg: String = ""): Unit = {
+val actualDataType = schema(colName).dataType
+val message = if (msg != null && msg.trim.length > 0) " " + msg else ""
+require(dataTypes.exists(actualDataType.equals(_)),
--- End diff --

You do not need the `_`:  `dataTypes.exists(actualDataType.equals)`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9716] [ML] BinaryClassificationEvaluato...

2016-01-06 Thread thunterdb
Github user thunterdb commented on a diff in the pull request:

https://github.com/apache/spark/pull/10472#discussion_r48993085
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/evaluation/BinaryClassificationEvaluator.scala
 ---
@@ -79,13 +79,14 @@ class BinaryClassificationEvaluator @Since("1.4.0") 
(@Since("1.4.0") override va
   @Since("1.2.0")
   override def evaluate(dataset: DataFrame): Double = {
 val schema = dataset.schema
-SchemaUtils.checkColumnType(schema, $(rawPredictionCol), new VectorUDT)
+SchemaUtils.checkColumnTypes(schema, $(rawPredictionCol), 
Seq(DoubleType, new VectorUDT))
 SchemaUtils.checkColumnType(schema, $(labelCol), DoubleType)
 
 // TODO: When dataset metadata has been implemented, check 
rawPredictionCol vector length = 2.
 val scoreAndLabels = dataset.select($(rawPredictionCol), $(labelCol))
-  .map { case Row(rawPrediction: Vector, label: Double) =>
-(rawPrediction(1), label)
+  .map {
+case Row(rawPrediction: Vector, label: Double) => 
(rawPrediction(1), label)
+case Row(rawPrediction: Double, label: Double) => (rawPrediction, 
label)
--- End diff --

there is a small loss of performance because a conditional branch is 
introduced here, but I believe the cost of unpacking the row is much higher 
anyway.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12618] [CORE] [STREAMING] [SQL] Clean u...

2016-01-06 Thread thunterdb
Github user thunterdb commented on the pull request:

https://github.com/apache/spark/pull/10570#issuecomment-169396527
  
@srowen thanks a lot for the cleanup! Just two comments.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12618] [CORE] [STREAMING] [SQL] Clean u...

2016-01-06 Thread thunterdb
Github user thunterdb commented on a diff in the pull request:

https://github.com/apache/spark/pull/10570#discussion_r48983461
  
--- Diff: sql/core/src/test/scala/org/apache/spark/sql/QueryTest.scala ---
@@ -206,7 +206,7 @@ abstract class QueryTest extends PlanTest {
 val jsonString = try {
   logicalPlan.toJSON
 } catch {
-  case e =>
+  case e: Throwable =>
--- End diff --

This is more a potential bug in the original code, but why not use 
`NonFatal(e)` here instead?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12618] [CORE] [STREAMING] [SQL] Clean u...

2016-01-06 Thread thunterdb
Github user thunterdb commented on a diff in the pull request:

https://github.com/apache/spark/pull/10570#discussion_r48983495
  
--- Diff: sql/core/src/test/scala/org/apache/spark/sql/QueryTest.scala ---
@@ -231,7 +231,7 @@ abstract class QueryTest extends PlanTest {
 val jsonBackPlan = try {
   TreeNode.fromJSON[LogicalPlan](jsonString, sqlContext.sparkContext)
 } catch {
-  case e =>
+  case e: Throwable =>
--- End diff --

same thing


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12618] [CORE] [STREAMING] [SQL] Clean u...

2016-01-06 Thread thunterdb
Github user thunterdb commented on a diff in the pull request:

https://github.com/apache/spark/pull/10570#discussion_r48983231
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/ExpressionEvalHelper.scala
 ---
@@ -275,8 +275,8 @@ trait ExpressionEvalHelper extends 
GeneratorDrivenPropertyChecks {
 (result, expected) match {
   case (result: Array[Byte], expected: Array[Byte]) =>
 java.util.Arrays.equals(result, expected)
-  case (result: Double, expected: Spread[Double]) =>
-expected.isWithin(result)
+  case (result: Double, expected: Spread[_]) => // can't use 
Spread[Double] b/c of erasure
--- End diff --

same thing here


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12618] [CORE] [STREAMING] [SQL] Clean u...

2016-01-06 Thread thunterdb
Github user thunterdb commented on a diff in the pull request:

https://github.com/apache/spark/pull/10570#discussion_r48983200
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/ExpressionEvalHelper.scala
 ---
@@ -57,8 +57,8 @@ trait ExpressionEvalHelper extends 
GeneratorDrivenPropertyChecks {
 (result, expected) match {
   case (result: Array[Byte], expected: Array[Byte]) =>
 java.util.Arrays.equals(result, expected)
-  case (result: Double, expected: Spread[Double]) =>
-expected.isWithin(result)
+  case (result: Double, expected: Spread[_]) => // Can't use 
Spread[Double] b/c of erasure
--- End diff --

maybe I am not familiar with the style guidelines on this point, but it 
seems to be a perfect use case for `@unchecked`: 
http://www.scala-lang.org/api/current/index.html#scala.unchecked


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12663] [MLlib] More informative error m...

2016-01-05 Thread thunterdb
Github user thunterdb commented on the pull request:

https://github.com/apache/spark/pull/10611#issuecomment-169190994
  
@robert-dodier thanks for your PR. Can you please fix the style issue?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12663] [MLlib] More informative error m...

2016-01-05 Thread thunterdb
Github user thunterdb commented on a diff in the pull request:

https://github.com/apache/spark/pull/10611#discussion_r48919311
  
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/util/MLUtils.scala ---
@@ -86,7 +86,7 @@ object MLUtils {
 val indicesLength = indices.length
 while (i < indicesLength) {
   val current = indices(i)
-  require(current > previous, "indices should be one-based and in 
ascending order" )
+  require(current > previous, "indices should be one-based and in 
ascending order; found current=" + current + ", previous=" + previous + "; 
line=\"" + line + "\"")
--- End diff --

This line is too long (see spark style guide). Also, I recommend you use 
scala's string interpolation rather than manually building the string: 
```scala
s"previous=$previous, line='$line' "
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12006][ML][PYTHON] Fix GMM failure if i...

2016-01-05 Thread thunterdb
Github user thunterdb commented on a diff in the pull request:

https://github.com/apache/spark/pull/9986#discussion_r48880089
  
--- Diff: python/pyspark/mllib/clustering.py ---
@@ -346,7 +346,7 @@ def train(cls, rdd, k, convergenceTol=1e-3, 
maxIterations=100, seed=None, initia
 if initialModel.k != k:
 raise Exception("Mismatched cluster count, initialModel.k 
= %s, however k = %s"
 % (initialModel.k, k))
-initialModelWeights = initialModel.weights
+initialModelWeights = [w for w in initialModel.weights]
--- End diff --

Instead of doing this, I suggest you explicitly call 
`list(initialModel.weights)`, it is easier to understand.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12026] [MLlib] ChiSqTest gets slower an...

2016-01-04 Thread thunterdb
Github user thunterdb commented on the pull request:

https://github.com/apache/spark/pull/10146#issuecomment-168867750
  
@hhbyyh thanks for the fix; I just have one small comment.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12026] [MLlib] ChiSqTest gets slower an...

2016-01-04 Thread thunterdb
Github user thunterdb commented on a diff in the pull request:

https://github.com/apache/spark/pull/10146#discussion_r48805286
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/stat/test/ChiSqTest.scala ---
@@ -109,7 +109,9 @@ private[stat] object ChiSqTest extends Logging {
   }
   i += 1
   distinctLabels += label
-  features.toArray.view.zipWithIndex.slice(startCol, endCol).map { 
case (feature, col) =>
+  val featureArray = features.toArray
+  (startCol until endCol).map { col =>
+val feature = featureArray(col)
--- End diff --

ah yes, good catch about the view that builds incrementally a copy of the 
feature vectors.

Let's also remove the `featureArray`, we can call directly `val feature = 
features(col)`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12041] [ML] [PySpark] Add columnSimilar...

2016-01-04 Thread thunterdb
Github user thunterdb commented on the pull request:

https://github.com/apache/spark/pull/10158#issuecomment-168863404
  
@vectorijk thanks for the PR, it looks good to me. cc @jkbradley 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11815] [ML] [PySpark] PySpark DecisionT...

2016-01-04 Thread thunterdb
Github user thunterdb commented on the pull request:

https://github.com/apache/spark/pull/9807#issuecomment-168862678
  
This looks good to me. cc @jkbradley 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11515][ML] QuantileDiscretizer should t...

2016-01-04 Thread thunterdb
Github user thunterdb commented on the pull request:

https://github.com/apache/spark/pull/9535#issuecomment-168861235
  
@yu-iskw thanks for fixing this issue. It looks good to me, but can you 
please resolve the conflicts?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12450][MLLib] Un-persist broadcasted va...

2016-01-04 Thread thunterdb
Github user thunterdb commented on the pull request:

https://github.com/apache/spark/pull/10415#issuecomment-168859502
  
@rnowling thanks for fixing this, I just have some style comments.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12450][MLLib] Un-persist broadcasted va...

2016-01-04 Thread thunterdb
Github user thunterdb commented on a diff in the pull request:

https://github.com/apache/spark/pull/10415#discussion_r48802864
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala ---
@@ -324,6 +326,7 @@ class KMeans private (
 s0
   }
 )
+  bcNewCenters.unpersist(blocking = false)
--- End diff --

small nit: add new lines around this line


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12450][MLLib] Un-persist broadcasted va...

2016-01-04 Thread thunterdb
Github user thunterdb commented on a diff in the pull request:

https://github.com/apache/spark/pull/10415#discussion_r48802820
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala ---
@@ -353,6 +356,7 @@ class KMeans private (
 ((r, KMeans.findClosest(bcCenters.value(r), p)._1), 1.0)
   }
 }.reduceByKey(_ + _).collectAsMap()
+bcCenters.unpersist(blocking = false)
--- End diff --

small nit: add new lines around this line


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11608][MLLIB][DOC] Added migration guid...

2015-12-16 Thread thunterdb
Github user thunterdb commented on the pull request:

https://github.com/apache/spark/pull/10235#issuecomment-165206222
  
This looks good to me.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12324][MLLIB][DOC] Fixes the sidebar in...

2015-12-15 Thread thunterdb
Github user thunterdb commented on a diff in the pull request:

https://github.com/apache/spark/pull/10297#discussion_r47727256
  
--- Diff: docs/css/main.css ---
@@ -101,6 +98,26 @@ a:hover code {
   max-width: 914px;
 }
 
+.content {
+  z-index: 1;
+  position: relative;
+  background-color: #FFF;
+  max-width: 914px;
+  line-height: 1.6; /* Inspired by Github's wiki style */
+  background-color: white;
+  padding-left: 15px;
+}
+
+.content-with-sidebar {
+  z-index: 1;
+  position: relative;
+  background-color: #FFF;
--- End diff --

Done


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12324][MLLIB][DOC] Fixes the sidebar in...

2015-12-15 Thread thunterdb
Github user thunterdb commented on a diff in the pull request:

https://github.com/apache/spark/pull/10297#discussion_r47727246
  
--- Diff: docs/css/main.css ---
@@ -101,6 +98,26 @@ a:hover code {
   max-width: 914px;
 }
 
+.content {
+  z-index: 1;
+  position: relative;
+  background-color: #FFF;
--- End diff --

oops thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12324][MLLIB][DOC] Fixes the sidebar in...

2015-12-15 Thread thunterdb
Github user thunterdb commented on a diff in the pull request:

https://github.com/apache/spark/pull/10297#discussion_r47721444
  
--- Diff: docs/css/main.css ---
@@ -171,24 +202,104 @@ a.anchorjs-link:hover { text-decoration: none; }
  * The left navigation bar.
  */
 .left-menu-wrapper {
-  position: absolute;
-  height: 100%;
-
-  width: 256px;
-  margin-top: -20px;
-  padding-top: 20px;
+  margin-left: 0px;
+  margin-right: 0px;
   background-color: #F0F8FC;
+  border-top-width: 0px;
+  border-left-width: 0px;
+  border-bottom-width: 0px;
+  margin-top: 0px;
+  width: 210px;
+  float: left;
+  position: absolute;
 }
 
 .left-menu {
-  position: fixed;
-  max-width: 350px;
-
-  padding-right: 10px;
-  width: 256px;
+  padding: 0px;
+  width: 199px;
 }
 
 .left-menu h3 {
   margin-left: 10px;
   line-height: 30px;
+}
+
+/**
+ * The collapsing button for the navigation bar.
+ */
+.nav-trigger {
+position: fixed;
+clip: rect(0, 0, 0, 0);
+}
+
+.nav-trigger + label:after {
+content: '»';
+}
+
+label {
+  z-index: 10;
+}
+
+label[for="nav-trigger"] {
+position: fixed;
+margin-left: 0px;
+padding-top: 100px;
+padding-left: 5px;
+width: 10px;
+height: 80%;
+cursor: pointer;
+background-size: contain;
+background-color: #D4F0FF;
+border-radius: 10px 0px 0px 10px;
--- End diff --

Removing the border.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12324][MLLIB][DOC] Fixes the sidebar in...

2015-12-15 Thread thunterdb
Github user thunterdb commented on a diff in the pull request:

https://github.com/apache/spark/pull/10297#discussion_r47721419
  
--- Diff: docs/css/main.css ---
@@ -171,24 +202,104 @@ a.anchorjs-link:hover { text-decoration: none; }
  * The left navigation bar.
  */
 .left-menu-wrapper {
-  position: absolute;
-  height: 100%;
-
-  width: 256px;
-  margin-top: -20px;
-  padding-top: 20px;
+  margin-left: 0px;
+  margin-right: 0px;
   background-color: #F0F8FC;
+  border-top-width: 0px;
+  border-left-width: 0px;
+  border-bottom-width: 0px;
+  margin-top: 0px;
+  width: 210px;
+  float: left;
+  position: absolute;
 }
 
 .left-menu {
-  position: fixed;
-  max-width: 350px;
-
-  padding-right: 10px;
-  width: 256px;
+  padding: 0px;
+  width: 199px;
 }
 
 .left-menu h3 {
   margin-left: 10px;
   line-height: 30px;
+}
+
+/**
+ * The collapsing button for the navigation bar.
+ */
+.nav-trigger {
+position: fixed;
+clip: rect(0, 0, 0, 0);
+}
+
+.nav-trigger + label:after {
+content: '»';
+}
+
+label {
--- End diff --

Good question. I tried to do this and it failed to take the z-index into 
account despite my best efforts. This is the only way I found to make it work. 
If you think it is too wide, then I can try to add a CSS class to it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12324][MLLIB][DOC] Fixes the sidebar in...

2015-12-15 Thread thunterdb
Github user thunterdb commented on a diff in the pull request:

https://github.com/apache/spark/pull/10297#discussion_r47721302
  
--- Diff: docs/css/main.css ---
@@ -39,13 +39,26 @@
   margin-left: 10px;
 }
 
+/*
 body .container-wrapper {
   position: absolute;
   width: 100%;
   display: flex;
 }
+*/
 
-body #content {
+body .container-wrapper {
+  background-color: #FFF;
+  color: #1D1F22;
+  max-width: 1024px;
+  margin-top: 10px;
+  margin-left: auto;
+  margin-right: auto;
+  border-radius: 15px;
+  position: relative;
+}
+
+body #contentREMOVE {
--- End diff --

Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12324][MLLIB][DOC] Fixes the sidebar in...

2015-12-15 Thread thunterdb
Github user thunterdb commented on a diff in the pull request:

https://github.com/apache/spark/pull/10297#discussion_r47721217
  
--- Diff: docs/css/main.css ---
@@ -171,24 +202,104 @@ a.anchorjs-link:hover { text-decoration: none; }
  * The left navigation bar.
  */
 .left-menu-wrapper {
-  position: absolute;
-  height: 100%;
-
-  width: 256px;
-  margin-top: -20px;
-  padding-top: 20px;
+  margin-left: 0px;
+  margin-right: 0px;
   background-color: #F0F8FC;
+  border-top-width: 0px;
+  border-left-width: 0px;
+  border-bottom-width: 0px;
+  margin-top: 0px;
+  width: 210px;
+  float: left;
+  position: absolute;
 }
 
 .left-menu {
-  position: fixed;
-  max-width: 350px;
-
-  padding-right: 10px;
-  width: 256px;
+  padding: 0px;
+  width: 199px;
 }
 
 .left-menu h3 {
   margin-left: 10px;
   line-height: 30px;
+}
+
+/**
+ * The collapsing button for the navigation bar.
+ */
+.nav-trigger {
+position: fixed;
+clip: rect(0, 0, 0, 0);
+}
+
+.nav-trigger + label:after {
+content: '»';
+}
+
+label {
+  z-index: 10;
+}
+
+label[for="nav-trigger"] {
+position: fixed;
--- End diff --

Fixed everywhere to 2 spaces


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12324][MLLIB][DOC] Fixes the sidebar in...

2015-12-15 Thread thunterdb
Github user thunterdb commented on a diff in the pull request:

https://github.com/apache/spark/pull/10297#discussion_r47720994
  
--- Diff: docs/_layouts/global.html ---
@@ -128,19 +128,31 @@
 
 {% if page.url contains "/ml" %}
   {% include nav-left-wrapper-ml.html 
nav-mllib=site.data.menu-mllib nav-ml=site.data.menu-ml %}
+  
+  
+
--- End diff --

Fixed


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12324][MLLIB][DOC] Fixes the sidebar in...

2015-12-14 Thread thunterdb
Github user thunterdb commented on the pull request:

https://github.com/apache/spark/pull/10297#issuecomment-164556778
  
@jkbradley can you take a look at this fix?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12324][MLLIB][DOC] Fixes the sidebar in...

2015-12-14 Thread thunterdb
GitHub user thunterdb opened a pull request:

https://github.com/apache/spark/pull/10297

[SPARK-12324][MLLIB][DOC] Fixes the sidebar in the ML documentation

This fixes the sidebar, using a pure CSS mechanism to hide it when the 
browser's viewport is too narrow.
Credit goes to the original author @Titan-C (mentioned in the NOTICE).

Note that I am not a CSS expert, so I can only address comments up to some 
extent.

Default view:
https://cloud.githubusercontent.com/assets/7594753/11793597/6d1d6eda-a261-11e5-836b-6eb2054e9054.png";>

When collapsed manually by the user:
https://cloud.githubusercontent.com/assets/7594753/11793669/c991989e-a261-11e5-8bf6-aecf3bdb6319.png";>


Disappears when column is too narrow:
https://cloud.githubusercontent.com/assets/7594753/11793607/7754dbcc-a261-11e5-8b15-e0d074b0e47c.png";>

Can still be opened by the user if necessary:
https://cloud.githubusercontent.com/assets/7594753/11793612/7bf82968-a261-11e5-9cc3-e827a7a6b2b0.png";>


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/thunterdb/spark 12324

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/10297.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #10297


commit dc8d41b94aa49c31849596ea986368c3bb3c29e3
Author: Timothy Hunter 
Date:   2015-12-14T19:39:54Z

menu

commit 94e82743956bce72f5ca2402be535fced36db24e
Author: Timothy Hunter 
Date:   2015-12-14T20:00:50Z

colors fixed

commit b091b42660ac510ba49ddf0274c779ae1c249b67
Author: Timothy Hunter 
Date:   2015-12-14T20:41:28Z

add the original author to the notice




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12212][ML][DOC] Clarifies the differenc...

2015-12-10 Thread thunterdb
Github user thunterdb commented on the pull request:

https://github.com/apache/spark/pull/10234#issuecomment-163723975
  
@jkbradley done:
https://cloud.githubusercontent.com/assets/7594753/11725710/e9949f04-9f2f-11e5-8ba5-7f955e8b41fa.png";>



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12212][ML][DOC] Clarifies the differenc...

2015-12-10 Thread thunterdb
Github user thunterdb commented on the pull request:

https://github.com/apache/spark/pull/10234#issuecomment-163713115
  
@jkbradley done with changes, let me know what you think.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12212][ML][DOC] Clarifies the differenc...

2015-12-10 Thread thunterdb
Github user thunterdb commented on a diff in the pull request:

https://github.com/apache/spark/pull/10234#discussion_r47262584
  
--- Diff: docs/ml-classification-regression.md ---
@@ -27,10 +27,10 @@ displayTitle: Classification and regression in spark.ml
 * This will become a table of contents (this text will be scraped).
 {:toc}
 
-In MLlib, we implement popular linear methods such as logistic
+In `spark.ml`, we implement popular linear methods such as logistic
 regression and linear least squares with $L_1$ or $L_2$ regularization.
 Refer to [the linear methods in mllib](mllib-linear-methods.html) for
-details.  In `spark.ml`, we also include Pipelines API for [Elastic
+details about implementation and tuning.  We also include a Dataframe API 
for [Elastic
--- End diff --

Corrected everywhere in the documentation


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12212][ML][DOC] Clarifies the differenc...

2015-12-10 Thread thunterdb
Github user thunterdb commented on a diff in the pull request:

https://github.com/apache/spark/pull/10234#discussion_r47262597
  
--- Diff: docs/mllib-feature-extraction.md ---
@@ -1,7 +1,7 @@
 ---
 layout: global
-title: Feature Extraction and Transformation - MLlib
-displayTitle: MLlib - Feature Extraction 
and Transformation
+title: Feature Extraction and Transformation
--- End diff --

Done everywhere


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12212][ML][DOC] Clarifies the differenc...

2015-12-09 Thread thunterdb
Github user thunterdb commented on the pull request:

https://github.com/apache/spark/pull/10234#issuecomment-163455399
  
@jkbradley I agree with you. Note that when you use the doc, the current 
section appears in bold in the side menu (and the submenu gets expandend). This 
is why I did not include it again, but I do not have a strong opinion on this 
point:

https://cloud.githubusercontent.com/assets/7594753/11703974/64e32588-9e98-11e5-995a-247be7e51d77.png";>



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12212][ML][DOC] Clarifies the differenc...

2015-12-09 Thread thunterdb
GitHub user thunterdb opened a pull request:

https://github.com/apache/spark/pull/10234

[SPARK-12212][ML][DOC] Clarifies the difference between spark.ml, 
spark.mllib and mllib in the documentation.

Replaces a number of occurences of `MLlib` in the documentation that were 
meant to refer to the `spark.mllib` package instead. It should clarify for new 
users the difference between `spark.mllib` (the package) and MLlib (the 
umbrella project for ML in spark).

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/thunterdb/spark 12212

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/10234.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #10234


commit edddc09237e4f519fa54d902fea227f59b5b1aac
Author: Timothy Hunter 
Date:   2015-12-09T23:07:08Z

removed old files

commit 4b328e544dd3bc11ba2b88966023b50a91f919b3
Author: Timothy Hunter 
Date:   2015-12-09T23:07:45Z

removed old files

commit 62dc368d090402f5416961e012cd9e8160e0c585
Author: Timothy Hunter 
Date:   2015-12-10T00:17:16Z

last changes

commit 42bf337daa7145916d907bb8161d95c65bb20ae6
Author: Timothy Hunter 
Date:   2015-12-10T00:27:51Z

missing conversions




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8517][ML][DOC] Reorganizes the spark.ml...

2015-12-08 Thread thunterdb
Github user thunterdb commented on a diff in the pull request:

https://github.com/apache/spark/pull/10207#discussion_r47043656
  
--- Diff: docs/ml-classification-regression.md ---
@@ -0,0 +1,762 @@
+---
+layout: global
+title: Classification and regression - spark.ml
+displayTitle: Classification and regression in spark.ml
+---
+
+
+`\[
+\newcommand{\R}{\mathbb{R}}
+\newcommand{\E}{\mathbb{E}}
+\newcommand{\x}{\mathbf{x}}
+\newcommand{\y}{\mathbf{y}}
+\newcommand{\wv}{\mathbf{w}}
+\newcommand{\av}{\mathbf{\alpha}}
+\newcommand{\bv}{\mathbf{b}}
+\newcommand{\N}{\mathbb{N}}
+\newcommand{\id}{\mathbf{I}}
+\newcommand{\ind}{\mathbf{1}}
+\newcommand{\0}{\mathbf{0}}
+\newcommand{\unit}{\mathbf{e}}
+\newcommand{\one}{\mathbf{1}}
+\newcommand{\zero}{\mathbf{0}}
+\]`
+
+**Table of Contents**
+
+* This will become a table of contents (this text will be scraped).
+{:toc}
+
+In MLlib, we implement popular linear methods such as logistic
+regression and linear least squares with $L_1$ or $L_2$ regularization.
+Refer to [the linear methods in mllib](mllib-linear-methods.html) for
+details.  In `spark.ml`, we also include Pipelines API for [Elastic
+net](http://en.wikipedia.org/wiki/Elastic_net_regularization), a hybrid
+of $L_1$ and $L_2$ regularization proposed in [Zou et al, Regularization
+and variable selection via the elastic
+net](http://users.stat.umn.edu/~zouxx019/Papers/elasticnet.pdf).
+Mathematically, it is defined as a convex combination of the $L_1$ and
+the $L_2$ regularization terms:
+`\[
+\alpha \left( \lambda \|\wv\|_1 \right) + (1-\alpha) \left( 
\frac{\lambda}{2}\|\wv\|_2^2 \right) , \alpha \in [0, 1], \lambda \geq 0
+\]`
+By setting $\alpha$ properly, elastic net contains both $L_1$ and $L_2$
+regularization as special cases. For example, if a [linear
+regression](https://en.wikipedia.org/wiki/Linear_regression) model is
+trained with the elastic net parameter $\alpha$ set to $1$, it is
+equivalent to a
+[Lasso](http://en.wikipedia.org/wiki/Least_squares#Lasso_method) model.
+On the other hand, if $\alpha$ is set to $0$, the trained model reduces
+to a [ridge
+regression](http://en.wikipedia.org/wiki/Tikhonov_regularization) model.
+We implement Pipelines API for both linear regression and logistic
+regression with elastic net regularization.
+
+
+# Classification
+
+## Logistic regression
+
+Logistic regression is a popular method to predict a binary response. It 
is a special case of [Generalized Linear 
models](https://en.wikipedia.org/wiki/Generalized_linear_model) that predicts 
the probability of the outcome.
+For more background and more details about the implementation, refer to 
the documentation of the [logistic regression in 
`spark.mllib`](mllib-linear-methods.html#logistic-regression). 
+
+  > The current implementation of logistic regression in `spark.ml` only 
supports binary classes. Support for multiclass regression will be added in the 
future.
+
+The following example shows how to train a logistic regression model
+with elastic net regularization. `elasticNetParam` corresponds to
+$\alpha$ and `regParam` corresponds to $\lambda$.
+
+
+
+
+{% include_example 
scala/org/apache/spark/examples/ml/LogisticRegressionWithElasticNetExample.scala
 %}
+
+
+
+{% include_example 
java/org/apache/spark/examples/ml/JavaLogisticRegressionWithElasticNetExample.java
 %}
+
+
+
+{% include_example python/ml/logistic_regression_with_elastic_net.py %}
+
+
+
+
+The `spark.ml` implementation of logistic regression also supports
+extracting a summary of the model over the training set. Note that the
+predictions and metrics which are stored as `Dataframe` in
+`BinaryLogisticRegressionSummary` are annotated `@transient` and hence
+only available on the driver.
+
+
+
+
+

+[`LogisticRegressionTrainingSummary`](api/scala/index.html#org.apache.spark.ml.classification.LogisticRegressionTrainingSummary)
+provides a summary for a

+[`LogisticRegressionModel`](api/scala/index.html#org.apache.spark.ml.classification.LogisticRegressionModel).
+Currently, only binary classification is supported and the
+summary must be explicitly cast to

+[`BinaryLogisticRegressionTrainingSummary`](api/scala/index.html#org.apache.spark.ml.classification.BinaryLogisticRegressionTrainingSummary).
+This will likely change when multiclass classification is supported.
+
+Continuing the earlier example:
+
+{% include_example 
scala/org/apache/spark/examples/ml/LogisticRegressionSummaryExample.scala %}
+
+
+

+[`LogisticRegressionTrainingSummary`](api/java/org/apache/spark/ml/classificat

[GitHub] spark pull request: [SPARK-8517][ML][DOC] Reorganizes the spark.ml...

2015-12-08 Thread thunterdb
Github user thunterdb commented on a diff in the pull request:

https://github.com/apache/spark/pull/10207#discussion_r47018195
  
--- Diff: docs/ml-classification-regression.md ---
@@ -0,0 +1,733 @@
+---
+layout: global
+title: Classification and regression - spark.ml
+displayTitle: Classification and regression in spark.ml
+---
+
+
+`\[
+\newcommand{\R}{\mathbb{R}}
+\newcommand{\E}{\mathbb{E}}
+\newcommand{\x}{\mathbf{x}}
+\newcommand{\y}{\mathbf{y}}
+\newcommand{\wv}{\mathbf{w}}
+\newcommand{\av}{\mathbf{\alpha}}
+\newcommand{\bv}{\mathbf{b}}
+\newcommand{\N}{\mathbb{N}}
+\newcommand{\id}{\mathbf{I}}
+\newcommand{\ind}{\mathbf{1}}
+\newcommand{\0}{\mathbf{0}}
+\newcommand{\unit}{\mathbf{e}}
+\newcommand{\one}{\mathbf{1}}
+\newcommand{\zero}{\mathbf{0}}
+\]`
+
+**Table of Contents**
+
+* This will become a table of contents (this text will be scraped).
+{:toc}
+
+In MLlib, we implement popular linear methods such as logistic
+regression and linear least squares with $L_1$ or $L_2$ regularization.
+Refer to [the linear methods in mllib](mllib-linear-methods.html) for
+details.  In `spark.ml`, we also include Pipelines API for [Elastic
+net](http://en.wikipedia.org/wiki/Elastic_net_regularization), a hybrid
+of $L_1$ and $L_2$ regularization proposed in [Zou et al, Regularization
+and variable selection via the elastic
+net](http://users.stat.umn.edu/~zouxx019/Papers/elasticnet.pdf).
+Mathematically, it is defined as a convex combination of the $L_1$ and
+the $L_2$ regularization terms:
+`\[
+\alpha \left( \lambda \|\wv\|_1 \right) + (1-\alpha) \left( 
\frac{\lambda}{2}\|\wv\|_2^2 \right) , \alpha \in [0, 1], \lambda \geq 0
+\]`
+By setting $\alpha$ properly, elastic net contains both $L_1$ and $L_2$
+regularization as special cases. For example, if a [linear
+regression](https://en.wikipedia.org/wiki/Linear_regression) model is
+trained with the elastic net parameter $\alpha$ set to $1$, it is
+equivalent to a
+[Lasso](http://en.wikipedia.org/wiki/Least_squares#Lasso_method) model.
+On the other hand, if $\alpha$ is set to $0$, the trained model reduces
+to a [ridge
+regression](http://en.wikipedia.org/wiki/Tikhonov_regularization) model.
+We implement Pipelines API for both linear regression and logistic
+regression with elastic net regularization.
+
+# Regression
+
+## Linear regression
+
+The interface for working with linear regression models and model
+summaries is similar to the logistic regression case. The following
+example demonstrates training an elastic net regularized linear
+regression model and extracting model summary statistics.
+
+
+
+
+{% include_example 
scala/org/apache/spark/examples/ml/LinearRegressionWithElasticNetExample.scala 
%}
+
+
+
+{% include_example 
java/org/apache/spark/examples/ml/JavaLinearRegressionWithElasticNetExample.java
 %}
+
+
+
+
+{% include_example python/ml/linear_regression_with_elastic_net.py %}
+
+
+
+
+## Survival regression
+
+
+In `spark.ml`, we implement the [Accelerated failure time 
(AFT)](https://en.wikipedia.org/wiki/Accelerated_failure_time_model) 
+model which is a parametric survival regression model for censored data. 
+It describes a model for the log of survival time, so it's often called 
+log-linear model for survival analysis. Different from 
+[Proportional 
hazards](https://en.wikipedia.org/wiki/Proportional_hazards_model) model
+designed for the same purpose, the AFT model is more easily to parallelize 
+because each instance contribute to the objective function independently.
+
+Given the values of the covariates $x^{'}$, for random lifetime $t_{i}$ of 
+subjects i = 1, ..., n, with possible right-censoring, 
+the likelihood function under the AFT model is given as:
+`\[

+L(\beta,\sigma)=\prod_{i=1}^n[\frac{1}{\sigma}f_{0}(\frac{\log{t_{i}}-x^{'}\beta}{\sigma})]^{\delta_{i}}S_{0}(\frac{\log{t_{i}}-x^{'}\beta}{\sigma})^{1-\delta_{i}}
+\]`
+Where $\delta_{i}$ is the indicator of the event has occurred i.e. 
uncensored or not.
+Using $\epsilon_{i}=\frac{\log{t_{i}}-x^{'}\beta}{\sigma}$, the 
log-likelihood function
+assumes the form:
+`\[

+\iota(\beta,\sigma)=\sum_{i=1}^{n}[-\delta_{i}\log\sigma+\delta_{i}\log{f_{0}}(\epsilon_{i})+(1-\delta_{i})\log{S_{0}(\epsilon_{i})}]
+\]`
+Where $S_{0}(\epsilon_{i})$ is the baseline survivor function,
+and $f_{0}(\epsilon_{i})$ is corresponding density function.
+
+The most commonly used AFT model is based on the Weibull distribution of 
the survival time. 
+The Weibull distribution for lifetime corresponding to extreme value 
distribu

[GitHub] spark pull request: [SPARK-8517][ML][DOC] Reorganizes the spark.ml...

2015-12-08 Thread thunterdb
Github user thunterdb commented on a diff in the pull request:

https://github.com/apache/spark/pull/10207#discussion_r47018213
  
--- Diff: docs/_data/menu-ml.yaml ---
@@ -1,10 +1,10 @@
-- text: Feature extraction, transformation, and selection
+- text: "Overview: estimators, transformers and pipelines"
+  url: ml-intro.html
+- text: Building and transforming features
--- End diff --

Done


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8517][ML][DOC] Reorganizes the spark.ml...

2015-12-08 Thread thunterdb
Github user thunterdb commented on a diff in the pull request:

https://github.com/apache/spark/pull/10207#discussion_r47017809
  
--- Diff: docs/mllib-guide.md ---
@@ -66,15 +66,18 @@ We list major functionality from both below, with links 
to detailed guides.
 
 # spark.ml: high-level APIs for ML pipelines
 
-**[spark.ml programming guide](ml-guide.html)** provides an overview of 
the Pipelines API and major
-concepts. It also contains sections on using algorithms within the 
Pipelines API, for example:
-
-* [Feature extraction, transformation, and selection](ml-features.html)
+* [Overview: estimators, transformers and pipelines](ml-intro.html)
+* [Building and transforming features](ml-features.html)
+* [Classification and regression](ml-classification-regression.html)
 * [Clustering](ml-clustering.html)
-* [Decision trees for classification and regression](ml-decision-tree.html)
-* [Ensembles](ml-ensembles.html)
-* [Linear methods with elastic net regularization](ml-linear-methods.html)
-* [Multilayer perceptron classifier](ml-ann.html)
+* [Advanced topics](ml-advanced.html)
+
+Some techniques are not available yet in spark.ml, most notably:
+ - clustering
--- End diff --

Oh yes true, of course. I rephrased it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8517][ML][DOC] Reorganizes the spark.ml...

2015-12-08 Thread thunterdb
Github user thunterdb commented on the pull request:

https://github.com/apache/spark/pull/10207#issuecomment-163009350
  
@jkbradley no this PR just moves the text around, with little modification. 
More substantital changes will be done later.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8517][MLLIB][DOC] Reorganizes the spark...

2015-12-08 Thread thunterdb
GitHub user thunterdb opened a pull request:

https://github.com/apache/spark/pull/10207

[SPARK-8517][MLLIB][DOC] Reorganizes the spark.ml user guide

This PR moves pieces of the spark.ml user guide to reflect suggestions in 
SPARK-8517. It does not introduce new content, as requested.

https://cloud.githubusercontent.com/assets/7594753/11666166/e82b84f2-9d9f-11e5-8904-e215424d8444.png";>


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/thunterdb/spark spark-8517

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/10207.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #10207


commit af8abb9713f90478a1a0b81fdc83192192354c30
Author: Timothy Hunter 
Date:   2015-12-08T18:23:06Z

moved content

commit 451b7737f553fbc425ce2144fe0930b885874c7f
Author: Timothy Hunter 
Date:   2015-12-08T19:32:22Z

reordered doc




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12000] Fixes compilation issues with ja...

2015-11-30 Thread thunterdb
Github user thunterdb commented on a diff in the pull request:

https://github.com/apache/spark/pull/10048#discussion_r46210267
  
--- Diff: project/SparkBuild.scala ---
@@ -155,13 +155,15 @@ object SparkBuild extends PomBuild {
   if (major.toInt >= 1 && minor.toInt >= 8) Seq("-Xdoclint:all", 
"-Xdoclint:-missing") else Seq.empty
 },
 
+// TODO(thunterdb) how to set these based on the current version of 
the JVM?
 javacJVMVersion := "1.7",
 scalacJVMVersion := "1.7",
 
 javacOptions in Compile ++= Seq(
-  "-encoding", "UTF-8",
-  "-source", javacJVMVersion.value,
-  "-target", javacJVMVersion.value
+  "-encoding", "UTF-8"
+  // TODO(thunterdb) make it conditional for <= "1.7"
--- End diff --

I am not sure how to get the current version of java (not the target 
version).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12000] Fixes compilation issues with ja...

2015-11-30 Thread thunterdb
GitHub user thunterdb opened a pull request:

https://github.com/apache/spark/pull/10048

[SPARK-12000] Fixes compilation issues with java8 and sbt.

Currently, trying to publish spark locally fails when java version is >= 
1.8.0:
 - a javadoc option has been removed
 - javadoc is stricter about its input and fails with some malformed 
comments

This PR fixes both issues.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/thunterdb/spark 1511-java8

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/10048.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #10048


commit 97d12f1fa2b494d006d591317364633aef677d75
Author: Timothy Hunter 
Date:   2015-11-26T00:16:58Z

changes

commit 990053924e08153d5bc06e81a9857aeb4cc36a05
Author: Timothy Hunter 
Date:   2015-11-26T00:25:10Z

comments




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11835] Adds a sidebar menu to MLlib's d...

2015-11-20 Thread thunterdb
Github user thunterdb commented on the pull request:

https://github.com/apache/spark/pull/9826#issuecomment-158541369
  
@mengxr comment addressed


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11835] Adds a sidebar menu to MLlib's d...

2015-11-19 Thread thunterdb
Github user thunterdb commented on the pull request:

https://github.com/apache/spark/pull/9826#issuecomment-158229760
  
@mengxr based on our discussions, here is a slightly updated version:
 - menu is fixed
 - menu disappears under the page (but it looks half covered in the 
middle). Fixing that would require JS, I think (and sk-learn does not have a 
fixed menu):

![screen shot 2015-11-19 at 3 10 50 
pm](https://cloud.githubusercontent.com/assets/7594753/11287386/efb9686c-8ecf-11e5-9829-038f22468ad9.png)




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11835] Adds a sidebar menu to MLlib's d...

2015-11-19 Thread thunterdb
Github user thunterdb commented on the pull request:

https://github.com/apache/spark/pull/9826#issuecomment-158220169
  
@andrewor14 this is a different issue: the SIPs show the table of contents 
within one document, which we already have with the `{:toc}` directive. This PR 
adds the per-project organization (linking various markdown files together). 
This is not something we can infer automatically because we show different 
levels of nesting for each of the pages. Each project already has this 
description in their overview pages, and I am proposing we move them to the 
side bar:
 - http://spark.apache.org/docs/latest/streaming-programming-guide.html
 - http://spark.apache.org/docs/latest/sql-programming-guide.html
 - http://spark.apache.org/docs/latest/graphx-programming-guide.html
 - http://spark.apache.org/docs/latest/sparkr.html



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11835] Adds a sidebar menu to MLlib's d...

2015-11-19 Thread thunterdb
Github user thunterdb commented on the pull request:

https://github.com/apache/spark/pull/9826#issuecomment-158169959
  
This is about as much as I can do with my very limited knowledge of CSS, so 
additional fixes can be done in a separate PR.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



<    1   2   3   4   >