[GitHub] spark pull request: [SPARK-13430][PySpark][ML] Python API for trai...
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/11621#discussion_r56587642 --- Diff: python/pyspark/ml/classification.py --- @@ -231,6 +232,210 @@ def intercept(self): """ return self._call_java("intercept") +@property +@since("2.0.0") +def summary(self): +""" +Gets summary (e.g. residuals, mse, r-squared ) of model on +training set. An exception is thrown if +`trainingSummary == None`. +""" +java_blrt_summary = self._call_java("summary") +return BinaryLogisticRegressionTrainingSummary(java_blrt_summary) + +@property +@since("2.0.0") +def hasSummary(self): +""" +Indicates whether a training summary exists for this model +instance. +""" +return self._call_java("hasSummary") + +""" +TODO: enable once Scala API is made public +def evaluate(self, df): +"" +Evaluates the model on a testset. +@param dataset Test dataset to evaluate model on. +"" +java_blr_summary = self._call_java("evaluate", df) +return BinaryLogisticRegressionSummary(java_blr_summary) +""" + + +class LogisticRegressionSummary(JavaCallable): +""" +Abstraction for Logistic Regression Results for a given model. + +.. versionadded:: 2.0.0 +""" + +@property +@since("2.0.0") +def predictions(self): +""" +Dataframe outputted by the model's `transform` method. --- End diff -- nit: technically, `outputted` is not in the dictionary. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13430][PySpark][ML] Python API for trai...
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/11621#discussion_r56587676 --- Diff: python/pyspark/ml/wrapper.py --- @@ -223,3 +223,20 @@ def _call_java(self, name, *args): sc = SparkContext._active_spark_context java_args = [_py2java(sc, arg) for arg in args] return _java2py(sc, m(*java_args)) + + +class JavaCallable(object): --- End diff -- this looks good to me, but @mengxr will have a more expert opinion on this class. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13430][PySpark][ML] Python API for trai...
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/11621#discussion_r56587656 --- Diff: python/pyspark/ml/regression.py --- @@ -151,6 +151,209 @@ def intercept(self): """ return self._call_java("intercept") +@property +@since("2.0.0") +def summary(self): +""" +Gets summary (e.g. residuals, mse, r-squared ) of model on +training set. An exception is thrown if +`trainingSummary == None`. +""" +java_lrt_summary = self._call_java("summary") +return LinearRegressionTrainingSummary(java_lrt_summary) + +@property +@since("2.0.0") +def hasSummary(self): +""" +Indicates whether a training summary exists for this model +instance. +""" +return self._call_java("hasSummary") + +""" --- End diff -- same thing here --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MINOR] Typo fixes
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/11802#discussion_r56595822 --- Diff: streaming/src/main/scala/org/apache/spark/streaming/dstream/DStreamCheckpointData.scala --- @@ -102,8 +102,8 @@ class DStreamCheckpointData[T: ClassTag] (dstream: DStream[T]) } /** - * Restore the checkpoint data. This gets called once when the DStream graph - * (along with its DStreams) are being restored from a graph checkpoint file. + * Restore the checkpoint data. This gets called once while the DStream graph --- End diff -- I am not sure if this comment is accurate, someone from the streaming team should check it. cc @tdas --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MINOR] Typo fixes
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/11802#discussion_r56595611 --- Diff: streaming/src/main/scala/org/apache/spark/streaming/dstream/DStreamCheckpointData.scala --- @@ -45,7 +45,7 @@ class DStreamCheckpointData[T: ClassTag] (dstream: DStream[T]) /** * Updates the checkpoint data of the DStream. This gets called every time * the graph checkpoint is initiated. Default implementation records the - * checkpoint files to which the generate RDDs of the DStream has been saved. + * checkpoint files to which the generated RDDs of the DStream have been saved. --- End diff -- should be `records ... at which ...` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Changes to support KMeans with large feature s...
Github user thunterdb commented on the pull request: https://github.com/apache/spark/pull/10739#issuecomment-195090771 @levin-royl it looks like @hhbyyh has a branch with some code that tackles the same issue (see the jira discussion for more information), you may want to coordinate there. Also, I suggest you take again a look at the pull request section of the guidelines https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13600] [MLlib] Use approxQuantile from ...
Github user thunterdb commented on the pull request: https://github.com/apache/spark/pull/11553#issuecomment-195054258 @oliverpierson I was thinking that `relativeError` should be automatically selected (and not exposed as a param). However, I am fine with exposing it for the sake of simplicity. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12566] [ML] [WIP] GLM model family, lin...
Github user thunterdb commented on the pull request: https://github.com/apache/spark/pull/11549#issuecomment-194528019 @hhbyyh thanks! I just have some small comments; my main comment being in the jira ticket regarding the choice of options 1/2/3. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12566] [ML] [WIP] GLM model family, lin...
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/11549#discussion_r55597240 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -569,9 +572,46 @@ class GeneralizedLinearRegressionModel private[ml] ( familyAndLink.fitted(eta) } + private var trainingSummary: Option[GeneralizedLinearRegressionSummary] = None + + private[regression] def setSummary(summary: GeneralizedLinearRegressionSummary): this.type = { +this.trainingSummary = Some(summary) +this + } + + /** +* Gets summary of model on training set. An exception is +* thrown if `trainingSummary == None`. +*/ + @Since("2.0.0") + def summary: GeneralizedLinearRegressionSummary = trainingSummary match { +case Some(summ) => summ +case None => + throw new SparkException( +"No training summary available for this GeneralizedLinearRegressionModel", +new NullPointerException()) + } + @Since("2.0.0") override def copy(extra: ParamMap): GeneralizedLinearRegressionModel = { copyValues(new GeneralizedLinearRegressionModel(uid, coefficients, intercept), extra) .setParent(parent) } } + +/** + * :: Experimental :: + * GeneralizedLinearRegressionModel results evaluated on a dataset. + * + * @param predictions dataframe outputted by the model's `transform` method. + * @param predictionCol field in "predictions" which gives the prediction of each instance. + * @param labelCol field in "predictions" which gives the true label of each instance. + * @param featuresCol field in "predictions" which gives the features of each instance as a vector. + */ +@Experimental +@Since("2.0.0") +class GeneralizedLinearRegressionSummary private[regression] ( +@Since("2.0.0") @transient val predictions: DataFrame, +@Since("2.0.0") val predictionCol: String, --- End diff -- Based on a private discussion with @jkbradley , we should not expose the name of the columns in the public API. Users should be referring to the original model to get access to the column. In `LinearRegressionSummary`, they are passed around because they are required for metrics. We do not need them here. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12566] [ML] [WIP] GLM model family, lin...
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/11549#discussion_r55595546 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -569,9 +572,46 @@ class GeneralizedLinearRegressionModel private[ml] ( familyAndLink.fitted(eta) } + private var trainingSummary: Option[GeneralizedLinearRegressionSummary] = None + + private[regression] def setSummary(summary: GeneralizedLinearRegressionSummary): this.type = { +this.trainingSummary = Some(summary) +this + } + + /** +* Gets summary of model on training set. An exception is +* thrown if `trainingSummary == None`. +*/ + @Since("2.0.0") + def summary: GeneralizedLinearRegressionSummary = trainingSummary match { --- End diff -- nit: you can also do ```scala trainingSummary.getOrElse { throw new Exception(...) } ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12566] [ML] [WIP] GLM model family, lin...
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/11549#discussion_r55592623 --- Diff: mllib/src/main/scala/org/apache/spark/ml/r/SparkRWrappers.scala --- @@ -17,15 +17,41 @@ package org.apache.spark.ml.api.r +import org.apache.spark.SparkException import org.apache.spark.ml.{Pipeline, PipelineModel} import org.apache.spark.ml.attribute._ import org.apache.spark.ml.classification.{LogisticRegression, LogisticRegressionModel} import org.apache.spark.ml.clustering.{KMeans, KMeansModel} import org.apache.spark.ml.feature.{RFormula, VectorAssembler} -import org.apache.spark.ml.regression.{LinearRegression, LinearRegressionModel} +import org.apache.spark.ml.regression._ import org.apache.spark.sql.DataFrame private[r] object SparkRWrappers { + def fitGLM( + value: String, + df: DataFrame, + family: String, + lambda: Double, + solver: String): PipelineModel = { +if (solver.trim != "irls") throw new SparkException("Currently only support irls") + +val formula = new RFormula().setFormula(value) +val regex = "^\\s*(\\w+)\\s*(\\(\\s*link\\s*=\\s*\"(\\w+)\"\\s*\\))?\\s*$".r +val estimator = family match { + case regex(familyName, group2, linkName) => --- End diff -- I am confused: why do you need a regex here? I do not see anything special on the other side in R. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12566] [ML] [WIP] GLM model family, lin...
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/11549#discussion_r55591400 --- Diff: R/pkg/R/mllib.R --- @@ -51,13 +45,12 @@ setClass("PipelineModel", representation(model = "jobj")) #' summary(model) #'} setMethod("glm", signature(formula = "formula", family = "ANY", data = "DataFrame"), - function(formula, family = c("gaussian", "binomial"), data, lambda = 0, alpha = 0, -standardize = TRUE, solver = "auto") { + function(formula, family = c("gaussian", "binomial", "poisson", "gamma"), data, + lambda = 0, solver = "irls") { --- End diff -- I would keep the solver to 'auto', so that we can change the implementation of the solver without regression. However, the option 'irls' is available for users who want to use it. As for the other options for the solver, see my comment in the ticket. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12566] [ML] [WIP] GLM model family, lin...
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/11549#discussion_r55563547 --- Diff: mllib/src/main/scala/org/apache/spark/ml/r/SparkRWrappers.scala --- @@ -17,15 +17,41 @@ package org.apache.spark.ml.api.r +import org.apache.spark.SparkException import org.apache.spark.ml.{Pipeline, PipelineModel} import org.apache.spark.ml.attribute._ import org.apache.spark.ml.classification.{LogisticRegression, LogisticRegressionModel} import org.apache.spark.ml.clustering.{KMeans, KMeansModel} import org.apache.spark.ml.feature.{RFormula, VectorAssembler} -import org.apache.spark.ml.regression.{LinearRegression, LinearRegressionModel} +import org.apache.spark.ml.regression._ import org.apache.spark.sql.DataFrame private[r] object SparkRWrappers { + def fitGLM( + value: String, + df: DataFrame, + family: String, + lambda: Double, + solver: String): PipelineModel = { +if (solver.trim != "irls") throw new SparkException("Currently only support irls") + +val formula = new RFormula().setFormula(value) +val regex = "^\\s*(\\w+)\\s*(\\(\\s*link\\s*=\\s*\"(\\w+)\"\\s*\\))?\\s*$".r --- End diff -- in order to minimize the escaping, you can use Scala's raw strings: ```scala """^\s*(\w+...\s*$""".r ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13600] [MLlib] Incorrect number of buck...
Github user thunterdb commented on the pull request: https://github.com/apache/spark/pull/11553#issuecomment-193889292 If you use 0.0 for the relative error, it is going to return the exact quantiles. However in this case, there will be no data compression and the algorithm will essentially run distributed merge-sort. The cost of doing this may be prohibitive. The current implementation is already approximate, so it makes sense to pick a relative error > 0. It looks like we do not offer any guarantee in the approximation already anyway. I suggest we automatically pick the relative error as follows: > target_error = min(0.1, 1.0 / (alpha * num_buckets)) for some value of alpha (1-10): the run time is proportional to the number of buckets, and the precision increases with the number of buckets required. If this is considered too complicated, then a fixed value such as 1e-2 or 1e-3 looks fine to me as well. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11535][ML] handling empty string in Str...
Github user thunterdb commented on the pull request: https://github.com/apache/spark/pull/9522#issuecomment-193542005 @pravingadakh sorry for the delay. Would you mind resolving the conflicts? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6761][SQL][ML] Fixes to API and documen...
Github user thunterdb commented on the pull request: https://github.com/apache/spark/pull/11332#issuecomment-187964822 LGTM, thanks @mengxr --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6761][SQL] Approximate quantile for Dat...
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/6042#discussion_r53821901 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/stat/ApproxQuantileSuite.scala --- @@ -0,0 +1,129 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.stat + +import scala.util.Random + +import org.apache.spark.SparkFunSuite +import org.apache.spark.sql.execution.stat.StatFunctions.QuantileSummaries + + +class ApproxQuantileSuite extends SparkFunSuite { + + private val r = new Random(1) + private val n = 100 + private val increasing = "increasing" -> (0 until n).map(_.toDouble) + private val decreasing = "decreasing" -> (n until 0 by -1).map(_.toDouble) + private val random = "random" -> Seq.fill(n)(math.ceil(r.nextDouble() * 1000)) + + private def buildSummary( + data: Seq[Double], + epsi: Double, + threshold: Int): QuantileSummaries = { +var summary = new QuantileSummaries(threshold, epsi) +data.foreach { x => + summary = summary.insert(x) +} +summary.compress() + } + + private def checkQuantile(quant: Double, data: Seq[Double], summary: QuantileSummaries): Unit = { +val approx = summary.query(quant) +// The rank of the approximation. +val rank = data.count(_ < approx) // has to be <, not <= to be exact +val lower = math.floor((quant - summary.epsilon) * data.size) +assert(rank >= lower, + s"approx_rank: $rank ! >= $lower, requested quantile = $quant") +val upper = math.ceil((quant + summary.epsilon) * data.size) +assert(rank <= upper, --- End diff -- fixed --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6761][SQL] Approximate quantile for Dat...
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/6042#discussion_r53821883 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DataFrameStatSuite.scala --- @@ -269,3 +291,41 @@ class DataFrameStatSuite extends QueryTest with SharedSQLContext { assert(0.until(1000).forall(i => filter4.mightContain(i * 3))) } } + + +class DataFrameStatPerfSuite extends QueryTest with SharedSQLContext with Logging { + + // Turn on this test if you want to test the performance of approximate quantiles. + ignore("describe() should not be slowed down too much by quantiles") { --- End diff -- changed, thanks --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6761][SQL] Approximate quantile for Dat...
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/6042#discussion_r53821857 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/stat/StatFunctions.scala --- @@ -27,6 +30,312 @@ import org.apache.spark.unsafe.types.UTF8String private[sql] object StatFunctions extends Logging { + import QuantileSummaries.Stats + + /** + * Calculates the approximate quantile for the given column. + * + * If you need to compute multiple quantiles at once, you should use [[multipleApproxQuantiles]] + * + * Note on the target error. + * + * The result of this algorithm has the following deterministic bound: + * if the DataFrame has N elements and if we request the quantile `phi` up to error `epsi`, + * then the algorithm will return a sample `x` from the DataFrame so that the *exact* rank + * of `x` close to (phi * N). More precisely: + * + * floor((phi - epsi) * N) <= rank(x) <= ceil((phi + epsi) * N) + * + * Note on the algorithm used. + * + * This method implements a variation of the Greenwald-Khanna algorithm + * (with some speed optimizations). The algorithm was first present in the following article: + * "Space-efficient Online Computation of Quantile Summaries" by Greenwald, Michael + * and Khanna, Sanjeev. (http://dl.acm.org/citation.cfm?id=375670) + * + * The performance optimizations are detailed in the comments of the implementation. + * + * @param df the dataframe to estimate quantiles on + * @param col the name of the column + * @param quantile the target quantile of interest + * @param epsilon the target error. Should be >= 0. + * */ + def approxQuantile( + df: DataFrame, + col: String, + quantile: Double, + epsilon: Double = QuantileSummaries.defaultEpsilon): Double = { +require(quantile >= 0.0 && quantile <= 1.0, "Quantile must be in the range of (0.0, 1.0).") +val Seq(Seq(res)) = multipleApproxQuantiles(df, Seq(col), Seq(quantile), epsilon) +res + } + + /** + * Runs multiple quantile computations in a single pass, with the same target error. + * + * See [[approxQuantile)]] for more details on the approximation guarantees. + * + * @param df the dataframe + * @param cols columns of the dataframe + * @param quantiles target quantiles to compute + * @param epsilon the precision to achieve + * @return for each column, returns the requested approximations + */ + def multipleApproxQuantiles( + df: DataFrame, + cols: Seq[String], + quantiles: Seq[Double], + epsilon: Double): Seq[Seq[Double]] = { +val columns: Seq[Column] = cols.map { colName => + val field = df.schema(colName) + require(field.dataType.isInstanceOf[NumericType], +s"Quantile calculation for column $colName with data type ${field.dataType}" + +" is not supported.") + Column(Cast(Column(colName).expr, DoubleType)) +} +val emptySummaries = Array.fill(cols.size)( + new QuantileSummaries(QuantileSummaries.defaultCompressThreshold, epsilon)) + +// Note that it works more or less by accident as `rdd.aggregate` is not a pure function: +// this function returns the same array as given in the input (because `aggregate` reuses +// the same argument). +def apply(summaries: Array[QuantileSummaries], row: Row): Array[QuantileSummaries] = { + var i = 0 + while (i < summaries.length) { +summaries(i) = summaries(i).insert(row.getDouble(i)) +i += 1 + } + summaries +} + +def merge( +sum1: Array[QuantileSummaries], +sum2: Array[QuantileSummaries]): Array[QuantileSummaries] = { + sum1.zip(sum2).map { case (s1, s2) => s1.compress().merge(s2.compress()) } +} +val summaries = df.select(columns: _*).rdd.aggregate(emptySummaries)(apply, merge) + +summaries.map { summary => quantiles.map(summary.query) } + } + + /** + * Helper class to compute approximate quantile summary. + * This implementation is based on the algorithm proposed in the paper: + * "Space-efficient Online Computation of Quantile Summaries" by Greenwald, Michael + * and Khanna, Sanjeev. (http://dl.acm.org/citation.cfm?id=375670) + * + * In order to optimize for speed, it maintains an internal buffer of the last seen samples, + * and only inserts them after crossing a certain size threshold. This guarantees a near-constant + * runtime complexity
[GitHub] spark pull request: [SPARK-6761][SQL] Approximate quantile for Dat...
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/6042#discussion_r53821725 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/stat/StatFunctions.scala --- @@ -27,6 +30,312 @@ import org.apache.spark.unsafe.types.UTF8String private[sql] object StatFunctions extends Logging { + import QuantileSummaries.Stats + + /** + * Calculates the approximate quantile for the given column. + * + * If you need to compute multiple quantiles at once, you should use [[multipleApproxQuantiles]] + * + * Note on the target error. + * + * The result of this algorithm has the following deterministic bound: + * if the DataFrame has N elements and if we request the quantile `phi` up to error `epsi`, + * then the algorithm will return a sample `x` from the DataFrame so that the *exact* rank + * of `x` close to (phi * N). More precisely: + * + * floor((phi - epsi) * N) <= rank(x) <= ceil((phi + epsi) * N) + * + * Note on the algorithm used. + * + * This method implements a variation of the Greenwald-Khanna algorithm + * (with some speed optimizations). The algorithm was first present in the following article: + * "Space-efficient Online Computation of Quantile Summaries" by Greenwald, Michael + * and Khanna, Sanjeev. (http://dl.acm.org/citation.cfm?id=375670) + * + * The performance optimizations are detailed in the comments of the implementation. + * + * @param df the dataframe to estimate quantiles on + * @param col the name of the column + * @param quantile the target quantile of interest + * @param epsilon the target error. Should be >= 0. + * */ + def approxQuantile( --- End diff -- Ok, removed --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6761][SQL] Approximate quantile for Dat...
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/6042#discussion_r53821685 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala --- @@ -37,6 +37,16 @@ import org.apache.spark.util.sketch.{BloomFilter, CountMinSketch} final class DataFrameStatFunctions private[sql](df: DataFrame) { /** + * Calculate the approximate quantile of numerical column of a DataFrame. + * @param col the name of the column + * @param quantile the quantile number + * @return the approximate quantile + */ + def approxQuantile(col: String, quantile: Double, epsilon: Double): Double = { --- End diff -- Good point, changing this to an Array return --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6761][SQL] Approximate quantile for Dat...
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/6042#discussion_r53821749 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/stat/StatFunctions.scala --- @@ -27,6 +30,312 @@ import org.apache.spark.unsafe.types.UTF8String private[sql] object StatFunctions extends Logging { + import QuantileSummaries.Stats + + /** + * Calculates the approximate quantile for the given column. + * + * If you need to compute multiple quantiles at once, you should use [[multipleApproxQuantiles]] + * + * Note on the target error. + * + * The result of this algorithm has the following deterministic bound: + * if the DataFrame has N elements and if we request the quantile `phi` up to error `epsi`, + * then the algorithm will return a sample `x` from the DataFrame so that the *exact* rank + * of `x` close to (phi * N). More precisely: + * + * floor((phi - epsi) * N) <= rank(x) <= ceil((phi + epsi) * N) + * + * Note on the algorithm used. + * + * This method implements a variation of the Greenwald-Khanna algorithm + * (with some speed optimizations). The algorithm was first present in the following article: + * "Space-efficient Online Computation of Quantile Summaries" by Greenwald, Michael + * and Khanna, Sanjeev. (http://dl.acm.org/citation.cfm?id=375670) + * + * The performance optimizations are detailed in the comments of the implementation. + * + * @param df the dataframe to estimate quantiles on + * @param col the name of the column + * @param quantile the target quantile of interest + * @param epsilon the target error. Should be >= 0. + * */ + def approxQuantile( + df: DataFrame, + col: String, + quantile: Double, + epsilon: Double = QuantileSummaries.defaultEpsilon): Double = { +require(quantile >= 0.0 && quantile <= 1.0, "Quantile must be in the range of (0.0, 1.0).") +val Seq(Seq(res)) = multipleApproxQuantiles(df, Seq(col), Seq(quantile), epsilon) +res + } + + /** + * Runs multiple quantile computations in a single pass, with the same target error. + * + * See [[approxQuantile)]] for more details on the approximation guarantees. + * + * @param df the dataframe + * @param cols columns of the dataframe + * @param quantiles target quantiles to compute + * @param epsilon the precision to achieve + * @return for each column, returns the requested approximations + */ + def multipleApproxQuantiles( + df: DataFrame, + cols: Seq[String], + quantiles: Seq[Double], + epsilon: Double): Seq[Seq[Double]] = { +val columns: Seq[Column] = cols.map { colName => + val field = df.schema(colName) + require(field.dataType.isInstanceOf[NumericType], +s"Quantile calculation for column $colName with data type ${field.dataType}" + +" is not supported.") + Column(Cast(Column(colName).expr, DoubleType)) +} +val emptySummaries = Array.fill(cols.size)( + new QuantileSummaries(QuantileSummaries.defaultCompressThreshold, epsilon)) + +// Note that it works more or less by accident as `rdd.aggregate` is not a pure function: +// this function returns the same array as given in the input (because `aggregate` reuses +// the same argument). +def apply(summaries: Array[QuantileSummaries], row: Row): Array[QuantileSummaries] = { + var i = 0 + while (i < summaries.length) { +summaries(i) = summaries(i).insert(row.getDouble(i)) +i += 1 + } + summaries +} + +def merge( +sum1: Array[QuantileSummaries], +sum2: Array[QuantileSummaries]): Array[QuantileSummaries] = { + sum1.zip(sum2).map { case (s1, s2) => s1.compress().merge(s2.compress()) } +} +val summaries = df.select(columns: _*).rdd.aggregate(emptySummaries)(apply, merge) + +summaries.map { summary => quantiles.map(summary.query) } + } + + /** + * Helper class to compute approximate quantile summary. + * This implementation is based on the algorithm proposed in the paper: + * "Space-efficient Online Computation of Quantile Summaries" by Greenwald, Michael + * and Khanna, Sanjeev. (http://dl.acm.org/citation.cfm?id=375670) + * + * In order to optimize for speed, it maintains an internal buffer of the last seen samples, + * and only inserts them after crossing a certain size threshold. This guarantees a near-constant + * runtime complexity
[GitHub] spark pull request: [SPARK-6761][SQL] Approximate quantile for Dat...
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/6042#discussion_r53821735 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/stat/StatFunctions.scala --- @@ -27,6 +30,312 @@ import org.apache.spark.unsafe.types.UTF8String private[sql] object StatFunctions extends Logging { + import QuantileSummaries.Stats + + /** + * Calculates the approximate quantile for the given column. + * + * If you need to compute multiple quantiles at once, you should use [[multipleApproxQuantiles]] + * + * Note on the target error. + * + * The result of this algorithm has the following deterministic bound: + * if the DataFrame has N elements and if we request the quantile `phi` up to error `epsi`, + * then the algorithm will return a sample `x` from the DataFrame so that the *exact* rank + * of `x` close to (phi * N). More precisely: + * + * floor((phi - epsi) * N) <= rank(x) <= ceil((phi + epsi) * N) + * + * Note on the algorithm used. + * + * This method implements a variation of the Greenwald-Khanna algorithm + * (with some speed optimizations). The algorithm was first present in the following article: + * "Space-efficient Online Computation of Quantile Summaries" by Greenwald, Michael + * and Khanna, Sanjeev. (http://dl.acm.org/citation.cfm?id=375670) + * + * The performance optimizations are detailed in the comments of the implementation. + * + * @param df the dataframe to estimate quantiles on + * @param col the name of the column + * @param quantile the target quantile of interest + * @param epsilon the target error. Should be >= 0. + * */ + def approxQuantile( + df: DataFrame, + col: String, + quantile: Double, + epsilon: Double = QuantileSummaries.defaultEpsilon): Double = { +require(quantile >= 0.0 && quantile <= 1.0, "Quantile must be in the range of (0.0, 1.0).") +val Seq(Seq(res)) = multipleApproxQuantiles(df, Seq(col), Seq(quantile), epsilon) +res + } + + /** + * Runs multiple quantile computations in a single pass, with the same target error. + * + * See [[approxQuantile)]] for more details on the approximation guarantees. + * + * @param df the dataframe + * @param cols columns of the dataframe + * @param quantiles target quantiles to compute + * @param epsilon the precision to achieve + * @return for each column, returns the requested approximations + */ + def multipleApproxQuantiles( + df: DataFrame, + cols: Seq[String], + quantiles: Seq[Double], + epsilon: Double): Seq[Seq[Double]] = { +val columns: Seq[Column] = cols.map { colName => + val field = df.schema(colName) + require(field.dataType.isInstanceOf[NumericType], +s"Quantile calculation for column $colName with data type ${field.dataType}" + +" is not supported.") + Column(Cast(Column(colName).expr, DoubleType)) +} +val emptySummaries = Array.fill(cols.size)( + new QuantileSummaries(QuantileSummaries.defaultCompressThreshold, epsilon)) + +// Note that it works more or less by accident as `rdd.aggregate` is not a pure function: +// this function returns the same array as given in the input (because `aggregate` reuses +// the same argument). +def apply(summaries: Array[QuantileSummaries], row: Row): Array[QuantileSummaries] = { + var i = 0 + while (i < summaries.length) { +summaries(i) = summaries(i).insert(row.getDouble(i)) +i += 1 + } + summaries +} + +def merge( +sum1: Array[QuantileSummaries], +sum2: Array[QuantileSummaries]): Array[QuantileSummaries] = { + sum1.zip(sum2).map { case (s1, s2) => s1.compress().merge(s2.compress()) } +} +val summaries = df.select(columns: _*).rdd.aggregate(emptySummaries)(apply, merge) + +summaries.map { summary => quantiles.map(summary.query) } + } + + /** + * Helper class to compute approximate quantile summary. + * This implementation is based on the algorithm proposed in the paper: + * "Space-efficient Online Computation of Quantile Summaries" by Greenwald, Michael + * and Khanna, Sanjeev. (http://dl.acm.org/citation.cfm?id=375670) + * + * In order to optimize for speed, it maintains an internal buffer of the last seen samples, + * and only inserts them after crossing a certain size threshold. This guarantees a near-constant + * runtime complexity
[GitHub] spark pull request: [SPARK-6761][SQL] Approximate quantile for Dat...
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/6042#discussion_r53821700 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/stat/StatFunctions.scala --- @@ -27,6 +30,312 @@ import org.apache.spark.unsafe.types.UTF8String private[sql] object StatFunctions extends Logging { + import QuantileSummaries.Stats + + /** + * Calculates the approximate quantile for the given column. + * + * If you need to compute multiple quantiles at once, you should use [[multipleApproxQuantiles]] + * + * Note on the target error. + * + * The result of this algorithm has the following deterministic bound: + * if the DataFrame has N elements and if we request the quantile `phi` up to error `epsi`, + * then the algorithm will return a sample `x` from the DataFrame so that the *exact* rank + * of `x` close to (phi * N). More precisely: + * + * floor((phi - epsi) * N) <= rank(x) <= ceil((phi + epsi) * N) + * + * Note on the algorithm used. + * + * This method implements a variation of the Greenwald-Khanna algorithm + * (with some speed optimizations). The algorithm was first present in the following article: + * "Space-efficient Online Computation of Quantile Summaries" by Greenwald, Michael + * and Khanna, Sanjeev. (http://dl.acm.org/citation.cfm?id=375670) --- End diff -- Done in both places --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6761][SQL] Approximate quantile for Dat...
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/6042#discussion_r53821676 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala --- @@ -37,6 +37,16 @@ import org.apache.spark.util.sketch.{BloomFilter, CountMinSketch} final class DataFrameStatFunctions private[sql](df: DataFrame) { /** + * Calculate the approximate quantile of numerical column of a DataFrame. + * @param col the name of the column + * @param quantile the quantile number + * @return the approximate quantile --- End diff -- Done --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6761][SQL][ML] Fixes to API and documen...
GitHub user thunterdb opened a pull request: https://github.com/apache/spark/pull/11325 [SPARK-6761][SQL][ML] Fixes to API and documentation of approximate quantiles ## What changes were proposed in this pull request? This PR addresses the remaining comments from @mengxr in #6042 - fixes to the API - more tests - cleanups and more documentation in the implementation ## How was the this patch tested? This test passes the comprehensive test suite of the approximate quantiles, and adds more tests for the public API. You can merge this pull request into a Git repository by running: $ git pull https://github.com/thunterdb/spark spark-6761c Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/11325.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #11325 commit a77b7e678ae266210bfaf8ca9b88fc3d2dd54d0d Author: Timothy Hunter Date: 2016-02-23T17:32:18Z SPARK-6761 Additional cleanups to address comments, there is an error somewhere commit dbf87f1da765eed21880cda71a4b5ec62bee3e4d Author: Timothy Hunter Date: 2016-02-23T17:50:13Z fixed test --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6761][SQL] Approximate quantile for Dat...
Github user thunterdb commented on the pull request: https://github.com/apache/spark/pull/6042#issuecomment-187774708 @mengxr thanks for the review, will do in another PR. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6761][SQL] Approximate quantile for Dat...
Github user thunterdb commented on the pull request: https://github.com/apache/spark/pull/6042#issuecomment-182958229 @viirya sorry I missed your email, I will look at your PR today. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13146][SQL] Management API for continuo...
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/11030#discussion_r52202089 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/ContinuousQuery.scala --- @@ -17,11 +17,47 @@ package org.apache.spark.sql +import org.apache.spark.annotation.Experimental + /** + * :: Experimental :: * A handle to a query that is executing continuously in the background as new data arrives. + * @since 2.0.0 */ +@Experimental trait ContinuousQuery { + /** Returns the name of the query */ + def name: String + + def sqlContext: SQLContext + + /** Whether the query is currently active or not */ + def isActive: Boolean + + /** Returns the [[ContinuousQueryException]] if the query was terminated by an exception. */ + def exception: Option[ContinuousQueryException] + + /** Returns current status of all the sources. */ + def sourceStatuses: Array[SourceStatus] + + /** Returns current status of the sink. */ + def sinkStatus: SinkStatus + + /** + * Waits for the termination of this query, either by `stop` or by any exception. + * @throws ContinuousQueryException, if the query terminated by an exception. + */ + def awaitTermination(): Unit + + /** + * Waits for the termination of this query, either by `stop` or by any exception. + * Returns whether the query has terminated or not. + * @throws ContinuousQueryException, if the query terminated by an exception before + * `timeoutMs` milliseconds + */ + def awaitTermination(timeoutMs: Long): Boolean + /** * Stops the execution of this query if it is running. This method blocks until the threads --- End diff -- Also, can you document if the methods above are thread-safe? This is important for complex client applications. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13146][SQL] Management API for continuo...
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/11030#discussion_r52201810 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/ContinuousQuery.scala --- @@ -17,11 +17,47 @@ package org.apache.spark.sql +import org.apache.spark.annotation.Experimental + /** + * :: Experimental :: * A handle to a query that is executing continuously in the background as new data arrives. + * @since 2.0.0 */ +@Experimental trait ContinuousQuery { + /** Returns the name of the query */ + def name: String + + def sqlContext: SQLContext + + /** Whether the query is currently active or not */ + def isActive: Boolean + + /** Returns the [[ContinuousQueryException]] if the query was terminated by an exception. */ + def exception: Option[ContinuousQueryException] + + /** Returns current status of all the sources. */ + def sourceStatuses: Array[SourceStatus] + + /** Returns current status of the sink. */ + def sinkStatus: SinkStatus + + /** + * Waits for the termination of this query, either by `stop` or by any exception. + * @throws ContinuousQueryException, if the query terminated by an exception. + */ + def awaitTermination(): Unit + + /** + * Waits for the termination of this query, either by `stop` or by any exception. + * Returns whether the query has terminated or not. + * @throws ContinuousQueryException, if the query terminated by an exception before + * `timeoutMs` milliseconds + */ + def awaitTermination(timeoutMs: Long): Boolean + /** * Stops the execution of this query if it is running. This method blocks until the threads --- End diff -- there is a typo in the doc. Also, could you document the kind of exceptions that are expected to be thrown by `stop()`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11515][ML] QuantileDiscretizer should t...
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/9535#discussion_r50163562 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/QuantileDiscretizer.scala --- @@ -23,8 +23,8 @@ import org.apache.spark.Logging import org.apache.spark.annotation.{Experimental, Since} import org.apache.spark.ml._ import org.apache.spark.ml.attribute.NominalAttribute +import org.apache.spark.ml.param.shared.{HasSeed, HasInputCol, HasOutputCol} --- End diff -- @yu-iskw you need to respect the lexicographic order for the imports --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12765] [ML] [CountVectorizer]fix CountV...
Github user thunterdb commented on the pull request: https://github.com/apache/spark/pull/10720#issuecomment-171778898 @sloth2012 this looks good to me, thanks for the fix. cc @jkbradley --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12026] [MLlib] ChiSqTest gets slower an...
Github user thunterdb commented on the pull request: https://github.com/apache/spark/pull/10146#issuecomment-170949938 @hhbyyh yes, option 3 sounds good. A caveat, though, about the numbers you posted: micro benchmarks on the JVM are very hard to get right, and a simple loop is not considered a good practice in general. I recommend you use a framework like JMH. There are some scala wrappers for it. http://openjdk.java.net/projects/code-tools/jmh/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11938][ML] Expose numFeatures in all ML...
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/9936#discussion_r49143733 --- Diff: python/pyspark/ml/tests.py --- @@ -371,6 +378,103 @@ def test_fit_maximize_metric(self): self.assertEqual(1.0, bestModelMetric, "Best model has R-squared of 1") +class RegressorTest(PySparkTestCase): + +def setupData(self): +try: +self.df +except AttributeError: +from pyspark.mllib.linalg import Vectors +sqlContext = SQLContext(self.sc) +self.df = sqlContext.createDataFrame([ +(1.0, Vectors.dense(1.0)), +(0.0, Vectors.sparse(1, [], []))], ["label", "features"]) + +def test_linear_regression(self): +self.setupData() +lr = LinearRegression(maxIter=5, regParam=0.0, solver="normal") +model = lr.fit(self.df) +self.assertEquals(1, model.numFeatures) + +def test_decision_tree_regressor(self): +self.setupData() +dt = DecisionTreeRegressor(maxDepth=2) +model = dt.fit(self.df) +self.assertEquals(1, model.numFeatures) + +def test_random_forest_regressor(self): +self.setupData() +rf = RandomForestRegressor(numTrees=2, maxDepth=2, seed=42) +model = rf.fit(self.df) +self.assertEquals(1, model.numFeatures) + +def test_gbt_regressor(self): +self.setupData() +gbt = GBTRegressor(maxIter=5, maxDepth=2) +model = gbt.fit(self.df) +self.assertEquals(1, model.numFeatures) + + +class ClassificationTest(PySparkTestCase): + +def setupData(self): +try: +self.df +except AttributeError: +from pyspark.mllib.linalg import Vectors +sqlContext = SQLContext(self.sc) +self.df = sqlContext.createDataFrame([ +(1.0, Vectors.dense(1.0, 0.0)), +(0.0, Vectors.sparse(2, [1], [1.0]))], ["label", "features"]) + +def test_logistic_regression(self): +self.setupData() +lr = LogisticRegression(maxIter=5, regParam=0.01) +model = lr.fit(self.df) +self.assertEqual(2, model.numFeatures) + +def test_decision_tree_classifier(self): +from pyspark.ml.feature import StringIndexer --- End diff -- same thing here --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11938][ML] Expose numFeatures in all ML...
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/9936#discussion_r49143685 --- Diff: python/pyspark/ml/tests.py --- @@ -371,6 +378,103 @@ def test_fit_maximize_metric(self): self.assertEqual(1.0, bestModelMetric, "Best model has R-squared of 1") +class RegressorTest(PySparkTestCase): + +def setupData(self): +try: +self.df +except AttributeError: +from pyspark.mllib.linalg import Vectors +sqlContext = SQLContext(self.sc) +self.df = sqlContext.createDataFrame([ +(1.0, Vectors.dense(1.0)), +(0.0, Vectors.sparse(1, [], []))], ["label", "features"]) + +def test_linear_regression(self): +self.setupData() +lr = LinearRegression(maxIter=5, regParam=0.0, solver="normal") +model = lr.fit(self.df) +self.assertEquals(1, model.numFeatures) + +def test_decision_tree_regressor(self): +self.setupData() +dt = DecisionTreeRegressor(maxDepth=2) +model = dt.fit(self.df) +self.assertEquals(1, model.numFeatures) + +def test_random_forest_regressor(self): +self.setupData() +rf = RandomForestRegressor(numTrees=2, maxDepth=2, seed=42) +model = rf.fit(self.df) +self.assertEquals(1, model.numFeatures) + +def test_gbt_regressor(self): +self.setupData() +gbt = GBTRegressor(maxIter=5, maxDepth=2) +model = gbt.fit(self.df) +self.assertEquals(1, model.numFeatures) + + +class ClassificationTest(PySparkTestCase): + +def setupData(self): +try: +self.df +except AttributeError: +from pyspark.mllib.linalg import Vectors --- End diff -- is there any reason for putting the import in the code? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11923][ML] Python API for ml.feature.Ch...
Github user thunterdb commented on the pull request: https://github.com/apache/spark/pull/10186#issuecomment-169840549 LGTM cc @jkbradley --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12632][Python][Make Parameter Descripti...
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/10602#discussion_r49140273 --- Diff: python/pyspark/mllib/fpm.py --- @@ -130,15 +133,21 @@ def train(cls, data, minSupport=0.1, maxPatternLength=10, maxLocalProjDBSize=320 """ Finds the complete set of frequent sequential patterns in the input sequences of itemsets. -:param data: The input data set, each element contains a sequnce of itemsets. -:param minSupport: the minimal support level of the sequential pattern, any pattern appears -more than (minSupport * size-of-the-dataset) times will be output (default: `0.1`) -:param maxPatternLength: the maximal length of the sequential pattern, any pattern appears -less than maxPatternLength will be output. (default: `10`) -:param maxLocalProjDBSize: The maximum number of items (including delimiters used in -the internal storage format) allowed in a projected database before local -processing. If a projected database exceeds this size, another -iteration of distributed prefix growth is run. (default: `3200`) +:param data: + The input data set, each element contains a sequnce of itemsets. +:param minSupport: + The minimal support level of the sequential pattern, any pattern appears more than --- End diff -- the lines below have indentation issues --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12632][Python][Make Parameter Descripti...
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/10602#discussion_r49140295 --- Diff: python/pyspark/mllib/recommendation.py --- @@ -239,6 +239,17 @@ def train(cls, ratings, rank, iterations=5, lambda_=0.01, blocks=-1, nonnegative product of two lower-rank matrices of a given rank (number of features). To solve for these features, we run a given number of iterations of ALS. This is done using a level of parallelism given by `blocks`. + + :param iterations: --- End diff -- indentation issues? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/10150#discussion_r49140159 --- Diff: python/pyspark/mllib/clustering.py --- @@ -38,13 +38,116 @@ from pyspark.mllib.util import Saveable, Loader, inherit_doc, JavaLoader, JavaSaveable from pyspark.streaming import DStream -__all__ = ['KMeansModel', 'KMeans', 'GaussianMixtureModel', 'GaussianMixture', - 'PowerIterationClusteringModel', 'PowerIterationClustering', - 'StreamingKMeans', 'StreamingKMeansModel', +__all__ = ['BisectingKMeansModel', 'BisectingKMeans', 'KMeansModel', 'KMeans', + 'GaussianMixtureModel', 'GaussianMixture', 'PowerIterationClusteringModel', + 'PowerIterationClustering', 'StreamingKMeans', 'StreamingKMeansModel', 'LDA', 'LDAModel'] @inherit_doc +class BisectingKMeansModel(JavaModelWrapper): +""" +.. note:: Experimental + +A clustering model derived from the bisecting k-means method. + +>>> data = array([0.0,0.0, 1.0,1.0, 9.0,8.0, 8.0,9.0]).reshape(4, 2) +>>> bskm = BisectingKMeans() +>>> model = bskm.train(sc.parallelize(data), k=4) +>>> model.predict(array([0.0, 0.0])) == model.predict(array([0.0, 0.0])) +True +>>> model.k +4 +>>> model.computeCost(array([0.0, 0.0])) +0.0 +>>> model.k == len(model.clusterCenters) +True +>>> model = bskm.train(sc.parallelize(data), k=2) +>>> model.predict(array([0.0, 0.0])) == model.predict(array([1.0, 1.0])) +True +>>> model.k +2 + +.. versionadded:: 2.0.0 +""" + +@property +@since('2.0.0') +def clusterCenters(self): +"""Get the cluster centers, represented as a list of NumPy arrays.""" +return [c.toArray() for c in self.call("clusterCenters")] + +@property +@since('2.0.0') +def k(self): +"""Get the number of clusters""" +return self.call("k") + +@since('2.0.0') +def predict(self, x): +""" +Find the cluster to which x belongs in this model. + +:param x: Either the point to determine the cluster for or an RDD of points to determine +the clusters for. +""" +if isinstance(x, RDD): +return x.map(self.predict(x)) --- End diff -- Also, maybe you can add a test for this case in the docstring. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/10150#discussion_r49140117 --- Diff: python/pyspark/mllib/clustering.py --- @@ -38,13 +38,116 @@ from pyspark.mllib.util import Saveable, Loader, inherit_doc, JavaLoader, JavaSaveable from pyspark.streaming import DStream -__all__ = ['KMeansModel', 'KMeans', 'GaussianMixtureModel', 'GaussianMixture', - 'PowerIterationClusteringModel', 'PowerIterationClustering', - 'StreamingKMeans', 'StreamingKMeansModel', +__all__ = ['BisectingKMeansModel', 'BisectingKMeans', 'KMeansModel', 'KMeans', + 'GaussianMixtureModel', 'GaussianMixture', 'PowerIterationClusteringModel', + 'PowerIterationClustering', 'StreamingKMeans', 'StreamingKMeansModel', 'LDA', 'LDAModel'] @inherit_doc +class BisectingKMeansModel(JavaModelWrapper): +""" +.. note:: Experimental + +A clustering model derived from the bisecting k-means method. + +>>> data = array([0.0,0.0, 1.0,1.0, 9.0,8.0, 8.0,9.0]).reshape(4, 2) +>>> bskm = BisectingKMeans() +>>> model = bskm.train(sc.parallelize(data), k=4) +>>> model.predict(array([0.0, 0.0])) == model.predict(array([0.0, 0.0])) +True +>>> model.k +4 +>>> model.computeCost(array([0.0, 0.0])) +0.0 +>>> model.k == len(model.clusterCenters) +True +>>> model = bskm.train(sc.parallelize(data), k=2) +>>> model.predict(array([0.0, 0.0])) == model.predict(array([1.0, 1.0])) +True +>>> model.k +2 + +.. versionadded:: 2.0.0 +""" + +@property +@since('2.0.0') +def clusterCenters(self): +"""Get the cluster centers, represented as a list of NumPy arrays.""" +return [c.toArray() for c in self.call("clusterCenters")] + +@property +@since('2.0.0') +def k(self): +"""Get the number of clusters""" +return self.call("k") + +@since('2.0.0') +def predict(self, x): +""" +Find the cluster to which x belongs in this model. + +:param x: Either the point to determine the cluster for or an RDD of points to determine +the clusters for. +""" +if isinstance(x, RDD): +return x.map(self.predict(x)) --- End diff -- I am not sure I understand this line, shouldn't it be `x.map(self.predict)`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12618] [CORE] [STREAMING] [SQL] Clean u...
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/10570#discussion_r49139498 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/ExpressionEvalHelper.scala --- @@ -57,8 +57,8 @@ trait ExpressionEvalHelper extends GeneratorDrivenPropertyChecks { (result, expected) match { case (result: Array[Byte], expected: Array[Byte]) => java.util.Arrays.equals(result, expected) - case (result: Double, expected: Spread[Double]) => -expected.isWithin(result) + case (result: Double, expected: Spread[_]) => // Can't use Spread[Double] b/c of erasure --- End diff -- I see. Sadly, I think this is not going to work here without extra work, and then it is not going to do what you want. This version of scalatest uses manifest to encode type information, and you would have to define it manually in this context: ```scala implicit val x: Manifest[Int] = ??? stream shouldBe a [ReceiverInputDStream[Int @unchecked]] ``` but then the scalatest library is not aware of the `unchecked` annotation, and still throws a warning. Let's just have `_` in the suite file. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12634][Python][MLlib][DOC] Update param...
Github user thunterdb commented on the pull request: https://github.com/apache/spark/pull/10601#issuecomment-169830444 @vijaykiran thanks for the style fixes. cc @jkbradley --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12618] [CORE] [STREAMING] [SQL] Clean u...
Github user thunterdb commented on the pull request: https://github.com/apache/spark/pull/10570#issuecomment-169828910 @srowen I am confused, did you send me this message for this PR? It somehow does not show up in github, and I do not see the jenkins run that failed: ``` @thunterdb Hm, I get this error now: [error] /home/jenkins/workspace/SparkPullRequestBuilder/extras/kinesis-asl/src/test/scala/org/apache/spark/streaming/kinesis/KinesisStreamSuite.scala:140: No Manifest available for org.apache.spark.streaming.kinesis.KinesisBackedBlockRDD[Array[Byte] @unchecked]. [error] nonEmptyRDD shouldBe a [KinesisBackedBlockRDD[Array[Byte] @unchecked]] [error] Did I misunderstand how to apply it or is it not going to work here? ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12631] [PYSPARK] [DOC] PySpark clusteri...
Github user thunterdb commented on the pull request: https://github.com/apache/spark/pull/10610#issuecomment-169506325 @BryanCutler it looks great, thanks! One overall comment about the seeds: it is unclear what `None` means, and it has a different behavior between `spark.ml` and `spark.mllib`. I suggest you mention that if unspecified, the seed is initialized to to the current machine time. (It is a fixed value in `spark.ml`) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12631] [PYSPARK] [DOC] PySpark clusteri...
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/10610#discussion_r49028171 --- Diff: python/pyspark/mllib/clustering.py --- @@ -774,17 +843,32 @@ def train(cls, rdd, k=10, maxIterations=20, docConcentration=-1.0, topicConcentration=-1.0, seed=None, checkpointInterval=10, optimizer="em"): """Train a LDA model. -:param rdd: RDD of data points -:param k: Number of clusters you want -:param maxIterations: Number of iterations. Default to 20 -:param docConcentration:Concentration parameter (commonly named "alpha") -for the prior placed on documents' distributions over topics ("theta"). -:param topicConcentration: Concentration parameter (commonly named "beta" or "eta") -for the prior placed on topics' distributions over terms. -:param seed:Random Seed -:param checkpointInterval: Period (in iterations) between checkpoints. -:param optimizer: LDAOptimizer used to perform the actual calculation. -Currently "em", "online" are supported. Default to "em". +:param rdd: + Train with a RDD of data points. +:param k: + Number of topics to infer. I.e., the number of soft cluster centers. --- End diff -- nit. According to Strunk and White, you cannot start a sentence with I.e.: `to infer, i.e., the...` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9716] [ML] BinaryClassificationEvaluato...
Github user thunterdb commented on the pull request: https://github.com/apache/spark/pull/10472#issuecomment-169483754 cc @jkbradley --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12663] [MLlib] More informative error m...
Github user thunterdb commented on the pull request: https://github.com/apache/spark/pull/10611#issuecomment-169483568 cc @jkbradley --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12618] [CORE] [STREAMING] [SQL] Clean u...
Github user thunterdb commented on the pull request: https://github.com/apache/spark/pull/10570#issuecomment-169449477 cc @jkbradley @srowen I have some concerns about the number of files being touched by this PR, it may be hard to merge without an ever-present conflict somewhere. Maybe splitting it will go faster? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11520][ML] RegressionMetrics should sup...
Github user thunterdb commented on the pull request: https://github.com/apache/spark/pull/9907#issuecomment-169444237 @Lewuathe thanks for your patch. I think it will require more work in `RegressionMetrics` to fully implement weighted metrics. We need to do the following changes: - expose `weightSum` in `MultivariateStatisticalSummary` (as a developer API) - the computations of `SSreg` and `SStot` should take the weights into account - in `RegressionMetrics`, all references to `summary.count` should be replaced by `summary.weightSum` Given that the default weights are 0, it should give the same result. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11520][ML] RegressionMetrics should sup...
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/9907#discussion_r49002633 --- Diff: mllib/src/test/scala/org/apache/spark/mllib/evaluation/RegressionMetricsSuite.scala --- @@ -109,4 +109,55 @@ class RegressionMetricsSuite extends SparkFunSuite with MLlibTestSparkContext { "root mean squared error mismatch") assert(metrics.r2 ~== 1.0 absTol 1E-5, "r2 score mismatch") } + + test("regression metrics with same(1.0) weight samples") { +val predictionAndObservationWithWeight = sc.parallelize( + Seq((2.25, 3.0, 1.0), (-0.25, -0.5, 1.0), (1.75, 2.0, 1.0), (7.75, 7.0, 1.0)), 2) +val metrics = new RegressionMetrics(predictionAndObservationWithWeight) +assert(metrics.explainedVariance ~== 8.79687 absTol 1E-5, + "explained variance regression score mismatch") +assert(metrics.meanAbsoluteError ~== 0.5 absTol 1E-5, "mean absolute error mismatch") +assert(metrics.meanSquaredError ~== 0.3125 absTol 1E-5, "mean squared error mismatch") +assert(metrics.rootMeanSquaredError ~== 0.55901 absTol 1E-5, + "root mean squared error mismatch") +assert(metrics.r2 ~== 0.95717 absTol 1E-5, "r2 score mismatch") + } + + + /** +* The following values are hand calculated using the formula: +* [[https://en.wikipedia.org/wiki/Weighted_arithmetic_mean#Reliability_weights]] +* preds = c(2.25, -0.25, 1.75, 7.75) +* obs = c(3.0, -0.5, 2.0, 7.0) +* weights = c(0.1, 0.2, 0.15, 0.05) +* count = 4 +* +* Weighted metrics can be calculated with MultivariateStatisticalSummary. +* (observations, observations - predictions) +* mean(1.7, 0.05) +* variance(7.3, 0.3) +* numNonZeros (0.5, 0.5) +* max (7.0, 0.75) +* min (-0.5, -0.75) +* normL2 (2.0, 0.32596) +* normL1 (1.05, 0.2) +* +* explainedVariance: sum((preds - 1.7)^2) / count = 10.1775 +* meanAbsoluteError: normL1(1) / count = 0.05 +* meanSquaredError: normL2(1)^2 / count = 0.02656 +* rootMeanSquaredError: sqrt(meanSquaredError) = 0.16298 +* r2: 1 - normL2(1)^2 / (variance(0) * (count - 1)) = 0.9951484 --- End diff -- Same thing for this formula. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11520][ML] RegressionMetrics should sup...
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/9907#discussion_r49002581 --- Diff: mllib/src/test/scala/org/apache/spark/mllib/evaluation/RegressionMetricsSuite.scala --- @@ -109,4 +109,55 @@ class RegressionMetricsSuite extends SparkFunSuite with MLlibTestSparkContext { "root mean squared error mismatch") assert(metrics.r2 ~== 1.0 absTol 1E-5, "r2 score mismatch") } + + test("regression metrics with same(1.0) weight samples") { +val predictionAndObservationWithWeight = sc.parallelize( + Seq((2.25, 3.0, 1.0), (-0.25, -0.5, 1.0), (1.75, 2.0, 1.0), (7.75, 7.0, 1.0)), 2) +val metrics = new RegressionMetrics(predictionAndObservationWithWeight) +assert(metrics.explainedVariance ~== 8.79687 absTol 1E-5, + "explained variance regression score mismatch") +assert(metrics.meanAbsoluteError ~== 0.5 absTol 1E-5, "mean absolute error mismatch") +assert(metrics.meanSquaredError ~== 0.3125 absTol 1E-5, "mean squared error mismatch") +assert(metrics.rootMeanSquaredError ~== 0.55901 absTol 1E-5, + "root mean squared error mismatch") +assert(metrics.r2 ~== 0.95717 absTol 1E-5, "r2 score mismatch") + } + + + /** +* The following values are hand calculated using the formula: +* [[https://en.wikipedia.org/wiki/Weighted_arithmetic_mean#Reliability_weights]] +* preds = c(2.25, -0.25, 1.75, 7.75) +* obs = c(3.0, -0.5, 2.0, 7.0) +* weights = c(0.1, 0.2, 0.15, 0.05) +* count = 4 +* +* Weighted metrics can be calculated with MultivariateStatisticalSummary. +* (observations, observations - predictions) +* mean(1.7, 0.05) +* variance(7.3, 0.3) +* numNonZeros (0.5, 0.5) +* max (7.0, 0.75) +* min (-0.5, -0.75) +* normL2 (2.0, 0.32596) +* normL1 (1.05, 0.2) +* +* explainedVariance: sum((preds - 1.7)^2) / count = 10.1775 --- End diff -- I am having an issue with these formulas, based on the very wikipedia page you are referencing. It should not be normalized by the count but by the weight sum, right? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11520][ML] RegressionMetrics should sup...
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/9907#discussion_r49000659 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/evaluation/RegressionMetrics.scala --- @@ -39,16 +51,17 @@ class RegressionMetrics @Since("1.2.0") ( * prediction and observation */ private[mllib] def this(predictionAndObservations: DataFrame) = -this(predictionAndObservations.map(r => (r.getDouble(0), r.getDouble(1 +this(predictionAndObservations.map(r => (r.getDouble(0), r.getDouble(1), 1.0))) /** * Use MultivariateOnlineSummarizer to calculate summary statistics of observations and errors. */ private lazy val summary: MultivariateStatisticalSummary = { -val summary: MultivariateStatisticalSummary = predictionAndObservations.map { - case (prediction, observation) => Vectors.dense(observation, observation - prediction) +val summary: MultivariateStatisticalSummary = predictionAndObservationsWithWeight.map { + case (prediction, observation, weight) => +(Vectors.dense(observation, observation - prediction), weight) }.aggregate(new MultivariateOnlineSummarizer())( -(summary, v) => summary.add(v), +(summary, sample) => summary.add(sample._1, sample._2), --- End diff -- We should rename this `summary` variable, it is used 3 times for different objects. Also, `sample` is a bit too generic here, and using the `_xx` methods are not very intuitive. I suggest you unpack the tuple fully: `{ case (currentSummary, (vec, weight)) => currentSummary.add(vec, weight) }` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12346] [ML] Missing attribute names in ...
Github user thunterdb commented on the pull request: https://github.com/apache/spark/pull/10323#issuecomment-169425244 @ericl this looks great, thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9716] [ML] BinaryClassificationEvaluato...
Github user thunterdb commented on the pull request: https://github.com/apache/spark/pull/10472#issuecomment-169418377 @BenFradet thanks! Just a small comment. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9716] [ML] BinaryClassificationEvaluato...
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/10472#discussion_r48993376 --- Diff: mllib/src/main/scala/org/apache/spark/ml/util/SchemaUtils.scala --- @@ -44,6 +44,23 @@ private[spark] object SchemaUtils { } /** +* Check whether the given schema contains a column of one of the require data types. +* @param colName column name +* @param dataTypes required column data types +*/ + def checkColumnTypes( + schema: StructType, + colName: String, + dataTypes: Seq[DataType], + msg: String = ""): Unit = { +val actualDataType = schema(colName).dataType +val message = if (msg != null && msg.trim.length > 0) " " + msg else "" +require(dataTypes.exists(actualDataType.equals(_)), --- End diff -- You do not need the `_`: `dataTypes.exists(actualDataType.equals)` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9716] [ML] BinaryClassificationEvaluato...
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/10472#discussion_r48993085 --- Diff: mllib/src/main/scala/org/apache/spark/ml/evaluation/BinaryClassificationEvaluator.scala --- @@ -79,13 +79,14 @@ class BinaryClassificationEvaluator @Since("1.4.0") (@Since("1.4.0") override va @Since("1.2.0") override def evaluate(dataset: DataFrame): Double = { val schema = dataset.schema -SchemaUtils.checkColumnType(schema, $(rawPredictionCol), new VectorUDT) +SchemaUtils.checkColumnTypes(schema, $(rawPredictionCol), Seq(DoubleType, new VectorUDT)) SchemaUtils.checkColumnType(schema, $(labelCol), DoubleType) // TODO: When dataset metadata has been implemented, check rawPredictionCol vector length = 2. val scoreAndLabels = dataset.select($(rawPredictionCol), $(labelCol)) - .map { case Row(rawPrediction: Vector, label: Double) => -(rawPrediction(1), label) + .map { +case Row(rawPrediction: Vector, label: Double) => (rawPrediction(1), label) +case Row(rawPrediction: Double, label: Double) => (rawPrediction, label) --- End diff -- there is a small loss of performance because a conditional branch is introduced here, but I believe the cost of unpacking the row is much higher anyway. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12618] [CORE] [STREAMING] [SQL] Clean u...
Github user thunterdb commented on the pull request: https://github.com/apache/spark/pull/10570#issuecomment-169396527 @srowen thanks a lot for the cleanup! Just two comments. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12618] [CORE] [STREAMING] [SQL] Clean u...
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/10570#discussion_r48983461 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/QueryTest.scala --- @@ -206,7 +206,7 @@ abstract class QueryTest extends PlanTest { val jsonString = try { logicalPlan.toJSON } catch { - case e => + case e: Throwable => --- End diff -- This is more a potential bug in the original code, but why not use `NonFatal(e)` here instead? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12618] [CORE] [STREAMING] [SQL] Clean u...
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/10570#discussion_r48983495 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/QueryTest.scala --- @@ -231,7 +231,7 @@ abstract class QueryTest extends PlanTest { val jsonBackPlan = try { TreeNode.fromJSON[LogicalPlan](jsonString, sqlContext.sparkContext) } catch { - case e => + case e: Throwable => --- End diff -- same thing --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12618] [CORE] [STREAMING] [SQL] Clean u...
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/10570#discussion_r48983231 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/ExpressionEvalHelper.scala --- @@ -275,8 +275,8 @@ trait ExpressionEvalHelper extends GeneratorDrivenPropertyChecks { (result, expected) match { case (result: Array[Byte], expected: Array[Byte]) => java.util.Arrays.equals(result, expected) - case (result: Double, expected: Spread[Double]) => -expected.isWithin(result) + case (result: Double, expected: Spread[_]) => // can't use Spread[Double] b/c of erasure --- End diff -- same thing here --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12618] [CORE] [STREAMING] [SQL] Clean u...
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/10570#discussion_r48983200 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/ExpressionEvalHelper.scala --- @@ -57,8 +57,8 @@ trait ExpressionEvalHelper extends GeneratorDrivenPropertyChecks { (result, expected) match { case (result: Array[Byte], expected: Array[Byte]) => java.util.Arrays.equals(result, expected) - case (result: Double, expected: Spread[Double]) => -expected.isWithin(result) + case (result: Double, expected: Spread[_]) => // Can't use Spread[Double] b/c of erasure --- End diff -- maybe I am not familiar with the style guidelines on this point, but it seems to be a perfect use case for `@unchecked`: http://www.scala-lang.org/api/current/index.html#scala.unchecked --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12663] [MLlib] More informative error m...
Github user thunterdb commented on the pull request: https://github.com/apache/spark/pull/10611#issuecomment-169190994 @robert-dodier thanks for your PR. Can you please fix the style issue? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12663] [MLlib] More informative error m...
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/10611#discussion_r48919311 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/util/MLUtils.scala --- @@ -86,7 +86,7 @@ object MLUtils { val indicesLength = indices.length while (i < indicesLength) { val current = indices(i) - require(current > previous, "indices should be one-based and in ascending order" ) + require(current > previous, "indices should be one-based and in ascending order; found current=" + current + ", previous=" + previous + "; line=\"" + line + "\"") --- End diff -- This line is too long (see spark style guide). Also, I recommend you use scala's string interpolation rather than manually building the string: ```scala s"previous=$previous, line='$line' " ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12006][ML][PYTHON] Fix GMM failure if i...
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/9986#discussion_r48880089 --- Diff: python/pyspark/mllib/clustering.py --- @@ -346,7 +346,7 @@ def train(cls, rdd, k, convergenceTol=1e-3, maxIterations=100, seed=None, initia if initialModel.k != k: raise Exception("Mismatched cluster count, initialModel.k = %s, however k = %s" % (initialModel.k, k)) -initialModelWeights = initialModel.weights +initialModelWeights = [w for w in initialModel.weights] --- End diff -- Instead of doing this, I suggest you explicitly call `list(initialModel.weights)`, it is easier to understand. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12026] [MLlib] ChiSqTest gets slower an...
Github user thunterdb commented on the pull request: https://github.com/apache/spark/pull/10146#issuecomment-168867750 @hhbyyh thanks for the fix; I just have one small comment. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12026] [MLlib] ChiSqTest gets slower an...
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/10146#discussion_r48805286 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/stat/test/ChiSqTest.scala --- @@ -109,7 +109,9 @@ private[stat] object ChiSqTest extends Logging { } i += 1 distinctLabels += label - features.toArray.view.zipWithIndex.slice(startCol, endCol).map { case (feature, col) => + val featureArray = features.toArray + (startCol until endCol).map { col => +val feature = featureArray(col) --- End diff -- ah yes, good catch about the view that builds incrementally a copy of the feature vectors. Let's also remove the `featureArray`, we can call directly `val feature = features(col)` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12041] [ML] [PySpark] Add columnSimilar...
Github user thunterdb commented on the pull request: https://github.com/apache/spark/pull/10158#issuecomment-168863404 @vectorijk thanks for the PR, it looks good to me. cc @jkbradley --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11815] [ML] [PySpark] PySpark DecisionT...
Github user thunterdb commented on the pull request: https://github.com/apache/spark/pull/9807#issuecomment-168862678 This looks good to me. cc @jkbradley --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11515][ML] QuantileDiscretizer should t...
Github user thunterdb commented on the pull request: https://github.com/apache/spark/pull/9535#issuecomment-168861235 @yu-iskw thanks for fixing this issue. It looks good to me, but can you please resolve the conflicts? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12450][MLLib] Un-persist broadcasted va...
Github user thunterdb commented on the pull request: https://github.com/apache/spark/pull/10415#issuecomment-168859502 @rnowling thanks for fixing this, I just have some style comments. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12450][MLLib] Un-persist broadcasted va...
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/10415#discussion_r48802864 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala --- @@ -324,6 +326,7 @@ class KMeans private ( s0 } ) + bcNewCenters.unpersist(blocking = false) --- End diff -- small nit: add new lines around this line --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12450][MLLib] Un-persist broadcasted va...
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/10415#discussion_r48802820 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala --- @@ -353,6 +356,7 @@ class KMeans private ( ((r, KMeans.findClosest(bcCenters.value(r), p)._1), 1.0) } }.reduceByKey(_ + _).collectAsMap() +bcCenters.unpersist(blocking = false) --- End diff -- small nit: add new lines around this line --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11608][MLLIB][DOC] Added migration guid...
Github user thunterdb commented on the pull request: https://github.com/apache/spark/pull/10235#issuecomment-165206222 This looks good to me. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12324][MLLIB][DOC] Fixes the sidebar in...
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/10297#discussion_r47727256 --- Diff: docs/css/main.css --- @@ -101,6 +98,26 @@ a:hover code { max-width: 914px; } +.content { + z-index: 1; + position: relative; + background-color: #FFF; + max-width: 914px; + line-height: 1.6; /* Inspired by Github's wiki style */ + background-color: white; + padding-left: 15px; +} + +.content-with-sidebar { + z-index: 1; + position: relative; + background-color: #FFF; --- End diff -- Done --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12324][MLLIB][DOC] Fixes the sidebar in...
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/10297#discussion_r47727246 --- Diff: docs/css/main.css --- @@ -101,6 +98,26 @@ a:hover code { max-width: 914px; } +.content { + z-index: 1; + position: relative; + background-color: #FFF; --- End diff -- oops thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12324][MLLIB][DOC] Fixes the sidebar in...
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/10297#discussion_r47721444 --- Diff: docs/css/main.css --- @@ -171,24 +202,104 @@ a.anchorjs-link:hover { text-decoration: none; } * The left navigation bar. */ .left-menu-wrapper { - position: absolute; - height: 100%; - - width: 256px; - margin-top: -20px; - padding-top: 20px; + margin-left: 0px; + margin-right: 0px; background-color: #F0F8FC; + border-top-width: 0px; + border-left-width: 0px; + border-bottom-width: 0px; + margin-top: 0px; + width: 210px; + float: left; + position: absolute; } .left-menu { - position: fixed; - max-width: 350px; - - padding-right: 10px; - width: 256px; + padding: 0px; + width: 199px; } .left-menu h3 { margin-left: 10px; line-height: 30px; +} + +/** + * The collapsing button for the navigation bar. + */ +.nav-trigger { +position: fixed; +clip: rect(0, 0, 0, 0); +} + +.nav-trigger + label:after { +content: '»'; +} + +label { + z-index: 10; +} + +label[for="nav-trigger"] { +position: fixed; +margin-left: 0px; +padding-top: 100px; +padding-left: 5px; +width: 10px; +height: 80%; +cursor: pointer; +background-size: contain; +background-color: #D4F0FF; +border-radius: 10px 0px 0px 10px; --- End diff -- Removing the border. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12324][MLLIB][DOC] Fixes the sidebar in...
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/10297#discussion_r47721419 --- Diff: docs/css/main.css --- @@ -171,24 +202,104 @@ a.anchorjs-link:hover { text-decoration: none; } * The left navigation bar. */ .left-menu-wrapper { - position: absolute; - height: 100%; - - width: 256px; - margin-top: -20px; - padding-top: 20px; + margin-left: 0px; + margin-right: 0px; background-color: #F0F8FC; + border-top-width: 0px; + border-left-width: 0px; + border-bottom-width: 0px; + margin-top: 0px; + width: 210px; + float: left; + position: absolute; } .left-menu { - position: fixed; - max-width: 350px; - - padding-right: 10px; - width: 256px; + padding: 0px; + width: 199px; } .left-menu h3 { margin-left: 10px; line-height: 30px; +} + +/** + * The collapsing button for the navigation bar. + */ +.nav-trigger { +position: fixed; +clip: rect(0, 0, 0, 0); +} + +.nav-trigger + label:after { +content: '»'; +} + +label { --- End diff -- Good question. I tried to do this and it failed to take the z-index into account despite my best efforts. This is the only way I found to make it work. If you think it is too wide, then I can try to add a CSS class to it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12324][MLLIB][DOC] Fixes the sidebar in...
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/10297#discussion_r47721302 --- Diff: docs/css/main.css --- @@ -39,13 +39,26 @@ margin-left: 10px; } +/* body .container-wrapper { position: absolute; width: 100%; display: flex; } +*/ -body #content { +body .container-wrapper { + background-color: #FFF; + color: #1D1F22; + max-width: 1024px; + margin-top: 10px; + margin-left: auto; + margin-right: auto; + border-radius: 15px; + position: relative; +} + +body #contentREMOVE { --- End diff -- Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12324][MLLIB][DOC] Fixes the sidebar in...
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/10297#discussion_r47721217 --- Diff: docs/css/main.css --- @@ -171,24 +202,104 @@ a.anchorjs-link:hover { text-decoration: none; } * The left navigation bar. */ .left-menu-wrapper { - position: absolute; - height: 100%; - - width: 256px; - margin-top: -20px; - padding-top: 20px; + margin-left: 0px; + margin-right: 0px; background-color: #F0F8FC; + border-top-width: 0px; + border-left-width: 0px; + border-bottom-width: 0px; + margin-top: 0px; + width: 210px; + float: left; + position: absolute; } .left-menu { - position: fixed; - max-width: 350px; - - padding-right: 10px; - width: 256px; + padding: 0px; + width: 199px; } .left-menu h3 { margin-left: 10px; line-height: 30px; +} + +/** + * The collapsing button for the navigation bar. + */ +.nav-trigger { +position: fixed; +clip: rect(0, 0, 0, 0); +} + +.nav-trigger + label:after { +content: '»'; +} + +label { + z-index: 10; +} + +label[for="nav-trigger"] { +position: fixed; --- End diff -- Fixed everywhere to 2 spaces --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12324][MLLIB][DOC] Fixes the sidebar in...
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/10297#discussion_r47720994 --- Diff: docs/_layouts/global.html --- @@ -128,19 +128,31 @@ {% if page.url contains "/ml" %} {% include nav-left-wrapper-ml.html nav-mllib=site.data.menu-mllib nav-ml=site.data.menu-ml %} + + + --- End diff -- Fixed --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12324][MLLIB][DOC] Fixes the sidebar in...
Github user thunterdb commented on the pull request: https://github.com/apache/spark/pull/10297#issuecomment-164556778 @jkbradley can you take a look at this fix? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12324][MLLIB][DOC] Fixes the sidebar in...
GitHub user thunterdb opened a pull request: https://github.com/apache/spark/pull/10297 [SPARK-12324][MLLIB][DOC] Fixes the sidebar in the ML documentation This fixes the sidebar, using a pure CSS mechanism to hide it when the browser's viewport is too narrow. Credit goes to the original author @Titan-C (mentioned in the NOTICE). Note that I am not a CSS expert, so I can only address comments up to some extent. Default view: https://cloud.githubusercontent.com/assets/7594753/11793597/6d1d6eda-a261-11e5-836b-6eb2054e9054.png";> When collapsed manually by the user: https://cloud.githubusercontent.com/assets/7594753/11793669/c991989e-a261-11e5-8bf6-aecf3bdb6319.png";> Disappears when column is too narrow: https://cloud.githubusercontent.com/assets/7594753/11793607/7754dbcc-a261-11e5-8b15-e0d074b0e47c.png";> Can still be opened by the user if necessary: https://cloud.githubusercontent.com/assets/7594753/11793612/7bf82968-a261-11e5-9cc3-e827a7a6b2b0.png";> You can merge this pull request into a Git repository by running: $ git pull https://github.com/thunterdb/spark 12324 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/10297.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #10297 commit dc8d41b94aa49c31849596ea986368c3bb3c29e3 Author: Timothy Hunter Date: 2015-12-14T19:39:54Z menu commit 94e82743956bce72f5ca2402be535fced36db24e Author: Timothy Hunter Date: 2015-12-14T20:00:50Z colors fixed commit b091b42660ac510ba49ddf0274c779ae1c249b67 Author: Timothy Hunter Date: 2015-12-14T20:41:28Z add the original author to the notice --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12212][ML][DOC] Clarifies the differenc...
Github user thunterdb commented on the pull request: https://github.com/apache/spark/pull/10234#issuecomment-163723975 @jkbradley done: https://cloud.githubusercontent.com/assets/7594753/11725710/e9949f04-9f2f-11e5-8ba5-7f955e8b41fa.png";> --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12212][ML][DOC] Clarifies the differenc...
Github user thunterdb commented on the pull request: https://github.com/apache/spark/pull/10234#issuecomment-163713115 @jkbradley done with changes, let me know what you think. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12212][ML][DOC] Clarifies the differenc...
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/10234#discussion_r47262584 --- Diff: docs/ml-classification-regression.md --- @@ -27,10 +27,10 @@ displayTitle: Classification and regression in spark.ml * This will become a table of contents (this text will be scraped). {:toc} -In MLlib, we implement popular linear methods such as logistic +In `spark.ml`, we implement popular linear methods such as logistic regression and linear least squares with $L_1$ or $L_2$ regularization. Refer to [the linear methods in mllib](mllib-linear-methods.html) for -details. In `spark.ml`, we also include Pipelines API for [Elastic +details about implementation and tuning. We also include a Dataframe API for [Elastic --- End diff -- Corrected everywhere in the documentation --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12212][ML][DOC] Clarifies the differenc...
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/10234#discussion_r47262597 --- Diff: docs/mllib-feature-extraction.md --- @@ -1,7 +1,7 @@ --- layout: global -title: Feature Extraction and Transformation - MLlib -displayTitle: MLlib - Feature Extraction and Transformation +title: Feature Extraction and Transformation --- End diff -- Done everywhere --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12212][ML][DOC] Clarifies the differenc...
Github user thunterdb commented on the pull request: https://github.com/apache/spark/pull/10234#issuecomment-163455399 @jkbradley I agree with you. Note that when you use the doc, the current section appears in bold in the side menu (and the submenu gets expandend). This is why I did not include it again, but I do not have a strong opinion on this point: https://cloud.githubusercontent.com/assets/7594753/11703974/64e32588-9e98-11e5-995a-247be7e51d77.png";> --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12212][ML][DOC] Clarifies the differenc...
GitHub user thunterdb opened a pull request: https://github.com/apache/spark/pull/10234 [SPARK-12212][ML][DOC] Clarifies the difference between spark.ml, spark.mllib and mllib in the documentation. Replaces a number of occurences of `MLlib` in the documentation that were meant to refer to the `spark.mllib` package instead. It should clarify for new users the difference between `spark.mllib` (the package) and MLlib (the umbrella project for ML in spark). You can merge this pull request into a Git repository by running: $ git pull https://github.com/thunterdb/spark 12212 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/10234.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #10234 commit edddc09237e4f519fa54d902fea227f59b5b1aac Author: Timothy Hunter Date: 2015-12-09T23:07:08Z removed old files commit 4b328e544dd3bc11ba2b88966023b50a91f919b3 Author: Timothy Hunter Date: 2015-12-09T23:07:45Z removed old files commit 62dc368d090402f5416961e012cd9e8160e0c585 Author: Timothy Hunter Date: 2015-12-10T00:17:16Z last changes commit 42bf337daa7145916d907bb8161d95c65bb20ae6 Author: Timothy Hunter Date: 2015-12-10T00:27:51Z missing conversions --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8517][ML][DOC] Reorganizes the spark.ml...
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/10207#discussion_r47043656 --- Diff: docs/ml-classification-regression.md --- @@ -0,0 +1,762 @@ +--- +layout: global +title: Classification and regression - spark.ml +displayTitle: Classification and regression in spark.ml +--- + + +`\[ +\newcommand{\R}{\mathbb{R}} +\newcommand{\E}{\mathbb{E}} +\newcommand{\x}{\mathbf{x}} +\newcommand{\y}{\mathbf{y}} +\newcommand{\wv}{\mathbf{w}} +\newcommand{\av}{\mathbf{\alpha}} +\newcommand{\bv}{\mathbf{b}} +\newcommand{\N}{\mathbb{N}} +\newcommand{\id}{\mathbf{I}} +\newcommand{\ind}{\mathbf{1}} +\newcommand{\0}{\mathbf{0}} +\newcommand{\unit}{\mathbf{e}} +\newcommand{\one}{\mathbf{1}} +\newcommand{\zero}{\mathbf{0}} +\]` + +**Table of Contents** + +* This will become a table of contents (this text will be scraped). +{:toc} + +In MLlib, we implement popular linear methods such as logistic +regression and linear least squares with $L_1$ or $L_2$ regularization. +Refer to [the linear methods in mllib](mllib-linear-methods.html) for +details. In `spark.ml`, we also include Pipelines API for [Elastic +net](http://en.wikipedia.org/wiki/Elastic_net_regularization), a hybrid +of $L_1$ and $L_2$ regularization proposed in [Zou et al, Regularization +and variable selection via the elastic +net](http://users.stat.umn.edu/~zouxx019/Papers/elasticnet.pdf). +Mathematically, it is defined as a convex combination of the $L_1$ and +the $L_2$ regularization terms: +`\[ +\alpha \left( \lambda \|\wv\|_1 \right) + (1-\alpha) \left( \frac{\lambda}{2}\|\wv\|_2^2 \right) , \alpha \in [0, 1], \lambda \geq 0 +\]` +By setting $\alpha$ properly, elastic net contains both $L_1$ and $L_2$ +regularization as special cases. For example, if a [linear +regression](https://en.wikipedia.org/wiki/Linear_regression) model is +trained with the elastic net parameter $\alpha$ set to $1$, it is +equivalent to a +[Lasso](http://en.wikipedia.org/wiki/Least_squares#Lasso_method) model. +On the other hand, if $\alpha$ is set to $0$, the trained model reduces +to a [ridge +regression](http://en.wikipedia.org/wiki/Tikhonov_regularization) model. +We implement Pipelines API for both linear regression and logistic +regression with elastic net regularization. + + +# Classification + +## Logistic regression + +Logistic regression is a popular method to predict a binary response. It is a special case of [Generalized Linear models](https://en.wikipedia.org/wiki/Generalized_linear_model) that predicts the probability of the outcome. +For more background and more details about the implementation, refer to the documentation of the [logistic regression in `spark.mllib`](mllib-linear-methods.html#logistic-regression). + + > The current implementation of logistic regression in `spark.ml` only supports binary classes. Support for multiclass regression will be added in the future. + +The following example shows how to train a logistic regression model +with elastic net regularization. `elasticNetParam` corresponds to +$\alpha$ and `regParam` corresponds to $\lambda$. + + + + +{% include_example scala/org/apache/spark/examples/ml/LogisticRegressionWithElasticNetExample.scala %} + + + +{% include_example java/org/apache/spark/examples/ml/JavaLogisticRegressionWithElasticNetExample.java %} + + + +{% include_example python/ml/logistic_regression_with_elastic_net.py %} + + + + +The `spark.ml` implementation of logistic regression also supports +extracting a summary of the model over the training set. Note that the +predictions and metrics which are stored as `Dataframe` in +`BinaryLogisticRegressionSummary` are annotated `@transient` and hence +only available on the driver. + + + + + +[`LogisticRegressionTrainingSummary`](api/scala/index.html#org.apache.spark.ml.classification.LogisticRegressionTrainingSummary) +provides a summary for a +[`LogisticRegressionModel`](api/scala/index.html#org.apache.spark.ml.classification.LogisticRegressionModel). +Currently, only binary classification is supported and the +summary must be explicitly cast to +[`BinaryLogisticRegressionTrainingSummary`](api/scala/index.html#org.apache.spark.ml.classification.BinaryLogisticRegressionTrainingSummary). +This will likely change when multiclass classification is supported. + +Continuing the earlier example: + +{% include_example scala/org/apache/spark/examples/ml/LogisticRegressionSummaryExample.scala %} + + + +[`LogisticRegressionTrainingSummary`](api/java/org/apache/spark/ml/classificat
[GitHub] spark pull request: [SPARK-8517][ML][DOC] Reorganizes the spark.ml...
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/10207#discussion_r47018195 --- Diff: docs/ml-classification-regression.md --- @@ -0,0 +1,733 @@ +--- +layout: global +title: Classification and regression - spark.ml +displayTitle: Classification and regression in spark.ml +--- + + +`\[ +\newcommand{\R}{\mathbb{R}} +\newcommand{\E}{\mathbb{E}} +\newcommand{\x}{\mathbf{x}} +\newcommand{\y}{\mathbf{y}} +\newcommand{\wv}{\mathbf{w}} +\newcommand{\av}{\mathbf{\alpha}} +\newcommand{\bv}{\mathbf{b}} +\newcommand{\N}{\mathbb{N}} +\newcommand{\id}{\mathbf{I}} +\newcommand{\ind}{\mathbf{1}} +\newcommand{\0}{\mathbf{0}} +\newcommand{\unit}{\mathbf{e}} +\newcommand{\one}{\mathbf{1}} +\newcommand{\zero}{\mathbf{0}} +\]` + +**Table of Contents** + +* This will become a table of contents (this text will be scraped). +{:toc} + +In MLlib, we implement popular linear methods such as logistic +regression and linear least squares with $L_1$ or $L_2$ regularization. +Refer to [the linear methods in mllib](mllib-linear-methods.html) for +details. In `spark.ml`, we also include Pipelines API for [Elastic +net](http://en.wikipedia.org/wiki/Elastic_net_regularization), a hybrid +of $L_1$ and $L_2$ regularization proposed in [Zou et al, Regularization +and variable selection via the elastic +net](http://users.stat.umn.edu/~zouxx019/Papers/elasticnet.pdf). +Mathematically, it is defined as a convex combination of the $L_1$ and +the $L_2$ regularization terms: +`\[ +\alpha \left( \lambda \|\wv\|_1 \right) + (1-\alpha) \left( \frac{\lambda}{2}\|\wv\|_2^2 \right) , \alpha \in [0, 1], \lambda \geq 0 +\]` +By setting $\alpha$ properly, elastic net contains both $L_1$ and $L_2$ +regularization as special cases. For example, if a [linear +regression](https://en.wikipedia.org/wiki/Linear_regression) model is +trained with the elastic net parameter $\alpha$ set to $1$, it is +equivalent to a +[Lasso](http://en.wikipedia.org/wiki/Least_squares#Lasso_method) model. +On the other hand, if $\alpha$ is set to $0$, the trained model reduces +to a [ridge +regression](http://en.wikipedia.org/wiki/Tikhonov_regularization) model. +We implement Pipelines API for both linear regression and logistic +regression with elastic net regularization. + +# Regression + +## Linear regression + +The interface for working with linear regression models and model +summaries is similar to the logistic regression case. The following +example demonstrates training an elastic net regularized linear +regression model and extracting model summary statistics. + + + + +{% include_example scala/org/apache/spark/examples/ml/LinearRegressionWithElasticNetExample.scala %} + + + +{% include_example java/org/apache/spark/examples/ml/JavaLinearRegressionWithElasticNetExample.java %} + + + + +{% include_example python/ml/linear_regression_with_elastic_net.py %} + + + + +## Survival regression + + +In `spark.ml`, we implement the [Accelerated failure time (AFT)](https://en.wikipedia.org/wiki/Accelerated_failure_time_model) +model which is a parametric survival regression model for censored data. +It describes a model for the log of survival time, so it's often called +log-linear model for survival analysis. Different from +[Proportional hazards](https://en.wikipedia.org/wiki/Proportional_hazards_model) model +designed for the same purpose, the AFT model is more easily to parallelize +because each instance contribute to the objective function independently. + +Given the values of the covariates $x^{'}$, for random lifetime $t_{i}$ of +subjects i = 1, ..., n, with possible right-censoring, +the likelihood function under the AFT model is given as: +`\[ +L(\beta,\sigma)=\prod_{i=1}^n[\frac{1}{\sigma}f_{0}(\frac{\log{t_{i}}-x^{'}\beta}{\sigma})]^{\delta_{i}}S_{0}(\frac{\log{t_{i}}-x^{'}\beta}{\sigma})^{1-\delta_{i}} +\]` +Where $\delta_{i}$ is the indicator of the event has occurred i.e. uncensored or not. +Using $\epsilon_{i}=\frac{\log{t_{i}}-x^{'}\beta}{\sigma}$, the log-likelihood function +assumes the form: +`\[ +\iota(\beta,\sigma)=\sum_{i=1}^{n}[-\delta_{i}\log\sigma+\delta_{i}\log{f_{0}}(\epsilon_{i})+(1-\delta_{i})\log{S_{0}(\epsilon_{i})}] +\]` +Where $S_{0}(\epsilon_{i})$ is the baseline survivor function, +and $f_{0}(\epsilon_{i})$ is corresponding density function. + +The most commonly used AFT model is based on the Weibull distribution of the survival time. +The Weibull distribution for lifetime corresponding to extreme value distribu
[GitHub] spark pull request: [SPARK-8517][ML][DOC] Reorganizes the spark.ml...
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/10207#discussion_r47018213 --- Diff: docs/_data/menu-ml.yaml --- @@ -1,10 +1,10 @@ -- text: Feature extraction, transformation, and selection +- text: "Overview: estimators, transformers and pipelines" + url: ml-intro.html +- text: Building and transforming features --- End diff -- Done --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8517][ML][DOC] Reorganizes the spark.ml...
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/10207#discussion_r47017809 --- Diff: docs/mllib-guide.md --- @@ -66,15 +66,18 @@ We list major functionality from both below, with links to detailed guides. # spark.ml: high-level APIs for ML pipelines -**[spark.ml programming guide](ml-guide.html)** provides an overview of the Pipelines API and major -concepts. It also contains sections on using algorithms within the Pipelines API, for example: - -* [Feature extraction, transformation, and selection](ml-features.html) +* [Overview: estimators, transformers and pipelines](ml-intro.html) +* [Building and transforming features](ml-features.html) +* [Classification and regression](ml-classification-regression.html) * [Clustering](ml-clustering.html) -* [Decision trees for classification and regression](ml-decision-tree.html) -* [Ensembles](ml-ensembles.html) -* [Linear methods with elastic net regularization](ml-linear-methods.html) -* [Multilayer perceptron classifier](ml-ann.html) +* [Advanced topics](ml-advanced.html) + +Some techniques are not available yet in spark.ml, most notably: + - clustering --- End diff -- Oh yes true, of course. I rephrased it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8517][ML][DOC] Reorganizes the spark.ml...
Github user thunterdb commented on the pull request: https://github.com/apache/spark/pull/10207#issuecomment-163009350 @jkbradley no this PR just moves the text around, with little modification. More substantital changes will be done later. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8517][MLLIB][DOC] Reorganizes the spark...
GitHub user thunterdb opened a pull request: https://github.com/apache/spark/pull/10207 [SPARK-8517][MLLIB][DOC] Reorganizes the spark.ml user guide This PR moves pieces of the spark.ml user guide to reflect suggestions in SPARK-8517. It does not introduce new content, as requested. https://cloud.githubusercontent.com/assets/7594753/11666166/e82b84f2-9d9f-11e5-8904-e215424d8444.png";> You can merge this pull request into a Git repository by running: $ git pull https://github.com/thunterdb/spark spark-8517 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/10207.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #10207 commit af8abb9713f90478a1a0b81fdc83192192354c30 Author: Timothy Hunter Date: 2015-12-08T18:23:06Z moved content commit 451b7737f553fbc425ce2144fe0930b885874c7f Author: Timothy Hunter Date: 2015-12-08T19:32:22Z reordered doc --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12000] Fixes compilation issues with ja...
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/10048#discussion_r46210267 --- Diff: project/SparkBuild.scala --- @@ -155,13 +155,15 @@ object SparkBuild extends PomBuild { if (major.toInt >= 1 && minor.toInt >= 8) Seq("-Xdoclint:all", "-Xdoclint:-missing") else Seq.empty }, +// TODO(thunterdb) how to set these based on the current version of the JVM? javacJVMVersion := "1.7", scalacJVMVersion := "1.7", javacOptions in Compile ++= Seq( - "-encoding", "UTF-8", - "-source", javacJVMVersion.value, - "-target", javacJVMVersion.value + "-encoding", "UTF-8" + // TODO(thunterdb) make it conditional for <= "1.7" --- End diff -- I am not sure how to get the current version of java (not the target version). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12000] Fixes compilation issues with ja...
GitHub user thunterdb opened a pull request: https://github.com/apache/spark/pull/10048 [SPARK-12000] Fixes compilation issues with java8 and sbt. Currently, trying to publish spark locally fails when java version is >= 1.8.0: - a javadoc option has been removed - javadoc is stricter about its input and fails with some malformed comments This PR fixes both issues. You can merge this pull request into a Git repository by running: $ git pull https://github.com/thunterdb/spark 1511-java8 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/10048.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #10048 commit 97d12f1fa2b494d006d591317364633aef677d75 Author: Timothy Hunter Date: 2015-11-26T00:16:58Z changes commit 990053924e08153d5bc06e81a9857aeb4cc36a05 Author: Timothy Hunter Date: 2015-11-26T00:25:10Z comments --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11835] Adds a sidebar menu to MLlib's d...
Github user thunterdb commented on the pull request: https://github.com/apache/spark/pull/9826#issuecomment-158541369 @mengxr comment addressed --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11835] Adds a sidebar menu to MLlib's d...
Github user thunterdb commented on the pull request: https://github.com/apache/spark/pull/9826#issuecomment-158229760 @mengxr based on our discussions, here is a slightly updated version: - menu is fixed - menu disappears under the page (but it looks half covered in the middle). Fixing that would require JS, I think (and sk-learn does not have a fixed menu): ![screen shot 2015-11-19 at 3 10 50 pm](https://cloud.githubusercontent.com/assets/7594753/11287386/efb9686c-8ecf-11e5-9829-038f22468ad9.png) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11835] Adds a sidebar menu to MLlib's d...
Github user thunterdb commented on the pull request: https://github.com/apache/spark/pull/9826#issuecomment-158220169 @andrewor14 this is a different issue: the SIPs show the table of contents within one document, which we already have with the `{:toc}` directive. This PR adds the per-project organization (linking various markdown files together). This is not something we can infer automatically because we show different levels of nesting for each of the pages. Each project already has this description in their overview pages, and I am proposing we move them to the side bar: - http://spark.apache.org/docs/latest/streaming-programming-guide.html - http://spark.apache.org/docs/latest/sql-programming-guide.html - http://spark.apache.org/docs/latest/graphx-programming-guide.html - http://spark.apache.org/docs/latest/sparkr.html --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11835] Adds a sidebar menu to MLlib's d...
Github user thunterdb commented on the pull request: https://github.com/apache/spark/pull/9826#issuecomment-158169959 This is about as much as I can do with my very limited knowledge of CSS, so additional fixes can be done in a separate PR. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org