[GitHub] spark pull request #15555: [Minor][ML] Refactor clustering summary.
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/1 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15555: [Minor][ML] Refactor clustering summary.
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/1#discussion_r84998430 --- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/GaussianMixture.scala --- @@ -356,42 +356,25 @@ object GaussianMixture extends DefaultParamsReadable[GaussianMixture] { * :: Experimental :: * Summary of GaussianMixture. * - * @param predictions [[DataFrame]] produced by [[GaussianMixtureModel.transform()]] - * @param predictionCol Name for column of predicted clusters in `predictions` - * @param probabilityCol Name for column of predicted probability of each cluster in `predictions` - * @param featuresCol Name for column of features in `predictions` - * @param k Number of clusters + * @param predictions [[DataFrame]] produced by [[GaussianMixtureModel.transform()]]. + * @param predictionCol Name for column of predicted clusters in `predictions`. + * @param probabilityCol Name for column of predicted probability of each cluster + *in `predictions`. + * @param featuresCol Name for column of features in `predictions`. + * @param k Number of clusters. */ @Since("2.0.0") @Experimental class GaussianMixtureSummary private[clustering] ( -@Since("2.0.0") @transient val predictions: DataFrame, -@Since("2.0.0") val predictionCol: String, -@Since("2.0.0") val probabilityCol: String, -@Since("2.0.0") val featuresCol: String, -@Since("2.0.0") val k: Int) extends Serializable { - - /** - * Cluster centers of the transformed data. - */ - @Since("2.0.0") - @transient lazy val cluster: DataFrame = predictions.select(predictionCol) +predictions: DataFrame, +predictionCol: String, +val probabilityCol: String, --- End diff -- could do a Since tag here --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15555: [Minor][ML] Refactor clustering summary.
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/1#discussion_r84837836 --- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala --- @@ -354,21 +354,41 @@ object KMeans extends DefaultParamsReadable[KMeans] { @Since("2.0.0") @Experimental class KMeansSummary private[clustering] ( -@Since("2.0.0") @transient val predictions: DataFrame, -@Since("2.0.0") val predictionCol: String, -@Since("2.0.0") val featuresCol: String, -@Since("2.0.0") val k: Int) extends Serializable { +predictions: DataFrame, +predictionCol: String, +featuresCol: String, +k: Int) + extends ClusteringSummary ( +predictions, +predictionCol, +featuresCol, +k + ) + +/** + * :: Experimental :: + * Summary of clustering. + * + * @param predictions [[DataFrame]] produced by model.transform() + * @param predictionCol Name for column of predicted clusters in `predictions` + * @param featuresCol Name for column of features in `predictions` + * @param k Number of clusters + */ +@Experimental --- End diff -- I'm also ambivalent about this, the reason behind my change is that some classes such as ```KMeansSummary``` and ```GaussianMixtureSummary``` were added at 2.0. If I put ```@Since("2.1.0")``` here, it looks not quite right, but I'm not sure whether it's OK. @jkbradley What's your opinion? Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15555: [Minor][ML] Refactor clustering summary.
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/1#discussion_r84798339 --- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala --- @@ -354,21 +354,41 @@ object KMeans extends DefaultParamsReadable[KMeans] { @Since("2.0.0") @Experimental class KMeansSummary private[clustering] ( -@Since("2.0.0") @transient val predictions: DataFrame, -@Since("2.0.0") val predictionCol: String, -@Since("2.0.0") val featuresCol: String, -@Since("2.0.0") val k: Int) extends Serializable { +predictions: DataFrame, +predictionCol: String, +featuresCol: String, +k: Int) + extends ClusteringSummary ( +predictions, +predictionCol, +featuresCol, +k + ) + +/** + * :: Experimental :: + * Summary of clustering. + * + * @param predictions [[DataFrame]] produced by model.transform() + * @param predictionCol Name for column of predicted clusters in `predictions` + * @param featuresCol Name for column of features in `predictions` + * @param k Number of clusters + */ +@Experimental --- End diff -- I'm not entirely certain on the official policy for the `@Since` tags, but it seems better to me to put `@Since("2.1.0")` here for the class and the methods. It will be correct for some and will at least not be incorrect for others. I'm not positive though. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15555: [Minor][ML] Refactor clustering summary.
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/1#discussion_r84792326 --- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala --- @@ -354,21 +354,41 @@ object KMeans extends DefaultParamsReadable[KMeans] { @Since("2.0.0") @Experimental class KMeansSummary private[clustering] ( -@Since("2.0.0") @transient val predictions: DataFrame, -@Since("2.0.0") val predictionCol: String, -@Since("2.0.0") val featuresCol: String, -@Since("2.0.0") val k: Int) extends Serializable { +predictions: DataFrame, +predictionCol: String, +featuresCol: String, +k: Int) + extends ClusteringSummary ( +predictions, +predictionCol, +featuresCol, +k + ) + +/** + * :: Experimental :: + * Summary of clustering. + * + * @param predictions [[DataFrame]] produced by model.transform() --- End diff -- nit: Add periods for each line --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15555: [Minor][ML] Refactor clustering summary.
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/1#discussion_r84792150 --- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/BisectingKMeans.scala --- @@ -297,27 +297,13 @@ object BisectingKMeans extends DefaultParamsReadable[BisectingKMeans] { @Since("2.1.0") @Experimental class BisectingKMeansSummary private[clustering] ( -@Since("2.1.0") @transient val predictions: DataFrame, -@Since("2.1.0") val predictionCol: String, -@Since("2.1.0") val featuresCol: String, -@Since("2.1.0") val k: Int) extends Serializable { - - /** - * Cluster centers of the transformed data. - */ - @Since("2.1.0") - @transient lazy val cluster: DataFrame = predictions.select(predictionCol) - - /** - * Size of (number of data points in) each cluster. - */ - @Since("2.1.0") - lazy val clusterSizes: Array[Long] = { -val sizes = Array.fill[Long](k)(0) -cluster.groupBy(predictionCol).count().select(predictionCol, "count").collect().foreach { - case Row(cluster: Int, count: Long) => sizes(cluster) = count -} -sizes - } - -} +predictions: DataFrame, +predictionCol: String, +featuresCol: String, +k: Int) + extends ClusteringSummary ( --- End diff -- minor: this can fit on one line like: `k: Int) extends ClusteringSummary(..., ..., ...)` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15555: [Minor][ML] Refactor clustering summary.
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/1#discussion_r84798486 --- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala --- @@ -354,21 +354,41 @@ object KMeans extends DefaultParamsReadable[KMeans] { @Since("2.0.0") @Experimental class KMeansSummary private[clustering] ( -@Since("2.0.0") @transient val predictions: DataFrame, -@Since("2.0.0") val predictionCol: String, -@Since("2.0.0") val featuresCol: String, -@Since("2.0.0") val k: Int) extends Serializable { +predictions: DataFrame, +predictionCol: String, +featuresCol: String, +k: Int) + extends ClusteringSummary ( +predictions, +predictionCol, +featuresCol, +k + ) + +/** + * :: Experimental :: + * Summary of clustering. --- End diff -- "Summary of clustering algorithms." ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15555: [Minor][ML] Refactor clustering summary.
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/1#discussion_r84204673 --- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala --- @@ -354,21 +354,41 @@ object KMeans extends DefaultParamsReadable[KMeans] { @Since("2.0.0") @Experimental class KMeansSummary private[clustering] ( -@Since("2.0.0") @transient val predictions: DataFrame, -@Since("2.0.0") val predictionCol: String, -@Since("2.0.0") val featuresCol: String, -@Since("2.0.0") val k: Int) extends Serializable { +predictions: DataFrame, +predictionCol: String, +featuresCol: String, +k: Int) + extends ClusteringSummary ( +predictions, +predictionCol, +featuresCol, +k + ) + +/** + * :: Experimental :: + * Summary of clustering. + * + * @param predictions [[DataFrame]] produced by model.transform() + * @param predictionCol Name for column of predicted clusters in `predictions` + * @param featuresCol Name for column of features in `predictions` + * @param k Number of clusters + */ +@Experimental --- End diff -- ```ClusteringSummary``` will be succeeded by summaries who were added in different version, so I think we should not add since version here. To the issue for a new file, I think ```ClusteringSummary``` is a small class, we can place it here temporarily. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15555: [Minor][ML] Refactor clustering summary.
Github user zhengruifeng commented on a diff in the pull request: https://github.com/apache/spark/pull/1#discussion_r84203200 --- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala --- @@ -354,21 +354,41 @@ object KMeans extends DefaultParamsReadable[KMeans] { @Since("2.0.0") @Experimental class KMeansSummary private[clustering] ( -@Since("2.0.0") @transient val predictions: DataFrame, -@Since("2.0.0") val predictionCol: String, -@Since("2.0.0") val featuresCol: String, -@Since("2.0.0") val k: Int) extends Serializable { +predictions: DataFrame, +predictionCol: String, +featuresCol: String, +k: Int) + extends ClusteringSummary ( +predictions, +predictionCol, +featuresCol, +k + ) + +/** + * :: Experimental :: + * Summary of clustering. + * + * @param predictions [[DataFrame]] produced by model.transform() + * @param predictionCol Name for column of predicted clusters in `predictions` + * @param featuresCol Name for column of features in `predictions` + * @param k Number of clusters + */ +@Experimental --- End diff -- what about adding `@Since("2.1.0")` here? Create a new scala file named `Clustering.scala` and move `ClusteringSummary` into it? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15555: [Minor][ML] Refactor clustering summary.
GitHub user yanboliang opened a pull request: https://github.com/apache/spark/pull/1 [Minor][ML] Refactor clustering summary. ## What changes were proposed in this pull request? Abstract ```ClusteringSummary``` from ```KMeansSummary```, ```GaussianMixtureSummary``` and ```BisectingSummary```, and eliminate duplicated pieces of code. ## How was this patch tested? Existing tests. You can merge this pull request into a Git repository by running: $ git pull https://github.com/yanboliang/spark clustering-summary Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/1.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1 commit 31692676e2a6df0651ff35992aa8ab3688c113b8 Author: Yanbo Liang Date: 2016-10-19T13:46:20Z Refactor clustering summary. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org