[GitHub] spark issue #22136: [SPARK-25124][ML]VectorSizeHint setSize and getSize don'...
Github user thunterdb commented on the issue: https://github.com/apache/spark/pull/22136 @huaxingao thank you for your pull request. Can you please add a test to make sure this does not regress? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19439: [SPARK-21866][ML][PySpark] Adding spark image rea...
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/19439#discussion_r150650409 --- Diff: mllib/src/main/scala/org/apache/spark/ml/image/HadoopUtils.scala --- @@ -0,0 +1,109 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.image + +import scala.language.existentials +import scala.util.Random + +import org.apache.commons.io.FilenameUtils +import org.apache.hadoop.conf.{Configuration, Configured} +import org.apache.hadoop.fs.{Path, PathFilter} +import org.apache.hadoop.mapreduce.lib.input.FileInputFormat + +import org.apache.spark.sql.SparkSession + +private object RecursiveFlag { + /** + * Sets the spark recursive flag and then restores it. + * + * @param value Value to set + * @param spark Existing spark session + * @param f The function to evaluate after setting the flag + * @return Returns the evaluation result T of the function + */ + def withRecursiveFlag[T](value: Boolean, spark: SparkSession)(f: => T): T = { +val flagName = FileInputFormat.INPUT_DIR_RECURSIVE +val hadoopConf = spark.sparkContext.hadoopConfiguration +val old = Option(hadoopConf.get(flagName)) +hadoopConf.set(flagName, value.toString) +try f finally { + old match { +case Some(v) => hadoopConf.set(flagName, v) +case None => hadoopConf.unset(flagName) + } +} + } +} + +/** + * Filter that allows loading a fraction of HDFS files. + */ +private class SamplePathFilter extends Configured with PathFilter { --- End diff -- Ok, if we think that duplicate file names are not an issue, then I would prefer having using the deterministic hashing-based scheme. I am also happy to make this into another PR since this is a pretty small matter. @imatiach-msft a good hashing function is certainly a concern, and we will need to make a trade off between performance and correctness. If we really want to make sure that it works as expected, the best is probably to use a cryptographic hash, like `SHA-256` (which has strong guarantees of the distribution of output values): https://stackoverflow.com/questions/5531455/how-to-hash-some-string-with-sha256-in-java We have `murmur3` in the Spark source code, but it is not cryptographic and does not come with as strong guarantees. For the sake of performance, we may want to use it eventually. Once you have the digest, which is a byte array, then it can be converted to a long first (by taking 8 bytes from the whole digest): https://stackoverflow.com/questions/4485128/how-do-i-convert-long-to-byte-and-back-in-java and then you can convert this long to a double and compare it to the requested fraction: ```java Long l = new Long(...); double d = l.doubleValue(); bool shouldKeep = d / (2<<31) <= requestedFraction ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19439: [SPARK-21866][ML][PySpark] Adding spark image rea...
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/19439#discussion_r150250118 --- Diff: mllib/src/main/scala/org/apache/spark/ml/image/HadoopUtils.scala --- @@ -0,0 +1,109 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.image + +import scala.language.existentials +import scala.util.Random + +import org.apache.commons.io.FilenameUtils +import org.apache.hadoop.conf.{Configuration, Configured} +import org.apache.hadoop.fs.{Path, PathFilter} +import org.apache.hadoop.mapreduce.lib.input.FileInputFormat + +import org.apache.spark.sql.SparkSession + +private object RecursiveFlag { + /** + * Sets the spark recursive flag and then restores it. + * + * @param value Value to set + * @param spark Existing spark session + * @param f The function to evaluate after setting the flag + * @return Returns the evaluation result T of the function + */ + def withRecursiveFlag[T](value: Boolean, spark: SparkSession)(f: => T): T = { +val flagName = FileInputFormat.INPUT_DIR_RECURSIVE +val hadoopConf = spark.sparkContext.hadoopConfiguration +val old = Option(hadoopConf.get(flagName)) +hadoopConf.set(flagName, value.toString) +try f finally { + old match { +case Some(v) => hadoopConf.set(flagName, v) +case None => hadoopConf.unset(flagName) + } +} + } +} + +/** + * Filter that allows loading a fraction of HDFS files. + */ +private class SamplePathFilter extends Configured with PathFilter { --- End diff -- Actually, I take this comment back, since it would break on some pathological cases such as all the names being the same. When users want some samples, they most probably want a result that is a fraction of the original, whatever it may contain. @jkbradley do you prefer a something that may not be deterministic (using random numbers) or deterministic but not respecting the sampling ratio in pathological cases? The only way to do both that I can think of is deduplicating, which requires a shuffle. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19439: [SPARK-21866][ML][PySpark] Adding spark image rea...
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/19439#discussion_r150247999 --- Diff: mllib/src/main/scala/org/apache/spark/ml/image/HadoopUtils.scala --- @@ -0,0 +1,109 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.image + +import scala.language.existentials +import scala.util.Random + +import org.apache.commons.io.FilenameUtils +import org.apache.hadoop.conf.{Configuration, Configured} +import org.apache.hadoop.fs.{Path, PathFilter} +import org.apache.hadoop.mapreduce.lib.input.FileInputFormat + +import org.apache.spark.sql.SparkSession + +private object RecursiveFlag { + /** + * Sets the spark recursive flag and then restores it. + * + * @param value Value to set + * @param spark Existing spark session + * @param f The function to evaluate after setting the flag + * @return Returns the evaluation result T of the function + */ + def withRecursiveFlag[T](value: Boolean, spark: SparkSession)(f: => T): T = { +val flagName = FileInputFormat.INPUT_DIR_RECURSIVE +val hadoopConf = spark.sparkContext.hadoopConfiguration +val old = Option(hadoopConf.get(flagName)) +hadoopConf.set(flagName, value.toString) +try f finally { + old match { +case Some(v) => hadoopConf.set(flagName, v) +case None => hadoopConf.unset(flagName) + } +} + } +} + +/** + * Filter that allows loading a fraction of HDFS files. + */ +private class SamplePathFilter extends Configured with PathFilter { --- End diff -- Oh, it would be pretty simple to work it out this way (pseudocode). Note that a hash is a random variable between -2^31 and 2^31-1, so: ``` val fraction: Double = ??? // Something between 0 and 1 val pathname: String = ??? val hash = pathname.hashcode() // Could use some other other, more robust methods that return longs. val shouldKeep: Boolean = math.abs(hash) < (math.pow(2, 31) * fraction) ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19439: [SPARK-21866][ML][PySpark] Adding spark image rea...
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/19439#discussion_r147661505 --- Diff: mllib/src/main/scala/org/apache/spark/ml/image/ImageSchema.scala --- @@ -0,0 +1,252 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.image + +import java.awt.Color +import java.awt.color.ColorSpace +import java.io.ByteArrayInputStream +import javax.imageio.ImageIO + +import org.apache.spark.annotation.{Experimental, Since} +import org.apache.spark.input.PortableDataStream +import org.apache.spark.sql.{DataFrame, Row, SparkSession} +import org.apache.spark.sql.types._ + +@Experimental +@Since("2.3.0") +object ImageSchema { + + val undefinedImageType = "Undefined" + + val imageFields = Array("origin", "height", "width", "nChannels", "mode", "data") + + val ocvTypes = Map( --- End diff -- Same thing here (and even more important): `Map[String, Int]` if I am not mistaken. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19439: [SPARK-21866][ML][PySpark] Adding spark image rea...
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/19439#discussion_r147661396 --- Diff: mllib/src/main/scala/org/apache/spark/ml/image/ImageSchema.scala --- @@ -0,0 +1,252 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.image + +import java.awt.Color +import java.awt.color.ColorSpace +import java.io.ByteArrayInputStream +import javax.imageio.ImageIO + +import org.apache.spark.annotation.{Experimental, Since} +import org.apache.spark.input.PortableDataStream +import org.apache.spark.sql.{DataFrame, Row, SparkSession} +import org.apache.spark.sql.types._ + +@Experimental +@Since("2.3.0") +object ImageSchema { + + val undefinedImageType = "Undefined" + + val imageFields = Array("origin", "height", "width", "nChannels", "mode", "data") --- End diff -- nit: since this is public, put the explicit type: `Array[String]`. Scala can have otherwise some surprising behavior here. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19439: [SPARK-21866][ML][PySpark] Adding spark image rea...
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/19439#discussion_r147661078 --- Diff: mllib/src/main/scala/org/apache/spark/ml/image/HadoopUtils.scala --- @@ -0,0 +1,120 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.image + +import scala.language.existentials +import scala.util.Random + +import org.apache.commons.io.FilenameUtils +import org.apache.hadoop.conf.{Configuration, Configured} +import org.apache.hadoop.fs.{Path, PathFilter} +import org.apache.hadoop.mapreduce.lib.input.FileInputFormat + +import org.apache.spark.sql.SparkSession + +private object RecursiveFlag { + + /** + * Sets a value of spark recursive flag. + * If value is a None, it unsets the flag. + * + * @param value value to set + * @param spark existing spark session + * @return previous value of this flag + */ + def setRecursiveFlag(value: Option[String], spark: SparkSession): Option[String] = { +val flagName = FileInputFormat.INPUT_DIR_RECURSIVE +val hadoopConf = spark.sparkContext.hadoopConfiguration +val old = Option(hadoopConf.get(flagName)) + +value match { + case Some(v) => hadoopConf.set(flagName, v) + case None => hadoopConf.unset(flagName) +} + +old + } +} + +/** + * Filter that allows loading a fraction of HDFS files. + */ +private class SamplePathFilter extends Configured with PathFilter { + val random = new Random() + + // Ratio of files to be read from disk + var sampleRatio: Double = 1 + + override def setConf(conf: Configuration): Unit = { +if (conf != null) { + sampleRatio = conf.getDouble(SamplePathFilter.ratioParam, 1) +} + } + + override def accept(path: Path): Boolean = { +// Note: checking fileSystem.isDirectory is very slow here, so we use basic rules instead +!SamplePathFilter.isFile(path) || random.nextDouble() < sampleRatio --- End diff -- I would prefer that we do not use a seed and that the result is deterministic, based for example on some hash of the file name, to make it more robust to future code changes. That being said, there is no fundamental issues with the current implementation and other developers may have differing opinions, so the current implementation is fine as far as I am concerned. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19439: [SPARK-21866][ML][PySpark] Adding spark image reader
Github user thunterdb commented on the issue: https://github.com/apache/spark/pull/19439 @hhbyyh I recall now the reason for an extra `origin` field, which is to get around the standard issue of many small image files in S3 or other distributed file systems. It is standard to compact many small images into larger zip files, and the original `readImages` implementation could recursively traverse zip files to deal with that. This is a feature that we would like to add again at some point. When you compact multiple images in a single zip file, though, the filename is that of the zip file, so having an extra `origin` field is convenient to name the image correctly. This field is optional and this format is still experimental, so I do not think it is going to be an issue to deprecate this field if it is deemed to be too much trouble. Here is for example a relevant issue that we have in Deep Learning Pipelines, which is very representative of normal scenarios: https://github.com/databricks/spark-deep-learning/issues/67 The current workaround is suboptimal in terms of performance and user experience. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19439: [SPARK-21866][ML][PySpark] Adding spark image reader
Github user thunterdb commented on the issue: https://github.com/apache/spark/pull/19439 @hhbyyh regarding the data representation, one could indeed have the each of the representations being encoded with the proper array information. This brings some additional complexity for the complex UDFs though, because they need to select the proper field, and the target implementations in C++ or tensorflow already can cast the field to the proper type. I suggest we keep bytes[] for now and see if there is a need to have a more refined representations. For the `origin` field, @dakirsa or @imatiach-msft should have more context. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19439: [SPARK-21866][ML][PySpark] Adding spark image reader
Github user thunterdb commented on the issue: https://github.com/apache/spark/pull/19439 @hhbyyh thank you for bringing up these questions. In response to your questions: > Does the current schema support or plan to support image feature data in Floats[] or Doubles[]? It does, indirectly: this is what the field types `CV_32FXX` do. You need to do some low-level casting to convert the byte array to array of numbers, but that should not be a limitation for most applications. > Correct me if I'm wrong, I don't see ocvTypes plays any role in the code. If the field is for future extension. maybe It's better to keep only the supported types. But not all the OpenCV types. Indeed. These fields are added so that users know what values are allowed for these fields. A scala-friendly choice would have been sealed traits or enumerations, but the consensus in Spark has been for a low-level representation. Nothing precludes adding a case class to represent this dataset in the future, with more type safety information. > In most scenarios, deep learning applications use rescaled/cropped images (typically 256, 224 or smaller). Maybe add an extra parameter "smallSideSize" to the readImages method, which is more convenient for the users and we can avoid to cache the image of original size (which could be 100 times larger than the scaled image). This can be done in follow up PR. This is a good point, that we all hit. The issue here is that there is no unique definition of rescaling (what interpolation? crop and scale? scale and crop?) and each library has made different choices. This is certainly something that would be good for a future PR. > Not sure about the reason to include "origin" info into the image data. Based on my experience, path info serves better as a separate column in the DataFrame. (E.g. prediction) Yes, this feature has been debated. Some developers have had a compelling need for directly accessing some information about the origin of the image directly inside the image. > IMO the parameter "recursive" may not be necessary. Existing wild card matching can provides more functions. Indeed, this feature may not seem that useful at first glance. For some hadoop file systems though, in which images are accessed in a batched manner, it is useful to traverse these batches. This is important for performance reasons. This is why it is marked as experimental for the time being. > For scala API, ImageSchema should be in a separate file but not to be mixed with image reading. I do not have a strong opinion about this point, I will let other developers decide. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19156: [SPARK-19634][FOLLOW-UP][ML] Improve interface of...
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/19156#discussion_r137603986 --- Diff: mllib/src/main/scala/org/apache/spark/ml/stat/Summarizer.scala --- @@ -109,31 +108,47 @@ object Summarizer extends Logging { } @Since("2.3.0") - def mean(col: Column): Column = getSingleMetric(col, "mean") + def mean(col: Column, weightCol: Column = lit(1.0)): Column = { --- End diff -- I am not a fan of default parameters, it tends to cause issues with binary compatibility. Unless you have some good reasons, you should have two different functions: ```scala def mean(col: Column): Column = mean(col, lit(1.0)) def mean(col: Column, weightCol: Column): Column = ... ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18798: [SPARK-19634][ML] Multivariate summarizer - dataframes A...
Github user thunterdb commented on the issue: https://github.com/apache/spark/pull/18798 Thank you @yanboliang. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18798: [SPARK-19634][ML] Multivariate summarizer - dataframes A...
Github user thunterdb commented on the issue: https://github.com/apache/spark/pull/18798 @yanboliang do you feel comfortable to merge this PR? I think that all the questions have been addressed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18798: [SPARK-19634][ML] Multivariate summarizer - dataf...
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/18798#discussion_r131971123 --- Diff: mllib/src/main/scala/org/apache/spark/ml/stat/Summarizer.scala --- @@ -0,0 +1,587 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.stat + +import java.io._ + +import org.apache.spark.annotation.Since +import org.apache.spark.internal.Logging +import org.apache.spark.ml.linalg.{Vector, Vectors, VectorUDT} +import org.apache.spark.sql.Column +import org.apache.spark.sql.catalyst.InternalRow +import org.apache.spark.sql.catalyst.expressions.{Expression, UnsafeArrayData} +import org.apache.spark.sql.catalyst.expressions.aggregate.{AggregateExpression, Complete, TypedImperativeAggregate} +import org.apache.spark.sql.catalyst.util.ArrayData +import org.apache.spark.sql.functions.lit +import org.apache.spark.sql.types._ + +/** + * A builder object that provides summary statistics about a given column. + * + * Users should not directly create such builders, but instead use one of the methods in + * [[Summarizer]]. + */ +@Since("2.3.0") +abstract class SummaryBuilder { + /** + * Returns an aggregate object that contains the summary of the column with the requested metrics. + * @param featuresCol a column that contains features Vector object. + * @param weightCol a column that contains weight value. + * @return an aggregate column that contains the statistics. The exact content of this + * structure is determined during the creation of the builder. + */ + @Since("2.3.0") + def summary(featuresCol: Column, weightCol: Column): Column + + @Since("2.3.0") + def summary(featuresCol: Column): Column = summary(featuresCol, lit(1.0)) +} + +/** + * Tools for vectorized statistics on MLlib Vectors. + * + * The methods in this package provide various statistics for Vectors contained inside DataFrames. + * + * This class lets users pick the statistics they would like to extract for a given column. Here is + * an example in Scala: + * {{{ + * val dataframe = ... // Some dataframe containing a feature column + * val allStats = dataframe.select(Summarizer.metrics("min", "max").summary($"features")) + * val Row(min_, max_) = allStats.first() + * }}} + * + * If one wants to get a single metric, shortcuts are also available: + * {{{ + * val meanDF = dataframe.select(Summarizer.mean($"features")) + * val Row(mean_) = meanDF.first() + * }}} + */ --- End diff -- Put a comment about performance here. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18798: [SPARK-19634][ML] Multivariate summarizer - dataf...
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/18798#discussion_r131970836 --- Diff: mllib/src/main/scala/org/apache/spark/ml/stat/Summarizer.scala --- @@ -0,0 +1,587 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.stat + +import java.io._ + +import org.apache.spark.annotation.Since +import org.apache.spark.internal.Logging +import org.apache.spark.ml.linalg.{Vector, Vectors, VectorUDT} +import org.apache.spark.sql.Column +import org.apache.spark.sql.catalyst.InternalRow +import org.apache.spark.sql.catalyst.expressions.{Expression, UnsafeArrayData} +import org.apache.spark.sql.catalyst.expressions.aggregate.{AggregateExpression, Complete, TypedImperativeAggregate} +import org.apache.spark.sql.catalyst.util.ArrayData +import org.apache.spark.sql.functions.lit +import org.apache.spark.sql.types._ + +/** + * A builder object that provides summary statistics about a given column. + * + * Users should not directly create such builders, but instead use one of the methods in + * [[Summarizer]]. + */ +@Since("2.3.0") +abstract class SummaryBuilder { + /** + * Returns an aggregate object that contains the summary of the column with the requested metrics. + * @param featuresCol a column that contains features Vector object. + * @param weightCol a column that contains weight value. + * @return an aggregate column that contains the statistics. The exact content of this + * structure is determined during the creation of the builder. + */ + @Since("2.3.0") + def summary(featuresCol: Column, weightCol: Column): Column + + @Since("2.3.0") + def summary(featuresCol: Column): Column = summary(featuresCol, lit(1.0)) +} + +/** + * Tools for vectorized statistics on MLlib Vectors. + * + * The methods in this package provide various statistics for Vectors contained inside DataFrames. + * + * This class lets users pick the statistics they would like to extract for a given column. Here is + * an example in Scala: + * {{{ + * val dataframe = ... // Some dataframe containing a feature column + * val allStats = dataframe.select(Summarizer.metrics("min", "max").summary($"features")) + * val Row(min_, max_) = allStats.first() + * }}} + * + * If one wants to get a single metric, shortcuts are also available: + * {{{ + * val meanDF = dataframe.select(Summarizer.mean($"features")) + * val Row(mean_) = meanDF.first() + * }}} + */ +@Since("2.3.0") +object Summarizer extends Logging { + + import SummaryBuilderImpl._ + + /** + * Given a list of metrics, provides a builder that it turns computes metrics from a column. + * + * See the documentation of [[Summarizer]] for an example. + * + * The following metrics are accepted (case sensitive): + * - mean: a vector that contains the coefficient-wise mean. + * - variance: a vector tha contains the coefficient-wise variance. + * - count: the count of all vectors seen. + * - numNonzeros: a vector with the number of non-zeros for each coefficients + * - max: the maximum for each coefficient. + * - min: the minimum for each coefficient. + * - normL2: the Euclidian norm for each coefficient. + * - normL1: the L1 norm of each coefficient (sum of the absolute values). + * @param firstMetric the metric being provided + * @param metrics additional metrics that can be provided. + * @return a builder. + * @throws IllegalArgumentException if one of the metric names is not understood. + */ + @Since("2.3.0") --- End diff -- Let's put a comment about performance to indicate that it is about 3x slower than using the RDD interface. --- If your proje
[GitHub] spark pull request #18798: [SPARK-19634][ML] Multivariate summarizer - dataf...
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/18798#discussion_r130742319 --- Diff: mllib/src/main/scala/org/apache/spark/ml/stat/Summarizer.scala --- @@ -0,0 +1,633 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.stat + +import java.io._ + +import org.apache.spark.annotation.Since +import org.apache.spark.internal.Logging +import org.apache.spark.ml.linalg.{Vector, Vectors, VectorUDT} +import org.apache.spark.sql.Column +import org.apache.spark.sql.catalyst.InternalRow +import org.apache.spark.sql.catalyst.expressions.{Expression, UnsafeArrayData} +import org.apache.spark.sql.catalyst.expressions.aggregate.{AggregateExpression, Complete, TypedImperativeAggregate} +import org.apache.spark.sql.catalyst.util.ArrayData +import org.apache.spark.sql.functions.lit +import org.apache.spark.sql.types._ + +/** + * A builder object that provides summary statistics about a given column. + * + * Users should not directly create such builders, but instead use one of the methods in + * [[Summarizer]]. + */ +@Since("2.2.0") +abstract class SummaryBuilder { + /** + * Returns an aggregate object that contains the summary of the column with the requested metrics. + * @param featuresCol a column that contains features Vector object. + * @param weightCol a column that contains weight value. + * @return an aggregate column that contains the statistics. The exact content of this + * structure is determined during the creation of the builder. + */ + @Since("2.2.0") + def summary(featuresCol: Column, weightCol: Column): Column + + @Since("2.2.0") + def summary(featuresCol: Column): Column = summary(featuresCol, lit(1.0)) +} + +/** + * Tools for vectorized statistics on MLlib Vectors. + * + * The methods in this package provide various statistics for Vectors contained inside DataFrames. + * + * This class lets users pick the statistics they would like to extract for a given column. Here is + * an example in Scala: + * {{{ + * val dataframe = ... // Some dataframe containing a feature column + * val allStats = dataframe.select(Summarizer.metrics("min", "max").summary($"features")) + * val Row(min_, max_) = allStats.first() + * }}} + * + * If one wants to get a single metric, shortcuts are also available: + * {{{ + * val meanDF = dataframe.select(Summarizer.mean($"features")) + * val Row(mean_) = meanDF.first() + * }}} + */ +@Since("2.2.0") +object Summarizer extends Logging { + + import SummaryBuilderImpl._ + + /** + * Given a list of metrics, provides a builder that it turns computes metrics from a column. + * + * See the documentation of [[Summarizer]] for an example. + * + * The following metrics are accepted (case sensitive): + * - mean: a vector that contains the coefficient-wise mean. + * - variance: a vector tha contains the coefficient-wise variance. + * - count: the count of all vectors seen. + * - numNonzeros: a vector with the number of non-zeros for each coefficients + * - max: the maximum for each coefficient. + * - min: the minimum for each coefficient. + * - normL2: the Euclidian norm for each coefficient. + * - normL1: the L1 norm of each coefficient (sum of the absolute values). + * @param firstMetric the metric being provided + * @param metrics additional metrics that can be provided. + * @return a builder. + * @throws IllegalArgumentException if one of the metric names is not understood. + */ + @Since("2.2.0") + def metrics(firstMetric: String, metrics: String*): SummaryBuilder = { +val (typedMetrics, computeMetrics) = getRelevantMetrics(Seq(firstMetric
[GitHub] spark pull request #18798: [SPARK-19634][ML] Multivariate summarizer - dataf...
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/18798#discussion_r130741880 --- Diff: mllib/src/main/scala/org/apache/spark/ml/stat/Summarizer.scala --- @@ -0,0 +1,633 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.stat + +import java.io._ + +import org.apache.spark.annotation.Since +import org.apache.spark.internal.Logging +import org.apache.spark.ml.linalg.{Vector, Vectors, VectorUDT} +import org.apache.spark.sql.Column +import org.apache.spark.sql.catalyst.InternalRow +import org.apache.spark.sql.catalyst.expressions.{Expression, UnsafeArrayData} +import org.apache.spark.sql.catalyst.expressions.aggregate.{AggregateExpression, Complete, TypedImperativeAggregate} +import org.apache.spark.sql.catalyst.util.ArrayData +import org.apache.spark.sql.functions.lit +import org.apache.spark.sql.types._ + +/** + * A builder object that provides summary statistics about a given column. + * + * Users should not directly create such builders, but instead use one of the methods in + * [[Summarizer]]. + */ +@Since("2.2.0") +abstract class SummaryBuilder { + /** + * Returns an aggregate object that contains the summary of the column with the requested metrics. + * @param featuresCol a column that contains features Vector object. + * @param weightCol a column that contains weight value. + * @return an aggregate column that contains the statistics. The exact content of this + * structure is determined during the creation of the builder. + */ + @Since("2.2.0") + def summary(featuresCol: Column, weightCol: Column): Column + + @Since("2.2.0") + def summary(featuresCol: Column): Column = summary(featuresCol, lit(1.0)) +} + +/** + * Tools for vectorized statistics on MLlib Vectors. + * + * The methods in this package provide various statistics for Vectors contained inside DataFrames. + * + * This class lets users pick the statistics they would like to extract for a given column. Here is + * an example in Scala: + * {{{ + * val dataframe = ... // Some dataframe containing a feature column + * val allStats = dataframe.select(Summarizer.metrics("min", "max").summary($"features")) + * val Row(min_, max_) = allStats.first() + * }}} + * + * If one wants to get a single metric, shortcuts are also available: + * {{{ + * val meanDF = dataframe.select(Summarizer.mean($"features")) + * val Row(mean_) = meanDF.first() + * }}} + */ +@Since("2.2.0") +object Summarizer extends Logging { + + import SummaryBuilderImpl._ + + /** + * Given a list of metrics, provides a builder that it turns computes metrics from a column. + * + * See the documentation of [[Summarizer]] for an example. + * + * The following metrics are accepted (case sensitive): + * - mean: a vector that contains the coefficient-wise mean. + * - variance: a vector tha contains the coefficient-wise variance. + * - count: the count of all vectors seen. + * - numNonzeros: a vector with the number of non-zeros for each coefficients + * - max: the maximum for each coefficient. + * - min: the minimum for each coefficient. + * - normL2: the Euclidian norm for each coefficient. + * - normL1: the L1 norm of each coefficient (sum of the absolute values). + * @param firstMetric the metric being provided + * @param metrics additional metrics that can be provided. + * @return a builder. + * @throws IllegalArgumentException if one of the metric names is not understood. + */ + @Since("2.2.0") + def metrics(firstMetric: String, metrics: String*): SummaryBuilder = { +val (typedMetrics, computeMetrics) = getRelevantMetrics(Seq(firstMetric
[GitHub] spark pull request #18798: [SPARK-19634][ML] Multivariate summarizer - dataf...
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/18798#discussion_r130742836 --- Diff: mllib/src/main/scala/org/apache/spark/ml/stat/Summarizer.scala --- @@ -0,0 +1,633 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.stat + +import java.io._ + +import org.apache.spark.annotation.Since +import org.apache.spark.internal.Logging +import org.apache.spark.ml.linalg.{Vector, Vectors, VectorUDT} +import org.apache.spark.sql.Column +import org.apache.spark.sql.catalyst.InternalRow +import org.apache.spark.sql.catalyst.expressions.{Expression, UnsafeArrayData} +import org.apache.spark.sql.catalyst.expressions.aggregate.{AggregateExpression, Complete, TypedImperativeAggregate} +import org.apache.spark.sql.catalyst.util.ArrayData +import org.apache.spark.sql.functions.lit +import org.apache.spark.sql.types._ + +/** + * A builder object that provides summary statistics about a given column. + * + * Users should not directly create such builders, but instead use one of the methods in + * [[Summarizer]]. + */ +@Since("2.2.0") +abstract class SummaryBuilder { + /** + * Returns an aggregate object that contains the summary of the column with the requested metrics. + * @param featuresCol a column that contains features Vector object. + * @param weightCol a column that contains weight value. + * @return an aggregate column that contains the statistics. The exact content of this + * structure is determined during the creation of the builder. + */ + @Since("2.2.0") + def summary(featuresCol: Column, weightCol: Column): Column + + @Since("2.2.0") + def summary(featuresCol: Column): Column = summary(featuresCol, lit(1.0)) +} + +/** + * Tools for vectorized statistics on MLlib Vectors. + * + * The methods in this package provide various statistics for Vectors contained inside DataFrames. + * + * This class lets users pick the statistics they would like to extract for a given column. Here is + * an example in Scala: + * {{{ + * val dataframe = ... // Some dataframe containing a feature column + * val allStats = dataframe.select(Summarizer.metrics("min", "max").summary($"features")) + * val Row(min_, max_) = allStats.first() + * }}} + * + * If one wants to get a single metric, shortcuts are also available: + * {{{ + * val meanDF = dataframe.select(Summarizer.mean($"features")) + * val Row(mean_) = meanDF.first() + * }}} + */ +@Since("2.2.0") +object Summarizer extends Logging { + + import SummaryBuilderImpl._ + + /** + * Given a list of metrics, provides a builder that it turns computes metrics from a column. + * + * See the documentation of [[Summarizer]] for an example. + * + * The following metrics are accepted (case sensitive): + * - mean: a vector that contains the coefficient-wise mean. + * - variance: a vector tha contains the coefficient-wise variance. + * - count: the count of all vectors seen. + * - numNonzeros: a vector with the number of non-zeros for each coefficients + * - max: the maximum for each coefficient. + * - min: the minimum for each coefficient. + * - normL2: the Euclidian norm for each coefficient. + * - normL1: the L1 norm of each coefficient (sum of the absolute values). + * @param firstMetric the metric being provided + * @param metrics additional metrics that can be provided. + * @return a builder. + * @throws IllegalArgumentException if one of the metric names is not understood. + */ + @Since("2.2.0") + def metrics(firstMetric: String, metrics: String*): SummaryBuilder = { +val (typedMetrics, computeMetrics) = getRelevantMetrics(Seq(firstMetric
[GitHub] spark pull request #18798: [SPARK-19634][ML] Multivariate summarizer - dataf...
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/18798#discussion_r130742759 --- Diff: mllib/src/main/scala/org/apache/spark/ml/stat/Summarizer.scala --- @@ -0,0 +1,633 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.stat + +import java.io._ + +import org.apache.spark.annotation.Since +import org.apache.spark.internal.Logging +import org.apache.spark.ml.linalg.{Vector, Vectors, VectorUDT} +import org.apache.spark.sql.Column +import org.apache.spark.sql.catalyst.InternalRow +import org.apache.spark.sql.catalyst.expressions.{Expression, UnsafeArrayData} +import org.apache.spark.sql.catalyst.expressions.aggregate.{AggregateExpression, Complete, TypedImperativeAggregate} +import org.apache.spark.sql.catalyst.util.ArrayData +import org.apache.spark.sql.functions.lit +import org.apache.spark.sql.types._ + +/** + * A builder object that provides summary statistics about a given column. + * + * Users should not directly create such builders, but instead use one of the methods in + * [[Summarizer]]. + */ +@Since("2.2.0") +abstract class SummaryBuilder { + /** + * Returns an aggregate object that contains the summary of the column with the requested metrics. + * @param featuresCol a column that contains features Vector object. + * @param weightCol a column that contains weight value. + * @return an aggregate column that contains the statistics. The exact content of this + * structure is determined during the creation of the builder. + */ + @Since("2.2.0") + def summary(featuresCol: Column, weightCol: Column): Column + + @Since("2.2.0") + def summary(featuresCol: Column): Column = summary(featuresCol, lit(1.0)) +} + +/** + * Tools for vectorized statistics on MLlib Vectors. + * + * The methods in this package provide various statistics for Vectors contained inside DataFrames. + * + * This class lets users pick the statistics they would like to extract for a given column. Here is + * an example in Scala: + * {{{ + * val dataframe = ... // Some dataframe containing a feature column + * val allStats = dataframe.select(Summarizer.metrics("min", "max").summary($"features")) + * val Row(min_, max_) = allStats.first() + * }}} + * + * If one wants to get a single metric, shortcuts are also available: + * {{{ + * val meanDF = dataframe.select(Summarizer.mean($"features")) + * val Row(mean_) = meanDF.first() + * }}} + */ +@Since("2.2.0") +object Summarizer extends Logging { + + import SummaryBuilderImpl._ + + /** + * Given a list of metrics, provides a builder that it turns computes metrics from a column. + * + * See the documentation of [[Summarizer]] for an example. + * + * The following metrics are accepted (case sensitive): + * - mean: a vector that contains the coefficient-wise mean. + * - variance: a vector tha contains the coefficient-wise variance. + * - count: the count of all vectors seen. + * - numNonzeros: a vector with the number of non-zeros for each coefficients + * - max: the maximum for each coefficient. + * - min: the minimum for each coefficient. + * - normL2: the Euclidian norm for each coefficient. + * - normL1: the L1 norm of each coefficient (sum of the absolute values). + * @param firstMetric the metric being provided + * @param metrics additional metrics that can be provided. + * @return a builder. + * @throws IllegalArgumentException if one of the metric names is not understood. + */ + @Since("2.2.0") + def metrics(firstMetric: String, metrics: String*): SummaryBuilder = { +val (typedMetrics, computeMetrics) = getRelevantMetrics(Seq(firstMetric
[GitHub] spark pull request #18798: [SPARK-19634][ML] Multivariate summarizer - dataf...
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/18798#discussion_r130742524 --- Diff: mllib/src/main/scala/org/apache/spark/ml/stat/Summarizer.scala --- @@ -0,0 +1,633 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.stat + +import java.io._ + +import org.apache.spark.annotation.Since +import org.apache.spark.internal.Logging +import org.apache.spark.ml.linalg.{Vector, Vectors, VectorUDT} +import org.apache.spark.sql.Column +import org.apache.spark.sql.catalyst.InternalRow +import org.apache.spark.sql.catalyst.expressions.{Expression, UnsafeArrayData} +import org.apache.spark.sql.catalyst.expressions.aggregate.{AggregateExpression, Complete, TypedImperativeAggregate} +import org.apache.spark.sql.catalyst.util.ArrayData +import org.apache.spark.sql.functions.lit +import org.apache.spark.sql.types._ + +/** + * A builder object that provides summary statistics about a given column. + * + * Users should not directly create such builders, but instead use one of the methods in + * [[Summarizer]]. + */ +@Since("2.2.0") +abstract class SummaryBuilder { + /** + * Returns an aggregate object that contains the summary of the column with the requested metrics. + * @param featuresCol a column that contains features Vector object. + * @param weightCol a column that contains weight value. + * @return an aggregate column that contains the statistics. The exact content of this + * structure is determined during the creation of the builder. + */ + @Since("2.2.0") + def summary(featuresCol: Column, weightCol: Column): Column + + @Since("2.2.0") + def summary(featuresCol: Column): Column = summary(featuresCol, lit(1.0)) +} + +/** + * Tools for vectorized statistics on MLlib Vectors. + * + * The methods in this package provide various statistics for Vectors contained inside DataFrames. + * + * This class lets users pick the statistics they would like to extract for a given column. Here is + * an example in Scala: + * {{{ + * val dataframe = ... // Some dataframe containing a feature column + * val allStats = dataframe.select(Summarizer.metrics("min", "max").summary($"features")) + * val Row(min_, max_) = allStats.first() + * }}} + * + * If one wants to get a single metric, shortcuts are also available: + * {{{ + * val meanDF = dataframe.select(Summarizer.mean($"features")) + * val Row(mean_) = meanDF.first() + * }}} + */ +@Since("2.2.0") +object Summarizer extends Logging { + + import SummaryBuilderImpl._ + + /** + * Given a list of metrics, provides a builder that it turns computes metrics from a column. + * + * See the documentation of [[Summarizer]] for an example. + * + * The following metrics are accepted (case sensitive): + * - mean: a vector that contains the coefficient-wise mean. + * - variance: a vector tha contains the coefficient-wise variance. + * - count: the count of all vectors seen. + * - numNonzeros: a vector with the number of non-zeros for each coefficients + * - max: the maximum for each coefficient. + * - min: the minimum for each coefficient. + * - normL2: the Euclidian norm for each coefficient. + * - normL1: the L1 norm of each coefficient (sum of the absolute values). + * @param firstMetric the metric being provided + * @param metrics additional metrics that can be provided. + * @return a builder. + * @throws IllegalArgumentException if one of the metric names is not understood. + */ + @Since("2.2.0") + def metrics(firstMetric: String, metrics: String*): SummaryBuilder = { +val (typedMetrics, computeMetrics) = getRelevantMetrics(Seq(firstMetric
[GitHub] spark pull request #18798: [SPARK-19634][ML] Multivariate summarizer - dataf...
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/18798#discussion_r130742933 --- Diff: mllib/src/main/scala/org/apache/spark/ml/stat/Summarizer.scala --- @@ -0,0 +1,633 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.stat + +import java.io._ + +import org.apache.spark.annotation.Since +import org.apache.spark.internal.Logging +import org.apache.spark.ml.linalg.{Vector, Vectors, VectorUDT} +import org.apache.spark.sql.Column +import org.apache.spark.sql.catalyst.InternalRow +import org.apache.spark.sql.catalyst.expressions.{Expression, UnsafeArrayData} +import org.apache.spark.sql.catalyst.expressions.aggregate.{AggregateExpression, Complete, TypedImperativeAggregate} +import org.apache.spark.sql.catalyst.util.ArrayData +import org.apache.spark.sql.functions.lit +import org.apache.spark.sql.types._ + +/** + * A builder object that provides summary statistics about a given column. + * + * Users should not directly create such builders, but instead use one of the methods in + * [[Summarizer]]. + */ +@Since("2.2.0") +abstract class SummaryBuilder { + /** + * Returns an aggregate object that contains the summary of the column with the requested metrics. + * @param featuresCol a column that contains features Vector object. + * @param weightCol a column that contains weight value. + * @return an aggregate column that contains the statistics. The exact content of this + * structure is determined during the creation of the builder. + */ + @Since("2.2.0") + def summary(featuresCol: Column, weightCol: Column): Column + + @Since("2.2.0") + def summary(featuresCol: Column): Column = summary(featuresCol, lit(1.0)) +} + +/** + * Tools for vectorized statistics on MLlib Vectors. + * + * The methods in this package provide various statistics for Vectors contained inside DataFrames. + * + * This class lets users pick the statistics they would like to extract for a given column. Here is + * an example in Scala: + * {{{ + * val dataframe = ... // Some dataframe containing a feature column + * val allStats = dataframe.select(Summarizer.metrics("min", "max").summary($"features")) + * val Row(min_, max_) = allStats.first() + * }}} + * + * If one wants to get a single metric, shortcuts are also available: + * {{{ + * val meanDF = dataframe.select(Summarizer.mean($"features")) + * val Row(mean_) = meanDF.first() + * }}} + */ +@Since("2.2.0") +object Summarizer extends Logging { + + import SummaryBuilderImpl._ + + /** + * Given a list of metrics, provides a builder that it turns computes metrics from a column. + * + * See the documentation of [[Summarizer]] for an example. + * + * The following metrics are accepted (case sensitive): + * - mean: a vector that contains the coefficient-wise mean. + * - variance: a vector tha contains the coefficient-wise variance. + * - count: the count of all vectors seen. + * - numNonzeros: a vector with the number of non-zeros for each coefficients + * - max: the maximum for each coefficient. + * - min: the minimum for each coefficient. + * - normL2: the Euclidian norm for each coefficient. + * - normL1: the L1 norm of each coefficient (sum of the absolute values). + * @param firstMetric the metric being provided + * @param metrics additional metrics that can be provided. + * @return a builder. + * @throws IllegalArgumentException if one of the metric names is not understood. + */ + @Since("2.2.0") + def metrics(firstMetric: String, metrics: String*): SummaryBuilder = { +val (typedMetrics, computeMetrics) = getRelevantMetrics(Seq(firstMetric
[GitHub] spark pull request #18798: [SPARK-19634][ML] Multivariate summarizer - dataf...
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/18798#discussion_r130743131 --- Diff: mllib/src/test/scala/org/apache/spark/ml/stat/SummarizerSuite.scala --- @@ -0,0 +1,619 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.stat + +import org.scalatest.exceptions.TestFailedException + +import org.apache.spark.{SparkException, SparkFunSuite} +import org.apache.spark.ml.linalg.{Vector, Vectors} +import org.apache.spark.ml.util.TestingUtils._ +import org.apache.spark.mllib.linalg.{Vector => OldVector, Vectors => OldVectors} +import org.apache.spark.mllib.stat.{MultivariateOnlineSummarizer, Statistics} +import org.apache.spark.mllib.util.MLlibTestSparkContext +import org.apache.spark.sql.{DataFrame, Row} +import org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema + +class SummarizerSuite extends SparkFunSuite with MLlibTestSparkContext { + + import testImplicits._ + import Summarizer._ + import SummaryBuilderImpl._ + + private case class ExpectedMetrics( + mean: Seq[Double], + variance: Seq[Double], + count: Long, + numNonZeros: Seq[Long], + max: Seq[Double], + min: Seq[Double], + normL2: Seq[Double], + normL1: Seq[Double]) + + // The input is expected to be either a sparse vector, a dense vector or an array of doubles + // (which will be converted to a dense vector) + // The expected is the list of all the known metrics. + // + // The tests take an list of input vectors and a list of all the summary values that + // are expected for this input. They currently test against some fixed subset of the + // metrics, but should be made fuzzy in the future. + + private def testExample(name: String, input: Seq[Any], exp: ExpectedMetrics): Unit = { +def inputVec: Seq[Vector] = input.map { + case x: Array[Double @unchecked] => Vectors.dense(x) + case x: Seq[Double @unchecked] => Vectors.dense(x.toArray) + case x: Vector => x + case x => throw new Exception(x.toString) +} + +val s = { + val s2 = new MultivariateOnlineSummarizer + inputVec.foreach(v => s2.add(OldVectors.fromML(v))) + s2 +} + +// Because the Spark context is reset between tests, we cannot hold a reference onto it. +def wrapped() = { + val df = sc.parallelize(inputVec).map(Tuple1.apply).toDF("features") + val c = df.col("features") + (df, c) +} + +registerTest(s"$name - mean only") { + val (df, c) = wrapped() + compare(df.select(metrics("mean").summary(c), mean(c)), Seq(Row(exp.mean), s.mean)) +} + +registerTest(s"$name - mean only (direct)") { + val (df, c) = wrapped() + compare(df.select(mean(c)), Seq(exp.mean)) +} + +registerTest(s"$name - variance only") { + val (df, c) = wrapped() + compare(df.select(metrics("variance").summary(c), variance(c)), +Seq(Row(exp.variance), s.variance)) +} + +registerTest(s"$name - variance only (direct)") { + val (df, c) = wrapped() + compare(df.select(variance(c)), Seq(s.variance)) +} + +registerTest(s"$name - count only") { + val (df, c) = wrapped() + compare(df.select(metrics("count").summary(c), count(c)), +Seq(Row(exp.count), exp.count)) +} + +registerTest(s"$name - count only (direct)") { + val (df, c) = wrapped() + compare(df.select(count(c)), +Seq(exp.count)) +} + +registerTest(s"$name - numNonZeros only") { + val (df, c) = wrapped() + compare(df.select(metrics("numNonZ
[GitHub] spark pull request #18798: [SPARK-19634][ML] Multivariate summarizer - dataf...
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/18798#discussion_r130741348 --- Diff: mllib/src/main/scala/org/apache/spark/ml/stat/Summarizer.scala --- @@ -0,0 +1,633 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.stat + +import java.io._ + +import org.apache.spark.annotation.Since +import org.apache.spark.internal.Logging +import org.apache.spark.ml.linalg.{Vector, Vectors, VectorUDT} +import org.apache.spark.sql.Column +import org.apache.spark.sql.catalyst.InternalRow +import org.apache.spark.sql.catalyst.expressions.{Expression, UnsafeArrayData} +import org.apache.spark.sql.catalyst.expressions.aggregate.{AggregateExpression, Complete, TypedImperativeAggregate} +import org.apache.spark.sql.catalyst.util.ArrayData +import org.apache.spark.sql.functions.lit +import org.apache.spark.sql.types._ + +/** + * A builder object that provides summary statistics about a given column. + * + * Users should not directly create such builders, but instead use one of the methods in + * [[Summarizer]]. + */ +@Since("2.2.0") --- End diff -- this is not going to be 2.2 anymore --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17419: [SPARK-19634][ML] Multivariate summarizer - dataframes A...
Github user thunterdb commented on the issue: https://github.com/apache/spark/pull/17419 I am going to close this PR, since this is being taken over by @WeichenXu123 in #18798 . --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17419: [SPARK-19634][ML] Multivariate summarizer - dataf...
Github user thunterdb closed the pull request at: https://github.com/apache/spark/pull/17419 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18798: [SPARK-19634][ML] Multivariate summarizer - dataframes A...
Github user thunterdb commented on the issue: https://github.com/apache/spark/pull/18798 cc @hvanhovell as well. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18798: [SPARK-19634][ML] Multivariate summarizer - dataframes A...
Github user thunterdb commented on the issue: https://github.com/apache/spark/pull/18798 Thank you for the performance numbers @WeichenXu123 , I have a couple of comments: - you say that SQL uses adaptive compaction. How bad is that? I assume it adds some overhead. - did you just run each experiment once? I would be interested in error bars on these numbers, as it can take up to 30 seconds for the JVM to warm up and optimize the byte code. You should report the geometric mean or the median time of running these experiments to make sure that you are skewed by outliers. Some others will probably have some good advice as well. - from the performance numbers, there are 2 different regimes: small vectors, and big vectors (for which even the DataFrame -> RDD conversion is faster than working straight with DataFrames). I would be curious to know the bottlenecks for each case. If we trust these numbers, the overall conclusion is that the SQL interface adds a 2x-3x performance overhead over RDDs for the time being. @cloud-fan @liancheng are there still some low hanging fruits that could be merged into SQL? This state of affair is of course far from great, but I am in favor of merging this piece and improve it iteratively with the help of the SQL team, as this code is easy to benchmark and representative of the rest of MLlib, once we start to rely more on dataframe and catalysts, and less on RDDs. @yanboliang @viirya @kiszk what are your thoughts? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18798: [SPARK-19634][ML] Multivariate summarizer - dataframes A...
Github user thunterdb commented on the issue: https://github.com/apache/spark/pull/18798 @WeichenXu123 thanks! Can you post some performance numbers as well? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18281: [SPARK-21027][ML][PYTHON] Added tunable paralleli...
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/18281#discussion_r121790182 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/OneVsRest.scala --- @@ -325,8 +343,13 @@ final class OneVsRest @Since("1.4.0") ( multiclassLabeled.persist(StorageLevel.MEMORY_AND_DISK) } +val iters = Range(0, numClasses).par --- End diff -- @jkbradley thanks for calling this out. Indeed, the code as it stands may cause some non-deterministic issues in complex environments. I put a comment about that in that PR: https://github.com/apache/spark/pull/16774 See how it is done in this file: https://github.com/apache/spark/pull/16774/files#diff-d14539cadce0fba9d8b7d970adaf8b26 It should be a quick change. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17419: [SPARK-19634][ML] Multivariate summarizer - dataf...
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/17419#discussion_r109063248 --- Diff: mllib/src/main/scala/org/apache/spark/ml/stat/Summarizer.scala --- @@ -0,0 +1,746 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.stat + +import breeze.{linalg => la} +import breeze.linalg.{Vector => BV} +import breeze.numerics + +import org.apache.spark.SparkException +import org.apache.spark.annotation.Since +import org.apache.spark.internal.Logging +import org.apache.spark.ml.linalg.{DenseVector, SparseVector, Vector, Vectors, VectorUDT} +import org.apache.spark.sql.Column +import org.apache.spark.sql.catalyst.InternalRow +import org.apache.spark.sql.catalyst.expressions.{Expression, UnsafeArrayData, UnsafeProjection, UnsafeRow} +import org.apache.spark.sql.catalyst.expressions.aggregate.{AggregateExpression, Complete, TypedImperativeAggregate} +import org.apache.spark.sql.types._ + + +/** + * A builder object that provides summary statistics about a given column. + * + * Users should not directly create such builders, but instead use one of the methods in + * [[Summarizer]]. + */ +@Since("2.2.0") +abstract class SummaryBuilder { + /** + * Returns an aggregate object that contains the summary of the column with the requested metrics. + * @param column a column that contains Vector object. + * @return an aggregate column that contains the statistics. The exact content of this + * structure is determined during the creation of the builder. + */ + @Since("2.2.0") + def summary(column: Column): Column +} + +/** + * Tools for vectorized statistics on MLlib Vectors. + * + * The methods in this package provide various statistics for Vectors contained inside DataFrames. + * + * This class lets users pick the statistics they would like to extract for a given column. Here is + * an example in Scala: + * {{{ + * val dataframe = ... // Some dataframe containing a feature column + * val allStats = dataframe.select(Summarizer.metrics("min", "max").summary($"features")) + * val Row(min_, max_) = allStats.first() + * }}} + * + * If one wants to get a single metric, shortcuts are also available: + * {{{ + * val meanDF = dataframe.select(Summarizer.mean($"features")) + * val Row(mean_) = meanDF.first() + * }}} + */ +@Since("2.2.0") +object Summarizer extends Logging { + + import SummaryBuilderImpl._ + + /** + * Given a list of metrics, provides a builder that it turns computes metrics from a column. + * + * See the documentation of [[Summarizer]] for an example. + * + * The following metrics are accepted (case sensitive): + * - mean: a vector that contains the coefficient-wise mean. + * - variance: a vector tha contains the coefficient-wise variance. + * - count: the count of all vectors seen. + * - numNonzeros: a vector with the number of non-zeros for each coefficients + * - max: the maximum for each coefficient. + * - min: the minimum for each coefficient. + * - normL2: the Euclidian norm for each coefficient. + * - normL1: the L1 norm of each coefficient (sum of the absolute values). + * @param firstMetric the metric being provided + * @param metrics additional metrics that can be provided. + * @return a builder. + * @throws IllegalArgumentException if one of the metric names is not understood. + */ + @Since("2.2.0") + def metrics(firstMetric: String, metrics: String*): SummaryBuilder = { +val (typedMetrics, computeMetrics) = getRelevantMetrics(Seq(firstMetric) ++ metrics) +new SummaryBuilderImpl(typedMetrics, computeMetrics) + } + + def mean(col: Column): Column = ge
[GitHub] spark issue #17419: [SPARK-19634][ML] Multivariate summarizer - dataframes A...
Github user thunterdb commented on the issue: https://github.com/apache/spark/pull/17419 I looked a bit deeper into the performance aspect. Here are some quick insights: - there was an immediate bottleneck in `VectorUDT`, which boosts the performance already by 3x - it is not clear if switching to pure Breeze operations helps given the overhead for tiny vectors. I will need to do more analysis on larger vectors. - now, most of the time is roughly split between `ObjectAggregationIterator.processInputs` (40%), some codegen'ed expression (20%) and our own `MetricsAggregate.update` (35%) That benchmark focuses on the overhead of catalyst. I will do another benchmark with dense vectors to see how it fares in practice with more real data. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17419: [SPARK-19634][ML] Multivariate summarizer - dataf...
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/17419#discussion_r108743634 --- Diff: mllib/src/test/scala/org/apache/spark/ml/stat/SummarizerSuite.scala --- @@ -335,4 +335,65 @@ class SummarizerSuite extends SparkFunSuite with MLlibTestSparkContext { assert(Buffer.totalCount(summarizer) === 6) } + // TODO: this test should not be committed. It is here to isolate some performance hotspots. + test("perf test") { +val n = 1000 +val rdd1 = sc.parallelize(1 to n).map { idx => + OldVectors.dense(idx.toDouble) +} +val trieouts = 10 --- End diff -- I did not try about that class, thanks. It should stabilize the results. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17419: [SPARK-19634][ML] Multivariate summarizer - dataframes A...
Github user thunterdb commented on the issue: https://github.com/apache/spark/pull/17419 I have added a small perf test to find the performance bottlenecks. Note that this test works on the worst case (vectors of size 1) from the perspective of overhead. Here are the numbers I currently get. I will profile the code to see if there are some obvious targets for optimization: ``` [min ~ median ~ max], higher is better: RDD = [2482 ~ 46150 ~ 48354] records / milli dataframe (variance only) = [4217 ~ 4557 ~ 4848] records / milli dataframe = [2887 ~ 4420 ~ 4717] records / milli ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17419: [SPARK-19634][ML] Multivariate summarizer - dataframes A...
Github user thunterdb commented on the issue: https://github.com/apache/spark/pull/17419 @sethah it would have been nice, but I do not think we should merge it this late into the release cycle. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17419: [SPARK-19634][ML][WIP] Multivariate summarizer - ...
GitHub user thunterdb opened a pull request: https://github.com/apache/spark/pull/17419 [SPARK-19634][ML][WIP] Multivariate summarizer - dataframes API ## What changes were proposed in this pull request? This patch adds the DataFrames API to the multivariate summarizer (mean, variance, etc.). In addition to all the features of `MultivariateOnlineSummarizer`, it also allows the user to select a subset of the metrics. This should resolve some performance issues related to computing unrequested metrics. Furthermore, it uses the BLAS API to the extent possible, so that the given code should be efficient for the dense case. ## How was this patch tested? This patch includes most of the tests of the RDD-based. It compares results against the existing `MultivariateOnlineSummarizer` as well as adding more tests. This patch also includes some documentation for some low-level constructs such as `TypedImperativeAggregate`. ## Performance I have not run tests against the existing implementation. However, this patch uses the recommended low-level SQL APIs, so it should be interesting to compare both implementation in that respect. ## WIP Marked as WIP because some debugging comments are still present in the code. Thanks to @hvanhovell and Cheng Liang for suggestions on SparkSQL. You can merge this pull request into a Git repository by running: $ git pull https://github.com/thunterdb/spark 19634 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/17419.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #17419 commit f3fa6580bca70f3307d70e938ef8531c928d958b Author: Timothy Hunter Date: 2017-03-03T18:36:02Z work commit 7539835dad863a6b73d88d79983342f9ddb7fb9d Author: Timothy Hunter Date: 2017-03-06T22:38:41Z work on the test suite commit 673943f334b94e5d1ecd8874cb82bbc875d739c6 Author: Timothy Hunter Date: 2017-03-07T00:01:30Z last work commit 202b672afec127f4e0885cf3a58f4dfc97031fc6 Author: Timothy Hunter Date: 2017-03-13T22:48:47Z work on using imperative aggregators commit be019813f241d0ad3559b4d84339f1bb1055cbc4 Author: Timothy Hunter Date: 2017-03-17T21:44:40Z Merge remote-tracking branch 'upstream/master' into 19634 commit a983284cfeddabd017792e3991cf99a7d3ab1e16 Author: Timothy Hunter Date: 2017-03-18T00:14:40Z more work on summarizer commit 647a4fecb17d478c3c8cd68d40f2a9456eb10c66 Author: Timothy Hunter Date: 2017-03-21T17:47:30Z work commit 3c4bef772a3cbc759e43223af658a357c5ca6bc2 Author: Timothy Hunter Date: 2017-03-21T18:54:16Z changes commit 56390ccc456c67b2f7a08c1271fa50408518da0f Author: Timothy Hunter Date: 2017-03-21T18:54:19Z Merge remote-tracking branch 'upstream/master' into 19634 commit c3f236c4422031ae818cb6bbec2415b3f1bf7b70 Author: Timothy Hunter Date: 2017-03-21T19:03:07Z cleanup commit ef955c00275705f14342f3e4ed970a78f0f3c141 Author: Timothy Hunter Date: 2017-03-21T22:42:42Z debugging commit a04f923913ca1118a61d66bd53b8514af62594d7 Author: Timothy Hunter Date: 2017-03-21T23:14:23Z work commit 946d490c8b29e55ec0e6d40785122269063894ad Author: Timothy Hunter Date: 2017-03-22T21:14:29Z Merge remote-tracking branch 'upstream/master' into 19634 commit 201eb7712054967cd5093d3a908f4ebbd73f30a8 Author: Timothy Hunter Date: 2017-03-22T21:19:57Z debug commit f4dec88a49d0a20e1b328617fd721633fd8c201a Author: Timothy Hunter Date: 2017-03-23T18:27:19Z trying to debug serialization issue commit 4af0f47d326ef91d7cf9ccaf6a45ee3f904b191f Author: Timothy Hunter Date: 2017-03-23T23:16:10Z better tests commit 9f29030f75089884156bdc4ee634857b3730114d Author: Timothy Hunter Date: 2017-03-24T00:12:28Z changes commit e9877dc2f08d393f079bdf6fbbf1b9b9abaa21da Author: Timothy Hunter Date: 2017-03-24T21:04:32Z debugging commit 3a11d0265ef665a63cd070eeb1ae4ac25bc89908 Author: Timothy Hunter Date: 2017-03-24T22:14:06Z more tests and debugging commit 6d26c17d0bd4ab18d564ee7f37916780211702d5 Author: Timothy Hunter Date: 2017-03-24T23:12:19Z fixed tests commit 35eaeb0d02ae9cc29ae559231fe4858935315477 Author: Timothy Hunter Date: 2017-03-24T23:23:15Z doc --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For addition
[GitHub] spark issue #17108: [SPARK-19636][ML] Feature parity for correlation statist...
Github user thunterdb commented on the issue: https://github.com/apache/spark/pull/17108 Tickets created: - https://issues.apache.org/jira/browse/SPARK-20076 - https://issues.apache.org/jira/browse/SPARK-20077 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17108: [SPARK-19636][ML] Feature parity for correlation ...
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/17108#discussion_r107505718 --- Diff: mllib-local/src/test/scala/org/apache/spark/ml/util/TestingUtils.scala --- @@ -32,6 +32,10 @@ object TestingUtils { * the relative tolerance is meaningless, so the exception will be raised to warn users. */ private def RelativeErrorComparison(x: Double, y: Double, eps: Double): Boolean = { +// Special case for NaNs --- End diff -- @jkbradley I do not think this change is going to be controversial, but I want to point out that from now on, matrix/vector checks will not _always_ throw errors when comparing `NaN`: the previous code would throw whenever a NaN was found. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17108: [SPARK-19636][ML] Feature parity for correlation ...
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/17108#discussion_r107505201 --- Diff: mllib/src/main/scala/org/apache/spark/ml/stat/Correlations.scala --- @@ -0,0 +1,88 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.stat + +import scala.collection.JavaConverters._ + +import org.apache.spark.annotation.{Experimental, Since} +import org.apache.spark.ml.linalg.{SQLDataTypes, Vector} +import org.apache.spark.mllib.linalg.{Vectors => OldVectors} +import org.apache.spark.mllib.stat.{Statistics => OldStatistics} +import org.apache.spark.sql.{DataFrame, Dataset, Row} +import org.apache.spark.sql.types.{StructField, StructType} + +/** + * API for statistical functions in MLlib, compatible with Dataframes and Datasets. + * + * The functions in this package generalize the functions in [[org.apache.spark.sql.Dataset.stat]] + * to spark.ml's Vector types. + */ +@Since("2.2.0") +@Experimental +object Correlations { --- End diff -- sure, I do not know if there is a convention for that. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17108: [SPARK-19636][ML] Feature parity for correlation ...
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/17108#discussion_r107505212 --- Diff: mllib/src/main/scala/org/apache/spark/ml/stat/Correlations.scala --- @@ -0,0 +1,88 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.stat + +import scala.collection.JavaConverters._ + +import org.apache.spark.annotation.{Experimental, Since} +import org.apache.spark.ml.linalg.{SQLDataTypes, Vector} +import org.apache.spark.mllib.linalg.{Vectors => OldVectors} +import org.apache.spark.mllib.stat.{Statistics => OldStatistics} +import org.apache.spark.sql.{DataFrame, Dataset, Row} +import org.apache.spark.sql.types.{StructField, StructType} + +/** + * API for statistical functions in MLlib, compatible with Dataframes and Datasets. --- End diff -- done --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17108: [SPARK-19636][ML] Feature parity for correlation ...
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/17108#discussion_r107505215 --- Diff: mllib/src/main/scala/org/apache/spark/ml/stat/Correlations.scala --- @@ -0,0 +1,88 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.stat + +import scala.collection.JavaConverters._ + +import org.apache.spark.annotation.{Experimental, Since} +import org.apache.spark.ml.linalg.{SQLDataTypes, Vector} +import org.apache.spark.mllib.linalg.{Vectors => OldVectors} +import org.apache.spark.mllib.stat.{Statistics => OldStatistics} +import org.apache.spark.sql.{DataFrame, Dataset, Row} +import org.apache.spark.sql.types.{StructField, StructType} + +/** + * API for statistical functions in MLlib, compatible with Dataframes and Datasets. --- End diff -- done --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17108: [SPARK-19636][ML] Feature parity for correlation ...
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/17108#discussion_r107505185 --- Diff: mllib/src/main/scala/org/apache/spark/ml/stat/Correlations.scala --- @@ -0,0 +1,88 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.stat + +import scala.collection.JavaConverters._ + +import org.apache.spark.annotation.{Experimental, Since} +import org.apache.spark.ml.linalg.{SQLDataTypes, Vector} +import org.apache.spark.mllib.linalg.{Vectors => OldVectors} +import org.apache.spark.mllib.stat.{Statistics => OldStatistics} +import org.apache.spark.sql.{DataFrame, Dataset, Row} +import org.apache.spark.sql.types.{StructField, StructType} + +/** + * API for statistical functions in MLlib, compatible with Dataframes and Datasets. + * + * The functions in this package generalize the functions in [[org.apache.spark.sql.Dataset.stat]] + * to spark.ml's Vector types. + */ +@Since("2.2.0") +@Experimental +object Correlations { + + /** + * Compute the correlation matrix for the input RDD of Vectors using the specified method. + * Methods currently supported: `pearson` (default), `spearman`. + * + * @param dataset A dataset or a dataframe + * @param column The name of the column of vectors for which the correlation coefficient needs + * to be computed. This must be a column of the dataset, and it must contain + * Vector objects. + * @param method String specifying the method to use for computing correlation. + * Supported: `pearson` (default), `spearman` + * @return A dataframe that contains the correlation matrix of the column of vectors. This + * dataframe contains a single row and a single column of name + * '$METHODNAME($COLUMN)'. + * @throws IllegalArgumentException if the column is not a valid column in the dataset, or if + * the content of this column is not of type Vector. + * + * Here is how to access the correlation coefficient: + * {{{ + *val data: Dataset[Vector] = ... + *val Row(coeff: Matrix) = Statistics.corr(data, "value").head + *// coeff now contains the Pearson correlation matrix. + * }}} + * + * @note For Spearman, a rank correlation, we need to create an RDD[Double] for each column + * and sort it in order to retrieve the ranks and then join the columns back into an RDD[Vector], + * which is fairly costly. Cache the input RDD before calling corr with `method = "spearman"` to + * avoid recomputing the common lineage. + */ + @Since("2.2.0") + def corr(dataset: Dataset[_], column: String, method: String): DataFrame = { +val rdd = dataset.select(column).rdd.map { + case Row(v: Vector) => OldVectors.fromML(v) +} +val oldM = OldStatistics.corr(rdd, method) +val name = s"$method($column)" +val schema = StructType(Array(StructField(name, SQLDataTypes.MatrixType, nullable = true))) --- End diff -- Good point. It seems that Spark can be quite liberal with the nullability. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17108: [SPARK-19636][ML] Feature parity for correlation ...
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/17108#discussion_r107505180 --- Diff: mllib/src/test/scala/org/apache/spark/ml/util/LinalgUtils.scala --- @@ -0,0 +1,54 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.util + +import breeze.linalg.{Matrix => BM} + +import org.apache.spark.internal.Logging + +/** + * Utility test methods for linear algebra. + */ +object LinalgUtils extends Logging { --- End diff -- You are right, I had missed that file. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16483: [SPARK-18847][GraphX] PageRank gives incorrect results f...
Github user thunterdb commented on the issue: https://github.com/apache/spark/pull/16483 It looks good to me. cc @jkbradley or @mengxr for final approval --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16483: [SPARK-18847][GraphX] PageRank gives incorrect re...
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/16483#discussion_r106746316 --- Diff: graphx/src/main/scala/org/apache/spark/graphx/lib/PageRank.scala --- @@ -322,13 +335,12 @@ object PageRank extends Logging { def personalizedVertexProgram(id: VertexId, attr: (Double, Double), msgSum: Double): (Double, Double) = { val (oldPR, lastDelta) = attr - var teleport = oldPR - val delta = if (src==id) resetProb else 0.0 - teleport = oldPR*delta - - val newPR = teleport + (1.0 - resetProb) * msgSum - val newDelta = if (lastDelta == Double.NegativeInfinity) newPR else newPR - oldPR - (newPR, newDelta) + val newPR = if (lastDelta == Double.NegativeInfinity) { --- End diff -- I agree that the new code is easier to follow in that respect. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16483: [SPARK-18847][GraphX] PageRank gives incorrect results f...
Github user thunterdb commented on the issue: https://github.com/apache/spark/pull/16483 In addition, this introduces an extra step reduction at each iteration. I am fine with that since it is for correctness, but @jkbradley may want to comment as well. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16483: [SPARK-18847][GraphX] PageRank gives incorrect re...
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/16483#discussion_r106529377 --- Diff: graphx/src/main/scala/org/apache/spark/graphx/lib/PageRank.scala --- @@ -353,9 +365,19 @@ object PageRank extends Logging { vertexProgram(id, attr, msgSum) } -Pregel(pagerankGraph, initialMessage, activeDirection = EdgeDirection.Out)( +val rankGraph = Pregel(pagerankGraph, initialMessage, activeDirection = EdgeDirection.Out)( vp, sendMessage, messageCombiner) .mapVertices((vid, attr) => attr._1) - } // end of deltaPageRank + +// If the graph has sinks (vertices with no outgoing edges) the sum of ranks will not be correct --- End diff -- This is the same code as above, please factor it into a function. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16483: [SPARK-18847][GraphX] PageRank gives incorrect re...
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/16483#discussion_r106532078 --- Diff: graphx/src/test/scala/org/apache/spark/graphx/lib/PageRankSuite.scala --- @@ -68,26 +69,34 @@ class PageRankSuite extends SparkFunSuite with LocalSparkContext { val nVertices = 100 val starGraph = GraphGenerators.starGraph(sc, nVertices).cache() val resetProb = 0.15 + val tol = 0.0001 + val numIter = 2 val errorTol = 1.0e-5 - val staticRanks1 = starGraph.staticPageRank(numIter = 2, resetProb).vertices - val staticRanks2 = starGraph.staticPageRank(numIter = 3, resetProb).vertices.cache() + val staticRanks = starGraph.staticPageRank(numIter, resetProb).vertices.cache() + val staticRanks2 = starGraph.staticPageRank(numIter + 1, resetProb).vertices - // Static PageRank should only take 3 iterations to converge - val notMatching = staticRanks1.innerZipJoin(staticRanks2) { (vid, pr1, pr2) => + // Static PageRank should only take 2 iterations to converge --- End diff -- Why does it take only two iterations to converge now? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16483: [SPARK-18847][GraphX] PageRank gives incorrect re...
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/16483#discussion_r106535595 --- Diff: graphx/src/main/scala/org/apache/spark/graphx/lib/PageRank.scala --- @@ -322,13 +335,12 @@ object PageRank extends Logging { def personalizedVertexProgram(id: VertexId, attr: (Double, Double), msgSum: Double): (Double, Double) = { val (oldPR, lastDelta) = attr - var teleport = oldPR - val delta = if (src==id) resetProb else 0.0 - teleport = oldPR*delta - - val newPR = teleport + (1.0 - resetProb) * msgSum - val newDelta = if (lastDelta == Double.NegativeInfinity) newPR else newPR - oldPR - (newPR, newDelta) + val newPR = if (lastDelta == Double.NegativeInfinity) { --- End diff -- My memory of the algorithm is a bit rusty. Why don't you need to check for self-loops here anymore? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16483: [SPARK-18847][GraphX] PageRank gives incorrect re...
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/16483#discussion_r106528007 --- Diff: graphx/src/main/scala/org/apache/spark/graphx/lib/PageRank.scala --- @@ -162,7 +162,15 @@ object PageRank extends Logging { iteration += 1 } -rankGraph +// If the graph has sinks (vertices with no outgoing edges) the sum of ranks will not be correct --- End diff -- put the name of the ticket as well --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16971: [SPARK-19573][SQL] Make NaN/null handling consist...
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/16971#discussion_r106309333 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/ApproximatePercentile.scala --- @@ -245,7 +245,7 @@ object ApproximatePercentile { val result = new Array[Double](percentages.length) var i = 0 while (i < percentages.length) { - result(i) = summaries.query(percentages(i)) + result(i) = summaries.query(percentages(i)).get --- End diff -- Yes, this is correct. But you should leave a comment, since it is not obvious. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17108: [SPARK-19636][ML] Feature parity for correlation statist...
Github user thunterdb commented on the issue: https://github.com/apache/spark/pull/17108 I moved the code `Correlations` as suggested. @imatiach-msft , I addressed your comments. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17108: [SPARK-19636][ML] Feature parity for correlation ...
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/17108#discussion_r106307446 --- Diff: mllib/src/test/scala/org/apache/spark/ml/stat/StatisticsSuite.scala --- @@ -0,0 +1,102 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.stat + +import breeze.linalg.{DenseMatrix => BDM, Matrix => BM} + +import org.apache.spark.SparkFunSuite +import org.apache.spark.internal.Logging +import org.apache.spark.ml.linalg.Matrix +import org.apache.spark.ml.linalg.Vectors +import org.apache.spark.mllib.util.MLlibTestSparkContext +import org.apache.spark.sql.{DataFrame, Row} + + +class StatisticsSuite extends SparkFunSuite with MLlibTestSparkContext with Logging { + + import StatisticsSuite._ + + val xData = Array(1.0, 0.0, -2.0) + val yData = Array(4.0, 5.0, 3.0) + val zeros = new Array[Double](3) + val data = Seq( +Vectors.dense(1.0, 0.0, 0.0, -2.0), +Vectors.dense(4.0, 5.0, 0.0, 3.0), +Vectors.dense(6.0, 7.0, 0.0, 8.0), +Vectors.dense(9.0, 0.0, 0.0, 1.0) + ) + + private def X = spark.createDataFrame(data.map(Tuple1.apply)).toDF("features") + + private def extract(df: DataFrame): BDM[Double] = { +val Array(Row(mat: Matrix)) = df.collect() +mat.asBreeze.toDenseMatrix + } + + + test("corr(X) default, pearson") { +val defaultMat = Statistics.corr(X, "features") +val pearsonMat = Statistics.corr(X, "features", "pearson") +// scalastyle:off +val expected = BDM( + (1., 0.05564149, Double.NaN, 0.4004714), + (0.05564149, 1., Double.NaN, 0.9135959), + (Double.NaN, Double.NaN, 1., Double.NaN), + (0.40047142, 0.91359586, Double.NaN, 1.000)) +// scalastyle:on + +assert(matrixApproxEqual(extract(defaultMat), expected)) +assert(matrixApproxEqual(extract(pearsonMat), expected)) + } + + test("corr(X) spearman") { +val spearmanMat = Statistics.corr(X, "features", "spearman") +// scalastyle:off +val expected = BDM( + (1.000, 0.1054093, Double.NaN, 0.400), + (0.1054093, 1.000, Double.NaN, 0.9486833), + (Double.NaN, Double.NaN, 1., Double.NaN), + (0.400, 0.9486833, Double.NaN, 1.000)) +// scalastyle:on +assert(matrixApproxEqual(extract(spearmanMat), expected)) + } + +} + + +object StatisticsSuite extends Logging { + + def approxEqual(v1: Double, v2: Double, threshold: Double = 1e-6): Boolean = { --- End diff -- Moved --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17108: [SPARK-19636][ML] Feature parity for correlation ...
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/17108#discussion_r106307517 --- Diff: mllib/src/main/scala/org/apache/spark/ml/stat/Statistics.scala --- @@ -0,0 +1,89 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.stat + +import scala.collection.JavaConverters._ + +import org.apache.spark.annotation.Since +import org.apache.spark.ml.linalg.{SQLDataTypes, Vector} +import org.apache.spark.mllib.linalg.{Vectors => OldVectors} +import org.apache.spark.mllib.stat.{Statistics => OldStatistics} +import org.apache.spark.sql.{DataFrame, Dataset, Row} +import org.apache.spark.sql.types.{StructField, StructType} + +/** + * API for statistical functions in MLlib, compatible with Dataframes and Datasets. + * + * The functions in this package generalize the functions in [[org.apache.spark.sql.Dataset.stat]] + * to MLlib's Vector types. + */ +@Since("2.2.0") --- End diff -- Good point, thanks --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17108: [SPARK-19636][ML] Feature parity for correlation ...
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/17108#discussion_r106306502 --- Diff: mllib/src/test/scala/org/apache/spark/ml/stat/StatisticsSuite.scala --- @@ -0,0 +1,102 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.stat + +import breeze.linalg.{DenseMatrix => BDM, Matrix => BM} + +import org.apache.spark.SparkFunSuite +import org.apache.spark.internal.Logging +import org.apache.spark.ml.linalg.Matrix +import org.apache.spark.ml.linalg.Vectors +import org.apache.spark.mllib.util.MLlibTestSparkContext +import org.apache.spark.sql.{DataFrame, Row} + + +class StatisticsSuite extends SparkFunSuite with MLlibTestSparkContext with Logging { + + import StatisticsSuite._ + + val xData = Array(1.0, 0.0, -2.0) + val yData = Array(4.0, 5.0, 3.0) + val zeros = new Array[Double](3) + val data = Seq( +Vectors.dense(1.0, 0.0, 0.0, -2.0), +Vectors.dense(4.0, 5.0, 0.0, 3.0), +Vectors.dense(6.0, 7.0, 0.0, 8.0), +Vectors.dense(9.0, 0.0, 0.0, 1.0) + ) + + private def X = spark.createDataFrame(data.map(Tuple1.apply)).toDF("features") + + private def extract(df: DataFrame): BDM[Double] = { +val Array(Row(mat: Matrix)) = df.collect() +mat.asBreeze.toDenseMatrix + } + + + test("corr(X) default, pearson") { +val defaultMat = Statistics.corr(X, "features") +val pearsonMat = Statistics.corr(X, "features", "pearson") +// scalastyle:off --- End diff -- The problem is the alignment of the values, which we realize by padding with `0`'s. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17108: [SPARK-19636][ML] Feature parity for correlation ...
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/17108#discussion_r106306385 --- Diff: mllib/src/main/scala/org/apache/spark/ml/stat/Statistics.scala --- @@ -0,0 +1,89 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.stat + +import scala.collection.JavaConverters._ + +import org.apache.spark.annotation.Since +import org.apache.spark.ml.linalg.{SQLDataTypes, Vector} +import org.apache.spark.mllib.linalg.{Vectors => OldVectors} +import org.apache.spark.mllib.stat.{Statistics => OldStatistics} +import org.apache.spark.sql.{DataFrame, Dataset, Row} +import org.apache.spark.sql.types.{StructField, StructType} + +/** + * API for statistical functions in MLlib, compatible with Dataframes and Datasets. + * + * The functions in this package generalize the functions in [[org.apache.spark.sql.Dataset.stat]] + * to MLlib's Vector types. + */ +@Since("2.2.0") +object Statistics { + + /** + * Compute the correlation matrix for the input RDD of Vectors using the specified method. + * Methods currently supported: `pearson` (default), `spearman`. + * + * @param dataset a dataset or a dataframe + * @param column the name of the column of vectors for which the correlation coefficient needs + * to be computed. This must be a column of the dataset, and it must contain + * Vector objects. + * @param method String specifying the method to use for computing correlation. + * Supported: `pearson` (default), `spearman` + * @return A dataframe that contains the correlation matrix of the column of vectors. This + * dataframe contains a single row and a single column of name + * '$METHODNAME($COLUMN)'. + * @throws IllegalArgumentException if the column is not a valid column in the dataset, or if + * the content of this column is not of type Vector. + * + * Here is how to access the correlation coefficient: + * {{{ + *val data: Dataset[Vector] = ... + *val Row(coeff: Matrix) = Statistics.corr(data, "value").head + *// coeff now contains the Pearson correlation matrix. + * }}} + * + * @note For Spearman, a rank correlation, we need to create an RDD[Double] for each column + * and sort it in order to retrieve the ranks and then join the columns back into an RDD[Vector], + * which is fairly costly. Cache the input RDD before calling corr with `method = "spearman"` to + * avoid recomputing the common lineage. + */ + // TODO: how do we handle missing values? --- End diff -- Good point. I will remove the comment at this point, since this should be decided in JIRA instead of during the implementation. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17108: [SPARK-19636][ML] Feature parity for correlation ...
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/17108#discussion_r106306111 --- Diff: mllib/src/main/scala/org/apache/spark/ml/stat/Statistics.scala --- @@ -0,0 +1,89 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.stat + +import scala.collection.JavaConverters._ + +import org.apache.spark.annotation.Since +import org.apache.spark.ml.linalg.{SQLDataTypes, Vector} +import org.apache.spark.mllib.linalg.{Vectors => OldVectors} +import org.apache.spark.mllib.stat.{Statistics => OldStatistics} +import org.apache.spark.sql.{DataFrame, Dataset, Row} +import org.apache.spark.sql.types.{StructField, StructType} + +/** + * API for statistical functions in MLlib, compatible with Dataframes and Datasets. + * + * The functions in this package generalize the functions in [[org.apache.spark.sql.Dataset.stat]] + * to MLlib's Vector types. + */ +@Since("2.2.0") +object Statistics { + + /** + * Compute the correlation matrix for the input RDD of Vectors using the specified method. + * Methods currently supported: `pearson` (default), `spearman`. + * + * @param dataset a dataset or a dataframe --- End diff -- oh yes, thank you. I am correcting the other instances of course. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17108: [SPARK-19636][ML] Feature parity for correlation ...
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/17108#discussion_r106305822 --- Diff: mllib/src/main/scala/org/apache/spark/ml/stat/Statistics.scala --- @@ -0,0 +1,89 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.stat + +import scala.collection.JavaConverters._ + +import org.apache.spark.annotation.Since +import org.apache.spark.ml.linalg.{SQLDataTypes, Vector} +import org.apache.spark.mllib.linalg.{Vectors => OldVectors} +import org.apache.spark.mllib.stat.{Statistics => OldStatistics} +import org.apache.spark.sql.{DataFrame, Dataset, Row} +import org.apache.spark.sql.types.{StructField, StructType} + +/** + * API for statistical functions in MLlib, compatible with Dataframes and Datasets. + * + * The functions in this package generalize the functions in [[org.apache.spark.sql.Dataset.stat]] + * to MLlib's Vector types. --- End diff -- I will use `spark.ml` which is the most correct terminology. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17110: [SPARK-19635][ML] DataFrame-based API for chi square tes...
Github user thunterdb commented on the issue: https://github.com/apache/spark/pull/17110 @jkbradley LGTM, thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17215: [MINOR][ML] Improve MLWriter overwrite error message
Github user thunterdb commented on the issue: https://github.com/apache/spark/pull/17215 LGTM --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13440: [SPARK-15699] [ML] Implement a Chi-Squared test s...
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/13440#discussion_r104813864 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/impurity/ChiSquared.scala --- @@ -0,0 +1,162 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.tree.impurity + +import org.apache.spark.annotation.{DeveloperApi, Experimental, Since} + +/** + * :: Experimental :: + * Class for calculating [[https://en.wikipedia.org/wiki/Chi-squared_test chi-squared]] + * during binary classification. + */ +@Since("2.0.0") +@Experimental +object ChiSquared extends Impurity { + private object CSTest extends org.apache.commons.math3.stat.inference.ChiSquareTest() --- End diff -- why not a private `val`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13440: [SPARK-15699] [ML] Implement a Chi-Squared test s...
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/13440#discussion_r104813302 --- Diff: mllib/src/test/scala/org/apache/spark/ml/classification/DecisionTreeClassifierSuite.scala --- @@ -237,6 +237,41 @@ class DecisionTreeClassifierSuite compareAPIs(rdd, dt, categoricalFeatures = Map.empty[Int, Int], numClasses) } + test("split quality using chi-squared and minimum gain") { +// Generate a data set where the 1st feature is useful and the others are noise +val features = Vector.fill(200) { + Array.fill(3) { scala.util.Random.nextInt(2).toDouble } +} +val labels = features.map { fv => + LabeledPoint(if (fv(0) == 1.0) 1.0 else 0.0, Vectors.dense(fv)) +} +val rdd = sc.parallelize(labels) + +// two-class learning problem +val numClasses = 2 +// all binary features +val catFeatures = Map(Vector.tabulate(features.head.length) { j => (j, 2) } : _*) + +// Chi-squared split quality with a p-value threshold of 0.01 should allow +// only the first feature to be used since the others are uncorrelated noise +val train: DataFrame = TreeTests.setMetadata(rdd, catFeatures, numClasses) +val dt = new DecisionTreeClassifier() + .setImpurity("chisquared") + .setMaxDepth(5) + .setMinInfoGain(0.01) +val treeModel = dt.fit(train) + +// The tree should use exactly one of the 3 features: featue(0) --- End diff -- nit: feature --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13440: [SPARK-15699] [ML] Implement a Chi-Squared test s...
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/13440#discussion_r104812803 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/impurity/Impurity.scala --- @@ -50,6 +50,50 @@ trait Impurity extends Serializable { @Since("1.0.0") @DeveloperApi def calculate(count: Double, sum: Double, sumSquares: Double): Double + + /** + * :: DeveloperApi :: + * Compute a test-statistic p-value quality measure from left and right split populations + * @param calcL impurity calculator for the left split population + * @param calcR impurity calculator for the right split population + * @return The p-value for the null hypothesis; that left and right split populations + * represent the same distribution + * @note Unless overridden this method will fail with an exception, for backward compatability + */ + @Since("2.0.0") + @DeveloperApi + def calculate(calcL: ImpurityCalculator, calcR: ImpurityCalculator): Double = --- End diff -- you cannot guarantee compatibility with existing code here, since you would break the bytecode either way. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13440: [SPARK-15699] [ML] Implement a Chi-Squared test s...
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/13440#discussion_r104812596 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/impurity/Impurity.scala --- @@ -50,6 +50,50 @@ trait Impurity extends Serializable { @Since("1.0.0") @DeveloperApi def calculate(count: Double, sum: Double, sumSquares: Double): Double + + /** + * :: DeveloperApi :: + * Compute a test-statistic p-value quality measure from left and right split populations + * @param calcL impurity calculator for the left split population + * @param calcR impurity calculator for the right split population + * @return The p-value for the null hypothesis; that left and right split populations + * represent the same distribution + * @note Unless overridden this method will fail with an exception, for backward compatability + */ + @Since("2.0.0") + @DeveloperApi + def calculate(calcL: ImpurityCalculator, calcR: ImpurityCalculator): Double = +throw new UnsupportedOperationException("Impurity.calculate") + + /** + * :: DeveloperApi :: + * Determine if this impurity measure is a test-statistic measure + * @return True if this is a split quality measure based on a test statistic (i.e. returns a + * p-value) or false otherwise. + * @note Unless overridden this method returns false by default, for backward compatability + */ + @Since("2.0.0") + @DeveloperApi + def isTestStatistic: Boolean = false +} + +/** + * :: DeveloperApi :: + * Utility functions for Impurity measures + */ +@Since("2.0.0") +@DeveloperApi --- End diff -- there is no need for this object to be publicly exposed? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13440: [SPARK-15699] [ML] Implement a Chi-Squared test s...
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/13440#discussion_r104812468 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/impurity/Impurity.scala --- @@ -50,6 +50,50 @@ trait Impurity extends Serializable { @Since("1.0.0") @DeveloperApi def calculate(count: Double, sum: Double, sumSquares: Double): Double + + /** + * :: DeveloperApi :: + * Compute a test-statistic p-value quality measure from left and right split populations + * @param calcL impurity calculator for the left split population + * @param calcR impurity calculator for the right split population + * @return The p-value for the null hypothesis; that left and right split populations + * represent the same distribution + * @note Unless overridden this method will fail with an exception, for backward compatability + */ + @Since("2.0.0") + @DeveloperApi + def calculate(calcL: ImpurityCalculator, calcR: ImpurityCalculator): Double = --- End diff -- scala: do not add a default implementation, it causes issues with java compatibility --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13440: [SPARK-15699] [ML] Implement a Chi-Squared test s...
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/13440#discussion_r104812484 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/impurity/Impurity.scala --- @@ -50,6 +50,50 @@ trait Impurity extends Serializable { @Since("1.0.0") @DeveloperApi def calculate(count: Double, sum: Double, sumSquares: Double): Double + + /** + * :: DeveloperApi :: + * Compute a test-statistic p-value quality measure from left and right split populations + * @param calcL impurity calculator for the left split population + * @param calcR impurity calculator for the right split population + * @return The p-value for the null hypothesis; that left and right split populations + * represent the same distribution + * @note Unless overridden this method will fail with an exception, for backward compatability + */ + @Since("2.0.0") + @DeveloperApi + def calculate(calcL: ImpurityCalculator, calcR: ImpurityCalculator): Double = +throw new UnsupportedOperationException("Impurity.calculate") + + /** + * :: DeveloperApi :: + * Determine if this impurity measure is a test-statistic measure + * @return True if this is a split quality measure based on a test statistic (i.e. returns a + * p-value) or false otherwise. + * @note Unless overridden this method returns false by default, for backward compatability + */ + @Since("2.0.0") + @DeveloperApi + def isTestStatistic: Boolean = false --- End diff -- scala: do not add a default implementation, it causes issues with java compatibility --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17108: [SPARK-19636][ML] Feature parity for correlation ...
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/17108#discussion_r103596760 --- Diff: mllib/src/main/scala/org/apache/spark/ml/stat/Correlations.scala --- @@ -0,0 +1,25 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.stat + +/** + * --- End diff -- oops, sorry, removing this file. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17108: [SPARK-19636][ML] Feature parity for correlation ...
GitHub user thunterdb opened a pull request: https://github.com/apache/spark/pull/17108 [SPARK-19636][ML] Feature parity for correlation statistics in MLlib ## What changes were proposed in this pull request? This patch adds the Dataframes-based support for the correlation statistics found in the `org.apache.spark.mllib.stat.correlation.Statistics`, following the design doc discussed in the JIRA ticket. The current implementation is a simple wrapper around the `spark.mllib` implementation. Future optimizations can be implemented at a later stage. ## How was this patch tested? ``` build/sbt "testOnly org.apache.spark.ml.stat.StatisticsSuite" ``` You can merge this pull request into a Git repository by running: $ git pull https://github.com/thunterdb/spark 19636 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/17108.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #17108 commit 42c26bdd5a3ac24bbe3a4101ee252c072373efe6 Author: Timothy Hunter Date: 2017-02-23T23:44:55Z changes commit d9f6a6c6c2457920fbf1002da5da62ff7d29b46c Author: Timothy Hunter Date: 2017-03-01T00:31:19Z commit commit 7d4ccfef4e6d3a7b65c3cca149afb250414aea4c Author: Timothy Hunter Date: 2017-03-01T00:31:34Z Cleanup --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15770: [SPARK-15784][ML]:Add Power Iteration Clustering to spar...
Github user thunterdb commented on the issue: https://github.com/apache/spark/pull/15770 Note that any of these formats would cause trouble for a graph with high centrality (lady gaga in the twitter graph). That being said, I do not have a strong opinion as to which option we pick, in order to move things along. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16971: [SPARK-19573][SQL] Make NaN/null handling consistent in ...
Github user thunterdb commented on the issue: https://github.com/apache/spark/pull/16971 @zhengruifeng thanks for looking into this issue. I have one comment above. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16971: [SPARK-19573][SQL] Make NaN/null handling consist...
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/16971#discussion_r102592174 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/stat/StatFunctions.scala --- @@ -78,7 +80,12 @@ object StatFunctions extends Logging { def apply(summaries: Array[QuantileSummaries], row: Row): Array[QuantileSummaries] = { var i = 0 while (i < summaries.length) { -summaries(i) = summaries(i).insert(row.getDouble(i)) +if (!row.isNullAt(i)) { --- End diff -- Thank you for fixing this issue, it was an oversight in my original implementation. The current exception being thrown depends on an implementation detail (calling `sampled.head`). Can you modify the function `def query` below to explicitly throw an exception if `sampled` is empty, and document this behavior in that function? This way, we will not forget it if decide to change the semantics of that class. As @MLnick was mentioning above, it would be preferrable to either return an Option, null or NaN eventually, but this can wait for more consensus. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16971: [SPARK-19573][SQL] Make NaN/null handling consist...
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/16971#discussion_r102589719 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala --- @@ -89,18 +89,17 @@ final class DataFrameStatFunctions private[sql](df: DataFrame) { * Note that values greater than 1 are accepted but give the same result as 1. * @return the approximate quantiles at the given probabilities of each column * - * @note Rows containing any null or NaN values will be removed before calculation. If - * the dataframe is empty or all rows contain null or NaN, null is returned. + * @note null and NaN values will be removed from the numerical column before calculation. If + * the dataframe is empty, or all rows in some column contain null or NaN, null is returned. * * @since 2.2.0 */ def approxQuantile( cols: Array[String], probabilities: Array[Double], relativeError: Double): Array[Array[Double]] = { -// TODO: Update NaN/null handling to keep consistent with the single-column version try { - StatFunctions.multipleApproxQuantiles(df.select(cols.map(col): _*).na.drop(), cols, + StatFunctions.multipleApproxQuantiles(df.select(cols.map(col): _*), cols, probabilities, relativeError).map(_.toArray).toArray } catch { case e: NoSuchElementException => null --- End diff -- +1. I tend to think that the result should be `NaN` (following the IEEE convention) or `null` (following scala Option convention). But pending a resolution, I am fine with throwing an exception because it is the most conservative behavior (stopping computations). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15770: [SPARK-15784][ML]:Add Power Iteration Clustering to spar...
Github user thunterdb commented on the issue: https://github.com/apache/spark/pull/15770 @wangmiao1981 yes I had seen the discussions there. I believe that eventually PIC should be moved into graphframes, but we can have a simple API in `spark.ml` for the time being. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15770: [SPARK-15784][ML]:Add Power Iteration Clustering to spar...
Github user thunterdb commented on the issue: https://github.com/apache/spark/pull/15770 You are right, I had forgotten that for this algorithm, the input is the edges, and the output is the label for each of the vertices. This is a tricky algorithm to put as a transformer, since it does not follow the usual convention that data should only be appended to the dataframe. I suggest we follow the same example as ALS the mllib implementation of PIC: - let's make it an estimator that returns a model: the model contains the labels for each of the points in a dataframe (the current output of transform) - the model's transform method now takes points with an id, and joins it with the models to append a column of labels. This is the same as ALS. If we do not follow this pattern, then the model selection algorithms are not going to work. What do you think? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14299: Ensure broadcasted variables are destroyed even in case ...
Github user thunterdb commented on the issue: https://github.com/apache/spark/pull/14299 @AnthonyTruchet thank you for the PR. This is definitely worth fixing for large deployments. Now, as you noticed, this portion of code does not quite abide by the best engineering practices... Instead of adding an extra layer of nesting, would you mind make the following changes? ```scala def fit[S <: Iterable[String]](dataset: RDD[S]): Word2VecModel = { ... val expTable = sc.broadcast(createExpTable()) val bcVocab = sc.broadcast(vocab) val bcVocabHash = sc.broadcast(vocabHash) try { fit0(expTable, ...) } finally { ... } } private final def fit0(...) { // Put all the content here. // Note that the inner code also includes some broadcasts, you may want to fix these as well if you can } ``` I personally agree about resource management and scala-arm. We try to keep scala dependencies to a minimum, unfortunately, because they can be very tedious to move from one scala version to another. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16774: [SPARK-19357][ML] Adding parallel model evaluation in ML...
Github user thunterdb commented on the issue: https://github.com/apache/spark/pull/16774 Thanks for working on this task, this is a much requested feature. While it will work for simple cases in the current shape, it is going to cause some issues for any complex deployments (Apache Toree, Livy, Databricks, etc.) because the threadpool that controls the computations is not managed. The default assumption with `.par` is a lot of quick tasks. With the current implementation, because the same thread pool is going to be shared across all the parallel collections, users are going to encounter some mysterious freezes in other places, while the ML models are finishing to train (I am talking from experience here). While the situation with `.par` has notably improved with scala 2.10, it is better to: - create a dedicated thread pool for each `.fit`, that users can replace. - use futures, of which the execution context is tied to the thread pool above. - not use semaphores, but instead rely on the thread limit at the thread pool level to cap the number of concurrent execution. If you do not do that, users in a shared environment like any of the above will experience some mysterious freezes depending on what other users are doing. Ideally the default resources should be tied to `SparkSession`, but we can start with a default static pool marked as experimental API. More concretely, the API should look like this, I believe: ``` def setNumParallelEval(num: Int) // Creates an execution context with the given max number of threads def setExecutionExecutorService(exec: ExecutorService) // Will use the given executor service instead of an executor service shared by all the ML calculations ``` See the doc in: https://twitter.github.io/scala_school/concurrency.html#executor https://docs.oracle.com/javase/8/docs/api/java/util/concurrent/ExecutorService.html More about parallel collections: http://stackoverflow.com/questions/5424496/scala-parallel-collections-degree-of-parallelism http://stackoverflow.com/questions/14214757/what-is-the-benefit-of-using-futures-over-parallel-collections-in-scala --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16774: [SPARK-19357][ML] Adding parallel model evaluatio...
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/16774#discussion_r101834675 --- Diff: mllib/src/test/scala/org/apache/spark/ml/tuning/CrossValidatorSuite.scala --- @@ -121,6 +121,33 @@ class CrossValidatorSuite } } + test("cross validation with parallel evaluation") { +val lr = new LogisticRegression +val lrParamMaps = new ParamGridBuilder() + .addGrid(lr.regParam, Array(0.001, 1000.0)) + .addGrid(lr.maxIter, Array(0, 3)) + .build() +val eval = new BinaryClassificationEvaluator +val cv = new CrossValidator() + .setEstimator(lr) + .setEstimatorParamMaps(lrParamMaps) + .setEvaluator(eval) + .setNumFolds(2) + .setNumParallelEval(1) +val cvSerialModel = cv.fit(dataset) +cv.setNumParallelEval(2) --- End diff -- this test is not deterministic now. @MLnick , do you know if we have some utilities to retry tests a much of times? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16973: [SPARKR][EXAMPLES] update examples to stop spark session
Github user thunterdb commented on the issue: https://github.com/apache/spark/pull/16973 These changes look good to me, but my knowledge of R is very limited. @mengxr should confirm. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16557: [SPARK-18693][ML][MLLIB] ML Evaluators should use weight...
Github user thunterdb commented on the issue: https://github.com/apache/spark/pull/16557 I agree, let's break this PR. It will go faster, and some changes may require longer discussions. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16776: [SPARK-19436][SQL] Add missing tests for approxQuantile
Github user thunterdb commented on the issue: https://github.com/apache/spark/pull/16776 Sorry I missed the conversation here. LGTM. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15770: [SPARK-15784][ML]:Add Power Iteration Clustering to spar...
Github user thunterdb commented on the issue: https://github.com/apache/spark/pull/15770 @wangmiao1981 thanks a lot! I would be very happy to see that PR in Spark 2.2 and I will gladly help you for that. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15770: [SPARK-15784][ML]:Add Power Iteration Clustering ...
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/15770#discussion_r101666251 --- Diff: mllib/src/test/scala/org/apache/spark/ml/clustering/PowerIterationClusteringSuite.scala --- @@ -0,0 +1,153 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.clustering + +import scala.collection.mutable + +import org.apache.spark.SparkFunSuite +import org.apache.spark.ml.linalg.Vectors +import org.apache.spark.ml.util.DefaultReadWriteTest +import org.apache.spark.mllib.util.MLlibTestSparkContext +import org.apache.spark.sql.{DataFrame, Dataset, Row, SparkSession} + +class PowerIterationClusteringSuite extends SparkFunSuite + with MLlibTestSparkContext with DefaultReadWriteTest { + + @transient var data: Dataset[_] = _ + final val r1 = 1.0 + final val n1 = 10 + final val r2 = 4.0 + final val n2 = 40 + + override def beforeAll(): Unit = { +super.beforeAll() + +data = PowerIterationClusteringSuite.generatePICData(spark, r1, r2, n1, n2) + } + + test("default parameters") { +val pic = new PowerIterationClustering() + +assert(pic.getK === 2) +assert(pic.getMaxIter === 20) +assert(pic.getInitMode === "random") +assert(pic.getFeaturesCol === "features") +assert(pic.getPredictionCol === "prediction") +assert(pic.getIdCol === "id") + } + + test("set parameters") { +val pic = new PowerIterationClustering() + .setK(9) + .setMaxIter(33) + .setInitMode("degree") + .setFeaturesCol("test_feature") + .setPredictionCol("test_prediction") + .setIdCol("test_id") + +assert(pic.getK === 9) +assert(pic.getMaxIter === 33) +assert(pic.getInitMode === "degree") +assert(pic.getFeaturesCol === "test_feature") +assert(pic.getPredictionCol === "test_prediction") +assert(pic.getIdCol === "test_id") + } + + test("parameters validation") { +intercept[IllegalArgumentException] { + new PowerIterationClustering().setK(1) +} +intercept[IllegalArgumentException] { + new PowerIterationClustering().setInitMode("no_such_a_mode") +} + } + + test("power iteration clustering") { --- End diff -- can you also add a test with a dataframe that has some extra data in it? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15770: [SPARK-15784][ML]:Add Power Iteration Clustering ...
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/15770#discussion_r101665899 --- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/PowerIterationClustering.scala --- @@ -0,0 +1,182 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.clustering + +import org.apache.spark.annotation.{Experimental, Since} +import org.apache.spark.ml.Transformer +import org.apache.spark.ml.linalg.{Vector} +import org.apache.spark.ml.param._ +import org.apache.spark.ml.param.shared._ +import org.apache.spark.ml.util._ +import org.apache.spark.mllib.clustering.{PowerIterationClustering => MLlibPowerIterationClustering} +import org.apache.spark.mllib.clustering.PowerIterationClustering.Assignment +import org.apache.spark.rdd.RDD +import org.apache.spark.sql.{DataFrame, Dataset, Row} +import org.apache.spark.sql.functions.{col} +import org.apache.spark.sql.types.{IntegerType, LongType, StructField, StructType} + +/** + * Common params for PowerIterationClustering + */ +private[clustering] trait PowerIterationClusteringParams extends Params with HasMaxIter + with HasFeaturesCol with HasPredictionCol { + + /** + * The number of clusters to create (k). Must be > 1. Default: 2. + * @group param + */ + @Since("2.2.0") + final val k = new IntParam(this, "k", "The number of clusters to create. " + +"Must be > 1.", ParamValidators.gt(1)) + + /** @group getParam */ + @Since("2.2.0") + def getK: Int = $(k) + + /** + * Param for the initialization algorithm. This can be either "random" to use a random vector + * as vertex properties, or "degree" to use normalized sum similarities. Default: random. + */ + @Since("2.2.0") + final val initMode = new Param[String](this, "initMode", "The initialization algorithm. " + +"Supported options: 'random' and 'degree'.", +(value: String) => validateInitMode(value)) + + private[spark] def validateInitMode(initMode: String): Boolean = { +initMode match { + case "random" => true + case "degree" => true + case _ => false +} + } + + /** @group expertGetParam */ + @Since("2.2.0") + def getInitMode: String = $(initMode) + + /** + * Param for the column name for ids returned by [[PowerIterationClustering.transform()]]. + * Default: "id" + * @group param + */ + val idCol = new Param[String](this, "idCol", "column name for ids.") + + /** @group getParam */ + def getIdCol: String = $(idCol) + + /** + * Validates the input schema + * @param schema input schema + */ + protected def validateSchema(schema: StructType): Unit = { +SchemaUtils.checkColumnType(schema, $(idCol), LongType) +SchemaUtils.checkColumnType(schema, $(predictionCol), IntegerType) + } +} + +/** + * :: Experimental :: + * Power Iteration Clustering (PIC), a scalable graph clustering algorithm developed by + * [[http://www.icml2010.org/papers/387.pdf Lin and Cohen]]. From the abstract: PIC finds a very + * low-dimensional embedding of a dataset using truncated power iteration on a normalized pair-wise + * similarity matrix of the data. + * + * Note that we implement [[PowerIterationClustering]] as a transformer. The [[transform]] is an + * expensive operation, because it uses PIC algorithm to cluster the whole input dataset. + * + * @see [[http://en.wikipedia.org/wiki/Spectral_clustering Spectral clustering (Wikipedia)]] + */ +@Since("2.2.0") +@Experimental +class PowerIterationClustering private[clustering] ( +
[GitHub] spark pull request #15770: [SPARK-15784][ML]:Add Power Iteration Clustering ...
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/15770#discussion_r101664268 --- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/PowerIterationClustering.scala --- @@ -0,0 +1,182 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.clustering + +import org.apache.spark.annotation.{Experimental, Since} +import org.apache.spark.ml.Transformer +import org.apache.spark.ml.linalg.{Vector} +import org.apache.spark.ml.param._ +import org.apache.spark.ml.param.shared._ +import org.apache.spark.ml.util._ +import org.apache.spark.mllib.clustering.{PowerIterationClustering => MLlibPowerIterationClustering} +import org.apache.spark.mllib.clustering.PowerIterationClustering.Assignment +import org.apache.spark.rdd.RDD +import org.apache.spark.sql.{DataFrame, Dataset, Row} +import org.apache.spark.sql.functions.{col} +import org.apache.spark.sql.types.{IntegerType, LongType, StructField, StructType} + +/** + * Common params for PowerIterationClustering + */ +private[clustering] trait PowerIterationClusteringParams extends Params with HasMaxIter + with HasFeaturesCol with HasPredictionCol { + + /** + * The number of clusters to create (k). Must be > 1. Default: 2. + * @group param + */ + @Since("2.2.0") + final val k = new IntParam(this, "k", "The number of clusters to create. " + +"Must be > 1.", ParamValidators.gt(1)) + + /** @group getParam */ + @Since("2.2.0") + def getK: Int = $(k) + + /** + * Param for the initialization algorithm. This can be either "random" to use a random vector + * as vertex properties, or "degree" to use normalized sum similarities. Default: random. + */ + @Since("2.2.0") + final val initMode = new Param[String](this, "initMode", "The initialization algorithm. " + +"Supported options: 'random' and 'degree'.", +(value: String) => validateInitMode(value)) + + private[spark] def validateInitMode(initMode: String): Boolean = { +initMode match { + case "random" => true + case "degree" => true + case _ => false +} + } + + /** @group expertGetParam */ + @Since("2.2.0") + def getInitMode: String = $(initMode) + + /** + * Param for the column name for ids returned by [[PowerIterationClustering.transform()]]. + * Default: "id" + * @group param + */ + val idCol = new Param[String](this, "idCol", "column name for ids.") --- End diff -- Instead of making an 'id' column, which does not convey much information, we should follow the example of `K-Means` and call it `prediction`. You already include the trait for that. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15770: [SPARK-15784][ML]:Add Power Iteration Clustering ...
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/15770#discussion_r101663790 --- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/PowerIterationClustering.scala --- @@ -0,0 +1,182 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.clustering + +import org.apache.spark.annotation.{Experimental, Since} +import org.apache.spark.ml.Transformer +import org.apache.spark.ml.linalg.{Vector} +import org.apache.spark.ml.param._ +import org.apache.spark.ml.param.shared._ +import org.apache.spark.ml.util._ +import org.apache.spark.mllib.clustering.{PowerIterationClustering => MLlibPowerIterationClustering} +import org.apache.spark.mllib.clustering.PowerIterationClustering.Assignment +import org.apache.spark.rdd.RDD +import org.apache.spark.sql.{DataFrame, Dataset, Row} +import org.apache.spark.sql.functions.{col} +import org.apache.spark.sql.types.{IntegerType, LongType, StructField, StructType} + +/** + * Common params for PowerIterationClustering + */ +private[clustering] trait PowerIterationClusteringParams extends Params with HasMaxIter + with HasFeaturesCol with HasPredictionCol { + + /** + * The number of clusters to create (k). Must be > 1. Default: 2. + * @group param + */ + @Since("2.2.0") + final val k = new IntParam(this, "k", "The number of clusters to create. " + +"Must be > 1.", ParamValidators.gt(1)) + + /** @group getParam */ + @Since("2.2.0") + def getK: Int = $(k) + + /** + * Param for the initialization algorithm. This can be either "random" to use a random vector + * as vertex properties, or "degree" to use normalized sum similarities. Default: random. + */ + @Since("2.2.0") + final val initMode = new Param[String](this, "initMode", "The initialization algorithm. " + +"Supported options: 'random' and 'degree'.", +(value: String) => validateInitMode(value)) + + private[spark] def validateInitMode(initMode: String): Boolean = { +initMode match { + case "random" => true + case "degree" => true + case _ => false +} + } + + /** @group expertGetParam */ + @Since("2.2.0") + def getInitMode: String = $(initMode) + + /** + * Param for the column name for ids returned by [[PowerIterationClustering.transform()]]. + * Default: "id" + * @group param + */ + val idCol = new Param[String](this, "idCol", "column name for ids.") + + /** @group getParam */ + def getIdCol: String = $(idCol) + + /** + * Validates the input schema + * @param schema input schema + */ + protected def validateSchema(schema: StructType): Unit = { --- End diff -- Instead of just validating the schema, we should validate and transform. You can follow the example in https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala#L92 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15770: [SPARK-15784][ML]:Add Power Iteration Clustering ...
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/15770#discussion_r101662332 --- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/PowerIterationClustering.scala --- @@ -0,0 +1,182 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.clustering + +import org.apache.spark.annotation.{Experimental, Since} +import org.apache.spark.ml.Transformer +import org.apache.spark.ml.linalg.{Vector} +import org.apache.spark.ml.param._ +import org.apache.spark.ml.param.shared._ +import org.apache.spark.ml.util._ +import org.apache.spark.mllib.clustering.{PowerIterationClustering => MLlibPowerIterationClustering} +import org.apache.spark.mllib.clustering.PowerIterationClustering.Assignment +import org.apache.spark.rdd.RDD +import org.apache.spark.sql.{DataFrame, Dataset, Row} +import org.apache.spark.sql.functions.{col} +import org.apache.spark.sql.types.{IntegerType, LongType, StructField, StructType} + +/** + * Common params for PowerIterationClustering + */ +private[clustering] trait PowerIterationClusteringParams extends Params with HasMaxIter + with HasFeaturesCol with HasPredictionCol { + + /** + * The number of clusters to create (k). Must be > 1. Default: 2. + * @group param + */ + @Since("2.2.0") + final val k = new IntParam(this, "k", "The number of clusters to create. " + +"Must be > 1.", ParamValidators.gt(1)) + + /** @group getParam */ + @Since("2.2.0") + def getK: Int = $(k) + + /** + * Param for the initialization algorithm. This can be either "random" to use a random vector + * as vertex properties, or "degree" to use normalized sum similarities. Default: random. + */ + @Since("2.2.0") + final val initMode = new Param[String](this, "initMode", "The initialization algorithm. " + +"Supported options: 'random' and 'degree'.", +(value: String) => validateInitMode(value)) + + private[spark] def validateInitMode(initMode: String): Boolean = { +initMode match { + case "random" => true + case "degree" => true + case _ => false +} + } + + /** @group expertGetParam */ + @Since("2.2.0") + def getInitMode: String = $(initMode) + + /** + * Param for the column name for ids returned by [[PowerIterationClustering.transform()]]. + * Default: "id" + * @group param + */ + val idCol = new Param[String](this, "idCol", "column name for ids.") + + /** @group getParam */ + def getIdCol: String = $(idCol) + + /** + * Validates the input schema + * @param schema input schema + */ + protected def validateSchema(schema: StructType): Unit = { +SchemaUtils.checkColumnType(schema, $(idCol), LongType) +SchemaUtils.checkColumnType(schema, $(predictionCol), IntegerType) + } +} + +/** + * :: Experimental :: + * Power Iteration Clustering (PIC), a scalable graph clustering algorithm developed by + * [[http://www.icml2010.org/papers/387.pdf Lin and Cohen]]. From the abstract: PIC finds a very + * low-dimensional embedding of a dataset using truncated power iteration on a normalized pair-wise + * similarity matrix of the data. + * + * Note that we implement [[PowerIterationClustering]] as a transformer. The [[transform]] is an + * expensive operation, because it uses PIC algorithm to cluster the whole input dataset. + * + * @see [[http://en.wikipedia.org/wiki/Spectral_clustering Spectral clustering (Wikipedia)]] + */ +@Since("2.2.0") +@Experimental +class PowerIterationClustering private[clustering] ( +
[GitHub] spark pull request #15770: [SPARK-15784][ML]:Add Power Iteration Clustering ...
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/15770#discussion_r101662298 --- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/PowerIterationClustering.scala --- @@ -0,0 +1,182 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.clustering + +import org.apache.spark.annotation.{Experimental, Since} +import org.apache.spark.ml.Transformer +import org.apache.spark.ml.linalg.{Vector} +import org.apache.spark.ml.param._ +import org.apache.spark.ml.param.shared._ +import org.apache.spark.ml.util._ +import org.apache.spark.mllib.clustering.{PowerIterationClustering => MLlibPowerIterationClustering} +import org.apache.spark.mllib.clustering.PowerIterationClustering.Assignment +import org.apache.spark.rdd.RDD +import org.apache.spark.sql.{DataFrame, Dataset, Row} +import org.apache.spark.sql.functions.{col} +import org.apache.spark.sql.types.{IntegerType, LongType, StructField, StructType} + +/** + * Common params for PowerIterationClustering + */ +private[clustering] trait PowerIterationClusteringParams extends Params with HasMaxIter + with HasFeaturesCol with HasPredictionCol { + + /** + * The number of clusters to create (k). Must be > 1. Default: 2. + * @group param + */ + @Since("2.2.0") + final val k = new IntParam(this, "k", "The number of clusters to create. " + +"Must be > 1.", ParamValidators.gt(1)) + + /** @group getParam */ + @Since("2.2.0") + def getK: Int = $(k) + + /** + * Param for the initialization algorithm. This can be either "random" to use a random vector + * as vertex properties, or "degree" to use normalized sum similarities. Default: random. + */ + @Since("2.2.0") + final val initMode = new Param[String](this, "initMode", "The initialization algorithm. " + +"Supported options: 'random' and 'degree'.", +(value: String) => validateInitMode(value)) + + private[spark] def validateInitMode(initMode: String): Boolean = { --- End diff -- No need with comment above --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15770: [SPARK-15784][ML]:Add Power Iteration Clustering ...
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/15770#discussion_r101662273 --- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/PowerIterationClustering.scala --- @@ -0,0 +1,182 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.clustering + +import org.apache.spark.annotation.{Experimental, Since} +import org.apache.spark.ml.Transformer +import org.apache.spark.ml.linalg.{Vector} +import org.apache.spark.ml.param._ +import org.apache.spark.ml.param.shared._ +import org.apache.spark.ml.util._ +import org.apache.spark.mllib.clustering.{PowerIterationClustering => MLlibPowerIterationClustering} +import org.apache.spark.mllib.clustering.PowerIterationClustering.Assignment +import org.apache.spark.rdd.RDD +import org.apache.spark.sql.{DataFrame, Dataset, Row} +import org.apache.spark.sql.functions.{col} +import org.apache.spark.sql.types.{IntegerType, LongType, StructField, StructType} + +/** + * Common params for PowerIterationClustering + */ +private[clustering] trait PowerIterationClusteringParams extends Params with HasMaxIter + with HasFeaturesCol with HasPredictionCol { + + /** + * The number of clusters to create (k). Must be > 1. Default: 2. + * @group param + */ + @Since("2.2.0") + final val k = new IntParam(this, "k", "The number of clusters to create. " + +"Must be > 1.", ParamValidators.gt(1)) + + /** @group getParam */ + @Since("2.2.0") + def getK: Int = $(k) + + /** + * Param for the initialization algorithm. This can be either "random" to use a random vector + * as vertex properties, or "degree" to use normalized sum similarities. Default: random. + */ + @Since("2.2.0") + final val initMode = new Param[String](this, "initMode", "The initialization algorithm. " + +"Supported options: 'random' and 'degree'.", +(value: String) => validateInitMode(value)) --- End diff -- You do not need to use write a function as you do below after that, it will allow more user-friendly error messages in the future. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15770: [SPARK-15784][ML]:Add Power Iteration Clustering ...
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/15770#discussion_r101662038 --- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/PowerIterationClustering.scala --- @@ -0,0 +1,182 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.clustering + +import org.apache.spark.annotation.{Experimental, Since} +import org.apache.spark.ml.Transformer +import org.apache.spark.ml.linalg.{Vector} +import org.apache.spark.ml.param._ +import org.apache.spark.ml.param.shared._ +import org.apache.spark.ml.util._ +import org.apache.spark.mllib.clustering.{PowerIterationClustering => MLlibPowerIterationClustering} +import org.apache.spark.mllib.clustering.PowerIterationClustering.Assignment +import org.apache.spark.rdd.RDD +import org.apache.spark.sql.{DataFrame, Dataset, Row} +import org.apache.spark.sql.functions.{col} --- End diff -- same here --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15770: [SPARK-15784][ML]:Add Power Iteration Clustering ...
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/15770#discussion_r101662018 --- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/PowerIterationClustering.scala --- @@ -0,0 +1,182 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.clustering + +import org.apache.spark.annotation.{Experimental, Since} +import org.apache.spark.ml.Transformer +import org.apache.spark.ml.linalg.{Vector} --- End diff -- no need for brackets --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15826: [SPARK-14077][ML][FOLLOW-UP] Minor refactor and cleanup ...
Github user thunterdb commented on the issue: https://github.com/apache/spark/pull/15826 @yanboliang that looks great, thank you. LGTM. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15593: [SPARK-18060][ML] Avoid unnecessary computation f...
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/15593#discussion_r87267275 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -489,13 +485,14 @@ class LogisticRegression @Since("1.2.0") ( val initialCoefWithInterceptArray = initialCoefficientsWithIntercept.toArray --- End diff -- can you document on line 465 the layout of the data, to help future developers (all the coefficients, in column major order, maybe followed by a last column that contain the intercept)? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15683: [SPARK-18166][MLlib] Fix Poisson GLM bug due to wrong re...
Github user thunterdb commented on the issue: https://github.com/apache/spark/pull/15683 +1 for trying to get it into 2.1 (modulo tests) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15809: [SPARK-18268][ML][MLLib] ALS fail with better message if...
Github user thunterdb commented on the issue: https://github.com/apache/spark/pull/15809 LGTM --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15826: [SPARK-14077][ML][FOLLOW-UP] Minor refactor and c...
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/15826#discussion_r87256910 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/NaiveBayes.scala --- @@ -110,21 +110,20 @@ class NaiveBayes @Since("1.5.0") ( @Since("2.1.0") def setWeightCol(value: String): this.type = set(weightCol, value) + override protected def train(dataset: Dataset[_]): NaiveBayesModel = { +trainWithLabelCheck(dataset) + } + /** * ml assumes input labels in range [0, numClasses). But this implementation * is also called by mllib NaiveBayes which allows other kinds of input labels - * such as {-1, +1}. Here we use this parameter to switch between different processing logic. - * It should be removed when we remove mllib NaiveBayes. + * such as {-1, +1}. `positiveLabel` is used to determine whether the label + * should be checked and it should be removed when we remove mllib NaiveBayes. */ - private[spark] var isML: Boolean = true - - private[spark] def setIsML(isML: Boolean): this.type = { -this.isML = isML -this - } - - override protected def train(dataset: Dataset[_]): NaiveBayesModel = { -if (isML) { + private[spark] def trainWithLabelCheck( + dataset: Dataset[_], + positiveLabel: Boolean = true): NaiveBayesModel = { --- End diff -- Since it is an internal method, I suggest to not use a default argument and specify it explicitly above. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15826: [SPARK-14077][ML][FOLLOW-UP] Minor refactor and c...
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/15826#discussion_r87256702 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/NaiveBayes.scala --- @@ -226,13 +206,33 @@ class NaiveBayes @Since("1.5.0") ( @Since("1.6.0") object NaiveBayes extends DefaultParamsReadable[NaiveBayes] { /** String name for multinomial model type. */ - private[spark] val Multinomial: String = "multinomial" + private[classification] val Multinomial: String = "multinomial" /** String name for Bernoulli model type. */ - private[spark] val Bernoulli: String = "bernoulli" + private[classification] val Bernoulli: String = "bernoulli" /* Set of modelTypes that NaiveBayes supports */ - private[spark] val supportedModelTypes = Set(Multinomial, Bernoulli) + private[classification] val supportedModelTypes = Set(Multinomial, Bernoulli) + + private[NaiveBayes] val requireNonnegativeValues: Vector => Unit = (v: Vector) => { +val values = v match { + case sv: SparseVector => sv.values + case dv: DenseVector => dv.values +} + +require(values.forall(_ >= 0.0), + s"Naive Bayes requires nonnegative feature values but found $v.") + } + + private[NaiveBayes] val requireZeroOneBernoulliValues: Vector => Unit = (v: Vector) => { --- End diff -- same thing here --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15826: [SPARK-14077][ML][FOLLOW-UP] Minor refactor and c...
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/15826#discussion_r87256589 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/NaiveBayes.scala --- @@ -226,13 +206,33 @@ class NaiveBayes @Since("1.5.0") ( @Since("1.6.0") object NaiveBayes extends DefaultParamsReadable[NaiveBayes] { /** String name for multinomial model type. */ - private[spark] val Multinomial: String = "multinomial" + private[classification] val Multinomial: String = "multinomial" /** String name for Bernoulli model type. */ - private[spark] val Bernoulli: String = "bernoulli" + private[classification] val Bernoulli: String = "bernoulli" /* Set of modelTypes that NaiveBayes supports */ - private[spark] val supportedModelTypes = Set(Multinomial, Bernoulli) + private[classification] val supportedModelTypes = Set(Multinomial, Bernoulli) + + private[NaiveBayes] val requireNonnegativeValues: Vector => Unit = (v: Vector) => { --- End diff -- @yanboliang indeed it makes a difference when defining methods inside another method. In the current case, you brought the function as a value of the companion object. You can safely turn it into a method --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15795: [SPARK-18081] Add user guide for Locality Sensiti...
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/15795#discussion_r87116504 --- Diff: docs/ml-features.md --- @@ -1396,3 +1396,149 @@ for more details on the API. {% include_example python/ml/chisq_selector_example.py %} + +# Locality Sensitive Hashing +[Locality Sensitive Hashing(LSH)](https://en.wikipedia.org/wiki/Locality-sensitive_hashing) Locality Sensitive Hashing(LSH) is an important class of hashing techniques, which is commonly used in clustering and outlier detection with large datasets. + +The general idea of LSH is to use a family of functions (we call them LSH families) to hash data points into buckets, so that the data points which are close to each other are in the same buckets with high probability, while data points that are far away from each other are very likely in different buckets. A formal definition of LSH family is as follows: + +In a metric space `(M, d)`, an LSH family is a family of functions `h` that satisfy the following properties: +`\[ +\forall p, q \in M,\\ +d(p,q) < r1 \Rightarrow Pr(h(p)=h(q)) \geq p1\\ +d(p,q) > r2 \Rightarrow Pr(h(p)=h(q)) \leq p1 +\]` +This LSH family is called `(r1, r2, p1, p2)`-sensitive. + +In this section, we call a pair of input features a false positive if the two features are hashed into the same hash bucket but they are far away in distance, and we define false negative as the pair of features when their distance are close but they are not in the same hash bucket. + +## Random Projection for Euclidean Distance +**Note:** Please note that this is different than the [Random Projection for cosine distance](https://en.wikipedia.org/wiki/Locality-sensitive_hashing#Random_projection). + +[Random Projection](https://en.wikipedia.org/wiki/Locality-sensitive_hashing#Stable_distributions) is the LSH family in `spark.ml` for Euclidean distance. The Euclidean distance is defined as follows: +`\[ +d(\mathbf{x}, \mathbf{y}) = \sqrt{\sum_i (x_i - y_i)^2} +\]` +Its LSH family projects features onto a random unit vector and divide the projected results to hash buckets: +`\[ +h(\mathbf{x}) = \lfloor \frac{\mathbf{x} \cdot \mathbf{v}}{r} \rfloor +\]` +where `v` is a normalized random unit vector and `r` is user-defined bucket length. The bucket length can be used to control the average size of hash buckets. A larger bucket length means higher probability for features to be in the same bucket. + +The input features in Euclidean space are represented in vectors. Both sparse and dense vectors are supported. + + + + +Refer to the [RandomProjection Scala docs](api/scala/index.html#org.apache.spark.ml.feature.RandomProjection) +for more details on the API. + +{% include_example scala/org/apache/spark/examples/ml/RandomProjectionExample.scala %} + + + + +Refer to the [RandomProjection Java docs](api/java/org/apache/spark/ml/feature/RandomProjection.html) +for more details on the API. + +{% include_example java/org/apache/spark/examples/ml/JavaRandomProjectionExample.java %} + + + +## MinHash for Jaccard Distance +[MinHash](https://en.wikipedia.org/wiki/MinHash) is the LSH family in `spark.ml` for Jaccard distance where input features are sets of natural numbers. Jaccard distance of two sets is defined by the cardinality of their intersection and union: +`\[ +d(\mathbf{A}, \mathbf{B}) = 1 - \frac{|\mathbf{A} \cap \mathbf{B}|}{|\mathbf{A} \cup \mathbf{B}|} +\]` +As its LSH family, MinHash applies a random perfect hash function `g` to each elements in the set and take the minimum of all hashed values: +`\[ +h(\mathbf{A}) = \min_{a \in \mathbf{A}}(g(a)) +\]` + +Input sets for MinHash is represented in vectors which dimension equals the total number of elements in the space. Each dimension of the vectors represents the status of an elements: zero value means the elements is not in the set; non-zero value means the set contains the corresponding elements. For example, `Vectors.sparse(10, Array[(2, 1.0), (3, 1.0), (5, 1.0)])` means there are 10 elements in the space. This set contains elem 2, elem 3 and elem 5. + +**Note:** Empty sets cannot be transformed by MinHash, which means any input vector must have at least 1 non-zero indices. + + + + +Refer to the [MinHash Scala docs](api/scala/index.html#org.apache.spark.ml.feature.MinHash) +for more details on the API. + +{% include_example scala/org/apache/spark/examples/ml/MinHashExample.scala %} + + + + +Refer to the [MinHash Java docs](api/java/org/apache/spark/ml/feature/MinHash.html) +for more details on the API. + +{% include_example java/org/apache/spark/examples/ml/JavaMinHashExampl
[GitHub] spark pull request #15795: [SPARK-18081] Add user guide for Locality Sensiti...
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/15795#discussion_r87113033 --- Diff: docs/ml-features.md --- @@ -1396,3 +1396,149 @@ for more details on the API. {% include_example python/ml/chisq_selector_example.py %} + +# Locality Sensitive Hashing +[Locality Sensitive Hashing(LSH)](https://en.wikipedia.org/wiki/Locality-sensitive_hashing) Locality Sensitive Hashing(LSH) is an important class of hashing techniques, which is commonly used in clustering and outlier detection with large datasets. + +The general idea of LSH is to use a family of functions (we call them LSH families) to hash data points into buckets, so that the data points which are close to each other are in the same buckets with high probability, while data points that are far away from each other are very likely in different buckets. A formal definition of LSH family is as follows: + +In a metric space `(M, d)`, an LSH family is a family of functions `h` that satisfy the following properties: +`\[ +\forall p, q \in M,\\ +d(p,q) < r1 \Rightarrow Pr(h(p)=h(q)) \geq p1\\ +d(p,q) > r2 \Rightarrow Pr(h(p)=h(q)) \leq p1 --- End diff -- p2 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15795: [SPARK-18081] Add user guide for Locality Sensiti...
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/15795#discussion_r87113839 --- Diff: docs/ml-features.md --- @@ -1396,3 +1396,149 @@ for more details on the API. {% include_example python/ml/chisq_selector_example.py %} + +# Locality Sensitive Hashing +[Locality Sensitive Hashing(LSH)](https://en.wikipedia.org/wiki/Locality-sensitive_hashing) Locality Sensitive Hashing(LSH) is an important class of hashing techniques, which is commonly used in clustering and outlier detection with large datasets. + +The general idea of LSH is to use a family of functions (we call them LSH families) to hash data points into buckets, so that the data points which are close to each other are in the same buckets with high probability, while data points that are far away from each other are very likely in different buckets. A formal definition of LSH family is as follows: + +In a metric space `(M, d)`, an LSH family is a family of functions `h` that satisfy the following properties: --- End diff -- Actually, I would at least mention that `d` is a distance function. It is important in the context of LSH. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15795: [SPARK-18081] Add user guide for Locality Sensiti...
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/15795#discussion_r87113728 --- Diff: docs/ml-features.md --- @@ -1396,3 +1396,149 @@ for more details on the API. {% include_example python/ml/chisq_selector_example.py %} + +# Locality Sensitive Hashing +[Locality Sensitive Hashing(LSH)](https://en.wikipedia.org/wiki/Locality-sensitive_hashing) Locality Sensitive Hashing(LSH) is an important class of hashing techniques, which is commonly used in clustering and outlier detection with large datasets. + +The general idea of LSH is to use a family of functions (we call them LSH families) to hash data points into buckets, so that the data points which are close to each other are in the same buckets with high probability, while data points that are far away from each other are very likely in different buckets. A formal definition of LSH family is as follows: + +In a metric space `(M, d)`, an LSH family is a family of functions `h` that satisfy the following properties: --- End diff -- You should either link to the definition of a metric space, or explain what M and d are. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org