[GitHub] spark pull request #21097: [SPARK-14682][ML] Provide evaluateEachIteration m...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/21097#discussion_r184760717 --- Diff: mllib/src/test/scala/org/apache/spark/ml/classification/GBTClassifierSuite.scala --- @@ -365,6 +365,20 @@ class GBTClassifierSuite extends MLTest with DefaultReadWriteTest { assert(mostImportantFeature !== mostIF) } + test("model evaluateEachIteration") { +for (lossType <- Seq("logistic")) { + val gbt = new GBTClassifier() +.setMaxDepth(2) +.setMaxIter(2) +.setLossType(lossType) + val model = gbt.fit(trainData.toDF) + val eval1 = model.evaluateEachIteration(validationData.toDF) + val eval2 = GradientBoostedTrees.evaluateEachIteration(validationData, --- End diff -- This is testing the spark.ml implementation against itself. I was about to recommend using the old spark.mllib implementation as a reference. However, the old implementation is not tested at all. Would you be able to test against a standard implementation in R or scikit-learn (following the patterns used elsewhere in MLlib)? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21097: [SPARK-14682][ML] Provide evaluateEachIteration m...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/21097#discussion_r184763088 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GBTRegressor.scala --- @@ -269,6 +269,21 @@ class GBTRegressionModel private[ml]( new OldGBTModel(OldAlgo.Regression, _trees.map(_.toOld), _treeWeights) } + /** + * Method to compute error or loss for every iteration of gradient boosting. + * + * @param dataset Dataset for validation. + */ + @Since("2.4.0") + def evaluateEachIteration(dataset: Dataset[_]): Array[Double] = { --- End diff -- Do we want to support evaluation on other losses, as in the old API? It might be nice to be able to without having to modify the Model's loss Param value. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20973: [SPARK-20114][ML] spark.ml parity for sequential ...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/20973#discussion_r183866393 --- Diff: mllib/src/main/scala/org/apache/spark/ml/fpm/PrefixSpan.scala --- @@ -0,0 +1,91 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.fpm + +import org.apache.spark.annotation.{Experimental, Since} +import org.apache.spark.mllib.fpm.{PrefixSpan => mllibPrefixSpan} +import org.apache.spark.sql.{DataFrame, Dataset, Row} +import org.apache.spark.sql.functions.col +import org.apache.spark.sql.types.{LongType, StructField, StructType} +import org.apache.spark.storage.StorageLevel + +/** + * :: Experimental :: + * A parallel PrefixSpan algorithm to mine frequent sequential patterns. + * The PrefixSpan algorithm is described in J. Pei, et al., PrefixSpan: Mining Sequential Patterns + * Efficiently by Prefix-Projected Pattern Growth + * (see http://doi.org/10.1109/ICDE.2001.914830";>here). + * + * @see https://en.wikipedia.org/wiki/Sequential_Pattern_Mining";>Sequential Pattern Mining + * (Wikipedia) + */ +@Since("2.4.0") +@Experimental +object PrefixSpan { + + /** + * :: Experimental :: + * Finds the complete set of frequent sequential patterns in the input sequences of itemsets. + * + * @param dataset A dataset or a dataframe containing a sequence column which is + *{{{Seq[Seq[_]]}}} type + * @param sequenceCol the name of the sequence column in dataset + * @param minSupport the minimal support level of the sequential pattern, any pattern that + * appears more than (minSupport * size-of-the-dataset) times will be output + * (default: `0.1`). + * @param maxPatternLength the maximal length of the sequential pattern, any pattern that appears + * less than maxPatternLength will be output (default: `10`). + * @param maxLocalProjDBSize The maximum number of items (including delimiters used in the + * internal storage format) allowed in a projected database before + * local processing. If a projected database exceeds this size, another + * iteration of distributed prefix growth is run (default: `3200`). + * @return A dataframe that contains columns of sequence and corresponding frequency. + */ + @Since("2.4.0") + def findFrequentSequentPatterns( + dataset: Dataset[_], + sequenceCol: String, + minSupport: Double = 0.1, + maxPatternLength: Int = 10, + maxLocalProjDBSize: Long = 3200L): DataFrame = { +val handlePersistence = dataset.storageLevel == StorageLevel.NONE + +val data = dataset.select(sequenceCol) +val sequences = data.where(col(sequenceCol).isNotNull).rdd + .map(r => r.getAs[Seq[Seq[Any]]](0).map(_.toArray).toArray) + +val mllibPrefixSpan = new mllibPrefixSpan() + .setMinSupport(minSupport) + .setMaxPatternLength(maxPatternLength) + .setMaxLocalProjDBSize(maxLocalProjDBSize) +if (handlePersistence) { + sequences.persist(StorageLevel.MEMORY_AND_DISK) +} +val rows = mllibPrefixSpan.run(sequences).freqSequences.map(f => Row(f.sequence, f.freq)) +val schema = StructType(Seq( + StructField("sequence", dataset.schema(sequenceCol).dataType, nullable = false), + StructField("freq", LongType, nullable = false))) --- End diff -- nit: I'd prefer to call the column "frequency" --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20973: [SPARK-20114][ML] spark.ml parity for sequential ...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/20973#discussion_r183864852 --- Diff: mllib/src/main/scala/org/apache/spark/ml/fpm/PrefixSpan.scala --- @@ -0,0 +1,91 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.fpm + +import org.apache.spark.annotation.{Experimental, Since} +import org.apache.spark.mllib.fpm.{PrefixSpan => mllibPrefixSpan} +import org.apache.spark.sql.{DataFrame, Dataset, Row} +import org.apache.spark.sql.functions.col +import org.apache.spark.sql.types.{LongType, StructField, StructType} +import org.apache.spark.storage.StorageLevel + +/** + * :: Experimental :: + * A parallel PrefixSpan algorithm to mine frequent sequential patterns. + * The PrefixSpan algorithm is described in J. Pei, et al., PrefixSpan: Mining Sequential Patterns + * Efficiently by Prefix-Projected Pattern Growth + * (see http://doi.org/10.1109/ICDE.2001.914830";>here). + * + * @see https://en.wikipedia.org/wiki/Sequential_Pattern_Mining";>Sequential Pattern Mining + * (Wikipedia) + */ +@Since("2.4.0") +@Experimental +object PrefixSpan { + + /** + * :: Experimental :: + * Finds the complete set of frequent sequential patterns in the input sequences of itemsets. + * + * @param dataset A dataset or a dataframe containing a sequence column which is + *{{{Seq[Seq[_]]}}} type + * @param sequenceCol the name of the sequence column in dataset + * @param minSupport the minimal support level of the sequential pattern, any pattern that + * appears more than (minSupport * size-of-the-dataset) times will be output + * (default: `0.1`). + * @param maxPatternLength the maximal length of the sequential pattern, any pattern that appears + * less than maxPatternLength will be output (default: `10`). + * @param maxLocalProjDBSize The maximum number of items (including delimiters used in the + * internal storage format) allowed in a projected database before + * local processing. If a projected database exceeds this size, another + * iteration of distributed prefix growth is run (default: `3200`). + * @return A dataframe that contains columns of sequence and corresponding frequency. + */ + @Since("2.4.0") + def findFrequentSequentPatterns( --- End diff -- rename: findFrequentSequentPatterns -> findFrequentSequent**ial**Patterns --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20973: [SPARK-20114][ML] spark.ml parity for sequential ...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/20973#discussion_r183865224 --- Diff: mllib/src/main/scala/org/apache/spark/ml/fpm/PrefixSpan.scala --- @@ -0,0 +1,91 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.fpm + +import org.apache.spark.annotation.{Experimental, Since} +import org.apache.spark.mllib.fpm.{PrefixSpan => mllibPrefixSpan} +import org.apache.spark.sql.{DataFrame, Dataset, Row} +import org.apache.spark.sql.functions.col +import org.apache.spark.sql.types.{LongType, StructField, StructType} +import org.apache.spark.storage.StorageLevel + +/** + * :: Experimental :: + * A parallel PrefixSpan algorithm to mine frequent sequential patterns. + * The PrefixSpan algorithm is described in J. Pei, et al., PrefixSpan: Mining Sequential Patterns + * Efficiently by Prefix-Projected Pattern Growth + * (see http://doi.org/10.1109/ICDE.2001.914830";>here). + * + * @see https://en.wikipedia.org/wiki/Sequential_Pattern_Mining";>Sequential Pattern Mining + * (Wikipedia) + */ +@Since("2.4.0") +@Experimental +object PrefixSpan { + + /** + * :: Experimental :: + * Finds the complete set of frequent sequential patterns in the input sequences of itemsets. + * + * @param dataset A dataset or a dataframe containing a sequence column which is + *{{{Seq[Seq[_]]}}} type + * @param sequenceCol the name of the sequence column in dataset + * @param minSupport the minimal support level of the sequential pattern, any pattern that + * appears more than (minSupport * size-of-the-dataset) times will be output + * (default: `0.1`). + * @param maxPatternLength the maximal length of the sequential pattern, any pattern that appears + * less than maxPatternLength will be output (default: `10`). + * @param maxLocalProjDBSize The maximum number of items (including delimiters used in the + * internal storage format) allowed in a projected database before + * local processing. If a projected database exceeds this size, another + * iteration of distributed prefix growth is run (default: `3200`). + * @return A dataframe that contains columns of sequence and corresponding frequency. + */ + @Since("2.4.0") + def findFrequentSequentPatterns( + dataset: Dataset[_], + sequenceCol: String, + minSupport: Double = 0.1, + maxPatternLength: Int = 10, + maxLocalProjDBSize: Long = 3200L): DataFrame = { +val handlePersistence = dataset.storageLevel == StorageLevel.NONE --- End diff -- We don't really need this handlePersistence logic here since it's handled by the spark.mllib implementation. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20973: [SPARK-20114][ML] spark.ml parity for sequential ...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/20973#discussion_r183865609 --- Diff: mllib/src/main/scala/org/apache/spark/ml/fpm/PrefixSpan.scala --- @@ -0,0 +1,91 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.fpm + +import org.apache.spark.annotation.{Experimental, Since} +import org.apache.spark.mllib.fpm.{PrefixSpan => mllibPrefixSpan} +import org.apache.spark.sql.{DataFrame, Dataset, Row} +import org.apache.spark.sql.functions.col +import org.apache.spark.sql.types.{LongType, StructField, StructType} +import org.apache.spark.storage.StorageLevel + +/** + * :: Experimental :: + * A parallel PrefixSpan algorithm to mine frequent sequential patterns. + * The PrefixSpan algorithm is described in J. Pei, et al., PrefixSpan: Mining Sequential Patterns + * Efficiently by Prefix-Projected Pattern Growth + * (see http://doi.org/10.1109/ICDE.2001.914830";>here). + * + * @see https://en.wikipedia.org/wiki/Sequential_Pattern_Mining";>Sequential Pattern Mining + * (Wikipedia) + */ +@Since("2.4.0") +@Experimental +object PrefixSpan { + + /** + * :: Experimental :: + * Finds the complete set of frequent sequential patterns in the input sequences of itemsets. + * + * @param dataset A dataset or a dataframe containing a sequence column which is + *{{{Seq[Seq[_]]}}} type + * @param sequenceCol the name of the sequence column in dataset --- End diff -- It'd be nice to document that rows with nulls in this column are ignored. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20973: [SPARK-20114][ML] spark.ml parity for sequential ...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/20973#discussion_r183864177 --- Diff: mllib/src/main/scala/org/apache/spark/ml/fpm/PrefixSpan.scala --- @@ -0,0 +1,91 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.fpm + +import org.apache.spark.annotation.{Experimental, Since} +import org.apache.spark.mllib.fpm.{PrefixSpan => mllibPrefixSpan} +import org.apache.spark.sql.{DataFrame, Dataset, Row} +import org.apache.spark.sql.functions.col +import org.apache.spark.sql.types.{LongType, StructField, StructType} +import org.apache.spark.storage.StorageLevel + +/** + * :: Experimental :: + * A parallel PrefixSpan algorithm to mine frequent sequential patterns. + * The PrefixSpan algorithm is described in J. Pei, et al., PrefixSpan: Mining Sequential Patterns + * Efficiently by Prefix-Projected Pattern Growth + * (see http://doi.org/10.1109/ICDE.2001.914830";>here). + * + * @see https://en.wikipedia.org/wiki/Sequential_Pattern_Mining";>Sequential Pattern Mining + * (Wikipedia) + */ +@Since("2.4.0") +@Experimental +object PrefixSpan { + + /** + * :: Experimental :: + * Finds the complete set of frequent sequential patterns in the input sequences of itemsets. + * + * @param dataset A dataset or a dataframe containing a sequence column which is + *{{{Seq[Seq[_]]}}} type + * @param sequenceCol the name of the sequence column in dataset + * @param minSupport the minimal support level of the sequential pattern, any pattern that + * appears more than (minSupport * size-of-the-dataset) times will be output + * (default: `0.1`). + * @param maxPatternLength the maximal length of the sequential pattern, any pattern that appears --- End diff -- Let's fix this phrasing by just saying "the maximal length of the sequential pattern" (The other part does not make sense: "any pattern that appears...") Feel free to fix that in the old API doc too. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20973: [SPARK-20114][ML] spark.ml parity for sequential ...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/20973#discussion_r183864721 --- Diff: mllib/src/main/scala/org/apache/spark/ml/fpm/PrefixSpan.scala --- @@ -0,0 +1,91 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.fpm + +import org.apache.spark.annotation.{Experimental, Since} +import org.apache.spark.mllib.fpm.{PrefixSpan => mllibPrefixSpan} +import org.apache.spark.sql.{DataFrame, Dataset, Row} +import org.apache.spark.sql.functions.col +import org.apache.spark.sql.types.{LongType, StructField, StructType} +import org.apache.spark.storage.StorageLevel + +/** + * :: Experimental :: + * A parallel PrefixSpan algorithm to mine frequent sequential patterns. + * The PrefixSpan algorithm is described in J. Pei, et al., PrefixSpan: Mining Sequential Patterns + * Efficiently by Prefix-Projected Pattern Growth + * (see http://doi.org/10.1109/ICDE.2001.914830";>here). + * + * @see https://en.wikipedia.org/wiki/Sequential_Pattern_Mining";>Sequential Pattern Mining + * (Wikipedia) + */ +@Since("2.4.0") +@Experimental +object PrefixSpan { + + /** + * :: Experimental :: + * Finds the complete set of frequent sequential patterns in the input sequences of itemsets. + * + * @param dataset A dataset or a dataframe containing a sequence column which is + *{{{Seq[Seq[_]]}}} type + * @param sequenceCol the name of the sequence column in dataset + * @param minSupport the minimal support level of the sequential pattern, any pattern that + * appears more than (minSupport * size-of-the-dataset) times will be output + * (default: `0.1`). + * @param maxPatternLength the maximal length of the sequential pattern, any pattern that appears + * less than maxPatternLength will be output (default: `10`). + * @param maxLocalProjDBSize The maximum number of items (including delimiters used in the + * internal storage format) allowed in a projected database before + * local processing. If a projected database exceeds this size, another + * iteration of distributed prefix growth is run (default: `3200`). + * @return A dataframe that contains columns of sequence and corresponding frequency. --- End diff -- Be very explicit about the output schema please: For each column, provide the name and DataType. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20973: [SPARK-20114][ML] spark.ml parity for sequential ...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/20973#discussion_r183865745 --- Diff: mllib/src/main/scala/org/apache/spark/ml/fpm/PrefixSpan.scala --- @@ -0,0 +1,91 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.fpm + +import org.apache.spark.annotation.{Experimental, Since} +import org.apache.spark.mllib.fpm.{PrefixSpan => mllibPrefixSpan} +import org.apache.spark.sql.{DataFrame, Dataset, Row} +import org.apache.spark.sql.functions.col +import org.apache.spark.sql.types.{LongType, StructField, StructType} +import org.apache.spark.storage.StorageLevel + +/** + * :: Experimental :: + * A parallel PrefixSpan algorithm to mine frequent sequential patterns. + * The PrefixSpan algorithm is described in J. Pei, et al., PrefixSpan: Mining Sequential Patterns + * Efficiently by Prefix-Projected Pattern Growth + * (see http://doi.org/10.1109/ICDE.2001.914830";>here). + * + * @see https://en.wikipedia.org/wiki/Sequential_Pattern_Mining";>Sequential Pattern Mining + * (Wikipedia) + */ +@Since("2.4.0") +@Experimental +object PrefixSpan { + + /** + * :: Experimental :: + * Finds the complete set of frequent sequential patterns in the input sequences of itemsets. + * + * @param dataset A dataset or a dataframe containing a sequence column which is + *{{{Seq[Seq[_]]}}} type + * @param sequenceCol the name of the sequence column in dataset --- End diff -- You could add a unit test for that too. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20973: [SPARK-20114][ML] spark.ml parity for sequential ...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/20973#discussion_r183863701 --- Diff: mllib/src/main/scala/org/apache/spark/ml/fpm/PrefixSpan.scala --- @@ -0,0 +1,91 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.fpm + +import org.apache.spark.annotation.{Experimental, Since} +import org.apache.spark.mllib.fpm.{PrefixSpan => mllibPrefixSpan} +import org.apache.spark.sql.{DataFrame, Dataset, Row} +import org.apache.spark.sql.functions.col +import org.apache.spark.sql.types.{LongType, StructField, StructType} +import org.apache.spark.storage.StorageLevel + +/** + * :: Experimental :: + * A parallel PrefixSpan algorithm to mine frequent sequential patterns. + * The PrefixSpan algorithm is described in J. Pei, et al., PrefixSpan: Mining Sequential Patterns + * Efficiently by Prefix-Projected Pattern Growth + * (see http://doi.org/10.1109/ICDE.2001.914830";>here). + * + * @see https://en.wikipedia.org/wiki/Sequential_Pattern_Mining";>Sequential Pattern Mining + * (Wikipedia) + */ +@Since("2.4.0") +@Experimental +object PrefixSpan { + + /** + * :: Experimental :: + * Finds the complete set of frequent sequential patterns in the input sequences of itemsets. + * + * @param dataset A dataset or a dataframe containing a sequence column which is + *{{{Seq[Seq[_]]}}} type + * @param sequenceCol the name of the sequence column in dataset + * @param minSupport the minimal support level of the sequential pattern, any pattern that + * appears more than (minSupport * size-of-the-dataset) times will be output + * (default: `0.1`). + * @param maxPatternLength the maximal length of the sequential pattern, any pattern that appears + * less than maxPatternLength will be output (default: `10`). + * @param maxLocalProjDBSize The maximum number of items (including delimiters used in the + * internal storage format) allowed in a projected database before + * local processing. If a projected database exceeds this size, another + * iteration of distributed prefix growth is run (default: `3200`). + * @return A dataframe that contains columns of sequence and corresponding frequency. + */ + @Since("2.4.0") + def findFrequentSequentPatterns( + dataset: Dataset[_], + sequenceCol: String, + minSupport: Double = 0.1, --- End diff -- We never want to use default arguments in Scala APIs since they are not Java-friendly. Let's just state recommended values in the docstrings. We can add defaults when we create an Estimator. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20973: [SPARK-20114][ML] spark.ml parity for sequential ...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/20973#discussion_r183865387 --- Diff: mllib/src/main/scala/org/apache/spark/ml/fpm/PrefixSpan.scala --- @@ -0,0 +1,91 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.fpm + +import org.apache.spark.annotation.{Experimental, Since} +import org.apache.spark.mllib.fpm.{PrefixSpan => mllibPrefixSpan} +import org.apache.spark.sql.{DataFrame, Dataset, Row} +import org.apache.spark.sql.functions.col +import org.apache.spark.sql.types.{LongType, StructField, StructType} +import org.apache.spark.storage.StorageLevel + +/** + * :: Experimental :: + * A parallel PrefixSpan algorithm to mine frequent sequential patterns. + * The PrefixSpan algorithm is described in J. Pei, et al., PrefixSpan: Mining Sequential Patterns + * Efficiently by Prefix-Projected Pattern Growth + * (see http://doi.org/10.1109/ICDE.2001.914830";>here). + * + * @see https://en.wikipedia.org/wiki/Sequential_Pattern_Mining";>Sequential Pattern Mining + * (Wikipedia) + */ +@Since("2.4.0") +@Experimental +object PrefixSpan { + + /** + * :: Experimental :: + * Finds the complete set of frequent sequential patterns in the input sequences of itemsets. + * + * @param dataset A dataset or a dataframe containing a sequence column which is + *{{{Seq[Seq[_]]}}} type + * @param sequenceCol the name of the sequence column in dataset + * @param minSupport the minimal support level of the sequential pattern, any pattern that + * appears more than (minSupport * size-of-the-dataset) times will be output + * (default: `0.1`). + * @param maxPatternLength the maximal length of the sequential pattern, any pattern that appears + * less than maxPatternLength will be output (default: `10`). + * @param maxLocalProjDBSize The maximum number of items (including delimiters used in the + * internal storage format) allowed in a projected database before + * local processing. If a projected database exceeds this size, another + * iteration of distributed prefix growth is run (default: `3200`). + * @return A dataframe that contains columns of sequence and corresponding frequency. + */ + @Since("2.4.0") + def findFrequentSequentPatterns( + dataset: Dataset[_], + sequenceCol: String, + minSupport: Double = 0.1, + maxPatternLength: Int = 10, + maxLocalProjDBSize: Long = 3200L): DataFrame = { +val handlePersistence = dataset.storageLevel == StorageLevel.NONE + +val data = dataset.select(sequenceCol) --- End diff -- Let's check the input schema and throw a clear exception if it's not OK. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
spark git commit: [SPARK-23990][ML] Instruments logging improvements - ML regression package
Repository: spark Updated Branches: refs/heads/master 83013752e -> 379bffa05 [SPARK-23990][ML] Instruments logging improvements - ML regression package ## What changes were proposed in this pull request? Instruments logging improvements - ML regression package I add an `OptionalInstrument` class which used in `WeightLeastSquares` and `IterativelyReweightedLeastSquares`. ## How was this patch tested? N/A Author: WeichenXu Closes #21078 from WeichenXu123/inst_reg. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/379bffa0 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/379bffa0 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/379bffa0 Branch: refs/heads/master Commit: 379bffa0525a4343f8c10e51ed192031922f9874 Parents: 8301375 Author: WeichenXu Authored: Tue Apr 24 11:02:22 2018 -0700 Committer: Joseph K. Bradley Committed: Tue Apr 24 11:02:22 2018 -0700 -- .../ml/classification/LogisticRegression.scala | 4 +- .../IterativelyReweightedLeastSquares.scala | 18 -- .../spark/ml/optim/WeightedLeastSquares.scala | 32 + .../ml/regression/AFTSurvivalRegression.scala | 2 +- .../GeneralizedLinearRegression.scala | 14 ++-- .../spark/ml/regression/LinearRegression.scala | 22 --- .../spark/ml/tree/impl/RandomForest.scala | 2 + .../apache/spark/ml/util/Instrumentation.scala | 68 +++- 8 files changed, 125 insertions(+), 37 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/379bffa0/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala b/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala index e426263..06ca37b 100644 --- a/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala +++ b/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala @@ -500,7 +500,7 @@ class LogisticRegression @Since("1.2.0") ( if (handlePersistence) instances.persist(StorageLevel.MEMORY_AND_DISK) -val instr = Instrumentation.create(this, instances) +val instr = Instrumentation.create(this, dataset) instr.logParams(regParam, elasticNetParam, standardization, threshold, maxIter, tol, fitIntercept) @@ -816,7 +816,7 @@ class LogisticRegression @Since("1.2.0") ( if (state == null) { val msg = s"${optimizer.getClass.getName} failed." - logError(msg) + instr.logError(msg) throw new SparkException(msg) } http://git-wip-us.apache.org/repos/asf/spark/blob/379bffa0/mllib/src/main/scala/org/apache/spark/ml/optim/IterativelyReweightedLeastSquares.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/ml/optim/IterativelyReweightedLeastSquares.scala b/mllib/src/main/scala/org/apache/spark/ml/optim/IterativelyReweightedLeastSquares.scala index 6961b45..572b8cf 100644 --- a/mllib/src/main/scala/org/apache/spark/ml/optim/IterativelyReweightedLeastSquares.scala +++ b/mllib/src/main/scala/org/apache/spark/ml/optim/IterativelyReweightedLeastSquares.scala @@ -17,9 +17,9 @@ package org.apache.spark.ml.optim -import org.apache.spark.internal.Logging import org.apache.spark.ml.feature.{Instance, OffsetInstance} import org.apache.spark.ml.linalg._ +import org.apache.spark.ml.util.OptionalInstrumentation import org.apache.spark.rdd.RDD /** @@ -61,9 +61,12 @@ private[ml] class IterativelyReweightedLeastSquares( val fitIntercept: Boolean, val regParam: Double, val maxIter: Int, -val tol: Double) extends Logging with Serializable { +val tol: Double) extends Serializable { - def fit(instances: RDD[OffsetInstance]): IterativelyReweightedLeastSquaresModel = { + def fit( + instances: RDD[OffsetInstance], + instr: OptionalInstrumentation = OptionalInstrumentation.create( +classOf[IterativelyReweightedLeastSquares])): IterativelyReweightedLeastSquaresModel = { var converged = false var iter = 0 @@ -83,7 +86,8 @@ private[ml] class IterativelyReweightedLeastSquares( // Estimate new model model = new WeightedLeastSquares(fitIntercept, regParam, elasticNetParam = 0.0, -standardizeFeatures = false, standardizeLabel = false).fit(newInstances) +standardizeFeatures = false, standardizeLabel = false) +.fit(newInstances, instr = instr) // Check convergence val oldCoefficients = oldModel.coefficients @@ -96,14 +100,14 @@ private[ml] class IterativelyReweightedLeastSquares( if (maxTol < tol) {
[GitHub] spark issue #21078: [SPARK-23990][ML] Instruments logging improvements - ML ...
Github user jkbradley commented on the issue: https://github.com/apache/spark/pull/21078 LGTM Merging with master Thanks @WeichenXu123 ! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
spark git commit: [SPARK-23455][ML] Default Params in ML should be saved separately in metadata
Repository: spark Updated Branches: refs/heads/master ce7ba2e98 -> 83013752e [SPARK-23455][ML] Default Params in ML should be saved separately in metadata ## What changes were proposed in this pull request? We save ML's user-supplied params and default params as one entity in metadata. During loading the saved models, we set all the loaded params into created ML model instances as user-supplied params. It causes some problems, e.g., if we strictly disallow some params to be set at the same time, a default param can fail the param check because it is treated as user-supplied param after loading. The loaded default params should not be set as user-supplied params. We should save ML default params separately in metadata. For backward compatibility, when loading metadata, if it is a metadata file from previous Spark, we shouldn't raise error if we can't find the default param field. ## How was this patch tested? Pass existing tests and added tests. Author: Liang-Chi Hsieh Closes #20633 from viirya/save-ml-default-params. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/83013752 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/83013752 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/83013752 Branch: refs/heads/master Commit: 83013752e3cfcbc3edeef249439ac20b143eeabc Parents: ce7ba2e Author: Liang-Chi Hsieh Authored: Tue Apr 24 10:40:25 2018 -0700 Committer: Joseph K. Bradley Committed: Tue Apr 24 10:40:25 2018 -0700 -- .../classification/DecisionTreeClassifier.scala | 2 +- .../spark/ml/classification/GBTClassifier.scala | 4 +- .../spark/ml/classification/LinearSVC.scala | 2 +- .../ml/classification/LogisticRegression.scala | 2 +- .../MultilayerPerceptronClassifier.scala| 2 +- .../spark/ml/classification/NaiveBayes.scala| 2 +- .../spark/ml/classification/OneVsRest.scala | 4 +- .../classification/RandomForestClassifier.scala | 4 +- .../spark/ml/clustering/BisectingKMeans.scala | 2 +- .../spark/ml/clustering/GaussianMixture.scala | 2 +- .../org/apache/spark/ml/clustering/KMeans.scala | 2 +- .../org/apache/spark/ml/clustering/LDA.scala| 4 +- .../feature/BucketedRandomProjectionLSH.scala | 2 +- .../apache/spark/ml/feature/Bucketizer.scala| 24 .../apache/spark/ml/feature/ChiSqSelector.scala | 2 +- .../spark/ml/feature/CountVectorizer.scala | 2 +- .../scala/org/apache/spark/ml/feature/IDF.scala | 2 +- .../org/apache/spark/ml/feature/Imputer.scala | 2 +- .../apache/spark/ml/feature/MaxAbsScaler.scala | 2 +- .../apache/spark/ml/feature/MinHashLSH.scala| 2 +- .../apache/spark/ml/feature/MinMaxScaler.scala | 2 +- .../ml/feature/OneHotEncoderEstimator.scala | 2 +- .../scala/org/apache/spark/ml/feature/PCA.scala | 2 +- .../spark/ml/feature/QuantileDiscretizer.scala | 24 .../org/apache/spark/ml/feature/RFormula.scala | 6 +- .../spark/ml/feature/StandardScaler.scala | 2 +- .../apache/spark/ml/feature/StringIndexer.scala | 2 +- .../apache/spark/ml/feature/VectorIndexer.scala | 2 +- .../org/apache/spark/ml/feature/Word2Vec.scala | 2 +- .../org/apache/spark/ml/fpm/FPGrowth.scala | 2 +- .../org/apache/spark/ml/param/params.scala | 13 +- .../apache/spark/ml/recommendation/ALS.scala| 2 +- .../ml/regression/AFTSurvivalRegression.scala | 2 +- .../ml/regression/DecisionTreeRegressor.scala | 2 +- .../spark/ml/regression/GBTRegressor.scala | 4 +- .../GeneralizedLinearRegression.scala | 2 +- .../ml/regression/IsotonicRegression.scala | 2 +- .../spark/ml/regression/LinearRegression.scala | 2 +- .../ml/regression/RandomForestRegressor.scala | 4 +- .../apache/spark/ml/tuning/CrossValidator.scala | 6 +- .../spark/ml/tuning/TrainValidationSplit.scala | 6 +- .../org/apache/spark/ml/util/ReadWrite.scala| 130 --- .../spark/ml/util/DefaultReadWriteTest.scala| 73 ++- project/MimaExcludes.scala | 6 + 44 files changed, 223 insertions(+), 147 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/83013752/mllib/src/main/scala/org/apache/spark/ml/classification/DecisionTreeClassifier.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/ml/classification/DecisionTreeClassifier.scala b/mllib/src/main/scala/org/apache/spark/ml/classification/DecisionTreeClassifier.scala index 771cd4f..57797d1 100644 --- a/mllib/src/main/scala/org/apache/spark/ml/classification/DecisionTreeClassifier.scala +++ b/mllib/src/main/scala/org/apache/spark/ml/classification/DecisionTreeClassifier.scala @@ -279,7 +279,7 @@ object DecisionTreeClassificationMod
spark git commit: [SPARK-23975][ML] Allow Clustering to take Arrays of Double as input features
Repository: spark Updated Branches: refs/heads/master 55c4ca88a -> 2a24c481d [SPARK-23975][ML] Allow Clustering to take Arrays of Double as input features ## What changes were proposed in this pull request? - Multiple possible input types is added in validateAndTransformSchema() and computeCost() while checking column type - Add if statement in transform() to support array type as featuresCol - Add the case statement in fit() while selecting columns from dataset These changes will be applied to KMeans first, then to other clustering method ## How was this patch tested? unit test is added Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Lu WANG Closes #21081 from ludatabricks/SPARK-23975. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/2a24c481 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/2a24c481 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/2a24c481 Branch: refs/heads/master Commit: 2a24c481da3f30b510deb62e5cf21c9463cf250c Parents: 55c4ca8 Author: Lu WANG Authored: Tue Apr 24 09:25:41 2018 -0700 Committer: Joseph K. Bradley Committed: Tue Apr 24 09:25:41 2018 -0700 -- .../org/apache/spark/ml/clustering/KMeans.scala | 32 +++--- .../org/apache/spark/ml/util/DatasetUtils.scala | 63 .../spark/ml/clustering/KMeansSuite.scala | 38 3 files changed, 126 insertions(+), 7 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/2a24c481/mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala b/mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala index 1ad157a..d475c72 100644 --- a/mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala +++ b/mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala @@ -33,8 +33,8 @@ import org.apache.spark.mllib.linalg.{Vector => OldVector, Vectors => OldVectors import org.apache.spark.mllib.linalg.VectorImplicits._ import org.apache.spark.rdd.RDD import org.apache.spark.sql.{DataFrame, Dataset, Row, SparkSession} -import org.apache.spark.sql.functions.{col, udf} -import org.apache.spark.sql.types.{IntegerType, StructType} +import org.apache.spark.sql.functions.udf +import org.apache.spark.sql.types.{ArrayType, DoubleType, FloatType, IntegerType, StructType} import org.apache.spark.storage.StorageLevel import org.apache.spark.util.VersionUtils.majorVersion @@ -87,12 +87,23 @@ private[clustering] trait KMeansParams extends Params with HasMaxIter with HasFe def getInitSteps: Int = $(initSteps) /** + * Validates the input schema. + * @param schema input schema + */ + private[clustering] def validateSchema(schema: StructType): Unit = { +val typeCandidates = List( new VectorUDT, + new ArrayType(DoubleType, false), + new ArrayType(FloatType, false)) + +SchemaUtils.checkColumnTypes(schema, $(featuresCol), typeCandidates) + } + /** * Validates and transforms the input schema. * @param schema input schema * @return output schema */ protected def validateAndTransformSchema(schema: StructType): StructType = { -SchemaUtils.checkColumnType(schema, $(featuresCol), new VectorUDT) +validateSchema(schema) SchemaUtils.appendColumn(schema, $(predictionCol), IntegerType) } } @@ -125,8 +136,11 @@ class KMeansModel private[ml] ( @Since("2.0.0") override def transform(dataset: Dataset[_]): DataFrame = { transformSchema(dataset.schema, logging = true) + val predictUDF = udf((vector: Vector) => predict(vector)) -dataset.withColumn($(predictionCol), predictUDF(col($(featuresCol + +dataset.withColumn($(predictionCol), + predictUDF(DatasetUtils.columnToVector(dataset, getFeaturesCol))) } @Since("1.5.0") @@ -146,8 +160,10 @@ class KMeansModel private[ml] ( // TODO: Replace the temp fix when we have proper evaluators defined for clustering. @Since("2.0.0") def computeCost(dataset: Dataset[_]): Double = { -SchemaUtils.checkColumnType(dataset.schema, $(featuresCol), new VectorUDT) -val data: RDD[OldVector] = dataset.select(col($(featuresCol))).rdd.map { +validateSchema(dataset.schema) + +val data: RDD[OldVector] = dataset.select(DatasetUtils.columnToVector(dataset, getFeaturesCol)) + .rdd.map { case Row(point: Vector) => OldVectors.fromML(point) } parentModel.computeCost(data) @@ -335,7 +351,9 @@ class KMeans @Since("1.5.0") ( transformSchema(dataset.schema, logging = true) val handlePersistence = dataset.storageLevel == StorageLevel.NONE -val instances: RDD[OldVector] = dataset.select(col($(feat
[GitHub] spark pull request #21081: [SPARK-23975][ML]Allow Clustering to take Arrays ...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/21081#discussion_r183796459 --- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala --- @@ -86,13 +86,23 @@ private[clustering] trait KMeansParams extends Params with HasMaxIter with HasFe @Since("1.5.0") def getInitSteps: Int = $(initSteps) + /** + * Validates the input schema. + * @param schema input schema + */ + private[clustering] def validateSchema(schema: StructType): Unit = { +val typeCandidates = List( new VectorUDT, + new ArrayType(DoubleType, false), + new ArrayType(FloatType, false)) +SchemaUtils.checkColumnTypes(schema, $(featuresCol), typeCandidates) + } /** --- End diff -- Ping: There needs to be a newline between the "}" of the previous method and the "/**" Scaladoc of the next method. Please start checking for this. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21081: [SPARK-23975][ML]Allow Clustering to take Arrays ...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/21081#discussion_r183797106 --- Diff: mllib/src/main/scala/org/apache/spark/ml/util/DatasetUtils.scala --- @@ -27,28 +26,38 @@ import org.apache.spark.sql.types.{ArrayType, DoubleType, FloatType} private[spark] object DatasetUtils { /** - * preprocessing the input feature column to Vector - * @param dataset DataFrame with columns for features - * @param colName column name for features - * @return Vector feature column + * Cast a column in a Dataset to Vector type. + * + * The supported data types of the input column are + * - Vector + * - float/double type Array. + * + * Note: The returned column does not have Metadata. + * + * @param dataset input DataFrame + * @param colName column name. + * @return Vector column */ - @Since("2.4.0") def columnToVector(dataset: Dataset[_], colName: String): Column = { -val featuresDataType = dataset.schema(colName).dataType -featuresDataType match { +val columnDataType = dataset.schema(colName).dataType +columnDataType match { case _: VectorUDT => col(colName) case fdt: ArrayType => val transferUDF = fdt.elementType match { case _: FloatType => udf(f = (vector: Seq[Float]) => { -val featureArray = Array.fill[Double](vector.size)(0.0) -vector.indices.foreach(idx => featureArray(idx) = vector(idx).toDouble) -Vectors.dense(featureArray) +val inputArray = Array.fill[Double](vector.size)(0.0) +vector.indices.foreach(idx => inputArray(idx) = vector(idx).toDouble) +Vectors.dense(inputArray) }) case _: DoubleType => udf((vector: Seq[Double]) => { Vectors.dense(vector.toArray) }) + case other => --- End diff -- Thanks! I forgot about this since this was generalized. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20319: [SPARK-22884][ML][TESTS] ML test for StructuredSt...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/20319#discussion_r183565892 --- Diff: mllib/src/test/scala/org/apache/spark/ml/clustering/BisectingKMeansSuite.scala --- @@ -20,13 +20,14 @@ package org.apache.spark.ml.clustering import org.apache.spark.{SparkException, SparkFunSuite} import org.apache.spark.ml.linalg.{Vector, Vectors} import org.apache.spark.ml.param.ParamMap -import org.apache.spark.ml.util.{DefaultReadWriteTest, MLTestingUtils} +import org.apache.spark.ml.util.{DefaultReadWriteTest, MLTest, MLTestingUtils} import org.apache.spark.mllib.clustering.DistanceMeasure import org.apache.spark.mllib.util.MLlibTestSparkContext import org.apache.spark.sql.Dataset -class BisectingKMeansSuite - extends SparkFunSuite with MLlibTestSparkContext with DefaultReadWriteTest { +class BisectingKMeansSuite extends MLTest with DefaultReadWriteTest { + + import Encoders._ --- End diff -- ```import testImplicits._``` instead --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20319: [SPARK-22884][ML][TESTS] ML test for StructuredSt...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/20319#discussion_r183565968 --- Diff: mllib/src/test/scala/org/apache/spark/ml/clustering/BisectingKMeansSuite.scala --- @@ -65,10 +66,12 @@ class BisectingKMeansSuite // Verify fit does not fail on very sparse data val model = bkm.fit(sparseDataset) -val result = model.transform(sparseDataset) -val numClusters = result.select("prediction").distinct().collect().length -// Verify we hit the edge case -assert(numClusters < k && numClusters > 1) + +testTransformerByGlobalCheckFunc[Vector](sparseDataset.toDF(), model, "prediction") { rows => --- End diff -- Use ```Tuple1[Vector]``` instead of ```Vector``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20319: [SPARK-22884][ML][TESTS] ML test for StructuredSt...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/20319#discussion_r182180564 --- Diff: mllib/src/test/scala/org/apache/spark/ml/clustering/Encoders.scala --- @@ -0,0 +1,25 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.clustering + +import org.apache.spark.ml.linalg.Vector +import org.apache.spark.sql.catalyst.encoders.ExpressionEncoder + +private[clustering] object Encoders { + implicit val vectorEncoder = ExpressionEncoder[Vector]() --- End diff -- Thanks for asking; you shouldn't need to do this. I'll comment on BisectingKMeansSuite.scala about using testImplicits instead. You basically just need to import testImplicits._ and use Tuple1 for the type param for testTransformer. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20319: [SPARK-22884][ML][TESTS] ML test for StructuredSt...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/20319#discussion_r183566557 --- Diff: mllib/src/test/scala/org/apache/spark/ml/clustering/BisectingKMeansSuite.scala --- @@ -102,17 +105,14 @@ class BisectingKMeansSuite val model = bkm.fit(dataset) assert(model.clusterCenters.length === k) -val transformed = model.transform(dataset) -val expectedColumns = Array("features", predictionColName) -expectedColumns.foreach { column => - assert(transformed.columns.contains(column)) +testTransformerByGlobalCheckFunc[Vector](dataset.toDF(), model, + "features", predictionColName) { rows => + val clusters = rows.map(_.getAs[Int](predictionColName)).toSet + assert(clusters.size === k) + assert(clusters === Set(0, 1, 2, 3, 4)) + assert(model.computeCost(dataset) < 0.1) --- End diff -- These checks which do not use "rows" should go outside of testTransformerByGlobalCheckFunc --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20633: [SPARK-23455][ML] Default Params in ML should be saved s...
Github user jkbradley commented on the issue: https://github.com/apache/spark/pull/20633 Sorry for the pause in review. LGTM Merging with master @dbtsai I'm going to merge this since I'm worried it will collect more conflicts, but let's discuss more if needed. @viirya We'll need to update Python's DefaultParamsReader as well for Spark 2.4 in order to keep it in sync with Scala/Java. R thankfully should not require anything since it only has wrappers. I'll make & link a JIRA. Will you have time to work on that? Thanks @viirya ! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21081: [SPARK-23975][ML]Allow Clustering to take Arrays ...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/21081#discussion_r183555957 --- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala --- @@ -86,13 +86,23 @@ private[clustering] trait KMeansParams extends Params with HasMaxIter with HasFe @Since("1.5.0") def getInitSteps: Int = $(initSteps) + /** + * Validates the input schema. + * @param schema input schema + */ + private[clustering] def validateSchema(schema: StructType): Unit = { +val typeCandidates = List( new VectorUDT, + new ArrayType(DoubleType, false), + new ArrayType(FloatType, false)) +SchemaUtils.checkColumnTypes(schema, $(featuresCol), typeCandidates) + } /** --- End diff -- scala style: always put newline between methods --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21081: [SPARK-23975][ML]Allow Clustering to take Arrays ...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/21081#discussion_r183556811 --- Diff: mllib/src/main/scala/org/apache/spark/ml/util/DatasetUtils.scala --- @@ -0,0 +1,54 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.util + +import org.apache.spark.annotation.Since +import org.apache.spark.ml.linalg.{Vectors, VectorUDT} +import org.apache.spark.sql.{Column, Dataset} +import org.apache.spark.sql.functions.{col, udf} +import org.apache.spark.sql.types.{ArrayType, DoubleType, FloatType} + + +private[spark] object DatasetUtils { + + /** + * preprocessing the input feature column to Vector --- End diff -- This is a bit unclear. How about: "Cast a column in a Dataset to a Vector type." Also, this isn't specific to features, so please clarify that below. Finally, the key thing to document is the list of supported input types, so I'd add that. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21081: [SPARK-23975][ML]Allow Clustering to take Arrays ...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/21081#discussion_r183558105 --- Diff: mllib/src/test/scala/org/apache/spark/ml/clustering/KMeansSuite.scala --- @@ -199,6 +201,47 @@ class KMeansSuite extends SparkFunSuite with MLlibTestSparkContext with DefaultR assert(e.getCause.getMessage.contains("Cosine distance is not defined")) } + test("KMean with Array input") { +val featuresColNameD = "array_double_features" +val featuresColNameF = "array_float_features" + +val doubleUDF = udf { (features: Vector) => + val featureArray = Array.fill[Double](features.size)(0.0) + features.foreachActive((idx, value) => featureArray(idx) = value.toFloat) + featureArray +} +val floatUDF = udf { (features: Vector) => + val featureArray = Array.fill[Float](features.size)(0.0f) + features.foreachActive((idx, value) => featureArray(idx) = value.toFloat) + featureArray +} + +val newdatasetD = dataset.withColumn(featuresColNameD, doubleUDF(col("features"))) + .drop("features") +val newdatasetF = dataset.withColumn(featuresColNameF, floatUDF(col("features"))) + .drop("features") + +assert(newdatasetD.schema(featuresColNameD).dataType.equals(new ArrayType(DoubleType, false))) +assert(newdatasetF.schema(featuresColNameF).dataType.equals(new ArrayType(FloatType, false))) + +val kmeansD = new KMeans().setK(k).setFeaturesCol(featuresColNameD).setSeed(1) +val kmeansF = new KMeans().setK(k).setFeaturesCol(featuresColNameF).setSeed(1) +val modelD = kmeansD.fit(newdatasetD) +val modelF = kmeansF.fit(newdatasetF) + +val transformedD = modelD.transform(newdatasetD) +val transformedF = modelF.transform(newdatasetF) + +val predictDifference = transformedD.select("prediction") + .except(transformedF.select("prediction")) + +assert(predictDifference.count() == 0) + +assert(modelD.computeCost(newdatasetD) == modelF.computeCost(newdatasetF) ) + --- End diff -- nit: remove unnecessary newline --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21081: [SPARK-23975][ML]Allow Clustering to take Arrays ...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/21081#discussion_r183557655 --- Diff: mllib/src/main/scala/org/apache/spark/ml/util/DatasetUtils.scala --- @@ -0,0 +1,54 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.util + +import org.apache.spark.annotation.Since +import org.apache.spark.ml.linalg.{Vectors, VectorUDT} +import org.apache.spark.sql.{Column, Dataset} +import org.apache.spark.sql.functions.{col, udf} +import org.apache.spark.sql.types.{ArrayType, DoubleType, FloatType} + + +private[spark] object DatasetUtils { + + /** + * preprocessing the input feature column to Vector + * @param dataset DataFrame with columns for features + * @param colName column name for features + * @return Vector feature column --- End diff -- Add a note that this returned Column does not have Metadata --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21081: [SPARK-23975][ML]Allow Clustering to take Arrays ...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/21081#discussion_r183556424 --- Diff: mllib/src/main/scala/org/apache/spark/ml/util/DatasetUtils.scala --- @@ -0,0 +1,54 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.util + +import org.apache.spark.annotation.Since +import org.apache.spark.ml.linalg.{Vectors, VectorUDT} +import org.apache.spark.sql.{Column, Dataset} +import org.apache.spark.sql.functions.{col, udf} +import org.apache.spark.sql.types.{ArrayType, DoubleType, FloatType} + + +private[spark] object DatasetUtils { + + /** + * preprocessing the input feature column to Vector + * @param dataset DataFrame with columns for features + * @param colName column name for features + * @return Vector feature column + */ + @Since("2.4.0") --- End diff -- Don't add Since annotations to private APIs. They can get Since annotations when they are made public. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21081: [SPARK-23975][ML]Allow Clustering to take Arrays ...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/21081#discussion_r183558056 --- Diff: mllib/src/test/scala/org/apache/spark/ml/clustering/KMeansSuite.scala --- @@ -199,6 +201,47 @@ class KMeansSuite extends SparkFunSuite with MLlibTestSparkContext with DefaultR assert(e.getCause.getMessage.contains("Cosine distance is not defined")) } + test("KMean with Array input") { +val featuresColNameD = "array_double_features" +val featuresColNameF = "array_float_features" + +val doubleUDF = udf { (features: Vector) => + val featureArray = Array.fill[Double](features.size)(0.0) + features.foreachActive((idx, value) => featureArray(idx) = value.toFloat) + featureArray +} +val floatUDF = udf { (features: Vector) => + val featureArray = Array.fill[Float](features.size)(0.0f) + features.foreachActive((idx, value) => featureArray(idx) = value.toFloat) + featureArray +} + +val newdatasetD = dataset.withColumn(featuresColNameD, doubleUDF(col("features"))) + .drop("features") +val newdatasetF = dataset.withColumn(featuresColNameF, floatUDF(col("features"))) + .drop("features") + +assert(newdatasetD.schema(featuresColNameD).dataType.equals(new ArrayType(DoubleType, false))) +assert(newdatasetF.schema(featuresColNameF).dataType.equals(new ArrayType(FloatType, false))) + +val kmeansD = new KMeans().setK(k).setFeaturesCol(featuresColNameD).setSeed(1) --- End diff -- Also do: `setMaxIter(1)` to make this a little faster. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21078: [SPARK-23990][ML] Instruments logging improvement...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/21078#discussion_r183514966 --- Diff: mllib/src/main/scala/org/apache/spark/ml/util/Instrumentation.scala --- @@ -157,34 +161,55 @@ private[spark] object Instrumentation { } -private[spark] class OptionalInstrument private ( -val instrument: Option[Instrumentation[_ <: Estimator[_]]], +/** + * A small wrapper that contains an optional `Instrumentation` object. + * Provide some log methods, if the containing `Instrumentation` object is defined, + * will log via it, otherwise will log via common logger. + */ +private[spark] class OptionalInstrumentation private( +val instrumentation: Option[Instrumentation[_ <: Estimator[_]]], val className: String) extends Logging { - def this(instr: Instrumentation[_ <: Estimator[_]]) = this(Some(instr), "") - - def this(clazz: Class[_]) = this(None, clazz.getName.stripSuffix("$")) - - protected override def logName = className + protected override def logName: String = className override def logInfo(msg: => String) { -instrument match { +instrumentation match { case Some(instr) => instr.logInfo(msg) case None => super.logInfo(msg) } } override def logWarning(msg: => String) { -instrument match { +instrumentation match { case Some(instr) => instr.logWarning(msg) case None => super.logWarning(msg) } } override def logError(msg: => String) { -instrument match { +instrumentation match { case Some(instr) => instr.logError(msg) case None => super.logError(msg) } } } + +private[spark] object OptionalInstrumentation { + + /** + * Creates an `OptionalInstrumentation` object from an existing `Instrumentation` object. + */ + def create(instr: Instrumentation[_ <: Estimator[_]]): OptionalInstrumentation = { +new OptionalInstrumentation(Some(instr), + classOf[Instrumentation[_]].getName.stripSuffix("$")) --- End diff -- This is going to log using className = Instrumentation, which is not very informative. It should log using the Estimator class name. It might be easier to separate logName from className. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21081: [SPARK-23975][ML]Allow Clustering to take Arrays ...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/21081#discussion_r182925491 --- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala --- @@ -144,8 +168,23 @@ class KMeansModel private[ml] ( // TODO: Replace the temp fix when we have proper evaluators defined for clustering. @Since("2.0.0") def computeCost(dataset: Dataset[_]): Double = { -SchemaUtils.checkColumnType(dataset.schema, $(featuresCol), new VectorUDT) -val data: RDD[OldVector] = dataset.select(col($(featuresCol))).rdd.map { +val typeCandidates = List( new VectorUDT, --- End diff -- You can reuse validateAndTransformSchema here. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21081: [SPARK-23975][ML]Allow Clustering to take Arrays ...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/21081#discussion_r182925299 --- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala --- @@ -305,15 +344,45 @@ class KMeans @Since("1.5.0") ( @Since("1.5.0") def setSeed(value: Long): this.type = set(seed, value) + @Since("2.4.0") + def featureToVector(dataset: Dataset[_], col: Column): Column = { --- End diff -- Is this a copy of the same method? It should be shared, either in KMeansParams or in a static (object) method. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21081: [SPARK-23975][ML]Allow Clustering to take Arrays ...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/21081#discussion_r182924819 --- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala --- @@ -120,11 +123,32 @@ class KMeansModel private[ml] ( @Since("2.0.0") def setPredictionCol(value: String): this.type = set(predictionCol, value) + @Since("2.4.0") + def featureToVector(dataset: Dataset[_], col: Column): Column = { --- End diff -- Make this private. In general, we try to keep APIs as private as possible since that allows us more flexibility to make changes in the future. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21081: [SPARK-23975][ML]Allow Clustering to take Arrays ...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/21081#discussion_r182924903 --- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala --- @@ -120,11 +123,32 @@ class KMeansModel private[ml] ( @Since("2.0.0") def setPredictionCol(value: String): this.type = set(predictionCol, value) + @Since("2.4.0") + def featureToVector(dataset: Dataset[_], col: Column): Column = { +val featuresDataType = dataset.schema(getFeaturesCol).dataType +val transferUDF = featuresDataType match { + case _: VectorUDT => udf((vector: Vector) => vector) --- End diff -- Just return ```col(getFeaturesCol)``` since that will be more efficient. (Calling a UDF requires data serialization overhead.) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21081: [SPARK-23975][ML]Allow Clustering to take Arrays ...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/21081#discussion_r182925210 --- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala --- @@ -120,11 +123,32 @@ class KMeansModel private[ml] ( @Since("2.0.0") def setPredictionCol(value: String): this.type = set(predictionCol, value) + @Since("2.4.0") + def featureToVector(dataset: Dataset[_], col: Column): Column = { --- End diff -- Also, add a Scala docstring saying what this does. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21081: [SPARK-23975][ML]Allow Clustering to take Arrays of Doub...
Github user jkbradley commented on the issue: https://github.com/apache/spark/pull/21081 @WeichenXu123 A generic vector class would be interesting, but that would be a big project, way out of scope of this PR. You could bring it up if that person on the dev list sends a SPIP about linear algebra. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21078: [SPARK-23990][ML] Instruments logging improvement...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/21078#discussion_r182836091 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala --- @@ -378,18 +378,24 @@ class LinearRegression @Since("1.3.0") (@Since("1.3.0") override val uid: String val yMean = ySummarizer.mean(0) val rawYStd = math.sqrt(ySummarizer.variance(0)) + +instr.logNamedValue(Instrumentation.loggerTags.numExamples, ySummarizer.count) --- End diff -- It might be worth creating a logNumExamples() method in Instrumentation since I bet we'll add this elsewhere. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21078: [SPARK-23990][ML] Instruments logging improvement...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/21078#discussion_r182834828 --- Diff: mllib/src/main/scala/org/apache/spark/ml/util/Instrumentation.scala --- @@ -150,3 +156,35 @@ private[spark] object Instrumentation { } } + +private[spark] class OptionalInstrument private ( +val instrument: Option[Instrumentation[_ <: Estimator[_]]], +val className: String) extends Logging { + + def this(instr: Instrumentation[_ <: Estimator[_]]) = this(Some(instr), "") --- End diff -- Also, let's set logName in the constructor so that, when we create() this with an Instrumentation instance, we can match the Instrumentation instance's logName: instr.getClass.getName.stripSuffix(...) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21078: [SPARK-23990][ML] Instruments logging improvement...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/21078#discussion_r182835834 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala --- @@ -326,7 +326,7 @@ class LinearRegression @Since("1.3.0") (@Since("1.3.0") override val uid: String Instance(label, weight, features) } -val instr = Instrumentation.create(this, dataset) +val instr = Instrumentation.create(this, instances) --- End diff -- Why this change? Is this to match other things inheriting from Predictor? Let's keep dataset since I hope (maybe in 3.0) we can switch everything to create Instrumentation with Datasets instead of RDDs (since Datasets have more info available). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21078: [SPARK-23990][ML] Instruments logging improvement...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/21078#discussion_r182833013 --- Diff: mllib/src/main/scala/org/apache/spark/ml/util/Instrumentation.scala --- @@ -150,3 +156,35 @@ private[spark] object Instrumentation { } } + +private[spark] class OptionalInstrument private ( +val instrument: Option[Instrumentation[_ <: Estimator[_]]], +val className: String) extends Logging { + + def this(instr: Instrumentation[_ <: Estimator[_]]) = this(Some(instr), "") + + def this(clazz: Class[_]) = this(None, clazz.getName.stripSuffix("$")) + + protected override def logName = className --- End diff -- Fix IntelliJ style warning: add explicit type annotation --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21078: [SPARK-23990][ML] Instruments logging improvement...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/21078#discussion_r182833153 --- Diff: mllib/src/main/scala/org/apache/spark/ml/util/Instrumentation.scala --- @@ -150,3 +156,35 @@ private[spark] object Instrumentation { } } + +private[spark] class OptionalInstrument private ( --- End diff -- Stick with existing naming conventions: OptionalInstrument -> OptionalInstrumentation --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21078: [SPARK-23990][ML] Instruments logging improvement...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/21078#discussion_r182833420 --- Diff: mllib/src/main/scala/org/apache/spark/ml/util/Instrumentation.scala --- @@ -150,3 +156,35 @@ private[spark] object Instrumentation { } } + +private[spark] class OptionalInstrument private ( +val instrument: Option[Instrumentation[_ <: Estimator[_]]], --- End diff -- instrument -> either `instrumentation` or `instr` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21078: [SPARK-23990][ML] Instruments logging improvement...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/21078#discussion_r182833731 --- Diff: mllib/src/main/scala/org/apache/spark/ml/util/Instrumentation.scala --- @@ -150,3 +156,35 @@ private[spark] object Instrumentation { } } + +private[spark] class OptionalInstrument private ( +val instrument: Option[Instrumentation[_ <: Estimator[_]]], +val className: String) extends Logging { + + def this(instr: Instrumentation[_ <: Estimator[_]]) = this(Some(instr), "") --- End diff -- Stick with conventions: Move these constructors to object methods called `create()` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
spark git commit: [SPARK-24026][ML] Add Power Iteration Clustering to spark.ml
Repository: spark Updated Branches: refs/heads/master 6e19f7683 -> a471880af [SPARK-24026][ML] Add Power Iteration Clustering to spark.ml ## What changes were proposed in this pull request? This PR adds PowerIterationClustering as a Transformer to spark.ml. In the transform method, it calls spark.mllib's PowerIterationClustering.run() method and transforms the return value assignments (the Kmeans output of the pseudo-eigenvector) as a DataFrame (id: LongType, cluster: IntegerType). This PR is copied and modified from https://github.com/apache/spark/pull/15770 The primary author is wangmiao1981 ## How was this patch tested? This PR has 2 types of tests: * Copies of tests from spark.mllib's PIC tests * New tests specific to the spark.ml APIs Author: wm...@hotmail.com Author: wangmiao1981 Author: Joseph K. Bradley Closes #21090 from jkbradley/wangmiao1981-pic. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/a471880a Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/a471880a Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/a471880a Branch: refs/heads/master Commit: a471880afbeafd4ef54c15a97e72ea7ff784a88d Parents: 6e19f76 Author: wm...@hotmail.com Authored: Thu Apr 19 09:40:20 2018 -0700 Committer: Joseph K. Bradley Committed: Thu Apr 19 09:40:20 2018 -0700 -- .../clustering/PowerIterationClustering.scala | 256 +++ .../PowerIterationClusteringSuite.scala | 238 + 2 files changed, 494 insertions(+) -- http://git-wip-us.apache.org/repos/asf/spark/blob/a471880a/mllib/src/main/scala/org/apache/spark/ml/clustering/PowerIterationClustering.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/ml/clustering/PowerIterationClustering.scala b/mllib/src/main/scala/org/apache/spark/ml/clustering/PowerIterationClustering.scala new file mode 100644 index 000..2c30a1d --- /dev/null +++ b/mllib/src/main/scala/org/apache/spark/ml/clustering/PowerIterationClustering.scala @@ -0,0 +1,256 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.clustering + +import org.apache.spark.annotation.{Experimental, Since} +import org.apache.spark.ml.Transformer +import org.apache.spark.ml.param._ +import org.apache.spark.ml.param.shared._ +import org.apache.spark.ml.util._ +import org.apache.spark.mllib.clustering.{PowerIterationClustering => MLlibPowerIterationClustering} +import org.apache.spark.rdd.RDD +import org.apache.spark.sql.{DataFrame, Dataset, Row} +import org.apache.spark.sql.functions.col +import org.apache.spark.sql.types._ + +/** + * Common params for PowerIterationClustering + */ +private[clustering] trait PowerIterationClusteringParams extends Params with HasMaxIter + with HasPredictionCol { + + /** + * The number of clusters to create (k). Must be > 1. Default: 2. + * @group param + */ + @Since("2.4.0") + final val k = new IntParam(this, "k", "The number of clusters to create. " + +"Must be > 1.", ParamValidators.gt(1)) + + /** @group getParam */ + @Since("2.4.0") + def getK: Int = $(k) + + /** + * Param for the initialization algorithm. This can be either "random" to use a random vector + * as vertex properties, or "degree" to use a normalized sum of similarities with other vertices. + * Default: random. + * @group expertParam + */ + @Since("2.4.0") + final val initMode = { +val allowedParams = ParamValidators.inArray(Array("random", "degree")) +new Param[String](this, "initMode", "The initialization algorithm. This can be either " + + "'random' to use a random vector as vertex properties, or 'degree' to use a normalized sum " + + "of similarities with other vertices. Supported options: 'random' and 'degree
[GitHub] spark issue #21090: [SPARK-15784][ML] Add Power Iteration Clustering to spar...
Github user jkbradley commented on the issue: https://github.com/apache/spark/pull/21090 Thanks for reviewing this and for the LGTM @wangmiao1981 ! I'll merge with master now, with you as the primary author. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21090: [SPARK-15784][ML] Add Power Iteration Clustering ...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/21090#discussion_r182606716 --- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/PowerIterationClustering.scala --- @@ -0,0 +1,256 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.clustering + +import org.apache.spark.annotation.{Experimental, Since} +import org.apache.spark.ml.Transformer +import org.apache.spark.ml.param._ +import org.apache.spark.ml.param.shared._ +import org.apache.spark.ml.util._ +import org.apache.spark.mllib.clustering.{PowerIterationClustering => MLlibPowerIterationClustering} +import org.apache.spark.rdd.RDD +import org.apache.spark.sql.{DataFrame, Dataset, Row} +import org.apache.spark.sql.functions.col +import org.apache.spark.sql.types._ + +/** + * Common params for PowerIterationClustering + */ +private[clustering] trait PowerIterationClusteringParams extends Params with HasMaxIter + with HasPredictionCol { + + /** + * The number of clusters to create (k). Must be > 1. Default: 2. + * @group param + */ + @Since("2.4.0") + final val k = new IntParam(this, "k", "The number of clusters to create. " + +"Must be > 1.", ParamValidators.gt(1)) + + /** @group getParam */ + @Since("2.4.0") + def getK: Int = $(k) + + /** + * Param for the initialization algorithm. This can be either "random" to use a random vector + * as vertex properties, or "degree" to use a normalized sum of similarities with other vertices. + * Default: random. + * @group expertParam + */ + @Since("2.4.0") + final val initMode = { +val allowedParams = ParamValidators.inArray(Array("random", "degree")) +new Param[String](this, "initMode", "The initialization algorithm. This can be either " + + "'random' to use a random vector as vertex properties, or 'degree' to use a normalized sum " + + "of similarities with other vertices. Supported options: 'random' and 'degree'.", + allowedParams) + } + + /** @group expertGetParam */ + @Since("2.4.0") + def getInitMode: String = $(initMode) + + /** + * Param for the name of the input column for vertex IDs. + * Default: "id" + * @group param + */ + @Since("2.4.0") + val idCol = new Param[String](this, "idCol", "Name of the input column for vertex IDs.", +(value: String) => value.nonEmpty) + + setDefault(idCol, "id") + + /** @group getParam */ + @Since("2.4.0") + def getIdCol: String = getOrDefault(idCol) + + /** + * Param for the name of the input column for neighbors in the adjacency list representation. + * Default: "neighbors" + * @group param + */ + @Since("2.4.0") + val neighborsCol = new Param[String](this, "neighborsCol", +"Name of the input column for neighbors in the adjacency list representation.", +(value: String) => value.nonEmpty) + + setDefault(neighborsCol, "neighbors") + + /** @group getParam */ + @Since("2.4.0") + def getNeighborsCol: String = $(neighborsCol) + + /** + * Param for the name of the input column for neighbors in the adjacency list representation. + * Default: "similarities" + * @group param + */ + @Since("2.4.0") + val similaritiesCol = new Param[String](this, "similaritiesCol", +"Name of the input column for neighbors in the adjacency list representation.", +(value: String) => value.nonEmpty) +
[GitHub] spark issue #21081: [SPARK-23975][ML]Allow Clustering to take Arrays of Doub...
Github user jkbradley commented on the issue: https://github.com/apache/spark/pull/21081 I hope we can apply it to other algs too. @ludatabricks is doing some refactoring which should make that easier, but we're not going for a completely general approach right away. I don't think we need to worry about sparse FloatType features; users have no way to pass those in. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21078: [SPARK-23990][ML] Instruments logging improvements - ML ...
Github user jkbradley commented on the issue: https://github.com/apache/spark/pull/21078 Thanks for thinking through the optional logging issue! I responded in the JIRA to preserve the design discussion there: https://issues.apache.org/jira/browse/SPARK-23990 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21081: [SPARK-23975][ML]Allow Clustering to take Arrays ...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/21081#discussion_r182269723 --- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala --- @@ -123,7 +128,21 @@ class KMeansModel private[ml] ( @Since("2.0.0") override def transform(dataset: Dataset[_]): DataFrame = { transformSchema(dataset.schema, logging = true) -val predictUDF = udf((vector: Vector) => predict(vector)) +// val predictUDF = udf((vector: Vector) => predict(vector)) +val predictUDF = if (dataset.schema($(featuresCol)).dataType.equals(new VectorUDT)) { + udf((vector: Vector) => predict(vector)) +} +else { + udf((vector: Seq[_]) => { +val featureArray = Array.fill[Double](vector.size)(0.0) --- End diff -- Here's what I meant: ``` val predictUDF = featuresDataType match { case _: VectorUDT => udf((vector: Vector) => predict(vector)) case fdt: ArrayType => fdt.elementType match { case _: FloatType => ??? case _: DoubleType => ??? } } ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21081: [SPARK-23975][ML]Allow Clustering to take Arrays ...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/21081#discussion_r182269644 --- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala --- @@ -123,7 +128,21 @@ class KMeansModel private[ml] ( @Since("2.0.0") override def transform(dataset: Dataset[_]): DataFrame = { transformSchema(dataset.schema, logging = true) -val predictUDF = udf((vector: Vector) => predict(vector)) +// val predictUDF = udf((vector: Vector) => predict(vector)) +val predictUDF = if (dataset.schema($(featuresCol)).dataType.equals(new VectorUDT)) { + udf((vector: Vector) => predict(vector)) --- End diff -- Side note: I realized that "predict" will cause the whole model to be serialized and sent to workers. But that's actually OK since we do need to send most of the model data to make predictions and since there's not a clean way to just sent the model weights. So I think my previous comment about copying "numClasses" to a local variable was not necessary. Don't bother reverting the change though. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21081: [SPARK-23975][ML]Allow Clustering to take Arrays ...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/21081#discussion_r182216415 --- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala --- @@ -90,7 +90,12 @@ private[clustering] trait KMeansParams extends Params with HasMaxIter with HasFe * @return output schema */ protected def validateAndTransformSchema(schema: StructType): StructType = { -SchemaUtils.checkColumnType(schema, $(featuresCol), new VectorUDT) +val typeCandidates = List( new VectorUDT, + new ArrayType(DoubleType, true), --- End diff -- Also, IntelliJ may warn you about passing boolean arguments as named arguments; that'd be nice to fix here. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21081: [SPARK-23975][ML]Allow Clustering to take Arrays ...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/21081#discussion_r182215434 --- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala --- @@ -123,7 +128,21 @@ class KMeansModel private[ml] ( @Since("2.0.0") override def transform(dataset: Dataset[_]): DataFrame = { transformSchema(dataset.schema, logging = true) -val predictUDF = udf((vector: Vector) => predict(vector)) +// val predictUDF = udf((vector: Vector) => predict(vector)) +val predictUDF = if (dataset.schema($(featuresCol)).dataType.equals(new VectorUDT)) { + udf((vector: Vector) => predict(vector)) +} --- End diff -- Scala style: } else { --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21081: [SPARK-23975][ML]Allow Clustering to take Arrays ...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/21081#discussion_r182216309 --- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala --- @@ -90,7 +90,12 @@ private[clustering] trait KMeansParams extends Params with HasMaxIter with HasFe * @return output schema */ protected def validateAndTransformSchema(schema: StructType): StructType = { -SchemaUtils.checkColumnType(schema, $(featuresCol), new VectorUDT) +val typeCandidates = List( new VectorUDT, + new ArrayType(DoubleType, true), --- End diff -- Thinking about this, let's actually disallow nullable columns. KMeans won't handle nulls properly. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21081: [SPARK-23975][ML]Allow Clustering to take Arrays ...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/21081#discussion_r182217722 --- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala --- @@ -123,7 +128,21 @@ class KMeansModel private[ml] ( @Since("2.0.0") override def transform(dataset: Dataset[_]): DataFrame = { transformSchema(dataset.schema, logging = true) -val predictUDF = udf((vector: Vector) => predict(vector)) +// val predictUDF = udf((vector: Vector) => predict(vector)) +val predictUDF = if (dataset.schema($(featuresCol)).dataType.equals(new VectorUDT)) { + udf((vector: Vector) => predict(vector)) +} +else { + udf((vector: Seq[_]) => { +val featureArray = Array.fill[Double](vector.size)(0.0) --- End diff -- You shouldn't have to do the conversion in this convoluted (and less efficient) way. I'd recommend doing a match-case statement on dataset.schema; I think that will be the most succinct. Then you can handle Vector, Seq of Float, and Seq of Double separately, without conversions to strings. Same for the similar cases below. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21081: [SPARK-23975][ML]Allow Clustering to take Arrays ...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/21081#discussion_r182215639 --- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala --- @@ -123,7 +128,21 @@ class KMeansModel private[ml] ( @Since("2.0.0") override def transform(dataset: Dataset[_]): DataFrame = { transformSchema(dataset.schema, logging = true) -val predictUDF = udf((vector: Vector) => predict(vector)) +// val predictUDF = udf((vector: Vector) => predict(vector)) +val predictUDF = if (dataset.schema($(featuresCol)).dataType.equals(new VectorUDT)) { + udf((vector: Vector) => predict(vector)) +} +else { + udf((vector: Seq[_]) => { --- End diff -- scala style: remove unnecessary ```{``` at end of line (IntelliJ should warn you about this) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15770: [SPARK-15784][ML]:Add Power Iteration Clustering to spar...
Github user jkbradley commented on the issue: https://github.com/apache/spark/pull/15770 OK sorry to push @wangmiao1981 ! I just want to make sure this gets in before I no longer have bandwidth for it. If you have the time, would you mind checking the updates I made in the new PR? Thanks! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21090: [SPARK-15784][ML] Add Power Iteration Clustering to spar...
Github user jkbradley commented on the issue: https://github.com/apache/spark/pull/21090 @wangmiao1981 and @WeichenXu123 would you mind taking a look? Thanks! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21090: [SPARK-15784][ML] Add Power Iteration Clustering to spar...
Github user jkbradley commented on the issue: https://github.com/apache/spark/pull/21090 **To review this PR**: This was copied from https://github.com/apache/spark/pull/15770 with the following changes: * Addressed comments in original PR (See my review comments there) * Added Param validators for required input columns * Renamed âweightsâ column to âsimilaritiesâ * Made algorithm take more types of inputs: Long/Int and Double/Float * Removed test("set parameters") since setters are already tested in the read/write test. If you saw the previous PR, you should be able to review this one based on the last 3 commits, viewable in this diff: https://github.com/jkbradley/spark/compare/5cb8ed6...wangmiao1981-pic --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21090: [SPARK-15784][ML] Add Power Iteration Clustering ...
GitHub user jkbradley opened a pull request: https://github.com/apache/spark/pull/21090 [SPARK-15784][ML] Add Power Iteration Clustering to spark.ml ## What changes were proposed in this pull request? This PR adds PowerIterationClustering as a Transformer to spark.ml. In the transform method, it calls spark.mllib's PowerIterationClustering.run() method and transforms the return value assignments (the Kmeans output of the pseudo-eigenvector) as a DataFrame (id: LongType, cluster: IntegerType). This PR is copied and modified from https://github.com/apache/spark/pull/15770 The primary author is @wangmiao1981 ## How was this patch tested? This PR has 2 types of tests: * Copies of tests from spark.mllib's PIC tests * New tests specific to the spark.ml APIs You can merge this pull request into a Git repository by running: $ git pull https://github.com/jkbradley/spark wangmiao1981-pic Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/21090.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #21090 commit e4492a64b74b0ccc2da8f13353d37bb9bb0c Author: wm...@hotmail.com Date: 2016-06-13T19:47:42Z add pic framework (model, class etc) commit 70862491e5b86ce4add500a0c96ae5220733b35d Author: wm...@hotmail.com Date: 2016-06-13T23:28:09Z change a comment commit b73d8a78fa69f83c278996feb1b19502ef871c5b Author: wm...@hotmail.com Date: 2016-06-17T17:27:55Z add missing functions fit predict load save etc. commit 022fe523f735c5519f948b175871489f79434fb5 Author: wm...@hotmail.com Date: 2016-06-18T01:12:41Z add unit test flie commit 552cf54fb03f88af023f080e60fa50f1f39060fc Author: wm...@hotmail.com Date: 2016-06-20T17:35:05Z add test cases part 1 commit 0b4954d55b4d344794d3c47366220c67f07d0d43 Author: wm...@hotmail.com Date: 2016-06-20T20:29:54Z add unit test part 2: test fit, parameters etc. commit f22b01e06eaaf5951befcebdffc18c8e519183d2 Author: wm...@hotmail.com Date: 2016-06-20T21:22:59Z fix a type issue commit 305b194dae40eaff990c18837c3f2bc8d469e60c Author: wm...@hotmail.com Date: 2016-06-21T20:07:27Z add more unit tests commit 4b32cbf02965c5c1a0c094fa144836dab0dfd543 Author: wm...@hotmail.com Date: 2016-06-21T21:46:25Z delete unused import and add comments commit f6eda88a6c0af416b988a2c37f46c8b08e5e99cf Author: wm...@hotmail.com Date: 2016-10-25T21:28:12Z change version to 2.1.0 commit 45c4b1cd1afa28c775c666b57ecee614ed9a41d0 Author: wm...@hotmail.com Date: 2016-11-03T23:26:01Z change PIC as a Transformer commit e8d7ed37138909d010a812fba7d03ef30a4f6e05 Author: wm...@hotmail.com Date: 2016-11-04T17:28:26Z add LabelCol commit e4e1e055a9b3ab54b83331ac7dc56d6b792dcf7b Author: wm...@hotmail.com Date: 2016-11-04T18:36:09Z change col implementation commit 8384422ec0e7192cc8ce53df02ddb4ae0401fd0b Author: wm...@hotmail.com Date: 2017-02-17T22:20:00Z address some of the comments commit d6a199c48ff940861d80caf275da29d99375ce33 Author: wm...@hotmail.com Date: 2017-02-21T22:37:51Z add additional test with dataset having more data commit b0c3aff4a76ace99c104c2b2c10c9485a028bfd6 Author: wm...@hotmail.com Date: 2017-03-14T23:13:45Z change input data format commit 091225dd2f1c353edc28dc4299034a018a92bc81 Author: wm...@hotmail.com Date: 2017-03-15T22:49:45Z resolve warnings commit 8bb99567556ce29c75d5f395157d0161dff695bc Author: wm...@hotmail.com Date: 2017-03-16T18:33:47Z add neighbor and weight cols commit 8ba82e8392e6d607ab750ed8eb3caaf386e1352a Author: wangmiao1981 Date: 2017-08-15T21:13:55Z address review comments 1 commit 468a94741efe6530c9acfbb1af4f46499550ee1f Author: wangmiao1981 Date: 2017-08-15T21:23:39Z fix style commit ec10f24336ff51354a1657c7ceadb9ada8cd1484 Author: wangmiao1981 Date: 2017-08-15T22:30:28Z remove unused comments commit 5710cfcf2e3596c95f353ce043f7358a030d70a0 Author: wangmiao1981 Date: 2017-08-15T23:43:14Z add Since commit 88654b3055ebd863e3b3c5774abdce28f3cda184 Author: wangmiao1981 Date: 2017-08-17T00:12:12Z fix missing > commit 804adc6fece91e7264f315ee965faa40c5e334c5 Author: wangmiao1981 Date: 2017-08-17T17:26:40Z fix doc commit 4a6dd79a9c37f71ea4378692438f19b3247b7913 Author: wangmiao1981 Date: 2017-10-25T23:16:55Z address review comments commit 5cb8ed6de3865f58719b3b30888b3bc4542905d4 Author: wangmiao1981 Date: 2017-10-30T21:44:24Z fix unit test commit 6abf6023868d944068a26186cde3fbadffd83a74 Author: Joseph K. Bradley Date: 2018-04-03T23:46:40Z cleanups to docs commit d9270876797153d7660843fc621e707b4dff71ca Author: Joseph K. Bradley Date: 2018-04-03T23:52:36Z typo commit d2157489770a79fe443d567bfc0
[GitHub] spark issue #20319: [SPARK-22884][ML][TESTS] ML test for StructuredStreaming...
Github user jkbradley commented on the issue: https://github.com/apache/spark/pull/20319 Reviewing now! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
spark git commit: [SPARK-21741][ML][PYSPARK] Python API for DataFrame-based multivariate summarizer
Repository: spark Updated Branches: refs/heads/master f39e82ce1 -> 1ca3c50fe [SPARK-21741][ML][PYSPARK] Python API for DataFrame-based multivariate summarizer ## What changes were proposed in this pull request? Python API for DataFrame-based multivariate summarizer. ## How was this patch tested? doctest added. Author: WeichenXu Closes #20695 from WeichenXu123/py_summarizer. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/1ca3c50f Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/1ca3c50f Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/1ca3c50f Branch: refs/heads/master Commit: 1ca3c50fefb34532c78427fa74872db3ecbf7ba2 Parents: f39e82c Author: WeichenXu Authored: Tue Apr 17 10:11:08 2018 -0700 Committer: Joseph K. Bradley Committed: Tue Apr 17 10:11:08 2018 -0700 -- python/pyspark/ml/stat.py | 193 - 1 file changed, 192 insertions(+), 1 deletion(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/1ca3c50f/python/pyspark/ml/stat.py -- diff --git a/python/pyspark/ml/stat.py b/python/pyspark/ml/stat.py index 93d0f4f..a06ab31 100644 --- a/python/pyspark/ml/stat.py +++ b/python/pyspark/ml/stat.py @@ -19,7 +19,9 @@ import sys from pyspark import since, SparkContext from pyspark.ml.common import _java2py, _py2java -from pyspark.ml.wrapper import _jvm +from pyspark.ml.wrapper import JavaWrapper, _jvm +from pyspark.sql.column import Column, _to_seq +from pyspark.sql.functions import lit class ChiSquareTest(object): @@ -195,6 +197,195 @@ class KolmogorovSmirnovTest(object): _jvm().PythonUtils.toSeq(params))) +class Summarizer(object): +""" +.. note:: Experimental + +Tools for vectorized statistics on MLlib Vectors. +The methods in this package provide various statistics for Vectors contained inside DataFrames. +This class lets users pick the statistics they would like to extract for a given column. + +>>> from pyspark.ml.stat import Summarizer +>>> from pyspark.sql import Row +>>> from pyspark.ml.linalg import Vectors +>>> summarizer = Summarizer.metrics("mean", "count") +>>> df = sc.parallelize([Row(weight=1.0, features=Vectors.dense(1.0, 1.0, 1.0)), +... Row(weight=0.0, features=Vectors.dense(1.0, 2.0, 3.0))]).toDF() +>>> df.select(summarizer.summary(df.features, df.weight)).show(truncate=False) ++---+ +|aggregate_metrics(features, weight)| ++---+ +|[[1.0,1.0,1.0], 1] | ++---+ + +>>> df.select(summarizer.summary(df.features)).show(truncate=False) +++ +|aggregate_metrics(features, 1.0)| +++ +|[[1.0,1.5,2.0], 2] | +++ + +>>> df.select(Summarizer.mean(df.features, df.weight)).show(truncate=False) ++--+ +|mean(features)| ++--+ +|[1.0,1.0,1.0] | ++--+ + +>>> df.select(Summarizer.mean(df.features)).show(truncate=False) ++--+ +|mean(features)| ++--+ +|[1.0,1.5,2.0] | ++--+ + + +.. versionadded:: 2.4.0 + +""" +@staticmethod +@since("2.4.0") +def mean(col, weightCol=None): +""" +return a column of mean summary +""" +return Summarizer._get_single_metric(col, weightCol, "mean") + +@staticmethod +@since("2.4.0") +def variance(col, weightCol=None): +""" +return a column of variance summary +""" +return Summarizer._get_single_metric(col, weightCol, "variance") + +@staticmethod +@since("2.4.0") +def count(col, weightCol=None): +""" +return a column of count summary +""" +return Summarizer._get_single_metric(col, weightCol, "count") + +@staticmethod +@since("2.4.0") +def numNonZeros(col, weightCol=None): +""" +return a column of numNonZero summary +""" +return Summarizer._get_single_metric(col, weightCol, "numNonZeros") + +@staticmethod +@since("2.4.0") +def max(col, weightCol=None): +""" +return a column of max summary +""" +return Summarizer._get_single_metric(col, weightCol, "max") + +@staticmethod +@since("2.4.0") +def min(col, weightCol=None): +""" +return a column of min summary +""" +return Summarizer._get_single_metric(col, weightCol, "min") + +@staticmethod +@since("2.4.0") +
[GitHub] spark issue #20695: [SPARK-21741][ML][PySpark] Python API for DataFrame-base...
Github user jkbradley commented on the issue: https://github.com/apache/spark/pull/20695 LGTM Thanks for the PR! Merging with master --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15770: [SPARK-15784][ML]:Add Power Iteration Clustering to spar...
Github user jkbradley commented on the issue: https://github.com/apache/spark/pull/15770 I don't mind; I'll take it. But I'll mark @wangmiao1981 as the main contributor for the PR. Would you mind closing this issue @wangmiao1981 and I'll reopen a new PR under the same JIRA? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21081: [SPARK-23975][ML]Allow Clustering to take Arrays ...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/21081#discussion_r181847061 --- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala --- @@ -144,7 +156,12 @@ class KMeansModel private[ml] ( // TODO: Replace the temp fix when we have proper evaluators defined for clustering. @Since("2.0.0") def computeCost(dataset: Dataset[_]): Double = { -SchemaUtils.checkColumnType(dataset.schema, $(featuresCol), new VectorUDT) +val typeCandidates = List( new VectorUDT, + new ArrayType(DoubleType, true), + new ArrayType(DoubleType, false), + new ArrayType(FloatType, true), + new ArrayType(FloatType, false)) +SchemaUtils.checkColumnTypes(dataset.schema, $(featuresCol), typeCandidates) val data: RDD[OldVector] = dataset.select(col($(featuresCol))).rdd.map { --- End diff -- this won't take non-Vector types though; a unit test would catch this --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21081: [SPARK-23975][ML]Allow Clustering to take Arrays ...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/21081#discussion_r181846789 --- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala --- @@ -123,8 +128,15 @@ class KMeansModel private[ml] ( @Since("2.0.0") override def transform(dataset: Dataset[_]): DataFrame = { transformSchema(dataset.schema, logging = true) -val predictUDF = udf((vector: Vector) => predict(vector)) -dataset.withColumn($(predictionCol), predictUDF(col($(featuresCol +// val predictUDF = udf((vector: Vector) => predict(vector)) +if (dataset.schema($(featuresCol)).dataType.equals(new VectorUDT)) { --- End diff -- tip: This can be more succinct if written as: ``` val predictUDF = if (dataset.schema(...).dataType.equals(...)) { A } else { B } dataset.withColumn($(predictionCol), predictUDF(col($(featuresCol // so this line is only written once ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21081: [SPARK-23975][ML]Allow Clustering to take Arrays ...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/21081#discussion_r181841713 --- Diff: mllib/src/test/scala/org/apache/spark/ml/clustering/KMeansSuite.scala --- @@ -194,6 +195,34 @@ class KMeansSuite extends SparkFunSuite with MLlibTestSparkContext with DefaultR assert(e.getCause.getMessage.contains("Cosine distance is not defined")) } + test("KMean with Array input") { --- End diff -- This should check transform and computeCost too --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21081: [SPARK-23975][ML]Allow Clustering to take Arrays ...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/21081#discussion_r181847695 --- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala --- @@ -312,6 +329,8 @@ class KMeans @Since("1.5.0") ( val handlePersistence = dataset.storageLevel == StorageLevel.NONE val instances: RDD[OldVector] = dataset.select(col($(featuresCol))).rdd.map { case Row(point: Vector) => OldVectors.fromML(point) + case Row(point: Seq[_]) => + OldVectors.fromML(Vectors.dense(point.asInstanceOf[Seq[Double]].toArray)) --- End diff -- I'm not sure this will work with arrays of FloatType. Make sure to test it --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21081: [SPARK-23975][ML]Allow Clustering to take Arrays ...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/21081#discussion_r181840894 --- Diff: mllib/src/test/scala/org/apache/spark/ml/clustering/KMeansSuite.scala --- @@ -194,6 +195,34 @@ class KMeansSuite extends SparkFunSuite with MLlibTestSparkContext with DefaultR assert(e.getCause.getMessage.contains("Cosine distance is not defined")) } + test("KMean with Array input") { +val featuresColName = "array_model_features" + +val arrayUDF = udf { (features: Vector) => + features.toArray +} +val newdataset = dataset.withColumn(featuresColName, arrayUDF(col("features")) ) --- End diff -- nit: You could drop the original column as well just to make extra sure that it's not being accidentally used. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21081: [SPARK-23975][ML]Allow Clustering to take Arrays ...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/21081#discussion_r181841557 --- Diff: mllib/src/test/scala/org/apache/spark/ml/clustering/KMeansSuite.scala --- @@ -194,6 +195,34 @@ class KMeansSuite extends SparkFunSuite with MLlibTestSparkContext with DefaultR assert(e.getCause.getMessage.contains("Cosine distance is not defined")) } + test("KMean with Array input") { +val featuresColName = "array_model_features" + +val arrayUDF = udf { (features: Vector) => + features.toArray +} +val newdataset = dataset.withColumn(featuresColName, arrayUDF(col("features")) ) + +val kmeans = new KMeans() + .setFeaturesCol(featuresColName) + +assert(kmeans.getK === 2) +assert(kmeans.getFeaturesCol === featuresColName) +assert(kmeans.getPredictionCol === "prediction") +assert(kmeans.getMaxIter === 20) +assert(kmeans.getInitMode === MLlibKMeans.K_MEANS_PARALLEL) +assert(kmeans.getInitSteps === 2) +assert(kmeans.getTol === 1e-4) +assert(kmeans.getDistanceMeasure === DistanceMeasure.EUCLIDEAN) +val model = kmeans.setMaxIter(1).fit(newdataset) + +MLTestingUtils.checkCopyAndUids(kmeans, model) --- End diff -- ditto for hasSummary and copying --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21081: [SPARK-23975][ML]Allow Clustering to take Arrays ...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/21081#discussion_r181841503 --- Diff: mllib/src/test/scala/org/apache/spark/ml/clustering/KMeansSuite.scala --- @@ -194,6 +195,34 @@ class KMeansSuite extends SparkFunSuite with MLlibTestSparkContext with DefaultR assert(e.getCause.getMessage.contains("Cosine distance is not defined")) } + test("KMean with Array input") { +val featuresColName = "array_model_features" + +val arrayUDF = udf { (features: Vector) => + features.toArray +} +val newdataset = dataset.withColumn(featuresColName, arrayUDF(col("features")) ) + +val kmeans = new KMeans() + .setFeaturesCol(featuresColName) + +assert(kmeans.getK === 2) +assert(kmeans.getFeaturesCol === featuresColName) +assert(kmeans.getPredictionCol === "prediction") +assert(kmeans.getMaxIter === 20) +assert(kmeans.getInitMode === MLlibKMeans.K_MEANS_PARALLEL) +assert(kmeans.getInitSteps === 2) +assert(kmeans.getTol === 1e-4) +assert(kmeans.getDistanceMeasure === DistanceMeasure.EUCLIDEAN) +val model = kmeans.setMaxIter(1).fit(newdataset) + +MLTestingUtils.checkCopyAndUids(kmeans, model) --- End diff -- You don't need this test here --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21081: [SPARK-23975][ML]Allow Clustering to take Arrays ...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/21081#discussion_r181847784 --- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala --- @@ -123,8 +128,15 @@ class KMeansModel private[ml] ( @Since("2.0.0") override def transform(dataset: Dataset[_]): DataFrame = { transformSchema(dataset.schema, logging = true) -val predictUDF = udf((vector: Vector) => predict(vector)) -dataset.withColumn($(predictionCol), predictUDF(col($(featuresCol +// val predictUDF = udf((vector: Vector) => predict(vector)) +if (dataset.schema($(featuresCol)).dataType.equals(new VectorUDT)) { + val predictUDF = udf((vector: Vector) => predict(vector)) + dataset.withColumn($(predictionCol), predictUDF(col($(featuresCol +} else { + val predictUDF = udf((vector: Seq[_]) => +predict(Vectors.dense(vector.asInstanceOf[Seq[Double]].toArray))) --- End diff -- This may not work with arrays of FloatType. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21081: [SPARK-23975][ML]Allow Clustering to take Arrays ...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/21081#discussion_r181840765 --- Diff: mllib/src/test/scala/org/apache/spark/ml/clustering/KMeansSuite.scala --- @@ -194,6 +195,34 @@ class KMeansSuite extends SparkFunSuite with MLlibTestSparkContext with DefaultR assert(e.getCause.getMessage.contains("Cosine distance is not defined")) } + test("KMean with Array input") { +val featuresColName = "array_model_features" + +val arrayUDF = udf { (features: Vector) => + features.toArray +} +val newdataset = dataset.withColumn(featuresColName, arrayUDF(col("features")) ) + +val kmeans = new KMeans() + .setFeaturesCol(featuresColName) + +assert(kmeans.getK === 2) +assert(kmeans.getFeaturesCol === featuresColName) +assert(kmeans.getPredictionCol === "prediction") --- End diff -- No need to check this or the other Params which are not relevant to this test --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
spark git commit: [SPARK-21088][ML] CrossValidator, TrainValidationSplit support collect all models when fitting: Python API
Repository: spark Updated Branches: refs/heads/master 5003736ad -> 04614820e [SPARK-21088][ML] CrossValidator, TrainValidationSplit support collect all models when fitting: Python API ## What changes were proposed in this pull request? Add python API for collecting sub-models during CrossValidator/TrainValidationSplit fitting. ## How was this patch tested? UT added. Author: WeichenXu Closes #19627 from WeichenXu123/expose-model-list-py. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/04614820 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/04614820 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/04614820 Branch: refs/heads/master Commit: 04614820e103feeae91299dc90dba1dd628fd485 Parents: 5003736 Author: WeichenXu Authored: Mon Apr 16 11:31:24 2018 -0500 Committer: Joseph K. Bradley Committed: Mon Apr 16 11:31:24 2018 -0500 -- .../apache/spark/ml/tuning/CrossValidator.scala | 11 ++ .../spark/ml/tuning/TrainValidationSplit.scala | 11 ++ .../pyspark/ml/param/_shared_params_code_gen.py | 5 + python/pyspark/ml/param/shared.py | 24 + python/pyspark/ml/tests.py | 78 ++ python/pyspark/ml/tuning.py | 107 ++- python/pyspark/ml/util.py | 4 + 7 files changed, 211 insertions(+), 29 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/04614820/mllib/src/main/scala/org/apache/spark/ml/tuning/CrossValidator.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/ml/tuning/CrossValidator.scala b/mllib/src/main/scala/org/apache/spark/ml/tuning/CrossValidator.scala index a0b507d..c2826dc 100644 --- a/mllib/src/main/scala/org/apache/spark/ml/tuning/CrossValidator.scala +++ b/mllib/src/main/scala/org/apache/spark/ml/tuning/CrossValidator.scala @@ -270,6 +270,17 @@ class CrossValidatorModel private[ml] ( this } + // A Python-friendly auxiliary method + private[tuning] def setSubModels(subModels: JList[JList[Model[_]]]) +: CrossValidatorModel = { +_subModels = if (subModels != null) { + Some(subModels.asScala.toArray.map(_.asScala.toArray)) +} else { + None +} +this + } + /** * @return submodels represented in two dimension array. The index of outer array is the * fold index, and the index of inner array corresponds to the ordering of http://git-wip-us.apache.org/repos/asf/spark/blob/04614820/mllib/src/main/scala/org/apache/spark/ml/tuning/TrainValidationSplit.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/ml/tuning/TrainValidationSplit.scala b/mllib/src/main/scala/org/apache/spark/ml/tuning/TrainValidationSplit.scala index 88ff0df..8d1b9a8 100644 --- a/mllib/src/main/scala/org/apache/spark/ml/tuning/TrainValidationSplit.scala +++ b/mllib/src/main/scala/org/apache/spark/ml/tuning/TrainValidationSplit.scala @@ -262,6 +262,17 @@ class TrainValidationSplitModel private[ml] ( this } + // A Python-friendly auxiliary method + private[tuning] def setSubModels(subModels: JList[Model[_]]) +: TrainValidationSplitModel = { +_subModels = if (subModels != null) { + Some(subModels.asScala.toArray) +} else { + None +} +this + } + /** * @return submodels represented in array. The index of array corresponds to the ordering of * estimatorParamMaps http://git-wip-us.apache.org/repos/asf/spark/blob/04614820/python/pyspark/ml/param/_shared_params_code_gen.py -- diff --git a/python/pyspark/ml/param/_shared_params_code_gen.py b/python/pyspark/ml/param/_shared_params_code_gen.py index db951d8..6e9e0a3 100644 --- a/python/pyspark/ml/param/_shared_params_code_gen.py +++ b/python/pyspark/ml/param/_shared_params_code_gen.py @@ -157,6 +157,11 @@ if __name__ == "__main__": "TypeConverters.toInt"), ("parallelism", "the number of threads to use when running parallel algorithms (>= 1).", "1", "TypeConverters.toInt"), +("collectSubModels", "Param for whether to collect a list of sub-models trained during " + + "tuning. If set to false, then only the single best sub-model will be available after " + + "fitting. If set to true, then all sub-models will be available. Warning: For large " + + "models, collecting all sub-models can cause OOMs on the Spark driver.", + "False", "TypeConverters.toBoolean"), ("loss", "the loss function to be optimized.", None, "TypeConverters.toString")] code = [] http://git-wip-us.apache.org/repos/asf/spark/blob/04614820/py
[GitHub] spark issue #19627: [SPARK-21088][ML] CrossValidator, TrainValidationSplit s...
Github user jkbradley commented on the issue: https://github.com/apache/spark/pull/19627 LGTM Merging with master Thanks! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #7652: [SPARK-9312] [ML] Added max confidence factor to OneVsRes...
Github user jkbradley commented on the issue: https://github.com/apache/spark/pull/7652 Hi, sorry I let this PR get stale. This should be resolved now by https://github.com/apache/spark/pull/21044 so would you mind closing this issue @badriub ? Thanks though! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
spark git commit: [SPARK-9312][ML] Add RawPrediction, numClasses, and numFeatures for OneVsRestModel
Repository: spark Updated Branches: refs/heads/master 083cf2235 -> 5003736ad [SPARK-9312][ML] Add RawPrediction, numClasses, and numFeatures for OneVsRestModel add RawPrediction as output column add numClasses and numFeatures to OneVsRestModel ## What changes were proposed in this pull request? - Add two val numClasses and numFeatures in OneVsRestModel so that we can inherit from Classifier in the future - Add rawPrediction output column in transform, the prediction label in calculated by the rawPrediciton like raw2prediction ## How was this patch tested? (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Lu WANG Closes #21044 from ludatabricks/SPARK-9312. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/5003736a Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/5003736a Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/5003736a Branch: refs/heads/master Commit: 5003736ad60c3231bb18264c9561646c08379170 Parents: 083cf22 Author: Lu WANG Authored: Mon Apr 16 11:27:30 2018 -0500 Committer: Joseph K. Bradley Committed: Mon Apr 16 11:27:30 2018 -0500 -- .../spark/ml/classification/OneVsRest.scala | 56 .../ml/classification/OneVsRestSuite.scala | 7 ++- 2 files changed, 51 insertions(+), 12 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/5003736a/mllib/src/main/scala/org/apache/spark/ml/classification/OneVsRest.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/ml/classification/OneVsRest.scala b/mllib/src/main/scala/org/apache/spark/ml/classification/OneVsRest.scala index f04fde2..5348d88 100644 --- a/mllib/src/main/scala/org/apache/spark/ml/classification/OneVsRest.scala +++ b/mllib/src/main/scala/org/apache/spark/ml/classification/OneVsRest.scala @@ -32,7 +32,7 @@ import org.apache.spark.SparkContext import org.apache.spark.annotation.Since import org.apache.spark.ml._ import org.apache.spark.ml.attribute._ -import org.apache.spark.ml.linalg.Vector +import org.apache.spark.ml.linalg.{Vector, Vectors} import org.apache.spark.ml.param.{Param, ParamMap, ParamPair, Params} import org.apache.spark.ml.param.shared.{HasParallelism, HasWeightCol} import org.apache.spark.ml.util._ @@ -55,7 +55,7 @@ private[ml] trait ClassifierTypeTrait { /** * Params for [[OneVsRest]]. */ -private[ml] trait OneVsRestParams extends PredictorParams +private[ml] trait OneVsRestParams extends ClassifierParams with ClassifierTypeTrait with HasWeightCol { /** @@ -138,6 +138,14 @@ final class OneVsRestModel private[ml] ( @Since("1.4.0") val models: Array[_ <: ClassificationModel[_, _]]) extends Model[OneVsRestModel] with OneVsRestParams with MLWritable { + require(models.nonEmpty, "OneVsRestModel requires at least one model for one class") + + @Since("2.4.0") + val numClasses: Int = models.length + + @Since("2.4.0") + val numFeatures: Int = models.head.numFeatures + /** @group setParam */ @Since("2.1.0") def setFeaturesCol(value: String): this.type = set(featuresCol, value) @@ -146,6 +154,10 @@ final class OneVsRestModel private[ml] ( @Since("2.1.0") def setPredictionCol(value: String): this.type = set(predictionCol, value) + /** @group setParam */ + @Since("2.4.0") + def setRawPredictionCol(value: String): this.type = set(rawPredictionCol, value) + @Since("1.4.0") override def transformSchema(schema: StructType): StructType = { validateAndTransformSchema(schema, fitting = false, getClassifier.featuresDataType) @@ -181,6 +193,7 @@ final class OneVsRestModel private[ml] ( val updateUDF = udf { (predictions: Map[Int, Double], prediction: Vector) => predictions + ((index, prediction(1))) } + model.setFeaturesCol($(featuresCol)) val transformedDataset = model.transform(df).select(columns: _*) val updatedDataset = transformedDataset @@ -195,15 +208,34 @@ final class OneVsRestModel private[ml] ( newDataset.unpersist() } -// output the index of the classifier with highest confidence as prediction -val labelUDF = udf { (predictions: Map[Int, Double]) => - predictions.maxBy(_._2)._1.toDouble -} +if (getRawPredictionCol != "") { + val numClass = models.length -// output label and label metadata as prediction -aggregatedDataset - .withColumn($(predictionCol), labelUDF(col(accColName)), labelMetadata) - .drop(accColName) + // output the RawPrediction as vector + val rawPredi
[GitHub] spark issue #21044: [SPARK-9312][ML] Add RawPrediction, numClasses, and numF...
Github user jkbradley commented on the issue: https://github.com/apache/spark/pull/21044 LGTM Merging with master Thanks!! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20695: [SPARK-21741][ML][PySpark] Python API for DataFra...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/20695#discussion_r181802586 --- Diff: python/pyspark/ml/stat.py --- @@ -195,6 +197,195 @@ def test(dataset, sampleCol, distName, *params): _jvm().PythonUtils.toSeq(params))) +class Summarizer(object): +""" +.. note:: Experimental + +Tools for vectorized statistics on MLlib Vectors. +The methods in this package provide various statistics for Vectors contained inside DataFrames. +This class lets users pick the statistics they would like to extract for a given column. + +>>> from pyspark.ml.stat import Summarizer +>>> from pyspark.sql import Row +>>> from pyspark.ml.linalg import Vectors +>>> summarizer = Summarizer.metrics("mean", "count") +>>> df = sc.parallelize([Row(weight=1.0, features=Vectors.dense(1.0, 1.0, 1.0)), +... Row(weight=0.0, features=Vectors.dense(1.0, 2.0, 3.0))]).toDF() +>>> df.select(summarizer.summary(df.features, df.weight)).show(truncate=False) ++---+ +|aggregate_metrics(features, weight)| ++---+ +|[[1.0,1.0,1.0], 1] | ++---+ + +>>> df.select(summarizer.summary(df.features)).show(truncate=False) +++ +|aggregate_metrics(features, 1.0)| +++ +|[[1.0,1.5,2.0], 2] | +++ + +>>> df.select(Summarizer.mean(df.features, df.weight)).show(truncate=False) ++--+ +|mean(features)| ++--+ +|[1.0,1.0,1.0] | ++--+ + +>>> df.select(Summarizer.mean(df.features)).show(truncate=False) ++--+ +|mean(features)| ++--+ +|[1.0,1.5,2.0] | ++--+ + + +.. versionadded:: 2.4.0 + +""" +@staticmethod +@since("2.4.0") +def mean(col, weightCol=None): +""" +return a column of mean summary +""" +return Summarizer._get_single_metric(col, weightCol, "mean") + +@staticmethod +@since("2.4.0") +def variance(col, weightCol=None): +""" +return a column of variance summary +""" +return Summarizer._get_single_metric(col, weightCol, "variance") + +@staticmethod +@since("2.4.0") +def count(col, weightCol=None): +""" +return a column of count summary +""" +return Summarizer._get_single_metric(col, weightCol, "count") + +@staticmethod +@since("2.4.0") +def numNonZeros(col, weightCol=None): +""" +return a column of numNonZero summary +""" +return Summarizer._get_single_metric(col, weightCol, "numNonZeros") + +@staticmethod +@since("2.4.0") +def max(col, weightCol=None): +""" +return a column of max summary +""" +return Summarizer._get_single_metric(col, weightCol, "max") + +@staticmethod +@since("2.4.0") +def min(col, weightCol=None): +""" +return a column of min summary +""" +return Summarizer._get_single_metric(col, weightCol, "min") + +@staticmethod +@since("2.4.0") +def normL1(col, weightCol=None): +""" +return a column of normL1 summary +""" +return Summarizer._get_single_metric(col, weightCol, "normL1") + +@staticmethod +@since("2.4.0") +def normL2(col, weightCol=None): +""" +return a column of normL2 summary +""" +return Summarizer._get_single_metric(col, weightCol, "normL2") + +@staticmethod +def _check_param(featuresCol, weightCol): +if weightCol is None: +weightCol = lit(1.0) +if not isinstance(featuresCol, Colum
[GitHub] spark pull request #20904: [SPARK-23751][ML][PySpark] Kolmogorov-Smirnoff te...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/20904#discussion_r181731252 --- Diff: mllib/src/main/scala/org/apache/spark/ml/stat/KolmogorovSmirnovTest.scala --- @@ -81,32 +81,37 @@ object KolmogorovSmirnovTest { * Java-friendly version of `test(dataset: DataFrame, sampleCol: String, cdf: Double => Double)` */ @Since("2.4.0") - def test(dataset: DataFrame, sampleCol: String, -cdf: Function[java.lang.Double, java.lang.Double]): DataFrame = { -val f: Double => Double = x => cdf.call(x) -test(dataset, sampleCol, f) + def test( + dataset: Dataset[_], + sampleCol: String, + cdf: Function[java.lang.Double, java.lang.Double]): DataFrame = { +test(dataset, sampleCol, (x: Double) => cdf.call(x)) --- End diff -- That's not a very scalable fix... --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21044: [SPARK-9312][ML] Add RawPrediction, numClasses, a...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/21044#discussion_r181288716 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/OneVsRest.scala --- @@ -146,6 +152,10 @@ final class OneVsRestModel private[ml] ( @Since("2.1.0") def setPredictionCol(value: String): this.type = set(predictionCol, value) + /** @group setParam */ + @Since("2.4.0") + def setRawPredictionCol(value: String): this.type = set(rawPredictionCol, value) --- End diff -- You'll need to add this to the Estimator too. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21044: [SPARK-9312][ML] Add RawPrediction, numClasses, a...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/21044#discussion_r181288736 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/OneVsRest.scala --- @@ -195,15 +206,32 @@ final class OneVsRestModel private[ml] ( newDataset.unpersist() } -// output the index of the classifier with highest confidence as prediction -val labelUDF = udf { (predictions: Map[Int, Double]) => - predictions.maxBy(_._2)._1.toDouble -} +// output the RawPrediction as vector +if (getRawPredictionCol != "") { + val rawPredictionUDF = udf { (predictions: Map[Int, Double]) => +val predArray = Array.fill[Double](numClasses)(0.0) +predictions.foreach { case (idx, value) => predArray(idx) = value } +Vectors.dense(predArray) + } + + // output the index of the classifier with highest confidence as prediction + val labelUDF = udf { (predictions: Vector) => predictions.argmax.toDouble } -// output label and label metadata as prediction -aggregatedDataset - .withColumn($(predictionCol), labelUDF(col(accColName)), labelMetadata) - .drop(accColName) + aggregatedDataset +.withColumn(getRawPredictionCol, rawPredictionUDF(col(accColName))) +.withColumn(getPredictionCol, labelUDF(col(getRawPredictionCol)), labelMetadata) +.drop(accColName) +} +else { + // output the index of the classifier with highest confidence as prediction + val labelUDF = udf { (predictions: Map[Int, Double]) => +predictions.maxBy(_._2)._1.toDouble + } + // output confidence as rwa prediction, label and label metadata as prediction --- End diff -- This comment seems to be in the wrong part of the code. Also there's a typo --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21044: [SPARK-9312][ML] Add RawPrediction, numClasses, a...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/21044#discussion_r181288725 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/OneVsRest.scala --- @@ -195,15 +206,32 @@ final class OneVsRestModel private[ml] ( newDataset.unpersist() } -// output the index of the classifier with highest confidence as prediction -val labelUDF = udf { (predictions: Map[Int, Double]) => - predictions.maxBy(_._2)._1.toDouble -} +// output the RawPrediction as vector +if (getRawPredictionCol != "") { + val rawPredictionUDF = udf { (predictions: Map[Int, Double]) => +val predArray = Array.fill[Double](numClasses)(0.0) +predictions.foreach { case (idx, value) => predArray(idx) = value } +Vectors.dense(predArray) + } + + // output the index of the classifier with highest confidence as prediction + val labelUDF = udf { (predictions: Vector) => predictions.argmax.toDouble } -// output label and label metadata as prediction -aggregatedDataset - .withColumn($(predictionCol), labelUDF(col(accColName)), labelMetadata) - .drop(accColName) + aggregatedDataset +.withColumn(getRawPredictionCol, rawPredictionUDF(col(accColName))) +.withColumn(getPredictionCol, labelUDF(col(getRawPredictionCol)), labelMetadata) +.drop(accColName) +} +else { --- End diff -- Scala style: This should go on the previous line: ```} else {``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21044: [SPARK-9312][ML] Add RawPrediction, numClasses, a...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/21044#discussion_r181288721 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/OneVsRest.scala --- @@ -195,15 +206,32 @@ final class OneVsRestModel private[ml] ( newDataset.unpersist() } -// output the index of the classifier with highest confidence as prediction -val labelUDF = udf { (predictions: Map[Int, Double]) => - predictions.maxBy(_._2)._1.toDouble -} +// output the RawPrediction as vector +if (getRawPredictionCol != "") { + val rawPredictionUDF = udf { (predictions: Map[Int, Double]) => +val predArray = Array.fill[Double](numClasses)(0.0) --- End diff -- This causes a subtle ContextCleaner bug: `numClasses` refers to a field of the class OneVsRestModel, so when Spark's closure capture serializes this UDF to send to executors, it will end up sending the entire OneVsRestModel object, rather than just the value for numClasses. Make a local copy of the value numClasses within the transform() method to avoid this issue. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21044: [SPARK-9312][ML] Add RawPrediction, numClasses, a...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/21044#discussion_r181288710 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/OneVsRest.scala --- @@ -138,6 +138,12 @@ final class OneVsRestModel private[ml] ( @Since("1.4.0") val models: Array[_ <: ClassificationModel[_, _]]) extends Model[OneVsRestModel] with OneVsRestParams with MLWritable { --- End diff -- Let's add a require() statement here which checks that models.nonEmpty is true (to throw an exception upon construction, rather than when numFeatures calls models.head below). Just to be safe... --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20695: [SPARK-21741][ML][PySpark] Python API for DataFra...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/20695#discussion_r181263309 --- Diff: python/pyspark/ml/stat.py --- @@ -195,6 +197,185 @@ def test(dataset, sampleCol, distName, *params): _jvm().PythonUtils.toSeq(params))) +class Summarizer(object): +""" +.. note:: Experimental + +Tools for vectorized statistics on MLlib Vectors. +The methods in this package provide various statistics for Vectors contained inside DataFrames. +This class lets users pick the statistics they would like to extract for a given column. + +>>> from pyspark.ml.stat import Summarizer +>>> from pyspark.sql import Row +>>> from pyspark.ml.linalg import Vectors +>>> summarizer = Summarizer.metrics("mean", "count") +>>> df = sc.parallelize([Row(weight=1.0, features=Vectors.dense(1.0, 1.0, 1.0)), +... Row(weight=0.0, features=Vectors.dense(1.0, 2.0, 3.0))]).toDF() +>>> df.select(summarizer.summary(df.features, df.weight)).show(truncate=False) ++---+ +|aggregate_metrics(features, weight)| ++---+ +|[[1.0,1.0,1.0], 1] | ++---+ + +>>> df.select(summarizer.summary(df.features)).show(truncate=False) +++ +|aggregate_metrics(features, 1.0)| +++ +|[[1.0,1.5,2.0], 2] | +++ + +>>> df.select(Summarizer.mean(df.features, df.weight)).show(truncate=False) ++--+ +|mean(features)| ++--+ +|[1.0,1.0,1.0] | ++--+ + +>>> df.select(Summarizer.mean(df.features)).show(truncate=False) ++--+ +|mean(features)| ++--+ +|[1.0,1.5,2.0] | ++--+ + + +.. versionadded:: 2.4.0 + +""" +@staticmethod +@since("2.4.0") +def mean(col, weightCol=None): +""" +return a column of mean summary +""" +return Summarizer._get_single_metric(col, weightCol, "mean") + +@staticmethod +@since("2.4.0") +def variance(col, weightCol=None): +""" +return a column of variance summary +""" +return Summarizer._get_single_metric(col, weightCol, "variance") + +@staticmethod +@since("2.4.0") +def count(col, weightCol=None): +""" +return a column of count summary +""" +return Summarizer._get_single_metric(col, weightCol, "count") + +@staticmethod +@since("2.4.0") +def numNonZeros(col, weightCol=None): +""" +return a column of numNonZero summary +""" +return Summarizer._get_single_metric(col, weightCol, "numNonZeros") + +@staticmethod +@since("2.4.0") +def max(col, weightCol=None): +""" +return a column of max summary +""" +return Summarizer._get_single_metric(col, weightCol, "max") + +@staticmethod +@since("2.4.0") +def min(col, weightCol=None): +""" +return a column of min summary +""" +return Summarizer._get_single_metric(col, weightCol, "min") + +@staticmethod +@since("2.4.0") +def normL1(col, weightCol=None): +""" +return a column of normL1 summary +""" +return Summarizer._get_single_metric(col, weightCol, "normL1") + +@staticmethod +@since("2.4.0") +def normL2(col, weightCol=None): +""" +return a column of normL2 summary +""" +return Summarizer._get_single_metric(col, weightCol, "normL2") + +@staticmethod +def _check_param(featureCol, weightCol): +if weightCol is None: +weightCol = lit(1.0) +if not isinstance(featureCol, Column) or not isins
[GitHub] spark pull request #20695: [SPARK-21741][ML][PySpark] Python API for DataFra...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/20695#discussion_r181259361 --- Diff: python/pyspark/ml/stat.py --- @@ -195,6 +197,185 @@ def test(dataset, sampleCol, distName, *params): _jvm().PythonUtils.toSeq(params))) +class Summarizer(object): +""" +.. note:: Experimental + +Tools for vectorized statistics on MLlib Vectors. +The methods in this package provide various statistics for Vectors contained inside DataFrames. +This class lets users pick the statistics they would like to extract for a given column. + +>>> from pyspark.ml.stat import Summarizer +>>> from pyspark.sql import Row +>>> from pyspark.ml.linalg import Vectors +>>> summarizer = Summarizer.metrics("mean", "count") +>>> df = sc.parallelize([Row(weight=1.0, features=Vectors.dense(1.0, 1.0, 1.0)), +... Row(weight=0.0, features=Vectors.dense(1.0, 2.0, 3.0))]).toDF() +>>> df.select(summarizer.summary(df.features, df.weight)).show(truncate=False) ++---+ +|aggregate_metrics(features, weight)| ++---+ +|[[1.0,1.0,1.0], 1] | ++---+ + +>>> df.select(summarizer.summary(df.features)).show(truncate=False) +++ +|aggregate_metrics(features, 1.0)| +++ +|[[1.0,1.5,2.0], 2] | +++ + +>>> df.select(Summarizer.mean(df.features, df.weight)).show(truncate=False) ++--+ +|mean(features)| ++--+ +|[1.0,1.0,1.0] | ++--+ + +>>> df.select(Summarizer.mean(df.features)).show(truncate=False) ++--+ +|mean(features)| ++--+ +|[1.0,1.5,2.0] | ++--+ + + +.. versionadded:: 2.4.0 + +""" +@staticmethod +@since("2.4.0") +def mean(col, weightCol=None): +""" +return a column of mean summary +""" +return Summarizer._get_single_metric(col, weightCol, "mean") + +@staticmethod +@since("2.4.0") +def variance(col, weightCol=None): +""" +return a column of variance summary +""" +return Summarizer._get_single_metric(col, weightCol, "variance") + +@staticmethod +@since("2.4.0") +def count(col, weightCol=None): +""" +return a column of count summary +""" +return Summarizer._get_single_metric(col, weightCol, "count") + +@staticmethod +@since("2.4.0") +def numNonZeros(col, weightCol=None): +""" +return a column of numNonZero summary +""" +return Summarizer._get_single_metric(col, weightCol, "numNonZeros") + +@staticmethod +@since("2.4.0") +def max(col, weightCol=None): +""" +return a column of max summary +""" +return Summarizer._get_single_metric(col, weightCol, "max") + +@staticmethod +@since("2.4.0") +def min(col, weightCol=None): +""" +return a column of min summary +""" +return Summarizer._get_single_metric(col, weightCol, "min") + +@staticmethod +@since("2.4.0") +def normL1(col, weightCol=None): +""" +return a column of normL1 summary +""" +return Summarizer._get_single_metric(col, weightCol, "normL1") + +@staticmethod +@since("2.4.0") +def normL2(col, weightCol=None): +""" +return a column of normL2 summary +""" +return Summarizer._get_single_metric(col, weightCol, "normL2") + +@staticmethod +def _check_param(featureCol, weightCol): +if weightCol is None: +weightCol = lit(1.0) +if not isinstance(featureCol, Column) or not isins
[GitHub] spark pull request #20695: [SPARK-21741][ML][PySpark] Python API for DataFra...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/20695#discussion_r181259181 --- Diff: python/pyspark/ml/stat.py --- @@ -195,6 +197,185 @@ def test(dataset, sampleCol, distName, *params): _jvm().PythonUtils.toSeq(params))) +class Summarizer(object): +""" +.. note:: Experimental + +Tools for vectorized statistics on MLlib Vectors. +The methods in this package provide various statistics for Vectors contained inside DataFrames. +This class lets users pick the statistics they would like to extract for a given column. + +>>> from pyspark.ml.stat import Summarizer +>>> from pyspark.sql import Row +>>> from pyspark.ml.linalg import Vectors +>>> summarizer = Summarizer.metrics("mean", "count") +>>> df = sc.parallelize([Row(weight=1.0, features=Vectors.dense(1.0, 1.0, 1.0)), +... Row(weight=0.0, features=Vectors.dense(1.0, 2.0, 3.0))]).toDF() +>>> df.select(summarizer.summary(df.features, df.weight)).show(truncate=False) ++---+ +|aggregate_metrics(features, weight)| ++---+ +|[[1.0,1.0,1.0], 1] | ++---+ + +>>> df.select(summarizer.summary(df.features)).show(truncate=False) +++ +|aggregate_metrics(features, 1.0)| +++ +|[[1.0,1.5,2.0], 2] | +++ + +>>> df.select(Summarizer.mean(df.features, df.weight)).show(truncate=False) ++--+ +|mean(features)| ++--+ +|[1.0,1.0,1.0] | ++--+ + +>>> df.select(Summarizer.mean(df.features)).show(truncate=False) ++--+ +|mean(features)| ++--+ +|[1.0,1.5,2.0] | ++--+ + + +.. versionadded:: 2.4.0 + +""" +@staticmethod +@since("2.4.0") +def mean(col, weightCol=None): +""" +return a column of mean summary +""" +return Summarizer._get_single_metric(col, weightCol, "mean") + +@staticmethod +@since("2.4.0") +def variance(col, weightCol=None): +""" +return a column of variance summary +""" +return Summarizer._get_single_metric(col, weightCol, "variance") + +@staticmethod +@since("2.4.0") +def count(col, weightCol=None): +""" +return a column of count summary +""" +return Summarizer._get_single_metric(col, weightCol, "count") + +@staticmethod +@since("2.4.0") +def numNonZeros(col, weightCol=None): +""" +return a column of numNonZero summary +""" +return Summarizer._get_single_metric(col, weightCol, "numNonZeros") + +@staticmethod +@since("2.4.0") +def max(col, weightCol=None): +""" +return a column of max summary +""" +return Summarizer._get_single_metric(col, weightCol, "max") + +@staticmethod +@since("2.4.0") +def min(col, weightCol=None): +""" +return a column of min summary +""" +return Summarizer._get_single_metric(col, weightCol, "min") + +@staticmethod +@since("2.4.0") +def normL1(col, weightCol=None): +""" +return a column of normL1 summary +""" +return Summarizer._get_single_metric(col, weightCol, "normL1") + +@staticmethod +@since("2.4.0") +def normL2(col, weightCol=None): +""" +return a column of normL2 summary +""" +return Summarizer._get_single_metric(col, weightCol, "normL2") + +@staticmethod +def _check_param(featureCol, weightCol): +if weightCol is None: +weightCol = lit(1.0) +if not isinstance(featureCol, Column) or not isins
[GitHub] spark pull request #20695: [SPARK-21741][ML][PySpark] Python API for DataFra...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/20695#discussion_r181259536 --- Diff: python/pyspark/ml/stat.py --- @@ -195,6 +197,185 @@ def test(dataset, sampleCol, distName, *params): _jvm().PythonUtils.toSeq(params))) +class Summarizer(object): +""" +.. note:: Experimental + +Tools for vectorized statistics on MLlib Vectors. +The methods in this package provide various statistics for Vectors contained inside DataFrames. +This class lets users pick the statistics they would like to extract for a given column. + +>>> from pyspark.ml.stat import Summarizer +>>> from pyspark.sql import Row +>>> from pyspark.ml.linalg import Vectors +>>> summarizer = Summarizer.metrics("mean", "count") +>>> df = sc.parallelize([Row(weight=1.0, features=Vectors.dense(1.0, 1.0, 1.0)), +... Row(weight=0.0, features=Vectors.dense(1.0, 2.0, 3.0))]).toDF() +>>> df.select(summarizer.summary(df.features, df.weight)).show(truncate=False) ++---+ +|aggregate_metrics(features, weight)| ++---+ +|[[1.0,1.0,1.0], 1] | ++---+ + +>>> df.select(summarizer.summary(df.features)).show(truncate=False) +++ +|aggregate_metrics(features, 1.0)| +++ +|[[1.0,1.5,2.0], 2] | +++ + +>>> df.select(Summarizer.mean(df.features, df.weight)).show(truncate=False) ++--+ +|mean(features)| ++--+ +|[1.0,1.0,1.0] | ++--+ + +>>> df.select(Summarizer.mean(df.features)).show(truncate=False) ++--+ +|mean(features)| ++--+ +|[1.0,1.5,2.0] | ++--+ + + +.. versionadded:: 2.4.0 + +""" +@staticmethod +@since("2.4.0") +def mean(col, weightCol=None): +""" +return a column of mean summary +""" +return Summarizer._get_single_metric(col, weightCol, "mean") + +@staticmethod +@since("2.4.0") +def variance(col, weightCol=None): +""" +return a column of variance summary +""" +return Summarizer._get_single_metric(col, weightCol, "variance") + +@staticmethod +@since("2.4.0") +def count(col, weightCol=None): +""" +return a column of count summary +""" +return Summarizer._get_single_metric(col, weightCol, "count") + +@staticmethod +@since("2.4.0") +def numNonZeros(col, weightCol=None): +""" +return a column of numNonZero summary +""" +return Summarizer._get_single_metric(col, weightCol, "numNonZeros") + +@staticmethod +@since("2.4.0") +def max(col, weightCol=None): +""" +return a column of max summary +""" +return Summarizer._get_single_metric(col, weightCol, "max") + +@staticmethod +@since("2.4.0") +def min(col, weightCol=None): +""" +return a column of min summary +""" +return Summarizer._get_single_metric(col, weightCol, "min") + +@staticmethod +@since("2.4.0") +def normL1(col, weightCol=None): +""" +return a column of normL1 summary +""" +return Summarizer._get_single_metric(col, weightCol, "normL1") + +@staticmethod +@since("2.4.0") +def normL2(col, weightCol=None): +""" +return a column of normL2 summary +""" +return Summarizer._get_single_metric(col, weightCol, "normL2") + +@staticmethod +def _check_param(featureCol, weightCol): +if weightCol is None: +weightCol = lit(1.0) +if not isinstance(featureCol, Column) or not isins
spark git commit: [SPARK-23751][FOLLOW-UP] fix build for scala-2.12
Repository: spark Updated Branches: refs/heads/master 0b19122d4 -> 0f93b91a7 [SPARK-23751][FOLLOW-UP] fix build for scala-2.12 ## What changes were proposed in this pull request? fix build for scala-2.12 ## How was this patch tested? Manual. Author: WeichenXu Closes #21051 from WeichenXu123/fix_build212. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/0f93b91a Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/0f93b91a Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/0f93b91a Branch: refs/heads/master Commit: 0f93b91a71444a1a938acfd8ea2191c54fb0187c Parents: 0b19122 Author: WeichenXu Authored: Thu Apr 12 15:47:42 2018 -0600 Committer: Joseph K. Bradley Committed: Thu Apr 12 15:47:42 2018 -0600 -- .../scala/org/apache/spark/ml/stat/KolmogorovSmirnovTest.scala | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/0f93b91a/mllib/src/main/scala/org/apache/spark/ml/stat/KolmogorovSmirnovTest.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/ml/stat/KolmogorovSmirnovTest.scala b/mllib/src/main/scala/org/apache/spark/ml/stat/KolmogorovSmirnovTest.scala index af8ff64..adf8145 100644 --- a/mllib/src/main/scala/org/apache/spark/ml/stat/KolmogorovSmirnovTest.scala +++ b/mllib/src/main/scala/org/apache/spark/ml/stat/KolmogorovSmirnovTest.scala @@ -85,7 +85,7 @@ object KolmogorovSmirnovTest { dataset: Dataset[_], sampleCol: String, cdf: Function[java.lang.Double, java.lang.Double]): DataFrame = { -test(dataset, sampleCol, (x: Double) => cdf.call(x)) +test(dataset, sampleCol, (x: Double) => cdf.call(x).toDouble) } /** - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[GitHub] spark issue #21051: [SPARK-23751][FOLLOW-UP] fix build for scala-2.12
Github user jkbradley commented on the issue: https://github.com/apache/spark/pull/21051 LGTM Merging with master Thanks @WeichenXu123 ! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20904: [SPARK-23751][ML][PySpark] Kolmogorov-Smirnoff te...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/20904#discussion_r181230965 --- Diff: mllib/src/main/scala/org/apache/spark/ml/stat/KolmogorovSmirnovTest.scala --- @@ -81,32 +81,37 @@ object KolmogorovSmirnovTest { * Java-friendly version of `test(dataset: DataFrame, sampleCol: String, cdf: Double => Double)` */ @Since("2.4.0") - def test(dataset: DataFrame, sampleCol: String, -cdf: Function[java.lang.Double, java.lang.Double]): DataFrame = { -val f: Double => Double = x => cdf.call(x) -test(dataset, sampleCol, f) + def test( + dataset: Dataset[_], + sampleCol: String, + cdf: Function[java.lang.Double, java.lang.Double]): DataFrame = { +test(dataset, sampleCol, (x: Double) => cdf.call(x)) --- End diff -- This is the 2nd time I've merged something which broke the Scala 2.12 build. If we care about Scala 2.12, do you know if there are any efforts to add PR builder tests for 2.12? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
spark git commit: typo rawPredicition changed to rawPrediction
Repository: spark Updated Branches: refs/heads/master 75a183071 -> 9d960de08 typo rawPredicition changed to rawPrediction MultilayerPerceptronClassifier had 4 occurrences ## What changes were proposed in this pull request? (Please fill in changes proposed in this fix) ## How was this patch tested? (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Please review http://spark.apache.org/contributing.html before opening a pull request. Author: JBauerKogentix <37910022+jbauerkogen...@users.noreply.github.com> Closes #21030 from JBauerKogentix/patch-1. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/9d960de0 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/9d960de0 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/9d960de0 Branch: refs/heads/master Commit: 9d960de0814a1128318676cc2e91f447cdf0137f Parents: 75a1830 Author: JBauerKogentix <37910022+jbauerkogen...@users.noreply.github.com> Authored: Wed Apr 11 15:52:13 2018 -0700 Committer: Joseph K. Bradley Committed: Wed Apr 11 15:52:13 2018 -0700 -- python/pyspark/ml/classification.py | 8 1 file changed, 4 insertions(+), 4 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/9d960de0/python/pyspark/ml/classification.py -- diff --git a/python/pyspark/ml/classification.py b/python/pyspark/ml/classification.py index fbbe3d0..ec17653 100644 --- a/python/pyspark/ml/classification.py +++ b/python/pyspark/ml/classification.py @@ -1543,12 +1543,12 @@ class MultilayerPerceptronClassifier(JavaEstimator, HasFeaturesCol, HasLabelCol, def __init__(self, featuresCol="features", labelCol="label", predictionCol="prediction", maxIter=100, tol=1e-6, seed=None, layers=None, blockSize=128, stepSize=0.03, solver="l-bfgs", initialWeights=None, probabilityCol="probability", - rawPredicitionCol="rawPrediction"): + rawPredictionCol="rawPrediction"): """ __init__(self, featuresCol="features", labelCol="label", predictionCol="prediction", \ maxIter=100, tol=1e-6, seed=None, layers=None, blockSize=128, stepSize=0.03, \ solver="l-bfgs", initialWeights=None, probabilityCol="probability", \ - rawPredicitionCol="rawPrediction") + rawPredictionCol="rawPrediction") """ super(MultilayerPerceptronClassifier, self).__init__() self._java_obj = self._new_java_obj( @@ -1562,12 +1562,12 @@ class MultilayerPerceptronClassifier(JavaEstimator, HasFeaturesCol, HasLabelCol, def setParams(self, featuresCol="features", labelCol="label", predictionCol="prediction", maxIter=100, tol=1e-6, seed=None, layers=None, blockSize=128, stepSize=0.03, solver="l-bfgs", initialWeights=None, probabilityCol="probability", - rawPredicitionCol="rawPrediction"): + rawPredictionCol="rawPrediction"): """ setParams(self, featuresCol="features", labelCol="label", predictionCol="prediction", \ maxIter=100, tol=1e-6, seed=None, layers=None, blockSize=128, stepSize=0.03, \ solver="l-bfgs", initialWeights=None, probabilityCol="probability", \ - rawPredicitionCol="rawPrediction"): + rawPredictionCol="rawPrediction"): Sets params for MultilayerPerceptronClassifier. """ kwargs = self._input_kwargs - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: typo rawPredicition changed to rawPrediction
Repository: spark Updated Branches: refs/heads/branch-2.3 acfc156df -> 03a4dfd69 typo rawPredicition changed to rawPrediction MultilayerPerceptronClassifier had 4 occurrences ## What changes were proposed in this pull request? (Please fill in changes proposed in this fix) ## How was this patch tested? (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Please review http://spark.apache.org/contributing.html before opening a pull request. Author: JBauerKogentix <37910022+jbauerkogen...@users.noreply.github.com> Closes #21030 from JBauerKogentix/patch-1. (cherry picked from commit 9d960de0814a1128318676cc2e91f447cdf0137f) Signed-off-by: Joseph K. Bradley Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/03a4dfd6 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/03a4dfd6 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/03a4dfd6 Branch: refs/heads/branch-2.3 Commit: 03a4dfd6901595fb4622587e3387ec1ac8d77237 Parents: acfc156 Author: JBauerKogentix <37910022+jbauerkogen...@users.noreply.github.com> Authored: Wed Apr 11 15:52:13 2018 -0700 Committer: Joseph K. Bradley Committed: Wed Apr 11 15:52:24 2018 -0700 -- python/pyspark/ml/classification.py | 8 1 file changed, 4 insertions(+), 4 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/03a4dfd6/python/pyspark/ml/classification.py -- diff --git a/python/pyspark/ml/classification.py b/python/pyspark/ml/classification.py index 27ad1e8..55d6030 100644 --- a/python/pyspark/ml/classification.py +++ b/python/pyspark/ml/classification.py @@ -1542,12 +1542,12 @@ class MultilayerPerceptronClassifier(JavaEstimator, HasFeaturesCol, HasLabelCol, def __init__(self, featuresCol="features", labelCol="label", predictionCol="prediction", maxIter=100, tol=1e-6, seed=None, layers=None, blockSize=128, stepSize=0.03, solver="l-bfgs", initialWeights=None, probabilityCol="probability", - rawPredicitionCol="rawPrediction"): + rawPredictionCol="rawPrediction"): """ __init__(self, featuresCol="features", labelCol="label", predictionCol="prediction", \ maxIter=100, tol=1e-6, seed=None, layers=None, blockSize=128, stepSize=0.03, \ solver="l-bfgs", initialWeights=None, probabilityCol="probability", \ - rawPredicitionCol="rawPrediction") + rawPredictionCol="rawPrediction") """ super(MultilayerPerceptronClassifier, self).__init__() self._java_obj = self._new_java_obj( @@ -1561,12 +1561,12 @@ class MultilayerPerceptronClassifier(JavaEstimator, HasFeaturesCol, HasLabelCol, def setParams(self, featuresCol="features", labelCol="label", predictionCol="prediction", maxIter=100, tol=1e-6, seed=None, layers=None, blockSize=128, stepSize=0.03, solver="l-bfgs", initialWeights=None, probabilityCol="probability", - rawPredicitionCol="rawPrediction"): + rawPredictionCol="rawPrediction"): """ setParams(self, featuresCol="features", labelCol="label", predictionCol="prediction", \ maxIter=100, tol=1e-6, seed=None, layers=None, blockSize=128, stepSize=0.03, \ solver="l-bfgs", initialWeights=None, probabilityCol="probability", \ - rawPredicitionCol="rawPrediction"): + rawPredictionCol="rawPrediction"): Sets params for MultilayerPerceptronClassifier. """ kwargs = self._input_kwargs - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[GitHub] spark issue #21030: typo rawPredicition changed to rawPrediction
Github user jkbradley commented on the issue: https://github.com/apache/spark/pull/21030 Actually, forget the JIRA; I'll just merge it with master and branch-2.3 as is. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21044: [SPARK-9312][ML] Add RawPrediction, numClasses, a...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/21044#discussion_r180920806 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/OneVsRest.scala --- @@ -195,14 +205,18 @@ final class OneVsRestModel private[ml] ( newDataset.unpersist() } -// output the index of the classifier with highest confidence as prediction -val labelUDF = udf { (predictions: Map[Int, Double]) => - predictions.maxBy(_._2)._1.toDouble +// output the RawPrediction as vector +val rawPredictionUDF = udf { (predictions: Map[Int, Double]) => + Vectors.sparse(numClasses, predictions.toList ) --- End diff -- Also, let's output a dense Vector since it will almost surely be dense. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21044: [SPARK-9312][ML] Add RawPrediction, numClasses, and numF...
Github user jkbradley commented on the issue: https://github.com/apache/spark/pull/21044 Thanks for the PR! Quick high-level comment: We'll need to have rawPredictionCol be optional. If it's not set or is an empty string, then it should not be added to the output DataFrame. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19627: [SPARK-21088][ML] CrossValidator, TrainValidation...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/19627#discussion_r180874011 --- Diff: python/pyspark/ml/tuning.py --- @@ -194,7 +195,8 @@ def _to_java_impl(self): return java_estimator, java_epms, java_evaluator -class CrossValidator(Estimator, ValidatorParams, HasParallelism, MLReadable, MLWritable): +class CrossValidator(Estimator, ValidatorParams, HasParallelism, HasCollectSubModels, --- End diff -- Let's also clarify in the doc for CrossValidatorModel.copy() that it does not copy the extra Params into the subModels. (same for TrainValidationSplitModel) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19627: [SPARK-21088][ML] CrossValidator, TrainValidation...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/19627#discussion_r180862523 --- Diff: python/pyspark/ml/tuning.py --- @@ -194,7 +195,8 @@ def _to_java_impl(self): return java_estimator, java_epms, java_evaluator -class CrossValidator(Estimator, ValidatorParams, HasParallelism, MLReadable, MLWritable): +class CrossValidator(Estimator, ValidatorParams, HasParallelism, HasCollectSubModels, --- End diff -- You'll need to update _from_java and _to_java as well to pass collectSubModels around. (Same for TrainValidationSplit) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19627: [SPARK-21088][ML] CrossValidator, TrainValidation...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/19627#discussion_r180876558 --- Diff: python/pyspark/ml/tests.py --- @@ -1018,6 +1018,48 @@ def test_parallel_evaluation(self): cvParallelModel = cv.fit(dataset) self.assertEqual(cvSerialModel.avgMetrics, cvParallelModel.avgMetrics) +def test_expose_sub_models(self): --- End diff -- Nice tests. Can you make one addition: Test the copy() method to make sure it copies the submodels. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19627: [SPARK-21088][ML] CrossValidator, TrainValidation...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/19627#discussion_r180611695 --- Diff: python/pyspark/ml/param/_shared_params_code_gen.py --- @@ -157,6 +157,8 @@ def get$Name(self): "TypeConverters.toInt"), ("parallelism", "the number of threads to use when running parallel algorithms (>= 1).", "1", "TypeConverters.toInt"), +("collectSubModels", "whether to collect a list of sub-models trained during tuning", --- End diff -- It would be nice to add the full description from Scala. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19627: [SPARK-21088][ML] CrossValidator, TrainValidation...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/19627#discussion_r180610868 --- Diff: python/pyspark/ml/tests.py --- @@ -1186,6 +1228,38 @@ def test_parallel_evaluation(self): tvsParallelModel = tvs.fit(dataset) self.assertEqual(tvsSerialModel.validationMetrics, tvsParallelModel.validationMetrics) +def test_expose_sub_models(self): +temp_path = tempfile.mkdtemp() +dataset = self.spark.createDataFrame( +[(Vectors.dense([0.0]), 0.0), + (Vectors.dense([0.4]), 1.0), + (Vectors.dense([0.5]), 0.0), + (Vectors.dense([0.6]), 1.0), + (Vectors.dense([1.0]), 1.0)] * 10, +["features", "label"]) +lr = LogisticRegression() +grid = ParamGridBuilder().addGrid(lr.maxIter, [0, 1]).build() +evaluator = BinaryClassificationEvaluator() +tvs = TrainValidationSplit(estimator=lr, estimatorParamMaps=grid, evaluator=evaluator, + collectSubModels=True) +tvsModel = tvs.fit(dataset) +assert len(tvsModel.subModels) == len(grid) --- End diff -- Use self.assertEqual here and elsewhere. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org