[GitHub] spark pull request #21097: [SPARK-14682][ML] Provide evaluateEachIteration m...

2018-04-27 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/21097#discussion_r184760717
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/classification/GBTClassifierSuite.scala
 ---
@@ -365,6 +365,20 @@ class GBTClassifierSuite extends MLTest with 
DefaultReadWriteTest {
 assert(mostImportantFeature !== mostIF)
   }
 
+  test("model evaluateEachIteration") {
+for (lossType <- Seq("logistic")) {
+  val gbt = new GBTClassifier()
+.setMaxDepth(2)
+.setMaxIter(2)
+.setLossType(lossType)
+  val model = gbt.fit(trainData.toDF)
+  val eval1 = model.evaluateEachIteration(validationData.toDF)
+  val eval2 = 
GradientBoostedTrees.evaluateEachIteration(validationData,
--- End diff --

This is testing the spark.ml implementation against itself.  I was about to 
recommend using the old spark.mllib implementation as a reference.   However, 
the old implementation is not tested at all.  Would you be able to test against 
a standard implementation in R or scikit-learn (following the patterns used 
elsewhere in MLlib)?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21097: [SPARK-14682][ML] Provide evaluateEachIteration m...

2018-04-27 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/21097#discussion_r184763088
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/GBTRegressor.scala ---
@@ -269,6 +269,21 @@ class GBTRegressionModel private[ml](
 new OldGBTModel(OldAlgo.Regression, _trees.map(_.toOld), _treeWeights)
   }
 
+  /**
+   * Method to compute error or loss for every iteration of gradient 
boosting.
+   *
+   * @param dataset Dataset for validation.
+   */
+  @Since("2.4.0")
+  def evaluateEachIteration(dataset: Dataset[_]): Array[Double] = {
--- End diff --

Do we want to support evaluation on other losses, as in the old API?  It 
might be nice to be able to without having to modify the Model's loss Param 
value.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20973: [SPARK-20114][ML] spark.ml parity for sequential ...

2018-04-24 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/20973#discussion_r183866393
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/fpm/PrefixSpan.scala ---
@@ -0,0 +1,91 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.fpm
+
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.mllib.fpm.{PrefixSpan => mllibPrefixSpan}
+import org.apache.spark.sql.{DataFrame, Dataset, Row}
+import org.apache.spark.sql.functions.col
+import org.apache.spark.sql.types.{LongType, StructField, StructType}
+import org.apache.spark.storage.StorageLevel
+
+/**
+ * :: Experimental ::
+ * A parallel PrefixSpan algorithm to mine frequent sequential patterns.
+ * The PrefixSpan algorithm is described in J. Pei, et al., PrefixSpan: 
Mining Sequential Patterns
+ * Efficiently by Prefix-Projected Pattern Growth
+ * (see http://doi.org/10.1109/ICDE.2001.914830";>here).
+ *
+ * @see https://en.wikipedia.org/wiki/Sequential_Pattern_Mining";>Sequential 
Pattern Mining
+ * (Wikipedia)
+ */
+@Since("2.4.0")
+@Experimental
+object PrefixSpan {
+
+  /**
+   * :: Experimental ::
+   * Finds the complete set of frequent sequential patterns in the input 
sequences of itemsets.
+   *
+   * @param dataset A dataset or a dataframe containing a sequence column 
which is
+   *{{{Seq[Seq[_]]}}} type
+   * @param sequenceCol the name of the sequence column in dataset
+   * @param minSupport the minimal support level of the sequential 
pattern, any pattern that
+   *   appears more than (minSupport * 
size-of-the-dataset) times will be output
+   *  (default: `0.1`).
+   * @param maxPatternLength the maximal length of the sequential pattern, 
any pattern that appears
+   * less than maxPatternLength will be output 
(default: `10`).
+   * @param maxLocalProjDBSize The maximum number of items (including 
delimiters used in the
+   *   internal storage format) allowed in a 
projected database before
+   *   local processing. If a projected database 
exceeds this size, another
+   *   iteration of distributed prefix growth is 
run (default: `3200`).
+   * @return A dataframe that contains columns of sequence and 
corresponding frequency.
+   */
+  @Since("2.4.0")
+  def findFrequentSequentPatterns(
+  dataset: Dataset[_],
+  sequenceCol: String,
+  minSupport: Double = 0.1,
+  maxPatternLength: Int = 10,
+  maxLocalProjDBSize: Long = 3200L): DataFrame = {
+val handlePersistence = dataset.storageLevel == StorageLevel.NONE
+
+val data = dataset.select(sequenceCol)
+val sequences = data.where(col(sequenceCol).isNotNull).rdd
+  .map(r => r.getAs[Seq[Seq[Any]]](0).map(_.toArray).toArray)
+
+val mllibPrefixSpan = new mllibPrefixSpan()
+  .setMinSupport(minSupport)
+  .setMaxPatternLength(maxPatternLength)
+  .setMaxLocalProjDBSize(maxLocalProjDBSize)
+if (handlePersistence) {
+  sequences.persist(StorageLevel.MEMORY_AND_DISK)
+}
+val rows = mllibPrefixSpan.run(sequences).freqSequences.map(f => 
Row(f.sequence, f.freq))
+val schema = StructType(Seq(
+  StructField("sequence", dataset.schema(sequenceCol).dataType, 
nullable = false),
+  StructField("freq", LongType, nullable = false)))
--- End diff --

nit: I'd prefer to call the column "frequency"


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20973: [SPARK-20114][ML] spark.ml parity for sequential ...

2018-04-24 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/20973#discussion_r183864852
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/fpm/PrefixSpan.scala ---
@@ -0,0 +1,91 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.fpm
+
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.mllib.fpm.{PrefixSpan => mllibPrefixSpan}
+import org.apache.spark.sql.{DataFrame, Dataset, Row}
+import org.apache.spark.sql.functions.col
+import org.apache.spark.sql.types.{LongType, StructField, StructType}
+import org.apache.spark.storage.StorageLevel
+
+/**
+ * :: Experimental ::
+ * A parallel PrefixSpan algorithm to mine frequent sequential patterns.
+ * The PrefixSpan algorithm is described in J. Pei, et al., PrefixSpan: 
Mining Sequential Patterns
+ * Efficiently by Prefix-Projected Pattern Growth
+ * (see http://doi.org/10.1109/ICDE.2001.914830";>here).
+ *
+ * @see https://en.wikipedia.org/wiki/Sequential_Pattern_Mining";>Sequential 
Pattern Mining
+ * (Wikipedia)
+ */
+@Since("2.4.0")
+@Experimental
+object PrefixSpan {
+
+  /**
+   * :: Experimental ::
+   * Finds the complete set of frequent sequential patterns in the input 
sequences of itemsets.
+   *
+   * @param dataset A dataset or a dataframe containing a sequence column 
which is
+   *{{{Seq[Seq[_]]}}} type
+   * @param sequenceCol the name of the sequence column in dataset
+   * @param minSupport the minimal support level of the sequential 
pattern, any pattern that
+   *   appears more than (minSupport * 
size-of-the-dataset) times will be output
+   *  (default: `0.1`).
+   * @param maxPatternLength the maximal length of the sequential pattern, 
any pattern that appears
+   * less than maxPatternLength will be output 
(default: `10`).
+   * @param maxLocalProjDBSize The maximum number of items (including 
delimiters used in the
+   *   internal storage format) allowed in a 
projected database before
+   *   local processing. If a projected database 
exceeds this size, another
+   *   iteration of distributed prefix growth is 
run (default: `3200`).
+   * @return A dataframe that contains columns of sequence and 
corresponding frequency.
+   */
+  @Since("2.4.0")
+  def findFrequentSequentPatterns(
--- End diff --

rename: findFrequentSequentPatterns -> findFrequentSequent**ial**Patterns


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20973: [SPARK-20114][ML] spark.ml parity for sequential ...

2018-04-24 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/20973#discussion_r183865224
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/fpm/PrefixSpan.scala ---
@@ -0,0 +1,91 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.fpm
+
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.mllib.fpm.{PrefixSpan => mllibPrefixSpan}
+import org.apache.spark.sql.{DataFrame, Dataset, Row}
+import org.apache.spark.sql.functions.col
+import org.apache.spark.sql.types.{LongType, StructField, StructType}
+import org.apache.spark.storage.StorageLevel
+
+/**
+ * :: Experimental ::
+ * A parallel PrefixSpan algorithm to mine frequent sequential patterns.
+ * The PrefixSpan algorithm is described in J. Pei, et al., PrefixSpan: 
Mining Sequential Patterns
+ * Efficiently by Prefix-Projected Pattern Growth
+ * (see http://doi.org/10.1109/ICDE.2001.914830";>here).
+ *
+ * @see https://en.wikipedia.org/wiki/Sequential_Pattern_Mining";>Sequential 
Pattern Mining
+ * (Wikipedia)
+ */
+@Since("2.4.0")
+@Experimental
+object PrefixSpan {
+
+  /**
+   * :: Experimental ::
+   * Finds the complete set of frequent sequential patterns in the input 
sequences of itemsets.
+   *
+   * @param dataset A dataset or a dataframe containing a sequence column 
which is
+   *{{{Seq[Seq[_]]}}} type
+   * @param sequenceCol the name of the sequence column in dataset
+   * @param minSupport the minimal support level of the sequential 
pattern, any pattern that
+   *   appears more than (minSupport * 
size-of-the-dataset) times will be output
+   *  (default: `0.1`).
+   * @param maxPatternLength the maximal length of the sequential pattern, 
any pattern that appears
+   * less than maxPatternLength will be output 
(default: `10`).
+   * @param maxLocalProjDBSize The maximum number of items (including 
delimiters used in the
+   *   internal storage format) allowed in a 
projected database before
+   *   local processing. If a projected database 
exceeds this size, another
+   *   iteration of distributed prefix growth is 
run (default: `3200`).
+   * @return A dataframe that contains columns of sequence and 
corresponding frequency.
+   */
+  @Since("2.4.0")
+  def findFrequentSequentPatterns(
+  dataset: Dataset[_],
+  sequenceCol: String,
+  minSupport: Double = 0.1,
+  maxPatternLength: Int = 10,
+  maxLocalProjDBSize: Long = 3200L): DataFrame = {
+val handlePersistence = dataset.storageLevel == StorageLevel.NONE
--- End diff --

We don't really need this handlePersistence logic here since it's handled 
by the spark.mllib implementation.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20973: [SPARK-20114][ML] spark.ml parity for sequential ...

2018-04-24 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/20973#discussion_r183865609
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/fpm/PrefixSpan.scala ---
@@ -0,0 +1,91 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.fpm
+
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.mllib.fpm.{PrefixSpan => mllibPrefixSpan}
+import org.apache.spark.sql.{DataFrame, Dataset, Row}
+import org.apache.spark.sql.functions.col
+import org.apache.spark.sql.types.{LongType, StructField, StructType}
+import org.apache.spark.storage.StorageLevel
+
+/**
+ * :: Experimental ::
+ * A parallel PrefixSpan algorithm to mine frequent sequential patterns.
+ * The PrefixSpan algorithm is described in J. Pei, et al., PrefixSpan: 
Mining Sequential Patterns
+ * Efficiently by Prefix-Projected Pattern Growth
+ * (see http://doi.org/10.1109/ICDE.2001.914830";>here).
+ *
+ * @see https://en.wikipedia.org/wiki/Sequential_Pattern_Mining";>Sequential 
Pattern Mining
+ * (Wikipedia)
+ */
+@Since("2.4.0")
+@Experimental
+object PrefixSpan {
+
+  /**
+   * :: Experimental ::
+   * Finds the complete set of frequent sequential patterns in the input 
sequences of itemsets.
+   *
+   * @param dataset A dataset or a dataframe containing a sequence column 
which is
+   *{{{Seq[Seq[_]]}}} type
+   * @param sequenceCol the name of the sequence column in dataset
--- End diff --

It'd be nice to document that rows with nulls in this column are ignored.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20973: [SPARK-20114][ML] spark.ml parity for sequential ...

2018-04-24 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/20973#discussion_r183864177
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/fpm/PrefixSpan.scala ---
@@ -0,0 +1,91 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.fpm
+
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.mllib.fpm.{PrefixSpan => mllibPrefixSpan}
+import org.apache.spark.sql.{DataFrame, Dataset, Row}
+import org.apache.spark.sql.functions.col
+import org.apache.spark.sql.types.{LongType, StructField, StructType}
+import org.apache.spark.storage.StorageLevel
+
+/**
+ * :: Experimental ::
+ * A parallel PrefixSpan algorithm to mine frequent sequential patterns.
+ * The PrefixSpan algorithm is described in J. Pei, et al., PrefixSpan: 
Mining Sequential Patterns
+ * Efficiently by Prefix-Projected Pattern Growth
+ * (see http://doi.org/10.1109/ICDE.2001.914830";>here).
+ *
+ * @see https://en.wikipedia.org/wiki/Sequential_Pattern_Mining";>Sequential 
Pattern Mining
+ * (Wikipedia)
+ */
+@Since("2.4.0")
+@Experimental
+object PrefixSpan {
+
+  /**
+   * :: Experimental ::
+   * Finds the complete set of frequent sequential patterns in the input 
sequences of itemsets.
+   *
+   * @param dataset A dataset or a dataframe containing a sequence column 
which is
+   *{{{Seq[Seq[_]]}}} type
+   * @param sequenceCol the name of the sequence column in dataset
+   * @param minSupport the minimal support level of the sequential 
pattern, any pattern that
+   *   appears more than (minSupport * 
size-of-the-dataset) times will be output
+   *  (default: `0.1`).
+   * @param maxPatternLength the maximal length of the sequential pattern, 
any pattern that appears
--- End diff --

Let's fix this phrasing by just saying "the maximal length of the 
sequential pattern"  (The other part does not make sense: "any pattern that 
appears...")  Feel free to fix that in the old API doc too.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20973: [SPARK-20114][ML] spark.ml parity for sequential ...

2018-04-24 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/20973#discussion_r183864721
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/fpm/PrefixSpan.scala ---
@@ -0,0 +1,91 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.fpm
+
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.mllib.fpm.{PrefixSpan => mllibPrefixSpan}
+import org.apache.spark.sql.{DataFrame, Dataset, Row}
+import org.apache.spark.sql.functions.col
+import org.apache.spark.sql.types.{LongType, StructField, StructType}
+import org.apache.spark.storage.StorageLevel
+
+/**
+ * :: Experimental ::
+ * A parallel PrefixSpan algorithm to mine frequent sequential patterns.
+ * The PrefixSpan algorithm is described in J. Pei, et al., PrefixSpan: 
Mining Sequential Patterns
+ * Efficiently by Prefix-Projected Pattern Growth
+ * (see http://doi.org/10.1109/ICDE.2001.914830";>here).
+ *
+ * @see https://en.wikipedia.org/wiki/Sequential_Pattern_Mining";>Sequential 
Pattern Mining
+ * (Wikipedia)
+ */
+@Since("2.4.0")
+@Experimental
+object PrefixSpan {
+
+  /**
+   * :: Experimental ::
+   * Finds the complete set of frequent sequential patterns in the input 
sequences of itemsets.
+   *
+   * @param dataset A dataset or a dataframe containing a sequence column 
which is
+   *{{{Seq[Seq[_]]}}} type
+   * @param sequenceCol the name of the sequence column in dataset
+   * @param minSupport the minimal support level of the sequential 
pattern, any pattern that
+   *   appears more than (minSupport * 
size-of-the-dataset) times will be output
+   *  (default: `0.1`).
+   * @param maxPatternLength the maximal length of the sequential pattern, 
any pattern that appears
+   * less than maxPatternLength will be output 
(default: `10`).
+   * @param maxLocalProjDBSize The maximum number of items (including 
delimiters used in the
+   *   internal storage format) allowed in a 
projected database before
+   *   local processing. If a projected database 
exceeds this size, another
+   *   iteration of distributed prefix growth is 
run (default: `3200`).
+   * @return A dataframe that contains columns of sequence and 
corresponding frequency.
--- End diff --

Be very explicit about the output schema please: For each column, provide 
the name and DataType.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20973: [SPARK-20114][ML] spark.ml parity for sequential ...

2018-04-24 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/20973#discussion_r183865745
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/fpm/PrefixSpan.scala ---
@@ -0,0 +1,91 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.fpm
+
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.mllib.fpm.{PrefixSpan => mllibPrefixSpan}
+import org.apache.spark.sql.{DataFrame, Dataset, Row}
+import org.apache.spark.sql.functions.col
+import org.apache.spark.sql.types.{LongType, StructField, StructType}
+import org.apache.spark.storage.StorageLevel
+
+/**
+ * :: Experimental ::
+ * A parallel PrefixSpan algorithm to mine frequent sequential patterns.
+ * The PrefixSpan algorithm is described in J. Pei, et al., PrefixSpan: 
Mining Sequential Patterns
+ * Efficiently by Prefix-Projected Pattern Growth
+ * (see http://doi.org/10.1109/ICDE.2001.914830";>here).
+ *
+ * @see https://en.wikipedia.org/wiki/Sequential_Pattern_Mining";>Sequential 
Pattern Mining
+ * (Wikipedia)
+ */
+@Since("2.4.0")
+@Experimental
+object PrefixSpan {
+
+  /**
+   * :: Experimental ::
+   * Finds the complete set of frequent sequential patterns in the input 
sequences of itemsets.
+   *
+   * @param dataset A dataset or a dataframe containing a sequence column 
which is
+   *{{{Seq[Seq[_]]}}} type
+   * @param sequenceCol the name of the sequence column in dataset
--- End diff --

You could add a unit test for that too.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20973: [SPARK-20114][ML] spark.ml parity for sequential ...

2018-04-24 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/20973#discussion_r183863701
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/fpm/PrefixSpan.scala ---
@@ -0,0 +1,91 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.fpm
+
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.mllib.fpm.{PrefixSpan => mllibPrefixSpan}
+import org.apache.spark.sql.{DataFrame, Dataset, Row}
+import org.apache.spark.sql.functions.col
+import org.apache.spark.sql.types.{LongType, StructField, StructType}
+import org.apache.spark.storage.StorageLevel
+
+/**
+ * :: Experimental ::
+ * A parallel PrefixSpan algorithm to mine frequent sequential patterns.
+ * The PrefixSpan algorithm is described in J. Pei, et al., PrefixSpan: 
Mining Sequential Patterns
+ * Efficiently by Prefix-Projected Pattern Growth
+ * (see http://doi.org/10.1109/ICDE.2001.914830";>here).
+ *
+ * @see https://en.wikipedia.org/wiki/Sequential_Pattern_Mining";>Sequential 
Pattern Mining
+ * (Wikipedia)
+ */
+@Since("2.4.0")
+@Experimental
+object PrefixSpan {
+
+  /**
+   * :: Experimental ::
+   * Finds the complete set of frequent sequential patterns in the input 
sequences of itemsets.
+   *
+   * @param dataset A dataset or a dataframe containing a sequence column 
which is
+   *{{{Seq[Seq[_]]}}} type
+   * @param sequenceCol the name of the sequence column in dataset
+   * @param minSupport the minimal support level of the sequential 
pattern, any pattern that
+   *   appears more than (minSupport * 
size-of-the-dataset) times will be output
+   *  (default: `0.1`).
+   * @param maxPatternLength the maximal length of the sequential pattern, 
any pattern that appears
+   * less than maxPatternLength will be output 
(default: `10`).
+   * @param maxLocalProjDBSize The maximum number of items (including 
delimiters used in the
+   *   internal storage format) allowed in a 
projected database before
+   *   local processing. If a projected database 
exceeds this size, another
+   *   iteration of distributed prefix growth is 
run (default: `3200`).
+   * @return A dataframe that contains columns of sequence and 
corresponding frequency.
+   */
+  @Since("2.4.0")
+  def findFrequentSequentPatterns(
+  dataset: Dataset[_],
+  sequenceCol: String,
+  minSupport: Double = 0.1,
--- End diff --

We never want to use default arguments in Scala APIs since they are not 
Java-friendly.  Let's just state recommended values in the docstrings.  We can 
add defaults when we create an Estimator.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20973: [SPARK-20114][ML] spark.ml parity for sequential ...

2018-04-24 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/20973#discussion_r183865387
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/fpm/PrefixSpan.scala ---
@@ -0,0 +1,91 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.fpm
+
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.mllib.fpm.{PrefixSpan => mllibPrefixSpan}
+import org.apache.spark.sql.{DataFrame, Dataset, Row}
+import org.apache.spark.sql.functions.col
+import org.apache.spark.sql.types.{LongType, StructField, StructType}
+import org.apache.spark.storage.StorageLevel
+
+/**
+ * :: Experimental ::
+ * A parallel PrefixSpan algorithm to mine frequent sequential patterns.
+ * The PrefixSpan algorithm is described in J. Pei, et al., PrefixSpan: 
Mining Sequential Patterns
+ * Efficiently by Prefix-Projected Pattern Growth
+ * (see http://doi.org/10.1109/ICDE.2001.914830";>here).
+ *
+ * @see https://en.wikipedia.org/wiki/Sequential_Pattern_Mining";>Sequential 
Pattern Mining
+ * (Wikipedia)
+ */
+@Since("2.4.0")
+@Experimental
+object PrefixSpan {
+
+  /**
+   * :: Experimental ::
+   * Finds the complete set of frequent sequential patterns in the input 
sequences of itemsets.
+   *
+   * @param dataset A dataset or a dataframe containing a sequence column 
which is
+   *{{{Seq[Seq[_]]}}} type
+   * @param sequenceCol the name of the sequence column in dataset
+   * @param minSupport the minimal support level of the sequential 
pattern, any pattern that
+   *   appears more than (minSupport * 
size-of-the-dataset) times will be output
+   *  (default: `0.1`).
+   * @param maxPatternLength the maximal length of the sequential pattern, 
any pattern that appears
+   * less than maxPatternLength will be output 
(default: `10`).
+   * @param maxLocalProjDBSize The maximum number of items (including 
delimiters used in the
+   *   internal storage format) allowed in a 
projected database before
+   *   local processing. If a projected database 
exceeds this size, another
+   *   iteration of distributed prefix growth is 
run (default: `3200`).
+   * @return A dataframe that contains columns of sequence and 
corresponding frequency.
+   */
+  @Since("2.4.0")
+  def findFrequentSequentPatterns(
+  dataset: Dataset[_],
+  sequenceCol: String,
+  minSupport: Double = 0.1,
+  maxPatternLength: Int = 10,
+  maxLocalProjDBSize: Long = 3200L): DataFrame = {
+val handlePersistence = dataset.storageLevel == StorageLevel.NONE
+
+val data = dataset.select(sequenceCol)
--- End diff --

Let's check the input schema and throw a clear exception if it's not OK.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



spark git commit: [SPARK-23990][ML] Instruments logging improvements - ML regression package

2018-04-24 Thread jkbradley
Repository: spark
Updated Branches:
  refs/heads/master 83013752e -> 379bffa05


[SPARK-23990][ML] Instruments logging improvements - ML regression package

## What changes were proposed in this pull request?

Instruments logging improvements - ML regression package

I add an `OptionalInstrument` class which used in `WeightLeastSquares` and 
`IterativelyReweightedLeastSquares`.

## How was this patch tested?

N/A

Author: WeichenXu 

Closes #21078 from WeichenXu123/inst_reg.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/379bffa0
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/379bffa0
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/379bffa0

Branch: refs/heads/master
Commit: 379bffa0525a4343f8c10e51ed192031922f9874
Parents: 8301375
Author: WeichenXu 
Authored: Tue Apr 24 11:02:22 2018 -0700
Committer: Joseph K. Bradley 
Committed: Tue Apr 24 11:02:22 2018 -0700

--
 .../ml/classification/LogisticRegression.scala  |  4 +-
 .../IterativelyReweightedLeastSquares.scala | 18 --
 .../spark/ml/optim/WeightedLeastSquares.scala   | 32 +
 .../ml/regression/AFTSurvivalRegression.scala   |  2 +-
 .../GeneralizedLinearRegression.scala   | 14 ++--
 .../spark/ml/regression/LinearRegression.scala  | 22 ---
 .../spark/ml/tree/impl/RandomForest.scala   |  2 +
 .../apache/spark/ml/util/Instrumentation.scala  | 68 +++-
 8 files changed, 125 insertions(+), 37 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/379bffa0/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 
b/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
index e426263..06ca37b 100644
--- 
a/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
+++ 
b/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
@@ -500,7 +500,7 @@ class LogisticRegression @Since("1.2.0") (
 
 if (handlePersistence) instances.persist(StorageLevel.MEMORY_AND_DISK)
 
-val instr = Instrumentation.create(this, instances)
+val instr = Instrumentation.create(this, dataset)
 instr.logParams(regParam, elasticNetParam, standardization, threshold,
   maxIter, tol, fitIntercept)
 
@@ -816,7 +816,7 @@ class LogisticRegression @Since("1.2.0") (
 
 if (state == null) {
   val msg = s"${optimizer.getClass.getName} failed."
-  logError(msg)
+  instr.logError(msg)
   throw new SparkException(msg)
 }
 

http://git-wip-us.apache.org/repos/asf/spark/blob/379bffa0/mllib/src/main/scala/org/apache/spark/ml/optim/IterativelyReweightedLeastSquares.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/ml/optim/IterativelyReweightedLeastSquares.scala
 
b/mllib/src/main/scala/org/apache/spark/ml/optim/IterativelyReweightedLeastSquares.scala
index 6961b45..572b8cf 100644
--- 
a/mllib/src/main/scala/org/apache/spark/ml/optim/IterativelyReweightedLeastSquares.scala
+++ 
b/mllib/src/main/scala/org/apache/spark/ml/optim/IterativelyReweightedLeastSquares.scala
@@ -17,9 +17,9 @@
 
 package org.apache.spark.ml.optim
 
-import org.apache.spark.internal.Logging
 import org.apache.spark.ml.feature.{Instance, OffsetInstance}
 import org.apache.spark.ml.linalg._
+import org.apache.spark.ml.util.OptionalInstrumentation
 import org.apache.spark.rdd.RDD
 
 /**
@@ -61,9 +61,12 @@ private[ml] class IterativelyReweightedLeastSquares(
 val fitIntercept: Boolean,
 val regParam: Double,
 val maxIter: Int,
-val tol: Double) extends Logging with Serializable {
+val tol: Double) extends Serializable {
 
-  def fit(instances: RDD[OffsetInstance]): 
IterativelyReweightedLeastSquaresModel = {
+  def fit(
+  instances: RDD[OffsetInstance],
+  instr: OptionalInstrumentation = OptionalInstrumentation.create(
+classOf[IterativelyReweightedLeastSquares])): 
IterativelyReweightedLeastSquaresModel = {
 
 var converged = false
 var iter = 0
@@ -83,7 +86,8 @@ private[ml] class IterativelyReweightedLeastSquares(
 
   // Estimate new model
   model = new WeightedLeastSquares(fitIntercept, regParam, elasticNetParam 
= 0.0,
-standardizeFeatures = false, standardizeLabel = 
false).fit(newInstances)
+standardizeFeatures = false, standardizeLabel = false)
+.fit(newInstances, instr = instr)
 
   // Check convergence
   val oldCoefficients = oldModel.coefficients
@@ -96,14 +100,14 @@ private[ml] class IterativelyReweightedLeastSquares(
 
   if (maxTol < tol) {
 

[GitHub] spark issue #21078: [SPARK-23990][ML] Instruments logging improvements - ML ...

2018-04-24 Thread jkbradley
Github user jkbradley commented on the issue:

https://github.com/apache/spark/pull/21078
  
LGTM
Merging with master
Thanks @WeichenXu123 !


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



spark git commit: [SPARK-23455][ML] Default Params in ML should be saved separately in metadata

2018-04-24 Thread jkbradley
Repository: spark
Updated Branches:
  refs/heads/master ce7ba2e98 -> 83013752e


[SPARK-23455][ML] Default Params in ML should be saved separately in metadata

## What changes were proposed in this pull request?

We save ML's user-supplied params and default params as one entity in metadata. 
During loading the saved models, we set all the loaded params into created ML 
model instances as user-supplied params.

It causes some problems, e.g., if we strictly disallow some params to be set at 
the same time, a default param can fail the param check because it is treated 
as user-supplied param after loading.

The loaded default params should not be set as user-supplied params. We should 
save ML default params separately in metadata.

For backward compatibility, when loading metadata, if it is a metadata file 
from previous Spark, we shouldn't raise error if we can't find the default 
param field.

## How was this patch tested?

Pass existing tests and added tests.

Author: Liang-Chi Hsieh 

Closes #20633 from viirya/save-ml-default-params.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/83013752
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/83013752
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/83013752

Branch: refs/heads/master
Commit: 83013752e3cfcbc3edeef249439ac20b143eeabc
Parents: ce7ba2e
Author: Liang-Chi Hsieh 
Authored: Tue Apr 24 10:40:25 2018 -0700
Committer: Joseph K. Bradley 
Committed: Tue Apr 24 10:40:25 2018 -0700

--
 .../classification/DecisionTreeClassifier.scala |   2 +-
 .../spark/ml/classification/GBTClassifier.scala |   4 +-
 .../spark/ml/classification/LinearSVC.scala |   2 +-
 .../ml/classification/LogisticRegression.scala  |   2 +-
 .../MultilayerPerceptronClassifier.scala|   2 +-
 .../spark/ml/classification/NaiveBayes.scala|   2 +-
 .../spark/ml/classification/OneVsRest.scala |   4 +-
 .../classification/RandomForestClassifier.scala |   4 +-
 .../spark/ml/clustering/BisectingKMeans.scala   |   2 +-
 .../spark/ml/clustering/GaussianMixture.scala   |   2 +-
 .../org/apache/spark/ml/clustering/KMeans.scala |   2 +-
 .../org/apache/spark/ml/clustering/LDA.scala|   4 +-
 .../feature/BucketedRandomProjectionLSH.scala   |   2 +-
 .../apache/spark/ml/feature/Bucketizer.scala|  24 
 .../apache/spark/ml/feature/ChiSqSelector.scala |   2 +-
 .../spark/ml/feature/CountVectorizer.scala  |   2 +-
 .../scala/org/apache/spark/ml/feature/IDF.scala |   2 +-
 .../org/apache/spark/ml/feature/Imputer.scala   |   2 +-
 .../apache/spark/ml/feature/MaxAbsScaler.scala  |   2 +-
 .../apache/spark/ml/feature/MinHashLSH.scala|   2 +-
 .../apache/spark/ml/feature/MinMaxScaler.scala  |   2 +-
 .../ml/feature/OneHotEncoderEstimator.scala |   2 +-
 .../scala/org/apache/spark/ml/feature/PCA.scala |   2 +-
 .../spark/ml/feature/QuantileDiscretizer.scala  |  24 
 .../org/apache/spark/ml/feature/RFormula.scala  |   6 +-
 .../spark/ml/feature/StandardScaler.scala   |   2 +-
 .../apache/spark/ml/feature/StringIndexer.scala |   2 +-
 .../apache/spark/ml/feature/VectorIndexer.scala |   2 +-
 .../org/apache/spark/ml/feature/Word2Vec.scala  |   2 +-
 .../org/apache/spark/ml/fpm/FPGrowth.scala  |   2 +-
 .../org/apache/spark/ml/param/params.scala  |  13 +-
 .../apache/spark/ml/recommendation/ALS.scala|   2 +-
 .../ml/regression/AFTSurvivalRegression.scala   |   2 +-
 .../ml/regression/DecisionTreeRegressor.scala   |   2 +-
 .../spark/ml/regression/GBTRegressor.scala  |   4 +-
 .../GeneralizedLinearRegression.scala   |   2 +-
 .../ml/regression/IsotonicRegression.scala  |   2 +-
 .../spark/ml/regression/LinearRegression.scala  |   2 +-
 .../ml/regression/RandomForestRegressor.scala   |   4 +-
 .../apache/spark/ml/tuning/CrossValidator.scala |   6 +-
 .../spark/ml/tuning/TrainValidationSplit.scala  |   6 +-
 .../org/apache/spark/ml/util/ReadWrite.scala| 130 ---
 .../spark/ml/util/DefaultReadWriteTest.scala|  73 ++-
 project/MimaExcludes.scala  |   6 +
 44 files changed, 223 insertions(+), 147 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/83013752/mllib/src/main/scala/org/apache/spark/ml/classification/DecisionTreeClassifier.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/ml/classification/DecisionTreeClassifier.scala
 
b/mllib/src/main/scala/org/apache/spark/ml/classification/DecisionTreeClassifier.scala
index 771cd4f..57797d1 100644
--- 
a/mllib/src/main/scala/org/apache/spark/ml/classification/DecisionTreeClassifier.scala
+++ 
b/mllib/src/main/scala/org/apache/spark/ml/classification/DecisionTreeClassifier.scala
@@ -279,7 +279,7 @@ object DecisionTreeClassificationMod

spark git commit: [SPARK-23975][ML] Allow Clustering to take Arrays of Double as input features

2018-04-24 Thread jkbradley
Repository: spark
Updated Branches:
  refs/heads/master 55c4ca88a -> 2a24c481d


[SPARK-23975][ML] Allow Clustering to take Arrays of Double as input features

## What changes were proposed in this pull request?

- Multiple possible input types is added in validateAndTransformSchema() and 
computeCost() while checking column type

- Add if statement in transform() to support array type as featuresCol

- Add the case statement in fit() while selecting columns from dataset

These changes will be applied to KMeans first, then to other clustering method

## How was this patch tested?

unit test is added

Please review http://spark.apache.org/contributing.html before opening a pull 
request.

Author: Lu WANG 

Closes #21081 from ludatabricks/SPARK-23975.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/2a24c481
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/2a24c481
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/2a24c481

Branch: refs/heads/master
Commit: 2a24c481da3f30b510deb62e5cf21c9463cf250c
Parents: 55c4ca8
Author: Lu WANG 
Authored: Tue Apr 24 09:25:41 2018 -0700
Committer: Joseph K. Bradley 
Committed: Tue Apr 24 09:25:41 2018 -0700

--
 .../org/apache/spark/ml/clustering/KMeans.scala | 32 +++---
 .../org/apache/spark/ml/util/DatasetUtils.scala | 63 
 .../spark/ml/clustering/KMeansSuite.scala   | 38 
 3 files changed, 126 insertions(+), 7 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/2a24c481/mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala
--
diff --git a/mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala 
b/mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala
index 1ad157a..d475c72 100644
--- a/mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala
+++ b/mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala
@@ -33,8 +33,8 @@ import org.apache.spark.mllib.linalg.{Vector => OldVector, 
Vectors => OldVectors
 import org.apache.spark.mllib.linalg.VectorImplicits._
 import org.apache.spark.rdd.RDD
 import org.apache.spark.sql.{DataFrame, Dataset, Row, SparkSession}
-import org.apache.spark.sql.functions.{col, udf}
-import org.apache.spark.sql.types.{IntegerType, StructType}
+import org.apache.spark.sql.functions.udf
+import org.apache.spark.sql.types.{ArrayType, DoubleType, FloatType, 
IntegerType, StructType}
 import org.apache.spark.storage.StorageLevel
 import org.apache.spark.util.VersionUtils.majorVersion
 
@@ -87,12 +87,23 @@ private[clustering] trait KMeansParams extends Params with 
HasMaxIter with HasFe
   def getInitSteps: Int = $(initSteps)
 
   /**
+   * Validates the input schema.
+   * @param schema input schema
+   */
+  private[clustering] def validateSchema(schema: StructType): Unit = {
+val typeCandidates = List( new VectorUDT,
+  new ArrayType(DoubleType, false),
+  new ArrayType(FloatType, false))
+
+SchemaUtils.checkColumnTypes(schema, $(featuresCol), typeCandidates)
+  }
+  /**
* Validates and transforms the input schema.
* @param schema input schema
* @return output schema
*/
   protected def validateAndTransformSchema(schema: StructType): StructType = {
-SchemaUtils.checkColumnType(schema, $(featuresCol), new VectorUDT)
+validateSchema(schema)
 SchemaUtils.appendColumn(schema, $(predictionCol), IntegerType)
   }
 }
@@ -125,8 +136,11 @@ class KMeansModel private[ml] (
   @Since("2.0.0")
   override def transform(dataset: Dataset[_]): DataFrame = {
 transformSchema(dataset.schema, logging = true)
+
 val predictUDF = udf((vector: Vector) => predict(vector))
-dataset.withColumn($(predictionCol), predictUDF(col($(featuresCol
+
+dataset.withColumn($(predictionCol),
+  predictUDF(DatasetUtils.columnToVector(dataset, getFeaturesCol)))
   }
 
   @Since("1.5.0")
@@ -146,8 +160,10 @@ class KMeansModel private[ml] (
   // TODO: Replace the temp fix when we have proper evaluators defined for 
clustering.
   @Since("2.0.0")
   def computeCost(dataset: Dataset[_]): Double = {
-SchemaUtils.checkColumnType(dataset.schema, $(featuresCol), new VectorUDT)
-val data: RDD[OldVector] = dataset.select(col($(featuresCol))).rdd.map {
+validateSchema(dataset.schema)
+
+val data: RDD[OldVector] = 
dataset.select(DatasetUtils.columnToVector(dataset, getFeaturesCol))
+  .rdd.map {
   case Row(point: Vector) => OldVectors.fromML(point)
 }
 parentModel.computeCost(data)
@@ -335,7 +351,9 @@ class KMeans @Since("1.5.0") (
 transformSchema(dataset.schema, logging = true)
 
 val handlePersistence = dataset.storageLevel == StorageLevel.NONE
-val instances: RDD[OldVector] = 
dataset.select(col($(feat

[GitHub] spark pull request #21081: [SPARK-23975][ML]Allow Clustering to take Arrays ...

2018-04-24 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/21081#discussion_r183796459
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala 
---
@@ -86,13 +86,23 @@ private[clustering] trait KMeansParams extends Params 
with HasMaxIter with HasFe
   @Since("1.5.0")
   def getInitSteps: Int = $(initSteps)
 
+  /**
+   * Validates the input schema.
+   * @param schema input schema
+   */
+  private[clustering] def validateSchema(schema: StructType): Unit = {
+val typeCandidates = List( new VectorUDT,
+  new ArrayType(DoubleType, false),
+  new ArrayType(FloatType, false))
+SchemaUtils.checkColumnTypes(schema, $(featuresCol), typeCandidates)
+  }
   /**
--- End diff --

Ping: There needs to be a newline between the "}" of the previous method 
and the "/**" Scaladoc of the next method.  Please start checking for this.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21081: [SPARK-23975][ML]Allow Clustering to take Arrays ...

2018-04-24 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/21081#discussion_r183797106
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/util/DatasetUtils.scala 
---
@@ -27,28 +26,38 @@ import org.apache.spark.sql.types.{ArrayType, 
DoubleType, FloatType}
 private[spark] object DatasetUtils {
 
   /**
-   * preprocessing the input feature column to Vector
-   * @param dataset DataFrame with columns for features
-   * @param colName column name for features
-   * @return Vector feature column
+   * Cast a column in a Dataset to Vector type.
+   *
+   * The supported data types of the input column are
+   * - Vector
+   * - float/double type Array.
+   *
+   * Note: The returned column does not have Metadata.
+   *
+   * @param dataset input DataFrame
+   * @param colName column name.
+   * @return Vector column
*/
-  @Since("2.4.0")
   def columnToVector(dataset: Dataset[_], colName: String): Column = {
-val featuresDataType = dataset.schema(colName).dataType
-featuresDataType match {
+val columnDataType = dataset.schema(colName).dataType
+columnDataType match {
   case _: VectorUDT => col(colName)
   case fdt: ArrayType =>
 val transferUDF = fdt.elementType match {
   case _: FloatType => udf(f = (vector: Seq[Float]) => {
-val featureArray = Array.fill[Double](vector.size)(0.0)
-vector.indices.foreach(idx => featureArray(idx) = 
vector(idx).toDouble)
-Vectors.dense(featureArray)
+val inputArray = Array.fill[Double](vector.size)(0.0)
+vector.indices.foreach(idx => inputArray(idx) = 
vector(idx).toDouble)
+Vectors.dense(inputArray)
   })
   case _: DoubleType => udf((vector: Seq[Double]) => {
 Vectors.dense(vector.toArray)
   })
+  case other =>
--- End diff --

Thanks!  I forgot about this since this was generalized.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20319: [SPARK-22884][ML][TESTS] ML test for StructuredSt...

2018-04-23 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/20319#discussion_r183565892
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/clustering/BisectingKMeansSuite.scala 
---
@@ -20,13 +20,14 @@ package org.apache.spark.ml.clustering
 import org.apache.spark.{SparkException, SparkFunSuite}
 import org.apache.spark.ml.linalg.{Vector, Vectors}
 import org.apache.spark.ml.param.ParamMap
-import org.apache.spark.ml.util.{DefaultReadWriteTest, MLTestingUtils}
+import org.apache.spark.ml.util.{DefaultReadWriteTest, MLTest, 
MLTestingUtils}
 import org.apache.spark.mllib.clustering.DistanceMeasure
 import org.apache.spark.mllib.util.MLlibTestSparkContext
 import org.apache.spark.sql.Dataset
 
-class BisectingKMeansSuite
-  extends SparkFunSuite with MLlibTestSparkContext with 
DefaultReadWriteTest {
+class BisectingKMeansSuite extends MLTest with DefaultReadWriteTest {
+
+  import Encoders._
--- End diff --

```import testImplicits._``` instead


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20319: [SPARK-22884][ML][TESTS] ML test for StructuredSt...

2018-04-23 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/20319#discussion_r183565968
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/clustering/BisectingKMeansSuite.scala 
---
@@ -65,10 +66,12 @@ class BisectingKMeansSuite
 
 // Verify fit does not fail on very sparse data
 val model = bkm.fit(sparseDataset)
-val result = model.transform(sparseDataset)
-val numClusters = 
result.select("prediction").distinct().collect().length
-// Verify we hit the edge case
-assert(numClusters < k && numClusters > 1)
+
+testTransformerByGlobalCheckFunc[Vector](sparseDataset.toDF(), model, 
"prediction") { rows =>
--- End diff --

Use ```Tuple1[Vector]``` instead of ```Vector```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20319: [SPARK-22884][ML][TESTS] ML test for StructuredSt...

2018-04-23 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/20319#discussion_r182180564
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/clustering/Encoders.scala ---
@@ -0,0 +1,25 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.clustering
+
+import org.apache.spark.ml.linalg.Vector
+import org.apache.spark.sql.catalyst.encoders.ExpressionEncoder
+
+private[clustering] object Encoders {
+  implicit val vectorEncoder = ExpressionEncoder[Vector]()
--- End diff --

Thanks for asking; you shouldn't need to do this.  I'll comment on 
BisectingKMeansSuite.scala
 about using testImplicits instead.  You basically just need to import 
testImplicits._ and use Tuple1 for the type param for testTransformer.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20319: [SPARK-22884][ML][TESTS] ML test for StructuredSt...

2018-04-23 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/20319#discussion_r183566557
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/clustering/BisectingKMeansSuite.scala 
---
@@ -102,17 +105,14 @@ class BisectingKMeansSuite
 val model = bkm.fit(dataset)
 assert(model.clusterCenters.length === k)
 
-val transformed = model.transform(dataset)
-val expectedColumns = Array("features", predictionColName)
-expectedColumns.foreach { column =>
-  assert(transformed.columns.contains(column))
+testTransformerByGlobalCheckFunc[Vector](dataset.toDF(), model,
+  "features", predictionColName) { rows =>
+  val clusters = rows.map(_.getAs[Int](predictionColName)).toSet
+  assert(clusters.size === k)
+  assert(clusters === Set(0, 1, 2, 3, 4))
+  assert(model.computeCost(dataset) < 0.1)
--- End diff --

These checks which do not use "rows" should go outside of 
testTransformerByGlobalCheckFunc


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20633: [SPARK-23455][ML] Default Params in ML should be saved s...

2018-04-23 Thread jkbradley
Github user jkbradley commented on the issue:

https://github.com/apache/spark/pull/20633
  
Sorry for the pause in review.
LGTM
Merging with master

@dbtsai  I'm going to merge this since I'm worried it will collect more 
conflicts, but let's discuss more if needed.

@viirya We'll need to update Python's DefaultParamsReader as well for Spark 
2.4 in order to keep it in sync with Scala/Java.  R thankfully should not 
require anything since it only has wrappers.  I'll make & link a JIRA.  Will 
you have time to work on that?

Thanks @viirya !


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21081: [SPARK-23975][ML]Allow Clustering to take Arrays ...

2018-04-23 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/21081#discussion_r183555957
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala 
---
@@ -86,13 +86,23 @@ private[clustering] trait KMeansParams extends Params 
with HasMaxIter with HasFe
   @Since("1.5.0")
   def getInitSteps: Int = $(initSteps)
 
+  /**
+   * Validates the input schema.
+   * @param schema input schema
+   */
+  private[clustering] def validateSchema(schema: StructType): Unit = {
+val typeCandidates = List( new VectorUDT,
+  new ArrayType(DoubleType, false),
+  new ArrayType(FloatType, false))
+SchemaUtils.checkColumnTypes(schema, $(featuresCol), typeCandidates)
+  }
   /**
--- End diff --

scala style: always put newline between methods


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21081: [SPARK-23975][ML]Allow Clustering to take Arrays ...

2018-04-23 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/21081#discussion_r183556811
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/util/DatasetUtils.scala 
---
@@ -0,0 +1,54 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.util
+
+import org.apache.spark.annotation.Since
+import org.apache.spark.ml.linalg.{Vectors, VectorUDT}
+import org.apache.spark.sql.{Column, Dataset}
+import org.apache.spark.sql.functions.{col, udf}
+import org.apache.spark.sql.types.{ArrayType, DoubleType, FloatType}
+
+
+private[spark] object DatasetUtils {
+
+  /**
+   * preprocessing the input feature column to Vector
--- End diff --

This is a bit unclear.  How about: "Cast a column in a Dataset to a Vector 
type."
Also, this isn't specific to features, so please clarify that below.
Finally, the key thing to document is the list of supported input types, so 
I'd add that.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21081: [SPARK-23975][ML]Allow Clustering to take Arrays ...

2018-04-23 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/21081#discussion_r183558105
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/clustering/KMeansSuite.scala ---
@@ -199,6 +201,47 @@ class KMeansSuite extends SparkFunSuite with 
MLlibTestSparkContext with DefaultR
 assert(e.getCause.getMessage.contains("Cosine distance is not 
defined"))
   }
 
+  test("KMean with Array input") {
+val featuresColNameD = "array_double_features"
+val featuresColNameF = "array_float_features"
+
+val doubleUDF = udf { (features: Vector) =>
+  val featureArray = Array.fill[Double](features.size)(0.0)
+  features.foreachActive((idx, value) => featureArray(idx) = 
value.toFloat)
+  featureArray
+}
+val floatUDF = udf { (features: Vector) =>
+  val featureArray = Array.fill[Float](features.size)(0.0f)
+  features.foreachActive((idx, value) => featureArray(idx) = 
value.toFloat)
+  featureArray
+}
+
+val newdatasetD = dataset.withColumn(featuresColNameD, 
doubleUDF(col("features")))
+  .drop("features")
+val newdatasetF = dataset.withColumn(featuresColNameF, 
floatUDF(col("features")))
+  .drop("features")
+
+assert(newdatasetD.schema(featuresColNameD).dataType.equals(new 
ArrayType(DoubleType, false)))
+assert(newdatasetF.schema(featuresColNameF).dataType.equals(new 
ArrayType(FloatType, false)))
+
+val kmeansD = new 
KMeans().setK(k).setFeaturesCol(featuresColNameD).setSeed(1)
+val kmeansF = new 
KMeans().setK(k).setFeaturesCol(featuresColNameF).setSeed(1)
+val modelD = kmeansD.fit(newdatasetD)
+val modelF = kmeansF.fit(newdatasetF)
+
+val transformedD = modelD.transform(newdatasetD)
+val transformedF = modelF.transform(newdatasetF)
+
+val predictDifference = transformedD.select("prediction")
+  .except(transformedF.select("prediction"))
+
+assert(predictDifference.count() == 0)
+
+assert(modelD.computeCost(newdatasetD) == 
modelF.computeCost(newdatasetF) )
+
--- End diff --

nit: remove unnecessary newline


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21081: [SPARK-23975][ML]Allow Clustering to take Arrays ...

2018-04-23 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/21081#discussion_r183557655
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/util/DatasetUtils.scala 
---
@@ -0,0 +1,54 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.util
+
+import org.apache.spark.annotation.Since
+import org.apache.spark.ml.linalg.{Vectors, VectorUDT}
+import org.apache.spark.sql.{Column, Dataset}
+import org.apache.spark.sql.functions.{col, udf}
+import org.apache.spark.sql.types.{ArrayType, DoubleType, FloatType}
+
+
+private[spark] object DatasetUtils {
+
+  /**
+   * preprocessing the input feature column to Vector
+   * @param dataset DataFrame with columns for features
+   * @param colName column name for features
+   * @return Vector feature column
--- End diff --

Add a note that this returned Column does not have Metadata


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21081: [SPARK-23975][ML]Allow Clustering to take Arrays ...

2018-04-23 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/21081#discussion_r183556424
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/util/DatasetUtils.scala 
---
@@ -0,0 +1,54 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.util
+
+import org.apache.spark.annotation.Since
+import org.apache.spark.ml.linalg.{Vectors, VectorUDT}
+import org.apache.spark.sql.{Column, Dataset}
+import org.apache.spark.sql.functions.{col, udf}
+import org.apache.spark.sql.types.{ArrayType, DoubleType, FloatType}
+
+
+private[spark] object DatasetUtils {
+
+  /**
+   * preprocessing the input feature column to Vector
+   * @param dataset DataFrame with columns for features
+   * @param colName column name for features
+   * @return Vector feature column
+   */
+  @Since("2.4.0")
--- End diff --

Don't add Since annotations to private APIs.  They can get Since 
annotations when they are made public.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21081: [SPARK-23975][ML]Allow Clustering to take Arrays ...

2018-04-23 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/21081#discussion_r183558056
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/clustering/KMeansSuite.scala ---
@@ -199,6 +201,47 @@ class KMeansSuite extends SparkFunSuite with 
MLlibTestSparkContext with DefaultR
 assert(e.getCause.getMessage.contains("Cosine distance is not 
defined"))
   }
 
+  test("KMean with Array input") {
+val featuresColNameD = "array_double_features"
+val featuresColNameF = "array_float_features"
+
+val doubleUDF = udf { (features: Vector) =>
+  val featureArray = Array.fill[Double](features.size)(0.0)
+  features.foreachActive((idx, value) => featureArray(idx) = 
value.toFloat)
+  featureArray
+}
+val floatUDF = udf { (features: Vector) =>
+  val featureArray = Array.fill[Float](features.size)(0.0f)
+  features.foreachActive((idx, value) => featureArray(idx) = 
value.toFloat)
+  featureArray
+}
+
+val newdatasetD = dataset.withColumn(featuresColNameD, 
doubleUDF(col("features")))
+  .drop("features")
+val newdatasetF = dataset.withColumn(featuresColNameF, 
floatUDF(col("features")))
+  .drop("features")
+
+assert(newdatasetD.schema(featuresColNameD).dataType.equals(new 
ArrayType(DoubleType, false)))
+assert(newdatasetF.schema(featuresColNameF).dataType.equals(new 
ArrayType(FloatType, false)))
+
+val kmeansD = new 
KMeans().setK(k).setFeaturesCol(featuresColNameD).setSeed(1)
--- End diff --

Also do: `setMaxIter(1)` to make this a little faster.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21078: [SPARK-23990][ML] Instruments logging improvement...

2018-04-23 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/21078#discussion_r183514966
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/util/Instrumentation.scala ---
@@ -157,34 +161,55 @@ private[spark] object Instrumentation {
 
 }
 
-private[spark] class OptionalInstrument private (
-val instrument: Option[Instrumentation[_ <: Estimator[_]]],
+/**
+ * A small wrapper that contains an optional `Instrumentation` object.
+ * Provide some log methods, if the containing `Instrumentation` object is 
defined,
+ * will log via it, otherwise will log via common logger.
+ */
+private[spark] class OptionalInstrumentation private(
+val instrumentation: Option[Instrumentation[_ <: Estimator[_]]],
 val className: String) extends Logging {
 
-  def this(instr: Instrumentation[_ <: Estimator[_]]) = this(Some(instr), 
"")
-
-  def this(clazz: Class[_]) = this(None, clazz.getName.stripSuffix("$"))
-
-  protected override def logName = className
+  protected override def logName: String = className
 
   override def logInfo(msg: => String) {
-instrument match {
+instrumentation match {
   case Some(instr) => instr.logInfo(msg)
   case None => super.logInfo(msg)
 }
   }
 
   override def logWarning(msg: => String) {
-instrument match {
+instrumentation match {
   case Some(instr) => instr.logWarning(msg)
   case None => super.logWarning(msg)
 }
   }
 
   override def logError(msg: => String) {
-instrument match {
+instrumentation match {
   case Some(instr) => instr.logError(msg)
   case None => super.logError(msg)
 }
   }
 }
+
+private[spark] object OptionalInstrumentation {
+
+  /**
+   * Creates an `OptionalInstrumentation` object from an existing 
`Instrumentation` object.
+   */
+  def create(instr: Instrumentation[_ <: Estimator[_]]): 
OptionalInstrumentation = {
+new OptionalInstrumentation(Some(instr),
+  classOf[Instrumentation[_]].getName.stripSuffix("$"))
--- End diff --

This is going to log using className = Instrumentation, which is not very 
informative.  It should log using the Estimator class name.  It might be easier 
to separate logName from className.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21081: [SPARK-23975][ML]Allow Clustering to take Arrays ...

2018-04-19 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/21081#discussion_r182925491
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala 
---
@@ -144,8 +168,23 @@ class KMeansModel private[ml] (
   // TODO: Replace the temp fix when we have proper evaluators defined for 
clustering.
   @Since("2.0.0")
   def computeCost(dataset: Dataset[_]): Double = {
-SchemaUtils.checkColumnType(dataset.schema, $(featuresCol), new 
VectorUDT)
-val data: RDD[OldVector] = dataset.select(col($(featuresCol))).rdd.map 
{
+val typeCandidates = List( new VectorUDT,
--- End diff --

You can reuse validateAndTransformSchema here.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21081: [SPARK-23975][ML]Allow Clustering to take Arrays ...

2018-04-19 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/21081#discussion_r182925299
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala 
---
@@ -305,15 +344,45 @@ class KMeans @Since("1.5.0") (
   @Since("1.5.0")
   def setSeed(value: Long): this.type = set(seed, value)
 
+  @Since("2.4.0")
+  def featureToVector(dataset: Dataset[_], col: Column): Column = {
--- End diff --

Is this a copy of the same method?  It should be shared, either in 
KMeansParams or in a static (object) method.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21081: [SPARK-23975][ML]Allow Clustering to take Arrays ...

2018-04-19 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/21081#discussion_r182924819
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala 
---
@@ -120,11 +123,32 @@ class KMeansModel private[ml] (
   @Since("2.0.0")
   def setPredictionCol(value: String): this.type = set(predictionCol, 
value)
 
+  @Since("2.4.0")
+  def featureToVector(dataset: Dataset[_], col: Column): Column = {
--- End diff --

Make this private.  In general, we try to keep APIs as private as possible 
since that allows us more flexibility to make changes in the future.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21081: [SPARK-23975][ML]Allow Clustering to take Arrays ...

2018-04-19 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/21081#discussion_r182924903
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala 
---
@@ -120,11 +123,32 @@ class KMeansModel private[ml] (
   @Since("2.0.0")
   def setPredictionCol(value: String): this.type = set(predictionCol, 
value)
 
+  @Since("2.4.0")
+  def featureToVector(dataset: Dataset[_], col: Column): Column = {
+val featuresDataType = dataset.schema(getFeaturesCol).dataType
+val transferUDF = featuresDataType match {
+  case _: VectorUDT => udf((vector: Vector) => vector)
--- End diff --

Just return ```col(getFeaturesCol)``` since that will be more efficient.  
(Calling a UDF requires data serialization overhead.)


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21081: [SPARK-23975][ML]Allow Clustering to take Arrays ...

2018-04-19 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/21081#discussion_r182925210
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala 
---
@@ -120,11 +123,32 @@ class KMeansModel private[ml] (
   @Since("2.0.0")
   def setPredictionCol(value: String): this.type = set(predictionCol, 
value)
 
+  @Since("2.4.0")
+  def featureToVector(dataset: Dataset[_], col: Column): Column = {
--- End diff --

Also, add a Scala docstring saying what this does.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21081: [SPARK-23975][ML]Allow Clustering to take Arrays of Doub...

2018-04-19 Thread jkbradley
Github user jkbradley commented on the issue:

https://github.com/apache/spark/pull/21081
  
@WeichenXu123  A generic vector class would be interesting, but that would 
be a big project, way out of scope of this PR.  You could bring it up if that 
person on the dev list sends a SPIP about linear algebra.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21078: [SPARK-23990][ML] Instruments logging improvement...

2018-04-19 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/21078#discussion_r182836091
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala ---
@@ -378,18 +378,24 @@ class LinearRegression @Since("1.3.0") 
(@Since("1.3.0") override val uid: String
 
 val yMean = ySummarizer.mean(0)
 val rawYStd = math.sqrt(ySummarizer.variance(0))
+
+instr.logNamedValue(Instrumentation.loggerTags.numExamples, 
ySummarizer.count)
--- End diff --

It might be worth creating a logNumExamples() method in Instrumentation 
since I bet we'll add this elsewhere.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21078: [SPARK-23990][ML] Instruments logging improvement...

2018-04-19 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/21078#discussion_r182834828
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/util/Instrumentation.scala ---
@@ -150,3 +156,35 @@ private[spark] object Instrumentation {
   }
 
 }
+
+private[spark] class OptionalInstrument private (
+val instrument: Option[Instrumentation[_ <: Estimator[_]]],
+val className: String) extends Logging {
+
+  def this(instr: Instrumentation[_ <: Estimator[_]]) = this(Some(instr), 
"")
--- End diff --

Also, let's set logName in the constructor so that, when we create() this 
with an Instrumentation instance, we can match the Instrumentation instance's 
logName: instr.getClass.getName.stripSuffix(...)


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21078: [SPARK-23990][ML] Instruments logging improvement...

2018-04-19 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/21078#discussion_r182835834
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala ---
@@ -326,7 +326,7 @@ class LinearRegression @Since("1.3.0") (@Since("1.3.0") 
override val uid: String
 Instance(label, weight, features)
 }
 
-val instr = Instrumentation.create(this, dataset)
+val instr = Instrumentation.create(this, instances)
--- End diff --

Why this change?  Is this to match other things inheriting from Predictor?  
Let's keep dataset since I hope (maybe in 3.0) we can switch everything to 
create Instrumentation with Datasets instead of RDDs (since Datasets have more 
info available).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21078: [SPARK-23990][ML] Instruments logging improvement...

2018-04-19 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/21078#discussion_r182833013
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/util/Instrumentation.scala ---
@@ -150,3 +156,35 @@ private[spark] object Instrumentation {
   }
 
 }
+
+private[spark] class OptionalInstrument private (
+val instrument: Option[Instrumentation[_ <: Estimator[_]]],
+val className: String) extends Logging {
+
+  def this(instr: Instrumentation[_ <: Estimator[_]]) = this(Some(instr), 
"")
+
+  def this(clazz: Class[_]) = this(None, clazz.getName.stripSuffix("$"))
+
+  protected override def logName = className
--- End diff --

Fix IntelliJ style warning: add explicit type annotation


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21078: [SPARK-23990][ML] Instruments logging improvement...

2018-04-19 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/21078#discussion_r182833153
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/util/Instrumentation.scala ---
@@ -150,3 +156,35 @@ private[spark] object Instrumentation {
   }
 
 }
+
+private[spark] class OptionalInstrument private (
--- End diff --

Stick with existing naming conventions: OptionalInstrument -> 
OptionalInstrumentation


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21078: [SPARK-23990][ML] Instruments logging improvement...

2018-04-19 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/21078#discussion_r182833420
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/util/Instrumentation.scala ---
@@ -150,3 +156,35 @@ private[spark] object Instrumentation {
   }
 
 }
+
+private[spark] class OptionalInstrument private (
+val instrument: Option[Instrumentation[_ <: Estimator[_]]],
--- End diff --

instrument -> either `instrumentation` or `instr`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21078: [SPARK-23990][ML] Instruments logging improvement...

2018-04-19 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/21078#discussion_r182833731
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/util/Instrumentation.scala ---
@@ -150,3 +156,35 @@ private[spark] object Instrumentation {
   }
 
 }
+
+private[spark] class OptionalInstrument private (
+val instrument: Option[Instrumentation[_ <: Estimator[_]]],
+val className: String) extends Logging {
+
+  def this(instr: Instrumentation[_ <: Estimator[_]]) = this(Some(instr), 
"")
--- End diff --

Stick with conventions: Move these constructors to object methods called 
`create()`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



spark git commit: [SPARK-24026][ML] Add Power Iteration Clustering to spark.ml

2018-04-19 Thread jkbradley
Repository: spark
Updated Branches:
  refs/heads/master 6e19f7683 -> a471880af


[SPARK-24026][ML] Add Power Iteration Clustering to spark.ml

## What changes were proposed in this pull request?

This PR adds PowerIterationClustering as a Transformer to spark.ml.  In the 
transform method, it calls spark.mllib's PowerIterationClustering.run() method 
and transforms the return value assignments (the Kmeans output of the 
pseudo-eigenvector) as a DataFrame (id: LongType, cluster: IntegerType).

This PR is copied and modified from https://github.com/apache/spark/pull/15770  
The primary author is wangmiao1981

## How was this patch tested?

This PR has 2 types of tests:
* Copies of tests from spark.mllib's PIC tests
* New tests specific to the spark.ml APIs

Author: wm...@hotmail.com 
Author: wangmiao1981 
Author: Joseph K. Bradley 

Closes #21090 from jkbradley/wangmiao1981-pic.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/a471880a
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/a471880a
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/a471880a

Branch: refs/heads/master
Commit: a471880afbeafd4ef54c15a97e72ea7ff784a88d
Parents: 6e19f76
Author: wm...@hotmail.com 
Authored: Thu Apr 19 09:40:20 2018 -0700
Committer: Joseph K. Bradley 
Committed: Thu Apr 19 09:40:20 2018 -0700

--
 .../clustering/PowerIterationClustering.scala   | 256 +++
 .../PowerIterationClusteringSuite.scala | 238 +
 2 files changed, 494 insertions(+)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/a471880a/mllib/src/main/scala/org/apache/spark/ml/clustering/PowerIterationClustering.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/ml/clustering/PowerIterationClustering.scala
 
b/mllib/src/main/scala/org/apache/spark/ml/clustering/PowerIterationClustering.scala
new file mode 100644
index 000..2c30a1d
--- /dev/null
+++ 
b/mllib/src/main/scala/org/apache/spark/ml/clustering/PowerIterationClustering.scala
@@ -0,0 +1,256 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.clustering
+
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.Transformer
+import org.apache.spark.ml.param._
+import org.apache.spark.ml.param.shared._
+import org.apache.spark.ml.util._
+import org.apache.spark.mllib.clustering.{PowerIterationClustering => 
MLlibPowerIterationClustering}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.{DataFrame, Dataset, Row}
+import org.apache.spark.sql.functions.col
+import org.apache.spark.sql.types._
+
+/**
+ * Common params for PowerIterationClustering
+ */
+private[clustering] trait PowerIterationClusteringParams extends Params with 
HasMaxIter
+  with HasPredictionCol {
+
+  /**
+   * The number of clusters to create (k). Must be > 1. Default: 2.
+   * @group param
+   */
+  @Since("2.4.0")
+  final val k = new IntParam(this, "k", "The number of clusters to create. " +
+"Must be > 1.", ParamValidators.gt(1))
+
+  /** @group getParam */
+  @Since("2.4.0")
+  def getK: Int = $(k)
+
+  /**
+   * Param for the initialization algorithm. This can be either "random" to 
use a random vector
+   * as vertex properties, or "degree" to use a normalized sum of similarities 
with other vertices.
+   * Default: random.
+   * @group expertParam
+   */
+  @Since("2.4.0")
+  final val initMode = {
+val allowedParams = ParamValidators.inArray(Array("random", "degree"))
+new Param[String](this, "initMode", "The initialization algorithm. This 
can be either " +
+  "'random' to use a random vector as vertex properties, or 'degree' to 
use a normalized sum " +
+  "of similarities with other vertices.  Supported options: 'random' and 
'degree&#

[GitHub] spark issue #21090: [SPARK-15784][ML] Add Power Iteration Clustering to spar...

2018-04-19 Thread jkbradley
Github user jkbradley commented on the issue:

https://github.com/apache/spark/pull/21090
  
Thanks for reviewing this and for the LGTM @wangmiao1981 !  I'll merge with 
master now, with you as the primary author.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21090: [SPARK-15784][ML] Add Power Iteration Clustering ...

2018-04-18 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/21090#discussion_r182606716
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/clustering/PowerIterationClustering.scala
 ---
@@ -0,0 +1,256 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.clustering
+
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.Transformer
+import org.apache.spark.ml.param._
+import org.apache.spark.ml.param.shared._
+import org.apache.spark.ml.util._
+import org.apache.spark.mllib.clustering.{PowerIterationClustering => 
MLlibPowerIterationClustering}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.{DataFrame, Dataset, Row}
+import org.apache.spark.sql.functions.col
+import org.apache.spark.sql.types._
+
+/**
+ * Common params for PowerIterationClustering
+ */
+private[clustering] trait PowerIterationClusteringParams extends Params 
with HasMaxIter
+  with HasPredictionCol {
+
+  /**
+   * The number of clusters to create (k). Must be > 1. Default: 2.
+   * @group param
+   */
+  @Since("2.4.0")
+  final val k = new IntParam(this, "k", "The number of clusters to create. 
" +
+"Must be > 1.", ParamValidators.gt(1))
+
+  /** @group getParam */
+  @Since("2.4.0")
+  def getK: Int = $(k)
+
+  /**
+   * Param for the initialization algorithm. This can be either "random" 
to use a random vector
+   * as vertex properties, or "degree" to use a normalized sum of 
similarities with other vertices.
+   * Default: random.
+   * @group expertParam
+   */
+  @Since("2.4.0")
+  final val initMode = {
+val allowedParams = ParamValidators.inArray(Array("random", "degree"))
+new Param[String](this, "initMode", "The initialization algorithm. 
This can be either " +
+  "'random' to use a random vector as vertex properties, or 'degree' 
to use a normalized sum " +
+  "of similarities with other vertices.  Supported options: 'random' 
and 'degree'.",
+  allowedParams)
+  }
+
+  /** @group expertGetParam */
+  @Since("2.4.0")
+  def getInitMode: String = $(initMode)
+
+  /**
+   * Param for the name of the input column for vertex IDs.
+   * Default: "id"
+   * @group param
+   */
+  @Since("2.4.0")
+  val idCol = new Param[String](this, "idCol", "Name of the input column 
for vertex IDs.",
+(value: String) => value.nonEmpty)
+
+  setDefault(idCol, "id")
+
+  /** @group getParam */
+  @Since("2.4.0")
+  def getIdCol: String = getOrDefault(idCol)
+
+  /**
+   * Param for the name of the input column for neighbors in the adjacency 
list representation.
+   * Default: "neighbors"
+   * @group param
+   */
+  @Since("2.4.0")
+  val neighborsCol = new Param[String](this, "neighborsCol",
+"Name of the input column for neighbors in the adjacency list 
representation.",
+(value: String) => value.nonEmpty)
+
+  setDefault(neighborsCol, "neighbors")
+
+  /** @group getParam */
+  @Since("2.4.0")
+  def getNeighborsCol: String = $(neighborsCol)
+
+  /**
+   * Param for the name of the input column for neighbors in the adjacency 
list representation.
+   * Default: "similarities"
+   * @group param
+   */
+  @Since("2.4.0")
+  val similaritiesCol = new Param[String](this, "similaritiesCol",
+"Name of the input column for neighbors in the adjacency list 
representation.",
+(value: String) => value.nonEmpty)
+
 

[GitHub] spark issue #21081: [SPARK-23975][ML]Allow Clustering to take Arrays of Doub...

2018-04-18 Thread jkbradley
Github user jkbradley commented on the issue:

https://github.com/apache/spark/pull/21081
  
I hope we can apply it to other algs too.  @ludatabricks is doing some 
refactoring which should make that easier, but we're not going for a completely 
general approach right away.

I don't think we need to worry about sparse FloatType features; users have 
no way to pass those in.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21078: [SPARK-23990][ML] Instruments logging improvements - ML ...

2018-04-17 Thread jkbradley
Github user jkbradley commented on the issue:

https://github.com/apache/spark/pull/21078
  
Thanks for thinking through the optional logging issue!  I responded in the 
JIRA to preserve the design discussion there: 
https://issues.apache.org/jira/browse/SPARK-23990


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21081: [SPARK-23975][ML]Allow Clustering to take Arrays ...

2018-04-17 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/21081#discussion_r182269723
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala 
---
@@ -123,7 +128,21 @@ class KMeansModel private[ml] (
   @Since("2.0.0")
   override def transform(dataset: Dataset[_]): DataFrame = {
 transformSchema(dataset.schema, logging = true)
-val predictUDF = udf((vector: Vector) => predict(vector))
+// val predictUDF = udf((vector: Vector) => predict(vector))
+val predictUDF = if 
(dataset.schema($(featuresCol)).dataType.equals(new VectorUDT)) {
+  udf((vector: Vector) => predict(vector))
+}
+else {
+  udf((vector: Seq[_]) => {
+val featureArray = Array.fill[Double](vector.size)(0.0)
--- End diff --

Here's what I meant:
```
val predictUDF = featuresDataType match {
  case _: VectorUDT =>
udf((vector: Vector) => predict(vector))
  case fdt: ArrayType => fdt.elementType match {
case _: FloatType =>
  ???
case _: DoubleType =>
  ???
  }
}
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21081: [SPARK-23975][ML]Allow Clustering to take Arrays ...

2018-04-17 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/21081#discussion_r182269644
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala 
---
@@ -123,7 +128,21 @@ class KMeansModel private[ml] (
   @Since("2.0.0")
   override def transform(dataset: Dataset[_]): DataFrame = {
 transformSchema(dataset.schema, logging = true)
-val predictUDF = udf((vector: Vector) => predict(vector))
+// val predictUDF = udf((vector: Vector) => predict(vector))
+val predictUDF = if 
(dataset.schema($(featuresCol)).dataType.equals(new VectorUDT)) {
+  udf((vector: Vector) => predict(vector))
--- End diff --

Side note: I realized that "predict" will cause the whole model to be 
serialized and sent to workers.  But that's actually OK since we do need to 
send most of the model data to make predictions and since there's not a clean 
way to just sent the model weights.  So I think my previous comment about 
copying "numClasses" to a local variable was not necessary.  Don't bother 
reverting the change though.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21081: [SPARK-23975][ML]Allow Clustering to take Arrays ...

2018-04-17 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/21081#discussion_r182216415
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala 
---
@@ -90,7 +90,12 @@ private[clustering] trait KMeansParams extends Params 
with HasMaxIter with HasFe
* @return output schema
*/
   protected def validateAndTransformSchema(schema: StructType): StructType 
= {
-SchemaUtils.checkColumnType(schema, $(featuresCol), new VectorUDT)
+val typeCandidates = List( new VectorUDT,
+  new ArrayType(DoubleType, true),
--- End diff --

Also, IntelliJ may warn you about passing boolean arguments as named 
arguments; that'd be nice to fix here.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21081: [SPARK-23975][ML]Allow Clustering to take Arrays ...

2018-04-17 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/21081#discussion_r182215434
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala 
---
@@ -123,7 +128,21 @@ class KMeansModel private[ml] (
   @Since("2.0.0")
   override def transform(dataset: Dataset[_]): DataFrame = {
 transformSchema(dataset.schema, logging = true)
-val predictUDF = udf((vector: Vector) => predict(vector))
+// val predictUDF = udf((vector: Vector) => predict(vector))
+val predictUDF = if 
(dataset.schema($(featuresCol)).dataType.equals(new VectorUDT)) {
+  udf((vector: Vector) => predict(vector))
+}
--- End diff --

Scala style: } else {


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21081: [SPARK-23975][ML]Allow Clustering to take Arrays ...

2018-04-17 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/21081#discussion_r182216309
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala 
---
@@ -90,7 +90,12 @@ private[clustering] trait KMeansParams extends Params 
with HasMaxIter with HasFe
* @return output schema
*/
   protected def validateAndTransformSchema(schema: StructType): StructType 
= {
-SchemaUtils.checkColumnType(schema, $(featuresCol), new VectorUDT)
+val typeCandidates = List( new VectorUDT,
+  new ArrayType(DoubleType, true),
--- End diff --

Thinking about this, let's actually disallow nullable columns.  KMeans 
won't handle nulls properly.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21081: [SPARK-23975][ML]Allow Clustering to take Arrays ...

2018-04-17 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/21081#discussion_r182217722
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala 
---
@@ -123,7 +128,21 @@ class KMeansModel private[ml] (
   @Since("2.0.0")
   override def transform(dataset: Dataset[_]): DataFrame = {
 transformSchema(dataset.schema, logging = true)
-val predictUDF = udf((vector: Vector) => predict(vector))
+// val predictUDF = udf((vector: Vector) => predict(vector))
+val predictUDF = if 
(dataset.schema($(featuresCol)).dataType.equals(new VectorUDT)) {
+  udf((vector: Vector) => predict(vector))
+}
+else {
+  udf((vector: Seq[_]) => {
+val featureArray = Array.fill[Double](vector.size)(0.0)
--- End diff --

You shouldn't have to do the conversion in this convoluted (and less 
efficient) way.  I'd recommend doing a match-case statement on dataset.schema; 
I think that will be the most succinct.  Then you can handle Vector, Seq of 
Float, and Seq of Double separately, without conversions to strings.

Same for the similar cases below.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21081: [SPARK-23975][ML]Allow Clustering to take Arrays ...

2018-04-17 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/21081#discussion_r182215639
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala 
---
@@ -123,7 +128,21 @@ class KMeansModel private[ml] (
   @Since("2.0.0")
   override def transform(dataset: Dataset[_]): DataFrame = {
 transformSchema(dataset.schema, logging = true)
-val predictUDF = udf((vector: Vector) => predict(vector))
+// val predictUDF = udf((vector: Vector) => predict(vector))
+val predictUDF = if 
(dataset.schema($(featuresCol)).dataType.equals(new VectorUDT)) {
+  udf((vector: Vector) => predict(vector))
+}
+else {
+  udf((vector: Seq[_]) => {
--- End diff --

scala style: remove unnecessary ```{``` at end of line (IntelliJ should 
warn you about this)


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15770: [SPARK-15784][ML]:Add Power Iteration Clustering to spar...

2018-04-17 Thread jkbradley
Github user jkbradley commented on the issue:

https://github.com/apache/spark/pull/15770
  
OK sorry to push @wangmiao1981 !  I just want to make sure this gets in 
before I no longer have bandwidth for it.  If you have the time, would you mind 
checking the updates I made in the new PR?   Thanks!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21090: [SPARK-15784][ML] Add Power Iteration Clustering to spar...

2018-04-17 Thread jkbradley
Github user jkbradley commented on the issue:

https://github.com/apache/spark/pull/21090
  
@wangmiao1981 and @WeichenXu123 would you mind taking a look?  Thanks!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21090: [SPARK-15784][ML] Add Power Iteration Clustering to spar...

2018-04-17 Thread jkbradley
Github user jkbradley commented on the issue:

https://github.com/apache/spark/pull/21090
  
**To review this PR**: This was copied from 
https://github.com/apache/spark/pull/15770 with the following changes:
* Addressed comments in original PR  (See my review comments there)
* Added Param validators for required input columns
* Renamed “weights” column to “similarities”
* Made algorithm take more types of inputs: Long/Int and Double/Float
* Removed test("set parameters") since setters are already tested in the 
read/write test.

If you saw the previous PR, you should be able to review this one based on 
the last 3 commits, viewable in this diff: 
https://github.com/jkbradley/spark/compare/5cb8ed6...wangmiao1981-pic


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21090: [SPARK-15784][ML] Add Power Iteration Clustering ...

2018-04-17 Thread jkbradley
GitHub user jkbradley opened a pull request:

https://github.com/apache/spark/pull/21090

[SPARK-15784][ML] Add Power Iteration Clustering to spark.ml

## What changes were proposed in this pull request?

This PR adds PowerIterationClustering as a Transformer to spark.ml.  In the 
transform method, it calls spark.mllib's PowerIterationClustering.run() method 
and transforms the return value assignments (the Kmeans output of the 
pseudo-eigenvector) as a DataFrame (id: LongType, cluster: IntegerType).

This PR is copied and modified from 
https://github.com/apache/spark/pull/15770  The primary author is @wangmiao1981 

## How was this patch tested?

This PR has 2 types of tests:
* Copies of tests from spark.mllib's PIC tests
* New tests specific to the spark.ml APIs


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/jkbradley/spark wangmiao1981-pic

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/21090.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #21090


commit e4492a64b74b0ccc2da8f13353d37bb9bb0c
Author: wm...@hotmail.com 
Date:   2016-06-13T19:47:42Z

add pic framework (model, class etc)

commit 70862491e5b86ce4add500a0c96ae5220733b35d
Author: wm...@hotmail.com 
Date:   2016-06-13T23:28:09Z

change a comment

commit b73d8a78fa69f83c278996feb1b19502ef871c5b
Author: wm...@hotmail.com 
Date:   2016-06-17T17:27:55Z

add missing functions fit predict load save etc.

commit 022fe523f735c5519f948b175871489f79434fb5
Author: wm...@hotmail.com 
Date:   2016-06-18T01:12:41Z

add unit test flie

commit 552cf54fb03f88af023f080e60fa50f1f39060fc
Author: wm...@hotmail.com 
Date:   2016-06-20T17:35:05Z

add test cases part 1

commit 0b4954d55b4d344794d3c47366220c67f07d0d43
Author: wm...@hotmail.com 
Date:   2016-06-20T20:29:54Z

add unit test part 2: test fit, parameters etc.

commit f22b01e06eaaf5951befcebdffc18c8e519183d2
Author: wm...@hotmail.com 
Date:   2016-06-20T21:22:59Z

fix a type issue

commit 305b194dae40eaff990c18837c3f2bc8d469e60c
Author: wm...@hotmail.com 
Date:   2016-06-21T20:07:27Z

add more unit tests

commit 4b32cbf02965c5c1a0c094fa144836dab0dfd543
Author: wm...@hotmail.com 
Date:   2016-06-21T21:46:25Z

delete unused import and add comments

commit f6eda88a6c0af416b988a2c37f46c8b08e5e99cf
Author: wm...@hotmail.com 
Date:   2016-10-25T21:28:12Z

change version to 2.1.0

commit 45c4b1cd1afa28c775c666b57ecee614ed9a41d0
Author: wm...@hotmail.com 
Date:   2016-11-03T23:26:01Z

change PIC as a Transformer

commit e8d7ed37138909d010a812fba7d03ef30a4f6e05
Author: wm...@hotmail.com 
Date:   2016-11-04T17:28:26Z

add LabelCol

commit e4e1e055a9b3ab54b83331ac7dc56d6b792dcf7b
Author: wm...@hotmail.com 
Date:   2016-11-04T18:36:09Z

change col implementation

commit 8384422ec0e7192cc8ce53df02ddb4ae0401fd0b
Author: wm...@hotmail.com 
Date:   2017-02-17T22:20:00Z

address some of the comments

commit d6a199c48ff940861d80caf275da29d99375ce33
Author: wm...@hotmail.com 
Date:   2017-02-21T22:37:51Z

add additional test with dataset having more data

commit b0c3aff4a76ace99c104c2b2c10c9485a028bfd6
Author: wm...@hotmail.com 
Date:   2017-03-14T23:13:45Z

change input data format

commit 091225dd2f1c353edc28dc4299034a018a92bc81
Author: wm...@hotmail.com 
Date:   2017-03-15T22:49:45Z

resolve warnings

commit 8bb99567556ce29c75d5f395157d0161dff695bc
Author: wm...@hotmail.com 
Date:   2017-03-16T18:33:47Z

add neighbor and weight cols

commit 8ba82e8392e6d607ab750ed8eb3caaf386e1352a
Author: wangmiao1981 
Date:   2017-08-15T21:13:55Z

address review comments 1

commit 468a94741efe6530c9acfbb1af4f46499550ee1f
Author: wangmiao1981 
Date:   2017-08-15T21:23:39Z

fix style

commit ec10f24336ff51354a1657c7ceadb9ada8cd1484
Author: wangmiao1981 
Date:   2017-08-15T22:30:28Z

remove unused comments

commit 5710cfcf2e3596c95f353ce043f7358a030d70a0
Author: wangmiao1981 
Date:   2017-08-15T23:43:14Z

add Since

commit 88654b3055ebd863e3b3c5774abdce28f3cda184
Author: wangmiao1981 
Date:   2017-08-17T00:12:12Z

fix missing >

commit 804adc6fece91e7264f315ee965faa40c5e334c5
Author: wangmiao1981 
Date:   2017-08-17T17:26:40Z

fix doc

commit 4a6dd79a9c37f71ea4378692438f19b3247b7913
Author: wangmiao1981 
Date:   2017-10-25T23:16:55Z

address review comments

commit 5cb8ed6de3865f58719b3b30888b3bc4542905d4
Author: wangmiao1981 
Date:   2017-10-30T21:44:24Z

fix unit test

commit 6abf6023868d944068a26186cde3fbadffd83a74
Author: Joseph K. Bradley 
Date:   2018-04-03T23:46:40Z

cleanups to docs

commit d9270876797153d7660843fc621e707b4dff71ca
Author: Joseph K. Bradley 
Date:   2018-04-03T23:52:36Z

typo

commit d2157489770a79fe443d567bfc0

[GitHub] spark issue #20319: [SPARK-22884][ML][TESTS] ML test for StructuredStreaming...

2018-04-17 Thread jkbradley
Github user jkbradley commented on the issue:

https://github.com/apache/spark/pull/20319
  
Reviewing now!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



spark git commit: [SPARK-21741][ML][PYSPARK] Python API for DataFrame-based multivariate summarizer

2018-04-17 Thread jkbradley
Repository: spark
Updated Branches:
  refs/heads/master f39e82ce1 -> 1ca3c50fe


[SPARK-21741][ML][PYSPARK] Python API for DataFrame-based multivariate 
summarizer

## What changes were proposed in this pull request?

Python API for DataFrame-based multivariate summarizer.

## How was this patch tested?

doctest added.

Author: WeichenXu 

Closes #20695 from WeichenXu123/py_summarizer.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/1ca3c50f
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/1ca3c50f
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/1ca3c50f

Branch: refs/heads/master
Commit: 1ca3c50fefb34532c78427fa74872db3ecbf7ba2
Parents: f39e82c
Author: WeichenXu 
Authored: Tue Apr 17 10:11:08 2018 -0700
Committer: Joseph K. Bradley 
Committed: Tue Apr 17 10:11:08 2018 -0700

--
 python/pyspark/ml/stat.py | 193 -
 1 file changed, 192 insertions(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/1ca3c50f/python/pyspark/ml/stat.py
--
diff --git a/python/pyspark/ml/stat.py b/python/pyspark/ml/stat.py
index 93d0f4f..a06ab31 100644
--- a/python/pyspark/ml/stat.py
+++ b/python/pyspark/ml/stat.py
@@ -19,7 +19,9 @@ import sys
 
 from pyspark import since, SparkContext
 from pyspark.ml.common import _java2py, _py2java
-from pyspark.ml.wrapper import _jvm
+from pyspark.ml.wrapper import JavaWrapper, _jvm
+from pyspark.sql.column import Column, _to_seq
+from pyspark.sql.functions import lit
 
 
 class ChiSquareTest(object):
@@ -195,6 +197,195 @@ class KolmogorovSmirnovTest(object):
  _jvm().PythonUtils.toSeq(params)))
 
 
+class Summarizer(object):
+"""
+.. note:: Experimental
+
+Tools for vectorized statistics on MLlib Vectors.
+The methods in this package provide various statistics for Vectors 
contained inside DataFrames.
+This class lets users pick the statistics they would like to extract for a 
given column.
+
+>>> from pyspark.ml.stat import Summarizer
+>>> from pyspark.sql import Row
+>>> from pyspark.ml.linalg import Vectors
+>>> summarizer = Summarizer.metrics("mean", "count")
+>>> df = sc.parallelize([Row(weight=1.0, features=Vectors.dense(1.0, 1.0, 
1.0)),
+...  Row(weight=0.0, features=Vectors.dense(1.0, 2.0, 
3.0))]).toDF()
+>>> df.select(summarizer.summary(df.features, 
df.weight)).show(truncate=False)
++---+
+|aggregate_metrics(features, weight)|
++---+
+|[[1.0,1.0,1.0], 1] |
++---+
+
+>>> df.select(summarizer.summary(df.features)).show(truncate=False)
+++
+|aggregate_metrics(features, 1.0)|
+++
+|[[1.0,1.5,2.0], 2]  |
+++
+
+>>> df.select(Summarizer.mean(df.features, df.weight)).show(truncate=False)
++--+
+|mean(features)|
++--+
+|[1.0,1.0,1.0] |
++--+
+
+>>> df.select(Summarizer.mean(df.features)).show(truncate=False)
++--+
+|mean(features)|
++--+
+|[1.0,1.5,2.0] |
++--+
+
+
+.. versionadded:: 2.4.0
+
+"""
+@staticmethod
+@since("2.4.0")
+def mean(col, weightCol=None):
+"""
+return a column of mean summary
+"""
+return Summarizer._get_single_metric(col, weightCol, "mean")
+
+@staticmethod
+@since("2.4.0")
+def variance(col, weightCol=None):
+"""
+return a column of variance summary
+"""
+return Summarizer._get_single_metric(col, weightCol, "variance")
+
+@staticmethod
+@since("2.4.0")
+def count(col, weightCol=None):
+"""
+return a column of count summary
+"""
+return Summarizer._get_single_metric(col, weightCol, "count")
+
+@staticmethod
+@since("2.4.0")
+def numNonZeros(col, weightCol=None):
+"""
+return a column of numNonZero summary
+"""
+return Summarizer._get_single_metric(col, weightCol, "numNonZeros")
+
+@staticmethod
+@since("2.4.0")
+def max(col, weightCol=None):
+"""
+return a column of max summary
+"""
+return Summarizer._get_single_metric(col, weightCol, "max")
+
+@staticmethod
+@since("2.4.0")
+def min(col, weightCol=None):
+"""
+return a column of min summary
+"""
+return Summarizer._get_single_metric(col, weightCol, "min")
+
+@staticmethod
+@since("2.4.0")
+  

[GitHub] spark issue #20695: [SPARK-21741][ML][PySpark] Python API for DataFrame-base...

2018-04-17 Thread jkbradley
Github user jkbradley commented on the issue:

https://github.com/apache/spark/pull/20695
  
LGTM
Thanks for the PR!
Merging with master


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15770: [SPARK-15784][ML]:Add Power Iteration Clustering to spar...

2018-04-16 Thread jkbradley
Github user jkbradley commented on the issue:

https://github.com/apache/spark/pull/15770
  
I don't mind; I'll take it.  But I'll mark @wangmiao1981 as the main 
contributor for the PR.  Would you mind closing this issue @wangmiao1981  and 
I'll reopen a new PR under the same JIRA?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21081: [SPARK-23975][ML]Allow Clustering to take Arrays ...

2018-04-16 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/21081#discussion_r181847061
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala 
---
@@ -144,7 +156,12 @@ class KMeansModel private[ml] (
   // TODO: Replace the temp fix when we have proper evaluators defined for 
clustering.
   @Since("2.0.0")
   def computeCost(dataset: Dataset[_]): Double = {
-SchemaUtils.checkColumnType(dataset.schema, $(featuresCol), new 
VectorUDT)
+val typeCandidates = List( new VectorUDT,
+  new ArrayType(DoubleType, true),
+  new ArrayType(DoubleType, false),
+  new ArrayType(FloatType, true),
+  new ArrayType(FloatType, false))
+SchemaUtils.checkColumnTypes(dataset.schema, $(featuresCol), 
typeCandidates)
 val data: RDD[OldVector] = dataset.select(col($(featuresCol))).rdd.map 
{
--- End diff --

this won't take non-Vector types though; a unit test would catch this


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21081: [SPARK-23975][ML]Allow Clustering to take Arrays ...

2018-04-16 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/21081#discussion_r181846789
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala 
---
@@ -123,8 +128,15 @@ class KMeansModel private[ml] (
   @Since("2.0.0")
   override def transform(dataset: Dataset[_]): DataFrame = {
 transformSchema(dataset.schema, logging = true)
-val predictUDF = udf((vector: Vector) => predict(vector))
-dataset.withColumn($(predictionCol), predictUDF(col($(featuresCol
+// val predictUDF = udf((vector: Vector) => predict(vector))
+if (dataset.schema($(featuresCol)).dataType.equals(new VectorUDT)) {
--- End diff --

tip: This can be more succinct if written as:
```
val predictUDF = if (dataset.schema(...).dataType.equals(...)) { A } else { 
B }
dataset.withColumn($(predictionCol), predictUDF(col($(featuresCol  // 
so this line is only written once
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21081: [SPARK-23975][ML]Allow Clustering to take Arrays ...

2018-04-16 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/21081#discussion_r181841713
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/clustering/KMeansSuite.scala ---
@@ -194,6 +195,34 @@ class KMeansSuite extends SparkFunSuite with 
MLlibTestSparkContext with DefaultR
 assert(e.getCause.getMessage.contains("Cosine distance is not 
defined"))
   }
 
+  test("KMean with Array input") {
--- End diff --

This should check transform and computeCost too


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21081: [SPARK-23975][ML]Allow Clustering to take Arrays ...

2018-04-16 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/21081#discussion_r181847695
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala 
---
@@ -312,6 +329,8 @@ class KMeans @Since("1.5.0") (
 val handlePersistence = dataset.storageLevel == StorageLevel.NONE
 val instances: RDD[OldVector] = 
dataset.select(col($(featuresCol))).rdd.map {
   case Row(point: Vector) => OldVectors.fromML(point)
+  case Row(point: Seq[_]) =>
+
OldVectors.fromML(Vectors.dense(point.asInstanceOf[Seq[Double]].toArray))
--- End diff --

I'm not sure this will work with arrays of FloatType.  Make sure to test it


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21081: [SPARK-23975][ML]Allow Clustering to take Arrays ...

2018-04-16 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/21081#discussion_r181840894
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/clustering/KMeansSuite.scala ---
@@ -194,6 +195,34 @@ class KMeansSuite extends SparkFunSuite with 
MLlibTestSparkContext with DefaultR
 assert(e.getCause.getMessage.contains("Cosine distance is not 
defined"))
   }
 
+  test("KMean with Array input") {
+val featuresColName = "array_model_features"
+
+val arrayUDF = udf { (features: Vector) =>
+  features.toArray
+}
+val newdataset = dataset.withColumn(featuresColName, 
arrayUDF(col("features")) )
--- End diff --

nit: You could drop the original column as well just to make extra sure 
that it's not being accidentally used.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21081: [SPARK-23975][ML]Allow Clustering to take Arrays ...

2018-04-16 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/21081#discussion_r181841557
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/clustering/KMeansSuite.scala ---
@@ -194,6 +195,34 @@ class KMeansSuite extends SparkFunSuite with 
MLlibTestSparkContext with DefaultR
 assert(e.getCause.getMessage.contains("Cosine distance is not 
defined"))
   }
 
+  test("KMean with Array input") {
+val featuresColName = "array_model_features"
+
+val arrayUDF = udf { (features: Vector) =>
+  features.toArray
+}
+val newdataset = dataset.withColumn(featuresColName, 
arrayUDF(col("features")) )
+
+val kmeans = new KMeans()
+  .setFeaturesCol(featuresColName)
+
+assert(kmeans.getK === 2)
+assert(kmeans.getFeaturesCol === featuresColName)
+assert(kmeans.getPredictionCol === "prediction")
+assert(kmeans.getMaxIter === 20)
+assert(kmeans.getInitMode === MLlibKMeans.K_MEANS_PARALLEL)
+assert(kmeans.getInitSteps === 2)
+assert(kmeans.getTol === 1e-4)
+assert(kmeans.getDistanceMeasure === DistanceMeasure.EUCLIDEAN)
+val model = kmeans.setMaxIter(1).fit(newdataset)
+
+MLTestingUtils.checkCopyAndUids(kmeans, model)
--- End diff --

ditto for hasSummary and copying


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21081: [SPARK-23975][ML]Allow Clustering to take Arrays ...

2018-04-16 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/21081#discussion_r181841503
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/clustering/KMeansSuite.scala ---
@@ -194,6 +195,34 @@ class KMeansSuite extends SparkFunSuite with 
MLlibTestSparkContext with DefaultR
 assert(e.getCause.getMessage.contains("Cosine distance is not 
defined"))
   }
 
+  test("KMean with Array input") {
+val featuresColName = "array_model_features"
+
+val arrayUDF = udf { (features: Vector) =>
+  features.toArray
+}
+val newdataset = dataset.withColumn(featuresColName, 
arrayUDF(col("features")) )
+
+val kmeans = new KMeans()
+  .setFeaturesCol(featuresColName)
+
+assert(kmeans.getK === 2)
+assert(kmeans.getFeaturesCol === featuresColName)
+assert(kmeans.getPredictionCol === "prediction")
+assert(kmeans.getMaxIter === 20)
+assert(kmeans.getInitMode === MLlibKMeans.K_MEANS_PARALLEL)
+assert(kmeans.getInitSteps === 2)
+assert(kmeans.getTol === 1e-4)
+assert(kmeans.getDistanceMeasure === DistanceMeasure.EUCLIDEAN)
+val model = kmeans.setMaxIter(1).fit(newdataset)
+
+MLTestingUtils.checkCopyAndUids(kmeans, model)
--- End diff --

You don't need this test here


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21081: [SPARK-23975][ML]Allow Clustering to take Arrays ...

2018-04-16 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/21081#discussion_r181847784
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala 
---
@@ -123,8 +128,15 @@ class KMeansModel private[ml] (
   @Since("2.0.0")
   override def transform(dataset: Dataset[_]): DataFrame = {
 transformSchema(dataset.schema, logging = true)
-val predictUDF = udf((vector: Vector) => predict(vector))
-dataset.withColumn($(predictionCol), predictUDF(col($(featuresCol
+// val predictUDF = udf((vector: Vector) => predict(vector))
+if (dataset.schema($(featuresCol)).dataType.equals(new VectorUDT)) {
+  val predictUDF = udf((vector: Vector) => predict(vector))
+  dataset.withColumn($(predictionCol), predictUDF(col($(featuresCol
+} else {
+  val predictUDF = udf((vector: Seq[_]) =>
+predict(Vectors.dense(vector.asInstanceOf[Seq[Double]].toArray)))
--- End diff --

This may not work with arrays of FloatType.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21081: [SPARK-23975][ML]Allow Clustering to take Arrays ...

2018-04-16 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/21081#discussion_r181840765
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/clustering/KMeansSuite.scala ---
@@ -194,6 +195,34 @@ class KMeansSuite extends SparkFunSuite with 
MLlibTestSparkContext with DefaultR
 assert(e.getCause.getMessage.contains("Cosine distance is not 
defined"))
   }
 
+  test("KMean with Array input") {
+val featuresColName = "array_model_features"
+
+val arrayUDF = udf { (features: Vector) =>
+  features.toArray
+}
+val newdataset = dataset.withColumn(featuresColName, 
arrayUDF(col("features")) )
+
+val kmeans = new KMeans()
+  .setFeaturesCol(featuresColName)
+
+assert(kmeans.getK === 2)
+assert(kmeans.getFeaturesCol === featuresColName)
+assert(kmeans.getPredictionCol === "prediction")
--- End diff --

No need to check this or the other Params which are not relevant to this 
test


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



spark git commit: [SPARK-21088][ML] CrossValidator, TrainValidationSplit support collect all models when fitting: Python API

2018-04-16 Thread jkbradley
Repository: spark
Updated Branches:
  refs/heads/master 5003736ad -> 04614820e


[SPARK-21088][ML] CrossValidator, TrainValidationSplit support collect all 
models when fitting: Python API

## What changes were proposed in this pull request?

Add python API for collecting sub-models during 
CrossValidator/TrainValidationSplit fitting.

## How was this patch tested?

UT added.

Author: WeichenXu 

Closes #19627 from WeichenXu123/expose-model-list-py.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/04614820
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/04614820
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/04614820

Branch: refs/heads/master
Commit: 04614820e103feeae91299dc90dba1dd628fd485
Parents: 5003736
Author: WeichenXu 
Authored: Mon Apr 16 11:31:24 2018 -0500
Committer: Joseph K. Bradley 
Committed: Mon Apr 16 11:31:24 2018 -0500

--
 .../apache/spark/ml/tuning/CrossValidator.scala |  11 ++
 .../spark/ml/tuning/TrainValidationSplit.scala  |  11 ++
 .../pyspark/ml/param/_shared_params_code_gen.py |   5 +
 python/pyspark/ml/param/shared.py   |  24 +
 python/pyspark/ml/tests.py  |  78 ++
 python/pyspark/ml/tuning.py | 107 ++-
 python/pyspark/ml/util.py   |   4 +
 7 files changed, 211 insertions(+), 29 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/04614820/mllib/src/main/scala/org/apache/spark/ml/tuning/CrossValidator.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/ml/tuning/CrossValidator.scala 
b/mllib/src/main/scala/org/apache/spark/ml/tuning/CrossValidator.scala
index a0b507d..c2826dc 100644
--- a/mllib/src/main/scala/org/apache/spark/ml/tuning/CrossValidator.scala
+++ b/mllib/src/main/scala/org/apache/spark/ml/tuning/CrossValidator.scala
@@ -270,6 +270,17 @@ class CrossValidatorModel private[ml] (
 this
   }
 
+  // A Python-friendly auxiliary method
+  private[tuning] def setSubModels(subModels: JList[JList[Model[_]]])
+: CrossValidatorModel = {
+_subModels = if (subModels != null) {
+  Some(subModels.asScala.toArray.map(_.asScala.toArray))
+} else {
+  None
+}
+this
+  }
+
   /**
* @return submodels represented in two dimension array. The index of outer 
array is the
* fold index, and the index of inner array corresponds to the 
ordering of

http://git-wip-us.apache.org/repos/asf/spark/blob/04614820/mllib/src/main/scala/org/apache/spark/ml/tuning/TrainValidationSplit.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/ml/tuning/TrainValidationSplit.scala 
b/mllib/src/main/scala/org/apache/spark/ml/tuning/TrainValidationSplit.scala
index 88ff0df..8d1b9a8 100644
--- a/mllib/src/main/scala/org/apache/spark/ml/tuning/TrainValidationSplit.scala
+++ b/mllib/src/main/scala/org/apache/spark/ml/tuning/TrainValidationSplit.scala
@@ -262,6 +262,17 @@ class TrainValidationSplitModel private[ml] (
 this
   }
 
+  // A Python-friendly auxiliary method
+  private[tuning] def setSubModels(subModels: JList[Model[_]])
+: TrainValidationSplitModel = {
+_subModels = if (subModels != null) {
+  Some(subModels.asScala.toArray)
+} else {
+  None
+}
+this
+  }
+
   /**
* @return submodels represented in array. The index of array corresponds to 
the ordering of
* estimatorParamMaps

http://git-wip-us.apache.org/repos/asf/spark/blob/04614820/python/pyspark/ml/param/_shared_params_code_gen.py
--
diff --git a/python/pyspark/ml/param/_shared_params_code_gen.py 
b/python/pyspark/ml/param/_shared_params_code_gen.py
index db951d8..6e9e0a3 100644
--- a/python/pyspark/ml/param/_shared_params_code_gen.py
+++ b/python/pyspark/ml/param/_shared_params_code_gen.py
@@ -157,6 +157,11 @@ if __name__ == "__main__":
  "TypeConverters.toInt"),
 ("parallelism", "the number of threads to use when running parallel 
algorithms (>= 1).",
  "1", "TypeConverters.toInt"),
+("collectSubModels", "Param for whether to collect a list of 
sub-models trained during " +
+ "tuning. If set to false, then only the single best sub-model will be 
available after " +
+ "fitting. If set to true, then all sub-models will be available. 
Warning: For large " +
+ "models, collecting all sub-models can cause OOMs on the Spark 
driver.",
+ "False", "TypeConverters.toBoolean"),
 ("loss", "the loss function to be optimized.", None, 
"TypeConverters.toString")]
 
 code = []

http://git-wip-us.apache.org/repos/asf/spark/blob/04614820/py

[GitHub] spark issue #19627: [SPARK-21088][ML] CrossValidator, TrainValidationSplit s...

2018-04-16 Thread jkbradley
Github user jkbradley commented on the issue:

https://github.com/apache/spark/pull/19627
  
LGTM
Merging with master
Thanks!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #7652: [SPARK-9312] [ML] Added max confidence factor to OneVsRes...

2018-04-16 Thread jkbradley
Github user jkbradley commented on the issue:

https://github.com/apache/spark/pull/7652
  
Hi, sorry I let this PR get stale.  This should be resolved now by 
https://github.com/apache/spark/pull/21044 so would you mind closing this issue 
@badriub ?  Thanks though!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



spark git commit: [SPARK-9312][ML] Add RawPrediction, numClasses, and numFeatures for OneVsRestModel

2018-04-16 Thread jkbradley
Repository: spark
Updated Branches:
  refs/heads/master 083cf2235 -> 5003736ad


[SPARK-9312][ML] Add RawPrediction, numClasses, and numFeatures for 
OneVsRestModel

add RawPrediction as output column
add numClasses and numFeatures to OneVsRestModel

## What changes were proposed in this pull request?

- Add two val numClasses and numFeatures in OneVsRestModel so that we can 
inherit from Classifier in the future

- Add rawPrediction output column in transform, the prediction label in 
calculated by the rawPrediciton like raw2prediction

## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, 
manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, 
remove this)
Please review http://spark.apache.org/contributing.html before opening a pull 
request.

Author: Lu WANG 

Closes #21044 from ludatabricks/SPARK-9312.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/5003736a
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/5003736a
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/5003736a

Branch: refs/heads/master
Commit: 5003736ad60c3231bb18264c9561646c08379170
Parents: 083cf22
Author: Lu WANG 
Authored: Mon Apr 16 11:27:30 2018 -0500
Committer: Joseph K. Bradley 
Committed: Mon Apr 16 11:27:30 2018 -0500

--
 .../spark/ml/classification/OneVsRest.scala | 56 
 .../ml/classification/OneVsRestSuite.scala  |  7 ++-
 2 files changed, 51 insertions(+), 12 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/5003736a/mllib/src/main/scala/org/apache/spark/ml/classification/OneVsRest.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/ml/classification/OneVsRest.scala 
b/mllib/src/main/scala/org/apache/spark/ml/classification/OneVsRest.scala
index f04fde2..5348d88 100644
--- a/mllib/src/main/scala/org/apache/spark/ml/classification/OneVsRest.scala
+++ b/mllib/src/main/scala/org/apache/spark/ml/classification/OneVsRest.scala
@@ -32,7 +32,7 @@ import org.apache.spark.SparkContext
 import org.apache.spark.annotation.Since
 import org.apache.spark.ml._
 import org.apache.spark.ml.attribute._
-import org.apache.spark.ml.linalg.Vector
+import org.apache.spark.ml.linalg.{Vector, Vectors}
 import org.apache.spark.ml.param.{Param, ParamMap, ParamPair, Params}
 import org.apache.spark.ml.param.shared.{HasParallelism, HasWeightCol}
 import org.apache.spark.ml.util._
@@ -55,7 +55,7 @@ private[ml] trait ClassifierTypeTrait {
 /**
  * Params for [[OneVsRest]].
  */
-private[ml] trait OneVsRestParams extends PredictorParams
+private[ml] trait OneVsRestParams extends ClassifierParams
   with ClassifierTypeTrait with HasWeightCol {
 
   /**
@@ -138,6 +138,14 @@ final class OneVsRestModel private[ml] (
 @Since("1.4.0") val models: Array[_ <: ClassificationModel[_, _]])
   extends Model[OneVsRestModel] with OneVsRestParams with MLWritable {
 
+  require(models.nonEmpty, "OneVsRestModel requires at least one model for one 
class")
+
+  @Since("2.4.0")
+  val numClasses: Int = models.length
+
+  @Since("2.4.0")
+  val numFeatures: Int = models.head.numFeatures
+
   /** @group setParam */
   @Since("2.1.0")
   def setFeaturesCol(value: String): this.type = set(featuresCol, value)
@@ -146,6 +154,10 @@ final class OneVsRestModel private[ml] (
   @Since("2.1.0")
   def setPredictionCol(value: String): this.type = set(predictionCol, value)
 
+  /** @group setParam */
+  @Since("2.4.0")
+  def setRawPredictionCol(value: String): this.type = set(rawPredictionCol, 
value)
+
   @Since("1.4.0")
   override def transformSchema(schema: StructType): StructType = {
 validateAndTransformSchema(schema, fitting = false, 
getClassifier.featuresDataType)
@@ -181,6 +193,7 @@ final class OneVsRestModel private[ml] (
 val updateUDF = udf { (predictions: Map[Int, Double], prediction: 
Vector) =>
   predictions + ((index, prediction(1)))
 }
+
 model.setFeaturesCol($(featuresCol))
 val transformedDataset = model.transform(df).select(columns: _*)
 val updatedDataset = transformedDataset
@@ -195,15 +208,34 @@ final class OneVsRestModel private[ml] (
   newDataset.unpersist()
 }
 
-// output the index of the classifier with highest confidence as prediction
-val labelUDF = udf { (predictions: Map[Int, Double]) =>
-  predictions.maxBy(_._2)._1.toDouble
-}
+if (getRawPredictionCol != "") {
+  val numClass = models.length
 
-// output label and label metadata as prediction
-aggregatedDataset
-  .withColumn($(predictionCol), labelUDF(col(accColName)), labelMetadata)
-  .drop(accColName)
+  // output the RawPrediction as vector
+  val rawPredi

[GitHub] spark issue #21044: [SPARK-9312][ML] Add RawPrediction, numClasses, and numF...

2018-04-16 Thread jkbradley
Github user jkbradley commented on the issue:

https://github.com/apache/spark/pull/21044
  
LGTM
Merging with master
Thanks!!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20695: [SPARK-21741][ML][PySpark] Python API for DataFra...

2018-04-16 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/20695#discussion_r181802586
  
--- Diff: python/pyspark/ml/stat.py ---
@@ -195,6 +197,195 @@ def test(dataset, sampleCol, distName, *params):
  
_jvm().PythonUtils.toSeq(params)))
 
 
+class Summarizer(object):
+"""
+.. note:: Experimental
+
+Tools for vectorized statistics on MLlib Vectors.
+The methods in this package provide various statistics for Vectors 
contained inside DataFrames.
+This class lets users pick the statistics they would like to extract 
for a given column.
+
+>>> from pyspark.ml.stat import Summarizer
+>>> from pyspark.sql import Row
+>>> from pyspark.ml.linalg import Vectors
+>>> summarizer = Summarizer.metrics("mean", "count")
+>>> df = sc.parallelize([Row(weight=1.0, features=Vectors.dense(1.0, 
1.0, 1.0)),
+...  Row(weight=0.0, features=Vectors.dense(1.0, 
2.0, 3.0))]).toDF()
+>>> df.select(summarizer.summary(df.features, 
df.weight)).show(truncate=False)
++---+
+|aggregate_metrics(features, weight)|
++---+
+|[[1.0,1.0,1.0], 1] |
++---+
+
+>>> df.select(summarizer.summary(df.features)).show(truncate=False)
+++
+|aggregate_metrics(features, 1.0)|
+++
+|[[1.0,1.5,2.0], 2]  |
+++
+
+>>> df.select(Summarizer.mean(df.features, 
df.weight)).show(truncate=False)
++--+
+|mean(features)|
++--+
+|[1.0,1.0,1.0] |
++--+
+
+>>> df.select(Summarizer.mean(df.features)).show(truncate=False)
++--+
+|mean(features)|
++--+
+|[1.0,1.5,2.0] |
++--+
+
+
+.. versionadded:: 2.4.0
+
+"""
+@staticmethod
+@since("2.4.0")
+def mean(col, weightCol=None):
+"""
+return a column of mean summary
+"""
+return Summarizer._get_single_metric(col, weightCol, "mean")
+
+@staticmethod
+@since("2.4.0")
+def variance(col, weightCol=None):
+"""
+return a column of variance summary
+"""
+return Summarizer._get_single_metric(col, weightCol, "variance")
+
+@staticmethod
+@since("2.4.0")
+def count(col, weightCol=None):
+"""
+return a column of count summary
+"""
+return Summarizer._get_single_metric(col, weightCol, "count")
+
+@staticmethod
+@since("2.4.0")
+def numNonZeros(col, weightCol=None):
+"""
+return a column of numNonZero summary
+"""
+return Summarizer._get_single_metric(col, weightCol, "numNonZeros")
+
+@staticmethod
+@since("2.4.0")
+def max(col, weightCol=None):
+"""
+return a column of max summary
+"""
+return Summarizer._get_single_metric(col, weightCol, "max")
+
+@staticmethod
+@since("2.4.0")
+def min(col, weightCol=None):
+"""
+return a column of min summary
+"""
+return Summarizer._get_single_metric(col, weightCol, "min")
+
+@staticmethod
+@since("2.4.0")
+def normL1(col, weightCol=None):
+"""
+return a column of normL1 summary
+"""
+return Summarizer._get_single_metric(col, weightCol, "normL1")
+
+@staticmethod
+@since("2.4.0")
+def normL2(col, weightCol=None):
+"""
+return a column of normL2 summary
+"""
+return Summarizer._get_single_metric(col, weightCol, "normL2")
+
+@staticmethod
+def _check_param(featuresCol, weightCol):
+if weightCol is None:
+weightCol = lit(1.0)
+if not isinstance(featuresCol, Colum

[GitHub] spark pull request #20904: [SPARK-23751][ML][PySpark] Kolmogorov-Smirnoff te...

2018-04-16 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/20904#discussion_r181731252
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/stat/KolmogorovSmirnovTest.scala ---
@@ -81,32 +81,37 @@ object KolmogorovSmirnovTest {
* Java-friendly version of `test(dataset: DataFrame, sampleCol: String, 
cdf: Double => Double)`
*/
   @Since("2.4.0")
-  def test(dataset: DataFrame, sampleCol: String,
-cdf: Function[java.lang.Double, java.lang.Double]): DataFrame = {
-val f: Double => Double = x => cdf.call(x)
-test(dataset, sampleCol, f)
+  def test(
+  dataset: Dataset[_],
+  sampleCol: String,
+  cdf: Function[java.lang.Double, java.lang.Double]): DataFrame = {
+test(dataset, sampleCol, (x: Double) => cdf.call(x))
--- End diff --

That's not a very scalable fix...


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21044: [SPARK-9312][ML] Add RawPrediction, numClasses, a...

2018-04-12 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/21044#discussion_r181288716
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/OneVsRest.scala ---
@@ -146,6 +152,10 @@ final class OneVsRestModel private[ml] (
   @Since("2.1.0")
   def setPredictionCol(value: String): this.type = set(predictionCol, 
value)
 
+  /** @group setParam */
+  @Since("2.4.0")
+  def setRawPredictionCol(value: String): this.type = 
set(rawPredictionCol, value)
--- End diff --

You'll need to add this to the Estimator too.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21044: [SPARK-9312][ML] Add RawPrediction, numClasses, a...

2018-04-12 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/21044#discussion_r181288736
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/OneVsRest.scala ---
@@ -195,15 +206,32 @@ final class OneVsRestModel private[ml] (
   newDataset.unpersist()
 }
 
-// output the index of the classifier with highest confidence as 
prediction
-val labelUDF = udf { (predictions: Map[Int, Double]) =>
-  predictions.maxBy(_._2)._1.toDouble
-}
+// output the RawPrediction as vector
+if (getRawPredictionCol != "") {
+  val rawPredictionUDF = udf { (predictions: Map[Int, Double]) =>
+val predArray = Array.fill[Double](numClasses)(0.0)
+predictions.foreach { case (idx, value) => predArray(idx) = value }
+Vectors.dense(predArray)
+  }
+
+  // output the index of the classifier with highest confidence as 
prediction
+  val labelUDF = udf { (predictions: Vector) => 
predictions.argmax.toDouble }
 
-// output label and label metadata as prediction
-aggregatedDataset
-  .withColumn($(predictionCol), labelUDF(col(accColName)), 
labelMetadata)
-  .drop(accColName)
+  aggregatedDataset
+.withColumn(getRawPredictionCol, rawPredictionUDF(col(accColName)))
+.withColumn(getPredictionCol, labelUDF(col(getRawPredictionCol)), 
labelMetadata)
+.drop(accColName)
+}
+else {
+  // output the index of the classifier with highest confidence as 
prediction
+  val labelUDF = udf { (predictions: Map[Int, Double]) =>
+predictions.maxBy(_._2)._1.toDouble
+  }
+  // output confidence as rwa prediction, label and label metadata as 
prediction
--- End diff --

This comment seems to be in the wrong part of the code.  Also there's a typo


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21044: [SPARK-9312][ML] Add RawPrediction, numClasses, a...

2018-04-12 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/21044#discussion_r181288725
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/OneVsRest.scala ---
@@ -195,15 +206,32 @@ final class OneVsRestModel private[ml] (
   newDataset.unpersist()
 }
 
-// output the index of the classifier with highest confidence as 
prediction
-val labelUDF = udf { (predictions: Map[Int, Double]) =>
-  predictions.maxBy(_._2)._1.toDouble
-}
+// output the RawPrediction as vector
+if (getRawPredictionCol != "") {
+  val rawPredictionUDF = udf { (predictions: Map[Int, Double]) =>
+val predArray = Array.fill[Double](numClasses)(0.0)
+predictions.foreach { case (idx, value) => predArray(idx) = value }
+Vectors.dense(predArray)
+  }
+
+  // output the index of the classifier with highest confidence as 
prediction
+  val labelUDF = udf { (predictions: Vector) => 
predictions.argmax.toDouble }
 
-// output label and label metadata as prediction
-aggregatedDataset
-  .withColumn($(predictionCol), labelUDF(col(accColName)), 
labelMetadata)
-  .drop(accColName)
+  aggregatedDataset
+.withColumn(getRawPredictionCol, rawPredictionUDF(col(accColName)))
+.withColumn(getPredictionCol, labelUDF(col(getRawPredictionCol)), 
labelMetadata)
+.drop(accColName)
+}
+else {
--- End diff --

Scala style: This should go on the previous line: ```} else {```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21044: [SPARK-9312][ML] Add RawPrediction, numClasses, a...

2018-04-12 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/21044#discussion_r181288721
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/OneVsRest.scala ---
@@ -195,15 +206,32 @@ final class OneVsRestModel private[ml] (
   newDataset.unpersist()
 }
 
-// output the index of the classifier with highest confidence as 
prediction
-val labelUDF = udf { (predictions: Map[Int, Double]) =>
-  predictions.maxBy(_._2)._1.toDouble
-}
+// output the RawPrediction as vector
+if (getRawPredictionCol != "") {
+  val rawPredictionUDF = udf { (predictions: Map[Int, Double]) =>
+val predArray = Array.fill[Double](numClasses)(0.0)
--- End diff --

This causes a subtle ContextCleaner bug: `numClasses` refers to a field of 
the class OneVsRestModel, so when Spark's closure capture serializes this UDF 
to send to executors, it will end up sending the entire OneVsRestModel object, 
rather than just the value for numClasses.  Make a local copy of the value 
numClasses within the transform() method to avoid this issue.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21044: [SPARK-9312][ML] Add RawPrediction, numClasses, a...

2018-04-12 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/21044#discussion_r181288710
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/OneVsRest.scala ---
@@ -138,6 +138,12 @@ final class OneVsRestModel private[ml] (
 @Since("1.4.0") val models: Array[_ <: ClassificationModel[_, _]])
   extends Model[OneVsRestModel] with OneVsRestParams with MLWritable {
 
--- End diff --

Let's add a require() statement here which checks that models.nonEmpty is 
true (to throw an exception upon construction, rather than when numFeatures 
calls models.head below).  Just to be safe...


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20695: [SPARK-21741][ML][PySpark] Python API for DataFra...

2018-04-12 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/20695#discussion_r181263309
  
--- Diff: python/pyspark/ml/stat.py ---
@@ -195,6 +197,185 @@ def test(dataset, sampleCol, distName, *params):
  
_jvm().PythonUtils.toSeq(params)))
 
 
+class Summarizer(object):
+"""
+.. note:: Experimental
+
+Tools for vectorized statistics on MLlib Vectors.
+The methods in this package provide various statistics for Vectors 
contained inside DataFrames.
+This class lets users pick the statistics they would like to extract 
for a given column.
+
+>>> from pyspark.ml.stat import Summarizer
+>>> from pyspark.sql import Row
+>>> from pyspark.ml.linalg import Vectors
+>>> summarizer = Summarizer.metrics("mean", "count")
+>>> df = sc.parallelize([Row(weight=1.0, features=Vectors.dense(1.0, 
1.0, 1.0)),
+...  Row(weight=0.0, features=Vectors.dense(1.0, 
2.0, 3.0))]).toDF()
+>>> df.select(summarizer.summary(df.features, 
df.weight)).show(truncate=False)
++---+
+|aggregate_metrics(features, weight)|
++---+
+|[[1.0,1.0,1.0], 1] |
++---+
+
+>>> df.select(summarizer.summary(df.features)).show(truncate=False)
+++
+|aggregate_metrics(features, 1.0)|
+++
+|[[1.0,1.5,2.0], 2]  |
+++
+
+>>> df.select(Summarizer.mean(df.features, 
df.weight)).show(truncate=False)
++--+
+|mean(features)|
++--+
+|[1.0,1.0,1.0] |
++--+
+
+>>> df.select(Summarizer.mean(df.features)).show(truncate=False)
++--+
+|mean(features)|
++--+
+|[1.0,1.5,2.0] |
++--+
+
+
+.. versionadded:: 2.4.0
+
+"""
+@staticmethod
+@since("2.4.0")
+def mean(col, weightCol=None):
+"""
+return a column of mean summary
+"""
+return Summarizer._get_single_metric(col, weightCol, "mean")
+
+@staticmethod
+@since("2.4.0")
+def variance(col, weightCol=None):
+"""
+return a column of variance summary
+"""
+return Summarizer._get_single_metric(col, weightCol, "variance")
+
+@staticmethod
+@since("2.4.0")
+def count(col, weightCol=None):
+"""
+return a column of count summary
+"""
+return Summarizer._get_single_metric(col, weightCol, "count")
+
+@staticmethod
+@since("2.4.0")
+def numNonZeros(col, weightCol=None):
+"""
+return a column of numNonZero summary
+"""
+return Summarizer._get_single_metric(col, weightCol, "numNonZeros")
+
+@staticmethod
+@since("2.4.0")
+def max(col, weightCol=None):
+"""
+return a column of max summary
+"""
+return Summarizer._get_single_metric(col, weightCol, "max")
+
+@staticmethod
+@since("2.4.0")
+def min(col, weightCol=None):
+"""
+return a column of min summary
+"""
+return Summarizer._get_single_metric(col, weightCol, "min")
+
+@staticmethod
+@since("2.4.0")
+def normL1(col, weightCol=None):
+"""
+return a column of normL1 summary
+"""
+return Summarizer._get_single_metric(col, weightCol, "normL1")
+
+@staticmethod
+@since("2.4.0")
+def normL2(col, weightCol=None):
+"""
+return a column of normL2 summary
+"""
+return Summarizer._get_single_metric(col, weightCol, "normL2")
+
+@staticmethod
+def _check_param(featureCol, weightCol):
+if weightCol is None:
+weightCol = lit(1.0)
+if not isinstance(featureCol, Column) or not isins

[GitHub] spark pull request #20695: [SPARK-21741][ML][PySpark] Python API for DataFra...

2018-04-12 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/20695#discussion_r181259361
  
--- Diff: python/pyspark/ml/stat.py ---
@@ -195,6 +197,185 @@ def test(dataset, sampleCol, distName, *params):
  
_jvm().PythonUtils.toSeq(params)))
 
 
+class Summarizer(object):
+"""
+.. note:: Experimental
+
+Tools for vectorized statistics on MLlib Vectors.
+The methods in this package provide various statistics for Vectors 
contained inside DataFrames.
+This class lets users pick the statistics they would like to extract 
for a given column.
+
+>>> from pyspark.ml.stat import Summarizer
+>>> from pyspark.sql import Row
+>>> from pyspark.ml.linalg import Vectors
+>>> summarizer = Summarizer.metrics("mean", "count")
+>>> df = sc.parallelize([Row(weight=1.0, features=Vectors.dense(1.0, 
1.0, 1.0)),
+...  Row(weight=0.0, features=Vectors.dense(1.0, 
2.0, 3.0))]).toDF()
+>>> df.select(summarizer.summary(df.features, 
df.weight)).show(truncate=False)
++---+
+|aggregate_metrics(features, weight)|
++---+
+|[[1.0,1.0,1.0], 1] |
++---+
+
+>>> df.select(summarizer.summary(df.features)).show(truncate=False)
+++
+|aggregate_metrics(features, 1.0)|
+++
+|[[1.0,1.5,2.0], 2]  |
+++
+
+>>> df.select(Summarizer.mean(df.features, 
df.weight)).show(truncate=False)
++--+
+|mean(features)|
++--+
+|[1.0,1.0,1.0] |
++--+
+
+>>> df.select(Summarizer.mean(df.features)).show(truncate=False)
++--+
+|mean(features)|
++--+
+|[1.0,1.5,2.0] |
++--+
+
+
+.. versionadded:: 2.4.0
+
+"""
+@staticmethod
+@since("2.4.0")
+def mean(col, weightCol=None):
+"""
+return a column of mean summary
+"""
+return Summarizer._get_single_metric(col, weightCol, "mean")
+
+@staticmethod
+@since("2.4.0")
+def variance(col, weightCol=None):
+"""
+return a column of variance summary
+"""
+return Summarizer._get_single_metric(col, weightCol, "variance")
+
+@staticmethod
+@since("2.4.0")
+def count(col, weightCol=None):
+"""
+return a column of count summary
+"""
+return Summarizer._get_single_metric(col, weightCol, "count")
+
+@staticmethod
+@since("2.4.0")
+def numNonZeros(col, weightCol=None):
+"""
+return a column of numNonZero summary
+"""
+return Summarizer._get_single_metric(col, weightCol, "numNonZeros")
+
+@staticmethod
+@since("2.4.0")
+def max(col, weightCol=None):
+"""
+return a column of max summary
+"""
+return Summarizer._get_single_metric(col, weightCol, "max")
+
+@staticmethod
+@since("2.4.0")
+def min(col, weightCol=None):
+"""
+return a column of min summary
+"""
+return Summarizer._get_single_metric(col, weightCol, "min")
+
+@staticmethod
+@since("2.4.0")
+def normL1(col, weightCol=None):
+"""
+return a column of normL1 summary
+"""
+return Summarizer._get_single_metric(col, weightCol, "normL1")
+
+@staticmethod
+@since("2.4.0")
+def normL2(col, weightCol=None):
+"""
+return a column of normL2 summary
+"""
+return Summarizer._get_single_metric(col, weightCol, "normL2")
+
+@staticmethod
+def _check_param(featureCol, weightCol):
+if weightCol is None:
+weightCol = lit(1.0)
+if not isinstance(featureCol, Column) or not isins

[GitHub] spark pull request #20695: [SPARK-21741][ML][PySpark] Python API for DataFra...

2018-04-12 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/20695#discussion_r181259181
  
--- Diff: python/pyspark/ml/stat.py ---
@@ -195,6 +197,185 @@ def test(dataset, sampleCol, distName, *params):
  
_jvm().PythonUtils.toSeq(params)))
 
 
+class Summarizer(object):
+"""
+.. note:: Experimental
+
+Tools for vectorized statistics on MLlib Vectors.
+The methods in this package provide various statistics for Vectors 
contained inside DataFrames.
+This class lets users pick the statistics they would like to extract 
for a given column.
+
+>>> from pyspark.ml.stat import Summarizer
+>>> from pyspark.sql import Row
+>>> from pyspark.ml.linalg import Vectors
+>>> summarizer = Summarizer.metrics("mean", "count")
+>>> df = sc.parallelize([Row(weight=1.0, features=Vectors.dense(1.0, 
1.0, 1.0)),
+...  Row(weight=0.0, features=Vectors.dense(1.0, 
2.0, 3.0))]).toDF()
+>>> df.select(summarizer.summary(df.features, 
df.weight)).show(truncate=False)
++---+
+|aggregate_metrics(features, weight)|
++---+
+|[[1.0,1.0,1.0], 1] |
++---+
+
+>>> df.select(summarizer.summary(df.features)).show(truncate=False)
+++
+|aggregate_metrics(features, 1.0)|
+++
+|[[1.0,1.5,2.0], 2]  |
+++
+
+>>> df.select(Summarizer.mean(df.features, 
df.weight)).show(truncate=False)
++--+
+|mean(features)|
++--+
+|[1.0,1.0,1.0] |
++--+
+
+>>> df.select(Summarizer.mean(df.features)).show(truncate=False)
++--+
+|mean(features)|
++--+
+|[1.0,1.5,2.0] |
++--+
+
+
+.. versionadded:: 2.4.0
+
+"""
+@staticmethod
+@since("2.4.0")
+def mean(col, weightCol=None):
+"""
+return a column of mean summary
+"""
+return Summarizer._get_single_metric(col, weightCol, "mean")
+
+@staticmethod
+@since("2.4.0")
+def variance(col, weightCol=None):
+"""
+return a column of variance summary
+"""
+return Summarizer._get_single_metric(col, weightCol, "variance")
+
+@staticmethod
+@since("2.4.0")
+def count(col, weightCol=None):
+"""
+return a column of count summary
+"""
+return Summarizer._get_single_metric(col, weightCol, "count")
+
+@staticmethod
+@since("2.4.0")
+def numNonZeros(col, weightCol=None):
+"""
+return a column of numNonZero summary
+"""
+return Summarizer._get_single_metric(col, weightCol, "numNonZeros")
+
+@staticmethod
+@since("2.4.0")
+def max(col, weightCol=None):
+"""
+return a column of max summary
+"""
+return Summarizer._get_single_metric(col, weightCol, "max")
+
+@staticmethod
+@since("2.4.0")
+def min(col, weightCol=None):
+"""
+return a column of min summary
+"""
+return Summarizer._get_single_metric(col, weightCol, "min")
+
+@staticmethod
+@since("2.4.0")
+def normL1(col, weightCol=None):
+"""
+return a column of normL1 summary
+"""
+return Summarizer._get_single_metric(col, weightCol, "normL1")
+
+@staticmethod
+@since("2.4.0")
+def normL2(col, weightCol=None):
+"""
+return a column of normL2 summary
+"""
+return Summarizer._get_single_metric(col, weightCol, "normL2")
+
+@staticmethod
+def _check_param(featureCol, weightCol):
+if weightCol is None:
+weightCol = lit(1.0)
+if not isinstance(featureCol, Column) or not isins

[GitHub] spark pull request #20695: [SPARK-21741][ML][PySpark] Python API for DataFra...

2018-04-12 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/20695#discussion_r181259536
  
--- Diff: python/pyspark/ml/stat.py ---
@@ -195,6 +197,185 @@ def test(dataset, sampleCol, distName, *params):
  
_jvm().PythonUtils.toSeq(params)))
 
 
+class Summarizer(object):
+"""
+.. note:: Experimental
+
+Tools for vectorized statistics on MLlib Vectors.
+The methods in this package provide various statistics for Vectors 
contained inside DataFrames.
+This class lets users pick the statistics they would like to extract 
for a given column.
+
+>>> from pyspark.ml.stat import Summarizer
+>>> from pyspark.sql import Row
+>>> from pyspark.ml.linalg import Vectors
+>>> summarizer = Summarizer.metrics("mean", "count")
+>>> df = sc.parallelize([Row(weight=1.0, features=Vectors.dense(1.0, 
1.0, 1.0)),
+...  Row(weight=0.0, features=Vectors.dense(1.0, 
2.0, 3.0))]).toDF()
+>>> df.select(summarizer.summary(df.features, 
df.weight)).show(truncate=False)
++---+
+|aggregate_metrics(features, weight)|
++---+
+|[[1.0,1.0,1.0], 1] |
++---+
+
+>>> df.select(summarizer.summary(df.features)).show(truncate=False)
+++
+|aggregate_metrics(features, 1.0)|
+++
+|[[1.0,1.5,2.0], 2]  |
+++
+
+>>> df.select(Summarizer.mean(df.features, 
df.weight)).show(truncate=False)
++--+
+|mean(features)|
++--+
+|[1.0,1.0,1.0] |
++--+
+
+>>> df.select(Summarizer.mean(df.features)).show(truncate=False)
++--+
+|mean(features)|
++--+
+|[1.0,1.5,2.0] |
++--+
+
+
+.. versionadded:: 2.4.0
+
+"""
+@staticmethod
+@since("2.4.0")
+def mean(col, weightCol=None):
+"""
+return a column of mean summary
+"""
+return Summarizer._get_single_metric(col, weightCol, "mean")
+
+@staticmethod
+@since("2.4.0")
+def variance(col, weightCol=None):
+"""
+return a column of variance summary
+"""
+return Summarizer._get_single_metric(col, weightCol, "variance")
+
+@staticmethod
+@since("2.4.0")
+def count(col, weightCol=None):
+"""
+return a column of count summary
+"""
+return Summarizer._get_single_metric(col, weightCol, "count")
+
+@staticmethod
+@since("2.4.0")
+def numNonZeros(col, weightCol=None):
+"""
+return a column of numNonZero summary
+"""
+return Summarizer._get_single_metric(col, weightCol, "numNonZeros")
+
+@staticmethod
+@since("2.4.0")
+def max(col, weightCol=None):
+"""
+return a column of max summary
+"""
+return Summarizer._get_single_metric(col, weightCol, "max")
+
+@staticmethod
+@since("2.4.0")
+def min(col, weightCol=None):
+"""
+return a column of min summary
+"""
+return Summarizer._get_single_metric(col, weightCol, "min")
+
+@staticmethod
+@since("2.4.0")
+def normL1(col, weightCol=None):
+"""
+return a column of normL1 summary
+"""
+return Summarizer._get_single_metric(col, weightCol, "normL1")
+
+@staticmethod
+@since("2.4.0")
+def normL2(col, weightCol=None):
+"""
+return a column of normL2 summary
+"""
+return Summarizer._get_single_metric(col, weightCol, "normL2")
+
+@staticmethod
+def _check_param(featureCol, weightCol):
+if weightCol is None:
+weightCol = lit(1.0)
+if not isinstance(featureCol, Column) or not isins

spark git commit: [SPARK-23751][FOLLOW-UP] fix build for scala-2.12

2018-04-12 Thread jkbradley
Repository: spark
Updated Branches:
  refs/heads/master 0b19122d4 -> 0f93b91a7


[SPARK-23751][FOLLOW-UP] fix build for scala-2.12

## What changes were proposed in this pull request?

fix build for scala-2.12

## How was this patch tested?

Manual.

Author: WeichenXu 

Closes #21051 from WeichenXu123/fix_build212.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/0f93b91a
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/0f93b91a
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/0f93b91a

Branch: refs/heads/master
Commit: 0f93b91a71444a1a938acfd8ea2191c54fb0187c
Parents: 0b19122
Author: WeichenXu 
Authored: Thu Apr 12 15:47:42 2018 -0600
Committer: Joseph K. Bradley 
Committed: Thu Apr 12 15:47:42 2018 -0600

--
 .../scala/org/apache/spark/ml/stat/KolmogorovSmirnovTest.scala | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/0f93b91a/mllib/src/main/scala/org/apache/spark/ml/stat/KolmogorovSmirnovTest.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/ml/stat/KolmogorovSmirnovTest.scala 
b/mllib/src/main/scala/org/apache/spark/ml/stat/KolmogorovSmirnovTest.scala
index af8ff64..adf8145 100644
--- a/mllib/src/main/scala/org/apache/spark/ml/stat/KolmogorovSmirnovTest.scala
+++ b/mllib/src/main/scala/org/apache/spark/ml/stat/KolmogorovSmirnovTest.scala
@@ -85,7 +85,7 @@ object KolmogorovSmirnovTest {
   dataset: Dataset[_],
   sampleCol: String,
   cdf: Function[java.lang.Double, java.lang.Double]): DataFrame = {
-test(dataset, sampleCol, (x: Double) => cdf.call(x))
+test(dataset, sampleCol, (x: Double) => cdf.call(x).toDouble)
   }
 
   /**


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[GitHub] spark issue #21051: [SPARK-23751][FOLLOW-UP] fix build for scala-2.12

2018-04-12 Thread jkbradley
Github user jkbradley commented on the issue:

https://github.com/apache/spark/pull/21051
  
LGTM
Merging with master
Thanks @WeichenXu123 !


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20904: [SPARK-23751][ML][PySpark] Kolmogorov-Smirnoff te...

2018-04-12 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/20904#discussion_r181230965
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/stat/KolmogorovSmirnovTest.scala ---
@@ -81,32 +81,37 @@ object KolmogorovSmirnovTest {
* Java-friendly version of `test(dataset: DataFrame, sampleCol: String, 
cdf: Double => Double)`
*/
   @Since("2.4.0")
-  def test(dataset: DataFrame, sampleCol: String,
-cdf: Function[java.lang.Double, java.lang.Double]): DataFrame = {
-val f: Double => Double = x => cdf.call(x)
-test(dataset, sampleCol, f)
+  def test(
+  dataset: Dataset[_],
+  sampleCol: String,
+  cdf: Function[java.lang.Double, java.lang.Double]): DataFrame = {
+test(dataset, sampleCol, (x: Double) => cdf.call(x))
--- End diff --

This is the 2nd time I've merged something which broke the Scala 2.12 
build.  If we care about Scala 2.12, do you know if there are any efforts to 
add PR builder tests for 2.12?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



spark git commit: typo rawPredicition changed to rawPrediction

2018-04-11 Thread jkbradley
Repository: spark
Updated Branches:
  refs/heads/master 75a183071 -> 9d960de08


typo rawPredicition changed to rawPrediction

MultilayerPerceptronClassifier had 4 occurrences

## What changes were proposed in this pull request?

(Please fill in changes proposed in this fix)

## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, 
manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, 
remove this)

Please review http://spark.apache.org/contributing.html before opening a pull 
request.

Author: JBauerKogentix <37910022+jbauerkogen...@users.noreply.github.com>

Closes #21030 from JBauerKogentix/patch-1.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/9d960de0
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/9d960de0
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/9d960de0

Branch: refs/heads/master
Commit: 9d960de0814a1128318676cc2e91f447cdf0137f
Parents: 75a1830
Author: JBauerKogentix <37910022+jbauerkogen...@users.noreply.github.com>
Authored: Wed Apr 11 15:52:13 2018 -0700
Committer: Joseph K. Bradley 
Committed: Wed Apr 11 15:52:13 2018 -0700

--
 python/pyspark/ml/classification.py | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/9d960de0/python/pyspark/ml/classification.py
--
diff --git a/python/pyspark/ml/classification.py 
b/python/pyspark/ml/classification.py
index fbbe3d0..ec17653 100644
--- a/python/pyspark/ml/classification.py
+++ b/python/pyspark/ml/classification.py
@@ -1543,12 +1543,12 @@ class MultilayerPerceptronClassifier(JavaEstimator, 
HasFeaturesCol, HasLabelCol,
 def __init__(self, featuresCol="features", labelCol="label", 
predictionCol="prediction",
  maxIter=100, tol=1e-6, seed=None, layers=None, blockSize=128, 
stepSize=0.03,
  solver="l-bfgs", initialWeights=None, 
probabilityCol="probability",
- rawPredicitionCol="rawPrediction"):
+ rawPredictionCol="rawPrediction"):
 """
 __init__(self, featuresCol="features", labelCol="label", 
predictionCol="prediction", \
  maxIter=100, tol=1e-6, seed=None, layers=None, blockSize=128, 
stepSize=0.03, \
  solver="l-bfgs", initialWeights=None, 
probabilityCol="probability", \
- rawPredicitionCol="rawPrediction")
+ rawPredictionCol="rawPrediction")
 """
 super(MultilayerPerceptronClassifier, self).__init__()
 self._java_obj = self._new_java_obj(
@@ -1562,12 +1562,12 @@ class MultilayerPerceptronClassifier(JavaEstimator, 
HasFeaturesCol, HasLabelCol,
 def setParams(self, featuresCol="features", labelCol="label", 
predictionCol="prediction",
   maxIter=100, tol=1e-6, seed=None, layers=None, 
blockSize=128, stepSize=0.03,
   solver="l-bfgs", initialWeights=None, 
probabilityCol="probability",
-  rawPredicitionCol="rawPrediction"):
+  rawPredictionCol="rawPrediction"):
 """
 setParams(self, featuresCol="features", labelCol="label", 
predictionCol="prediction", \
   maxIter=100, tol=1e-6, seed=None, layers=None, 
blockSize=128, stepSize=0.03, \
   solver="l-bfgs", initialWeights=None, 
probabilityCol="probability", \
-  rawPredicitionCol="rawPrediction"):
+  rawPredictionCol="rawPrediction"):
 Sets params for MultilayerPerceptronClassifier.
 """
 kwargs = self._input_kwargs


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



spark git commit: typo rawPredicition changed to rawPrediction

2018-04-11 Thread jkbradley
Repository: spark
Updated Branches:
  refs/heads/branch-2.3 acfc156df -> 03a4dfd69


typo rawPredicition changed to rawPrediction

MultilayerPerceptronClassifier had 4 occurrences

## What changes were proposed in this pull request?

(Please fill in changes proposed in this fix)

## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, 
manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, 
remove this)

Please review http://spark.apache.org/contributing.html before opening a pull 
request.

Author: JBauerKogentix <37910022+jbauerkogen...@users.noreply.github.com>

Closes #21030 from JBauerKogentix/patch-1.

(cherry picked from commit 9d960de0814a1128318676cc2e91f447cdf0137f)
Signed-off-by: Joseph K. Bradley 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/03a4dfd6
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/03a4dfd6
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/03a4dfd6

Branch: refs/heads/branch-2.3
Commit: 03a4dfd6901595fb4622587e3387ec1ac8d77237
Parents: acfc156
Author: JBauerKogentix <37910022+jbauerkogen...@users.noreply.github.com>
Authored: Wed Apr 11 15:52:13 2018 -0700
Committer: Joseph K. Bradley 
Committed: Wed Apr 11 15:52:24 2018 -0700

--
 python/pyspark/ml/classification.py | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/03a4dfd6/python/pyspark/ml/classification.py
--
diff --git a/python/pyspark/ml/classification.py 
b/python/pyspark/ml/classification.py
index 27ad1e8..55d6030 100644
--- a/python/pyspark/ml/classification.py
+++ b/python/pyspark/ml/classification.py
@@ -1542,12 +1542,12 @@ class MultilayerPerceptronClassifier(JavaEstimator, 
HasFeaturesCol, HasLabelCol,
 def __init__(self, featuresCol="features", labelCol="label", 
predictionCol="prediction",
  maxIter=100, tol=1e-6, seed=None, layers=None, blockSize=128, 
stepSize=0.03,
  solver="l-bfgs", initialWeights=None, 
probabilityCol="probability",
- rawPredicitionCol="rawPrediction"):
+ rawPredictionCol="rawPrediction"):
 """
 __init__(self, featuresCol="features", labelCol="label", 
predictionCol="prediction", \
  maxIter=100, tol=1e-6, seed=None, layers=None, blockSize=128, 
stepSize=0.03, \
  solver="l-bfgs", initialWeights=None, 
probabilityCol="probability", \
- rawPredicitionCol="rawPrediction")
+ rawPredictionCol="rawPrediction")
 """
 super(MultilayerPerceptronClassifier, self).__init__()
 self._java_obj = self._new_java_obj(
@@ -1561,12 +1561,12 @@ class MultilayerPerceptronClassifier(JavaEstimator, 
HasFeaturesCol, HasLabelCol,
 def setParams(self, featuresCol="features", labelCol="label", 
predictionCol="prediction",
   maxIter=100, tol=1e-6, seed=None, layers=None, 
blockSize=128, stepSize=0.03,
   solver="l-bfgs", initialWeights=None, 
probabilityCol="probability",
-  rawPredicitionCol="rawPrediction"):
+  rawPredictionCol="rawPrediction"):
 """
 setParams(self, featuresCol="features", labelCol="label", 
predictionCol="prediction", \
   maxIter=100, tol=1e-6, seed=None, layers=None, 
blockSize=128, stepSize=0.03, \
   solver="l-bfgs", initialWeights=None, 
probabilityCol="probability", \
-  rawPredicitionCol="rawPrediction"):
+  rawPredictionCol="rawPrediction"):
 Sets params for MultilayerPerceptronClassifier.
 """
 kwargs = self._input_kwargs


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[GitHub] spark issue #21030: typo rawPredicition changed to rawPrediction

2018-04-11 Thread jkbradley
Github user jkbradley commented on the issue:

https://github.com/apache/spark/pull/21030
  
Actually, forget the JIRA; I'll just merge it with master and branch-2.3 as 
is.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21044: [SPARK-9312][ML] Add RawPrediction, numClasses, a...

2018-04-11 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/21044#discussion_r180920806
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/OneVsRest.scala ---
@@ -195,14 +205,18 @@ final class OneVsRestModel private[ml] (
   newDataset.unpersist()
 }
 
-// output the index of the classifier with highest confidence as 
prediction
-val labelUDF = udf { (predictions: Map[Int, Double]) =>
-  predictions.maxBy(_._2)._1.toDouble
+// output the RawPrediction as vector
+val rawPredictionUDF = udf { (predictions: Map[Int, Double]) =>
+  Vectors.sparse(numClasses, predictions.toList )
--- End diff --

Also, let's output a dense Vector since it will almost surely be dense.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21044: [SPARK-9312][ML] Add RawPrediction, numClasses, and numF...

2018-04-11 Thread jkbradley
Github user jkbradley commented on the issue:

https://github.com/apache/spark/pull/21044
  
Thanks for the PR!  Quick high-level comment: We'll need to have 
rawPredictionCol be optional.  If it's not set or is an empty string, then it 
should not be added to the output DataFrame.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19627: [SPARK-21088][ML] CrossValidator, TrainValidation...

2018-04-11 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/19627#discussion_r180874011
  
--- Diff: python/pyspark/ml/tuning.py ---
@@ -194,7 +195,8 @@ def _to_java_impl(self):
 return java_estimator, java_epms, java_evaluator
 
 
-class CrossValidator(Estimator, ValidatorParams, HasParallelism, 
MLReadable, MLWritable):
+class CrossValidator(Estimator, ValidatorParams, HasParallelism, 
HasCollectSubModels,
--- End diff --

Let's also clarify in the doc for CrossValidatorModel.copy() that it does 
not copy the extra Params into the subModels.  (same for 
TrainValidationSplitModel)


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19627: [SPARK-21088][ML] CrossValidator, TrainValidation...

2018-04-11 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/19627#discussion_r180862523
  
--- Diff: python/pyspark/ml/tuning.py ---
@@ -194,7 +195,8 @@ def _to_java_impl(self):
 return java_estimator, java_epms, java_evaluator
 
 
-class CrossValidator(Estimator, ValidatorParams, HasParallelism, 
MLReadable, MLWritable):
+class CrossValidator(Estimator, ValidatorParams, HasParallelism, 
HasCollectSubModels,
--- End diff --

You'll need to update _from_java and _to_java as well to pass 
collectSubModels around.  (Same for TrainValidationSplit)


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19627: [SPARK-21088][ML] CrossValidator, TrainValidation...

2018-04-11 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/19627#discussion_r180876558
  
--- Diff: python/pyspark/ml/tests.py ---
@@ -1018,6 +1018,48 @@ def test_parallel_evaluation(self):
 cvParallelModel = cv.fit(dataset)
 self.assertEqual(cvSerialModel.avgMetrics, 
cvParallelModel.avgMetrics)
 
+def test_expose_sub_models(self):
--- End diff --

Nice tests.  Can you make one addition: Test the copy() method to make sure 
it copies the submodels.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19627: [SPARK-21088][ML] CrossValidator, TrainValidation...

2018-04-11 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/19627#discussion_r180611695
  
--- Diff: python/pyspark/ml/param/_shared_params_code_gen.py ---
@@ -157,6 +157,8 @@ def get$Name(self):
  "TypeConverters.toInt"),
 ("parallelism", "the number of threads to use when running 
parallel algorithms (>= 1).",
  "1", "TypeConverters.toInt"),
+("collectSubModels", "whether to collect a list of sub-models 
trained during tuning",
--- End diff --

It would be nice to add the full description from Scala.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19627: [SPARK-21088][ML] CrossValidator, TrainValidation...

2018-04-11 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/19627#discussion_r180610868
  
--- Diff: python/pyspark/ml/tests.py ---
@@ -1186,6 +1228,38 @@ def test_parallel_evaluation(self):
 tvsParallelModel = tvs.fit(dataset)
 self.assertEqual(tvsSerialModel.validationMetrics, 
tvsParallelModel.validationMetrics)
 
+def test_expose_sub_models(self):
+temp_path = tempfile.mkdtemp()
+dataset = self.spark.createDataFrame(
+[(Vectors.dense([0.0]), 0.0),
+ (Vectors.dense([0.4]), 1.0),
+ (Vectors.dense([0.5]), 0.0),
+ (Vectors.dense([0.6]), 1.0),
+ (Vectors.dense([1.0]), 1.0)] * 10,
+["features", "label"])
+lr = LogisticRegression()
+grid = ParamGridBuilder().addGrid(lr.maxIter, [0, 1]).build()
+evaluator = BinaryClassificationEvaluator()
+tvs = TrainValidationSplit(estimator=lr, estimatorParamMaps=grid, 
evaluator=evaluator,
+   collectSubModels=True)
+tvsModel = tvs.fit(dataset)
+assert len(tvsModel.subModels) == len(grid)
--- End diff --

Use self.assertEqual here and elsewhere.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



<    1   2   3   4   5   6   7   8   9   10   >