[GitHub] spark issue #23144: [SPARK-26172][ML][WIP] Unify String Params' case-insensi...

2018-11-27 Thread zhengruifeng
Github user zhengruifeng commented on the issue: https://github.com/apache/spark/pull/23144 Using an optional `normalize` function argument maybe OK, I will have a try. --- - To unsubscribe, e-mail: reviews-unsubscr

[GitHub] spark issue #23144: [SPARK-26172][ML][WIP] Unify String Params' case-insensi...

2018-11-27 Thread zhengruifeng
Github user zhengruifeng commented on the issue: https://github.com/apache/spark/pull/23144 @srowen To adopt an optional `normalize` function argument, we may need to create a new class `StringParam` and add the argument into it. But this will be a breaking change, since existing

[GitHub] spark issue #23144: [SPARK-26172][ML][WIP] Unify String Params' case-insensi...

2018-11-26 Thread zhengruifeng
Github user zhengruifeng commented on the issue: https://github.com/apache/spark/pull/23144 I am not sure about `$$` or `%%`, we can replace them with other names. I want to resolve the confusion of case-insensitivity, and wonder whether a new flag can do this. If we want

[GitHub] spark pull request #23122: [MINOR][ML] add missing params to Instr

2018-11-26 Thread zhengruifeng
Github user zhengruifeng commented on a diff in the pull request: https://github.com/apache/spark/pull/23122#discussion_r236537309 --- Diff: mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala --- @@ -671,7 +671,7 @@ class ALS(@Since("1.4.0") override val u

[GitHub] spark pull request #23144: [SPARK-26172][ML][WIP] Unify String Params' case-...

2018-11-26 Thread zhengruifeng
GitHub user zhengruifeng opened a pull request: https://github.com/apache/spark/pull/23144 [SPARK-26172][ML][WIP] Unify String Params' case-insensitivity in ML ## What changes were proposed in this pull request? 1, methods `lowerCaseInArray` and `upperCaseInArray` are created

[GitHub] spark pull request #22991: [SPARK-25989][ML] OneVsRestModel handle empty out...

2018-11-25 Thread zhengruifeng
Github user zhengruifeng commented on a diff in the pull request: https://github.com/apache/spark/pull/22991#discussion_r236110139 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/OneVsRest.scala --- @@ -219,14 +225,20 @@ final class OneVsRestModel private[ml

[GitHub] spark issue #22991: [SPARK-25989][ML] OneVsRestModel handle empty outputCols...

2018-11-23 Thread zhengruifeng
Github user zhengruifeng commented on the issue: https://github.com/apache/spark/pull/22991 friendly ping @srowen @jkbradley @MLnick --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional

[GitHub] spark pull request #23100: [SPARK-26133][ML] Remove deprecated OneHotEncoder...

2018-11-23 Thread zhengruifeng
Github user zhengruifeng commented on a diff in the pull request: https://github.com/apache/spark/pull/23100#discussion_r235886910 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/OneHotEncoder.scala --- @@ -17,126 +17,512 @@ package

[GitHub] spark pull request #23123: [SPARK-26153][ML] GBT & RandomForest avoid unnece...

2018-11-22 Thread zhengruifeng
GitHub user zhengruifeng opened a pull request: https://github.com/apache/spark/pull/23123 [SPARK-26153][ML] GBT & RandomForest avoid unnecessary `first` job to compute `numFeatures` ## What changes were proposed in this pull request? use base models' `numFeature` ins

[GitHub] spark pull request #23122: [MINOR][ML] add missing params to Instr

2018-11-22 Thread zhengruifeng
GitHub user zhengruifeng opened a pull request: https://github.com/apache/spark/pull/23122 [MINOR][ML] add missing params to Instr ## What changes were proposed in this pull request? add following param to instr: GBTC: validationTol GBTR: validationTol

[GitHub] spark issue #22974: [SPARK-22450][WIP][Core][MLLib][FollowUp] Safely registe...

2018-11-15 Thread zhengruifeng
Github user zhengruifeng commented on the issue: https://github.com/apache/spark/pull/22974 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail

[GitHub] spark issue #22974: [SPARK-22450][WIP][Core][MLLib][FollowUp] Safely registe...

2018-11-13 Thread zhengruifeng
Github user zhengruifeng commented on the issue: https://github.com/apache/spark/pull/22974 @srowen Yes, this is the problem. I have to register `Param*` before any prediction model, but there are too many anonymous classes in `ParamValidators` and other places, and I have not found

[GitHub] spark issue #22974: [SPARK-22450][WIP][Core][MLLib][FollowUp] Safely registe...

2018-11-13 Thread zhengruifeng
Github user zhengruifeng commented on the issue: https://github.com/apache/spark/pull/22974 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail

[GitHub] spark issue #22974: [SPARK-22450][WIP][Core][MLLib][FollowUp] Safely registe...

2018-11-12 Thread zhengruifeng
Github user zhengruifeng commented on the issue: https://github.com/apache/spark/pull/22974 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail

[GitHub] spark issue #22087: [SPARK-25097][ML] Support prediction on single instance ...

2018-11-11 Thread zhengruifeng
Github user zhengruifeng commented on the issue: https://github.com/apache/spark/pull/22087 I also expose GMM's predictProbability. could you please make a final pass? @srowen @felixcheung --- - To unsubscribe

[GitHub] spark issue #22974: [SPARK-22450][Core][MLLib][FollowUp] Safely register Mul...

2018-11-11 Thread zhengruifeng
Github user zhengruifeng commented on the issue: https://github.com/apache/spark/pull/22974 @srowen I have some spare time, and will work on it. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

[GitHub] spark pull request #19927: [SPARK-22737][ML][WIP] OVR transform optimization

2018-11-09 Thread zhengruifeng
Github user zhengruifeng closed the pull request at: https://github.com/apache/spark/pull/19927 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h

[GitHub] spark issue #22974: [SPARK-22450][Core][MLLib][FollowUp] Safely register Mul...

2018-11-09 Thread zhengruifeng
Github user zhengruifeng commented on the issue: https://github.com/apache/spark/pull/22974 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail

[GitHub] spark issue #22975: [SPARK-20156][SQL][ML][FOLLOW-UP] Java String toLowerCas...

2018-11-09 Thread zhengruifeng
Github user zhengruifeng commented on the issue: https://github.com/apache/spark/pull/22975 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail

[GitHub] spark pull request #22991: [SPARK-25989][ML] OneVsRestModel handle empty out...

2018-11-09 Thread zhengruifeng
GitHub user zhengruifeng opened a pull request: https://github.com/apache/spark/pull/22991 [SPARK-25989][ML] OneVsRestModel handle empty outputCols incorrectly ## What changes were proposed in this pull request? ignore empty output columns ## How was this patch tested

[GitHub] spark issue #22975: [SPARK-20156][SQL][ML][FOLLOW-UP] Java String toLowerCas...

2018-11-08 Thread zhengruifeng
Github user zhengruifeng commented on the issue: https://github.com/apache/spark/pull/22975 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail

[GitHub] spark issue #22974: [SPARK-22450][Core][MLLib][FollowUp] Safely register Mul...

2018-11-08 Thread zhengruifeng
Github user zhengruifeng commented on the issue: https://github.com/apache/spark/pull/22974 not all public serializable classes are needed to registered. Only those one which needed ser-deser should be registered, one important groups should be transformers and prediction models

[GitHub] spark issue #22974: [SPARK-22450][Core][MLLib][FollowUp] Safely register Mul...

2018-11-08 Thread zhengruifeng
Github user zhengruifeng commented on the issue: https://github.com/apache/spark/pull/22974 I am not sure, but maybe all serializable classes need to be registered. Since `MultivariateGaussian` is a public class, so I think we need to add it. I also wonder whether a test

[GitHub] spark issue #22974: [SPARK-22450][Core][MLLib][FollowUp] Safely register Mul...

2018-11-08 Thread zhengruifeng
Github user zhengruifeng commented on the issue: https://github.com/apache/spark/pull/22974 Do you mean fail in this pr? It was caused by a non-registered filed `BDM[Double]`. `MultivariateGaussian` is used in GMM, kryo-registration should help performance. As to mllib

[GitHub] spark issue #22974: [SPARK-22450][Core][MLLib][FollowUp] Safely register Mul...

2018-11-08 Thread zhengruifeng
Github user zhengruifeng commented on the issue: https://github.com/apache/spark/pull/22974 @srowen Existing kryo-register testsuite need to import spark-core: ``` import org.apache.spark.SparkConf import org.apache.spark.serializer.KryoSerializer val conf = new

[GitHub] spark issue #22975: [SPARK-20156][SQL][ML][FOLLOW-UP] Java String toLowerCas...

2018-11-08 Thread zhengruifeng
Github user zhengruifeng commented on the issue: https://github.com/apache/spark/pull/22975 @srowen Yes, we should keep user input data and column names. Thanks for your explain! --- - To unsubscribe, e-mail

[GitHub] spark pull request #22975: [SPARK-20156][SQL][ML][FOLLOW-UP] Java String toL...

2018-11-08 Thread zhengruifeng
GitHub user zhengruifeng opened a pull request: https://github.com/apache/spark/pull/22975 [SPARK-20156][SQL][ML][FOLLOW-UP] Java String toLowerCase with Locale.ROOT ## What changes were proposed in this pull request? Add `Locale.ROOT` to all internal calls to String

[GitHub] spark pull request #22974: [SPARK-22450][Core][MLLib][FollowUp] Safely regis...

2018-11-08 Thread zhengruifeng
GitHub user zhengruifeng opened a pull request: https://github.com/apache/spark/pull/22974 [SPARK-22450][Core][MLLib][FollowUp] Safely register MultivariateGaussian ## What changes were proposed in this pull request? register following classes in Kryo

[GitHub] spark issue #22971: [SPARK-25970][ML] Add Instrumentation to PrefixSpan

2018-11-08 Thread zhengruifeng
Github user zhengruifeng commented on the issue: https://github.com/apache/spark/pull/22971 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail

[GitHub] spark pull request #22971: [SPARK-25970][ML] Add Instrumentation to PrefixSp...

2018-11-07 Thread zhengruifeng
GitHub user zhengruifeng opened a pull request: https://github.com/apache/spark/pull/22971 [SPARK-25970][ML] Add Instrumentation to PrefixSpan ## What changes were proposed in this pull request? Add Instrumentation to PrefixSpan ## How was this patch tested

[GitHub] spark issue #22087: [SPARK-25097][ML] Support prediction on single instance ...

2018-11-07 Thread zhengruifeng
Github user zhengruifeng commented on the issue: https://github.com/apache/spark/pull/22087 Sounds good to design a universal prediction model as a super-class. BTW, I think we can also create a new class `ProbabilisticPredictionModel` (as a subclass of `PredictionModel`), so

[GitHub] spark issue #19927: [SPARK-22737][ML][WIP] OVR transform optimization

2018-10-31 Thread zhengruifeng
Github user zhengruifeng commented on the issue: https://github.com/apache/spark/pull/19927 @srowen How do you think about this? Current OVR model's transform is too slow. Thanks. --- - To unsubscribe, e-mail

[GitHub] spark issue #22087: [SPARK-25097][ML] Support prediction on single instance ...

2018-10-31 Thread zhengruifeng
Github user zhengruifeng commented on the issue: https://github.com/apache/spark/pull/22087 @imatiach-msft Updated according to your comments! Thanks for your reviewing! --- - To unsubscribe, e-mail: reviews

[GitHub] spark pull request #21561: [SPARK-24555][ML] logNumExamples in KMeans/BiKM/G...

2018-08-15 Thread zhengruifeng
Github user zhengruifeng commented on a diff in the pull request: https://github.com/apache/spark/pull/21561#discussion_r210468639 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/BisectingKMeans.scala --- @@ -246,6 +245,16 @@ class BisectingKMeans private

[GitHub] spark pull request #21561: [SPARK-24555][ML] logNumExamples in KMeans/BiKM/G...

2018-08-15 Thread zhengruifeng
Github user zhengruifeng commented on a diff in the pull request: https://github.com/apache/spark/pull/21561#discussion_r210467653 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/BisectingKMeans.scala --- @@ -246,6 +245,16 @@ class BisectingKMeans private

[GitHub] spark pull request #21561: [SPARK-24555][ML] logNumExamples in KMeans/BiKM/G...

2018-08-14 Thread zhengruifeng
Github user zhengruifeng commented on a diff in the pull request: https://github.com/apache/spark/pull/21561#discussion_r210158840 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala --- @@ -299,7 +299,7 @@ class KMeans private ( val bcCenters

[GitHub] spark issue #22087: [SPARK-25097][ML] Support prediction on single instance ...

2018-08-13 Thread zhengruifeng
Github user zhengruifeng commented on the issue: https://github.com/apache/spark/pull/22087 @felixcheung Testsuites is added. Thanks for reviewing! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

[GitHub] spark pull request #22087: [SPARK-25097][Support prediction on single instan...

2018-08-13 Thread zhengruifeng
GitHub user zhengruifeng opened a pull request: https://github.com/apache/spark/pull/22087 [SPARK-25097][Support prediction on single instance in KMeans/BiKMeans/GMM] Support prediction on single instance in KMeans/BiKMeans/GMM ## What changes were proposed in this pull request

[GitHub] spark pull request #21561: [SPARK-24555][ML] logNumExamples in KMeans/BiKM/G...

2018-08-13 Thread zhengruifeng
Github user zhengruifeng commented on a diff in the pull request: https://github.com/apache/spark/pull/21561#discussion_r209498032 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/BisectingKMeans.scala --- @@ -151,13 +152,9 @@ class BisectingKMeans private

[GitHub] spark pull request #21561: [SPARK-24555][ML] logNumExamples in KMeans/BiKM/G...

2018-08-12 Thread zhengruifeng
Github user zhengruifeng commented on a diff in the pull request: https://github.com/apache/spark/pull/21561#discussion_r209496789 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/NaiveBayes.scala --- @@ -157,11 +157,15 @@ class NaiveBayes @Since("

[GitHub] spark pull request #19084: [SPARK-20711][ML]MultivariateOnlineSummarizer/Sum...

2018-08-01 Thread zhengruifeng
Github user zhengruifeng closed the pull request at: https://github.com/apache/spark/pull/19084 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h

[GitHub] spark issue #21563: [SPARK-24557][ML] ClusteringEvaluator support array inpu...

2018-07-31 Thread zhengruifeng
Github user zhengruifeng commented on the issue: https://github.com/apache/spark/pull/21563 @mengxr I notice that you open a ticket for supporting integer type labels in ClusteringEvalutator, would you like to shepherd this pr too

[GitHub] spark pull request #19186: [SPARK-21972][ML] Add param handlePersistence

2018-07-31 Thread zhengruifeng
Github user zhengruifeng closed the pull request at: https://github.com/apache/spark/pull/19186 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h

[GitHub] spark pull request #20918: [SPARK-23805][ML][WIP] Features alg support vecto...

2018-07-31 Thread zhengruifeng
Github user zhengruifeng closed the pull request at: https://github.com/apache/spark/pull/20918 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h

[GitHub] spark pull request #18589: [SPARK-16872][ML] Add Gaussian NB

2018-07-31 Thread zhengruifeng
Github user zhengruifeng closed the pull request at: https://github.com/apache/spark/pull/18589 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h

[GitHub] spark pull request #18389: [SPARK-14174][ML] Add minibatch kmeans

2018-07-31 Thread zhengruifeng
Github user zhengruifeng closed the pull request at: https://github.com/apache/spark/pull/18389 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h

[GitHub] spark issue #21563: [SPARK-24557][ML] ClusteringEvaluator support array inpu...

2018-07-31 Thread zhengruifeng
Github user zhengruifeng commented on the issue: https://github.com/apache/spark/pull/21563 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail

[GitHub] spark issue #20028: [SPARK-19053][ML]Supporting multiple evaluation metrics ...

2018-07-26 Thread zhengruifeng
Github user zhengruifeng commented on the issue: https://github.com/apache/spark/pull/20028 LGTM, except for the since annotations. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional

[GitHub] spark issue #21788: [SPARK-24609][ML][DOC] PySpark/SparkR doc doesn't explai...

2018-07-26 Thread zhengruifeng
Github user zhengruifeng commented on the issue: https://github.com/apache/spark/pull/21788 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail

[GitHub] spark issue #21563: [SPARK-24557][ML] ClusteringEvaluator support array inpu...

2018-07-26 Thread zhengruifeng
Github user zhengruifeng commented on the issue: https://github.com/apache/spark/pull/21563 @mgaido91 I am sorry to make a force push to update my git username in this PR. Since I found that my current PRs are not linked to my account and it is troublesome to track them

[GitHub] spark issue #21788: [SPARK-24609][ML][DOC] PySpark/SparkR doc doesn't explai...

2018-07-26 Thread zhengruifeng
Github user zhengruifeng commented on the issue: https://github.com/apache/spark/pull/21788 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail

[GitHub] spark issue #21788: [SPARK-24609][ML][DOC] PySpark/SparkR doc doesn't explai...

2018-07-26 Thread zhengruifeng
Github user zhengruifeng commented on the issue: https://github.com/apache/spark/pull/21788 @felixcheung I have to force push it so as to change the git username. I will look for what happend --- - To unsubscribe

[GitHub] spark issue #21792: [SPARK-23231][ML][DOC] Add doc for string indexer orderi...

2018-07-17 Thread zhengruifeng
Github user zhengruifeng commented on the issue: https://github.com/apache/spark/pull/21792 @srowen I think we need to update the docs 1, Current doc in `StringIndexer` is somewhat misleading: "The indices are in `[0, numLabels)`, ordered by label frequencies, so the most fre

[GitHub] spark pull request #21792: [SPARK-23231][ML][DOC] Add doc for string indexer...

2018-07-17 Thread zhengruifeng
GitHub user zhengruifeng opened a pull request: https://github.com/apache/spark/pull/21792 [SPARK-23231][ML][DOC] Add doc for string indexer ordering to user guide (also to RFormula guide) ## What changes were proposed in this pull request? add doc for string indexer ordering

[GitHub] spark pull request #21788: [SPARK-24609][ML][DOC] PySpark/SparkR doc doesn't...

2018-07-17 Thread zhengruifeng
GitHub user zhengruifeng opened a pull request: https://github.com/apache/spark/pull/21788 [SPARK-24609][ML][DOC] PySpark/SparkR doc doesn't explain RandomForestClassifier.featureSubsetStrategy well ## What changes were proposed in this pull request? update doc

[GitHub] spark issue #21562: [Trivial][ML] GMM unpersist RDD after training

2018-07-12 Thread zhengruifeng
Github user zhengruifeng commented on the issue: https://github.com/apache/spark/pull/21562 @felixcheung Would you mind make a final pass? Thanks! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

[GitHub] spark pull request #21563: [SPARK-24557][ML] ClusteringEvaluator support arr...

2018-06-22 Thread zhengruifeng
Github user zhengruifeng commented on a diff in the pull request: https://github.com/apache/spark/pull/21563#discussion_r197600500 --- Diff: mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala --- @@ -107,15 +106,18 @@ class ClusteringEvaluator @Since

[GitHub] spark pull request #21563: [SPARK-24557][ML] ClusteringEvaluator support arr...

2018-06-14 Thread zhengruifeng
Github user zhengruifeng commented on a diff in the pull request: https://github.com/apache/spark/pull/21563#discussion_r195618344 --- Diff: mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala --- @@ -107,15 +106,18 @@ class ClusteringEvaluator @Since

[GitHub] spark issue #16171: [SPARK-18739][ML][PYSPARK] Classification and regression...

2018-06-14 Thread zhengruifeng
Github user zhengruifeng commented on the issue: https://github.com/apache/spark/pull/16171 It is out of date, and I will close it --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional

[GitHub] spark pull request #16171: [SPARK-18739][ML][PYSPARK] Classification and reg...

2018-06-14 Thread zhengruifeng
Github user zhengruifeng closed the pull request at: https://github.com/apache/spark/pull/16171 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h

[GitHub] spark pull request #16763: [SPARK-19422][ML][WIP] Cache input data in algori...

2018-06-14 Thread zhengruifeng
Github user zhengruifeng closed the pull request at: https://github.com/apache/spark/pull/16763 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h

[GitHub] spark issue #16763: [SPARK-19422][ML][WIP] Cache input data in algorithms

2018-06-14 Thread zhengruifeng
Github user zhengruifeng commented on the issue: https://github.com/apache/spark/pull/16763 This pr is out of date. I will close it. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional

[GitHub] spark issue #19084: [SPARK-20711][ML]MultivariateOnlineSummarizer/Summarizer...

2018-06-14 Thread zhengruifeng
Github user zhengruifeng commented on the issue: https://github.com/apache/spark/pull/19084 @srowen Could you please give a final review? Thanks --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

[GitHub] spark issue #19927: [SPARK-22737][ML][WIP] OVR transform optimization

2018-06-14 Thread zhengruifeng
Github user zhengruifeng commented on the issue: https://github.com/apache/spark/pull/19927 @mengxr @holdenk How do you think about this? Thanks. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

[GitHub] spark pull request #21563: [SPARK-24557][ML] ClusteringEvaluator support arr...

2018-06-14 Thread zhengruifeng
GitHub user zhengruifeng opened a pull request: https://github.com/apache/spark/pull/21563 [SPARK-24557][ML] ClusteringEvaluator support array input ## What changes were proposed in this pull request? ClusteringEvaluator support array input ## How was this patch tested

[GitHub] spark pull request #21562: [Trivial][ML] GMM unpersist RDD after training

2018-06-14 Thread zhengruifeng
GitHub user zhengruifeng opened a pull request: https://github.com/apache/spark/pull/21562 [Trivial][ML] GMM unpersist RDD after training ## What changes were proposed in this pull request? unpersist `instances` after training ## How was this patch tested? existing

[GitHub] spark pull request #18154: [SPARK-20932][ML]CountVectorizer support handle p...

2018-06-14 Thread zhengruifeng
Github user zhengruifeng closed the pull request at: https://github.com/apache/spark/pull/18154 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h

[GitHub] spark issue #18154: [SPARK-20932][ML]CountVectorizer support handle persiste...

2018-06-14 Thread zhengruifeng
Github user zhengruifeng commented on the issue: https://github.com/apache/spark/pull/18154 This PR is out of date. I will close it. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional

[GitHub] spark pull request #20164: [SPARK-22971][ML] OneVsRestModel should use tempo...

2018-06-14 Thread zhengruifeng
Github user zhengruifeng closed the pull request at: https://github.com/apache/spark/pull/20164 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h

[GitHub] spark issue #20164: [SPARK-22971][ML] OneVsRestModel should use temporary Ra...

2018-06-14 Thread zhengruifeng
Github user zhengruifeng commented on the issue: https://github.com/apache/spark/pull/20164 This pr is out of date. So I will close it. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

[GitHub] spark pull request #21561: [SPARK-24555][ML] logNumExamples in KMeans/BiKM/G...

2018-06-14 Thread zhengruifeng
GitHub user zhengruifeng opened a pull request: https://github.com/apache/spark/pull/21561 [SPARK-24555][ML] logNumExamples in KMeans/BiKM/GMM/AFT/NB ## What changes were proposed in this pull request? logNumExamples in KMeans/BiKM/GMM/AFT/NB ## How was this patch

[GitHub] spark issue #19927: [SPARK-22737][ML][WIP] OVR transform optimization

2018-04-12 Thread zhengruifeng
Github user zhengruifeng commented on the issue: https://github.com/apache/spark/pull/19927 @MLnick @jkbradley What's your thoughts? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

[GitHub] spark pull request #19381: [SPARK-10884][ML] Support prediction on single in...

2018-04-12 Thread zhengruifeng
Github user zhengruifeng commented on a diff in the pull request: https://github.com/apache/spark/pull/19381#discussion_r180997645 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/Classifier.scala --- @@ -192,12 +192,12 @@ abstract class ClassificationModel

[GitHub] spark issue #20956: [SPARK-23841][ML] NodeIdCache should unpersist the last ...

2018-04-09 Thread zhengruifeng
Github user zhengruifeng commented on the issue: https://github.com/apache/spark/pull/20956 @srowen Could you please help reviewing this? Thanks in advance --- - To unsubscribe, e-mail: reviews-unsubscr

[GitHub] spark pull request #20956: [SPARK-23841][ML] NodeIdCache should unpersist th...

2018-04-09 Thread zhengruifeng
Github user zhengruifeng commented on a diff in the pull request: https://github.com/apache/spark/pull/20956#discussion_r180064831 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/impl/NodeIdCache.scala --- @@ -166,9 +166,13 @@ private[spark] class NodeIdCache

[GitHub] spark pull request #20956: [SPARK-23841][ML] NodeIdCache should unpersist th...

2018-04-09 Thread zhengruifeng
Github user zhengruifeng commented on a diff in the pull request: https://github.com/apache/spark/pull/20956#discussion_r180063562 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/impl/NodeIdCache.scala --- @@ -95,7 +95,7 @@ private[spark] class NodeIdCache

[GitHub] spark pull request #20956: [SPARK-23841][ML] NodeIdCache should unpersist th...

2018-04-01 Thread zhengruifeng
GitHub user zhengruifeng opened a pull request: https://github.com/apache/spark/pull/20956 [SPARK-23841][ML] NodeIdCache should unpersist the last cached nodeIdsForInstances ## What changes were proposed in this pull request? unpersist the last cached nodeIdsForInstances

[GitHub] spark pull request #20918: [SPARK-23805][ML][WIP] Features alg support vecto...

2018-03-28 Thread zhengruifeng
GitHub user zhengruifeng opened a pull request: https://github.com/apache/spark/pull/20918 [SPARK-23805][ML][WIP] Features alg support vector-size validation and Inference ## What changes were proposed in this pull request? support vector-size validation and Inference

[GitHub] spark pull request #20539: [SPARK-22700][ML] Bucketizer.transform incorrectl...

2018-03-14 Thread zhengruifeng
Github user zhengruifeng closed the pull request at: https://github.com/apache/spark/pull/20539 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h

[GitHub] spark pull request #20518: [SPARK-22119][FOLLOWUP][ML] Use spherical KMeans ...

2018-02-10 Thread zhengruifeng
Github user zhengruifeng commented on a diff in the pull request: https://github.com/apache/spark/pull/20518#discussion_r167417459 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala --- @@ -745,4 +763,27 @@ private[spark] class CosineDistanceMeasure

[GitHub] spark issue #20539: [SPARK-22700][ML] Bucketizer.transform incorrectly drops...

2018-02-07 Thread zhengruifeng
Github user zhengruifeng commented on the issue: https://github.com/apache/spark/pull/20539 ping @jkbradley --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail

[GitHub] spark pull request #20539: [SPARK-22700][ML] Bucketizer.transform incorrectl...

2018-02-07 Thread zhengruifeng
GitHub user zhengruifeng opened a pull request: https://github.com/apache/spark/pull/20539 [SPARK-22700][ML] Bucketizer.transform incorrectly drops row containing NaN - for branch-2.2 ## What changes were proposed in this pull request? for branch-2.2 only drops the rows

[GitHub] spark pull request #20518: [SPARK-22119][FOLLOWUP][ML] Use spherical KMeans ...

2018-02-07 Thread zhengruifeng
Github user zhengruifeng commented on a diff in the pull request: https://github.com/apache/spark/pull/20518#discussion_r166813909 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala --- @@ -745,4 +763,27 @@ private[spark] class CosineDistanceMeasure

[GitHub] spark issue #19340: [SPARK-22119][ML] Add cosine distance to KMeans

2018-02-05 Thread zhengruifeng
Github user zhengruifeng commented on the issue: https://github.com/apache/spark/pull/19340 @mgaido91 agree that it is better to normalize centers --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

[GitHub] spark issue #20164: [SPARK-22971][ML] OneVsRestModel should use temporary Ra...

2018-02-04 Thread zhengruifeng
Github user zhengruifeng commented on the issue: https://github.com/apache/spark/pull/20164 @WeichenXu123 Yes, my concern is that it is confusing if the transform failure is caused by column conflict by a ‘invisible’ column. @srowen Agree that it is not perfect if we

[GitHub] spark issue #20164: [SPARK-22971][ML] OneVsRestModel should use temporary Ra...

2018-01-31 Thread zhengruifeng
Github user zhengruifeng commented on the issue: https://github.com/apache/spark/pull/20164 @srowen Different from the base model (like LoR), OVR and OVRModel do not have param `rawPredictionCol`. So if the input dataframe contains a column which has the same name as base

[GitHub] spark issue #19340: [SPARK-22119][ML] Add cosine distance to KMeans

2018-01-30 Thread zhengruifeng
Github user zhengruifeng commented on the issue: https://github.com/apache/spark/pull/19340 The updating of centers should be viewed as the **M-step** in EM algorithm, in which some objective is optimized. Since cosine similarity do not take vector-norm into account: 1

[GitHub] spark issue #19340: [SPARK-22119][ML] Add cosine distance to KMeans

2018-01-30 Thread zhengruifeng
Github user zhengruifeng commented on the issue: https://github.com/apache/spark/pull/19340 @mgaido91 @srowen I have the same concern as @Kevin-Ferret and @viirya I don't find the normailization of vectors before training, and the update of center seems incorrect

[GitHub] spark issue #19892: [SPARK-22797][PySpark] Bucketizer support multi-column

2018-01-16 Thread zhengruifeng
Github user zhengruifeng commented on the issue: https://github.com/apache/spark/pull/19892 @MLnick Thanks for your reviewing and suggestions. I have updated this PR --- - To unsubscribe, e-mail: reviews-unsubscr

[GitHub] spark pull request #20275: [SPARK-23085][ML] API parity for mllib.linalg.Vec...

2018-01-15 Thread zhengruifeng
GitHub user zhengruifeng opened a pull request: https://github.com/apache/spark/pull/20275 [SPARK-23085][ML] API parity for mllib.linalg.Vectors.sparse ## What changes were proposed in this pull request? `ML.Vectors#sparse(size: Int, elements: Seq[(Int, Double)])` support zero

[GitHub] spark issue #20164: [SPARK-22971][ML] OneVsRestModel should use temporary Ra...

2018-01-15 Thread zhengruifeng
Github user zhengruifeng commented on the issue: https://github.com/apache/spark/pull/20164 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail

[GitHub] spark issue #20164: [SPARK-22971][ML] OneVsRestModel should use temporary Ra...

2018-01-15 Thread zhengruifeng
Github user zhengruifeng commented on the issue: https://github.com/apache/spark/pull/20164 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail

[GitHub] spark pull request #20164: [SPARK-22971][ML] OneVsRestModel should use tempo...

2018-01-05 Thread zhengruifeng
GitHub user zhengruifeng opened a pull request: https://github.com/apache/spark/pull/20164 [SPARK-22971][ML] OneVsRestModel should use temporary RawPredictionCol ## What changes were proposed in this pull request? use temporary RawPredictionCol in `OneVsRestModel#transform

[GitHub] spark issue #20113: [SPARK-22905][ML][FollowUp] Fix GaussianMixtureModel sav...

2017-12-29 Thread zhengruifeng
Github user zhengruifeng commented on the issue: https://github.com/apache/spark/pull/20113 @WeichenXu123 I use this cmd to list all impl of model.save, and others looks OK. `find mllib/src/main/scala -name '*.scala' | xargs -i bash -c 'egrep -in "repartiti

[GitHub] spark issue #19892: [SPARK-22797][PySpark] Bucketizer support multi-column

2017-12-28 Thread zhengruifeng
Github user zhengruifeng commented on the issue: https://github.com/apache/spark/pull/19892 ping @MLnick ? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews

[GitHub] spark pull request #20113: [SPARK-22905][ML][FollowUp] Fix GaussianMixtureMo...

2017-12-28 Thread zhengruifeng
GitHub user zhengruifeng opened a pull request: https://github.com/apache/spark/pull/20113 [SPARK-22905][ML][FollowUp] Fix GaussianMixtureModel save ## What changes were proposed in this pull request? make sure model data is stored in order. @WeichenXu123

[GitHub] spark pull request #20030: [SPARK-10496][CORE] Efficient RDD cumulative sum

2017-12-27 Thread zhengruifeng
Github user zhengruifeng closed the pull request at: https://github.com/apache/spark/pull/20030 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h

[GitHub] spark pull request #20030: [SPARK-10496][CORE] Efficient RDD cumulative sum

2017-12-20 Thread zhengruifeng
GitHub user zhengruifeng opened a pull request: https://github.com/apache/spark/pull/20030 [SPARK-10496][CORE] Efficient RDD cumulative sum ## What changes were proposed in this pull request? impl Efficient RDD cumulative sum ## How was this patch tested? existing

[GitHub] spark issue #19950: [SPARK-22450][Core][MLLib][FollowUp] safely register cla...

2017-12-19 Thread zhengruifeng
Github user zhengruifeng commented on the issue: https://github.com/apache/spark/pull/19950 @WeichenXu123 I am not very sure, but it seems that `Kryo` will automatic ser/deser `Tuple2[A, B]` type if both `A` and `B` have been registered: ``` scala> imp

[GitHub] spark issue #20017: [SPARK-22832][ML] BisectingKMeans unpersist unused datas...

2017-12-19 Thread zhengruifeng
Github user zhengruifeng commented on the issue: https://github.com/apache/spark/pull/20017 ping @srowen --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h

  1   2   3   4   5   6   7   8   9   >