[jira] [Commented] (SPARK-20711) MultivariateOnlineSummarizer incorrect min/max for identical NaN feature

2017-05-12 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16007739#comment-16007739
 ] 

Nick Pentreath commented on SPARK-20711:


Shouldn't the stats for any column that contains at least one {{NaN}} value be 
{{NaN}}?

> MultivariateOnlineSummarizer incorrect min/max for identical NaN feature
> 
>
> Key: SPARK-20711
> URL: https://issues.apache.org/jira/browse/SPARK-20711
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: zhengruifeng
>Priority: Minor
>
> {code}
> scala> val summarizer = new MultivariateOnlineSummarizer()
> summarizer: org.apache.spark.mllib.stat.MultivariateOnlineSummarizer = 
> org.apache.spark.mllib.stat.MultivariateOnlineSummarizer@2ac58d
> scala> summarizer.add(Vectors.dense(Double.NaN, -10.0))
> res20: summarizer.type = 
> org.apache.spark.mllib.stat.MultivariateOnlineSummarizer@2ac58d
> scala> summarizer.add(Vectors.dense(Double.NaN, 2.0))
> res21: summarizer.type = 
> org.apache.spark.mllib.stat.MultivariateOnlineSummarizer@2ac58d
> scala> summarizer.min
> res22: org.apache.spark.mllib.linalg.Vector = [1.7976931348623157E308,-10.0]
> scala> summarizer.max
> res23: org.apache.spark.mllib.linalg.Vector = [-1.7976931348623157E308,2.0]
> {code}
> For a feature only containing {{Double.NaN}}, the returned max is 
> {{Double.MinValue}} and the min is {{Double.MaxValue}}.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-8402) Add DP means clustering to MLlib

2017-05-11 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath closed SPARK-8402.
-
Resolution: Won't Fix

> Add DP means clustering to MLlib
> 
>
> Key: SPARK-8402
> URL: https://issues.apache.org/jira/browse/SPARK-8402
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Meethu Mathew
>Assignee: Meethu Mathew
>  Labels: features
>
> At present, all the clustering algorithms in MLlib require the number of 
> clusters to be specified in advance. 
> The Dirichlet process (DP) is a popular non-parametric Bayesian mixture model 
> that allows for flexible clustering of data without having to specify apriori 
> the number of clusters. 
> DP means is a non-parametric clustering algorithm that uses a scale parameter 
> 'lambda' to control the creation of new clusters ["Revisiting k-means: New 
> Algorithms via Bayesian Nonparametrics" by Brian Kulis, Michael I. Jordan].
> We have followed the distributed implementation of DP means which has been 
> proposed in the paper titled "MLbase: Distributed Machine Learning Made Easy" 
> by Xinghao Pan, Evan R. Sparks, Andre Wibisono.
> A benchmark comparison between k-means and dp-means based on Normalized 
> Mutual Information between ground truth clusters and algorithm outputs, have 
> been provided in the following table. It can be seen from the table that 
> DP-means reported a higher NMI on 5 of 8 data sets in comparison to 
> k-means[Source: Kulis, B., Jordan, M.I.: Revisiting k-means: New algorithms 
> via Bayesian nonparametrics (2011) Arxiv:.0352. (Table 1)]
> | Dataset   | DP-means | k-means |
> | Wine  | .41  | .43 |
> | Iris  | .75  | .76 |
> | Pima  | .02  | .03 |
> | Soybean   | .72  | .66 |
> | Car   | .07  | .05 |
> | Balance Scale | .17  | .11 |
> | Breast Cancer | .04  | .03 |
> | Vehicle   | .18  | .18 |
> Experiment on our spark cluster setup:
> An initial benchmark study was performed on a 3 node Spark cluster setup on 
> mesos where each node config was 8 Cores, 64 GB RAM and the spark version 
> used was 1.5(git branch).
> Tests were done using a mixture of 10 Gaussians with varying number of 
> features and instances. The results from the benchmark study are provided 
> below. The reported stats are average over 5 runs. 
> | DATASET || DPMEANS |   |
>  | KMEANS (k =10) | |
> | Instances   | Dimensions | No of clusters obtained | Time  | Converged in 
> iterations |  Time  | Converged in iterations |
> |  10 million | 10 |10   | 43.6s |2   
>  |  52.2s |2|
> |  1 million  | 100|10   | 39.8s |2   
>  | 43.39s |2|
> | 0.1 million |1000|10   | 37.3s |2   
>  | 41.64s |2|



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8402) Add DP means clustering to MLlib

2017-05-11 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16006060#comment-16006060
 ] 

Nick Pentreath commented on SPARK-8402:
---

I'm afraid I would say there is not sufficient demand or committer bandwidth 
for this feature at present.

> Add DP means clustering to MLlib
> 
>
> Key: SPARK-8402
> URL: https://issues.apache.org/jira/browse/SPARK-8402
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Meethu Mathew
>Assignee: Meethu Mathew
>  Labels: features
>
> At present, all the clustering algorithms in MLlib require the number of 
> clusters to be specified in advance. 
> The Dirichlet process (DP) is a popular non-parametric Bayesian mixture model 
> that allows for flexible clustering of data without having to specify apriori 
> the number of clusters. 
> DP means is a non-parametric clustering algorithm that uses a scale parameter 
> 'lambda' to control the creation of new clusters ["Revisiting k-means: New 
> Algorithms via Bayesian Nonparametrics" by Brian Kulis, Michael I. Jordan].
> We have followed the distributed implementation of DP means which has been 
> proposed in the paper titled "MLbase: Distributed Machine Learning Made Easy" 
> by Xinghao Pan, Evan R. Sparks, Andre Wibisono.
> A benchmark comparison between k-means and dp-means based on Normalized 
> Mutual Information between ground truth clusters and algorithm outputs, have 
> been provided in the following table. It can be seen from the table that 
> DP-means reported a higher NMI on 5 of 8 data sets in comparison to 
> k-means[Source: Kulis, B., Jordan, M.I.: Revisiting k-means: New algorithms 
> via Bayesian nonparametrics (2011) Arxiv:.0352. (Table 1)]
> | Dataset   | DP-means | k-means |
> | Wine  | .41  | .43 |
> | Iris  | .75  | .76 |
> | Pima  | .02  | .03 |
> | Soybean   | .72  | .66 |
> | Car   | .07  | .05 |
> | Balance Scale | .17  | .11 |
> | Breast Cancer | .04  | .03 |
> | Vehicle   | .18  | .18 |
> Experiment on our spark cluster setup:
> An initial benchmark study was performed on a 3 node Spark cluster setup on 
> mesos where each node config was 8 Cores, 64 GB RAM and the spark version 
> used was 1.5(git branch).
> Tests were done using a mixture of 10 Gaussians with varying number of 
> features and instances. The results from the benchmark study are provided 
> below. The reported stats are average over 5 runs. 
> | DATASET || DPMEANS |   |
>  | KMEANS (k =10) | |
> | Instances   | Dimensions | No of clusters obtained | Time  | Converged in 
> iterations |  Time  | Converged in iterations |
> |  10 million | 10 |10   | 43.6s |2   
>  |  52.2s |2|
> |  1 million  | 100|10   | 39.8s |2   
>  | 43.39s |2|
> | 0.1 million |1000|10   | 37.3s |2   
>  | 41.64s |2|



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11669) Python interface to SparkR GLM module

2017-05-11 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16006056#comment-16006056
 ] 

Nick Pentreath commented on SPARK-11669:


I think this can be closed as its done in {{GeneralizedLinearRegression}} 
across the various language APIs.

> Python interface to SparkR GLM module
> -
>
> Key: SPARK-11669
> URL: https://issues.apache.org/jira/browse/SPARK-11669
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SparkR
>Affects Versions: 1.5.0, 1.5.1
> Environment: LINUX
> MAC
> WINDOWS
>Reporter: Shubhanshu Mishra
>Priority: Minor
>  Labels: GLM, pyspark, sparkR, statistics
>
> There should be a python interface to the sparkR GLM module. Currently the 
> only python library which creates R style GLM module results in statsmodels. 
> Inspiration for the API can be taken from the following page. 
> http://statsmodels.sourceforge.net/devel/examples/notebooks/generated/formulas.html



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20503) ML 2.2 QA: API: Python API coverage

2017-05-11 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16006038#comment-16006038
 ] 

Nick Pentreath commented on SPARK-20503:


If SPARK-20602 and/or SPARK-20348 are completed, Python API must be updated 
accordingly.

> ML 2.2 QA: API: Python API coverage
> ---
>
> Key: SPARK-20503
> URL: https://issues.apache.org/jira/browse/SPARK-20503
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, ML, PySpark
>Reporter: Joseph K. Bradley
>Priority: Blocker
>
> For new public APIs added to MLlib ({{spark.ml}} only), we need to check the 
> generated HTML doc and compare the Scala & Python versions.
> * *GOAL*: Audit and create JIRAs to fix in the next release.
> * *NON-GOAL*: This JIRA is _not_ for fixing the API parity issues.
> We need to track:
> * Inconsistency: Do class/method/parameter names match?
> * Docs: Is the Python doc missing or just a stub?  We want the Python doc to 
> be as complete as the Scala doc.
> * API breaking changes: These should be very rare but are occasionally either 
> necessary (intentional) or accidental.  These must be recorded and added in 
> the Migration Guide for this release.
> ** Note: If the API change is for an Alpha/Experimental/DeveloperApi 
> component, please note that as well.
> * Missing classes/methods/parameters: We should create to-do JIRAs for 
> functionality missing from Python, to be added in the next release cycle.  
> *Please use a _separate_ JIRA (linked below as "requires") for this list of 
> to-do items.*



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20679) Let ML ALS recommend for a subset of users/items

2017-05-11 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath updated SPARK-20679:
---
Summary: Let ML ALS recommend for a subset of users/items  (was: Let ALS 
recommend for a subset of users/items)

> Let ML ALS recommend for a subset of users/items
> 
>
> Key: SPARK-20679
> URL: https://issues.apache.org/jira/browse/SPARK-20679
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>    Reporter: Nick Pentreath
>
> SPARK-10802 is for {{mllib}}'s {{MatrixFactorizationModel}} to recommend for 
> a subset of user or item factors.
> Since {{mllib}} is in maintenance mode and {{ml}}'s {{ALSModel}} now supports 
> the {{recommendForAllX}} methods, this ticket tracks adding this 
> functionality to {{ALSModel}}.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20679) Let ALS recommend for a subset of users/items

2017-05-09 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16002360#comment-16002360
 ] 

Nick Pentreath commented on SPARK-20679:


I'm working on this

> Let ALS recommend for a subset of users/items
> -
>
> Key: SPARK-20679
> URL: https://issues.apache.org/jira/browse/SPARK-20679
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Nick Pentreath
>
> SPARK-10802 is for {{mllib}}'s {{MatrixFactorizationModel}} to recommend for 
> a subset of user or item factors.
> Since {{mllib}} is in maintenance mode and {{ml}}'s {{ALSModel}} now supports 
> the {{recommendForAllX}} methods, this ticket tracks adding this 
> functionality to {{ALSModel}}.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10802) Let ALS recommend for subset of data

2017-05-09 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16002343#comment-16002343
 ] 

Nick Pentreath commented on SPARK-10802:


Hey folks - since the {{ALSModel}} in the ML API now supports "recommend-all" 
methods, this functionality will be implemented there (see SPARK-20679). Unless 
there are major objections, I advocate closing this one as "Wont Fix" once 
SPARK-20679 is done, since MLlib API is in maintenance mode.

> Let ALS recommend for subset of data
> 
>
> Key: SPARK-10802
> URL: https://issues.apache.org/jira/browse/SPARK-10802
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.5.0
>Reporter: Tomasz Bartczak
>Priority: Minor
>
> Currently MatrixFactorizationModel allows to get recommendations for
> - single user 
> - single product 
> - all users
> - all products
> recommendation for all users/products do a cartesian join inside.
> It would be useful in some cases to get recommendations for subset of 
> users/products by providing an RDD with which MatrixFactorizationModel could 
> do an intersection before doing a cartesian join. This would make it much 
> faster in situation where recommendations are needed only for subset of 
> users/products, and when the subset is still too large to make it feasible to 
> recommend one-by-one.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20679) Let ALS recommend for a subset of users/items

2017-05-09 Thread Nick Pentreath (JIRA)
Nick Pentreath created SPARK-20679:
--

 Summary: Let ALS recommend for a subset of users/items
 Key: SPARK-20679
 URL: https://issues.apache.org/jira/browse/SPARK-20679
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 2.2.0
Reporter: Nick Pentreath


SPARK-10802 is for {{mllib}}'s {{MatrixFactorizationModel}} to recommend for a 
subset of user or item factors.

Since {{mllib}} is in maintenance mode and {{ml}}'s {{ALSModel}} now supports 
the {{recommendForAllX}} methods, this ticket tracks adding this functionality 
to {{ALSModel}}.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-10408) Autoencoder

2017-05-09 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16002326#comment-16002326
 ] 

Nick Pentreath edited comment on SPARK-10408 at 5/9/17 9:00 AM:


What is the status here? I think it's fairly safe to say that it's unlikely 
that Spark will support much in the way of deep learning within the core 
itself. 

That said there may still be an argument for adding the {{MLPRegressor}} and an 
autoencoder - but I'm concerned we lack the review and maintenance bandwidth 
currently.


was (Author: mlnick):
What is the status here? I think it's fairly safe to say that it's unlikely 
that Spark will support much in the way of deep learning itself. 

That said there may still be an argument for adding the {{MLPRegressor}} and an 
autoencoder - but I'm concerned we lack the review and maintenance bandwidth 
currently.

> Autoencoder
> ---
>
> Key: SPARK-10408
> URL: https://issues.apache.org/jira/browse/SPARK-10408
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 1.5.0
>Reporter: Alexander Ulanov
>Assignee: Alexander Ulanov
>
> Goal: Implement various types of autoencoders 
> Requirements:
> 1)Basic (deep) autoencoder that supports different types of inputs: binary, 
> real in [0..1]. real in [-inf, +inf] 
> 2)Sparse autoencoder i.e. L1 regularization. It should be added as a feature 
> to the MLP and then used here 
> 3)Denoising autoencoder 
> 4)Stacked autoencoder for pre-training of deep networks. It should support 
> arbitrary network layers
> References: 
> 1. Vincent, Pascal, et al. "Extracting and composing robust features with 
> denoising autoencoders." Proceedings of the 25th international conference on 
> Machine learning. ACM, 2008. 
> http://www.iro.umontreal.ca/~vincentp/Publications/denoising_autoencoders_tr1316.pdf
>  
> 2. 
> http://machinelearning.wustl.edu/mlpapers/paper_files/ICML2011Rifai_455.pdf, 
> 3. Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., and Manzagol, P.-A. 
> (2010). Stacked denoising autoencoders: Learning useful representations in a 
> deep network with a local denoising criterion. Journal of Machine Learning 
> Research, 11(3371–3408). 
> http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.297.3484&rep=rep1&type=pdf
> 4, 5, 6. Bengio, Yoshua, et al. "Greedy layer-wise training of deep 
> networks." Advances in neural information processing systems 19 (2007): 153. 
> http://www.iro.umontreal.ca/~lisa/pointeurs/dbn_supervised_tr1282.pdf



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10408) Autoencoder

2017-05-09 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16002326#comment-16002326
 ] 

Nick Pentreath commented on SPARK-10408:


What is the status here? I think it's fairly safe to say that it's unlikely 
that Spark will support much in the way of deep learning itself. 

That said there may still be an argument for adding the {{MLPRegressor}} and an 
autoencoder - but I'm concerned we lack the review and maintenance bandwidth 
currently.

> Autoencoder
> ---
>
> Key: SPARK-10408
> URL: https://issues.apache.org/jira/browse/SPARK-10408
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 1.5.0
>Reporter: Alexander Ulanov
>Assignee: Alexander Ulanov
>
> Goal: Implement various types of autoencoders 
> Requirements:
> 1)Basic (deep) autoencoder that supports different types of inputs: binary, 
> real in [0..1]. real in [-inf, +inf] 
> 2)Sparse autoencoder i.e. L1 regularization. It should be added as a feature 
> to the MLP and then used here 
> 3)Denoising autoencoder 
> 4)Stacked autoencoder for pre-training of deep networks. It should support 
> arbitrary network layers
> References: 
> 1. Vincent, Pascal, et al. "Extracting and composing robust features with 
> denoising autoencoders." Proceedings of the 25th international conference on 
> Machine learning. ACM, 2008. 
> http://www.iro.umontreal.ca/~vincentp/Publications/denoising_autoencoders_tr1316.pdf
>  
> 2. 
> http://machinelearning.wustl.edu/mlpapers/paper_files/ICML2011Rifai_455.pdf, 
> 3. Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., and Manzagol, P.-A. 
> (2010). Stacked denoising autoencoders: Learning useful representations in a 
> deep network with a local denoising criterion. Journal of Machine Learning 
> Research, 11(3371–3408). 
> http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.297.3484&rep=rep1&type=pdf
> 4, 5, 6. Bengio, Yoshua, et al. "Greedy layer-wise training of deep 
> networks." Advances in neural information processing systems 19 (2007): 153. 
> http://www.iro.umontreal.ca/~lisa/pointeurs/dbn_supervised_tr1282.pdf



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6323) Large rank matrix factorization with Nonlinear loss and constraints

2017-05-09 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16002310#comment-16002310
 ] 

Nick Pentreath commented on SPARK-6323:
---

I think it is safe to say this will not be feasible to incorporate into Spark 
itself in the short- to medium-term.

> Large rank matrix factorization with Nonlinear loss and constraints
> ---
>
> Key: SPARK-6323
> URL: https://issues.apache.org/jira/browse/SPARK-6323
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, MLlib
>Reporter: Debasish Das
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> Currently ml.recommendation.ALS is optimized for gram matrix generation which 
> scales to modest ranks. The problems that we can solve are in the normal 
> equation/quadratic form: 0.5x'Hx + c'x + g(z)
> g(z) can be one of the constraints from Breeze proximal library:
> https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/optimize/proximal/Proximal.scala
> In this PR we will re-use ml.recommendation.ALS design and come up with 
> ml.recommendation.ALM (Alternating Minimization). Thanks to [~mengxr] recent 
> changes, it's straightforward to do it now !
> ALM will be capable of solving the following problems: min f ( x ) + g ( z )
> 1. Loss function f ( x ) can be LeastSquareLoss and LoglikelihoodLoss. Most 
> likely we will re-use the Gradient interfaces already defined and implement 
> LoglikelihoodLoss
> 2. Constraints g ( z ) supported are same as above except that we don't 
> support affine + bounds yet Aeq x = beq , lb <= x <= ub yet. Most likely we 
> don't need that for ML applications
> 3. For solver we will use breeze.optimize.proximal.NonlinearMinimizer which 
> in turn uses projection based solver (SPG) or proximal solvers (ADMM) based 
> on convergence speed.
> https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/optimize/proximal/NonlinearMinimizer.scala
> 4. The factors will be SparseVector so that we keep shuffle size in check. 
> For example we will run with 10K ranks but we will force factors to be 
> 100-sparse.
> This is closely related to Sparse LDA 
> https://issues.apache.org/jira/browse/SPARK-5564 with the difference that we 
> are not using graph representation here.
> As we do scaling experiments, we will understand which flow is more suited as 
> ratings get denser (my understanding is that since we already scaled ALS to 2 
> billion ratings and we will keep sparsity in check, the same 2 billion flow 
> will scale to 10K ranks as well)...
> This JIRA is intended to extend the capabilities of ml recommendation to 
> generalized loss function.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20587) Improve performance of ML ALS recommendForAll

2017-05-09 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath resolved SPARK-20587.

   Resolution: Fixed
Fix Version/s: 2.2.1

Issue resolved by pull request 17845
[https://github.com/apache/spark/pull/17845]

> Improve performance of ML ALS recommendForAll
> -
>
> Key: SPARK-20587
> URL: https://issues.apache.org/jira/browse/SPARK-20587
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>    Reporter: Nick Pentreath
>    Assignee: Nick Pentreath
> Fix For: 2.2.1
>
>
> SPARK-11968 relates to excessive GC pressure from using the "blocked BLAS 3" 
> approach for generating top-k recommendations in 
> {{mllib.recommendation.MatrixFactorizationModel}}.
> The solution there is still based on blocking factors, but efficiently 
> computes the top-k elements *per block* first (using 
> {{BoundedPriorityQueue}}) and then computes the global top-k elements.
> This improves performance and GC pressure substantially for {{mllib}}'s ALS 
> model. The same approach is also a lot more efficient than the current 
> "crossJoin and score per-row" used in {{ml}}'s {{DataFrame}}-based method. 
> This adapts the solution in SPARK-11968 for {{DataFrame}}.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11968) ALS recommend all methods spend most of time in GC

2017-05-09 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath resolved SPARK-11968.

   Resolution: Fixed
Fix Version/s: 2.2.1

Issue resolved by pull request 17742
[https://github.com/apache/spark/pull/17742]

> ALS recommend all methods spend most of time in GC
> --
>
> Key: SPARK-11968
> URL: https://issues.apache.org/jira/browse/SPARK-11968
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 1.5.2, 1.6.0
>Reporter: Joseph K. Bradley
>    Assignee: Nick Pentreath
> Fix For: 2.2.1
>
>
> After adding recommendUsersForProducts and recommendProductsForUsers to ALS 
> in spark-perf, I noticed that it takes much longer than ALS itself.  Looking 
> at the monitoring page, I can see it is spending about 8min doing GC for each 
> 10min task.  That sounds fixable.  Looking at the implementation, there is 
> clearly an opportunity to avoid extra allocations: 
> [https://github.com/apache/spark/blob/e6dd237463d2de8c506f0735dfdb3f43e8122513/mllib/src/main/scala/org/apache/spark/mllib/recommendation/MatrixFactorizationModel.scala#L283]
> CC: [~mengxr]



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11968) ALS recommend all methods spend most of time in GC

2017-05-09 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath reassigned SPARK-11968:
--

Assignee: Peng Meng  (was: Nick Pentreath)

> ALS recommend all methods spend most of time in GC
> --
>
> Key: SPARK-11968
> URL: https://issues.apache.org/jira/browse/SPARK-11968
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 1.5.2, 1.6.0
>Reporter: Joseph K. Bradley
>Assignee: Peng Meng
> Fix For: 2.2.1
>
>
> After adding recommendUsersForProducts and recommendProductsForUsers to ALS 
> in spark-perf, I noticed that it takes much longer than ALS itself.  Looking 
> at the monitoring page, I can see it is spending about 8min doing GC for each 
> 10min task.  That sounds fixable.  Looking at the implementation, there is 
> clearly an opportunity to avoid extra allocations: 
> [https://github.com/apache/spark/blob/e6dd237463d2de8c506f0735dfdb3f43e8122513/mllib/src/main/scala/org/apache/spark/mllib/recommendation/MatrixFactorizationModel.scala#L283]
> CC: [~mengxr]



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20677) Clean up ALS recommend all improvement code.

2017-05-09 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath reassigned SPARK-20677:
--

Assignee: Nick Pentreath

> Clean up ALS recommend all improvement code. 
> -
>
> Key: SPARK-20677
> URL: https://issues.apache.org/jira/browse/SPARK-20677
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 2.2.0
>    Reporter: Nick Pentreath
>    Assignee: Nick Pentreath
>Priority: Minor
>
> SPARK-11968 and SPARK-20587 added performance improvements to the "recommend 
> all" methods of {{ALS}}. This JIRA tracks a few clean ups as follow up:
> # Force use F2jBLAS {{dot}} instead of the hand-written version
> # (Potentially) clean up iterator code for priority queue



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20677) Clean up ALS recommend all improvement code.

2017-05-09 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath updated SPARK-20677:
---
Description: 
SPARK-11968 and SPARK-20587 added performance improvements to the "recommend 
all" methods of {{ALS}}. This JIRA tracks a few clean ups as follow up:

# Force use F2jBLAS {{dot}} instead of the hand-written version
# (Potentially) clean up iterator code for priority queue

  was:
SPARK-11968 and SPARK-20587 added performance improvements to the "recommend 
all" methods of {{ALS}}. This JIRA tracks a few clean ups as follow up:

# Force use F2jBLAS {{dot}} instead of the hand-written version


> Clean up ALS recommend all improvement code. 
> -
>
> Key: SPARK-20677
> URL: https://issues.apache.org/jira/browse/SPARK-20677
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 2.2.0
>Reporter: Nick Pentreath
>Priority: Minor
>
> SPARK-11968 and SPARK-20587 added performance improvements to the "recommend 
> all" methods of {{ALS}}. This JIRA tracks a few clean ups as follow up:
> # Force use F2jBLAS {{dot}} instead of the hand-written version
> # (Potentially) clean up iterator code for priority queue



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20677) Clean up ALS recommend all improvement code.

2017-05-09 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath updated SPARK-20677:
---
Description: 
SPARK-11968 and SPARK-20587 added performance improvements to the "recommend 
all" methods of {{ALS}}. This JIRA tracks a few clean ups as follow up:

# Force use F2jBLAS {{dot}} instead of the hand-written version

  was:
SPARK-11968 and SPARK-20587 added performance improvements to the "recommend 
all" methods of {{ALS}}. This JIRA tracks a few clean ups as follow up:

# Force use F2jBLAS {{dot}} instead of the hand-written version
# Clean up increment variables


> Clean up ALS recommend all improvement code. 
> -
>
> Key: SPARK-20677
> URL: https://issues.apache.org/jira/browse/SPARK-20677
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 2.2.0
>Reporter: Nick Pentreath
>Priority: Minor
>
> SPARK-11968 and SPARK-20587 added performance improvements to the "recommend 
> all" methods of {{ALS}}. This JIRA tracks a few clean ups as follow up:
> # Force use F2jBLAS {{dot}} instead of the hand-written version



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20677) Clean up ALS recommend all improvement code.

2017-05-09 Thread Nick Pentreath (JIRA)
Nick Pentreath created SPARK-20677:
--

 Summary: Clean up ALS recommend all improvement code. 
 Key: SPARK-20677
 URL: https://issues.apache.org/jira/browse/SPARK-20677
 Project: Spark
  Issue Type: Improvement
  Components: ML, MLlib
Affects Versions: 2.2.0
Reporter: Nick Pentreath
Priority: Minor


SPARK-11968 and SPARK-20587 added performance improvements to the "recommend 
all" methods of {{ALS}}. This JIRA tracks a few clean ups as follow up:

# Force use F2jBLAS {{dot}} instead of the hand-written version
# Clean up increment variables



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20596) Improve ALS recommend all test cases

2017-05-08 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath updated SPARK-20596:
---
Fix Version/s: (was: 2.2.0)
   2.2.1

> Improve ALS recommend all test cases
> 
>
> Key: SPARK-20596
> URL: https://issues.apache.org/jira/browse/SPARK-20596
> Project: Spark
>  Issue Type: Test
>  Components: ML, Tests
>Affects Versions: 2.2.0
>    Reporter: Nick Pentreath
>    Assignee: Nick Pentreath
>Priority: Minor
> Fix For: 2.2.1
>
>
> Existing test cases for `recommendForAllX` methods in SPARK-19535 test {{k}} 
> < num items and {{k}} = num items. Technically we should also test that {{k}} 
> > num items returns the same results as {{k}} = num items.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20596) Improve ALS recommend all test cases

2017-05-08 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath resolved SPARK-20596.

   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 17860
[https://github.com/apache/spark/pull/17860]

> Improve ALS recommend all test cases
> 
>
> Key: SPARK-20596
> URL: https://issues.apache.org/jira/browse/SPARK-20596
> Project: Spark
>  Issue Type: Test
>  Components: ML, Tests
>Affects Versions: 2.2.0
>    Reporter: Nick Pentreath
>    Assignee: Nick Pentreath
>Priority: Minor
> Fix For: 2.2.0
>
>
> Existing test cases for `recommendForAllX` methods in SPARK-19535 test {{k}} 
> < num items and {{k}} = num items. Technically we should also test that {{k}} 
> > num items returns the same results as {{k}} = num items.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20503) ML 2.2 QA: API: Python API coverage

2017-05-04 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15996676#comment-15996676
 ] 

Nick Pentreath commented on SPARK-20503:


cc [~holdenk] [~bryanc] [~zero323]? I can take it if others can't.

> ML 2.2 QA: API: Python API coverage
> ---
>
> Key: SPARK-20503
> URL: https://issues.apache.org/jira/browse/SPARK-20503
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, ML, PySpark
>Reporter: Joseph K. Bradley
>Priority: Blocker
>
> For new public APIs added to MLlib ({{spark.ml}} only), we need to check the 
> generated HTML doc and compare the Scala & Python versions.
> * *GOAL*: Audit and create JIRAs to fix in the next release.
> * *NON-GOAL*: This JIRA is _not_ for fixing the API parity issues.
> We need to track:
> * Inconsistency: Do class/method/parameter names match?
> * Docs: Is the Python doc missing or just a stub?  We want the Python doc to 
> be as complete as the Scala doc.
> * API breaking changes: These should be very rare but are occasionally either 
> necessary (intentional) or accidental.  These must be recorded and added in 
> the Migration Guide for this release.
> ** Note: If the API change is for an Alpha/Experimental/DeveloperApi 
> component, please note that as well.
> * Missing classes/methods/parameters: We should create to-do JIRAs for 
> functionality missing from Python, to be added in the next release cycle.  
> *Please use a _separate_ JIRA (linked below as "requires") for this list of 
> to-do items.*



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20501) ML, Graph 2.2 QA: API: New Scala APIs, docs

2017-05-04 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15996675#comment-15996675
 ] 

Nick Pentreath commented on SPARK-20501:


Things that would need to be checked include:

# {{Imputer}}
# {{FPGrowth}} in {{ml}}
# {{ALS}} {{recommendForX}}
# {{coldStartStrategy}} for {{ALS}}

> ML, Graph 2.2 QA: API: New Scala APIs, docs
> ---
>
> Key: SPARK-20501
> URL: https://issues.apache.org/jira/browse/SPARK-20501
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib
>Reporter: Joseph K. Bradley
>Assignee: Yanbo Liang
>Priority: Blocker
>
> Audit new public Scala APIs added to MLlib & GraphX.  Take note of:
> * Protected/public classes or methods.  If access can be more private, then 
> it should be.
> * Also look for non-sealed traits.
> * Documentation: Missing?  Bad links or formatting?
> *Make sure to check the object doc!*
> As you find issues, please create JIRAs and link them to this issue.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20499) Spark MLlib, GraphX 2.2 QA umbrella

2017-05-04 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath updated SPARK-20499:
---
Description: 
This JIRA lists tasks for the next Spark release's QA period for MLlib and 
GraphX.   *SparkR is separate: [SPARK-20508].*

The list below gives an overview of what is involved, and the corresponding 
JIRA issues are linked below that.

h2. API

* Check binary API compatibility for Scala/Java
* Audit new public APIs (from the generated html doc)
** Scala
** Java compatibility
** Python coverage
* Check Experimental, DeveloperApi tags

h2. Algorithms and performance

* Performance tests
* Major new feature transformers: Imputer, FPGrowth in {{ml}}

h2. Documentation and example code

* For new algorithms, create JIRAs for updating the user guide sections & 
examples
* Update Programming Guide
* Update website


  was:
This JIRA lists tasks for the next Spark release's QA period for MLlib and 
GraphX.   *SparkR is separate: [SPARK-20508].*

The list below gives an overview of what is involved, and the corresponding 
JIRA issues are linked below that.

h2. API

* Check binary API compatibility for Scala/Java
* Audit new public APIs (from the generated html doc)
** Scala
** Java compatibility
** Python coverage
* Check Experimental, DeveloperApi tags

h2. Algorithms and performance

* Performance tests
* Major new algorithms: MinHash, RandomProjection

h2. Documentation and example code

* For new algorithms, create JIRAs for updating the user guide sections & 
examples
* Update Programming Guide
* Update website



> Spark MLlib, GraphX 2.2 QA umbrella
> ---
>
> Key: SPARK-20499
> URL: https://issues.apache.org/jira/browse/SPARK-20499
> Project: Spark
>  Issue Type: Umbrella
>  Components: Documentation, GraphX, ML, MLlib
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>Priority: Critical
>
> This JIRA lists tasks for the next Spark release's QA period for MLlib and 
> GraphX.   *SparkR is separate: [SPARK-20508].*
> The list below gives an overview of what is involved, and the corresponding 
> JIRA issues are linked below that.
> h2. API
> * Check binary API compatibility for Scala/Java
> * Audit new public APIs (from the generated html doc)
> ** Scala
> ** Java compatibility
> ** Python coverage
> * Check Experimental, DeveloperApi tags
> h2. Algorithms and performance
> * Performance tests
> * Major new feature transformers: Imputer, FPGrowth in {{ml}}
> h2. Documentation and example code
> * For new algorithms, create JIRAs for updating the user guide sections & 
> examples
> * Update Programming Guide
> * Update website



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20596) Improve ALS recommend all test cases

2017-05-04 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath updated SPARK-20596:
---
Component/s: Tests

> Improve ALS recommend all test cases
> 
>
> Key: SPARK-20596
> URL: https://issues.apache.org/jira/browse/SPARK-20596
> Project: Spark
>  Issue Type: Test
>  Components: ML, Tests
>Affects Versions: 2.2.0
>    Reporter: Nick Pentreath
>    Assignee: Nick Pentreath
>Priority: Minor
>
> Existing test cases for `recommendForAllX` methods in SPARK-19535 test {{k}} 
> < num items and {{k}} = num items. Technically we should also test that {{k}} 
> > num items returns the same results as {{k}} = num items.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20596) Improve ALS recommend all test cases

2017-05-04 Thread Nick Pentreath (JIRA)
Nick Pentreath created SPARK-20596:
--

 Summary: Improve ALS recommend all test cases
 Key: SPARK-20596
 URL: https://issues.apache.org/jira/browse/SPARK-20596
 Project: Spark
  Issue Type: Test
  Components: ML
Affects Versions: 2.1.0
Reporter: Nick Pentreath
Assignee: Nick Pentreath
Priority: Minor


Existing test cases for `recommendForAllX` methods in SPARK-19535 test {{k}} < 
num items and {{k}} = num items. Technically we should also test that {{k}} > 
num items returns the same results as {{k}} = num items.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20596) Improve ALS recommend all test cases

2017-05-04 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath updated SPARK-20596:
---
Affects Version/s: (was: 2.1.0)
   2.2.0

> Improve ALS recommend all test cases
> 
>
> Key: SPARK-20596
> URL: https://issues.apache.org/jira/browse/SPARK-20596
> Project: Spark
>  Issue Type: Test
>  Components: ML
>Affects Versions: 2.2.0
>    Reporter: Nick Pentreath
>    Assignee: Nick Pentreath
>Priority: Minor
>
> Existing test cases for `recommendForAllX` methods in SPARK-19535 test {{k}} 
> < num items and {{k}} = num items. Technically we should also test that {{k}} 
> > num items returns the same results as {{k}} = num items.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20587) Improve performance of ML ALS recommendForAll

2017-05-03 Thread Nick Pentreath (JIRA)
Nick Pentreath created SPARK-20587:
--

 Summary: Improve performance of ML ALS recommendForAll
 Key: SPARK-20587
 URL: https://issues.apache.org/jira/browse/SPARK-20587
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 2.2.0
Reporter: Nick Pentreath
Assignee: Nick Pentreath


SPARK-11968 relates to excessive GC pressure from using the "blocked BLAS 3" 
approach for generating top-k recommendations in 
{{mllib.recommendation.MatrixFactorizationModel}}.

The solution there is still based on blocking factors, but efficiently computes 
the top-k elements *per block* first (using {{BoundedPriorityQueue}}) and then 
computes the global top-k elements.

This improves performance and GC pressure substantially for {{mllib}}'s ALS 
model. The same approach is also a lot more efficient than the current 
"crossJoin and score per-row" used in {{ml}}'s {{DataFrame}}-based method. This 
adapts the solution in SPARK-11968 for {{DataFrame}}.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-6227) PCA and SVD for PySpark

2017-05-03 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath resolved SPARK-6227.
---
   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 17621
[https://github.com/apache/spark/pull/17621]

> PCA and SVD for PySpark
> ---
>
> Key: SPARK-6227
> URL: https://issues.apache.org/jira/browse/SPARK-6227
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib, PySpark
>Affects Versions: 1.2.1
>Reporter: Julien Amelot
>Assignee: Manoj Kumar
> Fix For: 2.2.0
>
>
> The Dimensionality Reduction techniques are not available via Python (Scala + 
> Java only).
> * Principal component analysis (PCA)
> * Singular value decomposition (SVD)
> Doc:
> http://spark.apache.org/docs/1.2.1/mllib-dimensionality-reduction.html



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



Re: [VOTE] Apache Spark 2.2.0 (RC1)

2017-05-02 Thread Nick Pentreath
I won't +1 just given that it seems certain there will be another RC and
there are the outstanding ML QA blocker issues.

But clean build and test for JVM and Python tests LGTM on CentOS Linux
7.2.1511, OpenJDK 1.8.0_111

On Mon, 1 May 2017 at 22:42 Frank Austin Nothaft 
wrote:

> Hi Ryan,
>
> IMO, the problem is that the Spark Avro version conflicts with the Parquet
> Avro version. As discussed upthread, I don’t think there’s a way to
> *reliably *make sure that Avro 1.8 is on the classpath first while using
> spark-submit. Relocating avro in our project wouldn’t solve the problem,
> because the MethodNotFoundError is thrown from the internals of the
> ParquetAvroOutputFormat, not from code in our project.
>
> Regards,
>
> Frank Austin Nothaft
> fnoth...@berkeley.edu
> fnoth...@eecs.berkeley.edu
> 202-340-0466 <(202)%20340-0466>
>
> On May 1, 2017, at 12:33 PM, Ryan Blue  wrote:
>
> Michael, I think that the problem is with your classpath.
>
> Spark has a dependency to 1.7.7, which can't be changed. Your project is
> what pulls in parquet-avro and transitively Avro 1.8. Spark has no runtime
> dependency on Avro 1.8. It is understandably annoying that using the same
> version of Parquet for your parquet-avro dependency is what causes your
> project to depend on Avro 1.8, but Spark's dependencies aren't a problem
> because its Parquet dependency doesn't bring in Avro.
>
> There are a few ways around this:
> 1. Make sure Avro 1.8 is found in the classpath first
> 2. Shade Avro 1.8 in your project (assuming Avro classes aren't shared)
> 3. Use parquet-avro 1.8.1 in your project, which I think should work with
> 1.8.2 and avoid the Avro change
>
> The work-around in Spark is for tests, which do use parquet-avro. We can
> look at a Parquet 1.8.3 that avoids this issue, but I think this is
> reasonable for the 2.2.0 release.
>
> rb
>
> On Mon, May 1, 2017 at 12:08 PM, Michael Heuer  wrote:
>
>> Please excuse me if I'm misunderstanding -- the problem is not with our
>> library or our classpath.
>>
>> There is a conflict within Spark itself, in that Parquet 1.8.2 expects to
>> find Avro 1.8.0 on the runtime classpath and sees 1.7.7 instead.  Spark
>> already has to work around this for unit tests to pass.
>>
>>
>>
>> On Mon, May 1, 2017 at 2:00 PM, Ryan Blue  wrote:
>>
>>> Thanks for the extra context, Frank. I agree that it sounds like your
>>> problem comes from the conflict between your Jars and what comes with
>>> Spark. Its the same concern that makes everyone shudder when anything has a
>>> public dependency on Jackson. :)
>>>
>>> What we usually do to get around situations like this is to relocate the
>>> problem library inside the shaded Jar. That way, Spark uses its version of
>>> Avro and your classes use a different version of Avro. This works if you
>>> don't need to share classes between the two. Would that work for your
>>> situation?
>>>
>>> rb
>>>
>>> On Mon, May 1, 2017 at 11:55 AM, Koert Kuipers 
>>> wrote:
>>>
 sounds like you are running into the fact that you cannot really put
 your classes before spark's on classpath? spark's switches to support this
 never really worked for me either.

 inability to control the classpath + inconsistent jars => trouble ?

 On Mon, May 1, 2017 at 2:36 PM, Frank Austin Nothaft <
 fnoth...@berkeley.edu> wrote:

> Hi Ryan,
>
> We do set Avro to 1.8 in our downstream project. We also set Spark as
> a provided dependency, and build an überjar. We run via spark-submit, 
> which
> builds the classpath with our überjar and all of the Spark deps. This 
> leads
> to avro 1.7.1 getting picked off of the classpath at runtime, which causes
> the no such method exception to occur.
>
> Regards,
>
> Frank Austin Nothaft
> fnoth...@berkeley.edu
> fnoth...@eecs.berkeley.edu
> 202-340-0466 <(202)%20340-0466>
>
> On May 1, 2017, at 11:31 AM, Ryan Blue  wrote:
>
> Frank,
>
> The issue you're running into is caused by using parquet-avro with
> Avro 1.7. Can't your downstream project set the Avro dependency to 1.8?
> Spark can't update Avro because it is a breaking change that would force
> users to rebuilt specific Avro classes in some cases. But you should be
> free to use Avro 1.8 to avoid the problem.
>
> On Mon, May 1, 2017 at 11:08 AM, Frank Austin Nothaft <
> fnoth...@berkeley.edu> wrote:
>
>> Hi Ryan et al,
>>
>> The issue we’ve seen using a build of the Spark 2.2.0 branch from a
>> downstream project is that parquet-avro uses one of the new Avro 1.8.0
>> methods, and you get a NoSuchMethodError since Spark puts Avro 1.7.7 as a
>> dependency. My colleague Michael (who posted earlier on this thread)
>> documented this in Spark-19697
>> . I know that
>> Spark has unit tests that check this compatibility issue, but it looks 
>> like
>>

[jira] [Created] (SPARK-20553) Update ALS examples for ML to illustrate recommend all

2017-05-02 Thread Nick Pentreath (JIRA)
Nick Pentreath created SPARK-20553:
--

 Summary: Update ALS examples for ML to illustrate recommend all
 Key: SPARK-20553
 URL: https://issues.apache.org/jira/browse/SPARK-20553
 Project: Spark
  Issue Type: Documentation
  Components: ML, PySpark
Affects Versions: 2.2.0
Reporter: Nick Pentreath
Priority: Minor






--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20300) Python API for ALSModel.recommendForAllUsers,Items

2017-05-02 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath resolved SPARK-20300.

   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 17622
[https://github.com/apache/spark/pull/17622]

> Python API for ALSModel.recommendForAllUsers,Items
> --
>
> Key: SPARK-20300
> URL: https://issues.apache.org/jira/browse/SPARK-20300
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Affects Versions: 2.2.0
>Reporter: Joseph K. Bradley
>    Assignee: Nick Pentreath
> Fix For: 2.2.0
>
>
> Python API for ALSModel methods recommendForAllUsers, recommendForAllItems



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20443) The blockSize of MLLIB ALS should be setting by the User

2017-05-02 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15992497#comment-15992497
 ] 

Nick Pentreath commented on SPARK-20443:


Interesting - though it appears to me that {{2048}} is the best setting for 
both data sizes. At the least I think we should adjust the default.

> The blockSize of MLLIB ALS should be setting  by the User
> -
>
> Key: SPARK-20443
> URL: https://issues.apache.org/jira/browse/SPARK-20443
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 2.3.0
>Reporter: Peng Meng
>Priority: Minor
>
> The blockSize of MLLIB ALS is very important for ALS performance. 
> In our test, when the blockSize is 128, the performance is about 4X comparing 
> with the blockSize is 4096 (default value).
> The following are our test results: 
> BlockSize(recommendationForAll time)
> 128(124s), 256(160s), 512(184s), 1024(244s), 2048(332s), 4096(488s), 8192(OOM)
> The Test Environment:
> 3 workers: each work 10 core, each work 30G memory, each work 1 executor.
> The Data: User 480,000, and Item 17,000



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20551) ImportError adding custom class from jar in pyspark

2017-05-02 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15992491#comment-15992491
 ] 

Nick Pentreath commented on SPARK-20551:


Yes I agree that it appears you're trying to import Java or Scala classes in 
Python which won't work. I suggest you post a question to the Spark user list 
asking for help: u...@spark.apache.org. Please indicate what you are trying to 
do and provide some example code (it appears that you're trying to read a 
custom Hadoop {{InputFormat}} in PySpark?

> ImportError adding custom class from jar in pyspark
> ---
>
> Key: SPARK-20551
> URL: https://issues.apache.org/jira/browse/SPARK-20551
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Shell
>Affects Versions: 2.1.0
>Reporter: Sergio Monteiro
>
> the flowwing imports are failing in PySpark, even when I set the --jars or 
> --driver-class-path:
> import net.ripe.hadoop.pcap.io.PcapInputFormat
> import net.ripe.hadoop.pcap.io.CombinePcapInputFormat
> import net.ripe.hadoop.pcap.packet.Packet
> Using Python version 2.7.12 (default, Nov 19 2016 06:48:10)
> SparkSession available as 'spark'.
> >>> import net.ripe.hadoop.pcap.io.PcapInputFormat
> Traceback (most recent call last):
>   File "", line 1, in 
> ImportError: No module named net.ripe.hadoop.pcap.io.PcapInputFormat
> >>> import net.ripe.hadoop.pcap.io.CombinePcapInputFormat
> Traceback (most recent call last):
>   File "", line 1, in 
> ImportError: No module named net.ripe.hadoop.pcap.io.CombinePcapInputFormat
> >>> import net.ripe.hadoop.pcap.packet.Packet
> Traceback (most recent call last):
>   File "", line 1, in 
> ImportError: No module named net.ripe.hadoop.pcap.packet.Packet
> >>>
> The same works great in spark-shell. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-20551) ImportError adding custom class from jar in pyspark

2017-05-02 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath closed SPARK-20551.
--
Resolution: Not A Problem

> ImportError adding custom class from jar in pyspark
> ---
>
> Key: SPARK-20551
> URL: https://issues.apache.org/jira/browse/SPARK-20551
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Shell
>Affects Versions: 2.1.0
>Reporter: Sergio Monteiro
>
> the flowwing imports are failing in PySpark, even when I set the --jars or 
> --driver-class-path:
> import net.ripe.hadoop.pcap.io.PcapInputFormat
> import net.ripe.hadoop.pcap.io.CombinePcapInputFormat
> import net.ripe.hadoop.pcap.packet.Packet
> Using Python version 2.7.12 (default, Nov 19 2016 06:48:10)
> SparkSession available as 'spark'.
> >>> import net.ripe.hadoop.pcap.io.PcapInputFormat
> Traceback (most recent call last):
>   File "", line 1, in 
> ImportError: No module named net.ripe.hadoop.pcap.io.PcapInputFormat
> >>> import net.ripe.hadoop.pcap.io.CombinePcapInputFormat
> Traceback (most recent call last):
>   File "", line 1, in 
> ImportError: No module named net.ripe.hadoop.pcap.io.CombinePcapInputFormat
> >>> import net.ripe.hadoop.pcap.packet.Packet
> Traceback (most recent call last):
>   File "", line 1, in 
> ImportError: No module named net.ripe.hadoop.pcap.packet.Packet
> >>>
> The same works great in spark-shell. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20443) The blockSize of MLLIB ALS should be setting by the User

2017-05-02 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15992475#comment-15992475
 ] 

Nick Pentreath commented on SPARK-20443:


Were these tests against existing master? Because SPARK-11968 / [PR 
#17742|https://github.com/apache/spark/pull/17742] should make block size less 
relevant - we should of course re-test this once that PR is merged in, to see 
if it's worth exposing the parameter.

> The blockSize of MLLIB ALS should be setting  by the User
> -
>
> Key: SPARK-20443
> URL: https://issues.apache.org/jira/browse/SPARK-20443
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 2.3.0
>Reporter: Peng Meng
>Priority: Minor
>
> The blockSize of MLLIB ALS is very important for ALS performance. 
> In our test, when the blockSize is 128, the performance is about 4X comparing 
> with the blockSize is 4096 (default value).
> The following are our test results: 
> BlockSize(recommendationForAll time)
> 128(124s), 256(160s), 512(184s), 1024(244s), 2048(332s), 4096(488s), 8192(OOM)
> The Test Environment:
> 3 workers: each work 10 core, each work 30G memory, each work 1 executor.
> The Data: User 480,000, and Item 17,000



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20300) Python API for ALSModel.recommendForAllUsers,Items

2017-04-30 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath reassigned SPARK-20300:
--

Assignee: Nick Pentreath

> Python API for ALSModel.recommendForAllUsers,Items
> --
>
> Key: SPARK-20300
> URL: https://issues.apache.org/jira/browse/SPARK-20300
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Affects Versions: 2.2.0
>Reporter: Joseph K. Bradley
>    Assignee: Nick Pentreath
>
> Python API for ALSModel methods recommendForAllUsers, recommendForAllItems



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20469) Add a method to display DataFrame schema in PipelineStage

2017-04-27 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20469?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15987219#comment-15987219
 ] 

Nick Pentreath commented on SPARK-20469:


Pipeline stages themselves have no concept of schema. They can output the 
result of `transformSchema` or `transform` but this requires an input DataFrame.

> Add a method to display DataFrame schema in PipelineStage
> -
>
> Key: SPARK-20469
> URL: https://issues.apache.org/jira/browse/SPARK-20469
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, MLlib
>Affects Versions: 1.6.3, 2.0.2, 2.1.0
>Reporter: darion yaphet
>Priority: Minor
>
> Sometime apply Transformer and Estimator on a pipeline. The PipelineStage 
> could display schema will be a big help to understand and check the dataset .



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11968) ALS recommend all methods spend most of time in GC

2017-04-26 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15984533#comment-15984533
 ] 

Nick Pentreath commented on SPARK-11968:


Thanks - in the meantime I will take a look at the PR.

> ALS recommend all methods spend most of time in GC
> --
>
> Key: SPARK-11968
> URL: https://issues.apache.org/jira/browse/SPARK-11968
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 1.5.2, 1.6.0
>Reporter: Joseph K. Bradley
>Assignee: Nick Pentreath
>
> After adding recommendUsersForProducts and recommendProductsForUsers to ALS 
> in spark-perf, I noticed that it takes much longer than ALS itself.  Looking 
> at the monitoring page, I can see it is spending about 8min doing GC for each 
> 10min task.  That sounds fixable.  Looking at the implementation, there is 
> clearly an opportunity to avoid extra allocations: 
> [https://github.com/apache/spark/blob/e6dd237463d2de8c506f0735dfdb3f43e8122513/mllib/src/main/scala/org/apache/spark/mllib/recommendation/MatrixFactorizationModel.scala#L283]
> CC: [~mengxr]



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20443) The blockSize of MLLIB ALS should be setting by the User

2017-04-25 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15983050#comment-15983050
 ] 

Nick Pentreath commented on SPARK-20443:


Your PR for SPARK-20446 / SPARK11968 should largely remove the need to adjust 
the block size? Do you agree?

> The blockSize of MLLIB ALS should be setting  by the User
> -
>
> Key: SPARK-20443
> URL: https://issues.apache.org/jira/browse/SPARK-20443
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 2.3.0
>Reporter: Peng Meng
>Priority: Minor
>
> The blockSize of MLLIB ALS is very important for ALS performance. 
> In our test, when the blockSize is 128, the performance is about 4X comparing 
> with the blockSize is 4096 (default value).
> The following are our test results: 
> BlockSize(recommendationForAll time)
> 128(124s), 256(160s), 512(184s), 1024(244s), 2048(332s), 4096(488s), 8192(OOM)
> The Test Environment:
> 3 workers: each work 10 core, each work 30G memory, each work 1 executor.
> The Data: User 480,000, and Item 17,000



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11968) ALS recommend all methods spend most of time in GC

2017-04-25 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15983048#comment-15983048
 ] 

Nick Pentreath commented on SPARK-11968:


[~peng.m...@intel.com] would you mind posting your comments here about the 
solution from SPARK-20446 as well as the experiment timings? You can rename 
your PR to include this JIRA (SPARK-11968) in the title instead, in order to 
link it.

Also please include the timings here of the {{ml DataFrame}} version for 
comparison. Your approach should also be much faster than the current {{ml}} 
SparkSQL approach, I think.

I just did some quick tests using MovieLens {{latest}} data (~24 million 
ratings, ~260,000 users, ~39,000 items) and found the following (note these are 
very rough timings):

Using default block sizes:

Current {{ml}} master - 262 sec
My approach: 58 sec
Your PR: 35 sec

You're correct that there is still +/- 20-25% GC time overhead using the BLAS 3 
+ sorting approach. Potentially it could be slightly improved through some form 
of pre-allocation, but even then it does look like any benefit of BLAS 3 is 
smaller than the GC cost.

> ALS recommend all methods spend most of time in GC
> --
>
> Key: SPARK-11968
> URL: https://issues.apache.org/jira/browse/SPARK-11968
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 1.5.2, 1.6.0
>Reporter: Joseph K. Bradley
>Assignee: Nick Pentreath
>
> After adding recommendUsersForProducts and recommendProductsForUsers to ALS 
> in spark-perf, I noticed that it takes much longer than ALS itself.  Looking 
> at the monitoring page, I can see it is spending about 8min doing GC for each 
> 10min task.  That sounds fixable.  Looking at the implementation, there is 
> clearly an opportunity to avoid extra allocations: 
> [https://github.com/apache/spark/blob/e6dd237463d2de8c506f0735dfdb3f43e8122513/mllib/src/main/scala/org/apache/spark/mllib/recommendation/MatrixFactorizationModel.scala#L283]
> CC: [~mengxr]



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-20446) Optimize the process of MLLIB ALS recommendForAll

2017-04-25 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath closed SPARK-20446.
--
Resolution: Duplicate

> Optimize the process of MLLIB ALS recommendForAll
> -
>
> Key: SPARK-20446
> URL: https://issues.apache.org/jira/browse/SPARK-20446
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 2.3.0
>Reporter: Peng Meng
>
> The recommendForAll of MLLIB ALS is very slow.
> GC is a key problem of the current method. 
> The task use the following code to keep temp result:
> val output = new Array[(Int, (Int, Double))](m*n)
> m = n = 4096 (default value, no method to set)
> so output is about 4k * 4k * (4 + 4 + 8) = 256M. This is a large memory and 
> cause serious GC problem, and it is frequently OOM.
> Actually, we don't need to save all the temp result. Suppose we recommend 
> topK (topK is about 10, or 20) product for each user, we only need  4k * topK 
> * (4 + 4 + 8) memory to save the temp result.
> I have written a solution for this method with the following test result. 
> The Test Environment:
> 3 workers: each work 10 core, each work 30G memory, each work 1 executor.
> The Data: User 480,000, and Item 17,000
> BlockSize: 1024 2048 4096 8192
> Old method: 245s 332s 488s OOM
> This solution: 121s 118s 117s 120s
>  



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11968) ALS recommend all methods spend most of time in GC

2017-04-25 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15983026#comment-15983026
 ] 

Nick Pentreath commented on SPARK-11968:


Note, there is a solution proposed in SPARK-20446. I've redirected the 
discussion to this original JIRA.

> ALS recommend all methods spend most of time in GC
> --
>
> Key: SPARK-11968
> URL: https://issues.apache.org/jira/browse/SPARK-11968
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 1.5.2, 1.6.0
>Reporter: Joseph K. Bradley
>Assignee: Nick Pentreath
>
> After adding recommendUsersForProducts and recommendProductsForUsers to ALS 
> in spark-perf, I noticed that it takes much longer than ALS itself.  Looking 
> at the monitoring page, I can see it is spending about 8min doing GC for each 
> 10min task.  That sounds fixable.  Looking at the implementation, there is 
> clearly an opportunity to avoid extra allocations: 
> [https://github.com/apache/spark/blob/e6dd237463d2de8c506f0735dfdb3f43e8122513/mllib/src/main/scala/org/apache/spark/mllib/recommendation/MatrixFactorizationModel.scala#L283]
> CC: [~mengxr]



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20446) Optimize the process of MLLIB ALS recommendForAll

2017-04-25 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15983018#comment-15983018
 ] 

Nick Pentreath commented on SPARK-20446:


By the way when I say it is a duplicate I mean for the JIRA ticket. I agree 
that PR9980 was not the correct solution - JIRA tickets can have multiple PRs 
linked to them.

I'd prefer to close this ticket and move the discussion to SPARK-11968 (also 
there are watchers on that ticket that may be interested in the outcome). 



> Optimize the process of MLLIB ALS recommendForAll
> -
>
> Key: SPARK-20446
> URL: https://issues.apache.org/jira/browse/SPARK-20446
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 2.3.0
>Reporter: Peng Meng
>
> The recommendForAll of MLLIB ALS is very slow.
> GC is a key problem of the current method. 
> The task use the following code to keep temp result:
> val output = new Array[(Int, (Int, Double))](m*n)
> m = n = 4096 (default value, no method to set)
> so output is about 4k * 4k * (4 + 4 + 8) = 256M. This is a large memory and 
> cause serious GC problem, and it is frequently OOM.
> Actually, we don't need to save all the temp result. Suppose we recommend 
> topK (topK is about 10, or 20) product for each user, we only need  4k * topK 
> * (4 + 4 + 8) memory to save the temp result.
> I have written a solution for this method with the following test result. 
> The Test Environment:
> 3 workers: each work 10 core, each work 30G memory, each work 1 executor.
> The Data: User 480,000, and Item 17,000
> BlockSize: 1024 2048 4096 8192
> Old method: 245s 332s 488s OOM
> This solution: 121s 118s 117s 120s
>  



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13857) Feature parity for ALS ML with MLLIB

2017-04-25 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15982695#comment-15982695
 ] 

Nick Pentreath commented on SPARK-13857:


I'm going to close this as superseded by SPARK-19535. 

However, the discussion here should still serve as a reference for making 
{{ALS.transform}} able to support ranking metrics for cross-validation in 
SPARK-14409.

> Feature parity for ALS ML with MLLIB
> 
>
> Key: SPARK-13857
> URL: https://issues.apache.org/jira/browse/SPARK-13857
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Nick Pentreath
>Assignee: Nick Pentreath
>
> Currently {{mllib.recommendation.MatrixFactorizationModel}} has methods 
> {{recommendProducts/recommendUsers}} for recommending top K to a given user / 
> item, as well as {{recommendProductsForUsers/recommendUsersForProducts}} to 
> recommend top K across all users/items.
> Additionally, SPARK-10802 is for adding the ability to do 
> {{recommendProductsForUsers}} for a subset of users (or vice versa).
> Look at exposing or porting (as appropriate) these methods to ALS in ML. 
> Investigate if efficiency can be improved at the same time (see SPARK-11968).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-13857) Feature parity for ALS ML with MLLIB

2017-04-25 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath closed SPARK-13857.
--
Resolution: Duplicate

> Feature parity for ALS ML with MLLIB
> 
>
> Key: SPARK-13857
> URL: https://issues.apache.org/jira/browse/SPARK-13857
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>    Reporter: Nick Pentreath
>    Assignee: Nick Pentreath
>
> Currently {{mllib.recommendation.MatrixFactorizationModel}} has methods 
> {{recommendProducts/recommendUsers}} for recommending top K to a given user / 
> item, as well as {{recommendProductsForUsers/recommendUsersForProducts}} to 
> recommend top K across all users/items.
> Additionally, SPARK-10802 is for adding the ability to do 
> {{recommendProductsForUsers}} for a subset of users (or vice versa).
> Look at exposing or porting (as appropriate) these methods to ALS in ML. 
> Investigate if efficiency can be improved at the same time (see SPARK-11968).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



Re: pyspark vector

2017-04-25 Thread Nick Pentreath
Well the 3 in this case is the size of the sparse vector. This equates to
the number of features, which for CountVectorizer (I assume that's what
you're using) is also vocab size (number of unique terms).

On Tue, 25 Apr 2017 at 04:06 Peyman Mohajerian  wrote:

> setVocabSize
>
>
> On Mon, Apr 24, 2017 at 5:36 PM, Zeming Yu  wrote:
>
>> Hi all,
>>
>> Beginner question:
>>
>> what does the 3 mean in the (3,[0,1,2],[1.0,1.0,1.0])?
>>
>> https://spark.apache.org/docs/2.1.0/ml-features.html
>>
>>  id | texts   | vector
>> |-|---
>>  0  | Array("a", "b", "c")| (3,[0,1,2],[1.0,1.0,1.0])
>>  1  | Array("a", "b", "b", "c", "a")  | (3,[0,1,2],[2.0,2.0,1.0])
>>
>>
>


[jira] [Commented] (SPARK-20446) Optimize the process of MLLIB ALS recommendForAll

2017-04-24 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15981648#comment-15981648
 ] 

Nick Pentreath commented on SPARK-20446:


By "compare to DataFrame implementation" I mean the current "recommendAll" 
methods in master/branch-2.2 for {{ALSModel}}. Did you compare against that? If 
so what was the result?

Reason being, conceptually that is doing something fairly similar (computing 
the vector dot products rather than matrix-matrix multiply, followed by a 
priority queue aggregator for top-k). The idea was that this SparkSQL approach 
would be more efficient. In practice I didn't find this to be the case for 
large data sizes, when comparing to my approach with BLAS 3 (though granted yes 
there is potential for more GC pressure).

Also, there is really no point in doing the "blockify" operation in this case 
right? As you're not using BLAS 3 anyway, so blocking is unnecessary and the 
block size param is irrelevant.

> Optimize the process of MLLIB ALS recommendForAll
> -
>
> Key: SPARK-20446
> URL: https://issues.apache.org/jira/browse/SPARK-20446
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 2.3.0
>Reporter: Peng Meng
>
> The recommendForAll of MLLIB ALS is very slow.
> GC is a key problem of the current method. 
> The task use the following code to keep temp result:
> val output = new Array[(Int, (Int, Double))](m*n)
> m = n = 4096 (default value, no method to set)
> so output is about 4k * 4k * (4 + 4 + 8) = 256M. This is a large memory and 
> cause serious GC problem, and it is frequently OOM.
> Actually, we don't need to save all the temp result. Suppose we recommend 
> topK (topK is about 10, or 20) product for each user, we only need  4k * topK 
> * (4 + 4 + 8) memory to save the temp result.
> I have written a solution for this method with the following test result. 
> The Test Environment:
> 3 workers: each work 10 core, each work 30G memory, each work 1 executor.
> The Data: User 480,000, and Item 17,000
> BlockSize: 1024 2048 4096 8192
> Old method: 245s 332s 488s OOM
> This solution: 121s 118s 117s 120s
>  



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20446) Optimize the process of MLLIB ALS recommendForAll

2017-04-24 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15981134#comment-15981134
 ] 

Nick Pentreath commented on SPARK-20446:


Also would be good to compare to the new {{DataFrame}} version in 
{{ml.recommendation.ALSModel}}. I found it faster than the old blocked version 
for smaller data. But I think lowering the block size as you did in SPARK-20443 
could change that result.

> Optimize the process of MLLIB ALS recommendForAll
> -
>
> Key: SPARK-20446
> URL: https://issues.apache.org/jira/browse/SPARK-20446
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 2.3.0
>Reporter: Peng Meng
>
> The recommendForAll of MLLIB ALS is very slow.
> GC is a key problem of the current method. 
> The task use the following code to keep temp result:
> val output = new Array[(Int, (Int, Double))](m*n)
> m = n = 4096 (default value, no method to set)
> so output is about 4k * 4k * (4 + 4 + 8) = 256M. This is a large memory and 
> cause serious GC problem, and it is frequently OOM.
> Actually, we don't need to save all the temp result. Suppose we recommend 
> topK (topK is about 10, or 20) product for each user, we only need  4k * topK 
> * (4 + 4 + 8) memory to save the temp result.
> I have written a solution for this method with the following test result. 
> The Test Environment:
> 3 workers: each work 10 core, each work 30G memory, each work 1 executor.
> The Data: User 480,000, and Item 17,000
> BlockSize: 1024 2048 4096 8192
> Old method: 245s 332s 488s OOM
> This solution: 121s 118s 117s 120s
>  



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20446) Optimize the process of MLLIB ALS recommendForAll

2017-04-24 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15981129#comment-15981129
 ] 

Nick Pentreath commented on SPARK-20446:


Anyway I'd like to compare the approaches and see which is most efficient. 

> Optimize the process of MLLIB ALS recommendForAll
> -
>
> Key: SPARK-20446
> URL: https://issues.apache.org/jira/browse/SPARK-20446
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 2.3.0
>Reporter: Peng Meng
>
> The recommendForAll of MLLIB ALS is very slow.
> GC is a key problem of the current method. 
> The task use the following code to keep temp result:
> val output = new Array[(Int, (Int, Double))](m*n)
> m = n = 4096 (default value, no method to set)
> so output is about 4k * 4k * (4 + 4 + 8) = 256M. This is a large memory and 
> cause serious GC problem, and it is frequently OOM.
> Actually, we don't need to save all the temp result. Suppose we recommend 
> topK (topK is about 10, or 20) product for each user, we only need  4k * topK 
> * (4 + 4 + 8) memory to save the temp result.
> I have written a solution for this method with the following test result. 
> The Test Environment:
> 3 workers: each work 10 core, each work 30G memory, each work 1 executor.
> The Data: User 480,000, and Item 17,000
> BlockSize: 1024 2048 4096 8192
> Old method: 245s 332s 488s OOM
> This solution: 121s 118s 117s 120s
>  



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20446) Optimize the process of MLLIB ALS recommendForAll

2017-04-24 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15981122#comment-15981122
 ] 

Nick Pentreath commented on SPARK-20446:


The GC would come from the temp result array in the BLAS3 case. The new result 
array per group {{new Array[(Int, (Int, Double))](m * n)}} is the same. I think 
the temp result array could be pre-allocated per partition to eliminate the GC 
issue for that part of the computation. That was my next efficiency change to 
look into for this.

It could be that the combo of the above with the priority queue could be best?

> Optimize the process of MLLIB ALS recommendForAll
> -
>
> Key: SPARK-20446
> URL: https://issues.apache.org/jira/browse/SPARK-20446
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 2.3.0
>Reporter: Peng Meng
>
> The recommendForAll of MLLIB ALS is very slow.
> GC is a key problem of the current method. 
> The task use the following code to keep temp result:
> val output = new Array[(Int, (Int, Double))](m*n)
> m = n = 4096 (default value, no method to set)
> so output is about 4k * 4k * (4 + 4 + 8) = 256M. This is a large memory and 
> cause serious GC problem, and it is frequently OOM.
> Actually, we don't need to save all the temp result. Suppose we recommend 
> topK (topK is about 10, or 20) product for each user, we only need  4k * topK 
> * (4 + 4 + 8) memory to save the temp result.
> I have written a solution for this method with the following test result. 
> The Test Environment:
> 3 workers: each work 10 core, each work 30G memory, each work 1 executor.
> The Data: User 480,000, and Item 17,000
> BlockSize: 1024 2048 4096 8192
> Old method: 245s 332s 488s OOM
> This solution: 121s 118s 117s 120s
>  



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20446) Optimize the process of MLLIB ALS recommendForAll

2017-04-24 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15981066#comment-15981066
 ] 

Nick Pentreath commented on SPARK-20446:


This is really a duplicate of https://issues.apache.org/jira/browse/SPARK-11968

> Optimize the process of MLLIB ALS recommendForAll
> -
>
> Key: SPARK-20446
> URL: https://issues.apache.org/jira/browse/SPARK-20446
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 2.3.0
>Reporter: Peng Meng
>
> The recommendForAll of MLLIB ALS is very slow.
> GC is a key problem of the current method. 
> The task use the following code to keep temp result:
> val output = new Array[(Int, (Int, Double))](m*n)
> m = n = 4096 (default value, no method to set)
> so output is about 4k * 4k * (4 + 4 + 8) = 256M. This is a large memory and 
> cause serious GC problem, and it is frequently OOM.
> Actually, we don't need to save all the temp result. Suppose we recommend 
> topK (topK is about 10, or 20) product for each user, we only need  4k * topK 
> * (4 + 4 + 8) memory to save the temp result.
> I have written a solution for this method with the following test result. 
> The Test Environment:
> 3 workers: each work 10 core, each work 30G memory, each work 1 executor.
> The Data: User 480,000, and Item 17,000
> BlockSize: 1024 2048 4096 8192
> Old method: 245s 332s 488s OOM
> This solution: 121s 118s 117s 120s
>  



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20392) Slow performance when calling fit on ML pipeline for dataset with many columns but few rows

2017-04-20 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15978157#comment-15978157
 ] 

Nick Pentreath commented on SPARK-20392:


cc [~viirya] 

> Slow performance when calling fit on ML pipeline for dataset with many 
> columns but few rows
> ---
>
> Key: SPARK-20392
> URL: https://issues.apache.org/jira/browse/SPARK-20392
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: Barry Becker
> Attachments: blockbuster.csv, 
> giant_query_plan_for_fitting_pipeline.txt
>
>
> This started as a [question on stack 
> overflow|http://stackoverflow.com/questions/43484006/why-is-it-slow-to-apply-a-spark-pipeline-to-dataset-with-many-columns-but-few-ro],
>  but it seems like a bug.
> I am testing spark pipelines using a simple dataset (attached) with 312 
> (mostly numeric) columns, but only 421 rows. It is small, but it takes 3 
> minutes to apply my ML pipeline to it on a 24 core server with 60G of memory. 
> This seems much to long for such a tiny dataset. Similar pipelines run 
> quickly on datasets that have fewer columns and more rows. It's something 
> about the number of columns that is causing the slow performance.
> Here are a list of the stages in my pipeline:
> {code}
> 000_strIdx_5708525b2b6c
> 001_strIdx_ec2296082913
> 002_bucketizer_3cbc8811877b
> 003_bucketizer_5a01d5d78436
> 004_bucketizer_bf290d11364d
> 005_bucketizer_c3296dfe94b2
> 006_bucketizer_7071ca50eb85
> 007_bucketizer_27738213c2a1
> 008_bucketizer_bd728fd89ba1
> 009_bucketizer_e1e716f51796
> 010_bucketizer_38be665993ba
> 011_bucketizer_5a0e41e5e94f
> 012_bucketizer_b5a3d5743aaa
> 013_bucketizer_4420f98ff7ff
> 014_bucketizer_777cc4fe6d12
> 015_bucketizer_f0f3a3e5530e
> 016_bucketizer_218ecca3b5c1
> 017_bucketizer_0b083439a192
> 018_bucketizer_4520203aec27
> 019_bucketizer_462c2c346079
> 020_bucketizer_47435822e04c
> 021_bucketizer_eb9dccb5e6e8
> 022_bucketizer_b5f63dd7451d
> 023_bucketizer_e0fd5041c841
> 024_bucketizer_ffb3b9737100
> 025_bucketizer_e06c0d29273c
> 026_bucketizer_36ee535a425f
> 027_bucketizer_ee3a330269f1
> 028_bucketizer_094b58ea01c0
> 029_bucketizer_e93ea86c08e2
> 030_bucketizer_4728a718bc4b
> 031_bucketizer_08f6189c7fcc
> 032_bucketizer_11feb74901e6
> 033_bucketizer_ab4add4966c7
> 034_bucketizer_4474f7f1b8ce
> 035_bucketizer_90cfa5918d71
> 036_bucketizer_1a9ff5e4eccb
> 037_bucketizer_38085415a4f4
> 038_bucketizer_9b5e5a8d12eb
> 039_bucketizer_082bb650ecc3
> 040_bucketizer_57e1e363c483
> 041_bucketizer_337583fbfd65
> 042_bucketizer_73e8f6673262
> 043_bucketizer_0f9394ed30b8
> 044_bucketizer_8530f3570019
> 045_bucketizer_c53614f1e507
> 046_bucketizer_8fd99e6ec27b
> 047_bucketizer_6a8610496d8a
> 048_bucketizer_888b0055c1ad
> 049_bucketizer_974e0a1433a6
> 050_bucketizer_e848c0937cb9
> 051_bucketizer_95611095a4ac
> 052_bucketizer_660a6031acd9
> 053_bucketizer_aaffe5a3140d
> 054_bucketizer_8dc569be285f
> 055_bucketizer_83d1bffa07bc
> 056_bucketizer_0c6180ba75e6
> 057_bucketizer_452f265a000d
> 058_bucketizer_38e02ddfb447
> 059_bucketizer_6fa4ad5d3ebd
> 060_bucketizer_91044ee766ce
> 061_bucketizer_9a9ef04a173d
> 062_bucketizer_3d98eb15f206
> 063_bucketizer_c4915bb4d4ed
> 064_bucketizer_8ca2b6550c38
> 065_bucketizer_417ee9b760bc
> 066_bucketizer_67f3556bebe8
> 067_bucketizer_0556deb652c6
> 068_bucketizer_067b4b3d234c
> 069_bucketizer_30ba55321538
> 070_bucketizer_ad826cc5d746
> 071_bucketizer_77676a898055
> 072_bucketizer_05c37a38ce30
> 073_bucketizer_6d9ae54163ed
> 074_bucketizer_8cd668b2855d
> 075_bucketizer_d50ea1732021
> 076_bucketizer_c68f467c9559
> 077_bucketizer_ee1dfc840db1
> 078_bucketizer_83ec06a32519
> 079_bucketizer_741d08c1b69e
> 080_bucketizer_b7402e4829c7
> 081_bucketizer_8adc590dc447
> 082_bucketizer_673be99bdace
> 083_bucketizer_77693b45f94c
> 084_bucketizer_53529c6b1ac4
> 085_bucketizer_6a3ca776a81e
> 086_bucketizer_6679d9588ac1
> 087_bucketizer_6c73af456f65
> 088_bucketizer_2291b2c5ab51
> 089_bucketizer_cb3d0fe669d8
> 090_bucketizer_e71f913c1512
> 091_bucketizer_156528f65ce7
> 092_bucketizer_f3ec5dae079b
> 093_bucketizer_809fab77eee1
> 094_bucketizer_6925831511e6
> 095_bucketizer_c5d853b95707
> 096_bucketizer_e677659ca253
> 097_bucketizer_396e35548c72
> 098_bucketizer_78a6410d7a84
> 099_bucketizer_e3ae6e54bca1
> 100_bucketizer_9fed5923fe8a
> 101_bucketizer_8925ba4c3ee2
> 102_bucketizer_95750b

[jira] [Resolved] (SPARK-20097) Fix visibility discrepancy with numInstances and degreesOfFreedom in LR and GLR

2017-04-11 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath resolved SPARK-20097.

   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 17431
[https://github.com/apache/spark/pull/17431]

> Fix visibility discrepancy with numInstances and degreesOfFreedom in LR and 
> GLR
> ---
>
> Key: SPARK-20097
> URL: https://issues.apache.org/jira/browse/SPARK-20097
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: Benjamin Fradet
>Priority: Trivial
> Fix For: 2.2.0
>
>
> - numInstances is public in lr and regression private in glr
> - degreesOfFreedom is private in lr and public in glr



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



Re: How to convert Spark MLlib vector to ML Vector?

2017-04-09 Thread Nick Pentreath
Why not use the RandomForest from Spark ML?

On Sun, 9 Apr 2017 at 16:01, Md. Rezaul Karim <
rezaul.ka...@insight-centre.org> wrote:

> I have already posted this question to the StackOverflow
> .
> However, not getting any response from someone else. I'm trying to use
> RandomForest algorithm for the classification after applying the PCA
> technique since the dataset is pretty high-dimensional. Here's my source
> code:
>
> import org.apache.spark.mllib.util.MLUtils
> import org.apache.spark.mllib.tree.RandomForest
> import org.apache.spark.mllib.tree.model.RandomForestModel
> import org.apache.spark.mllib.regression.LabeledPoint
> import org.apache.spark.ml.linalg.{Vectors, VectorUDT}
> import org.apache.spark.sql._
> import org.apache.spark.sql.SQLContext
> import org.apache.spark.sql.SparkSession
>
> import org.apache.spark.ml.feature.PCA
> import org.apache.spark.rdd.RDD
>
> object PCAExample {
>   def main(args: Array[String]): Unit = {
> val spark = SparkSession
>   .builder
>   .master("local[*]")
>   .config("spark.sql.warehouse.dir", "E:/Exp/")
>   .appName(s"OneVsRestExample")
>   .getOrCreate()
>
> val dataset = MLUtils.loadLibSVMFile(spark.sparkContext, "data/mnist.bz2")
>
> val splits = dataset.randomSplit(Array(0.7, 0.3), seed = 12345L)
> val (trainingData, testData) = (splits(0), splits(1))
>
> val sqlContext = new SQLContext(spark.sparkContext)
> import sqlContext.implicits._
> val trainingDF = trainingData.toDF("label", "features")
>
> val pca = new PCA()
>   .setInputCol("features")
>   .setOutputCol("pcaFeatures")
>   .setK(100)
>   .fit(trainingDF)
>
> val pcaTrainingData = pca.transform(trainingDF)
> //pcaTrainingData.show()
>
> val labeled = pca.transform(trainingDF).rdd.map(row => LabeledPoint(
>   row.getAs[Double]("label"),
>   row.getAs[org.apache.spark.mllib.linalg.Vector]("pcaFeatures")))
>
> //val labeled = pca.transform(trainingDF).rdd.map(row => 
> LabeledPoint(row.getAs[Double]("label"),
> //  
> Vector.fromML(row.getAs[org.apache.spark.ml.linalg.SparseVector]("features"
>
> val numClasses = 10
> val categoricalFeaturesInfo = Map[Int, Int]()
> val numTrees = 10 // Use more in practice.
> val featureSubsetStrategy = "auto" // Let the algorithm choose.
> val impurity = "gini"
> val maxDepth = 20
> val maxBins = 32
>
> val model = RandomForest.trainClassifier(labeled, numClasses, 
> categoricalFeaturesInfo,
>   numTrees, featureSubsetStrategy, impurity, maxDepth, maxBins)
>   }
> }
>
> However, I'm getting the following error:
>
> *Exception in thread "main" java.lang.IllegalArgumentException:
> requirement failed: Column features must be of type
> org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7 but was actually
> org.apache.spark.mllib.linalg.VectorUDT@f71b0bce.*
>
> What am I doing wrong in my code?  Actually, I'm getting the above
> exception in this line:
>
> val pca = new PCA()
>   .setInputCol("features")
>   .setOutputCol("pcaFeatures")
>   .setK(100)
>   .fit(trainingDF) /// GETTING EXCEPTION HERE
>
> Please, someone, help me to solve the problem.
>
>
>
>
>
> Kind regards,
> *Md. Rezaul Karim*
>


Re: Spark 2.1 ml library scalability

2017-04-07 Thread Nick Pentreath
It's true that CrossValidator is not parallel currently - see
https://issues.apache.org/jira/browse/SPARK-19357 and feel free to help
review.

On Fri, 7 Apr 2017 at 14:18 Aseem Bansal  wrote:

>
>- Limited the data to 100,000 records.
>- 6 categorical feature which go through imputation, string indexing,
>one hot encoding. The maximum classes for the feature is 100. As data is
>imputated it becomes dense.
>- 1 numerical feature.
>- Training Logistic Regression through CrossValidation with grid to
>optimize its regularization parameter over the values 0.0001, 0.001, 0.005,
>0.01, 0.05, 0.1
>- Using spark's launcher api to launch it on a yarn cluster in Amazon
>AWS.
>
> I was thinking that as CrossValidator is finding the best parameters it
> should be able to run them independently. That sounds like something which
> could be ran in parallel.
>
>
> On Fri, Apr 7, 2017 at 5:20 PM, Nick Pentreath 
> wrote:
>
> What is the size of training data (number examples, number features)?
> Dense or sparse features? How many classes?
>
> What commands are you using to submit your job via spark-submit?
>
> On Fri, 7 Apr 2017 at 13:12 Aseem Bansal  wrote:
>
> When using spark ml's LogisticRegression, RandomForest, CrossValidator
> etc. do we need to give any consideration while coding in making it scale
> with more CPUs or does it scale automatically?
>
> I am reading some data from S3, using a pipeline to train a model. I am
> running the job on a spark cluster with 36 cores and 60GB RAM and I cannot
> see much usage. It is running but I was expecting spark to use all RAM
> available and make it faster. So that's why I was thinking whether we need
> to take something particular in consideration or wrong expectations?
>
>
>


Re: Spark 2.1 ml library scalability

2017-04-07 Thread Nick Pentreath
What is the size of training data (number examples, number features)? Dense
or sparse features? How many classes?

What commands are you using to submit your job via spark-submit?

On Fri, 7 Apr 2017 at 13:12 Aseem Bansal  wrote:

> When using spark ml's LogisticRegression, RandomForest, CrossValidator
> etc. do we need to give any consideration while coding in making it scale
> with more CPUs or does it scale automatically?
>
> I am reading some data from S3, using a pipeline to train a model. I am
> running the job on a spark cluster with 36 cores and 60GB RAM and I cannot
> see much usage. It is running but I was expecting spark to use all RAM
> available and make it faster. So that's why I was thinking whether we need
> to take something particular in consideration or wrong expectations?
>


[jira] [Commented] (SPARK-4038) Outlier Detection Algorithm for MLlib

2017-04-07 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15960537#comment-15960537
 ] 

Nick Pentreath commented on SPARK-4038:
---

I don't think there can be a reasonable expectation of committer bandwidth for 
this right now or in the foreseeable future. I think it would be best to close 
this as "Wont Fix". If someone wants to do it as a Spark package and put a link 
here that would be welcome too.

cc [~srowen] [~josephkb] ?

> Outlier Detection Algorithm for MLlib
> -
>
> Key: SPARK-4038
> URL: https://issues.apache.org/jira/browse/SPARK-4038
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Ashutosh Trivedi
>Priority: Minor
>
> The aim of this JIRA is to discuss about which parallel outlier detection 
> algorithms can be included in MLlib. 
> The one which I am familiar with is Attribute Value Frequency (AVF). It 
> scales linearly with the number of data points and attributes, and relies on 
> a single data scan. It is not distance based and well suited for categorical 
> data. In original paper  a parallel version is also given, which is not 
> complected to implement.  I am working on the implementation and soon submit 
> the initial code for review.
> Here is the Link for the paper
> http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=4410382
> As pointed out by Xiangrui in discussion 
> http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-td8880.html
> There are other algorithms also. Lets discuss about which will be more 
> general and easily paralleled.
>



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17716) Hidden Markov Model (HMM)

2017-04-07 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17716?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15960534#comment-15960534
 ] 

Nick Pentreath commented on SPARK-17716:


I don't think we can expect sufficient committer bandwidth to accommodate this. 
I'd tend to put this in the same camp of "deep learning on Spark" where it 
might be best implemented as a Spark package. And perhaps RNNs are becoming 
more used in the ML space too?

cc [~josephkb] [~srowen]?

> Hidden Markov Model (HMM)
> -
>
> Key: SPARK-17716
> URL: https://issues.apache.org/jira/browse/SPARK-17716
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Xiangrui Meng
>Assignee: Runxin Li
>
> Had an offline chat with [~Lil'Rex], who implemented HMM on Spark at 
> https://github.com/apache/spark/compare/master...lilrex:sequence. I asked him 
> to list popular HMM applications, describe public API (params, input/output 
> schemas), compare its API with existing HMM implementations.
> h1. Hidden Markov Model (HMM) Design Doc
> h2. Overview
> h3. Introduction to HMM
> Hidden Markov Model is a type of statistical Machine Learning model that 
> assumes a sequence of observations is generated by a Markov process with 
> hidden states. There are 3 (or 2, depending on the implementation) main 
> components of the model:
> * *Transition Probability*: describes the probability distribution of 
> transitions from each state to other states (including self) in the Markov 
> process
> * *Emission Probability*: describes the probability distribution for an 
> observation associated with hidden states
> * *Initial/Start Probability* (optional): represents the prior probability of 
> each state at the beginning of the observation sequence
> _Note: some implementations merge the Initial Probability into Transition 
> Probability by adding an arbitrary Start state before the first observation 
> point._
> h3. HMM Models and Algorithms
> Given a limited number of states, most HMM models have the same form of 
> Transition Probability: a matrix, where each element _(i, j)_ represents the 
> probability of transition from state _i_ to state _j_. The Initial 
> Probability usually take the simple form of a probabilistic vector.
> The Emission Probability, on the other hand, can be represented in many 
> different ways, depending on different nature of observations, i.e. 
> continuous vs. discrete, or different model assumptions, e.g. single Gaussian 
> vs. Gaussian Mixtures.
> There are three main problems associated with HMM models, and their canonical 
> algorithms:
> # *Evaluation*: What’s the probability of a given observation sequence, based 
> on the model? It’s usually done by either *Forward* or *Backward* algorithms
> # *Decoding*: What’s the most likely state sequence, given the observation 
> sequence and the model? It’s usually done by *Viterbi* decoding
> # *Learning*: How to train the parameters of the model based on the 
> observation sequences? *Baum-Welch* (Forward-Backward) is usually used as 
> part of the *EM* algorithm in unsupervised training
> h3. Popular Applications of HMM
> * Speech Recognition
> * Part-of-speech Tagging
> * Named Entity Recognition
> * Machine Translation
> * Gene Prediction
> h2. Alternate Libraries
> [Mahout|https://mahout.apache.org/users/classification/hidden-markov-models.html]
> * Assume emission probability to be an m-by-n matrix
> * Use Baum-Welch algorithm for training and Viterbi algorithm for prediction
> * API (command line)
> ** training
> {{$ echo "0 1 2 2 2 1 1 0 0 3 3 3 2 1 2 1 1 1 1 2 2 2 0 0 0 0 0 0 2 2 2 0 0 0 
> 0 0 0 2 2 2 3 3 3 3 3 3 2 3 2 3 2 3 2 1 3 0 0 0 1 0 1 0 2 1 2 1 2 1 2 3 3 3 3 
> 2 2 3 2 1 1 0" > hmm-input}}
> {{$ export MAHOUT_LOCAL=true}}
> {{$ $MAHOUT_HOME/bin/mahout baumwelch -i hmm-input -o hmm-model -nh 3 -no 4 
> -e .0001 -m 1000}}
> ** prediction
> {{$ $MAHOUT_HOME/bin/mahout hmmpredict -m hmm-model -o hmm-predictions -l 10}}
> {{$ cat hmm-predictions}}
> [Mallet|http://mallet.cs.umass.edu/api/cc/mallet/fst/HMM.html]
> * Treat HMM as a Finite State Transducer (FST)
> * Theoretically can go beyond first-order Markov assumption by setting an 
> arbitrary order
> * Limited to text data, i.e. discrete observation sequence with Multinomial 
> emission model assumption
> * Supervised training only
> * API:
> ** Training:
> {{HMM hmm = new HMM(pipe, null);}}
> {{hmm.addStatesForLabelsConnectedAsIn(trainingInstances);}}
> {{HMMTrainerByLikelihood trainer = new HMMTrainerByLikelih

[jira] [Commented] (SPARK-3903) Create general data loading method for LabeledPoints

2017-04-07 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15960531#comment-15960531
 ] 

Nick Pentreath commented on SPARK-3903:
---

I think given the move to DataFrames, and that we can load {{libsvm}} in DF, 
this can be closed?

> Create general data loading method for LabeledPoints
> 
>
> Key: SPARK-3903
> URL: https://issues.apache.org/jira/browse/SPARK-3903
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 1.3.0
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> Proposal: Provide a more general data loading function for LabeledPoints.
> * load multiple data files (e.g., train + test), and ensure they have the 
> same number of features (determined based on a scan of the data)
> * use same function for multiple input formats
> Proposed function format (in MLUtils), with default parameters:
> {code}
> def loadLabeledPointsFiles(
> sc: SparkContext,
> paths: Seq[String],
> numFeatures = -1,
> vectorFormat = "auto",
> numPartitions = sc.defaultMinPartitions): Seq[RDD[LabeledPoint]]
> {code}
> About the parameters:
> * paths: list of paths to data files or folders with data files
> * vectorFormat options: dense/sparse/auto
> * numFeatures, numPartitions: same behavior as loadLibSVMFile
> Return value: Order of RDDs follows the order of the paths.
> Note: This is named differently from loadLabeledPoints for 2 reasons:
> * different argument order (following loadLibSVMFile)
> * different return type



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7674) R-like stats for ML models

2017-04-07 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15960530#comment-15960530
 ] 

Nick Pentreath commented on SPARK-7674:
---

Is this JIRA still open? Can it be resolved? Or are there further outstanding 
issues. I think most or all of the model summaries have been done?

> R-like stats for ML models
> --
>
> Key: SPARK-7674
> URL: https://issues.apache.org/jira/browse/SPARK-7674
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>Priority: Critical
>
> This is an umbrella JIRA for supporting ML model summaries and statistics, 
> following the example of R's summary() and plot() functions.
> [Design 
> doc|https://docs.google.com/document/d/1oswC_Neqlqn5ElPwodlDY4IkSaHAi0Bx6Guo_LvhHK8/edit?usp=sharing]
> From the design doc:
> {quote}
> R and its well-established packages provide extensive functionality for 
> inspecting a model and its results.  This inspection is critical to 
> interpreting, debugging and improving models.
> R is arguably a gold standard for a statistics/ML library, so this doc 
> largely attempts to imitate it.  The challenge we face is supporting similar 
> functionality, but on big (distributed) data.  Data size makes both efficient 
> computation and meaningful displays/summaries difficult.
> R model and result summaries generally take 2 forms:
> * summary(model): Display text with information about the model and results 
> on data
> * plot(model): Display plots about the model and results
> We aim to provide both of these types of information.  Visualization for the 
> plottable results will not be supported in MLlib itself, but we can provide 
> results in a form which can be plotted easily with other tools.
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12210) Small example that shows how to integrate spark.mllib with spark.ml

2017-04-07 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15960523#comment-15960523
 ] 

Nick Pentreath commented on SPARK-12210:


Is this required any more? I guess we are close enough to feature parity now 
where I'd say we can close this one.

> Small example that shows how to integrate spark.mllib with spark.ml
> ---
>
> Key: SPARK-12210
> URL: https://issues.apache.org/jira/browse/SPARK-12210
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, MLlib
>Affects Versions: 1.5.2
>Reporter: Timothy Hunter
>
> Since we are missing a number of algorithms in {{spark.ml}} such as 
> clustering or LDA, we should have a small example that shows the recommended 
> way to go back and forth between {{spark.ml}} and {{spark.mllib}}. It is 
> mostly putting together existing pieces, but I feel it is important for new 
> users to see how the interaction plays out in practice.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20076) Python interface for ml.stats.Correlation

2017-04-07 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath reassigned SPARK-20076:
--

Assignee: Liang-Chi Hsieh

> Python interface for ml.stats.Correlation
> -
>
> Key: SPARK-20076
> URL: https://issues.apache.org/jira/browse/SPARK-20076
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: Timothy Hunter
>Assignee: Liang-Chi Hsieh
> Fix For: 2.2.0
>
>
> The (Pearson) statistics have been exposed with a Dataframe interface as part 
> of SPARK-19636 in the Scala interface. We should now make these available in 
> Python.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20076) Python interface for ml.stats.Correlation

2017-04-07 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath resolved SPARK-20076.

   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 17494
[https://github.com/apache/spark/pull/17494]

> Python interface for ml.stats.Correlation
> -
>
> Key: SPARK-20076
> URL: https://issues.apache.org/jira/browse/SPARK-20076
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: Timothy Hunter
> Fix For: 2.2.0
>
>
> The (Pearson) statistics have been exposed with a Dataframe interface as part 
> of SPARK-19636 in the Scala interface. We should now make these available in 
> Python.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19979) [MLLIB] Multiple Estimators/Pipelines In CrossValidator

2017-04-06 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15958492#comment-15958492
 ] 

Nick Pentreath commented on SPARK-19979:


I think we could add a note to the user guide. However I do agree that it's not 
super intuitive or user-friendly from an API perspective and we should think 
about adding a better API for this, as it is definitely a very common use case.

> [MLLIB] Multiple Estimators/Pipelines In CrossValidator
> ---
>
> Key: SPARK-19979
> URL: https://issues.apache.org/jira/browse/SPARK-19979
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.1.0
>Reporter: David Leifker
>
> Update CrossValidator and TrainValidationSplit to be able to accept multiple 
> pipelines and grid parameters for testing different algorithms and/or being 
> able to better control tuning combinations. Maintains backwards compatible 
> API and reads legacy serialized objects.
> The same could be done using an external iterative approach. Build different 
> pipelines, throwing each into a CrossValidator, and then taking the best 
> model from each of those CrossValidators. Then finally picking the best from 
> those. This is the initial approach I explored. It resulted in a lot of 
> boiler plate code that felt like it shouldn't need to exist if the api simply 
> allowed for arrays of estimators and their parameters.
> A couple advantages to this implementation to consider come from keeping the 
> functional interface to the CrossValidator.
> 1. The caching of the folds is better utilized. An external iterative 
> approach creates a new set of k folds for each CrossValidator fit and the 
> folds are discarded after each CrossValidator run. In this implementation a 
> single set of k folds is created and cached for all of the pipelines.
> 2. A potential advantage of using this implementation is for future 
> parallelization of the pipelines within the CrossValdiator. It is of course 
> possible to handle the parallelization outside of the CrossValidator here 
> too, however I believe there is already work in progress to parallelize the 
> grid parameters and that could be extended to multiple pipelines.
> Both of those behind-the-scene optimizations are possible because of 
> providing the CrossValidator with the data and the complete set of 
> pipelines/estimators to evaluate up front allowing one to abstract away the 
> implementation.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19953) RandomForest Models should use the UID of Estimator when fit

2017-04-06 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath reassigned SPARK-19953:
--

Assignee: Bryan Cutler

> RandomForest Models should use the UID of Estimator when fit
> 
>
> Key: SPARK-19953
> URL: https://issues.apache.org/jira/browse/SPARK-19953
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.0.1, 2.1.0
>Reporter: Bryan Cutler
>Assignee: Bryan Cutler
>Priority: Minor
> Fix For: 2.2.0
>
>
> Currently, RandomForestClassificationModel and RandomForestRegressionModel 
> use the alternate constructor which creates a new random UID instead of using 
> the parent estimators UID.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19953) RandomForest Models should use the UID of Estimator when fit

2017-04-06 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath resolved SPARK-19953.

   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 17296
[https://github.com/apache/spark/pull/17296]

> RandomForest Models should use the UID of Estimator when fit
> 
>
> Key: SPARK-19953
> URL: https://issues.apache.org/jira/browse/SPARK-19953
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.0.1, 2.1.0
>Reporter: Bryan Cutler
>Priority: Minor
> Fix For: 2.2.0
>
>
> Currently, RandomForestClassificationModel and RandomForestRegressionModel 
> use the alternate constructor which creates a new random UID instead of using 
> the parent estimators UID.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20203) Change default maxPatternLength value to Int.MaxValue in PrefixSpan

2017-04-04 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15954904#comment-15954904
 ] 

Nick Pentreath commented on SPARK-20203:


I see there is a comment in the code that says: {{// TODO: support unbounded 
pattern length when maxPatternLength = 0}}. But the same thing can essentially 
be achieved setting the pattern length to {{Int.MaxValue}} as Sean has 
previously said, so I don't really think this is a valid work item (in fact 
probably that comment should be removed).

Is an unbounded default really better (or worse) from an API / user facing 
perspective? There are arguments either way but to be honest I see nothing 
compelling enough to warrant a change here.

> Change default maxPatternLength value to Int.MaxValue in PrefixSpan
> ---
>
> Key: SPARK-20203
> URL: https://issues.apache.org/jira/browse/SPARK-20203
> Project: Spark
>  Issue Type: Wish
>  Components: MLlib
>Affects Versions: 2.1.0
>Reporter: Cyril de Vogelaere
>Priority: Trivial
>   Original Estimate: 0h
>  Remaining Estimate: 0h
>
> I think changing the default value to Int.MaxValue would be more user 
> friendly. At least for new users.
> Personally, when I run an algorithm, I expect it to find all solution by 
> default. And a limited number of them, when I set the parameters to do so.
> The current implementation limit the length of solution patterns to 10.
> Thus preventing all solution to be printed when running slightly large 
> datasets.
> I feel like that should be changed, but since this would change the default 
> behavior of PrefixSpan. I think asking for the communities opinion should 
> come first. So, what do you think ?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20047) Constrained Logistic Regression

2017-04-03 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15953551#comment-15953551
 ] 

Nick Pentreath commented on SPARK-20047:


Is this really targeted for 2.2.0? 

> Constrained Logistic Regression
> ---
>
> Key: SPARK-20047
> URL: https://issues.apache.org/jira/browse/SPARK-20047
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Affects Versions: 2.1.0
>Reporter: DB Tsai
>Assignee: Yanbo Liang
>
> For certain applications, such as stacked regressions, it is important to put 
> non-negative constraints on the regression coefficients. Also, if the ranges 
> of coefficients are known, it makes sense to constrain the coefficient search 
> space.
> Fitting generalized constrained regression models object to Cβ ≤ b, where C ∈ 
> R^\{m×p\} and b ∈ R^\{m\} are predefined matrices and vectors which places a 
> set of m linear constraints on the coefficients is very challenging as 
> discussed in many literatures. 
> However, for box constraints on the coefficients, the optimization is well 
> solved. For gradient descent, people can projected gradient descent in the 
> primal by zeroing the negative weights at each step. For LBFGS, an extended 
> version of it, LBFGS-B can handle large scale box optimization efficiently. 
> Unfortunately, for OWLQN, there is no good efficient way to do optimization 
> with box constrains.
> As a result, in this work, we only implement constrained LR with box 
> constrains without L1 regularization. 
> Note that since we standardize the data in training phase, so the 
> coefficients seen in the optimization subroutine are in the scaled space; as 
> a result, we need to convert the box constrains into scaled space.
> Users will be able to set the lower / upper bounds of each coefficients and 
> intercepts.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19969) Doc and examples for Imputer

2017-04-03 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath reassigned SPARK-19969:
--

Assignee: yuhao yang

> Doc and examples for Imputer
> 
>
> Key: SPARK-19969
> URL: https://issues.apache.org/jira/browse/SPARK-19969
> Project: Spark
>  Issue Type: Documentation
>  Components: ML
>Affects Versions: 2.2.0
>    Reporter: Nick Pentreath
>Assignee: yuhao yang
> Fix For: 2.2.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19969) Doc and examples for Imputer

2017-04-03 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath resolved SPARK-19969.

   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 17324
[https://github.com/apache/spark/pull/17324]

> Doc and examples for Imputer
> 
>
> Key: SPARK-19969
> URL: https://issues.apache.org/jira/browse/SPARK-19969
> Project: Spark
>  Issue Type: Documentation
>  Components: ML
>Affects Versions: 2.2.0
>    Reporter: Nick Pentreath
> Fix For: 2.2.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19985) Some ML Models error when copy or do not set parent

2017-04-03 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath reassigned SPARK-19985:
--

Assignee: Bryan Cutler

> Some ML Models error when copy or do not set parent
> ---
>
> Key: SPARK-19985
> URL: https://issues.apache.org/jira/browse/SPARK-19985
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Bryan Cutler
>Assignee: Bryan Cutler
> Fix For: 2.2.0
>
>
> Some ML Models fail when copied due to not having a default constructor and 
> implementing {{copy}} with {{defaultCopy}}.  Other cases do not properly set 
> the parent when the model is copied.  These models were missing the normal 
> check that tests for these in the test suites.
> Models with issues are:
> * RFormlaModel
> * MultilayerPerceptronClassificationModel
> * BucketedRandomProjectionLSHModel
> * MinHashLSH



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19985) Some ML Models error when copy or do not set parent

2017-04-03 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath resolved SPARK-19985.

   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 17326
[https://github.com/apache/spark/pull/17326]

> Some ML Models error when copy or do not set parent
> ---
>
> Key: SPARK-19985
> URL: https://issues.apache.org/jira/browse/SPARK-19985
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Bryan Cutler
> Fix For: 2.2.0
>
>
> Some ML Models fail when copied due to not having a default constructor and 
> implementing {{copy}} with {{defaultCopy}}.  Other cases do not properly set 
> the parent when the model is copied.  These models were missing the normal 
> check that tests for these in the test suites.
> Models with issues are:
> * RFormlaModel
> * MultilayerPerceptronClassificationModel
> * BucketedRandomProjectionLSHModel
> * MinHashLSH



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



Re: Collaborative filtering steps in spark

2017-03-29 Thread Nick Pentreath
No, it does a random initialization. It does use a slightly different
approach from pure normal random - it chooses non-negative draws which
results in very slightly better results empirically.

In practice I'm not sure if the average rating approach will make a big
difference (it's been a long while since I read the paper!)

Sean put the absolute value init stuff in originally if I recall so may
have more context.

Though in fact looking at the code now, I see the comment still says that,
but I'm not convinced the code actually does it:

/**
 * Initializes factors randomly given the in-link blocks.
 *
 * @param inBlocks in-link blocks
 * @param rank rank
 * @return initialized factor blocks
 */
private def initialize[ID](
inBlocks: RDD[(Int, InBlock[ID])],
rank: Int,
seed: Long): RDD[(Int, FactorBlock)] = {
  // Choose a unit vector uniformly at random from the unit sphere, but from the
  // "first quadrant" where all elements are nonnegative. This can be
done by choosing
  // elements distributed as Normal(0,1) and taking the absolute
value, and then normalizing.
  // This appears to create factorizations that have a slightly better
reconstruction
  // (<1%) compared picking elements uniformly at random in [0,1].
  inBlocks.map { case (srcBlockId, inBlock) =>
val random = new XORShiftRandom(byteswap64(seed ^ srcBlockId))
val factors = Array.fill(inBlock.srcIds.length) {
  val factor = Array.fill(rank)(random.nextGaussian().toFloat)
  val nrm = blas.snrm2(rank, factor, 1)
  blas.sscal(rank, 1.0f / nrm, factor, 1)
  factor
}
(srcBlockId, factors)
  }
}


factor is ~ N(0, 1) and then scaled by the L2 norm, but it looks to me the
abs value is never taken before scaling which is what the comment
indicates...


On Mon, 27 Mar 2017 at 00:55 chris snow  wrote:

> In the paper “Large-Scale Parallel Collaborative Filtering for the
> Netflix Prize”, the following steps are described for ALS:
>
> Step 1 Initialize matrix M by assigning the average rating for that
> movie as the first row, and
> small random numbers for the remaining entries.
> Step 2 Fix M, Solve U by minimizing the objective function (the sum of
> squared errors);
> Step 3 Fix U, solve M by minimizing the objective function similarly;
> Step 4 Repeat Steps 2 and 3 until a stopping criterion is satisfied.
>
> Does spark take the average rating for the movie as the first row?
> I've looked through the source code, but I can't see the average
> rating being calculated for the movie.
>
> Many thanks,
>
> Chris
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


[jira] [Commented] (SPARK-14174) Accelerate KMeans via Mini-Batch EM

2017-03-29 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15947114#comment-15947114
 ] 

Nick Pentreath commented on SPARK-14174:


The actual fix in the PR is pretty small - essentially just adding an 
{{rdd.sample}} call (similar to the old {{mllib}} gradient descent impl). So if 
we can see some good speed improvements on a relatively large class of input 
datasets, this seems like an easy win. From the performance tests above it 
seems like there's a significant win even for low-dimensional vectors. For 
higher dimensions the improvement may be as large or perhaps larger.

[~podongfeng] it may be best to add a few different cases to the performance 
tests to illustrate the behavior for different cases (and if not for certain 
cases, we should document that):

# small dimension, dense
# high dimension, dense
# small dimension, sparse
# high dimension, sparse

[~rnowling] do you have time to check out the PR here? It seems similar in 
spirit to what you had done and just uses the built-in RDD sampling (which I 
think [~derrickburns] mentioned in SPARK-2308).

> Accelerate KMeans via Mini-Batch EM
> ---
>
> Key: SPARK-14174
> URL: https://issues.apache.org/jira/browse/SPARK-14174
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: zhengruifeng
>
> The MiniBatchKMeans is a variant of the KMeans algorithm which uses 
> mini-batches to reduce the computation time, while still attempting to 
> optimise the same objective function. Mini-batches are subsets of the input 
> data, randomly sampled in each training iteration. These mini-batches 
> drastically reduce the amount of computation required to converge to a local 
> solution. In contrast to other algorithms that reduce the convergence time of 
> k-means, mini-batch k-means produces results that are generally only slightly 
> worse than the standard algorithm.
> I have implemented mini-batch kmeans in Mllib, and the acceleration is realy 
> significant.
> The MiniBatch KMeans is named XMeans in following lines.
> {code}
> val path = "/tmp/mnist8m.scale"
> val data = MLUtils.loadLibSVMFile(sc, path)
> val vecs = data.map(_.features).persist()
> val km = KMeans.train(data=vecs, k=10, maxIterations=10, runs=1, 
> initializationMode="k-means||", seed=123l)
> km.computeCost(vecs)
> res0: Double = 3.317029898599564E8
> val xm = XMeans.train(data=vecs, k=10, maxIterations=10, runs=1, 
> initializationMode="k-means||", miniBatchFraction=0.1, seed=123l)
> xm.computeCost(vecs)
> res1: Double = 3.3169865959604424E8
> val xm2 = XMeans.train(data=vecs, k=10, maxIterations=10, runs=1, 
> initializationMode="k-means||", miniBatchFraction=0.01, seed=123l)
> xm2.computeCost(vecs)
> res2: Double = 3.317195831216454E8
> {code}
> The above three training all reached the max number of iterations 10.
> We can see that the WSSSEs are almost the same. While their speed perfermence 
> have significant difference:
> {code}
> KMeans2876sec
> MiniBatch KMeans (fraction=0.1) 263sec
> MiniBatch KMeans (fraction=0.01)   90sec
> {code}
> With appropriate fraction, the bigger the dataset is, the higher speedup is.
> The data used above have 8,100,000 samples, 784 features. It can be 
> downloaded here 
> (https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass/mnist8m.scale.bz2)
> Comparison of the K-Means and MiniBatchKMeans on sklearn : 
> http://scikit-learn.org/stable/auto_examples/cluster/plot_mini_batch_kmeans.html#example-cluster-plot-mini-batch-kmeans-py



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-15040) PySpark impl for ml.feature.Imputer

2017-03-24 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath reassigned SPARK-15040:
--

Assignee: Nick Pentreath

> PySpark impl for ml.feature.Imputer
> ---
>
> Key: SPARK-15040
> URL: https://issues.apache.org/jira/browse/SPARK-15040
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: yuhao yang
>    Assignee: Nick Pentreath
>Priority: Minor
> Fix For: 2.2.0
>
>
> PySpark impl for ml.feature.Imputer.
> This need to wait until PR for SPARK-13568 gets merged.
> https://github.com/apache/spark/pull/11601



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15040) PySpark impl for ml.feature.Imputer

2017-03-24 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath resolved SPARK-15040.

   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 17316
[https://github.com/apache/spark/pull/17316]

> PySpark impl for ml.feature.Imputer
> ---
>
> Key: SPARK-15040
> URL: https://issues.apache.org/jira/browse/SPARK-15040
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: yuhao yang
>Priority: Minor
> Fix For: 2.2.0
>
>
> PySpark impl for ml.feature.Imputer.
> This need to wait until PR for SPARK-13568 gets merged.
> https://github.com/apache/spark/pull/11601



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



Re: Collaborative Filtering - scaling of the regularization parameter

2017-03-23 Thread Nick Pentreath
I usually advocate a JIRA even for small stuff but for doc only change like
this it's ok to submit a PR directly with [MINOR] in title.


On Thu, 23 Mar 2017 at 06:55, chris snow  wrote:

> Thanks Nick.  If this will help other users, I'll create a JIRA and
> send a patch.
>
> On 23 March 2017 at 13:49, Nick Pentreath 
> wrote:
> > Yup, that is true and a reasonable clarification of the doc.
> >
> > On Thu, 23 Mar 2017 at 00:03 chris snow  wrote:
> >>
> >> The documentation for collaborative filtering is as follows:
> >>
> >> ===
> >> Scaling of the regularization parameter
> >>
> >> Since v1.1, we scale the regularization parameter lambda in solving
> >> each least squares problem by the number of ratings the user generated
> >> in updating user factors, or the number of ratings the product
> >> received in updating product factors.
> >> ===
> >>
> >> I find this description confusing, probably because I lack a detailed
> >> understanding of ALS.   The wording suggest that the number of ratings
> >> change ("generated", "received") during solving the least squares.
> >>
> >> This is how I think I should be interpreting the description:
> >>
> >> ===
> >> Since v1.1, we scale the regularization parameter lambda when solving
> >> each least squares problem.  When updating the user factors, we scale
> >> the regularization parameter by the total number of ratings from the
> >> user.  Similarly, when updating the product factors, we scale the
> >> regularization parameter by the total number of ratings for the
> >> product.
> >> ===
> >>
> >> Have I understood this correctly?
> >>
> >> -
> >> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> >>
> >
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: Collaborative Filtering - scaling of the regularization parameter

2017-03-23 Thread Nick Pentreath
Yup, that is true and a reasonable clarification of the doc.

On Thu, 23 Mar 2017 at 00:03 chris snow  wrote:

> The documentation for collaborative filtering is as follows:
>
> ===
> Scaling of the regularization parameter
>
> Since v1.1, we scale the regularization parameter lambda in solving
> each least squares problem by the number of ratings the user generated
> in updating user factors, or the number of ratings the product
> received in updating product factors.
> ===
>
> I find this description confusing, probably because I lack a detailed
> understanding of ALS.   The wording suggest that the number of ratings
> change ("generated", "received") during solving the least squares.
>
> This is how I think I should be interpreting the description:
>
> ===
> Since v1.1, we scale the regularization parameter lambda when solving
> each least squares problem.  When updating the user factors, we scale
> the regularization parameter by the total number of ratings from the
> user.  Similarly, when updating the product factors, we scale the
> regularization parameter by the total number of ratings for the
> product.
> ===
>
> Have I understood this correctly?
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


[jira] [Updated] (SPARK-20043) CrossValidatorModel loader does not recognize impurity "Gini" and "Entropy" on ML random forest and decision. Only "gini" and "entropy" (in lower case) are accepted

2017-03-22 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath updated SPARK-20043:
---
Labels: starter  (was: )

> CrossValidatorModel loader does not recognize impurity "Gini" and "Entropy" 
> on ML random forest and decision. Only "gini" and "entropy" (in lower case) 
> are accepted
> 
>
> Key: SPARK-20043
> URL: https://issues.apache.org/jira/browse/SPARK-20043
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: Zied Sellami
>  Labels: starter
>
> I saved a CrossValidatorModel with a decision tree and a random forest. I use 
> Paramgrid to test "gini" and "entropy" impurity. CrossValidatorModel are not 
> able to load the saved model, when impurity are written not in lowercase. I 
> obtain an error from Spark "impurity Gini (Entropy) not recognized.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



Re: Outstanding Spark 2.1.1 issues

2017-03-21 Thread Nick Pentreath
As for SPARK-19759 , I
don't think that needs to be targeted for 2.1.1 so we don't need to worry
about it

On Tue, 21 Mar 2017 at 13:49 Holden Karau  wrote:

> I agree with Michael, I think we've got some outstanding issues but none
> of them seem like regression from 2.1 so we should be good to start the RC
> process.
>
> On Tue, Mar 21, 2017 at 1:41 PM, Michael Armbrust 
> wrote:
>
> Please speak up if I'm wrong, but none of these seem like critical
> regressions from 2.1.  As such I'll start the RC process later today.
>
> On Mon, Mar 20, 2017 at 9:52 PM, Holden Karau 
> wrote:
>
> I'm not super sure it should be a blocker for 2.1.1 -- is it a regression?
> Maybe we can get TDs input on it?
>
> On Mon, Mar 20, 2017 at 8:48 PM Nan Zhu  wrote:
>
> I think https://issues.apache.org/jira/browse/SPARK-19280 should be a
> blocker
>
> Best,
>
> Nan
>
> On Mon, Mar 20, 2017 at 8:18 PM, Felix Cheung 
> wrote:
>
> I've been scrubbing R and think we are tracking 2 issues
>
> https://issues.apache.org/jira/browse/SPARK-19237
>
> https://issues.apache.org/jira/browse/SPARK-19925
>
>
>
>
> --
> *From:* holden.ka...@gmail.com  on behalf of
> Holden Karau 
> *Sent:* Monday, March 20, 2017 3:12:35 PM
> *To:* dev@spark.apache.org
> *Subject:* Outstanding Spark 2.1.1 issues
>
> Hi Spark Developers!
>
> As we start working on the Spark 2.1.1 release I've been looking at our
> outstanding issues still targeted for it. I've tried to break it down by
> component so that people in charge of each component can take a quick look
> and see if any of these things can/should be re-targeted to 2.2 or 2.1.2 &
> the overall list is pretty short (only 9 items - 5 if we only look at
> explicitly tagged) :)
>
> If your working on something for Spark 2.1.1 and it doesn't show up in
> this list please speak up now :) We have a lot of issues (including "in
> progress") that are listed as impacting 2.1.0, but they aren't targeted for
> 2.1.1 - if there is something you are working in their which should be
> targeted for 2.1.1 please let us know so it doesn't slip through the cracks.
>
> The query string I used for looking at the 2.1.1 open issues is:
>
> ((affectedVersion = 2.1.1 AND cf[12310320] is Empty) OR fixVersion = 2.1.1
> OR cf[12310320] = "2.1.1") AND project = spark AND resolution = Unresolved
> ORDER BY priority DESC
>
> None of the open issues appear to be a regression from 2.1.0, but those
> seem more likely to show up during the RC process (thanks in advance to
> everyone testing their workloads :)) & generally none of them seem to be
>
> (Note: the cfs are for Target Version/s field)
>
> Critical Issues:
>  SQL:
>   SPARK-19690  - Join
> a streaming DataFrame with a batch DataFrame may not work - PR
> https://github.com/apache/spark/pull/17052 (review in progress by
> zsxwing, currently failing Jenkins)*
>
> Major Issues:
>  SQL:
>   SPARK-19035  - rand()
> function in case when cause failed - no outstanding PR (consensus on JIRA
> seems to be leaning towards it being a real issue but not necessarily
> everyone agrees just yet - maybe we should slip this?)*
>  Deploy:
>   SPARK-19522  - 
> --executor-memory
> flag doesn't work in local-cluster mode -
> https://github.com/apache/spark/pull/16975 (review in progress by vanzin,
> but PR currently stalled waiting on response) *
>  Core:
>   SPARK-20025  - Driver
> fail over will not work, if SPARK_LOCAL* env is set. -
> https://github.com/apache/spark/pull/17357 (waiting on review) *
>  PySpark:
>  SPARK-19955  - Update
> run-tests to support conda [ Part of Dropping 2.6 support -- which we
> shouldn't do in a minor release -- but also fixes pip installability tests
> to run in Jenkins ]-  PR failing Jenkins (I need to poke this some more,
> but seems like 2.7 support works but some other issues. Maybe slip to 2.2?)
>
> Minor issues:
>  Tests:
>   SPARK-19612  - Tests
> failing with timeout - No PR per-se but it seems unrelated to the 2.1.1
> release. It's not targetted for 2.1.1 but listed as affecting 2.1.1 - I'd
> consider explicitly targeting this for 2.2?
>  PySpark:
>   SPARK-19570  - Allow
> to disable hive in pyspark shell -
> https://github.com/apache/spark/pull/16906 PR exists but its difficult to
> add automated tests for this (although if SPARK-19955
>  gets in would make
> testing this easier) - no reviewers yet. Possible re-target?*
>  Structured Streaming:
>   SPARK-19613  - Flaky
> test: StateStoreRDDSuite.versioning and immutability - I

[jira] [Commented] (SPARK-20043) CrossValidatorModel loader does not recognize impurity "Gini" and "Entropy" on ML random forest and decision. Only "gini" and "entropy" (in lower case) are accepted

2017-03-21 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15934905#comment-15934905
 ] 

Nick Pentreath commented on SPARK-20043:


I just noticed the error message you put above says "Entorpy" - is that a 
spelling mistake in the JIRA description or in your code?

> CrossValidatorModel loader does not recognize impurity "Gini" and "Entropy" 
> on ML random forest and decision. Only "gini" and "entropy" (in lower case) 
> are accepted
> 
>
> Key: SPARK-20043
> URL: https://issues.apache.org/jira/browse/SPARK-20043
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: Zied Sellami
>
> I saved a CrossValidatorModel with a decision tree and a random forest. I use 
> Paramgrid to test "gini" and "entropy" impurity. CrossValidatorModel are not 
> able to load the saved model, when impurity are written not in lowercase. I 
> obtain an error from Spark "impurity Gini (Entorpy) not recognized.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20043) CrossValidatorModel loader does not recognize impurity "Gini" and "Entropy" on ML random forest and decision. Only "gini" and "entropy" (in lower case) are accepted

2017-03-21 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath updated SPARK-20043:
---
Docs Text:   (was: I saved a CrossValidatorModel with a decision tree and a 
random forest. I use Paramgrid to test "gini" and "entropy" impurity. 
CrossValidatorModel are not able to load the saved model, when impurity are 
written not in lowercase. I obtain an error from Spark "impurity Gini (Entorpy) 
not recognized.)

> CrossValidatorModel loader does not recognize impurity "Gini" and "Entropy" 
> on ML random forest and decision. Only "gini" and "entropy" (in lower case) 
> are accepted
> 
>
> Key: SPARK-20043
> URL: https://issues.apache.org/jira/browse/SPARK-20043
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: Zied Sellami
>
> I saved a CrossValidatorModel with a decision tree and a random forest. I use 
> Paramgrid to test "gini" and "entropy" impurity. CrossValidatorModel are not 
> able to load the saved model, when impurity are written not in lowercase. I 
> obtain an error from Spark "impurity Gini (Entorpy) not recognized.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20043) CrossValidatorModel loader does not recognize impurity "Gini" and "Entropy" on ML random forest and decision. Only "gini" and "entropy" (in lower case) are accepted

2017-03-21 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath updated SPARK-20043:
---
Description: 
I saved a CrossValidatorModel with a decision tree and a random forest. I use 
Paramgrid to test "gini" and "entropy" impurity. CrossValidatorModel are not 
able to load the saved model, when impurity are written not in lowercase. I 
obtain an error from Spark "impurity Gini (Entorpy) not recognized.


> CrossValidatorModel loader does not recognize impurity "Gini" and "Entropy" 
> on ML random forest and decision. Only "gini" and "entropy" (in lower case) 
> are accepted
> 
>
> Key: SPARK-20043
> URL: https://issues.apache.org/jira/browse/SPARK-20043
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: Zied Sellami
>
> I saved a CrossValidatorModel with a decision tree and a random forest. I use 
> Paramgrid to test "gini" and "entropy" impurity. CrossValidatorModel are not 
> able to load the saved model, when impurity are written not in lowercase. I 
> obtain an error from Spark "impurity Gini (Entorpy) not recognized.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



Re: Contributing to Spark

2017-03-19 Thread Nick Pentreath
If you have experience and interest in Python then PySpark is a good area
to look into.

Yes, adding things like tests & documentation is a good starting point.
Start out relatively small and go from there. Adding new wrappers to python
for ML is useful for slightly larger tasks.




On Mon, 20 Mar 2017 at 02:39, Sam Elamin  wrote:

> Hi All,
>
> I would like to start contributing to Spark if possible, its an amazing
> technology and I would love to get involved
>
>
> The contributing page  states
> this "consult the list of starter tasks in JIRA, or ask the
> user@spark.apache.org mailing list."
>
>
> Can anyone guide me on where is best to start? What are these "starter
> tasks"?
>
> I was thinking adding tests would be a good place to begin when dealing
> with any new code base, perhaps to Pyspark since Scala seems a bit more
> stable
>
>
> Also - if at all possible - I would really appreciate if any of the
> contributors or PMC members would be willing to mentor or guide me in this.
> Any help would be greatly appreciated!
>
>
> Regards
> Sam
>
>
>


[jira] [Commented] (SPARK-19969) Doc and examples for Imputer

2017-03-16 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15928854#comment-15928854
 ] 

Nick Pentreath commented on SPARK-19969:


Ok - I can help on it but probably only some time next week.

> Doc and examples for Imputer
> 
>
> Key: SPARK-19969
> URL: https://issues.apache.org/jira/browse/SPARK-19969
> Project: Spark
>  Issue Type: Documentation
>  Components: ML
>Affects Versions: 2.2.0
>    Reporter: Nick Pentreath
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19979) [MLLIB] Multiple Estimators/Pipelines In CrossValidator

2017-03-16 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15928545#comment-15928545
 ] 

Nick Pentreath commented on SPARK-19979:


I wonder if this fits in as a sort of sub-task of SPARK-19071?

cc [~bryanc] as it relates to your work on SPARK-19357.

> [MLLIB] Multiple Estimators/Pipelines In CrossValidator
> ---
>
> Key: SPARK-19979
> URL: https://issues.apache.org/jira/browse/SPARK-19979
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.1.0
>Reporter: David Leifker
>
> Update CrossValidator and TrainValidationSplit to be able to accept multiple 
> pipelines and grid parameters for testing different algorithms and/or being 
> able to better control tuning combinations. Maintains backwards compatible 
> API and reads legacy serialized objects.
> The same could be done using an external iterative approach. Build different 
> pipelines, throwing each into a CrossValidator, and then taking the best 
> model from each of those CrossValidators. Then finally picking the best from 
> those. This is the initial approach I explored. It resulted in a lot of 
> boiler plate code that felt like it shouldn't need to exist if the api simply 
> allowed for arrays of estimators and their parameters.
> A couple advantages to this implementation to consider come from keeping the 
> functional interface to the CrossValidator.
> 1. The caching of the folds is better utilized. An external iterative 
> approach creates a new set of k folds for each CrossValidator fit and the 
> folds are discarded after each CrossValidator run. In this implementation a 
> single set of k folds is created and cached for all of the pipelines.
> 2. A potential advantage of using this implementation is for future 
> parallelization of the pipelines within the CrossValdiator. It is of course 
> possible to handle the parallelization outside of the CrossValidator here 
> too, however I believe there is already work in progress to parallelize the 
> grid parameters and that could be extended to multiple pipelines.
> Both of those behind-the-scene optimizations are possible because of 
> providing the CrossValidator with the data and the complete set of 
> pipelines/estimators to evaluate up front allowing one to abstract away the 
> implementation.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19969) Doc and examples for Imputer

2017-03-16 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15928537#comment-15928537
 ] 

Nick Pentreath commented on SPARK-19969:


No haven't done the doc or examples - I seem to recall you had already done 
some work on that?

> Doc and examples for Imputer
> 
>
> Key: SPARK-19969
> URL: https://issues.apache.org/jira/browse/SPARK-19969
> Project: Spark
>  Issue Type: Documentation
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Nick Pentreath
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15040) PySpark impl for ml.feature.Imputer

2017-03-16 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15928180#comment-15928180
 ] 

Nick Pentreath commented on SPARK-15040:


Sorry, I did not see your comment - I opened a 
[PR|https://github.com/apache/spark/pull/17316] already.

> PySpark impl for ml.feature.Imputer
> ---
>
> Key: SPARK-15040
> URL: https://issues.apache.org/jira/browse/SPARK-15040
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: yuhao yang
>Priority: Minor
>
> PySpark impl for ml.feature.Imputer.
> This need to wait until PR for SPARK-13568 gets merged.
> https://github.com/apache/spark/pull/11601



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19899) FPGrowth input column naming

2017-03-16 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15928193#comment-15928193
 ] 

Nick Pentreath commented on SPARK-19899:


+1 on {{itemsCol}} - feel free to send a PR :)

> FPGrowth input column naming
> 
>
> Key: SPARK-19899
> URL: https://issues.apache.org/jira/browse/SPARK-19899
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Maciej Szymkiewicz
>
> Current implementation extends {{HasFeaturesCol}}. Personally I find it 
> rather unfortunate. Up to this moment we used consistent conventions - if we 
> mix-in  {{HasFeaturesCol}} the {{featuresCol}} should be {{VectorUDT}}. 
> Using the same {{Param}} for an {{array}} (and possibly for 
> {{array>}} once {{PrefixSpan}} is ported to {{ml}}) will be 
> confusing for the users.
> I would like to suggest adding new {{trait}} (let's say 
> {{HasTransactionsCol}}) to clearly indicate that the input type differs for 
> the other {{Estiamtors}}.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13568) Create feature transformer to impute missing values

2017-03-16 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath reassigned SPARK-13568:
--

Assignee: yuhao yang

> Create feature transformer to impute missing values
> ---
>
> Key: SPARK-13568
> URL: https://issues.apache.org/jira/browse/SPARK-13568
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>    Reporter: Nick Pentreath
>Assignee: yuhao yang
>Priority: Minor
> Fix For: 2.2.0
>
>
> It is quite common to encounter missing values in data sets. It would be 
> useful to implement a {{Transformer}} that can impute missing data points, 
> similar to e.g. {{Imputer}} in 
> [scikit-learn|http://scikit-learn.org/dev/modules/preprocessing.html#imputation-of-missing-values].
> Initially, options for imputation could include {{mean}}, {{median}} and 
> {{most frequent}}, but we could add various other approaches. Where possible 
> existing DataFrame code can be used (e.g. for approximate quantiles etc).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-13568) Create feature transformer to impute missing values

2017-03-16 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath resolved SPARK-13568.

   Resolution: Fixed
Fix Version/s: 2.2.0

> Create feature transformer to impute missing values
> ---
>
> Key: SPARK-13568
> URL: https://issues.apache.org/jira/browse/SPARK-13568
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>    Reporter: Nick Pentreath
>Assignee: yuhao yang
>Priority: Minor
> Fix For: 2.2.0
>
>
> It is quite common to encounter missing values in data sets. It would be 
> useful to implement a {{Transformer}} that can impute missing data points, 
> similar to e.g. {{Imputer}} in 
> [scikit-learn|http://scikit-learn.org/dev/modules/preprocessing.html#imputation-of-missing-values].
> Initially, options for imputation could include {{mean}}, {{median}} and 
> {{most frequent}}, but we could add various other approaches. Where possible 
> existing DataFrame code can be used (e.g. for approximate quantiles etc).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19969) Doc and examples for Imputer

2017-03-16 Thread Nick Pentreath (JIRA)
Nick Pentreath created SPARK-19969:
--

 Summary: Doc and examples for Imputer
 Key: SPARK-19969
 URL: https://issues.apache.org/jira/browse/SPARK-19969
 Project: Spark
  Issue Type: Documentation
  Components: ML
Affects Versions: 2.2.0
Reporter: Nick Pentreath






--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



Re: Should we consider a Spark 2.1.1 release?

2017-03-16 Thread Nick Pentreath
Spark 1.5.1 had 87 issues fix version 1 month after 1.5.0.

Spark 1.6.1 had 123 issues 2 months after 1.6.0

2.0.1 was larger (317 issues) at 3 months after 2.0.0 - makes sense due to
how large a release it was.

We are at 185 for 2.1.1 and 3 months after (and not released yet so it
could slip further) - so not totally unusual as the release interval has
certainly increased, but in fairness probably a bit later than usual. I'd
say definitely makes sense to cut the RC!



On Thu, 16 Mar 2017 at 02:06 Michael Armbrust 
wrote:

> Hey Holden,
>
> Thanks for bringing this up!  I think we usually cut patch releases when
> there are enough fixes to justify it.  Sometimes just a few weeks after the
> release.  I guess if we are at 3 months Spark 2.1.0 was a pretty good
> release :)
>
> That said, it is probably time. I was about to start thinking about 2.2 as
> well (we are a little past the posted code-freeze deadline), so I'm happy
> to push the buttons etc (this is a very good description
>  if you are curious). I
> would love help watching JIRA, posting the burn down on issues and
> shepherding in any critical patches.  Feel free to ping me off-line if you
> like to coordinate.
>
> Unless there are any objections, how about we aim for an RC of 2.1.1 on
> Monday and I'll also plan to cut branch-2.2 then?  (I'll send a separate
> email on this as well).
>
> Michael
>
> On Mon, Mar 13, 2017 at 1:40 PM, Holden Karau 
> wrote:
>
> I'd be happy to do the work of coordinating a 2.1.1 release if that's a
> thing a committer can do (I think the release coordinator for the most
> recent Arrow release was a committer and the final publish step took a PMC
> member to upload but other than that I don't remember any issues).
>
> On Mon, Mar 13, 2017 at 1:05 PM Sean Owen  wrote:
>
> It seems reasonable to me, in that other x.y.1 releases have followed ~2
> months after the x.y.0 release and it's been about 3 months since 2.1.0.
>
> Related: creating releases is tough work, so I feel kind of bad voting for
> someone else to do that much work. Would it make sense to deputize another
> release manager to help get out just the maintenance releases? this may in
> turn mean maintenance branches last longer. Experienced hands can continue
> to manage new minor and major releases as they require more coordination.
>
> I know most of the release process is written down; I know it's also still
> going to be work to make it 100% documented. Eventually it'll be necessary
> to make sure it's entirely codified anyway.
>
> Not pushing for it myself, just noting I had heard this brought up in side
> conversations before.
>
>
> On Mon, Mar 13, 2017 at 7:07 PM Holden Karau  wrote:
>
> Hi Spark Devs,
>
> Spark 2.1 has been out since end of December
> 
> and we've got quite a few fixes merged for 2.1.1
> 
> .
>
> On the Python side one of the things I'd like to see us get out into a
> patch release is a packaging fix (now merged) before we upload to PyPI &
> Conda, and we also have the normal batch of fixes like toLocalIterator for
> large DataFrames in PySpark.
>
> I've chatted with Felix & Shivaram who seem to think the R side is looking
> close to in good shape for a 2.1.1 release to submit to CRAN (if I've
> miss-spoken my apologies). The two outstanding issues that are being
> tracked for R are SPARK-18817, SPARK-19237.
>
> Looking at the other components quickly it seems like structured streaming
> could also benefit from a patch release.
>
> What do others think - are there any issues people are actively targeting
> for 2.1.1? Is this too early to be considering a patch release?
>
> Cheers,
>
> Holden
> --
> Cell : 425-233-8271 <(425)%20233-8271>
> Twitter: https://twitter.com/holdenkarau
>
> --
> Cell : 425-233-8271 <(425)%20233-8271>
> Twitter: https://twitter.com/holdenkarau
>
>
>


[jira] [Commented] (SPARK-19962) add DictVectorizor for DataFrame

2017-03-16 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15927601#comment-15927601
 ] 

Nick Pentreath commented on SPARK-19962:


You may also want to take a look at 
https://issues.apache.org/jira/browse/SPARK-13969. I plan to work on the 
{{FeatureHasher}} (will be for Spark 2.3 now). I do think a {{DictVectorizer}} 
would be useful and is conceptually the same as my hasher but maintaining an 
exact feature mapping.

> add DictVectorizor for DataFrame
> 
>
> Key: SPARK-19962
> URL: https://issues.apache.org/jira/browse/SPARK-19962
> Project: Spark
>  Issue Type: Wish
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: yu peng
>  Labels: features
>
> it's really useful to have something like 
> sklearn.feature_extraction.DictVectorizor
> Since out features lives in json/data frame like format and 
> classifier/regressors only take vector input. so there is a gap between them.
> something like 
> ```
> df = sqlCtx.createDataFrame([Row(age=1, gender='male', country='cn', 
> hobbies=['sing', 'dance']),Row(age=3, gender='female', country='us',  
> hobbies=['sing']), ])
> df.show()
> |age|gender|country|hobbies|
> |1|male|cn|[sing, dance]|
> |3|female|us|[sing]|
> import DictVectorizor
> vec = DictVectorizor()
> matrix = vec.fit_transform(df)
> matrix.show()
> |features|
> |[1, 0, 1, 0, 1, 1, 1]|
> |[3, 1, 0, 1, 0, 1, 1]|
> vec.show()
> |feature_name| feature_dimension|
> |age|0|
> |gender=female|1|
> |gender=male|2|
> |country=us|3|
> |country=cn|4|
> |hobbies=sing|5|
> |hobbies=dance|6|
> ```



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19957) Inconsist KMeans initialization mode behavior between ML and MLlib

2017-03-15 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15925723#comment-15925723
 ] 

Nick Pentreath commented on SPARK-19957:


See https://issues.apache.org/jira/browse/SPARK-16832 

> Inconsist KMeans initialization mode behavior between ML and MLlib
> --
>
> Key: SPARK-19957
> URL: https://issues.apache.org/jira/browse/SPARK-19957
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: yuhao yang
>Priority: Minor
>
> when users set the initialization mode to "random", KMeans in ML and MLlib 
> has inconsistent behavior for multiple runs:
> MLlib will basically use new Random for each run.
> ML Kmeans however will use the default random seed, which is 
> {code}this.getClass.getName.hashCode.toLong{code}, and keep using the same 
> number among multiple fitting.
> I would expect the "random" initialization mode to be literally random. 
> There're different solutions with different scope of impact. Adjusting the 
> hasSeed trait may have a broader impact(but maybe worth discussion). We can 
> always just set random default seed in KMeans. 
> Appreciate your feedback.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14409) Investigate adding a RankingEvaluator to ML

2017-03-09 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15902649#comment-15902649
 ] 

Nick Pentreath commented on SPARK-14409:


[~josephkb] in reference to your [PR 
comment|https://github.com/apache/spark/pull/17090#issuecomment-284827573]:

Really the input schema for evaluation is fairly simple - a set of ground truth 
ids and a (sorted) set of predicted ids, for each query (/user). The exact 
format (arrays like for {{mllib}} version, "exploded" version proposed in this 
JIRA) is not relevant in itself. Rather, the format selected is actually 
dictated by the {{Pipeline}} API - specifically, a model's prediction output 
schema from {{transform}} must be compatible with the evaluator's input schema 
for {{evaluate}}.

The schema proposed above is - I believe - the only one that is compatible with 
both "linear model" style things such as `LogisticRegression` for ad CTR 
prediction and learning-to-rank settings, as well as recommendation tasks.

> Investigate adding a RankingEvaluator to ML
> ---
>
> Key: SPARK-14409
> URL: https://issues.apache.org/jira/browse/SPARK-14409
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Nick Pentreath
>Priority: Minor
>
> {{mllib.evaluation}} contains a {{RankingMetrics}} class, while there is no 
> {{RankingEvaluator}} in {{ml.evaluation}}. Such an evaluator can be useful 
> for recommendation evaluation (and can be useful in other settings 
> potentially).
> Should be thought about in conjunction with adding the "recommendAll" methods 
> in SPARK-13857, so that top-k ranking metrics can be used in cross-validators.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-14409) Investigate adding a RankingEvaluator to ML

2017-03-09 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15902639#comment-15902639
 ] 

Nick Pentreath edited comment on SPARK-14409 at 3/9/17 8:05 AM:


I commented on the [PR for 
SPARK-19535|https://github.com/apache/spark/pull/17090#issuecomment-284648012] 
and am copying that comment here for future reference as it contains further 
detail of the discussion:

=

Sorry if my other comments here and on JIRA were unclear. But the proposed 
schema for input to RankingEvaluator is:

*Schema 1*

{noformat}
+--+---+--+--+
|userId|movieId|rating|prediction|
+--+---+--+--+
|   230|318|   5.0| 4.2403245|
|   230|   3424|   4.0|  null|
|   230|  81191|  null|  4.317455|
+--+---+--+--+
{noformat}
You will notice that rating and prediction columns can be null. This is by 
design. There are three cases shown above:

# 1st row indicates a (user-item) pair that occurs in both the ground-truth set 
and the top-k predictions;
# 2nd row indicates a (user-item) pair that occurs in the ground-truth set, but 
not in the top-k predictions;
# 3rd row indicates a (user-item) pair that occurs in the top-k predictions, 
but not in the ground-truth set.

Note for reference, the input to the current mllib RankingMetrics is:

*Schema 2*

{noformat}
RDD[(true labels array, predicted labels array)],
i.e.
RDD of ([318, 3424, 7139,...], [81191, 93040, 31...])
{noformat}

(So actually neither of the above schemas are easily compatible with the return 
schema here - but I think it is not really necessary to match the 
mllib.RankingMetrics format)

*ALS cross-validation*

My proposal for fitting ALS into cross-validation is the ALSModel.transform 
will output a DF of Schema 1 - only when the parameters k and recommendFor are 
appropriately set, and the input DF contains both user and item columns. In 
practice, this scenario will occur during cross-validation only.

So what I am saying is that ALS itself (not the evaluator) must know how to 
return the correct DataFrame output from transform such that it can be used in 
a cross-validation as input to the RankingEvaluator.

Concretely:

{code}
val als = new ALS().setRecommendFor("user").setK(10)
val validator = new TrainValidationSplit()
  .setEvaluator(new RankingEvaluator().setK(10))
  .setEstimator(als)
  .setEstimatorParamMaps(...)
val bestModel = validator.fit(ratings)
{code}

So while it is complex under the hood - to users it's simply a case of setting 
2 params and the rest is as normal.

Now, we have the best model selected by cross-validation. We can make 
recommendations using these convenience methods (I think it will need a cast):

{code}
val recommendations = 
bestModel.asInstanceOf[ALSModel].recommendItemsforUsers(10)
{code}

Alternatively, the transform version looks like this:

{code}
val usersDF = ...
+--+
|userId|
+--+
| 1|
| 2|
| 3|
+--+
val recommendations = bestModel.transform(usersDF)
{code}

So the questions:

* should we support the above transform-based recommendations? Or only support 
it for cross-validation purposes as a special case?
* if we do, what should the output schema of the above transform version look 
like? It must certainly match the output of recommendX methods.

The options are:

*(1) The schema in this PR:*
*Pros*: as you mention above - also more "compact"
*Cons*: doesn't match up so closely with the transform "cross-validation" 
schema above

*(2) The schema below. It is basically an "exploded" version of option (1)*

{noformat}
+--+---+--+
|userId|movieId|prediction|
+--+---+--+
| 1|  1|   4.3|
| 1|  5|   3.2|
| 1|  9|   2.1|
+--+---+--+
{noformat}

*Pros*: matches more closely with the cross-validation / evaluator input 
format. Perhaps slightly more "dataframe-like".
*Cons*: less compact; lose ordering?; may require more munging to save to 
external data stores etc.

Anyway sorry for hijacking this PR discussion - but as I think you can see, the 
evaluator / ALS transform interplay is a bit subtle and requires some thought 
to get the right approach.


was (Author: mlnick):
I commented on the [PR for 
SPARK-19535|https://github.com/apache/spark/pull/17090#issuecomment-284648012] 
and am copying that comment here for future reference as it contains further 
detail of the discussion:

=
{noformat}
Sorry if my other comments here and on JIRA were unclear. But the proposed 
schema for input to RankingEvaluator is:

Schema 1

+--+---+--+--+
|userId|movieId|rating|prediction|
+--+---+--+--+
|   230|318|   5.0| 4.2403245|
|   230|   3424|   4.0|  null|
|   230|  81191|  null|  4.317455|
+--+-

[jira] [Commented] (SPARK-14409) Investigate adding a RankingEvaluator to ML

2017-03-09 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15902639#comment-15902639
 ] 

Nick Pentreath commented on SPARK-14409:


I commented on the [PR for 
SPARK-19535|https://github.com/apache/spark/pull/17090#issuecomment-284648012] 
and am copying that comment here for future reference as it contains further 
detail of the discussion:

=
{noformat}
Sorry if my other comments here and on JIRA were unclear. But the proposed 
schema for input to RankingEvaluator is:

Schema 1

+--+---+--+--+
|userId|movieId|rating|prediction|
+--+---+--+--+
|   230|318|   5.0| 4.2403245|
|   230|   3424|   4.0|  null|
|   230|  81191|  null|  4.317455|
+--+---+--+--+
You will notice that rating and prediction columns can be null. This is by 
design. There are three cases shown above:

1st row indicates a (user-item) pair that occurs in both the ground-truth set 
and the top-k predictions;
2nd row indicates a (user-item) pair that occurs in the ground-truth set, but 
not in the top-k predictions;
3rd row indicates a (user-item) pair that occurs in the top-k predictions, but 
not in the ground-truth set.
Note for reference, the input to the current mllib RankingMetrics is:

Schema 2

RDD[(true labels array, predicted labels array)],
i.e.
RDD of ([318, 3424, 7139,...], [81191, 93040, 31...])
(So actually neither of the above schemas are easily compatible with the return 
schema here - but I think it is not really necessary to match the 
mllib.RankingMetrics format)

ALS cross-validation

My proposal for fitting ALS into cross-validation is the ALSModel.transform 
will output a DF of Schema 1 - only when the parameters k and recommendFor are 
appropriately set, and the input DF contains both user and item columns. In 
practice, this scenario will occur during cross-validation only.

So what I am saying is that ALS itself (not the evaluator) must know how to 
return the correct DataFrame output from transform such that it can be used in 
a cross-validation as input to the RankingEvaluator.

Concretely:

val als = new ALS().setRecommendFor("user").setK(10)
val validator = new TrainValidationSplit()
  .setEvaluator(new RankingEvaluator().setK(10))
  .setEstimator(als)
  .setEstimatorParamMaps(...)
val bestModel = validator.fit(ratings)
So while it is complex under the hood - to users it's simply a case of setting 
2 params and the rest is as normal.

Now, we have the best model selected by cross-validation. We can make 
recommendations using these convenience methods (I think it will need a cast):

val recommendations = 
bestModel.asInstanceOf[ALSModel].recommendItemsforUsers(10)
Alternatively, the transform version looks like this:

val usersDF = ...
+--+
|userId|
+--+
| 1|
| 2|
| 3|
+--+
val recommendations = bestModel.transform(usersDF)
So the questions:

should we support the above transform-based recommendations? Or only support it 
for cross-validation purposes as a special case?
if we do, what should the output schema of the above transform version look 
like? It must certainly match the output of recommendX methods.
The options are:

(1) The schema in this PR:
Pros: as you mention above - also more "compact"
Cons: doesn't match up so closely with the transform "cross-validation" schema 
above

(2) The schema below. It is basically an "exploded" version of option (1)

+--+---+--+
|userId|movieId|prediction|
+--+---+--+
| 1|  1|   4.3|
| 1|  5|   3.2|
| 1|  9|   2.1|
+--+---+--+
Pros*: matches more closely with the cross-validation / evaluator input format. 
Perhaps slightly more "dataframe-like".
Cons: less compact; lose ordering?; may require more munging to save to 
external data stores etc.

Anyway sorry for hijacking this PR discussion - but as I think you can see, the 
evaluator / ALS transform interplay is a bit subtle and requires some thought 
to get the right approach.
{noformat}

> Investigate adding a RankingEvaluator to ML
> ---
>
> Key: SPARK-14409
> URL: https://issues.apache.org/jira/browse/SPARK-14409
> Project: Spark
>      Issue Type: New Feature
>  Components: ML
>Reporter: Nick Pentreath
>Priority: Minor
>
> {{mllib.evaluation}} contains a {{RankingMetrics}} class, while there is no 
> {{RankingEvaluator}} in {{ml.evaluation}}. Such an evaluator can be useful 
> for recommendation evaluation (and can be useful in other settings 
> potentially).
> Should be thought about in conjunction with adding the "recommendAll" methods 
> in SPARK-1385

[jira] [Commented] (SPARK-13969) Extend input format that feature hashing can handle

2017-03-07 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15900825#comment-15900825
 ] 

Nick Pentreath commented on SPARK-13969:


I think {{HashingTF}} and {{FeatureHasher}} are different things - similar to 
HashingVectorizer and FeatureHasher in scikit-learn.

{{HashingTF}} (HashingVectorizer) transforms a Seq of terms (typically 
{{String}}) into a term frequency vector. Yes technically it can operate on Seq 
of any type (well actually only strings and numbers, see [murmur3Hash 
function|https://github.com/apache/spark/blob/60022bfd65e4637efc0eb5f4cc0112289c783147/mllib/src/main/scala/org/apache/spark/mllib/feature/HashingTF.scala#L151]).
 It could certainly operate on multiple columns - that would hash all the 
columns of sentences into one term frequency vector, so it seems like it would 
probably be less used in practice (though Vowpal Wabbit supports a form of this 
with its namespaces).

What {{HashingTF}} does not support is arbitrary categorical or numeric 
columns. It is possible to support categorical "one-hot" style encoding using 
what I have come to call the "stringify hack" - transforming a set of 
categorical columns into a Seq for input to HashingTF.

So taking say two categorical columns {{city}} and {{state}}, for example:
{code}
++-+-+
|city|state|stringified  |
++-+-+
|Boston  |MA   |[city=Boston, state=MA]  |
|New York|NY   |[city=New York, state=NY]|
++-+-+
{code}

This works but is pretty ugly, doesn't fit nicely into a pipeline, and can't 
support numeric columns.

The {{FeatureHasher}} I propose acts like that in scikit-learn - it can handle 
multiple numeric and/or categorical columns in one pass. I go into some detail 
about all of this in my [Spark Summit East 2017 
talk|https://www.slideshare.net/SparkSummit/feature-hashing-for-scalable-machine-learning-spark-summit-east-talk-by-nick-pentreath].
 The rough draft of it used for the talk is 
[here|https://github.com/MLnick/spark/blob/FeatureHasher/mllib/src/main/scala/org/apache/spark/ml/feature/FeatureHasher.scala].

Another nice thing about the {{FeatureHasher}} is it opens up possibilities for 
doing things like namespaces in Vowpal Wabbit and it would be interesting to 
see if we could mimic their internal feature crossing, and so on.

> Extend input format that feature hashing can handle
> ---
>
> Key: SPARK-13969
> URL: https://issues.apache.org/jira/browse/SPARK-13969
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, MLlib
>Reporter: Nick Pentreath
>Priority: Minor
>
> Currently {{HashingTF}} works like {{CountVectorizer}} (the equivalent in 
> scikit-learn is {{HashingVectorizer}}). That is, it works on a sequence of 
> strings and computes term frequencies.
> The use cases for feature hashing extend to arbitrary feature values (binary, 
> count or real-valued). For example, scikit-learn's {{FeatureHasher}} can 
> accept a sequence of (feature_name, value) pairs (e.g. a map, list). In this 
> way, feature hashing can operate as both "one-hot encoder" and "vector 
> assembler" at the same time.
> Investigate adding a more generic feature hasher (that in turn can be used by 
> {{HashingTF}}).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



<    1   2   3   4   5   6   7   8   9   10   >