[jira] [Commented] (SPARK-23112) ML, Graph 2.3 QA: Programming guide update and migration guide

2018-01-23 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16335821#comment-16335821
 ] 

Nick Pentreath commented on SPARK-23112:


{{OneHotEncoder}} is the only deprecation I can see - but let me know if I 
missed anything.

> ML, Graph 2.3 QA: Programming guide update and migration guide
> --
>
> Key: SPARK-23112
> URL: https://issues.apache.org/jira/browse/SPARK-23112
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib
>Reporter: Joseph K. Bradley
>Priority: Critical
>
> Before the release, we need to update the MLlib and GraphX Programming 
> Guides. Updates will include:
>  * Add migration guide subsection.
>  ** Use the results of the QA audit JIRAs.
>  * Check phrasing, especially in main sections (for outdated items such as 
> "In this release, ...")



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23105) Spark MLlib, GraphX 2.3 QA umbrella

2018-01-23 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16335526#comment-16335526
 ] 

Nick Pentreath commented on SPARK-23105:


Certain of the ML QA sub-tasks are marked {{Blocker}} - SPARK-23106, 
SPARK-23109, SPARK-23108, SPARK-23109, SPARK-23110 - but they are not targeted 
for {{2.3.0}}. I think they should be surely? 

> Spark MLlib, GraphX 2.3 QA umbrella
> ---
>
> Key: SPARK-23105
> URL: https://issues.apache.org/jira/browse/SPARK-23105
> Project: Spark
>  Issue Type: Umbrella
>  Components: Documentation, GraphX, ML, MLlib
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>Priority: Critical
>
> This JIRA lists tasks for the next Spark release's QA period for MLlib and 
> GraphX. *SparkR is separate: SPARK-23114.*
> The list below gives an overview of what is involved, and the corresponding 
> JIRA issues are linked below that.
> h2. API
>  * Check binary API compatibility for Scala/Java
>  * Audit new public APIs (from the generated html doc)
>  ** Scala
>  ** Java compatibility
>  ** Python coverage
>  * Check Experimental, DeveloperApi tags
> h2. Algorithms and performance
>  * Performance tests
> h2. Documentation and example code
>  * For new algorithms, create JIRAs for updating the user guide sections & 
> examples
>  * Update Programming Guide
>  * Update website



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13964) Feature hashing improvements

2018-01-22 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16334599#comment-16334599
 ] 

Nick Pentreath commented on SPARK-13964:


Yes, that's certainly something I'd like to see added to the {{FeatureHasher}}

> Feature hashing improvements
> 
>
> Key: SPARK-13964
> URL: https://issues.apache.org/jira/browse/SPARK-13964
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML, MLlib
>Reporter: Nick Pentreath
>Priority: Minor
>
> Investigate improvements to Spark ML feature hashing (see e.g. 
> http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.FeatureHasher.html#sklearn.feature_extraction.FeatureHasher).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



Re: Has there been any explanation on the performance degradation between spark.ml and Mllib?

2018-01-21 Thread Nick Pentreath
At least one of their comparisons is flawed.

The Spark ML version of linear regression (*note* they use linear
regression and not logistic regression, it is not clear why) uses L-BFGS as
the solver, not SGD (as MLLIB uses). Hence it is typically going to be
slower. However, it should in most cases converge to a better solution.
MLLIB doesn't offer an L-BFGS version for linear regression, but it does
for logistic regression.

In my view a more sensible comparison would be between LogReg with L-BFGS
between ML and MLLIB. These should be close to identical since now the
MLLIB version actually wraps the ML version.

They also don't show any results for algorithm performance (accuracy, AUC
etc). The better comparison to make is the run-time to achieve the same AUC
(for example). SGD may be fast, but it may result in a significantly poorer
solution relative to say L-BFGS.

Note that the "withSGD" algorithms are deprecated in MLLIB partly to move
users to ML, but also partly because their performance in terms of accuracy
is relatively poor and the amount of tuning required (e.g. learning rates)
is high.

They say:

The time difference between Spark MLlib and Spark ML can be explained by
internally transforming the dataset from DataFrame to RDD in order to use
the same implementation of the algorithm present in MLlib.

but this is not true for the LR example.

For the feature selection example, it is probably mostly due to the
conversion, but even then the difference seems larger than what I would
expect. It would be worth investigating their implementation to see if
there are other potential underlying causes.


On Sun, 21 Jan 2018 at 23:49 Stephen Boesch  wrote:

> While MLLib performed favorably vs Flink it *also *performed favorably vs
> spark.ml ..  and by an *order of magnitude*.  The following is one of the
> tables - it is for Logistic Regression.  At that time spark.ML did not yet
> support SVM
>
> From:
> https://bdataanalytics.biomedcentral.com/articles/10.1186/s41044-016-0020-2
>
>
>
> Table 3
>
> LR learning time in seconds
>
> Dataset
>
> Spark MLlib
>
> Spark ML
>
> Flink
>
> ECBDL14-10
>
> 3
>
> 26
>
> 181
>
> ECBDL14-30
>
> 5
>
> 63
>
> 815
>
> ECBDL14-50
>
> 6
>
> 173
>
> 1314
>
> ECBDL14-75
>
> 8
>
> 260
>
> 1878
>
> ECBDL14-100
>
> 12
>
> 415
>
> 2566
>
> The DataFrame based API (spark.ml) is even slower vs the RDD (mllib) than
> had been anticipated - yet the latter has been shutdown for several
> versions of Spark already.  What is the thought process behind that
> decision : *performance matters! *Is there visibility into a meaningful
> narrowing of that gap?
>


[jira] [Commented] (SPARK-23154) Document backwards compatibility guarantees for ML persistence

2018-01-19 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16332252#comment-16332252
 ] 

Nick Pentreath commented on SPARK-23154:


SGTM

> Document backwards compatibility guarantees for ML persistence
> --
>
> Key: SPARK-23154
> URL: https://issues.apache.org/jira/browse/SPARK-23154
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, ML
>Affects Versions: 2.3.0
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>Priority: Major
>
> We have (as far as I know) maintained backwards compatibility for ML 
> persistence, but this is not documented anywhere.  I'd like us to document it 
> (for spark.ml, not for spark.mllib).
> I'd recommend something like:
> {quote}
> In general, MLlib maintains backwards compatibility for ML persistence.  
> I.e., if you save an ML model or Pipeline in one version of Spark, then you 
> should be able to load it back and use it in a future version of Spark.  
> However, there are rare exceptions, described below.
> Model persistence: Is a model or Pipeline saved using Apache Spark ML 
> persistence in Spark version X loadable by Spark version Y?
> * Major versions: No guarantees, but best-effort.
> * Minor and patch versions: Yes; these are backwards compatible.
> * Note about the format: There are no guarantees for a stable persistence 
> format, but model loading itself is designed to be backwards compatible.
> Model behavior: Does a model or Pipeline in Spark version X behave 
> identically in Spark version Y?
> * Major versions: No guarantees, but best-effort.
> * Minor and patch versions: Identical behavior, except for bug fixes.
> For both model persistence and model behavior, any breaking changes across a 
> minor version or patch version are reported in the Spark version release 
> notes. If a breakage is not reported in release notes, then it should be 
> treated as a bug to be fixed.
> {quote}
> How does this sound?
> Note: We unfortunately don't have tests for backwards compatibility (which 
> has technical hurdles and can be discussed in [SPARK-15573]).  However, we 
> have made efforts to maintain it during PR review and Spark release QA, and 
> most users expect it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



Re: [ML] Allow CrossValidation ParamGrid on SVMWithSGD

2018-01-19 Thread Nick Pentreath
SVMWithSGD sits in the older "mllib" package and is not compatible directly
with the DataFrame API. I suppose one could write a ML-API wrapper around
it.

However, there is LinearSVC in Spark 2.2.x:
http://spark.apache.org/docs/latest/ml-classification-regression.html#linear-support-vector-machine

You should use that instead I would say.

On Fri, 19 Jan 2018 at 13:59 Tomasz Dudek 
wrote:

> Hello,
>
> is there any way to use CrossValidation's ParamGrid with SVMWithSGD?
>
> usually, when e.g. using RandomForest you can specify a lot of parameters,
> to automatise the param grid search (when used with CrossValidation)
>
> val algorithm = new RandomForestClassifier()
> val paramGrid = { new ParamGridBuilder()
>   .addGrid(algorithm.impurity, Array("gini", "entropy"))
>   .addGrid(algorithm.maxDepth, Array(3, 5, 10))
>   .addGrid(algorithm.numTrees, Array(2, 3, 5, 15, 50))
>   .addGrid(algorithm.minInfoGain, Array(0.01, 0.001))
>   .addGrid(algorithm.minInstancesPerNode, Array(10, 50, 500))
>   .build()
> }
>
> with SGDWIthSGD however, the parameters are inside GradientDescent. You
> can explicitly tune the params, either by using SGDWithSGD's constructor or
> by calling setters here:
>
> val algorithm = new SVMWithSGD()
> algorithm.optimizer.setMiniBatchFraction(256)
>   .setNumIterations(200)
>   .setRegParam(0.01)
>
> those two ways however restrict me from using ParamGridBuilder correctly.
>
> There are no such things as algorithm.optimizer.numIterations or
> algorithm.optimizer.regParam, only setters(and ParamGrid requires Params,
> not setters)
>
> I could of course create each SVM model manually, create one huge Pipeline
> with each model saving its result to different column and then manually
> decide which performed the best. It requires a lot of coding and so far
> CrossValidation's ParamGrid did that job for me instead.
>
> Am I missing something? Is it WIP or is there any hack to do that?
>
> Yours,
> Tomasz
>


[jira] [Assigned] (SPARK-23048) Update mllib docs to replace OneHotEncoder with OneHotEncoderEstimator

2018-01-19 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath reassigned SPARK-23048:
--

Assignee: Liang-Chi Hsieh

> Update mllib docs to replace OneHotEncoder with OneHotEncoderEstimator 
> ---
>
> Key: SPARK-23048
> URL: https://issues.apache.org/jira/browse/SPARK-23048
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, ML
>Affects Versions: 2.3.0
>Reporter: Bago Amirbekian
>Assignee: Liang-Chi Hsieh
>Priority: Major
> Fix For: 2.3.0
>
>
> Since we're deprecating OneHotEncoder, we should update the docs to reference 
> it's replacement, OneHotEncoderEstimator.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23048) Update mllib docs to replace OneHotEncoder with OneHotEncoderEstimator

2018-01-19 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath resolved SPARK-23048.

   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 20257
[https://github.com/apache/spark/pull/20257]

> Update mllib docs to replace OneHotEncoder with OneHotEncoderEstimator 
> ---
>
> Key: SPARK-23048
> URL: https://issues.apache.org/jira/browse/SPARK-23048
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, ML
>Affects Versions: 2.3.0
>Reporter: Bago Amirbekian
>Assignee: Liang-Chi Hsieh
>Priority: Major
> Fix For: 2.3.0
>
>
> Since we're deprecating OneHotEncoder, we should update the docs to reference 
> it's replacement, OneHotEncoderEstimator.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23127) Update FeatureHasher user guide for catCols parameter

2018-01-19 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath resolved SPARK-23127.

   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 20293
[https://github.com/apache/spark/pull/20293]

> Update FeatureHasher user guide for catCols parameter
> -
>
> Key: SPARK-23127
> URL: https://issues.apache.org/jira/browse/SPARK-23127
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, ML
>Affects Versions: 2.3.0
>    Reporter: Nick Pentreath
>    Assignee: Nick Pentreath
>Priority: Major
> Fix For: 2.3.0
>
>
> SPARK-22801 added the {{categoricalCols}} parameter and updated the Scala and 
> Python doc, but did not update the user guide entry discussing feature 
> handling.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23127) Update FeatureHasher user guide for catCols parameter

2018-01-19 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath reassigned SPARK-23127:
--

Assignee: Nick Pentreath

> Update FeatureHasher user guide for catCols parameter
> -
>
> Key: SPARK-23127
> URL: https://issues.apache.org/jira/browse/SPARK-23127
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, ML
>Affects Versions: 2.3.0
>    Reporter: Nick Pentreath
>    Assignee: Nick Pentreath
>Priority: Major
>
> SPARK-22801 added the {{categoricalCols}} parameter and updated the Scala and 
> Python doc, but did not update the user guide entry discussing feature 
> handling.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23127) Update FeatureHasher user guide for catCols parameter

2018-01-17 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath updated SPARK-23127:
---
Description: SPARK-22801 added the {{categoricalCols}} parameter and 
updated the Scala and Python doc, but did not update the user guide entry 
discussing feature handling.

> Update FeatureHasher user guide for catCols parameter
> -
>
> Key: SPARK-23127
> URL: https://issues.apache.org/jira/browse/SPARK-23127
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, ML
>Affects Versions: 2.3.0
>    Reporter: Nick Pentreath
>Priority: Major
>
> SPARK-22801 added the {{categoricalCols}} parameter and updated the Scala and 
> Python doc, but did not update the user guide entry discussing feature 
> handling.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23127) Update FeatureHasher user guide for catCols parameter

2018-01-17 Thread Nick Pentreath (JIRA)
Nick Pentreath created SPARK-23127:
--

 Summary: Update FeatureHasher user guide for catCols parameter
 Key: SPARK-23127
 URL: https://issues.apache.org/jira/browse/SPARK-23127
 Project: Spark
  Issue Type: Documentation
  Components: Documentation, ML
Affects Versions: 2.3.0
Reporter: Nick Pentreath






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23060) RDD's apply function

2018-01-16 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16326866#comment-16326866
 ] 

Nick Pentreath commented on SPARK-23060:


I agree I don't see enough of a compelling case for adding this to the public 
API.

> RDD's apply function
> 
>
> Key: SPARK-23060
> URL: https://issues.apache.org/jira/browse/SPARK-23060
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Affects Versions: 2.2.1
>Reporter: Gianmarco Donetti
>Priority: Minor
>  Labels: features, newbie
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> New function for RDDs -> apply
> >>> def foo(rdd):
> ... return rdd.map(lambda x: x.split('|')).filter(lambda x: x[0] 
> == 'ERROR')
> >>> rdd = sc.parallelize(['ERROR|10', 'ERROR|12', 'WARNING|10', 
> 'INFO|2'])
> >>> result = rdd.apply(foo)
> >>> result.collect()
> [('ERROR', '10'), ('ERROR', '12')]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-21108) convert LinearSVC to aggregator framework

2018-01-15 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath resolved SPARK-21108.

Resolution: Fixed

> convert LinearSVC to aggregator framework
> -
>
> Key: SPARK-21108
> URL: https://issues.apache.org/jira/browse/SPARK-21108
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: yuhao yang
>Assignee: yuhao yang
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21856) Update Python API for MultilayerPerceptronClassifierModel

2018-01-15 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath reassigned SPARK-21856:
--

Assignee: Chunsheng Ji

> Update Python API for MultilayerPerceptronClassifierModel
> -
>
> Key: SPARK-21856
> URL: https://issues.apache.org/jira/browse/SPARK-21856
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Affects Versions: 2.3.0
>Reporter: Weichen Xu
>Assignee: Chunsheng Ji
>Priority: Minor
>
> SPARK-12664 has exposed probability in MultilayerPerceptronClassifier, so 
> python API also need update.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21856) Update Python API for MultilayerPerceptronClassifierModel

2018-01-15 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath reassigned SPARK-21856:
--

Assignee: (was: Weichen Xu)

> Update Python API for MultilayerPerceptronClassifierModel
> -
>
> Key: SPARK-21856
> URL: https://issues.apache.org/jira/browse/SPARK-21856
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Affects Versions: 2.3.0
>Reporter: Weichen Xu
>Priority: Minor
>
> SPARK-12664 has exposed probability in MultilayerPerceptronClassifier, so 
> python API also need update.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21856) Update Python API for MultilayerPerceptronClassifierModel

2018-01-15 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath reassigned SPARK-21856:
--

Assignee: Weichen Xu

> Update Python API for MultilayerPerceptronClassifierModel
> -
>
> Key: SPARK-21856
> URL: https://issues.apache.org/jira/browse/SPARK-21856
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Affects Versions: 2.3.0
>Reporter: Weichen Xu
>Assignee: Weichen Xu
>Priority: Minor
>
> SPARK-12664 has exposed probability in MultilayerPerceptronClassifier, so 
> python API also need update.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-21856) Update Python API for MultilayerPerceptronClassifierModel

2018-01-15 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath resolved SPARK-21856.

Resolution: Fixed

> Update Python API for MultilayerPerceptronClassifierModel
> -
>
> Key: SPARK-21856
> URL: https://issues.apache.org/jira/browse/SPARK-21856
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Affects Versions: 2.3.0
>Reporter: Weichen Xu
>Assignee: Weichen Xu
>Priority: Minor
>
> SPARK-12664 has exposed probability in MultilayerPerceptronClassifier, so 
> python API also need update.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22943) OneHotEncoder supports manual specification of categorySizes

2018-01-15 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16326151#comment-16326151
 ] 

Nick Pentreath commented on SPARK-22943:


Does the new estimator & model version of OHE solve this underlying issue? 

> OneHotEncoder supports manual specification of categorySizes
> 
>
> Key: SPARK-22943
> URL: https://issues.apache.org/jira/browse/SPARK-22943
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: yuhao yang
>Priority: Minor
>
> OHE should support configurable categorySizes, as n-values in  
> http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html.
>  which allows consistent and foreseeable conversion.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22993) checkpointInterval param doc should be clearer

2018-01-15 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath resolved SPARK-22993.

Resolution: Fixed

> checkpointInterval param doc should be clearer
> --
>
> Key: SPARK-22993
> URL: https://issues.apache.org/jira/browse/SPARK-22993
> Project: Spark
>  Issue Type: Documentation
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Seth Hendrickson
>Assignee: Seth Hendrickson
>Priority: Trivial
>
> several algorithms use the shared parameter {{HasCheckpointInterval}} (ALS, 
> LDA, GBT), each of which silently ignores the parameter when the checkpoint 
> directory is not set on the spark context. This should be documented in the 
> param doc.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22993) checkpointInterval param doc should be clearer

2018-01-15 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath reassigned SPARK-22993:
--

Assignee: Seth Hendrickson

> checkpointInterval param doc should be clearer
> --
>
> Key: SPARK-22993
> URL: https://issues.apache.org/jira/browse/SPARK-22993
> Project: Spark
>  Issue Type: Documentation
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Seth Hendrickson
>Assignee: Seth Hendrickson
>Priority: Trivial
>
> several algorithms use the shared parameter {{HasCheckpointInterval}} (ALS, 
> LDA, GBT), each of which silently ignores the parameter when the checkpoint 
> directory is not set on the spark context. This should be documented in the 
> param doc.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22871) Add GBT+LR Algorithm in MLlib

2017-12-31 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16307210#comment-16307210
 ] 

Nick Pentreath commented on SPARK-22871:


Tree-based feature transformation is covered in SPARK-13677. I think this 
duplicates that ticket. I also think it is best to leave the functionality 
separate rather than create a new estimator in Spark. i.e. we could add the 
leaf-based feature transformation to the tree models, and leave it up to the 
user to combine that with LR etc. I think this separation of concerns and 
modularity is better.

Finally, as [~srowen] mentions in SPARK-22867, I think this particular model is 
best kept as a separate Spark package.

> Add GBT+LR Algorithm in MLlib
> -
>
> Key: SPARK-22871
> URL: https://issues.apache.org/jira/browse/SPARK-22871
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Affects Versions: 2.2.1
>Reporter: Fangzhou Yang
>
> GBTLRClassifier is a hybrid model of Gradient Boosting Trees and Logistic 
> Regression. 
> It is quite practical and popular in many data mining competitions. In this 
> hybrid model, input features are transformed by means of boosted decision 
> trees. The output of each individual tree is treated as a categorical input 
> feature to a sparse linear classifer. Boosted decision trees prove to be very 
> powerful feature transforms.
> Model details about GBTLR can be found in the following paper:
> https://dl.acm.org/citation.cfm?id=2648589";>Practical Lessons from 
> Predicting Clicks on Ads at Facebook 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22801) Allow FeatureHasher to specify numeric columns to treat as categorical

2017-12-31 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath resolved SPARK-22801.

   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 19991
[https://github.com/apache/spark/pull/19991]

> Allow FeatureHasher to specify numeric columns to treat as categorical
> --
>
> Key: SPARK-22801
> URL: https://issues.apache.org/jira/browse/SPARK-22801
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.3.0
>    Reporter: Nick Pentreath
>    Assignee: Nick Pentreath
> Fix For: 2.3.0
>
>
> {{FeatureHasher}} added in SPARK-13964 always treats numeric type columns as 
> numbers and never as categorical features. It is quite common to have 
> categorical features represented as numbers or codes (often say {{Int}}) in 
> data sources. 
> In order to hash these features as categorical, users must first explicitly 
> convert them to strings which is cumbersome. 
> Add a new param {{categoricalCols}} which specifies the numeric columns that 
> should be treated as categorical features.
> *Note* while the reverse case is certainly possible (i.e. numeric features 
> that are encoded as strings and a user would like to treat them as numeric), 
> this is probably less likely and this case won't be supported at this time. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22397) Add multiple column support to QuantileDiscretizer

2017-12-31 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath reassigned SPARK-22397:
--

Assignee: Huaxin Gao

> Add multiple column support to QuantileDiscretizer
> --
>
> Key: SPARK-22397
> URL: https://issues.apache.org/jira/browse/SPARK-22397
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>    Reporter: Nick Pentreath
>Assignee: Huaxin Gao
> Fix For: 2.3.0
>
>
> Once SPARK-20542 adds multi column support to {{Bucketizer}}, we  can add 
> multi column support to the {{QuantileDiscretizer}} too.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22397) Add multiple column support to QuantileDiscretizer

2017-12-31 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath resolved SPARK-22397.

   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 19715
[https://github.com/apache/spark/pull/19715]

> Add multiple column support to QuantileDiscretizer
> --
>
> Key: SPARK-22397
> URL: https://issues.apache.org/jira/browse/SPARK-22397
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>    Reporter: Nick Pentreath
> Fix For: 2.3.0
>
>
> Once SPARK-20542 adds multi column support to {{Bucketizer}}, we  can add 
> multi column support to the {{QuantileDiscretizer}} too.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22799) Bucketizer should throw exception if single- and multi-column params are both set

2017-12-15 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath updated SPARK-22799:
---
Description: See the related discussion: 
https://issues.apache.org/jira/browse/SPARK-8418?focusedCommentId=16275049&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16275049

> Bucketizer should throw exception if single- and multi-column params are both 
> set
> -
>
> Key: SPARK-22799
> URL: https://issues.apache.org/jira/browse/SPARK-22799
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Nick Pentreath
>
> See the related discussion: 
> https://issues.apache.org/jira/browse/SPARK-8418?focusedCommentId=16275049&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16275049



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22801) Allow FeatureHasher to specify numeric columns to treat as categorical

2017-12-15 Thread Nick Pentreath (JIRA)
Nick Pentreath created SPARK-22801:
--

 Summary: Allow FeatureHasher to specify numeric columns to treat 
as categorical
 Key: SPARK-22801
 URL: https://issues.apache.org/jira/browse/SPARK-22801
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 2.3.0
Reporter: Nick Pentreath
Assignee: Nick Pentreath


{{FeatureHasher}} added in SPARK-13964 always treats numeric type columns as 
numbers and never as categorical features. It is quite common to have 
categorical features represented as numbers or codes (often say {{Int}}) in 
data sources. 

In order to hash these features as categorical, users must first explicitly 
convert them to strings which is cumbersome. 

Add a new param {{categoricalCols}} which specifies the numeric columns that 
should be treated as categorical features.

*Note* while the reverse case is certainly possible (i.e. numeric features that 
are encoded as strings and a user would like to treat them as numeric), this is 
probably less likely and this case won't be supported at this time. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-8418) Add single- and multi-value support to ML Transformers

2017-12-15 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16292320#comment-16292320
 ] 

Nick Pentreath edited comment on SPARK-8418 at 12/15/17 10:40 AM:
--

Created SPARK-22796, SPARK-22797 and SPARK-22798 to track PySpark support for 
{{QuantileDiscretizer}}, {{Bucketizer}} and {{StringIndexer}}, respectively.

The in-progress PR for QD changed to throwing exception as per above 
discussion. I created SPARK-22799 to track that for {{Bucketizer}}


was (Author: mlnick):
Created SPARK-22796, SPARK-22797 and SPARK-22798 to track PySpark support for 
{{QuantileDiscretizer}}, {{Bucketizer}} and {{StringIndexer}}, respectively.

The in-progress PR for QD changed to throwing exception as per above 
discussion. I created SPARK-22799 to track that.

> Add single- and multi-value support to ML Transformers
> --
>
> Key: SPARK-8418
> URL: https://issues.apache.org/jira/browse/SPARK-8418
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Joseph K. Bradley
>
> It would be convenient if all feature transformers supported transforming 
> columns of single values and multiple values, specifically:
> * one column with one value (e.g., type {{Double}})
> * one column with multiple values (e.g., {{Array[Double]}} or {{Vector}})
> We could go as far as supporting multiple columns, but that may not be 
> necessary since VectorAssembler could be used to handle that.
> Estimators under {{ml.feature}} should also support this.
> This will likely require a short design doc to describe:
> * how input and output columns will be specified
> * schema validation
> * code sharing to reduce duplication



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22799) Bucketizer should throw exception if single- and multi-column params are both set

2017-12-15 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath updated SPARK-22799:
---
Issue Type: Improvement  (was: New Feature)

> Bucketizer should throw exception if single- and multi-column params are both 
> set
> -
>
> Key: SPARK-22799
> URL: https://issues.apache.org/jira/browse/SPARK-22799
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.3.0
>    Reporter: Nick Pentreath
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8418) Add single- and multi-value support to ML Transformers

2017-12-15 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16292320#comment-16292320
 ] 

Nick Pentreath commented on SPARK-8418:
---

Created SPARK-22796, SPARK-22797 and SPARK-22798 to track PySpark support for 
{{QuantileDiscretizer}}, {{Bucketizer}} and {{StringIndexer}}, respectively.

The in-progress PR for QD changed to throwing exception as per above 
discussion. I created SPARK-22799 to track that.

> Add single- and multi-value support to ML Transformers
> --
>
> Key: SPARK-8418
> URL: https://issues.apache.org/jira/browse/SPARK-8418
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Joseph K. Bradley
>
> It would be convenient if all feature transformers supported transforming 
> columns of single values and multiple values, specifically:
> * one column with one value (e.g., type {{Double}})
> * one column with multiple values (e.g., {{Array[Double]}} or {{Vector}})
> We could go as far as supporting multiple columns, but that may not be 
> necessary since VectorAssembler could be used to handle that.
> Estimators under {{ml.feature}} should also support this.
> This will likely require a short design doc to describe:
> * how input and output columns will be specified
> * schema validation
> * code sharing to reduce duplication



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22799) Bucketizer should throw exception if single- and multi-column params are both set

2017-12-15 Thread Nick Pentreath (JIRA)
Nick Pentreath created SPARK-22799:
--

 Summary: Bucketizer should throw exception if single- and 
multi-column params are both set
 Key: SPARK-22799
 URL: https://issues.apache.org/jira/browse/SPARK-22799
 Project: Spark
  Issue Type: New Feature
  Components: ML
Affects Versions: 2.3.0
Reporter: Nick Pentreath






--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22798) Add multiple column support to PySpark StringIndexer

2017-12-15 Thread Nick Pentreath (JIRA)
Nick Pentreath created SPARK-22798:
--

 Summary: Add multiple column support to PySpark StringIndexer
 Key: SPARK-22798
 URL: https://issues.apache.org/jira/browse/SPARK-22798
 Project: Spark
  Issue Type: New Feature
  Components: ML, PySpark
Affects Versions: 2.3.0
Reporter: Nick Pentreath






--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22797) Add multiple column support to PySpark Bucketizer

2017-12-15 Thread Nick Pentreath (JIRA)
Nick Pentreath created SPARK-22797:
--

 Summary: Add multiple column support to PySpark Bucketizer
 Key: SPARK-22797
 URL: https://issues.apache.org/jira/browse/SPARK-22797
 Project: Spark
  Issue Type: New Feature
  Components: ML, PySpark
Affects Versions: 2.3.0
Reporter: Nick Pentreath






--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22796) Add multiple column support to PySpark QuantileDiscretizer

2017-12-15 Thread Nick Pentreath (JIRA)
Nick Pentreath created SPARK-22796:
--

 Summary: Add multiple column support to PySpark QuantileDiscretizer
 Key: SPARK-22796
 URL: https://issues.apache.org/jira/browse/SPARK-22796
 Project: Spark
  Issue Type: New Feature
  Components: ML, PySpark
Affects Versions: 2.3.0
Reporter: Nick Pentreath






--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19357) Parallel Model Evaluation for ML Tuning: Scala

2017-12-13 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16288993#comment-16288993
 ] 

Nick Pentreath commented on SPARK-19357:


I've thought about this and taken a look at the proposed solution in 
SPARK-22126 (PR: https://github.com/apache/spark/pull/19350; see my 
[comment|https://github.com/apache/spark/pull/19350/files#r156599955]). I don't 
think the PR solves the problem of a pipeline with stages that have 
model-specific optimizations. In addition the API presented there seems a bit 
convoluted and quite tricky to implement a model-specific optimization for a 
given estimator. I don't see the benefit there of "pushing" the parallel 
implementation down to {{Estimator}}.

Overall, if we cannot in the short term support model-specific optimization for 
CV, that seems ok to me since we don't have any actual implementations, and the 
benefit of parallel CV as it stands far outweighs that cost. We can make a note 
in the user guide or API docs if necessary.

If we can figure out a clean API to support both all the better but until we 
actually have a significant model-specific optimization implementation it seems 
like overkill. I do think Bryan's concept seems cleaner and simpler to 
implement for specific estimators, so perhaps [~bryanc] is able to work up a 
WIP PR to illustrate how it would work?


> Parallel Model Evaluation for ML Tuning: Scala
> --
>
> Key: SPARK-19357
> URL: https://issues.apache.org/jira/browse/SPARK-19357
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Bryan Cutler
>Assignee: Bryan Cutler
> Fix For: 2.3.0
>
> Attachments: parallelism-verification-test.pdf
>
>
> This is a first step of the parent task of Optimizations for ML Pipeline 
> Tuning to perform model evaluation in parallel.  A simple approach is to 
> naively evaluate with a possible parameter to control the level of 
> parallelism.  There are some concerns with this:
> * excessive caching of datasets
> * what to set as the default value for level of parallelism.  1 will evaluate 
> all models in serial, as is done currently. Higher values could lead to 
> excessive caching.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22700) Bucketizer.transform incorrectly drops row containing NaN

2017-12-12 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath resolved SPARK-22700.

   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 19894
[https://github.com/apache/spark/pull/19894]

> Bucketizer.transform incorrectly drops row containing NaN
> -
>
> Key: SPARK-22700
> URL: https://issues.apache.org/jira/browse/SPARK-22700
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.2.0, 2.3.0
>Reporter: zhengruifeng
> Fix For: 2.3.0
>
>
> {code}
> import org.apache.spark.ml.feature._
> val df = spark.createDataFrame(Seq((2.3, 3.0), (Double.NaN, 3.0), (6.7, 
> Double.NaN))).toDF("a", "b")
> val splits = Array(Double.NegativeInfinity, 3.0, Double.PositiveInfinity)
> val bucketizer: Bucketizer = new 
> Bucketizer().setInputCol("a").setOutputCol("aa").setSplits(splits)
> bucketizer.setHandleInvalid("skip")
> scala> df.show
> +---+---+
> |  a|  b|
> +---+---+
> |2.3|3.0|
> |NaN|3.0|
> |6.7|NaN|
> +---+---+
> scala> bucketizer.transform(df).show
> +---+---+---+
> |  a|  b| aa|
> +---+---+---+
> |2.3|3.0|0.0|
> +---+---+---+
> {code}
> When {{handleInvalid}} is set {{skip}}, the last item in input is incorrectly 
> droped, though colum 'b' is not an input column



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22700) Bucketizer.transform incorrectly drops row containing NaN

2017-12-12 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath reassigned SPARK-22700:
--

Assignee: zhengruifeng

> Bucketizer.transform incorrectly drops row containing NaN
> -
>
> Key: SPARK-22700
> URL: https://issues.apache.org/jira/browse/SPARK-22700
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.2.0, 2.3.0
>Reporter: zhengruifeng
>Assignee: zhengruifeng
> Fix For: 2.3.0
>
>
> {code}
> import org.apache.spark.ml.feature._
> val df = spark.createDataFrame(Seq((2.3, 3.0), (Double.NaN, 3.0), (6.7, 
> Double.NaN))).toDF("a", "b")
> val splits = Array(Double.NegativeInfinity, 3.0, Double.PositiveInfinity)
> val bucketizer: Bucketizer = new 
> Bucketizer().setInputCol("a").setOutputCol("aa").setSplits(splits)
> bucketizer.setHandleInvalid("skip")
> scala> df.show
> +---+---+
> |  a|  b|
> +---+---+
> |2.3|3.0|
> |NaN|3.0|
> |6.7|NaN|
> +---+---+
> scala> bucketizer.transform(df).show
> +---+---+---+
> |  a|  b| aa|
> +---+---+---+
> |2.3|3.0|0.0|
> +---+---+---+
> {code}
> When {{handleInvalid}} is set {{skip}}, the last item in input is incorrectly 
> droped, though colum 'b' is not an input column



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22690) Imputer inherit HasOutputCols

2017-12-07 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath resolved SPARK-22690.

Resolution: Fixed

> Imputer inherit HasOutputCols
> -
>
> Key: SPARK-22690
> URL: https://issues.apache.org/jira/browse/SPARK-22690
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: zhengruifeng
>Assignee: zhengruifeng
>Priority: Trivial
>
> trait {{HasOutputCols}} was add in Spark-20542, {{Imputer}} should also 
> inherit it.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22690) Imputer inherit HasOutputCols

2017-12-07 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath updated SPARK-22690:
---
Fix Version/s: 2.3.0

> Imputer inherit HasOutputCols
> -
>
> Key: SPARK-22690
> URL: https://issues.apache.org/jira/browse/SPARK-22690
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: zhengruifeng
>Assignee: zhengruifeng
>Priority: Trivial
> Fix For: 2.3.0
>
>
> trait {{HasOutputCols}} was add in Spark-20542, {{Imputer}} should also 
> inherit it.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22690) Imputer inherit HasOutputCols

2017-12-07 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath reassigned SPARK-22690:
--

Assignee: zhengruifeng

> Imputer inherit HasOutputCols
> -
>
> Key: SPARK-22690
> URL: https://issues.apache.org/jira/browse/SPARK-22690
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: zhengruifeng
>Assignee: zhengruifeng
>Priority: Trivial
>
> trait {{HasOutputCols}} was add in Spark-20542, {{Imputer}} should also 
> inherit it.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8418) Add single- and multi-value support to ML Transformers

2017-12-01 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16275426#comment-16275426
 ] 

Nick Pentreath commented on SPARK-8418:
---

*1 I’m ok with throwing an exception. We can update the previous and in
progress PRs accordingly.

*2 where modifying an existing API obviously we need to keep both.

But I prefer only inputCols for new Components. We can provide convenience
method to set single (or a few) input columns - I did that for
FeatureHasher.

Like setInputCol(col: String, others: String*). But the param set is
inputCols under the hood.

Java still must use setInputCols as the above only works for Scala I think.

We can also deprecate the single column variants for 3.0 if we like?

*3 yes we must thoroughly test this before 2.3 release. I think it should
be fine as it’s just adding a few new parameters which is nothing out of
the ordinary.

*4 I will create JIRAs for Python APIs - ideally we’d like them for 2.3.
Fortunately it should be pretty trivial to complete.
On Sat, 2 Dec 2017 at 00:00, Joseph K. Bradley (JIRA) 



> Add single- and multi-value support to ML Transformers
> --
>
> Key: SPARK-8418
> URL: https://issues.apache.org/jira/browse/SPARK-8418
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Joseph K. Bradley
>
> It would be convenient if all feature transformers supported transforming 
> columns of single values and multiple values, specifically:
> * one column with one value (e.g., type {{Double}})
> * one column with multiple values (e.g., {{Array[Double]}} or {{Vector}})
> We could go as far as supporting multiple columns, but that may not be 
> necessary since VectorAssembler could be used to handle that.
> Estimators under {{ml.feature}} should also support this.
> This will likely require a short design doc to describe:
> * how input and output columns will be specified
> * schema validation
> * code sharing to reduce duplication



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



Re: CrossValidation distribution - is it in the roadmap?

2017-11-29 Thread Nick Pentreath
Hi Tomasz

Parallel evaluation for CrossValidation and TrainValidationSplit was added
for Spark 2.3 in https://issues.apache.org/jira/browse/SPARK-19357


On Wed, 29 Nov 2017 at 16:31 Tomasz Dudek 
wrote:

> Hey,
>
> is there a way to make the following code:
>
> val paramGrid = new ParamGridBuilder().//omitted for brevity - lets say we
> have hundreds of param combinations here
>
> val cv = new
> CrossValidator().setNumFolds(3).setEstimator(pipeline).setEstimatorParamMaps(paramGrid)
>
> automatically distribute itself over all the executors? What I mean is
> to simultaneously compute few(or hundreds of it) ML models, instead of
> using all the computation power on just one model at time.
>
> If not, is such behavior in the Spark's road map?
>
> ...if not, do you think a person without prior Spark development
> experience(me) could do it? I'm using SparkML daily, since few months, at
> work. How much time would it take, approximately?
>
> Yours,
> Tomasz
>
>
>


Re: does "Deep Learning Pipelines" scale out linearly?

2017-11-22 Thread Nick Pentreath
For that package specifically it’s best to see if they have a mailing list
and if not perhaps ask on github issues.

Having said that perhaps the folks involved in that package will reply here
too.

On Wed, 22 Nov 2017 at 20:03, Andy Davidson 
wrote:

> I am starting a new deep learning project currently we do all of our work
> on a single machine using a combination of Keras and Tensor flow.
> https://databricks.github.io/spark-deep-learning/site/index.html looks
> very promising. Any idea how performance is likely to improve as I add
> machines to my my cluster?
>
> Kind regards
>
> Andy
>
>
> P.s. Is user@spark.apache.org the best place to ask questions about this
> package?
>
>
>


[jira] [Assigned] (SPARK-20199) GradientBoostedTreesModel doesn't have featureSubsetStrategy parameter

2017-11-10 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath reassigned SPARK-20199:
--

Assignee: pralabhkumar

> GradientBoostedTreesModel doesn't have  featureSubsetStrategy parameter
> ---
>
> Key: SPARK-20199
> URL: https://issues.apache.org/jira/browse/SPARK-20199
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 2.1.0
>Reporter: pralabhkumar
>Assignee: pralabhkumar
> Fix For: 2.3.0
>
>
> Spark GradientBoostedTreesModel doesn't have featureSubsetStrategy . It Uses 
> random forest internally ,which have featureSubsetStrategy hardcoded "all". 
> It should be provided by the user to have randomness at the feature level.
> This parameter is available in H2O and XGBoost. 
> Sample from H2O.ai 
> gbmParams._col_sample_rate
> Please provide the parameter . 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20199) GradientBoostedTreesModel doesn't have featureSubsetStrategy parameter

2017-11-10 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath resolved SPARK-20199.

   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 18118
[https://github.com/apache/spark/pull/18118]

> GradientBoostedTreesModel doesn't have  featureSubsetStrategy parameter
> ---
>
> Key: SPARK-20199
> URL: https://issues.apache.org/jira/browse/SPARK-20199
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 2.1.0
>Reporter: pralabhkumar
> Fix For: 2.3.0
>
>
> Spark GradientBoostedTreesModel doesn't have featureSubsetStrategy . It Uses 
> random forest internally ,which have featureSubsetStrategy hardcoded "all". 
> It should be provided by the user to have randomness at the feature level.
> This parameter is available in H2O and XGBoost. 
> Sample from H2O.ai 
> gbmParams._col_sample_rate
> Please provide the parameter . 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



Re: Timeline for Spark 2.3

2017-11-09 Thread Nick Pentreath
+1 I think that’s practical

On Fri, 10 Nov 2017 at 03:13, Erik Erlandson  wrote:

> +1 on extending the deadline. It will significantly improve the logistics
> for upstreaming the Kubernetes back-end.  Also agreed, on the general
> realities of reduced bandwidth over the Nov-Dec holiday season.
> Erik
>
> On Thu, Nov 9, 2017 at 6:03 PM, Matei Zaharia 
> wrote:
>
>> I’m also +1 on extending this to get Kubernetes and other features in.
>>
>> Matei
>>
>> > On Nov 9, 2017, at 4:04 PM, Anirudh Ramanathan
>>  wrote:
>> >
>> > This would help the community on the Kubernetes effort quite a bit -
>> giving us additional time for reviews and testing for the 2.3 release.
>> >
>> > On Thu, Nov 9, 2017 at 3:56 PM, Justin Miller <
>> justin.mil...@protectwise.com> wrote:
>> > That sounds fine to me. I’m hoping that this ticket can make it into
>> Spark 2.3: https://issues.apache.org/jira/browse/SPARK-18016
>> >
>> > It’s causing some pretty considerable problems when we alter the
>> columns to be nullable, but we are OK for now without that.
>> >
>> > Best,
>> > Justin
>> >
>> >> On Nov 9, 2017, at 4:54 PM, Michael Armbrust 
>> wrote:
>> >>
>> >> According to the timeline posted on the website, we are nearing branch
>> cut for Spark 2.3.  I'd like to propose pushing this out towards mid to
>> late December for a couple of reasons and would like to hear what people
>> think.
>> >>
>> >> 1. I've done release management during the Thanksgiving / Christmas
>> time before and in my experience, we don't actually get a lot of testing
>> during this time due to vacations and other commitments. I think beginning
>> the RC process in early January would give us the best coverage in the
>> shortest amount of time.
>> >> 2. There are several large initiatives in progress that given a little
>> more time would leave us with a much more exciting 2.3 release.
>> Specifically, the work on the history server, Kubernetes and continuous
>> processing.
>> >> 3. Given the actual release date of Spark 2.2, I think we'll still get
>> Spark 2.3 out roughly 6 months after.
>> >>
>> >> Thoughts?
>> >>
>> >> Michael
>> >
>> >
>>
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>
>


[jira] [Resolved] (SPARK-20542) Add an API into Bucketizer that can bin a lot of columns all at once

2017-11-09 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath resolved SPARK-20542.

   Resolution: Fixed
Fix Version/s: 2.3.0

> Add an API into Bucketizer that can bin a lot of columns all at once
> 
>
> Key: SPARK-20542
> URL: https://issues.apache.org/jira/browse/SPARK-20542
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Liang-Chi Hsieh
>Assignee: Liang-Chi Hsieh
> Fix For: 2.3.0
>
>
> Current ML's Bucketizer can only bin a column of continuous features. If a 
> dataset has thousands of of continuous columns needed to bin, we will result 
> in thousands of ML stages. It is very inefficient regarding query planning 
> and execution.
> We should have a type of bucketizer that can bin a lot of columns all at 
> once. It would need to accept an list of arrays of split points to correspond 
> to the columns to bin, but it might make things more efficient by replacing 
> thousands of stages with just one.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13030) Change OneHotEncoder to Estimator

2017-10-31 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16226448#comment-16226448
 ] 

Nick Pentreath commented on SPARK-13030:


I just think it makes sense for OHE to be an Estimator (as it is in sklearn). 
It really should have been from the beginning. The fact that it is not is 
actually a bug, IMO.

The proposal to have a size param could fix the issue but it is a bit of a 
band-aid fix. It requires the user to specify the size (num categories) 
manually. This doesn't really feel like the right workflow to me, the OHE 
should be able to figure that out itself. So that adds one more "speed bump", 
albeit a small one, in using the component in a pipeline.

It is possible that it can use a sort of "hack" for {{fit}} i.e. during the 
first transform call, set the param if not set already. But that just argues 
for the fact that it should be an {{Estimator/Model}} pair. Sure we could wait 
until {{3.0}} but if the work is already done I don't see a compelling reason 
not to do that now.

> Change OneHotEncoder to Estimator
> -
>
> Key: SPARK-13030
> URL: https://issues.apache.org/jira/browse/SPARK-13030
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 1.6.0
>Reporter: Wojciech Jurczyk
>
> OneHotEncoder should be an Estimator, just like in scikit-learn 
> (http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html).
> In its current form, it is impossible to use when number of categories is 
> different between training dataset and test dataset.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



Re: StringIndexer on several columns in a DataFrame with Scala

2017-10-30 Thread Nick Pentreath
For now, you must follow this approach of constructing a pipeline
consisting of a StringIndexer for each categorical column. See
https://issues.apache.org/jira/browse/SPARK-11215 for the related JIRA to
allow multiple columns for StringIndexer, which is being worked on
currently.

The reason you're seeing a NPE is:

var indexers: Array[StringIndexer] = null

and then you're trying to append an element to something that is null.

Try this instead:

var indexers: Array[StringIndexer] = Array()


But even better is a more functional approach:

val indexers = featureCol.map { colName =>

  new StringIndexer().setInputCol(colName).setOutpucol(colName + "_indexed")

}


On Fri, 27 Oct 2017 at 22:29 Md. Rezaul Karim <
rezaul.ka...@insight-centre.org> wrote:

> Hi All,
>
> There are several categorical columns in my dataset as follows:
> [image: grafik.png]
>
> How can I transform values in each (categorical) columns into numeric
> using StringIndexer so that the resulting DataFrame can be feed into
> VectorAssembler to generate a feature vector?
>
> A naive approach that I can try using StringIndexer for each categorical
> column. But that sounds hilarious, I know.
> A possible workaround
> in
> PySpark is combining several StringIndexer on a list and use a Pipeline
> to execute them all as follows:
>
> from pyspark.ml import Pipelinefrom pyspark.ml.feature import StringIndexer
> indexers = [StringIndexer(inputCol=column, outputCol=column+"_index").fit(df) 
> for column in list(set(df.columns)-set(['date'])) ]
> pipeline = Pipeline(stages=indexers)
> df_r = pipeline.fit(df).transform(df)
> df_r.show()
>
> How I can do the same in Scala? I tried the following:
>
> val featureCol = trainingDF.columns
> var indexers: Array[StringIndexer] = null
>
> for (colName <- featureCol) {
>   val index = new StringIndexer()
> .setInputCol(colName)
> .setOutputCol(colName + "_indexed")
> //.fit(trainDF)
>   indexers = indexers :+ index
> }
>
>  val pipeline = new Pipeline()
> .setStages(indexers)
> val newDF = pipeline.fit(trainingDF).transform(trainingDF)
> newDF.show()
>
> However, I am experiencing NullPointerException at
>
> for (colName <- featureCol)
>
> I am sure, I am doing something wrong. Any suggestion?
>
>
>
> Regards,
> _
> *Md. Rezaul Karim*, BSc, MSc
> Researcher, INSIGHT Centre for Data Analytics
> National University of Ireland, Galway
> IDA Business Park, Dangan, Galway, Ireland
> Web: http://www.reza-analytics.eu/index.html
> 
>


[jira] [Updated] (SPARK-22397) Add multiple column support to QuantileDiscretizer

2017-10-30 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath updated SPARK-22397:
---
Description: Once SPARK-20542 adds multi column support to {{Bucketizer}}, 
we  can add multi column support to the {{QuantileDiscretizer}} too.

> Add multiple column support to QuantileDiscretizer
> --
>
> Key: SPARK-22397
> URL: https://issues.apache.org/jira/browse/SPARK-22397
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>    Reporter: Nick Pentreath
>
> Once SPARK-20542 adds multi column support to {{Bucketizer}}, we  can add 
> multi column support to the {{QuantileDiscretizer}} too.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22397) Add multiple column support to QuantileDiscretizer

2017-10-30 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16224459#comment-16224459
 ] 

Nick Pentreath commented on SPARK-22397:


[~huaxing] is working on this and will submit a PR shortly.

> Add multiple column support to QuantileDiscretizer
> --
>
> Key: SPARK-22397
> URL: https://issues.apache.org/jira/browse/SPARK-22397
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>    Reporter: Nick Pentreath
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22397) Add multiple column support to QuantileDiscretizer

2017-10-30 Thread Nick Pentreath (JIRA)
Nick Pentreath created SPARK-22397:
--

 Summary: Add multiple column support to QuantileDiscretizer
 Key: SPARK-22397
 URL: https://issues.apache.org/jira/browse/SPARK-22397
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 2.2.0
Reporter: Nick Pentreath






--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8418) Add single- and multi-value support to ML Transformers

2017-10-30 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16224454#comment-16224454
 ] 

Nick Pentreath commented on SPARK-8418:
---

Adding SPARK-13030, since the new version of {{OneHotEncoder}} will also 
support transforming multiple columns.

> Add single- and multi-value support to ML Transformers
> --
>
> Key: SPARK-8418
> URL: https://issues.apache.org/jira/browse/SPARK-8418
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Joseph K. Bradley
>
> It would be convenient if all feature transformers supported transforming 
> columns of single values and multiple values, specifically:
> * one column with one value (e.g., type {{Double}})
> * one column with multiple values (e.g., {{Array[Double]}} or {{Vector}})
> We could go as far as supporting multiple columns, but that may not be 
> necessary since VectorAssembler could be used to handle that.
> Estimators under {{ml.feature}} should also support this.
> This will likely require a short design doc to describe:
> * how input and output columns will be specified
> * schema validation
> * code sharing to reduce duplication



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22346) Update VectorAssembler to work with StreamingDataframes

2017-10-25 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16219006#comment-16219006
 ] 

Nick Pentreath commented on SPARK-22346:


SPARK-19141 mentions another option which may work as an interim measure until 
we could make some more drastic or breaking change. That could be a param 
option to skip creating metadata in {{transform}}. As you mention it's not 
really necessary to have metadata at prediction time.

This happens to also be a quick-fix for SPARK-19141, since the metadata could 
be skipped even during training, when it would otherwise cause memory issues 
with large feature spaces.

> Update VectorAssembler to work with StreamingDataframes
> ---
>
> Key: SPARK-22346
> URL: https://issues.apache.org/jira/browse/SPARK-22346
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Bago Amirbekian
>Priority: Critical
>
> The issue
> In batch mode, VectorAssembler can take multiple columns of VectorType and 
> assemble a output a new column of VectorType containing the concatenated 
> vectors. In streaming mode, this transformation can fail because 
> VectorAssembler does not have enough information to produce metadata 
> (AttributeGroup) for the new column. Because VectorAssembler is such a 
> ubiquitous part of mllib pipelines, this issue effectively means spark 
> structured streaming does not support prediction using mllib pipelines.
> I've created this ticket so we can discuss ways to potentially improve 
> VectorAssembler. Please let me know if there are any issues I have not 
> considered or potential fixes I haven't outlined. I'm happy to submit a patch 
> once I know which strategy is the best approach.
> Potential fixes
> 1) Replace VectorAssembler with an estimator/model pair like was recently 
> done with OneHotEncoder, 
> [SPARK-13030|https://issues.apache.org/jira/browse/SPARK-13030]. The 
> Estimator can "learn" the size of the inputs vectors during training and save 
> it to use during prediction.
> Pros:
> * Possibly simplest of the potential fixes
> Cons:
> * We'll need to deprecate current VectorAssembler
> 2) Drop the metadata (ML Attributes) from Vector columns. This is pretty 
> major change, but it could be done in stages. We could first ensure that 
> metadata is not used during prediction and allow the VectorAssembler to drop 
> metadata for streaming dataframes. Going forward, it would be important to 
> not use any metadata on Vector columns for any prediction tasks.
> Pros:
> * Potentially, easy short term fix for VectorAssembler
> * Current Attributes implementation is also causing other issues, eg 
> [SPARK-19141|https://issues.apache.org/jira/browse/SPARK-19141].
> Cons:
> * To fully remove ML Attributes would be a major refactor of MLlib and would 
> most likely require breaking changings.
> * A partial removal of ML attributes (eg: ensure ML attributes are not used 
> during transform, only during fit) might be tricky. This would require 
> testing or other enforcement mechanism to prevent regressions.
> 3) Require Vector columns to have fixed length vectors. Most mllib 
> transformers that produce vectors already include the size of the vector in 
> the column metadata. This change would be to deprecate APIs that allow 
> creating a vector column of unknown length and replace those APIs with 
> equivalents that enforce a fixed size.
> Pros:
> * We already treat vectors as fixed size, for example VectorAssembler assumes 
> the inputs * output col are fixed size vectors and creates metadata 
> accordingly. In the spirit of explicit is better than implicit, we would be 
> codifying something we already assume.
> * This could potentially enable performance optimizations that are only 
> possible if the Vector size of a column is fixed & known.
> Cons:
> * This would require breaking changes.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22331) Strength consistency for supporting string params: case-insensitive or not

2017-10-23 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16214747#comment-16214747
 ] 

Nick Pentreath commented on SPARK-22331:


I can't think of any examples offhand where case sensitivity is required, so it 
would make sense to have all params case insensitive if possible. But of course 
it should not break current behavior or break save/load of existing models.

> Strength consistency for supporting string params: case-insensitive or not
> --
>
> Key: SPARK-22331
> URL: https://issues.apache.org/jira/browse/SPARK-22331
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.2.0
>Reporter: yuhao yang
>Priority: Minor
>
> Some String params in ML are still case-sensitive, as they are checked by 
> ParamValidators.inArray.
> For consistency in user experience, there should be some general guideline in 
> whether String params in Spark MLlib are case-insensitive or not. 
> I'm leaning towards making all String params case-insensitive where possible.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22289) Cannot save LogisticRegressionClassificationModel with bounds on coefficients

2017-10-17 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16207133#comment-16207133
 ] 

Nick Pentreath commented on SPARK-22289:


I think option (2) is the more general fix here.

> Cannot save LogisticRegressionClassificationModel with bounds on coefficients
> -
>
> Key: SPARK-22289
> URL: https://issues.apache.org/jira/browse/SPARK-22289
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Nic Eggert
>
> I think this was introduced in SPARK-20047.
> Trying to call save on a logistic regression model trained with bounds on its 
> parameters throws an error. This seems to be because Spark doesn't know how 
> to serialize the Matrix parameter.
> Model is set up like this:
> {code}
> val calibrator = new LogisticRegression()
>   .setFeaturesCol("uncalibrated_probability")
>   .setLabelCol("label")
>   .setWeightCol("weight")
>   .setStandardization(false)
>   .setLowerBoundsOnCoefficients(new DenseMatrix(1, 1, Array(0.0)))
>   .setFamily("binomial")
>   .setProbabilityCol("probability")
>   .setPredictionCol("logistic_prediction")
>   .setRawPredictionCol("logistic_raw_prediction")
> {code}
> {code}
> 17/10/16 15:36:59 ERROR ApplicationMaster: User class threw exception: 
> scala.NotImplementedError: The default jsonEncode only supports string and 
> vector. org.apache.spark.ml.param.Param must override jsonEncode for 
> org.apache.spark.ml.linalg.DenseMatrix.
> scala.NotImplementedError: The default jsonEncode only supports string and 
> vector. org.apache.spark.ml.param.Param must override jsonEncode for 
> org.apache.spark.ml.linalg.DenseMatrix.
>   at org.apache.spark.ml.param.Param.jsonEncode(params.scala:98)
>   at 
> org.apache.spark.ml.util.DefaultParamsWriter$$anonfun$1$$anonfun$2.apply(ReadWrite.scala:296)
>   at 
> org.apache.spark.ml.util.DefaultParamsWriter$$anonfun$1$$anonfun$2.apply(ReadWrite.scala:295)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:104)
>   at 
> org.apache.spark.ml.util.DefaultParamsWriter$$anonfun$1.apply(ReadWrite.scala:295)
>   at 
> org.apache.spark.ml.util.DefaultParamsWriter$$anonfun$1.apply(ReadWrite.scala:295)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.ml.util.DefaultParamsWriter$.getMetadataToSave(ReadWrite.scala:295)
>   at 
> org.apache.spark.ml.util.DefaultParamsWriter$.saveMetadata(ReadWrite.scala:277)
>   at 
> org.apache.spark.ml.classification.LogisticRegressionModel$LogisticRegressionModelWriter.saveImpl(LogisticRegression.scala:1182)
>   at org.apache.spark.ml.util.MLWriter.save(ReadWrite.scala:114)
>   at 
> org.apache.spark.ml.Pipeline$SharedReadWrite$$anonfun$saveImpl$1.apply(Pipeline.scala:254)
>   at 
> org.apache.spark.ml.Pipeline$SharedReadWrite$$anonfun$saveImpl$1.apply(Pipeline.scala:253)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
>   at 
> org.apache.spark.ml.Pipeline$SharedReadWrite$.saveImpl(Pipeline.scala:253)
>   at 
> org.apache.spark.ml.PipelineModel$PipelineModelWriter.saveImpl(Pipeline.scala:337)
>   at org.apache.spark.ml.util.MLWriter.save(ReadWrite.scala:114)
>   -snip-
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20542) Add an API into Bucketizer that can bin a lot of columns all at once

2017-10-11 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath reassigned SPARK-20542:
--

Assignee: Liang-Chi Hsieh

> Add an API into Bucketizer that can bin a lot of columns all at once
> 
>
> Key: SPARK-20542
> URL: https://issues.apache.org/jira/browse/SPARK-20542
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Liang-Chi Hsieh
>Assignee: Liang-Chi Hsieh
>
> Current ML's Bucketizer can only bin a column of continuous features. If a 
> dataset has thousands of of continuous columns needed to bin, we will result 
> in thousands of ML stages. It is very inefficient regarding query planning 
> and execution.
> We should have a type of bucketizer that can bin a lot of columns all at 
> once. It would need to accept an list of arrays of split points to correspond 
> to the columns to bin, but it might make things more efficient by replacing 
> thousands of stages with just one.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10802) Let ALS recommend for subset of data

2017-10-09 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16196673#comment-16196673
 ] 

Nick Pentreath commented on SPARK-10802:


SPARK-20679 has been completed for the new ML API. I've closed this as we won't 
be doing it in the RDD API as mentioned above.

> Let ALS recommend for subset of data
> 
>
> Key: SPARK-10802
> URL: https://issues.apache.org/jira/browse/SPARK-10802
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.5.0
>Reporter: Tomasz Bartczak
>Priority: Minor
>
> Currently MatrixFactorizationModel allows to get recommendations for
> - single user 
> - single product 
> - all users
> - all products
> recommendation for all users/products do a cartesian join inside.
> It would be useful in some cases to get recommendations for subset of 
> users/products by providing an RDD with which MatrixFactorizationModel could 
> do an intersection before doing a cartesian join. This would make it much 
> faster in situation where recommendations are needed only for subset of 
> users/products, and when the subset is still too large to make it feasible to 
> recommend one-by-one.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10802) Let ALS recommend for subset of data

2017-10-09 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath resolved SPARK-10802.

Resolution: Won't Fix

> Let ALS recommend for subset of data
> 
>
> Key: SPARK-10802
> URL: https://issues.apache.org/jira/browse/SPARK-10802
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.5.0
>Reporter: Tomasz Bartczak
>Priority: Minor
>
> Currently MatrixFactorizationModel allows to get recommendations for
> - single user 
> - single product 
> - all users
> - all products
> recommendation for all users/products do a cartesian join inside.
> It would be useful in some cases to get recommendations for subset of 
> users/products by providing an RDD with which MatrixFactorizationModel could 
> do an intersection before doing a cartesian join. This would make it much 
> faster in situation where recommendations are needed only for subset of 
> users/products, and when the subset is still too large to make it feasible to 
> recommend one-by-one.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20679) Let ML ALS recommend for a subset of users/items

2017-10-09 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath reassigned SPARK-20679:
--

Assignee: Nick Pentreath

> Let ML ALS recommend for a subset of users/items
> 
>
> Key: SPARK-20679
> URL: https://issues.apache.org/jira/browse/SPARK-20679
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>    Reporter: Nick Pentreath
>    Assignee: Nick Pentreath
> Fix For: 2.3.0
>
>
> SPARK-10802 is for {{mllib}}'s {{MatrixFactorizationModel}} to recommend for 
> a subset of user or item factors.
> Since {{mllib}} is in maintenance mode and {{ml}}'s {{ALSModel}} now supports 
> the {{recommendForAllX}} methods, this ticket tracks adding this 
> functionality to {{ALSModel}}.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20679) Let ML ALS recommend for a subset of users/items

2017-10-09 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath resolved SPARK-20679.

   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 18748
[https://github.com/apache/spark/pull/18748]

> Let ML ALS recommend for a subset of users/items
> 
>
> Key: SPARK-20679
> URL: https://issues.apache.org/jira/browse/SPARK-20679
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>    Reporter: Nick Pentreath
> Fix For: 2.3.0
>
>
> SPARK-10802 is for {{mllib}}'s {{MatrixFactorizationModel}} to recommend for 
> a subset of user or item factors.
> Since {{mllib}} is in maintenance mode and {{ml}}'s {{ALSModel}} now supports 
> the {{recommendForAllX}} methods, this ticket tracks adding this 
> functionality to {{ALSModel}}.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22115) Add operator for linalg Matrix and Vector

2017-10-08 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16196288#comment-16196288
 ] 

Nick Pentreath commented on SPARK-22115:


Best keep it private for now.

There's been lot of discussion around the issue of Spark providing a linear 
algebra lib and the consensus is generally that it's a huge amount of overhead 
for Spark to maintain a full-blown linear algebra lib. 

https://issues.apache.org/jira/browse/SPARK-6442 and 
https://issues.apache.org/jira/browse/SPARK-16365 

> Add operator for linalg Matrix and Vector
> -
>
> Key: SPARK-22115
> URL: https://issues.apache.org/jira/browse/SPARK-22115
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 3.0.0
>Reporter: Peng Meng
>
> For example, there are many code in LDA like this:
> {code:java}
> phiNorm := expElogbetad * expElogthetad +:+ 1e-100
> {code}
> expElogbetad is a breeze Matrix, expElogthetad is a breeze Vector, 
> This code will call a blas GEMV, then loop the result (:+ 1e-100)
> Actually, this can be done with only GEMV, because the standard interface of 
> gemv is : 
> gemv(alpha, A, x, beta, y)//y := alpha*A*x + beta*y
> We can provide some operators (e.g. Element-wise product (:*), Element-wise 
> sum (:+)) to Spark linalg Matrix and Vector, and replace breeze Matrix and 
> Vector by Spark linalg Matrix and Vector. 
> Then for all the cases like: y = alpha*A*x + beta*y, we can call GEMM or GEMV 
> for it. 
> Don't need to call GEMM or GEMV and then loop the result (for the add) as the 
> current implementation. 
> I can help to do it if we plan to add this feature.  



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



Re: [VOTE] Spark 2.1.2 (RC4)

2017-10-06 Thread Nick Pentreath
Ah yes - I recall that it was fixed. Forgot it was for 2.3.0

My +1 vote stands.

On Fri, 6 Oct 2017 at 15:15 Hyukjin Kwon  wrote:

> Hi Nick,
>
> I believe that R test failure is due to SPARK-21093, at least the error
> message looks the same, and that is fixed from 2.3.0. This was not
> backported because I and reviewers were worried as that fixed a very core
> to SparkR (even, it was reverted once even after very close look by some
> reviewers).
>
> I asked Michael to note this as a known issue in
> https://spark.apache.org/releases/spark-release-2-2-0.html#known-issues
> before due to this reason.
> I believe It should be fine and probably we should note if possible. I
> believe this should not be a regression anyway as, if I understood
> correctly, it was there from the very first place.
>
> Thanks.
>
>
>
>
> 2017-10-06 21:20 GMT+09:00 Nick Pentreath :
>
>> Checked sigs & hashes.
>>
>> Tested on RHEL
>> build/mvn -Phadoop-2.7 -Phive -Pyarn test passed
>> Python tests passed
>>
>> I ran R tests and am getting some failures:
>> https://gist.github.com/MLnick/ddf4d531d5125208771beee0cc9c697e (I seem
>> to recall similar issues on a previous release but I thought it was fixed).
>>
>> I re-ran R tests on an Ubuntu box to double check and they passed there.
>>
>> So I'd still +1 the release
>>
>> Perhaps someone can take a look at the R failures on RHEL just in case
>> though.
>>
>>
>> On Fri, 6 Oct 2017 at 05:58 vaquar khan  wrote:
>>
>>> +1 (non binding ) tested on Ubuntu ,all test case  are passed.
>>>
>>> Regards,
>>> Vaquar khan
>>>
>>> On Thu, Oct 5, 2017 at 10:46 PM, Hyukjin Kwon 
>>> wrote:
>>>
>>>> +1 too.
>>>>
>>>>
>>>> On 6 Oct 2017 10:49 am, "Reynold Xin"  wrote:
>>>>
>>>> +1
>>>>
>>>>
>>>> On Mon, Oct 2, 2017 at 11:24 PM, Holden Karau 
>>>> wrote:
>>>>
>>>>> Please vote on releasing the following candidate as Apache Spark
>>>>> version 2.1.2. The vote is open until Saturday October 7th at 9:00
>>>>> PST and passes if a majority of at least 3 +1 PMC votes are cast.
>>>>>
>>>>> [ ] +1 Release this package as Apache Spark 2.1.2
>>>>> [ ] -1 Do not release this package because ...
>>>>>
>>>>>
>>>>> To learn more about Apache Spark, please see https://spark.apache.org/
>>>>>
>>>>> The tag to be voted on is v2.1.2-rc4
>>>>> <https://github.com/apache/spark/tree/v2.1.2-rc4> (
>>>>> 2abaea9e40fce81cd4626498e0f5c28a70917499)
>>>>>
>>>>> List of JIRA tickets resolved in this release can be found with this
>>>>> filter.
>>>>> <https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20fixVersion%20%3D%202.1.2>
>>>>>
>>>>> The release files, including signatures, digests, etc. can be found at:
>>>>> https://home.apache.org/~holden/spark-2.1.2-rc4-bin/
>>>>>
>>>>> Release artifacts are signed with a key from:
>>>>> https://people.apache.org/~holden/holdens_keys.asc
>>>>>
>>>>> The staging repository for this release can be found at:
>>>>> https://repository.apache.org/content/repositories/orgapachespark-1252
>>>>>
>>>>> The documentation corresponding to this release can be found at:
>>>>> https://people.apache.org/~holden/spark-2.1.2-rc4-docs/
>>>>>
>>>>>
>>>>> *FAQ*
>>>>>
>>>>> *How can I help test this release?*
>>>>>
>>>>> If you are a Spark user, you can help us test this release by taking
>>>>> an existing Spark workload and running on this release candidate, then
>>>>> reporting any regressions.
>>>>>
>>>>> If you're working in PySpark you can set up a virtual env and install
>>>>> the current RC and see if anything important breaks, in the
>>>>> Java/Scala you can add the staging repository to your projects resolvers
>>>>> and test with the RC (make sure to clean up the artifact cache
>>>>> before/after so you don't end up building with a out of date RC going
>>>>> forward).
>>>>>
>>>>> *What should happen to JIRA tickets still targeting 2.1.2?*
>>>>>
&

Re: [VOTE] Spark 2.1.2 (RC4)

2017-10-06 Thread Nick Pentreath
Checked sigs & hashes.

Tested on RHEL
build/mvn -Phadoop-2.7 -Phive -Pyarn test passed
Python tests passed

I ran R tests and am getting some failures:
https://gist.github.com/MLnick/ddf4d531d5125208771beee0cc9c697e (I seem to
recall similar issues on a previous release but I thought it was fixed).

I re-ran R tests on an Ubuntu box to double check and they passed there.

So I'd still +1 the release

Perhaps someone can take a look at the R failures on RHEL just in case
though.


On Fri, 6 Oct 2017 at 05:58 vaquar khan  wrote:

> +1 (non binding ) tested on Ubuntu ,all test case  are passed.
>
> Regards,
> Vaquar khan
>
> On Thu, Oct 5, 2017 at 10:46 PM, Hyukjin Kwon  wrote:
>
>> +1 too.
>>
>>
>> On 6 Oct 2017 10:49 am, "Reynold Xin"  wrote:
>>
>> +1
>>
>>
>> On Mon, Oct 2, 2017 at 11:24 PM, Holden Karau 
>> wrote:
>>
>>> Please vote on releasing the following candidate as Apache Spark
>>> version 2.1.2. The vote is open until Saturday October 7th at 9:00
>>> PST and passes if a majority of at least 3 +1 PMC votes are cast.
>>>
>>> [ ] +1 Release this package as Apache Spark 2.1.2
>>> [ ] -1 Do not release this package because ...
>>>
>>>
>>> To learn more about Apache Spark, please see https://spark.apache.org/
>>>
>>> The tag to be voted on is v2.1.2-rc4
>>>  (
>>> 2abaea9e40fce81cd4626498e0f5c28a70917499)
>>>
>>> List of JIRA tickets resolved in this release can be found with this
>>> filter.
>>> 
>>>
>>> The release files, including signatures, digests, etc. can be found at:
>>> https://home.apache.org/~holden/spark-2.1.2-rc4-bin/
>>>
>>> Release artifacts are signed with a key from:
>>> https://people.apache.org/~holden/holdens_keys.asc
>>>
>>> The staging repository for this release can be found at:
>>> https://repository.apache.org/content/repositories/orgapachespark-1252
>>>
>>> The documentation corresponding to this release can be found at:
>>> https://people.apache.org/~holden/spark-2.1.2-rc4-docs/
>>>
>>>
>>> *FAQ*
>>>
>>> *How can I help test this release?*
>>>
>>> If you are a Spark user, you can help us test this release by taking an
>>> existing Spark workload and running on this release candidate, then
>>> reporting any regressions.
>>>
>>> If you're working in PySpark you can set up a virtual env and install
>>> the current RC and see if anything important breaks, in the Java/Scala
>>> you can add the staging repository to your projects resolvers and test with
>>> the RC (make sure to clean up the artifact cache before/after so you
>>> don't end up building with a out of date RC going forward).
>>>
>>> *What should happen to JIRA tickets still targeting 2.1.2?*
>>>
>>> Committers should look at those and triage. Extremely important bug
>>> fixes, documentation, and API tweaks that impact compatibility should be
>>> worked on immediately. Everything else please retarget to 2.1.3.
>>>
>>> *But my bug isn't fixed!??!*
>>>
>>> In order to make timely releases, we will typically not hold the release
>>> unless the bug in question is a regression from 2.1.1. That being said
>>> if there is something which is a regression form 2.1.1 that has not
>>> been correctly targeted please ping a committer to help target the issue
>>> (you can see the open issues listed as impacting Spark 2.1.1 & 2.1.2
>>> 
>>> )
>>>
>>> *What are the unresolved* issues targeted for 2.1.2
>>> 
>>> ?
>>>
>>> At this time there are no open unresolved issues.
>>>
>>> *Is there anything different about this release?*
>>>
>>> This is the first release in awhile not built on the AMPLAB Jenkins.
>>> This is good because it means future releases can more easily be built and
>>> signed securely (and I've been updating the documentation in
>>> https://github.com/apache/spark-website/pull/66 as I progress), however
>>> the chances of a mistake are higher with any change like this. If there
>>> something you normally take for granted as correct when checking a release,
>>> please double check this time :)
>>>
>>> *Should I be committing code to branch-2.1?*
>>>
>>> Thanks for asking! Please treat this stage in the RC process as "code
>>> freeze" so bug fixes only. If you're uncertain if something should be back
>>> ported please reach out. If you do commit to branch-2.1 please tag your
>>> JIRA issue fix version for 2.1.3 and if we cut another RC I'll move the
>>> 2.1.3 fixed into 2.1.2 as appropriate.
>>>
>>> *What happened to RC3?*
>>>
>>> Some R+zinc interactions kept it from getting out the door.
>>> --
>>> Twitter

Re: [VOTE] Spark 2.1.2 (RC4)

2017-10-04 Thread Nick Pentreath
Ah right! Was using a new cloud instance and didn't realize I was logged in
as root! thanks

On Tue, 3 Oct 2017 at 21:13 Marcelo Vanzin  wrote:

> Maybe you're running as root (or the admin account on your OS)?
>
> On Tue, Oct 3, 2017 at 12:12 PM, Nick Pentreath
>  wrote:
> > Hmm I'm consistently getting this error in core tests:
> >
> > - SPARK-3697: ignore directories that cannot be read. *** FAILED ***
> >   2 was not equal to 1 (FsHistoryProviderSuite.scala:146)
> >
> >
> > Anyone else? Any insight? Perhaps it's my set up.
> >
> >>>
> >>>
> >>> On Tue, Oct 3, 2017 at 7:24 AM Holden Karau 
> wrote:
> >>>>
> >>>> Please vote on releasing the following candidate as Apache Spark
> version
> >>>> 2.1.2. The vote is open until Saturday October 7th at 9:00 PST and
> passes if
> >>>> a majority of at least 3 +1 PMC votes are cast.
> >>>>
> >>>> [ ] +1 Release this package as Apache Spark 2.1.2
> >>>> [ ] -1 Do not release this package because ...
> >>>>
> >>>>
> >>>> To learn more about Apache Spark, please see
> https://spark.apache.org/
> >>>>
> >>>> The tag to be voted on is v2.1.2-rc4
> >>>> (2abaea9e40fce81cd4626498e0f5c28a70917499)
> >>>>
> >>>> List of JIRA tickets resolved in this release can be found with this
> >>>> filter.
> >>>>
> >>>> The release files, including signatures, digests, etc. can be found
> at:
> >>>> https://home.apache.org/~holden/spark-2.1.2-rc4-bin/
> >>>>
> >>>> Release artifacts are signed with a key from:
> >>>> https://people.apache.org/~holden/holdens_keys.asc
> >>>>
> >>>> The staging repository for this release can be found at:
> >>>>
> https://repository.apache.org/content/repositories/orgapachespark-1252
> >>>>
> >>>> The documentation corresponding to this release can be found at:
> >>>> https://people.apache.org/~holden/spark-2.1.2-rc4-docs/
> >>>>
> >>>>
> >>>> FAQ
> >>>>
> >>>> How can I help test this release?
> >>>>
> >>>> If you are a Spark user, you can help us test this release by taking
> an
> >>>> existing Spark workload and running on this release candidate, then
> >>>> reporting any regressions.
> >>>>
> >>>> If you're working in PySpark you can set up a virtual env and install
> >>>> the current RC and see if anything important breaks, in the
> Java/Scala you
> >>>> can add the staging repository to your projects resolvers and test
> with the
> >>>> RC (make sure to clean up the artifact cache before/after so you
> don't end
> >>>> up building with a out of date RC going forward).
> >>>>
> >>>> What should happen to JIRA tickets still targeting 2.1.2?
> >>>>
> >>>> Committers should look at those and triage. Extremely important bug
> >>>> fixes, documentation, and API tweaks that impact compatibility should
> be
> >>>> worked on immediately. Everything else please retarget to 2.1.3.
> >>>>
> >>>> But my bug isn't fixed!??!
> >>>>
> >>>> In order to make timely releases, we will typically not hold the
> release
> >>>> unless the bug in question is a regression from 2.1.1. That being
> said if
> >>>> there is something which is a regression form 2.1.1 that has not been
> >>>> correctly targeted please ping a committer to help target the issue
> (you can
> >>>> see the open issues listed as impacting Spark 2.1.1 & 2.1.2)
> >>>>
> >>>> What are the unresolved issues targeted for 2.1.2?
> >>>>
> >>>> At this time there are no open unresolved issues.
> >>>>
> >>>> Is there anything different about this release?
> >>>>
> >>>> This is the first release in awhile not built on the AMPLAB Jenkins.
> >>>> This is good because it means future releases can more easily be
> built and
> >>>> signed securely (and I've been updating the documentation in
> >>>> https://github.com/apache/spark-website/pull/66 as I progress),
> however the
> >>>> chances of a mistake are higher with any change like this. If there
> >>>> something you normally take for granted as correct when checking a
> release,
> >>>> please double check this time :)
> >>>>
> >>>> Should I be committing code to branch-2.1?
> >>>>
> >>>> Thanks for asking! Please treat this stage in the RC process as "code
> >>>> freeze" so bug fixes only. If you're uncertain if something should be
> back
> >>>> ported please reach out. If you do commit to branch-2.1 please tag
> your JIRA
> >>>> issue fix version for 2.1.3 and if we cut another RC I'll move the
> 2.1.3
> >>>> fixed into 2.1.2 as appropriate.
> >>>>
> >>>> What happened to RC3?
> >>>>
> >>>> Some R+zinc interactions kept it from getting out the door.
> >>>> --
> >>>> Twitter: https://twitter.com/holdenkarau
> >>
> >>
> >
>
>
>
> --
> Marcelo
>


Re: [VOTE] Spark 2.1.2 (RC4)

2017-10-03 Thread Nick Pentreath
Hmm I'm consistently getting this error in core tests:

- SPARK-3697: ignore directories that cannot be read. *** FAILED ***
  2 was not equal to 1 (FsHistoryProviderSuite.scala:146)


Anyone else? Any insight? Perhaps it's my set up.


>>
>> On Tue, Oct 3, 2017 at 7:24 AM Holden Karau  wrote:
>>
>>> Please vote on releasing the following candidate as Apache Spark
>>> version 2.1.2. The vote is open until Saturday October 7th at 9:00
>>> PST and passes if a majority of at least 3 +1 PMC votes are cast.
>>>
>>> [ ] +1 Release this package as Apache Spark 2.1.2
>>> [ ] -1 Do not release this package because ...
>>>
>>>
>>> To learn more about Apache Spark, please see https://spark.apache.org/
>>>
>>> The tag to be voted on is v2.1.2-rc4
>>>  (
>>> 2abaea9e40fce81cd4626498e0f5c28a70917499)
>>>
>>> List of JIRA tickets resolved in this release can be found with this
>>> filter.
>>> 
>>>
>>> The release files, including signatures, digests, etc. can be found at:
>>> https://home.apache.org/~holden/spark-2.1.2-rc4-bin/
>>>
>>> Release artifacts are signed with a key from:
>>> https://people.apache.org/~holden/holdens_keys.asc
>>>
>>> The staging repository for this release can be found at:
>>> https://repository.apache.org/content/repositories/orgapachespark-1252
>>>
>>> The documentation corresponding to this release can be found at:
>>> https://people.apache.org/~holden/spark-2.1.2-rc4-docs/
>>>
>>>
>>> *FAQ*
>>>
>>> *How can I help test this release?*
>>>
>>> If you are a Spark user, you can help us test this release by taking an
>>> existing Spark workload and running on this release candidate, then
>>> reporting any regressions.
>>>
>>> If you're working in PySpark you can set up a virtual env and install
>>> the current RC and see if anything important breaks, in the Java/Scala
>>> you can add the staging repository to your projects resolvers and test with
>>> the RC (make sure to clean up the artifact cache before/after so you
>>> don't end up building with a out of date RC going forward).
>>>
>>> *What should happen to JIRA tickets still targeting 2.1.2?*
>>>
>>> Committers should look at those and triage. Extremely important bug
>>> fixes, documentation, and API tweaks that impact compatibility should be
>>> worked on immediately. Everything else please retarget to 2.1.3.
>>>
>>> *But my bug isn't fixed!??!*
>>>
>>> In order to make timely releases, we will typically not hold the release
>>> unless the bug in question is a regression from 2.1.1. That being said
>>> if there is something which is a regression form 2.1.1 that has not
>>> been correctly targeted please ping a committer to help target the issue
>>> (you can see the open issues listed as impacting Spark 2.1.1 & 2.1.2
>>> 
>>> )
>>>
>>> *What are the unresolved* issues targeted for 2.1.2
>>> 
>>> ?
>>>
>>> At this time there are no open unresolved issues.
>>>
>>> *Is there anything different about this release?*
>>>
>>> This is the first release in awhile not built on the AMPLAB Jenkins.
>>> This is good because it means future releases can more easily be built and
>>> signed securely (and I've been updating the documentation in
>>> https://github.com/apache/spark-website/pull/66 as I progress), however
>>> the chances of a mistake are higher with any change like this. If there
>>> something you normally take for granted as correct when checking a release,
>>> please double check this time :)
>>>
>>> *Should I be committing code to branch-2.1?*
>>>
>>> Thanks for asking! Please treat this stage in the RC process as "code
>>> freeze" so bug fixes only. If you're uncertain if something should be back
>>> ported please reach out. If you do commit to branch-2.1 please tag your
>>> JIRA issue fix version for 2.1.3 and if we cut another RC I'll move the
>>> 2.1.3 fixed into 2.1.2 as appropriate.
>>>
>>> *What happened to RC3?*
>>>
>>> Some R+zinc interactions kept it from getting out the door.
>>> --
>>> Twitter: https://twitter.com/holdenkarau
>>>
>>
>


[jira] [Commented] (SPARK-22115) Add operator for linalg Matrix and Vector

2017-10-02 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16187740#comment-16187740
 ] 

Nick Pentreath commented on SPARK-22115:


Do we plan to make this private? Or are you suggesting to expose a new public 
API here?

> Add operator for linalg Matrix and Vector
> -
>
> Key: SPARK-22115
> URL: https://issues.apache.org/jira/browse/SPARK-22115
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 3.0.0
>Reporter: Peng Meng
>
> For example, there are many code in LDA like this:
> {code:java}
> phiNorm := expElogbetad * expElogthetad +:+ 1e-100
> {code}
> expElogbetad is a breeze Matrix, expElogthetad is a breeze Vector, 
> This code will call a blas GEMV, then loop the result (:+ 1e-100)
> Actually, this can be done with only GEMV, because the standard interface of 
> gemv is : 
> gemv(alpha, A, x, beta, y)//y := alpha*A*x + beta*y
> We can provide some operators (e.g. Element-wise product (:*), Element-wise 
> sum (:+)) to Spark linalg Matrix and Vector, and replace breeze Matrix and 
> Vector by Spark linalg Matrix and Vector. 
> Then for all the cases like: y = alpha*A*x + beta*y, we can call GEMM or GEMV 
> for it. 
> Don't need to call GEMM or GEMV and then loop the result (for the add) as the 
> current implementation. 
> I can help to do it if we plan to add this feature.  



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



Re: Should Flume integration be behind a profile?

2017-10-02 Thread Nick Pentreath
I'd agree with #1 or #2. Deprecation now seems fine.

Perhaps this should be raised on the user list also?

And perhaps it makes sense to look at moving the Flume support into Apache
Bahir if there is interest (I've cc'ed Bahir dev list here)? That way the
current state of the connector could keep going for those users who may
need it.

As for examples, for the Kinesis connector the examples now live in the
subproject (see e.g. KinesisWordCountASL under external/kinesis-asl). So we
don't have to completely remove the examples, just move them (this may not
solve the doc issue but at least the examples are still there for anyone
who needs them).

On Mon, 2 Oct 2017 at 06:36 Mridul Muralidharan  wrote:

> I agree, proposal 1 sounds better among the options.
>
> Regards,
> Mridul
>
>
> On Sun, Oct 1, 2017 at 3:50 PM, Reynold Xin  wrote:
> > Probably should do 1, and then it is an easier transition in 3.0.
> >
> > On Sun, Oct 1, 2017 at 1:28 AM Sean Owen  wrote:
> >>
> >> I tried and failed to do this in
> >> https://issues.apache.org/jira/browse/SPARK-22142 because it became
> clear
> >> that the Flume examples would have to be removed to make this work, too.
> >> (Well, you can imagine other solutions with extra source dirs or
> modules for
> >> flume examples enabled by a profile, but that doesn't help the docs and
> is
> >> nontrivial complexity for little gain.)
> >>
> >> It kind of suggests Flume support should be deprecated if it's put
> behind
> >> a profile. Like with Kafka 0.8. (This is why I'm raising it again to the
> >> whole list.)
> >>
> >> Any preferences among:
> >> 1. Put Flume behind a profile, remove examples, deprecate
> >> 2. Put Flume behind a profile, remove examples, but don't deprecate
> >> 3. Punt until Spark 3.0, when this integration would probably be removed
> >> entirely (?)
> >>
> >> On Tue, Sep 26, 2017 at 10:36 AM Sean Owen  wrote:
> >>>
> >>> Not a big deal, but I'm wondering whether Flume integration should at
> >>> least be opt-in and behind a profile? it still sees some use (at least
> on
> >>> our end) but not applicable to the majority of users. Most other
> third-party
> >>> framework integrations are behind a profile, like YARN, Mesos, Kinesis,
> >>> Kafka 0.8, Docker. Just soliciting comments, not arguing for it.
> >>>
> >>> (Well, actually it annoys me that the Flume integration always fails to
> >>> compile in IntelliJ unless you generate the sources manually)
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: Should Flume integration be behind a profile?

2017-10-02 Thread Nick Pentreath
I'd agree with #1 or #2. Deprecation now seems fine.

Perhaps this should be raised on the user list also?

And perhaps it makes sense to look at moving the Flume support into Apache
Bahir if there is interest (I've cc'ed Bahir dev list here)? That way the
current state of the connector could keep going for those users who may
need it.

As for examples, for the Kinesis connector the examples now live in the
subproject (see e.g. KinesisWordCountASL under external/kinesis-asl). So we
don't have to completely remove the examples, just move them (this may not
solve the doc issue but at least the examples are still there for anyone
who needs them).

On Mon, 2 Oct 2017 at 06:36 Mridul Muralidharan  wrote:

> I agree, proposal 1 sounds better among the options.
>
> Regards,
> Mridul
>
>
> On Sun, Oct 1, 2017 at 3:50 PM, Reynold Xin  wrote:
> > Probably should do 1, and then it is an easier transition in 3.0.
> >
> > On Sun, Oct 1, 2017 at 1:28 AM Sean Owen  wrote:
> >>
> >> I tried and failed to do this in
> >> https://issues.apache.org/jira/browse/SPARK-22142 because it became
> clear
> >> that the Flume examples would have to be removed to make this work, too.
> >> (Well, you can imagine other solutions with extra source dirs or
> modules for
> >> flume examples enabled by a profile, but that doesn't help the docs and
> is
> >> nontrivial complexity for little gain.)
> >>
> >> It kind of suggests Flume support should be deprecated if it's put
> behind
> >> a profile. Like with Kafka 0.8. (This is why I'm raising it again to the
> >> whole list.)
> >>
> >> Any preferences among:
> >> 1. Put Flume behind a profile, remove examples, deprecate
> >> 2. Put Flume behind a profile, remove examples, but don't deprecate
> >> 3. Punt until Spark 3.0, when this integration would probably be removed
> >> entirely (?)
> >>
> >> On Tue, Sep 26, 2017 at 10:36 AM Sean Owen  wrote:
> >>>
> >>> Not a big deal, but I'm wondering whether Flume integration should at
> >>> least be opt-in and behind a profile? it still sees some use (at least
> on
> >>> our end) but not applicable to the majority of users. Most other
> third-party
> >>> framework integrations are behind a profile, like YARN, Mesos, Kinesis,
> >>> Kafka 0.8, Docker. Just soliciting comments, not arguing for it.
> >>>
> >>> (Well, actually it annoys me that the Flume integration always fails to
> >>> compile in IntelliJ unless you generate the sources manually)
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: Welcoming Tejas Patil as a Spark committer

2017-09-30 Thread Nick Pentreath
Congratulations!



>>
>> Matei Zaharia wrote
>> > Hi all,
>> >
>> > The Spark PMC recently added Tejas Patil as a committer on the
>> > project. Tejas has been contributing across several areas of Spark for
>> > a while, focusing especially on scalability issues and SQL. Please
>> > join me in welcoming Tejas!
>> >
>> > Matei
>> >
>> > -
>> > To unsubscribe e-mail:
>>
>> > dev-unsubscribe@.apache
>>
>>
>>
>>
>>
>> -
>> Liang-Chi Hsieh | @viirya
>> Spark Technology Center
>> http://www.spark.tc/
>> --
>> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>
>


Re: How to run MLlib's word2vec in CBOW mode?

2017-09-28 Thread Nick Pentreath
MLlib currently doesn't support CBOW - there is an open PR for it (see
https://issues.apache.org/jira/browse/SPARK-20372).

On Thu, 28 Sep 2017 at 09:56 pun  wrote:

> Hello,
> My understanding is that word2vec can be ran in two modes:
>
>- continuous bag-of-words (CBOW) (order of words does not matter)
>- continuous skip-gram (order of words matters)
>
> I would like to run the *CBOW* implementation from Spark's MLlib, but it
> is not clear to me from the documentation and their example how to do it.
> This is the example listed on their page. From:
> https://spark.apache.org/docs/2.1.0/mllib-feature-extraction.html#example
>
> import org.apache.spark.mllib.feature.{Word2Vec, Word2VecModel}
>
> val input = sc.textFile("data/mllib/sample_lda_data.txt").map(line => 
> line.split(" ").toSeq)
>
> val word2vec = new Word2Vec()
>
> val model = word2vec.fit(input)
>
> val synonyms = model.findSynonyms("1", 5)
>
> for((synonym, cosineSimilarity) <- synonyms) {
>   println(s"$synonym $cosineSimilarity")
> }
>
> *My questions:*
>
>- Which of the two modes does this example use?
>- Do you know how I can run the model in the CBOW mode?
>
> Thanks in advance!
> --
> Sent from the Apache Spark User List mailing list archive
>  at Nabble.com.
>


[jira] [Commented] (SPARK-13030) Change OneHotEncoder to Estimator

2017-09-25 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16180073#comment-16180073
 ] 

Nick Pentreath commented on SPARK-13030:


Yes definitely needs to support multi column. [~viirya] or I may also have 
bandwidth for this.

> Change OneHotEncoder to Estimator
> -
>
> Key: SPARK-13030
> URL: https://issues.apache.org/jira/browse/SPARK-13030
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 1.6.0
>Reporter: Wojciech Jurczyk
>
> OneHotEncoder should be an Estimator, just like in scikit-learn 
> (http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html).
> In its current form, it is impossible to use when number of categories is 
> different between training dataset and test dataset.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13030) Change OneHotEncoder to Estimator

2017-09-23 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16177939#comment-16177939
 ] 

Nick Pentreath commented on SPARK-13030:


It's ugly but we can introduce a new class {{OneHotEncoderEstimator extends 
Estimator[_]}}. The fit method should return a {{OneHotEncoderModel}}. We then 
deprecate {{OneHotEncoder}} and in {{3.0}} we rename {{OneHotEncoderEstimator 
-> OneHotEncoder}} and remove the old {{OneHotEncoder}}?

Alternatively, we can just create a new name such as {{CategoryEncoder}} or 
{{CategoricalEncoder}} and deprecate the old {{OneHotEncoder}}.

> Change OneHotEncoder to Estimator
> -
>
> Key: SPARK-13030
> URL: https://issues.apache.org/jira/browse/SPARK-13030
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 1.6.0
>Reporter: Wojciech Jurczyk
>
> OneHotEncoder should be an Estimator, just like in scikit-learn 
> (http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html).
> In its current form, it is impossible to use when number of categories is 
> different between training dataset and test dataset.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22061) Add pipeline model of SVM

2017-09-21 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22061?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath resolved SPARK-22061.

Resolution: Won't Fix

> Add pipeline model of SVM
> -
>
> Key: SPARK-22061
> URL: https://issues.apache.org/jira/browse/SPARK-22061
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Jiaming Shu
>
> add pipeline implementation of SVM



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22061) Add pipeline model of SVM

2017-09-21 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16175124#comment-16175124
 ] 

Nick Pentreath commented on SPARK-22061:


Agreed, this already exists. I closed this issue.

> Add pipeline model of SVM
> -
>
> Key: SPARK-22061
> URL: https://issues.apache.org/jira/browse/SPARK-22061
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Jiaming Shu
>
> add pipeline implementation of SVM



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21958) Attempting to save large Word2Vec model hangs driver in constant GC.

2017-09-15 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath reassigned SPARK-21958:
--

Assignee: Travis Hegner

> Attempting to save large Word2Vec model hangs driver in constant GC.
> 
>
> Key: SPARK-21958
> URL: https://issues.apache.org/jira/browse/SPARK-21958
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.2.0
> Environment: Running spark on yarn, hadoop 2.7.2 provided by the 
> cluster
>Reporter: Travis Hegner
>Assignee: Travis Hegner
>  Labels: easyfix, patch, performance
> Fix For: 2.3.0
>
>
> In the new version of Word2Vec, the model saving was modified to estimate an 
> appropriate number of partitions based on the kryo buffer size. This is a 
> great improvement, but there is a caveat for very large models.
> The {{(word, vector)}} tuple goes through a transformation to a local case 
> class of {{Data(word, vector)}}... I can only assume this is for the kryo 
> serialization process. The new version of the code iterates over the entire 
> vocabulary to do this transformation (the old version wrapped the entire 
> datum) in the driver's heap. Only to have the result then distributed to the 
> cluster to be written into it's parquet files.
> With extremely large vocabularies (~2 million docs, with uni-grams, bi-grams, 
> and tri-grams), that local driver transformation is causing the driver to 
> hang indefinitely in GC as I can only assume that it's generating millions of 
> short lived objects which can't be evicted fast enough.
> Perhaps I'm overlooking something, but it seems to me that since the result 
> is distributed over the cluster to be saved _after_ the transformation 
> anyway, we may as well distribute it _first_, allowing the cluster resources 
> to do the transformation more efficiently, and then write the parquet file 
> from there.
> I have a patch implemented, and am in the process of testing it at scale. I 
> will open a pull request when I feel that the patch is successfully resolving 
> the issue, and after making sure that it passes unit tests.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-21958) Attempting to save large Word2Vec model hangs driver in constant GC.

2017-09-15 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath resolved SPARK-21958.

   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 19191
[https://github.com/apache/spark/pull/19191]

> Attempting to save large Word2Vec model hangs driver in constant GC.
> 
>
> Key: SPARK-21958
> URL: https://issues.apache.org/jira/browse/SPARK-21958
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.2.0
> Environment: Running spark on yarn, hadoop 2.7.2 provided by the 
> cluster
>Reporter: Travis Hegner
>  Labels: easyfix, patch, performance
> Fix For: 2.3.0
>
>
> In the new version of Word2Vec, the model saving was modified to estimate an 
> appropriate number of partitions based on the kryo buffer size. This is a 
> great improvement, but there is a caveat for very large models.
> The {{(word, vector)}} tuple goes through a transformation to a local case 
> class of {{Data(word, vector)}}... I can only assume this is for the kryo 
> serialization process. The new version of the code iterates over the entire 
> vocabulary to do this transformation (the old version wrapped the entire 
> datum) in the driver's heap. Only to have the result then distributed to the 
> cluster to be written into it's parquet files.
> With extremely large vocabularies (~2 million docs, with uni-grams, bi-grams, 
> and tri-grams), that local driver transformation is causing the driver to 
> hang indefinitely in GC as I can only assume that it's generating millions of 
> short lived objects which can't be evicted fast enough.
> Perhaps I'm overlooking something, but it seems to me that since the result 
> is distributed over the cluster to be saved _after_ the transformation 
> anyway, we may as well distribute it _first_, allowing the cluster resources 
> to do the transformation more efficiently, and then write the parquet file 
> from there.
> I have a patch implemented, and am in the process of testing it at scale. I 
> will open a pull request when I feel that the patch is successfully resolving 
> the issue, and after making sure that it passes unit tests.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22021) Add a feature transformation to accept a function and apply it on all rows of dataframe

2017-09-15 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16167806#comment-16167806
 ] 

Nick Pentreath commented on SPARK-22021:


Why a JavaScript function? I think this is not a good fit to go into Spark ML 
core. You can easily have this as an external library or Spark package.

We are looking at potentially a transformer for generic Scala functions in 
SPARK-20271

> Add a feature transformation to accept a function and apply it on all rows of 
> dataframe
> ---
>
> Key: SPARK-22021
> URL: https://issues.apache.org/jira/browse/SPARK-22021
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Hosur Narahari
>
> More often we generate derived features in ML pipeline by doing some 
> mathematical or other kind of operation on columns of dataframe like getting 
> a total of few columns as a new column or if there is text field message and 
> we want the length of message etc. We currently don't have an efficient way 
> to handle such scenario in ML pipeline.
> By Providing a transformer which accepts a function and performs that on 
> mentioned columns to generate output column of numerical type, user has the 
> flexibility to derive features by applying any domain specific logic.
> Example:
> val function = "function(a,b) { return a+b;}"
> val transformer = new GenFuncTransformer().setInputCols(Array("v1", 
> "v2")).setOutputCol("result").setFunction(function)
> val df = Seq((1.0, 2.0), (3.0, 4.0)).toDF("v1", "v2")
> val result = transformer.transform(df)
> result.show
> v1   v2  result
> 1.0 2.0 3.0
> 3.0 4.0 7.0



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21958) Attempting to save large Word2Vec model hangs driver in constant GC.

2017-09-11 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16160862#comment-16160862
 ] 

Nick Pentreath commented on SPARK-21958:


Seems like your proposal could improve things - but yeah let's see what your 
testing results are.

> Attempting to save large Word2Vec model hangs driver in constant GC.
> 
>
> Key: SPARK-21958
> URL: https://issues.apache.org/jira/browse/SPARK-21958
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.2.0
> Environment: Running spark on yarn, hadoop 2.7.2 provided by the 
> cluster
>Reporter: Travis Hegner
>  Labels: easyfix, patch, performance
>
> In the new version of Word2Vec, the model saving was modified to estimate an 
> appropriate number of partitions based on the kryo buffer size. This is a 
> great improvement, but there is a caveat for very large models.
> The {{(word, vector)}} tuple goes through a transformation to a local case 
> class of {{Data(word, vector)}}... I can only assume this is for the kryo 
> serialization process. The new version of the code iterates over the entire 
> vocabulary to do this transformation (the old version wrapped the entire 
> datum) in the driver's heap. Only to have the result then distributed to the 
> cluster to be written into it's parquet files.
> With extremely large vocabularies (~2 million docs, with uni-grams, bi-grams, 
> and tri-grams), that local driver transformation is causing the driver to 
> hang indefinitely in GC as I can only assume that it's generating millions of 
> short lived objects which can't be evicted fast enough.
> Perhaps I'm overlooking something, but it seems to me that since the result 
> is distributed over the cluster to be saved _after_ the transformation 
> anyway, we may as well distribute it _first_, allowing the cluster resources 
> to do the transformation more efficiently, and then write the parquet file 
> from there.
> I have a patch implemented, and am in the process of testing it at scale. I 
> will open a pull request when I feel that the patch is successfully resolving 
> the issue, and after making sure that it passes unit tests.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19357) Parallel Model Evaluation for ML Tuning: Scala

2017-09-06 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath reassigned SPARK-19357:
--

Assignee: Bryan Cutler

> Parallel Model Evaluation for ML Tuning: Scala
> --
>
> Key: SPARK-19357
> URL: https://issues.apache.org/jira/browse/SPARK-19357
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Bryan Cutler
>Assignee: Bryan Cutler
> Fix For: 2.3.0
>
>
> This is a first step of the parent task of Optimizations for ML Pipeline 
> Tuning to perform model evaluation in parallel.  A simple approach is to 
> naively evaluate with a possible parameter to control the level of 
> parallelism.  There are some concerns with this:
> * excessive caching of datasets
> * what to set as the default value for level of parallelism.  1 will evaluate 
> all models in serial, as is done currently. Higher values could lead to 
> excessive caching.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19357) Parallel Model Evaluation for ML Tuning: Scala

2017-09-06 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath resolved SPARK-19357.

   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 16774
[https://github.com/apache/spark/pull/16774]

> Parallel Model Evaluation for ML Tuning: Scala
> --
>
> Key: SPARK-19357
> URL: https://issues.apache.org/jira/browse/SPARK-19357
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Bryan Cutler
> Fix For: 2.3.0
>
>
> This is a first step of the parent task of Optimizations for ML Pipeline 
> Tuning to perform model evaluation in parallel.  A simple approach is to 
> naively evaluate with a possible parameter to control the level of 
> parallelism.  There are some concerns with this:
> * excessive caching of datasets
> * what to set as the default value for level of parallelism.  1 will evaluate 
> all models in serial, as is done currently. Higher values could lead to 
> excessive caching.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-21926) Some transformers in spark.ml.feature fail when trying to transform steaming dataframes

2017-09-05 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16154901#comment-16154901
 ] 

Nick Pentreath edited comment on SPARK-21926 at 9/6/17 6:54 AM:


For #2, (a) is definitely the correct solution. By the way that issue impacts 
any prediction-time situation, not just streaming.


was (Author: mlnick):
For #2, (a) is definitely the correct solution.

> Some transformers in spark.ml.feature fail when trying to transform steaming 
> dataframes
> ---
>
> Key: SPARK-21926
> URL: https://issues.apache.org/jira/browse/SPARK-21926
> Project: Spark
>  Issue Type: Bug
>  Components: ML, Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Bago Amirbekian
>
> We've run into a few cases where ML components don't play nice with streaming 
> dataframes (for prediction). This ticket is meant to help aggregate these 
> known cases in one place and provide a place to discuss possible fixes.
> Failing cases:
> 1) VectorAssembler where one of the inputs is a VectorUDT column with no 
> metadata.
> Possible fixes:
> a) Re-design vectorUDT metadata to support missing metadata for some 
> elements. (This might be a good thing to do anyways SPARK-19141)
> b) drop metadata in streaming context.
> 2) OneHotEncoder where the input is a column with no metadata.
> Possible fixes:
> a) Make OneHotEncoder an estimator (SPARK-13030).
> b) Allow user to set the cardinality of OneHotEncoder.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21926) Some transformers in spark.ml.feature fail when trying to transform steaming dataframes

2017-09-05 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16154901#comment-16154901
 ] 

Nick Pentreath commented on SPARK-21926:


For #2, (a) is definitely the correct solution.

> Some transformers in spark.ml.feature fail when trying to transform steaming 
> dataframes
> ---
>
> Key: SPARK-21926
> URL: https://issues.apache.org/jira/browse/SPARK-21926
> Project: Spark
>  Issue Type: Bug
>  Components: ML, Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Bago Amirbekian
>
> We've run into a few cases where ML components don't play nice with streaming 
> dataframes (for prediction). This ticket is meant to help aggregate these 
> known cases in one place and provide a place to discuss possible fixes.
> Failing cases:
> 1) VectorAssembler where one of the inputs is a VectorUDT column with no 
> metadata.
> Possible fixes:
> a) Re-design vectorUDT metadata to support missing metadata for some 
> elements. (This might be a good thing to do anyways SPARK-19141)
> b) drop metadata in streaming context.
> 2) OneHotEncoder where the input is a column with no metadata.
> Possible fixes:
> a) Make OneHotEncoder an estimator (SPARK-13030).
> b) Allow user to set the cardinality of OneHotEncoder.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15790) Audit @Since annotations in ML

2017-09-05 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath resolved SPARK-15790.

Resolution: Fixed

> Audit @Since annotations in ML
> --
>
> Key: SPARK-15790
> URL: https://issues.apache.org/jira/browse/SPARK-15790
> Project: Spark
>  Issue Type: Documentation
>  Components: ML, PySpark
>    Reporter: Nick Pentreath
>    Assignee: Nick Pentreath
>
> Many classes & methods in ML are missing {{@Since}} annotations. Audit what's 
> missing and add annotations to public API constructors, vals and methods.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



Re: isCached

2017-09-01 Thread Nick Pentreath
No unfortunately not - as i recall storageLevel accesses some private
methods to get the result.

On Fri, 1 Sep 2017 at 17:55, Nathan Kronenfeld
 wrote:

> Ah, in 2.1.0.
>
> I'm in 2.0.1 at the moment... is there any way that works that far back?
>
> On Fri, Sep 1, 2017 at 11:46 AM, Nick Pentreath 
> wrote:
>
>> Dataset does have storageLevel. So you can use isCached = (storageLevel
>> != StorageLevel.NONE) as a test.
>>
>> Arguably isCached could be added to dataset too, shouldn't be a
>> controversial change.
>>
>> On Fri, 1 Sep 2017 at 17:31, Nathan Kronenfeld
>>  wrote:
>>
>>> I'm currently porting some of our code from RDDs to Datasets.
>>>
>>> With RDDs it's pretty easy to figure out if they are cached or not.
>>>
>>> I notice that the catalog has a function for determining this on
>>> Datasets too, but it's private[sql].  Is there any reason for it not to be
>>> public?  Is there any way at the moment to determine if a dataset is cached
>>> or not?
>>>
>>> Thanks in advance
>>>-Nathan Kronenfeld
>>>
>>
>


Re: isCached

2017-09-01 Thread Nick Pentreath
Dataset does have storageLevel. So you can use isCached = (storageLevel !=
StorageLevel.NONE) as a test.

Arguably isCached could be added to dataset too, shouldn't be a
controversial change.

On Fri, 1 Sep 2017 at 17:31, Nathan Kronenfeld
 wrote:

> I'm currently porting some of our code from RDDs to Datasets.
>
> With RDDs it's pretty easy to figure out if they are cached or not.
>
> I notice that the catalog has a function for determining this on Datasets
> too, but it's private[sql].  Is there any reason for it not to be public?
> Is there any way at the moment to determine if a dataset is cached or not?
>
> Thanks in advance
>-Nathan Kronenfeld
>


Re: Updates on migration guides

2017-08-30 Thread Nick Pentreath
MLlib has tried quite hard to ensure the migration guide is up to date for
each release. I think generally we catch all breaking and most major
behavior changes

On Wed, 30 Aug 2017 at 17:02, Dongjoon Hyun  wrote:

> +1
>
> On Wed, Aug 30, 2017 at 7:54 AM, Xiao Li  wrote:
>
>> Hi, Devs,
>>
>> Many questions from the open source community are actually caused by the
>> behavior changes we made in each release. So far, the migration guides
>> (e.g.,
>> https://spark.apache.org/docs/latest/sql-programming-guide.html#migration-guide)
>> were not being properly updated. In the last few releases, multiple
>> behavior changes are not documented in migration guides and even release
>> notes. I propose to do the document updates in the same PRs that introduce
>> the behavior changes. If the contributors can't make it, the committers who
>> merge the PRs need to do it instead. We also can create a dedicated page
>> for migration guides of all the components. Hopefully, this can assist the
>> migration efforts.
>>
>> Thanks,
>>
>> Xiao Li
>>
>
>


Re: Updates on migration guides

2017-08-30 Thread Nick Pentreath
MLlib has tried quite hard to ensure the migration guide is up to date for
each release. I think generally we catch all breaking and most major
behavior changes

On Wed, 30 Aug 2017 at 17:02, Dongjoon Hyun  wrote:

> +1
>
> On Wed, Aug 30, 2017 at 7:54 AM, Xiao Li  wrote:
>
>> Hi, Devs,
>>
>> Many questions from the open source community are actually caused by the
>> behavior changes we made in each release. So far, the migration guides
>> (e.g.,
>> https://spark.apache.org/docs/latest/sql-programming-guide.html#migration-guide)
>> were not being properly updated. In the last few releases, multiple
>> behavior changes are not documented in migration guides and even release
>> notes. I propose to do the document updates in the same PRs that introduce
>> the behavior changes. If the contributors can't make it, the committers who
>> merge the PRs need to do it instead. We also can create a dedicated page
>> for migration guides of all the components. Hopefully, this can assist the
>> migration efforts.
>>
>> Thanks,
>>
>> Xiao Li
>>
>
>


[jira] [Resolved] (SPARK-21469) Add doc and example for FeatureHasher

2017-08-30 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath resolved SPARK-21469.

   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 19024
[https://github.com/apache/spark/pull/19024]

> Add doc and example for FeatureHasher
> -
>
> Key: SPARK-21469
> URL: https://issues.apache.org/jira/browse/SPARK-21469
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, ML
>Affects Versions: 2.3.0
>    Reporter: Nick Pentreath
> Fix For: 2.3.0
>
>
> Add examples and user guide section for {{FeatureHasher}} in SPARK-13969



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21469) Add doc and example for FeatureHasher

2017-08-30 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath reassigned SPARK-21469:
--

Assignee: Bryan Cutler

> Add doc and example for FeatureHasher
> -
>
> Key: SPARK-21469
> URL: https://issues.apache.org/jira/browse/SPARK-21469
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, ML
>Affects Versions: 2.3.0
>    Reporter: Nick Pentreath
>Assignee: Bryan Cutler
> Fix For: 2.3.0
>
>
> Add examples and user guide section for {{FeatureHasher}} in SPARK-13969



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21086) CrossValidator, TrainValidationSplit should preserve all models after fitting

2017-08-23 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16138007#comment-16138007
 ] 

Nick Pentreath commented on SPARK-21086:


Ok - I commented on the PR.

Agree it makes sense to proceed with the simple version first of keeping all 
models (configurable with a param, false by default). Later we can look at 
Yuhao's file-based version for large models.

> CrossValidator, TrainValidationSplit should preserve all models after fitting
> -
>
> Key: SPARK-21086
> URL: https://issues.apache.org/jira/browse/SPARK-21086
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Joseph K. Bradley
>
> I've heard multiple requests for having CrossValidatorModel and 
> TrainValidationSplitModel preserve the full list of fitted models.  This 
> sounds very valuable.
> One decision should be made before we do this: Should we save and load the 
> models in ML persistence?  That could blow up the size of a saved Pipeline if 
> the models are large.
> * I suggest *not* saving the models by default but allowing saving if 
> specified.  We could specify whether to save the model as an extra Param for 
> CrossValidatorModelWriter, but we would have to make sure to expose 
> CrossValidatorModelWriter as a public API and modify the return type of 
> CrossValidatorModel.write to be CrossValidatorModelWriter (but this will not 
> be a breaking change).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21799) KMeans performance regression (5-6x slowdown) in Spark 2.2

2017-08-22 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16136630#comment-16136630
 ] 

Nick Pentreath commented on SPARK-21799:


Refer to SPARK-18608 and SPARK-19422. There is some work on it. Haven't looked 
at it for a while but I recall that it was a little more complex than initially 
expected. 

> KMeans performance regression (5-6x slowdown) in Spark 2.2
> --
>
> Key: SPARK-21799
> URL: https://issues.apache.org/jira/browse/SPARK-21799
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 2.2.0
>Reporter: Siddharth Murching
>
> I've been running KMeans performance tests using 
> [spark-sql-perf|https://github.com/databricks/spark-sql-perf/] and have 
> noticed a regression (slowdowns of 5-6x) when running tests on large datasets 
> in Spark 2.2 vs 2.1.
> The test params are:
> * Cluster: 510 GB RAM, 16 workers
> * Data: 100 examples, 1 features
> After talking to [~josephkb], the issue seems related to the changes in 
> [SPARK-18356|https://issues.apache.org/jira/browse/SPARK-18356] introduced in 
> [this PR|https://github.com/apache/spark/pull/16295].
> It seems `df.cache()` doesn't set the storageLevel of `df.rdd`, so 
> `handlePersistence` is true even when KMeans is run on a cached DataFrame. 
> This unnecessarily causes another copy of the input dataset to be persisted.
> As of Spark 2.1 ([JIRA 
> link|https://issues.apache.org/jira/browse/SPARK-16063]) `df.storageLevel` 
> returns the correct result after calling `df.cache()`, so I'd suggest 
> replacing instances of `df.rdd.getStorageLevel` with df.storageLevel` in 
> MLlib algorithms (the same pattern shows up in LogisticRegression, 
> LinearRegression, and others). I've verified this behavior in [this 
> notebook|https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/5211178207246023/950505630032626/7788830288800223/latest.html]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21468) FeatureHasher Python API

2017-08-21 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath reassigned SPARK-21468:
--

Assignee: Nick Pentreath

> FeatureHasher Python API
> 
>
> Key: SPARK-21468
> URL: https://issues.apache.org/jira/browse/SPARK-21468
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Affects Versions: 2.3.0
>    Reporter: Nick Pentreath
>    Assignee: Nick Pentreath
> Fix For: 2.3.0
>
>
> Add Python API for FeatureHasher in SPARK-13969



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-21468) FeatureHasher Python API

2017-08-21 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath resolved SPARK-21468.

   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 18970
[https://github.com/apache/spark/pull/18970]

> FeatureHasher Python API
> 
>
> Key: SPARK-21468
> URL: https://issues.apache.org/jira/browse/SPARK-21468
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Affects Versions: 2.3.0
>    Reporter: Nick Pentreath
> Fix For: 2.3.0
>
>
> Add Python API for FeatureHasher in SPARK-13969



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4981) Add a streaming singular value decomposition

2017-08-18 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16132118#comment-16132118
 ] 

Nick Pentreath commented on SPARK-4981:
---

Hey folks, as interesting as this would be, I think it's fairly clear that it 
won't be moving ahead any time soon (and furthermore any 
ML-on-Structured-Streaming is not imminent). Shall we close this off?

> Add a streaming singular value decomposition
> 
>
> Key: SPARK-4981
> URL: https://issues.apache.org/jira/browse/SPARK-4981
> Project: Spark
>  Issue Type: New Feature
>  Components: DStreams, MLlib
>Reporter: Jeremy Freeman
>
> This is for tracking WIP on a streaming singular value decomposition 
> implementation. This will likely be more complex than the existing streaming 
> algorithms (k-means, regression), but should be possible using the family of 
> sequential update rule outlined in this paper:
> "Fast low-rank modifications of the thin singular value decomposition"
> by Matthew Brand
> http://www.stat.osu.edu/~dmsl/thinSVDtracking.pdf



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21742) BisectingKMeans generate different models with/without caching

2017-08-16 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16128512#comment-16128512
 ] 

Nick Pentreath commented on SPARK-21742:


Isn't the solution to set a fixed seed for the randomly generated dataset where 
necessary?

> BisectingKMeans generate different models with/without caching
> --
>
> Key: SPARK-21742
> URL: https://issues.apache.org/jira/browse/SPARK-21742
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: zhengruifeng
>
> I found that {{BisectingKMeans}} will generate different models if the input 
> is cached or not.
> Using the same dataset in {{BisectingKMeansSuite}}, we can found that if we 
> cache the input, then the number of centers will change from 2 to 3.
> So it looks like a potential bug.
> {code}
> import org.apache.spark.ml.param.ParamMap
> import org.apache.spark.sql.Dataset
> import org.apache.spark.ml.clustering._
> import org.apache.spark.ml.linalg._
> import scala.util.Random
> case class TestRow(features: org.apache.spark.ml.linalg.Vector)
> val rows = 10
> val dim = 1000
> val seed = 42
> val random = new Random(seed)
> val nnz = random.nextInt(dim)
> val rdd = sc.parallelize(1 to rows).map(i => Vectors.sparse(dim, 
> random.shuffle(0 to dim - 1).slice(0, nnz).sorted.toArray, 
> Array.fill(nnz)(random.nextDouble(.map(v => new TestRow(v))
> val sparseDataset = spark.createDataFrame(rdd)
> val k = 5
> val bkm = new 
> BisectingKMeans().setK(k).setMinDivisibleClusterSize(4).setMaxIter(4).setSeed(123)
> val model = bkm.fit(sparseDataset)
> model.clusterCenters.length
> res0: Int = 2
> sparseDataset.persist()
> val model = bkm.fit(sparseDataset)
> model.clusterCenters.length
> res2: Int = 3
> {code}
> [~imatiach] 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-13969) Extend input format that feature hashing can handle

2017-08-16 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath resolved SPARK-13969.

   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 18513
[https://github.com/apache/spark/pull/18513]

> Extend input format that feature hashing can handle
> ---
>
> Key: SPARK-13969
> URL: https://issues.apache.org/jira/browse/SPARK-13969
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, MLlib
>    Reporter: Nick Pentreath
>Priority: Minor
> Fix For: 2.3.0
>
>
> Currently {{HashingTF}} works like {{CountVectorizer}} (the equivalent in 
> scikit-learn is {{HashingVectorizer}}). That is, it works on a sequence of 
> strings and computes term frequencies.
> The use cases for feature hashing extend to arbitrary feature values (binary, 
> count or real-valued). For example, scikit-learn's {{FeatureHasher}} can 
> accept a sequence of (feature_name, value) pairs (e.g. a map, list). In this 
> way, feature hashing can operate as both "one-hot encoder" and "vector 
> assembler" at the same time.
> Investigate adding a more generic feature hasher (that in turn can be used by 
> {{HashingTF}}).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13969) Extend input format that feature hashing can handle

2017-08-16 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath reassigned SPARK-13969:
--

Assignee: Nick Pentreath

> Extend input format that feature hashing can handle
> ---
>
> Key: SPARK-13969
> URL: https://issues.apache.org/jira/browse/SPARK-13969
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, MLlib
>    Reporter: Nick Pentreath
>    Assignee: Nick Pentreath
>Priority: Minor
> Fix For: 2.3.0
>
>
> Currently {{HashingTF}} works like {{CountVectorizer}} (the equivalent in 
> scikit-learn is {{HashingVectorizer}}). That is, it works on a sequence of 
> strings and computes term frequencies.
> The use cases for feature hashing extend to arbitrary feature values (binary, 
> count or real-valued). For example, scikit-learn's {{FeatureHasher}} can 
> accept a sequence of (feature_name, value) pairs (e.g. a map, list). In this 
> way, feature hashing can operate as both "one-hot encoder" and "vector 
> assembler" at the same time.
> Investigate adding a more generic feature hasher (that in turn can be used by 
> {{HashingTF}}).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21723) Can't write LibSVM - key not found: numFeatures

2017-08-15 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16126939#comment-16126939
 ] 

Nick Pentreath commented on SPARK-21723:


Yes, we should definitely be able to write LibSVM format regardless of whether 
the original data was read from that format, and whether we have ML metadata 
attached to the dataframe. We should be able to inspect the vectors to get the 
size in the absence of the metadata.



> Can't write LibSVM - key not found: numFeatures
> ---
>
> Key: SPARK-21723
> URL: https://issues.apache.org/jira/browse/SPARK-21723
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output, ML
>Affects Versions: 2.2.0, 2.3.0
>Reporter: Jan Vršovský
>
> Writing a dataset to LibSVM format raises an exception
> {{java.util.NoSuchElementException: key not found: numFeatures}}
> Happens only when the dataset was NOT read from a LibSVM format before 
> (because otherwise numFeatures is in its metadata). Steps to reproduce:
> {{import org.apache.spark.ml.linalg.Vectors
> val rawData = Seq((1.0, Vectors.sparse(3, Seq((0, 2.0), (1, 3.0,
>   (4.0, Vectors.sparse(3, Seq((0, 5.0), (2, 6.0)
> val dfTemp = spark.sparkContext.parallelize(rawData).toDF("label", "features")
> dfTemp.coalesce(1).write.format("libsvm").save("...filename...")}}
> PR with a fix and unit test is ready.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21086) CrossValidator, TrainValidationSplit should preserve all models after fitting

2017-08-03 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16112623#comment-16112623
 ] 

Nick Pentreath commented on SPARK-21086:


I just want to understand _why_ folks want to keep all the models? Is it 
actually the models (and model data) they want, or a way (well, easier 
"official API" way) to link the param permutations with the cross-val score to 
see what param combinations result in what scores? (In which case, 
https://issues.apache.org/jira/browse/SPARK-18704 is actually the solution).

> CrossValidator, TrainValidationSplit should preserve all models after fitting
> -
>
> Key: SPARK-21086
> URL: https://issues.apache.org/jira/browse/SPARK-21086
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Joseph K. Bradley
>
> I've heard multiple requests for having CrossValidatorModel and 
> TrainValidationSplitModel preserve the full list of fitted models.  This 
> sounds very valuable.
> One decision should be made before we do this: Should we save and load the 
> models in ML persistence?  That could blow up the size of a saved Pipeline if 
> the models are large.
> * I suggest *not* saving the models by default but allowing saving if 
> specified.  We could specify whether to save the model as an extra Param for 
> CrossValidatorModelWriter, but we would have to make sure to expose 
> CrossValidatorModelWriter as a public API and modify the return type of 
> CrossValidatorModel.write to be CrossValidatorModelWriter (but this will not 
> be a breaking change).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



<    1   2   3   4   5   6   7   8   9   10   >