[jira] [Commented] (SPARK-19848) Regex Support in StopWordsRemover

2017-03-07 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15899250#comment-15899250
 ] 

Nick Pentreath commented on SPARK-19848:


Perhaps the ML pipeline components mentioned 
[here|https://lucidworks.com/2016/04/13/spark-solr-lucenetextanalyzer/] may be 
of use?

> Regex Support in StopWordsRemover
> -
>
> Key: SPARK-19848
> URL: https://issues.apache.org/jira/browse/SPARK-19848
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: Mohd Suaib Danish
>Priority: Minor
>
> Can we have regex feature in StopWordsRemover in addition to the provided 
> list of stop words?
> Use cases can be following:
> 1. Remove all single or double letter words- [a-zA-Z]{1,2}
> 2. Remove anything starting with abc - ^abc



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-19848) Regex Support in StopWordsRemover

2017-03-07 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15899250#comment-15899250
 ] 

Nick Pentreath edited comment on SPARK-19848 at 3/7/17 11:06 AM:
-

This seems more like a "token filter" in a sense. Perhaps the ML pipeline 
components mentioned 
[here|https://lucidworks.com/2016/04/13/spark-solr-lucenetextanalyzer/] may be 
of use?


was (Author: mlnick):
Perhaps the ML pipeline components mentioned 
[here|https://lucidworks.com/2016/04/13/spark-solr-lucenetextanalyzer/] may be 
of use?

> Regex Support in StopWordsRemover
> -
>
> Key: SPARK-19848
> URL: https://issues.apache.org/jira/browse/SPARK-19848
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: Mohd Suaib Danish
>Priority: Minor
>
> Can we have regex feature in StopWordsRemover in addition to the provided 
> list of stop words?
> Use cases can be following:
> 1. Remove all single or double letter words- [a-zA-Z]{1,2}
> 2. Remove anything starting with abc - ^abc



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



Re: Check if dataframe is empty

2017-03-07 Thread Nick Pentreath
I believe take on an empty dataset will return an empty Array rather than
throw an exception.

df.take(1).isEmpty should work

On Tue, 7 Mar 2017 at 07:42, Deepak Sharma  wrote:

> If the df is empty , the .take would return
> java.util.NoSuchElementException.
> This can be done as below:
> df.rdd.isEmpty
>
>
> On Tue, Mar 7, 2017 at 9:33 AM,  wrote:
>
> Dataframe.take(1) is faster.
>
>
>
> *From:* ashaita...@nz.imshealth.com [mailto:ashaita...@nz.imshealth.com]
> *Sent:* Tuesday, March 07, 2017 9:22 AM
> *To:* user@spark.apache.org
> *Subject:* Check if dataframe is empty
>
>
>
> Hello!
>
>
>
> I am pretty sure that I am asking something which has been already asked
> lots of times. However, I cannot find the question in the mailing list
> archive.
>
>
>
> The question is – I need to check whether dataframe is empty or not. I
> receive a dataframe from 3rd party library and this dataframe can be
> potentially empty, but also can be really huge – millions of rows. Thus, I
> want to avoid of doing some logic in case the dataframe is empty. How can I
> efficiently check it?
>
>
>
> Right now I am doing it in the following way:
>
>
>
> *private def *isEmpty(df: Option[DataFrame]): Boolean = {
>   df.isEmpty || (df.isDefined && df.get.limit(1).*rdd*.isEmpty())
> }
>
>
>
> But the performance is really slow for big dataframes. I would be grateful
> for any suggestions.
>
>
>
> Thank you in advance.
>
>
>
>
> Best regards,
>
>
>
> Artem
>
>
> --
>
> ** IMPORTANT--PLEASE READ 
> This electronic message, including its attachments, is CONFIDENTIAL and may
> contain PROPRIETARY or LEGALLY PRIVILEGED or PROTECTED information and is
> intended for the authorized recipient of the sender. If you are not the
> intended recipient, you are hereby notified that any use, disclosure,
> copying, or distribution of this message or any of the information included
> in it is unauthorized and strictly prohibited. If you have received this
> message in error, please immediately notify the sender by reply e-mail and
> permanently delete this message and its attachments, along with any copies
> thereof, from all locations received (e.g., computer, mobile device, etc.).
> Thank you.
> 
>
> --
>
> This message is for the designated recipient only and may contain
> privileged, proprietary, or otherwise confidential information. If you have
> received it in error, please notify the sender immediately and delete the
> original. Any other use of the e-mail by you is prohibited. Where allowed
> by local law, electronic communications with Accenture and its affiliates,
> including e-mail and instant messaging (including content), may be scanned
> by our systems for the purposes of information security and assessment of
> internal compliance with Accenture policy.
>
> __
>
> www.accenture.com
>
>
>
>
> --
> Thanks
> Deepak
> www.bigdatabig.com
> www.keosha.net
>


[jira] [Commented] (SPARK-14409) Investigate adding a RankingEvaluator to ML

2017-03-06 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15898855#comment-15898855
 ] 

Nick Pentreath commented on SPARK-14409:


[~josephkb] the proposed input schema above encompasses that - the {{labelCol}} 
is the true relevance score (rating, confidence, etc), while the 
{{predictionCol}} is the predicted relevance (rating, confidence, etc). Note we 
can name these columns something more specific ({{labelCol}} and 
{{predictionCol}} are re-used really from the other evaluators).

This also allows "weighted" forms of ranking metric later (e.g. some metrics 
can incorporate the true relevance score into the computation which serves as a 
form of weighting of the metric) - the metrics we currently have don't do that. 
So for now the true relevance can serve as a filter - for example, when 
computing the ranking metric for recommendation, we *don't* want to include 
negative ratings in the "ground truth set of relevant documents" - hence the 
{{goodThreshold}} param above (I would rather call it something like 
{{relevanceThreshold}} myself).

*Note* that there are 2 formats I detail in my comment above - the first is the 
the actual schema of the {{DataFrame}} used as input to the 
{{RankingEvaluator}} - this must therefore be the schema of the DF output by 
{{model.transform}} (whether that is ALS for recommendation, a logistic 
regression for ad prediction, or whatever).

The second format mentioned is simply illustrating the *intermediate internal 
transformation* that the evaluator will do in the {{evaluate}} method. You can 
see a rough draft of it in Danilo's PR 
[here|https://github.com/apache/spark/pull/16618/files#diff-0345c4cb1878d3bb0d84297202fdc95fR93].

> Investigate adding a RankingEvaluator to ML
> ---
>
> Key: SPARK-14409
> URL: https://issues.apache.org/jira/browse/SPARK-14409
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Nick Pentreath
>Priority: Minor
>
> {{mllib.evaluation}} contains a {{RankingMetrics}} class, while there is no 
> {{RankingEvaluator}} in {{ml.evaluation}}. Such an evaluator can be useful 
> for recommendation evaluation (and can be useful in other settings 
> potentially).
> Should be thought about in conjunction with adding the "recommendAll" methods 
> in SPARK-13857, so that top-k ranking metrics can be used in cross-validators.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-14409) Investigate adding a RankingEvaluator to ML

2017-03-06 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15896933#comment-15896933
 ] 

Nick Pentreath edited comment on SPARK-14409 at 3/6/17 9:07 AM:


I've thought about this a lot over the past few days, and I think the approach 
should be in line with that suggested by [~roberto.mirizzi] & [~danilo.ascione].

*Goal*

Provide a DataFrame-based ranking evaluator that is general enough to handle 
common scenarios such as recommendations (ALS), search ranking, ad click 
prediction using ranking metrics (e.g. recent Kaggle competitions for 
illustration: [Outbrain Ad Clicks using 
MAP|https://www.kaggle.com/c/outbrain-click-prediction#evaluation], [Expedia 
Hotel Search Ranking using 
NDCG|https://www.kaggle.com/c/expedia-personalized-sort#evaluation]).

*RankingEvaluator input format*

{{evaluate}} would take a {{DataFrame}} with columns:

* {{queryCol}} - the column containing "query id" (e.g. "query" for cases such 
as search ranking; "user" for recommendations; "impression" for ad click 
prediction/ranking, etc);
* {{documentCol}} - the column containing "document id" (e.g. "document" in 
search, "item" in recommendation, "ad" in ad ranking, etc);
* {{labelCol}} (or maybe {{relevanceCol}} to be more precise) - the column 
containing the true relevance score for a query-document pair (e.g. in 
recommendations this would be the "rating"). This column will only be used for 
filtering out "irrelevant" documents from the ground-truth set (see Param 
{{goodThreshold}} mentioned 
[above|https://issues.apache.org/jira/browse/SPARK-14409?focusedCommentId=15826901&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15826901)]);
* {{predictionCol}} - the column containing the predicted relevance score for a 
query-document pair. The predicted ids will be ordered by this column for 
computing ranking metrics (for which order matters in predictions but generally 
not for ground-truth which is treated as a set).

The reasoning is that this format is flexible & generic enough to encompass the 
diverse use cases mentioned above *as part of a Spark ML Pipeline*, with 
*cross-validation support*.

Here is an illustrative example from recommendations as a special case:

{code}
+--+---+--+--+
|userId|movieId|rating|prediction|
+--+---+--+--+
|   230|318|   5.0| 4.2403245|
|   230|   3424|   4.0|  null|
|   230|  81191|  null|  4.317455|
+--+---+--+--+
{code}

You will notice that {{rating}} and {{prediction}} columns can be {{null}}. 
This is by design. There are three cases shown above:

# 1st row indicates a query-document (user-item) pair that occurs in *both* the 
ground-truth set and the top-k predictions;
# 2nd row indicates a user-item pair that occurs in the ground-truth set, but 
*not* in the top-k predictions;
# 3rd row indicates a user-item pair that *does not* occur in the ground-truth 
set, but *does* occur in the top-k predictions;

*Note* that while technically the input allows both these columns to be 
{{null}} in practice that won't occur since a query-document pair must occur in 
at least one of the ground-truth set or predictions. If it does occur for some 
reason it can be ignored.

*Evaluator approach*

The evaluator will perform a window function over {{queryCol}} and order by 
{{predictionCol}} within each query. Then, {{collect_list}} can be used to 
arrive at the following intermediate format:

{code}
+--+++
|userId| true_labels|predicted_labels|
+--+++
|   230|[318, 3424, 7139,...|[81191, 93040, 31...|
+--+++
{code}

*Relationship to RankingMetrics*

Technically the intermediate format above is the same format as used for 
{{RankingMetrics}}, and perhaps we could simple wrap the {{mllib}} version. 
*Note* however that the {{mllib}} class is parameterized by the type of 
"document": {code}RankingMetrics[T]{code}

I believe for the generic case we must support both {{NumericType}} and 
{{StringType}} for id columns (rather than restricting to {{Int}} as in Danilo 
& Roberto versions above). So either:
# the evaluator must be similarly parameterized; or
# we will need to re-write the ranking metrics computations as UDFs as follows: 
{code} udf { (predicted: Seq[Any], actual: Seq[Any]) => ... {code} 

I strongly prefer option #2 as it is more flexible and in keeping with the 
DataFrame style of Spark ML components (as a side note, this will give us a 
chance to review the implementations & naming of metrics, since there are some 
issues with a few of the metrics).


That 

[jira] [Comment Edited] (SPARK-14409) Investigate adding a RankingEvaluator to ML

2017-03-06 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15896933#comment-15896933
 ] 

Nick Pentreath edited comment on SPARK-14409 at 3/6/17 9:06 AM:


I've thought about this a lot over the past few days, and I think the approach 
should be in line with that suggested by [~roberto.mirizzi] & [~danilo.ascione].

*Goal*

Provide a DataFrame-based ranking evaluator that is general enough to handle 
common scenarios such as recommendations (ALS), search ranking, ad click 
prediction using ranking metrics (e.g. recent Kaggle competitions for 
illustration: [Outbrain Ad Clicks using 
MAP|https://www.kaggle.com/c/outbrain-click-prediction#evaluation], [Expedia 
Hotel Search Ranking using 
NDCG|https://www.kaggle.com/c/expedia-personalized-sort#evaluation]).

*RankingEvaluator input format*

{{evaluate}} would take a {{DataFrame}} with columns:

* {{queryCol}} - the column containing "query id" (e.g. "query" for cases such 
as search ranking; "user" for recommendations; "impression" for ad click 
prediction/ranking, etc);
* {{documentCol}} - the column containing "document id" (e.g. "document" in 
search, "item" in recommendation, "ad" in ad ranking, etc);
* {{labelCol}} (or maybe {{relevanceCol}} to be more precise) - the column 
containing the true relevance score for a query-document pair (e.g. in 
recommendations this would be the "rating"). This column will only be used for 
filtering out "irrelevant" documents from the ground-truth set (see Param 
{{goodThreshold}} mentioned 
[above|https://issues.apache.org/jira/browse/SPARK-14409?focusedCommentId=15826901&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15826901)]);
* {{predictionCol}} - the column containing the predicted relevance score for a 
query-document pair. The predicted ids will be ordered by this column for 
computing ranking metrics (for which order matters in predictions but generally 
not for ground-truth which is treated as a set).

The reasoning is that this format is flexible & generic enough to encompass the 
diverse use cases mentioned above.

Here is an illustrative example from recommendations as a special case:

{code}
+--+---+--+--+
|userId|movieId|rating|prediction|
+--+---+--+--+
|   230|318|   5.0| 4.2403245|
|   230|   3424|   4.0|  null|
|   230|  81191|  null|  4.317455|
+--+---+--+--+
{code}

You will notice that {{rating}} and {{prediction}} columns can be {{null}}. 
This is by design. There are three cases shown above:

# 1st row indicates a query-document (user-item) pair that occurs in *both* the 
ground-truth set and the top-k predictions;
# 2nd row indicates a user-item pair that occurs in the ground-truth set, but 
*not* in the top-k predictions;
# 3rd row indicates a user-item pair that *does not* occur in the ground-truth 
set, but *does* occur in the top-k predictions;

*Note* that while technically the input allows both these columns to be 
{{null}} in practice that won't occur since a query-document pair must occur in 
at least one of the ground-truth set or predictions. If it does occur for some 
reason it can be ignored.

*Evaluator approach*

The evaluator will perform a window function over {{queryCol}} and order by 
{{predictionCol}} within each query. Then, {{collect_list}} can be used to 
arrive at the following intermediate format:

{code}
+--+++
|userId| true_labels|predicted_labels|
+--+++
|   230|[318, 3424, 7139,...|[81191, 93040, 31...|
+--+++
{code}

*Relationship to RankingMetrics*

Technically the intermediate format above is the same format as used for 
{{RankingMetrics}}, and perhaps we could simple wrap the {{mllib}} version. 
*Note* however that the {{mllib}} class is parameterized by the type of 
"document": {code}RankingMetrics[T]{code}

I believe for the generic case we must support both {{NumericType}} and 
{{StringType}} for id columns (rather than restricting to {{Int}} as in Danilo 
& Roberto versions above). So either:
# the evaluator must be similarly parameterized; or
# we will need to re-write the ranking metrics computations as UDFs as follows: 
{code} udf { (predicted: Seq[Any], actual: Seq[Any]) => ... {code} 

I strongly prefer option #2 as it is more flexible and in keeping with the 
DataFrame style of Spark ML components (as a side note, this will give us a 
chance to review the implementations & naming of metrics, since there are some 
issues with a few of the metrics).


That is my proposal (sorry Yong, this is quite different now from the wor

[jira] [Commented] (SPARK-14409) Investigate adding a RankingEvaluator to ML

2017-03-06 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15896933#comment-15896933
 ] 

Nick Pentreath commented on SPARK-14409:


I've thought about this a lot over the past few days, and I think the approach 
should be in line with that suggested by [~roberto.mirizzi] & [~danilo.ascione].

*Goal*

Provide a DataFrame-based ranking evaluator that is general enough to handle 
common scenarios such as recommendations (ALS), search ranking, ad click 
prediction using ranking metrics (e.g. recent Kaggle competitions for 
illustration: [Outbrain Ad Clicks using 
MAP|https://www.kaggle.com/c/outbrain-click-prediction#evaluation], [Expedia 
Hotel Search Ranking using 
NDCG|https://www.kaggle.com/c/expedia-personalized-sort#evaluation]).

*RankingEvaluator input format*

{{evaluate}} would take a {{DataFrame}} with columns:

* {{queryCol}} - the column containing "query id" (e.g. "query" for cases such 
as search ranking; "user" for recommendations; "impression" for ad click 
prediction/ranking, etc);
* {{documentCol}} - the column containing "document id" (e.g. "document" in 
search, "item" in recommendation, "ad" in ad ranking, etc);
* {{labelCol}} (or maybe {{relevanceCol}} to be more precise) - the column 
containing the true relevance score for a query-document pair (e.g. in 
recommendations this would be the "rating"). This column will only be used for 
filtering out "irrelevant" documents from the ground-truth set (see Param 
{{goodThreshold}} mentioned 
[above|https://issues.apache.org/jira/browse/SPARK-14409?focusedCommentId=15826901&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15826901)]);
* {{predictionCol}} - the column containing the predicted relevance score for a 
query-document pair. The predicted ids will be ordered by this column for 
computing ranking metrics (for which order matters in predictions but generally 
not for ground-truth which is treated as a set).

The reasoning is that this format is flexible & generic enough to encompass the 
diverse use cases mentioned above.

Here is an illustrative example from recommendations as a special case:

{code}
+--+---+--+--+
|userId|movieId|rating|prediction|
+--+---+--+--+
|   230|318|   5.0| 4.2403245|
|   230|   3424|   4.0|  null|
|   230|  81191|  null|  4.317455|
+--+---+--+--+
{code}

You will notice that {{rating}} and {{prediction}} columns can be {{null}}. 
This is by design. There are three cases shown above:

# 1st row indicates a query-document (user-item) pair that occurs in *both* the 
ground-truth set and the top-k predictions;
# 2nd row indicates a user-item pair that occurs in the ground-truth set, but 
*not* in the top-k predictions;
# 3rd row indicates a user-item pair that *does not* occur in the ground-truth 
set, but *does* occur in the top-k predictions;

*Note* that while technically the input allows both these columns to be 
{{null}} in practice that won't occur since a query-document pair must occur in 
at least one of the ground-truth set or predictions. If it does occur for some 
reason it can be ignored.

*Evaluator approach*

The evaluator will perform a window function over {{queryCol}} and order by 
{{predictionCol}} within each query. Then, {{collect_list}} can be used to 
arrive at the following intermediate format:

{code}
+--+++
|userId| true_labels|predicted_labels|
+--+++
|   230|[318, 3424, 7139,...|[81191, 93040, 31...|
+--+++
{code}

*Relationship to RankingMetrics*

Technically the intermediate format above is the same format as used for 
{{RankingMetrics}}, and perhaps we could simple wrap the {{mllib}} version. 
*Note* however that the {{mllib}} class is parameterized by the type of 
"document": {code}RankingMetrics[T]{code}

I believe for the generic case we must support both {{NumericType}} and 
{{StringType}} for id columns (rather than restricting to {{Int}} as in Danilo 
& Roberto versions above). So either:
# the evaluator must be similarly parameterized; or
# we will need to re-write the ranking metrics computations as UDFs as follows: 
{code} udf { (predicted: Seq[Any], actual: Seq[Any]) => ... {code} 

I strongly prefer option #2 as it is more flexible and in keeping with the 
DataFrame style of Spark ML components (as a side note, this will give us a 
chance to review the implementations & naming of metrics, since there are some 
issues with a few of the metrics).


That is my proposal (sorry Yong, this is quite different now from the work 
you'v

[jira] [Commented] (SPARK-7146) Should ML sharedParams be a public API?

2017-03-04 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15895629#comment-15895629
 ] 

Nick Pentreath commented on SPARK-7146:
---

Personally I support developer API - these are going to be used by developers 
of extensions & custom pipeline components and I think we can make best effort 
not to break things but don't need to absolutely guarantee that.

> Should ML sharedParams be a public API?
> ---
>
> Key: SPARK-7146
> URL: https://issues.apache.org/jira/browse/SPARK-7146
> Project: Spark
>  Issue Type: Brainstorming
>  Components: ML
>Reporter: Joseph K. Bradley
>
> Proposal: Make most of the Param traits in sharedParams.scala public.  Mark 
> them as DeveloperApi.
> Pros:
> * Sharing the Param traits helps to encourage standardized Param names and 
> documentation.
> Cons:
> * Users have to be careful since parameters can have different meanings for 
> different algorithms.
> * If the shared Params are public, then implementations could test for the 
> traits.  It is unclear if we want users to rely on these traits, which are 
> somewhat experimental.
> Currently, the shared params are private.
> h3. UPDATED proposal
> * Some Params are clearly safe to make public.  We will do so.
> * Some Params could be made public but may require caveats in the trait doc.
> * Some Params have turned out not to be shared in practice.  We can move 
> those Params to the classes which use them.
> *Public shared params*:
> * I/O column params
> ** HasFeaturesCol
> ** HasInputCol
> ** HasInputCols
> ** HasLabelCol
> ** HasOutputCol
> ** HasPredictionCol
> ** HasProbabilityCol
> ** HasRawPredictionCol
> ** HasVarianceCol
> ** HasWeightCol
> * Algorithm settings
> ** HasCheckpointInterval
> ** HasElasticNetParam
> ** HasFitIntercept
> ** HasMaxIter
> ** HasRegParam
> ** HasSeed
> ** HasStandardization (less common)
> ** HasStepSize
> ** HasTol
> *Questionable params*:
> * HasHandleInvalid (only used in StringIndexer, but might be more widely used 
> later on)
> * HasSolver (used in LinearRegression and GeneralizedLinearRegression, but 
> same meaning as Optimizer in LDA)
> *Params to be removed from sharedParams*:
> * HasThreshold (only used in LogisticRegression)
> * HasThresholds (only used in ProbabilisticClassifier)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



Re: [Spark Namespace]: Expanding Spark ML under Different Namespace?

2017-03-04 Thread Nick Pentreath
Also, note https://issues.apache.org/jira/browse/SPARK-7146 is linked from
SPARK-19498 specifically to discuss opening up sharedParams traits.


On Fri, 3 Mar 2017 at 23:17 Shouheng Yi 
wrote:

> Hi Spark dev list,
>
>
>
> Thank you guys so much for all your inputs. We really appreciated those
> suggestions. After some discussions in the team, we decided to stay under
> apache’s namespace for now, and attach some comments to explain what we did
> and why we did this.
>
>
>
> As the Spark dev list kindly pointed out, this is an existing issue that
> was documented in the JIRA ticket [Spark-19498] [0]. We can follow the JIRA
> ticket to see if there are any new suggested practices that should be
> adopted in the future and make corresponding fixes.
>
>
>
> Best,
>
> Shouheng
>
>
>
> [0] https://issues.apache.org/jira/browse/SPARK-19498
>
>
>
> *From:* Tim Hunter [mailto:timhun...@databricks.com
> ]
> *Sent:* Friday, February 24, 2017 9:08 AM
> *To:* Joseph Bradley 
> *Cc:* Steve Loughran ; Shouheng Yi <
> sho...@microsoft.com.invalid>; Apache Spark Dev ;
> Markus Weimer ; Rogan Carr ;
> Pei Jiang ; Miruna Oprescu 
> *Subject:* Re: [Spark Namespace]: Expanding Spark ML under Different
> Namespace?
>
>
>
> Regarding logging, Graphframes makes a simple wrapper this way:
>
>
>
>
> https://github.com/graphframes/graphframes/blob/master/src/main/scala/org/graphframes/Logging.scala
> 
>
>
>
> Regarding the UDTs, they have been hidden to be reworked for Datasets, the
> reasons being detailed here [1]. Can you describe your use case in more
> details? You may be better off copy/pasting the UDT code outside of Spark,
> depending on your use case.
>
>
>
> [1] https://issues.apache.org/jira/browse/SPARK-14155
> 
>
>
>
> On Thu, Feb 23, 2017 at 3:42 PM, Joseph Bradley 
> wrote:
>
> +1 for Nick's comment about discussing APIs which need to be made public
> in https://issues.apache.org/jira/browse/SPARK-19498
> 
> !
>
>
>
> On Thu, Feb 23, 2017 at 2:36 AM, Steve Loughran 
> wrote:
>
>
>
> On 22 Feb 2017, at 20:51, Shouheng Yi 
> wrote:
>
>
>
> Hi Spark developers,
>
>
>
> Currently my team at Microsoft is extending Spark’s machine learning
> functionalities to include new learners and transformers. We would like
> users to use these within spark pipelines so that they can mix and match
> with existing Spark learners/transformers, and overall have a native spark
> experience. We cannot accomplish this using a non-“org.apache” namespace
> with the current implementation, and we don’t want to release code inside
> the apache namespace because it’s confusing and there could be naming
> rights issues.
>
>
>
> This isn't actually the ASF has a strong stance against, more left to
> projects themselves. After all: the source is licensed by the ASF, and the
> license doesn't say you can't.
>
>
>
> Indeed, there's a bit of org.apache.hive in the Spark codebase where the
> hive team kept stuff package private. Though that's really a sign that
> things could be improved there.
>
>
>
> Where is problematic is that stack traces end up blaming the wrong group;
> nobody likes getting a bug report which doesn't actually exist in your
> codebase., not least because you have to waste time to even work it out.
>
>
>
> You also have to expect absolutely no stability guarantees, so you'd
> better set your nightly build to work against trunk
>
>
>
> Apache Bahir does put some stuff into org.apache.spark.stream, but they've
> sort of inherited that right.when they picked up the code from spark. new
> stuff is going into org.apache.bahir
>
>
>
>
>
> We need to extend several classes from spark which happen to have
> “private[spark].” For example, one of our class extends VectorUDT[0] which
> has private[spark] class VectorUDT as its access modifier. This
> unfortunately put us in a strange scenario that forces us to work under the
> namespace org.apache.spark.
>
>
>
> To be specific, currently the private classes/traits we need to use to
> create new S

[jira] [Commented] (SPARK-19339) StatFunctions.multipleApproxQuantiles can give NoSuchElementException: next on empty iterator

2017-03-02 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15893858#comment-15893858
 ] 

Nick Pentreath commented on SPARK-19339:


This should be addressed by SPARK-19573 - empty (or all null) columns will 
return empty Array rather than throw exception.

> StatFunctions.multipleApproxQuantiles can give NoSuchElementException: next 
> on empty iterator
> -
>
> Key: SPARK-19339
> URL: https://issues.apache.org/jira/browse/SPARK-19339
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Barry Becker
>Priority: Minor
>
> This problem is easy to reproduce by running 
> StatFunctions.multipleApproxQuantiles on an empty dataset, but I think it can 
> occur in other cases, like if the column is all null or all one value.
> I have unit tests that can hit it in several different cases.
> The fix that I have introduced locally is to return
> {code}
>  if (sampled.length == 0) 0 else sampled.last.value
> {code}
> instead of 
> {code}
> sampled.last.value
> {code}
> at the end of QuantileSummaries.query.
> Below is the exception:
> {code}
> next on empty iterator
> java.util.NoSuchElementException: next on empty iterator
>   at scala.collection.Iterator$$anon$2.next(Iterator.scala:39)
>   at scala.collection.Iterator$$anon$2.next(Iterator.scala:37)
>   at 
> scala.collection.IndexedSeqLike$Elements.next(IndexedSeqLike.scala:63)
>   at scala.collection.IterableLike$class.head(IterableLike.scala:107)
>   at 
> scala.collection.mutable.ArrayOps$ofRef.scala$collection$IndexedSeqOptimized$$super$head(ArrayOps.scala:186)
>   at 
> scala.collection.IndexedSeqOptimized$class.head(IndexedSeqOptimized.scala:126)
>   at scala.collection.mutable.ArrayOps$ofRef.head(ArrayOps.scala:186)
>   at 
> scala.collection.TraversableLike$class.last(TraversableLike.scala:459)
>   at 
> scala.collection.mutable.ArrayOps$ofRef.scala$collection$IndexedSeqOptimized$$super$last(ArrayOps.scala:186)
>   at 
> scala.collection.IndexedSeqOptimized$class.last(IndexedSeqOptimized.scala:132)
>   at scala.collection.mutable.ArrayOps$ofRef.last(ArrayOps.scala:186)
>   at 
> org.apache.spark.sql.catalyst.util.QuantileSummaries.query(QuantileSummaries.scala:207)
>   at 
> org.apache.spark.sql.SparkPercentileCalculator$$anonfun$multipleApproxQuantiles$1$$anonfun$apply$1.apply$mcDD$sp(SparkPercentileCalculator.scala:91)
>   at 
> org.apache.spark.sql.SparkPercentileCalculator$$anonfun$multipleApproxQuantiles$1$$anonfun$apply$1.apply(SparkPercentileCalculator.scala:91)
>   at 
> org.apache.spark.sql.SparkPercentileCalculator$$anonfun$multipleApproxQuantiles$1$$anonfun$apply$1.apply(SparkPercentileCalculator.scala:91)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:245)
>   at scala.collection.immutable.List.map(List.scala:285)
>   at 
> org.apache.spark.sql.SparkPercentileCalculator$$anonfun$multipleApproxQuantiles$1.apply(SparkPercentileCalculator.scala:91)
>   at 
> org.apache.spark.sql.SparkPercentileCalculator$$anonfun$multipleApproxQuantiles$1.apply(SparkPercentileCalculator.scala:91)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:245)
>   at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186)
>   at 
> org.apache.spark.sql.SparkPercentileCalculator.multipleApproxQuantiles(SparkPercentileCalculator.scala:91)
>   at 
> com.mineset.spark.statistics.model.ContinuousMinesetStats.quartiles$lzycompute(ContinuousMinesetStats.scala:274)
>   at 
> com.mineset.spark.statistics.model.ContinuousMinesetStats.quartiles(ContinuousMinesetStats.scala:272)
>   at 
> com.mineset.spark.statistics.model.MinesetStats.com$mineset$spark$statistics$model$MinesetStats$$serializeContinuousFeature$1(MinesetStats.sca

[jira] [Commented] (SPARK-19714) Bucketizer Bug Regarding Handling Unbucketed Inputs

2017-03-02 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15893821#comment-15893821
 ] 

Nick Pentreath commented on SPARK-19714:


If you feel that handling values outside the bucket ranges as "invalid" is 
reasonable - specifically including them in the special "invalid" bucket - then 
we can discuss if and how that could be implemented.

I agree it's quite a large departure, but we could support it with a further 
param value such as "keepAll" which keeps both {{NaN}} and values outside of 
range in the special bucket.

I don't see a compelling reason that this is a bug, so if you want to motivate 
for a change then propose an approach. 

I do think we should update the doc for {{handleInvalid}} - [~wojtek-szymanski] 
feel free to open a PR for that.

> Bucketizer Bug Regarding Handling Unbucketed Inputs
> ---
>
> Key: SPARK-19714
> URL: https://issues.apache.org/jira/browse/SPARK-19714
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Affects Versions: 2.1.0
>Reporter: Bill Chambers
>
> {code}
> contDF = spark.range(500).selectExpr("cast(id as double) as id")
> import org.apache.spark.ml.feature.Bucketizer
> val splits = Array(5.0, 10.0, 250.0, 500.0)
> val bucketer = new Bucketizer()
>   .setSplits(splits)
>   .setInputCol("id")
>   .setHandleInvalid("skip")
> bucketer.transform(contDF).show()
> {code}
> You would expect that this would handle the invalid buckets. However it fails
> {code}
> Caused by: org.apache.spark.SparkException: Feature value 0.0 out of 
> Bucketizer bounds [5.0, 500.0].  Check your features, or loosen the 
> lower/upper bound constraints.
> {code} 
> It seems strange that handleInvalud doesn't actually handleInvalid inputs.
> Thoughts anyone?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19747) Consolidate code in ML aggregators

2017-03-02 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15893811#comment-15893811
 ] 

Nick Pentreath commented on SPARK-19747:


Also agree we should be able to extract out the penalty / regularization term. 
I know Seth's done that in the WIP PR for L2 - but L1 is interesting because 
currently it is more tightly baked into the Breeze optimizers...

> Consolidate code in ML aggregators
> --
>
> Key: SPARK-19747
> URL: https://issues.apache.org/jira/browse/SPARK-19747
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Seth Hendrickson
>Priority: Minor
>
> Many algorithms in Spark ML are posed as optimization of a differentiable 
> loss function over a parameter vector. We implement these by having a loss 
> function accumulate the gradient using an Aggregator class which has methods 
> that amount to a {{seqOp}} and {{combOp}}. So, pretty much every algorithm 
> that obeys this form implements a cost function class and an aggregator 
> class, which are completely separate from one another but share probably 80% 
> of the same code. 
> I think it is important to clean things like this up, and if we can do it 
> properly it will make the code much more maintainable, readable, and bug 
> free. It will also help reduce the overhead of future implementations.
> The design is of course open for discussion, but I think we should aim to:
> 1. Have all aggregators share parent classes, so that they only need to 
> implement the {{add}} function. This is really the only difference in the 
> current aggregators.
> 2. Have a single, generic cost function that is parameterized by the 
> aggregator type. This reduces the many places we implement cost functions and 
> greatly reduces the amount of duplicated code.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19747) Consolidate code in ML aggregators

2017-03-02 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15893810#comment-15893810
 ] 

Nick Pentreath commented on SPARK-19747:


[~yuhaoyan] for {{SGDClassifier}} it would be interesting to look at Vowpal 
Wabbit - style normalized adaptive gradient descent approach which does the 
normalization / standardization on the fly during training.

> Consolidate code in ML aggregators
> --
>
> Key: SPARK-19747
> URL: https://issues.apache.org/jira/browse/SPARK-19747
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Seth Hendrickson
>Priority: Minor
>
> Many algorithms in Spark ML are posed as optimization of a differentiable 
> loss function over a parameter vector. We implement these by having a loss 
> function accumulate the gradient using an Aggregator class which has methods 
> that amount to a {{seqOp}} and {{combOp}}. So, pretty much every algorithm 
> that obeys this form implements a cost function class and an aggregator 
> class, which are completely separate from one another but share probably 80% 
> of the same code. 
> I think it is important to clean things like this up, and if we can do it 
> properly it will make the code much more maintainable, readable, and bug 
> free. It will also help reduce the overhead of future implementations.
> The design is of course open for discussion, but I think we should aim to:
> 1. Have all aggregators share parent classes, so that they only need to 
> implement the {{add}} function. This is really the only difference in the 
> current aggregators.
> 2. Have a single, generic cost function that is parameterized by the 
> aggregator type. This reduces the many places we implement cost functions and 
> greatly reduces the amount of duplicated code.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19345) Add doc for "coldStartStrategy" usage in ALS

2017-03-02 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath resolved SPARK-19345.

   Resolution: Fixed
Fix Version/s: 2.2.0

> Add doc for "coldStartStrategy" usage in ALS
> 
>
> Key: SPARK-19345
> URL: https://issues.apache.org/jira/browse/SPARK-19345
> Project: Spark
>  Issue Type: Documentation
>  Components: ML
>Reporter: Nick Pentreath
>Assignee: Nick Pentreath
> Fix For: 2.2.0
>
>
> SPARK-14489 adds the ability to skip {{NaN}} predictions during 
> {{ALS.transform}}. This can be useful in production scenarios but is 
> particularly useful when trying to use the cross-validation classes with ALS, 
> since in many cases the test set will have users/items that are not in the 
> training set, leading to evaluation metrics that are all {{NaN}} and making 
> cross-validation unusable.
> Add an explanation for the {{coldStartStrategy}} param to the ALS 
> documentation, and add example code to illustrate the usage.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19345) Add doc for "coldStartStrategy" usage in ALS

2017-03-02 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath updated SPARK-19345:
---
Priority: Minor  (was: Major)

> Add doc for "coldStartStrategy" usage in ALS
> 
>
> Key: SPARK-19345
> URL: https://issues.apache.org/jira/browse/SPARK-19345
> Project: Spark
>  Issue Type: Documentation
>  Components: ML
>Reporter: Nick Pentreath
>Assignee: Nick Pentreath
>Priority: Minor
> Fix For: 2.2.0
>
>
> SPARK-14489 adds the ability to skip {{NaN}} predictions during 
> {{ALS.transform}}. This can be useful in production scenarios but is 
> particularly useful when trying to use the cross-validation classes with ALS, 
> since in many cases the test set will have users/items that are not in the 
> training set, leading to evaluation metrics that are all {{NaN}} and making 
> cross-validation unusable.
> Add an explanation for the {{coldStartStrategy}} param to the ALS 
> documentation, and add example code to illustrate the usage.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19704) AFTSurvivalRegression should support numeric censorCol

2017-03-02 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath updated SPARK-19704:
---
Fix Version/s: 2.2.0

> AFTSurvivalRegression should support numeric censorCol
> --
>
> Key: SPARK-19704
> URL: https://issues.apache.org/jira/browse/SPARK-19704
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: zhengruifeng
>Assignee: zhengruifeng
>Priority: Minor
> Fix For: 2.2.0
>
>
> AFTSurvivalRegression should support numeric censorCol



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19704) AFTSurvivalRegression should support numeric censorCol

2017-03-02 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath reassigned SPARK-19704:
--

Assignee: zhengruifeng

> AFTSurvivalRegression should support numeric censorCol
> --
>
> Key: SPARK-19704
> URL: https://issues.apache.org/jira/browse/SPARK-19704
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: zhengruifeng
>Assignee: zhengruifeng
>Priority: Minor
>
> AFTSurvivalRegression should support numeric censorCol



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19704) AFTSurvivalRegression should support numeric censorCol

2017-03-02 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath resolved SPARK-19704.

Resolution: Fixed

> AFTSurvivalRegression should support numeric censorCol
> --
>
> Key: SPARK-19704
> URL: https://issues.apache.org/jira/browse/SPARK-19704
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: zhengruifeng
>Assignee: zhengruifeng
>Priority: Minor
>
> AFTSurvivalRegression should support numeric censorCol



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19733) ALS performs unnecessary casting on item and user ids

2017-03-02 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath reassigned SPARK-19733:
--

Assignee: Vasilis Vryniotis

> ALS performs unnecessary casting on item and user ids
> -
>
> Key: SPARK-19733
> URL: https://issues.apache.org/jira/browse/SPARK-19733
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.0.0, 2.0.1, 2.1.0
>Reporter: Vasilis Vryniotis
>Assignee: Vasilis Vryniotis
> Fix For: 2.2.0
>
>
> The ALS is performing unnecessary casting to the user and item ids (to 
> double). I believe this is because the protected checkedCast() method 
> requires a double input. This can be avoided by refactroing the code of 
> checkedCast method.
> Issue resolved by pull-request 17059:
> https://github.com/apache/spark/pull/17059



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19733) ALS performs unnecessary casting on item and user ids

2017-03-02 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath resolved SPARK-19733.

   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 17059
[https://github.com/apache/spark/pull/17059]

> ALS performs unnecessary casting on item and user ids
> -
>
> Key: SPARK-19733
> URL: https://issues.apache.org/jira/browse/SPARK-19733
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.0.0, 2.0.1, 2.1.0
>Reporter: Vasilis Vryniotis
> Fix For: 2.2.0
>
>
> The ALS is performing unnecessary casting to the user and item ids (to 
> double). I believe this is because the protected checkedCast() method 
> requires a double input. This can be avoided by refactroing the code of 
> checkedCast method.
> Issue resolved by pull-request 17059:
> https://github.com/apache/spark/pull/17059



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19787) Different default regParam values in ALS

2017-03-01 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath resolved SPARK-19787.

   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 17121
[https://github.com/apache/spark/pull/17121]

> Different default regParam values in ALS
> 
>
> Key: SPARK-19787
> URL: https://issues.apache.org/jira/browse/SPARK-19787
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0
>Reporter: Vasilis Vryniotis
>Priority: Trivial
> Fix For: 2.2.0
>
>
> In the ALS method the default values of regParam do not match within the same 
> file (lines 
> [224|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala#L224]
>  and 
> [714|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala#L714]).
>  In one place we set it to 1.0 and in the other to 0.1.
> We can change the one of train() method to 0.1. The method is marked with 
> DeveloperApi so it should not affect the users. Based on what I saw, whenever 
> we use the particular method we provide all parameters, so the default does 
> not matter. Only exception is the unit-tests on ALSSuite but the change does 
> not break them.
> This change was discussed on a separate PR and [~mlnick] 
> [suggested|https://github.com/apache/spark/pull/17059#issuecomment-28572] 
> to create a separate PR & ticket.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19345) Add doc for "coldStartStrategy" usage in ALS

2017-02-28 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath reassigned SPARK-19345:
--

Assignee: Nick Pentreath

> Add doc for "coldStartStrategy" usage in ALS
> 
>
> Key: SPARK-19345
> URL: https://issues.apache.org/jira/browse/SPARK-19345
> Project: Spark
>  Issue Type: Documentation
>  Components: ML
>Reporter: Nick Pentreath
>Assignee: Nick Pentreath
>
> SPARK-14489 adds the ability to skip {{NaN}} predictions during 
> {{ALS.transform}}. This can be useful in production scenarios but is 
> particularly useful when trying to use the cross-validation classes with ALS, 
> since in many cases the test set will have users/items that are not in the 
> training set, leading to evaluation metrics that are all {{NaN}} and making 
> cross-validation unusable.
> Add an explanation for the {{coldStartStrategy}} param to the ALS 
> documentation, and add example code to illustrate the usage.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-14489) RegressionEvaluator returns NaN for ALS in Spark ml

2017-02-28 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath resolved SPARK-14489.

   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 12896
[https://github.com/apache/spark/pull/12896]

> RegressionEvaluator returns NaN for ALS in Spark ml
> ---
>
> Key: SPARK-14489
> URL: https://issues.apache.org/jira/browse/SPARK-14489
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 1.6.0
> Environment: AWS EMR
>Reporter: Boris Clémençon 
>Assignee: Nick Pentreath
>  Labels: patch
> Fix For: 2.2.0
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> When building a Spark ML pipeline containing an ALS estimator, the metrics 
> "rmse", "mse", "r2" and "mae" all return NaN. 
> The reason is in CrossValidator.scala line 109. The K-folds are randomly 
> generated. For large and sparse datasets, there is a significant probability 
> that at least one user of the validation set is missing in the training set, 
> hence generating a few NaN estimation with transform method and NaN 
> RegressionEvaluator's metrics too. 
> Suggestion to fix the bug: remove the NaN values while computing the rmse or 
> other metrics (ie, removing users or items in validation test that is missing 
> in the learning set). Send logs when this happen.
> Issue SPARK-14153 seems to be the same pbm
> {code:title=Bar.scala|borderStyle=solid}
> val splits = MLUtils.kFold(dataset.rdd, $(numFolds), 0)
> splits.zipWithIndex.foreach { case ((training, validation), splitIndex) =>
>   val trainingDataset = sqlCtx.createDataFrame(training, schema).cache()
>   val validationDataset = sqlCtx.createDataFrame(validation, 
> schema).cache()
>   // multi-model training
>   logDebug(s"Train split $splitIndex with multiple sets of parameters.")
>   val models = est.fit(trainingDataset, epm).asInstanceOf[Seq[Model[_]]]
>   trainingDataset.unpersist()
>   var i = 0
>   while (i < numModels) {
> // TODO: duplicate evaluator to take extra params from input
> val metric = eval.evaluate(models(i).transform(validationDataset, 
> epm(i)))
> logDebug(s"Got metric $metric for model trained with ${epm(i)}.")
> metrics(i) += metric
> i += 1
>   }
>   validationDataset.unpersist()
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11968) ALS recommend all methods spend most of time in GC

2017-02-27 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15885636#comment-15885636
 ] 

Nick Pentreath commented on SPARK-11968:


While working on performance testing for ALS parity I've got a possible 
solution for this.

Will update once I have some more concrete numbers.

> ALS recommend all methods spend most of time in GC
> --
>
> Key: SPARK-11968
> URL: https://issues.apache.org/jira/browse/SPARK-11968
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 1.5.2, 1.6.0
>Reporter: Joseph K. Bradley
>Assignee: Nick Pentreath
>
> After adding recommendUsersForProducts and recommendProductsForUsers to ALS 
> in spark-perf, I noticed that it takes much longer than ALS itself.  Looking 
> at the monitoring page, I can see it is spending about 8min doing GC for each 
> 10min task.  That sounds fixable.  Looking at the implementation, there is 
> clearly an opportunity to avoid extra allocations: 
> [https://github.com/apache/spark/blob/e6dd237463d2de8c506f0735dfdb3f43e8122513/mllib/src/main/scala/org/apache/spark/mllib/recommendation/MatrixFactorizationModel.scala#L283]
> CC: [~mengxr]



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-11968) ALS recommend all methods spend most of time in GC

2017-02-27 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath reopened SPARK-11968:

  Assignee: Nick Pentreath

> ALS recommend all methods spend most of time in GC
> --
>
> Key: SPARK-11968
> URL: https://issues.apache.org/jira/browse/SPARK-11968
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 1.5.2, 1.6.0
>Reporter: Joseph K. Bradley
>    Assignee: Nick Pentreath
>
> After adding recommendUsersForProducts and recommendProductsForUsers to ALS 
> in spark-perf, I noticed that it takes much longer than ALS itself.  Looking 
> at the monitoring page, I can see it is spending about 8min doing GC for each 
> 10min task.  That sounds fixable.  Looking at the implementation, there is 
> clearly an opportunity to avoid extra allocations: 
> [https://github.com/apache/spark/blob/e6dd237463d2de8c506f0735dfdb3f43e8122513/mllib/src/main/scala/org/apache/spark/mllib/recommendation/MatrixFactorizationModel.scala#L283]
> CC: [~mengxr]



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19141) VectorAssembler metadata causing memory issues

2017-02-27 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15885625#comment-15885625
 ] 

Nick Pentreath commented on SPARK-19141:


Hi there - I've also run into issues with larger-scale feature dimensions 
involving {{VectorAssembler}}. I suspected it was due to ML attributes but 
hadn't had the time to fully investigate. Thanks for digging into it!

I'll take a deeper look. The ideal would be some better, more efficient version 
of attributes. Perhaps it needs an overhaul.

> VectorAssembler metadata causing memory issues
> --
>
> Key: SPARK-19141
> URL: https://issues.apache.org/jira/browse/SPARK-19141
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Affects Versions: 1.6.0, 2.0.0, 2.1.0
> Environment: Windows 10, Ubuntu 16.04.1, Scala 2.11.8, Spark 1.6.0, 
> 2.0.0, 2.1.0
>Reporter: Antonia Oprescu
>
> VectorAssembler produces unnecessary metadata that overflows the Java heap in 
> the case of sparse vectors. In the example below, the logical length of the 
> vector is 10^6, but the number of non-zero values is only 2.
> The problem arises when the vector assembler creates metadata (ML attributes) 
> for each of the 10^6 slots, even if this metadata didn't exist upstream (i.e. 
> HashingTF doesn't produce metadata per slot). Here is a chunk of metadata it 
> produces:
> {noformat}
> {"ml_attr":{"attrs":{"numeric":[{"idx":0,"name":"HashedFeat_0"},{"idx":1,"name":"HashedFeat_1"},{"idx":2,"name":"HashedFeat_2"},{"idx":3,"name":"HashedFeat_3"},{"idx":4,"name":"HashedFeat_4"},{"idx":5,"name":"HashedFeat_5"},{"idx":6,"name":"HashedFeat_6"},{"idx":7,"name":"HashedFeat_7"},{"idx":8,"name":"HashedFeat_8"},{"idx":9,"name":"HashedFeat_9"},...,{"idx":100,"name":"Feat01"}]},"num_attrs":101}}
> {noformat}
> In this lightweight example, the feature size limit seems to be 1,000,000 
> when run locally, but this scales poorly with more complicated routines. With 
> a larger dataset and a learner (say LogisticRegression), it maxes out 
> anywhere between 10k and 100k hash size even on a decent sized cluster.
> I did some digging, and it seems that the only metadata necessary for 
> downstream learners is the one indicating categorical columns. Thus, I 
> thought of the following possible solutions:
> 1. Compact representation of ml attributes metadata (but this seems to be a 
> bigger change)
> 2. Removal of non-categorical tags from the metadata created by the 
> VectorAssembler
> 3. An option on the existent VectorAssembler to skip unnecessary ml 
> attributes or create another transformer altogether
> I would happy to take a stab at any of these solutions, but I need some 
> direction from the Spark community.
> {code:title=VABug.scala |borderStyle=solid}
> import org.apache.spark.SparkConf
> import org.apache.spark.ml.feature.{HashingTF, VectorAssembler}
> import org.apache.spark.sql.SparkSession
> object VARepro {
>   case class Record(Label: Double, Feat01: Double, Feat02: Array[String])
>   def main(args: Array[String]) {
> val conf = new SparkConf()
>   .setAppName("Vector assembler bug")
>   .setMaster("local[*]")
> val spark = SparkSession.builder.config(conf).getOrCreate()
> import spark.implicits._
> val df = Seq(Record(1.0, 2.0, Array("4daf")), Record(0.0, 3.0, 
> Array("a9ee"))).toDS()
> val numFeatures = 1000
> val hashingScheme = new 
> HashingTF().setInputCol("Feat02").setOutputCol("HashedFeat").setNumFeatures(numFeatures)
> val hashedData = hashingScheme.transform(df)
> val vectorAssembler = new 
> VectorAssembler().setInputCols(Array("HashedFeat","Feat01")).setOutputCol("Features")
> val processedData = vectorAssembler.transform(hashedData).select("Label", 
> "Features")
> processedData.show()
>   }
> }
> {code}
> *Stacktrace from the example above:*
> Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit 
> exceeded
>   at 
> org.apache.spark.ml.attribute.NumericAttribute.copy(attributes.scala:272)
>   at 
> org.apache.spark.ml.

[jira] [Commented] (SPARK-19714) Bucketizer Bug Regarding Handling Unbucketed Inputs

2017-02-27 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15885315#comment-15885315
 ] 

Nick Pentreath commented on SPARK-19714:


I also agree that the naming of {{splits}} could be better, but for now we're 
stuck with it. We could deprecate it and have a new param, but to me the param 
doc is pretty clear and unambiguous about what it actually does. So that option 
seems more confusing to users than it's worth.

Of course {{QuantileDiscretizer}} is different but the result is exactly a 
{{Bucketizer}} - the discretizer computes what the actual values of the splits 
should be. My point is that if you want to include values outside of the splits 
(bucket boundaries) you need to be explicit and put Inf/-Inf in {{splits}}. 

If you believe that the "invalid" handling should also include values outside 
of the split range that can be discussed. Do you propose to include all values 
outside the range in the special bucket (as is done for {{NaN}} now)?

> Bucketizer Bug Regarding Handling Unbucketed Inputs
> ---
>
> Key: SPARK-19714
> URL: https://issues.apache.org/jira/browse/SPARK-19714
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Affects Versions: 2.1.0
>Reporter: Bill Chambers
>
> {code}
> contDF = spark.range(500).selectExpr("cast(id as double) as id")
> import org.apache.spark.ml.feature.Bucketizer
> val splits = Array(5.0, 10.0, 250.0, 500.0)
> val bucketer = new Bucketizer()
>   .setSplits(splits)
>   .setInputCol("id")
>   .setHandleInvalid("skip")
> bucketer.transform(contDF).show()
> {code}
> You would expect that this would handle the invalid buckets. However it fails
> {code}
> Caused by: org.apache.spark.SparkException: Feature value 0.0 out of 
> Bucketizer bounds [5.0, 500.0].  Check your features, or loosen the 
> lower/upper bound constraints.
> {code} 
> It seems strange that handleInvalud doesn't actually handleInvalid inputs.
> Thoughts anyone?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19747) Consolidate code in ML aggregators

2017-02-26 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15885298#comment-15885298
 ] 

Nick Pentreath commented on SPARK-19747:


Big +1 for this! I agree we really should be able to make all the concrete 
implementations simply specify the specific aggregation part - effectively the 
loss.

The general approach sounds good to me.

> Consolidate code in ML aggregators
> --
>
> Key: SPARK-19747
> URL: https://issues.apache.org/jira/browse/SPARK-19747
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Seth Hendrickson
>Priority: Minor
>
> Many algorithms in Spark ML are posed as optimization of a differentiable 
> loss function over a parameter vector. We implement these by having a loss 
> function accumulate the gradient using an Aggregator class which has methods 
> that amount to a {{seqOp}} and {{combOp}}. So, pretty much every algorithm 
> that obeys this form implements a cost function class and an aggregator 
> class, which are completely separate from one another but share probably 80% 
> of the same code. 
> I think it is important to clean things like this up, and if we can do it 
> properly it will make the code much more maintainable, readable, and bug 
> free. It will also help reduce the overhead of future implementations.
> The design is of course open for discussion, but I think we should aim to:
> 1. Have all aggregators share parent classes, so that they only need to 
> implement the {{add}} function. This is really the only difference in the 
> current aggregators.
> 2. Have a single, generic cost function that is parameterized by the 
> aggregator type. This reduces the many places we implement cost functions and 
> greatly reduces the amount of duplicated code.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19714) Bucketizer Bug Regarding Handling Unbucketed Inputs

2017-02-24 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15882224#comment-15882224
 ] 

Nick Pentreath commented on SPARK-19714:


Another alternative is that we do expand the "invalid" handling to include 
anything that falls outside of the splits provided. For {{QuantileDiscretizer}} 
it would have no effect but it would provide further flexibility to users.

> Bucketizer Bug Regarding Handling Unbucketed Inputs
> ---
>
> Key: SPARK-19714
> URL: https://issues.apache.org/jira/browse/SPARK-19714
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Affects Versions: 2.1.0
>Reporter: Bill Chambers
>
> {code}
> contDF = spark.range(500).selectExpr("cast(id as double) as id")
> import org.apache.spark.ml.feature.Bucketizer
> val splits = Array(5.0, 10.0, 250.0, 500.0)
> val bucketer = new Bucketizer()
>   .setSplits(splits)
>   .setInputCol("id")
>   .setHandleInvalid("skip")
> bucketer.transform(contDF).show()
> {code}
> You would expect that this would handle the invalid buckets. However it fails
> {code}
> Caused by: org.apache.spark.SparkException: Feature value 0.0 out of 
> Bucketizer bounds [5.0, 500.0].  Check your features, or loosen the 
> lower/upper bound constraints.
> {code} 
> It seems strange that handleInvalud doesn't actually handleInvalid inputs.
> Thoughts anyone?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-19714) Bucketizer Bug Regarding Handling Unbucketed Inputs

2017-02-24 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15882216#comment-15882216
 ] 

Nick Pentreath edited comment on SPARK-19714 at 2/24/17 8:35 AM:
-

I agree that the parameter naming is perhaps misleading. At least the doc 
should be updated because "invalid" here actually means {{NaN}} or {{null}}. 

However {{Bucketizer}} is doing what you tell it to as the splits are specified 
by you. Note that if you used {{QuantileDiscretizer}} to construct the 
{{Bucketizer}} then it adds {{+/- Infinity}} as the lower/upper bounds of the 
splits. So you can do the same if you want anything below the lower bound or 
above the lower bound to be "valid". You will then have 2 more buckets.


was (Author: mlnick):
I agree that the parameter naming is perhaps misleading. At least the doc 
should be updated because "invalid" here actually means {{NaN}} or {{null}}. 

However {{Bucketizer}} is doing what you tell it to as the splits are specified 
by you. Note that if you used {{QuantileDiscretizer}} to construct the 
{{Bucketizer}} then it adds {{+/- Infinity}} as the lower/upper bounds of the 
splits. So you can do the same if you want anything below the lower bound to be 
included in the first bucket, and above the upper bound to be included in the 
last bucket.

> Bucketizer Bug Regarding Handling Unbucketed Inputs
> ---
>
> Key: SPARK-19714
> URL: https://issues.apache.org/jira/browse/SPARK-19714
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Affects Versions: 2.1.0
>Reporter: Bill Chambers
>
> {code}
> contDF = spark.range(500).selectExpr("cast(id as double) as id")
> import org.apache.spark.ml.feature.Bucketizer
> val splits = Array(5.0, 10.0, 250.0, 500.0)
> val bucketer = new Bucketizer()
>   .setSplits(splits)
>   .setInputCol("id")
>   .setHandleInvalid("skip")
> bucketer.transform(contDF).show()
> {code}
> You would expect that this would handle the invalid buckets. However it fails
> {code}
> Caused by: org.apache.spark.SparkException: Feature value 0.0 out of 
> Bucketizer bounds [5.0, 500.0].  Check your features, or loosen the 
> lower/upper bound constraints.
> {code} 
> It seems strange that handleInvalud doesn't actually handleInvalid inputs.
> Thoughts anyone?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19714) Bucketizer Bug Regarding Handling Unbucketed Inputs

2017-02-24 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15882216#comment-15882216
 ] 

Nick Pentreath commented on SPARK-19714:


I agree that the parameter naming is perhaps misleading. At least the doc 
should be updated because "invalid" here actually means {{NaN}} or {{null}}. 

However {{Bucketizer}} is doing what you tell it to as the splits are specified 
by you. Note that if you used {{QuantileDiscretizer}} to construct the 
{{Bucketizer}} then it adds {{+/- Infinity}} as the lower/upper bounds of the 
splits. So you can do the same if you want anything below the lower bound to be 
included in the first bucket, and above the upper bound to be included in the 
last bucket.

> Bucketizer Bug Regarding Handling Unbucketed Inputs
> ---
>
> Key: SPARK-19714
> URL: https://issues.apache.org/jira/browse/SPARK-19714
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Affects Versions: 2.1.0
>Reporter: Bill Chambers
>
> {code}
> contDF = spark.range(500).selectExpr("cast(id as double) as id")
> import org.apache.spark.ml.feature.Bucketizer
> val splits = Array(5.0, 10.0, 250.0, 500.0)
> val bucketer = new Bucketizer()
>   .setSplits(splits)
>   .setInputCol("id")
>   .setHandleInvalid("skip")
> bucketer.transform(contDF).show()
> {code}
> You would expect that this would handle the invalid buckets. However it fails
> {code}
> Caused by: org.apache.spark.SparkException: Feature value 0.0 out of 
> Bucketizer bounds [5.0, 500.0].  Check your features, or loosen the 
> lower/upper bound constraints.
> {code} 
> It seems strange that handleInvalud doesn't actually handleInvalid inputs.
> Thoughts anyone?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



Re: Feedback on MLlib roadmap process proposal

2017-02-24 Thread Nick Pentreath
FYI I've started going through a few of the top Watched JIRAs and tried to
identify those that are obviously stale and can probably be closed, to try
to clean things up a bit.

On Thu, 23 Feb 2017 at 21:38 Tim Hunter  wrote:

> As Sean wrote very nicely above, the changes made to Spark are decided in
> an organic fashion based on the interests and motivations of the committers
> and contributors. The case of deep learning is a good example. There is a
> lot of interest, and the core algorithms could be implemented without too
> much problem in a few thousands of lines of scala code. However, the
> performance of such a simple implementation would be one to two order of
> magnitude slower than what would get from the popular frameworks out there.
>
> At this point, there are probably more man-hours invested in TensorFlow
> (as an example) than in MLlib, so I think we need to be realistic about
> what we can expect to achieve inside Spark. Unlike BLAS for linear algebra,
> there is no agreed-up interface for deep learning, and each of the XOnSpark
> flavors explores a slightly different design. It will be interesting to see
> what works well in practice. In the meantime, though, there are plenty of
> things that we could do to help developers of other libraries to have a
> great experience with Spark. Matei alluded to that in his Spark Summit
> keynote when he mentioned better integration with low-level libraries.
>
> Tim
>
>
> On Thu, Feb 23, 2017 at 5:32 AM, Nick Pentreath 
> wrote:
>
> Sorry for being late to the discussion. I think Joseph, Sean and others
> have covered the issues well.
>
> Overall I like the proposed cleaned up roadmap & process (thanks Joseph!).
> As for the actual critical roadmap items mentioned on SPARK-18813, I think
> it makes sense and will comment a bit further on that JIRA.
>
> I would like to encourage votes & watching for issues to give a sense of
> what the community wants (I guess Vote is more explicit yet passive, while
> actually Watching an issue is more informative as it may indicate a real
> use case dependent on the issue?!).
>
> I think if used well this is valuable information for contributors. Of
> course not everything on that list can get done. But if I look through the
> top votes or watch list, while not all of those are likely to go in, a
> great many of the issues are fairly non-contentious in terms of being good
> additions to the project.
>
> Things like these are good examples IMO (I just sample a few of them, not
> exhaustive):
> - sample weights for RF / DT
> - multi-model and/or parallel model selection
> - make sharedParams public?
> - multi-column support for various transformers
> - incremental model training
> - tree algorithm enhancements
>
> Now, whether these can be prioritised in terms of bandwidth available to
> reviewers and committers is a totally different thing. But as Sean mentions
> there is some process there for trying to find the balance of the issue
> being a "good thing to add", a shepherd with bandwidth & interest in the
> issue to review, and the maintenance burden imposed.
>
> Let's take Deep Learning / NN for example. Here's a good example of
> something that has a lot of votes/watchers and as Sean mentions it is
> something that "everyone wants someone else to implement". In this case,
> much of the interest may in fact be "stale" - 2 years ago it would have
> been very interesting to have a strong DL impl in Spark. Now, because there
> are a plethora of very good DL libraries out there, how many of those Votes
> would be "deleted"? Granted few are well integrated with Spark but that can
> and is changing (DL4J, BigDL, the "XonSpark" flavours etc).
>
> So this is something that I dare say will not be in Spark any time in the
> foreseeable future or perhaps ever given the current status. Perhaps it's
> worth seriously thinking about just closing these kind of issues?
>
>
>
> On Fri, 27 Jan 2017 at 05:53 Joseph Bradley  wrote:
>
> Sean has given a great explanation.  A few more comments:
>
> Roadmap: I have been creating roadmap JIRAs, but the goal really is to
> have all committers working on MLlib help to set that roadmap, based on
> either their knowledge of current maintenance/internal needs of the project
> or the feedback given from the rest of the community.
> @Committers - I see people actively shepherding PRs for MLlib, but I don't
> see many major initiatives linked to the roadmap.  If there are ones large
> enough to merit adding to the roadmap, please do.
>
> In general, there are many process improvements we could make.  A few in
> my mind are:
> * Visibility: Let the com

[jira] [Commented] (SPARK-18813) MLlib 2.2 Roadmap

2017-02-24 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15882206#comment-15882206
 ] 

Nick Pentreath commented on SPARK-18813:


FYI I've started going through a few of the top Watched JIRAs and tried to 
identify those that are obviously stale and can probably be closed, to try to 
clean things up a bit.

> MLlib 2.2 Roadmap
> -
>
> Key: SPARK-18813
> URL: https://issues.apache.org/jira/browse/SPARK-18813
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML, MLlib
>Reporter: Joseph K. Bradley
>Priority: Blocker
>  Labels: roadmap
>
> *PROPOSAL: This includes a proposal for the 2.2 roadmap process for MLlib.*
> The roadmap process described below is significantly updated since the 2.1 
> roadmap [SPARK-15581].  Please refer to [SPARK-15581] for more discussion on 
> the basis for this proposal, and comment in this JIRA if you have suggestions 
> for improvements.
> h1. Roadmap process
> This roadmap is a master list for MLlib improvements we are working on during 
> this release.  This includes ML-related changes in PySpark and SparkR.
> *What is planned for the next release?*
> * This roadmap lists issues which at least one Committer has prioritized.  
> See details below in "Instructions for committers."
> * This roadmap only lists larger or more critical issues.
> *How can contributors influence this roadmap?*
> * If you believe an issue should be in this roadmap, please discuss the issue 
> on JIRA and/or the dev mailing list.  Make sure to ping Committers since at 
> least one must agree to shepherd the issue.
> * For general discussions, use this JIRA or the dev mailing list.  For 
> specific issues, please comment on those issues or the mailing list.
> * Vote for & watch issues which are important to you.
> ** MLlib, sorted by: [Votes | 
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20component%20in%20(ML%2C%20MLlib)%20ORDER%20BY%20votes%20DESC]
>  or [Watchers | 
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20component%20in%20(ML%2C%20MLlib)%20ORDER%20BY%20Watchers%20DESC]
> ** SparkR, sorted by: [Votes | 
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20component%20in%20(SparkR)%20ORDER%20BY%20votes%20DESC]
>  or [Watchers | 
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20component%20in%20(SparkR)%20ORDER%20BY%20Watchers%20DESC]
> h2. Target Version and Priority
> This section describes the meaning of Target Version and Priority.  _These 
> meanings have been updated in this proposal for the 2.2 process._
> || Category | Target Version | Priority | Shepherd | Put on roadmap? | In 
> next release? ||
> | [1 | 
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20priority%20%3D%20Blocker%20AND%20component%20in%20(GraphX%2C%20ML%2C%20MLlib%2C%20SparkR)%20AND%20%22Target%20Version%2Fs%22%20%3D%202.2.0]
>  | next release | Blocker | *must* | *must* | *must* |
> | [2 | 
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20priority%20%3D%20Critical%20AND%20component%20in%20(GraphX%2C%20ML%2C%20MLlib%2C%20SparkR)%20AND%20%22Target%20Version%2Fs%22%20%3D%202.2.0]
>  | next release | Critical | *must* | yes, unless small | *best effort* |
> | [3 | 
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20priority%20%3D%20Major%20AND%20component%20in%20(GraphX%2C%20ML%2C%20MLlib%2C%20SparkR)%20AND%20%22Target%20Version%2Fs%22%20%3D%202.2.0]
>  | next release | Major | *must* | optional | *best effort* |
> | [4 | 
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20priority%20%3D%20Minor%20AND%20component%20in%20(GraphX%2C%20ML%2C%20MLlib%2C%20SparkR)%20AND%20%22Target%20Version%2Fs%22%20%3D%202.2.0]
>  | next release | Minor | optional | no | maybe |
> | [5 | 
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20priority%20%3D%20Trivial%20AND%20component%20i

[jira] [Closed] (SPARK-10041) Proposal of Parameter Server Interface for Spark

2017-02-24 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath closed SPARK-10041.
--
Resolution: Won't Fix

> Proposal of Parameter Server Interface for Spark
> 
>
> Key: SPARK-10041
> URL: https://issues.apache.org/jira/browse/SPARK-10041
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, MLlib
>Reporter: Yi Liu
> Attachments: Proposal of Parameter Server Interface for Spark - v1.pdf
>
>
> Many large-scale machine learning algorithms (logistic regression, LDA, 
> neural network, etc.) have been built on top of Apache Spark. As discussed in 
> SPARK-4590, a Parameter Server (PS) architecture can greatly improve the 
> scalability and efficiency for these large-scale machine learning. There are 
> some previous discussions on possible Parameter Server implementations inside 
> Spark (e.g., SPARK-6932). However, at this stage we believe it is more 
> important for the community to first define the proper interface of Parameter 
> Server, which can be decoupled from the actual PS implementations; 
> consequently, it is possible to support different implementations of 
> Parameter Servers in Spark later. The attached document contains our initial 
> proposal of Parameter Server interface for ML algorithms on Spark, including 
> data model, supported operations, epoch support and possible Spark 
> integrations.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-10041) Proposal of Parameter Server Interface for Spark

2017-02-24 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath closed SPARK-10041.
--
Resolution: Won't Fix

> Proposal of Parameter Server Interface for Spark
> 
>
> Key: SPARK-10041
> URL: https://issues.apache.org/jira/browse/SPARK-10041
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, MLlib
>Reporter: Yi Liu
> Attachments: Proposal of Parameter Server Interface for Spark - v1.pdf
>
>
> Many large-scale machine learning algorithms (logistic regression, LDA, 
> neural network, etc.) have been built on top of Apache Spark. As discussed in 
> SPARK-4590, a Parameter Server (PS) architecture can greatly improve the 
> scalability and efficiency for these large-scale machine learning. There are 
> some previous discussions on possible Parameter Server implementations inside 
> Spark (e.g., SPARK-6932). However, at this stage we believe it is more 
> important for the community to first define the proper interface of Parameter 
> Server, which can be decoupled from the actual PS implementations; 
> consequently, it is possible to support different implementations of 
> Parameter Servers in Spark later. The attached document contains our initial 
> proposal of Parameter Server interface for ML algorithms on Spark, including 
> data model, supported operations, epoch support and possible Spark 
> integrations.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-10041) Proposal of Parameter Server Interface for Spark

2017-02-24 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath reopened SPARK-10041:


> Proposal of Parameter Server Interface for Spark
> 
>
> Key: SPARK-10041
> URL: https://issues.apache.org/jira/browse/SPARK-10041
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, MLlib
>Reporter: Yi Liu
> Attachments: Proposal of Parameter Server Interface for Spark - v1.pdf
>
>
> Many large-scale machine learning algorithms (logistic regression, LDA, 
> neural network, etc.) have been built on top of Apache Spark. As discussed in 
> SPARK-4590, a Parameter Server (PS) architecture can greatly improve the 
> scalability and efficiency for these large-scale machine learning. There are 
> some previous discussions on possible Parameter Server implementations inside 
> Spark (e.g., SPARK-6932). However, at this stage we believe it is more 
> important for the community to first define the proper interface of Parameter 
> Server, which can be decoupled from the actual PS implementations; 
> consequently, it is possible to support different implementations of 
> Parameter Servers in Spark later. The attached document contains our initial 
> proposal of Parameter Server interface for ML algorithms on Spark, including 
> data model, supported operations, epoch support and possible Spark 
> integrations.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10041) Proposal of Parameter Server Interface for Spark

2017-02-24 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15882198#comment-15882198
 ] 

Nick Pentreath commented on SPARK-10041:


I think it is safe to say this is not going to be part of MLlib core. I'll 
close this.

> Proposal of Parameter Server Interface for Spark
> 
>
> Key: SPARK-10041
> URL: https://issues.apache.org/jira/browse/SPARK-10041
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, MLlib
>Reporter: Yi Liu
> Attachments: Proposal of Parameter Server Interface for Spark - v1.pdf
>
>
> Many large-scale machine learning algorithms (logistic regression, LDA, 
> neural network, etc.) have been built on top of Apache Spark. As discussed in 
> SPARK-4590, a Parameter Server (PS) architecture can greatly improve the 
> scalability and efficiency for these large-scale machine learning. There are 
> some previous discussions on possible Parameter Server implementations inside 
> Spark (e.g., SPARK-6932). However, at this stage we believe it is more 
> important for the community to first define the proper interface of Parameter 
> Server, which can be decoupled from the actual PS implementations; 
> consequently, it is possible to support different implementations of 
> Parameter Servers in Spark later. The attached document contains our initial 
> proposal of Parameter Server interface for ML algorithms on Spark, including 
> data model, supported operations, epoch support and possible Spark 
> integrations.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2336) Approximate k-NN Models for MLLib

2017-02-24 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15882187#comment-15882187
 ] 

Nick Pentreath commented on SPARK-2336:
---

I think it's safe to say that this now lives in a Spark package (that seems 
reasonable actively maintained which is great) so is anyone wants this that is 
where to look. I further think it's safe to say this is not going to be 
prioritised for MLlib, so shall we close this ticket?

> Approximate k-NN Models for MLLib
> -
>
> Key: SPARK-2336
> URL: https://issues.apache.org/jira/browse/SPARK-2336
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Brian Gawalt
>Priority: Minor
>  Labels: clustering, features
>
> After tackling the general k-Nearest Neighbor model as per 
> https://issues.apache.org/jira/browse/SPARK-2335 , there's an opportunity to 
> also offer approximate k-Nearest Neighbor. A promising approach would involve 
> building a kd-tree variant within from each partition, a la
> http://www.autonlab.org/autonweb/14714.html?branch=1&language=2
> This could offer a simple non-linear ML model that can label new data with 
> much lower latency than the plain-vanilla kNN versions.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6567) Large linear model parallelism via a join and reduceByKey

2017-02-24 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15882182#comment-15882182
 ] 

Nick Pentreath commented on SPARK-6567:
---

This JIRA has been around for a while without any movement. I think generally 
it seems that the "vector-free" versions of algorithms such as L-BFGS (see 
https://spark-summit.org/east-2017/events/scaling-apache-spark-mllib-to-billions-of-parameters/)
 will be generally more efficient.

Shall we close this (unless there are major objections)?

> Large linear model parallelism via a join and reduceByKey
> -
>
> Key: SPARK-6567
> URL: https://issues.apache.org/jira/browse/SPARK-6567
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Reporter: Reza Zadeh
> Attachments: model-parallelism.pptx
>
>
> To train a linear model, each training point in the training set needs its 
> dot product computed against the model, per iteration. If the model is large 
> (too large to fit in memory on a single machine) then SPARK-4590 proposes 
> using parameter server.
> There is an easier way to achieve this without parameter servers. In 
> particular, if the data is held as a BlockMatrix and the model as an RDD, 
> then each block can be joined with the relevant part of the model, followed 
> by a reduceByKey to compute the dot products.
> This obviates the need for a parameter server, at least for linear models. 
> However, it's unclear how it compares performance-wise to parameter servers.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3434) Distributed block matrix

2017-02-24 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15882179#comment-15882179
 ] 

Nick Pentreath commented on SPARK-3434:
---

This JIRA only has SPARK-3976 open. There was an old PR for it by [~brkyvz] 
here: https://github.com/apache/spark/pull/4286 (which was abandoned).

Unless SPARK-3976 is a priority and someone wants to revive it, shall we close 
this JIRA since the rest of the tickets are resolved?

> Distributed block matrix
> 
>
> Key: SPARK-3434
> URL: https://issues.apache.org/jira/browse/SPARK-3434
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Xiangrui Meng
>Assignee: Shivaram Venkataraman
>
> This JIRA is for discussing distributed matrices stored in block 
> sub-matrices. The main challenge is the partitioning scheme to allow adding 
> linear algebra operations in the future, e.g.:
> 1. matrix multiplication
> 2. matrix factorization (QR, LU, ...)
> Let's discuss the partitioning and storage and how they fit into the above 
> use cases.
> Questions:
> 1. Should it be backed by a single RDD that contains all of the sub-matrices 
> or many RDDs with each contains only one sub-matrix?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14409) Investigate adding a RankingEvaluator to ML

2017-02-23 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15882174#comment-15882174
 ] 

Nick Pentreath commented on SPARK-14409:


The other option is to work with [~danilo.ascione] PR here: 
https://github.com/apache/spark/pull/16618 if Yong does not have time to update.

> Investigate adding a RankingEvaluator to ML
> ---
>
> Key: SPARK-14409
> URL: https://issues.apache.org/jira/browse/SPARK-14409
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>    Reporter: Nick Pentreath
>Priority: Minor
>
> {{mllib.evaluation}} contains a {{RankingMetrics}} class, while there is no 
> {{RankingEvaluator}} in {{ml.evaluation}}. Such an evaluator can be useful 
> for recommendation evaluation (and can be useful in other settings 
> potentially).
> Should be thought about in conjunction with adding the "recommendAll" methods 
> in SPARK-13857, so that top-k ranking metrics can be used in cross-validators.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14409) Investigate adding a RankingEvaluator to ML

2017-02-23 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15882163#comment-15882163
 ] 

Nick Pentreath commented on SPARK-14409:


[~roberto.mirizzi] the {{goodThreshold}} param seems pretty reasonable in this 
context to exclude irrelevant items. I think it can be a good {{expertParam}} 
addition.

Ok, I think that a first pass at this should just aim to replicate what we have 
exposed in {{mllib}} and wrap {{RankingMetrics}}. Initially we can look at: (a) 
supporting numeric columns and doing the windowing & {{collect_list}} approach 
to feed into {{RankingMetrics}}; (b) support Array columns and feed directly 
into {{RankingMetrics}} or (c) support both.

[~yongtang] already did a PR here: https://github.com/apache/spark/pull/12461. 
It is fairly complete and also includes MRR. [~yongtang] are you able to work 
on reviving that PR? If os, [~roberto.mirizzi] [~danilo.ascione] are you able 
to help review that PR?

> Investigate adding a RankingEvaluator to ML
> ---
>
> Key: SPARK-14409
> URL: https://issues.apache.org/jira/browse/SPARK-14409
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>    Reporter: Nick Pentreath
>Priority: Minor
>
> {{mllib.evaluation}} contains a {{RankingMetrics}} class, while there is no 
> {{RankingEvaluator}} in {{ml.evaluation}}. Such an evaluator can be useful 
> for recommendation evaluation (and can be useful in other settings 
> potentially).
> Should be thought about in conjunction with adding the "recommendAll" methods 
> in SPARK-13857, so that top-k ranking metrics can be used in cross-validators.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-14084) Parallel training jobs in model selection

2017-02-23 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath resolved SPARK-14084.

  Resolution: Duplicate
Target Version/s:   (was: )

> Parallel training jobs in model selection
> -
>
> Key: SPARK-14084
> URL: https://issues.apache.org/jira/browse/SPARK-14084
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.0.0
>Reporter: Xiangrui Meng
>
> In CrossValidator and TrainValidationSplit, we run training jobs one by one. 
> If users have a big cluster, they might see speed-ups if we parallelize the 
> job submission on the driver. The trade-off is that we might need to make 
> multiple copies of the training data, which could be expensive. It is worth 
> testing and figure out the best way to implement it.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14084) Parallel training jobs in model selection

2017-02-23 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15882123#comment-15882123
 ] 

Nick Pentreath commented on SPARK-14084:


I guess we could have put SPARK-19071 into this ticket (sorry about that) - but 
since SPARK-19071 also covers a longer-term plan for further optimizing 
parallel CV, I'm going to close this as Superceded By. If watchers are still 
interested, please watch SPARK-19071. Thanks!

> Parallel training jobs in model selection
> -
>
> Key: SPARK-14084
> URL: https://issues.apache.org/jira/browse/SPARK-14084
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.0.0
>Reporter: Xiangrui Meng
>
> In CrossValidator and TrainValidationSplit, we run training jobs one by one. 
> If users have a big cluster, they might see speed-ups if we parallelize the 
> job submission on the driver. The trade-off is that we might need to make 
> multiple copies of the training data, which could be expensive. It is worth 
> testing and figure out the best way to implement it.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-3246) Support weighted SVMWithSGD for classification of unbalanced dataset

2017-02-23 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15882113#comment-15882113
 ] 

Nick Pentreath edited comment on SPARK-3246 at 2/24/17 7:15 AM:


Since {{mllib}} is in maintenance mode and {{LinearSVC}} was added in 
SPARK-14709 (and supports {{weightCol}}, I am going to close this as Wont Fix


was (Author: mlnick):
Since {{mllib}} is in maintenance mode and {{LinearSVC}} was added in 
SPARK-14709, I am going to close this as Wont Fix

> Support weighted SVMWithSGD for classification of unbalanced dataset
> 
>
> Key: SPARK-3246
> URL: https://issues.apache.org/jira/browse/SPARK-3246
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 0.9.0, 1.0.2
>Reporter: mahesh bhole
>
> Please support  weighted SVMWithSGD  for binary classification of unbalanced 
> dataset.Though other options like undersampling or oversampling can be 
> used,It will be good if we can have a way to assign weights to minority 
> class. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-3246) Support weighted SVMWithSGD for classification of unbalanced dataset

2017-02-23 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15882113#comment-15882113
 ] 

Nick Pentreath edited comment on SPARK-3246 at 2/24/17 7:16 AM:


Since {{mllib}} is in maintenance mode and {{LinearSVC}} was added in 
SPARK-14709 (and supports {{weightCol}}), I am going to close this as Wont Fix


was (Author: mlnick):
Since {{mllib}} is in maintenance mode and {{LinearSVC}} was added in 
SPARK-14709 (and supports {{weightCol}}, I am going to close this as Wont Fix

> Support weighted SVMWithSGD for classification of unbalanced dataset
> 
>
> Key: SPARK-3246
> URL: https://issues.apache.org/jira/browse/SPARK-3246
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 0.9.0, 1.0.2
>Reporter: mahesh bhole
>
> Please support  weighted SVMWithSGD  for binary classification of unbalanced 
> dataset.Though other options like undersampling or oversampling can be 
> used,It will be good if we can have a way to assign weights to minority 
> class. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-3246) Support weighted SVMWithSGD for classification of unbalanced dataset

2017-02-23 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath closed SPARK-3246.
-
Resolution: Won't Fix

> Support weighted SVMWithSGD for classification of unbalanced dataset
> 
>
> Key: SPARK-3246
> URL: https://issues.apache.org/jira/browse/SPARK-3246
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 0.9.0, 1.0.2
>Reporter: mahesh bhole
>
> Please support  weighted SVMWithSGD  for binary classification of unbalanced 
> dataset.Though other options like undersampling or oversampling can be 
> used,It will be good if we can have a way to assign weights to minority 
> class. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3246) Support weighted SVMWithSGD for classification of unbalanced dataset

2017-02-23 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15882113#comment-15882113
 ] 

Nick Pentreath commented on SPARK-3246:
---

Since {{mllib}} is in maintenance mode and {{LinearSVC}} was added in 
SPARK-14709, I am going to close this as Wont Fix

> Support weighted SVMWithSGD for classification of unbalanced dataset
> 
>
> Key: SPARK-3246
> URL: https://issues.apache.org/jira/browse/SPARK-3246
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 0.9.0, 1.0.2
>Reporter: mahesh bhole
>
> Please support  weighted SVMWithSGD  for binary classification of unbalanced 
> dataset.Though other options like undersampling or oversampling can be 
> used,It will be good if we can have a way to assign weights to minority 
> class. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19634) Feature parity for descriptive statistics in MLlib

2017-02-23 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15880628#comment-15880628
 ] 

Nick Pentreath commented on SPARK-19634:


Ah I see it was discussed in the design doc - will go through it more detail 
and check the comments.

> Feature parity for descriptive statistics in MLlib
> --
>
> Key: SPARK-19634
> URL: https://issues.apache.org/jira/browse/SPARK-19634
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: Timothy Hunter
>
> This ticket tracks porting the functionality of 
> spark.mllib.MultivariateOnlineSummarizer over to spark.ml.
> A design has been discussed in SPARK-19208 . Here is a design doc:
> https://docs.google.com/document/d/1ELVpGV3EBjc2KQPLN9_9_Ge9gWchPZ6SGtDW5tTm_50/edit#



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19634) Feature parity for descriptive statistics in MLlib

2017-02-23 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15880623#comment-15880623
 ] 

Nick Pentreath commented on SPARK-19634:


Thanks [~timhunter].

In terms of performance, we expect to gain from (a) not computing unnecessary 
metrics or values (saving mainly in memory usage for the intermediate arrays 
created, potentially some computation saving); and (b) using UDAF.

Do we expect a large gain from using UDAF? I'm not totally up to date on the 
current state of UDAF integration into working with Tungsten data, but my last 
impression was that (a) UDAFs didn't really offer this unless they're internal 
(like HyperLogLog) and (b) array storage & SerDe in Tungsten was still a bit 
patchy. Has this changed?

Of course in terms of API it is beneficial and we should do it anyway under the 
assumption that performance is at least the same as the current implementation. 
I just want to understand the expected performance gains since the implicit 
assumption is always "DataFrame operations will be so much faster" but in 
practice this is not always the case for more complex data types & situations, 
and things that switch into RDDs anyway under the hood such as in the linear 
models cases...  

> Feature parity for descriptive statistics in MLlib
> --
>
> Key: SPARK-19634
> URL: https://issues.apache.org/jira/browse/SPARK-19634
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: Timothy Hunter
>
> This ticket tracks porting the functionality of 
> spark.mllib.MultivariateOnlineSummarizer over to spark.ml.
> A design has been discussed in SPARK-19208 . Here is a design doc:
> https://docs.google.com/document/d/1ELVpGV3EBjc2KQPLN9_9_Ge9gWchPZ6SGtDW5tTm_50/edit#



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



Re: Feedback on MLlib roadmap process proposal

2017-02-23 Thread Nick Pentreath
Sorry for being late to the discussion. I think Joseph, Sean and others
have covered the issues well.

Overall I like the proposed cleaned up roadmap & process (thanks Joseph!).
As for the actual critical roadmap items mentioned on SPARK-18813, I think
it makes sense and will comment a bit further on that JIRA.

I would like to encourage votes & watching for issues to give a sense of
what the community wants (I guess Vote is more explicit yet passive, while
actually Watching an issue is more informative as it may indicate a real
use case dependent on the issue?!).

I think if used well this is valuable information for contributors. Of
course not everything on that list can get done. But if I look through the
top votes or watch list, while not all of those are likely to go in, a
great many of the issues are fairly non-contentious in terms of being good
additions to the project.

Things like these are good examples IMO (I just sample a few of them, not
exhaustive):
- sample weights for RF / DT
- multi-model and/or parallel model selection
- make sharedParams public?
- multi-column support for various transformers
- incremental model training
- tree algorithm enhancements

Now, whether these can be prioritised in terms of bandwidth available to
reviewers and committers is a totally different thing. But as Sean mentions
there is some process there for trying to find the balance of the issue
being a "good thing to add", a shepherd with bandwidth & interest in the
issue to review, and the maintenance burden imposed.

Let's take Deep Learning / NN for example. Here's a good example of
something that has a lot of votes/watchers and as Sean mentions it is
something that "everyone wants someone else to implement". In this case,
much of the interest may in fact be "stale" - 2 years ago it would have
been very interesting to have a strong DL impl in Spark. Now, because there
are a plethora of very good DL libraries out there, how many of those Votes
would be "deleted"? Granted few are well integrated with Spark but that can
and is changing (DL4J, BigDL, the "XonSpark" flavours etc).

So this is something that I dare say will not be in Spark any time in the
foreseeable future or perhaps ever given the current status. Perhaps it's
worth seriously thinking about just closing these kind of issues?



On Fri, 27 Jan 2017 at 05:53 Joseph Bradley  wrote:

> Sean has given a great explanation.  A few more comments:
>
> Roadmap: I have been creating roadmap JIRAs, but the goal really is to
> have all committers working on MLlib help to set that roadmap, based on
> either their knowledge of current maintenance/internal needs of the project
> or the feedback given from the rest of the community.
> @Committers - I see people actively shepherding PRs for MLlib, but I don't
> see many major initiatives linked to the roadmap.  If there are ones large
> enough to merit adding to the roadmap, please do.
>
> In general, there are many process improvements we could make.  A few in
> my mind are:
> * Visibility: Let the community know what committers are focusing on.
> This was the primary purpose of the "MLlib roadmap proposal."
> * Community initiatives: This is currently very organic.  Some of the
> organic process could be improved, such as encouraging Votes/Watchers
> (though I agree with Sean about these being one-sided metrics).  Cody's SIP
> work is a great step towards adding more clarity and structure for major
> initiatives.
> * JIRA hygiene: Always a challenge, and always requires some manual
> prodding.  But it's great to push for efforts on this.
>
>
> On Wed, Jan 25, 2017 at 3:59 AM, Sean Owen  wrote:
>
> On Wed, Jan 25, 2017 at 6:01 AM Ilya Matiach  wrote:
>
> My confusion was that the ML 2.2 roadmap critical features (
> https://issues.apache.org/jira/browse/SPARK-18813) did not line up with
> the top ML/MLLIB JIRAs by Votes
> or
> Watchers
> 
> .
>
> Your explanation that they do not have 

[jira] [Commented] (SPARK-18813) MLlib 2.2 Roadmap

2017-02-23 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15880387#comment-15880387
 ] 

Nick Pentreath commented on SPARK-18813:


Thanks for this Joseph and everyone for the comments & discussion (I also 
responded briefly on @dev list).

>From my perspective, I think the highest priorities are:

# *ML / MLlib Parity* - this is sort of a weight around our necks currently, 
and once it is finally done it can free up bandwidth and mind space for how to 
take Spark ML forward. Until it is complete, I feel that almost everything else 
that is not a critical bug fix or missing feature will not (and probably should 
not) receive much attention;
# *Python / R / Scala parity* - I think this is obvious
# *Persistence completion & backward compat* - again obvious

In terms of the "new features & improvements" stuff, I tend to agree with some 
of the items mentioned by Yanbo:

# *Improve scalability & usability of feature transformers* - also fixing any 
big issues here (e.g. https://issues.apache.org/jira/browse/SPARK-8418 and OHE 
issue such as https://issues.apache.org/jira/browse/SPARK-13030).
# *Set initial model* - useful and a lot of the initial work in the PR has been 
done so actually not that much blocking this going in
# *GBT improvements* - I think it is safe to say that GBT/GBM is a core 
algorithm and we should aim to be able to provide comparable functionality to 
scikit-learn, XGBoost & LightGBM in terms of scalability and accuracy. Even if 
this means re-writing much of the implementation! 

FYI, I don't really believe the local model stuff is high priority - not 
because it is not critically important and often requested, but because it is 
such a complex thing, and Spark ML is so tightly integrated with Spark SQL 
runtime currently, that I think we'd need to come up with a pretty detailed 
plan here for it to make its way anywhere near the medium-term roadmap... Added 
to this the community is moving ahead with their own solutions fairly rapidly, 
so that perhaps it will not be as important in the near future (e.g. MLeap and 
other serving frameworks, standards such as PMML & PFA, etc).

Apart from a few JIRAs / PRs that were around before that I need to 
re-shepherd, I'll be trying to focus on the parity stuff for remainder of 
{{2.2}} dev cycle - for my own outstanding ALS PR and others. I'll work through 
the major items and either shepherd or help shepherd them over the line.

> MLlib 2.2 Roadmap
> -
>
> Key: SPARK-18813
> URL: https://issues.apache.org/jira/browse/SPARK-18813
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML, MLlib
>Reporter: Joseph K. Bradley
>Priority: Blocker
>  Labels: roadmap
>
> *PROPOSAL: This includes a proposal for the 2.2 roadmap process for MLlib.*
> The roadmap process described below is significantly updated since the 2.1 
> roadmap [SPARK-15581].  Please refer to [SPARK-15581] for more discussion on 
> the basis for this proposal, and comment in this JIRA if you have suggestions 
> for improvements.
> h1. Roadmap process
> This roadmap is a master list for MLlib improvements we are working on during 
> this release.  This includes ML-related changes in PySpark and SparkR.
> *What is planned for the next release?*
> * This roadmap lists issues which at least one Committer has prioritized.  
> See details below in "Instructions for committers."
> * This roadmap only lists larger or more critical issues.
> *How can contributors influence this roadmap?*
> * If you believe an issue should be in this roadmap, please discuss the issue 
> on JIRA and/or the dev mailing list.  Make sure to ping Committers since at 
> least one must agree to shepherd the issue.
> * For general discussions, use this JIRA or the dev mailing list.  For 
> specific issues, please comment on those issues or the mailing list.
> * Vote for & watch issues which are important to you.
> ** MLlib, sorted by: [Votes | 
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20component%20in%20(ML%2C%20MLlib)%20ORDER%20BY%20votes%20DESC]
>  or [Watchers | 
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20component%20in%20(ML%2C%20MLlib)%20ORDER%20BY%20Watchers%20DESC]
> ** SparkR, sorted by: [Votes | 
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20component%20in%20(SparkR)%20ORDER%20BY%20votes%20DESC]

Re: Implementation of RNN/LSTM in Spark

2017-02-23 Thread Nick Pentreath
The short answer is there is none and highly unlikely to be inside of Spark
MLlib any time in the near future.

The best bets are to look at other DL libraries - for JVM there is
Deeplearning4J and BigDL (there are others but these seem to be the most
comprehensive I have come across) - that run on Spark. Also there are
various flavours of TensorFlow / Caffe on Spark. And of course the libs
such as Torch, Keras, Tensorflow, MXNet, Caffe etc. Some of them have Java
or Scala APIs and some form of Spark integration out there in the community
(in varying states of development).

Integrations with Spark are a bit patchy currently but include the
"XOnSpark" flavours mentioned above and TensorFrames (again, there may be
others).

On Thu, 23 Feb 2017 at 14:23 n1kt0  wrote:

> Hi,
> can anyone tell me what the current status about RNNs in Spark is?
>
>
>
> --
> View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/Implementation-of-RNN-LSTM-in-Spark-tp14866p21060.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


[jira] [Commented] (SPARK-14409) Investigate adding a RankingEvaluator to ML

2017-02-23 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15880324#comment-15880324
 ] 

Nick Pentreath commented on SPARK-14409:


[~roberto.mirizzi] If using the current {{ALS.transform}} output as input to 
the {{RankingEvaluator}}, as envisaged here, the model will predict a score for 
each {{user-item}} pair in the evaluation set. For each user, the ground truth 
is exactly this distinct set of items. By definition the top-k items ranked by 
predicted sore will be in the ground truth set, since {{ALS}} is only scoring 
{{user-item}} pairs *that already exist in the evaluation set*. So how is it 
possible *not* to get a perfect score, since all top-k recommended items will 
be "relevant"?

Unless you are cutting off the ground truth set at {{k}} too - in which case 
that does not sound like a correct computation to me.

By contrast, if {{ALS.transform}} output a set of top-k items for each user, 
where the items are scored from *the entire set of possible candidate items*, 
then computing the ranking metric of that top-k set against the actual ground 
truth for each user is correct.

> Investigate adding a RankingEvaluator to ML
> ---
>
> Key: SPARK-14409
> URL: https://issues.apache.org/jira/browse/SPARK-14409
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Nick Pentreath
>Priority: Minor
>
> {{mllib.evaluation}} contains a {{RankingMetrics}} class, while there is no 
> {{RankingEvaluator}} in {{ml.evaluation}}. Such an evaluator can be useful 
> for recommendation evaluation (and can be useful in other settings 
> potentially).
> Should be thought about in conjunction with adding the "recommendAll" methods 
> in SPARK-13857, so that top-k ranking metrics can be used in cross-validators.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14409) Investigate adding a RankingEvaluator to ML

2017-02-23 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15880312#comment-15880312
 ] 

Nick Pentreath commented on SPARK-14409:


[~danilo.ascione] Yes, your solution is generic assuming the input 
{{DataFrame}} is {{| user | item | predicted_score | actual score |}}, and that 
any of {{predicted_score}} or {{actual_score}} could be missing.

> Investigate adding a RankingEvaluator to ML
> ---
>
> Key: SPARK-14409
> URL: https://issues.apache.org/jira/browse/SPARK-14409
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>    Reporter: Nick Pentreath
>Priority: Minor
>
> {{mllib.evaluation}} contains a {{RankingMetrics}} class, while there is no 
> {{RankingEvaluator}} in {{ml.evaluation}}. Such an evaluator can be useful 
> for recommendation evaluation (and can be useful in other settings 
> potentially).
> Should be thought about in conjunction with adding the "recommendAll" methods 
> in SPARK-13857, so that top-k ranking metrics can be used in cross-validators.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



Re: [Spark Namespace]: Expanding Spark ML under Different Namespace?

2017-02-23 Thread Nick Pentreath
Currently your only option is to write (or copy) your own implementations.

Logging is definitely intended to be internal use only, and it's best to
use your own logging lib - Typesafe scalalogging is a common option that
I've used.

As for the VectorUDT, for now that is private. There are no plans to open
it up as yet. It should not be too difficult to have your own UDT
implementation. What type of extensions are you trying to do with the UDT?

Likewise the shared params are for now private. It is a bit annoying to
have to re-create them, but most of them are pretty simple so it's not a
huge overhead.

Perhaps you can add your thoughts & comments to
https://issues.apache.org/jira/browse/SPARK-19498 in terms of extending
Spark ML. Ultimately I support making it easier to extend. But we do have
to balance that with exposing new public APIs and classes that impose
backward compat guarantees.

Perhaps now is a good time to think about some of the common shared params
for example.

Thanks
Nick


On Wed, 22 Feb 2017 at 22:51 Shouheng Yi 
wrote:

Hi Spark developers,



Currently my team at Microsoft is extending Spark’s machine learning
functionalities to include new learners and transformers. We would like
users to use these within spark pipelines so that they can mix and match
with existing Spark learners/transformers, and overall have a native spark
experience. We cannot accomplish this using a non-“org.apache” namespace
with the current implementation, and we don’t want to release code inside
the apache namespace because it’s confusing and there could be naming
rights issues.



We need to extend several classes from spark which happen to have
“private[spark].” For example, one of our class extends VectorUDT[0] which
has private[spark] class VectorUDT as its access modifier. This
unfortunately put us in a strange scenario that forces us to work under the
namespace org.apache.spark.



To be specific, currently the private classes/traits we need to use to
create new Spark learners & Transformers are HasInputCol, VectorUDT and
Logging. We will expand this list as we develop more.



Is there a way to avoid this namespace issue? What do other
people/companies do in this scenario? Thank you for your help!



[0]:
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/linalg/VectorUDT.scala



Best,

Shouheng


[jira] [Commented] (SPARK-19668) Multiple NGram sizes

2017-02-23 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15880131#comment-15880131
 ] 

Nick Pentreath commented on SPARK-19668:


The simplest will be to keep the existing param and make it the max of the 
range (since that keeps with the existing semantics), and add a new param to 
specify the start of the range.

However we need to be careful about backward compat for save/load. If an older 
model is loaded, then the two params should be set to the same value. I think 
it would be sufficient to check if the new param is missing and if so set it to 
the value of the existing param. Make sense?

> Multiple NGram sizes
> 
>
> Key: SPARK-19668
> URL: https://issues.apache.org/jira/browse/SPARK-19668
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: Jacek KK
>Priority: Minor
>  Labels: beginner, easyfix, newbie
>
> It would be nice to have a possibility of specyfing the range (or maybe a 
> list of) sizes of ngrams, like it is done in sklearn, as in 
> http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html#sklearn.feature_extraction.text.TfidfVectorizer
> This shouldn't be difficult to add, the code is very straightforward, and I 
> can implement it. The only issue is with the NGram API - should it just 
> accept a number/tuple/list?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



Re: Practical configuration to run LSH in Spark 2.1.0

2017-02-22 Thread Nick Pentreath
And to be clear, are you doing a self-join for approx similarity? Or
joining to another dataset?



On Thu, 23 Feb 2017 at 02:01, nguyen duc Tuan  wrote:

> Hi Seth,
> Here's the parameters that I used in my experiments.
> - Number of executors: 16
> - Executor's memories: vary from 1G -> 2G -> 3G
> - Number of cores per executor: 1-> 2
> - Driver's memory:  1G -> 2G -> 3G
> - The similar threshold: 0.6
> MinHash:
> - number of hash tables: 2
> SignedRandomProjection:
> - Number of hash tables: 2
>
> 2017-02-23 0:13 GMT+07:00 Seth Hendrickson :
>
> I'm looking into this a bit further, thanks for bringing it up! Right now
> the LSH implementation only uses OR-amplification. The practical
> consequence of this is that it will select too many candidates when doing
> approximate near neighbor search and approximate similarity join. When we
> add AND-amplification I think it will become significantly more usable. In
> the meantime, I will also investigate scalability issues.
>
> Can you please provide every parameter you used? It will be very helfpul
> :) For instance, the similarity threshold, the number of hash tables, the
> bucket width, etc...
>
> Thanks!
>
> On Mon, Feb 13, 2017 at 3:21 PM, Nick Pentreath 
> wrote:
>
> The original Uber authors provided this performance test result:
> https://docs.google.com/document/d/19BXg-67U83NVB3M0I84HVBVg3baAVaESD_mrg_-vLro
>
> This was for MinHash only though, so it's not clear about what the
> scalability is for the other metric types.
>
> The SignRandomProjectionLSH is not yet in Spark master (see
> https://issues.apache.org/jira/browse/SPARK-18082). It could be there are
> some implementation details that would make a difference here.
>
> By the way, what is the join threshold you use in approx join?
>
> Could you perhaps create a JIRA ticket with the details in order to track
> this?
>
>
> On Sun, 12 Feb 2017 at 22:52 nguyen duc Tuan  wrote:
>
> After all, I switched back to LSH implementation that I used before (
> https://github.com/karlhigley/spark-neighbors ). I can run on my dataset
> now. If someone has any suggestion, please tell me.
> Thanks.
>
> 2017-02-12 9:25 GMT+07:00 nguyen duc Tuan :
>
> Hi Timur,
> 1) Our data is transformed to dataset of Vector already.
> 2) If I use RandomSignProjectLSH, the job dies after I call
> approximateSimilarJoin. I tried to use Minhash instead, the job is still
> slow. I don't thinks the problem is related to the GC. The time for GC is
> small compare with the time for computation. Here is some screenshots of my
> job.
> Thanks
>
> 2017-02-12 8:01 GMT+07:00 Timur Shenkao :
>
> Hello,
>
> 1) Are you sure that your data is "clean"?  No unexpected missing values?
> No strings in unusual encoding? No additional or missing columns ?
> 2) How long does your job run? What about garbage collector parameters?
> Have you checked what happens with jconsole / jvisualvm ?
>
> Sincerely yours, Timur
>
> On Sat, Feb 11, 2017 at 12:52 AM, nguyen duc Tuan 
> wrote:
>
> Hi Nick,
> Because we use *RandomSignProjectionLSH*, there is only one parameter for
> LSH is the number of hashes. I try with small number of hashes (2) but the
> error is still happens. And it happens when I call similarity join. After
> transformation, the size of  dataset is about 4G.
>
> 2017-02-11 3:07 GMT+07:00 Nick Pentreath :
>
> What other params are you using for the lsh transformer?
>
> Are the issues occurring during transform or during the similarity join?
>
>
> On Fri, 10 Feb 2017 at 05:46, nguyen duc Tuan 
> wrote:
>
> hi Das,
> In general, I will apply them to larger datasets, so I want to use LSH,
> which is more scaleable than the approaches as you suggested. Have you
> tried LSH in Spark 2.1.0 before ? If yes, how do you set the
> parameters/configuration to make it work ?
> Thanks.
>
> 2017-02-10 19:21 GMT+07:00 Debasish Das :
>
> If it is 7m rows and 700k features (or say 1m features) brute force row
> similarity will run fine as well...check out spark-4823...you can compare
> quality with approximate variant...
> On Feb 9, 2017 2:55 AM, "nguyen duc Tuan"  wrote:
>
> Hi everyone,
> Since spark 2.1.0 introduces LSH (
> http://spark.apache.org/docs/latest/ml-features.html#locality-sensitive-hashing),
> we want to use LSH to find approximately nearest neighbors. Basically, We
> have dataset with about 7M rows. we want to use cosine distance to meassure
> the similarity between items, so we use *RandomSignProjectionLSH* (
> https://gist.github.com/tuan3w/c968e56ea8ef135096eeedb08af097db) instead
> of *BucketedRandomProjectionLSH*. I try to tune

[jira] [Resolved] (SPARK-19679) Destroy broadcasted object without blocking

2017-02-22 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath resolved SPARK-19679.

   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 17016
[https://github.com/apache/spark/pull/17016]

> Destroy broadcasted object without blocking
> ---
>
> Key: SPARK-19679
> URL: https://issues.apache.org/jira/browse/SPARK-19679
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: zhengruifeng
>Priority: Trivial
> Fix For: 2.2.0
>
>
> Destroy broadcasted object without blocking in ML



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19679) Destroy broadcasted object without blocking

2017-02-22 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath reassigned SPARK-19679:
--

Assignee: zhengruifeng

> Destroy broadcasted object without blocking
> ---
>
> Key: SPARK-19679
> URL: https://issues.apache.org/jira/browse/SPARK-19679
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: zhengruifeng
>Assignee: zhengruifeng
>Priority: Trivial
> Fix For: 2.2.0
>
>
> Destroy broadcasted object without blocking in ML



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19694) Add missing 'setTopicDistributionCol' for LDAModel

2017-02-22 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath reassigned SPARK-19694:
--

Assignee: zhengruifeng

> Add missing 'setTopicDistributionCol' for LDAModel
> --
>
> Key: SPARK-19694
> URL: https://issues.apache.org/jira/browse/SPARK-19694
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: zhengruifeng
>Assignee: zhengruifeng
>Priority: Trivial
> Fix For: 2.2.0
>
>
> {{LDAModel}} can not set the output column now.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19694) Add missing 'setTopicDistributionCol' for LDAModel

2017-02-22 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath resolved SPARK-19694.

   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 17021
[https://github.com/apache/spark/pull/17021]

> Add missing 'setTopicDistributionCol' for LDAModel
> --
>
> Key: SPARK-19694
> URL: https://issues.apache.org/jira/browse/SPARK-19694
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: zhengruifeng
>Priority: Trivial
> Fix For: 2.2.0
>
>
> {{LDAModel}} can not set the output column now.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-18454) Changes to improve Nearest Neighbor Search for LSH

2017-02-21 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15875552#comment-15875552
 ] 

Nick Pentreath edited comment on SPARK-18454 at 2/21/17 8:00 AM:
-

Can you also comment on 
http://mail-archives.apache.org/mod_mbox/spark-user/201702.mbox/%3CCANxMKZU0iVd9Ff4TrWjtdk%3DkEyXAeoXGLEgmVW5vbE5puobE6Q%40mail.gmail.com%3E?
 It would be good to understand why we're seeing poor performance vs an 
alternative impl in Spark packages, and whether we can take some idea from that 
on how to improve performance.

Though it's true it does not support similarity join. Still we should 
investigate.


was (Author: mlnick):
Can you also comment on 
http://mail-archives.apache.org/mod_mbox/spark-user/201702.mbox/%3CCANxMKZU0iVd9Ff4TrWjtdk%3DkEyXAeoXGLEgmVW5vbE5puobE6Q%40mail.gmail.com%3E?
 It would be good to understand why we're seeing poor performance vs an 
alternative impl in Spark packages, and whether we can take some idea from that 
on how to improve performance.

> Changes to improve Nearest Neighbor Search for LSH
> --
>
> Key: SPARK-18454
> URL: https://issues.apache.org/jira/browse/SPARK-18454
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Yun Ni
>
> We all agree to do the following improvement to Multi-Probe NN Search:
> (1) Use approxQuantile to get the {{hashDistance}} threshold instead of doing 
> full sort on the whole dataset
> Currently we are still discussing the following:
> (1) What {{hashDistance}} (or Probing Sequence) we should use for {{MinHash}}
> (2) What are the issues and how we should change the current Nearest Neighbor 
> implementation



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18454) Changes to improve Nearest Neighbor Search for LSH

2017-02-20 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15875552#comment-15875552
 ] 

Nick Pentreath commented on SPARK-18454:


Can you also comment on 
http://mail-archives.apache.org/mod_mbox/spark-user/201702.mbox/%3CCANxMKZU0iVd9Ff4TrWjtdk%3DkEyXAeoXGLEgmVW5vbE5puobE6Q%40mail.gmail.com%3E?
 It would be good to understand why we're seeing poor performance vs an 
alternative impl in Spark packages, and whether we can take some idea from that 
on how to improve performance.

> Changes to improve Nearest Neighbor Search for LSH
> --
>
> Key: SPARK-18454
> URL: https://issues.apache.org/jira/browse/SPARK-18454
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Yun Ni
>
> We all agree to do the following improvement to Multi-Probe NN Search:
> (1) Use approxQuantile to get the {{hashDistance}} threshold instead of doing 
> full sort on the whole dataset
> Currently we are still discussing the following:
> (1) What {{hashDistance}} (or Probing Sequence) we should use for {{MinHash}}
> (2) What are the issues and how we should change the current Nearest Neighbor 
> implementation



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18608) Spark ML algorithms that check RDD cache level for internal caching double-cache data

2017-02-20 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15875465#comment-15875465
 ] 

Nick Pentreath commented on SPARK-18608:


[~podongfeng] [~yuhaoyan] I'm not aware of anyone working on this now, either 
of you want to take it?

> Spark ML algorithms that check RDD cache level for internal caching 
> double-cache data
> -
>
> Key: SPARK-18608
> URL: https://issues.apache.org/jira/browse/SPARK-18608
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Reporter: Nick Pentreath
>
> Some algorithms in Spark ML (e.g. {{LogisticRegression}}, 
> {{LinearRegression}}, and I believe now {{KMeans}}) handle persistence 
> internally. They check whether the input dataset is cached, and if not they 
> cache it for performance.
> However, the check is done using {{dataset.rdd.getStorageLevel == NONE}}. 
> This will actually always be true, since even if the dataset itself is 
> cached, the RDD returned by {{dataset.rdd}} will not be cached.
> Hence if the input dataset is cached, the data will end up being cached 
> twice, which is wasteful.
> To see this:
> {code}
> scala> import org.apache.spark.storage.StorageLevel
> import org.apache.spark.storage.StorageLevel
> scala> val df = spark.range(10).toDF("num")
> df: org.apache.spark.sql.DataFrame = [num: bigint]
> scala> df.storageLevel == StorageLevel.NONE
> res0: Boolean = true
> scala> df.persist
> res1: df.type = [num: bigint]
> scala> df.storageLevel == StorageLevel.MEMORY_AND_DISK
> res2: Boolean = true
> scala> df.rdd.getStorageLevel == StorageLevel.MEMORY_AND_DISK
> res3: Boolean = false
> scala> df.rdd.getStorageLevel == StorageLevel.NONE
> res4: Boolean = true
> {code}
> Before SPARK-16063, there was no way to check the storage level of the input 
> {{DataSet}}, but now we can, so the checks should be migrated to use 
> {{dataset.storageLevel}}.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19668) Multiple NGram sizes

2017-02-20 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15875437#comment-15875437
 ] 

Nick Pentreath commented on SPARK-19668:


I'd say a range is feasible. The current API doesn't quite fit, but we could 
add a further parameter that specifies the max of the range.

> Multiple NGram sizes
> 
>
> Key: SPARK-19668
> URL: https://issues.apache.org/jira/browse/SPARK-19668
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: Jacek KK
>Priority: Minor
>  Labels: beginner, easyfix, newbie
>
> It would be nice to have a possibility of specyfing the range (or maybe a 
> list of) sizes of ngrams, like it is done in sklearn, as in 
> http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html#sklearn.feature_extraction.text.TfidfVectorizer
> This shouldn't be difficult to add, the code is very straightforward, and I 
> can implement it. The only issue is with the NGram API - should it just 
> accept a number/tuple/list?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19573) Make NaN/null handling consistent in approxQuantile

2017-02-20 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15874321#comment-15874321
 ] 

Nick Pentreath commented on SPARK-19573:


cc [~timhunter] - can you take a look at the discussion (here: 
https://github.com/apache/spark/pull/16776#discussion_r101155454)? What is your 
view? thanks

> Make NaN/null handling consistent in approxQuantile
> ---
>
> Key: SPARK-19573
> URL: https://issues.apache.org/jira/browse/SPARK-19573
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: zhengruifeng
>
> As discussed in https://github.com/apache/spark/pull/16776, this jira is used 
> to track the following issue:
> Multi-column version of approxQuantile drop the rows containing *any* 
> NaN/null, the results are not consistent with outputs of the single-version.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19208) MultivariateOnlineSummarizer performance optimization

2017-02-14 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15866755#comment-15866755
 ] 

Nick Pentreath commented on SPARK-19208:


Ah right I see - yes rewrite rules would be a good ultimate goal.

One question I have - if we do:

{{code}}
val summary = df.select(VectorSummarizer.metrics("min", 
"max").summary("features"))
{{code}}

How will we return a DF with cols {{min}} and {{max}}? Since it seems multiple 
return cols are not supported by UDAF?

Or do we have to live with the struct return type for now until we could do the 
rewrite version?

> MultivariateOnlineSummarizer performance optimization
> -
>
> Key: SPARK-19208
> URL: https://issues.apache.org/jira/browse/SPARK-19208
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: zhengruifeng
> Attachments: Tests.pdf, WechatIMG2621.jpeg
>
>
> Now, {{MaxAbsScaler}} and {{MinMaxScaler}} are using 
> {{MultivariateOnlineSummarizer}} to compute the min/max.
> However {{MultivariateOnlineSummarizer}} will also compute extra unused 
> statistics. It slows down the task, moreover it is more prone to cause OOM.
> For example:
> env : --driver-memory 4G --executor-memory 1G --num-executors 4
> data: 
> [http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#kdd2010%20(bridge%20to%20algebra)]
>  748401 instances,   and 29,890,095 features
> {{MaxAbsScaler.fit}} fails because of OOM
> {{MultivariateOnlineSummarizer}} maintains 8 arrays:
> {code}
> private var currMean: Array[Double] = _
>   private var currM2n: Array[Double] = _
>   private var currM2: Array[Double] = _
>   private var currL1: Array[Double] = _
>   private var totalCnt: Long = 0
>   private var totalWeightSum: Double = 0.0
>   private var weightSquareSum: Double = 0.0
>   private var weightSum: Array[Double] = _
>   private var nnz: Array[Long] = _
>   private var currMax: Array[Double] = _
>   private var currMin: Array[Double] = _
> {code}
> For {{MaxAbsScaler}}, only 1 array is needed (max of abs value)
> For {{MinMaxScaler}}, only 3 arrays are needed (max, min, nnz)
> After modication in the pr, the above example run successfully.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-19208) MultivariateOnlineSummarizer performance optimization

2017-02-14 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15866755#comment-15866755
 ] 

Nick Pentreath edited comment on SPARK-19208 at 2/14/17 9:42 PM:
-

Ah right I see - yes rewrite rules would be a good ultimate goal.

One question I have - if we do:

{code}
val summary = df.select(VectorSummarizer.metrics("min", 
"max").summary("features"))
{code}

How will we return a DF with cols {{min}} and {{max}}? Since it seems multiple 
return cols are not supported by UDAF?

Or do we have to live with the struct return type for now until we could do the 
rewrite version?


was (Author: mlnick):
Ah right I see - yes rewrite rules would be a good ultimate goal.

One question I have - if we do:

{{code}}
val summary = df.select(VectorSummarizer.metrics("min", 
"max").summary("features"))
{{code}}

How will we return a DF with cols {{min}} and {{max}}? Since it seems multiple 
return cols are not supported by UDAF?

Or do we have to live with the struct return type for now until we could do the 
rewrite version?

> MultivariateOnlineSummarizer performance optimization
> -
>
> Key: SPARK-19208
> URL: https://issues.apache.org/jira/browse/SPARK-19208
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: zhengruifeng
> Attachments: Tests.pdf, WechatIMG2621.jpeg
>
>
> Now, {{MaxAbsScaler}} and {{MinMaxScaler}} are using 
> {{MultivariateOnlineSummarizer}} to compute the min/max.
> However {{MultivariateOnlineSummarizer}} will also compute extra unused 
> statistics. It slows down the task, moreover it is more prone to cause OOM.
> For example:
> env : --driver-memory 4G --executor-memory 1G --num-executors 4
> data: 
> [http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#kdd2010%20(bridge%20to%20algebra)]
>  748401 instances,   and 29,890,095 features
> {{MaxAbsScaler.fit}} fails because of OOM
> {{MultivariateOnlineSummarizer}} maintains 8 arrays:
> {code}
> private var currMean: Array[Double] = _
>   private var currM2n: Array[Double] = _
>   private var currM2: Array[Double] = _
>   private var currL1: Array[Double] = _
>   private var totalCnt: Long = 0
>   private var totalWeightSum: Double = 0.0
>   private var weightSquareSum: Double = 0.0
>   private var weightSum: Array[Double] = _
>   private var nnz: Array[Long] = _
>   private var currMax: Array[Double] = _
>   private var currMin: Array[Double] = _
> {code}
> For {{MaxAbsScaler}}, only 1 array is needed (max of abs value)
> For {{MinMaxScaler}}, only 3 arrays are needed (max, min, nnz)
> After modication in the pr, the above example run successfully.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19208) MultivariateOnlineSummarizer performance optimization

2017-02-14 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15866675#comment-15866675
 ] 

Nick Pentreath commented on SPARK-19208:


When I said "estimator-like", I didn't mean it should necessarily be an actual 
{{Estimator}} (I agree it is not really intended to fit into transformers & 
pipelines), but rather mimic the API, i.e. that the summarizer is "fitted" on a 
dataset to return a summary.

I just wasn't too keen on the idea of returning a struct as it just feels sort 
of clunky relative to returning a df with vector columns {{"mean", "min", 
"max"}} etc.

Supporting SS and {{groupBy}} seems like an important goal, so something like 
[~timhunter]'s suggestion looks like it will work nicely.

For doing it via catalyst rules, that would be first prize to automatically 
re-use the intermediate results for multiple end-result computations, and only 
compute what is necessary for the required end-results. But, is that even 
supported for UDTs currently? I'm not an expert but my understanding was that 
is not supported yet.

> MultivariateOnlineSummarizer performance optimization
> -
>
> Key: SPARK-19208
> URL: https://issues.apache.org/jira/browse/SPARK-19208
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: zhengruifeng
> Attachments: Tests.pdf, WechatIMG2621.jpeg
>
>
> Now, {{MaxAbsScaler}} and {{MinMaxScaler}} are using 
> {{MultivariateOnlineSummarizer}} to compute the min/max.
> However {{MultivariateOnlineSummarizer}} will also compute extra unused 
> statistics. It slows down the task, moreover it is more prone to cause OOM.
> For example:
> env : --driver-memory 4G --executor-memory 1G --num-executors 4
> data: 
> [http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#kdd2010%20(bridge%20to%20algebra)]
>  748401 instances,   and 29,890,095 features
> {{MaxAbsScaler.fit}} fails because of OOM
> {{MultivariateOnlineSummarizer}} maintains 8 arrays:
> {code}
> private var currMean: Array[Double] = _
>   private var currM2n: Array[Double] = _
>   private var currM2: Array[Double] = _
>   private var currL1: Array[Double] = _
>   private var totalCnt: Long = 0
>   private var totalWeightSum: Double = 0.0
>   private var weightSquareSum: Double = 0.0
>   private var weightSum: Array[Double] = _
>   private var nnz: Array[Long] = _
>   private var currMax: Array[Double] = _
>   private var currMin: Array[Double] = _
> {code}
> For {{MaxAbsScaler}}, only 1 array is needed (max of abs value)
> For {{MinMaxScaler}}, only 3 arrays are needed (max, min, nnz)
> After modication in the pr, the above example run successfully.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14503) spark.ml Scala API for FPGrowth

2017-02-13 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15864872#comment-15864872
 ] 

Nick Pentreath commented on SPARK-14503:


Seems {{PrefixSpan}} even takes different input: {{Array[Array[T]]}} vs 
FPGrowth: {{Array[T]}}. So it may be tricky to unify.

However we do have the case where e.g. {{QuantileDiscretizer}} returns a 
{{Bucketizer}} as {{Model}} from {{fit}}. In that case {{Bucketizer}} can be 
instantiated directly and independently, but it could in theory be the case 
that some other estimator returns a {{Bucketizer}} as its model.

So we could perhaps think about both {{FPGrowth}} and {{PrefixSpan}} returning 
an {{AssociationRuleModel}} from {{fit}}. It could work if the input can be 
generalized to {{Seq[T]}} where for {{FPGrowth}} it would be {{Seq[Item]}} and 
for {{PrefixSpan}} it would be {{Seq[Seq[Item]]}}. The output of {{transform}} 
for the model would be the predicted items as above. It would expose 
{{getFreqItems}} and {{getAssociationRules}} both returning a {{DataFrame}}.

Is there something in the nature of {{PrefixSpan}} vs {{FPGrowth}} that makes 
this too difficult? (I'll have to go read the papers when I get some time!)

But having said that it could be pretty complex to try to support this. If so, 
unless there's a compelling argument I'd go for [~josephkb]'s suggestion above, 
and hide the association rule class for now (can expose later as needed). Then 
{{PrefixSpan}} will be totally independent and return its own 
{{PrefixSpanModel}} (that may also expose a {{transform}} method that has 
similar semantics but different internals).

> spark.ml Scala API for FPGrowth
> ---
>
> Key: SPARK-14503
> URL: https://issues.apache.org/jira/browse/SPARK-14503
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Joseph K. Bradley
>
> This task is the first port of spark.mllib.fpm functionality to spark.ml 
> (Scala).
> This will require a brief design doc to confirm a reasonable DataFrame-based 
> API, with details for this class.  The doc could also look ahead to the other 
> fpm classes, especially if their API decisions will affect FPGrowth.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



Re: Practical configuration to run LSH in Spark 2.1.0

2017-02-13 Thread Nick Pentreath
The original Uber authors provided this performance test result:
https://docs.google.com/document/d/19BXg-67U83NVB3M0I84HVBVg3baAVaESD_mrg_-vLro

This was for MinHash only though, so it's not clear about what the
scalability is for the other metric types.

The SignRandomProjectionLSH is not yet in Spark master (see
https://issues.apache.org/jira/browse/SPARK-18082). It could be there are
some implementation details that would make a difference here.

By the way, what is the join threshold you use in approx join?

Could you perhaps create a JIRA ticket with the details in order to track
this?


On Sun, 12 Feb 2017 at 22:52 nguyen duc Tuan  wrote:

> After all, I switched back to LSH implementation that I used before (
> https://github.com/karlhigley/spark-neighbors ). I can run on my dataset
> now. If someone has any suggestion, please tell me.
> Thanks.
>
> 2017-02-12 9:25 GMT+07:00 nguyen duc Tuan :
>
> Hi Timur,
> 1) Our data is transformed to dataset of Vector already.
> 2) If I use RandomSignProjectLSH, the job dies after I call
> approximateSimilarJoin. I tried to use Minhash instead, the job is still
> slow. I don't thinks the problem is related to the GC. The time for GC is
> small compare with the time for computation. Here is some screenshots of my
> job.
> Thanks
>
> 2017-02-12 8:01 GMT+07:00 Timur Shenkao :
>
> Hello,
>
> 1) Are you sure that your data is "clean"?  No unexpected missing values?
> No strings in unusual encoding? No additional or missing columns ?
> 2) How long does your job run? What about garbage collector parameters?
> Have you checked what happens with jconsole / jvisualvm ?
>
> Sincerely yours, Timur
>
> On Sat, Feb 11, 2017 at 12:52 AM, nguyen duc Tuan 
> wrote:
>
> Hi Nick,
> Because we use *RandomSignProjectionLSH*, there is only one parameter for
> LSH is the number of hashes. I try with small number of hashes (2) but the
> error is still happens. And it happens when I call similarity join. After
> transformation, the size of  dataset is about 4G.
>
> 2017-02-11 3:07 GMT+07:00 Nick Pentreath :
>
> What other params are you using for the lsh transformer?
>
> Are the issues occurring during transform or during the similarity join?
>
>
> On Fri, 10 Feb 2017 at 05:46, nguyen duc Tuan 
> wrote:
>
> hi Das,
> In general, I will apply them to larger datasets, so I want to use LSH,
> which is more scaleable than the approaches as you suggested. Have you
> tried LSH in Spark 2.1.0 before ? If yes, how do you set the
> parameters/configuration to make it work ?
> Thanks.
>
> 2017-02-10 19:21 GMT+07:00 Debasish Das :
>
> If it is 7m rows and 700k features (or say 1m features) brute force row
> similarity will run fine as well...check out spark-4823...you can compare
> quality with approximate variant...
> On Feb 9, 2017 2:55 AM, "nguyen duc Tuan"  wrote:
>
> Hi everyone,
> Since spark 2.1.0 introduces LSH (
> http://spark.apache.org/docs/latest/ml-features.html#locality-sensitive-hashing),
> we want to use LSH to find approximately nearest neighbors. Basically, We
> have dataset with about 7M rows. we want to use cosine distance to meassure
> the similarity between items, so we use *RandomSignProjectionLSH* (
> https://gist.github.com/tuan3w/c968e56ea8ef135096eeedb08af097db) instead
> of *BucketedRandomProjectionLSH*. I try to tune some configurations such
> as serialization, memory fraction, executor memory (~6G), number of
> executors ( ~20), memory overhead ..., but nothing works. I often get error
> "java.lang.OutOfMemoryError: Java heap space" while running. I know that
> this implementation is done by engineer at Uber but I don't know right
> configurations,.. to run the algorithm at scale. Do they need very big
> memory to run it?
>
> Any help would be appreciated.
> Thanks
>
>
>
>
>
>
>


Re: Practical configuration to run LSH in Spark 2.1.0

2017-02-10 Thread Nick Pentreath
What other params are you using for the lsh transformer?

Are the issues occurring during transform or during the similarity join?


On Fri, 10 Feb 2017 at 05:46, nguyen duc Tuan  wrote:

> hi Das,
> In general, I will apply them to larger datasets, so I want to use LSH,
> which is more scaleable than the approaches as you suggested. Have you
> tried LSH in Spark 2.1.0 before ? If yes, how do you set the
> parameters/configuration to make it work ?
> Thanks.
>
> 2017-02-10 19:21 GMT+07:00 Debasish Das :
>
> If it is 7m rows and 700k features (or say 1m features) brute force row
> similarity will run fine as well...check out spark-4823...you can compare
> quality with approximate variant...
> On Feb 9, 2017 2:55 AM, "nguyen duc Tuan"  wrote:
>
> Hi everyone,
> Since spark 2.1.0 introduces LSH (
> http://spark.apache.org/docs/latest/ml-features.html#locality-sensitive-hashing),
> we want to use LSH to find approximately nearest neighbors. Basically, We
> have dataset with about 7M rows. we want to use cosine distance to meassure
> the similarity between items, so we use *RandomSignProjectionLSH* (
> https://gist.github.com/tuan3w/c968e56ea8ef135096eeedb08af097db) instead
> of *BucketedRandomProjectionLSH*. I try to tune some configurations such
> as serialization, memory fraction, executor memory (~6G), number of
> executors ( ~20), memory overhead ..., but nothing works. I often get error
> "java.lang.OutOfMemoryError: Java heap space" while running. I know that
> this implementation is done by engineer at Uber but I don't know right
> configurations,.. to run the algorithm at scale. Do they need very big
> memory to run it?
>
> Any help would be appreciated.
> Thanks
>
>
>


Re: Google Summer of Code 2017 is coming

2017-02-05 Thread Nick Pentreath
I think Sean raises valid points - that the result is highly dependent on
the particular student, project and mentor involved, and that the actual
required time investment is very significant.

Having said that, it's not all bad certainly. Scikit-learn started as a
GSoC project 10 years ago!

Actually they have a pretty good model for accepting students - typically
the student must demonstrate significant prior knowledge and ability with
the project sufficient to complete the work.

The challenge I think Spark has is already folks are strapped for capacity
so finding mentors with time will be tricky. But if there are mentors and
the right project / student fit can be found, I think it's a good idea.


On Sat, 4 Feb 2017 at 01:22 Jacek Laskowski  wrote:

> Thanks Sean. You've again been very helpful to put the right tone to
> the matters. I stand corrected and have no interest in GSoC anymore.
>
> Pozdrawiam,
> Jacek Laskowski
> 
> https://medium.com/@jaceklaskowski/
> Mastering Apache Spark 2.0 https://bit.ly/mastering-apache-spark
> Follow me at https://twitter.com/jaceklaskowski
>
>
> On Fri, Feb 3, 2017 at 11:38 PM, Sean Owen  wrote:
> > I have a contrarian opinion on GSoC from experience many years ago in
> > Mahout. Of 3 students I interacted with, 2 didn't come close to
> completing
> > the work they signed up for. I think it's mostly that students are hungry
> > for the resumé line item, and don't understand the amount of work they're
> > proposing, and ultimately have little incentive to complete their
> proposal.
> > The stipend is small.
> >
> > I can appreciate the goal of GSoC but it makes more sense for projects
> that
> > don't get as much attention, and Spark gets plenty. I would not expect
> > students to be able to be net contributors to a project like Spark. The
> time
> > they consume in hand-holding will exceed the time it would take for
> someone
> > experienced to just do the work. I would caution anyone from agreeing to
> > this for Spark unless they are willing to devote 5-10 hours per week for
> the
> > summer to helping someone learn.
> >
> > My net experience with GSoC is negative, mostly on account of the
> > participants.
> >
> > On Fri, Feb 3, 2017 at 9:56 PM Holden Karau 
> wrote:
> >>
> >> As someone who did GSoC back in University I think this could be a good
> >> idea if there is enough interest from the PMC & I'd be willing the help
> >> mentor if that is a bottleneck.
> >>
> >> On Fri, Feb 3, 2017 at 12:42 PM, Jacek Laskowski 
> wrote:
> >>>
> >>> Hi,
> >>>
> >>> Is this something Spark considering? Would be nice to mark issues as
> >>> GSoC in JIRA and solicit feedback. What do you think?
> >>>
> >>> Pozdrawiam,
> >>> Jacek Laskowski
> >>> 
> >>> https://medium.com/@jaceklaskowski/
> >>> Mastering Apache Spark 2.0 https://bit.ly/mastering-apache-spark
> >>> Follow me at https://twitter.com/jaceklaskowski
> >>>
> >>>
> >>>
> >>> -- Forwarded message --
> >>> From: Ulrich Stärk 
> >>> Date: Fri, Feb 3, 2017 at 8:50 PM
> >>> Subject: Google Summer of Code 2017 is coming
> >>> To: ment...@community.apache.org
> >>>
> >>>
> >>> Hello PMCs (incubator Mentors, please forward this email to your
> >>> podlings),
> >>>
> >>> Google Summer of Code [1] is a program sponsored by Google allowing
> >>> students to spend their summer
> >>> working on open source software. Students will receive stipends for
> >>> developing open source software
> >>> full-time for three months. Projects will provide mentoring and
> >>> project ideas, and in return have
> >>> the chance to get new code developed and - most importantly - to
> >>> identify and bring in new committers.
> >>>
> >>> The ASF will apply as a participating organization meaning individual
> >>> projects don't have to apply
> >>> separately.
> >>>
> >>> If you want to participate with your project we ask you to do the
> >>> following things as soon as
> >>> possible but by no later than 2017-02-09:
> >>>
> >>> 1. understand what it means to be a mentor [2].
> >>>
> >>> 2. record your project ideas.
> >>>
> >>> Just create issues in JIRA, label them with gsoc2017, and they will
> >>> show up at [3]. Please be as
> >>> specific as possible when describing your idea. Include the
> >>> programming language, the tools and
> >>> skills required, but try not to scare potential students away. They
> >>> are supposed to learn what's
> >>> required before the program starts.
> >>>
> >>> Use labels, e.g. for the programming language (java, c, c++, erlang,
> >>> python, brainfuck, ...) or
> >>> technology area (cloud, xml, web, foo, bar, ...) and record them at
> [5].
> >>>
> >>> Please use the COMDEV JIRA project for recording your ideas if your
> >>> project doesn't use JIRA (e.g.
> >>> httpd, ooo). Contact d...@community.apache.org if you need assistance.
> >>>
> >>> [4] contains some additional information (will be updated for 2017
> >>> shortly).
> >>>
> >>> 3. subscribe to ment...@community.apache.org; restricted

Re: [SQL][ML] Pipeline performance regression between 1.6 and 2.x

2017-02-01 Thread Nick Pentreath
Hi Maciej

If you're seeing a regression from 1.6 -> 2.0 *both using DataFrames *then
that seems to point to some other underlying issue as the root cause.

Even though adding checkpointing should help, we should understand why it's
different between 1.6 and 2.0?


On Thu, 2 Feb 2017 at 08:22 Liang-Chi Hsieh  wrote:

>
> Hi Maciej,
>
> FYI, the PR is at https://github.com/apache/spark/pull/16775.
>
>
> Liang-Chi Hsieh wrote
> > Hi Maciej,
> >
> > Basically the fitting algorithm in Pipeline is an iterative operation.
> > Running iterative algorithm on Dataset would have RDD lineages and query
> > plans that grow fast. Without cache and checkpoint, it gets slower when
> > the iteration number increases.
> >
> > I think it is why when you run a Pipeline with long stages, it gets much
> > longer time to finish. As I think it is not uncommon to have long stages
> > in a Pipeline, we should improve this. I will submit a PR for this.
> > zero323 wrote
> >> Hi everyone,
> >>
> >> While experimenting with ML pipelines I experience a significant
> >> performance regression when switching from 1.6.x to 2.x.
> >>
> >> import org.apache.spark.ml.{Pipeline, PipelineStage}
> >> import org.apache.spark.ml.feature.{OneHotEncoder, StringIndexer,
> >> VectorAssembler}
> >>
> >> val df = (1 to 40).foldLeft(Seq((1, "foo"), (2, "bar"), (3,
> >> "baz")).toDF("id", "x0"))((df, i) => df.withColumn(s"x$i", $"x0"))
> >> val indexers = df.columns.tail.map(c => new StringIndexer()
> >>   .setInputCol(c)
> >>   .setOutputCol(s"${c}_indexed")
> >>   .setHandleInvalid("skip"))
> >>
> >> val encoders = indexers.map(indexer => new OneHotEncoder()
> >>   .setInputCol(indexer.getOutputCol)
> >>   .setOutputCol(s"${indexer.getOutputCol}_encoded")
> >>   .setDropLast(true))
> >>
> >> val assembler = new
> >> VectorAssembler().setInputCols(encoders.map(_.getOutputCol))
> >> val stages: Array[PipelineStage] = indexers ++ encoders :+ assembler
> >>
> >> new Pipeline().setStages(stages).fit(df).transform(df).show
> >>
> >> Task execution time is comparable and executors are most of the time
> >> idle so it looks like it is a problem with the optimizer. Is it a known
> >> issue? Are there any changes I've missed, that could lead to this
> >> behavior?
> >>
> >> --
> >> Best,
> >> Maciej
> >>
> >>
> >> -
> >> To unsubscribe e-mail:
>
> >> dev-unsubscribe@.apache
>
>
>
>
>
> -
> Liang-Chi Hsieh | @viirya
> Spark Technology Center
> http://www.spark.tc/
> --
> View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/SQL-ML-Pipeline-performance-regression-between-1-6-and-2-x-tp20803p20822.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


[jira] [Commented] (SPARK-19422) Cache input data in algorithms

2017-02-01 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15848401#comment-15848401
 ] 

Nick Pentreath commented on SPARK-19422:


Please see SPARK-18608 - the fix you propose in the PR doesn't work, in fact it 
runs into the same problem of double caching.

> Cache input data in algorithms
> --
>
> Key: SPARK-19422
> URL: https://issues.apache.org/jira/browse/SPARK-19422
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: zhengruifeng
>
> Now some algorithms cache the input dataset if it was not cached any more 
> {{StorageLevel.NONE}}:
> {{FeedForwardTrainer}}, {{LogisticRegression}}, {{OneVsRest}}, {{KMeans}}, 
> {{AFTSurvivalRegression}}, {{IsotonicRegression}}, {{LinearRegression}} with 
> non-WSL solver
> It maybe reasonable to cache input for others:
> {{DecisionTreeClassifier}}, {{GBTClassifier}}, {{RandomForestClassifier}}, 
> {{LinearSVC}}
> {{BisectingKMeans}}, {{GaussianMixture}}, {{LDA}}
> {{DecisionTreeRegressor}}, {{GBTRegressor}}, {{GeneralizedLinearRegression}} 
> with IRLS solver, {{RandomForestRegressor}}
> {{NaiveBayes}} is not included since it only make one pass on the data.
> {{MultilayerPerceptronClassifier}} is not included since the data is cached 
> in {{FeedForwardTrainer.train}}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-19208) MultivariateOnlineSummarizer performance optimization

2017-02-01 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15848108#comment-15848108
 ] 

Nick Pentreath edited comment on SPARK-19208 at 2/1/17 8:09 AM:


Another option would be an "Estimator" like API, where the UDAF is purely 
internal to the API and not exposed to users, e.g.

{code}
val summarizer = new VectorSummarizer().setMetrics("min", 
"max").setInputCol("features").setWeightCol("weight")
val summary = summarizer.fit(df)

(or summarizer.evaluate, summarizer.summarize, etc?)

// this would need to throw exceptions (or perhaps return empty vectors) if the 
metric was not set
val min: Vector = summary.getMin

// OR DataFrame-based result:

val min: Vector = summary.select("min").as[Vector]
{code}

Agree it is important (and the point of this issue) to (a) only compute 
required metrics; and (b) not duplicate computation for efficiency.


was (Author: mlnick):
Another option would be an "Estimator" like API, where the UDAF is purely 
internal to the API and not exposed to users, e.g.

{code}
val summarizer = new VectorSummarizer().setMetrics("min", 
"max").setInputCol("features").setWeightCol("weight")
val summary = summarizer.fit(df)

(or summarizer.evaluate, summarizer.summarize, etc?)

// this would need to throw exceptions (or perhaps return empty vectors) if the 
metric was not set
val min: Vector = summary.getMin

// OR DataFrame-based result:

val min: Vector = summary.select("min").as[Vector]
{code}

> MultivariateOnlineSummarizer performance optimization
> -
>
> Key: SPARK-19208
> URL: https://issues.apache.org/jira/browse/SPARK-19208
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: zhengruifeng
> Attachments: Tests.pdf, WechatIMG2621.jpeg
>
>
> Now, {{MaxAbsScaler}} and {{MinMaxScaler}} are using 
> {{MultivariateOnlineSummarizer}} to compute the min/max.
> However {{MultivariateOnlineSummarizer}} will also compute extra unused 
> statistics. It slows down the task, moreover it is more prone to cause OOM.
> For example:
> env : --driver-memory 4G --executor-memory 1G --num-executors 4
> data: 
> [http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#kdd2010%20(bridge%20to%20algebra)]
>  748401 instances,   and 29,890,095 features
> {{MaxAbsScaler.fit}} fails because of OOM
> {{MultivariateOnlineSummarizer}} maintains 8 arrays:
> {code}
> private var currMean: Array[Double] = _
>   private var currM2n: Array[Double] = _
>   private var currM2: Array[Double] = _
>   private var currL1: Array[Double] = _
>   private var totalCnt: Long = 0
>   private var totalWeightSum: Double = 0.0
>   private var weightSquareSum: Double = 0.0
>   private var weightSum: Array[Double] = _
>   private var nnz: Array[Long] = _
>   private var currMax: Array[Double] = _
>   private var currMin: Array[Double] = _
> {code}
> For {{MaxAbsScaler}}, only 1 array is needed (max of abs value)
> For {{MinMaxScaler}}, only 3 arrays are needed (max, min, nnz)
> After modication in the pr, the above example run successfully.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19208) MultivariateOnlineSummarizer performance optimization

2017-02-01 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15848108#comment-15848108
 ] 

Nick Pentreath commented on SPARK-19208:


Another option would be an "Estimator" like API, where the UDAF is purely 
internal to the API and not exposed to users, e.g.

{code}
val summarizer = new VectorSummarizer().setMetrics("min", 
"max").setInputCol("features").setWeightCol("weight")
val summary = summarizer.fit(df)

(or summarizer.evaluate, summarizer.summarize, etc?)

// this would need to throw exceptions (or perhaps return empty vectors) if the 
metric was not set
val min: Vector = summary.getMin

// OR DataFrame-based result:

val min: Vector = summary.select("min").as[Vector]
{code}

> MultivariateOnlineSummarizer performance optimization
> -
>
> Key: SPARK-19208
> URL: https://issues.apache.org/jira/browse/SPARK-19208
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: zhengruifeng
> Attachments: Tests.pdf, WechatIMG2621.jpeg
>
>
> Now, {{MaxAbsScaler}} and {{MinMaxScaler}} are using 
> {{MultivariateOnlineSummarizer}} to compute the min/max.
> However {{MultivariateOnlineSummarizer}} will also compute extra unused 
> statistics. It slows down the task, moreover it is more prone to cause OOM.
> For example:
> env : --driver-memory 4G --executor-memory 1G --num-executors 4
> data: 
> [http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#kdd2010%20(bridge%20to%20algebra)]
>  748401 instances,   and 29,890,095 features
> {{MaxAbsScaler.fit}} fails because of OOM
> {{MultivariateOnlineSummarizer}} maintains 8 arrays:
> {code}
> private var currMean: Array[Double] = _
>   private var currM2n: Array[Double] = _
>   private var currM2: Array[Double] = _
>   private var currL1: Array[Double] = _
>   private var totalCnt: Long = 0
>   private var totalWeightSum: Double = 0.0
>   private var weightSquareSum: Double = 0.0
>   private var weightSum: Array[Double] = _
>   private var nnz: Array[Long] = _
>   private var currMax: Array[Double] = _
>   private var currMin: Array[Double] = _
> {code}
> For {{MaxAbsScaler}}, only 1 array is needed (max of abs value)
> For {{MinMaxScaler}}, only 3 arrays are needed (max, min, nnz)
> After modication in the pr, the above example run successfully.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18392) LSH API, algorithm, and documentation follow-ups

2017-01-23 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15834336#comment-15834336
 ] 

Nick Pentreath commented on SPARK-18392:


[~yunn] I was wondering if you will be working on the additions to LSH: (i) 
BitSampling & SignRandomProjection and (ii) the AND-amplification. if so what 
is the timeframe? Shall we look at some of it to target Spark 2.2?

> LSH API, algorithm, and documentation follow-ups
> 
>
> Key: SPARK-18392
> URL: https://issues.apache.org/jira/browse/SPARK-18392
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>
> This JIRA summarizes discussions from the initial LSH PR 
> [https://github.com/apache/spark/pull/15148] as well as the follow-up for 
> hash distance [https://github.com/apache/spark/pull/15800].  This will be 
> broken into subtasks:
> * API changes (targeted for 2.1)
> * algorithmic fixes (targeted for 2.1)
> * documentation improvements (ideally 2.1, but could slip)
> The major issues we have mentioned are as follows:
> * OR vs AND amplification
> ** Need to make API flexible enough to support both types of amplification in 
> the future
> ** Need to clarify which we support, including in each model function 
> (transform, similarity join, neighbors)
> * Need to clarify which algorithms we have implemented, improve docs and 
> references, and fix the algorithms if needed.
> These major issues are broken down into detailed issues below.
> h3. LSH abstraction
> * Rename {{outputDim}} to something indicative of OR-amplification.
> ** My current top pick is {{numHashTables}}, with {{numHashFunctions}} used 
> in the future for AND amplification (Thanks [~mlnick]!)
> * transform
> ** Update output schema to {{Array of Vector}} instead of {{Vector}}.  This 
> is the "raw" output of all hash functions, i.e., with no aggregation for 
> amplification.
> ** Clarify meaning of output in terms of multiple hash functions and 
> amplification.
> ** Note: We will _not_ worry about users using this output for dimensionality 
> reduction; if anything, that use case can be explained in the User Guide.
> * Documentation
> ** Clarify terminology used everywhere
> *** hash function {{h_i}}: basic hash function without amplification
> *** hash value {{h_i(key)}}: output of a hash function
> *** compound hash function {{g = (h_0,h_1,...h_{K-1})}}: hash function with 
> AND-amplification using K base hash functions
> *** compound hash function value {{g(key)}}: vector-valued output
> *** hash table {{H = (g_0,g_1,...g_{L-1})}}: hash function with 
> OR-amplification using L compound hash functions
> *** hash table value {{H(key)}}: output of array of vectors
> *** This terminology is largely pulled from Wang et al.'s survey and the 
> multi-probe LSH paper.
> ** Link clearly to documentation (Wikipedia or papers) which matches our 
> terminology and what we implemented
> h3. RandomProjection (or P-Stable Distributions)
> * Rename {{RandomProjection}}
> ** Options include: {{ScalarRandomProjectionLSH}}, 
> {{BucketedRandomProjectionLSH}}, {{PStableLSH}}
> * API privacy
> ** Make randUnitVectors private
> * hashFunction
> ** Currently, this uses OR-amplification for single probing, as we intended.
> ** It does *not* do multiple probing, at least not in the sense of the 
> original MPLSH paper.  We should fix that or at least document its behavior.
> * Documentation
> ** Clarify this is the P-Stable Distribution LSH method listed in Wikipedia
> ** Also link to the multi-probe LSH paper since that explains this method 
> very clearly.
> ** Clarify hash function and distance metric
> h3. MinHash
> * Rename {{MinHash}} -> {{MinHashLSH}}
> * API privacy
> ** Make randCoefficients, numEntries private
> * hashDistance (used in approxNearestNeighbors)
> ** Update to use average of indicators of hash collisions [SPARK-18334]
> ** See [Wikipedia | 
> https://en.wikipedia.org/wiki/MinHash#Variant_with_many_hash_functions] for a 
> reference
> h3. All references
> I'm just listing references I looked at.
> Papers
> * [http://cseweb.ucsd.edu/~dasgupta/254-embeddings/lawrence.pdf]
> * [https://people.csail.mit.edu/indyk/p117-andoni.pdf]
> * [http://web.stanford.edu/class/cs345a/slides/05-LSH.pdf]
> * [http://www.cs.princeton.edu/cass/papers/mplsh_vldb07.pdf] - Multi-probe 
> LSH paper
> Wikipedia
> * 
> [https://en.wikipedia.org/wiki/Locality-sensitive_hashing#LSH_algorithm_for_nearest_neighbor_search]
> * [https://en.wikipedia.org/wiki/Locality-sensitive_hashing#Amplification]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18704) CrossValidator should preserve more tuning statistics

2017-01-23 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath updated SPARK-18704:
---
Shepherd: Nick Pentreath

> CrossValidator should preserve more tuning statistics
> -
>
> Key: SPARK-18704
> URL: https://issues.apache.org/jira/browse/SPARK-18704
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: yuhao yang
>Priority: Minor
>
> Currently CrossValidator will train (k-fold * paramMaps) different models 
> during the training process, yet it only passes the average metrics to 
> CrossValidatorModel. From which some important information like variances for 
> the same paramMap cannot be retrieved, and users cannot be sure if the k 
> number is proper. Since the CrossValidator is relatively expensive, we 
> probably want to get the most from the tuning process.
> Just want to see if this sounds good. In my opinion, this can be done either 
> by passing a metrics matrix to the CrossValidatorModel, or we can introduce a 
> CrossValidatorSummary. I would vote for introducing the TunningSummary class, 
> which can also be used by TrainValidationSplit. In the summary we can present 
> a better statistics for the tuning process. Something like a DataFrame:
> +---+++-+
> |elasticNetParam|fitIntercept|regParam|metrics  |
> +---+++-+
> |0.0|true|0.1 |9.747795248932505|
> |0.0|true|0.01|9.751942357398603|
> |0.0|false   |0.1 |9.71727627087487 |
> |0.0|false   |0.01|9.721149803723822|
> |0.5|true|0.1 |9.719358515436005|
> |0.5|true|0.01|9.748121645368501|
> |0.5|false   |0.1 |9.687771328829479|
> |0.5|false   |0.01|9.717304811419261|
> |1.0|true|0.1 |9.696769467196487|
> |1.0|true|0.01|9.744325276259957|
> |1.0|false   |0.1 |9.665822167122172|
> |1.0|false   |0.01|9.713484065511892|
> +---+++-+
> Using the dataFrame, users can better understand the effect of different 
> parameters.
> Another thing we should improve is to include the paramMaps in the 
> CrossValidatorModel (or TrainValidationSplitModel) to allow meaningful 
> serialization. Keeping only the metrics without ParamMaps does not really 
> help model reuse.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19071) Optimizations for ML Pipeline Tuning

2017-01-23 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15834232#comment-15834232
 ] 

Nick Pentreath commented on SPARK-19071:


Thanks [~bryanc] for the design and working on the PoCs. I believe many Spark 
use cases will involve fairly large amounts of data, but not necessarily huge 
models, making parallel model evaluation very useful. Agree it makes sense to 
start with a fairly naive approach, try to make it transparent to users, and if 
necessarily expose a simple param to control parallelism. Selecting a default 
will indeed be tricky, but perhaps it's best to set it to {{1}} and let users 
decide.

We will need to consider the caching element. Currently there are existing 
issues with individual models deciding when to cache (see SPARK-18608). I think 
we need to also think about how this fits in with that issue, and try to come 
up with a more unified approach on what gets cached by which component, and how 
that is decided. It may require adding an argument {{handlePersistence}} to 
{{fit}} (probably private). Default will be {{true}} but for pipelines it could 
be {{false}} to allow the pipeline or cross-validator to make the decisions. 

Let's focus on step 1 for Spark 2.2 timeframe. Could we create the some linked 
JIRAs to represent the tasks (you can link them as being contained by this 
umbrella JIRA). I should have cycles to shepherd this.

> Optimizations for ML Pipeline Tuning
> 
>
> Key: SPARK-19071
> URL: https://issues.apache.org/jira/browse/SPARK-19071
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Bryan Cutler
>
> This is a parent task to plan the addition of optimizations in ML tuning for 
> parallel model evaluation and more efficiency with pipelines.  They will 
> benefit Crossvalidator and TrainValidationSplit when performing a parameter 
> grid search.  The proposal can be broken into 3 steps in order of simplicity:
> 1. Add ability to evaluate models in parallel.
> 2. Optimize param grid for pipelines, as described in SPARK-5844
> 3. Add parallel model evaluation to the optimized pipelines from step 2
> See the linked design document for details on the proposed implementation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14409) Investigate adding a RankingEvaluator to ML

2017-01-17 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15826186#comment-15826186
 ] 

Nick Pentreath commented on SPARK-14409:


Yes to be more clear, I would expect that the {{k}} param would be specified as 
in Danilo's version, for example. I do like the use of windowing to achieve the 
sort within each user.

This approach would also not work well with purely implicit data (unweighted). 
If everything is relevant in the ground truth then the model would score 
perfectly each time. It sort of works for the explicit rating case or the 
implicit case with "preference weights" since the ground truth then has an 
inherent ordering. 

Still I think the evaluator must be able to deal with the case of generating 
recommendations from the full item set. This means that the "label" and 
"prediction" columns could contains nulls.
e.g. where an item exists in the ground truth but is not recommended (hence no 
score), the "prediction" column would be null. While if an item is recommended 
but is not in ground truth, the "label" column would be null. See my comments 
in SPARK-13857 for details.

> Investigate adding a RankingEvaluator to ML
> ---
>
> Key: SPARK-14409
> URL: https://issues.apache.org/jira/browse/SPARK-14409
> Project: Spark
>      Issue Type: New Feature
>  Components: ML
>Reporter: Nick Pentreath
>Priority: Minor
>
> {{mllib.evaluation}} contains a {{RankingMetrics}} class, while there is no 
> {{RankingEvaluator}} in {{ml.evaluation}}. Such an evaluator can be useful 
> for recommendation evaluation (and can be useful in other settings 
> potentially).
> Should be thought about in conjunction with adding the "recommendAll" methods 
> in SPARK-13857, so that top-k ranking metrics can be used in cross-validators.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19208) MaxAbsScaler and MinMaxScaler are very inefficient

2017-01-17 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15826096#comment-15826096
 ] 

Nick Pentreath commented on SPARK-19208:


If we're going to look at performance optimization here, does it make sense to 
see if the aggregations can be done via native Spark SQL aggregations?

Alternatively less code duplication is the better option so amending the 
summarizer to allow computing only selected subsets of metrics would be my 
vote. However it can complicate the public API of colStats and the summarizer. 
So we should look at options to keep changes private.

> MaxAbsScaler and MinMaxScaler are very inefficient
> --
>
> Key: SPARK-19208
> URL: https://issues.apache.org/jira/browse/SPARK-19208
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: zhengruifeng
>Assignee: Apache Spark
> Attachments: Tests.pdf, WechatIMG2621.jpeg
>
>
> Now, {{MaxAbsScaler}} and {{MinMaxScaler}} are using 
> {{MultivariateOnlineSummarizer}} to compute the min/max.
> However {{MultivariateOnlineSummarizer}} will also compute extra unused 
> statistics. It slows down the task, moreover it is more prone to cause OOM.
> For example:
> env : --driver-memory 4G --executor-memory 1G --num-executors 4
> data: 
> [http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#kdd2010%20(bridge%20to%20algebra)]
>  748401 instances,   and 29,890,095 features
> {{MaxAbsScaler.fit}} fails because of OOM
> {{MultivariateOnlineSummarizer}} maintains 8 arrays:
> {code}
> private var currMean: Array[Double] = _
>   private var currM2n: Array[Double] = _
>   private var currM2: Array[Double] = _
>   private var currL1: Array[Double] = _
>   private var totalCnt: Long = 0
>   private var totalWeightSum: Double = 0.0
>   private var weightSquareSum: Double = 0.0
>   private var weightSum: Array[Double] = _
>   private var nnz: Array[Long] = _
>   private var currMax: Array[Double] = _
>   private var currMin: Array[Double] = _
> {code}
> For {{MaxAbsScaler}}, only 1 array is needed (max of abs value)
> For {{MinMaxScaler}}, only 3 arrays are needed (max, min, nnz)
> After modication in the pr, the above example run successfully.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14409) Investigate adding a RankingEvaluator to ML

2017-01-17 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15826078#comment-15826078
 ] 

Nick Pentreath commented on SPARK-14409:


[~danilo.ascione] [~roberto.mirizzi] thanks for the code examples. Both seem 
reasonable and I like the DataFrame-based solutions here. The ideal solution 
would likely take a few elements from each design.

One aspect that concerns me is how are you generating recommendations from ALS? 
It appears that you will be using the current output of {{ALS.transform}}. So 
you're computing a ranking metric in a scenario where you only recommend the 
subset of user-item combinations that occur in the evaluation data set. So it 
is sort of like a "re-ranking" evaluation metric in a sense. I'd expect the 
ranking metric here to quite dramatically overestimate true performance, since 
in the real word you would generate recommendations from the complete set of 
available items.

cc [~srowen] thoughts?

> Investigate adding a RankingEvaluator to ML
> ---
>
> Key: SPARK-14409
> URL: https://issues.apache.org/jira/browse/SPARK-14409
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Nick Pentreath
>Priority: Minor
>
> {{mllib.evaluation}} contains a {{RankingMetrics}} class, while there is no 
> {{RankingEvaluator}} in {{ml.evaluation}}. Such an evaluator can be useful 
> for recommendation evaluation (and can be useful in other settings 
> potentially).
> Should be thought about in conjunction with adding the "recommendAll" methods 
> in SPARK-13857, so that top-k ranking metrics can be used in cross-validators.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



Re: ML PIC

2017-01-16 Thread Nick Pentreath
The JIRA for this is here: https://issues.apache.org/jira/browse/SPARK-15784

There is a PR open already for it, which still needs to be reviewed.



On Wed, 21 Dec 2016 at 18:01 Robert Hamilton 
wrote:

> Thank you Nick that is good to know.
>
> Would this have some opportunity for newbs (like me) to volunteer some
> time?
>
> Sent from my iPhone
>
> On Dec 21, 2016, at 9:08 AM, Nick Pentreath 
> wrote:
>
> It is part of the general feature parity roadmap. I can't recall offhand
> any blocker reasons it's just resources
> On Wed, 21 Dec 2016 at 17:05, Robert Hamilton <
> robert_b_hamil...@icloud.com> wrote:
>
> Hi all.  Is it on the roadmap to have an
> Spark.ml.clustering.PowerIterationClustering? Are there technical reasons
> that there is currently only an .mllib version?
>
>
> Sent from my iPhone
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


[jira] [Commented] (SPARK-19217) Offer easy cast from vector to array

2017-01-16 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15823701#comment-15823701
 ] 

Nick Pentreath commented on SPARK-19217:


I don't understand why 
bq. You can't save these DataFrames to storage without converting the vector 
columns to array columns

Which storage exactly? Because using standard DF writers (such as parquet) 
works:

{code}
df = spark.createDataFrame([(1, Vectors.dense(1, 2)), (2, Vectors.dense(3, 4)), 
(3, Vectors.dense(5, 6))], ["id", "vector"])
df.write.parquet("/tmp/spark/vecs")
df2 = spark.read.parquet("/tmp/spark/vecs/")
{code}


> Offer easy cast from vector to array
> 
>
> Key: SPARK-19217
> URL: https://issues.apache.org/jira/browse/SPARK-19217
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark, SQL
>Affects Versions: 2.1.0
>Reporter: Nicholas Chammas
>Priority: Minor
>
> Working with ML often means working with DataFrames with vector columns. You 
> can't save these DataFrames to storage without converting the vector columns 
> to array columns, and there doesn't appear to an easy way to make that 
> conversion.
> This is a common enough problem that it is [documented on Stack 
> Overflow|http://stackoverflow.com/q/35855382/877069]. The current solutions 
> to making the conversion from a vector column to an array column are:
> # Convert the DataFrame to an RDD and back
> # Use a UDF
> Both approaches work fine, but it really seems like you should be able to do 
> something like this instead:
> {code}
> (le_data
> .select(
> col('features').cast('array').alias('features')
> ))
> {code}
> We already have an {{ArrayType}} in {{pyspark.sql.types}}, but it appears 
> that {{cast()}} doesn't support this conversion.
> Would this be an appropriate thing to add?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



Re: Why are ml models repartition(1)'d in save methods?

2017-01-13 Thread Nick Pentreath
Yup - it's because almost all model data in spark ML (model coefficients)
is "small" - i.e. Non distributed.

If you look at ALS you'll see there is no repartitioning since the factor
dataframes can be large
On Fri, 13 Jan 2017 at 19:42, Sean Owen  wrote:

> You're referring to code that serializes models, which are quite small.
> For example a PCA model consists of a few principal component vector. It's
> a Dataset of just one element being saved here. It's re-using the code path
> normally used to save big data sets, to output 1 file with 1 thing as
> Parquet.
>
> On Fri, Jan 13, 2017 at 5:29 PM Asher Krim  wrote:
>
> But why is that beneficial? The data is supposedly quite large,
> distributing it across many partitions/files would seem to make sense.
>
> On Fri, Jan 13, 2017 at 12:25 PM, Sean Owen  wrote:
>
> That is usually so the result comes out in one file, not partitioned over
> n files.
>
> On Fri, Jan 13, 2017 at 5:23 PM Asher Krim  wrote:
>
> Hi,
>
> I'm curious why it's common for data to be repartitioned to 1 partition
> when saving ml models:
>
> sqlContext.createDataFrame(Seq(data)).repartition(1
> ).write.parquet(dataPath)
>
> This shows up in most ml models I've seen (Word2Vec
> ,
> PCA
> ,
> LDA
> ).
> Am I missing some benefit of repartitioning like this?
>
> Thanks,
> --
> Asher Krim
> Senior Software Engineer
>
>
>
>
> --
> Asher Krim
> Senior Software Engineer
>
>


[jira] [Issue Comment Deleted] (SPARK-13857) Feature parity for ALS ML with MLLIB

2017-01-12 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath updated SPARK-13857:
---
Comment: was deleted

(was: My view is in practice brute-force is never going to be efficient enough 
for any non-trivial size problem. So I'd definitely like to incorporate the ANN 
stuff into the top-k recommendation here. Once SignRandomProjection is in I'll 
take a deeper look at item-item / user-user sim for DF-API.

We could also add a form of LSH appropriate for dot product space for user-item 
recs.)

> Feature parity for ALS ML with MLLIB
> 
>
> Key: SPARK-13857
> URL: https://issues.apache.org/jira/browse/SPARK-13857
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Nick Pentreath
>Assignee: Nick Pentreath
>
> Currently {{mllib.recommendation.MatrixFactorizationModel}} has methods 
> {{recommendProducts/recommendUsers}} for recommending top K to a given user / 
> item, as well as {{recommendProductsForUsers/recommendUsersForProducts}} to 
> recommend top K across all users/items.
> Additionally, SPARK-10802 is for adding the ability to do 
> {{recommendProductsForUsers}} for a subset of users (or vice versa).
> Look at exposing or porting (as appropriate) these methods to ALS in ML. 
> Investigate if efficiency can be improved at the same time (see SPARK-11968).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13857) Feature parity for ALS ML with MLLIB

2017-01-12 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15820747#comment-15820747
 ] 

Nick Pentreath commented on SPARK-13857:


My view is in practice brute-force is never going to be efficient enough for 
any non-trivial size problem. So I'd definitely like to incorporate the ANN 
stuff into the top-k recommendation here. Once SignRandomProjection is in I'll 
take a deeper look at item-item / user-user sim for DF-API.

We could also add a form of LSH appropriate for dot product space for user-item 
recs.

> Feature parity for ALS ML with MLLIB
> 
>
> Key: SPARK-13857
> URL: https://issues.apache.org/jira/browse/SPARK-13857
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Nick Pentreath
>Assignee: Nick Pentreath
>
> Currently {{mllib.recommendation.MatrixFactorizationModel}} has methods 
> {{recommendProducts/recommendUsers}} for recommending top K to a given user / 
> item, as well as {{recommendProductsForUsers/recommendUsersForProducts}} to 
> recommend top K across all users/items.
> Additionally, SPARK-10802 is for adding the ability to do 
> {{recommendProductsForUsers}} for a subset of users (or vice versa).
> Look at exposing or porting (as appropriate) these methods to ALS in ML. 
> Investigate if efficiency can be improved at the same time (see SPARK-11968).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13857) Feature parity for ALS ML with MLLIB

2017-01-12 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15820746#comment-15820746
 ] 

Nick Pentreath commented on SPARK-13857:


My view is in practice brute-force is never going to be efficient enough for 
any non-trivial size problem. So I'd definitely like to incorporate the ANN 
stuff into the top-k recommendation here. Once SignRandomProjection is in I'll 
take a deeper look at item-item / user-user sim for DF-API.

We could also add a form of LSH appropriate for dot product space for user-item 
recs.

> Feature parity for ALS ML with MLLIB
> 
>
> Key: SPARK-13857
> URL: https://issues.apache.org/jira/browse/SPARK-13857
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Nick Pentreath
>Assignee: Nick Pentreath
>
> Currently {{mllib.recommendation.MatrixFactorizationModel}} has methods 
> {{recommendProducts/recommendUsers}} for recommending top K to a given user / 
> item, as well as {{recommendProductsForUsers/recommendUsersForProducts}} to 
> recommend top K across all users/items.
> Additionally, SPARK-10802 is for adding the ability to do 
> {{recommendProductsForUsers}} for a subset of users (or vice versa).
> Look at exposing or porting (as appropriate) these methods to ALS in ML. 
> Investigate if efficiency can be improved at the same time (see SPARK-11968).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



Re: ML PIC

2016-12-21 Thread Nick Pentreath
It is part of the general feature parity roadmap. I can't recall offhand
any blocker reasons it's just resources
On Wed, 21 Dec 2016 at 17:05, Robert Hamilton 
wrote:

> Hi all.  Is it on the roadmap to have an
> Spark.ml.clustering.PowerIterationClustering? Are there technical reasons
> that there is currently only an .mllib version?
>
>
> Sent from my iPhone
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: Issue in using DenseVector in RowMatrix, error could be due to ml and mllib package changes

2016-12-08 Thread Nick Pentreath
Yes most likely due to hashing tf returns ml vectors while you need mllib
vectors for row matrix.

I'd recommend using the vector conversion utils (I think in
mllib.linalg.Vectors but I'm on mobile right now so can't recall exactly).
There are until methods for converting single vectors as well as vector
rows of DF. Check the mllib user guide for 2.0 for details.
On Fri, 9 Dec 2016 at 04:42, satyajit vegesna 
wrote:

> Hi All,
>
> PFB code.
>
>
> import org.apache.spark.ml.feature.{HashingTF, IDF}
> import org.apache.spark.ml.linalg.SparseVector
> import org.apache.spark.mllib.linalg.distributed.RowMatrix
> import org.apache.spark.sql.SparkSession
> import org.apache.spark.{SparkConf, SparkContext}
>
> /**
>   * Created by satyajit on 12/7/16.
>   */
> object DIMSUMusingtf extends App {
>
>   val conf = new SparkConf()
> .setMaster("local[1]")
> .setAppName("testColsim")
>   val sc = new SparkContext(conf)
>   val spark = SparkSession
> .builder
> .appName("testColSim").getOrCreate()
>
>   import org.apache.spark.ml.feature.Tokenizer
>
>   val sentenceData = spark.createDataFrame(Seq(
> (0, "Hi I heard about Spark"),
> (0, "I wish Java could use case classes"),
> (1, "Logistic regression models are neat")
>   )).toDF("label", "sentence")
>
>   val tokenizer = new 
> Tokenizer().setInputCol("sentence").setOutputCol("words")
>
>   val wordsData = tokenizer.transform(sentenceData)
>
>
>   val hashingTF = new HashingTF()
> .setInputCol("words").setOutputCol("rawFeatures").setNumFeatures(20)
>
>   val featurizedData = hashingTF.transform(wordsData)
>
>
>   val idf = new IDF().setInputCol("rawFeatures").setOutputCol("features")
>   val idfModel = idf.fit(featurizedData)
>   val rescaledData = idfModel.transform(featurizedData)
>   rescaledData.show()
>   rescaledData.select("features", "label").take(3).foreach(println)
>   val check = rescaledData.select("features")
>
>   val row = check.rdd.map(row => row.getAs[SparseVector]("features"))
>
>   val mat = new RowMatrix(row) //i am basically trying to use Dense.vector as 
> a direct input to
>
> rowMatrix, but i get an error that RowMatrix Cannot resolve constructor
>
>   row.foreach(println)
> }
>
> Any help would be appreciated.
>
> Regards,
> Satyajit.
>
>
>
>


Re: Issue in using DenseVector in RowMatrix, error could be due to ml and mllib package changes

2016-12-08 Thread Nick Pentreath
Yes most likely due to hashing tf returns ml vectors while you need mllib
vectors for row matrix.

I'd recommend using the vector conversion utils (I think in
mllib.linalg.Vectors but I'm on mobile right now so can't recall exactly).
There are until methods for converting single vectors as well as vector
rows of DF. Check the mllib user guide for 2.0 for details.
On Fri, 9 Dec 2016 at 04:42, satyajit vegesna 
wrote:

> Hi All,
>
> PFB code.
>
>
> import org.apache.spark.ml.feature.{HashingTF, IDF}
> import org.apache.spark.ml.linalg.SparseVector
> import org.apache.spark.mllib.linalg.distributed.RowMatrix
> import org.apache.spark.sql.SparkSession
> import org.apache.spark.{SparkConf, SparkContext}
>
> /**
>   * Created by satyajit on 12/7/16.
>   */
> object DIMSUMusingtf extends App {
>
>   val conf = new SparkConf()
> .setMaster("local[1]")
> .setAppName("testColsim")
>   val sc = new SparkContext(conf)
>   val spark = SparkSession
> .builder
> .appName("testColSim").getOrCreate()
>
>   import org.apache.spark.ml.feature.Tokenizer
>
>   val sentenceData = spark.createDataFrame(Seq(
> (0, "Hi I heard about Spark"),
> (0, "I wish Java could use case classes"),
> (1, "Logistic regression models are neat")
>   )).toDF("label", "sentence")
>
>   val tokenizer = new 
> Tokenizer().setInputCol("sentence").setOutputCol("words")
>
>   val wordsData = tokenizer.transform(sentenceData)
>
>
>   val hashingTF = new HashingTF()
> .setInputCol("words").setOutputCol("rawFeatures").setNumFeatures(20)
>
>   val featurizedData = hashingTF.transform(wordsData)
>
>
>   val idf = new IDF().setInputCol("rawFeatures").setOutputCol("features")
>   val idfModel = idf.fit(featurizedData)
>   val rescaledData = idfModel.transform(featurizedData)
>   rescaledData.show()
>   rescaledData.select("features", "label").take(3).foreach(println)
>   val check = rescaledData.select("features")
>
>   val row = check.rdd.map(row => row.getAs[SparseVector]("features"))
>
>   val mat = new RowMatrix(row) //i am basically trying to use Dense.vector as 
> a direct input to
>
> rowMatrix, but i get an error that RowMatrix Cannot resolve constructor
>
>   row.foreach(println)
> }
>
> Any help would be appreciated.
>
> Regards,
> Satyajit.
>
>
>
>


Re: 2.1.0-rc2 cut; committers please set fix version for branch-2.1 to 2.1.1 instead

2016-12-07 Thread Nick Pentreath
I went ahead and re-marked all the existing 2.1.1 fix version JIRAs (that
had gone into branch-2.1 since RC1 but before RC2) for Spark ML to 2.1.0

On Thu, 8 Dec 2016 at 09:20 Reynold Xin  wrote:

> Thanks.
>


[jira] [Commented] (SPARK-18633) Add multiclass logistic regression summary python example and document

2016-12-07 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15731424#comment-15731424
 ] 

Nick Pentreath commented on SPARK-18633:


Went ahead and remarked fix version to {{2.1.0}} since RC2 has been cut.

> Add multiclass logistic regression summary python example and document
> --
>
> Key: SPARK-18633
> URL: https://issues.apache.org/jira/browse/SPARK-18633
> Project: Spark
>  Issue Type: Documentation
>  Components: ML
>Reporter: Miao Wang
>Assignee: Miao Wang
>Priority: Minor
> Fix For: 2.1.0
>
>
> Logistic Regression summary is added in Python API. We need to add example 
> and document for summary.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18081) Locality Sensitive Hashing (LSH) User Guide

2016-12-07 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath updated SPARK-18081:
---
Fix Version/s: (was: 2.2.0)

> Locality Sensitive Hashing (LSH) User Guide
> ---
>
> Key: SPARK-18081
> URL: https://issues.apache.org/jira/browse/SPARK-18081
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, ML
>Reporter: Joseph K. Bradley
>Assignee: Yun Ni
> Fix For: 2.1.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18633) Add multiclass logistic regression summary python example and document

2016-12-07 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath updated SPARK-18633:
---
Fix Version/s: (was: 2.1.1)
   (was: 2.2.0)
   2.1.0

> Add multiclass logistic regression summary python example and document
> --
>
> Key: SPARK-18633
> URL: https://issues.apache.org/jira/browse/SPARK-18633
> Project: Spark
>  Issue Type: Documentation
>  Components: ML
>Reporter: Miao Wang
>Assignee: Miao Wang
>Priority: Minor
> Fix For: 2.1.0
>
>
> Logistic Regression summary is added in Python API. We need to add example 
> and document for summary.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18081) Locality Sensitive Hashing (LSH) User Guide

2016-12-07 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15731419#comment-15731419
 ] 

Nick Pentreath commented on SPARK-18081:


Went ahead and re-marked fix version to {{2.1.0}} since RC2 has been cut.

> Locality Sensitive Hashing (LSH) User Guide
> ---
>
> Key: SPARK-18081
> URL: https://issues.apache.org/jira/browse/SPARK-18081
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, ML
>Reporter: Joseph K. Bradley
>Assignee: Yun Ni
> Fix For: 2.1.0, 2.2.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15819) Add KMeanSummary in KMeans of PySpark

2016-12-07 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15731421#comment-15731421
 ] 

Nick Pentreath commented on SPARK-15819:


Went ahead and re-marked fix version to {{2.1.0}} since RC2 has been cut.

> Add KMeanSummary in KMeans of PySpark
> -
>
> Key: SPARK-15819
> URL: https://issues.apache.org/jira/browse/SPARK-15819
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.0.0
>Reporter: Jeff Zhang
>Assignee: Jeff Zhang
> Fix For: 2.1.0
>
>
> There's no corresponding python api for KMeansSummary, it would be nice to 
> have it. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18274) Memory leak in PySpark StringIndexer

2016-12-07 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15731418#comment-15731418
 ] 

Nick Pentreath commented on SPARK-18274:


Went ahead and re-marked fix version to {{2.1.0}} since RC2 has been cut.

> Memory leak in PySpark StringIndexer
> 
>
> Key: SPARK-18274
> URL: https://issues.apache.org/jira/browse/SPARK-18274
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Affects Versions: 1.5.2, 1.6.3, 2.0.1, 2.0.2, 2.1.0
>Reporter: Jonas Amrich
>Assignee: Sandeep Singh
>Priority: Critical
> Fix For: 2.0.3, 2.1.0, 2.2.0
>
>
> StringIndexerModel won't get collected by GC in Java even when deleted in 
> Python. It can be reproduced by this code, which fails after couple of 
> iterations (around 7 if you set driver memory to 600MB): 
> {code}
> import random, string
> from pyspark.ml.feature import StringIndexer
> l = [(''.join(random.choice(string.ascii_uppercase) for _ in range(10)), ) 
> for _ in range(int(7e5))]  # 70 random strings of 10 characters
> df = spark.createDataFrame(l, ['string'])
> for i in range(50):
> indexer = StringIndexer(inputCol='string', outputCol='index')
> indexer.fit(df)
> {code}
> Explicit call to Python GC fixes the issue - following code runs fine:
> {code}
> for i in range(50):
> indexer = StringIndexer(inputCol='string', outputCol='index')
> indexer.fit(df)
> gc.collect()
> {code}
> The issue is similar to SPARK-6194 and can be probably fixed by calling jvm 
> detach in model's destructor. This is implemented in 
> pyspark.mlib.common.JavaModelWrapper but missing in 
> pyspark.ml.wrapper.JavaWrapper. Other models in ml package may also be 
> affected by this memory leak. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



<    1   2   3   4   5   6   7   8   9   10   >