Re: [VOTE] Spark 2.3.0 (RC3)

2018-02-15 Thread Sameer Agarwal
In addition to the issues mentioned above, Wenchen and Xiao have flagged
two other regressions (https://issues.apache.org/jira/browse/SPARK-23316
and https://issues.apache.org/jira/browse/SPARK-23388) that were merged
after RC3 was cut.

Due to these, this vote fails. I'll follow-up with an RC4 in a day (this
will probably also give us enough time to resolve
https://issues.apache.org/jira/browse/SPARK-23381 and
https://issues.apache.org/jira/browse/SPARK-23410).


On 15 February 2018 at 17:22, mrkm4ntr  wrote:

> I agree that this is not a blocker against RC3.  It was not appropriate as
> a
> vote for RC3.
> There is no problem if it is in time for release 2.3.0.
>
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: [VOTE] Spark 2.3.0 (RC3)

2018-02-15 Thread mrkm4ntr
I agree that this is not a blocker against RC3.  It was not appropriate as a
vote for RC3.
There is no problem if it is in time for release 2.3.0.




--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [VOTE] Spark 2.3.0 (RC3)

2018-02-15 Thread Ryan Blue
I agree that SPARK-23413 should be considered a blocker. It isn't
unreasonable to run a history server that is used for several versions of
Spark.

On Thu, Feb 15, 2018 at 7:49 AM, Sean Owen  wrote:

> SPARK-23381 is probably not a blocker IMHO; it's a nice-to-have to make
> some returned values match an external implementation, for code that hasn't
> been published yet.
>
> However I think it's OK to add to the 2.3.0 release if there's going to be
> another RC.
>
>
> On Wed, Feb 14, 2018 at 10:49 PM Holden Karau 
> wrote:
>
>> So it's currently tagged as minor and under consideration for 2.4.0. Do
>> you think this priority is incorrect? This doesn't seem like a regression
>> or a correctness issue so normally we wouldn't hold the release. Of course
>> your free to vote how you choose, just providing some additional context
>> around how tend to do released.
>>
>>
>> On Feb 14, 2018 11:03 PM, "mrkm4ntr"  wrote:
>>
>> I'm -1 because of this issue.
>> I want to fix the hashing implementation in FeatureHasher before
>> FeatureHasher released in 2.3.0.
>>
>> https://issues.apache.org/jira/browse/SPARK-23381
>> https://github.com/apache/spark/pull/20568
>>
>> I will fix it soon.
>>
>>
>>
>> --
>> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>
>>


-- 
Ryan Blue
Software Engineer
Netflix


Re: [MLlib] Gaussian Process regression in MLlib

2018-02-15 Thread Аванесов Валерий
Hi all,

I've created a new JIRA.

https://issues.apache.org/jira/browse/SPARK-23437

All concerned are welcome to discuss.

Best,
Valeriy.

On Sat, Feb 3, 2018 at 9:24 PM, Valeriy Avanesov  wrote:

> Hi,
>
> no, I don't thing we should actually compute the n \times n matrix. Leave
> alone inverting it. However, variational inference is only one of the many
> sparse GP approaches. Another option could be Bayesian Committee.
>
> Best,
>
> Valeriy.
>
>
>
> On 02/02/2018 09:43 PM, Simon Dirmeier wrote:
>
>> Hey,
>>
>> I wanted to see that for a long time, too. :) If you'd plan on
>> implementing this, I could contribute.
>> However, I am not too familiar with variational inference for the GPs
>> which is what you would need I guess.
>> Or do you think it is feasible to compute the full kernel for the GP?
>>
>> Cheers,
>> S
>>
>>
>>
>> Am 01.02.18 um 20:01 schrieb Valeriy Avanesov:
>>
>>> Hi all,
>>>
>>> it came to my surprise that there is no implementation of Gaussian
>>> Process in Spark MLlib. The approach is widely known, employed and scalable
>>> (its sparse versions). Is there a good reason for that? Has it been
>>> discussed before?
>>>
>>> If there is a need in this approach being a part of MLlib I am eager to
>>> contribute.
>>>
>>> Best,
>>>
>>> Valeriy.
>>>
>>>
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>
>>>
>>
>


Re: [VOTE] Spark 2.3.0 (RC3)

2018-02-15 Thread Sean Owen
SPARK-23381 is probably not a blocker IMHO; it's a nice-to-have to make
some returned values match an external implementation, for code that hasn't
been published yet.

However I think it's OK to add to the 2.3.0 release if there's going to be
another RC.

On Wed, Feb 14, 2018 at 10:49 PM Holden Karau 
wrote:

> So it's currently tagged as minor and under consideration for 2.4.0. Do
> you think this priority is incorrect? This doesn't seem like a regression
> or a correctness issue so normally we wouldn't hold the release. Of course
> your free to vote how you choose, just providing some additional context
> around how tend to do released.
>
>
> On Feb 14, 2018 11:03 PM, "mrkm4ntr"  wrote:
>
> I'm -1 because of this issue.
> I want to fix the hashing implementation in FeatureHasher before
> FeatureHasher released in 2.3.0.
>
> https://issues.apache.org/jira/browse/SPARK-23381
> https://github.com/apache/spark/pull/20568
>
> I will fix it soon.
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>
>


Re: I Want to Help with MLlib Migration

2018-02-15 Thread Yacine Mazari
Thanks for the reply @srowen.

>>I don't think you can move or alter the class APis. 
Agreed. That's not my intention at all.

>>There also isn't much value in copying the code. Maybe there are
opportunities for moving some internal code.
There will probably be some copying and moving internal code, but this is
not the main purpose. 
The goal is to have these algorithms implemented using the Dataset API. 
Currently, the implementation of these classes/algorithms uses RDDs by
wrapping the old (mllib) classes, which will eventually be deprecated (and
deleted).

>>But in general I think all this has to wait. 
Do you have any schedule or plan in mind? If deprecation is targeted for
3.0, then we roughly have 1.5 years.
On the other-hand, the current situation prevents us from making
improvements to the existing classes, for example I'd like to add maxDocFreq
to ml.feature.IDF to make it similar to scikit-learn, but that's hard to do
because it's just a wrapper mllib.feature.IDF,


Thank you for the discussion.
Yacine.



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: I Want to Help with MLlib Migration

2018-02-15 Thread Sean Owen
I don't think you can move or alter the class APis. There also isn't much
value in copying the code. Maybe there are opportunities for moving some
internal code. But in general I think all this has to wait.

On Thu, Feb 15, 2018, 5:17 AM Yacine Mazari  wrote:

> Hi,
>
> I see that many classes under "org.apache.spark.ml" are still referring to
> the "org.apache.spark.mllib" implementation.
>
> While there still is time until the deprecation deadline by version 3.0,
> having these dependencies makes it impossible or difficult to make
> improvements to these classes.
>
> I see that IDF and HashingTF are being migrated in SPARK-22531 and
> SPARK-21748.
>
> 1) Should I go ahead and start with one of the following: BisectingKMeans,
> KMeans, LDA, PCA, Word2Vec,
>  FPGrowth? I don't see any ongoing work with them.
>
> 2) I suggest creating an umbrella ticket to track this migration. (I
> haven't
> seen any)
>
> What do you think?
>
> Best Regards,
> Yacine.
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: [VOTE] Spark 2.3.0 (RC3)

2018-02-15 Thread Marcelo Vanzin
Since it seems there are other issues to fix, I raised SPARK-23413 to
blocker status to avoid having to change the disk format of history
data in a minor release.

On Wed, Feb 14, 2018 at 11:06 PM, Nick Pentreath
 wrote:
> -1 for me as we elevated https://issues.apache.org/jira/browse/SPARK-23377
> to a Blocker. It should be fixed before release.
>
> On Thu, 15 Feb 2018 at 07:25 Holden Karau  wrote:
>>
>> If this is a blocker in your view then the vote thread is an important
>> place to mention it. I'm not super sure all of the places these methods are
>> used so I'll defer to srowen and folks, but for the ML related implications
>> in the past we've allowed people to set the hashing function when we've
>> introduced changes.
>>
>> On Feb 15, 2018 2:08 PM, "mrkm4ntr"  wrote:
>>>
>>> I was advised to post here in the discussion at GitHub. I do not know
>>> what to
>>> do about the problem that discussions dispersing in two places.
>>>
>>>
>>>
>>> --
>>> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>>>
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>
>



-- 
Marcelo

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



I Want to Help with MLlib Migration

2018-02-15 Thread Yacine Mazari
Hi,

I see that many classes under "org.apache.spark.ml" are still referring to
the "org.apache.spark.mllib" implementation.

While there still is time until the deprecation deadline by version 3.0,
having these dependencies makes it impossible or difficult to make
improvements to these classes.

I see that IDF and HashingTF are being migrated in SPARK-22531 and
SPARK-21748.

1) Should I go ahead and start with one of the following: BisectingKMeans,
KMeans, LDA, PCA, Word2Vec,
 FPGrowth? I don't see any ongoing work with them.

2) I suggest creating an umbrella ticket to track this migration. (I haven't
seen any)

What do you think?

Best Regards,
Yacine.



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org