Re: [VOTE] Spark 2.3.0 (RC3)
In addition to the issues mentioned above, Wenchen and Xiao have flagged two other regressions (https://issues.apache.org/jira/browse/SPARK-23316 and https://issues.apache.org/jira/browse/SPARK-23388) that were merged after RC3 was cut. Due to these, this vote fails. I'll follow-up with an RC4 in a day (this will probably also give us enough time to resolve https://issues.apache.org/jira/browse/SPARK-23381 and https://issues.apache.org/jira/browse/SPARK-23410). On 15 February 2018 at 17:22, mrkm4ntrwrote: > I agree that this is not a blocker against RC3. It was not appropriate as > a > vote for RC3. > There is no problem if it is in time for release 2.3.0. > > > > > -- > Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/ > > - > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > >
Re: [VOTE] Spark 2.3.0 (RC3)
I agree that this is not a blocker against RC3. It was not appropriate as a vote for RC3. There is no problem if it is in time for release 2.3.0. -- Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/ - To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
Re: [VOTE] Spark 2.3.0 (RC3)
I agree that SPARK-23413 should be considered a blocker. It isn't unreasonable to run a history server that is used for several versions of Spark. On Thu, Feb 15, 2018 at 7:49 AM, Sean Owenwrote: > SPARK-23381 is probably not a blocker IMHO; it's a nice-to-have to make > some returned values match an external implementation, for code that hasn't > been published yet. > > However I think it's OK to add to the 2.3.0 release if there's going to be > another RC. > > > On Wed, Feb 14, 2018 at 10:49 PM Holden Karau > wrote: > >> So it's currently tagged as minor and under consideration for 2.4.0. Do >> you think this priority is incorrect? This doesn't seem like a regression >> or a correctness issue so normally we wouldn't hold the release. Of course >> your free to vote how you choose, just providing some additional context >> around how tend to do released. >> >> >> On Feb 14, 2018 11:03 PM, "mrkm4ntr" wrote: >> >> I'm -1 because of this issue. >> I want to fix the hashing implementation in FeatureHasher before >> FeatureHasher released in 2.3.0. >> >> https://issues.apache.org/jira/browse/SPARK-23381 >> https://github.com/apache/spark/pull/20568 >> >> I will fix it soon. >> >> >> >> -- >> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/ >> >> - >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >> >> >> -- Ryan Blue Software Engineer Netflix
Re: [MLlib] Gaussian Process regression in MLlib
Hi all, I've created a new JIRA. https://issues.apache.org/jira/browse/SPARK-23437 All concerned are welcome to discuss. Best, Valeriy. On Sat, Feb 3, 2018 at 9:24 PM, Valeriy Avanesovwrote: > Hi, > > no, I don't thing we should actually compute the n \times n matrix. Leave > alone inverting it. However, variational inference is only one of the many > sparse GP approaches. Another option could be Bayesian Committee. > > Best, > > Valeriy. > > > > On 02/02/2018 09:43 PM, Simon Dirmeier wrote: > >> Hey, >> >> I wanted to see that for a long time, too. :) If you'd plan on >> implementing this, I could contribute. >> However, I am not too familiar with variational inference for the GPs >> which is what you would need I guess. >> Or do you think it is feasible to compute the full kernel for the GP? >> >> Cheers, >> S >> >> >> >> Am 01.02.18 um 20:01 schrieb Valeriy Avanesov: >> >>> Hi all, >>> >>> it came to my surprise that there is no implementation of Gaussian >>> Process in Spark MLlib. The approach is widely known, employed and scalable >>> (its sparse versions). Is there a good reason for that? Has it been >>> discussed before? >>> >>> If there is a need in this approach being a part of MLlib I am eager to >>> contribute. >>> >>> Best, >>> >>> Valeriy. >>> >>> >>> - >>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >>> >>> >> >
Re: [VOTE] Spark 2.3.0 (RC3)
SPARK-23381 is probably not a blocker IMHO; it's a nice-to-have to make some returned values match an external implementation, for code that hasn't been published yet. However I think it's OK to add to the 2.3.0 release if there's going to be another RC. On Wed, Feb 14, 2018 at 10:49 PM Holden Karauwrote: > So it's currently tagged as minor and under consideration for 2.4.0. Do > you think this priority is incorrect? This doesn't seem like a regression > or a correctness issue so normally we wouldn't hold the release. Of course > your free to vote how you choose, just providing some additional context > around how tend to do released. > > > On Feb 14, 2018 11:03 PM, "mrkm4ntr" wrote: > > I'm -1 because of this issue. > I want to fix the hashing implementation in FeatureHasher before > FeatureHasher released in 2.3.0. > > https://issues.apache.org/jira/browse/SPARK-23381 > https://github.com/apache/spark/pull/20568 > > I will fix it soon. > > > > -- > Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/ > > - > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > > >
Re: I Want to Help with MLlib Migration
Thanks for the reply @srowen. >>I don't think you can move or alter the class APis. Agreed. That's not my intention at all. >>There also isn't much value in copying the code. Maybe there are opportunities for moving some internal code. There will probably be some copying and moving internal code, but this is not the main purpose. The goal is to have these algorithms implemented using the Dataset API. Currently, the implementation of these classes/algorithms uses RDDs by wrapping the old (mllib) classes, which will eventually be deprecated (and deleted). >>But in general I think all this has to wait. Do you have any schedule or plan in mind? If deprecation is targeted for 3.0, then we roughly have 1.5 years. On the other-hand, the current situation prevents us from making improvements to the existing classes, for example I'd like to add maxDocFreq to ml.feature.IDF to make it similar to scikit-learn, but that's hard to do because it's just a wrapper mllib.feature.IDF, Thank you for the discussion. Yacine. -- Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/ - To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
Re: I Want to Help with MLlib Migration
I don't think you can move or alter the class APis. There also isn't much value in copying the code. Maybe there are opportunities for moving some internal code. But in general I think all this has to wait. On Thu, Feb 15, 2018, 5:17 AM Yacine Mazariwrote: > Hi, > > I see that many classes under "org.apache.spark.ml" are still referring to > the "org.apache.spark.mllib" implementation. > > While there still is time until the deprecation deadline by version 3.0, > having these dependencies makes it impossible or difficult to make > improvements to these classes. > > I see that IDF and HashingTF are being migrated in SPARK-22531 and > SPARK-21748. > > 1) Should I go ahead and start with one of the following: BisectingKMeans, > KMeans, LDA, PCA, Word2Vec, > FPGrowth? I don't see any ongoing work with them. > > 2) I suggest creating an umbrella ticket to track this migration. (I > haven't > seen any) > > What do you think? > > Best Regards, > Yacine. > > > > -- > Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/ > > - > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > >
Re: [VOTE] Spark 2.3.0 (RC3)
Since it seems there are other issues to fix, I raised SPARK-23413 to blocker status to avoid having to change the disk format of history data in a minor release. On Wed, Feb 14, 2018 at 11:06 PM, Nick Pentreathwrote: > -1 for me as we elevated https://issues.apache.org/jira/browse/SPARK-23377 > to a Blocker. It should be fixed before release. > > On Thu, 15 Feb 2018 at 07:25 Holden Karau wrote: >> >> If this is a blocker in your view then the vote thread is an important >> place to mention it. I'm not super sure all of the places these methods are >> used so I'll defer to srowen and folks, but for the ML related implications >> in the past we've allowed people to set the hashing function when we've >> introduced changes. >> >> On Feb 15, 2018 2:08 PM, "mrkm4ntr" wrote: >>> >>> I was advised to post here in the discussion at GitHub. I do not know >>> what to >>> do about the problem that discussions dispersing in two places. >>> >>> >>> >>> -- >>> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/ >>> >>> - >>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >>> > -- Marcelo - To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
I Want to Help with MLlib Migration
Hi, I see that many classes under "org.apache.spark.ml" are still referring to the "org.apache.spark.mllib" implementation. While there still is time until the deprecation deadline by version 3.0, having these dependencies makes it impossible or difficult to make improvements to these classes. I see that IDF and HashingTF are being migrated in SPARK-22531 and SPARK-21748. 1) Should I go ahead and start with one of the following: BisectingKMeans, KMeans, LDA, PCA, Word2Vec, FPGrowth? I don't see any ongoing work with them. 2) I suggest creating an umbrella ticket to track this migration. (I haven't seen any) What do you think? Best Regards, Yacine. -- Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/ - To unsubscribe e-mail: dev-unsubscr...@spark.apache.org