Re: [DISCUSS] Apache Spark 3.0.1 Release

2020-07-01 Thread Shivaram Venkataraman
Thanks Holden -- it would be great to also get 2.4.7 started

Thanks
Shivaram

On Tue, Jun 30, 2020 at 10:31 PM Holden Karau  wrote:
>
> I can take care of 2.4.7 unless someone else wants to do it.
>
> On Tue, Jun 30, 2020 at 8:29 PM Jason Moore  
> wrote:
>>
>> Hi all,
>>
>>
>>
>> Could I get some input on the severity of this one that I found yesterday?  
>> If that’s a correctness issue, should it block this patch?  Let me know 
>> under the ticket if there’s more info that I can provide to help.
>>
>>
>>
>> https://issues.apache.org/jira/browse/SPARK-32136
>>
>>
>>
>> Thanks,
>>
>> Jason.
>>
>>
>>
>> From: Jungtaek Lim 
>> Date: Wednesday, 1 July 2020 at 10:20 am
>> To: Shivaram Venkataraman 
>> Cc: Prashant Sharma , 郑瑞峰 , 
>> Gengliang Wang , gurwls223 
>> , Dongjoon Hyun , Jules Damji 
>> , Holden Karau , Reynold Xin 
>> , Yuanjian Li , 
>> "dev@spark.apache.org" , Takeshi Yamamuro 
>> 
>> Subject: Re: [DISCUSS] Apache Spark 3.0.1 Release
>>
>>
>>
>> SPARK-32130 [1] looks to be a performance regression introduced in Spark 
>> 3.0.0, which is ideal to look into before releasing another bugfix version.
>>
>>
>>
>> 1. https://issues.apache.org/jira/browse/SPARK-32130
>>
>>
>>
>> On Wed, Jul 1, 2020 at 7:05 AM Shivaram Venkataraman 
>>  wrote:
>>
>> Hi all
>>
>>
>>
>> I just wanted to ping this thread to see if all the outstanding blockers for 
>> 3.0.1 have been fixed. If so, it would be great if we can get the release 
>> going. The CRAN team sent us a note that the version SparkR available on 
>> CRAN for the current R version (4.0.2) is broken and hence we need to update 
>> the package soon --  it will be great to do it with 3.0.1.
>>
>>
>>
>> Thanks
>>
>> Shivaram
>>
>>
>>
>> On Wed, Jun 24, 2020 at 8:31 PM Prashant Sharma  wrote:
>>
>> +1 for 3.0.1 release.
>>
>> I too can help out as release manager.
>>
>>
>>
>> On Thu, Jun 25, 2020 at 4:58 AM 郑瑞峰  wrote:
>>
>> I volunteer to be a release manager of 3.0.1, if nobody is working on this.
>>
>>
>>
>>
>>
>> -- 原始邮件 --
>>
>> 发件人: "Gengliang Wang";
>>
>> 发送时间: 2020年6月24日(星期三) 下午4:15
>>
>> 收件人: "Hyukjin Kwon";
>>
>> 抄送: "Dongjoon Hyun";"Jungtaek 
>> Lim";"Jules 
>> Damji";"Holden Karau";"Reynold 
>> Xin";"Shivaram 
>> Venkataraman";"Yuanjian 
>> Li";"Spark dev list";"Takeshi 
>> Yamamuro";
>>
>> 主题: Re: [DISCUSS] Apache Spark 3.0.1 Release
>>
>>
>>
>> +1, the issues mentioned are really serious.
>>
>>
>>
>> On Tue, Jun 23, 2020 at 7:56 PM Hyukjin Kwon  wrote:
>>
>> +1.
>>
>> Just as a note,
>> - SPARK-31918 is fixed now, and there's no blocker. - When we build SparkR, 
>> we should use the latest R version at least 4.0.0+.
>>
>>
>>
>> 2020년 6월 24일 (수) 오전 11:20, Dongjoon Hyun 님이 작성:
>>
>> +1
>>
>>
>>
>> Bests,
>>
>> Dongjoon.
>>
>>
>>
>> On Tue, Jun 23, 2020 at 1:19 PM Jungtaek Lim  
>> wrote:
>>
>> +1 on a 3.0.1 soon.
>>
>>
>>
>> Probably it would be nice if some Scala experts can take a look at 
>> https://issues.apache.org/jira/browse/SPARK-32051 and include the fix into 
>> 3.0.1 if possible.
>>
>> Looks like APIs designed to work with Scala 2.11 & Java bring ambiguity in 
>> Scala 2.12 & Java.
>>
>>
>>
>> On Wed, Jun 24, 2020 at 4:52 AM Jules Damji  wrote:
>>
>> +1 (non-binding)
>>
>>
>>
>> Sent from my iPhone
>>
>> Pardon the dumb thumb typos :)
>>
>>
>>
>> On Jun 23, 2020, at 11:36 AM, Holden Karau  wrote:
>>
>> +1 on a patch release soon
>>
>>
>>
>> On Tue, Jun 23, 2020 at 10:47 AM Reynold Xin  wrote:
>>
>> Error! Filename not specified.
>>
>> +1 on doing a new patch release soon. I saw some of these issues when 
>> preparing the 3.0 release, and some of them are very serious.
>>
>>
>>
>>
>>
>> On Tue, Jun 23, 2020 at 8:06 AM, Shivaram Venkataraman 
>>  wrote:
>>
>> +1 Thanks Yuanjian -- I think it'll be great to have a 3.0.1 release soon.
>>
>> Shivaram
>>
>> On Tue, Jun 23, 2020 at 3:43 AM Takeshi Yamamuro  
>> wrote:
>>
>> Thanks for the heads-up, Yuanjian!
>>
>> I also noticed branch-3.0 already has 39 commits after Spark 3.0.0.
>>
>> wow, the updates are so quick. Anyway, +1 for the release.
>>
>> Bests,
>> Takeshi
>>
>> On Tue, Jun 23, 2020 at 4:59 PM Yuanjian Li  wrote:
>>
>> Hi dev-list,
>>
>> I’m writing this to raise the discussion about Spark 3.0.1 feasibility since 
>> 4 blocker issues were found after Spark 3.0.0:
>>
>> [SPARK-31990] The state store compatibility broken will cause a correctness 
>> issue when Streaming query with `dropDuplicate` uses the checkpoint written 
>> by the old Spark version.
>>
>> [SPARK-32038] The regression bug in handling NaN values in COUNT(DISTINCT)
>>
>> [SPARK-31918][WIP] CRAN requires to make it working with the latest R 4.0. 
>> It makes the 3.0 release unavailable on CRAN, and only supports R [3.5, 4.0)
>>
>> [SPARK-31967] Downgrade vis.js to fix Jobs UI loading time regression
>>
>> I also noticed branch-3.0 already has 39 commits after Spark 3.0.0. I think 
>> it would be great if we have Spark 3.0.1 to deliver the critical fixes.
>>
>> Any comments are 

Re: Setting spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=1 and Doc issue

2020-07-01 Thread Steve Loughran
https://issues.apache.org/jira/browse/MAPREDUCE-7282

"MR v2 commit algorithm is dangerous, should be deprecated and not the
default"

someone do a PR to change the default & if it doesn't break too much I'l
merge it



On Mon, 29 Jun 2020 at 13:20, Steve Loughran  wrote:

> v2 does a file-by-file copy to the dest dir in task commit; v1 promotes
> task attempts to job attempt dir by dir rename, job commit lists those and
> moves the contents
>
> if the worker fails during task commit -the next task attempt has to
> replace every file -so it had better use the same filenames.
>
> The really scary issue is a network partition: if the first worker went
> off-line long enough for a second attempt to commit (If speculation has
> enabled that may not be very long at all as could already be waiting) then
> if the second worker goes online again it may continue with its commit and
> partially overwrite some but not all of the output.
>
> That task commit is not atomic even though spark requires this. It is
> worse on Amazon S3 because rename is O(data). The window for failure is a
> lot longer.
>
> The S3A committers don't commit their work until job commit; while that is
> non-atomic (nor is MR v1 BTW) it's time is |files|/(min(|threads|,
> max-http-pool-size))
>
> The EMR spark committer does actually commit its work in task commit, so
> is also vulnerable. I wish they copied more of our ASF-licensed code :). Or
> some of IBM's stocator work.
>
>
> Presumably their algorithm is
>
> pre-task-reporting ready-to-commit: upload files from the localfd task
> attempt staging dir to dest dir, without completing the upload. You could
> actually do this with a scanning thread uploading as you go along.
> task commit: POST all the uploads
> job commit: touch _SUCCESS
>
> The scales better (no need to load & commit uploads in job commit) and
> does not require any consistent cluster FS. And is faster.
>
> But again: the failure semantic of task commit isn't what spark expects.
>
> Bonus fun: google GCS dir commit is file-by-file so non atomic; v1 task
> commit does expect an atomic dir rename. So you may as well use v2.
>
> They could add a committer which didn't do that rename, just write a
> manifest file to the job attempt dir pointing to the successful task
> attempt; commit that with their atomic file rename. The committer plugin
> point in MR lets you declare a committer factory for each FS, so it could
> be done without any further changes to spark.
>
> On Thu, 25 Jun 2020 at 22:38, Waleed Fateem 
> wrote:
>
>> I was trying to make my email short and concise, but the rationale behind
>> setting that as 1 by default is because it's safer. With algorithm version
>> 2 you run the risk of having bad data in cases where tasks fail or even
>> duplicate data if a task fails and succeeds on a reattempt (I don't know if
>> this is true for all OutputCommitters that extend the FileOutputCommitter
>> or not).
>>
>> Imran and Marcelo also discussed this here:
>>
>> https://issues.apache.org/jira/browse/SPARK-20107?focusedCommentId=15945177=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-15945177
>>
>> I also did discuss this a bit with Steve Loughran and his opinion was
>> that v2 should just be deprecated all together. I believe he was going to
>> bring that up with the Hadoop developers.
>>
>>
>> On Thu, Jun 25, 2020 at 3:56 PM Sean Owen  wrote:
>>
>>> I think is a Hadoop property that is just passed through? if the
>>> default is different in Hadoop 3 we could mention that in the docs. i
>>> don't know if we want to always set it to 1 as a Spark default, even
>>> in Hadoop 3 right?
>>>
>>> On Thu, Jun 25, 2020 at 2:43 PM Waleed Fateem 
>>> wrote:
>>> >
>>> > Hello!
>>> >
>>> > I noticed that in the documentation starting with 2.2.0 it states that
>>> the parameter spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version
>>> is 1 by default:
>>> > https://issues.apache.org/jira/browse/SPARK-20107
>>> >
>>> > I don't actually see this being set anywhere explicitly in the Spark
>>> code and so the documentation isn't entirely accurate in case you run on an
>>> environment that has MAPREDUCE-6406 implemented (starting with Hadoop 3.0).
>>> >
>>> > The default version was explicitly set to 2 in the FileOutputCommitter
>>> class, so any output committer that inherits from this class
>>> (ParquetOutputCommitter for example) would use v2 in a Hadoop 3.0
>>> environment and v1 in the older Hadoop environments.
>>> >
>>> > Would it make sense for us to consider setting v1 as the default in
>>> code in case the configuration was not set by a user?
>>> >
>>> > Regards,
>>> >
>>> > Waleed
>>>
>>


Re: [DISCUSS] Apache Spark 3.0.1 Release

2020-07-01 Thread Jungtaek Lim
https://issues.apache.org/jira/browse/SPARK-32148 was reported yesterday,
and if the report is valid it looks to be a blocker. I'll try to take a
look sooner.

On Thu, Jul 2, 2020 at 12:48 AM Shivaram Venkataraman <
shiva...@eecs.berkeley.edu> wrote:

> Thanks Holden -- it would be great to also get 2.4.7 started
>
> Thanks
> Shivaram
>
> On Tue, Jun 30, 2020 at 10:31 PM Holden Karau 
> wrote:
> >
> > I can take care of 2.4.7 unless someone else wants to do it.
> >
> > On Tue, Jun 30, 2020 at 8:29 PM Jason Moore 
> wrote:
> >>
> >> Hi all,
> >>
> >>
> >>
> >> Could I get some input on the severity of this one that I found
> yesterday?  If that’s a correctness issue, should it block this patch?  Let
> me know under the ticket if there’s more info that I can provide to help.
> >>
> >>
> >>
> >> https://issues.apache.org/jira/browse/SPARK-32136
> >>
> >>
> >>
> >> Thanks,
> >>
> >> Jason.
> >>
> >>
> >>
> >> From: Jungtaek Lim 
> >> Date: Wednesday, 1 July 2020 at 10:20 am
> >> To: Shivaram Venkataraman 
> >> Cc: Prashant Sharma , 郑瑞峰 ,
> Gengliang Wang , gurwls223 <
> gurwls...@gmail.com>, Dongjoon Hyun , Jules
> Damji , Holden Karau , Reynold
> Xin , Yuanjian Li , "
> dev@spark.apache.org" , Takeshi Yamamuro <
> linguin@gmail.com>
> >> Subject: Re: [DISCUSS] Apache Spark 3.0.1 Release
> >>
> >>
> >>
> >> SPARK-32130 [1] looks to be a performance regression introduced in
> Spark 3.0.0, which is ideal to look into before releasing another bugfix
> version.
> >>
> >>
> >>
> >> 1. https://issues.apache.org/jira/browse/SPARK-32130
> >>
> >>
> >>
> >> On Wed, Jul 1, 2020 at 7:05 AM Shivaram Venkataraman <
> shiva...@eecs.berkeley.edu> wrote:
> >>
> >> Hi all
> >>
> >>
> >>
> >> I just wanted to ping this thread to see if all the outstanding
> blockers for 3.0.1 have been fixed. If so, it would be great if we can get
> the release going. The CRAN team sent us a note that the version SparkR
> available on CRAN for the current R version (4.0.2) is broken and hence we
> need to update the package soon --  it will be great to do it with 3.0.1.
> >>
> >>
> >>
> >> Thanks
> >>
> >> Shivaram
> >>
> >>
> >>
> >> On Wed, Jun 24, 2020 at 8:31 PM Prashant Sharma 
> wrote:
> >>
> >> +1 for 3.0.1 release.
> >>
> >> I too can help out as release manager.
> >>
> >>
> >>
> >> On Thu, Jun 25, 2020 at 4:58 AM 郑瑞峰  wrote:
> >>
> >> I volunteer to be a release manager of 3.0.1, if nobody is working on
> this.
> >>
> >>
> >>
> >>
> >>
> >> -- 原始邮件 --
> >>
> >> 发件人: "Gengliang Wang";
> >>
> >> 发送时间: 2020年6月24日(星期三) 下午4:15
> >>
> >> 收件人: "Hyukjin Kwon";
> >>
> >> 抄送: "Dongjoon Hyun";"Jungtaek Lim"<
> kabhwan.opensou...@gmail.com>;"Jules Damji";"Holden
> Karau";"Reynold Xin";"Shivaram
> Venkataraman";"Yuanjian Li"<
> xyliyuanj...@gmail.com>;"Spark dev list";"Takeshi
> Yamamuro";
> >>
> >> 主题: Re: [DISCUSS] Apache Spark 3.0.1 Release
> >>
> >>
> >>
> >> +1, the issues mentioned are really serious.
> >>
> >>
> >>
> >> On Tue, Jun 23, 2020 at 7:56 PM Hyukjin Kwon 
> wrote:
> >>
> >> +1.
> >>
> >> Just as a note,
> >> - SPARK-31918 is fixed now, and there's no blocker. - When we build
> SparkR, we should use the latest R version at least 4.0.0+.
> >>
> >>
> >>
> >> 2020년 6월 24일 (수) 오전 11:20, Dongjoon Hyun 님이
> 작성:
> >>
> >> +1
> >>
> >>
> >>
> >> Bests,
> >>
> >> Dongjoon.
> >>
> >>
> >>
> >> On Tue, Jun 23, 2020 at 1:19 PM Jungtaek Lim <
> kabhwan.opensou...@gmail.com> wrote:
> >>
> >> +1 on a 3.0.1 soon.
> >>
> >>
> >>
> >> Probably it would be nice if some Scala experts can take a look at
> https://issues.apache.org/jira/browse/SPARK-32051 and include the fix
> into 3.0.1 if possible.
> >>
> >> Looks like APIs designed to work with Scala 2.11 & Java bring ambiguity
> in Scala 2.12 & Java.
> >>
> >>
> >>
> >> On Wed, Jun 24, 2020 at 4:52 AM Jules Damji 
> wrote:
> >>
> >> +1 (non-binding)
> >>
> >>
> >>
> >> Sent from my iPhone
> >>
> >> Pardon the dumb thumb typos :)
> >>
> >>
> >>
> >> On Jun 23, 2020, at 11:36 AM, Holden Karau 
> wrote:
> >>
> >> +1 on a patch release soon
> >>
> >>
> >>
> >> On Tue, Jun 23, 2020 at 10:47 AM Reynold Xin 
> wrote:
> >>
> >> Error! Filename not specified.
> >>
> >> +1 on doing a new patch release soon. I saw some of these issues when
> preparing the 3.0 release, and some of them are very serious.
> >>
> >>
> >>
> >>
> >>
> >> On Tue, Jun 23, 2020 at 8:06 AM, Shivaram Venkataraman <
> shiva...@eecs.berkeley.edu> wrote:
> >>
> >> +1 Thanks Yuanjian -- I think it'll be great to have a 3.0.1 release
> soon.
> >>
> >> Shivaram
> >>
> >> On Tue, Jun 23, 2020 at 3:43 AM Takeshi Yamamuro 
> wrote:
> >>
> >> Thanks for the heads-up, Yuanjian!
> >>
> >> I also noticed branch-3.0 already has 39 commits after Spark 3.0.0.
> >>
> >> wow, the updates are so quick. Anyway, +1 for the release.
> >>
> >> Bests,
> >> Takeshi
> >>
> >> On Tue, Jun 23, 2020 at 4:59 PM Yuanjian Li 
> wrote:
> >>
> >> Hi dev-list,
> >>
> >> I’m writing this to raise the discussion about Spark 3.0.1 

Re: Spark 3 pod template for the driver

2020-07-01 Thread Edward Mitchell
Okay, I see what's going on here.

Looks like the way that spark is coded, the driver container image
(specified by --conf
spark.kubernetes.driver.container.image) and executor container image
(specified by --conf
spark.kubernetes.executor.container.image) is required.

If they're not specified it'll fallback to --conf
spark.kubernetes.container.image.

The way the "pod template" feature was coded is such that even if it's
specified in the YAML, those conf properties take priority and override the
value set on the YAML file.

So basically what I'm saying is that although you have it in the YAML file,
you still need to specify them.

If, like you said, the goal is to not specify those in the spark submit,
you'll likely need to submit an Improvement to the JIRA.

On Tue, Jun 30, 2020 at 5:26 AM Michel Sumbul 
wrote:

> Hi Edeesis,
>
> The goal is to not have these settings in the spark submit command. If I
> specify the same things in a pod template for the executor, I still got the
> message:
> "Exception in thread "main" org.apache.spark.SparkException "Must specify
> the driver container image"
>
> it even don't try to start an executor container as the driver is not
> started yet.
> Any idea?
>
> Thanks,
> Michel
>
> Le mar. 30 juin 2020 à 00:06, edeesis  a écrit :
>
>> If I could muster a guess, you still need to specify the executor image.
>> As
>> is, this will only specify the driver image.
>>
>> You can specify it as --conf spark.kubernetes.container.image or --conf
>> spark.kubernetes.executor.container.image
>>
>>
>>
>> --
>> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>


[DISCUSS] Drop Python 2, 3.4 and 3.5

2020-07-01 Thread Hyukjin Kwon
Hi all,

I would like to discuss dropping deprecated Python versions 2, 3.4 and 3.5
at https://github.com/apache/spark/pull/28957. I assume people support it
in general
but I am writing this to make sure everybody is happy.

Fokko made a very good investigation on it, see
https://github.com/apache/spark/pull/28957#issuecomment-652022449.
Assuming from the statistics, I think we're pretty safe to drop them.
Also note that dropping Python 2 was actually declared at
https://python3statement.org/

Roughly speaking, there are many main advantages by dropping them:
  1. It removes a bunch of hacks we added around 700 lines in PySpark.
  2. PyPy2 has a critical bug that causes a flaky test,
https://issues.apache.org/jira/browse/SPARK-28358 given my testing and
investigation.
  3. Users can use Python type hints with Pandas UDFs without thinking
about Python version
  4. Users can leverage one latest cloudpickle,
https://github.com/apache/spark/pull/28950. With Python 3.8+ it can also
leverage C pickle.
  5. ...

So it benefits both users and dev. WDYT guys?


Re: [VOTE] Decommissioning SPIP

2020-07-01 Thread Imran Rashid
+1

I think this is going to be a really important feature for Spark and I'm
glad to see Holden focusing on it.

On Wed, Jul 1, 2020 at 8:38 PM Mridul Muralidharan  wrote:

> +1
>
> Thanks,
> Mridul
>
> On Wed, Jul 1, 2020 at 6:36 PM Hyukjin Kwon  wrote:
>
>> +1
>>
>> 2020년 7월 2일 (목) 오전 10:08, Marcelo Vanzin 님이 작성:
>>
>>> I reviewed the docs and PRs from way before an SPIP was explicitly
>>> asked, so I'm comfortable with giving a +1 even if I haven't really
>>> fully read the new document,
>>>
>>> On Wed, Jul 1, 2020 at 6:05 PM Holden Karau 
>>> wrote:
>>> >
>>> > Hi Spark Devs,
>>> >
>>> > I think discussion has settled on the SPIP doc at
>>> https://docs.google.com/document/d/1EOei24ZpVvR7_w0BwBjOnrWRy4k-qTdIlx60FsHZSHA/edit?usp=sharing
>>> , design doc at
>>> https://docs.google.com/document/d/1xVO1b6KAwdUhjEJBolVPl9C6sLj7oOveErwDSYdT-pE/edit,
>>> or JIRA https://issues.apache.org/jira/browse/SPARK-20624, and I've
>>> received a request to put the SPIP up for a VOTE quickly. The discussion
>>> thread on the mailing list is at
>>> http://apache-spark-developers-list.1001551.n3.nabble.com/DISCUSS-SPIP-Graceful-Decommissioning-td29650.html
>>> .
>>> >
>>> > Normally this vote would be open for 72 hours, however since it's a
>>> long weekend in the US where many of the PMC members are, this vote will
>>> not close before July 6th at noon pacific time.
>>> >
>>> > The SPIP procedures are documented at:
>>> https://spark.apache.org/improvement-proposals.html. The ASF's voting
>>> guide is at https://www.apache.org/foundation/voting.html.
>>> >
>>> > Please vote before July 6th at noon:
>>> >
>>> > [ ] +1: Accept the proposal as an official SPIP
>>> > [ ] +0
>>> > [ ] -1: I don't think this is a good idea because ...
>>> >
>>> > I will start the voting off with a +1 from myself.
>>> >
>>> > Cheers,
>>> >
>>> > Holden
>>>
>>>
>>>
>>> --
>>> Marcelo Vanzin
>>> van...@gmail.com
>>> "Life's too short to drink cheap beer"
>>>
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>
>>>


Re: [DISCUSS] Drop Python 2, 3.4 and 3.5

2020-07-01 Thread Hyukjin Kwon
Yeah, sure. It will be dropped at Spark 3.1 onwards. I don't think we
should make such changes in maintenance releases

2020년 7월 2일 (목) 오전 11:13, Holden Karau 님이 작성:

> To be clear the plan is to drop them in Spark 3.1 onwards, yes?
>
> On Wed, Jul 1, 2020 at 7:11 PM Hyukjin Kwon  wrote:
>
>> Hi all,
>>
>> I would like to discuss dropping deprecated Python versions 2, 3.4 and
>> 3.5 at https://github.com/apache/spark/pull/28957. I assume people
>> support it in general
>> but I am writing this to make sure everybody is happy.
>>
>> Fokko made a very good investigation on it, see
>> https://github.com/apache/spark/pull/28957#issuecomment-652022449.
>> Assuming from the statistics, I think we're pretty safe to drop them.
>> Also note that dropping Python 2 was actually declared at
>> https://python3statement.org/
>>
>> Roughly speaking, there are many main advantages by dropping them:
>>   1. It removes a bunch of hacks we added around 700 lines in PySpark.
>>   2. PyPy2 has a critical bug that causes a flaky test,
>> https://issues.apache.org/jira/browse/SPARK-28358 given my testing and
>> investigation.
>>   3. Users can use Python type hints with Pandas UDFs without thinking
>> about Python version
>>   4. Users can leverage one latest cloudpickle,
>> https://github.com/apache/spark/pull/28950. With Python 3.8+ it can also
>> leverage C pickle.
>>   5. ...
>>
>> So it benefits both users and dev. WDYT guys?
>>
>>
>> --
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>


Re: [VOTE] Decommissioning SPIP

2020-07-01 Thread Marcelo Vanzin
I reviewed the docs and PRs from way before an SPIP was explicitly
asked, so I'm comfortable with giving a +1 even if I haven't really
fully read the new document,

On Wed, Jul 1, 2020 at 6:05 PM Holden Karau  wrote:
>
> Hi Spark Devs,
>
> I think discussion has settled on the SPIP doc at 
> https://docs.google.com/document/d/1EOei24ZpVvR7_w0BwBjOnrWRy4k-qTdIlx60FsHZSHA/edit?usp=sharing
>  , design doc at 
> https://docs.google.com/document/d/1xVO1b6KAwdUhjEJBolVPl9C6sLj7oOveErwDSYdT-pE/edit,
>  or JIRA https://issues.apache.org/jira/browse/SPARK-20624, and I've received 
> a request to put the SPIP up for a VOTE quickly. The discussion thread on the 
> mailing list is at 
> http://apache-spark-developers-list.1001551.n3.nabble.com/DISCUSS-SPIP-Graceful-Decommissioning-td29650.html.
>
> Normally this vote would be open for 72 hours, however since it's a long 
> weekend in the US where many of the PMC members are, this vote will not close 
> before July 6th at noon pacific time.
>
> The SPIP procedures are documented at: 
> https://spark.apache.org/improvement-proposals.html. The ASF's voting guide 
> is at https://www.apache.org/foundation/voting.html.
>
> Please vote before July 6th at noon:
>
> [ ] +1: Accept the proposal as an official SPIP
> [ ] +0
> [ ] -1: I don't think this is a good idea because ...
>
> I will start the voting off with a +1 from myself.
>
> Cheers,
>
> Holden



-- 
Marcelo Vanzin
van...@gmail.com
"Life's too short to drink cheap beer"

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [DISCUSS] Drop Python 2, 3.4 and 3.5

2020-07-01 Thread Holden Karau
I’m ok with us dropping Python 2, 3.4, and 3.5 in Spark 3.1 forward. It
will be exciting to get to use more recent Python features. The most recent
Ubuntu LTS ships with 3.7, and while the previous LTS ships with 3.5, if
folks really can’t upgrade there’s conda.

Is there anyone with a large Python 3.5 fleet who can’t use conda?

On Wed, Jul 1, 2020 at 7:15 PM Hyukjin Kwon  wrote:

> Yeah, sure. It will be dropped at Spark 3.1 onwards. I don't think we
> should make such changes in maintenance releases
>
> 2020년 7월 2일 (목) 오전 11:13, Holden Karau 님이 작성:
>
>> To be clear the plan is to drop them in Spark 3.1 onwards, yes?
>>
>> On Wed, Jul 1, 2020 at 7:11 PM Hyukjin Kwon  wrote:
>>
>>> Hi all,
>>>
>>> I would like to discuss dropping deprecated Python versions 2, 3.4 and
>>> 3.5 at https://github.com/apache/spark/pull/28957. I assume people
>>> support it in general
>>> but I am writing this to make sure everybody is happy.
>>>
>>> Fokko made a very good investigation on it, see
>>> https://github.com/apache/spark/pull/28957#issuecomment-652022449.
>>> Assuming from the statistics, I think we're pretty safe to drop them.
>>> Also note that dropping Python 2 was actually declared at
>>> https://python3statement.org/
>>>
>>> Roughly speaking, there are many main advantages by dropping them:
>>>   1. It removes a bunch of hacks we added around 700 lines in PySpark.
>>>   2. PyPy2 has a critical bug that causes a flaky test,
>>> https://issues.apache.org/jira/browse/SPARK-28358 given my testing and
>>> investigation.
>>>   3. Users can use Python type hints with Pandas UDFs without thinking
>>> about Python version
>>>   4. Users can leverage one latest cloudpickle,
>>> https://github.com/apache/spark/pull/28950. With Python 3.8+ it can
>>> also leverage C pickle.
>>>   5. ...
>>>
>>> So it benefits both users and dev. WDYT guys?
>>>
>>>
>>> --
>> Twitter: https://twitter.com/holdenkarau
>> Books (Learning Spark, High Performance Spark, etc.):
>> https://amzn.to/2MaRAG9  
>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>
> --
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  
YouTube Live Streams: https://www.youtube.com/user/holdenkarau


[VOTE] Decommissioning SPIP

2020-07-01 Thread Holden Karau
Hi Spark Devs,

I think discussion has settled on the SPIP doc at
https://docs.google.com/document/d/1EOei24ZpVvR7_w0BwBjOnrWRy4k-qTdIlx60FsHZSHA/edit?usp=sharing
,
design doc at
https://docs.google.com/document/d/1xVO1b6KAwdUhjEJBolVPl9C6sLj7oOveErwDSYdT-pE/edit,
or JIRA https://issues.apache.org/jira/browse/SPARK-20624, and I've
received a request to put the SPIP up for a VOTE quickly. The discussion
thread on the mailing list is at
http://apache-spark-developers-list.1001551.n3.nabble.com/DISCUSS-SPIP-Graceful-Decommissioning-td29650.html
.

Normally this vote would be open for 72 hours, however since it's a long
weekend in the US where many of the PMC members are, this vote will not
close before July 6th at noon pacific time.

The SPIP procedures are documented at:
https://spark.apache.org/improvement-proposals.html. The ASF's voting guide
is at https://www.apache.org/foundation/voting.html.

Please vote before July 6th at noon:

[ ] +1: Accept the proposal as an official SPIP
[ ] +0
[ ] -1: I don't think this is a good idea because ...

I will start the voting off with a +1 from myself.

Cheers,

Holden


Re: [DISCUSS] Drop Python 2, 3.4 and 3.5

2020-07-01 Thread Holden Karau
To be clear the plan is to drop them in Spark 3.1 onwards, yes?

On Wed, Jul 1, 2020 at 7:11 PM Hyukjin Kwon  wrote:

> Hi all,
>
> I would like to discuss dropping deprecated Python versions 2, 3.4 and 3.5
> at https://github.com/apache/spark/pull/28957. I assume people support it
> in general
> but I am writing this to make sure everybody is happy.
>
> Fokko made a very good investigation on it, see
> https://github.com/apache/spark/pull/28957#issuecomment-652022449.
> Assuming from the statistics, I think we're pretty safe to drop them.
> Also note that dropping Python 2 was actually declared at
> https://python3statement.org/
>
> Roughly speaking, there are many main advantages by dropping them:
>   1. It removes a bunch of hacks we added around 700 lines in PySpark.
>   2. PyPy2 has a critical bug that causes a flaky test,
> https://issues.apache.org/jira/browse/SPARK-28358 given my testing and
> investigation.
>   3. Users can use Python type hints with Pandas UDFs without thinking
> about Python version
>   4. Users can leverage one latest cloudpickle,
> https://github.com/apache/spark/pull/28950. With Python 3.8+ it can also
> leverage C pickle.
>   5. ...
>
> So it benefits both users and dev. WDYT guys?
>
>
> --
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  
YouTube Live Streams: https://www.youtube.com/user/holdenkarau


Fwd: Announcing ApacheCon @Home 2020

2020-07-01 Thread Felix Cheung

-- Forwarded message -

We are pleased to announce that ApacheCon @Home will be held online,
September 29 through October 1.

More event details are available at https://apachecon.com/acah2020 but
there’s a few things that I want to highlight for you, the members.

Yes, the CFP has been reopened. It will be open until the morning of
July 13th. With no restrictions on space/time at the venue, we can
accept talks from a much wider pool of speakers, so we look forward to
hearing from those of you who may have been reluctant, or unwilling, to
travel to the US.
Yes, you can add your project to the event, whether that’s one talk, or
an entire track - we have the room now. Those of you who are PMC members
will be receiving information about how to get your projects represented
at the event.
Attendance is free, as has been the trend in these events in our
industry. We do, however, offer donation options for attendees who feel
that our content is worth paying for.
Sponsorship opportunities are available immediately at
https://www.apachecon.com/acna2020/sponsors.html

If you would like to volunteer to help, we ask that you join the
plann...@apachecon.com mailing list and discuss 
it there, rather than
here, so that we do not have a split discussion, while we’re trying to
coordinate all of the things we have to get done in this very short time
window.

Rich Bowen,
VP Conferences, The Apache Software Foundation




Re: [DISCUSS] Apache Spark 3.0.1 Release

2020-07-01 Thread Xiao Li
+1 on releasing both 3.0.1 and 2.4.7

Great! Three committers volunteer to be a release manager. Ruifeng,
Prashant and Holden. Holden just helped release Spark 2.4.6. This time,
maybe, Ruifeng and Prashant can be the release manager of 3.0.1 and 2.4.7
respectively.

Xiao

On Wed, Jul 1, 2020 at 2:24 PM Jungtaek Lim 
wrote:

> https://issues.apache.org/jira/browse/SPARK-32148 was reported yesterday,
> and if the report is valid it looks to be a blocker. I'll try to take a
> look sooner.
>
> On Thu, Jul 2, 2020 at 12:48 AM Shivaram Venkataraman <
> shiva...@eecs.berkeley.edu> wrote:
>
>> Thanks Holden -- it would be great to also get 2.4.7 started
>>
>> Thanks
>> Shivaram
>>
>> On Tue, Jun 30, 2020 at 10:31 PM Holden Karau 
>> wrote:
>> >
>> > I can take care of 2.4.7 unless someone else wants to do it.
>> >
>> > On Tue, Jun 30, 2020 at 8:29 PM Jason Moore <
>> jason.mo...@quantium.com.au> wrote:
>> >>
>> >> Hi all,
>> >>
>> >>
>> >>
>> >> Could I get some input on the severity of this one that I found
>> yesterday?  If that’s a correctness issue, should it block this patch?  Let
>> me know under the ticket if there’s more info that I can provide to help.
>> >>
>> >>
>> >>
>> >> https://issues.apache.org/jira/browse/SPARK-32136
>> >>
>> >>
>> >>
>> >> Thanks,
>> >>
>> >> Jason.
>> >>
>> >>
>> >>
>> >> From: Jungtaek Lim 
>> >> Date: Wednesday, 1 July 2020 at 10:20 am
>> >> To: Shivaram Venkataraman 
>> >> Cc: Prashant Sharma , 郑瑞峰 ,
>> Gengliang Wang , gurwls223 <
>> gurwls...@gmail.com>, Dongjoon Hyun , Jules
>> Damji , Holden Karau ,
>> Reynold Xin , Yuanjian Li ,
>> "dev@spark.apache.org" , Takeshi Yamamuro <
>> linguin@gmail.com>
>> >> Subject: Re: [DISCUSS] Apache Spark 3.0.1 Release
>> >>
>> >>
>> >>
>> >> SPARK-32130 [1] looks to be a performance regression introduced in
>> Spark 3.0.0, which is ideal to look into before releasing another bugfix
>> version.
>> >>
>> >>
>> >>
>> >> 1. https://issues.apache.org/jira/browse/SPARK-32130
>> >>
>> >>
>> >>
>> >> On Wed, Jul 1, 2020 at 7:05 AM Shivaram Venkataraman <
>> shiva...@eecs.berkeley.edu> wrote:
>> >>
>> >> Hi all
>> >>
>> >>
>> >>
>> >> I just wanted to ping this thread to see if all the outstanding
>> blockers for 3.0.1 have been fixed. If so, it would be great if we can get
>> the release going. The CRAN team sent us a note that the version SparkR
>> available on CRAN for the current R version (4.0.2) is broken and hence we
>> need to update the package soon --  it will be great to do it with 3.0.1.
>> >>
>> >>
>> >>
>> >> Thanks
>> >>
>> >> Shivaram
>> >>
>> >>
>> >>
>> >> On Wed, Jun 24, 2020 at 8:31 PM Prashant Sharma 
>> wrote:
>> >>
>> >> +1 for 3.0.1 release.
>> >>
>> >> I too can help out as release manager.
>> >>
>> >>
>> >>
>> >> On Thu, Jun 25, 2020 at 4:58 AM 郑瑞峰  wrote:
>> >>
>> >> I volunteer to be a release manager of 3.0.1, if nobody is working on
>> this.
>> >>
>> >>
>> >>
>> >>
>> >>
>> >> -- 原始邮件 --
>> >>
>> >> 发件人: "Gengliang Wang";
>> >>
>> >> 发送时间: 2020年6月24日(星期三) 下午4:15
>> >>
>> >> 收件人: "Hyukjin Kwon";
>> >>
>> >> 抄送: "Dongjoon Hyun";"Jungtaek Lim"<
>> kabhwan.opensou...@gmail.com>;"Jules Damji";"Holden
>> Karau";"Reynold Xin";"Shivaram
>> Venkataraman";"Yuanjian Li"<
>> xyliyuanj...@gmail.com>;"Spark dev list";"Takeshi
>> Yamamuro";
>> >>
>> >> 主题: Re: [DISCUSS] Apache Spark 3.0.1 Release
>> >>
>> >>
>> >>
>> >> +1, the issues mentioned are really serious.
>> >>
>> >>
>> >>
>> >> On Tue, Jun 23, 2020 at 7:56 PM Hyukjin Kwon 
>> wrote:
>> >>
>> >> +1.
>> >>
>> >> Just as a note,
>> >> - SPARK-31918 is fixed now, and there's no blocker. - When we build
>> SparkR, we should use the latest R version at least 4.0.0+.
>> >>
>> >>
>> >>
>> >> 2020년 6월 24일 (수) 오전 11:20, Dongjoon Hyun 님이
>> 작성:
>> >>
>> >> +1
>> >>
>> >>
>> >>
>> >> Bests,
>> >>
>> >> Dongjoon.
>> >>
>> >>
>> >>
>> >> On Tue, Jun 23, 2020 at 1:19 PM Jungtaek Lim <
>> kabhwan.opensou...@gmail.com> wrote:
>> >>
>> >> +1 on a 3.0.1 soon.
>> >>
>> >>
>> >>
>> >> Probably it would be nice if some Scala experts can take a look at
>> https://issues.apache.org/jira/browse/SPARK-32051 and include the fix
>> into 3.0.1 if possible.
>> >>
>> >> Looks like APIs designed to work with Scala 2.11 & Java bring
>> ambiguity in Scala 2.12 & Java.
>> >>
>> >>
>> >>
>> >> On Wed, Jun 24, 2020 at 4:52 AM Jules Damji 
>> wrote:
>> >>
>> >> +1 (non-binding)
>> >>
>> >>
>> >>
>> >> Sent from my iPhone
>> >>
>> >> Pardon the dumb thumb typos :)
>> >>
>> >>
>> >>
>> >> On Jun 23, 2020, at 11:36 AM, Holden Karau 
>> wrote:
>> >>
>> >> +1 on a patch release soon
>> >>
>> >>
>> >>
>> >> On Tue, Jun 23, 2020 at 10:47 AM Reynold Xin 
>> wrote:
>> >>
>> >> Error! Filename not specified.
>> >>
>> >> +1 on doing a new patch release soon. I saw some of these issues when
>> preparing the 3.0 release, and some of them are very serious.
>> >>
>> >>
>> >>
>> >>
>> >>
>> >> On Tue, Jun 23, 2020 at 8:06 AM, Shivaram Venkataraman <
>> shiva...@eecs.berkeley.edu> wrote:

Re: m2 cache issues in Jenkins?

2020-07-01 Thread Hyukjin Kwon
Nope, do we have an existing ticket? I think we can reopen if there is.

2020년 7월 2일 (목) 오후 1:43, Holden Karau 님이 작성:

> Huh interesting that it’s the same worker. Have you filed a ticket to
> Shane?
>
> On Wed, Jul 1, 2020 at 8:50 PM Hyukjin Kwon  wrote:
>
>> Hm .. seems this is happening again in amp-jenkins-worker-04 ;(.
>>
>> 2020년 6월 25일 (목) 오전 3:15, shane knapp ☠ 님이 작성:
>>
>>> done:
>>> -bash-4.1$ cd .m2
>>> -bash-4.1$ ls
>>> repository
>>> -bash-4.1$ time rm -rf *
>>>
>>> real17m4.607s
>>> user0m0.950s
>>> sys 0m18.816s
>>> -bash-4.1$
>>>
>>> On Wed, Jun 24, 2020 at 10:50 AM shane knapp ☠ 
>>> wrote:
>>>
 ok, i've taken that worker offline and once the job running on it
 finishes, i'll wipe the cache.

 in the future, please file a JIRA and assign it to me so i don't have
 to track my work through emails to the dev@ list.  ;)

 thanks!

 shane

 On Wed, Jun 24, 2020 at 10:48 AM Holden Karau 
 wrote:

> The most recent one I noticed was
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124437/console
>  which
> was run on  amp-jenkins-worker-04.
>
> On Wed, Jun 24, 2020 at 10:44 AM shane knapp ☠ 
> wrote:
>
>> for those weird failures, it's super helpful to provide which workers
>> are showing these issues.  :)
>>
>> i'd rather not wipe all of the m2 caches on all of the workers, as
>> we'll then potentially get blacklisted again if we download too many
>> packages from apache.org.
>>
>> On Tue, Jun 23, 2020 at 5:58 PM Holden Karau 
>> wrote:
>>
>>> Hi Folks,
>>>
>>> I've been see some weird failures on Jenkins and it looks like it
>>> might be from the m2 cache. Would it be OK to clean it out? Or is it
>>> important?
>>>
>>> Cheers,
>>>
>>> Holden
>>>
>>> --
>>> Twitter: https://twitter.com/holdenkarau
>>> Books (Learning Spark, High Performance Spark, etc.):
>>> https://amzn.to/2MaRAG9  
>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>
>>
>>
>> --
>> Shane Knapp
>> Computer Guy / Voice of Reason
>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>> https://rise.cs.berkeley.edu
>>
>
>
> --
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>


 --
 Shane Knapp
 Computer Guy / Voice of Reason
 UC Berkeley EECS Research / RISELab Staff Technical Lead
 https://rise.cs.berkeley.edu

>>>
>>>
>>> --
>>> Shane Knapp
>>> Computer Guy / Voice of Reason
>>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>>> https://rise.cs.berkeley.edu
>>>
>> --
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>


Re: m2 cache issues in Jenkins?

2020-07-01 Thread Holden Karau
We don't I didn't file one originally, but Shane reminded me to in the
future.

On Wed, Jul 1, 2020 at 9:44 PM Hyukjin Kwon  wrote:

> Nope, do we have an existing ticket? I think we can reopen if there is.
>
> 2020년 7월 2일 (목) 오후 1:43, Holden Karau 님이 작성:
>
>> Huh interesting that it’s the same worker. Have you filed a ticket to
>> Shane?
>>
>> On Wed, Jul 1, 2020 at 8:50 PM Hyukjin Kwon  wrote:
>>
>>> Hm .. seems this is happening again in amp-jenkins-worker-04 ;(.
>>>
>>> 2020년 6월 25일 (목) 오전 3:15, shane knapp ☠ 님이 작성:
>>>
 done:
 -bash-4.1$ cd .m2
 -bash-4.1$ ls
 repository
 -bash-4.1$ time rm -rf *

 real17m4.607s
 user0m0.950s
 sys 0m18.816s
 -bash-4.1$

 On Wed, Jun 24, 2020 at 10:50 AM shane knapp ☠ 
 wrote:

> ok, i've taken that worker offline and once the job running on it
> finishes, i'll wipe the cache.
>
> in the future, please file a JIRA and assign it to me so i don't have
> to track my work through emails to the dev@ list.  ;)
>
> thanks!
>
> shane
>
> On Wed, Jun 24, 2020 at 10:48 AM Holden Karau 
> wrote:
>
>> The most recent one I noticed was
>> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124437/console
>>  which
>> was run on  amp-jenkins-worker-04.
>>
>> On Wed, Jun 24, 2020 at 10:44 AM shane knapp ☠ 
>> wrote:
>>
>>> for those weird failures, it's super helpful to provide which
>>> workers are showing these issues.  :)
>>>
>>> i'd rather not wipe all of the m2 caches on all of the workers, as
>>> we'll then potentially get blacklisted again if we download too many
>>> packages from apache.org.
>>>
>>> On Tue, Jun 23, 2020 at 5:58 PM Holden Karau 
>>> wrote:
>>>
 Hi Folks,

 I've been see some weird failures on Jenkins and it looks like it
 might be from the m2 cache. Would it be OK to clean it out? Or is it
 important?

 Cheers,

 Holden

 --
 Twitter: https://twitter.com/holdenkarau
 Books (Learning Spark, High Performance Spark, etc.):
 https://amzn.to/2MaRAG9  
 YouTube Live Streams: https://www.youtube.com/user/holdenkarau

>>>
>>>
>>> --
>>> Shane Knapp
>>> Computer Guy / Voice of Reason
>>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>>> https://rise.cs.berkeley.edu
>>>
>>
>>
>> --
>> Twitter: https://twitter.com/holdenkarau
>> Books (Learning Spark, High Performance Spark, etc.):
>> https://amzn.to/2MaRAG9  
>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>
>
>
> --
> Shane Knapp
> Computer Guy / Voice of Reason
> UC Berkeley EECS Research / RISELab Staff Technical Lead
> https://rise.cs.berkeley.edu
>


 --
 Shane Knapp
 Computer Guy / Voice of Reason
 UC Berkeley EECS Research / RISELab Staff Technical Lead
 https://rise.cs.berkeley.edu

>>> --
>> Twitter: https://twitter.com/holdenkarau
>> Books (Learning Spark, High Performance Spark, etc.):
>> https://amzn.to/2MaRAG9  
>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>
>

-- 
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  
YouTube Live Streams: https://www.youtube.com/user/holdenkarau


Re: [VOTE] Decommissioning SPIP

2020-07-01 Thread Hyukjin Kwon
+1

2020년 7월 2일 (목) 오전 10:08, Marcelo Vanzin 님이 작성:

> I reviewed the docs and PRs from way before an SPIP was explicitly
> asked, so I'm comfortable with giving a +1 even if I haven't really
> fully read the new document,
>
> On Wed, Jul 1, 2020 at 6:05 PM Holden Karau  wrote:
> >
> > Hi Spark Devs,
> >
> > I think discussion has settled on the SPIP doc at
> https://docs.google.com/document/d/1EOei24ZpVvR7_w0BwBjOnrWRy4k-qTdIlx60FsHZSHA/edit?usp=sharing
> , design doc at
> https://docs.google.com/document/d/1xVO1b6KAwdUhjEJBolVPl9C6sLj7oOveErwDSYdT-pE/edit,
> or JIRA https://issues.apache.org/jira/browse/SPARK-20624, and I've
> received a request to put the SPIP up for a VOTE quickly. The discussion
> thread on the mailing list is at
> http://apache-spark-developers-list.1001551.n3.nabble.com/DISCUSS-SPIP-Graceful-Decommissioning-td29650.html
> .
> >
> > Normally this vote would be open for 72 hours, however since it's a long
> weekend in the US where many of the PMC members are, this vote will not
> close before July 6th at noon pacific time.
> >
> > The SPIP procedures are documented at:
> https://spark.apache.org/improvement-proposals.html. The ASF's voting
> guide is at https://www.apache.org/foundation/voting.html.
> >
> > Please vote before July 6th at noon:
> >
> > [ ] +1: Accept the proposal as an official SPIP
> > [ ] +0
> > [ ] -1: I don't think this is a good idea because ...
> >
> > I will start the voting off with a +1 from myself.
> >
> > Cheers,
> >
> > Holden
>
>
>
> --
> Marcelo Vanzin
> van...@gmail.com
> "Life's too short to drink cheap beer"
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: [VOTE] Decommissioning SPIP

2020-07-01 Thread Mridul Muralidharan
+1

Thanks,
Mridul

On Wed, Jul 1, 2020 at 6:36 PM Hyukjin Kwon  wrote:

> +1
>
> 2020년 7월 2일 (목) 오전 10:08, Marcelo Vanzin 님이 작성:
>
>> I reviewed the docs and PRs from way before an SPIP was explicitly
>> asked, so I'm comfortable with giving a +1 even if I haven't really
>> fully read the new document,
>>
>> On Wed, Jul 1, 2020 at 6:05 PM Holden Karau  wrote:
>> >
>> > Hi Spark Devs,
>> >
>> > I think discussion has settled on the SPIP doc at
>> https://docs.google.com/document/d/1EOei24ZpVvR7_w0BwBjOnrWRy4k-qTdIlx60FsHZSHA/edit?usp=sharing
>> , design doc at
>> https://docs.google.com/document/d/1xVO1b6KAwdUhjEJBolVPl9C6sLj7oOveErwDSYdT-pE/edit,
>> or JIRA https://issues.apache.org/jira/browse/SPARK-20624, and I've
>> received a request to put the SPIP up for a VOTE quickly. The discussion
>> thread on the mailing list is at
>> http://apache-spark-developers-list.1001551.n3.nabble.com/DISCUSS-SPIP-Graceful-Decommissioning-td29650.html
>> .
>> >
>> > Normally this vote would be open for 72 hours, however since it's a
>> long weekend in the US where many of the PMC members are, this vote will
>> not close before July 6th at noon pacific time.
>> >
>> > The SPIP procedures are documented at:
>> https://spark.apache.org/improvement-proposals.html. The ASF's voting
>> guide is at https://www.apache.org/foundation/voting.html.
>> >
>> > Please vote before July 6th at noon:
>> >
>> > [ ] +1: Accept the proposal as an official SPIP
>> > [ ] +0
>> > [ ] -1: I don't think this is a good idea because ...
>> >
>> > I will start the voting off with a +1 from myself.
>> >
>> > Cheers,
>> >
>> > Holden
>>
>>
>>
>> --
>> Marcelo Vanzin
>> van...@gmail.com
>> "Life's too short to drink cheap beer"
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>


Re: m2 cache issues in Jenkins?

2020-07-01 Thread Holden Karau
Huh interesting that it’s the same worker. Have you filed a ticket to Shane?

On Wed, Jul 1, 2020 at 8:50 PM Hyukjin Kwon  wrote:

> Hm .. seems this is happening again in amp-jenkins-worker-04 ;(.
>
> 2020년 6월 25일 (목) 오전 3:15, shane knapp ☠ 님이 작성:
>
>> done:
>> -bash-4.1$ cd .m2
>> -bash-4.1$ ls
>> repository
>> -bash-4.1$ time rm -rf *
>>
>> real17m4.607s
>> user0m0.950s
>> sys 0m18.816s
>> -bash-4.1$
>>
>> On Wed, Jun 24, 2020 at 10:50 AM shane knapp ☠ 
>> wrote:
>>
>>> ok, i've taken that worker offline and once the job running on it
>>> finishes, i'll wipe the cache.
>>>
>>> in the future, please file a JIRA and assign it to me so i don't have to
>>> track my work through emails to the dev@ list.  ;)
>>>
>>> thanks!
>>>
>>> shane
>>>
>>> On Wed, Jun 24, 2020 at 10:48 AM Holden Karau 
>>> wrote:
>>>
 The most recent one I noticed was
 https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124437/console
  which
 was run on  amp-jenkins-worker-04.

 On Wed, Jun 24, 2020 at 10:44 AM shane knapp ☠ 
 wrote:

> for those weird failures, it's super helpful to provide which workers
> are showing these issues.  :)
>
> i'd rather not wipe all of the m2 caches on all of the workers, as
> we'll then potentially get blacklisted again if we download too many
> packages from apache.org.
>
> On Tue, Jun 23, 2020 at 5:58 PM Holden Karau 
> wrote:
>
>> Hi Folks,
>>
>> I've been see some weird failures on Jenkins and it looks like it
>> might be from the m2 cache. Would it be OK to clean it out? Or is it
>> important?
>>
>> Cheers,
>>
>> Holden
>>
>> --
>> Twitter: https://twitter.com/holdenkarau
>> Books (Learning Spark, High Performance Spark, etc.):
>> https://amzn.to/2MaRAG9  
>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>
>
>
> --
> Shane Knapp
> Computer Guy / Voice of Reason
> UC Berkeley EECS Research / RISELab Staff Technical Lead
> https://rise.cs.berkeley.edu
>


 --
 Twitter: https://twitter.com/holdenkarau
 Books (Learning Spark, High Performance Spark, etc.):
 https://amzn.to/2MaRAG9  
 YouTube Live Streams: https://www.youtube.com/user/holdenkarau

>>>
>>>
>>> --
>>> Shane Knapp
>>> Computer Guy / Voice of Reason
>>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>>> https://rise.cs.berkeley.edu
>>>
>>
>>
>> --
>> Shane Knapp
>> Computer Guy / Voice of Reason
>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>> https://rise.cs.berkeley.edu
>>
> --
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  
YouTube Live Streams: https://www.youtube.com/user/holdenkarau


Re: [DISCUSS] Apache Spark 3.0.1 Release

2020-07-01 Thread Holden Karau
I’m happy to have Prashant do 2.4.7 :)

On Wed, Jul 1, 2020 at 9:40 PM Xiao Li  wrote:

> +1 on releasing both 3.0.1 and 2.4.7
>
> Great! Three committers volunteer to be a release manager. Ruifeng,
> Prashant and Holden. Holden just helped release Spark 2.4.6. This time,
> maybe, Ruifeng and Prashant can be the release manager of 3.0.1 and 2.4.7
> respectively.
>
> Xiao
>
> On Wed, Jul 1, 2020 at 2:24 PM Jungtaek Lim 
> wrote:
>
>> https://issues.apache.org/jira/browse/SPARK-32148 was reported
>> yesterday, and if the report is valid it looks to be a blocker. I'll try to
>> take a look sooner.
>>
>> On Thu, Jul 2, 2020 at 12:48 AM Shivaram Venkataraman <
>> shiva...@eecs.berkeley.edu> wrote:
>>
>>> Thanks Holden -- it would be great to also get 2.4.7 started
>>>
>>> Thanks
>>> Shivaram
>>>
>>> On Tue, Jun 30, 2020 at 10:31 PM Holden Karau 
>>> wrote:
>>> >
>>> > I can take care of 2.4.7 unless someone else wants to do it.
>>> >
>>> > On Tue, Jun 30, 2020 at 8:29 PM Jason Moore <
>>> jason.mo...@quantium.com.au> wrote:
>>> >>
>>> >> Hi all,
>>> >>
>>> >>
>>> >>
>>> >> Could I get some input on the severity of this one that I found
>>> yesterday?  If that’s a correctness issue, should it block this patch?  Let
>>> me know under the ticket if there’s more info that I can provide to help.
>>> >>
>>> >>
>>> >>
>>> >> https://issues.apache.org/jira/browse/SPARK-32136
>>> >>
>>> >>
>>> >>
>>> >> Thanks,
>>> >>
>>> >> Jason.
>>> >>
>>> >>
>>> >>
>>> >> From: Jungtaek Lim 
>>> >> Date: Wednesday, 1 July 2020 at 10:20 am
>>> >> To: Shivaram Venkataraman 
>>> >> Cc: Prashant Sharma , 郑瑞峰 ,
>>> Gengliang Wang , gurwls223 <
>>> gurwls...@gmail.com>, Dongjoon Hyun , Jules
>>> Damji , Holden Karau ,
>>> Reynold Xin , Yuanjian Li ,
>>> "dev@spark.apache.org" , Takeshi Yamamuro <
>>> linguin@gmail.com>
>>> >> Subject: Re: [DISCUSS] Apache Spark 3.0.1 Release
>>> >>
>>> >>
>>> >>
>>> >> SPARK-32130 [1] looks to be a performance regression introduced in
>>> Spark 3.0.0, which is ideal to look into before releasing another bugfix
>>> version.
>>> >>
>>> >>
>>> >>
>>> >> 1. https://issues.apache.org/jira/browse/SPARK-32130
>>> >>
>>> >>
>>> >>
>>> >> On Wed, Jul 1, 2020 at 7:05 AM Shivaram Venkataraman <
>>> shiva...@eecs.berkeley.edu> wrote:
>>> >>
>>> >> Hi all
>>> >>
>>> >>
>>> >>
>>> >> I just wanted to ping this thread to see if all the outstanding
>>> blockers for 3.0.1 have been fixed. If so, it would be great if we can get
>>> the release going. The CRAN team sent us a note that the version SparkR
>>> available on CRAN for the current R version (4.0.2) is broken and hence we
>>> need to update the package soon --  it will be great to do it with 3.0.1.
>>> >>
>>> >>
>>> >>
>>> >> Thanks
>>> >>
>>> >> Shivaram
>>> >>
>>> >>
>>> >>
>>> >> On Wed, Jun 24, 2020 at 8:31 PM Prashant Sharma 
>>> wrote:
>>> >>
>>> >> +1 for 3.0.1 release.
>>> >>
>>> >> I too can help out as release manager.
>>> >>
>>> >>
>>> >>
>>> >> On Thu, Jun 25, 2020 at 4:58 AM 郑瑞峰  wrote:
>>> >>
>>> >> I volunteer to be a release manager of 3.0.1, if nobody is working on
>>> this.
>>> >>
>>> >>
>>> >>
>>> >>
>>> >>
>>> >> -- 原始邮件 --
>>> >>
>>> >> 发件人: "Gengliang Wang";
>>> >>
>>> >> 发送时间: 2020年6月24日(星期三) 下午4:15
>>> >>
>>> >> 收件人: "Hyukjin Kwon";
>>> >>
>>> >> 抄送: "Dongjoon Hyun";"Jungtaek Lim"<
>>> kabhwan.opensou...@gmail.com>;"Jules Damji";"Holden
>>> Karau";"Reynold Xin";"Shivaram
>>> Venkataraman";"Yuanjian Li"<
>>> xyliyuanj...@gmail.com>;"Spark dev list";"Takeshi
>>> Yamamuro";
>>> >>
>>> >> 主题: Re: [DISCUSS] Apache Spark 3.0.1 Release
>>> >>
>>> >>
>>> >>
>>> >> +1, the issues mentioned are really serious.
>>> >>
>>> >>
>>> >>
>>> >> On Tue, Jun 23, 2020 at 7:56 PM Hyukjin Kwon 
>>> wrote:
>>> >>
>>> >> +1.
>>> >>
>>> >> Just as a note,
>>> >> - SPARK-31918 is fixed now, and there's no blocker. - When we build
>>> SparkR, we should use the latest R version at least 4.0.0+.
>>> >>
>>> >>
>>> >>
>>> >> 2020년 6월 24일 (수) 오전 11:20, Dongjoon Hyun 님이
>>> 작성:
>>> >>
>>> >> +1
>>> >>
>>> >>
>>> >>
>>> >> Bests,
>>> >>
>>> >> Dongjoon.
>>> >>
>>> >>
>>> >>
>>> >> On Tue, Jun 23, 2020 at 1:19 PM Jungtaek Lim <
>>> kabhwan.opensou...@gmail.com> wrote:
>>> >>
>>> >> +1 on a 3.0.1 soon.
>>> >>
>>> >>
>>> >>
>>> >> Probably it would be nice if some Scala experts can take a look at
>>> https://issues.apache.org/jira/browse/SPARK-32051 and include the fix
>>> into 3.0.1 if possible.
>>> >>
>>> >> Looks like APIs designed to work with Scala 2.11 & Java bring
>>> ambiguity in Scala 2.12 & Java.
>>> >>
>>> >>
>>> >>
>>> >> On Wed, Jun 24, 2020 at 4:52 AM Jules Damji 
>>> wrote:
>>> >>
>>> >> +1 (non-binding)
>>> >>
>>> >>
>>> >>
>>> >> Sent from my iPhone
>>> >>
>>> >> Pardon the dumb thumb typos :)
>>> >>
>>> >>
>>> >>
>>> >> On Jun 23, 2020, at 11:36 AM, Holden Karau 
>>> wrote:
>>> >>
>>> >> +1 on a patch release soon
>>> >>
>>> >>
>>> >>
>>> >> On Tue, Jun 23, 2020 at 10:47 AM Reynold Xin 
>>> wrote:
>>> >>
>>> >> Error! 

Re: [VOTE] Decommissioning SPIP

2020-07-01 Thread Stephen Boesch
+1 Thx for seeing this through

On Wed, 1 Jul 2020 at 20:03, Imran Rashid 
wrote:

> +1
>
> I think this is going to be a really important feature for Spark and I'm
> glad to see Holden focusing on it.
>
> On Wed, Jul 1, 2020 at 8:38 PM Mridul Muralidharan 
> wrote:
>
>> +1
>>
>> Thanks,
>> Mridul
>>
>> On Wed, Jul 1, 2020 at 6:36 PM Hyukjin Kwon  wrote:
>>
>>> +1
>>>
>>> 2020년 7월 2일 (목) 오전 10:08, Marcelo Vanzin 님이 작성:
>>>
 I reviewed the docs and PRs from way before an SPIP was explicitly
 asked, so I'm comfortable with giving a +1 even if I haven't really
 fully read the new document,

 On Wed, Jul 1, 2020 at 6:05 PM Holden Karau 
 wrote:
 >
 > Hi Spark Devs,
 >
 > I think discussion has settled on the SPIP doc at
 https://docs.google.com/document/d/1EOei24ZpVvR7_w0BwBjOnrWRy4k-qTdIlx60FsHZSHA/edit?usp=sharing
 , design doc at
 https://docs.google.com/document/d/1xVO1b6KAwdUhjEJBolVPl9C6sLj7oOveErwDSYdT-pE/edit,
 or JIRA https://issues.apache.org/jira/browse/SPARK-20624, and I've
 received a request to put the SPIP up for a VOTE quickly. The discussion
 thread on the mailing list is at
 http://apache-spark-developers-list.1001551.n3.nabble.com/DISCUSS-SPIP-Graceful-Decommissioning-td29650.html
 .
 >
 > Normally this vote would be open for 72 hours, however since it's a
 long weekend in the US where many of the PMC members are, this vote will
 not close before July 6th at noon pacific time.
 >
 > The SPIP procedures are documented at:
 https://spark.apache.org/improvement-proposals.html. The ASF's voting
 guide is at https://www.apache.org/foundation/voting.html.
 >
 > Please vote before July 6th at noon:
 >
 > [ ] +1: Accept the proposal as an official SPIP
 > [ ] +0
 > [ ] -1: I don't think this is a good idea because ...
 >
 > I will start the voting off with a +1 from myself.
 >
 > Cheers,
 >
 > Holden



 --
 Marcelo Vanzin
 van...@gmail.com
 "Life's too short to drink cheap beer"

 -
 To unsubscribe e-mail: dev-unsubscr...@spark.apache.org




Re: m2 cache issues in Jenkins?

2020-07-01 Thread Hyukjin Kwon
Hm .. seems this is happening again in amp-jenkins-worker-04 ;(.

2020년 6월 25일 (목) 오전 3:15, shane knapp ☠ 님이 작성:

> done:
> -bash-4.1$ cd .m2
> -bash-4.1$ ls
> repository
> -bash-4.1$ time rm -rf *
>
> real17m4.607s
> user0m0.950s
> sys 0m18.816s
> -bash-4.1$
>
> On Wed, Jun 24, 2020 at 10:50 AM shane knapp ☠ 
> wrote:
>
>> ok, i've taken that worker offline and once the job running on it
>> finishes, i'll wipe the cache.
>>
>> in the future, please file a JIRA and assign it to me so i don't have to
>> track my work through emails to the dev@ list.  ;)
>>
>> thanks!
>>
>> shane
>>
>> On Wed, Jun 24, 2020 at 10:48 AM Holden Karau 
>> wrote:
>>
>>> The most recent one I noticed was
>>> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124437/console
>>>  which
>>> was run on  amp-jenkins-worker-04.
>>>
>>> On Wed, Jun 24, 2020 at 10:44 AM shane knapp ☠ 
>>> wrote:
>>>
 for those weird failures, it's super helpful to provide which workers
 are showing these issues.  :)

 i'd rather not wipe all of the m2 caches on all of the workers, as
 we'll then potentially get blacklisted again if we download too many
 packages from apache.org.

 On Tue, Jun 23, 2020 at 5:58 PM Holden Karau 
 wrote:

> Hi Folks,
>
> I've been see some weird failures on Jenkins and it looks like it
> might be from the m2 cache. Would it be OK to clean it out? Or is it
> important?
>
> Cheers,
>
> Holden
>
> --
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>


 --
 Shane Knapp
 Computer Guy / Voice of Reason
 UC Berkeley EECS Research / RISELab Staff Technical Lead
 https://rise.cs.berkeley.edu

>>>
>>>
>>> --
>>> Twitter: https://twitter.com/holdenkarau
>>> Books (Learning Spark, High Performance Spark, etc.):
>>> https://amzn.to/2MaRAG9  
>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>
>>
>>
>> --
>> Shane Knapp
>> Computer Guy / Voice of Reason
>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>> https://rise.cs.berkeley.edu
>>
>
>
> --
> Shane Knapp
> Computer Guy / Voice of Reason
> UC Berkeley EECS Research / RISELab Staff Technical Lead
> https://rise.cs.berkeley.edu
>


Re: m2 cache issues in Jenkins?

2020-07-01 Thread Hyukjin Kwon
Ah, okay. Actually there already is -
https://issues.apache.org/jira/browse/SPARK-31693. I am reopening.

2020년 7월 2일 (목) 오후 2:06, Holden Karau 님이 작성:

> We don't I didn't file one originally, but Shane reminded me to in the
> future.
>
> On Wed, Jul 1, 2020 at 9:44 PM Hyukjin Kwon  wrote:
>
>> Nope, do we have an existing ticket? I think we can reopen if there is.
>>
>> 2020년 7월 2일 (목) 오후 1:43, Holden Karau 님이 작성:
>>
>>> Huh interesting that it’s the same worker. Have you filed a ticket to
>>> Shane?
>>>
>>> On Wed, Jul 1, 2020 at 8:50 PM Hyukjin Kwon  wrote:
>>>
 Hm .. seems this is happening again in amp-jenkins-worker-04 ;(.

 2020년 6월 25일 (목) 오전 3:15, shane knapp ☠ 님이 작성:

> done:
> -bash-4.1$ cd .m2
> -bash-4.1$ ls
> repository
> -bash-4.1$ time rm -rf *
>
> real17m4.607s
> user0m0.950s
> sys 0m18.816s
> -bash-4.1$
>
> On Wed, Jun 24, 2020 at 10:50 AM shane knapp ☠ 
> wrote:
>
>> ok, i've taken that worker offline and once the job running on it
>> finishes, i'll wipe the cache.
>>
>> in the future, please file a JIRA and assign it to me so i don't have
>> to track my work through emails to the dev@ list.  ;)
>>
>> thanks!
>>
>> shane
>>
>> On Wed, Jun 24, 2020 at 10:48 AM Holden Karau 
>> wrote:
>>
>>> The most recent one I noticed was
>>> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124437/console
>>>  which
>>> was run on  amp-jenkins-worker-04.
>>>
>>> On Wed, Jun 24, 2020 at 10:44 AM shane knapp ☠ 
>>> wrote:
>>>
 for those weird failures, it's super helpful to provide which
 workers are showing these issues.  :)

 i'd rather not wipe all of the m2 caches on all of the workers, as
 we'll then potentially get blacklisted again if we download too many
 packages from apache.org.

 On Tue, Jun 23, 2020 at 5:58 PM Holden Karau 
 wrote:

> Hi Folks,
>
> I've been see some weird failures on Jenkins and it looks like it
> might be from the m2 cache. Would it be OK to clean it out? Or is it
> important?
>
> Cheers,
>
> Holden
>
> --
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>


 --
 Shane Knapp
 Computer Guy / Voice of Reason
 UC Berkeley EECS Research / RISELab Staff Technical Lead
 https://rise.cs.berkeley.edu

>>>
>>>
>>> --
>>> Twitter: https://twitter.com/holdenkarau
>>> Books (Learning Spark, High Performance Spark, etc.):
>>> https://amzn.to/2MaRAG9  
>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>
>>
>>
>> --
>> Shane Knapp
>> Computer Guy / Voice of Reason
>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>> https://rise.cs.berkeley.edu
>>
>
>
> --
> Shane Knapp
> Computer Guy / Voice of Reason
> UC Berkeley EECS Research / RISELab Staff Technical Lead
> https://rise.cs.berkeley.edu
>
 --
>>> Twitter: https://twitter.com/holdenkarau
>>> Books (Learning Spark, High Performance Spark, etc.):
>>> https://amzn.to/2MaRAG9  
>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>
>>
>
> --
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>


Re: [DISCUSS] Drop Python 2, 3.4 and 3.5

2020-07-01 Thread Yuanjian Li
+1, especially Python 2

Holden Karau  于2020年7月2日周四 上午10:20写道:

> I’m ok with us dropping Python 2, 3.4, and 3.5 in Spark 3.1 forward. It
> will be exciting to get to use more recent Python features. The most recent
> Ubuntu LTS ships with 3.7, and while the previous LTS ships with 3.5, if
> folks really can’t upgrade there’s conda.
>
> Is there anyone with a large Python 3.5 fleet who can’t use conda?
>
> On Wed, Jul 1, 2020 at 7:15 PM Hyukjin Kwon  wrote:
>
>> Yeah, sure. It will be dropped at Spark 3.1 onwards. I don't think we
>> should make such changes in maintenance releases
>>
>> 2020년 7월 2일 (목) 오전 11:13, Holden Karau 님이 작성:
>>
>>> To be clear the plan is to drop them in Spark 3.1 onwards, yes?
>>>
>>> On Wed, Jul 1, 2020 at 7:11 PM Hyukjin Kwon  wrote:
>>>
 Hi all,

 I would like to discuss dropping deprecated Python versions 2, 3.4 and
 3.5 at https://github.com/apache/spark/pull/28957. I assume people
 support it in general
 but I am writing this to make sure everybody is happy.

 Fokko made a very good investigation on it, see
 https://github.com/apache/spark/pull/28957#issuecomment-652022449.
 Assuming from the statistics, I think we're pretty safe to drop them.
 Also note that dropping Python 2 was actually declared at
 https://python3statement.org/

 Roughly speaking, there are many main advantages by dropping them:
   1. It removes a bunch of hacks we added around 700 lines in PySpark.
   2. PyPy2 has a critical bug that causes a flaky test,
 https://issues.apache.org/jira/browse/SPARK-28358 given my testing and
 investigation.
   3. Users can use Python type hints with Pandas UDFs without thinking
 about Python version
   4. Users can leverage one latest cloudpickle,
 https://github.com/apache/spark/pull/28950. With Python 3.8+ it can
 also leverage C pickle.
   5. ...

 So it benefits both users and dev. WDYT guys?


 --
>>> Twitter: https://twitter.com/holdenkarau
>>> Books (Learning Spark, High Performance Spark, etc.):
>>> https://amzn.to/2MaRAG9  
>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>
>> --
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>


Re: Apache Spark 3.1 Feature Expectation (Dec. 2020)

2020-07-01 Thread Gabor Somogyi
Hi Dongjoon,

I would add JDBC Kerberos support w/ keytab:
https://issues.apache.org/jira/browse/SPARK-12312

BR,
G


On Mon, Jun 29, 2020 at 6:07 PM Dongjoon Hyun 
wrote:

> Hi, All.
>
> After a short celebration of Apache Spark 3.0, I'd like to ask you the
> community opinion on Apache Spark 3.1 feature expectations.
>
> First of all, Apache Spark 3.1 is scheduled for December 2020.
> - https://spark.apache.org/versioning-policy.html
>
> I'm expecting the following items:
>
> 1. Support Scala 2.13
> 2. Use Apache Hadoop 3.2 by default for better cloud support
> 3. Declaring Kubernetes Scheduler GA
> In my perspective, the last main missing piece was Dynamic allocation
> and
> - Dynamic allocation with shuffle tracking is already shipped at 3.0.
> - Dynamic allocation with worker decommission/data migration is
> targeting 3.1. (Thanks, Holden)
> 4. DSv2 Stabilization
>
> I'm aware of some more features which are on the way currently, but I love
> to hear the opinions from the main developers and more over the main users
> who need those features.
>
> Thank you in advance. Welcome for any comments.
>
> Bests,
> Dongjoon.
>