Re: [VOTE] SPIP: Pure Python Package in PyPI (Spark Connect)

2024-04-01 Thread Felix Cheung
+1

From: Denny Lee 
Sent: Monday, April 1, 2024 10:06:14 AM
To: Hussein Awala 
Cc: Chao Sun ; Hyukjin Kwon ; Mridul 
Muralidharan ; dev 
Subject: Re: [VOTE] SPIP: Pure Python Package in PyPI (Spark Connect)

+1 (non-binding)


On Mon, Apr 1, 2024 at 9:24 AM Hussein Awala 
mailto:huss...@awala.fr>> wrote:
+1(non-binding) I add to the difference will it make that it will also simplify 
package maintenance and easily release a bug fix/new feature without needing to 
wait for Pyspark to release.

On Mon, Apr 1, 2024 at 4:56 PM Chao Sun 
mailto:sunc...@apache.org>> wrote:
+1

On Sun, Mar 31, 2024 at 10:31 PM Hyukjin Kwon 
mailto:gurwls...@apache.org>> wrote:
Oh I didn't send the discussion thread out as it's pretty simple, non-invasive 
and the discussion was sort of done as part of the Spark Connect initial 
discussion ..

On Mon, Apr 1, 2024 at 1:59 PM Mridul Muralidharan 
mailto:mri...@gmail.com>> wrote:

Can you point me to the SPIP’s discussion thread please ?
I was not able to find it, but I was on vacation, and so might have missed this 
…


Regards,
Mridul

On Sun, Mar 31, 2024 at 9:08 PM Haejoon Lee 
 wrote:
+1

On Mon, Apr 1, 2024 at 10:15 AM Hyukjin Kwon 
mailto:gurwls...@apache.org>> wrote:
Hi all,

I'd like to start the vote for SPIP: Pure Python Package in PyPI (Spark Connect)

JIRA
Prototype
SPIP 
doc

Please vote on the SPIP for the next 72 hours:

[ ] +1: Accept the proposal as an official SPIP
[ ] +0
[ ] -1: I don’t think this is a good idea because …

Thanks.


Re: Why are hash functions seeded with 42?

2022-09-30 Thread Felix Cheung
+1 to doc, seed argument would be great if possible

From: Sean Owen 
Sent: Monday, September 26, 2022 5:26:26 PM
To: Nicholas Gustafson 
Cc: dev 
Subject: Re: Why are hash functions seeded with 42?

Oh yeah I get why we love to pick 42 for random things. I'm guessing it was a 
bit of an oversight here as the 'seed' is directly initial state and 0 makes 
much more sense.

On Mon, Sep 26, 2022, 7:24 PM Nicholas Gustafson 
mailto:njgustaf...@gmail.com>> wrote:
I don’t know the reason, however would offer a hunch that perhaps it’s a nod to 
Douglas Adams (author of The Hitchhiker’s Guide to the Galaxy).

https://news.mit.edu/2019/answer-life-universe-and-everything-sum-three-cubes-mathematics-0910

On Sep 26, 2022, at 16:59, Sean Owen 
mailto:sro...@gmail.com>> wrote:


OK, it came to my attention today that hash functions in spark, like xxhash64, 
actually always seed with 42: 
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/hash.scala#L655

This is an issue if you want the hash of some value in Spark to match the hash 
you compute with xxhash64 somewhere else, and, AFAICT most any other impl will 
start with seed=0.

I'm guessing there wasn't a great reason for this, just seemed like 42 was a 
nice default seed. And we can't change it now without maybe subtly changing 
program behaviors. And, I am guessing it's messy to let the function now take a 
seed argument, esp. in SQL.

So I'm left with, I guess we should doc that? I can do it if so.
And just a cautionary tale I guess, for hash function users.


Fwd: CRAN submission SparkR 3.2.0

2021-10-20 Thread Felix Cheung
-- Forwarded message -
From: Gregor Seyer 
Date: Wed, Oct 20, 2021 at 4:42 AM
Subject: Re: CRAN submission SparkR 3.2.0
To: Felix Cheung , CRAN <
cran-submissi...@r-project.org>


Thanks,

Please add \value to .Rd files regarding exported methods and explain
the functions results in the documentation. Please write about the
structure of the output (class) and also what the output means. (If a
function does not return a value, please document that too, e.g.
\value{No return value, called for side effects} or similar)
Missing Rd-tags in up to 102 .Rd files, e.g.:
  attach.Rd: \value
  avg.Rd: \value
  between.Rd: \value
  cache.Rd: \value
  cancelJobGroup.Rd: \value
  cast.Rd: \value
  ...

You have examples for unexported functions.
array_transform() in:
   hashCode.Rd
  Please either omit these examples or export the functions.

Warning: Unexecutable code in man/sparkR.session.Rd:
   sparkR.session(spark.master = "yarn", spark.submit.deployMode =
"client",:
Warning: Unexecutable code in man/write.stream.Rd:
   partitionBy:

Please do not modify the .GlobalEnv. This is not allowed by the CRAN
policies. e.g.: inst/profile/shell.R

Please do not modify the global environment (e.g. by using <<-) in your
functions. This is not allowed by the CRAN policies.  e.g.: R/utils.R


Additionally:
Have the issues why your package was archived been fixed?
Please explain this in the submission comments.


Please fix and resubmit.

Best,
Gregor Seyer

Am 19.10.21 um 19:48 schrieb CRAN submission:
> [This was generated from CRAN.R-project.org/submit.html]
>
> The following package was uploaded to CRAN:
> ===
>
> Package Information:
> Package: SparkR
> Version: 3.2.0
> Title: R Front End for 'Apache Spark'
> Author(s): Shivaram Venkataraman [aut], Xiangrui Meng [aut], Felix Cheung
>[aut, cre], The Apache Software Foundation [aut, cph]
> Maintainer: Felix Cheung 
> Depends: R (>= 3.5), methods
> Suggests: knitr, rmarkdown, markdown, testthat, e1071, survival, arrow
>(>= 1.0.0)
> Description: Provides an R Front end for 'Apache Spark'
><https://spark.apache.org>.
> License: Apache License (== 2.0)
>
>
> The maintainer confirms that he or she
> has read and agrees to the CRAN policies.
>
> =
>
> Original content of DESCRIPTION file:
>
> Package: SparkR
> Type: Package
> Version: 3.2.0
> Title: R Front End for 'Apache Spark'
> Description: Provides an R Front end for 'Apache Spark' <
https://spark.apache.org>.
> Authors@R: c(person("Shivaram", "Venkataraman", role = "aut",
>  email = "shiva...@cs.berkeley.edu"),
>   person("Xiangrui", "Meng", role = "aut",
>  email = "m...@databricks.com"),
>   person("Felix", "Cheung", role = c("aut", "cre"),
>  email = "felixche...@apache.org"),
>   person(family = "The Apache Software Foundation", role =
c("aut", "cph")))
> License: Apache License (== 2.0)
> URL: https://www.apache.org https://spark.apache.org
> BugReports: https://spark.apache.org/contributing.html
> SystemRequirements: Java (>= 8, < 12)
> Depends: R (>= 3.5), methods
> Suggests: knitr, rmarkdown, markdown, testthat, e1071, survival, arrow
>  (>= 1.0.0)
> Collate: 'schema.R' 'generics.R' 'jobj.R' 'column.R' 'group.R' 'RDD.R'
>  'pairRDD.R' 'DataFrame.R' 'SQLContext.R' 'WindowSpec.R'
>  'backend.R' 'broadcast.R' 'catalog.R' 'client.R' 'context.R'
>  'deserialize.R' 'functions.R' 'install.R' 'jvm.R'
>  'mllib_classification.R' 'mllib_clustering.R' 'mllib_fpm.R'
>  'mllib_recommendation.R' 'mllib_regression.R' 'mllib_stat.R'
>  'mllib_tree.R' 'mllib_utils.R' 'serialize.R' 'sparkR.R'
>  'stats.R' 'streaming.R' 'types.R' 'utils.R' 'window.R'
> RoxygenNote: 7.1.1
> VignetteBuilder: knitr
> NeedsCompilation: no
> Encoding: UTF-8
> Packaged: 2021-10-06 13:15:21 UTC; spark-rm
> Author: Shivaram Venkataraman [aut],
>Xiangrui Meng [aut],
>Felix Cheung [aut, cre],
>The Apache Software Foundation [aut, cph]
> Maintainer: Felix Cheung 
>


Re: CRAN package SparkR

2021-08-31 Thread Felix Cheung
I think a few lines to add the prompt might be enough. This checks for
interactive()

https://github.com/apache/spark/blob/c6a2021fec5bab9069fbfba33f75d4415ea76e99/R/pkg/R/sparkR.R#L658


On Tue, Aug 31, 2021 at 5:55 PM Hyukjin Kwon  wrote:

> Oh I missed this. Yes, can we simply get the user' confirmation when we
> install.spark?
> IIRC, the auto installation is only triggered by interactive shell so
> getting user's confirmation should be fine.
>
> 2021년 6월 18일 (금) 오전 2:54, Felix Cheung 님이 작성:
>
>> Any suggestion or comment on this? They are going to remove the package
>> by 6-28
>>
>> Seems to me if we have a switch to opt in to install (and not by default
>> on), or prompt the user in interactive session, should be good as user
>> confirmation.
>>
>>
>>
>> On Sun, Jun 13, 2021 at 11:25 PM Felix Cheung 
>> wrote:
>>
>>> It looks like they would not allow caching the Spark
>>> Distribution.
>>>
>>> I’m not sure what can be done about this.
>>>
>>> If I recall, the package should remove this during test. Or maybe
>>> spark.install() ie optional (hence getting user confirmation?)
>>>
>>>
>>> -- Forwarded message -
>>> Date: Sun, Jun 13, 2021 at 10:19 PM
>>> Subject: CRAN package SparkR
>>> To: Felix Cheung 
>>> CC: 
>>>
>>>
>>> Dear maintainer,
>>>
>>> Checking this apparently creates the default directory as per
>>>
>>> #' @param localDir a local directory where Spark is installed. The
>>> directory con
>>> tains
>>> #' version-specific folders of Spark packages. Default
>>> is path t
>>> o
>>> #' the cache directory:
>>> #' \itemize{
>>> #'   \item Mac OS X: \file{~/Library/Caches/spark}
>>> #'   \item Unix: \env{$XDG_CACHE_HOME} if defined,
>>> otherwise \file{~/.cache/spark}
>>> #'   \item Windows:
>>> \file{\%LOCALAPPDATA\%\\Apache\\Spark\\Cache}.
>>> #' }
>>>
>>> However, the CRAN Policy says
>>>
>>>   - Packages should not write in the user’s home filespace (including
>>> clipboards), nor anywhere else on the file system apart from the R
>>> session’s temporary directory (or during installation in the
>>> location pointed to by TMPDIR: and such usage should be cleaned
>>> up). Installing into the system’s R installation (e.g., scripts to
>>> its bin directory) is not allowed.
>>>
>>> Limited exceptions may be allowed in interactive sessions if the
>>> package obtains confirmation from the user.
>>>
>>> For R version 4.0 or later (hence a version dependency is required
>>> or only conditional use is possible), packages may store
>>> user-specific data, configuration and cache files in their
>>> respective user directories obtained from tools::R_user_dir(),
>>> provided that by default sizes are kept as small as possible and the
>>> contents are actively managed (including removing outdated
>>> material).
>>>
>>> Can you pls fix as necessary?
>>>
>>> Please fix before 2021-06-28 to safely retain your package on CRAN.
>>>
>>> Best
>>> -k
>>>
>>


Re: CRAN package SparkR

2021-06-17 Thread Felix Cheung
Any suggestion or comment on this? They are going to remove the package by
6-28

Seems to me if we have a switch to opt in to install (and not by default
on), or prompt the user in interactive session, should be good as user
confirmation.



On Sun, Jun 13, 2021 at 11:25 PM Felix Cheung 
wrote:

> It looks like they would not allow caching the Spark
> Distribution.
>
> I’m not sure what can be done about this.
>
> If I recall, the package should remove this during test. Or maybe
> spark.install() ie optional (hence getting user confirmation?)
>
>
> -- Forwarded message -
> Date: Sun, Jun 13, 2021 at 10:19 PM
> Subject: CRAN package SparkR
> To: Felix Cheung 
> CC: 
>
>
> Dear maintainer,
>
> Checking this apparently creates the default directory as per
>
> #' @param localDir a local directory where Spark is installed. The
> directory con
> tains
> #' version-specific folders of Spark packages. Default is
> path t
> o
> #' the cache directory:
> #' \itemize{
> #'   \item Mac OS X: \file{~/Library/Caches/spark}
> #'   \item Unix: \env{$XDG_CACHE_HOME} if defined,
> otherwise \file{~/.cache/spark}
> #'   \item Windows:
> \file{\%LOCALAPPDATA\%\\Apache\\Spark\\Cache}.
> #' }
>
> However, the CRAN Policy says
>
>   - Packages should not write in the user’s home filespace (including
> clipboards), nor anywhere else on the file system apart from the R
> session’s temporary directory (or during installation in the
> location pointed to by TMPDIR: and such usage should be cleaned
> up). Installing into the system’s R installation (e.g., scripts to
> its bin directory) is not allowed.
>
> Limited exceptions may be allowed in interactive sessions if the
> package obtains confirmation from the user.
>
> For R version 4.0 or later (hence a version dependency is required
> or only conditional use is possible), packages may store
> user-specific data, configuration and cache files in their
> respective user directories obtained from tools::R_user_dir(),
> provided that by default sizes are kept as small as possible and the
> contents are actively managed (including removing outdated
> material).
>
> Can you pls fix as necessary?
>
> Please fix before 2021-06-28 to safely retain your package on CRAN.
>
> Best
> -k
>


Fwd: CRAN package SparkR

2021-06-14 Thread Felix Cheung
It looks like they would not allow caching the Spark
Distribution.

I’m not sure what can be done about this.

If I recall, the package should remove this during test. Or maybe
spark.install() ie optional (hence getting user confirmation?)


-- Forwarded message -
Date: Sun, Jun 13, 2021 at 10:19 PM
Subject: CRAN package SparkR
To: Felix Cheung 
CC: 


Dear maintainer,

Checking this apparently creates the default directory as per

#' @param localDir a local directory where Spark is installed. The
directory con
tains
#' version-specific folders of Spark packages. Default is
path t
o
#' the cache directory:
#' \itemize{
#'   \item Mac OS X: \file{~/Library/Caches/spark}
#'   \item Unix: \env{$XDG_CACHE_HOME} if defined,
otherwise \file{~/.cache/spark}
#'   \item Windows:
\file{\%LOCALAPPDATA\%\\Apache\\Spark\\Cache}.
#' }

However, the CRAN Policy says

  - Packages should not write in the user’s home filespace (including
clipboards), nor anywhere else on the file system apart from the R
session’s temporary directory (or during installation in the
location pointed to by TMPDIR: and such usage should be cleaned
up). Installing into the system’s R installation (e.g., scripts to
its bin directory) is not allowed.

Limited exceptions may be allowed in interactive sessions if the
package obtains confirmation from the user.

For R version 4.0 or later (hence a version dependency is required
or only conditional use is possible), packages may store
user-specific data, configuration and cache files in their
respective user directories obtained from tools::R_user_dir(),
provided that by default sizes are kept as small as possible and the
contents are actively managed (including removing outdated
material).

Can you pls fix as necessary?

Please fix before 2021-06-28 to safely retain your package on CRAN.

Best
-k


Re: Welcoming six new Apache Spark committers

2021-03-26 Thread Felix Cheung
Welcome!


From: Driesprong, Fokko 
Sent: Friday, March 26, 2021 1:25:33 PM
To: Matei Zaharia 
Cc: Spark Dev List 
Subject: Re: Welcoming six new Apache Spark committers

Well deserved all! Welcome!

Op vr 26 mrt. 2021 om 21:21 schreef Matei Zaharia 
mailto:matei.zaha...@gmail.com>>
Hi all,

The Spark PMC recently voted to add several new committers. Please join me in 
welcoming them to their new role! Our new committers are:

- Maciej Szymkiewicz (contributor to PySpark)
- Max Gekk (contributor to Spark SQL)
- Kent Yao (contributor to Spark SQL)
- Attila Zsolt Piros (contributor to decommissioning and Spark on Kubernetes)
- Yi Wu (contributor to Spark Core and SQL)
- Gabor Somogyi (contributor to Streaming and security)

All six of them contributed to Spark 3.1 and we’re very excited to have them 
join as committers.

Matei and the Spark PMC
-
To unsubscribe e-mail: 
dev-unsubscr...@spark.apache.org



Re: [ANNOUNCE] Announcing Apache Spark 3.1.1

2021-03-05 Thread Felix Cheung
Congrats and thanks!


From: Hyukjin Kwon 
Sent: Wednesday, March 3, 2021 4:09:23 PM
To: Dongjoon Hyun 
Cc: Gabor Somogyi ; Jungtaek Lim 
; angers zhu ; Wenchen Fan 
; Kent Yao ; Takeshi Yamamuro 
; dev ; user @spark 

Subject: Re: [ANNOUNCE] Announcing Apache Spark 3.1.1

Thank you so much guys .. it indeed took a long time and it was pretty tough 
this time :-).
It was all possible because of your guys' support. I sincerely appreciate it .

2021년 3월 4일 (목) 오전 2:26, Dongjoon Hyun 
mailto:dongjoon.h...@gmail.com>>님이 작성:
It took a long time. Thank you, Hyukjin and all!

Bests,
Dongjoon.

On Wed, Mar 3, 2021 at 3:23 AM Gabor Somogyi 
mailto:gabor.g.somo...@gmail.com>> wrote:
Good to hear and great work Hyukjin! 

On Wed, 3 Mar 2021, 11:15 Jungtaek Lim, 
mailto:kabhwan.opensou...@gmail.com>> wrote:
Thanks Hyukjin for driving the huge release, and thanks everyone for 
contributing the release!

On Wed, Mar 3, 2021 at 6:54 PM angers zhu 
mailto:angers@gmail.com>> wrote:
Great work, Hyukjin !

Bests,
Angers

Wenchen Fan mailto:cloud0...@gmail.com>> 于2021年3月3日周三 
下午5:02写道:
Great work and congrats!

On Wed, Mar 3, 2021 at 3:51 PM Kent Yao 
mailto:yaooq...@qq.com>> wrote:
Congrats, all!

Bests,
Kent Yao
@ Data Science Center, Hangzhou Research Institute, NetEase Corp.
a spark enthusiast
kyuubiis a unified multi-tenant JDBC 
interface for large-scale data processing and analytics, built on top of Apache 
Spark.
spark-authorizerA Spark SQL 
extension which provides SQL Standard Authorization for Apache 
Spark.
spark-postgres A library for 
reading data from and transferring data to Postgres / Greenplum with Spark SQL 
and DataFrames, 10~100x faster.
spark-func-extrasA library that 
brings excellent and useful functions from various modern database management 
systems to Apache Spark.



On 03/3/2021 15:11,Takeshi 
Yamamuro wrote:
Great work and Congrats, all!

Bests,
Takeshi

On Wed, Mar 3, 2021 at 2:18 PM Mridul Muralidharan 
mailto:mri...@gmail.com>> wrote:

Thanks Hyukjin and congratulations everyone on the release !

Regards,
Mridul

On Tue, Mar 2, 2021 at 8:54 PM Yuming Wang 
mailto:wgy...@gmail.com>> wrote:
Great work, Hyukjin!

On Wed, Mar 3, 2021 at 9:50 AM Hyukjin Kwon 
mailto:gurwls...@gmail.com>> wrote:
We are excited to announce Spark 3.1.1 today.

Apache Spark 3.1.1 is the second release of the 3.x line. This release adds
Python type annotations and Python dependency management support as part of 
Project Zen.
Other major updates include improved ANSI SQL compliance support, history 
server support
in structured streaming, the general availability (GA) of Kubernetes and node 
decommissioning
in Kubernetes and Standalone. In addition, this release continues to focus on 
usability, stability,
and polish while resolving around 1500 tickets.

We'd like to thank our contributors and users for their contributions and early 
feedback to
this release. This release would not have been possible without you.

To download Spark 3.1.1, head over to the download page:
http://spark.apache.org/downloads.html

To view the release notes:
https://spark.apache.org/releases/spark-release-3-1-1.html



--
---
Takeshi Yamamuro


Re: Recovering SparkR on CRAN?

2020-12-30 Thread Felix Cheung
We could just submit the latest release with the fix again. I would not
recommend waiting, often time there are some external changes that are not
caught, and a fix will need to go through a release vote.

What is the latest release with your fix? 3.0.1? I can put it in but will
need to make sure we can get hold of Shivaram.


On Tue, Dec 29, 2020 at 11:05 PM Hyukjin Kwon  wrote:

> Let me try in this release - I will have to ask some questions to both of
> you. I will email you guys offline or private mailing list.
> If I happen to be stuck for a difficult reason, I think we can consider
> dropping it as Dongjoon initially pointed out.
>
> 2020년 12월 30일 (수) 오후 1:59, Felix Cheung 님이 작성:
>
>> Ah, I don’t recall actually - maybe it was just missed?
>>
>> The last message I had, was in June when it was broken by R 4.0.1, which
>> was fixed.
>>
>>
>> On Tue, Dec 29, 2020 at 7:21 PM Hyukjin Kwon  wrote:
>>
>>> BTW, I remember I fixed all standing issues at
>>> https://issues.apache.org/jira/browse/SPARK-31918 and
>>> https://issues.apache.org/jira/browse/SPARK-32073.
>>> I wonder why other releases were not uploaded yet. Do you guys know any
>>> context or if there is a standing issue on this, @Felix Cheung
>>>  or @Shivaram Venkataraman
>>> ?
>>>
>>> 2020년 12월 23일 (수) 오전 11:21, Mridul Muralidharan 님이 작성:
>>>
>>>>
>>>> I agree, is there something we can do to ensure CRAN publish goes
>>>> through consistently and predictably ?
>>>> If possible, it would be good to continue supporting it.
>>>>
>>>> Regards,
>>>> Mridul
>>>>
>>>> On Tue, Dec 22, 2020 at 7:48 PM Felix Cheung 
>>>> wrote:
>>>>
>>>>> Ok - it took many years to get it first published, so it was hard to
>>>>> get there.
>>>>>
>>>>>
>>>>> On Tue, Dec 22, 2020 at 5:45 PM Hyukjin Kwon 
>>>>> wrote:
>>>>>
>>>>>> Adding @Shivaram Venkataraman  and @Felix
>>>>>> Cheung  FYI
>>>>>>
>>>>>> 2020년 12월 23일 (수) 오전 9:22, Michael Heuer 님이 작성:
>>>>>>
>>>>>>> Anecdotally, as a project downstream of Spark, we've been prevented
>>>>>>> from pushing to CRAN because of this
>>>>>>>
>>>>>>> https://github.com/bigdatagenomics/adam/issues/1851
>>>>>>>
>>>>>>> We've given up and marked as WontFix.
>>>>>>>
>>>>>>>michael
>>>>>>>
>>>>>>>
>>>>>>> On Dec 22, 2020, at 5:14 PM, Dongjoon Hyun 
>>>>>>> wrote:
>>>>>>>
>>>>>>> Given the current circumstance, I'm thinking of dropping it
>>>>>>> officially from the community release scope.
>>>>>>>
>>>>>>> It's because
>>>>>>>
>>>>>>> - It turns out that our CRAN check is insufficient to guarantee the
>>>>>>> availability of SparkR on CRAN.
>>>>>>>   Apache Spark 3.1.0 may not not available on CRAN, too.
>>>>>>>
>>>>>>> - In daily CIs, CRAN check has been broken frequently due to both
>>>>>>> our side and CRAN side issues. Currently, branch-2.4 is broken.
>>>>>>>
>>>>>>> - It also has a side-effect to cause some delays on the official
>>>>>>> release announcement after RC passes because each release manager takes 
>>>>>>> a
>>>>>>> look at it if he/she can recover it at that release.
>>>>>>>
>>>>>>> If we are unable to support SparkR on CRAN in a sustainable way,
>>>>>>> what about dropping it official instead?
>>>>>>>
>>>>>>> Then, it will alleviate burdens on release managers and improves
>>>>>>> daily CIs' stability by removing the CRAN check.
>>>>>>>
>>>>>>> Bests,
>>>>>>> Dongjoon.
>>>>>>>
>>>>>>>
>>>>>>> On Mon, Dec 21, 2020 at 7:09 AM Dongjoon Hyun <
>>>>>>> dongjoon.h...@gmail.com> wrote:
>>>>>>>
>>>>>>>> Hi, All.
>>>>>>>>
>>>>>>>> The last `SparkR` package of Apache Spark in CRAN is `2.4.6`.
>>>>>>>>
>>>>>>>>
>>>>>>>> https://cran-archive.r-project.org/web/checks/2020/2020-07-10_check_results_SparkR.html
>>>>>>>>
>>>>>>>> The latest three Apache Spark distributions (2.4.7/3.0.0/3.0.1) are
>>>>>>>> not published to CRAN and the lack of SparkR on CRAN has been 
>>>>>>>> considered a
>>>>>>>> non-release blocker.
>>>>>>>>
>>>>>>>> I'm wondering if we are aiming to recover it in Apache Spark 3.1.0.
>>>>>>>>
>>>>>>>> Bests,
>>>>>>>> Dongjoon.
>>>>>>>>
>>>>>>>
>>>>>>>


Re: Recovering SparkR on CRAN?

2020-12-29 Thread Felix Cheung
Ah, I don’t recall actually - maybe it was just missed?

The last message I had, was in June when it was broken by R 4.0.1, which
was fixed.


On Tue, Dec 29, 2020 at 7:21 PM Hyukjin Kwon  wrote:

> BTW, I remember I fixed all standing issues at
> https://issues.apache.org/jira/browse/SPARK-31918 and
> https://issues.apache.org/jira/browse/SPARK-32073.
> I wonder why other releases were not uploaded yet. Do you guys know any
> context or if there is a standing issue on this, @Felix Cheung
>  or @Shivaram Venkataraman
> ?
>
> 2020년 12월 23일 (수) 오전 11:21, Mridul Muralidharan 님이 작성:
>
>>
>> I agree, is there something we can do to ensure CRAN publish goes through
>> consistently and predictably ?
>> If possible, it would be good to continue supporting it.
>>
>> Regards,
>> Mridul
>>
>> On Tue, Dec 22, 2020 at 7:48 PM Felix Cheung 
>> wrote:
>>
>>> Ok - it took many years to get it first published, so it was hard to get
>>> there.
>>>
>>>
>>> On Tue, Dec 22, 2020 at 5:45 PM Hyukjin Kwon 
>>> wrote:
>>>
>>>> Adding @Shivaram Venkataraman  and @Felix
>>>> Cheung  FYI
>>>>
>>>> 2020년 12월 23일 (수) 오전 9:22, Michael Heuer 님이 작성:
>>>>
>>>>> Anecdotally, as a project downstream of Spark, we've been prevented
>>>>> from pushing to CRAN because of this
>>>>>
>>>>> https://github.com/bigdatagenomics/adam/issues/1851
>>>>>
>>>>> We've given up and marked as WontFix.
>>>>>
>>>>>michael
>>>>>
>>>>>
>>>>> On Dec 22, 2020, at 5:14 PM, Dongjoon Hyun 
>>>>> wrote:
>>>>>
>>>>> Given the current circumstance, I'm thinking of dropping it officially
>>>>> from the community release scope.
>>>>>
>>>>> It's because
>>>>>
>>>>> - It turns out that our CRAN check is insufficient to guarantee the
>>>>> availability of SparkR on CRAN.
>>>>>   Apache Spark 3.1.0 may not not available on CRAN, too.
>>>>>
>>>>> - In daily CIs, CRAN check has been broken frequently due to both our
>>>>> side and CRAN side issues. Currently, branch-2.4 is broken.
>>>>>
>>>>> - It also has a side-effect to cause some delays on the official
>>>>> release announcement after RC passes because each release manager takes a
>>>>> look at it if he/she can recover it at that release.
>>>>>
>>>>> If we are unable to support SparkR on CRAN in a sustainable way, what
>>>>> about dropping it official instead?
>>>>>
>>>>> Then, it will alleviate burdens on release managers and improves daily
>>>>> CIs' stability by removing the CRAN check.
>>>>>
>>>>> Bests,
>>>>> Dongjoon.
>>>>>
>>>>>
>>>>> On Mon, Dec 21, 2020 at 7:09 AM Dongjoon Hyun 
>>>>> wrote:
>>>>>
>>>>>> Hi, All.
>>>>>>
>>>>>> The last `SparkR` package of Apache Spark in CRAN is `2.4.6`.
>>>>>>
>>>>>>
>>>>>> https://cran-archive.r-project.org/web/checks/2020/2020-07-10_check_results_SparkR.html
>>>>>>
>>>>>> The latest three Apache Spark distributions (2.4.7/3.0.0/3.0.1) are
>>>>>> not published to CRAN and the lack of SparkR on CRAN has been considered 
>>>>>> a
>>>>>> non-release blocker.
>>>>>>
>>>>>> I'm wondering if we are aiming to recover it in Apache Spark 3.1.0.
>>>>>>
>>>>>> Bests,
>>>>>> Dongjoon.
>>>>>>
>>>>>
>>>>>


Re: Recovering SparkR on CRAN?

2020-12-22 Thread Felix Cheung
Ok - it took many years to get it first published, so it was hard to get
there.


On Tue, Dec 22, 2020 at 5:45 PM Hyukjin Kwon  wrote:

> Adding @Shivaram Venkataraman  and @Felix
> Cheung  FYI
>
> 2020년 12월 23일 (수) 오전 9:22, Michael Heuer 님이 작성:
>
>> Anecdotally, as a project downstream of Spark, we've been prevented from
>> pushing to CRAN because of this
>>
>> https://github.com/bigdatagenomics/adam/issues/1851
>>
>> We've given up and marked as WontFix.
>>
>>michael
>>
>>
>> On Dec 22, 2020, at 5:14 PM, Dongjoon Hyun 
>> wrote:
>>
>> Given the current circumstance, I'm thinking of dropping it officially
>> from the community release scope.
>>
>> It's because
>>
>> - It turns out that our CRAN check is insufficient to guarantee the
>> availability of SparkR on CRAN.
>>   Apache Spark 3.1.0 may not not available on CRAN, too.
>>
>> - In daily CIs, CRAN check has been broken frequently due to both our
>> side and CRAN side issues. Currently, branch-2.4 is broken.
>>
>> - It also has a side-effect to cause some delays on the official release
>> announcement after RC passes because each release manager takes a look at
>> it if he/she can recover it at that release.
>>
>> If we are unable to support SparkR on CRAN in a sustainable way, what
>> about dropping it official instead?
>>
>> Then, it will alleviate burdens on release managers and improves daily
>> CIs' stability by removing the CRAN check.
>>
>> Bests,
>> Dongjoon.
>>
>>
>> On Mon, Dec 21, 2020 at 7:09 AM Dongjoon Hyun 
>> wrote:
>>
>>> Hi, All.
>>>
>>> The last `SparkR` package of Apache Spark in CRAN is `2.4.6`.
>>>
>>>
>>> https://cran-archive.r-project.org/web/checks/2020/2020-07-10_check_results_SparkR.html
>>>
>>> The latest three Apache Spark distributions (2.4.7/3.0.0/3.0.1) are not
>>> published to CRAN and the lack of SparkR on CRAN has been considered a
>>> non-release blocker.
>>>
>>> I'm wondering if we are aiming to recover it in Apache Spark 3.1.0.
>>>
>>> Bests,
>>> Dongjoon.
>>>
>>
>>


Re: [PySpark] Revisiting PySpark type annotations

2020-08-04 Thread Felix Cheung
So IMO maintaining outside in a separate repo is going to be harder. That was 
why I asked.




From: Maciej Szymkiewicz 
Sent: Tuesday, August 4, 2020 12:59 PM
To: Sean Owen
Cc: Felix Cheung; Hyukjin Kwon; Driesprong, Fokko; Holden Karau; Spark Dev List
Subject: Re: [PySpark] Revisiting PySpark type annotations


On 8/4/20 9:35 PM, Sean Owen wrote
> Yes, but the general argument you make here is: if you tie this
> project to the main project, it will _have_ to be maintained by
> everyone. That's good, but also exactly I think the downside we want
> to avoid at this stage (I thought?) I understand for some
> undertakings, it's just not feasible to start outside the main
> project, but is there no proof of concept even possible before taking
> this step -- which more or less implies it's going to be owned and
> merged and have to be maintained in the main project.


I think we have a bit different understanding here ‒ I believe we have
reached a conclusion that maintaining annotations within the project is
OK, we only differ when it comes to specific form it should take.

As of POC ‒ we have stubs, which have been maintained over three years
now and cover versions between 2.3 (though these are fairly limited) to,
with some lag, current master.  There is some evidence there are used in
the wild
(https://github.com/zero323/pyspark-stubs/network/dependents?package_id=UGFja2FnZS02MzU1MTc4Mg%3D%3D),
there are a few contributors
(https://github.com/zero323/pyspark-stubs/graphs/contributors) and at
least some use cases (https://stackoverflow.com/q/40163106/). So,
subjectively speaking, it seems we're already beyond POC.

--
Best regards,
Maciej Szymkiewicz

Web: https://zero323.net
Keybase: https://keybase.io/zero323
Gigs: https://www.codementor.io/@zero323
PGP: A30CEF0C31A501EC




Re: [PySpark] Revisiting PySpark type annotations

2020-08-04 Thread Felix Cheung
What would be the reason for separate git repo?


From: Hyukjin Kwon 
Sent: Monday, August 3, 2020 1:58:55 AM
To: Maciej Szymkiewicz 
Cc: Driesprong, Fokko ; Holden Karau 
; Spark Dev List 
Subject: Re: [PySpark] Revisiting PySpark type annotations

Okay, seems like we can create a separate repo as apache/spark? e.g.) 
https://issues.apache.org/jira/browse/INFRA-20470
We can also think about porting the files as are.
I will try to have a short sync with the author Maciej, and share what we 
discussed offline.


2020년 7월 22일 (수) 오후 10:43, Maciej Szymkiewicz 
mailto:mszymkiew...@gmail.com>>님이 작성:


W dniu środa, 22 lipca 2020 Driesprong, Fokko  napisał(a):
That's probably one-time overhead so it is not a big issue.  In my opinion, a 
bigger one is possible complexity. Annotations tend to introduce a lot of 
cyclic dependencies in Spark codebase. This can be addressed, but don't look 
great.

This is not true (anymore). With Python 3.6 you can add string annotations -> 
'DenseVector', and in the future with Python 3.7 this is fixed by having 
postponed evaluation: https://www.python.org/dev/peps/pep-0563/

As far as I recall linked PEP addresses backrferences not cyclic dependencies, 
which weren't a big issue in the first place

What I mean is a actually cyclic stuff - for example pyspark.context depends on 
pyspark.rdd and the other way around. These dependencies are not explicit at he 
moment.


Merging stubs into project structure from the other hand has almost no overhead.

This feels awkward to me, this is like having the docstring in a separate file. 
In my opinion you want to have the signatures and the functions together for 
transparency and maintainability.


I guess that's the matter of preference. From maintainability perspective it is 
actually much easier to have separate objects.

For example there are different types of objects that are required for 
meaningful checking, which don't really exist in real code (protocols, aliases, 
code generated signatures fo let complex overloads) as well as some monkey 
patched entities

Additionally it is often easier to see inconsistencies when typing is separate.

However, I am not implying that this should be a persistent state.

In general I see two non breaking paths here.

 - Merge pyspark-stubs a separate subproject within main spark repo and keep it 
in-sync there with common CI pipeline and transfer ownership of pypi package to 
ASF
- Move stubs directly into python/pyspark and then apply individual stubs to 
.modules of choice.

Of course, the first proposal could be an initial step for the latter one.


I think DBT is a very nice project where they use annotations very well: 
https://github.com/fishtown-analytics/dbt/blob/dev/marian-anderson/core/dbt/graph/queue.py

Also, they left out the types in the docstring, since they are available in the 
annotations itself.


In practice, the biggest advantage is actually support for completion, not type 
checking (which works in simple cases).

Agreed.

Would you be interested in writing up the Outreachy proposal for work on this?

I would be, and also happy to mentor. But, I think we first need to agree as a 
Spark community if we want to add the annotations to the code, and in which 
extend.




At some point (in general when things are heavy in generics, which is the case 
here), annotations become somewhat painful to write.

That's true, but that might also be a pointer that it is time to refactor the 
function/code :)

That might the case, but it is more often a matter capturing useful properties 
combined with requirement to keep things in sync with Scala counterparts.


For now, I tend to think adding type hints to the codes make it difficult to 
backport or revert and more difficult to discuss about typing only especially 
considering typing is arguably premature yet.

This feels a bit weird to me, since you want to keep this in sync right? Do you 
provide different stubs for different versions of Python? I had to look up the 
literals: https://www.python.org/dev/peps/pep-0586/

I think it is more about portability between Spark versions


Cheers, Fokko

Op wo 22 jul. 2020 om 09:40 schreef Maciej Szymkiewicz 
mailto:mszymkiew...@gmail.com>>:

On 7/22/20 3:45 AM, Hyukjin Kwon wrote:
> For now, I tend to think adding type hints to the codes make it
> difficult to backport or revert and
> more difficult to discuss about typing only especially considering
> typing is arguably premature yet.

About being premature ‒ since typing ecosystem evolves much faster than
Spark it might be preferable to keep annotations as a separate project
(preferably under AST / Spark umbrella). It allows for faster iterations
and supporting new features (for example Literals proved to be very
useful), without waiting for the next Spark release.

--
Best regards,
Maciej Szymkiewicz

Web: https://zero323.net
Keybase: https://keybase.io/zero323
Gigs: https://www.codementor.io/@zero323
PGP: 

Re: Exposing Spark parallelized directory listing & non-locality listing in core

2020-07-22 Thread Felix Cheung
+1


From: Holden Karau 
Sent: Wednesday, July 22, 2020 10:49:49 AM
To: Steve Loughran 
Cc: dev 
Subject: Re: Exposing Spark parallelized directory listing & non-locality 
listing in core

Wonderful. To be clear the patch is more to start the discussion about how we 
want to do it and less what I think is the right way.

On Wed, Jul 22, 2020 at 10:47 AM Steve Loughran 
mailto:ste...@cloudera.com>> wrote:


On Wed, 22 Jul 2020 at 00:51, Holden Karau 
mailto:hol...@pigscanfly.ca>> wrote:
Hi Folks,

In Spark SQL there is the ability to have Spark do it's partition 
discovery/file listing in parallel on the worker nodes and also avoid locality 
lookups. I'd like to expose this in core, but given the Hadoop APIs it's a bit 
more complicated to do right. I

That's ultimately fixable, if we can sort out what's good from the app side and 
reconcile that with 'what is not pathologically bad across both HDFS and object 
stores".

Bad: globStatus, anything which returns an array rather than a remote iterator, 
encourages treewalk
Good: deep recursive listings, remote iterator results for: incremental/async 
fetch of next page of listing, soon: option for iterator, if cast to 
IOStatisticsSource, actually serve up stats on IO performance during the 
listing. (e.g. #of list calls, mean time to get a list response back., store 
throttle events)

Also look at LocatedFileStatus to see how it parallelises its work. its not 
perfect because wildcards are supported, which means globStatus gets used

happy to talk about this some more, and I'll review the patch

-steve

made a quick POC and two potential different paths we could do for 
implementation and wanted to see if anyone had thoughts - 
https://github.com/apache/spark/pull/29179.

Cheers,

Holden

--
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 

YouTube Live Streams: https://www.youtube.com/user/holdenkarau


--
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 

YouTube Live Streams: https://www.youtube.com/user/holdenkarau


Re: Welcoming some new Apache Spark committers

2020-07-15 Thread Felix Cheung
Welcome!


From: Nick Pentreath 
Sent: Tuesday, July 14, 2020 10:21:17 PM
To: dev 
Cc: Dilip Biswal ; Jungtaek Lim 
; huaxin gao 
Subject: Re: Welcoming some new Apache Spark committers

Congratulations and welcome as Apache Spark committers!

On Wed, 15 Jul 2020 at 06:59, Prashant Sharma 
mailto:scrapco...@gmail.com>> wrote:
Congratulations all ! It's great to have such committed folks as committers. :)

On Wed, Jul 15, 2020 at 9:24 AM Yi Wu 
mailto:yi...@databricks.com>> wrote:
Congrats!!

On Wed, Jul 15, 2020 at 8:02 AM Hyukjin Kwon 
mailto:gurwls...@gmail.com>> wrote:
Congrats!

2020년 7월 15일 (수) 오전 7:56, Takeshi Yamamuro 
mailto:linguin@gmail.com>>님이 작성:
Congrats, all!

On Wed, Jul 15, 2020 at 5:15 AM Takuya UESHIN 
mailto:ues...@happy-camper.st>> wrote:
Congrats and welcome!

On Tue, Jul 14, 2020 at 1:07 PM Bryan Cutler 
mailto:cutl...@gmail.com>> wrote:
Congratulations and welcome!

On Tue, Jul 14, 2020 at 12:36 PM Xingbo Jiang 
mailto:jiangxb1...@gmail.com>> wrote:
Welcome, Huaxin, Jungtaek, and Dilip!

Congratulations!

On Tue, Jul 14, 2020 at 10:37 AM Matei Zaharia 
mailto:matei.zaha...@gmail.com>> wrote:
Hi all,

The Spark PMC recently voted to add several new committers. Please join me in 
welcoming them to their new roles! The new committers are:

- Huaxin Gao
- Jungtaek Lim
- Dilip Biswal

All three of them contributed to Spark 3.0 and we’re excited to have them join 
the project.

Matei and the Spark PMC
-
To unsubscribe e-mail: 
dev-unsubscr...@spark.apache.org



--
Takuya UESHIN



--
---
Takeshi Yamamuro


Re: Apache Spark 3.1 Feature Expectation (Dec. 2020)

2020-07-05 Thread Felix Cheung
I think pluggable storage in shuffle is essential for k8s GA


From: Holden Karau 
Sent: Monday, June 29, 2020 9:33 AM
To: Maxim Gekk
Cc: Dongjoon Hyun; dev
Subject: Re: Apache Spark 3.1 Feature Expectation (Dec. 2020)

Should we also consider the shuffle service refactoring to support pluggable 
storage engines as targeting the 3.1 release?

On Mon, Jun 29, 2020 at 9:31 AM Maxim Gekk 
mailto:maxim.g...@databricks.com>> wrote:
Hi Dongjoon,

I would add:
- Filters pushdown to JSON (https://github.com/apache/spark/pull/27366)
- Filters pushdown to other datasources like Avro
- Support nested attributes of filters pushed down to JSON

Maxim Gekk

Software Engineer

Databricks, Inc.


On Mon, Jun 29, 2020 at 7:07 PM Dongjoon Hyun 
mailto:dongjoon.h...@gmail.com>> wrote:
Hi, All.

After a short celebration of Apache Spark 3.0, I'd like to ask you the 
community opinion on Apache Spark 3.1 feature expectations.

First of all, Apache Spark 3.1 is scheduled for December 2020.
- https://spark.apache.org/versioning-policy.html

I'm expecting the following items:

1. Support Scala 2.13
2. Use Apache Hadoop 3.2 by default for better cloud support
3. Declaring Kubernetes Scheduler GA
In my perspective, the last main missing piece was Dynamic allocation and
- Dynamic allocation with shuffle tracking is already shipped at 3.0.
- Dynamic allocation with worker decommission/data migration is targeting 
3.1. (Thanks, Holden)
4. DSv2 Stabilization

I'm aware of some more features which are on the way currently, but I love to 
hear the opinions from the main developers and more over the main users who 
need those features.

Thank you in advance. Welcome for any comments.

Bests,
Dongjoon.


--
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 

YouTube Live Streams: https://www.youtube.com/user/holdenkarau


Fwd: Announcing ApacheCon @Home 2020

2020-07-01 Thread Felix Cheung

-- Forwarded message -

We are pleased to announce that ApacheCon @Home will be held online,
September 29 through October 1.

More event details are available at https://apachecon.com/acah2020 but
there’s a few things that I want to highlight for you, the members.

Yes, the CFP has been reopened. It will be open until the morning of
July 13th. With no restrictions on space/time at the venue, we can
accept talks from a much wider pool of speakers, so we look forward to
hearing from those of you who may have been reluctant, or unwilling, to
travel to the US.
Yes, you can add your project to the event, whether that’s one talk, or
an entire track - we have the room now. Those of you who are PMC members
will be receiving information about how to get your projects represented
at the event.
Attendance is free, as has been the trend in these events in our
industry. We do, however, offer donation options for attendees who feel
that our content is worth paying for.
Sponsorship opportunities are available immediately at
https://www.apachecon.com/acna2020/sponsors.html

If you would like to volunteer to help, we ask that you join the
plann...@apachecon.com mailing list and discuss 
it there, rather than
here, so that we do not have a split discussion, while we’re trying to
coordinate all of the things we have to get done in this very short time
window.

Rich Bowen,
VP Conferences, The Apache Software Foundation




Re: [ANNOUNCE] Apache Spark 3.0.0

2020-06-18 Thread Felix Cheung
Congrats


From: Jungtaek Lim 
Sent: Thursday, June 18, 2020 8:18:54 PM
To: Hyukjin Kwon 
Cc: Mridul Muralidharan ; Reynold Xin ; 
dev ; user 
Subject: Re: [ANNOUNCE] Apache Spark 3.0.0

Great, thanks all for your efforts on the huge step forward!

On Fri, Jun 19, 2020 at 12:13 PM Hyukjin Kwon 
mailto:gurwls...@gmail.com>> wrote:
Yay!

2020년 6월 19일 (금) 오전 4:46, Mridul Muralidharan 
mailto:mri...@gmail.com>>님이 작성:
Great job everyone ! Congratulations :-)

Regards,
Mridul

On Thu, Jun 18, 2020 at 10:21 AM Reynold Xin 
mailto:r...@databricks.com>> wrote:

Hi all,

Apache Spark 3.0.0 is the first release of the 3.x line. It builds on many of 
the innovations from Spark 2.x, bringing new ideas as well as continuing 
long-term projects that have been in development. This release resolves more 
than 3400 tickets.

We'd like to thank our contributors and users for their contributions and early 
feedback to this release. This release would not have been possible without you.

To download Spark 3.0.0, head over to the download page: 
http://spark.apache.org/downloads.html

To view the release notes: 
https://spark.apache.org/releases/spark-release-3-0-0.html





Re: More publicly documenting the options under spark.sql.*

2020-01-16 Thread Felix Cheung
I think it’s a good idea


From: Hyukjin Kwon 
Sent: Wednesday, January 15, 2020 5:49:12 AM
To: dev 
Cc: Sean Owen ; Nicholas Chammas 
Subject: Re: More publicly documenting the options under spark.sql.*

Resending to the dev list for archive purpose:

I think automatically creating a configuration page isn't a bad idea because I 
think we deprecate and remove configurations which are not created via 
.internal() in SQLConf anyway.

I already tried this automatic generation from the codes at SQL built-in 
functions and I'm pretty sure we can do the similar thing for configurations as 
well.

We could perhaps mimic what hadoop does 
https://hadoop.apache.org/docs/r2.8.0/hadoop-project-dist/hadoop-common/core-default.xml

On Wed, 15 Jan 2020, 22:46 Hyukjin Kwon, 
mailto:gurwls...@gmail.com>> wrote:
I think automatically creating a configuration page isn't a bad idea because I 
think we deprecate and remove configurations which are not created via 
.internal() in SQLConf anyway.

I already tried this automatic generation from the codes at SQL built-in 
functions and I'm pretty sure we can do the similar thing for configurations as 
well.

We could perhaps mimic what hadoop does 
https://hadoop.apache.org/docs/r2.8.0/hadoop-project-dist/hadoop-common/core-default.xml

On Wed, 15 Jan 2020, 10:46 Sean Owen, 
mailto:sro...@gmail.com>> wrote:
Some of it is intentionally undocumented, as far as I know, as an
experimental option that may change, or legacy, or safety valve flag.
Certainly anything that's marked an internal conf. (That does raise
the question of who it's for, if you have to read source to find it.)

I don't know if we need to overhaul the conf system, but there may
indeed be some confs that could legitimately be documented. I don't
know which.

On Tue, Jan 14, 2020 at 7:32 PM Nicholas Chammas
mailto:nicholas.cham...@gmail.com>> wrote:
>
> I filed SPARK-30510 thinking that we had forgotten to document an option, but 
> it turns out that there's a whole bunch of stuff under SQLConf.scala that has 
> no public documentation under http://spark.apache.org/docs.
>
> Would it be appropriate to somehow automatically generate a documentation 
> page from SQLConf.scala, as Hyukjin suggested on that ticket?
>
> Another thought that comes to mind is moving the config definitions out of 
> Scala and into a data format like YAML or JSON, and then sourcing that both 
> for SQLConf as well as for whatever documentation page we want to generate. 
> What do you think of that idea?
>
> Nick
>

-
To unsubscribe e-mail: 
dev-unsubscr...@spark.apache.org



Re: Enabling fully disaggregated shuffle on Spark

2019-11-20 Thread Felix Cheung
Great!

Due to number of constraints I won’t be sending link directly here but please r 
me and I will add you.



From: Ben Sidhom 
Sent: Wednesday, November 20, 2019 9:10:01 AM
To: John Zhuge 
Cc: bo yang ; Amogh Margoor ; Ryan Blue 
; Ben Sidhom ; Spark Dev List 
; Christopher Crosbie ; Griselda 
Cuevas ; Holden Karau ; Mayank Ahuja 
; Kalyan Sivakumar ; alfo...@fb.com 
; Felix Cheung ; Matt Cheah 
; Yifei Huang (PD) 
Subject: Re: Enabling fully disaggregated shuffle on Spark

That sounds great!

On Wed, Nov 20, 2019 at 9:02 AM John Zhuge 
mailto:jzh...@apache.org>> wrote:
That will be great. Please send us the invite.

On Wed, Nov 20, 2019 at 8:56 AM bo yang 
mailto:bobyan...@gmail.com>> wrote:
Cool, thanks Ryan, John, Amogh for the reply! Great to see you interested! 
Felix will have a Spark Scalability & Reliability Sync meeting on Dec 4 1pm 
PST. We could discuss more details there. Do you want to join?

On Tue, Nov 19, 2019 at 4:23 PM Amogh Margoor 
mailto:amo...@qubole.com>> wrote:
We at Qubole are also looking at disaggregating shuffle on Spark. Would love to 
collaborate and share learnings.

Regards,
Amogh

On Tue, Nov 19, 2019 at 4:09 PM John Zhuge 
mailto:jzh...@apache.org>> wrote:
Great work, Bo! Would love to hear the details.


On Tue, Nov 19, 2019 at 4:05 PM Ryan Blue  wrote:
I'm interested in remote shuffle services as well. I'd love to hear about what 
you're using in production!

rb

On Tue, Nov 19, 2019 at 2:43 PM bo yang 
mailto:bobyan...@gmail.com>> wrote:
Hi Ben,

Thanks for the writing up! This is Bo from Uber. I am in Felix's team in 
Seattle, and working on disaggregated shuffle (we called it remote shuffle 
service, RSS, internally). We have put RSS into production for a while, and 
learned a lot during the work (tried quite a few techniques to improve the 
remote shuffle performance). We could share our learning with the community, 
and also would like to hear feedback/suggestions on how to further improve 
remote shuffle performance. We could chat more details if you or other people 
are interested.

Best,
Bo

On Fri, Nov 15, 2019 at 4:10 PM Ben Sidhom  wrote:

I would like to start a conversation about extending the Spark shuffle manager 
surface to support fully disaggregated shuffle implementations. This is closely 
related to the work in 
SPARK-25299<https://issues.apache.org/jira/browse/SPARK-25299>, which is 
focused on refactoring the shuffle manager API (and in particular, 
SortShuffleManager) to use a pluggable storage backend. The motivation for that 
SPIP is further enabling Spark on Kubernetes.


The motivation for this proposal is enabling full externalized (disaggregated) 
shuffle service implementations. (Facebook’s Cosco 
shuffle<https://databricks.com/session/cosco-an-efficient-facebook-scale-shuffle-service>
 is one example of such a disaggregated shuffle service.) These changes allow 
the bulk of the shuffle to run in a remote service such that minimal state 
resides in executors and local disk spill is minimized. The net effect is 
increased job stability and performance improvements in certain scenarios. 
These changes should work well with or are complementary to SPARK-25299. Some 
or all points may be merged into that issue as appropriate.


Below is a description of each component of this proposal. These changes can 
ideally be introduced incrementally. I would like to gather feedback and gauge 
interest from others in the community to collaborate on this. There are likely 
more points that would  be useful to disaggregated shuffle services. We can 
outline a more concrete plan after gathering enough input. A working session 
could help us kick off this joint effort; maybe something in the mid-January to 
mid-February timeframe (depending on interest and availability. I’m happy to 
host at our Sunnyvale, CA offices.


Proposal
Scheduling and re-executing tasks

Allow coordination between the service and the Spark DAG scheduler as to 
whether a given block/partition needs to be recomputed when a task fails or 
when shuffle block data cannot be read. Having such coordination is important, 
e.g., for suppressing recomputation after aborted executors or for forcing late 
recomputation if the service internally acts as a cache. One catchall solution 
is to have the shuffle manager provide an indication of whether shuffle data is 
external to executors (or nodes). Another option: allow the shuffle manager 
(likely on the driver) to be queried for the existence of shuffle data for a 
given executor ID (or perhaps map task, reduce task, etc). Note that this is at 
the level of data the scheduler is aware of (i.e., map/reduce partitions) 
rather than block IDs, which are internal details for some shuffle managers.

ShuffleManager API

Add a heartbeat (keep-alive) mechanism to RDD shuffle output so that the 
service knows that data is still active. This is one way to enable 
time-/j

Re: The Myth: the forked Hive 1.2.1 is stabler than XXX

2019-11-20 Thread Felix Cheung
Just to add - hive 1.2 fork is definitely not more stable. We know of a few 
critical bug fixes that we cherry picked into a fork of that fork to maintain 
ourselves.



From: Dongjoon Hyun 
Sent: Wednesday, November 20, 2019 11:07:47 AM
To: Sean Owen 
Cc: dev 
Subject: Re: The Myth: the forked Hive 1.2.1 is stabler than XXX

Thanks. That will be a giant step forward, Sean!

> I'd prefer making it the default in the POM for 3.0.

Bests,
Dongjoon.

On Wed, Nov 20, 2019 at 11:02 AM Sean Owen 
mailto:sro...@gmail.com>> wrote:
Yeah 'stable' is ambiguous. It's old and buggy, but at least it's the
same old and buggy that's been there a while. "stable" in that sense
I'm sure there is a lot more delta between Hive 1 and 2 in terms of
bug fixes that are important; the question isn't just 1.x releases.

What I don't know is how much affects Spark, as it's a Hive client
mostly. Clearly some do.

I'd prefer making it the default in the POM for 3.0. Mostly on the
grounds that its effects are on deployed clusters, not apps. And
deployers can still choose a binary distro with 1.x or make the choice
they want. Those that don't care should probably be nudged to 2.x.
Spark 3.x is already full of behavior changes and 'unstable', so I
think this is minor relative to the overall risk question.

On Wed, Nov 20, 2019 at 12:53 PM Dongjoon Hyun 
mailto:dongjoon.h...@gmail.com>> wrote:
>
> Hi, All.
>
> I'm sending this email because it's important to discuss this topic narrowly
> and make a clear conclusion.
>
> `The forked Hive 1.2.1 is stable`? It sounds like a myth we created
> by ignoring the existing bugs. If you want to say the forked Hive 1.2.1 is
> stabler than XXX, please give us the evidence. Then, we can fix it.
> Otherwise, let's stop making `The forked Hive 1.2.1` invincible.
>
> Historically, the following forked Hive 1.2.1 has never been stable.
> It's just frozen. Since the forked Hive is out of our control, we ignored 
> bugs.
> That's all. The reality is a way far from the stable status.
>
> https://mvnrepository.com/artifact/org.spark-project.hive/
> 
> https://mvnrepository.com/artifact/org.spark-project.hive/hive-exec/1.2.1.spark
>  (2015 August)
> 
> https://mvnrepository.com/artifact/org.spark-project.hive/hive-exec/1.2.1.spark2
>  (2016 April)
>
> First, let's begin Hive itself by comparing with Apache Hive 1.2.2 and 1.2.3,
>
> Apache Hive 1.2.2 has 50 bug fixes.
> Apache Hive 1.2.3 has 9 bug fixes.
>
> I will not cover all of them, but Apache Hive community also backports
> important patches like Apache Spark community.
>
> Second, let's move to SPARK issues because we aren't exposed to all Hive 
> issues.
>
> SPARK-19109 ORC metadata section can sometimes exceed protobuf message 
> size limit
> SPARK-22267 Spark SQL incorrectly reads ORC file when column order is 
> different
>
> These were reported since Apache Spark 1.6.x because the forked Hive doesn't 
> have
> a proper upstream patch like HIVE-11592 (fixed at Apache Hive 1.3.0).
>
> Since we couldn't update the frozen forked Hive, we added Apache ORC 
> dependency
> at SPARK-20682 (2.3.0), added a switching configuration at SPARK-20728 
> (2.3.0),
> tured on `spark.sql.hive.convertMetastoreOrc by default` at SPARK-22279 
> (2.4.0).
> However, if you turn off the switch and start to use the forked hive,
> you will be exposed to the buggy forked Hive 1.2.1 again.
>
> Third, let's talk about the new features like Hadoop 3 and JDK11.
> No one believe that the ancient forked Hive 1.2.1 will work with this.
> I saw that the following issue is mentioned as an evidence of Hive 2.3.6 bug.
>
> SPARK-29245 ClassCastException during creating HiveMetaStoreClient
>
> Yes. I know that issue because I reported it and verified HIVE-21508.
> It's fixed already and will be released ad Apache Hive 2.3.7.
>
> Can we imagine something like this in the forked Hive 1.2.1?
> 'No'. There is no future on it. It's frozen.
>
> From now, I want to claim that the forked Hive 1.2.1 is the unstable one.
> I welcome all your positive and negative opinions.
> Please share your concerns and problems and fix them together.
> Apache Spark is an open source project we shared.
>
> Bests,
> Dongjoon.
>


Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

2019-11-18 Thread Felix Cheung
1000% with Steve, the org.spark-project hive 1.2 will need a solution. It is 
old and rather buggy; and It’s been *years*

I think we should decouple hive change from everything else if people are 
concerned?


From: Steve Loughran 
Sent: Sunday, November 17, 2019 9:22:09 AM
To: Cheng Lian 
Cc: Sean Owen ; Wenchen Fan ; Dongjoon 
Hyun ; dev ; Yuming Wang 

Subject: Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

Can I take this moment to remind everyone that the version of hive which spark 
has historically bundled (the org.spark-project one) is an orphan project put 
together to deal with Hive's shading issues and a source of unhappiness in the 
Hive project. What ever get shipped should do its best to avoid including that 
file.

Postponing a switch to hadoop 3.x after spark 3.0 is probably the safest move 
from a risk minimisation perspective. If something has broken then it is you 
can start with the assumption that it is in the o.a.s packages without having 
to debug o.a.hadoop and o.a.hive first. There is a cost: if there are problems 
with the hadoop / hive dependencies those teams will inevitably ignore filed 
bug reports for the same reason spark team will probably because 1.6-related 
JIRAs as WONTFIX. WONTFIX responses for the Hadoop 2.x line include any 
compatibility issues with Java 9+. Do bear that in mind. It's not been tested, 
it has dependencies on artifacts we know are incompatible, and as far as the 
Hadoop project is concerned: people should move to branch 3 if they want to run 
on a modern version of Java

It would be really really good if the published spark maven artefacts (a) 
included the spark-hadoop-cloud JAR and (b) were dependent upon hadoop 3.x. 
That way people doing things with their own projects will get up-to-date 
dependencies and don't get WONTFIX responses themselves.

-Steve

PS: Discussion on hadoop-dev @ making Hadoop 2.10 the official "last ever" 
branch-2 release and then declare its predecessors EOL; 2.10 will be the 
transition release.

On Sun, Nov 17, 2019 at 1:50 AM Cheng Lian 
mailto:lian.cs@gmail.com>> wrote:
Dongjoon, I didn't follow the original Hive 2.3 discussion closely. I thought 
the original proposal was to replace Hive 1.2 with Hive 2.3, which seemed 
risky, and therefore we only introduced Hive 2.3 under the hadoop-3.2 profile 
without removing Hive 1.2. But maybe I'm totally wrong here...

Sean, Yuming's PR https://github.com/apache/spark/pull/26533 showed that Hadoop 
2 + Hive 2 + JDK 11 looks promising. My major motivation is not about demand, 
but risk control: coupling Hive 2.3, Hadoop 3.2, and JDK 11 upgrade together 
looks too risky.

On Sat, Nov 16, 2019 at 4:03 AM Sean Owen 
mailto:sro...@gmail.com>> wrote:
I'd prefer simply not making Hadoop 3 the default until 3.1+, rather
than introduce yet another build combination. Does Hadoop 2 + Hive 2
work and is there demand for it?

On Sat, Nov 16, 2019 at 3:52 AM Wenchen Fan 
mailto:cloud0...@gmail.com>> wrote:
>
> Do we have a limitation on the number of pre-built distributions? Seems this 
> time we need
> 1. hadoop 2.7 + hive 1.2
> 2. hadoop 2.7 + hive 2.3
> 3. hadoop 3 + hive 2.3
>
> AFAIK we always built with JDK 8 (but make it JDK 11 compatible), so don't 
> need to add JDK version to the combination.
>
> On Sat, Nov 16, 2019 at 4:05 PM Dongjoon Hyun 
> mailto:dongjoon.h...@gmail.com>> wrote:
>>
>> Thank you for suggestion.
>>
>> Having `hive-2.3` profile sounds good to me because it's orthogonal to 
>> Hadoop 3.
>> IIRC, originally, it was proposed in that way, but we put it under 
>> `hadoop-3.2` to avoid adding new profiles at that time.
>>
>> And, I'm wondering if you are considering additional pre-built distribution 
>> and Jenkins jobs.
>>
>> Bests,
>> Dongjoon.
>>


Re: Adding JIRA ID as the prefix for the test case name

2019-11-14 Thread Felix Cheung
this is about test description and not test file name right?

if yes I don’t see a problem.


From: Hyukjin Kwon 
Sent: Thursday, November 14, 2019 6:03:02 PM
To: Shixiong(Ryan) Zhu 
Cc: dev ; Felix Cheung ; 
Shivaram Venkataraman 
Subject: Re: Adding JIRA ID as the prefix for the test case name

Yeah, sounds good to have it.

In case of R, it seems not quite common to write down JIRA ID [1] but looks 
some have the prefix in its test name in general.
In case of Python and Java, seems we time to time write a JIRA ID in the 
comment right under the test method [2][3].

Given this pattern, I would like to suggest use the same format but:

1. For Python and Java, write a single comment that starts with JIRA ID and 
short description, e.g. (SPARK-X: test blah blah)
2. For R, use JIRA ID as a prefix for its test name.

[1] git grep -r "SPARK-" -- '*test*.R'
[2] git grep -r "SPARK-" -- '*Suite.java'
[3] git grep -r "SPARK-" -- '*test*.py'

Does that make sense? Adding Felix and Shivaram too.


2019년 11월 15일 (금) 오전 3:13, Shixiong(Ryan) Zhu 
mailto:shixi...@databricks.com>>님이 작성:
Should we also add a guideline for non Scala tests? Other languages (Java, 
Python, R) don't support using string as a test name.

Best Regards,

Ryan


On Thu, Nov 14, 2019 at 4:04 AM Hyukjin Kwon 
mailto:gurwls...@gmail.com>> wrote:
I opened a PR - https://github.com/apache/spark-website/pull/231

2019년 11월 13일 (수) 오전 10:43, Hyukjin Kwon 
mailto:gurwls...@gmail.com>>님이 작성:
> In general a test should be self descriptive and I don't think we should be 
> adding JIRA ticket references wholesale. Any action that the reader has to 
> take to understand why a test was introduced is one too many. However in some 
> cases the thing we are trying to test is very subtle and in that case a 
> reference to a JIRA ticket might be useful, I do still feel that this should 
> be a backstop and that properly documenting your tests is a much better way 
> of dealing with this.

Yeah, the test should be self-descriptive. I don't think adding a JIRA prefix 
harms this point. Probably I should add this sentence in the guidelines as well.
Adding a JIRA prefix just adds one extra hint to track down details. I think 
it's fine to stick to this practice and make it simpler and clear to follow.

> 1. what if multiple JIRA IDs relating to the same test? we just take the very 
> first JIRA ID?
Ideally one JIRA should describe one issue and one PR should fix one JIRA with 
a dedicated test.
Yeah, I think I would take the very first JIRA ID.

> 2. are we going to have a full scan of all existing tests and attach a JIRA 
> ID to it?
Yea, let's don't do this.

> It's a nice-to-have, not super essential, just because ...
It's been asked multiple times and each committer seems having a different 
understanding on this.
It's not a biggie but wanted to make it clear and conclude this.

> I'd add this only when a test specifically targets a certain issue.
Yes, so this one I am not sure. From what I heard, people adds the JIRA in 
cases below:

- Whenever the JIRA type is a bug
- When a PR adds a couple of tests
- Only when a test specifically targets a certain issue.
- ...

Which one do we prefer and simpler to follow?

Or I can combine as below (im gonna reword when I actually document this):
1. In general, we should add a JIRA ID as prefix of a test when a PR targets to 
fix a specific issue.
In practice, it usually happens when a JIRA type is a bug or a PR adds a 
couple of tests.
2. Uses "SPARK-: test name" format

If we have no objection with ^, let me go with this.

2019년 11월 13일 (수) 오전 8:14, Sean Owen 
mailto:sro...@gmail.com>>님이 작성:
Let's suggest "SPARK-12345:" but not go back and change a bunch of test cases.
I'd add this only when a test specifically targets a certain issue.
It's a nice-to-have, not super essential, just because in the rare
case you need to understand why a test asserts something, you can go
back and find what added it in the git history without much trouble.

On Mon, Nov 11, 2019 at 10:46 AM Hyukjin Kwon 
mailto:gurwls...@gmail.com>> wrote:
>
> Hi all,
>
> Maybe it's not a big deal but it brought some confusions time to time into 
> Spark dev and community. I think it's time to discuss about when/which format 
> to add a JIRA ID as a prefix for the test case name in Scala test cases.
>
> Currently we have many test case names with prefixes as below:
>
> test("SPARK-X blah blah")
> test("SPARK-X: blah blah")
> test("SPARK-X - blah blah")
> test("[SPARK-X] blah blah")
> …
>
> It is a good practice to have the JIRA ID in general because, for instance,
> it makes us put less efforts to track commit histories (or even when the files
> are totally moved), or to track related 

Re: [VOTE] [SPARK-27495] SPIP: Support Stage level resource configuration and scheduling

2019-09-11 Thread Felix Cheung
+1


From: Thomas graves 
Sent: Wednesday, September 4, 2019 7:24:26 AM
To: dev 
Subject: [VOTE] [SPARK-27495] SPIP: Support Stage level resource configuration 
and scheduling

Hey everyone,

I'd like to call for a vote on SPARK-27495 SPIP: Support Stage level
resource configuration and scheduling

This is for supporting stage level resource configuration and
scheduling.  The basic idea is to allow the user to specify executor
and task resource requirements for each stage to allow the user to
control the resources required at a finer grain. One good example here
is doing some ETL to preprocess your data in one stage and then feed
that data into an ML algorithm (like tensorflow) that would run as a
separate stage.  The ETL could need totally different resource
requirements for the executors/tasks than the ML stage does.

The text for the SPIP is in the jira description:

https://issues.apache.org/jira/browse/SPARK-27495

I split the API and Design parts into a google doc that is linked to
from the jira.

This vote is open until next Fri (Sept 13th).

[ ] +1: Accept the proposal as an official SPIP
[ ] +0
[ ] -1: I don't think this is a good idea because ...

I'll start with my +1

Thanks,
Tom

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [VOTE][SPARK-28885] Follow ANSI store assignment rules in table insertion by default

2019-09-08 Thread Felix Cheung
I’d prefer strict mode and fail fast (analysis check)

Also I like what Alastair suggested about standard clarification.

I think we can re-visit this proposal and restart the vote


From: Ryan Blue 
Sent: Friday, September 6, 2019 5:28 PM
To: Alastair Green
Cc: Reynold Xin; Wenchen Fan; Spark dev list; Gengliang Wang
Subject: Re: [VOTE][SPARK-28885] Follow ANSI store assignment rules in table 
insertion by default


We discussed this thread quite a bit in the DSv2 sync up and Russell brought up 
a really good point about this.

The ANSI rule used here specifies how to store a specific value, V, so this is 
a runtime rule — an earlier case covers when V is NULL, so it is definitely 
referring to a specific value. The rule requires that if the type doesn’t match 
or if the value cannot be truncated, an exception is thrown for “numeric value 
out of range”.

That runtime error guarantees that even though the cast is introduced at 
analysis time, unexpected NULL values aren’t inserted into a table in place of 
data values that are out of range. Unexpected NULL values are the problem that 
was concerning to many of us in the discussion thread, but it turns out that 
real ANSI behavior doesn’t have the problem. (In the sync, we validated this by 
checking Postgres and MySQL behavior, too.)

In Spark, the runtime check is a separate configuration property from this one, 
but in order to actually implement ANSI semantics, both need to be set. So I 
think it makes sense tochange both defaults to be ANSI. The analysis check 
alone does not implement the ANSI standard.

In the sync, we also agreed that it makes sense to be able to turn off the 
runtime check in order to avoid job failures. Another, safer way to avoid job 
failures is to require an explicit cast, i.e., strict mode.

I think that we should amend this proposal to change the default for both the 
runtime check and the analysis check to ANSI.

As this stands now, I vote -1. But I would support this if the vote were to set 
both runtime and analysis checks to ANSI mode.

rb

On Fri, Sep 6, 2019 at 3:12 AM Alastair Green 
 wrote:
Makes sense.

While the ISO SQL standard automatically becomes an American national  (ANSI) 
standard, changes are only made to the International (ISO/IEC) Standard, which 
is the authoritative specification.

These rules are specified in SQL/Foundation (ISO/IEC SQL Part 2), section 9.2.

Could we rename the proposed default to “ISO/IEC (ANSI)”?

— Alastair

On Thu, Sep 5, 2019 at 17:17, Reynold Xin 
mailto:r...@databricks.com>> wrote:

Having three modes is a lot. Why not just use ansi mode as default, and legacy 
for backward compatibility? Then over time there's only the ANSI mode, which is 
standard compliant and easy to understand. We also don't need to invent a 
standard just for Spark.


On Thu, Sep 05, 2019 at 12:27 AM, Wenchen Fan 
mailto:cloud0...@gmail.com>> wrote:
+1

To be honest I don't like the legacy policy. It's too loose and easy for users 
to make mistakes, especially when Spark returns null if a function hit errors 
like overflow.

The strict policy is not good either. It's too strict and stops valid use cases 
like writing timestamp values to a date type column. Users do expect truncation 
to happen without adding cast manually in this case. It's also weird to use a 
spark specific policy that no other database is using.

The ANSI policy is better. It stops invalid use cases like writing string 
values to an int type column, while keeping valid use cases like timestamp -> 
date.

I think it's no doubt that we should use ANSI policy instead of legacy policy 
for v1 tables. Except for backward compatibility, ANSI policy is literally 
better than the legacy policy.

The v2 table is arguable here. Although the ANSI policy is better than strict 
policy to me, this is just the store assignment policy, which only partially 
controls the table insertion behavior. With Spark's "return null on error" 
behavior, the table insertion is more likely to insert invalid null values with 
the ANSI policy compared to the strict policy.

I think we should use ANSI policy by default for both v1 and v2 tables, because
1. End-users don't care how the table is implemented. Spark should provide 
consistent table insertion behavior between v1 and v2 tables.
2. Data Source V2 is unstable in Spark 2.x so there is no backward 
compatibility issue. That said, the baseline to judge which policy is better 
should be the table insertion behavior in Spark 2.x, which is the legacy policy 
+ "return null on error". ANSI policy is better than the baseline.
3. We expect more and more uses to migrate their data sources to the V2 API. 
The strict policy can be a stopper as it's a too big breaking change, which may 
break many existing queries.

Thanks,
Wenchen


On Wed, Sep 4, 2019 at 1:59 PM Gengliang Wang 
mailto:gengliang.w...@databricks.com>> wrote:

Hi everyone,

I'd like to call for a vote on 

Re: maven 3.6.1 removed from apache maven repo

2019-09-03 Thread Felix Cheung
(Hmm, what is spark-...@apache.org?)


From: Sean Owen 
Sent: Tuesday, September 3, 2019 11:58:30 AM
To: Xiao Li 
Cc: Tom Graves ; spark-...@apache.org 

Subject: Re: maven 3.6.1 removed from apache maven repo

It's because build/mvn only queries ASF mirrors, and they remove non-current 
releases from mirrors regularly (we do the same).
This may help avoid this in the future: 
https://github.com/apache/spark/pull/25667

On Tue, Sep 3, 2019 at 1:41 PM Xiao Li 
mailto:lix...@databricks.com>> wrote:
Hi, Tom,

To unblock the build, I merged the upgrade to master. 
https://github.com/apache/spark/pull/25665

Thanks!

Xiao


On Tue, Sep 3, 2019 at 10:58 AM Tom Graves  wrote:
It looks like maven 3.6.1 was removed from the repo - see SPARK-28960.  It 
looks like they pushed 3.6.2,  but I don't see any release notes on the maven 
page for it 3.6.2

Seems like we had this happen before, can't remember if it was maven or 
something else, anyone remember or know if they are about to release 3.6.2?

Tom


--
[Databricks Summit - Watch the 
talks]


Re: Design review of SPARK-28594

2019-09-01 Thread Felix Cheung
I did review it and solving this problem makes sense. I will comment in the 
JIRA.


From: Jungtaek Lim 
Sent: Sunday, August 25, 2019 3:34:22 PM
To: dev 
Subject: Design review of SPARK-28594

Hi devs,

I have been working on designing SPARK-28594 [1] (though I've started with this 
via different requests) and design doc is now available [2].

Let me describe SPARK-28954 briefly - single and growing event log file for 
application has been major issue for streaming application as as long as event 
log just grows while the application is running, and lots of issues occur from 
there. The only viable workaround has been disabling event log which is not 
easily acceptable. Maybe stopping the application and rerunning would be 
another approach but it sounds really odd to stop the application due to event 
log. SPARK-28594 enables the way to roll the event log files, with compacting 
old event log files without losing the ability to replay whole logs.

While I'll break down issue into subtask and start from easier one, in parallel 
I'd like to ask for reviewing on the design to get better idea and find 
possible defects of design.

Please note that the doc is intended to describe the detailed changes (closer 
to the implementation details) and is not a kind of SPIP because I wouldn't 
feel going through SPIP process for this improvement - the change would be 
rather not huge and the proposal works orthogonal to current feature. Please 
let me know if it's not the case and SPIP process is necessary.

Thanks,
Jungtaek Lim (HeartSaVioR)

1. https://issues.apache.org/jira/browse/SPARK-28594
2. 
https://docs.google.com/document/d/12bdCC4nA58uveRxpeo8k7kGOI2NRTXmXyBOweSi4YcY/edit?usp=sharing



Re: [VOTE] Release Apache Spark 2.4.4 (RC3)

2019-08-30 Thread Felix Cheung
+1

Run tests, R tests, r-hub Debian, Ubuntu, mac, Windows


From: Hyukjin Kwon 
Sent: Wednesday, August 28, 2019 9:14 PM
To: Takeshi Yamamuro
Cc: dev; Dongjoon Hyun
Subject: Re: [VOTE] Release Apache Spark 2.4.4 (RC3)

+1 (from the last blocker PR)

2019년 8월 29일 (목) 오전 8:20, Takeshi Yamamuro 
mailto:linguin@gmail.com>>님이 작성:
I checked the tests passed again on the same env.
It looks ok.


On Thu, Aug 29, 2019 at 6:15 AM Marcelo Vanzin  
wrote:
+1

On Tue, Aug 27, 2019 at 4:06 PM Dongjoon Hyun 
mailto:dongjoon.h...@gmail.com>> wrote:
>
> Please vote on releasing the following candidate as Apache Spark version 
> 2.4.4.
>
> The vote is open until August 30th 5PM PST and passes if a majority +1 PMC 
> votes are cast, with a minimum of 3 +1 votes.
>
> [ ] +1 Release this package as Apache Spark 2.4.4
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> The tag to be voted on is v2.4.4-rc3 (commit 
> 7955b3962ac46b89564e0613db7bea98a1478bf2):
> https://github.com/apache/spark/tree/v2.4.4-rc3
>
> The release files, including signatures, digests, etc. can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v2.4.4-rc3-bin/
>
> Signatures used for Spark RCs can be found in this file:
> https://dist.apache.org/repos/dist/dev/spark/KEYS
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1332/
>
> The documentation corresponding to this release can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v2.4.4-rc3-docs/
>
> The list of bug fixes going into 2.4.4 can be found at the following URL:
> https://issues.apache.org/jira/projects/SPARK/versions/12345466
>
> This release is using the release script of the tag v2.4.4-rc3.
>
> FAQ
>
> =
> How can I help test this release?
> =
>
> If you are a Spark user, you can help us test this release by taking
> an existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> If you're working in PySpark you can set up a virtual env and install
> the current RC and see if anything important breaks, in the Java/Scala
> you can add the staging repository to your projects resolvers and test
> with the RC (make sure to clean up the artifact cache before/after so
> you don't end up building with a out of date RC going forward).
>
> ===
> What should happen to JIRA tickets still targeting 2.4.4?
> ===
>
> The current list of open tickets targeted at 2.4.4 can be found at:
> https://issues.apache.org/jira/projects/SPARK and search for "Target 
> Version/s" = 2.4.4
>
> Committers should look at those and triage. Extremely important bug
> fixes, documentation, and API tweaks that impact compatibility should
> be worked on immediately. Everything else please retarget to an
> appropriate release.
>
> ==
> But my bug isn't fixed?
> ==
>
> In order to make timely releases, we will typically not hold the
> release unless the bug in question is a regression from the previous
> release. That being said, if there is something which is a regression
> that has not been correctly targeted please ping me or a committer to
> help target the issue.



--
Marcelo

-
To unsubscribe e-mail: 
dev-unsubscr...@spark.apache.org



--
---
Takeshi Yamamuro


Re: JDK11 Support in Apache Spark

2019-08-24 Thread Felix Cheung
That’s great!


From: ☼ R Nair 
Sent: Saturday, August 24, 2019 10:57:31 AM
To: Dongjoon Hyun 
Cc: dev@spark.apache.org ; user @spark/'user 
@spark'/spark users/user@spark 
Subject: Re: JDK11 Support in Apache Spark

Finally!!! Congrats

On Sat, Aug 24, 2019, 11:11 AM Dongjoon Hyun 
mailto:dongjoon.h...@gmail.com>> wrote:
Hi, All.

Thanks to your many many contributions,
Apache Spark master branch starts to pass on JDK11 as of today.
(with `hadoop-3.2` profile: Apache Hadoop 3.2 and Hive 2.3.6)


https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-3.2-jdk-11/326/
(JDK11 is used for building and testing.)

We already verified all UTs (including PySpark/SparkR) before.

Please feel free to use JDK11 in order to build/test/run `master` branch and
share your experience including any issues. It will help Apache Spark 3.0.0 
release.

For the follow-ups, please follow 
https://issues.apache.org/jira/browse/SPARK-24417 .
The next step is `how to support JDK8/JDK11 together in a single artifact`.

Bests,
Dongjoon.


Re: [VOTE][SPARK-25299] SPIP: Shuffle Storage API

2019-06-17 Thread Felix Cheung
+1

Glad to see the progress in this space - it’s been more than a year since the 
original discussion and effort started.


From: Yinan Li 
Sent: Monday, June 17, 2019 7:14:42 PM
To: rb...@netflix.com
Cc: Dongjoon Hyun; Saisai Shao; Imran Rashid; Ilan Filonenko; bo yang; Matt 
Cheah; Spark Dev List; Yifei Huang (PD); Vinoo Ganesh; Imran Rashid
Subject: Re: [VOTE][SPARK-25299] SPIP: Shuffle Storage API

+1 (non-binding)

On Mon, Jun 17, 2019 at 1:58 PM Ryan Blue  wrote:
+1 (non-binding)

On Sun, Jun 16, 2019 at 11:11 PM Dongjoon Hyun 
mailto:dongjoon.h...@gmail.com>> wrote:
+1

Bests,
Dongjoon.


On Sun, Jun 16, 2019 at 9:41 PM Saisai Shao 
mailto:sai.sai.s...@gmail.com>> wrote:
+1 (binding)

Thanks
Saisai

Imran Rashid mailto:im...@therashids.com>> 于2019年6月15日周六 
上午3:46写道:
+1 (binding)

I think this is a really important feature for spark.

First, there is already a lot of interest in alternative shuffle storage in the 
community.  There is already a lot of interest in alternative shuffle storage, 
from dynamic allocation in kubernetes, to even just improving stability in 
standard on-premise use of Spark.  However, they're often stuck doing this in 
forks of Spark, and in ways that are not maintainable (because they copy-paste 
many spark internals) or are incorrect (for not correctly handling speculative 
execution & stage retries).

Second, I think the specific proposal is good for finding the right balance 
between flexibility and too much complexity, to allow incremental improvements. 
 A lot of work has been put into this already to try to figure out which pieces 
are essential to make alternative shuffle storage implementations feasible.

Of course, that means it doesn't include everything imaginable; some things 
still aren't supported, and some will still choose to use the older 
ShuffleManager api to give total control over all of shuffle.  But we know 
there are a reasonable set of things which can be implemented behind the api as 
the first step, and it can continue to evolve.

On Fri, Jun 14, 2019 at 12:13 PM Ilan Filonenko 
mailto:i...@cornell.edu>> wrote:
+1 (non-binding). This API is versatile and flexible enough to handle 
Bloomberg's internal use-cases. The ability for us to vary implementation 
strategies is quite appealing. It is also worth to note the minimal changes to 
Spark core in order to make it work. This is a very much needed addition within 
the Spark shuffle story.

On Fri, Jun 14, 2019 at 9:59 AM bo yang 
mailto:bobyan...@gmail.com>> wrote:
+1 This is great work, allowing plugin of different sort shuffle write/read 
implementation! Also great to see it retain the current Spark configuration 
(spark.shuffle.manager=org.apache.spark.shuffle.YourShuffleManagerImpl).


On Thu, Jun 13, 2019 at 2:58 PM Matt Cheah 
mailto:mch...@palantir.com>> wrote:
Hi everyone,

I would like to call a vote for the SPIP for 
SPARK-25299, which proposes 
to introduce a pluggable storage API for temporary shuffle data.

You may find the SPIP document 
here.

The discussion thread for the SPIP was conducted 
here.

Please vote on whether or not this proposal is agreeable to you.

Thanks!

-Matt Cheah


--
Ryan Blue
Software Engineer
Netflix


Re: [DISCUSS] Increasing minimum supported version of Pandas

2019-06-14 Thread Felix Cheung
How about pyArrow?


From: Holden Karau 
Sent: Friday, June 14, 2019 11:06:15 AM
To: Felix Cheung
Cc: Bryan Cutler; Dongjoon Hyun; Hyukjin Kwon; dev; shane knapp
Subject: Re: [DISCUSS] Increasing minimum supported version of Pandas

Are there other Python dependencies we should consider upgrading at the same 
time?

On Fri, Jun 14, 2019 at 7:45 PM Felix Cheung 
mailto:felixcheun...@hotmail.com>> wrote:
So to be clear, min version check is 0.23
Jenkins test is 0.24

I’m ok with this. I hope someone will test 0.23 on releases though before we 
sign off?
We should maybe add this to the release instruction notes?


From: shane knapp mailto:skn...@berkeley.edu>>
Sent: Friday, June 14, 2019 10:23:56 AM
To: Bryan Cutler
Cc: Dongjoon Hyun; Holden Karau; Hyukjin Kwon; dev
Subject: Re: [DISCUSS] Increasing minimum supported version of Pandas

excellent.  i shall not touch anything.  :)

On Fri, Jun 14, 2019 at 10:22 AM Bryan Cutler 
mailto:cutl...@gmail.com>> wrote:
Shane, I think 0.24.2 is probably more common right now, so if we were to pick 
one to test against, I still think it should be that one. Our Pandas usage in 
PySpark is pretty conservative, so it's pretty unlikely that we will add 
something that would break 0.23.X.

On Fri, Jun 14, 2019 at 10:10 AM shane knapp 
mailto:skn...@berkeley.edu>> wrote:
ah, ok...  should we downgrade the testing env on jenkins then?  any specific 
version?

shane, who is loathe (and i mean LOATHE) to touch python envs ;)

On Fri, Jun 14, 2019 at 10:08 AM Bryan Cutler 
mailto:cutl...@gmail.com>> wrote:
I should have stated this earlier, but when the user does something that 
requires Pandas, the minimum version is checked against what was imported and 
will raise an exception if it is a lower version. So I'm concerned that using 
0.24.2 might be a little too new for users running older clusters. To give some 
release dates, 0.23.2 was released about a year ago, 0.24.0 in January and 
0.24.2 in March.
I think given that we’re switching to requiring Python 3 and also a bit of a 
way from cutting a release 0.24 could be Ok as a min version requirement


On Fri, Jun 14, 2019 at 9:27 AM shane knapp 
mailto:skn...@berkeley.edu>> wrote:
just to everyone knows, our python 3.6 testing infra is currently on 0.24.2...

On Fri, Jun 14, 2019 at 9:16 AM Dongjoon Hyun 
mailto:dongjoon.h...@gmail.com>> wrote:
+1

Thank you for this effort, Bryan!

Bests,
Dongjoon.

On Fri, Jun 14, 2019 at 4:24 AM Holden Karau 
mailto:hol...@pigscanfly.ca>> wrote:
I’m +1 for upgrading, although since this is probably the last easy chance 
we’ll have to bump version numbers easily I’d suggest 0.24.2


On Fri, Jun 14, 2019 at 4:38 AM Hyukjin Kwon 
mailto:gurwls...@gmail.com>> wrote:
I am +1 to go for 0.23.2 - it brings some overhead to test PyArrow and pandas 
combinations. Spark 3 should be good time to increase.

2019년 6월 14일 (금) 오전 9:46, Bryan Cutler 
mailto:cutl...@gmail.com>>님이 작성:
Hi All,

We would like to discuss increasing the minimum supported version of Pandas in 
Spark, which is currently 0.19.2.

Pandas 0.19.2 was released nearly 3 years ago and there are some workarounds in 
PySpark that could be removed if such an old version is not required. This will 
help to keep code clean and reduce maintenance effort.

The change is targeted for Spark 3.0.0 release, see 
https://issues.apache.org/jira/browse/SPARK-28041. The current thought is to 
bump the version to 0.23.2, but we would like to discuss before making a 
change. Does anyone else have thoughts on this?

Regards,
Bryan
--
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 
<https://amzn.to/2MaRAG9>
YouTube Live Streams: https://www.youtube.com/user/holdenkarau


--
Shane Knapp
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


--
Shane Knapp
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


--
Shane Knapp
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu
--
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 
<https://amzn.to/2MaRAG9>
YouTube Live Streams: https://www.youtube.com/user/holdenkarau


Re: [DISCUSS] Increasing minimum supported version of Pandas

2019-06-14 Thread Felix Cheung
So to be clear, min version check is 0.23
Jenkins test is 0.24

I’m ok with this. I hope someone will test 0.23 on releases though before we 
sign off?


From: shane knapp 
Sent: Friday, June 14, 2019 10:23:56 AM
To: Bryan Cutler
Cc: Dongjoon Hyun; Holden Karau; Hyukjin Kwon; dev
Subject: Re: [DISCUSS] Increasing minimum supported version of Pandas

excellent.  i shall not touch anything.  :)

On Fri, Jun 14, 2019 at 10:22 AM Bryan Cutler 
mailto:cutl...@gmail.com>> wrote:
Shane, I think 0.24.2 is probably more common right now, so if we were to pick 
one to test against, I still think it should be that one. Our Pandas usage in 
PySpark is pretty conservative, so it's pretty unlikely that we will add 
something that would break 0.23.X.

On Fri, Jun 14, 2019 at 10:10 AM shane knapp 
mailto:skn...@berkeley.edu>> wrote:
ah, ok...  should we downgrade the testing env on jenkins then?  any specific 
version?

shane, who is loathe (and i mean LOATHE) to touch python envs ;)

On Fri, Jun 14, 2019 at 10:08 AM Bryan Cutler 
mailto:cutl...@gmail.com>> wrote:
I should have stated this earlier, but when the user does something that 
requires Pandas, the minimum version is checked against what was imported and 
will raise an exception if it is a lower version. So I'm concerned that using 
0.24.2 might be a little too new for users running older clusters. To give some 
release dates, 0.23.2 was released about a year ago, 0.24.0 in January and 
0.24.2 in March.

On Fri, Jun 14, 2019 at 9:27 AM shane knapp 
mailto:skn...@berkeley.edu>> wrote:
just to everyone knows, our python 3.6 testing infra is currently on 0.24.2...

On Fri, Jun 14, 2019 at 9:16 AM Dongjoon Hyun 
mailto:dongjoon.h...@gmail.com>> wrote:
+1

Thank you for this effort, Bryan!

Bests,
Dongjoon.

On Fri, Jun 14, 2019 at 4:24 AM Holden Karau 
mailto:hol...@pigscanfly.ca>> wrote:
I’m +1 for upgrading, although since this is probably the last easy chance 
we’ll have to bump version numbers easily I’d suggest 0.24.2


On Fri, Jun 14, 2019 at 4:38 AM Hyukjin Kwon 
mailto:gurwls...@gmail.com>> wrote:
I am +1 to go for 0.23.2 - it brings some overhead to test PyArrow and pandas 
combinations. Spark 3 should be good time to increase.

2019년 6월 14일 (금) 오전 9:46, Bryan Cutler 
mailto:cutl...@gmail.com>>님이 작성:
Hi All,

We would like to discuss increasing the minimum supported version of Pandas in 
Spark, which is currently 0.19.2.

Pandas 0.19.2 was released nearly 3 years ago and there are some workarounds in 
PySpark that could be removed if such an old version is not required. This will 
help to keep code clean and reduce maintenance effort.

The change is targeted for Spark 3.0.0 release, see 
https://issues.apache.org/jira/browse/SPARK-28041. The current thought is to 
bump the version to 0.23.2, but we would like to discuss before making a 
change. Does anyone else have thoughts on this?

Regards,
Bryan
--
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 

YouTube Live Streams: https://www.youtube.com/user/holdenkarau


--
Shane Knapp
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


--
Shane Knapp
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


--
Shane Knapp
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


Re: Should python-2 be supported in Spark 3.0?

2019-05-31 Thread Felix Cheung
Very subtle but someone might take

“We will drop Python 2 support in a future release in 2020”

To mean any / first release in 2020. Whereas the next statement indicates patch 
release is not included in above. Might help reorder the items or clarify the 
wording.



From: shane knapp 
Sent: Friday, May 31, 2019 7:38:10 PM
To: Denny Lee
Cc: Holden Karau; Bryan Cutler; Erik Erlandson; Felix Cheung; Mark Hamstra; 
Matei Zaharia; Reynold Xin; Sean Owen; Wenchen Fen; Xiangrui Meng; dev; user
Subject: Re: Should python-2 be supported in Spark 3.0?

+1000  ;)

On Sat, Jun 1, 2019 at 6:53 AM Denny Lee 
mailto:denny.g@gmail.com>> wrote:
+1

On Fri, May 31, 2019 at 17:58 Holden Karau 
mailto:hol...@pigscanfly.ca>> wrote:
+1

On Fri, May 31, 2019 at 5:41 PM Bryan Cutler 
mailto:cutl...@gmail.com>> wrote:
+1 and the draft sounds good

On Thu, May 30, 2019, 11:32 AM Xiangrui Meng 
mailto:men...@gmail.com>> wrote:
Here is the draft announcement:

===
Plan for dropping Python 2 support

As many of you already knew, Python core development team and many utilized 
Python packages like Pandas and NumPy will drop Python 2 support in or before 
2020/01/01. Apache Spark has supported both Python 2 and 3 since Spark 1.4 
release in 2015. However, maintaining Python 2/3 compatibility is an increasing 
burden and it essentially limits the use of Python 3 features in Spark. Given 
the end of life (EOL) of Python 2 is coming, we plan to eventually drop Python 
2 support as well. The current plan is as follows:

* In the next major release in 2019, we will deprecate Python 2 support. 
PySpark users will see a deprecation warning if Python 2 is used. We will 
publish a migration guide for PySpark users to migrate to Python 3.
* We will drop Python 2 support in a future release in 2020, after Python 2 EOL 
on 2020/01/01. PySpark users will see an error if Python 2 is used.
* For releases that support Python 2, e.g., Spark 2.4, their patch releases 
will continue supporting Python 2. However, after Python 2 EOL, we might not 
take patches that are specific to Python 2.
===

Sean helped make a pass. If it looks good, I'm going to upload it to Spark 
website and announce it here. Let me know if you think we should do a VOTE 
instead.

On Thu, May 30, 2019 at 9:21 AM Xiangrui Meng 
mailto:men...@gmail.com>> wrote:
I created https://issues.apache.org/jira/browse/SPARK-27884 to track the work.

On Thu, May 30, 2019 at 2:18 AM Felix Cheung 
mailto:felixcheun...@hotmail.com>> wrote:
We don’t usually reference a future release on website

> Spark website and state that Python 2 is deprecated in Spark 3.0

I suspect people will then ask when is Spark 3.0 coming out then. Might need to 
provide some clarity on that.

We can say the "next major release in 2019" instead of Spark 3.0. Spark 3.0 
timeline certainly requires a new thread to discuss.




From: Reynold Xin mailto:r...@databricks.com>>
Sent: Thursday, May 30, 2019 12:59:14 AM
To: shane knapp
Cc: Erik Erlandson; Mark Hamstra; Matei Zaharia; Sean Owen; Wenchen Fen; 
Xiangrui Meng; dev; user
Subject: Re: Should python-2 be supported in Spark 3.0?

+1 on Xiangrui’s plan.

On Thu, May 30, 2019 at 7:55 AM shane knapp 
mailto:skn...@berkeley.edu>> wrote:
I don't have a good sense of the overhead of continuing to support
Python 2; is it large enough to consider dropping it in Spark 3.0?

from the build/test side, it will actually be pretty easy to continue support 
for python2.7 for spark 2.x as the feature sets won't be expanding.

that being said, i will be cracking a bottle of champagne when i can delete all 
of the ansible and anaconda configs for python2.x.  :)

On the development side, in a future release that drops Python 2 support we can 
remove code that maintains python 2/3 compatibility and start using python 3 
only features, which is also quite exciting.


shane
--
Shane Knapp
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


--
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 
<https://amzn.to/2MaRAG9>
YouTube Live Streams: https://www.youtube.com/user/holdenkarau


--
Shane Knapp
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


Re: Should python-2 be supported in Spark 3.0?

2019-05-30 Thread Felix Cheung
We don’t usually reference a future release on website

> Spark website and state that Python 2 is deprecated in Spark 3.0

I suspect people will then ask when is Spark 3.0 coming out then. Might need to 
provide some clarity on that.



From: Reynold Xin 
Sent: Thursday, May 30, 2019 12:59:14 AM
To: shane knapp
Cc: Erik Erlandson; Mark Hamstra; Matei Zaharia; Sean Owen; Wenchen Fen; 
Xiangrui Meng; dev; user
Subject: Re: Should python-2 be supported in Spark 3.0?

+1 on Xiangrui’s plan.

On Thu, May 30, 2019 at 7:55 AM shane knapp 
mailto:skn...@berkeley.edu>> wrote:
I don't have a good sense of the overhead of continuing to support
Python 2; is it large enough to consider dropping it in Spark 3.0?

from the build/test side, it will actually be pretty easy to continue support 
for python2.7 for spark 2.x as the feature sets won't be expanding.

that being said, i will be cracking a bottle of champagne when i can delete all 
of the ansible and anaconda configs for python2.x.  :)

shane
--
Shane Knapp
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended Columnar Processing Support

2019-05-27 Thread Felix Cheung
+1

I’d prefer to see more of the end goal and how that could be achieved (such as 
ETL or SPARK-24579). However given the rounds and months of discussions we have 
come down to just the public API.

If the community thinks a new set of public API is maintainable, I don’t see 
any problem with that.


From: Tom Graves 
Sent: Sunday, May 26, 2019 8:22:59 AM
To: hol...@pigscanfly.ca; Reynold Xin
Cc: Bobby Evans; DB Tsai; Dongjoon Hyun; Imran Rashid; Jason Lowe; Matei 
Zaharia; Thomas graves; Xiangrui Meng; Xiangrui Meng; dev
Subject: Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended Columnar 
Processing Support

More feedback would be great, this has been open a long time though, let's 
extend til Wednesday the 29th and see where we are at.

Tom



Sent from Yahoo Mail on 
Android

On Sat, May 25, 2019 at 6:28 PM, Holden Karau
 wrote:
Same I meant to catch up after kubecon but had some unexpected travels.

On Sat, May 25, 2019 at 10:56 PM Reynold Xin 
mailto:r...@databricks.com>> wrote:
Can we push this to June 1st? I have been meaning to read it but unfortunately 
keeps traveling...

On Sat, May 25, 2019 at 8:31 PM Dongjoon Hyun 
mailto:dongjoon.h...@gmail.com>> wrote:
+1

Thanks,
Dongjoon.

On Fri, May 24, 2019 at 17:03 DB Tsai  wrote:
+1 on exposing the APIs for columnar processing support.

I understand that the scope of this SPIP doesn't cover AI / ML
use-cases. But I saw a good performance gain when I converted data
from rows to columns to leverage on SIMD architectures in a POC ML
application.

With the exposed columnar processing support, I can imagine that the
heavy lifting parts of ML applications (such as computing the
objective functions) can be written as columnar expressions that
leverage on SIMD architectures to get a good speedup.

Sincerely,

DB Tsai
--
Web: https://www.dbtsai.com
PGP Key ID: 42E5B25A8F7A82C1

On Wed, May 15, 2019 at 2:59 PM Bobby Evans 
mailto:reva...@gmail.com>> wrote:
>
> It would allow for the columnar processing to be extended through the 
> shuffle.  So if I were doing say an FPGA accelerated extension it could 
> replace the ShuffleExechangeExec with one that can take a ColumnarBatch as 
> input instead of a Row. The extended version of the ShuffleExchangeExec could 
> then do the partitioning on the incoming batch and instead of producing a 
> ShuffleRowRDD for the exchange they could produce something like a 
> ShuffleBatchRDD that would let the serializing and deserializing happen in a 
> column based format for a faster exchange, assuming that columnar processing 
> is also happening after the exchange. This is just like providing a columnar 
> version of any other catalyst operator, except in this case it is a bit more 
> complex of an operator.
>
> On Wed, May 15, 2019 at 12:15 PM Imran Rashid  
> wrote:
>>
>> sorry I am late to the discussion here -- the jira mentions using this 
>> extensions for dealing with shuffles, can you explain that part?  I don't 
>> see how you would use this to change shuffle behavior at all.
>>
>> On Tue, May 14, 2019 at 10:59 AM Thomas graves 
>> mailto:tgra...@apache.org>> wrote:
>>>
>>> Thanks for replying, I'll extend the vote til May 26th to allow your
>>> and other people feedback who haven't had time to look at it.
>>>
>>> Tom
>>>
>>> On Mon, May 13, 2019 at 4:43 PM Holden Karau 
>>> mailto:hol...@pigscanfly.ca>> wrote:
>>> >
>>> > I’d like to ask this vote period to be extended, I’m interested but I 
>>> > don’t have the cycles to review it in detail and make an informed vote 
>>> > until the 25th.
>>> >
>>> > On Tue, May 14, 2019 at 1:49 AM Xiangrui Meng 
>>> > mailto:m...@databricks.com>> wrote:
>>> >>
>>> >> My vote is 0. Since the updated SPIP focuses on ETL use cases, I don't 
>>> >> feel strongly about it. I would still suggest doing the following:
>>> >>
>>> >> 1. Link the POC mentioned in Q4. So people can verify the POC result.
>>> >> 2. List public APIs we plan to expose in Appendix A. I did a quick 
>>> >> check. Beside ColumnarBatch and ColumnarVector, we also need to make the 
>>> >> following public. People who are familiar with SQL internals should help 
>>> >> assess the risk.
>>> >> * ColumnarArray
>>> >> * ColumnarMap
>>> >> * unsafe.types.CaledarInterval
>>> >> * ColumnarRow
>>> >> * UTF8String
>>> >> * ArrayData
>>> >> * ...
>>> >> 3. I still feel using Pandas UDF as the mid-term success doesn't match 
>>> >> the purpose of this SPIP. It does make some code cleaner. But I guess 
>>> >> for ETL use cases, it won't bring much value.
>>> >>
>>> > --
>>> > Twitter: https://twitter.com/holdenkarau
>>> > Books (Learning Spark, High Performance Spark, etc.): 
>>> > https://amzn.to/2MaRAG9
>>> > YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>

Re: Static partitioning in partitionBy()

2019-05-07 Thread Felix Cheung
You could

df.filter(col(“c”) = “c1”).write().partitionBy(“c”).save

It could get some data skew problem but might work for you




From: Burak Yavuz 
Sent: Tuesday, May 7, 2019 9:35:10 AM
To: Shubham Chaurasia
Cc: dev; u...@spark.apache.org
Subject: Re: Static partitioning in partitionBy()

It depends on the data source. Delta Lake (https://delta.io) allows you to do 
it with the .option("replaceWhere", "c = c1"). With other file formats, you can 
write directly into the partition directory (tablePath/c=c1), but you lose 
atomicity.

On Tue, May 7, 2019, 6:36 AM Shubham Chaurasia 
mailto:shubh.chaura...@gmail.com>> wrote:
Hi All,

Is there a way I can provide static partitions in partitionBy()?

Like:
df.write.mode("overwrite").format("MyDataSource").partitionBy("c=c1").save

Above code gives following error as it tries to find column `c=c1` in df.

org.apache.spark.sql.AnalysisException: Partition column `c=c1` not found in 
schema struct;

Thanks,
Shubham


Re: [VOTE] Release Apache Spark 2.4.3

2019-05-05 Thread Felix Cheung
I ran basic tests on R, r-hub etc. LGTM.

+1 (limited - I didn’t get to run other usual tests)


From: Sean Owen 
Sent: Wednesday, May 1, 2019 2:21 PM
To: Xiao Li
Cc: dev@spark.apache.org
Subject: Re: [VOTE] Release Apache Spark 2.4.3

+1 from me. There is little change from 2.4.2 anyway, except for the
important change to the build script that should build pyspark with
Scala 2.11 jars. I verified that the package contains the _2.11 Spark
jars, but have a look!

I'm still getting this weird error from the Kafka module when testing,
but it's a long-standing weird known issue:

[error] 
/home/ubuntu/spark-2.4.3/external/kafka-0-10/src/test/scala/org/apache/spark/streaming/kafka010/KafkaDataConsumerSuite.scala:85:
Symbol 'term org.eclipse' is missing from the classpath.
[error] This symbol is required by 'method
org.apache.spark.metrics.MetricsSystem.getServletHandlers'.
[error] Make sure that term eclipse is in your classpath and check for
conflicting dependencies with `-Ylog-classpath`.
[error] A full rebuild may help if 'MetricsSystem.class' was compiled
against an incompatible version of org.
[error] testUtils.sendMessages(topic, data.toArray)

Killing zinc and rebuilding didn't help.
But this isn't happening in Jenkins for example, so it should be env-specific.

On Wed, May 1, 2019 at 9:39 AM Xiao Li  wrote:
>
> Please vote on releasing the following candidate as Apache Spark version 
> 2.4.3.
>
> The vote is open until May 5th PST and passes if a majority +1 PMC votes are 
> cast, with
> a minimum of 3 +1 votes.
>
> [ ] +1 Release this package as Apache Spark 2.4.3
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> The tag to be voted on is v2.4.3-rc1 (commit 
> c3e32bf06c35ba2580d46150923abfa795b4446a):
> https://github.com/apache/spark/tree/v2.4.3-rc1
>
> The release files, including signatures, digests, etc. can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v2.4.3-rc1-bin/
>
> Signatures used for Spark RCs can be found in this file:
> https://dist.apache.org/repos/dist/dev/spark/KEYS
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1324/
>
> The documentation corresponding to this release can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v2.4.3-rc1-docs/
>
> The list of bug fixes going into 2.4.2 can be found at the following URL:
> https://issues.apache.org/jira/projects/SPARK/versions/12345410
>
> The release is using the release script of the branch 2.4.3-rc1 with the 
> following commit 
> https://github.com/apache/spark/commit/e417168ed012190db66a21e626b2b8d2332d6c01
>
> FAQ
>
> =
> How can I help test this release?
> =
>
> If you are a Spark user, you can help us test this release by taking
> an existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> If you're working in PySpark you can set up a virtual env and install
> the current RC and see if anything important breaks, in the Java/Scala
> you can add the staging repository to your projects resolvers and test
> with the RC (make sure to clean up the artifact cache before/after so
> you don't end up building with a out of date RC going forward).
>
> ===
> What should happen to JIRA tickets still targeting 2.4.3?
> ===
>
> The current list of open tickets targeted at 2.4.3 can be found at:
> https://issues.apache.org/jira/projects/SPARK and search for "Target 
> Version/s" = 2.4.3
>
> Committers should look at those and triage. Extremely important bug
> fixes, documentation, and API tweaks that impact compatibility should
> be worked on immediately. Everything else please retarget to an
> appropriate release.
>
> ==
> But my bug isn't fixed?
> ==
>
> In order to make timely releases, we will typically not hold the
> release unless the bug in question is a regression from the previous
> release. That being said, if there is something which is a regression
> that has not been correctly targeted please ping me or a committer to
> help target the issue.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [VOTE] Release Apache Spark 2.4.2

2019-05-01 Thread Felix Cheung
Just my 2c

If there is a known security issue, we should fix it rather waiting for if it 
actually could be might be affecting Spark to be found by a black hat, or worse.

I don’t think any of us want to see Spark in the news for this reason.

From: Sean Owen 
Sent: Tuesday, April 30, 2019 1:52:53 PM
To: Reynold Xin
Cc: Jungtaek Lim; Dongjoon Hyun; Wenchen Fan; Michael Heuer; Terry Kim; dev; 
Xiao Li
Subject: Re: [VOTE] Release Apache Spark 2.4.2

FWIW I'm OK with this even though I proposed the backport PR for discussion. It 
really is a tough call, balancing the potential but as-yet unclear security 
benefit vs minor but real Jackson deserialization behavior change.

Because we have a pressing need for a 2.4.3 release (really a 2.4.2.1 almost) I 
think it's reasonable to defer a final call on this in 2.4.x and revert for 
now. Leaving it in 2.4.3 makes it quite permanent.

A little more color on the discussion:
- I don't think https://github.com/apache/spark/pull/22071 mitigates the 
theoretical problem here; I would guess the attack vector is deserializing a 
malicious JSON file. This is unproven either way
- The behavior change we know is basically what you see in the revert PR: 
entries like "'foo': null" aren't written by Jackson by default in 2.7+. You 
can make them so but it needs a code tweak in any app that inherits Spark's 
Jackson
- This is not related to Scala version

This is for a discussion about re-including in 2.4.4:
- Does anyone know that the Jackson issues really _could_ affect Spark
- Does anyone have concrete examples of why the behavior change is a bigger 
deal, or not as big a deal, as anticipated?

On Tue, Apr 30, 2019 at 1:34 AM Reynold Xin 
mailto:r...@databricks.com>> wrote:

Echoing both of you ... it's a bit risky to bump dependency versions in a patch 
release, especially for a super common library. (I wish we shaded Jackson).

Maybe the CVE is a sufficient reason to bump the dependency, ignoring the 
potential behavior changes that might happen, but I'd like to see a bit more 
discussions there and have 2.4.3 focusing on fixing the Scala version issue 
first.



On Mon, Apr 29, 2019 at 11:17 PM, Jungtaek Lim 
mailto:kabh...@gmail.com>> wrote:
Ah! Sorry Xiao I should check the fix version of issue (it's 2.4.3/3.0.0).

Then looks much better to revert and avoid dependency conflict in bugfix 
release. Jackson is one of known things making non-backward changes to 
non-major version, so I agree it's the thing to be careful, or shade/relocate 
and forget about it.

On Tue, Apr 30, 2019 at 3:04 PM Xiao Li 
mailto:lix...@databricks.com>> wrote:
Jungtaek,

Thanks for your inputs! Sorry for the confusion. Let me make it clear.

  *   All the previous 2.4.x [including 2.4.2] releases are using Jackson 
2.6.7.1.
  *   In the master branch, the Jackson is already upgraded to 2.9.8.
  *   Here, I just try to revert Jackson upgrade in the upcoming 2.4.3 release.

Cheers,

Xiao

On Mon, Apr 29, 2019 at 10:53 PM Jungtaek Lim 
mailto:kabh...@gmail.com>> wrote:
Just to be clear, does upgrading jackson to 2.9.8 be coupled with Scala 
version? And could you summarize one of actual broken case due to upgrade if 
you observe anything? Providing actual case would help us to weigh the impact.

Btw, my 2 cents, personally I would rather avoid upgrading dependencies in 
bugfix release unless it resolves major bugs, so reverting it from only 
branch-2.4 sounds good to me. (I still think jackson upgrade is necessary in 
master branch, avoiding lots of CVEs we will waste huge amount of time to 
identify the impact. And other libs will start making couple with jackson 2.9.x 
which conflict Spark's jackson dependency.)

If there will be a consensus regarding reverting that, we may also need to 
announce Spark 2.4.2 is discouraged to be used, otherwise end users will suffer 
from jackson version back and forth.

Thanks,
Jungtaek Lim (HeartSaVioR)

On Tue, Apr 30, 2019 at 2:30 PM Xiao Li 
mailto:lix...@databricks.com>> wrote:
Before cutting 2.4.3, I just submitted a PR 
https://github.com/apache/spark/pull/24493 for reverting the commit 
https://github.com/apache/spark/commit/6f394a20bf49f67b4d6329a1c25171c8024a2fae.

In general, we need to be very cautious about the Jackson upgrade in the patch 
releases, especially when this upgrade could break the existing behaviors of 
the external packages or data sources, and generate different results after the 
upgrade. The external packages and data sources need to change their source 
code to keep the original behaviors. The upgrade requires more discussions 
before releasing it, I think.

In the previous PR https://github.com/apache/spark/pull/22071, we turned off 
`spark.master.rest.enabled` by default and 
added the following claim in our security doc:
The Rest Submission Server and the MesosClusterDispatcher do not support 
authentication.  You should ensure that all network access to the 

Re: [VOTE] Release Apache Spark 2.4.2

2019-04-21 Thread Felix Cheung
+1

R tests, package tests on r-hub. Manually check commits under R, doc etc



From: Sean Owen 
Sent: Saturday, April 20, 2019 11:27 AM
To: Wenchen Fan
Cc: Spark dev list
Subject: Re: [VOTE] Release Apache Spark 2.4.2

+1 from me too.

It seems like there is support for merging the Jackson change into
2.4.x (and, I think, a few more minor dependency updates) but this
doesn't have to go into 2.4.2. That said, if there is another RC for
any reason, I think we could include it. Otherwise can wait for 2.4.3.

On Thu, Apr 18, 2019 at 9:51 PM Wenchen Fan  wrote:
>
> Please vote on releasing the following candidate as Apache Spark version 
> 2.4.2.
>
> The vote is open until April 23 PST and passes if a majority +1 PMC votes are 
> cast, with
> a minimum of 3 +1 votes.
>
> [ ] +1 Release this package as Apache Spark 2.4.2
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> The tag to be voted on is v2.4.2-rc1 (commit 
> a44880ba74caab7a987128cb09c4bee41617770a):
> https://github.com/apache/spark/tree/v2.4.2-rc1
>
> The release files, including signatures, digests, etc. can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v2.4.2-rc1-bin/
>
> Signatures used for Spark RCs can be found in this file:
> https://dist.apache.org/repos/dist/dev/spark/KEYS
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1322/
>
> The documentation corresponding to this release can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v2.4.2-rc1-docs/
>
> The list of bug fixes going into 2.4.1 can be found at the following URL:
> https://issues.apache.org/jira/projects/SPARK/versions/12344996
>
> FAQ
>
> =
> How can I help test this release?
> =
>
> If you are a Spark user, you can help us test this release by taking
> an existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> If you're working in PySpark you can set up a virtual env and install
> the current RC and see if anything important breaks, in the Java/Scala
> you can add the staging repository to your projects resolvers and test
> with the RC (make sure to clean up the artifact cache before/after so
> you don't end up building with a out of date RC going forward).
>
> ===
> What should happen to JIRA tickets still targeting 2.4.2?
> ===
>
> The current list of open tickets targeted at 2.4.2 can be found at:
> https://issues.apache.org/jira/projects/SPARK and search for "Target 
> Version/s" = 2.4.2
>
> Committers should look at those and triage. Extremely important bug
> fixes, documentation, and API tweaks that impact compatibility should
> be worked on immediately. Everything else please retarget to an
> appropriate release.
>
> ==
> But my bug isn't fixed?
> ==
>
> In order to make timely releases, we will typically not hold the
> release unless the bug in question is a regression from the previous
> release. That being said, if there is something which is a regression
> that has not been correctly targeted please ping me or a committer to
> help target the issue.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Spark 2.4.2

2019-04-18 Thread Felix Cheung
Re shading - same argument I’ve made earlier today in a PR...

(Context- in many cases Spark has light or indirect dependencies but bringing 
them into the process breaks users code easily)



From: Michael Heuer 
Sent: Thursday, April 18, 2019 6:41 AM
To: Reynold Xin
Cc: Sean Owen; Michael Armbrust; Ryan Blue; Spark Dev List; Wenchen Fan; Xiao Li
Subject: Re: Spark 2.4.2

+100


On Apr 18, 2019, at 1:48 AM, Reynold Xin 
mailto:r...@databricks.com>> wrote:

We should have shaded all Spark’s dependencies :(

On Wed, Apr 17, 2019 at 11:47 PM Sean Owen 
mailto:sro...@gmail.com>> wrote:
For users that would inherit Jackson and use it directly, or whose
dependencies do. Spark itself (with modifications) should be OK with
the change.
It's risky and normally wouldn't backport, except that I've heard a
few times about concerns about CVEs affecting Databind, so wondering
who else out there might have an opinion. I'm not pushing for it
necessarily.

On Wed, Apr 17, 2019 at 6:18 PM Reynold Xin 
mailto:r...@databricks.com>> wrote:
>
> For Jackson - are you worrying about JSON parsing for users or internal Spark 
> functionality breaking?
>
> On Wed, Apr 17, 2019 at 6:02 PM Sean Owen 
> mailto:sro...@gmail.com>> wrote:
>>
>> There's only one other item on my radar, which is considering updating
>> Jackson to 2.9 in branch-2.4 to get security fixes. Pros: it's come up
>> a few times now that there are a number of CVEs open for 2.6.7. Cons:
>> not clear they affect Spark, and Jackson 2.6->2.9 does change Jackson
>> behavior non-trivially. That said back-porting the update PR to 2.4
>> worked out OK locally. Any strong opinions on this one?
>>
>> On Wed, Apr 17, 2019 at 7:49 PM Wenchen Fan 
>> mailto:cloud0...@gmail.com>> wrote:
>> >
>> > I volunteer to be the release manager for 2.4.2, as I was also going to 
>> > propose 2.4.2 because of the reverting of SPARK-25250. Is there any other 
>> > ongoing bug fixes we want to include in 2.4.2? If no I'd like to start the 
>> > release process today (CST).
>> >
>> > Thanks,
>> > Wenchen
>> >
>> > On Thu, Apr 18, 2019 at 3:44 AM Sean Owen 
>> > mailto:sro...@gmail.com>> wrote:
>> >>
>> >> I think the 'only backport bug fixes to branches' principle remains 
>> >> sound. But what's a bug fix? Something that changes behavior to match 
>> >> what is explicitly supposed to happen, or implicitly supposed to happen 
>> >> -- implied by what other similar things do, by reasonable user 
>> >> expectations, or simply how it worked previously.
>> >>
>> >> Is this a bug fix? I guess the criteria that matches is that behavior 
>> >> doesn't match reasonable user expectations? I don't know enough to have a 
>> >> strong opinion. I also don't think there is currently an objection to 
>> >> backporting it, whatever it's called.
>> >>
>> >>
>> >> Is the question whether this needs a new release? There's no harm in 
>> >> another point release, other than needing a volunteer release manager. 
>> >> One could say, wait a bit longer to see what more info comes in about 
>> >> 2.4.1. But given that 2.4.1 took like 2 months, it's reasonable to move 
>> >> towards a release cycle again. I don't see objection to that either (?)
>> >>
>> >>
>> >> The meta question remains: is a 'bug fix' definition even agreed, and 
>> >> being consistently applied? There aren't correct answers, only best 
>> >> guesses from each person's own experience, judgment and priorities. These 
>> >> can differ even when applied in good faith.
>> >>
>> >> Sometimes the variance of opinion comes because people have different 
>> >> info that needs to be surfaced. Here, maybe it's best to share what about 
>> >> that offline conversation was convincing, for example.
>> >>
>> >> I'd say it's also important to separate what one would prefer from what 
>> >> one can't live with(out). Assuming one trusts the intent and experience 
>> >> of the handful of others with an opinion, I'd defer to someone who wants 
>> >> X and will own it, even if I'm moderately against it. Otherwise we'd get 
>> >> little done.
>> >>
>> >> In that light, it seems like both of the PRs at issue here are not 
>> >> _wrong_ to backport. This is a good pair that highlights why, when there 
>> >> isn't a clear reason to do / not do something (e.g. obvious errors, 
>> >> breaking public APIs) we give benefit-of-the-doubt in order to get it 
>> >> later.
>> >>
>> >>
>> >> On Wed, Apr 17, 2019 at 12:09 PM Ryan Blue 
>> >> mailto:rb...@netflix.com.invalid>> wrote:
>> >>>
>> >>> Sorry, I should be more clear about what I'm trying to say here.
>> >>>
>> >>> In the past, Xiao has taken the opposite stance. A good example is PR 
>> >>> #21060 that was a very similar situation: behavior didn't match what was 
>> >>> expected and there was low risk. There was a long argument and the patch 
>> >>> didn't make it into 2.3 (to my knowledge).
>> >>>
>> >>> What we call these low-risk behavior fixes doesn't matter. I called it a 
>> >>> bug on 

Re: Dataset schema incompatibility bug when reading column partitioned data

2019-04-13 Thread Felix Cheung
I kinda agree it is confusing when a parameter is not used...


From: Ryan Blue 
Sent: Thursday, April 11, 2019 11:07:25 AM
To: Bruce Robbins
Cc: Dávid Szakállas; Spark Dev List
Subject: Re: Dataset schema incompatibility bug when reading column partitioned 
data


I think the confusion is that the schema passed to spark.read is not a 
projection schema. I don’t think it is even used in this case because the 
Parquet dataset has its own schema. You’re getting the schema of the table. I 
think the correct behavior is to reject a user-specified schema in this case.

On Thu, Apr 11, 2019 at 11:04 AM Bruce Robbins 
mailto:bersprock...@gmail.com>> wrote:
I see a Jira:

https://issues.apache.org/jira/browse/SPARK-21021

On Thu, Apr 11, 2019 at 9:08 AM Dávid Szakállas 
mailto:david.szakal...@gmail.com>> wrote:
+dev for more visibility. Is this a known issue? Is there a plan for a fix?

Thanks,
David

Begin forwarded message:

From: Dávid Szakállas 
mailto:david.szakal...@gmail.com>>
Subject: Dataset schema incompatibility bug when reading column partitioned data
Date: 2019. March 29. 14:15:27 CET
To: u...@spark.apache.org

We observed the following bug on Spark 2.4.0:


scala> 
spark.createDataset(Seq((1,2))).write.partitionBy("_1").parquet("foo.parquet")

scala> val schema = StructType(Seq(StructField("_1", 
IntegerType),StructField("_2", IntegerType)))

scala> spark.read.schema(schema).parquet("foo.parquet").as[(Int, Int)].show
+---+---+
| _2| _1|
+---+---+
|  2|  1|
+---+- --+

That is, when reading column partitioned Parquet files the explicitly specified 
schema is not adhered to, instead the partitioning columns are appended the end 
of the column list. This is a quite severe issue as some operations, such as 
union, fails if columns are in a different order in two datasets. Thus we have 
to work around the issue with a select:

val columnNames = schema.fields.map(_.name)
ds.select(columnNames.head, columnNames.tail: _*)


Thanks,
David Szakallas
Data Engineer | Whitepages, Inc.



--
Ryan Blue
Software Engineer
Netflix


ApacheCon NA 2019 Call For Proposal and help promoting Spark project

2019-04-13 Thread Felix Cheung
Hi Spark community!

As you know ApacheCon NA 2019 is coming this Sept and it’s CFP is now open! 
This is an important milestone as we celebrate 20 years of ASF. We have tracks 
like Big Data and Machine Learning among many others. Please submit your 
talks/thoughts/challenges/learnings here:
https://www.apachecon.com/acna19/cfp.html

Second, as a community I think it’d be great if we have a post on 
http://spark.apache.org/ website to promote this event also. We already have a 
logo link up and perhaps we could add a post to talk about:
What is the Spark project, what might you learn, then a few suggestions of talk 
topics, why speak at the ApacheCon etc. This will then be linked to the 
ApacheCon official website. Any volunteer from the community?

Third, Twitter. I’m not sure who has access to the ApacheSpark Twitter account 
but it’d be great to promote this. Use the hashtags #ApacheCon and #ACNA19. 
Mention @Apachecon. Please use
https://www.apachecon.com/acna19/cfp.html to promote the CFP, and
https://www.apachecon.com/acna19 to promote the event as a whole.



Re: Upgrading minimal PyArrow version to 0.12.x [SPARK-27276]

2019-03-29 Thread Felix Cheung
I don’t take it as Sept 2019 is end of life for python 3.5 tho. It’s just 
saying the next release.

In any case I think in the next release it will be great to get more Python 3.x 
release test coverage.




From: shane knapp 
Sent: Friday, March 29, 2019 4:46 PM
To: Bryan Cutler
Cc: Felix Cheung; Hyukjin Kwon; dev
Subject: Re: Upgrading minimal PyArrow version to 0.12.x [SPARK-27276]

i'm not opposed to 3.6 at all.

On Fri, Mar 29, 2019 at 4:16 PM Bryan Cutler 
mailto:cutl...@gmail.com>> wrote:
PyArrow dropping Python 3.4 was mainly due to support going away at Conda-Forge 
and other dependencies also dropping it.  I think we better upgrade Jenkins 
Python while we are at it.  Are you all against jumping to Python 3.6 so we are 
not in the same boat in September?

On Thu, Mar 28, 2019 at 7:58 PM Felix Cheung 
mailto:felixcheun...@hotmail.com>> wrote:
3.4 is end of life but 3.5 is not. From your link

we expect to release Python 3.5.8 around September 2019.




From: shane knapp mailto:skn...@berkeley.edu>>
Sent: Thursday, March 28, 2019 7:54 PM
To: Hyukjin Kwon
Cc: Bryan Cutler; dev; Felix Cheung
Subject: Re: Upgrading minimal PyArrow version to 0.12.x [SPARK-27276]

looks like the same for 3.5...   https://www.python.org/dev/peps/pep-0478/

let's pick a python version and start testing.

On Thu, Mar 28, 2019 at 7:52 PM shane knapp 
mailto:skn...@berkeley.edu>> wrote:

If there was, it looks inevitable to upgrade Jenkins\s Python from 3.4 to 3.5.

this is inevitable.  3.4s final release was 10 days ago 
(https://www.python.org/dev/peps/pep-0429/) so we're basically EOL.


--
Shane Knapp
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


--
Shane Knapp
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


Re: [k8s][jenkins] spark dev tool docs now have k8s+minikube instructions!

2019-03-29 Thread Felix Cheung
Definitely the part on the PR. Thanks!



From: shane knapp 
Sent: Thursday, March 28, 2019 11:19 AM
To: dev; Stavros Kontopoulos
Subject: [k8s][jenkins] spark dev tool docs now have k8s+minikube instructions!

https://spark.apache.org/developer-tools.html

search for "Testing K8S".

this is pretty much how i build and test PRs locally...  the commands there are 
lifted straight from the k8s integration test jenkins build, so they might 
require a little tweaking to better suit your laptop/server.

k8s is great (except when it's not), and it's really quite easy to get set up 
(except when it's not).  stackoverflow is your friend, and the minikube slack 
was really useful.

some of this is a little hacky (running the mount process in the background, 
for example), but there's a lot of development on minikube right now...  the 
k8s project understands the importance of minikube and has dedicated 
engineering resources involved.

and finally, if you have a suggesting for the docs, open a PR!  they are always 
welcome!

shane

ps- and a special thanks to @Stavros 
Kontopoulos and the PR from hell for 
throwing me in the deep end of k8s.  :)
--
Shane Knapp
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


Re: [VOTE] Release Apache Spark 2.4.1 (RC9)

2019-03-29 Thread Felix Cheung
+1

build source
R tests
R package CRAN check locally, r-hub



From: d_t...@apple.com on behalf of DB Tsai 
Sent: Wednesday, March 27, 2019 11:31 AM
To: dev
Subject: [VOTE] Release Apache Spark 2.4.1 (RC9)

Please vote on releasing the following candidate as Apache Spark version 2.4.1.

The vote is open until March 30 PST and passes if a majority +1 PMC votes are 
cast, with
a minimum of 3 +1 votes.

[ ] +1 Release this package as Apache Spark 2.4.1
[ ] -1 Do not release this package because ...

To learn more about Apache Spark, please see http://spark.apache.org/

The tag to be voted on is v2.4.1-rc9 (commit 
58301018003931454e93d8a309c7149cf84c279e):
https://github.com/apache/spark/tree/v2.4.1-rc9

The release files, including signatures, digests, etc. can be found at:
https://dist.apache.org/repos/dist/dev/spark/v2.4.1-rc9-bin/

Signatures used for Spark RCs can be found in this file:
https://dist.apache.org/repos/dist/dev/spark/KEYS

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1319/

The documentation corresponding to this release can be found at:
https://dist.apache.org/repos/dist/dev/spark/v2.4.1-rc9-docs/

The list of bug fixes going into 2.4.1 can be found at the following URL:
https://issues.apache.org/jira/projects/SPARK/versions/2.4.1

FAQ

=
How can I help test this release?
=

If you are a Spark user, you can help us test this release by taking
an existing Spark workload and running on this release candidate, then
reporting any regressions.

If you're working in PySpark you can set up a virtual env and install
the current RC and see if anything important breaks, in the Java/Scala
you can add the staging repository to your projects resolvers and test
with the RC (make sure to clean up the artifact cache before/after so
you don't end up building with a out of date RC going forward).

===
What should happen to JIRA tickets still targeting 2.4.1?
===

The current list of open tickets targeted at 2.4.1 can be found at:
https://issues.apache.org/jira/projects/SPARK and search for "Target Version/s" 
= 2.4.1

Committers should look at those and triage. Extremely important bug
fixes, documentation, and API tweaks that impact compatibility should
be worked on immediately. Everything else please retarget to an
appropriate release.

==
But my bug isn't fixed?
==

In order to make timely releases, we will typically not hold the
release unless the bug in question is a regression from the previous
release. That being said, if there is something which is a regression
that has not been correctly targeted please ping me or a committer to
help target the issue.


DB Tsai | Siri Open Source Technologies [not a contribution] |  Apple, Inc


-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [VOTE] Release Apache Spark 2.4.1 (RC9)

2019-03-29 Thread Felix Cheung
(I think the .invalid is added by the list server)

Personally I’d rather everyone just +1 or -1, and shouldn’t add binding or not. 
It’s really the responsibility of the RM to confirm if a vote is binding. 
Mistakes have been made otherwise.



From: Marcelo Vanzin 
Sent: Thursday, March 28, 2019 3:56 PM
To: dev
Subject: Re: [VOTE] Release Apache Spark 2.4.1 (RC9)

(Anybody knows what's the deal with all the .invalid e-mail addresses?)

Anyway. ASF has voting rules, and some things like releases follow
specific rules:
https://www.apache.org/foundation/voting.html#ReleaseVotes

So, for releases, ultimately, the only votes that "count" towards the
final tally are PMC votes. But everyone is welcome to vote, especially
if they have a reason to -1 a release. PMC members can use that to
guide how they vote, or the RM can use that to drop the RC
unilaterally if he agrees with the reason.


On Thu, Mar 28, 2019 at 3:47 PM Jonatan Jäderberg
 wrote:
>
> +1 (user vote)
>
> btw what to call a vote that is not pmc or committer?
> Some people use "non-binding”, but nobody says “my vote is binding”, and if 
> some vote is important to me, I still need to look up the who’s-who of the 
> project to be able to tally the votes.
> I like `user vote` for someone who has their say but is not speaking with any 
> authority (i.e., not pmc/committer). wdyt?
>
> Also, let’s get this release out the door!
>
> cheers,
> Jonatan
>
> On 28 Mar 2019, at 21:31, DB Tsai  wrote:
>
> +1 from myself
>
> On Thu, Mar 28, 2019 at 3:14 AM Mihaly Toth  
> wrote:
>>
>> +1 (non-binding)
>>
>> Thanks, Misi
>>
>> Sean Owen  ezt írta (időpont: 2019. márc. 28., Cs, 0:19):
>>>
>>> +1 from me - same as last time.
>>>
>>> On Wed, Mar 27, 2019 at 1:31 PM DB Tsai  wrote:
>>> >
>>> > Please vote on releasing the following candidate as Apache Spark version 
>>> > 2.4.1.
>>> >
>>> > The vote is open until March 30 PST and passes if a majority +1 PMC votes 
>>> > are cast, with
>>> > a minimum of 3 +1 votes.
>>> >
>>> > [ ] +1 Release this package as Apache Spark 2.4.1
>>> > [ ] -1 Do not release this package because ...
>>> >
>>> > To learn more about Apache Spark, please see http://spark.apache.org/
>>> >
>>> > The tag to be voted on is v2.4.1-rc9 (commit 
>>> > 58301018003931454e93d8a309c7149cf84c279e):
>>> > https://github.com/apache/spark/tree/v2.4.1-rc9
>>> >
>>> > The release files, including signatures, digests, etc. can be found at:
>>> > https://dist.apache.org/repos/dist/dev/spark/v2.4.1-rc9-bin/
>>> >
>>> > Signatures used for Spark RCs can be found in this file:
>>> > https://dist.apache.org/repos/dist/dev/spark/KEYS
>>> >
>>> > The staging repository for this release can be found at:
>>> > https://repository.apache.org/content/repositories/orgapachespark-1319/
>>> >
>>> > The documentation corresponding to this release can be found at:
>>> > https://dist.apache.org/repos/dist/dev/spark/v2.4.1-rc9-docs/
>>> >
>>> > The list of bug fixes going into 2.4.1 can be found at the following URL:
>>> > https://issues.apache.org/jira/projects/SPARK/versions/2.4.1
>>> >
>>> > FAQ
>>> >
>>> > =
>>> > How can I help test this release?
>>> > =
>>> >
>>> > If you are a Spark user, you can help us test this release by taking
>>> > an existing Spark workload and running on this release candidate, then
>>> > reporting any regressions.
>>> >
>>> > If you're working in PySpark you can set up a virtual env and install
>>> > the current RC and see if anything important breaks, in the Java/Scala
>>> > you can add the staging repository to your projects resolvers and test
>>> > with the RC (make sure to clean up the artifact cache before/after so
>>> > you don't end up building with a out of date RC going forward).
>>> >
>>> > ===
>>> > What should happen to JIRA tickets still targeting 2.4.1?
>>> > ===
>>> >
>>> > The current list of open tickets targeted at 2.4.1 can be found at:
>>> > https://issues.apache.org/jira/projects/SPARK and search for "Target 
>>> > Version/s" = 2.4.1
>>> >
>>> > Committers should look at those and triage. Extremely important bug
>>> > fixes, documentation, and API tweaks that impact compatibility should
>>> > be worked on immediately. Everything else please retarget to an
>>> > appropriate release.
>>> >
>>> > ==
>>> > But my bug isn't fixed?
>>> > ==
>>> >
>>> > In order to make timely releases, we will typically not hold the
>>> > release unless the bug in question is a regression from the previous
>>> > release. That being said, if there is something which is a regression
>>> > that has not been correctly targeted please ping me or a committer to
>>> > help target the issue.
>>> >
>>> >
>>> > DB Tsai | Siri Open Source Technologies [not a contribution] |  Apple, 
>>> > Inc
>>> >
>>> >
>>> > 

Re: Upgrading minimal PyArrow version to 0.12.x [SPARK-27276]

2019-03-28 Thread Felix Cheung
3.4 is end of life but 3.5 is not. From your link

we expect to release Python 3.5.8 around September 2019.




From: shane knapp 
Sent: Thursday, March 28, 2019 7:54 PM
To: Hyukjin Kwon
Cc: Bryan Cutler; dev; Felix Cheung
Subject: Re: Upgrading minimal PyArrow version to 0.12.x [SPARK-27276]

looks like the same for 3.5...   https://www.python.org/dev/peps/pep-0478/

let's pick a python version and start testing.

On Thu, Mar 28, 2019 at 7:52 PM shane knapp 
mailto:skn...@berkeley.edu>> wrote:

If there was, it looks inevitable to upgrade Jenkins\s Python from 3.4 to 3.5.

this is inevitable.  3.4s final release was 10 days ago 
(https://www.python.org/dev/peps/pep-0429/) so we're basically EOL.


--
Shane Knapp
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


Re: Upgrading minimal PyArrow version to 0.12.x [SPARK-27276]

2019-03-28 Thread Felix Cheung
That’s not necessarily bad. I don’t know if we have plan to ever release any 
new 2.2.x, 2.3.x at this point and we can message this “supported version” of 
python change for any new 2.4 release.

Besides we could still support python 3.4 - it’s just more complicated to test 
manually without Jenkins coverage.



From: shane knapp 
Sent: Tuesday, March 26, 2019 12:11 PM
To: Bryan Cutler
Cc: dev
Subject: Re: Upgrading minimal PyArrow version to 0.12.x [SPARK-27276]

i'm pretty certain that i've got a solid python 3.5 conda environment ready to 
be deployed, but this isn't a minor change to the build system and there might 
be some bugs to iron out.

another problem is that the current python 3.4 environment is hard-coded in to 
the both the build scripts on jenkins (all over the place) and in the codebase 
(thankfully in only one spot):  export PATH=/home/anaconda/envs/py3k/bin:$PATH

this means that every branch (master, 2.x, etc) will test against whatever 
version of python lives in that conda environment.  if we upgrade to 3.5, all 
branches will test against this version.  changing the build and test infra to 
support testing against 2.7, 3.4 or 3.5 based on branch is definitely 
non-trivial...

thoughts?




On Tue, Mar 26, 2019 at 11:39 AM Bryan Cutler 
mailto:cutl...@gmail.com>> wrote:
Thanks Hyukjin.  The plan is to get this done for 3.0 only.  Here is a link to 
the JIRA https://issues.apache.org/jira/browse/SPARK-27276.  Shane is also 
correct in that newer versions of pyarrow have stopped support for Python 3.4, 
so we should probably have Jenkins test against 2.7 and 3.5.

On Mon, Mar 25, 2019 at 9:44 PM Reynold Xin 
mailto:r...@databricks.com>> wrote:

+1 on doing this in 3.0.


On Mon, Mar 25, 2019 at 9:31 PM, Felix Cheung 
mailto:felixcheun...@hotmail.com>> wrote:
I’m +1 if 3.0



From: Sean Owen mailto:sro...@gmail.com>>
Sent: Monday, March 25, 2019 6:48 PM
To: Hyukjin Kwon
Cc: dev; Bryan Cutler; Takuya UESHIN; shane knapp
Subject: Re: Upgrading minimal PyArrow version to 0.12.x [SPARK-27276]

I don't know a lot about Arrow here, but seems reasonable. Is this for
Spark 3.0 or for 2.x? Certainly, requiring the latest for Spark 3
seems right.

On Mon, Mar 25, 2019 at 8:17 PM Hyukjin Kwon 
mailto:gurwls...@gmail.com>> wrote:
>
> Hi all,
>
> We really need to upgrade the minimal version soon. It's actually slowing 
> down the PySpark dev, for instance, by the overhead that sometimes we need 
> currently to test all multiple matrix of Arrow and Pandas. Also, it currently 
> requires to add some weird hacks or ugly codes. Some bugs exist in lower 
> versions, and some features are not supported in low PyArrow, for instance.
>
> Per, (Apache Arrow'+ Spark committer FWIW), Bryan's recommendation and my 
> opinion as well, we should better increase the minimal version to 0.12.x. 
> (Also, note that Pandas <> Arrow is an experimental feature).
>
> So, I and Bryan will proceed this roughly in few days if there isn't 
> objections assuming we're fine with increasing it to 0.12.x. Please let me 
> know if there are some concerns.
>
> For clarification, this requires some jobs in Jenkins to upgrade the minimal 
> version of PyArrow (I cc'ed Shane as well).
>
> PS: I roughly heard that Shane's busy for some work stuff .. but it's kind of 
> important in my perspective.
>

-
To unsubscribe e-mail: 
dev-unsubscr...@spark.apache.org<mailto:dev-unsubscr...@spark.apache.org>



--
Shane Knapp
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


Re: Upgrading minimal PyArrow version to 0.12.x [SPARK-27276]

2019-03-25 Thread Felix Cheung
I’m +1 if 3.0



From: Sean Owen 
Sent: Monday, March 25, 2019 6:48 PM
To: Hyukjin Kwon
Cc: dev; Bryan Cutler; Takuya UESHIN; shane knapp
Subject: Re: Upgrading minimal PyArrow version to 0.12.x [SPARK-27276]

I don't know a lot about Arrow here, but seems reasonable. Is this for
Spark 3.0 or for 2.x? Certainly, requiring the latest for Spark 3
seems right.

On Mon, Mar 25, 2019 at 8:17 PM Hyukjin Kwon  wrote:
>
> Hi all,
>
> We really need to upgrade the minimal version soon. It's actually slowing 
> down the PySpark dev, for instance, by the overhead that sometimes we need 
> currently to test all multiple matrix of Arrow and Pandas. Also, it currently 
> requires to add some weird hacks or ugly codes. Some bugs exist in lower 
> versions, and some features are not supported in low PyArrow, for instance.
>
> Per, (Apache Arrow'+ Spark committer FWIW), Bryan's recommendation and my 
> opinion as well, we should better increase the minimal version to 0.12.x. 
> (Also, note that Pandas <> Arrow is an experimental feature).
>
> So, I and Bryan will proceed this roughly in few days if there isn't 
> objections assuming we're fine with increasing it to 0.12.x. Please let me 
> know if there are some concerns.
>
> For clarification, this requires some jobs in Jenkins to upgrade the minimal 
> version of PyArrow (I cc'ed Shane as well).
>
> PS: I roughly heard that Shane's busy for some work stuff .. but it's kind of 
> important in my perspective.
>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [VOTE] Release Apache Spark 2.4.1 (RC8)

2019-03-20 Thread Felix Cheung
Reposting for shane here

[SPARK-27178]
https://github.com/apache/spark/commit/342e91fdfa4e6ce5cc3a0da085d1fe723184021b

Is problematic too and it’s not in the rc8 cut

https://github.com/apache/spark/commits/branch-2.4

(Personally I don’t want to delay 2.4.1 either..)


From: Sean Owen 
Sent: Wednesday, March 20, 2019 11:18 AM
To: DB Tsai
Cc: dev
Subject: Re: [VOTE] Release Apache Spark 2.4.1 (RC8)

+1 for this RC. The tag is correct, licenses and sigs check out, tests
of the source with most profiles enabled works for me.

On Tue, Mar 19, 2019 at 5:28 PM DB Tsai  wrote:
>
> Please vote on releasing the following candidate as Apache Spark version 
> 2.4.1.
>
> The vote is open until March 23 PST and passes if a majority +1 PMC votes are 
> cast, with
> a minimum of 3 +1 votes.
>
> [ ] +1 Release this package as Apache Spark 2.4.1
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> The tag to be voted on is v2.4.1-rc8 (commit 
> 746b3ddee6f7ad3464e326228ea226f5b1f39a41):
> https://github.com/apache/spark/tree/v2.4.1-rc8
>
> The release files, including signatures, digests, etc. can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v2.4.1-rc8-bin/
>
> Signatures used for Spark RCs can be found in this file:
> https://dist.apache.org/repos/dist/dev/spark/KEYS
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1318/
>
> The documentation corresponding to this release can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v2.4.1-rc8-docs/
>
> The list of bug fixes going into 2.4.1 can be found at the following URL:
> https://issues.apache.org/jira/projects/SPARK/versions/2.4.1
>
> FAQ
>
> =
> How can I help test this release?
> =
>
> If you are a Spark user, you can help us test this release by taking
> an existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> If you're working in PySpark you can set up a virtual env and install
> the current RC and see if anything important breaks, in the Java/Scala
> you can add the staging repository to your projects resolvers and test
> with the RC (make sure to clean up the artifact cache before/after so
> you don't end up building with a out of date RC going forward).
>
> ===
> What should happen to JIRA tickets still targeting 2.4.1?
> ===
>
> The current list of open tickets targeted at 2.4.1 can be found at:
> https://issues.apache.org/jira/projects/SPARK and search for "Target 
> Version/s" = 2.4.1
>
> Committers should look at those and triage. Extremely important bug
> fixes, documentation, and API tweaks that impact compatibility should
> be worked on immediately. Everything else please retarget to an
> appropriate release.
>
> ==
> But my bug isn't fixed?
> ==
>
> In order to make timely releases, we will typically not hold the
> release unless the bug in question is a regression from the previous
> release. That being said, if there is something which is a regression
> that has not been correctly targeted please ping me or a committer to
> help target the issue.
>
>
> DB Tsai | Siri Open Source Technologies [not a contribution] |  Apple, Inc
>
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [VOTE] Release Apache Spark 2.4.1 (RC2)

2019-03-07 Thread Felix Cheung
There is SPARK-26604 we are looking into


From: Saisai Shao 
Sent: Wednesday, March 6, 2019 6:05 PM
To: shane knapp
Cc: Stavros Kontopoulos; Sean Owen; DB Tsai; Spark dev list; d_t...@apple.com
Subject: Re: [VOTE] Release Apache Spark 2.4.1 (RC2)

Do we have other block/critical issues for Spark 2.4.1 or waiting something to 
be fixed? I roughly searched the JIRA, seems there's no block/critical issues 
marked for 2.4.1.

Thanks
Saisai

shane knapp mailto:skn...@berkeley.edu>> 于2019年3月7日周四 
上午4:57写道:
i'll be popping in to the sig-big-data meeting on the 20th to talk about stuff 
like this.

On Wed, Mar 6, 2019 at 12:40 PM Stavros Kontopoulos 
mailto:stavros.kontopou...@lightbend.com>> 
wrote:
Yes its a touch decision and as we discussed today 
(https://docs.google.com/document/d/1pnF38NF6N5eM8DlK088XUW85Vms4V2uTsGZvSp8MNIA)
"Kubernetes support window is 9 months, Spark is two years".So we may end up 
with old client versions on branches still supported like 2.4.x in the future.
That gives us no choice but to upgrade, if we want to be on the safe side. We 
have tested 3.0.0 with 1.11 internally and it works but I dont know what it 
means to run with old
clients.


On Wed, Mar 6, 2019 at 7:54 PM Sean Owen 
mailto:sro...@gmail.com>> wrote:
If the old client is basically unusable with the versions of K8S
people mostly use now, and the new client still works with older
versions, I could see including this in 2.4.1.

Looking at https://github.com/fabric8io/kubernetes-client#compatibility-matrix
it seems like the 4.1.1 client is needed for 1.10 and above. However
it no longer supports 1.7 and below.
We have 3.0.x, and versions through 4.0.x of the client support the
same K8S versions, so no real middle ground here.

1.7.0 came out June 2017, it seems. 1.10 was March 2018. Minor release
branches are maintained for 9 months per
https://kubernetes.io/docs/setup/version-skew-policy/

Spark 2.4.0 came in Nov 2018. I suppose we could say it should have
used the newer client from the start as at that point (?) 1.7 and
earlier were already at least 7 months past EOL.
If we update the client in 2.4.1, versions of K8S as recently
'supported' as a year ago won't work anymore. I'm guessing there are
still 1.7 users out there? That wasn't that long ago but if the
project and users generally move fast, maybe not.

Normally I'd say, that's what the next minor release of Spark is for;
update if you want later infra. But there is no Spark 2.5.
I presume downstream distros could modify the dependency easily (?) if
needed and maybe already do. It wouldn't necessarily help end users.

Does the 3.0.x client not work at all with 1.10+ or just unsupported.
If it 'basically works but no guarantees' I'd favor not updating. If
it doesn't work at all, hm. That's tough. I think I'd favor updating
the client but think it's a tough call both ways.



On Wed, Mar 6, 2019 at 11:14 AM Stavros Kontopoulos
mailto:stavros.kontopou...@lightbend.com>> 
wrote:
>
> Yes Shane Knapp has done the work for that already,  and also tests pass, I 
> am working on a PR now, I could submit it for the 2.4 branch .
> I understand that this is a major dependency update, but the problem I see is 
> that the client version is so old that I dont think it makes
> much sense for current users who are on k8s 1.10, 1.11 
> etc(https://github.com/fabric8io/kubernetes-client#compatibility-matrix, 
> 3.0.0 does not even exist in there).
> I dont know what it means to use that old version with current k8s clusters 
> in terms of bugs etc.




--
Shane Knapp
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


Re: [VOTE] [SPARK-24615] SPIP: Accelerator-aware Scheduling

2019-03-03 Thread Felix Cheung
Once again, I’d have to agree with Sean.

Let’s table the meaning of SPIP for another time, say. I think a few of us are 
trying to understand what does “accelerator resource aware” mean. As far as I 
know, no one is discussing API here. But on google doc, JIRA and on email and 
off list, I have seen questions, questions that are greatly concerning, like 
“oh scheduler is allocating GPU, but how does it affect memory” and many more, 
and so I think finer “high level” goals should be defined.





From: Sean Owen 
Sent: Sunday, March 3, 2019 5:24 PM
To: Xiangrui Meng
Cc: Felix Cheung; Xingbo Jiang; Yinan Li; dev; Weichen Xu; Marco Gaido
Subject: Re: [VOTE] [SPARK-24615] SPIP: Accelerator-aware Scheduling

I think treating SPIPs as this high-level takes away much of the point
of VOTEing on them. I'm not sure that's even what Reynold is
suggesting elsewhere; we're nowhere near discussing APIs here, just
what 'accelerator aware' even generally means. If the scope isn't
specified, what are we trying to bind with a formal VOTE? The worst I
can say is that this doesn't mean much, so the outcome of the vote
doesn't matter. The general ideas seems fine to me and I support
_something_ like this.

I think the subtext concern is that SPIPs become a way to request
cover to make a bunch of decisions separately, later. This is, to some
extent, how it has to work. A small number of interested parties need
to decide the details coherently, not design the whole thing by
committee, with occasional check-ins for feedback. There's a balance
between that, and using the SPIP as a license to go finish a design
and proclaim it later. That's not anyone's bad-faith intention, just
the risk of deferring so much.

Mesos support is not a big deal by itself but a fine illustration of
the point. That seems like a fine question of scope now, even if the
'how' or some of the 'what' can be decided later. I raised an eyebrow
here at the reply that this was already judged out-of-scope: how much
are we on the same page about this being a point to consider feedback?

If one wants to VOTE on more details, then this vote just doesn't
matter much. Is a future step to VOTE on some more detailed design
doc? Then that's what I call a "SPIP" and it's practically just
semantics.


On Sun, Mar 3, 2019 at 6:51 PM Xiangrui Meng  wrote:
>
> Hi Felix,
>
> Just to clarify, we are voting on the SPIP, not the companion scoping doc. 
> What is proposed and what we are voting on is to make Spark 
> accelerator-aware. The companion scoping doc and the design sketch are to 
> help demonstrate that what features could be implemented based on the use 
> cases and dev resources the co-authors are aware of. The exact scoping and 
> design would require more community involvement, by no means we are 
> finalizing it in this vote thread.
>
> I think copying the goals and non-goals from the companion scoping doc to the 
> SPIP caused the confusion. As mentioned in the SPIP, we proposed to make two 
> major changes at high level:
>
> At cluster manager level, we update or upgrade cluster managers to include 
> GPU support. Then we expose user interfaces for Spark to request GPUs from 
> them.
> Within Spark, we update its scheduler to understand available GPUs allocated 
> to executors, user task requests, and assign GPUs to tasks properly.
>
> We should keep our vote discussion at this level. It doesn't exclude 
> Mesos/Windows/TPU/FPGA, nor it commits to support YARN/K8s. Through the 
> initial scoping work, we found that we certainly need domain experts to 
> discuss the support of each cluster manager and each accelerator type. But 
> adding more details on Mesos or FPGA doesn't change the SPIP at high level. 
> So we concluded the initial scoping, shared the docs, and started this vote.


Re: [VOTE] [SPARK-24615] SPIP: Accelerator-aware Scheduling

2019-03-03 Thread Felix Cheung
Great points Sean.

Here’s what I’d like to suggest to move forward.
Split the SPIP.

If we want to propose upfront homogeneous allocation (aka spark.task.gpus), 
this should be one on its own and for instance, I really agree with Sean (like 
I did in the discuss thread) that we can’t simply non-goal Mesos. We have 
enough maintenance issue as it is. And IIRC there was a PR proposed for K8S 
that I’d like to see bring that discussion here as well.

IMO upfront allocation is less useful. Specifically too expensive for large 
jobs.

If we want per-stage resource request, this should a full SPIP with a lot more 
details to be hashed out. Our work with Horovod brings a few specific and 
critical requirements on how this should work with distributed DL and I would 
like to see those addressed.

In any case I’d like to see more consensus before moving forward, until then 
I’m going to -1 this.




From: Sean Owen 
Sent: Sunday, March 3, 2019 8:15 AM
To: Felix Cheung
Cc: Xingbo Jiang; Yinan Li; dev; Weichen Xu; Marco Gaido
Subject: Re: [VOTE] [SPARK-24615] SPIP: Accelerator-aware Scheduling

I'm for this in general, at least a +0. I do think this has to have a
story for what to do with the existing Mesos GPU support, which sounds
entirely like the spark.task.gpus config here. Maybe it's just a
synonym? that kind of thing.

Requesting different types of GPUs might be a bridge too far, but,
that's a P2 detail that can be hashed out later. (For example, if a
v100 is available and k80 was requested, do you use it or fail? is the
right level of resource control GPU RAM and cores?)

The per-stage resource requirements sounds like the biggest change;
you can even change CPU cores requested per pandas UDF? and what about
memory then? We'll see how that shakes out. That's the only thing I'm
kind of unsure about in this proposal.

On Sat, Mar 2, 2019 at 9:35 PM Felix Cheung  wrote:
>
> I’m very hesitant with this.
>
> I don’t want to vote -1, because I personally think it’s important to do, but 
> I’d like to see more discussion points addressed and not voting completely on 
> the spirit of it.
>
> First, SPIP doesn’t match the format of SPIP proposed and agreed on. (Maybe 
> this is a minor point and perhaps we should also vote to update the SPIP 
> format)
>
> Second, there are multiple pdf/google doc and JIRA. And I think for example 
> the design sketch is not covering the same points as the updated SPIP doc? It 
> would help to make them align before moving forward.
>
> Third, the proposal touches on some fairly core and sensitive components, 
> like the scheduler, and I think more discussions are necessary. We have a few 
> comments there and in the JIRA.
>
>
>
> 
> From: Marco Gaido 
> Sent: Saturday, March 2, 2019 4:18 AM
> To: Weichen Xu
> Cc: Yinan Li; Tom Graves; dev; Xingbo Jiang
> Subject: Re: [VOTE] [SPARK-24615] SPIP: Accelerator-aware Scheduling
>
> +1, a critical feature for AI/DL!
>
> Il giorno sab 2 mar 2019 alle ore 05:14 Weichen Xu 
>  ha scritto:
>>
>> +1, nice feature!
>>
>> On Sat, Mar 2, 2019 at 6:11 AM Yinan Li  wrote:
>>>
>>> +1
>>>
>>> On Fri, Mar 1, 2019 at 12:37 PM Tom Graves  
>>> wrote:
>>>>
>>>> +1 for the SPIP.
>>>>
>>>> Tom
>>>>
>>>> On Friday, March 1, 2019, 8:14:43 AM CST, Xingbo Jiang 
>>>>  wrote:
>>>>
>>>>
>>>> Hi all,
>>>>
>>>> I want to call for a vote of SPARK-24615. It improves Spark by making it 
>>>> aware of GPUs exposed by cluster managers, and hence Spark can match GPU 
>>>> resources with user task requests properly. The proposal and production 
>>>> doc was made available on dev@ to collect input. Your can also find a 
>>>> design sketch at SPARK-27005.
>>>>
>>>> The vote will be up for the next 72 hours. Please reply with your vote:
>>>>
>>>> +1: Yeah, let's go forward and implement the SPIP.
>>>> +0: Don't really care.
>>>> -1: I don't think this is a good idea because of the following technical 
>>>> reasons.
>>>>
>>>> Thank you!
>>>>
>>>> Xingbo


Re: [VOTE] [SPARK-24615] SPIP: Accelerator-aware Scheduling

2019-03-02 Thread Felix Cheung
I’m very hesitant with this.

I don’t want to vote -1, because I personally think it’s important to do, but 
I’d like to see more discussion points addressed and not voting completely on 
the spirit of it.

First, SPIP doesn’t match the format of SPIP proposed and agreed on. (Maybe 
this is a minor point and perhaps we should also vote to update the SPIP format)

Second, there are multiple pdf/google doc and JIRA. And I think for example the 
design sketch is not covering the same points as the updated SPIP doc? It would 
help to make them align before moving forward.

Third, the proposal touches on some fairly core and sensitive components, like 
the scheduler, and I think more discussions are necessary. We have a few 
comments there and in the JIRA.




From: Marco Gaido 
Sent: Saturday, March 2, 2019 4:18 AM
To: Weichen Xu
Cc: Yinan Li; Tom Graves; dev; Xingbo Jiang
Subject: Re: [VOTE] [SPARK-24615] SPIP: Accelerator-aware Scheduling

+1, a critical feature for AI/DL!

Il giorno sab 2 mar 2019 alle ore 05:14 Weichen Xu 
mailto:weichen...@databricks.com>> ha scritto:
+1, nice feature!

On Sat, Mar 2, 2019 at 6:11 AM Yinan Li 
mailto:liyinan...@gmail.com>> wrote:
+1

On Fri, Mar 1, 2019 at 12:37 PM Tom Graves  wrote:
+1 for the SPIP.

Tom

On Friday, March 1, 2019, 8:14:43 AM CST, Xingbo Jiang 
mailto:jiangxb1...@gmail.com>> wrote:


Hi all,

I want to call for a vote of 
SPARK-24615. It improves 
Spark by making it aware of GPUs exposed by cluster managers, and hence Spark 
can match GPU resources with user task requests properly. The 
proposal
 and production 
doc
 was made available on dev@ to collect input. Your can also find a design 
sketch at SPARK-27005.

The vote will be up for the next 72 hours. Please reply with your vote:

+1: Yeah, let's go forward and implement the SPIP.
+0: Don't really care.
-1: I don't think this is a good idea because of the following technical 
reasons.

Thank you!

Xingbo


Re: SPIP: Accelerator-aware Scheduling

2019-03-02 Thread Felix Cheung
+1 on mesos - what Sean says


From: Andrew Melo 
Sent: Friday, March 1, 2019 9:19 AM
To: Xingbo Jiang
Cc: Sean Owen; Xiangrui Meng; dev
Subject: Re: SPIP: Accelerator-aware Scheduling

Hi,

On Fri, Mar 1, 2019 at 9:48 AM Xingbo Jiang  wrote:
>
> Hi Sean,
>
> To support GPU scheduling with YARN cluster, we have to update the hadoop 
> version to 3.1.2+. However, if we decide to not upgrade hadoop to beyond that 
> version for Spark 3.0, then we just have to disable/fallback the GPU 
> scheduling with YARN, users shall still be able to have that feature with 
> Standalone or Kubernetes cluster.
>
> We didn't include the Mesos support in current SPIP because we didn't receive 
> use cases that require GPU scheduling on Mesos cluster, however, we can still 
> add Mesos support in the future if we observe valid use cases.

First time caller, long time listener. We have GPUs in our Mesos-based
Spark cluster, and it would be nice to use them with Spark-based
GPU-enabled frameworks (our use case is deep learning applications).

Cheers
Andrew

>
> Thanks!
>
> Xingbo
>
> Sean Owen  于2019年3月1日周五 下午10:39写道:
>>
>> Two late breaking questions:
>>
>> This basically requires Hadoop 3.1 for YARN support?
>> Mesos support is listed as a non goal but it already has support for 
>> requesting GPUs in Spark. That would be 'harmonized' with this 
>> implementation even if it's not extended?
>>
>> On Fri, Mar 1, 2019, 7:48 AM Xingbo Jiang  wrote:
>>>
>>> I think we are aligned on the commitment, I'll start a vote thread for this 
>>> shortly.
>>>
>>> Xiangrui Meng  于2019年2月27日周三 上午6:47写道:

 In case there are issues visiting Google doc, I attached PDF files to the 
 JIRA.

 On Tue, Feb 26, 2019 at 7:41 AM Xingbo Jiang  wrote:
>
> Hi all,
>
> I want send a revised SPIP on implementing Accelerator(GPU)-aware 
> Scheduling. It improves Spark by making it aware of GPUs exposed by 
> cluster managers, and hence Spark can match GPU resources with user task 
> requests properly. If you have scenarios that need to run 
> workloads(DL/ML/Signal Processing etc.) on Spark cluster with GPU nodes, 
> please help review and check how it fits into your use cases. Your 
> feedback would be greatly appreciated!
>
> # Links to SPIP and Product doc:
>
> * Jira issue for the SPIP: 
> https://issues.apache.org/jira/browse/SPARK-24615
> * Google Doc: 
> https://docs.google.com/document/d/1C4J_BPOcSCJc58HL7JfHtIzHrjU0rLRdQM3y7ejil64/edit?usp=sharing
> * Product Doc: 
> https://docs.google.com/document/d/12JjloksHCdslMXhdVZ3xY5l1Nde3HRhIrqvzGnK_bNE/edit?usp=sharing
>
> Thank you!
>
> Xingbo

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [DISCUSS][SQL][PySpark] Column name support for SQL functions

2019-02-24 Thread Felix Cheung
I hear three topics in this thread

1. I don’t think we should remove string. Column and string can both be “type 
safe”. And I would agree we don’t *need* to break API compatibility here.

2. Gaps in python API. Extending on #1, definitely we should be consistent and 
add string as param where it is missed.

3. Scala API for string - hard to say but make sense if nothing but for 
consistency. Though I can also see the argument of Column only in Scala. String 
might be more natural in python and much less significant in Scala because of 
$”foo” notation.

(My 2 c)



From: Sean Owen 
Sent: Sunday, February 24, 2019 6:59 AM
To: André Mello
Cc: dev
Subject: Re: [DISCUSS][SQL][PySpark] Column name support for SQL functions

I just commented on the PR -- I personally don't think it's worth
removing support for, say, max("foo") over max(col("foo")) or
max($"foo") in Scala. We can make breaking changes in Spark 3 but this
seems like it would unnecessarily break a lot of code. The string arg
is more concise in Python and I can't think of cases where it's
particularly ambiguous or confusing; on the contrary it's more natural
coming from SQL.

What we do have are inconsistencies and errors in support of string vs
Column as fixed in the PR. I was surprised to see that
df.select(abs("col")) throws an error while df.select(sqrt("col"))
doesn't. I think that's easy to fix on the Python side. Really I think
the question is: do we need to add methods like "def abs(String)" and
more in Scala? that would remain inconsistent even if the Pyspark side
is fixed.

On Sun, Feb 24, 2019 at 8:54 AM André Mello  wrote:
>
> # Context
>
> This comes from [SPARK-26979], which became PR #23879 and then PR
> #23882. The following reflects all the findings made so far.
>
> # Description
>
> Currently, in the Scala API, some SQL functions have two overloads,
> one taking a string that names the column to be operated on, the other
> taking a proper Column object. This allows for two patterns of calling
> these functions, which is a source of inconsistency and generates
> confusion for new users, since it is hard to predict which functions
> will take a column name or not.
>
> The PySpark API partially solves this problem by internally converting
> the argument to a Column object prior to passing it through to the
> underlying JVM implementation. This allows for a consistent use of
> name literals across the API, except for a few violations:
>
> - lower()
> - upper()
> - abs()
> - bitwiseNOT()
> - ltrim()
> - rtrim()
> - trim()
> - ascii()
> - base64()
> - unbase64()
>
> These violations happen because for a subset of the SQL functions,
> PySpark uses a functional mechanism (`_create_function`) to directly
> call the underlying JVM equivalent by name, thus skipping the
> conversion step. In most cases the column name pattern still works
> because the Scala API has its own support for string arguments, but
> the aforementioned functions are also exceptions there.
>
> My proposal was to solve this problem by adding the string support
> where it was missing in the PySpark API. Since this is a purely
> additive change, it doesn't break past code. Additionally, I find the
> API sugar to be a positive feature, since code like `max("foo")` is
> more concise and readable than `max(col("foo"))`. It adheres to the
> DRY philosophy and is consistent with Python's preference for
> readability over type protection.
>
> However, upon submission of the PR, a discussion was started about
> whether it wouldn't be better to entirely deprecate string support
> instead - in particular with major release 3.0 in mind. The reasoning,
> as I understood it, was that this approach is more explicit and type
> safe, which is preferred in Java/Scala, plus it reduces the API
> surface area - and the Python API should be consistent with the others
> as well.
>
> Upon request by @HyukjinKwon I'm submitting this matter for discussion
> by this mailing list.
>
> # Summary
>
> There is a problem with inconsistency in the Scala/Python SQL API,
> where sometimes you can use a column name string as a proxy, and
> sometimes you have to use a proper Column object. To solve it there
> are two approaches - to remove the string support entirely, or to add
> it where it is missing. Which approach is best?
>
> Hope this is clear.
>
> -- André.
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [VOTE] Release Apache Spark 2.4.1 (RC2)

2019-02-21 Thread Felix Cheung
I merged the fix to 2.4.



From: Felix Cheung 
Sent: Wednesday, February 20, 2019 9:34 PM
To: DB Tsai; Spark dev list
Cc: Cesar Delgado
Subject: Re: [VOTE] Release Apache Spark 2.4.1 (RC2)

Could you hold for a bit - I have one more fix to get in



From: d_t...@apple.com on behalf of DB Tsai 
Sent: Wednesday, February 20, 2019 12:25 PM
To: Spark dev list
Cc: Cesar Delgado
Subject: Re: [VOTE] Release Apache Spark 2.4.1 (RC2)

Okay. Let's fail rc2, and I'll prepare rc3 with SPARK-26859.

DB Tsai | Siri Open Source Technologies [not a contribution] |  Apple, Inc

> On Feb 20, 2019, at 12:11 PM, Marcelo Vanzin  
> wrote:
>
> Just wanted to point out that
> https://issues.apache.org/jira/browse/SPARK-26859 is not in this RC,
> and is marked as a correctness bug. (The fix is in the 2.4 branch,
> just not in rc2.)
>
> On Wed, Feb 20, 2019 at 12:07 PM DB Tsai  wrote:
>>
>> Please vote on releasing the following candidate as Apache Spark version 
>> 2.4.1.
>>
>> The vote is open until Feb 24 PST and passes if a majority +1 PMC votes are 
>> cast, with
>> a minimum of 3 +1 votes.
>>
>> [ ] +1 Release this package as Apache Spark 2.4.1
>> [ ] -1 Do not release this package because ...
>>
>> To learn more about Apache Spark, please see http://spark.apache.org/
>>
>> The tag to be voted on is v2.4.1-rc2 (commit 
>> 229ad524cfd3f74dd7aa5fc9ba841ae223caa960):
>> https://github.com/apache/spark/tree/v2.4.1-rc2
>>
>> The release files, including signatures, digests, etc. can be found at:
>> https://dist.apache.org/repos/dist/dev/spark/v2.4.1-rc2-bin/
>>
>> Signatures used for Spark RCs can be found in this file:
>> https://dist.apache.org/repos/dist/dev/spark/KEYS
>>
>> The staging repository for this release can be found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1299/
>>
>> The documentation corresponding to this release can be found at:
>> https://dist.apache.org/repos/dist/dev/spark/v2.4.1-rc2-docs/
>>
>> The list of bug fixes going into 2.4.1 can be found at the following URL:
>> https://issues.apache.org/jira/projects/SPARK/versions/2.4.1
>>
>> FAQ
>>
>> =
>> How can I help test this release?
>> =
>>
>> If you are a Spark user, you can help us test this release by taking
>> an existing Spark workload and running on this release candidate, then
>> reporting any regressions.
>>
>> If you're working in PySpark you can set up a virtual env and install
>> the current RC and see if anything important breaks, in the Java/Scala
>> you can add the staging repository to your projects resolvers and test
>> with the RC (make sure to clean up the artifact cache before/after so
>> you don't end up building with a out of date RC going forward).
>>
>> ===
>> What should happen to JIRA tickets still targeting 2.4.1?
>> ===
>>
>> The current list of open tickets targeted at 2.4.1 can be found at:
>> https://issues.apache.org/jira/projects/SPARK and search for "Target 
>> Version/s" = 2.4.1
>>
>> Committers should look at those and triage. Extremely important bug
>> fixes, documentation, and API tweaks that impact compatibility should
>> be worked on immediately. Everything else please retarget to an
>> appropriate release.
>>
>> ==
>> But my bug isn't fixed?
>> ==
>>
>> In order to make timely releases, we will typically not hold the
>> release unless the bug in question is a regression from the previous
>> release. That being said, if there is something which is a regression
>> that has not been correctly targeted please ping me or a committer to
>> help target the issue.
>>
>>
>> DB Tsai | Siri Open Source Technologies [not a contribution] |  Apple, Inc
>>
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>
>
> --
> Marcelo
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>


-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [VOTE] Release Apache Spark 2.4.1 (RC2)

2019-02-20 Thread Felix Cheung
Could you hold for a bit - I have one more fix to get in



From: d_t...@apple.com on behalf of DB Tsai 
Sent: Wednesday, February 20, 2019 12:25 PM
To: Spark dev list
Cc: Cesar Delgado
Subject: Re: [VOTE] Release Apache Spark 2.4.1 (RC2)

Okay. Let's fail rc2, and I'll prepare rc3 with SPARK-26859.

DB Tsai | Siri Open Source Technologies [not a contribution] |  Apple, Inc

> On Feb 20, 2019, at 12:11 PM, Marcelo Vanzin  
> wrote:
>
> Just wanted to point out that
> https://issues.apache.org/jira/browse/SPARK-26859 is not in this RC,
> and is marked as a correctness bug. (The fix is in the 2.4 branch,
> just not in rc2.)
>
> On Wed, Feb 20, 2019 at 12:07 PM DB Tsai  wrote:
>>
>> Please vote on releasing the following candidate as Apache Spark version 
>> 2.4.1.
>>
>> The vote is open until Feb 24 PST and passes if a majority +1 PMC votes are 
>> cast, with
>> a minimum of 3 +1 votes.
>>
>> [ ] +1 Release this package as Apache Spark 2.4.1
>> [ ] -1 Do not release this package because ...
>>
>> To learn more about Apache Spark, please see http://spark.apache.org/
>>
>> The tag to be voted on is v2.4.1-rc2 (commit 
>> 229ad524cfd3f74dd7aa5fc9ba841ae223caa960):
>> https://github.com/apache/spark/tree/v2.4.1-rc2
>>
>> The release files, including signatures, digests, etc. can be found at:
>> https://dist.apache.org/repos/dist/dev/spark/v2.4.1-rc2-bin/
>>
>> Signatures used for Spark RCs can be found in this file:
>> https://dist.apache.org/repos/dist/dev/spark/KEYS
>>
>> The staging repository for this release can be found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1299/
>>
>> The documentation corresponding to this release can be found at:
>> https://dist.apache.org/repos/dist/dev/spark/v2.4.1-rc2-docs/
>>
>> The list of bug fixes going into 2.4.1 can be found at the following URL:
>> https://issues.apache.org/jira/projects/SPARK/versions/2.4.1
>>
>> FAQ
>>
>> =
>> How can I help test this release?
>> =
>>
>> If you are a Spark user, you can help us test this release by taking
>> an existing Spark workload and running on this release candidate, then
>> reporting any regressions.
>>
>> If you're working in PySpark you can set up a virtual env and install
>> the current RC and see if anything important breaks, in the Java/Scala
>> you can add the staging repository to your projects resolvers and test
>> with the RC (make sure to clean up the artifact cache before/after so
>> you don't end up building with a out of date RC going forward).
>>
>> ===
>> What should happen to JIRA tickets still targeting 2.4.1?
>> ===
>>
>> The current list of open tickets targeted at 2.4.1 can be found at:
>> https://issues.apache.org/jira/projects/SPARK and search for "Target 
>> Version/s" = 2.4.1
>>
>> Committers should look at those and triage. Extremely important bug
>> fixes, documentation, and API tweaks that impact compatibility should
>> be worked on immediately. Everything else please retarget to an
>> appropriate release.
>>
>> ==
>> But my bug isn't fixed?
>> ==
>>
>> In order to make timely releases, we will typically not hold the
>> release unless the bug in question is a regression from the previous
>> release. That being said, if there is something which is a regression
>> that has not been correctly targeted please ping me or a committer to
>> help target the issue.
>>
>>
>> DB Tsai | Siri Open Source Technologies [not a contribution] |  Apple, Inc
>>
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>
>
> --
> Marcelo
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>


-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [VOTE] SPIP: Identifiers for multi-catalog Spark

2019-02-19 Thread Felix Cheung
+1



From: Ryan Blue 
Sent: Tuesday, February 19, 2019 9:34 AM
To: Jamison Bennett
Cc: dev
Subject: Re: [VOTE] SPIP: Identifiers for multi-catalog Spark

+1

On Tue, Feb 19, 2019 at 8:41 AM Jamison Bennett 
 wrote:
+1 (non-binding)


Jamison Bennett

Cloudera Software Engineer

jamison.benn...@cloudera.com

515 Congress Ave, Suite 1212   |   Austin, TX   |   78701


On Tue, Feb 19, 2019 at 10:33 AM Maryann Xue 
mailto:maryann@databricks.com>> wrote:
+1

On Mon, Feb 18, 2019 at 10:46 PM John Zhuge 
mailto:jzh...@apache.org>> wrote:
+1

On Mon, Feb 18, 2019 at 8:43 PM Dongjoon Hyun 
mailto:dongjoon.h...@gmail.com>> wrote:
+1

Dongjoon.

On 2019/02/19 04:12:23, Wenchen Fan 
mailto:cloud0...@gmail.com>> wrote:
> +1
>
> On Tue, Feb 19, 2019 at 10:50 AM Ryan Blue 
> wrote:
>
> > Hi everyone,
> >
> > It looks like there is consensus on the proposal, so I'd like to start a
> > vote thread on the SPIP for identifiers in multi-catalog Spark.
> >
> > The doc is available here:
> > https://docs.google.com/document/d/1jEcvomPiTc5GtB9F7d2RTVVpMY64Qy7INCA_rFEd9HQ/edit?usp=sharing
> >
> > Please vote in the next 3 days.
> >
> > [ ] +1: Accept the proposal as an official SPIP
> > [ ] +0
> > [ ] -1: I don't think this is a good idea because ...
> >
> >
> > Thanks!
> >
> > rb
> >
> > --
> > Ryan Blue
> > Software Engineer
> > Netflix
> >
>

-
To unsubscribe e-mail: 
dev-unsubscr...@spark.apache.org



--
John Zhuge


--
Ryan Blue
Software Engineer
Netflix


Re: Missing SparkR in CRAN

2019-02-19 Thread Felix Cheung
We are waiting for update from CRAN. Please hold on.



From: Takeshi Yamamuro 
Sent: Tuesday, February 19, 2019 2:53 PM
To: dev
Subject: Re: Missing SparkR in CRAN

Hi, guys

It seems SparkR still not found in CRAN and any problem
when resubmitting it?


On Fri, Jan 25, 2019 at 1:41 AM Felix Cheung 
mailto:felixche...@apache.org>> wrote:
Yes it was discussed on dev@. We are waiting for 2.3.3 to release to resubmit.


On Thu, Jan 24, 2019 at 5:33 AM Hyukjin Kwon 
mailto:gurwls...@gmail.com>> wrote:
Hi all,

I happened to find SparkR is missing in CRAN. See 
https://cran.r-project.org/web/packages/SparkR/index.html

I remember I saw some threads about this in spark-dev mailing list a long long 
ago IIRC. Is it in progress to fix it somewhere? or is it something I 
misunderstood?


--
---
Takeshi Yamamuro


Re: Vectorized R gapply[Collect]() implementation

2019-02-10 Thread Felix Cheung
This is super awesome!



From: Shivaram Venkataraman 
Sent: Saturday, February 9, 2019 8:33 AM
To: Hyukjin Kwon
Cc: dev; Felix Cheung; Bryan Cutler; Liang-Chi Hsieh; Shivaram Venkataraman
Subject: Re: Vectorized R gapply[Collect]() implementation

Those speedups look awesome! Great work Hyukjin!

Thanks
Shivaram

On Sat, Feb 9, 2019 at 7:41 AM Hyukjin Kwon  wrote:
>
> Guys, as continuation of Arrow optimization for R DataFrame to Spark 
> DataFrame,
>
> I am trying to make a vectorized gapply[Collect] implementation as an 
> experiment like vectorized Pandas UDFs
>
> It brought 820%+ performance improvement. See 
> https://github.com/apache/spark/pull/23746
>
> Please come and take a look if you're interested in R APIs :D. I have already 
> cc'ed some people I know but please come, review and discuss for both Spark 
> side and Arrow side.
>
> This Arrow optimization job is being done under 
> https://issues.apache.org/jira/browse/SPARK-26759 . Please feel free to take 
> one if anyone of you is interested in it.
>
> Thanks.


Re: [VOTE] Release Apache Spark 2.3.3 (RC2)

2019-02-10 Thread Felix Cheung
+1
See note

Tested build from source and running tests.
Also tested SparkR basic - ran more tests in RC1 and checked there was no 
change in R since. So I’m ok with that.

Note:
1. Opened https://issues.apache.org/jira/browse/SPARK-26855 on the 
SparkSubmitSuite failure - (thanks to Sean’s tip) I don’t think it’s blocker.

2. Ran into a failure in HiveExternalCatalogVersionsSuite. But passed the 2nd 
ran (How reliable is archive.apache? It failed for
me before)
WARN org.apache.spark.sql.hive.HiveExternalCatalogVersionsSuite: Failed to 
download Spark 2.3.2 from 
https://archive.apache.org/dist/spark/spark-2.3.2/spark-2.3.2-bin-hadoop2.7.tgz:
 Socket closed
org.apache.spark.sql.hive.HiveExternalCatalogVersionsSuite *** ABORTED ***
Exception encountered when invoking run on a nested suite - Unable to download 
Spark 2.3.2 (HiveExternalCatalogVersionsSuite.scala:97)

3. There are a fair bit of changes in Python and SQL - someone should test that

4. Last time k8s integration tests is broken before it isn’t built by default. 
Could someone test with -Pkubernetes -Pkubernetes-integration-tests

SPARK-26482 broke the integration tests



From: John Zhuge 
Sent: Saturday, February 9, 2019 6:25 PM
To: Felix Cheung
Cc: Takeshi Yamamuro; Spark dev list
Subject: Re: [VOTE] Release Apache Spark 2.3.3 (RC2)

Not me. I am running zulu8, maven, and hadoop-2.7.

On Sat, Feb 9, 2019 at 5:42 PM Felix Cheung 
mailto:felixcheun...@hotmail.com>> wrote:
One test in SparkSubmitSuite is consistently failing for me. Anyone seeing that?



From: Takeshi Yamamuro mailto:linguin@gmail.com>>
Sent: Saturday, February 9, 2019 5:25 AM
To: Spark dev list
Subject: Re: [VOTE] Release Apache Spark 2.3.3 (RC2)

Sorry, but I forgot to check ` -Pdocker-integration-tests` for the JDBC 
integration tests.
I run these tests, and then I checked if they are passed.

On Sat, Feb 9, 2019 at 5:26 PM Herman van Hovell 
mailto:her...@databricks.com>> wrote:
I count 2 binding votes :)...

Op vr 8 feb. 2019 om 22:36 schreef Felix Cheung 
mailto:felixcheun...@hotmail.com>>
Nope, still only 1 binding vote ;)



From: Mark Hamstra mailto:m...@clearstorydata.com>>
Sent: Friday, February 8, 2019 7:30 PM
To: Marcelo Vanzin
Cc: Takeshi Yamamuro; Spark dev list
Subject: Re: [VOTE] Release Apache Spark 2.3.3 (RC2)

There are 2. C'mon Marcelo, you can make it 3!

On Fri, Feb 8, 2019 at 5:03 PM Marcelo Vanzin  
wrote:
Hi Takeshi,

Since we only really have one +1 binding vote, do you want to extend
this vote a bit?

I've been stuck on a few things but plan to test this (setting things
up now), but it probably won't happen before the deadline.

On Tue, Feb 5, 2019 at 5:07 PM Takeshi Yamamuro 
mailto:linguin@gmail.com>> wrote:
>
> Please vote on releasing the following candidate as Apache Spark version 
> 2.3.3.
>
> The vote is open until February 8 6:00PM (PST) and passes if a majority +1 
> PMC votes are cast, with
> a minimum of 3 +1 votes.
>
> [ ] +1 Release this package as Apache Spark 2.3.3
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> The tag to be voted on is v2.3.3-rc2 (commit 
> 66fd9c34bf406a4b5f86605d06c9607752bd637a):
> https://github.com/apache/spark/tree/v2.3.3-rc2
>
> The release files, including signatures, digests, etc. can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v2.3.3-rc2-bin/
>
> Signatures used for Spark RCs can be found in this file:
> https://dist.apache.org/repos/dist/dev/spark/KEYS
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1298/
>
> The documentation corresponding to this release can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v2.3.3-rc2-docs/
>
> The list of bug fixes going into 2.3.3 can be found at the following URL:
> https://issues.apache.org/jira/projects/SPARK/versions/12343759
>
> FAQ
>
> =
> How can I help test this release?
> =
>
> If you are a Spark user, you can help us test this release by taking
> an existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> If you're working in PySpark you can set up a virtual env and install
> the current RC and see if anything important breaks, in the Java/Scala
> you can add the staging repository to your projects resolvers and test
> with the RC (make sure to clean up the artifact cache before/after so
> you don't end up building with a out of date RC going forward).
>
> ===
> What should happen to JIRA tickets still targeting 2.3.3?
> ==

Re: [VOTE] Release Apache Spark 2.3.3 (RC2)

2019-02-09 Thread Felix Cheung
One test in SparkSubmitSuite is consistently failing for me. Anyone seeing that?



From: Takeshi Yamamuro 
Sent: Saturday, February 9, 2019 5:25 AM
To: Spark dev list
Subject: Re: [VOTE] Release Apache Spark 2.3.3 (RC2)

Sorry, but I forgot to check ` -Pdocker-integration-tests` for the JDBC 
integration tests.
I run these tests, and then I checked if they are passed.

On Sat, Feb 9, 2019 at 5:26 PM Herman van Hovell 
mailto:her...@databricks.com>> wrote:
I count 2 binding votes :)...

Op vr 8 feb. 2019 om 22:36 schreef Felix Cheung 
mailto:felixcheun...@hotmail.com>>
Nope, still only 1 binding vote ;)



From: Mark Hamstra mailto:m...@clearstorydata.com>>
Sent: Friday, February 8, 2019 7:30 PM
To: Marcelo Vanzin
Cc: Takeshi Yamamuro; Spark dev list
Subject: Re: [VOTE] Release Apache Spark 2.3.3 (RC2)

There are 2. C'mon Marcelo, you can make it 3!

On Fri, Feb 8, 2019 at 5:03 PM Marcelo Vanzin  
wrote:
Hi Takeshi,

Since we only really have one +1 binding vote, do you want to extend
this vote a bit?

I've been stuck on a few things but plan to test this (setting things
up now), but it probably won't happen before the deadline.

On Tue, Feb 5, 2019 at 5:07 PM Takeshi Yamamuro 
mailto:linguin@gmail.com>> wrote:
>
> Please vote on releasing the following candidate as Apache Spark version 
> 2.3.3.
>
> The vote is open until February 8 6:00PM (PST) and passes if a majority +1 
> PMC votes are cast, with
> a minimum of 3 +1 votes.
>
> [ ] +1 Release this package as Apache Spark 2.3.3
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> The tag to be voted on is v2.3.3-rc2 (commit 
> 66fd9c34bf406a4b5f86605d06c9607752bd637a):
> https://github.com/apache/spark/tree/v2.3.3-rc2
>
> The release files, including signatures, digests, etc. can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v2.3.3-rc2-bin/
>
> Signatures used for Spark RCs can be found in this file:
> https://dist.apache.org/repos/dist/dev/spark/KEYS
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1298/
>
> The documentation corresponding to this release can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v2.3.3-rc2-docs/
>
> The list of bug fixes going into 2.3.3 can be found at the following URL:
> https://issues.apache.org/jira/projects/SPARK/versions/12343759
>
> FAQ
>
> =
> How can I help test this release?
> =
>
> If you are a Spark user, you can help us test this release by taking
> an existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> If you're working in PySpark you can set up a virtual env and install
> the current RC and see if anything important breaks, in the Java/Scala
> you can add the staging repository to your projects resolvers and test
> with the RC (make sure to clean up the artifact cache before/after so
> you don't end up building with a out of date RC going forward).
>
> ===
> What should happen to JIRA tickets still targeting 2.3.3?
> ===
>
> The current list of open tickets targeted at 2.3.3 can be found at:
> https://issues.apache.org/jira/projects/SPARK and search for "Target 
> Version/s" = 2.3.3
>
> Committers should look at those and triage. Extremely important bug
> fixes, documentation, and API tweaks that impact compatibility should
> be worked on immediately. Everything else please retarget to an
> appropriate release.
>
> ==
> But my bug isn't fixed?
> ==
>
> In order to make timely releases, we will typically not hold the
> release unless the bug in question is a regression from the previous
> release. That being said, if there is something which is a regression
> that has not been correctly targeted please ping me or a committer to
> help target the issue.
>
> P.S.
> I checked all the tests passed in the Amazon Linux 2 AMI;
> $ java -version
> openjdk version "1.8.0_191"
> OpenJDK Runtime Environment (build 1.8.0_191-b12)
> OpenJDK 64-Bit Server VM (build 25.191-b12, mixed mode)
> $ ./build/mvn -Pyarn -Phadoop-2.7 -Phive -Phive-thriftserver -Pmesos -Psparkr 
> test
>
> --
> ---
> Takeshi Yamamuro



--
Marcelo

-
To unsubscribe e-mail: 
dev-unsubscr...@spark.apache.org<mailto:dev-unsubscr...@spark.apache.org>



--
---
Takeshi Yamamuro


Re: [DISCUSS] Change default executor log URLs for YARN

2019-02-08 Thread Felix Cheung
For this case I’d agree with Ryan. I haven’t followed this thread and the 
details of the change since it’s way too much for me to consume “in my free 
time” (which is 0 nowadays) but I’m pretty sure the existing behavior works for 
us and very likely we don’t want it to change because of some proxy magic we do 
behind the scene.

I’d also agree config flag is not always the best way but in this case the 
existing established behavior doesn’t seem broken...

I could be wrong though.



From: Ryan Blue 
Sent: Friday, February 8, 2019 4:39 PM
To: Sean Owen
Cc: Jungtaek Lim; dev
Subject: Re: [DISCUSS] Change default executor log URLs for YARN

I'm not sure that many people need this, so it is hard to make a decision. I'm 
reluctant to change the current behavior if the result is a new papercut to 99% 
of users and a win for 1%. The suggested change will work for 100% of users, so 
if we don't want a flag then we should go with that. But I would certainly want 
to turn it off in our environment because it doesn't provide any value for us 
and would annoy our users.

On Fri, Feb 8, 2019 at 4:18 PM Sean Owen 
mailto:sro...@gmail.com>> wrote:
Is a flag needed? You know me, I think flags are often failures of
design, or disagreement punted to the user. I can understand retaining
old behavior under a flag where the behavior change could be
problematic for some users or facilitate migration, but this is just a
change to some UI links no? the underlying links don't change.
On Fri, Feb 8, 2019 at 5:41 PM Ryan Blue 
mailto:rb...@netflix.com>> wrote:
>
> I suggest using the current behavior as the default and add a flag to 
> implement the behavior you're suggesting: to link to the logs path in YARN 
> instead of directly to stderr and stdout.
>
> On Fri, Feb 8, 2019 at 3:33 PM Jungtaek Lim 
> mailto:kabh...@gmail.com>> wrote:
>>
>> Ryan,
>>
>> actually I'm not clear about your suggestion. For me three possible options 
>> here:
>>
>> 1. If we want to let users be able to completely rewrite log urls, that's 
>> SPARK-26792. For SHS we already addressed it.
>> 2. We could let users turning on/off flag option to just get one url or 
>> default two stdout/stderr urls.
>> 3. We could let users enumerate file names they want to link, and create log 
>> links for each file.
>>
>> Which one do you suggest?
>


--
Ryan Blue
Software Engineer
Netflix


Re: [VOTE] Release Apache Spark 2.3.3 (RC2)

2019-02-08 Thread Felix Cheung
Nope, still only 1 binding vote ;)



From: Mark Hamstra 
Sent: Friday, February 8, 2019 7:30 PM
To: Marcelo Vanzin
Cc: Takeshi Yamamuro; Spark dev list
Subject: Re: [VOTE] Release Apache Spark 2.3.3 (RC2)

There are 2. C'mon Marcelo, you can make it 3!

On Fri, Feb 8, 2019 at 5:03 PM Marcelo Vanzin  
wrote:
Hi Takeshi,

Since we only really have one +1 binding vote, do you want to extend
this vote a bit?

I've been stuck on a few things but plan to test this (setting things
up now), but it probably won't happen before the deadline.

On Tue, Feb 5, 2019 at 5:07 PM Takeshi Yamamuro 
mailto:linguin@gmail.com>> wrote:
>
> Please vote on releasing the following candidate as Apache Spark version 
> 2.3.3.
>
> The vote is open until February 8 6:00PM (PST) and passes if a majority +1 
> PMC votes are cast, with
> a minimum of 3 +1 votes.
>
> [ ] +1 Release this package as Apache Spark 2.3.3
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> The tag to be voted on is v2.3.3-rc2 (commit 
> 66fd9c34bf406a4b5f86605d06c9607752bd637a):
> https://github.com/apache/spark/tree/v2.3.3-rc2
>
> The release files, including signatures, digests, etc. can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v2.3.3-rc2-bin/
>
> Signatures used for Spark RCs can be found in this file:
> https://dist.apache.org/repos/dist/dev/spark/KEYS
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1298/
>
> The documentation corresponding to this release can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v2.3.3-rc2-docs/
>
> The list of bug fixes going into 2.3.3 can be found at the following URL:
> https://issues.apache.org/jira/projects/SPARK/versions/12343759
>
> FAQ
>
> =
> How can I help test this release?
> =
>
> If you are a Spark user, you can help us test this release by taking
> an existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> If you're working in PySpark you can set up a virtual env and install
> the current RC and see if anything important breaks, in the Java/Scala
> you can add the staging repository to your projects resolvers and test
> with the RC (make sure to clean up the artifact cache before/after so
> you don't end up building with a out of date RC going forward).
>
> ===
> What should happen to JIRA tickets still targeting 2.3.3?
> ===
>
> The current list of open tickets targeted at 2.3.3 can be found at:
> https://issues.apache.org/jira/projects/SPARK and search for "Target 
> Version/s" = 2.3.3
>
> Committers should look at those and triage. Extremely important bug
> fixes, documentation, and API tweaks that impact compatibility should
> be worked on immediately. Everything else please retarget to an
> appropriate release.
>
> ==
> But my bug isn't fixed?
> ==
>
> In order to make timely releases, we will typically not hold the
> release unless the bug in question is a regression from the previous
> release. That being said, if there is something which is a regression
> that has not been correctly targeted please ping me or a committer to
> help target the issue.
>
> P.S.
> I checked all the tests passed in the Amazon Linux 2 AMI;
> $ java -version
> openjdk version "1.8.0_191"
> OpenJDK Runtime Environment (build 1.8.0_191-b12)
> OpenJDK 64-Bit Server VM (build 25.191-b12, mixed mode)
> $ ./build/mvn -Pyarn -Phadoop-2.7 -Phive -Phive-thriftserver -Pmesos -Psparkr 
> test
>
> --
> ---
> Takeshi Yamamuro



--
Marcelo

-
To unsubscribe e-mail: 
dev-unsubscr...@spark.apache.org



Re: [DISCUSS] Upgrade built-in Hive to 2.3.4

2019-02-04 Thread Felix Cheung
Likely need a shim (which we should have anyway) because of namespace/import 
changes.

I’m huge +1 on this.



From: Hyukjin Kwon 
Sent: Monday, February 4, 2019 12:27 PM
To: Xiao Li
Cc: Sean Owen; Felix Cheung; Ryan Blue; Marcelo Vanzin; Yuming Wang; dev
Subject: Re: [DISCUSS] Upgrade built-in Hive to 2.3.4

I should check the details and feasiablity by myself but to me it sounds fine 
if it doesn't need extra big efforts.

On Tue, 5 Feb 2019, 4:15 am Xiao Li 
mailto:gatorsm...@gmail.com> wrote:
Yes. When our support/integration with Hive 2.x becomes stable, we can do it in 
Hadoop 2.x profile too, if needed. The whole proposal is to minimize the risk 
and ensure the release stability and quality.

Hyukjin Kwon mailto:gurwls...@gmail.com>> 于2019年2月4日周一 
下午12:01写道:
Xiao, to check if I understood correctly, do you mean the below?

1. Use our fork with Hadoop 2.x profile for now, and use Hive 2.x with Hadoop 
3.x profile.
2. Make another newer version of thrift server by Hive 2.x(?) in Spark side.
3. Target the transition to Hive 2.x completely and slowly later in the future.



2019년 2월 5일 (화) 오전 1:16, Xiao Li 
mailto:gatorsm...@gmail.com>>님이 작성:
To reduce the impact and risk of upgrading Hive execution JARs, we can just 
upgrade the built-in Hive to 2.x when using the profile of Hadoop 3.x. The 
support of Hadoop 3 will be still experimental in our next release. That means, 
the impact and risk are very minimal for most users who are still using Hadoop 
2.x profile.

The code changes in Spark thrift server are massive. It is risky and hard to 
review. The original code of our Spark thrift server is from Hive-service 
1.2.1. To reduce the risk of the upgrade, we can inline the new version. In the 
future, we can completely get rid of the thrift server, and build our own 
high-performant JDBC server.

Does this proposal sound good to you?

In the last two weeks, Yuming was trying this proposal. Now, he is on vacation. 
In China, today is already the lunar New Year. I would not expect he will reply 
this email in the next 7 days.

Cheers,

Xiao



Sean Owen mailto:sro...@gmail.com>> 于2019年2月4日周一 上午7:56写道:
I was unclear from this thread what the objection to these PRs is:

https://github.com/apache/spark/pull/23552
https://github.com/apache/spark/pull/23553

Would we like to specifically discuss whether to merge these or not? I
hear support for it, concerns about continuing to support Hive too,
but I wasn't clear whether those concerns specifically argue against
these PRs.


On Fri, Feb 1, 2019 at 2:03 PM Felix Cheung 
mailto:felixcheun...@hotmail.com>> wrote:
>
> What’s the update and next step on this?
>
> We have real users getting blocked by this issue.
>
>
> 
> From: Xiao Li mailto:gatorsm...@gmail.com>>
> Sent: Wednesday, January 16, 2019 9:37 AM
> To: Ryan Blue
> Cc: Marcelo Vanzin; Hyukjin Kwon; Sean Owen; Felix Cheung; Yuming Wang; dev
> Subject: Re: [DISCUSS] Upgrade built-in Hive to 2.3.4
>
> Thanks for your feedbacks!
>
> Working with Yuming to reduce the risk of stability and quality. Will keep 
> you posted when the proposal is ready.
>
> Cheers,
>
> Xiao
>
> Ryan Blue mailto:rb...@netflix.com>> 于2019年1月16日周三 
> 上午9:27写道:
>>
>> +1 for what Marcelo and Hyukjin said.
>>
>> In particular, I agree that we can't expect Hive to release a version that 
>> is now more than 3 years old just to solve a problem for Spark. Maybe that 
>> would have been a reasonable ask instead of publishing a fork years ago, but 
>> I think this is now Spark's problem.
>>
>> On Tue, Jan 15, 2019 at 9:02 PM Marcelo Vanzin 
>> mailto:van...@cloudera.com>> wrote:
>>>
>>> +1 to that. HIVE-16391 by itself means we're giving up things like
>>> Hadoop 3, and we're also putting the burden on the Hive folks to fix a
>>> problem that we created.
>>>
>>> The current PR is basically a Spark-side fix for that bug. It does
>>> mean also upgrading Hive (which gives us Hadoop 3, yay!), but I think
>>> it's really the right path to take here.
>>>
>>> On Tue, Jan 15, 2019 at 6:32 PM Hyukjin Kwon 
>>> mailto:gurwls...@gmail.com>> wrote:
>>> >
>>> > Resolving HIVE-16391 means Hive to release 1.2.x that contains the fixes 
>>> > of our Hive fork (correct me if I am mistaken).
>>> >
>>> > Just to be honest by myself and as a personal opinion, that basically 
>>> > says Hive to take care of Spark's dependency.
>>> > Hive looks going ahead for 3.1.x and no one would use the newer release 
>>> > of 1.2.x. In practice, Spark doesn't make a release 1.6.x anymore for 
>>> > instance,
>>> >

Re: [DISCUSS] Upgrade built-in Hive to 2.3.4

2019-02-01 Thread Felix Cheung
What’s the update and next step on this?

We have real users getting blocked by this issue.



From: Xiao Li 
Sent: Wednesday, January 16, 2019 9:37 AM
To: Ryan Blue
Cc: Marcelo Vanzin; Hyukjin Kwon; Sean Owen; Felix Cheung; Yuming Wang; dev
Subject: Re: [DISCUSS] Upgrade built-in Hive to 2.3.4

Thanks for your feedbacks!

Working with Yuming to reduce the risk of stability and quality. Will keep you 
posted when the proposal is ready.

Cheers,

Xiao

Ryan Blue mailto:rb...@netflix.com>> 于2019年1月16日周三 上午9:27写道:
+1 for what Marcelo and Hyukjin said.

In particular, I agree that we can't expect Hive to release a version that is 
now more than 3 years old just to solve a problem for Spark. Maybe that would 
have been a reasonable ask instead of publishing a fork years ago, but I think 
this is now Spark's problem.

On Tue, Jan 15, 2019 at 9:02 PM Marcelo Vanzin 
mailto:van...@cloudera.com>> wrote:
+1 to that. HIVE-16391 by itself means we're giving up things like
Hadoop 3, and we're also putting the burden on the Hive folks to fix a
problem that we created.

The current PR is basically a Spark-side fix for that bug. It does
mean also upgrading Hive (which gives us Hadoop 3, yay!), but I think
it's really the right path to take here.

On Tue, Jan 15, 2019 at 6:32 PM Hyukjin Kwon 
mailto:gurwls...@gmail.com>> wrote:
>
> Resolving HIVE-16391 means Hive to release 1.2.x that contains the fixes of 
> our Hive fork (correct me if I am mistaken).
>
> Just to be honest by myself and as a personal opinion, that basically says 
> Hive to take care of Spark's dependency.
> Hive looks going ahead for 3.1.x and no one would use the newer release of 
> 1.2.x. In practice, Spark doesn't make a release 1.6.x anymore for instance,
>
> Frankly, my impression was that it's, honestly, our mistake to fix. Since 
> Spark community is big enough, I was thinking we should try to fix it by 
> ourselves first.
> I am not saying upgrading is the only way to get through this but I think we 
> should at least try first, and see what's next.
>
> It does, yes, sound more risky to upgrade it in our side but I think it's 
> worth to check and try it and see if it's possible.
> I think this is a standard approach to upgrade the dependency than using the 
> fork or letting Hive side to release another 1.2.x.
>
> If we fail to upgrade it for critical or inevitable reasons somehow, yes, we 
> could find an alternative but that basically means
> we're going to stay in 1.2.x for, at least, a long time (say .. until Spark 
> 4.0.0?).
>
> I know somehow it happened to be sensitive but to be just literally honest to 
> myself, I think we should make a try.
>


--
Marcelo


--
Ryan Blue
Software Engineer
Netflix


Re: Missing SparkR in CRAN

2019-01-24 Thread Felix Cheung
Yes it was discussed on dev@. We are waiting for 2.3.3 to release to
resubmit.


On Thu, Jan 24, 2019 at 5:33 AM Hyukjin Kwon  wrote:

> Hi all,
>
> I happened to find SparkR is missing in CRAN. See
> https://cran.r-project.org/web/packages/SparkR/index.html
>
> I remember I saw some threads about this in spark-dev mailing list a long
> long ago IIRC. Is it in progress to fix it somewhere? or is it something I
> misunderstood?
>


Re: Make proactive check for closure serializability optional?

2019-01-21 Thread Felix Cheung
Agreed on the pros / cons, esp driver could be the data science notebook.
Is it worthwhile making it configurable?



From: Sean Owen 
Sent: Monday, January 21, 2019 10:42 AM
To: Reynold Xin
Cc: dev
Subject: Re: Make proactive check for closure serializability optional?

None except the bug / PR I linked to, which is really just a bug in
the RowMatrix implementation; a 2GB closure isn't reasonable.
I doubt it's much overhead in the common case, because closures are
small and this extra check happens once per execution of the closure.

I can also imagine middle-ground cases where people are dragging along
largeish 10MB closures (like, a model or some data) and this could add
non-trivial memory pressure on the driver. They should be broadcasting
those things, sure.

Given just that I'd leave it alone, but was wondering if anyone had
ever had the same thought or more arguments that it should be
disable-able. In 'production' one would imagine all the closures do
serialize correctly and so this is just a bit overhead that could be
skipped.

On Mon, Jan 21, 2019 at 12:17 PM Reynold Xin  wrote:
>
> Did you actually observe a perf issue?
>
> On Mon, Jan 21, 2019 at 10:04 AM Sean Owen  wrote:
>>
>> The ClosureCleaner proactively checks that closures passed to
>> transformations like RDD.map() are serializable, before they're
>> executed. It does this by just serializing it with the JavaSerializer.
>>
>> That's a nice feature, although there's overhead in always trying to
>> serialize the closure ahead of time, especially if the closure is
>> large. It shouldn't be large, usually. But I noticed it when coming up
>> with this fix: https://github.com/apache/spark/pull/23600
>>
>> It made me wonder, should this be optional, or even not the default?
>> Closures that don't serialize still fail, just later when an action is
>> invoked. I don't feel strongly about it, just checking if anyone had
>> pondered this before.
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [VOTE] Release Apache Spark 2.3.3 (RC1)

2019-01-20 Thread Felix Cheung
+1

My focus is on R (sorry couldn’t cross validate what’s Sean is seeing)

tested:
reviewed doc
R package test
win-builder, r-hub
Tarball/package signature




From: Takeshi Yamamuro 
Sent: Thursday, January 17, 2019 6:49 PM
To: Spark dev list
Subject: [VOTE] Release Apache Spark 2.3.3 (RC1)

Please vote on releasing the following candidate as Apache Spark version 2.3.3.

The vote is open until January 20 8:00PM (PST) and passes if a majority +1 PMC 
votes are cast, with
a minimum of 3 +1 votes.

[ ] +1 Release this package as Apache Spark 2.3.3
[ ] -1 Do not release this package because ...

To learn more about Apache Spark, please see http://spark.apache.org/

The tag to be voted on is v2.3.3-rc1 (commit 
b5ea9330e3072e99841270b10dc1d2248127064b):
https://github.com/apache/spark/tree/v2.3.3-rc1

The release files, including signatures, digests, etc. can be found at:
https://dist.apache.org/repos/dist/dev/spark/v2.3.3-rc1-bin/

Signatures used for Spark RCs can be found in this file:
https://dist.apache.org/repos/dist/dev/spark/KEYS

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1297

The documentation corresponding to this release can be found at:
https://dist.apache.org/repos/dist/dev/spark/v2.3.3-rc1-docs/

The list of bug fixes going into 2.3.3 can be found at the following URL:
https://issues.apache.org/jira/projects/SPARK/versions/12343759

FAQ

=
How can I help test this release?
=

If you are a Spark user, you can help us test this release by taking
an existing Spark workload and running on this release candidate, then
reporting any regressions.

If you're working in PySpark you can set up a virtual env and install
the current RC and see if anything important breaks, in the Java/Scala
you can add the staging repository to your projects resolvers and test
with the RC (make sure to clean up the artifact cache before/after so
you don't end up building with a out of date RC going forward).

===
What should happen to JIRA tickets still targeting 2.3.3?
===

The current list of open tickets targeted at 2.3.3 can be found at:
https://issues.apache.org/jira/projects/SPARK and search for "Target Version/s" 
= 2.3.3

Committers should look at those and triage. Extremely important bug
fixes, documentation, and API tweaks that impact compatibility should
be worked on immediately. Everything else please retarget to an
appropriate release.

==
But my bug isn't fixed?
==

In order to make timely releases, we will typically not hold the
release unless the bug in question is a regression from the previous
release. That being said, if there is something which is a regression
that has not been correctly targeted please ping me or a committer to
help target the issue.

--
---
Takeshi Yamamuro


Re: [DISCUSS] Identifiers with multi-catalog support

2019-01-20 Thread Felix Cheung
+1 I like Ryan last mail. Thank you for putting it clearly (should be a 
spec/SPIP!)

I agree and understand the need for 3 part id. However I don’t think we should 
make assumption that it must be or can only be as long as 3 parts. Once the 
catalog is identified (ie. The first part), the catalog should be responsible 
for resolving the namespace or schema etc. Agree also path is good idea to add 
to support file-based variant. Should separator be optional (perhaps in *space) 
to keep this extensible (it might not always be ‘.’)

Also this whole scheme will need to play nice with column identifier as well.



From: Ryan Blue 
Sent: Thursday, January 17, 2019 11:38 AM
To: Spark Dev List
Subject: Re: [DISCUSS] Identifiers with multi-catalog support

Any discussion on how Spark should manage identifiers when multiple catalogs 
are supported?

I know this is an area where a lot of people are interested in making progress, 
and it is a blocker for both multi-catalog support and CTAS in DSv2.

On Sun, Jan 13, 2019 at 2:22 PM Ryan Blue 
mailto:rb...@netflix.com>> wrote:

I think that the solution to this problem is to mix the two approaches by 
supporting 3 identifier parts: catalog, namespace, and name, where namespace 
can be an n-part identifier:

type Namespace = Seq[String]
case class CatalogIdentifier(space: Namespace, name: String)


This allows catalogs to work with the hierarchy of the external store, but the 
catalog API only requires a few discovery methods to list namespaces and to 
list each type of object in a namespace.

def listNamespaces(): Seq[Namespace]
def listNamespaces(space: Namespace, prefix: String): Seq[Namespace]
def listTables(space: Namespace): Seq[CatalogIdentifier]
def listViews(space: Namespace): Seq[CatalogIdentifier]
def listFunctions(space: Namespace): Seq[CatalogIdentifier]


The methods to list tables, views, or functions, would only return identifiers 
for the type queried, not namespaces or the other objects.

The SQL parser would be updated so that identifiers are parsed to 
UnresovledIdentifier(parts: Seq[String]), and resolution would work like this 
pseudo-code:

def resolveIdentifier(ident: UnresolvedIdentifier): (CatalogPlugin, 
CatalogIdentifier) = {
  val maybeCatalog = sparkSession.catalog(ident.parts.head)
  ident.parts match {
case Seq(catalogName, *space, name) if catalog.isDefined =>
  (maybeCatalog.get, CatalogIdentifier(space, name))
case Seq(*space, name) =>
  (sparkSession.defaultCatalog, CatalogIdentifier(space, name))
  }
}


I think this is a good approach because it allows Spark users to reference or 
discovery any name in the hierarchy of an external store, it uses a few 
well-defined methods for discovery, and makes name hierarchy a user concern.

  *   SHOW (DATABASES|SCHEMAS|NAMESPACES) would return the result of 
listNamespaces()
  *   SHOW NAMESPACES LIKE a.b% would return the result of 
listNamespaces(Seq("a"), "b")
  *   USE a.b would set the current namespace to Seq("a", "b")
  *   SHOW TABLES would return the result of listTables(currentNamespace)

Also, I think that we could generalize this a little more to support path-based 
tables by adding a path to CatalogIdentifier, either as a namespace or as a 
separate optional string. Then, the identifier passed to a catalog would work 
for either a path-based table or a catalog table, without needing a path-based 
catalog API.

Thoughts?

On Sun, Jan 13, 2019 at 1:38 PM Ryan Blue 
mailto:rb...@netflix.com>> wrote:

In the DSv2 sync up, we tried to discuss the Table metadata proposal but were 
side-tracked on its use of TableIdentifier. There were good points about how 
Spark should identify tables, views, functions, etc, and I want to start a 
discussion here.

Identifiers are orthogonal to the TableCatalog proposal that can be updated to 
use whatever identifier class we choose. That proposal is concerned with what 
information should be passed to define a table, and how to pass that 
information.

The main question for this discussion is: how should Spark identify tables, 
views, and functions when it supports multiple catalogs?

There are two main approaches:

  1.  Use a 3-part identifier, catalog.database.table
  2.  Use an identifier with an arbitrary number of parts

Option 1: use 3-part identifiers

The argument for option #1 is that it is simple. If an external data store has 
additional logical hierarchy layers, then that hierarchy would be mapped to 
multiple catalogs in Spark. Spark can support show tables and show databases 
without much trouble. This is the approach used by Presto, so there is some 
precedent for it.

The drawback is that mapping a more complex hierarchy into Spark requires more 
configuration. If an external DB has a 3-level hierarchy — say, for example, 
schema.database.table — then option #1 requires users to configure a catalog 
for each top-level structure, each schema. When a new schema is added, it is 
not 

Re: [DISCUSS] Upgrade built-in Hive to 2.3.4

2019-01-15 Thread Felix Cheung
One common case we have is a custom input format.

In any case, even when Hive metatstore is protocol compatible we should still 
upgrade or replace the hive jar from a fork, as Sean says, from a ASF release 
process standpoint. Unless there is a plan for removing hive integration (all 
of it) from the spark core project..



From: Xiao Li 
Sent: Tuesday, January 15, 2019 10:03 AM
To: Felix Cheung
Cc: rb...@netflix.com; Yuming Wang; dev
Subject: Re: [DISCUSS] Upgrade built-in Hive to 2.3.4

Let me take my words back. To read/write a table, Spark users do not use the 
Hive execution JARs, unless they explicitly create the Hive serde tables. 
Actually, I want to understand the motivation and use cases why your usage 
scenarios need to create Hive serde tables instead of our Spark native tables?

BTW, we are still using Hive metastore as our metadata store. This does not 
require the Hive execution JAR upgrade, based on my understanding. Users can 
upgrade it to the newer version of Hive metastore.

Felix Cheung mailto:felixcheun...@hotmail.com>> 
于2019年1月15日周二 上午9:56写道:
And we are super 100% dependent on Hive...



From: Ryan Blue 
Sent: Tuesday, January 15, 2019 9:53 AM
To: Xiao Li
Cc: Yuming Wang; dev
Subject: Re: [DISCUSS] Upgrade built-in Hive to 2.3.4

How do we know that most Spark users are not using Hive? I wouldn't be 
surprised either way, but I do want to make sure we aren't making decisions 
based on any one person's (or one company's) experience about what "most" Spark 
users do.

On Tue, Jan 15, 2019 at 9:44 AM Xiao Li 
mailto:gatorsm...@gmail.com>> wrote:
Hi, Yuming,

Thank you for your contributions! The community aims at reducing the dependence 
on Hive. Currently, most of Spark users are not using Hive. The changes looks 
risky to me.

To support Hadoop 3.x, we just need to resolve this JIRA: 
https://issues.apache.org/jira/browse/HIVE-16391

Cheers,

Xiao

Yuming Wang mailto:wgy...@gmail.com>> 于2019年1月15日周二 上午8:41写道:
Dear Spark Developers and Users,

Hyukjin and I plan to upgrade the built-in Hive 
from1.2.1-spark2<https://github.com/JoshRosen/hive/tree/release-1.2.1-spark2> 
to2.3.4<https://github.com/apache/hive/releases/tag/rel%2Frelease-2.3.4> to 
solve some critical issues, such as support Hadoop 3.x, solve some ORC and 
Parquet issues. This is the list:
Hive issues:
[SPARK-26332<https://issues.apache.org/jira/browse/SPARK-26332>][HIVE-10790] 
Spark sql write orc table on viewFS throws exception
[SPARK-25193<https://issues.apache.org/jira/browse/SPARK-25193>][HIVE-12505] 
insert overwrite doesn't throw exception when drop old data fails
[SPARK-26437<https://issues.apache.org/jira/browse/SPARK-26437>][HIVE-13083] 
Decimal data becomes bigint to query, unable to query
[SPARK-25919<https://issues.apache.org/jira/browse/SPARK-25919>][HIVE-11771] 
Date value corrupts when tables are "ParquetHiveSerDe" formatted and target 
table is Partitioned
[SPARK-12014<https://issues.apache.org/jira/browse/SPARK-12014>][HIVE-11100] 
Spark SQL query containing semicolon is broken in Beeline

Spark issues:
[SPARK-23534<https://issues.apache.org/jira/browse/SPARK-23534>] Spark run on 
Hadoop 3.0.0
[SPARK-20202<https://issues.apache.org/jira/browse/SPARK-20202>] Remove 
references to org.spark-project.hive
[SPARK-18673<https://issues.apache.org/jira/browse/SPARK-18673>] Dataframes 
doesn't work on Hadoop 3.x; Hive rejects Hadoop version
[SPARK-24766<https://issues.apache.org/jira/browse/SPARK-24766>] 
CreateHiveTableAsSelect and InsertIntoHiveDir won't generate decimal column 
stats in parquet


Since the code for the hive-thriftserver module has changed too much for this 
upgrade, I split it into two PRs for easy review.
The first PR<https://github.com/apache/spark/pull/23552> does not contain the 
changes of hive-thriftserver. Please ignore the failed test in 
hive-thriftserver.
The second PR<https://github.com/apache/spark/pull/23553> is complete changes.

I have created a Spark distribution for Apache Hadoop 2.7, you might download 
it viaGoogle 
Drive<https://drive.google.com/open?id=1cq2I8hUTs9F4JkFyvRfdOJ5BlxV0ujgt> 
orBaidu Pan<https://pan.baidu.com/s/1b090Ctuyf1CDYS7c0puBqQ>.
Please help review and test. Thanks.


--
Ryan Blue
Software Engineer
Netflix


Re: [DISCUSS] Upgrade built-in Hive to 2.3.4

2019-01-15 Thread Felix Cheung
And we are super 100% dependent on Hive...



From: Ryan Blue 
Sent: Tuesday, January 15, 2019 9:53 AM
To: Xiao Li
Cc: Yuming Wang; dev
Subject: Re: [DISCUSS] Upgrade built-in Hive to 2.3.4

How do we know that most Spark users are not using Hive? I wouldn't be 
surprised either way, but I do want to make sure we aren't making decisions 
based on any one person's (or one company's) experience about what "most" Spark 
users do.

On Tue, Jan 15, 2019 at 9:44 AM Xiao Li 
mailto:gatorsm...@gmail.com>> wrote:
Hi, Yuming,

Thank you for your contributions! The community aims at reducing the dependence 
on Hive. Currently, most of Spark users are not using Hive. The changes looks 
risky to me.

To support Hadoop 3.x, we just need to resolve this JIRA: 
https://issues.apache.org/jira/browse/HIVE-16391

Cheers,

Xiao

Yuming Wang mailto:wgy...@gmail.com>> 于2019年1月15日周二 上午8:41写道:
Dear Spark Developers and Users,

Hyukjin and I plan to upgrade the built-in Hive 
from1.2.1-spark2 
to 2.3.4 to 
solve some critical issues, such as support Hadoop 3.x, solve some ORC and 
Parquet issues. This is the list:
Hive issues:
[SPARK-26332][HIVE-10790] 
Spark sql write orc table on viewFS throws exception
[SPARK-25193][HIVE-12505] 
insert overwrite doesn't throw exception when drop old data fails
[SPARK-26437][HIVE-13083] 
Decimal data becomes bigint to query, unable to query
[SPARK-25919][HIVE-11771] 
Date value corrupts when tables are "ParquetHiveSerDe" formatted and target 
table is Partitioned
[SPARK-12014][HIVE-11100] 
Spark SQL query containing semicolon is broken in Beeline

Spark issues:
[SPARK-23534] Spark run on 
Hadoop 3.0.0
[SPARK-20202] Remove 
references to org.spark-project.hive
[SPARK-18673] Dataframes 
doesn't work on Hadoop 3.x; Hive rejects Hadoop version
[SPARK-24766] 
CreateHiveTableAsSelect and InsertIntoHiveDir won't generate decimal column 
stats in parquet


Since the code for the hive-thriftserver module has changed too much for this 
upgrade, I split it into two PRs for easy review.
The first PR does not contain the 
changes of hive-thriftserver. Please ignore the failed test in 
hive-thriftserver.
The second PR is complete changes.

I have created a Spark distribution for Apache Hadoop 2.7, you might download 
it viaGoogle 
Drive or 
Baidu Pan.
Please help review and test. Thanks.


--
Ryan Blue
Software Engineer
Netflix


Re: [DISCUSS] Upgrade built-in Hive to 2.3.4

2019-01-15 Thread Felix Cheung
Resolving https://issues.apache.org/jira/browse/HIVE-16391 means to keep Spark 
on Hive 1.2?

I’m not sure that is reducing dependency on Hive - Hive is still there and it’s 
a very old Hive. IMO it is increasing the risk the longer we keep on this. (And 
it’s been years)

Looking at the two PR. They don’t seem very drastic to me, except for thrift 
server. Is there another, better approach to thrift server?



From: Xiao Li 
Sent: Tuesday, January 15, 2019 9:44 AM
To: Yuming Wang
Cc: dev
Subject: Re: [DISCUSS] Upgrade built-in Hive to 2.3.4

Hi, Yuming,

Thank you for your contributions! The community aims at reducing the dependence 
on Hive. Currently, most of Spark users are not using Hive. The changes looks 
risky to me.

To support Hadoop 3.x, we just need to resolve this JIRA: 
https://issues.apache.org/jira/browse/HIVE-16391

Cheers,

Xiao

Yuming Wang mailto:wgy...@gmail.com>> 于2019年1月15日周二 上午8:41写道:
Dear Spark Developers and Users,

Hyukjin and I plan to upgrade the built-in Hive from 
1.2.1-spark2 to 
2.3.4 to solve 
some critical issues, such as support Hadoop 3.x, solve some ORC and Parquet 
issues. This is the list:
Hive issues:
[SPARK-26332][HIVE-10790] 
Spark sql write orc table on viewFS throws exception
[SPARK-25193][HIVE-12505] 
insert overwrite doesn't throw exception when drop old data fails
[SPARK-26437][HIVE-13083] 
Decimal data becomes bigint to query, unable to query
[SPARK-25919][HIVE-11771] 
Date value corrupts when tables are "ParquetHiveSerDe" formatted and target 
table is Partitioned
[SPARK-12014][HIVE-11100] 
Spark SQL query containing semicolon is broken in Beeline

Spark issues:
[SPARK-23534] Spark run on 
Hadoop 3.0.0
[SPARK-20202] Remove 
references to org.spark-project.hive
[SPARK-18673] Dataframes 
doesn't work on Hadoop 3.x; Hive rejects Hadoop version
[SPARK-24766] 
CreateHiveTableAsSelect and InsertIntoHiveDir won't generate decimal column 
stats in parquet


Since the code for the hive-thriftserver module has changed too much for this 
upgrade, I split it into two PRs for easy review.
The first PR does not contain the 
changes of hive-thriftserver. Please ignore the failed test in 
hive-thriftserver.
The second PR is complete changes.

I have created a Spark distribution for Apache Hadoop 2.7, you might download 
it via Google 
Drive or 
Baidu Pan.
Please help review and test. Thanks.


Re: Clean out https://dist.apache.org/repos/dist/dev/spark/ ?

2019-01-13 Thread Felix Cheung
Eh, yeah, like the one with signing, I think doc build is mostly useful when a) 
right before we do a release or during the RC resets; b) someone makes a huge 
change to doc and want to check

Not sure we need this nightly?



From: Sean Owen 
Sent: Sunday, January 13, 2019 5:45 AM
To: Felix Cheung
Cc: Dongjoon Hyun; dev
Subject: Re: Clean out https://dist.apache.org/repos/dist/dev/spark/ ?

Will do. Er, maybe add Shane here too -- should we disable this docs
job? are these docs used, and is there much value in nightly snapshots
of the whole site?

On Sat, Jan 12, 2019 at 9:04 PM Felix Cheung  wrote:
>
> These get “published” by doc nightly build from riselab Jenkins...
>
>
> 
> From: Dongjoon Hyun 
> Sent: Saturday, January 12, 2019 4:32 PM
> To: Sean Owen
> Cc: dev
> Subject: Re: Clean out https://dist.apache.org/repos/dist/dev/spark/ ?
>
> +1 for removing old docs there.
> It seems that we need to upgrade our build script to maintain only one 
> published snapshot doc.
>
> Bests,
> Dongjoon.
>
> On Sat, Jan 12, 2019 at 2:18 PM Sean Owen  wrote:
>>
>> I'm not sure it matters a whole lot, but we are encouraged to keep
>> dist.apache.org free of old files. I see tons of old -docs snapshot
>> builds at https://dist.apache.org/repos/dist/dev/spark/ -- can I just
>> remove anything not so current?
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>


Re: Clean out https://dist.apache.org/repos/dist/dev/spark/ ?

2019-01-12 Thread Felix Cheung
These get “published” by doc nightly build from riselab Jenkins...



From: Dongjoon Hyun 
Sent: Saturday, January 12, 2019 4:32 PM
To: Sean Owen
Cc: dev
Subject: Re: Clean out https://dist.apache.org/repos/dist/dev/spark/ ?

+1 for removing old docs there.
It seems that we need to upgrade our build script to maintain only one 
published snapshot doc.

Bests,
Dongjoon.

On Sat, Jan 12, 2019 at 2:18 PM Sean Owen 
mailto:sro...@gmail.com>> wrote:
I'm not sure it matters a whole lot, but we are encouraged to keep
dist.apache.org free of old files. I see tons of old 
-docs snapshot
builds at https://dist.apache.org/repos/dist/dev/spark/ -- can I just
remove anything not so current?

-
To unsubscribe e-mail: 
dev-unsubscr...@spark.apache.org



Re: Spark Packaging Jenkins

2019-01-06 Thread Felix Cheung
Awesome Shane!



From: shane knapp 
Sent: Sunday, January 6, 2019 11:38 AM
To: Felix Cheung
Cc: Dongjoon Hyun; Wenchen Fan; dev
Subject: Re: Spark Packaging Jenkins

noted.  i like the idea of building (but not signing) the release and will 
update the job(s) this week.

On Sun, Jan 6, 2019 at 11:22 AM Felix Cheung 
mailto:felixcheun...@hotmail.com>> wrote:
https://spark.apache.org/release-process.html

Look for do-release-docker.sh script



From: Felix Cheung mailto:felixcheun...@hotmail.com>>
Sent: Sunday, January 6, 2019 11:17 AM
To: Dongjoon Hyun; Wenchen Fan
Cc: dev; shane knapp
Subject: Re: Spark Packaging Jenkins

The release process doc should have been updated on this - as mentioned we do 
not use Jenkins for release signing (take this offline if further discussion is 
needed)

The release build on Jenkins can still be useful for pre-validating the release 
build process (without actually signing it)



From: Dongjoon Hyun mailto:dongjoon.h...@gmail.com>>
Sent: Saturday, January 5, 2019 9:46 PM
To: Wenchen Fan
Cc: dev; shane knapp
Subject: Re: Spark Packaging Jenkins

Thank you, Wenchen.

I see. I'll update the doc and proceed to the next step manually as you advise. 
And it seems that we can stop the outdated Jenkins jobs, too.

Bests,
Dongjoon.

On Sat, Jan 5, 2019 at 20:15 Wenchen Fan 
mailto:cloud0...@gmail.com>> wrote:
IIRC there was a change to the release process: we stop using the shared gpg 
key on Jenkins, but use the personal key of the release manager. I'm not sure 
Jenkins can help testing package anymore.

BTW release manager needs to run the packaging script by himself. If there is a 
problem, the release manager will find it out sooner or later.



On Sun, Jan 6, 2019 at 6:34 AM Dongjoon Hyun 
mailto:dongjoon.h...@gmail.com>> wrote:
Hi, All.

It turns out that `gpg signing` is the next huddle in Spark Packaging Jenkins.
Since 2.4.0 release, is there something changed in our Jenkins machine?

  gpg: skipped 
"/home/jenkins/workspace/spark-master-package/spark-utils/new-release-scripts/jenkins/jenkins-credentials-JEtz0nyn/gpg.tmp":
 No secret key
  gpg: signing failed: No secret key

Bests,
Dongjoon.


On Fri, Jan 4, 2019 at 11:52 AM shane knapp 
mailto:skn...@berkeley.edu>> wrote:
https://issues.apache.org/jira/browse/SPARK-26537

On Fri, Jan 4, 2019 at 11:31 AM shane knapp 
mailto:skn...@berkeley.edu>> wrote:
this may push in to early next week...  these builds were set up before my 
time, and i'm currently unraveling how they all work before pushing a commit to 
fix stuff.

nothing like some code archaeology to make my friday more exciting!  :)

shane

On Fri, Jan 4, 2019 at 11:08 AM Dongjoon Hyun 
mailto:dongjoon.h...@gmail.com>> wrote:
Thank you, Shane!

Bests,
Dongjoon.

On Fri, Jan 4, 2019 at 10:50 AM shane knapp 
mailto:skn...@berkeley.edu>> wrote:
yeah, i'll get on that today.  thanks for the heads up.

On Fri, Jan 4, 2019 at 10:46 AM Dongjoon Hyun 
mailto:dongjoon.h...@gmail.com>> wrote:
Hi, All

As a part of release process, we need to check Packaging/Compile/Test Jenkins 
status.

http://spark.apache.org/release-process.html

1. Spark Packaging: 
https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Packaging/
2. Spark QA Compile: 
https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Compile/
3. Spark QA Test: https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/

Currently, (2) and (3) are working because it uses GitHub 
(https://github.com/apache/spark.git).
But, (1) seems to be broken because it's looking for old 
repo(https://git-wip-us.apache.org/repos/asf/spark.git/info/refs) instead of 
new GitBox.

Can we fix this in this week?

Bests,
Dongjoon.



--
Shane Knapp
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


--
Shane Knapp
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


--
Shane Knapp
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


--
Shane Knapp
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


Re: Spark Packaging Jenkins

2019-01-06 Thread Felix Cheung
https://spark.apache.org/release-process.html

Look for do-release-docker.sh script



From: Felix Cheung 
Sent: Sunday, January 6, 2019 11:17 AM
To: Dongjoon Hyun; Wenchen Fan
Cc: dev; shane knapp
Subject: Re: Spark Packaging Jenkins

The release process doc should have been updated on this - as mentioned we do 
not use Jenkins for release signing (take this offline if further discussion is 
needed)

The release build on Jenkins can still be useful for pre-validating the release 
build process (without actually signing it)



From: Dongjoon Hyun 
Sent: Saturday, January 5, 2019 9:46 PM
To: Wenchen Fan
Cc: dev; shane knapp
Subject: Re: Spark Packaging Jenkins

Thank you, Wenchen.

I see. I'll update the doc and proceed to the next step manually as you advise. 
And it seems that we can stop the outdated Jenkins jobs, too.

Bests,
Dongjoon.

On Sat, Jan 5, 2019 at 20:15 Wenchen Fan 
mailto:cloud0...@gmail.com>> wrote:
IIRC there was a change to the release process: we stop using the shared gpg 
key on Jenkins, but use the personal key of the release manager. I'm not sure 
Jenkins can help testing package anymore.

BTW release manager needs to run the packaging script by himself. If there is a 
problem, the release manager will find it out sooner or later.



On Sun, Jan 6, 2019 at 6:34 AM Dongjoon Hyun 
mailto:dongjoon.h...@gmail.com>> wrote:
Hi, All.

It turns out that `gpg signing` is the next huddle in Spark Packaging Jenkins.
Since 2.4.0 release, is there something changed in our Jenkins machine?

  gpg: skipped 
"/home/jenkins/workspace/spark-master-package/spark-utils/new-release-scripts/jenkins/jenkins-credentials-JEtz0nyn/gpg.tmp":
 No secret key
  gpg: signing failed: No secret key

Bests,
Dongjoon.


On Fri, Jan 4, 2019 at 11:52 AM shane knapp 
mailto:skn...@berkeley.edu>> wrote:
https://issues.apache.org/jira/browse/SPARK-26537

On Fri, Jan 4, 2019 at 11:31 AM shane knapp 
mailto:skn...@berkeley.edu>> wrote:
this may push in to early next week...  these builds were set up before my 
time, and i'm currently unraveling how they all work before pushing a commit to 
fix stuff.

nothing like some code archaeology to make my friday more exciting!  :)

shane

On Fri, Jan 4, 2019 at 11:08 AM Dongjoon Hyun 
mailto:dongjoon.h...@gmail.com>> wrote:
Thank you, Shane!

Bests,
Dongjoon.

On Fri, Jan 4, 2019 at 10:50 AM shane knapp 
mailto:skn...@berkeley.edu>> wrote:
yeah, i'll get on that today.  thanks for the heads up.

On Fri, Jan 4, 2019 at 10:46 AM Dongjoon Hyun 
mailto:dongjoon.h...@gmail.com>> wrote:
Hi, All

As a part of release process, we need to check Packaging/Compile/Test Jenkins 
status.

http://spark.apache.org/release-process.html

1. Spark Packaging: 
https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Packaging/
2. Spark QA Compile: 
https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Compile/
3. Spark QA Test: https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/

Currently, (2) and (3) are working because it uses GitHub 
(https://github.com/apache/spark.git).
But, (1) seems to be broken because it's looking for old 
repo(https://git-wip-us.apache.org/repos/asf/spark.git/info/refs) instead of 
new GitBox.

Can we fix this in this week?

Bests,
Dongjoon.



--
Shane Knapp
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


--
Shane Knapp
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


--
Shane Knapp
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


Re: Spark Packaging Jenkins

2019-01-06 Thread Felix Cheung
The release process doc should have been updated on this - as mentioned we do 
not use Jenkins for release signing (take this offline if further discussion is 
needed)

The release build on Jenkins can still be useful for pre-validating the release 
build process (without actually signing it)



From: Dongjoon Hyun 
Sent: Saturday, January 5, 2019 9:46 PM
To: Wenchen Fan
Cc: dev; shane knapp
Subject: Re: Spark Packaging Jenkins

Thank you, Wenchen.

I see. I'll update the doc and proceed to the next step manually as you advise. 
And it seems that we can stop the outdated Jenkins jobs, too.

Bests,
Dongjoon.

On Sat, Jan 5, 2019 at 20:15 Wenchen Fan 
mailto:cloud0...@gmail.com>> wrote:
IIRC there was a change to the release process: we stop using the shared gpg 
key on Jenkins, but use the personal key of the release manager. I'm not sure 
Jenkins can help testing package anymore.

BTW release manager needs to run the packaging script by himself. If there is a 
problem, the release manager will find it out sooner or later.



On Sun, Jan 6, 2019 at 6:34 AM Dongjoon Hyun 
mailto:dongjoon.h...@gmail.com>> wrote:
Hi, All.

It turns out that `gpg signing` is the next huddle in Spark Packaging Jenkins.
Since 2.4.0 release, is there something changed in our Jenkins machine?

  gpg: skipped 
"/home/jenkins/workspace/spark-master-package/spark-utils/new-release-scripts/jenkins/jenkins-credentials-JEtz0nyn/gpg.tmp":
 No secret key
  gpg: signing failed: No secret key

Bests,
Dongjoon.


On Fri, Jan 4, 2019 at 11:52 AM shane knapp 
mailto:skn...@berkeley.edu>> wrote:
https://issues.apache.org/jira/browse/SPARK-26537

On Fri, Jan 4, 2019 at 11:31 AM shane knapp 
mailto:skn...@berkeley.edu>> wrote:
this may push in to early next week...  these builds were set up before my 
time, and i'm currently unraveling how they all work before pushing a commit to 
fix stuff.

nothing like some code archaeology to make my friday more exciting!  :)

shane

On Fri, Jan 4, 2019 at 11:08 AM Dongjoon Hyun 
mailto:dongjoon.h...@gmail.com>> wrote:
Thank you, Shane!

Bests,
Dongjoon.

On Fri, Jan 4, 2019 at 10:50 AM shane knapp 
mailto:skn...@berkeley.edu>> wrote:
yeah, i'll get on that today.  thanks for the heads up.

On Fri, Jan 4, 2019 at 10:46 AM Dongjoon Hyun 
mailto:dongjoon.h...@gmail.com>> wrote:
Hi, All

As a part of release process, we need to check Packaging/Compile/Test Jenkins 
status.

http://spark.apache.org/release-process.html

1. Spark Packaging: 
https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Packaging/
2. Spark QA Compile: 
https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Compile/
3. Spark QA Test: https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/

Currently, (2) and (3) are working because it uses GitHub 
(https://github.com/apache/spark.git).
But, (1) seems to be broken because it's looking for old 
repo(https://git-wip-us.apache.org/repos/asf/spark.git/info/refs) instead of 
new GitBox.

Can we fix this in this week?

Bests,
Dongjoon.



--
Shane Knapp
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


--
Shane Knapp
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


--
Shane Knapp
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


Re: Apache Spark 2.2.3 ?

2019-01-02 Thread Felix Cheung
+1 on 2.2.3 of course



From: Dongjoon Hyun 
Sent: Wednesday, January 2, 2019 12:21 PM
To: Saisai Shao
Cc: Xiao Li; Felix Cheung; Sean Owen; dev
Subject: Re: Apache Spark 2.2.3 ?

Thank you for swift feedbacks and Happy New Year. :)
For 2.2.3 release on next week, I see two positive opinions (including mine)
and don't see any direct objections.

Apache Spark has a mature, resourceful, and fast-growing community.
One of the important characteristic of the mature community is
the expectable behavior where the users are able to depend on.
For instance, we have a nice tradition to cut the branch as a sign of feature 
freeze.
The *final* release of a branch is not only good for the end users, but also a 
good sign of the EOL of the branch for all.

As a junior committer of the community, I want to contribute to deliver the 
final 2.2.3 release to the community and to finalize `branch-2.2`.

* For Apache Spark JIRA, I checked that there is no on-going issues targeting 
on `2.2.3`.
* For commits, I reviewed the newly landed commits after `2.2.2` tag and 
updated a few missing JIRA issues accordingly.
* Apparently, we can release 2.2.3 next week.

BTW, I'm +1 for the next 2.3/2.4 and have been expecting those releases before 
Spark+AI Summit (April) because we did like that usually.
Please send another email to `dev` mailing list because it's worth to receive 
more attentions and requests.

Bests,
Dongjoon.


On Tue, Jan 1, 2019 at 9:35 PM Saisai Shao 
mailto:sai.sai.s...@gmail.com>> wrote:
Agreed to have a new branch-2.3 release, as we already accumulated several 
fixes.

Thanks
Saisai

Xiao Li mailto:lix...@databricks.com>> 于2019年1月2日周三 
下午1:32写道:
Based on the commit history, 
https://gitbox.apache.org/repos/asf?p=spark.git;a=shortlog;h=refs/heads/branch-2.3
 contains more critical fixes. Maybe the priority is higher?

On Tue, Jan 1, 2019 at 9:22 PM Felix Cheung 
mailto:felixcheun...@hotmail.com>> wrote:
Speaking of, it’s been 3 months since 2.3.2... (Sept 2018)

And 2 months since 2.4.0 (Nov 2018) - does the community feel 2.4 branch is 
stabilizing?



From: Sean Owen mailto:sro...@gmail.com>>
Sent: Tuesday, January 1, 2019 8:30 PM
To: Dongjoon Hyun
Cc: dev
Subject: Re: Apache Spark 2.2.3 ?

I agree with that logic, and if you're volunteering to do the legwork,
I don't see a reason not to cut a final 2.2 release.

On Tue, Jan 1, 2019 at 9:19 PM Dongjoon Hyun 
mailto:dongjoon.h...@gmail.com>> wrote:
>
> Hi, All.
>
> Apache Spark community has a policy maintaining the feature branch for 18 
> months. I think it's time for the 2.2.3 release since 2.2.0 is released on 
> July 2017.
>
> http://spark.apache.org/versioning-policy.html
>
> After 2.2.2 (July 2018), `branch-2.2` has 40 patches (including security 
> patches).
>
> https://gitbox.apache.org/repos/asf?p=spark.git;a=shortlog;h=refs/heads/branch-2.2
>
> If it's okay and there is no further plan on `branch-2.2`, I want to 
> volunteer to prepare the first RC (early next week?).
>
> Please let me know your opinions about this.
>
> Bests,
> Dongjoon.

-
To unsubscribe e-mail: 
dev-unsubscr...@spark.apache.org<mailto:dev-unsubscr...@spark.apache.org>



--
[https://databricks.com/sparkaisummit/north-america?utm_source=email_medium=signature]


Re: Apache Spark 2.2.3 ?

2019-01-01 Thread Felix Cheung
Speaking of, it’s been 3 months since 2.3.2... (Sept 2018)

And 2 months since 2.4.0 (Nov 2018) - does the community feel 2.4 branch is 
stabilizing?



From: Sean Owen 
Sent: Tuesday, January 1, 2019 8:30 PM
To: Dongjoon Hyun
Cc: dev
Subject: Re: Apache Spark 2.2.3 ?

I agree with that logic, and if you're volunteering to do the legwork,
I don't see a reason not to cut a final 2.2 release.

On Tue, Jan 1, 2019 at 9:19 PM Dongjoon Hyun  wrote:
>
> Hi, All.
>
> Apache Spark community has a policy maintaining the feature branch for 18 
> months. I think it's time for the 2.2.3 release since 2.2.0 is released on 
> July 2017.
>
> http://spark.apache.org/versioning-policy.html
>
> After 2.2.2 (July 2018), `branch-2.2` has 40 patches (including security 
> patches).
>
> https://gitbox.apache.org/repos/asf?p=spark.git;a=shortlog;h=refs/heads/branch-2.2
>
> If it's okay and there is no further plan on `branch-2.2`, I want to 
> volunteer to prepare the first RC (early next week?).
>
> Please let me know your opinions about this.
>
> Bests,
> Dongjoon.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [CRAN-pretest-archived] CRAN submission SparkR 2.4.0

2018-11-11 Thread Felix Cheung
I opened a PR on the vignettes fix to skip eval.



From: Shivaram Venkataraman 
Sent: Wednesday, November 7, 2018 7:26 AM
To: Felix Cheung
Cc: Sean Owen; Shivaram Venkataraman; Wenchen Fan; Matei Zaharia; dev
Subject: Re: [CRAN-pretest-archived] CRAN submission SparkR 2.4.0

Agree with the points Felix made.

One thing is that it looks like the only problem is vignettes and the
tests are being skipped as designed. If you see
https://win-builder.r-project.org/incoming_pretest/SparkR_2.4.0_20181105_165757/Windows/00check.log
and 
https://win-builder.r-project.org/incoming_pretest/SparkR_2.4.0_20181105_165757/Debian/00check.log,
the tests run in 1s.
On Tue, Nov 6, 2018 at 1:29 PM Felix Cheung  wrote:
>
> I’d rather not mess with 2.4.0 at this point. On CRAN is nice but users can 
> also install from Apache Mirror.
>
> Also I had attempted and failed to get vignettes not to build, it was non 
> trivial and could t get it to work. It I have an idea.
>
> As for tests I don’t know exact why is it not skipped. Need to investigate 
> but worse case test_package can run with 0 test.
>
>
>
> 
> From: Sean Owen 
> Sent: Tuesday, November 6, 2018 10:51 AM
> To: Shivaram Venkataraman
> Cc: Felix Cheung; Wenchen Fan; Matei Zaharia; dev
> Subject: Re: [CRAN-pretest-archived] CRAN submission SparkR 2.4.0
>
> I think the second option, to skip the tests, is best right now, if
> the alternative is to have no SparkR release at all!
> Can we monkey-patch the 2.4.0 release for SparkR in this way, bless it
> from the PMC, and release that? It's drastic but so is not being able
> to release, I think.
> Right? or is CRAN not actually an important distribution path for
> SparkR in particular?
>
> On Tue, Nov 6, 2018 at 12:49 PM Shivaram Venkataraman
>  wrote:
> >
> > Right - I think we should move on with 2.4.0.
> >
> > In terms of what can be done to avoid this error there are two strategies
> > - Felix had this other thread about JDK 11 that should at least let
> > Spark run on the CRAN instance. In general this strategy isn't
> > foolproof because the JDK version and other dependencies on that
> > machine keep changing over time and we dont have much control over it.
> > Worse we also dont have much control
> > - The other solution is to not run code to build the vignettes
> > document and just have static code blocks there that have been
> > pre-evaluated / pre-populated. We can open a JIRA to discuss the
> > pros/cons of this ?
> >
> > Thanks
> > Shivaram
> >
> > On Tue, Nov 6, 2018 at 10:57 AM Felix Cheung  
> > wrote:
> > >
> > > We have not been able to publish to CRAN for quite some time (since 2.3.0 
> > > was archived - the cause is Java 11)
> > >
> > > I think it’s ok to announce the release of 2.4.0
> > >
> > >
> > > 
> > > From: Wenchen Fan 
> > > Sent: Tuesday, November 6, 2018 8:51 AM
> > > To: Felix Cheung
> > > Cc: Matei Zaharia; Sean Owen; Spark dev list; Shivaram Venkataraman
> > > Subject: Re: [CRAN-pretest-archived] CRAN submission SparkR 2.4.0
> > >
> > > Do you mean we should have a 2.4.0 release without CRAN and then do a 
> > > 2.4.1 immediately?
> > >
> > > On Wed, Nov 7, 2018 at 12:34 AM Felix Cheung  
> > > wrote:
> > >>
> > >> Shivaram and I were discussing.
> > >> Actually we worked with them before. Another possible approach is to 
> > >> remove the vignettes eval and all test from the source package... in the 
> > >> next release.
> > >>
> > >>
> > >> 
> > >> From: Matei Zaharia 
> > >> Sent: Tuesday, November 6, 2018 12:07 AM
> > >> To: Felix Cheung
> > >> Cc: Sean Owen; dev; Shivaram Venkataraman
> > >> Subject: Re: [CRAN-pretest-archived] CRAN submission SparkR 2.4.0
> > >>
> > >> Maybe it’s wroth contacting the CRAN maintainers to ask for help? 
> > >> Perhaps we aren’t disabling it correctly, or perhaps they can ignore 
> > >> this specific failure. +Shivaram who might have some ideas.
> > >>
> > >> Matei
> > >>
> > >> > On Nov 5, 2018, at 9:09 PM, Felix Cheung  
> > >> > wrote:
> > >> >
> > >> > I don¡Št know what the cause is yet.
> > >> >
> > >> > The test should be skipped because of this check
> > >> > https://github.com/apache/spark/blob/branch

Re: [discuss] SparkR CRAN feasibility check server problem

2018-11-10 Thread Felix Cheung
It’s a great point about min R version. From what I see, mostly because of 
fixes and packages support, most users of R are fairly up to date? So perhaps 
3.4 as min version is reasonable esp. for Spark 3.

Are we getting traction with CRAN sysadmin? It seems like this has been broken 
a few times.



From: Liang-Chi Hsieh 
Sent: Saturday, November 10, 2018 2:32 AM
To: dev@spark.apache.org
Subject: Re: [discuss] SparkR CRAN feasibility check server problem


Yeah, thanks Hyukjin Kwon for bringing this up for discussion.

I don't know how higher versions of R are widely used across R community. If
R version 3.1.x was not very commonly used, I think we can discuss to
upgrade minimum R version in next Spark version.

If we ended up with not upgrading, we can discuss with CRAN sysadmin to fix
it by the service side automatically that prevents malformed R packages
info. So we don't need to fix it manually every time.



Hyukjin Kwon wrote
>> Can upgrading R able to fix the issue. Is this perhaps not necessarily
> malform but some new format for new versions perhaps?
> That's my guess. I am not totally sure about it tho.
>
>> Anyway we should consider upgrading R version if that fixes the problem.
> Yea, we should. If we should, it should be more them R 3.4. Maybe it's
> good
> time to start to talk about minimum R version. 3.1.x is too old. It's
> released 4.5 years ago.
> R 3.4.0 is released 1.5 years ago. Considering the timing for Spark 3.0,
> deprecating lower versions, bumping up R to 3.4 might be reasonable
> option.
>
> Adding Shane as well.
>
> If we ended up with not upgrading it, I will forward this email to CRAN
> sysadmin to discuss further anyway.
>
>
>
> 2018년 11월 2일 (금) 오후 12:51, Felix Cheung 

> felixcheung@

> 님이 작성:
>
>> Thanks for being this up and much appreciate with keeping on top of this
>> at all times.
>>
>> Can upgrading R able to fix the issue. Is this perhaps not necessarily
>> malform but some new format for new versions perhaps? Anyway we should
>> consider upgrading R version if that fixes the problem.
>>
>> As an option we could also disable the repo check in Jenkins but I can
>> see
>> that could also be problematic.
>>
>>
>> On Thu, Nov 1, 2018 at 7:35 PM Hyukjin Kwon 

> gurwls223@

>  wrote:
>>
>>> Hi all,
>>>
>>> I want to raise the CRAN failure issue because it started to block Spark
>>> PRs time to time. Since the number
>>> of PRs grows hugely in Spark community, this is critical to not block
>>> other PRs.
>>>
>>> There has been a problem at CRAN (See
>>> https://github.com/apache/spark/pull/20005 for analysis).
>>> To cut it short, the root cause is malformed package info from
>>> https://cran.r-project.org/src/contrib/PACKAGES
>>> from server side, and this had to be fixed by requesting it to CRAN
>>> sysaadmin's help.
>>>
>>> https://issues.apache.org/jira/browse/SPARK-24152 <- newly open. I am
>>> pretty sure it's the same issue
>>> https://issues.apache.org/jira/browse/SPARK-25923 <- reopen/resolved 2
>>> times
>>> https://issues.apache.org/jira/browse/SPARK-22812
>>>
>>> This happened 5 times for roughly about 10 months, causing blocking
>>> almost all PRs in Apache Spark.
>>> Historically, it blocked whole PRs for few days once, and whole Spark
>>> community had to stop working.
>>>
>>> I assume this has been not a super big big issue so far for other
>>> projects or other people because apparently
>>> higher version of R has some logics to handle this malformed documents
>>> (at least I verified R 3.4.0 works fine).
>>>
>>> For our side, Jenkins has low R version (R 3.1.1 if that's not updated
>>> from what I have seen before),
>>> which is unable to parse the malformed server's response.
>>>
>>> So, I want to talk about how we are going to handle this. Possible
>>> solutions are:
>>>
>>> 1. We should start a talk with CRAN sysadmin to permanently prevent this
>>> issue
>>> 2. We upgrade R to 3.4.0 in Jenkins (however we will not be able to test
>>> low R versions)
>>> 3. ...
>>>
>>> If if we fine, I would like to suggest to forward this email to CRAN
>>> sysadmin to discuss further about this.
>>>
>>> Adding Liang-Chi Felix and Shivaram who I already talked about this few
>>> times before.
>>>
>>> Thanks all.
>>>
>>>
>>>
>>>





--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: DataSourceV2 capability API

2018-11-09 Thread Felix Cheung
One question is where will the list of capability strings be defined?



From: Ryan Blue 
Sent: Thursday, November 8, 2018 2:09 PM
To: Reynold Xin
Cc: Spark Dev List
Subject: Re: DataSourceV2 capability API


Yes, we currently use traits that have methods. Something like “supports 
reading missing columns” doesn’t need to deliver methods. The other example is 
where we don’t have an object to test for a trait 
(scan.isInstanceOf[SupportsBatch]) until we have a Scan with pushdown done. 
That could be expensive so we can use a capability to fail faster.

On Thu, Nov 8, 2018 at 1:54 PM Reynold Xin 
mailto:r...@databricks.com>> wrote:
This is currently accomplished by having traits that data sources can extend, 
as well as runtime exceptions right? It's hard to argue one way vs another 
without knowing how things will evolve (e.g. how many different capabilities 
there will be).


On Thu, Nov 8, 2018 at 12:50 PM Ryan Blue  wrote:

Hi everyone,

I’d like to propose an addition to DataSourceV2 tables, a capability API. This 
API would allow Spark to query a table to determine whether it supports a 
capability or not:

val table = catalog.load(identifier)
val supportsContinuous = table.isSupported("continuous-streaming")


There are a couple of use cases for this. First, we want to be able to fail 
fast when a user tries to stream a table that doesn’t support it. The design of 
our read implementation doesn’t necessarily support this. If we want to share 
the same “scan” across streaming and batch, then we need to “branch” in the API 
after that point, but that is at odds with failing fast. We could use 
capabilities to fail fast and not worry about that concern in the read design.

I also want to use capabilities to change the behavior of some validation 
rules. The rule that validates appends, for example, doesn’t allow a write that 
is missing an optional column. That’s because the current v1 sources don’t 
support reading when columns are missing. But Iceberg does support reading a 
missing column as nulls, so that users can add a column to a table without 
breaking a scheduled job that populates the table. To fix this problem, I would 
use a table capability, like read-missing-columns-as-null.

Any comments on this approach?

rb

--
Ryan Blue
Software Engineer
Netflix


--
Ryan Blue
Software Engineer
Netflix


Re: Arrow optimization in conversion from R DataFrame to Spark DataFrame

2018-11-09 Thread Felix Cheung
Very cool!



From: Hyukjin Kwon 
Sent: Thursday, November 8, 2018 10:29 AM
To: dev
Subject: Arrow optimization in conversion from R DataFrame to Spark DataFrame

Hi all,

I am trying to introduce R Arrow optimization by reusing PySpark Arrow 
optimization.

It boosts R DataFrame > Spark DataFrame up to roughly 900% ~ 1200% faster.

Looks working fine so far; however, I would appreciate if you guys have some 
time to take a look (https://github.com/apache/spark/pull/22954) so that we can 
directly go ahead as soon as R API of Arrow is released.

More importantly, I want some more people who're more into Arrow R API side but 
also interested in Spark side. I have already cc'ed some people I know but 
please come, review and discuss for both Spark side and Arrow side.

Thanks.



Re: [CRAN-pretest-archived] CRAN submission SparkR 2.4.0

2018-11-08 Thread Felix Cheung
They were discussed on dev@ in Mar 2018, for example.

Several attempts were made in 2.3.0, 2.3.1, 2.3.2, 2.4.0.
It’s not just tests, the last one is with vignettes.

The current doc about RStudio actually assumes you have the full Spark 
distribution (ie from the download page and Apache Mirror) and set SPARK_HOME 
etc, which is not a hard way and the doc also says it is the same for R shell, 
R script or other R IDE, with the exact same steps.




From: Matei Zaharia 
Sent: Wednesday, November 7, 2018 10:32 PM
To: Wenchen Fan
Cc: Shivaram Venkataraman; Felix Cheung; Sean Owen; Spark dev list
Subject: Re: [CRAN-pretest-archived] CRAN submission SparkR 2.4.0

I didn’t realize the same thing was broken in 2.3.0, but we should probably 
have made this a blocker for future releases, if it’s just a matter of removing 
things from the test script. We should also make the docs at 
https://spark.apache.org/docs/latest/sparkr.html clear about how we want people 
to run SparkR. They don’t seem to say to use any specific mirror or anything 
(in fact they only talk about how to import SparkR in RStudio and in our 
bin/sparkR, not in a normal R shell). I’m pretty sure it’s OK to update the 
docs website for 2.4.0 after the release to fix this if we want.

Matei

> On Nov 7, 2018, at 6:24 PM, Wenchen Fan  wrote:
>
> Do we need to create a JIRA ticket for it and list it as a known issue in 
> 2.4.0 release notes?
>
> On Wed, Nov 7, 2018 at 11:26 PM Shivaram Venkataraman 
>  wrote:
> Agree with the points Felix made.
>
> One thing is that it looks like the only problem is vignettes and the
> tests are being skipped as designed. If you see
> https://win-builder.r-project.org/incoming_pretest/SparkR_2.4.0_20181105_165757/Windows/00check.log
> and 
> https://win-builder.r-project.org/incoming_pretest/SparkR_2.4.0_20181105_165757/Debian/00check.log,
> the tests run in 1s.
> On Tue, Nov 6, 2018 at 1:29 PM Felix Cheung  wrote:
> >
> > I’d rather not mess with 2.4.0 at this point. On CRAN is nice but users can 
> > also install from Apache Mirror.
> >
> > Also I had attempted and failed to get vignettes not to build, it was non 
> > trivial and could t get it to work. It I have an idea.
> >
> > As for tests I don’t know exact why is it not skipped. Need to investigate 
> > but worse case test_package can run with 0 test.
> >
> >
> >
> > ________
> > From: Sean Owen 
> > Sent: Tuesday, November 6, 2018 10:51 AM
> > To: Shivaram Venkataraman
> > Cc: Felix Cheung; Wenchen Fan; Matei Zaharia; dev
> > Subject: Re: [CRAN-pretest-archived] CRAN submission SparkR 2.4.0
> >
> > I think the second option, to skip the tests, is best right now, if
> > the alternative is to have no SparkR release at all!
> > Can we monkey-patch the 2.4.0 release for SparkR in this way, bless it
> > from the PMC, and release that? It's drastic but so is not being able
> > to release, I think.
> > Right? or is CRAN not actually an important distribution path for
> > SparkR in particular?
> >
> > On Tue, Nov 6, 2018 at 12:49 PM Shivaram Venkataraman
> >  wrote:
> > >
> > > Right - I think we should move on with 2.4.0.
> > >
> > > In terms of what can be done to avoid this error there are two strategies
> > > - Felix had this other thread about JDK 11 that should at least let
> > > Spark run on the CRAN instance. In general this strategy isn't
> > > foolproof because the JDK version and other dependencies on that
> > > machine keep changing over time and we dont have much control over it.
> > > Worse we also dont have much control
> > > - The other solution is to not run code to build the vignettes
> > > document and just have static code blocks there that have been
> > > pre-evaluated / pre-populated. We can open a JIRA to discuss the
> > > pros/cons of this ?
> > >
> > > Thanks
> > > Shivaram
> > >
> > > On Tue, Nov 6, 2018 at 10:57 AM Felix Cheung  
> > > wrote:
> > > >
> > > > We have not been able to publish to CRAN for quite some time (since 
> > > > 2.3.0 was archived - the cause is Java 11)
> > > >
> > > > I think it’s ok to announce the release of 2.4.0
> > > >
> > > >
> > > > 
> > > > From: Wenchen Fan 
> > > > Sent: Tuesday, November 6, 2018 8:51 AM
> > > > To: Felix Cheung
> > > > Cc: Matei Zaharia; Sean Owen; Spark dev list; Shivaram Venkataraman
> > > > Subject: Re: [CRAN-pretest-archived] CRAN submission SparkR 2.4.0
&g

Re: Test and support only LTS JDK release?

2018-11-06 Thread Felix Cheung
Is there a list of LTS release that I can reference?



From: Ryan Blue 
Sent: Tuesday, November 6, 2018 1:28 PM
To: sn...@snazy.de
Cc: Spark Dev List; cdelg...@apple.com
Subject: Re: Test and support only LTS JDK release?

+1 for supporting LTS releases.

On Tue, Nov 6, 2018 at 11:48 AM Robert Stupp 
mailto:sn...@snazy.de>> wrote:

+1 on supporting LTS releases.

VM distributors (RedHat, Azul - to name two) want to provide patches to LTS 
versions (i.e. into http://hg.openjdk.java.net/jdk-updates/jdk11u/). How that 
will play out in reality ... I don't know. Whether Oracle will contribute to 
that repo for 8 after it's EOL and 11 after the 6 month cycle ... we will see. 
Most Linux distributions promised(?) long-term support for Java 11 in their LTS 
releases (e.g. Ubuntu 18.04). I am not sure what that exactly means ... whether 
they will actively provide patches to OpenJDK or whether they just build from 
source.

But considering that, I think it's definitely worth to at least keep an eye on 
Java 12 and 13 - even if those are just EA. Java 12 for example does already 
forbid some "dirty tricks" that are still possible in Java 11.


On 11/6/18 8:32 PM, DB Tsai wrote:
OpenJDK will follow Oracle's release cycle, 
https://openjdk.java.net/projects/jdk/, a strict six months model. I'm not 
familiar with other non-Oracle VMs and Redhat support.

DB Tsai  |  Siri Open Source Technologies [not a contribution]  |   Apple, Inc

On Nov 6, 2018, at 11:26 AM, Reynold Xin 
mailto:r...@databricks.com>> wrote:

What does OpenJDK do and other non-Oracle VMs? I know there was a lot of 
discussions from Redhat etc to support.


On Tue, Nov 6, 2018 at 11:24 AM DB Tsai 
mailto:d_t...@apple.com>> wrote:
Given Oracle's new 6-month release model, I feel the only realistic option is 
to only test and support JDK such as JDK 11 LTS and future LTS release. I would 
like to have a discussion on this in Spark community.

Thanks,

DB Tsai  |  Siri Open Source Technologies [not a contribution]  |   Apple, Inc



--
Robert Stupp
@snazy


--
Ryan Blue
Software Engineer
Netflix


Re: Make Scala 2.12 as default Scala version in Spark 3.0

2018-11-06 Thread Felix Cheung
So to clarify, only scala 2.12 is supported in Spark 3?



From: Ryan Blue 
Sent: Tuesday, November 6, 2018 1:24 PM
To: d_t...@apple.com
Cc: Sean Owen; Spark Dev List; cdelg...@apple.com
Subject: Re: Make Scala 2.12 as default Scala version in Spark 3.0

+1 to Scala 2.12 as the default in Spark 3.0.

On Tue, Nov 6, 2018 at 11:50 AM DB Tsai 
mailto:d_t...@apple.com>> wrote:
+1 on dropping Scala 2.11 in Spark 3.0 to simplify the build.

As Scala 2.11 will not support Java 11 unless we make a significant investment, 
if we decide not to drop Scala 2.11 in Spark 3.0, what we can do is have only 
Scala 2.12 build support Java 11 while Scala 2.11 support Java 8. But I agree 
with Sean that this can make the decencies really complicated; hence I support 
to drop Scala 2.11 in Spark 3.0 directly.

DB Tsai  |  Siri Open Source Technologies [not a contribution]  |   Apple, Inc

On Nov 6, 2018, at 11:38 AM, Sean Owen 
mailto:sro...@gmail.com>> wrote:

I think we should make Scala 2.12 the default in Spark 3.0. I would
also prefer to drop Scala 2.11 support in 3.0. In theory, not dropping
2.11 support it means we'd support Scala 2.11 for years, the lifetime
of Spark 3.x. In practice, we could drop 2.11 support in a 3.1.0 or
3.2.0 release, kind of like what happened with 2.10 in 2.x.

Java (9-)11 support also complicates this. I think getting it to work
will need some significant dependency updates, and I worry not all
will be available for 2.11 or will present some knotty problems. We'll
find out soon if that forces the issue.

Also note that Scala 2.13 is pretty close to release, and we'll want
to support it soon after release, perhaps sooner than the long delay
before 2.12 was supported (because it was hard!). It will probably be
out well before Spark 3.0. Cross-compiling for 3 Scala versions sounds
like too much. 3.0 could support 2.11 and 2.12, and 3.1 support 2.12
and 2.13, or something. But if 2.13 support is otherwise attainable at
the release of Spark 3.0, I wonder if that too argues for dropping
2.11 support.

Finally I'll say that Spark itself isn't dropping 2.11 support for a
while, no matter what; it still exists in the 2.4.x branch of course.
People who can't update off Scala 2.11 can stay on Spark 2.x, note.

Sean


On Tue, Nov 6, 2018 at 1:13 PM DB Tsai 
mailto:d_t...@apple.com>> wrote:

We made Scala 2.11 as default Scala version in Spark 2.0. Now, the next Spark 
version will be 3.0, so it's a great time to discuss should we make Scala 2.12 
as default Scala version in Spark 3.0.

Scala 2.11 is EOL, and it came out 4.5 ago; as a result, it's unlikely to 
support JDK 11 in Scala 2.11 unless we're willing to sponsor the needed work 
per discussion in Scala community, 
https://github.com/scala/scala-dev/issues/559#issuecomment-436160166

We have initial support of Scala 2.12 in Spark 2.4. If we decide to make Scala 
2.12 as default for Spark 3.0 now, we will have ample time to work on bugs and 
issues that we may run into.

What do you think?

Thanks,

DB Tsai  |  Siri Open Source Technologies [not a contribution]  |   Apple, Inc


-
To unsubscribe e-mail: 
dev-unsubscr...@spark.apache.org




--
Ryan Blue
Software Engineer
Netflix


Re: [CRAN-pretest-archived] CRAN submission SparkR 2.4.0

2018-11-06 Thread Felix Cheung
I’d rather not mess with 2.4.0 at this point. On CRAN is nice but users can 
also install from Apache Mirror.

Also I had attempted and failed to get vignettes not to build, it was non 
trivial and could t get it to work.  It I have an idea.

As for tests I don’t know exact why is it not skipped. Need to investigate but 
worse case test_package can run with 0 test.




From: Sean Owen 
Sent: Tuesday, November 6, 2018 10:51 AM
To: Shivaram Venkataraman
Cc: Felix Cheung; Wenchen Fan; Matei Zaharia; dev
Subject: Re: [CRAN-pretest-archived] CRAN submission SparkR 2.4.0

I think the second option, to skip the tests, is best right now, if
the alternative is to have no SparkR release at all!
Can we monkey-patch the 2.4.0 release for SparkR in this way, bless it
from the PMC, and release that? It's drastic but so is not being able
to release, I think.
Right? or is CRAN not actually an important distribution path for
SparkR in particular?

On Tue, Nov 6, 2018 at 12:49 PM Shivaram Venkataraman
 wrote:
>
> Right - I think we should move on with 2.4.0.
>
> In terms of what can be done to avoid this error there are two strategies
> - Felix had this other thread about JDK 11 that should at least let
> Spark run on the CRAN instance. In general this strategy isn't
> foolproof because the JDK version and other dependencies on that
> machine keep changing over time and we dont have much control over it.
> Worse we also dont have much control
> - The other solution is to not run code to build the vignettes
> document and just have static code blocks there that have been
> pre-evaluated / pre-populated. We can open a JIRA to discuss the
> pros/cons of this ?
>
> Thanks
> Shivaram
>
> On Tue, Nov 6, 2018 at 10:57 AM Felix Cheung  
> wrote:
> >
> > We have not been able to publish to CRAN for quite some time (since 2.3.0 
> > was archived - the cause is Java 11)
> >
> > I think it’s ok to announce the release of 2.4.0
> >
> >
> > ________
> > From: Wenchen Fan 
> > Sent: Tuesday, November 6, 2018 8:51 AM
> > To: Felix Cheung
> > Cc: Matei Zaharia; Sean Owen; Spark dev list; Shivaram Venkataraman
> > Subject: Re: [CRAN-pretest-archived] CRAN submission SparkR 2.4.0
> >
> > Do you mean we should have a 2.4.0 release without CRAN and then do a 2.4.1 
> > immediately?
> >
> > On Wed, Nov 7, 2018 at 12:34 AM Felix Cheung  
> > wrote:
> >>
> >> Shivaram and I were discussing.
> >> Actually we worked with them before. Another possible approach is to 
> >> remove the vignettes eval and all test from the source package... in the 
> >> next release.
> >>
> >>
> >> 
> >> From: Matei Zaharia 
> >> Sent: Tuesday, November 6, 2018 12:07 AM
> >> To: Felix Cheung
> >> Cc: Sean Owen; dev; Shivaram Venkataraman
> >> Subject: Re: [CRAN-pretest-archived] CRAN submission SparkR 2.4.0
> >>
> >> Maybe it’s wroth contacting the CRAN maintainers to ask for help? Perhaps 
> >> we aren’t disabling it correctly, or perhaps they can ignore this specific 
> >> failure. +Shivaram who might have some ideas.
> >>
> >> Matei
> >>
> >> > On Nov 5, 2018, at 9:09 PM, Felix Cheung  
> >> > wrote:
> >> >
> >> > I don¡Št know what the cause is yet.
> >> >
> >> > The test should be skipped because of this check
> >> > https://github.com/apache/spark/blob/branch-2.4/R/pkg/inst/tests/testthat/test_basic.R#L21
> >> >
> >> > And this
> >> > https://github.com/apache/spark/blob/branch-2.4/R/pkg/inst/tests/testthat/test_basic.R#L57
> >> >
> >> > But it ran:
> >> > callJStatic("org.apache.spark.ml.r.GeneralizedLinearRegressionWrapper", 
> >> > "fit", formula,
> >> >
> >> > The earlier release was achived because of Java 11+ too so this 
> >> > unfortunately isn¡Št new.
> >> >
> >> >
> >> > From: Sean Owen 
> >> > Sent: Monday, November 5, 2018 7:22 PM
> >> > To: Felix Cheung
> >> > Cc: dev
> >> > Subject: Re: [CRAN-pretest-archived] CRAN submission SparkR 2.4.0
> >> >
> >> > What can we do to get the release through? is there any way to
> >> > circumvent these tests or otherwise hack it? or does it need a
> >> > maintenance release?
> >> > On Mon, Nov 5, 2018 at 8:53 PM Felix Cheung  
> >> > wrote:
> >> > >
> >> > > FYI. SparkR submis

Re: Java 11 support

2018-11-06 Thread Felix Cheung
+1 for Spark 3, definitely
Thanks for the updates



From: Sean Owen 
Sent: Tuesday, November 6, 2018 9:11 AM
To: Felix Cheung
Cc: dev
Subject: Re: Java 11 support

I think that Java 9 support basically gets Java 10, 11 support. But
the jump from 8 to 9 is unfortunately more breaking than usual because
of the total revamping of the internal JDK classes. I think it will be
mostly a matter of dependencies needing updates to work. I agree this
is probably pretty important for Spark 3. Here's the ticket I know of:
https://issues.apache.org/jira/browse/SPARK-24417 . DB is already
working on some of it, I see.
On Tue, Nov 6, 2018 at 10:59 AM Felix Cheung  wrote:
>
> Speaking of, get we work to support Java 11?
> That will fix all the problems below.
>
>
>
> ____
> From: Felix Cheung 
> Sent: Tuesday, November 6, 2018 8:57 AM
> To: Wenchen Fan
> Cc: Matei Zaharia; Sean Owen; Spark dev list; Shivaram Venkataraman
> Subject: Re: [CRAN-pretest-archived] CRAN submission SparkR 2.4.0
>
> We have not been able to publish to CRAN for quite some time (since 2.3.0 was 
> archived - the cause is Java 11)
>
> I think it’s ok to announce the release of 2.4.0
>
>
> 
> From: Wenchen Fan 
> Sent: Tuesday, November 6, 2018 8:51 AM
> To: Felix Cheung
> Cc: Matei Zaharia; Sean Owen; Spark dev list; Shivaram Venkataraman
> Subject: Re: [CRAN-pretest-archived] CRAN submission SparkR 2.4.0
>
> Do you mean we should have a 2.4.0 release without CRAN and then do a 2.4.1 
> immediately?
>
> On Wed, Nov 7, 2018 at 12:34 AM Felix Cheung  
> wrote:
>>
>> Shivaram and I were discussing.
>> Actually we worked with them before. Another possible approach is to remove 
>> the vignettes eval and all test from the source package... in the next 
>> release.
>>
>>
>> 
>> From: Matei Zaharia 
>> Sent: Tuesday, November 6, 2018 12:07 AM
>> To: Felix Cheung
>> Cc: Sean Owen; dev; Shivaram Venkataraman
>> Subject: Re: [CRAN-pretest-archived] CRAN submission SparkR 2.4.0
>>
>> Maybe it’s wroth contacting the CRAN maintainers to ask for help? Perhaps we 
>> aren’t disabling it correctly, or perhaps they can ignore this specific 
>> failure. +Shivaram who might have some ideas.
>>
>> Matei
>>
>> > On Nov 5, 2018, at 9:09 PM, Felix Cheung  wrote:
>> >
>> > I don¡Št know what the cause is yet.
>> >
>> > The test should be skipped because of this check
>> > https://github.com/apache/spark/blob/branch-2.4/R/pkg/inst/tests/testthat/test_basic.R#L21
>> >
>> > And this
>> > https://github.com/apache/spark/blob/branch-2.4/R/pkg/inst/tests/testthat/test_basic.R#L57
>> >
>> > But it ran:
>> > callJStatic("org.apache.spark.ml.r.GeneralizedLinearRegressionWrapper", 
>> > "fit", formula,
>> >
>> > The earlier release was achived because of Java 11+ too so this 
>> > unfortunately isn¡Št new.
>> >
>> >
>> > From: Sean Owen 
>> > Sent: Monday, November 5, 2018 7:22 PM
>> > To: Felix Cheung
>> > Cc: dev
>> > Subject: Re: [CRAN-pretest-archived] CRAN submission SparkR 2.4.0
>> >
>> > What can we do to get the release through? is there any way to
>> > circumvent these tests or otherwise hack it? or does it need a
>> > maintenance release?
>> > On Mon, Nov 5, 2018 at 8:53 PM Felix Cheung  
>> > wrote:
>> > >
>> > > FYI. SparkR submission failed. It seems to detect Java 11 correctly with 
>> > > vignettes but not skipping tests as would be expected.
>> > >
>> > > Error: processing vignette ¡¥sparkr-vignettes.Rmd¡Š failed with 
>> > > diagnostics:
>> > > Java version 8 is required for this package; found version: 11.0.1
>> > > Execution halted
>> > >
>> > > * checking PDF version of manual ... OK
>> > > * DONE
>> > > Status: 1 WARNING, 1 NOTE
>> > >
>> > > Current CRAN status: ERROR: 1, OK: 1
>> > > See: <https://CRAN.R-project.org/web/checks/check_results_SparkR.html>
>> > >
>> > > Version: 2.3.0
>> > > Check: tests, Result: ERROR
>> > > Running ¡¥run-all.R¡Š [8s/35s]
>> > > Running the tests in ¡¥tests/run-all.R¡Š failed.
>> > > Last 13 lines of output:
>> > > 4: 
>> > > callJStatic("org.apache.spark.ml.r.GeneralizedLinearRegressionWrapper", 
>> > > "

Java 11 support

2018-11-06 Thread Felix Cheung
Speaking of, get we work to support Java 11?
That will fix all the problems below.




From: Felix Cheung 
Sent: Tuesday, November 6, 2018 8:57 AM
To: Wenchen Fan
Cc: Matei Zaharia; Sean Owen; Spark dev list; Shivaram Venkataraman
Subject: Re: [CRAN-pretest-archived] CRAN submission SparkR 2.4.0

We have not been able to publish to CRAN for quite some time (since 2.3.0 was 
archived - the cause is Java 11)

I think it’s ok to announce the release of 2.4.0



From: Wenchen Fan 
Sent: Tuesday, November 6, 2018 8:51 AM
To: Felix Cheung
Cc: Matei Zaharia; Sean Owen; Spark dev list; Shivaram Venkataraman
Subject: Re: [CRAN-pretest-archived] CRAN submission SparkR 2.4.0

Do you mean we should have a 2.4.0 release without CRAN and then do a 2.4.1 
immediately?

On Wed, Nov 7, 2018 at 12:34 AM Felix Cheung 
mailto:felixcheun...@hotmail.com>> wrote:
Shivaram and I were discussing.
Actually we worked with them before. Another possible approach is to remove the 
vignettes eval and all test from the source package... in the next release.



From: Matei Zaharia mailto:matei.zaha...@gmail.com>>
Sent: Tuesday, November 6, 2018 12:07 AM
To: Felix Cheung
Cc: Sean Owen; dev; Shivaram Venkataraman
Subject: Re: [CRAN-pretest-archived] CRAN submission SparkR 2.4.0

Maybe it’s wroth contacting the CRAN maintainers to ask for help? Perhaps we 
aren’t disabling it correctly, or perhaps they can ignore this specific 
failure. +Shivaram who might have some ideas.

Matei

> On Nov 5, 2018, at 9:09 PM, Felix Cheung 
> mailto:felixcheun...@hotmail.com>> wrote:
>
> I don¡Št know what the cause is yet.
>
> The test should be skipped because of this check
> https://github.com/apache/spark/blob/branch-2.4/R/pkg/inst/tests/testthat/test_basic.R#L21
>
> And this
> https://github.com/apache/spark/blob/branch-2.4/R/pkg/inst/tests/testthat/test_basic.R#L57
>
> But it ran:
> callJStatic("org.apache.spark.ml.r.GeneralizedLinearRegressionWrapper", 
> "fit", formula,
>
> The earlier release was achived because of Java 11+ too so this unfortunately 
> isn¡Št new.
>
>
> From: Sean Owen mailto:sro...@gmail.com>>
> Sent: Monday, November 5, 2018 7:22 PM
> To: Felix Cheung
> Cc: dev
> Subject: Re: [CRAN-pretest-archived] CRAN submission SparkR 2.4.0
>
> What can we do to get the release through? is there any way to
> circumvent these tests or otherwise hack it? or does it need a
> maintenance release?
> On Mon, Nov 5, 2018 at 8:53 PM Felix Cheung 
> mailto:felixcheun...@hotmail.com>> wrote:
> >
> > FYI. SparkR submission failed. It seems to detect Java 11 correctly with 
> > vignettes but not skipping tests as would be expected.
> >
> > Error: processing vignette ¡¥sparkr-vignettes.Rmd¡Š failed with diagnostics:
> > Java version 8 is required for this package; found version: 11.0.1
> > Execution halted
> >
> > * checking PDF version of manual ... OK
> > * DONE
> > Status: 1 WARNING, 1 NOTE
> >
> > Current CRAN status: ERROR: 1, OK: 1
> > See: <https://CRAN.R-project.org/web/checks/check_results_SparkR.html>
> >
> > Version: 2.3.0
> > Check: tests, Result: ERROR
> > Running ¡¥run-all.R¡Š [8s/35s]
> > Running the tests in ¡¥tests/run-all.R¡Š failed.
> > Last 13 lines of output:
> > 4: callJStatic("org.apache.spark.ml.r.GeneralizedLinearRegressionWrapper", 
> > "fit", formula,
> > data@sdf, tolower(family$family), family$link, tol, as.integer(maxIter), 
> > weightCol,
> > regParam, as.double(var.power), as.double(link.power), 
> > stringIndexerOrderType,
> > offsetCol)
> > 5: invokeJava(isStatic = TRUE, className, methodName, ...)
> > 6: handleErrors(returnStatus, conn)
> > 7: stop(readString(conn))
> >
> >  testthat results 
> > ùù
> > OK: 0 SKIPPED: 0 FAILED: 2
> > 1. Error: create DataFrame from list or data.frame (@test_basic.R#26)
> > 2. Error: spark.glm and predict (@test_basic.R#58)
> >
> >
> >
> > -- Forwarded message -
> > Date: Mon, Nov 5, 2018, 10:12
> > Subject: [CRAN-pretest-archived] CRAN submission SparkR 2.4.0
> >
> > Dear maintainer,
> >
> > package SparkR_2.4.0.tar.gz does not pass the incoming checks 
> > automatically, please see the following pre-tests:
> > Windows: 
> > <https://win-builder.r-project.org/incoming_pretest/SparkR_2.4.0_20181105_165757/Windows/00check.log>
> > Status: 1 NOTE
> > Debian: 
> > <https://win-builder.r

Re: [CRAN-pretest-archived] CRAN submission SparkR 2.4.0

2018-11-06 Thread Felix Cheung
We have not been able to publish to CRAN for quite some time (since 2.3.0 was 
archived - the cause is Java 11)

I think it’s ok to announce the release of 2.4.0



From: Wenchen Fan 
Sent: Tuesday, November 6, 2018 8:51 AM
To: Felix Cheung
Cc: Matei Zaharia; Sean Owen; Spark dev list; Shivaram Venkataraman
Subject: Re: [CRAN-pretest-archived] CRAN submission SparkR 2.4.0

Do you mean we should have a 2.4.0 release without CRAN and then do a 2.4.1 
immediately?

On Wed, Nov 7, 2018 at 12:34 AM Felix Cheung 
mailto:felixcheun...@hotmail.com>> wrote:
Shivaram and I were discussing.
Actually we worked with them before. Another possible approach is to remove the 
vignettes eval and all test from the source package... in the next release.



From: Matei Zaharia mailto:matei.zaha...@gmail.com>>
Sent: Tuesday, November 6, 2018 12:07 AM
To: Felix Cheung
Cc: Sean Owen; dev; Shivaram Venkataraman
Subject: Re: [CRAN-pretest-archived] CRAN submission SparkR 2.4.0

Maybe it’s wroth contacting the CRAN maintainers to ask for help? Perhaps we 
aren’t disabling it correctly, or perhaps they can ignore this specific 
failure. +Shivaram who might have some ideas.

Matei

> On Nov 5, 2018, at 9:09 PM, Felix Cheung 
> mailto:felixcheun...@hotmail.com>> wrote:
>
> I don¡Št know what the cause is yet.
>
> The test should be skipped because of this check
> https://github.com/apache/spark/blob/branch-2.4/R/pkg/inst/tests/testthat/test_basic.R#L21
>
> And this
> https://github.com/apache/spark/blob/branch-2.4/R/pkg/inst/tests/testthat/test_basic.R#L57
>
> But it ran:
> callJStatic("org.apache.spark.ml.r.GeneralizedLinearRegressionWrapper", 
> "fit", formula,
>
> The earlier release was achived because of Java 11+ too so this unfortunately 
> isn¡Št new.
>
>
> From: Sean Owen mailto:sro...@gmail.com>>
> Sent: Monday, November 5, 2018 7:22 PM
> To: Felix Cheung
> Cc: dev
> Subject: Re: [CRAN-pretest-archived] CRAN submission SparkR 2.4.0
>
> What can we do to get the release through? is there any way to
> circumvent these tests or otherwise hack it? or does it need a
> maintenance release?
> On Mon, Nov 5, 2018 at 8:53 PM Felix Cheung 
> mailto:felixcheun...@hotmail.com>> wrote:
> >
> > FYI. SparkR submission failed. It seems to detect Java 11 correctly with 
> > vignettes but not skipping tests as would be expected.
> >
> > Error: processing vignette ¡¥sparkr-vignettes.Rmd¡Š failed with diagnostics:
> > Java version 8 is required for this package; found version: 11.0.1
> > Execution halted
> >
> > * checking PDF version of manual ... OK
> > * DONE
> > Status: 1 WARNING, 1 NOTE
> >
> > Current CRAN status: ERROR: 1, OK: 1
> > See: <https://CRAN.R-project.org/web/checks/check_results_SparkR.html>
> >
> > Version: 2.3.0
> > Check: tests, Result: ERROR
> > Running ¡¥run-all.R¡Š [8s/35s]
> > Running the tests in ¡¥tests/run-all.R¡Š failed.
> > Last 13 lines of output:
> > 4: callJStatic("org.apache.spark.ml.r.GeneralizedLinearRegressionWrapper", 
> > "fit", formula,
> > data@sdf, tolower(family$family), family$link, tol, as.integer(maxIter), 
> > weightCol,
> > regParam, as.double(var.power), as.double(link.power), 
> > stringIndexerOrderType,
> > offsetCol)
> > 5: invokeJava(isStatic = TRUE, className, methodName, ...)
> > 6: handleErrors(returnStatus, conn)
> > 7: stop(readString(conn))
> >
> >  testthat results 
> > ùù
> > OK: 0 SKIPPED: 0 FAILED: 2
> > 1. Error: create DataFrame from list or data.frame (@test_basic.R#26)
> > 2. Error: spark.glm and predict (@test_basic.R#58)
> >
> >
> >
> > -- Forwarded message -
> > Date: Mon, Nov 5, 2018, 10:12
> > Subject: [CRAN-pretest-archived] CRAN submission SparkR 2.4.0
> >
> > Dear maintainer,
> >
> > package SparkR_2.4.0.tar.gz does not pass the incoming checks 
> > automatically, please see the following pre-tests:
> > Windows: 
> > <https://win-builder.r-project.org/incoming_pretest/SparkR_2.4.0_20181105_165757/Windows/00check.log>
> > Status: 1 NOTE
> > Debian: 
> > <https://win-builder.r-project.org/incoming_pretest/SparkR_2.4.0_20181105_165757/Debian/00check.log>
> > Status: 1 WARNING, 1 NOTE
> >
> > Last released version's CRAN status: ERROR: 1, OK: 1
> > See: <https://CRAN.R-project.org/web/checks/check_results_SparkR.html>
> >
> > CRAN Web: <https://cran

Re: [CRAN-pretest-archived] CRAN submission SparkR 2.4.0

2018-11-06 Thread Felix Cheung
Shivaram and I were discussing.
Actually we worked with them before. Another possible approach is to remove the 
vignettes eval and all test from the source package... in the next release.



From: Matei Zaharia 
Sent: Tuesday, November 6, 2018 12:07 AM
To: Felix Cheung
Cc: Sean Owen; dev; Shivaram Venkataraman
Subject: Re: [CRAN-pretest-archived] CRAN submission SparkR 2.4.0

Maybe it’s wroth contacting the CRAN maintainers to ask for help? Perhaps we 
aren’t disabling it correctly, or perhaps they can ignore this specific 
failure. +Shivaram who might have some ideas.

Matei

> On Nov 5, 2018, at 9:09 PM, Felix Cheung  wrote:
>
> I don¡Št know what the cause is yet.
>
> The test should be skipped because of this check
> https://github.com/apache/spark/blob/branch-2.4/R/pkg/inst/tests/testthat/test_basic.R#L21
>
> And this
> https://github.com/apache/spark/blob/branch-2.4/R/pkg/inst/tests/testthat/test_basic.R#L57
>
> But it ran:
> callJStatic("org.apache.spark.ml.r.GeneralizedLinearRegressionWrapper", 
> "fit", formula,
>
> The earlier release was achived because of Java 11+ too so this unfortunately 
> isn¡Št new.
>
>
> From: Sean Owen 
> Sent: Monday, November 5, 2018 7:22 PM
> To: Felix Cheung
> Cc: dev
> Subject: Re: [CRAN-pretest-archived] CRAN submission SparkR 2.4.0
>
> What can we do to get the release through? is there any way to
> circumvent these tests or otherwise hack it? or does it need a
> maintenance release?
> On Mon, Nov 5, 2018 at 8:53 PM Felix Cheung  wrote:
> >
> > FYI. SparkR submission failed. It seems to detect Java 11 correctly with 
> > vignettes but not skipping tests as would be expected.
> >
> > Error: processing vignette ¡¥sparkr-vignettes.Rmd¡Š failed with diagnostics:
> > Java version 8 is required for this package; found version: 11.0.1
> > Execution halted
> >
> > * checking PDF version of manual ... OK
> > * DONE
> > Status: 1 WARNING, 1 NOTE
> >
> > Current CRAN status: ERROR: 1, OK: 1
> > See: <https://CRAN.R-project.org/web/checks/check_results_SparkR.html>
> >
> > Version: 2.3.0
> > Check: tests, Result: ERROR
> > Running ¡¥run-all.R¡Š [8s/35s]
> > Running the tests in ¡¥tests/run-all.R¡Š failed.
> > Last 13 lines of output:
> > 4: callJStatic("org.apache.spark.ml.r.GeneralizedLinearRegressionWrapper", 
> > "fit", formula,
> > data@sdf, tolower(family$family), family$link, tol, as.integer(maxIter), 
> > weightCol,
> > regParam, as.double(var.power), as.double(link.power), 
> > stringIndexerOrderType,
> > offsetCol)
> > 5: invokeJava(isStatic = TRUE, className, methodName, ...)
> > 6: handleErrors(returnStatus, conn)
> > 7: stop(readString(conn))
> >
> >  testthat results 
> > ùù
> > OK: 0 SKIPPED: 0 FAILED: 2
> > 1. Error: create DataFrame from list or data.frame (@test_basic.R#26)
> > 2. Error: spark.glm and predict (@test_basic.R#58)
> >
> >
> >
> > -- Forwarded message -
> > Date: Mon, Nov 5, 2018, 10:12
> > Subject: [CRAN-pretest-archived] CRAN submission SparkR 2.4.0
> >
> > Dear maintainer,
> >
> > package SparkR_2.4.0.tar.gz does not pass the incoming checks 
> > automatically, please see the following pre-tests:
> > Windows: 
> > <https://win-builder.r-project.org/incoming_pretest/SparkR_2.4.0_20181105_165757/Windows/00check.log>
> > Status: 1 NOTE
> > Debian: 
> > <https://win-builder.r-project.org/incoming_pretest/SparkR_2.4.0_20181105_165757/Debian/00check.log>
> > Status: 1 WARNING, 1 NOTE
> >
> > Last released version's CRAN status: ERROR: 1, OK: 1
> > See: <https://CRAN.R-project.org/web/checks/check_results_SparkR.html>
> >
> > CRAN Web: <https://cran.r-project.org/package=SparkR>
> >
> > Please fix all problems and resubmit a fixed version via the webform.
> > If you are not sure how to fix the problems shown, please ask for help on 
> > the R-package-devel mailing list:
> > <https://stat.ethz.ch/mailman/listinfo/r-package-devel>
> > If you are fairly certain the rejection is a false positive, please 
> > reply-all to this message and explain.
> >
> > More details are given in the directory:
> > <https://win-builder.r-project.org/incoming_pretest/SparkR_2.4.0_20181105_165757/>
> > The files will be removed after roughly 7 days.
> >
> > No strong reverse dependencies to be checked.
> >
> 

Re: [CRAN-pretest-archived] CRAN submission SparkR 2.4.0

2018-11-05 Thread Felix Cheung
I don’t know what the cause is yet.

The test should be skipped because of this check
https://github.com/apache/spark/blob/branch-2.4/R/pkg/inst/tests/testthat/test_basic.R#L21

And this
https://github.com/apache/spark/blob/branch-2.4/R/pkg/inst/tests/testthat/test_basic.R#L57

But it ran:
callJStatic("org.apache.spark.ml.r.GeneralizedLinearRegressionWrapper", "fit", 
formula,

The earlier release was achived because of Java 11+ too so this unfortunately 
isn’t new.



From: Sean Owen 
Sent: Monday, November 5, 2018 7:22 PM
To: Felix Cheung
Cc: dev
Subject: Re: [CRAN-pretest-archived] CRAN submission SparkR 2.4.0

What can we do to get the release through? is there any way to
circumvent these tests or otherwise hack it? or does it need a
maintenance release?
On Mon, Nov 5, 2018 at 8:53 PM Felix Cheung  wrote:
>
> FYI. SparkR submission failed. It seems to detect Java 11 correctly with 
> vignettes but not skipping tests as would be expected.
>
> Error: processing vignette ‘sparkr-vignettes.Rmd’ failed with diagnostics:
> Java version 8 is required for this package; found version: 11.0.1
> Execution halted
>
> * checking PDF version of manual ... OK
> * DONE
> Status: 1 WARNING, 1 NOTE
>
> Current CRAN status: ERROR: 1, OK: 1
> See: <https://CRAN.R-project.org/web/checks/check_results_SparkR.html>
>
> Version: 2.3.0
> Check: tests, Result: ERROR
> Running ‘run-all.R’ [8s/35s]
> Running the tests in ‘tests/run-all.R’ failed.
> Last 13 lines of output:
> 4: callJStatic("org.apache.spark.ml.r.GeneralizedLinearRegressionWrapper", 
> "fit", formula,
> data@sdf, tolower(family$family), family$link, tol, as.integer(maxIter), 
> weightCol,
> regParam, as.double(var.power), as.double(link.power), stringIndexerOrderType,
> offsetCol)
> 5: invokeJava(isStatic = TRUE, className, methodName, ...)
> 6: handleErrors(returnStatus, conn)
> 7: stop(readString(conn))
>
> ══ testthat results 
> ═══
> OK: 0 SKIPPED: 0 FAILED: 2
> 1. Error: create DataFrame from list or data.frame (@test_basic.R#26)
> 2. Error: spark.glm and predict (@test_basic.R#58)
>
>
>
> -- Forwarded message -
> Date: Mon, Nov 5, 2018, 10:12
> Subject: [CRAN-pretest-archived] CRAN submission SparkR 2.4.0
>
> Dear maintainer,
>
> package SparkR_2.4.0.tar.gz does not pass the incoming checks automatically, 
> please see the following pre-tests:
> Windows: 
> <https://win-builder.r-project.org/incoming_pretest/SparkR_2.4.0_20181105_165757/Windows/00check.log>
> Status: 1 NOTE
> Debian: 
> <https://win-builder.r-project.org/incoming_pretest/SparkR_2.4.0_20181105_165757/Debian/00check.log>
> Status: 1 WARNING, 1 NOTE
>
> Last released version's CRAN status: ERROR: 1, OK: 1
> See: <https://CRAN.R-project.org/web/checks/check_results_SparkR.html>
>
> CRAN Web: <https://cran.r-project.org/package=SparkR>
>
> Please fix all problems and resubmit a fixed version via the webform.
> If you are not sure how to fix the problems shown, please ask for help on the 
> R-package-devel mailing list:
> <https://stat.ethz.ch/mailman/listinfo/r-package-devel>
> If you are fairly certain the rejection is a false positive, please reply-all 
> to this message and explain.
>
> More details are given in the directory:
> <https://win-builder.r-project.org/incoming_pretest/SparkR_2.4.0_20181105_165757/>
> The files will be removed after roughly 7 days.
>
> No strong reverse dependencies to be checked.
>
> Best regards,
> CRAN teams' auto-check service
> Flavor: r-devel-linux-x86_64-debian-gcc, r-devel-windows-ix86+x86_64
> Check: CRAN incoming feasibility, Result: NOTE
> Maintainer: 'Shivaram Venkataraman '
>
> New submission
>
> Package was archived on CRAN
>
> Possibly mis-spelled words in DESCRIPTION:
> Frontend (4:10, 5:28)
>
> CRAN repository db overrides:
> X-CRAN-Comment: Archived on 2018-05-01 as check problems were not
> corrected despite reminders.
>
> Flavor: r-devel-linux-x86_64-debian-gcc
> Check: re-building of vignette outputs, Result: WARNING
> Error in re-building vignettes:
> ...
>
> Attaching package: 'SparkR'
>
> The following objects are masked from 'package:stats':
>
> cov, filter, lag, na.omit, predict, sd, var, window
>
> The following objects are masked from 'package:base':
>
> as.data.frame, colnames, colnames<-, drop, endsWith,
> intersect, rank, rbind, sample, startsWith, subset, summary,
> transform, union
>
> trying URL 
> 'http://mirror.klaus-uwe.me/apache/spark/spark-2.4.0/spark-2.4.0-bin-hadoop2.7.tgz'
> Content type 'application/octet-stream' length 227893062 bytes (217.3 MB)
> ==
> downloaded 217.3 MB
>
> Quitting from lines 65-67 (sparkr-vignettes.Rmd)
> Error: processing vignette 'sparkr-vignettes.Rmd' failed with diagnostics:
> Java version 8 is required for this package; found version: 11.0.1
> Execution halted


Fwd: [CRAN-pretest-archived] CRAN submission SparkR 2.4.0

2018-11-05 Thread Felix Cheung
FYI. SparkR submission failed. It seems to detect Java 11 correctly with 
vignettes but not skipping tests as would be expected.

Error: processing vignette ‘sparkr-vignettes.Rmd’ failed with diagnostics:
Java version 8 is required for this package; found version: 11.0.1
Execution halted

* checking PDF version of manual ... OK
* DONE
Status: 1 WARNING, 1 NOTE

Current CRAN status: ERROR: 1, OK: 1
See: 

Version: 2.3.0
Check: tests, Result: ERROR
Running ‘run-all.R’ [8s/35s]
  Running the tests in ‘tests/run-all.R’ failed.
  Last 13 lines of output:
4: callJStatic("org.apache.spark.ml.r.GeneralizedLinearRegressionWrapper", 
"fit", formula,
   data@sdf, tolower(family$family), family$link, tol, 
as.integer(maxIter), weightCol,
   regParam, as.double(var.power), as.double(link.power), 
stringIndexerOrderType,
   offsetCol)
5: invokeJava(isStatic = TRUE, className, methodName, ...)
6: handleErrors(returnStatus, conn)
7: stop(readString(conn))

══ testthat results 
═══
OK: 0 SKIPPED: 0 FAILED: 2
1. Error: create DataFrame from list or data.frame (@test_basic.R#26)
2. Error: spark.glm and predict (@test_basic.R#58)



-- Forwarded message -
Date: Mon, Nov 5, 2018, 10:12
Subject: [CRAN-pretest-archived] CRAN submission SparkR 2.4.0

Dear maintainer,

package SparkR_2.4.0.tar.gz does not pass the incoming checks automatically, 
please see the following pre-tests:
Windows: 

Status: 1 NOTE
Debian: 

Status: 1 WARNING, 1 NOTE

Last released version's CRAN status: ERROR: 1, OK: 1
See: 

CRAN Web: 

Please fix all problems and resubmit a fixed version via the webform.
If you are not sure how to fix the problems shown, please ask for help on the 
R-package-devel mailing list:

If you are fairly certain the rejection is a false positive, please reply-all 
to this message and explain.

More details are given in the directory:

The files will be removed after roughly 7 days.

No strong reverse dependencies to be checked.

Best regards,
CRAN teams' auto-check service
Flavor: r-devel-linux-x86_64-debian-gcc, r-devel-windows-ix86+x86_64
Check: CRAN incoming feasibility, Result: NOTE
  Maintainer: 'Shivaram Venkataraman 
mailto:shiva...@cs.berkeley.edu>>'

  New submission

  Package was archived on CRAN

  Possibly mis-spelled words in DESCRIPTION:
Frontend (4:10, 5:28)

  CRAN repository db overrides:
X-CRAN-Comment: Archived on 2018-05-01 as check problems were not
  corrected despite reminders.

Flavor: r-devel-linux-x86_64-debian-gcc
Check: re-building of vignette outputs, Result: WARNING
  Error in re-building vignettes:
...

  Attaching package: 'SparkR'

  The following objects are masked from 'package:stats':

  cov, filter, lag, na.omit, predict, sd, var, window

  The following objects are masked from 'package:base':

  as.data.frame, colnames, colnames<-, drop, endsWith,
  intersect, rank, rbind, sample, startsWith, subset, summary,
  transform, union

  trying URL 
'http://mirror.klaus-uwe.me/apache/spark/spark-2.4.0/spark-2.4.0-bin-hadoop2.7.tgz'
  Content type 'application/octet-stream' length 227893062 bytes (217.3 MB)
  ==
  downloaded 217.3 MB

  Quitting from lines 65-67 (sparkr-vignettes.Rmd)
  Error: processing vignette 'sparkr-vignettes.Rmd' failed with diagnostics:
  Java version 8 is required for this package; found version: 11.0.1
  Execution halted


Re: [discuss] SparkR CRAN feasibility check server problem

2018-11-01 Thread Felix Cheung
Thanks for being this up and much appreciate with keeping on top of this at
all times.

Can upgrading R able to fix the issue. Is this perhaps  not necessarily
malform but some new format for new versions perhaps? Anyway we should
consider upgrading R version if that fixes the problem.

As an option we could also disable the repo check in Jenkins but I can see
that could also be problematic.


On Thu, Nov 1, 2018 at 7:35 PM Hyukjin Kwon  wrote:

> Hi all,
>
> I want to raise the CRAN failure issue because it started to block Spark
> PRs time to time. Since the number
> of PRs grows hugely in Spark community, this is critical to not block
> other PRs.
>
> There has been a problem at CRAN (See
> https://github.com/apache/spark/pull/20005 for analysis).
> To cut it short, the root cause is malformed package info from
> https://cran.r-project.org/src/contrib/PACKAGES
> from server side, and this had to be fixed by requesting it to CRAN
> sysaadmin's help.
>
> https://issues.apache.org/jira/browse/SPARK-24152 <- newly open. I am
> pretty sure it's the same issue
> https://issues.apache.org/jira/browse/SPARK-25923 <- reopen/resolved 2
> times
> https://issues.apache.org/jira/browse/SPARK-22812
>
> This happened 5 times for roughly about 10 months, causing blocking almost
> all PRs in Apache Spark.
> Historically, it blocked whole PRs for few days once, and whole Spark
> community had to stop working.
>
> I assume this has been not a super big big issue so far for other projects
> or other people because apparently
> higher version of R has some logics to handle this malformed documents (at
> least I verified R 3.4.0 works fine).
>
> For our side, Jenkins has low R version (R 3.1.1 if that's not updated
> from what I have seen before),
> which is unable to parse the malformed server's response.
>
> So, I want to talk about how we are going to handle this. Possible
> solutions are:
>
> 1. We should start a talk with CRAN sysadmin to permanently prevent this
> issue
> 2. We upgrade R to 3.4.0 in Jenkins (however we will not be able to test
> low R versions)
> 3. ...
>
> If if we fine, I would like to suggest to forward this email to CRAN
> sysadmin to discuss further about this.
>
> Adding Liang-Chi Felix and Shivaram who I already talked about this few
> times before.
>
> Thanks all.
>
>
>
>


Re: [VOTE] SPARK 2.4.0 (RC5)

2018-10-31 Thread Felix Cheung
+1
Checked R doc and all R API changes



From: Denny Lee 
Sent: Wednesday, October 31, 2018 9:13 PM
To: Chitral Verma
Cc: Wenchen Fan; dev@spark.apache.org
Subject: Re: [VOTE] SPARK 2.4.0 (RC5)

+1

On Wed, Oct 31, 2018 at 12:54 PM Chitral Verma 
mailto:chitralve...@gmail.com>> wrote:
+1

On Wed, 31 Oct 2018 at 11:56, Reynold Xin 
mailto:r...@databricks.com>> wrote:
+1

Look forward to the release!



On Mon, Oct 29, 2018 at 3:22 AM Wenchen Fan 
mailto:cloud0...@gmail.com>> wrote:
Please vote on releasing the following candidate as Apache Spark version 2.4.0.

The vote is open until November 1 PST and passes if a majority +1 PMC votes are 
cast, with
a minimum of 3 +1 votes.

[ ] +1 Release this package as Apache Spark 2.4.0
[ ] -1 Do not release this package because ...

To learn more about Apache Spark, please see http://spark.apache.org/

The tag to be voted on is v2.4.0-rc5 (commit 
0a4c03f7d084f1d2aa48673b99f3b9496893ce8d):
https://github.com/apache/spark/tree/v2.4.0-rc5

The release files, including signatures, digests, etc. can be found at:
https://dist.apache.org/repos/dist/dev/spark/v2.4.0-rc5-bin/

Signatures used for Spark RCs can be found in this file:
https://dist.apache.org/repos/dist/dev/spark/KEYS

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1291

The documentation corresponding to this release can be found at:
https://dist.apache.org/repos/dist/dev/spark/v2.4.0-rc5-docs/

The list of bug fixes going into 2.4.0 can be found at the following URL:
https://issues.apache.org/jira/projects/SPARK/versions/12342385

FAQ

=
How can I help test this release?
=

If you are a Spark user, you can help us test this release by taking
an existing Spark workload and running on this release candidate, then
reporting any regressions.

If you're working in PySpark you can set up a virtual env and install
the current RC and see if anything important breaks, in the Java/Scala
you can add the staging repository to your projects resolvers and test
with the RC (make sure to clean up the artifact cache before/after so
you don't end up building with a out of date RC going forward).

===
What should happen to JIRA tickets still targeting 2.4.0?
===

The current list of open tickets targeted at 2.4.0 can be found at:
https://issues.apache.org/jira/projects/SPARK and search for "Target Version/s" 
= 2.4.0

Committers should look at those and triage. Extremely important bug
fixes, documentation, and API tweaks that impact compatibility should
be worked on immediately. Everything else please retarget to an
appropriate release.

==
But my bug isn't fixed?
==

In order to make timely releases, we will typically not hold the
release unless the bug in question is a regression from the previous
release. That being said, if there is something which is a regression
that has not been correctly targeted please ping me or a committer to
help target the issue.


Re: DataSourceV2 hangouts sync

2018-10-25 Thread Felix Cheung
Yes please!



From: Ryan Blue 
Sent: Thursday, October 25, 2018 1:10 PM
To: Spark Dev List
Subject: DataSourceV2 hangouts sync

Hi everyone,

There's been some great discussion for DataSourceV2 in the last few months, but 
it has been difficult to resolve some of the discussions and I don't think that 
we have a very clear roadmap for getting the work done.

To coordinate better as a community, I'd like to start a regular sync-up over 
google hangouts. We use this in the Parquet community to have more effective 
community discussions about thorny technical issues and to get aligned on an 
overall roadmap. It is really helpful in that community and I think it would 
help us get DSv2 done more quickly.

Here's how it works: people join the hangout, we go around the list to gather 
topics, have about an hour-long discussion, and then send a summary of the 
discussion to the dev list for anyone that couldn't participate. That way we 
can move topics along, but we keep the broader community in the loop as well 
for further discussion on the mailing list.

I'll volunteer to set up the sync and send invites to anyone that wants to 
attend. If you're interested, please reply with the email address you'd like to 
put on the invite list (if there's a way to do this without specific invites, 
let me know). Also for the first sync, please note what times would work for 
you so we can try to account for people in different time zones.

For the first one, I was thinking some day next week (time TBD by those 
interested) and starting off with a general roadmap discussion before diving 
into specific technical topics.

Thanks,

rb

--
Ryan Blue
Software Engineer
Netflix


Re: [DISCUSS][K8S][TESTS] Include Kerberos integration tests for Spark 2.4

2018-10-16 Thread Felix Cheung
I’m in favor of it. If you check the PR it’s a few isolated script changes and 
all test-only changes. Should have low impact on release but much better 
integration test coverage.



From: Erik Erlandson 
Sent: Tuesday, October 16, 2018 8:20 AM
To: dev
Subject: [DISCUSS][K8S][TESTS] Include Kerberos integration tests for Spark 2.4

I'd like to propose including integration testing for Kerberos on the Spark 2.4 
release:
https://github.com/apache/spark/pull/22608

Arguments in favor:
1) it improves testing coverage on a feature important for integrating with 
HDFS deployments
2) its intersection with existing code is small - it consists primarily of new 
testing code, with a bit of refactoring into 'main' and 'test' sub-trees. These 
new tests appear stable.
3) Spark 2.4 is still in RC, with outstanding correctness issues.

The argument 'against' that I'm aware of would be the relatively large size of 
the PR. I believe this is considered above, but am soliciting community 
feedback before committing.
Cheers,
Erik



Re: [DISCUSS][K8S] Local dependencies with Kubernetes

2018-10-07 Thread Felix Cheung
Jars and libraries only accessible locally at the driver is fairly limited? 
Don’t you want the same on all executor?




From: Yinan Li 
Sent: Friday, October 5, 2018 11:25 AM
To: Stavros Kontopoulos
Cc: rve...@dotnetrdf.org; dev
Subject: Re: [DISCUSS][K8S] Local dependencies with Kubernetes

> Just to be clear: in client mode things work right? (Although I'm not
really familiar with how client mode works in k8s - never tried it.)

If the driver runs on the submission client machine, yes, it should just work. 
If the driver runs in a pod, however, it faces the same problem as in cluster 
mode.

Yinan

On Fri, Oct 5, 2018 at 11:06 AM Stavros Kontopoulos 
mailto:stavros.kontopou...@lightbend.com>> 
wrote:
@Marcelo is correct. Mesos does not have something similar. Only Yarn does due 
to the distributed cache thing.
I have described most of the above in the the jira also there are some other 
options.

Best,
Stavros

On Fri, Oct 5, 2018 at 8:28 PM, Marcelo Vanzin 
mailto:van...@cloudera.com.invalid>> wrote:
On Fri, Oct 5, 2018 at 7:54 AM Rob Vesse 
mailto:rve...@dotnetrdf.org>> wrote:
> Ideally this would all just be handled automatically for users in the way 
> that all other resource managers do

I think you're giving other resource managers too much credit. In
cluster mode, only YARN really distributes local dependencies, because
YARN has that feature (its distributed cache) and Spark just uses it.

Standalone doesn't do it (see SPARK-4160) and I don't remember seeing
anything similar on the Mesos side.

There are things that could be done; e.g. if you have HDFS you could
do a restricted version of what YARN does (upload files to HDFS, and
change the "spark.jars" and "spark.files" URLs to point to HDFS
instead). Or you could turn the submission client into a file server
that the cluster-mode driver downloads files from - although that
requires connectivity from the driver back to the client.

Neither is great, but better than not having that feature.

Just to be clear: in client mode things work right? (Although I'm not
really familiar with how client mode works in k8s - never tried it.)

--
Marcelo

-
To unsubscribe e-mail: 
dev-unsubscr...@spark.apache.org






  1   2   3   >