Re: [VOTE] SPARK 2.4.0 (RC1)

2018-09-20 Thread Jungtaek Lim
OK got it. Thanks for clarifying.

I can help checking and modifying version, but not sure the case both
versions are specified, like "2.4.0/3.0.0". Removing 3.0.0 would work in
this case?

2018년 9월 21일 (금) 오후 2:29, Wenchen Fan 님이 작성:

> There is an issue in the merge script, when resolving a ticket, the
> default fixed version is 3.0.0. I guess someone forgot to type the fixed
> version and lead to this mistake.
>
> On Fri, Sep 21, 2018 at 1:15 PM Jungtaek Lim  wrote:
>
>> Ah these issues were resolved before branch-2.4 is cut, like SPARK-24441
>>
>>
>> https://github.com/apache/spark/blob/v2.4.0-rc1/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/HDFSBackedStateStoreProvider.scala
>>
>> SPARK-24441 is included to Spark 2.4.0 RC1 but set to 3.0.0. I heard
>> there's a step which version of issues are aligned with new release when
>> branch/RC is being cut, but it doesn't look like happening for some issues.
>>
>> 2018년 9월 21일 (금) 오후 2:10, Holden Karau 님이 작성:
>>
>>> So normally during the release process if it's in branch-2.4 but not
>>> part of the current RC we set the resolved version to 2.4.1 and then if
>>> roll a new RC we switch the 2.4.1 issues to 2.4.0.
>>>
>>> On Thu, Sep 20, 2018 at 9:55 PM Jungtaek Lim  wrote:
>>>
 I also noticed there're some fixed issues which are included in
 branch-2.4 but its versions are still 3.0.0. Would we want to update
 versions to 2.4.0? If we are not planning to run some automations to
 correct it, I'm happy to fix them.

 2018년 9월 20일 (목) 오후 9:22, Weichen Xu 님이 작성:

> We need to merge this.
> https://github.com/apache/spark/pull/22492
> Otherwise mleap cannot build against spark 2.4.0
> Thanks!
>
> On Wed, Sep 19, 2018 at 1:16 PM Yinan Li  wrote:
>
>> FYI: SPARK-23200 has been resolved.
>>
>> On Tue, Sep 18, 2018 at 8:49 AM Felix Cheung <
>> felixcheun...@hotmail.com> wrote:
>>
>>> If we could work on this quickly - it might get on to future RCs.
>>>
>>>
>>>
>>> --
>>> *From:* Stavros Kontopoulos 
>>> *Sent:* Monday, September 17, 2018 2:35 PM
>>> *To:* Yinan Li
>>> *Cc:* Xiao Li; eerla...@redhat.com; van...@cloudera.com.invalid;
>>> Sean Owen; Wenchen Fan; dev
>>> *Subject:* Re: [VOTE] SPARK 2.4.0 (RC1)
>>>
>>> Hi Xiao,
>>>
>>> I just tested it, it seems ok. There are some questions about which
>>> properties we should keep when restoring the config. Otherwise it looks 
>>> ok
>>> to me.
>>> The reason this should go in 2.4 is that streaming on k8s is
>>> something people want to try day one (or at least it is cool to try) and
>>> since 2.4 comes with k8s support being refactored a lot,
>>> it would be disappointing not to have it in...IMHO.
>>>
>>> Best,
>>> Stavros
>>>
>>> On Mon, Sep 17, 2018 at 11:13 PM, Yinan Li 
>>> wrote:
>>>
 We can merge the PR and get SPARK-23200 resolved if the whole point
 is to make streaming on k8s work first. But given that this is not a
 blocker for 2.4, I think we can take a bit more time here and get it 
 right.
 With that being said, I would expect it to be resolved soon.

 On Mon, Sep 17, 2018 at 11:47 AM Xiao Li 
 wrote:

> Hi, Erik and Stavros,
>
> This bug fix SPARK-23200 is not a blocker of the 2.4 release. It
> sounds important for the Streaming on K8S. Could the K8S oriented
> committers speed up the reviews?
>
> Thanks,
>
> Xiao
>
> Erik Erlandson  于2018年9月17日周一 上午11:04写道:
>
>>
>> I have no binding vote but I second Stavros’ recommendation for
>> spark-23200
>>
>> Per parallel threads on Py2 support I would also like to propose
>> deprecating Py2 starting with this 2.4 release
>>
>> On Mon, Sep 17, 2018 at 10:38 AM Marcelo Vanzin
>>  wrote:
>>
>>> You can log in to https://repository.apache.org and see what's
>>> wrong.
>>> Just find that staging repo and look at the messages. In your
>>> case it
>>> seems related to your signature.
>>>
>>> failureMessageNo public key: Key with id: () was not able to
>>> be
>>> located on http://gpg-keyserver.de/. Upload your public key and
>>> try
>>> the operation again.
>>> On Sun, Sep 16, 2018 at 10:00 PM Wenchen Fan <
>>> cloud0...@gmail.com> wrote:
>>> >
>>> > I confirmed that
>>> https://repository.apache.org/content/repositories/orgapachespark-1285
>>> is not accessible. I did it via 
>>> ./dev/create-release/do-release-docker.sh
>>> -d /my/work/dir -s publish , not sure what's going wrong. I didn't 
>>> see any
>>> error 

Re: [VOTE] SPARK 2.4.0 (RC1)

2018-09-20 Thread Wenchen Fan
There is an issue in the merge script, when resolving a ticket, the default
fixed version is 3.0.0. I guess someone forgot to type the fixed version
and lead to this mistake.

On Fri, Sep 21, 2018 at 1:15 PM Jungtaek Lim  wrote:

> Ah these issues were resolved before branch-2.4 is cut, like SPARK-24441
>
>
> https://github.com/apache/spark/blob/v2.4.0-rc1/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/HDFSBackedStateStoreProvider.scala
>
> SPARK-24441 is included to Spark 2.4.0 RC1 but set to 3.0.0. I heard
> there's a step which version of issues are aligned with new release when
> branch/RC is being cut, but it doesn't look like happening for some issues.
>
> 2018년 9월 21일 (금) 오후 2:10, Holden Karau 님이 작성:
>
>> So normally during the release process if it's in branch-2.4 but not part
>> of the current RC we set the resolved version to 2.4.1 and then if roll a
>> new RC we switch the 2.4.1 issues to 2.4.0.
>>
>> On Thu, Sep 20, 2018 at 9:55 PM Jungtaek Lim  wrote:
>>
>>> I also noticed there're some fixed issues which are included in
>>> branch-2.4 but its versions are still 3.0.0. Would we want to update
>>> versions to 2.4.0? If we are not planning to run some automations to
>>> correct it, I'm happy to fix them.
>>>
>>> 2018년 9월 20일 (목) 오후 9:22, Weichen Xu 님이 작성:
>>>
 We need to merge this.
 https://github.com/apache/spark/pull/22492
 Otherwise mleap cannot build against spark 2.4.0
 Thanks!

 On Wed, Sep 19, 2018 at 1:16 PM Yinan Li  wrote:

> FYI: SPARK-23200 has been resolved.
>
> On Tue, Sep 18, 2018 at 8:49 AM Felix Cheung <
> felixcheun...@hotmail.com> wrote:
>
>> If we could work on this quickly - it might get on to future RCs.
>>
>>
>>
>> --
>> *From:* Stavros Kontopoulos 
>> *Sent:* Monday, September 17, 2018 2:35 PM
>> *To:* Yinan Li
>> *Cc:* Xiao Li; eerla...@redhat.com; van...@cloudera.com.invalid;
>> Sean Owen; Wenchen Fan; dev
>> *Subject:* Re: [VOTE] SPARK 2.4.0 (RC1)
>>
>> Hi Xiao,
>>
>> I just tested it, it seems ok. There are some questions about which
>> properties we should keep when restoring the config. Otherwise it looks 
>> ok
>> to me.
>> The reason this should go in 2.4 is that streaming on k8s is
>> something people want to try day one (or at least it is cool to try) and
>> since 2.4 comes with k8s support being refactored a lot,
>> it would be disappointing not to have it in...IMHO.
>>
>> Best,
>> Stavros
>>
>> On Mon, Sep 17, 2018 at 11:13 PM, Yinan Li 
>> wrote:
>>
>>> We can merge the PR and get SPARK-23200 resolved if the whole point
>>> is to make streaming on k8s work first. But given that this is not a
>>> blocker for 2.4, I think we can take a bit more time here and get it 
>>> right.
>>> With that being said, I would expect it to be resolved soon.
>>>
>>> On Mon, Sep 17, 2018 at 11:47 AM Xiao Li 
>>> wrote:
>>>
 Hi, Erik and Stavros,

 This bug fix SPARK-23200 is not a blocker of the 2.4 release. It
 sounds important for the Streaming on K8S. Could the K8S oriented
 committers speed up the reviews?

 Thanks,

 Xiao

 Erik Erlandson  于2018年9月17日周一 上午11:04写道:

>
> I have no binding vote but I second Stavros’ recommendation for
> spark-23200
>
> Per parallel threads on Py2 support I would also like to propose
> deprecating Py2 starting with this 2.4 release
>
> On Mon, Sep 17, 2018 at 10:38 AM Marcelo Vanzin
>  wrote:
>
>> You can log in to https://repository.apache.org and see what's
>> wrong.
>> Just find that staging repo and look at the messages. In your
>> case it
>> seems related to your signature.
>>
>> failureMessageNo public key: Key with id: () was not able to
>> be
>> located on http://gpg-keyserver.de/. Upload your public key and
>> try
>> the operation again.
>> On Sun, Sep 16, 2018 at 10:00 PM Wenchen Fan 
>> wrote:
>> >
>> > I confirmed that
>> https://repository.apache.org/content/repositories/orgapachespark-1285
>> is not accessible. I did it via 
>> ./dev/create-release/do-release-docker.sh
>> -d /my/work/dir -s publish , not sure what's going wrong. I didn't 
>> see any
>> error message during it.
>> >
>> > Any insights are appreciated! So that I can fix it in the next
>> RC. Thanks!
>> >
>> > On Mon, Sep 17, 2018 at 11:31 AM Sean Owen 
>> wrote:
>> >>
>> >> I think one build is enough, but haven't thought it through.
>> The
>> >> Hadoop 2.6/2.7 builds are already nearly redundant. 

Re: [VOTE] SPARK 2.4.0 (RC1)

2018-09-20 Thread Jungtaek Lim
Ah these issues were resolved before branch-2.4 is cut, like SPARK-24441

https://github.com/apache/spark/blob/v2.4.0-rc1/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/HDFSBackedStateStoreProvider.scala

SPARK-24441 is included to Spark 2.4.0 RC1 but set to 3.0.0. I heard
there's a step which version of issues are aligned with new release when
branch/RC is being cut, but it doesn't look like happening for some issues.

2018년 9월 21일 (금) 오후 2:10, Holden Karau 님이 작성:

> So normally during the release process if it's in branch-2.4 but not part
> of the current RC we set the resolved version to 2.4.1 and then if roll a
> new RC we switch the 2.4.1 issues to 2.4.0.
>
> On Thu, Sep 20, 2018 at 9:55 PM Jungtaek Lim  wrote:
>
>> I also noticed there're some fixed issues which are included in
>> branch-2.4 but its versions are still 3.0.0. Would we want to update
>> versions to 2.4.0? If we are not planning to run some automations to
>> correct it, I'm happy to fix them.
>>
>> 2018년 9월 20일 (목) 오후 9:22, Weichen Xu 님이 작성:
>>
>>> We need to merge this.
>>> https://github.com/apache/spark/pull/22492
>>> Otherwise mleap cannot build against spark 2.4.0
>>> Thanks!
>>>
>>> On Wed, Sep 19, 2018 at 1:16 PM Yinan Li  wrote:
>>>
 FYI: SPARK-23200 has been resolved.

 On Tue, Sep 18, 2018 at 8:49 AM Felix Cheung 
 wrote:

> If we could work on this quickly - it might get on to future RCs.
>
>
>
> --
> *From:* Stavros Kontopoulos 
> *Sent:* Monday, September 17, 2018 2:35 PM
> *To:* Yinan Li
> *Cc:* Xiao Li; eerla...@redhat.com; van...@cloudera.com.invalid; Sean
> Owen; Wenchen Fan; dev
> *Subject:* Re: [VOTE] SPARK 2.4.0 (RC1)
>
> Hi Xiao,
>
> I just tested it, it seems ok. There are some questions about which
> properties we should keep when restoring the config. Otherwise it looks ok
> to me.
> The reason this should go in 2.4 is that streaming on k8s is something
> people want to try day one (or at least it is cool to try) and since 2.4
> comes with k8s support being refactored a lot,
> it would be disappointing not to have it in...IMHO.
>
> Best,
> Stavros
>
> On Mon, Sep 17, 2018 at 11:13 PM, Yinan Li 
> wrote:
>
>> We can merge the PR and get SPARK-23200 resolved if the whole point
>> is to make streaming on k8s work first. But given that this is not a
>> blocker for 2.4, I think we can take a bit more time here and get it 
>> right.
>> With that being said, I would expect it to be resolved soon.
>>
>> On Mon, Sep 17, 2018 at 11:47 AM Xiao Li 
>> wrote:
>>
>>> Hi, Erik and Stavros,
>>>
>>> This bug fix SPARK-23200 is not a blocker of the 2.4 release. It
>>> sounds important for the Streaming on K8S. Could the K8S oriented
>>> committers speed up the reviews?
>>>
>>> Thanks,
>>>
>>> Xiao
>>>
>>> Erik Erlandson  于2018年9月17日周一 上午11:04写道:
>>>

 I have no binding vote but I second Stavros’ recommendation for
 spark-23200

 Per parallel threads on Py2 support I would also like to propose
 deprecating Py2 starting with this 2.4 release

 On Mon, Sep 17, 2018 at 10:38 AM Marcelo Vanzin
  wrote:

> You can log in to https://repository.apache.org and see what's
> wrong.
> Just find that staging repo and look at the messages. In your case
> it
> seems related to your signature.
>
> failureMessageNo public key: Key with id: () was not able to be
> located on http://gpg-keyserver.de/. Upload your public key and
> try
> the operation again.
> On Sun, Sep 16, 2018 at 10:00 PM Wenchen Fan 
> wrote:
> >
> > I confirmed that
> https://repository.apache.org/content/repositories/orgapachespark-1285
> is not accessible. I did it via 
> ./dev/create-release/do-release-docker.sh
> -d /my/work/dir -s publish , not sure what's going wrong. I didn't 
> see any
> error message during it.
> >
> > Any insights are appreciated! So that I can fix it in the next
> RC. Thanks!
> >
> > On Mon, Sep 17, 2018 at 11:31 AM Sean Owen 
> wrote:
> >>
> >> I think one build is enough, but haven't thought it through. The
> >> Hadoop 2.6/2.7 builds are already nearly redundant. 2.12 is
> probably
> >> best advertised as a 'beta'. So maybe publish a no-hadoop build
> of it?
> >> Really, whatever's the easy thing to do.
> >> On Sun, Sep 16, 2018 at 10:28 PM Wenchen Fan <
> cloud0...@gmail.com> wrote:
> >> >
> >> > Ah I missed the Scala 2.12 build. Do you mean we should
> publish a Scala 2.12 build this time? Current for 

Re: [VOTE] SPARK 2.4.0 (RC1)

2018-09-20 Thread Holden Karau
So normally during the release process if it's in branch-2.4 but not part
of the current RC we set the resolved version to 2.4.1 and then if roll a
new RC we switch the 2.4.1 issues to 2.4.0.

On Thu, Sep 20, 2018 at 9:55 PM Jungtaek Lim  wrote:

> I also noticed there're some fixed issues which are included in branch-2.4
> but its versions are still 3.0.0. Would we want to update versions to
> 2.4.0? If we are not planning to run some automations to correct it, I'm
> happy to fix them.
>
> 2018년 9월 20일 (목) 오후 9:22, Weichen Xu 님이 작성:
>
>> We need to merge this.
>> https://github.com/apache/spark/pull/22492
>> Otherwise mleap cannot build against spark 2.4.0
>> Thanks!
>>
>> On Wed, Sep 19, 2018 at 1:16 PM Yinan Li  wrote:
>>
>>> FYI: SPARK-23200 has been resolved.
>>>
>>> On Tue, Sep 18, 2018 at 8:49 AM Felix Cheung 
>>> wrote:
>>>
 If we could work on this quickly - it might get on to future RCs.



 --
 *From:* Stavros Kontopoulos 
 *Sent:* Monday, September 17, 2018 2:35 PM
 *To:* Yinan Li
 *Cc:* Xiao Li; eerla...@redhat.com; van...@cloudera.com.invalid; Sean
 Owen; Wenchen Fan; dev
 *Subject:* Re: [VOTE] SPARK 2.4.0 (RC1)

 Hi Xiao,

 I just tested it, it seems ok. There are some questions about which
 properties we should keep when restoring the config. Otherwise it looks ok
 to me.
 The reason this should go in 2.4 is that streaming on k8s is something
 people want to try day one (or at least it is cool to try) and since 2.4
 comes with k8s support being refactored a lot,
 it would be disappointing not to have it in...IMHO.

 Best,
 Stavros

 On Mon, Sep 17, 2018 at 11:13 PM, Yinan Li 
 wrote:

> We can merge the PR and get SPARK-23200 resolved if the whole point is
> to make streaming on k8s work first. But given that this is not a blocker
> for 2.4, I think we can take a bit more time here and get it right. With
> that being said, I would expect it to be resolved soon.
>
> On Mon, Sep 17, 2018 at 11:47 AM Xiao Li  wrote:
>
>> Hi, Erik and Stavros,
>>
>> This bug fix SPARK-23200 is not a blocker of the 2.4 release. It
>> sounds important for the Streaming on K8S. Could the K8S oriented
>> committers speed up the reviews?
>>
>> Thanks,
>>
>> Xiao
>>
>> Erik Erlandson  于2018年9月17日周一 上午11:04写道:
>>
>>>
>>> I have no binding vote but I second Stavros’ recommendation for
>>> spark-23200
>>>
>>> Per parallel threads on Py2 support I would also like to propose
>>> deprecating Py2 starting with this 2.4 release
>>>
>>> On Mon, Sep 17, 2018 at 10:38 AM Marcelo Vanzin
>>>  wrote:
>>>
 You can log in to https://repository.apache.org and see what's
 wrong.
 Just find that staging repo and look at the messages. In your case
 it
 seems related to your signature.

 failureMessageNo public key: Key with id: () was not able to be
 located on http://gpg-keyserver.de/. Upload your public key and try
 the operation again.
 On Sun, Sep 16, 2018 at 10:00 PM Wenchen Fan 
 wrote:
 >
 > I confirmed that
 https://repository.apache.org/content/repositories/orgapachespark-1285
 is not accessible. I did it via 
 ./dev/create-release/do-release-docker.sh
 -d /my/work/dir -s publish , not sure what's going wrong. I didn't see 
 any
 error message during it.
 >
 > Any insights are appreciated! So that I can fix it in the next
 RC. Thanks!
 >
 > On Mon, Sep 17, 2018 at 11:31 AM Sean Owen 
 wrote:
 >>
 >> I think one build is enough, but haven't thought it through. The
 >> Hadoop 2.6/2.7 builds are already nearly redundant. 2.12 is
 probably
 >> best advertised as a 'beta'. So maybe publish a no-hadoop build
 of it?
 >> Really, whatever's the easy thing to do.
 >> On Sun, Sep 16, 2018 at 10:28 PM Wenchen Fan <
 cloud0...@gmail.com> wrote:
 >> >
 >> > Ah I missed the Scala 2.12 build. Do you mean we should
 publish a Scala 2.12 build this time? Current for Scala 2.11 we have 3
 builds: with hadoop 2.7, with hadoop 2.6, without hadoop. Shall we do 
 the
 same thing for Scala 2.12?
 >> >
 >> > On Mon, Sep 17, 2018 at 11:14 AM Sean Owen 
 wrote:
 >> >>
 >> >> A few preliminary notes:
 >> >>
 >> >> Wenchen for some weird reason when I hit your key in gpg
 --import, it
 >> >> asks for a passphrase. When I skip it, it's fine, gpg can
 still verify
 >> >> the signature. No issue there really.
 >> >>
 >> >> The staging repo gives a 404:
 >> >>
 

Re: [VOTE] SPARK 2.4.0 (RC1)

2018-09-20 Thread Jungtaek Lim
I also noticed there're some fixed issues which are included in branch-2.4
but its versions are still 3.0.0. Would we want to update versions to
2.4.0? If we are not planning to run some automations to correct it, I'm
happy to fix them.

2018년 9월 20일 (목) 오후 9:22, Weichen Xu 님이 작성:

> We need to merge this.
> https://github.com/apache/spark/pull/22492
> Otherwise mleap cannot build against spark 2.4.0
> Thanks!
>
> On Wed, Sep 19, 2018 at 1:16 PM Yinan Li  wrote:
>
>> FYI: SPARK-23200 has been resolved.
>>
>> On Tue, Sep 18, 2018 at 8:49 AM Felix Cheung 
>> wrote:
>>
>>> If we could work on this quickly - it might get on to future RCs.
>>>
>>>
>>>
>>> --
>>> *From:* Stavros Kontopoulos 
>>> *Sent:* Monday, September 17, 2018 2:35 PM
>>> *To:* Yinan Li
>>> *Cc:* Xiao Li; eerla...@redhat.com; van...@cloudera.com.invalid; Sean
>>> Owen; Wenchen Fan; dev
>>> *Subject:* Re: [VOTE] SPARK 2.4.0 (RC1)
>>>
>>> Hi Xiao,
>>>
>>> I just tested it, it seems ok. There are some questions about which
>>> properties we should keep when restoring the config. Otherwise it looks ok
>>> to me.
>>> The reason this should go in 2.4 is that streaming on k8s is something
>>> people want to try day one (or at least it is cool to try) and since 2.4
>>> comes with k8s support being refactored a lot,
>>> it would be disappointing not to have it in...IMHO.
>>>
>>> Best,
>>> Stavros
>>>
>>> On Mon, Sep 17, 2018 at 11:13 PM, Yinan Li  wrote:
>>>
 We can merge the PR and get SPARK-23200 resolved if the whole point is
 to make streaming on k8s work first. But given that this is not a blocker
 for 2.4, I think we can take a bit more time here and get it right. With
 that being said, I would expect it to be resolved soon.

 On Mon, Sep 17, 2018 at 11:47 AM Xiao Li  wrote:

> Hi, Erik and Stavros,
>
> This bug fix SPARK-23200 is not a blocker of the 2.4 release. It
> sounds important for the Streaming on K8S. Could the K8S oriented
> committers speed up the reviews?
>
> Thanks,
>
> Xiao
>
> Erik Erlandson  于2018年9月17日周一 上午11:04写道:
>
>>
>> I have no binding vote but I second Stavros’ recommendation for
>> spark-23200
>>
>> Per parallel threads on Py2 support I would also like to propose
>> deprecating Py2 starting with this 2.4 release
>>
>> On Mon, Sep 17, 2018 at 10:38 AM Marcelo Vanzin
>>  wrote:
>>
>>> You can log in to https://repository.apache.org and see what's
>>> wrong.
>>> Just find that staging repo and look at the messages. In your case it
>>> seems related to your signature.
>>>
>>> failureMessageNo public key: Key with id: () was not able to be
>>> located on http://gpg-keyserver.de/. Upload your public key and try
>>> the operation again.
>>> On Sun, Sep 16, 2018 at 10:00 PM Wenchen Fan 
>>> wrote:
>>> >
>>> > I confirmed that
>>> https://repository.apache.org/content/repositories/orgapachespark-1285
>>> is not accessible. I did it via 
>>> ./dev/create-release/do-release-docker.sh
>>> -d /my/work/dir -s publish , not sure what's going wrong. I didn't see 
>>> any
>>> error message during it.
>>> >
>>> > Any insights are appreciated! So that I can fix it in the next RC.
>>> Thanks!
>>> >
>>> > On Mon, Sep 17, 2018 at 11:31 AM Sean Owen 
>>> wrote:
>>> >>
>>> >> I think one build is enough, but haven't thought it through. The
>>> >> Hadoop 2.6/2.7 builds are already nearly redundant. 2.12 is
>>> probably
>>> >> best advertised as a 'beta'. So maybe publish a no-hadoop build
>>> of it?
>>> >> Really, whatever's the easy thing to do.
>>> >> On Sun, Sep 16, 2018 at 10:28 PM Wenchen Fan 
>>> wrote:
>>> >> >
>>> >> > Ah I missed the Scala 2.12 build. Do you mean we should publish
>>> a Scala 2.12 build this time? Current for Scala 2.11 we have 3 builds: 
>>> with
>>> hadoop 2.7, with hadoop 2.6, without hadoop. Shall we do the same thing 
>>> for
>>> Scala 2.12?
>>> >> >
>>> >> > On Mon, Sep 17, 2018 at 11:14 AM Sean Owen 
>>> wrote:
>>> >> >>
>>> >> >> A few preliminary notes:
>>> >> >>
>>> >> >> Wenchen for some weird reason when I hit your key in gpg
>>> --import, it
>>> >> >> asks for a passphrase. When I skip it, it's fine, gpg can
>>> still verify
>>> >> >> the signature. No issue there really.
>>> >> >>
>>> >> >> The staging repo gives a 404:
>>> >> >>
>>> https://repository.apache.org/content/repositories/orgapachespark-1285/
>>> >> >> 404 - Repository "orgapachespark-1285 (staging: open)"
>>> >> >> [id=orgapachespark-1285] exists but is not exposed.
>>> >> >>
>>> >> >> The (revamped) licenses are OK, though there are some minor
>>> glitches
>>> >> >> in the final release tarballs (my fault) : there's an extra
>>> directory,
>>> >> >> 

Re: [VOTE] SPARK 2.3.2 (RC6)

2018-09-20 Thread Ryan Blue
Changing my vote to +1 with this fixed.

Here's what was going on -- and thanks to Owen O'Malley for debugging:

The problem was that Iceberg contained a fix for a JVM bug for timestamps
before the unix epoch where the timestamp was off by 1s. Owen moved this
code into ORC as well and using the new version of Spark pulled in the
newer version of ORC. That meant that the values were "fixed" twice and
were wrong.

Updating the Iceberg code to rely on the fix in the version of ORC that
Spark includes fixes the problem.

On Thu, Sep 20, 2018 at 2:38 PM Dongjoon Hyun 
wrote:

> Hi, Ryan.
>
> Could you share the result on 2.3.1 since this is 2.3.2 RC? That would be
> helpful to narrow down the scope.
>
> Bests,
> Dongjoon.
>
> On Thu, Sep 20, 2018 at 11:56 Ryan Blue  wrote:
>
>> -0
>>
>> My DataSourceV2 implementation for Iceberg is failing ORC tests when I
>> run with the 2.3.2 RC that pass when I run with 2.3.0. I'm tracking down
>> the cause and will report back, but I'm -0 on the release because there may
>> be a behavior change.
>>
>> On Thu, Sep 20, 2018 at 10:37 AM Denny Lee  wrote:
>>
>>> +1
>>>
>>> On Thu, Sep 20, 2018 at 9:55 AM Xiao Li  wrote:
>>>
 +1


 John Zhuge  于2018年9月19日周三 下午1:17写道:

> +1 (non-binding)
>
> Built on Ubuntu 16.04 with Maven flags: -Phadoop-2.7 -Pmesos -Pyarn
> -Phive-thriftserver -Psparkr -Pkinesis-asl -Phadoop-provided
>
> java version "1.8.0_181"
> Java(TM) SE Runtime Environment (build 1.8.0_181-b13)
> Java HotSpot(TM) 64-Bit Server VM (build 25.181-b13, mixed mode)
>
>
> On Wed, Sep 19, 2018 at 2:31 AM Takeshi Yamamuro <
> linguin@gmail.com> wrote:
>
>> +1
>>
>> I also checked `-Pyarn -Phadoop-2.7 -Pkinesis-asl -Phive
>> -Phive-thriftserve` on the openjdk below/macOSv10.12.6
>>
>> $ java -version
>> java version "1.8.0_181"
>> Java(TM) SE Runtime Environment (build 1.8.0_181-b13)
>> Java HotSpot(TM) 64-Bit Server VM (build 25.181-b13, mixed mode)
>>
>>
>> On Wed, Sep 19, 2018 at 10:45 AM Dongjoon Hyun <
>> dongjoon.h...@gmail.com> wrote:
>>
>>> +1.
>>>
>>> I tested with `-Pyarn -Phadoop-2.7 -Pkinesis-asl -Phive
>>> -Phive-thriftserve` on OpenJDK(1.8.0_181)/CentOS 7.5.
>>>
>>> I hit the following test case failure once during testing, but it's
>>> not persistent.
>>>
>>> KafkaContinuousSourceSuite
>>> ...
>>> subscribing topic by name from earliest offsets (failOnDataLoss:
>>> false) *** FAILED ***
>>>
>>> Thank you, Saisai.
>>>
>>> Bests,
>>> Dongjoon.
>>>
>>> On Mon, Sep 17, 2018 at 6:48 PM Saisai Shao 
>>> wrote:
>>>
 +1 from my own side.

 Thanks
 Saisai

 Wenchen Fan  于2018年9月18日周二 上午9:34写道:

> +1. All the blocker issues are all resolved in 2.3.2 AFAIK.
>
> On Tue, Sep 18, 2018 at 9:23 AM Sean Owen 
> wrote:
>
>> +1 . Licenses and sigs check out as in previous 2.3.x releases. A
>> build from source with most profiles passed for me.
>> On Mon, Sep 17, 2018 at 8:17 AM Saisai Shao <
>> sai.sai.s...@gmail.com> wrote:
>> >
>> > Please vote on releasing the following candidate as Apache
>> Spark version 2.3.2.
>> >
>> > The vote is open until September 21 PST and passes if a
>> majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
>> >
>> > [ ] +1 Release this package as Apache Spark 2.3.2
>> > [ ] -1 Do not release this package because ...
>> >
>> > To learn more about Apache Spark, please see
>> http://spark.apache.org/
>> >
>> > The tag to be voted on is v2.3.2-rc6 (commit
>> 02b510728c31b70e6035ad541bfcdc2b59dcd79a):
>> > https://github.com/apache/spark/tree/v2.3.2-rc6
>> >
>> > The release files, including signatures, digests, etc. can be
>> found at:
>> > https://dist.apache.org/repos/dist/dev/spark/v2.3.2-rc6-bin/
>> >
>> > Signatures used for Spark RCs can be found in this file:
>> > https://dist.apache.org/repos/dist/dev/spark/KEYS
>> >
>> > The staging repository for this release can be found at:
>> >
>> https://repository.apache.org/content/repositories/orgapachespark-1286/
>> >
>> > The documentation corresponding to this release can be found at:
>> > https://dist.apache.org/repos/dist/dev/spark/v2.3.2-rc6-docs/
>> >
>> > The list of bug fixes going into 2.3.2 can be found at the
>> following URL:
>> > https://issues.apache.org/jira/projects/SPARK/versions/12343289
>> >
>> >
>> > FAQ
>> >
>> > =
>> > How can I help test this release?
>> > 

2.4.0 Blockers, Critical, etc

2018-09-20 Thread Sean Owen
Because we're into 2.4 release candidates, I thought I'd look at
what's still open and targeted at 2.4.0. I presume the Blockers are
the usual umbrellas that don't themselves block anything, but,
confirming, there is nothing left to do there?

I think that's mostly a question for Joseph and Weichen.

As ever, anyone who knows these items are a) done or b) not going to
be in 2.4, go ahead and update them.


Blocker:

SPARK-25321 ML, Graph 2.4 QA: API: New Scala APIs, docs
SPARK-25324 ML 2.4 QA: API: Java compatibility, docs
SPARK-25323 ML 2.4 QA: API: Python API coverage
SPARK-25320 ML, Graph 2.4 QA: API: Binary incompatible changes

Critical:

SPARK-25319 Spark MLlib, GraphX 2.4 QA umbrella
SPARK-25378 ArrayData.toArray(StringType) assume UTF8String in 2.4
SPARK-25327 Update MLlib, GraphX websites for 2.4
SPARK-25325 ML, Graph 2.4 QA: Update user guide for new features & APIs
SPARK-25326 ML, Graph 2.4 QA: Programming guide update and migration guide

Other:

SPARK-25346 Document Spark builtin data sources
SPARK-25347 Document image data source in doc site
SPARK-12978 Skip unnecessary final group-by when input data already
clustered with group-by keys
SPARK-20184 performance regression for complex/long sql when enable
whole stage codegen
SPARK-16196 Optimize in-memory scan performance using ColumnarBatches
SPARK-15693 Write schema definition out for file-based data sources to
avoid schema inference
SPARK-23597 Audit Spark SQL code base for non-interpreted expressions
SPARK-25179 Document the features that require Pyarrow 0.10
SPARK-25110 make sure Flume streaming connector works with Spark 2.4
SPARK-21318 The exception message thrown by `lookupFunction` is ambiguous.
SPARK-24464 Unit tests for MLlib's Instrumentation
SPARK-23197 Flaky test: spark.streaming.ReceiverSuite."receiver_life_cycle"
SPARK-22809 pyspark is sensitive to imports with dots
SPARK-22739 Additional Expression Support for Objects
SPARK-22231 Support of map, filter, withColumn, dropColumn in nested
list of structures
SPARK-21030 extend hint syntax to support any expression for Python and R
SPARK-22386 Data Source V2 improvements
SPARK-15117 Generate code that get a value in each compressed column
from CachedBatch when DataFrame.cache() is called

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [VOTE] SPARK 2.3.2 (RC6)

2018-09-20 Thread Dongjoon Hyun
Hi, Ryan.

Could you share the result on 2.3.1 since this is 2.3.2 RC? That would be
helpful to narrow down the scope.

Bests,
Dongjoon.

On Thu, Sep 20, 2018 at 11:56 Ryan Blue  wrote:

> -0
>
> My DataSourceV2 implementation for Iceberg is failing ORC tests when I run
> with the 2.3.2 RC that pass when I run with 2.3.0. I'm tracking down the
> cause and will report back, but I'm -0 on the release because there may be
> a behavior change.
>
> On Thu, Sep 20, 2018 at 10:37 AM Denny Lee  wrote:
>
>> +1
>>
>> On Thu, Sep 20, 2018 at 9:55 AM Xiao Li  wrote:
>>
>>> +1
>>>
>>>
>>> John Zhuge  于2018年9月19日周三 下午1:17写道:
>>>
 +1 (non-binding)

 Built on Ubuntu 16.04 with Maven flags: -Phadoop-2.7 -Pmesos -Pyarn
 -Phive-thriftserver -Psparkr -Pkinesis-asl -Phadoop-provided

 java version "1.8.0_181"
 Java(TM) SE Runtime Environment (build 1.8.0_181-b13)
 Java HotSpot(TM) 64-Bit Server VM (build 25.181-b13, mixed mode)


 On Wed, Sep 19, 2018 at 2:31 AM Takeshi Yamamuro 
 wrote:

> +1
>
> I also checked `-Pyarn -Phadoop-2.7 -Pkinesis-asl -Phive
> -Phive-thriftserve` on the openjdk below/macOSv10.12.6
>
> $ java -version
> java version "1.8.0_181"
> Java(TM) SE Runtime Environment (build 1.8.0_181-b13)
> Java HotSpot(TM) 64-Bit Server VM (build 25.181-b13, mixed mode)
>
>
> On Wed, Sep 19, 2018 at 10:45 AM Dongjoon Hyun <
> dongjoon.h...@gmail.com> wrote:
>
>> +1.
>>
>> I tested with `-Pyarn -Phadoop-2.7 -Pkinesis-asl -Phive
>> -Phive-thriftserve` on OpenJDK(1.8.0_181)/CentOS 7.5.
>>
>> I hit the following test case failure once during testing, but it's
>> not persistent.
>>
>> KafkaContinuousSourceSuite
>> ...
>> subscribing topic by name from earliest offsets (failOnDataLoss:
>> false) *** FAILED ***
>>
>> Thank you, Saisai.
>>
>> Bests,
>> Dongjoon.
>>
>> On Mon, Sep 17, 2018 at 6:48 PM Saisai Shao 
>> wrote:
>>
>>> +1 from my own side.
>>>
>>> Thanks
>>> Saisai
>>>
>>> Wenchen Fan  于2018年9月18日周二 上午9:34写道:
>>>
 +1. All the blocker issues are all resolved in 2.3.2 AFAIK.

 On Tue, Sep 18, 2018 at 9:23 AM Sean Owen 
 wrote:

> +1 . Licenses and sigs check out as in previous 2.3.x releases. A
> build from source with most profiles passed for me.
> On Mon, Sep 17, 2018 at 8:17 AM Saisai Shao <
> sai.sai.s...@gmail.com> wrote:
> >
> > Please vote on releasing the following candidate as Apache Spark
> version 2.3.2.
> >
> > The vote is open until September 21 PST and passes if a majority
> +1 PMC votes are cast, with a minimum of 3 +1 votes.
> >
> > [ ] +1 Release this package as Apache Spark 2.3.2
> > [ ] -1 Do not release this package because ...
> >
> > To learn more about Apache Spark, please see
> http://spark.apache.org/
> >
> > The tag to be voted on is v2.3.2-rc6 (commit
> 02b510728c31b70e6035ad541bfcdc2b59dcd79a):
> > https://github.com/apache/spark/tree/v2.3.2-rc6
> >
> > The release files, including signatures, digests, etc. can be
> found at:
> > https://dist.apache.org/repos/dist/dev/spark/v2.3.2-rc6-bin/
> >
> > Signatures used for Spark RCs can be found in this file:
> > https://dist.apache.org/repos/dist/dev/spark/KEYS
> >
> > The staging repository for this release can be found at:
> >
> https://repository.apache.org/content/repositories/orgapachespark-1286/
> >
> > The documentation corresponding to this release can be found at:
> > https://dist.apache.org/repos/dist/dev/spark/v2.3.2-rc6-docs/
> >
> > The list of bug fixes going into 2.3.2 can be found at the
> following URL:
> > https://issues.apache.org/jira/projects/SPARK/versions/12343289
> >
> >
> > FAQ
> >
> > =
> > How can I help test this release?
> > =
> >
> > If you are a Spark user, you can help us test this release by
> taking
> > an existing Spark workload and running on this release
> candidate, then
> > reporting any regressions.
> >
> > If you're working in PySpark you can set up a virtual env and
> install
> > the current RC and see if anything important breaks, in the
> Java/Scala
> > you can add the staging repository to your projects resolvers
> and test
> > with the RC (make sure to clean up the artifact cache
> before/after so
> > you don't end up building with a out of date RC going forward).
> >
> > 

Re: [VOTE] SPARK 2.3.2 (RC6)

2018-09-20 Thread Ryan Blue
-0

My DataSourceV2 implementation for Iceberg is failing ORC tests when I run
with the 2.3.2 RC that pass when I run with 2.3.0. I'm tracking down the
cause and will report back, but I'm -0 on the release because there may be
a behavior change.

On Thu, Sep 20, 2018 at 10:37 AM Denny Lee  wrote:

> +1
>
> On Thu, Sep 20, 2018 at 9:55 AM Xiao Li  wrote:
>
>> +1
>>
>>
>> John Zhuge  于2018年9月19日周三 下午1:17写道:
>>
>>> +1 (non-binding)
>>>
>>> Built on Ubuntu 16.04 with Maven flags: -Phadoop-2.7 -Pmesos -Pyarn
>>> -Phive-thriftserver -Psparkr -Pkinesis-asl -Phadoop-provided
>>>
>>> java version "1.8.0_181"
>>> Java(TM) SE Runtime Environment (build 1.8.0_181-b13)
>>> Java HotSpot(TM) 64-Bit Server VM (build 25.181-b13, mixed mode)
>>>
>>>
>>> On Wed, Sep 19, 2018 at 2:31 AM Takeshi Yamamuro 
>>> wrote:
>>>
 +1

 I also checked `-Pyarn -Phadoop-2.7 -Pkinesis-asl -Phive
 -Phive-thriftserve` on the openjdk below/macOSv10.12.6

 $ java -version
 java version "1.8.0_181"
 Java(TM) SE Runtime Environment (build 1.8.0_181-b13)
 Java HotSpot(TM) 64-Bit Server VM (build 25.181-b13, mixed mode)


 On Wed, Sep 19, 2018 at 10:45 AM Dongjoon Hyun 
 wrote:

> +1.
>
> I tested with `-Pyarn -Phadoop-2.7 -Pkinesis-asl -Phive
> -Phive-thriftserve` on OpenJDK(1.8.0_181)/CentOS 7.5.
>
> I hit the following test case failure once during testing, but it's
> not persistent.
>
> KafkaContinuousSourceSuite
> ...
> subscribing topic by name from earliest offsets (failOnDataLoss:
> false) *** FAILED ***
>
> Thank you, Saisai.
>
> Bests,
> Dongjoon.
>
> On Mon, Sep 17, 2018 at 6:48 PM Saisai Shao 
> wrote:
>
>> +1 from my own side.
>>
>> Thanks
>> Saisai
>>
>> Wenchen Fan  于2018年9月18日周二 上午9:34写道:
>>
>>> +1. All the blocker issues are all resolved in 2.3.2 AFAIK.
>>>
>>> On Tue, Sep 18, 2018 at 9:23 AM Sean Owen  wrote:
>>>
 +1 . Licenses and sigs check out as in previous 2.3.x releases. A
 build from source with most profiles passed for me.
 On Mon, Sep 17, 2018 at 8:17 AM Saisai Shao 
 wrote:
 >
 > Please vote on releasing the following candidate as Apache Spark
 version 2.3.2.
 >
 > The vote is open until September 21 PST and passes if a majority
 +1 PMC votes are cast, with a minimum of 3 +1 votes.
 >
 > [ ] +1 Release this package as Apache Spark 2.3.2
 > [ ] -1 Do not release this package because ...
 >
 > To learn more about Apache Spark, please see
 http://spark.apache.org/
 >
 > The tag to be voted on is v2.3.2-rc6 (commit
 02b510728c31b70e6035ad541bfcdc2b59dcd79a):
 > https://github.com/apache/spark/tree/v2.3.2-rc6
 >
 > The release files, including signatures, digests, etc. can be
 found at:
 > https://dist.apache.org/repos/dist/dev/spark/v2.3.2-rc6-bin/
 >
 > Signatures used for Spark RCs can be found in this file:
 > https://dist.apache.org/repos/dist/dev/spark/KEYS
 >
 > The staging repository for this release can be found at:
 >
 https://repository.apache.org/content/repositories/orgapachespark-1286/
 >
 > The documentation corresponding to this release can be found at:
 > https://dist.apache.org/repos/dist/dev/spark/v2.3.2-rc6-docs/
 >
 > The list of bug fixes going into 2.3.2 can be found at the
 following URL:
 > https://issues.apache.org/jira/projects/SPARK/versions/12343289
 >
 >
 > FAQ
 >
 > =
 > How can I help test this release?
 > =
 >
 > If you are a Spark user, you can help us test this release by
 taking
 > an existing Spark workload and running on this release candidate,
 then
 > reporting any regressions.
 >
 > If you're working in PySpark you can set up a virtual env and
 install
 > the current RC and see if anything important breaks, in the
 Java/Scala
 > you can add the staging repository to your projects resolvers and
 test
 > with the RC (make sure to clean up the artifact cache
 before/after so
 > you don't end up building with a out of date RC going forward).
 >
 > ===
 > What should happen to JIRA tickets still targeting 2.3.2?
 > ===
 >
 > The current list of open tickets targeted at 2.3.2 can be found
 at:
 > https://issues.apache.org/jira/projects/SPARK and search for
 "Target Version/s" = 2.3.2
 >
 > Committers should 

Re: [VOTE] SPARK 2.3.2 (RC6)

2018-09-20 Thread Denny Lee
+1

On Thu, Sep 20, 2018 at 9:55 AM Xiao Li  wrote:

> +1
>
>
> John Zhuge  于2018年9月19日周三 下午1:17写道:
>
>> +1 (non-binding)
>>
>> Built on Ubuntu 16.04 with Maven flags: -Phadoop-2.7 -Pmesos -Pyarn
>> -Phive-thriftserver -Psparkr -Pkinesis-asl -Phadoop-provided
>>
>> java version "1.8.0_181"
>> Java(TM) SE Runtime Environment (build 1.8.0_181-b13)
>> Java HotSpot(TM) 64-Bit Server VM (build 25.181-b13, mixed mode)
>>
>>
>> On Wed, Sep 19, 2018 at 2:31 AM Takeshi Yamamuro 
>> wrote:
>>
>>> +1
>>>
>>> I also checked `-Pyarn -Phadoop-2.7 -Pkinesis-asl -Phive
>>> -Phive-thriftserve` on the openjdk below/macOSv10.12.6
>>>
>>> $ java -version
>>> java version "1.8.0_181"
>>> Java(TM) SE Runtime Environment (build 1.8.0_181-b13)
>>> Java HotSpot(TM) 64-Bit Server VM (build 25.181-b13, mixed mode)
>>>
>>>
>>> On Wed, Sep 19, 2018 at 10:45 AM Dongjoon Hyun 
>>> wrote:
>>>
 +1.

 I tested with `-Pyarn -Phadoop-2.7 -Pkinesis-asl -Phive
 -Phive-thriftserve` on OpenJDK(1.8.0_181)/CentOS 7.5.

 I hit the following test case failure once during testing, but it's not
 persistent.

 KafkaContinuousSourceSuite
 ...
 subscribing topic by name from earliest offsets (failOnDataLoss:
 false) *** FAILED ***

 Thank you, Saisai.

 Bests,
 Dongjoon.

 On Mon, Sep 17, 2018 at 6:48 PM Saisai Shao 
 wrote:

> +1 from my own side.
>
> Thanks
> Saisai
>
> Wenchen Fan  于2018年9月18日周二 上午9:34写道:
>
>> +1. All the blocker issues are all resolved in 2.3.2 AFAIK.
>>
>> On Tue, Sep 18, 2018 at 9:23 AM Sean Owen  wrote:
>>
>>> +1 . Licenses and sigs check out as in previous 2.3.x releases. A
>>> build from source with most profiles passed for me.
>>> On Mon, Sep 17, 2018 at 8:17 AM Saisai Shao 
>>> wrote:
>>> >
>>> > Please vote on releasing the following candidate as Apache Spark
>>> version 2.3.2.
>>> >
>>> > The vote is open until September 21 PST and passes if a majority
>>> +1 PMC votes are cast, with a minimum of 3 +1 votes.
>>> >
>>> > [ ] +1 Release this package as Apache Spark 2.3.2
>>> > [ ] -1 Do not release this package because ...
>>> >
>>> > To learn more about Apache Spark, please see
>>> http://spark.apache.org/
>>> >
>>> > The tag to be voted on is v2.3.2-rc6 (commit
>>> 02b510728c31b70e6035ad541bfcdc2b59dcd79a):
>>> > https://github.com/apache/spark/tree/v2.3.2-rc6
>>> >
>>> > The release files, including signatures, digests, etc. can be
>>> found at:
>>> > https://dist.apache.org/repos/dist/dev/spark/v2.3.2-rc6-bin/
>>> >
>>> > Signatures used for Spark RCs can be found in this file:
>>> > https://dist.apache.org/repos/dist/dev/spark/KEYS
>>> >
>>> > The staging repository for this release can be found at:
>>> >
>>> https://repository.apache.org/content/repositories/orgapachespark-1286/
>>> >
>>> > The documentation corresponding to this release can be found at:
>>> > https://dist.apache.org/repos/dist/dev/spark/v2.3.2-rc6-docs/
>>> >
>>> > The list of bug fixes going into 2.3.2 can be found at the
>>> following URL:
>>> > https://issues.apache.org/jira/projects/SPARK/versions/12343289
>>> >
>>> >
>>> > FAQ
>>> >
>>> > =
>>> > How can I help test this release?
>>> > =
>>> >
>>> > If you are a Spark user, you can help us test this release by
>>> taking
>>> > an existing Spark workload and running on this release candidate,
>>> then
>>> > reporting any regressions.
>>> >
>>> > If you're working in PySpark you can set up a virtual env and
>>> install
>>> > the current RC and see if anything important breaks, in the
>>> Java/Scala
>>> > you can add the staging repository to your projects resolvers and
>>> test
>>> > with the RC (make sure to clean up the artifact cache before/after
>>> so
>>> > you don't end up building with a out of date RC going forward).
>>> >
>>> > ===
>>> > What should happen to JIRA tickets still targeting 2.3.2?
>>> > ===
>>> >
>>> > The current list of open tickets targeted at 2.3.2 can be found at:
>>> > https://issues.apache.org/jira/projects/SPARK and search for
>>> "Target Version/s" = 2.3.2
>>> >
>>> > Committers should look at those and triage. Extremely important bug
>>> > fixes, documentation, and API tweaks that impact compatibility
>>> should
>>> > be worked on immediately. Everything else please retarget to an
>>> > appropriate release.
>>> >
>>> > ==
>>> > But my bug isn't fixed?
>>> > ==
>>> >
>>> > In order to make timely releases, we will typically not hold the
>>> > release 

Re: [VOTE] SPARK 2.3.2 (RC6)

2018-09-20 Thread Xiao Li
+1


John Zhuge  于2018年9月19日周三 下午1:17写道:

> +1 (non-binding)
>
> Built on Ubuntu 16.04 with Maven flags: -Phadoop-2.7 -Pmesos -Pyarn
> -Phive-thriftserver -Psparkr -Pkinesis-asl -Phadoop-provided
>
> java version "1.8.0_181"
> Java(TM) SE Runtime Environment (build 1.8.0_181-b13)
> Java HotSpot(TM) 64-Bit Server VM (build 25.181-b13, mixed mode)
>
>
> On Wed, Sep 19, 2018 at 2:31 AM Takeshi Yamamuro 
> wrote:
>
>> +1
>>
>> I also checked `-Pyarn -Phadoop-2.7 -Pkinesis-asl -Phive
>> -Phive-thriftserve` on the openjdk below/macOSv10.12.6
>>
>> $ java -version
>> java version "1.8.0_181"
>> Java(TM) SE Runtime Environment (build 1.8.0_181-b13)
>> Java HotSpot(TM) 64-Bit Server VM (build 25.181-b13, mixed mode)
>>
>>
>> On Wed, Sep 19, 2018 at 10:45 AM Dongjoon Hyun 
>> wrote:
>>
>>> +1.
>>>
>>> I tested with `-Pyarn -Phadoop-2.7 -Pkinesis-asl -Phive
>>> -Phive-thriftserve` on OpenJDK(1.8.0_181)/CentOS 7.5.
>>>
>>> I hit the following test case failure once during testing, but it's not
>>> persistent.
>>>
>>> KafkaContinuousSourceSuite
>>> ...
>>> subscribing topic by name from earliest offsets (failOnDataLoss:
>>> false) *** FAILED ***
>>>
>>> Thank you, Saisai.
>>>
>>> Bests,
>>> Dongjoon.
>>>
>>> On Mon, Sep 17, 2018 at 6:48 PM Saisai Shao 
>>> wrote:
>>>
 +1 from my own side.

 Thanks
 Saisai

 Wenchen Fan  于2018年9月18日周二 上午9:34写道:

> +1. All the blocker issues are all resolved in 2.3.2 AFAIK.
>
> On Tue, Sep 18, 2018 at 9:23 AM Sean Owen  wrote:
>
>> +1 . Licenses and sigs check out as in previous 2.3.x releases. A
>> build from source with most profiles passed for me.
>> On Mon, Sep 17, 2018 at 8:17 AM Saisai Shao 
>> wrote:
>> >
>> > Please vote on releasing the following candidate as Apache Spark
>> version 2.3.2.
>> >
>> > The vote is open until September 21 PST and passes if a majority +1
>> PMC votes are cast, with a minimum of 3 +1 votes.
>> >
>> > [ ] +1 Release this package as Apache Spark 2.3.2
>> > [ ] -1 Do not release this package because ...
>> >
>> > To learn more about Apache Spark, please see
>> http://spark.apache.org/
>> >
>> > The tag to be voted on is v2.3.2-rc6 (commit
>> 02b510728c31b70e6035ad541bfcdc2b59dcd79a):
>> > https://github.com/apache/spark/tree/v2.3.2-rc6
>> >
>> > The release files, including signatures, digests, etc. can be found
>> at:
>> > https://dist.apache.org/repos/dist/dev/spark/v2.3.2-rc6-bin/
>> >
>> > Signatures used for Spark RCs can be found in this file:
>> > https://dist.apache.org/repos/dist/dev/spark/KEYS
>> >
>> > The staging repository for this release can be found at:
>> >
>> https://repository.apache.org/content/repositories/orgapachespark-1286/
>> >
>> > The documentation corresponding to this release can be found at:
>> > https://dist.apache.org/repos/dist/dev/spark/v2.3.2-rc6-docs/
>> >
>> > The list of bug fixes going into 2.3.2 can be found at the
>> following URL:
>> > https://issues.apache.org/jira/projects/SPARK/versions/12343289
>> >
>> >
>> > FAQ
>> >
>> > =
>> > How can I help test this release?
>> > =
>> >
>> > If you are a Spark user, you can help us test this release by taking
>> > an existing Spark workload and running on this release candidate,
>> then
>> > reporting any regressions.
>> >
>> > If you're working in PySpark you can set up a virtual env and
>> install
>> > the current RC and see if anything important breaks, in the
>> Java/Scala
>> > you can add the staging repository to your projects resolvers and
>> test
>> > with the RC (make sure to clean up the artifact cache before/after
>> so
>> > you don't end up building with a out of date RC going forward).
>> >
>> > ===
>> > What should happen to JIRA tickets still targeting 2.3.2?
>> > ===
>> >
>> > The current list of open tickets targeted at 2.3.2 can be found at:
>> > https://issues.apache.org/jira/projects/SPARK and search for
>> "Target Version/s" = 2.3.2
>> >
>> > Committers should look at those and triage. Extremely important bug
>> > fixes, documentation, and API tweaks that impact compatibility
>> should
>> > be worked on immediately. Everything else please retarget to an
>> > appropriate release.
>> >
>> > ==
>> > But my bug isn't fixed?
>> > ==
>> >
>> > In order to make timely releases, we will typically not hold the
>> > release unless the bug in question is a regression from the previous
>> > release. That being said, if there is something which is a
>> regression
>> > that has not been correctly targeted please ping me 

unsubscribe

2018-09-20 Thread Praveen Srivastava
unsubscribe




Praveen Srivastava 
HYPERLINK 
"mailto:praveen.s.srivast...@oracle.com"praveen.s.srivast...@oracle.com


unsubscribe

2018-09-20 Thread Ryan Adams
unsubscribe

Ryan Adams
radams...@gmail.com


Re: [Discuss] Datasource v2 support for manipulating partitions

2018-09-20 Thread Thakrar, Jayesh
Here’s what can be done in PostgreSQL

You can create a partitioned table with a partition clause, e.g.
CREATE TABLE measurement (.) PARTITION BY RANGE (logdate)

You can create a partitioned table by creating tables as partitions of a 
partitioned table, e.g.
CREATE TABLE measurement_y2006m02 PARTITION OF measurement FOR VALUES FROM 
('2006-02-01') TO ('2006-03-01')

Each “partition” is like a table and can be managed just like a table.

And ofcourse you can have nested partitioning.

As for partition management, you can attach/detach partitions by converting a 
regular table into a table partition and a table partition into a regular table 
using the ALTER TABLE statement

ALTER TABLE measurement ATTACH/DETACH PARTITION

There are similar options in Oracle.
In Oracle, converting a table into a partition and vice-versa is referred to as 
 “partition exchange”.
However unlike Postgres, table partitions are not treated as regular tables.


As for partition management relevance in Spark API, here are some thoughts:

Reading data from a table supporting predicate pushdown
W/o explicit partition specification, we would need to rely on partition 
pruning to select the appropriate partitions
However if we can provide a mechanism to specify the partition(s), that would 
be great – and it would need to be translated into appropriate SQL clauses 
under the covers

Writing data to a table supporting partitions
I think there is no current way to support the above Postgres/Oracle ways of 
creating partition tables or doing table exchanges intelligently.
So probably options or some appropriate interfaces would be required
And the above ALTER TABLE equivalent work can be done as part of the commit 
(provided an appropriate interface is supported).

Here are Dale’s comments earlier from the thread
“So if we are not hiding them from the user, we need to allow users to
manipulate them. Either by representing them generically in the API,
allowing pass-through commands to manipulate them, or by some other means.”

I think we need to mull over this and also look beyond RDBMSes – say, S3 for 
applicability.

In essence, I think partitions matter because they allow partition pruning (= 
less resource intensive) during read and allow partition setup and 
appropriately targeting during write.


From: Ryan Blue 
Reply-To: "rb...@netflix.com" 
Date: Wednesday, September 19, 2018 at 4:35 PM
To: "Thakrar, Jayesh" 
Cc: "tigerqu...@outlook.com" , Spark Dev List 

Subject: Re: [Discuss] Datasource v2 support for manipulating partitions

What does partition management look like in those systems and what are the 
options we would standardize in an API?

On Wed, Sep 19, 2018 at 2:16 PM Thakrar, Jayesh 
mailto:jthak...@conversantmedia.com>> wrote:
I think partition management feature would be very useful in RDBMSes that 
support it – e.g. Oracle, PostgreSQL, and DB2.
In some cases add partitions can be explicit and can/may be done outside of 
data loads.
But in some other cases, it may/can need to be done implicitly when supported  
by the platform.
Similar to the static/dynamic partition loading in Hive and Oracle.

So in short, I agree that partition management should be an optional interface.

From: Ryan Blue mailto:rb...@netflix.com>>
Reply-To: "rb...@netflix.com" 
mailto:rb...@netflix.com>>
Date: Wednesday, September 19, 2018 at 2:58 PM
To: "Thakrar, Jayesh" 
mailto:jthak...@conversantmedia.com>>
Cc: "tigerqu...@outlook.com" 
mailto:tigerqu...@outlook.com>>, Spark Dev List 
mailto:dev@spark.apache.org>>
Subject: Re: [Discuss] Datasource v2 support for manipulating partitions

I'm open to exploring the idea of adding partition management as a catalog API. 
The approach we're taking is to have an interface for each concern a catalog 
might implement, like TableCatalog (proposed in SPARK-24252), but also 
FunctionCatalog for stored functions and possibly PartitionedTableCatalog for 
explicitly partitioned tables.

That could definitely be used to implement ALTER TABLE ADD/DROP PARTITION for 
Hive tables, although I'm not sure that we would want to continue exposing 
partitions for simple tables. I know that this is important for storage systems 
like Kudu, but I think it is needlessly difficult and annoying for simple 
tables that are partitioned by a regular transformation like Hive tables. 
That's why Iceberg hides partitioning outside of table configuration. That also 
avoids problems where SELECT DISTINCT queries are wrong because a partition 
exists but has no data.

How useful is this outside of Kudu? Is it something that we should provide an 
API for, or is it specific enough to Kudu that Spark shouldn't include it in 
the API for all sources?

rb


On Tue, Sep 18, 2018 at 7:38 AM Thakrar, Jayesh 
mailto:jthak...@conversantmedia.com>> wrote:
Totally agree with you Dale, that there are situations for efficiency, 
performance and better 

Checkpointing clarifications

2018-09-20 Thread Alessandro Liparoti
Good morning,

I have a large scale job that for certain size of input breaks so I am
trying to play with checkpointing to split the DAG and understand the
problematic point. I have some questions about checkpointing:

   1. What is the utility of non-eager checkpointing?
   2. How checkpointing is different than manually write a dataframe (or
   rdd) to hdfs? Also, doing that will allow to re-read the stored dataframe,
   while with chekpointing I don't see a simple way of re-reading them in a
   future job
   3. I read that checkpointing is different than persisting because the
   lineage is not stored, but I don't understand why persisting stores the
   lineage. The point of persisting is that next computation will start from
   the persisted data (either mem or mem+disk), so what is the advantage of
   having the lineage available? Am I missing some basic understanding of
   these 2 apparently different operations?

Thanks,
*Alessandro Liparoti*


Re: [VOTE] SPARK 2.4.0 (RC1)

2018-09-20 Thread Weichen Xu
We need to merge this.
https://github.com/apache/spark/pull/22492
Otherwise mleap cannot build against spark 2.4.0
Thanks!

On Wed, Sep 19, 2018 at 1:16 PM Yinan Li  wrote:

> FYI: SPARK-23200 has been resolved.
>
> On Tue, Sep 18, 2018 at 8:49 AM Felix Cheung 
> wrote:
>
>> If we could work on this quickly - it might get on to future RCs.
>>
>>
>>
>> --
>> *From:* Stavros Kontopoulos 
>> *Sent:* Monday, September 17, 2018 2:35 PM
>> *To:* Yinan Li
>> *Cc:* Xiao Li; eerla...@redhat.com; van...@cloudera.com.invalid; Sean
>> Owen; Wenchen Fan; dev
>> *Subject:* Re: [VOTE] SPARK 2.4.0 (RC1)
>>
>> Hi Xiao,
>>
>> I just tested it, it seems ok. There are some questions about which
>> properties we should keep when restoring the config. Otherwise it looks ok
>> to me.
>> The reason this should go in 2.4 is that streaming on k8s is something
>> people want to try day one (or at least it is cool to try) and since 2.4
>> comes with k8s support being refactored a lot,
>> it would be disappointing not to have it in...IMHO.
>>
>> Best,
>> Stavros
>>
>> On Mon, Sep 17, 2018 at 11:13 PM, Yinan Li  wrote:
>>
>>> We can merge the PR and get SPARK-23200 resolved if the whole point is
>>> to make streaming on k8s work first. But given that this is not a blocker
>>> for 2.4, I think we can take a bit more time here and get it right. With
>>> that being said, I would expect it to be resolved soon.
>>>
>>> On Mon, Sep 17, 2018 at 11:47 AM Xiao Li  wrote:
>>>
 Hi, Erik and Stavros,

 This bug fix SPARK-23200 is not a blocker of the 2.4 release. It sounds
 important for the Streaming on K8S. Could the K8S oriented committers speed
 up the reviews?

 Thanks,

 Xiao

 Erik Erlandson  于2018年9月17日周一 上午11:04写道:

>
> I have no binding vote but I second Stavros’ recommendation for
> spark-23200
>
> Per parallel threads on Py2 support I would also like to propose
> deprecating Py2 starting with this 2.4 release
>
> On Mon, Sep 17, 2018 at 10:38 AM Marcelo Vanzin
>  wrote:
>
>> You can log in to https://repository.apache.org and see what's wrong.
>> Just find that staging repo and look at the messages. In your case it
>> seems related to your signature.
>>
>> failureMessageNo public key: Key with id: () was not able to be
>> located on http://gpg-keyserver.de/. Upload your public key and try
>> the operation again.
>> On Sun, Sep 16, 2018 at 10:00 PM Wenchen Fan 
>> wrote:
>> >
>> > I confirmed that
>> https://repository.apache.org/content/repositories/orgapachespark-1285
>> is not accessible. I did it via ./dev/create-release/do-release-docker.sh
>> -d /my/work/dir -s publish , not sure what's going wrong. I didn't see 
>> any
>> error message during it.
>> >
>> > Any insights are appreciated! So that I can fix it in the next RC.
>> Thanks!
>> >
>> > On Mon, Sep 17, 2018 at 11:31 AM Sean Owen 
>> wrote:
>> >>
>> >> I think one build is enough, but haven't thought it through. The
>> >> Hadoop 2.6/2.7 builds are already nearly redundant. 2.12 is
>> probably
>> >> best advertised as a 'beta'. So maybe publish a no-hadoop build of
>> it?
>> >> Really, whatever's the easy thing to do.
>> >> On Sun, Sep 16, 2018 at 10:28 PM Wenchen Fan 
>> wrote:
>> >> >
>> >> > Ah I missed the Scala 2.12 build. Do you mean we should publish
>> a Scala 2.12 build this time? Current for Scala 2.11 we have 3 builds: 
>> with
>> hadoop 2.7, with hadoop 2.6, without hadoop. Shall we do the same thing 
>> for
>> Scala 2.12?
>> >> >
>> >> > On Mon, Sep 17, 2018 at 11:14 AM Sean Owen 
>> wrote:
>> >> >>
>> >> >> A few preliminary notes:
>> >> >>
>> >> >> Wenchen for some weird reason when I hit your key in gpg
>> --import, it
>> >> >> asks for a passphrase. When I skip it, it's fine, gpg can still
>> verify
>> >> >> the signature. No issue there really.
>> >> >>
>> >> >> The staging repo gives a 404:
>> >> >>
>> https://repository.apache.org/content/repositories/orgapachespark-1285/
>> >> >> 404 - Repository "orgapachespark-1285 (staging: open)"
>> >> >> [id=orgapachespark-1285] exists but is not exposed.
>> >> >>
>> >> >> The (revamped) licenses are OK, though there are some minor
>> glitches
>> >> >> in the final release tarballs (my fault) : there's an extra
>> directory,
>> >> >> and the source release has both binary and source licenses.
>> I'll fix
>> >> >> that. Not strictly necessary to reject the release over those.
>> >> >>
>> >> >> Last, when I check the staging repo I'll get my answer, but,
>> were you
>> >> >> able to build 2.12 artifacts as well?
>> >> >>
>> >> >> On Sun, Sep 16, 2018 at 9:48 PM Wenchen Fan <
>> cloud0...@gmail.com> wrote:
>> >> >> >

Re: [Feedback Requested] SPARK-25299: Using Distributed Storage for Persisting Shuffle Data

2018-09-20 Thread Felix Cheung
Hi
+baibing3
+huangtao6

Came across your presentation on Alluxio - including shuffling - would you be 
interested in this?



From: Matt Cheah 
Sent: Tuesday, September 4, 2018 2:54 PM
To: Yuanjian Li
Cc: Spark dev list
Subject: Re: [Feedback Requested] SPARK-25299: Using Distributed Storage for 
Persisting Shuffle Data

Yuanjian, Thanks for sharing your progress! I was wondering if there was any 
prototype code that we could read to get an idea of what the implementation 
looks like? We can evaluate the design together and also benchmark workloads 
from across the community �C that is, we can collect more data from more Spark 
users.

The experience would be greatly appreciated in the discussion.

-Matt Cheah

From: Yuanjian Li 
Date: Friday, August 31, 2018 at 8:29 PM
To: Matt Cheah 
Cc: Spark dev list 
Subject: Re: [Feedback Requested] SPARK-25299: Using Distributed Storage for 
Persisting Shuffle Data

Hi Matt,
 Thanks for the great document and proposal, I want to +1 for the reliable 
shuffle data and give some feedback.
 I think a reliable shuffle service based on DFS is necessary on Spark, 
especially running Spark job over unstable environment. For example, while 
mixed deploying Spark with online service, Spark executor will be killed any 
time. Current stage retry strategy will make the job many times slower than 
normal job.
 Actually we(Baidu inc) solved this problem by stable shuffle service over 
Hadoop, and we are now docking Spark to this shuffle service. The POC work will 
be done at October as expect. We'll post more benchmark and detailed work at 
that time. I'm still reading your discussion document and happy to give more 
feedback in the doc.

Thanks,
Yuanjian Li

Matt Cheah 
mailto:mch...@palantir.com>>于2018年9月1日周六上午8:42写道:
Hi everyone,

I filed SPARK-25299 
[issues.apache.org]
 to promote discussion on how we can improve the shuffle operation in Spark. 
The basic premise is to discuss the ways we can leverage distributed storage to 
improve the reliability and isolation of Spark’s shuffle architecture.

A few designs and a full problem statement are outlined in thisarchitecture 
discussion document 
[docs.google.com].

This is a complex problem and it would be great to get feedback from the 
community about the right direction to take this work in. Note that we have not 
yet committed to a specific implementation and architecture �C there’s a lot 
that needs to be discussed for this improvement, so we hope to get as much 
input as possible before moving forward with a design.

Please feel free to leave comments and suggestions on the JIRA ticket or on the 
discussion document.

Thank you!

-Matt Cheah


Re: [DISCUSS] PySpark Window UDF

2018-09-20 Thread Felix Cheung
Definitely!
numba numbers are amazing


From: Wes McKinney 
Sent: Saturday, September 8, 2018 7:46 AM
To: Li Jin
Cc: dev@spark.apache.org
Subject: Re: [DISCUSS] PySpark Window UDF

hi Li,

These results are very cool. I'm excited to see you continuing to push
this effort forward.

- Wes
On Wed, Sep 5, 2018 at 5:52 PM Li Jin  wrote:
>
> Hello again!
>
> I recently implemented a proof-of-concept implementation of proposal above. I 
> think the results are pretty exciting so I want to share my findings with the 
> community. I have implemented two variants of the pandas window UDF - one 
> that takes pandas.Series as input and one that takes numpy array as input. I 
> benchmarked with rolling mean on 1M doubles and here are some results:
>
> Spark SQL window function: 20s
> Pandas variant: ~60s
> Numpy variant: 10s
> Numpy variant with numba: 4s
>
> You can see the benchmark code here:
> https://gist.github.com/icexelloss/845beb3d0d6bfc3d51b3c7419edf0dcb
>
> I think the results are quite exciting because:
> (1) numpy variant even outperforms the Spark SQL window function
> (2) numpy variant with numba has the best performance as well as the 
> flexibility to allow users to write window functions in pure python
>
> The Pandas variant is not bad either (1.5x faster than existing UDF with 
> collect_list) but the numpy variant definitely has much better performance.
>
> So far all Pandas UDFs interacts with Pandas data structure rather than numpy 
> data structure, but the window UDF result might be a good reason to open up 
> numpy variants of Pandas UDFs. What do people think? I'd love to hear 
> community's feedbacks.
>
>
> Links:
> You can reproduce benchmark with numpy variant by using the branch:
> https://github.com/icexelloss/spark/tree/window-udf-numpy
>
> PR link:
> https://github.com/apache/spark/pull/22305
>
> On Wed, May 16, 2018 at 3:34 PM Li Jin  wrote:
>>
>> Hi All,
>>
>> I have been looking into leverage the Arrow and Pandas UDF work we have done 
>> so far for Window UDF in PySpark. I have done some investigation and believe 
>> there is a way to do PySpark window UDF efficiently.
>>
>> The basic idea is instead of passing each window to Python separately, we 
>> can pass a "batch of windows" as an Arrow Batch of rows + begin/end indices 
>> for each window (indices are computed on the Java side), and then rolling 
>> over the begin/end indices in Python and applies the UDF.
>>
>> I have written my investigation in more details here:
>> https://docs.google.com/document/d/14EjeY5z4-NC27-SmIP9CsMPCANeTcvxN44a7SIJtZPc/edit#
>>
>> I think this is a pretty promising and hope to get some feedback from the 
>> community about this approach. Let's discuss! :)
>>
>> Li

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org