Re: [VOTE] Spark 2.1.2 (RC1)

2017-09-15 Thread Ryan Blue
-1 (with my Apache member hat on, non-binding)

I'll continue discussion in the other thread, but I don't think we should
share signing keys.

On Fri, Sep 15, 2017 at 5:14 PM, Holden Karau  wrote:

> Indeed it's limited to a people with login permissions on the Jenkins host
> (and perhaps further limited, I'm not certain). Shane probably knows more
> about the ACLs, so I'll ask him in the other thread for specifics.
>
> This is maybe branching a bit from the question of the current RC though,
> so I'd suggest we continue this discussion on the thread Sean Owen made.
>
> On Fri, Sep 15, 2017 at 4:04 PM Ryan Blue  wrote:
>
>> I'm not familiar with the release procedure, can you send a link to this
>> Jenkins job? Can anyone run this job, or is it limited to committers?
>>
>> rb
>>
>> On Fri, Sep 15, 2017 at 12:28 PM, Holden Karau 
>> wrote:
>>
>>> That's a good question, I built the release candidate however the
>>> Jenkins scripts don't take a parameter for configuring who signs them
>>> rather it always signs them with Patrick's key. You can see this from
>>> previous releases which were managed by other folks but still signed by
>>> Patrick.
>>>
>>> On Fri, Sep 15, 2017 at 12:16 PM, Ryan Blue  wrote:
>>>
 The signature is valid, but why was the release signed with Patrick
 Wendell's private key? Did Patrick build the release candidate?

 rb

 On Fri, Sep 15, 2017 at 6:36 AM, Denny Lee 
 wrote:

> +1 (non-binding)
>
> On Thu, Sep 14, 2017 at 10:57 PM Felix Cheung <
> felixcheun...@hotmail.com> wrote:
>
>> +1 tested SparkR package on Windows, r-hub, Ubuntu.
>>
>> _
>> From: Sean Owen 
>> Sent: Thursday, September 14, 2017 3:12 PM
>> Subject: Re: [VOTE] Spark 2.1.2 (RC1)
>> To: Holden Karau , 
>>
>>
>>
>> +1
>> Very nice. The sigs and hashes look fine, it builds fine for me on
>> Debian Stretch with Java 8, yarn/hive/hadoop-2.7 profiles, and passes
>> tests.
>>
>> Yes as you say, no outstanding issues except for this which doesn't
>> look critical, as it's not a regression.
>>
>> SPARK-21985 PySpark PairDeserializer is broken for double-zipped RDDs
>>
>>
>> On Thu, Sep 14, 2017 at 7:47 PM Holden Karau 
>> wrote:
>>
>>> Please vote on releasing the following candidate as Apache Spark
>>> version 2.1.2. The vote is open until Friday September 22nd at
>>> 18:00 PST and passes if a majority of at least 3 +1 PMC votes are
>>> cast.
>>>
>>> [ ] +1 Release this package as Apache Spark 2.1.2
>>> [ ] -1 Do not release this package because ...
>>>
>>>
>>> To learn more about Apache Spark, please see
>>> https://spark.apache.org/
>>>
>>> The tag to be voted on is v2.1.2-rc1
>>>  (6f470323a036365
>>> 6999dd36cb33f528afe627c12)
>>>
>>> List of JIRA tickets resolved in this release can be found with
>>> this filter.
>>> 
>>>
>>> The release files, including signatures, digests, etc. can be found
>>> at:
>>> https://home.apache.org/~pwendell/spark-releases/spark-
>>> 2.1.2-rc1-bin/
>>>
>>> Release artifacts are signed with the following key:
>>> https://people.apache.org/keys/committer/pwendell.asc
>>>
>>> The staging repository for this release can be found at:
>>> https://repository.apache.org/content/repositories/
>>> orgapachespark-1248/
>>>
>>> The documentation corresponding to this release can be found at:
>>> https://people.apache.org/~pwendell/spark-releases/spark-
>>> 2.1.2-rc1-docs/
>>>
>>>
>>> *FAQ*
>>>
>>> *How can I help test this release?*
>>>
>>> If you are a Spark user, you can help us test this release by taking
>>> an existing Spark workload and running on this release candidate, then
>>> reporting any regressions.
>>>
>>> If you're working in PySpark you can set up a virtual env and
>>> install the current RC and see if anything important breaks, in the
>>> Java/Scala you can add the staging repository to your projects resolvers
>>> and test with the RC (make sure to clean up the artifact cache 
>>> before/after
>>> so you don't end up building with a out of date RC going forward).
>>>
>>> *What should happen to JIRA tickets still targeting 2.1.2?*
>>>
>>> Committers should look at those and triage. Extremely important bug
>>> fixes, documentation, and API tweaks that impact compatibility should be
>>> worked on immediately. Everything else please retarget to 

Re: Signing releases with pwendell or release manager's key?

2017-09-15 Thread Holden Karau
Oh yes and to keep people more informed I've been updating a PR for the
release documentation as I go to write down some of this unwritten
knowledge -- https://github.com/apache/spark-website/pull/66


On Fri, Sep 15, 2017 at 5:12 PM Holden Karau  wrote:

> Also continuing the discussion from the vote threads, Shane probably has
> the best idea on the ACLs for Jenkins so I've CC'd him as well.
>
>
> On Fri, Sep 15, 2017 at 5:09 PM Holden Karau  wrote:
>
>> Changing the release jobs, beyond the available parameters, right now
>> depends on Josh arisen as there are some scripts which generate the jobs
>> which aren't public. I've done temporary fixes in the past with the Python
>> packaging but my understanding is that in the medium term it requires
>> access to the scripts.
>>
>> So +CC Josh.
>>
>> On Fri, Sep 15, 2017 at 4:38 PM Ryan Blue  wrote:
>>
>>> I think this needs to be fixed. It's true that there are barriers to
>>> publication, but the signature is what we use to authenticate Apache
>>> releases.
>>>
>>> If Patrick's key is available on Jenkins for any Spark committer to use,
>>> then the chance of a compromise are much higher than for a normal RM key.
>>>
>>> rb
>>>
>>> On Fri, Sep 15, 2017 at 12:34 PM, Sean Owen  wrote:
>>>
 Yeah I had meant to ask about that in the past. While I presume Patrick
 consents to this and all that, it does mean that anyone with access to said
 Jenkins scripts can create a signed Spark release, regardless of who they
 are.

 I haven't thought through whether that's a theoretical issue we can
 ignore or something we need to fix up. For example you can't get a release
 on the ASF mirrors without more authentication.

 How hard would it be to make the script take in a key? it sort of looks
 like the script already takes GPG_KEY, but don't know how to modify the
 jobs. I suppose it would be ideal, in any event, for the actual release
 manager to sign.

 On Fri, Sep 15, 2017 at 8:28 PM Holden Karau 
 wrote:

> That's a good question, I built the release candidate however the
> Jenkins scripts don't take a parameter for configuring who signs them
> rather it always signs them with Patrick's key. You can see this from
> previous releases which were managed by other folks but still signed by
> Patrick.
>
> On Fri, Sep 15, 2017 at 12:16 PM, Ryan Blue  wrote:
>
>> The signature is valid, but why was the release signed with Patrick
>> Wendell's private key? Did Patrick build the release candidate?
>>
>
>>>
>>>
>>> --
>>> Ryan Blue
>>> Software Engineer
>>> Netflix
>>>
>> --
>> Twitter: https://twitter.com/holdenkarau
>>
> --
> Twitter: https://twitter.com/holdenkarau
>
-- 
Twitter: https://twitter.com/holdenkarau


Re: [VOTE] Spark 2.1.2 (RC1)

2017-09-15 Thread Holden Karau
Indeed it's limited to a people with login permissions on the Jenkins host
(and perhaps further limited, I'm not certain). Shane probably knows more
about the ACLs, so I'll ask him in the other thread for specifics.

This is maybe branching a bit from the question of the current RC though,
so I'd suggest we continue this discussion on the thread Sean Owen made.

On Fri, Sep 15, 2017 at 4:04 PM Ryan Blue  wrote:

> I'm not familiar with the release procedure, can you send a link to this
> Jenkins job? Can anyone run this job, or is it limited to committers?
>
> rb
>
> On Fri, Sep 15, 2017 at 12:28 PM, Holden Karau 
> wrote:
>
>> That's a good question, I built the release candidate however the Jenkins
>> scripts don't take a parameter for configuring who signs them rather it
>> always signs them with Patrick's key. You can see this from previous
>> releases which were managed by other folks but still signed by Patrick.
>>
>> On Fri, Sep 15, 2017 at 12:16 PM, Ryan Blue  wrote:
>>
>>> The signature is valid, but why was the release signed with Patrick
>>> Wendell's private key? Did Patrick build the release candidate?
>>>
>>> rb
>>>
>>> On Fri, Sep 15, 2017 at 6:36 AM, Denny Lee 
>>> wrote:
>>>
 +1 (non-binding)

 On Thu, Sep 14, 2017 at 10:57 PM Felix Cheung <
 felixcheun...@hotmail.com> wrote:

> +1 tested SparkR package on Windows, r-hub, Ubuntu.
>
> _
> From: Sean Owen 
> Sent: Thursday, September 14, 2017 3:12 PM
> Subject: Re: [VOTE] Spark 2.1.2 (RC1)
> To: Holden Karau , 
>
>
>
> +1
> Very nice. The sigs and hashes look fine, it builds fine for me on
> Debian Stretch with Java 8, yarn/hive/hadoop-2.7 profiles, and passes
> tests.
>
> Yes as you say, no outstanding issues except for this which doesn't
> look critical, as it's not a regression.
>
> SPARK-21985 PySpark PairDeserializer is broken for double-zipped RDDs
>
>
> On Thu, Sep 14, 2017 at 7:47 PM Holden Karau 
> wrote:
>
>> Please vote on releasing the following candidate as Apache Spark
>> version 2.1.2. The vote is open until Friday September 22nd at 18:00
>> PST and passes if a majority of at least 3 +1 PMC votes are cast.
>>
>> [ ] +1 Release this package as Apache Spark 2.1.2
>> [ ] -1 Do not release this package because ...
>>
>>
>> To learn more about Apache Spark, please see
>> https://spark.apache.org/
>>
>> The tag to be voted on is v2.1.2-rc1
>>  (
>> 6f470323a0363656999dd36cb33f528afe627c12)
>>
>> List of JIRA tickets resolved in this release can be found with this
>> filter.
>> 
>>
>> The release files, including signatures, digests, etc. can be found
>> at:
>> https://home.apache.org/~pwendell/spark-releases/spark-2.1.2-rc1-bin/
>>
>> Release artifacts are signed with the following key:
>> https://people.apache.org/keys/committer/pwendell.asc
>>
>> The staging repository for this release can be found at:
>>
>> https://repository.apache.org/content/repositories/orgapachespark-1248/
>>
>> The documentation corresponding to this release can be found at:
>>
>> https://people.apache.org/~pwendell/spark-releases/spark-2.1.2-rc1-docs/
>>
>>
>> *FAQ*
>>
>> *How can I help test this release?*
>>
>> If you are a Spark user, you can help us test this release by taking
>> an existing Spark workload and running on this release candidate, then
>> reporting any regressions.
>>
>> If you're working in PySpark you can set up a virtual env and install
>> the current RC and see if anything important breaks, in the Java/Scala 
>> you
>> can add the staging repository to your projects resolvers and test with 
>> the
>> RC (make sure to clean up the artifact cache before/after so you don't 
>> end
>> up building with a out of date RC going forward).
>>
>> *What should happen to JIRA tickets still targeting 2.1.2?*
>>
>> Committers should look at those and triage. Extremely important bug
>> fixes, documentation, and API tweaks that impact compatibility should be
>> worked on immediately. Everything else please retarget to 2.1.3.
>>
>> *But my bug isn't fixed!??!*
>>
>> In order to make timely releases, we will typically not hold the
>> release unless the bug in question is a regression from 2.1.1. That being
>> said if there is something which is a regression form 2.1.1 that has not
>> been correctly targeted please ping a committer to help 

Re: Signing releases with pwendell or release manager's key?

2017-09-15 Thread Holden Karau
Also continuing the discussion from the vote threads, Shane probably has
the best idea on the ACLs for Jenkins so I've CC'd him as well.


On Fri, Sep 15, 2017 at 5:09 PM Holden Karau  wrote:

> Changing the release jobs, beyond the available parameters, right now
> depends on Josh arisen as there are some scripts which generate the jobs
> which aren't public. I've done temporary fixes in the past with the Python
> packaging but my understanding is that in the medium term it requires
> access to the scripts.
>
> So +CC Josh.
>
> On Fri, Sep 15, 2017 at 4:38 PM Ryan Blue  wrote:
>
>> I think this needs to be fixed. It's true that there are barriers to
>> publication, but the signature is what we use to authenticate Apache
>> releases.
>>
>> If Patrick's key is available on Jenkins for any Spark committer to use,
>> then the chance of a compromise are much higher than for a normal RM key.
>>
>> rb
>>
>> On Fri, Sep 15, 2017 at 12:34 PM, Sean Owen  wrote:
>>
>>> Yeah I had meant to ask about that in the past. While I presume Patrick
>>> consents to this and all that, it does mean that anyone with access to said
>>> Jenkins scripts can create a signed Spark release, regardless of who they
>>> are.
>>>
>>> I haven't thought through whether that's a theoretical issue we can
>>> ignore or something we need to fix up. For example you can't get a release
>>> on the ASF mirrors without more authentication.
>>>
>>> How hard would it be to make the script take in a key? it sort of looks
>>> like the script already takes GPG_KEY, but don't know how to modify the
>>> jobs. I suppose it would be ideal, in any event, for the actual release
>>> manager to sign.
>>>
>>> On Fri, Sep 15, 2017 at 8:28 PM Holden Karau 
>>> wrote:
>>>
 That's a good question, I built the release candidate however the
 Jenkins scripts don't take a parameter for configuring who signs them
 rather it always signs them with Patrick's key. You can see this from
 previous releases which were managed by other folks but still signed by
 Patrick.

 On Fri, Sep 15, 2017 at 12:16 PM, Ryan Blue  wrote:

> The signature is valid, but why was the release signed with Patrick
> Wendell's private key? Did Patrick build the release candidate?
>

>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
> --
> Twitter: https://twitter.com/holdenkarau
>
-- 
Twitter: https://twitter.com/holdenkarau


Re: Signing releases with pwendell or release manager's key?

2017-09-15 Thread Holden Karau
Changing the release jobs, beyond the available parameters, right now
depends on Josh arisen as there are some scripts which generate the jobs
which aren't public. I've done temporary fixes in the past with the Python
packaging but my understanding is that in the medium term it requires
access to the scripts.

So +CC Josh.

On Fri, Sep 15, 2017 at 4:38 PM Ryan Blue  wrote:

> I think this needs to be fixed. It's true that there are barriers to
> publication, but the signature is what we use to authenticate Apache
> releases.
>
> If Patrick's key is available on Jenkins for any Spark committer to use,
> then the chance of a compromise are much higher than for a normal RM key.
>
> rb
>
> On Fri, Sep 15, 2017 at 12:34 PM, Sean Owen  wrote:
>
>> Yeah I had meant to ask about that in the past. While I presume Patrick
>> consents to this and all that, it does mean that anyone with access to said
>> Jenkins scripts can create a signed Spark release, regardless of who they
>> are.
>>
>> I haven't thought through whether that's a theoretical issue we can
>> ignore or something we need to fix up. For example you can't get a release
>> on the ASF mirrors without more authentication.
>>
>> How hard would it be to make the script take in a key? it sort of looks
>> like the script already takes GPG_KEY, but don't know how to modify the
>> jobs. I suppose it would be ideal, in any event, for the actual release
>> manager to sign.
>>
>> On Fri, Sep 15, 2017 at 8:28 PM Holden Karau 
>> wrote:
>>
>>> That's a good question, I built the release candidate however the
>>> Jenkins scripts don't take a parameter for configuring who signs them
>>> rather it always signs them with Patrick's key. You can see this from
>>> previous releases which were managed by other folks but still signed by
>>> Patrick.
>>>
>>> On Fri, Sep 15, 2017 at 12:16 PM, Ryan Blue  wrote:
>>>
 The signature is valid, but why was the release signed with Patrick
 Wendell's private key? Did Patrick build the release candidate?

>>>
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>
-- 
Twitter: https://twitter.com/holdenkarau


Re: Signing releases with pwendell or release manager's key?

2017-09-15 Thread Ryan Blue
I think this needs to be fixed. It's true that there are barriers to
publication, but the signature is what we use to authenticate Apache
releases.

If Patrick's key is available on Jenkins for any Spark committer to use,
then the chance of a compromise are much higher than for a normal RM key.

rb

On Fri, Sep 15, 2017 at 12:34 PM, Sean Owen  wrote:

> Yeah I had meant to ask about that in the past. While I presume Patrick
> consents to this and all that, it does mean that anyone with access to said
> Jenkins scripts can create a signed Spark release, regardless of who they
> are.
>
> I haven't thought through whether that's a theoretical issue we can ignore
> or something we need to fix up. For example you can't get a release on the
> ASF mirrors without more authentication.
>
> How hard would it be to make the script take in a key? it sort of looks
> like the script already takes GPG_KEY, but don't know how to modify the
> jobs. I suppose it would be ideal, in any event, for the actual release
> manager to sign.
>
> On Fri, Sep 15, 2017 at 8:28 PM Holden Karau  wrote:
>
>> That's a good question, I built the release candidate however the Jenkins
>> scripts don't take a parameter for configuring who signs them rather it
>> always signs them with Patrick's key. You can see this from previous
>> releases which were managed by other folks but still signed by Patrick.
>>
>> On Fri, Sep 15, 2017 at 12:16 PM, Ryan Blue  wrote:
>>
>>> The signature is valid, but why was the release signed with Patrick
>>> Wendell's private key? Did Patrick build the release candidate?
>>>
>>


-- 
Ryan Blue
Software Engineer
Netflix


Re: [VOTE] Spark 2.1.2 (RC1)

2017-09-15 Thread Ryan Blue
I'm not familiar with the release procedure, can you send a link to this
Jenkins job? Can anyone run this job, or is it limited to committers?

rb

On Fri, Sep 15, 2017 at 12:28 PM, Holden Karau  wrote:

> That's a good question, I built the release candidate however the Jenkins
> scripts don't take a parameter for configuring who signs them rather it
> always signs them with Patrick's key. You can see this from previous
> releases which were managed by other folks but still signed by Patrick.
>
> On Fri, Sep 15, 2017 at 12:16 PM, Ryan Blue  wrote:
>
>> The signature is valid, but why was the release signed with Patrick
>> Wendell's private key? Did Patrick build the release candidate?
>>
>> rb
>>
>> On Fri, Sep 15, 2017 at 6:36 AM, Denny Lee  wrote:
>>
>>> +1 (non-binding)
>>>
>>> On Thu, Sep 14, 2017 at 10:57 PM Felix Cheung 
>>> wrote:
>>>
 +1 tested SparkR package on Windows, r-hub, Ubuntu.

 _
 From: Sean Owen 
 Sent: Thursday, September 14, 2017 3:12 PM
 Subject: Re: [VOTE] Spark 2.1.2 (RC1)
 To: Holden Karau , 



 +1
 Very nice. The sigs and hashes look fine, it builds fine for me on
 Debian Stretch with Java 8, yarn/hive/hadoop-2.7 profiles, and passes
 tests.

 Yes as you say, no outstanding issues except for this which doesn't
 look critical, as it's not a regression.

 SPARK-21985 PySpark PairDeserializer is broken for double-zipped RDDs


 On Thu, Sep 14, 2017 at 7:47 PM Holden Karau 
 wrote:

> Please vote on releasing the following candidate as Apache Spark
> version 2.1.2. The vote is open until Friday September 22nd at 18:00
> PST and passes if a majority of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 2.1.2
> [ ] -1 Do not release this package because ...
>
>
> To learn more about Apache Spark, please see https://spark.apache.org/
>
> The tag to be voted on is v2.1.2-rc1
>  (6f470323a036365
> 6999dd36cb33f528afe627c12)
>
> List of JIRA tickets resolved in this release can be found with this
> filter.
> 
>
> The release files, including signatures, digests, etc. can be found at:
> https://home.apache.org/~pwendell/spark-releases/spark-2.1.2-rc1-bin/
>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapache
> spark-1248/
>
> The documentation corresponding to this release can be found at:
> https://people.apache.org/~pwendell/spark-releases/spark-2.1
> .2-rc1-docs/
>
>
> *FAQ*
>
> *How can I help test this release?*
>
> If you are a Spark user, you can help us test this release by taking
> an existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> If you're working in PySpark you can set up a virtual env and install
> the current RC and see if anything important breaks, in the Java/Scala you
> can add the staging repository to your projects resolvers and test with 
> the
> RC (make sure to clean up the artifact cache before/after so you don't end
> up building with a out of date RC going forward).
>
> *What should happen to JIRA tickets still targeting 2.1.2?*
>
> Committers should look at those and triage. Extremely important bug
> fixes, documentation, and API tweaks that impact compatibility should be
> worked on immediately. Everything else please retarget to 2.1.3.
>
> *But my bug isn't fixed!??!*
>
> In order to make timely releases, we will typically not hold the
> release unless the bug in question is a regression from 2.1.1. That being
> said if there is something which is a regression form 2.1.1 that has not
> been correctly targeted please ping a committer to help target the issue
> (you can see the open issues listed as impacting Spark 2.1.1 & 2.1.2
> 
> )
>
> *What are the unresolved* issues targeted for 2.1.2
> 
> ?
>
> At the 

Re: [VOTE] Spark 2.1.2 (RC1)

2017-09-15 Thread Felix Cheung
Yes ;)


From: Xiao Li 
Sent: Friday, September 15, 2017 2:22:03 PM
To: Holden Karau
Cc: Ryan Blue; Denny Lee; Felix Cheung; Sean Owen; dev@spark.apache.org
Subject: Re: [VOTE] Spark 2.1.2 (RC1)

Sorry, this release candidate is 2.1.2. The issue is in 2.2.1.

2017-09-15 14:21 GMT-07:00 Xiao Li 
>:
-1

See the discussion in https://github.com/apache/spark/pull/19074

Xiao



2017-09-15 12:28 GMT-07:00 Holden Karau 
>:
That's a good question, I built the release candidate however the Jenkins 
scripts don't take a parameter for configuring who signs them rather it always 
signs them with Patrick's key. You can see this from previous releases which 
were managed by other folks but still signed by Patrick.

On Fri, Sep 15, 2017 at 12:16 PM, Ryan Blue 
> wrote:
The signature is valid, but why was the release signed with Patrick Wendell's 
private key? Did Patrick build the release candidate?

rb

On Fri, Sep 15, 2017 at 6:36 AM, Denny Lee 
> wrote:
+1 (non-binding)

On Thu, Sep 14, 2017 at 10:57 PM Felix Cheung 
> wrote:
+1 tested SparkR package on Windows, r-hub, Ubuntu.

_
From: Sean Owen >
Sent: Thursday, September 14, 2017 3:12 PM
Subject: Re: [VOTE] Spark 2.1.2 (RC1)
To: Holden Karau >, 
>



+1
Very nice. The sigs and hashes look fine, it builds fine for me on Debian 
Stretch with Java 8, yarn/hive/hadoop-2.7 profiles, and passes tests.

Yes as you say, no outstanding issues except for this which doesn't look 
critical, as it's not a regression.

SPARK-21985 PySpark PairDeserializer is broken for double-zipped RDDs


On Thu, Sep 14, 2017 at 7:47 PM Holden Karau 
> wrote:
Please vote on releasing the following candidate as Apache Spark version 2.1.2. 
The vote is open until Friday September 22nd at 18:00 PST and passes if a 
majority of at least 3 +1 PMC votes are cast.

[ ] +1 Release this package as Apache Spark 2.1.2
[ ] -1 Do not release this package because ...


To learn more about Apache Spark, please see https://spark.apache.org/

The tag to be voted on is 
v2.1.2-rc1 
(6f470323a0363656999dd36cb33f528afe627c12)

List of JIRA tickets resolved in this release can be found with this 
filter.

The release files, including signatures, digests, etc. can be found at:
https://home.apache.org/~pwendell/spark-releases/spark-2.1.2-rc1-bin/

Release artifacts are signed with the following key:
https://people.apache.org/keys/committer/pwendell.asc

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1248/

The documentation corresponding to this release can be found at:
https://people.apache.org/~pwendell/spark-releases/spark-2.1.2-rc1-docs/


FAQ

How can I help test this release?

If you are a Spark user, you can help us test this release by taking an 
existing Spark workload and running on this release candidate, then reporting 
any regressions.

If you're working in PySpark you can set up a virtual env and install the 
current RC and see if anything important breaks, in the Java/Scala you can add 
the staging repository to your projects resolvers and test with the RC (make 
sure to clean up the artifact cache before/after so you don't end up building 
with a out of date RC going forward).

What should happen to JIRA tickets still targeting 2.1.2?

Committers should look at those and triage. Extremely important bug fixes, 
documentation, and API tweaks that impact compatibility should be worked on 
immediately. Everything else please retarget to 2.1.3.

But my bug isn't fixed!??!

In order to make timely releases, we will typically not hold the release unless 
the bug in question is a regression from 2.1.1. That being said if there is 
something which is a regression form 2.1.1 that has not been correctly targeted 
please ping a committer to help target the issue (you can see the open issues 
listed as impacting Spark 2.1.1 & 
2.1.2)

What are the unresolved issues targeted for 
2.1.2?

At 

Re: [VOTE] Spark 2.1.2 (RC1)

2017-09-15 Thread Xiao Li
Sorry, this release candidate is 2.1.2. The issue is in 2.2.1.

2017-09-15 14:21 GMT-07:00 Xiao Li :

> -1
>
> See the discussion in https://github.com/apache/spark/pull/19074
>
> Xiao
>
>
>
> 2017-09-15 12:28 GMT-07:00 Holden Karau :
>
>> That's a good question, I built the release candidate however the Jenkins
>> scripts don't take a parameter for configuring who signs them rather it
>> always signs them with Patrick's key. You can see this from previous
>> releases which were managed by other folks but still signed by Patrick.
>>
>> On Fri, Sep 15, 2017 at 12:16 PM, Ryan Blue  wrote:
>>
>>> The signature is valid, but why was the release signed with Patrick
>>> Wendell's private key? Did Patrick build the release candidate?
>>>
>>> rb
>>>
>>> On Fri, Sep 15, 2017 at 6:36 AM, Denny Lee 
>>> wrote:
>>>
 +1 (non-binding)

 On Thu, Sep 14, 2017 at 10:57 PM Felix Cheung <
 felixcheun...@hotmail.com> wrote:

> +1 tested SparkR package on Windows, r-hub, Ubuntu.
>
> _
> From: Sean Owen 
> Sent: Thursday, September 14, 2017 3:12 PM
> Subject: Re: [VOTE] Spark 2.1.2 (RC1)
> To: Holden Karau , 
>
>
>
> +1
> Very nice. The sigs and hashes look fine, it builds fine for me on
> Debian Stretch with Java 8, yarn/hive/hadoop-2.7 profiles, and passes
> tests.
>
> Yes as you say, no outstanding issues except for this which doesn't
> look critical, as it's not a regression.
>
> SPARK-21985 PySpark PairDeserializer is broken for double-zipped RDDs
>
>
> On Thu, Sep 14, 2017 at 7:47 PM Holden Karau 
> wrote:
>
>> Please vote on releasing the following candidate as Apache Spark
>> version 2.1.2. The vote is open until Friday September 22nd at 18:00
>> PST and passes if a majority of at least 3 +1 PMC votes are cast.
>>
>> [ ] +1 Release this package as Apache Spark 2.1.2
>> [ ] -1 Do not release this package because ...
>>
>>
>> To learn more about Apache Spark, please see
>> https://spark.apache.org/
>>
>> The tag to be voted on is v2.1.2-rc1
>>  (6f470323a036365
>> 6999dd36cb33f528afe627c12)
>>
>> List of JIRA tickets resolved in this release can be found with this
>> filter.
>> 
>>
>> The release files, including signatures, digests, etc. can be found
>> at:
>> https://home.apache.org/~pwendell/spark-releases/spark-2.1.2-rc1-bin/
>>
>> Release artifacts are signed with the following key:
>> https://people.apache.org/keys/committer/pwendell.asc
>>
>> The staging repository for this release can be found at:
>> https://repository.apache.org/content/repositories/orgapache
>> spark-1248/
>>
>> The documentation corresponding to this release can be found at:
>> https://people.apache.org/~pwendell/spark-releases/spark-2.1
>> .2-rc1-docs/
>>
>>
>> *FAQ*
>>
>> *How can I help test this release?*
>>
>> If you are a Spark user, you can help us test this release by taking
>> an existing Spark workload and running on this release candidate, then
>> reporting any regressions.
>>
>> If you're working in PySpark you can set up a virtual env and install
>> the current RC and see if anything important breaks, in the Java/Scala 
>> you
>> can add the staging repository to your projects resolvers and test with 
>> the
>> RC (make sure to clean up the artifact cache before/after so you don't 
>> end
>> up building with a out of date RC going forward).
>>
>> *What should happen to JIRA tickets still targeting 2.1.2?*
>>
>> Committers should look at those and triage. Extremely important bug
>> fixes, documentation, and API tweaks that impact compatibility should be
>> worked on immediately. Everything else please retarget to 2.1.3.
>>
>> *But my bug isn't fixed!??!*
>>
>> In order to make timely releases, we will typically not hold the
>> release unless the bug in question is a regression from 2.1.1. That being
>> said if there is something which is a regression form 2.1.1 that has not
>> been correctly targeted please ping a committer to help target the issue
>> (you can see the open issues listed as impacting Spark 2.1.1 & 2.1.2
>> 
>> )
>>
>> *What are the unresolved* issues targeted for 2.1.2
>> 

Re: [VOTE] Spark 2.1.2 (RC1)

2017-09-15 Thread Xiao Li
-1

See the discussion in https://github.com/apache/spark/pull/19074

Xiao



2017-09-15 12:28 GMT-07:00 Holden Karau :

> That's a good question, I built the release candidate however the Jenkins
> scripts don't take a parameter for configuring who signs them rather it
> always signs them with Patrick's key. You can see this from previous
> releases which were managed by other folks but still signed by Patrick.
>
> On Fri, Sep 15, 2017 at 12:16 PM, Ryan Blue  wrote:
>
>> The signature is valid, but why was the release signed with Patrick
>> Wendell's private key? Did Patrick build the release candidate?
>>
>> rb
>>
>> On Fri, Sep 15, 2017 at 6:36 AM, Denny Lee  wrote:
>>
>>> +1 (non-binding)
>>>
>>> On Thu, Sep 14, 2017 at 10:57 PM Felix Cheung 
>>> wrote:
>>>
 +1 tested SparkR package on Windows, r-hub, Ubuntu.

 _
 From: Sean Owen 
 Sent: Thursday, September 14, 2017 3:12 PM
 Subject: Re: [VOTE] Spark 2.1.2 (RC1)
 To: Holden Karau , 



 +1
 Very nice. The sigs and hashes look fine, it builds fine for me on
 Debian Stretch with Java 8, yarn/hive/hadoop-2.7 profiles, and passes
 tests.

 Yes as you say, no outstanding issues except for this which doesn't
 look critical, as it's not a regression.

 SPARK-21985 PySpark PairDeserializer is broken for double-zipped RDDs


 On Thu, Sep 14, 2017 at 7:47 PM Holden Karau 
 wrote:

> Please vote on releasing the following candidate as Apache Spark
> version 2.1.2. The vote is open until Friday September 22nd at 18:00
> PST and passes if a majority of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 2.1.2
> [ ] -1 Do not release this package because ...
>
>
> To learn more about Apache Spark, please see https://spark.apache.org/
>
> The tag to be voted on is v2.1.2-rc1
>  (6f470323a036365
> 6999dd36cb33f528afe627c12)
>
> List of JIRA tickets resolved in this release can be found with this
> filter.
> 
>
> The release files, including signatures, digests, etc. can be found at:
> https://home.apache.org/~pwendell/spark-releases/spark-2.1.2-rc1-bin/
>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapache
> spark-1248/
>
> The documentation corresponding to this release can be found at:
> https://people.apache.org/~pwendell/spark-releases/spark-2.1
> .2-rc1-docs/
>
>
> *FAQ*
>
> *How can I help test this release?*
>
> If you are a Spark user, you can help us test this release by taking
> an existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> If you're working in PySpark you can set up a virtual env and install
> the current RC and see if anything important breaks, in the Java/Scala you
> can add the staging repository to your projects resolvers and test with 
> the
> RC (make sure to clean up the artifact cache before/after so you don't end
> up building with a out of date RC going forward).
>
> *What should happen to JIRA tickets still targeting 2.1.2?*
>
> Committers should look at those and triage. Extremely important bug
> fixes, documentation, and API tweaks that impact compatibility should be
> worked on immediately. Everything else please retarget to 2.1.3.
>
> *But my bug isn't fixed!??!*
>
> In order to make timely releases, we will typically not hold the
> release unless the bug in question is a regression from 2.1.1. That being
> said if there is something which is a regression form 2.1.1 that has not
> been correctly targeted please ping a committer to help target the issue
> (you can see the open issues listed as impacting Spark 2.1.1 & 2.1.2
> 
> )
>
> *What are the unresolved* issues targeted for 2.1.2
> 
> ?
>
> At the time of the writing, there is one in progress major issue
> SPARK-21985 

Re: What is d3kbcqa49mib13.cloudfront.net ?

2017-09-15 Thread Shixiong(Ryan) Zhu
Can we just create those tables once locally using official Spark versions
and commit them? Then the unit tests can just read these files and don't
need to download Spark.

On Thu, Sep 14, 2017 at 8:13 AM, Sean Owen  wrote:

> I think the download could use the Apache mirror, yeah. I don't know if
> there's a reason that it must though. What's good enough for releases is
> good enough for this purpose. People might not like the big download in the
> tests if it really came up as an issue we could find ways to cache it
> better locally. I brought it up more as a question than a problem to solve.
>
> On Thu, Sep 14, 2017 at 5:02 PM Mark Hamstra 
> wrote:
>
>> The problem is that it's not really an "official" download link, but
>> rather just a supplemental convenience. While that may be ok when
>> distributing artifacts, it's more of a problem when actually building and
>> testing artifacts. In the latter case, the download should really only be
>> from an Apache mirror.
>>
>> On Thu, Sep 14, 2017 at 1:20 AM, Wenchen Fan  wrote:
>>
>>> That test case is trying to test the backward compatibility of
>>> `HiveExternalCatalog`. It downloads official Spark releases and creates
>>> tables with them, and then read these tables via the current Spark.
>>>
>>> About the download link, I just picked it from the Spark website, and
>>> this link is the default one when you choose "direct download". Do we have
>>> a better choice?
>>>
>>> On Thu, Sep 14, 2017 at 3:05 AM, Shivaram Venkataraman <
>>> shiva...@eecs.berkeley.edu> wrote:
>>>
 Mark, I agree with your point on the risks of using Cloudfront while
 building Spark. I was only trying to provide background on when we
 started using Cloudfront.

 Personally, I don't have enough about context about the test case in
 question (e.g. Why are we downloading Spark in a test case ?).

 Thanks
 Shivaram

 On Wed, Sep 13, 2017 at 11:50 AM, Mark Hamstra 
 wrote:
 > Yeah, but that discussion and use case is a bit different --
 providing a
 > different route to download the final released and approved artifacts
 that
 > were built using only acceptable artifacts and sources vs. building
 and
 > checking prior to release using something that is not from an Apache
 mirror.
 > This new use case puts us in the position of approving spark
 artifacts that
 > weren't built entirely from canonical resources located in presumably
 secure
 > and monitored repositories. Incorporating something that is not
 completely
 > trusted or approved into the process of building something that we
 are then
 > going to approve as trusted is different from the prior use of
 cloudfront.
 >
 > On Wed, Sep 13, 2017 at 10:26 AM, Shivaram Venkataraman
 >  wrote:
 >>
 >> The bucket comes from Cloudfront, a CDN thats part of AWS. There was
 a
 >> bunch of discussion about this back in 2013
 >>
 >> https://lists.apache.org/thread.html/9a72ff7ce913dd85a6b112b1b2de53
 6dcda74b28b050f70646aba0ac@1380147885@%3Cdev.spark.apache.org%3E
 >>
 >> Shivaram
 >>
 >> On Wed, Sep 13, 2017 at 9:30 AM, Sean Owen 
 wrote:
 >> > Not a big deal, but Mark noticed that this test now downloads Spark
 >> > artifacts from the same 'direct download' link available on the
 >> > downloads
 >> > page:
 >> >
 >> >
 >> > https://github.com/apache/spark/blob/master/sql/hive/
 src/test/scala/org/apache/spark/sql/hive/HiveExternalCatalogVersionsSui
 te.scala#L53
 >> >
 >> > https://d3kbcqa49mib13.cloudfront.net/spark-$version-
 bin-hadoop2.7.tgz
 >> >
 >> > I don't know of any particular problem with this, which is a
 parallel
 >> > download option in addition to the Apache mirrors. It's also the
 >> > default.
 >> >
 >> > Does anyone know what this bucket is and if there's a strong
 reason we
 >> > can't
 >> > just use mirrors?
 >>
 >> 
 -
 >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
 >>
 >

 -
 To unsubscribe e-mail: dev-unsubscr...@spark.apache.org


>>>
>>


Signing releases with pwendell or release manager's key?

2017-09-15 Thread Sean Owen
Yeah I had meant to ask about that in the past. While I presume Patrick
consents to this and all that, it does mean that anyone with access to said
Jenkins scripts can create a signed Spark release, regardless of who they
are.

I haven't thought through whether that's a theoretical issue we can ignore
or something we need to fix up. For example you can't get a release on the
ASF mirrors without more authentication.

How hard would it be to make the script take in a key? it sort of looks
like the script already takes GPG_KEY, but don't know how to modify the
jobs. I suppose it would be ideal, in any event, for the actual release
manager to sign.

On Fri, Sep 15, 2017 at 8:28 PM Holden Karau  wrote:

> That's a good question, I built the release candidate however the Jenkins
> scripts don't take a parameter for configuring who signs them rather it
> always signs them with Patrick's key. You can see this from previous
> releases which were managed by other folks but still signed by Patrick.
>
> On Fri, Sep 15, 2017 at 12:16 PM, Ryan Blue  wrote:
>
>> The signature is valid, but why was the release signed with Patrick
>> Wendell's private key? Did Patrick build the release candidate?
>>
>


Re: [VOTE] Spark 2.1.2 (RC1)

2017-09-15 Thread Holden Karau
That's a good question, I built the release candidate however the Jenkins
scripts don't take a parameter for configuring who signs them rather it
always signs them with Patrick's key. You can see this from previous
releases which were managed by other folks but still signed by Patrick.

On Fri, Sep 15, 2017 at 12:16 PM, Ryan Blue  wrote:

> The signature is valid, but why was the release signed with Patrick
> Wendell's private key? Did Patrick build the release candidate?
>
> rb
>
> On Fri, Sep 15, 2017 at 6:36 AM, Denny Lee  wrote:
>
>> +1 (non-binding)
>>
>> On Thu, Sep 14, 2017 at 10:57 PM Felix Cheung 
>> wrote:
>>
>>> +1 tested SparkR package on Windows, r-hub, Ubuntu.
>>>
>>> _
>>> From: Sean Owen 
>>> Sent: Thursday, September 14, 2017 3:12 PM
>>> Subject: Re: [VOTE] Spark 2.1.2 (RC1)
>>> To: Holden Karau , 
>>>
>>>
>>>
>>> +1
>>> Very nice. The sigs and hashes look fine, it builds fine for me on
>>> Debian Stretch with Java 8, yarn/hive/hadoop-2.7 profiles, and passes
>>> tests.
>>>
>>> Yes as you say, no outstanding issues except for this which doesn't look
>>> critical, as it's not a regression.
>>>
>>> SPARK-21985 PySpark PairDeserializer is broken for double-zipped RDDs
>>>
>>>
>>> On Thu, Sep 14, 2017 at 7:47 PM Holden Karau 
>>> wrote:
>>>
 Please vote on releasing the following candidate as Apache Spark
 version 2.1.2. The vote is open until Friday September 22nd at 18:00
 PST and passes if a majority of at least 3 +1 PMC votes are cast.

 [ ] +1 Release this package as Apache Spark 2.1.2
 [ ] -1 Do not release this package because ...


 To learn more about Apache Spark, please see https://spark.apache.org/

 The tag to be voted on is v2.1.2-rc1
  (6f470323a036365
 6999dd36cb33f528afe627c12)

 List of JIRA tickets resolved in this release can be found with this
 filter.
 

 The release files, including signatures, digests, etc. can be found at:
 https://home.apache.org/~pwendell/spark-releases/spark-2.1.2-rc1-bin/

 Release artifacts are signed with the following key:
 https://people.apache.org/keys/committer/pwendell.asc

 The staging repository for this release can be found at:
 https://repository.apache.org/content/repositories/orgapachespark-1248/

 The documentation corresponding to this release can be found at:
 https://people.apache.org/~pwendell/spark-releases/spark-2.
 1.2-rc1-docs/


 *FAQ*

 *How can I help test this release?*

 If you are a Spark user, you can help us test this release by taking an
 existing Spark workload and running on this release candidate, then
 reporting any regressions.

 If you're working in PySpark you can set up a virtual env and install
 the current RC and see if anything important breaks, in the Java/Scala you
 can add the staging repository to your projects resolvers and test with the
 RC (make sure to clean up the artifact cache before/after so you don't end
 up building with a out of date RC going forward).

 *What should happen to JIRA tickets still targeting 2.1.2?*

 Committers should look at those and triage. Extremely important bug
 fixes, documentation, and API tweaks that impact compatibility should be
 worked on immediately. Everything else please retarget to 2.1.3.

 *But my bug isn't fixed!??!*

 In order to make timely releases, we will typically not hold the
 release unless the bug in question is a regression from 2.1.1. That being
 said if there is something which is a regression form 2.1.1 that has not
 been correctly targeted please ping a committer to help target the issue
 (you can see the open issues listed as impacting Spark 2.1.1 & 2.1.2
 
 )

 *What are the unresolved* issues targeted for 2.1.2
 
 ?

 At the time of the writing, there is one in progress major issue
 SPARK-21985 , I
 believe Andrew Ray & HyukjinKwon are looking into this one.

 --
 Twitter: https://twitter.com/holdenkarau

>>>
>>>
>>>
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>



-- 
Twitter: 

Re: [VOTE] Spark 2.1.2 (RC1)

2017-09-15 Thread Ryan Blue
The signature is valid, but why was the release signed with Patrick
Wendell's private key? Did Patrick build the release candidate?

rb

On Fri, Sep 15, 2017 at 6:36 AM, Denny Lee  wrote:

> +1 (non-binding)
>
> On Thu, Sep 14, 2017 at 10:57 PM Felix Cheung 
> wrote:
>
>> +1 tested SparkR package on Windows, r-hub, Ubuntu.
>>
>> _
>> From: Sean Owen 
>> Sent: Thursday, September 14, 2017 3:12 PM
>> Subject: Re: [VOTE] Spark 2.1.2 (RC1)
>> To: Holden Karau , 
>>
>>
>>
>> +1
>> Very nice. The sigs and hashes look fine, it builds fine for me on Debian
>> Stretch with Java 8, yarn/hive/hadoop-2.7 profiles, and passes tests.
>>
>> Yes as you say, no outstanding issues except for this which doesn't look
>> critical, as it's not a regression.
>>
>> SPARK-21985 PySpark PairDeserializer is broken for double-zipped RDDs
>>
>>
>> On Thu, Sep 14, 2017 at 7:47 PM Holden Karau 
>> wrote:
>>
>>> Please vote on releasing the following candidate as Apache Spark
>>> version 2.1.2. The vote is open until Friday September 22nd at 18:00
>>> PST and passes if a majority of at least 3 +1 PMC votes are cast.
>>>
>>> [ ] +1 Release this package as Apache Spark 2.1.2
>>> [ ] -1 Do not release this package because ...
>>>
>>>
>>> To learn more about Apache Spark, please see https://spark.apache.org/
>>>
>>> The tag to be voted on is v2.1.2-rc1
>>>  (6f470323a036365
>>> 6999dd36cb33f528afe627c12)
>>>
>>> List of JIRA tickets resolved in this release can be found with this
>>> filter.
>>> 
>>>
>>> The release files, including signatures, digests, etc. can be found at:
>>> https://home.apache.org/~pwendell/spark-releases/spark-2.1.2-rc1-bin/
>>>
>>> Release artifacts are signed with the following key:
>>> https://people.apache.org/keys/committer/pwendell.asc
>>>
>>> The staging repository for this release can be found at:
>>> https://repository.apache.org/content/repositories/orgapachespark-1248/
>>>
>>> The documentation corresponding to this release can be found at:
>>> https://people.apache.org/~pwendell/spark-releases/spark-2.1.2-rc1-docs/
>>>
>>>
>>> *FAQ*
>>>
>>> *How can I help test this release?*
>>>
>>> If you are a Spark user, you can help us test this release by taking an
>>> existing Spark workload and running on this release candidate, then
>>> reporting any regressions.
>>>
>>> If you're working in PySpark you can set up a virtual env and install
>>> the current RC and see if anything important breaks, in the Java/Scala you
>>> can add the staging repository to your projects resolvers and test with the
>>> RC (make sure to clean up the artifact cache before/after so you don't end
>>> up building with a out of date RC going forward).
>>>
>>> *What should happen to JIRA tickets still targeting 2.1.2?*
>>>
>>> Committers should look at those and triage. Extremely important bug
>>> fixes, documentation, and API tweaks that impact compatibility should be
>>> worked on immediately. Everything else please retarget to 2.1.3.
>>>
>>> *But my bug isn't fixed!??!*
>>>
>>> In order to make timely releases, we will typically not hold the release
>>> unless the bug in question is a regression from 2.1.1. That being said if
>>> there is something which is a regression form 2.1.1 that has not been
>>> correctly targeted please ping a committer to help target the issue (you
>>> can see the open issues listed as impacting Spark 2.1.1 & 2.1.2
>>> 
>>> )
>>>
>>> *What are the unresolved* issues targeted for 2.1.2
>>> 
>>> ?
>>>
>>> At the time of the writing, there is one in progress major issue
>>> SPARK-21985 , I
>>> believe Andrew Ray & HyukjinKwon are looking into this one.
>>>
>>> --
>>> Twitter: https://twitter.com/holdenkarau
>>>
>>
>>
>>


-- 
Ryan Blue
Software Engineer
Netflix


Re: CHAR implementation?

2017-09-15 Thread Dongjoon Hyun
Thank you, Ryan!

Yes. Right. If we turn off `spark.sql.hive.convertMetastoreParquet`, Spark
pads the space.

For ORC CHAR, it's the same. ORC only handles truncation on write.
The padding is handled by Hive side in `HiveCharWritable` via
`HiveBaseChar.java` on read.
Spark ORCFileFormat uses HiveCharWritable, so the space is padded for both
`spark.sql.hive.convertMetastoreOrc` false or true. I was able to test it
in the following PR. Previously, it's blocked by another reason.

https://github.com/apache/spark/pull/19235

It seems that we may choose,
- adding the padding logic inside Spark Parquet reader
- ignoring that for performance/backward compatibility.

Data Source v2 Read Path is merged into master today.
We will change the code base anyway in Spark 2.3.

Bests,
Dongjoon


On Fri, Sep 15, 2017 at 10:05 AM, Ryan Blue  wrote:

> My guess is that this is because Parquet doesn't have a CHAR type. That
> should be applied to strings by Spark for Parquet.
>
> The reason from Parquet's perspective not to support CHAR is that we have
> no expectation that it is a portable type. Non-SQL writers aren't going to
> pad values with spaces, and it is a terrible idea for Parquet to silently
> alter or truncate data to fit the CHAR type. There's also no performance
> gain from CHAR because multi-byte UTF8 characters prevent us from using a
> fixed-length binary field. The conclusion we came to is that CHAR is a
> SQL-only type and has to be enforced by SQL engines: Spark should pad or
> truncate values, and expect Parquet to faithfully hand back exactly what
> was stored.
>
> If Spark doesn't have logic for this, then it is probably relying on the
> Hive serde to pad the first case. I'm not sure what ORC does, maybe it has
> a native CHAR type.
>
> rb
>
> On Thu, Sep 14, 2017 at 5:31 PM, Dongjoon Hyun 
> wrote:
>
>> Hi, All.
>>
>> Currently, Spark shows different behavior when we uses CHAR types.
>>
>> spark-sql> CREATE TABLE t1(a CHAR(3));
>> spark-sql> CREATE TABLE t2(a CHAR(3)) STORED AS ORC;
>> spark-sql> CREATE TABLE t3(a CHAR(3)) STORED AS PARQUET;
>>
>> spark-sql> INSERT INTO TABLE t1 SELECT 'a ';
>> spark-sql> INSERT INTO TABLE t2 SELECT 'a ';
>> spark-sql> INSERT INTO TABLE t3 SELECT 'a ';
>>
>> spark-sql> SELECT a, length(a) FROM t1;
>> a   3
>> spark-sql> SELECT a, length(a) FROM t2;
>> a   3
>> spark-sql> SELECT a, length(a) FROM t3;
>> a 2
>>
>> The reason I'm asking here is that it's a little bit old default behavior
>> of Spark `STORED AS PARQUET` in Spark. (Spark 1.6.3, too.)
>>
>> For me, `CREATE TABLE t1(a CHAR(3))` shows the correct one in Spark, but
>> Parquet has been de-factor standard in Spark also. (I'm not comparing this
>> with the other DBMS.)
>>
>> I'm wondering which way we need to go or want to go in Spark?
>>
>> Bests,
>> Dongjoon.
>>
>
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>


Re: CHAR implementation?

2017-09-15 Thread Ryan Blue
My guess is that this is because Parquet doesn't have a CHAR type. That
should be applied to strings by Spark for Parquet.

The reason from Parquet's perspective not to support CHAR is that we have
no expectation that it is a portable type. Non-SQL writers aren't going to
pad values with spaces, and it is a terrible idea for Parquet to silently
alter or truncate data to fit the CHAR type. There's also no performance
gain from CHAR because multi-byte UTF8 characters prevent us from using a
fixed-length binary field. The conclusion we came to is that CHAR is a
SQL-only type and has to be enforced by SQL engines: Spark should pad or
truncate values, and expect Parquet to faithfully hand back exactly what
was stored.

If Spark doesn't have logic for this, then it is probably relying on the
Hive serde to pad the first case. I'm not sure what ORC does, maybe it has
a native CHAR type.

rb

On Thu, Sep 14, 2017 at 5:31 PM, Dongjoon Hyun 
wrote:

> Hi, All.
>
> Currently, Spark shows different behavior when we uses CHAR types.
>
> spark-sql> CREATE TABLE t1(a CHAR(3));
> spark-sql> CREATE TABLE t2(a CHAR(3)) STORED AS ORC;
> spark-sql> CREATE TABLE t3(a CHAR(3)) STORED AS PARQUET;
>
> spark-sql> INSERT INTO TABLE t1 SELECT 'a ';
> spark-sql> INSERT INTO TABLE t2 SELECT 'a ';
> spark-sql> INSERT INTO TABLE t3 SELECT 'a ';
>
> spark-sql> SELECT a, length(a) FROM t1;
> a   3
> spark-sql> SELECT a, length(a) FROM t2;
> a   3
> spark-sql> SELECT a, length(a) FROM t3;
> a 2
>
> The reason I'm asking here is that it's a little bit old default behavior
> of Spark `STORED AS PARQUET` in Spark. (Spark 1.6.3, too.)
>
> For me, `CREATE TABLE t1(a CHAR(3))` shows the correct one in Spark, but
> Parquet has been de-factor standard in Spark also. (I'm not comparing this
> with the other DBMS.)
>
> I'm wondering which way we need to go or want to go in Spark?
>
> Bests,
> Dongjoon.
>



-- 
Ryan Blue
Software Engineer
Netflix


Re: [VOTE] Spark 2.1.2 (RC1)

2017-09-15 Thread Denny Lee
+1 (non-binding)

On Thu, Sep 14, 2017 at 10:57 PM Felix Cheung 
wrote:

> +1 tested SparkR package on Windows, r-hub, Ubuntu.
>
> _
> From: Sean Owen 
> Sent: Thursday, September 14, 2017 3:12 PM
> Subject: Re: [VOTE] Spark 2.1.2 (RC1)
> To: Holden Karau , 
>
>
>
> +1
> Very nice. The sigs and hashes look fine, it builds fine for me on Debian
> Stretch with Java 8, yarn/hive/hadoop-2.7 profiles, and passes tests.
>
> Yes as you say, no outstanding issues except for this which doesn't look
> critical, as it's not a regression.
>
> SPARK-21985 PySpark PairDeserializer is broken for double-zipped RDDs
>
>
> On Thu, Sep 14, 2017 at 7:47 PM Holden Karau  wrote:
>
>> Please vote on releasing the following candidate as Apache Spark version
>> 2.1.2. The vote is open until Friday September 22nd at 18:00 PST and
>> passes if a majority of at least 3 +1 PMC votes are cast.
>>
>> [ ] +1 Release this package as Apache Spark 2.1.2
>> [ ] -1 Do not release this package because ...
>>
>>
>> To learn more about Apache Spark, please see https://spark.apache.org/
>>
>> The tag to be voted on is v2.1.2-rc1
>>  (
>> 6f470323a0363656999dd36cb33f528afe627c12)
>>
>> List of JIRA tickets resolved in this release can be found with this
>> filter.
>> 
>>
>> The release files, including signatures, digests, etc. can be found at:
>> https://home.apache.org/~pwendell/spark-releases/spark-2.1.2-rc1-bin/
>>
>> Release artifacts are signed with the following key:
>> https://people.apache.org/keys/committer/pwendell.asc
>>
>> The staging repository for this release can be found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1248/
>>
>> The documentation corresponding to this release can be found at:
>> https://people.apache.org/~pwendell/spark-releases/spark-2.1.2-rc1-docs/
>>
>>
>> *FAQ*
>>
>> *How can I help test this release?*
>>
>> If you are a Spark user, you can help us test this release by taking an
>> existing Spark workload and running on this release candidate, then
>> reporting any regressions.
>>
>> If you're working in PySpark you can set up a virtual env and install the
>> current RC and see if anything important breaks, in the Java/Scala you can
>> add the staging repository to your projects resolvers and test with the RC
>> (make sure to clean up the artifact cache before/after so you don't end up
>> building with a out of date RC going forward).
>>
>> *What should happen to JIRA tickets still targeting 2.1.2?*
>>
>> Committers should look at those and triage. Extremely important bug
>> fixes, documentation, and API tweaks that impact compatibility should be
>> worked on immediately. Everything else please retarget to 2.1.3.
>>
>> *But my bug isn't fixed!??!*
>>
>> In order to make timely releases, we will typically not hold the release
>> unless the bug in question is a regression from 2.1.1. That being said if
>> there is something which is a regression form 2.1.1 that has not been
>> correctly targeted please ping a committer to help target the issue (you
>> can see the open issues listed as impacting Spark 2.1.1 & 2.1.2
>> 
>> )
>>
>> *What are the unresolved* issues targeted for 2.1.2
>> 
>> ?
>>
>> At the time of the writing, there is one in progress major issue
>> SPARK-21985 , I
>> believe Andrew Ray & HyukjinKwon are looking into this one.
>>
>> --
>> Twitter: https://twitter.com/holdenkarau
>>
>
>
>


[Spark Core] Custom Catalog. Integration between Apache Ignite and Apache Spark

2017-09-15 Thread Nikolay Izhikov

Hello, guys.

I’m contributor of Apache Ignite project which is self-described as an 
in-memory computing platform.


It has Data Grid features: distribute, transactional key-value store 
[1], Distributed SQL support [2], etc…[3]


Currently, I’m working on integration between Ignite and Spark [4]
I want to add support of Spark Data Frame API for Ignite.

As far as Ignite is distributed store it would be useful to create 
implementation of Catalog [5] for an Apache Ignite.


I see two ways to implement this feature:

1. Spark can provide API for any custom catalog implementation. As 
far as I can see there is a ticket for it [6]. It is closed with 
resolution “Later”. Is it suitable time to continue working on the 
ticket? How can I help with it?


2. I can provide an implementation of Catalog and other required 
API in the form of pull request in Spark, as it was implemented for Hive 
[7]. Can such pull request be acceptable?


Which way is more convenient for Spark community?

[1] https://ignite.apache.org/features/datagrid.html
[2] https://ignite.apache.org/features/sql.html
[3] https://ignite.apache.org/features.html
[4] https://issues.apache.org/jira/browse/IGNITE-3084
[5] 
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/ExternalCatalog.scala

[6] https://issues.apache.org/jira/browse/SPARK-17767
[7] 
https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala


-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



A little Scala 2.12 help

2017-09-15 Thread Sean Owen
I'm working on updating to Scala 2.12, and, have hit a compile error in
Scala 2.12 that I'm strugging to design a fix to (that doesn't modify the
API significantly). If you "./dev/change-scala-version.sh 2.12" and
compile, you'll see errors like...

[error]
/Users/srowen/Documents/Cloudera/spark/core/src/test/scala/org/apache/spark/FileSuite.scala:100:
could not find implicit value for parameter kcf: () =>
org.apache.spark.WritableConverter[org.apache.hadoop.io.IntWritable]
[error] Error occurred in an application involving default arguments.
[error] val output = sc.sequenceFile[IntWritable, Text](outputDir)

Clearly implicit resolution changed a little bit in 2.12 somehow. I
actually don't recall seeing this error before, so might be somehow related
to 2.12.3, but not sure.

As you can see the implicits that have always existed and been imported and
should apply here don't seem to be found.

If anyone is a Scala expert and could glance at this, you might help save
me a lot of puzzling.


Re: Easy way to get offset metatada with Spark Streaming API

2017-09-15 Thread Dmitry Naumenko
Nice, thanks again Michael for helping out.

Dmitry

2017-09-14 21:37 GMT+03:00 Michael Armbrust :

> Yep, that is correct.  You can also use the query ID which is a GUID that
> is stored in the checkpoint and preserved across restarts if you want to
> distinguish the batches from different streams.
>
> sqlContext.sparkContext.getLocalProperty(StreamExecution.QUERY_ID_KEY)
>
> This was added recently
> 
> though.
>
> On Thu, Sep 14, 2017 at 3:40 AM, Dmitry Naumenko 
> wrote:
>
>> Ok. So since I can get repeated batch ids, I guess I can just store the
>> last committed batch id in my storage (in the same transaction with the
>> data) and initialize the custom sink with right batch id when application
>> re-starts. After this just ignore batch if current batchId <=
>> latestBatchId.
>>
>> Dmitry
>>
>>
>> 2017-09-13 22:12 GMT+03:00 Michael Armbrust :
>>
>>> I think the right way to look at this is the batchId is just a proxy for
>>> offsets that is agnostic to what type of source you are reading from (or
>>> how many sources their are).  We might call into a custom sink with the
>>> same batchId more than once, but it will always contain the same data
>>> (there is no race condition, since this is stored in a write-ahead log).
>>> As long as you check/commit the batch id in the same transaction as the
>>> data you will get exactly once.
>>>
>>> On Wed, Sep 13, 2017 at 1:25 AM, Dmitry Naumenko 
>>> wrote:
>>>
 Thanks, I see.

 However, I guess reading from checkpoint directory might be less
 efficient comparing just preserving offsets in Dataset.

 I have one more question about operation idempotence (hope it help
 others to have a clear picture).

 If I read offsets on re-start from RDBMS and manually specify starting
 offsets on Kafka Source, is it still possible that in case of any failure I
 got a situation where the duplicate batch id will go to a Custom Sink?

 Previously on DStream, you will just read offsets from storage on start
 and just write them into DB in one transaction with data and it's was
 enough for "exactly-once". Please, correct me if I made a mistake here. So
 does the same strategy will work with Structured Streaming?

 I guess, that in case of Structured Streaming, Spark will commit batch
 offset to a checkpoint directory and there can be a race condition where
 you can commit your data with offsets into DB, but Spark will fail to
 commit the batch id, and some kind of automatic retry happen. If this is
 true, is it possible to disable this automatic re-try, so I can still use
 unified API for batch/streaming with my own re-try logic (which is
 basically, just ignore intermediate data, re-read from Kafka and re-try
 processing and load)?

 Dmitry


 2017-09-12 22:43 GMT+03:00 Michael Armbrust :

> In the checkpoint directory there is a file /offsets/$batchId that
> holds the offsets serialized as JSON.  I would not consider this a public
> stable API though.
>
> Really the only important thing to get exactly once is that you must
> ensure whatever operation you are doing downstream is idempotent with
> respect to the batchId.  For example, if you are writing to an RDBMS you
> could have a table that records the batch ID and update that in the same
> transaction as you append the results of the batch.  Before trying to
> append you should check that batch ID and make sure you have not already
> committed.
>
> On Tue, Sep 12, 2017 at 11:48 AM, Dmitry Naumenko <
> dm.naume...@gmail.com> wrote:
>
>> Thanks for response, Michael
>>
>> >  You should still be able to get exactly once processing by using
>> the batchId that is passed to the Sink.
>>
>> Could you explain this in more detail, please? Is there some kind of
>> offset manager API that works as get-offset by batch id lookup table?
>>
>> Dmitry
>>
>> 2017-09-12 20:29 GMT+03:00 Michael Armbrust :
>>
>>> I think that we are going to have to change the Sink API as part of
>>> SPARK-20928 ,
>>> which is why I linked these tickets together.  I'm still targeting an
>>> initial version for Spark 2.3 which should happen sometime towards the 
>>> end
>>> of the year.
>>>
>>> There are some misconceptions in that stack overflow answer that I
>>> can correct.  Until we improve the Source API, You should still be able 
>>> to
>>> get exactly once processing by using the batchId that is passed to
>>> the Sink. We guarantee that the offsets present at any given batch
>>> ID will be the same