Re: [VOTE] Apache Spark 2.2.0 (RC1)

2017-05-03 Thread Michael Armbrust
I'm going to -1 this given the number of small bug fixes that have gone
into the release branch.  I'll follow with another RC shortly.

On Tue, May 2, 2017 at 7:35 AM, Nick Pentreath 
wrote:

> I won't +1 just given that it seems certain there will be another RC and
> there are the outstanding ML QA blocker issues.
>
> But clean build and test for JVM and Python tests LGTM on CentOS Linux
> 7.2.1511, OpenJDK 1.8.0_111
>
>
> On Mon, 1 May 2017 at 22:42 Frank Austin Nothaft 
> wrote:
>
>> Hi Ryan,
>>
>> IMO, the problem is that the Spark Avro version conflicts with the
>> Parquet Avro version. As discussed upthread, I don’t think there’s a way to
>> *reliably *make sure that Avro 1.8 is on the classpath first while using
>> spark-submit. Relocating avro in our project wouldn’t solve the problem,
>> because the MethodNotFoundError is thrown from the internals of the
>> ParquetAvroOutputFormat, not from code in our project.
>>
>> Regards,
>>
>> Frank Austin Nothaft
>> fnoth...@berkeley.edu
>> fnoth...@eecs.berkeley.edu
>> 202-340-0466 <(202)%20340-0466>
>>
>> On May 1, 2017, at 12:33 PM, Ryan Blue  wrote:
>>
>> Michael, I think that the problem is with your classpath.
>>
>> Spark has a dependency to 1.7.7, which can't be changed. Your project is
>> what pulls in parquet-avro and transitively Avro 1.8. Spark has no runtime
>> dependency on Avro 1.8. It is understandably annoying that using the same
>> version of Parquet for your parquet-avro dependency is what causes your
>> project to depend on Avro 1.8, but Spark's dependencies aren't a problem
>> because its Parquet dependency doesn't bring in Avro.
>>
>> There are a few ways around this:
>> 1. Make sure Avro 1.8 is found in the classpath first
>> 2. Shade Avro 1.8 in your project (assuming Avro classes aren't shared)
>> 3. Use parquet-avro 1.8.1 in your project, which I think should work with
>> 1.8.2 and avoid the Avro change
>>
>> The work-around in Spark is for tests, which do use parquet-avro. We can
>> look at a Parquet 1.8.3 that avoids this issue, but I think this is
>> reasonable for the 2.2.0 release.
>>
>> rb
>>
>> On Mon, May 1, 2017 at 12:08 PM, Michael Heuer  wrote:
>>
>>> Please excuse me if I'm misunderstanding -- the problem is not with our
>>> library or our classpath.
>>>
>>> There is a conflict within Spark itself, in that Parquet 1.8.2 expects
>>> to find Avro 1.8.0 on the runtime classpath and sees 1.7.7 instead.  Spark
>>> already has to work around this for unit tests to pass.
>>>
>>>
>>>
>>> On Mon, May 1, 2017 at 2:00 PM, Ryan Blue  wrote:
>>>
 Thanks for the extra context, Frank. I agree that it sounds like your
 problem comes from the conflict between your Jars and what comes with
 Spark. Its the same concern that makes everyone shudder when anything has a
 public dependency on Jackson. :)

 What we usually do to get around situations like this is to relocate
 the problem library inside the shaded Jar. That way, Spark uses its version
 of Avro and your classes use a different version of Avro. This works if you
 don't need to share classes between the two. Would that work for your
 situation?

 rb

 On Mon, May 1, 2017 at 11:55 AM, Koert Kuipers 
 wrote:

> sounds like you are running into the fact that you cannot really put
> your classes before spark's on classpath? spark's switches to support this
> never really worked for me either.
>
> inability to control the classpath + inconsistent jars => trouble ?
>
> On Mon, May 1, 2017 at 2:36 PM, Frank Austin Nothaft <
> fnoth...@berkeley.edu> wrote:
>
>> Hi Ryan,
>>
>> We do set Avro to 1.8 in our downstream project. We also set Spark as
>> a provided dependency, and build an überjar. We run via spark-submit, 
>> which
>> builds the classpath with our überjar and all of the Spark deps. This 
>> leads
>> to avro 1.7.1 getting picked off of the classpath at runtime, which 
>> causes
>> the no such method exception to occur.
>>
>> Regards,
>>
>> Frank Austin Nothaft
>> fnoth...@berkeley.edu
>> fnoth...@eecs.berkeley.edu
>> 202-340-0466 <(202)%20340-0466>
>>
>> On May 1, 2017, at 11:31 AM, Ryan Blue  wrote:
>>
>> Frank,
>>
>> The issue you're running into is caused by using parquet-avro with
>> Avro 1.7. Can't your downstream project set the Avro dependency to 1.8?
>> Spark can't update Avro because it is a breaking change that would force
>> users to rebuilt specific Avro classes in some cases. But you should be
>> free to use Avro 1.8 to avoid the problem.
>>
>> On Mon, May 1, 2017 at 11:08 AM, Frank Austin Nothaft <
>> fnoth...@berkeley.edu> wrote:
>>
>>> Hi Ryan et al,
>>>
>>> The issue we’ve seen using a build 

Re: [VOTE] Apache Spark 2.2.0 (RC1)

2017-05-02 Thread Nick Pentreath
I won't +1 just given that it seems certain there will be another RC and
there are the outstanding ML QA blocker issues.

But clean build and test for JVM and Python tests LGTM on CentOS Linux
7.2.1511, OpenJDK 1.8.0_111

On Mon, 1 May 2017 at 22:42 Frank Austin Nothaft 
wrote:

> Hi Ryan,
>
> IMO, the problem is that the Spark Avro version conflicts with the Parquet
> Avro version. As discussed upthread, I don’t think there’s a way to
> *reliably *make sure that Avro 1.8 is on the classpath first while using
> spark-submit. Relocating avro in our project wouldn’t solve the problem,
> because the MethodNotFoundError is thrown from the internals of the
> ParquetAvroOutputFormat, not from code in our project.
>
> Regards,
>
> Frank Austin Nothaft
> fnoth...@berkeley.edu
> fnoth...@eecs.berkeley.edu
> 202-340-0466 <(202)%20340-0466>
>
> On May 1, 2017, at 12:33 PM, Ryan Blue  wrote:
>
> Michael, I think that the problem is with your classpath.
>
> Spark has a dependency to 1.7.7, which can't be changed. Your project is
> what pulls in parquet-avro and transitively Avro 1.8. Spark has no runtime
> dependency on Avro 1.8. It is understandably annoying that using the same
> version of Parquet for your parquet-avro dependency is what causes your
> project to depend on Avro 1.8, but Spark's dependencies aren't a problem
> because its Parquet dependency doesn't bring in Avro.
>
> There are a few ways around this:
> 1. Make sure Avro 1.8 is found in the classpath first
> 2. Shade Avro 1.8 in your project (assuming Avro classes aren't shared)
> 3. Use parquet-avro 1.8.1 in your project, which I think should work with
> 1.8.2 and avoid the Avro change
>
> The work-around in Spark is for tests, which do use parquet-avro. We can
> look at a Parquet 1.8.3 that avoids this issue, but I think this is
> reasonable for the 2.2.0 release.
>
> rb
>
> On Mon, May 1, 2017 at 12:08 PM, Michael Heuer  wrote:
>
>> Please excuse me if I'm misunderstanding -- the problem is not with our
>> library or our classpath.
>>
>> There is a conflict within Spark itself, in that Parquet 1.8.2 expects to
>> find Avro 1.8.0 on the runtime classpath and sees 1.7.7 instead.  Spark
>> already has to work around this for unit tests to pass.
>>
>>
>>
>> On Mon, May 1, 2017 at 2:00 PM, Ryan Blue  wrote:
>>
>>> Thanks for the extra context, Frank. I agree that it sounds like your
>>> problem comes from the conflict between your Jars and what comes with
>>> Spark. Its the same concern that makes everyone shudder when anything has a
>>> public dependency on Jackson. :)
>>>
>>> What we usually do to get around situations like this is to relocate the
>>> problem library inside the shaded Jar. That way, Spark uses its version of
>>> Avro and your classes use a different version of Avro. This works if you
>>> don't need to share classes between the two. Would that work for your
>>> situation?
>>>
>>> rb
>>>
>>> On Mon, May 1, 2017 at 11:55 AM, Koert Kuipers 
>>> wrote:
>>>
 sounds like you are running into the fact that you cannot really put
 your classes before spark's on classpath? spark's switches to support this
 never really worked for me either.

 inability to control the classpath + inconsistent jars => trouble ?

 On Mon, May 1, 2017 at 2:36 PM, Frank Austin Nothaft <
 fnoth...@berkeley.edu> wrote:

> Hi Ryan,
>
> We do set Avro to 1.8 in our downstream project. We also set Spark as
> a provided dependency, and build an überjar. We run via spark-submit, 
> which
> builds the classpath with our überjar and all of the Spark deps. This 
> leads
> to avro 1.7.1 getting picked off of the classpath at runtime, which causes
> the no such method exception to occur.
>
> Regards,
>
> Frank Austin Nothaft
> fnoth...@berkeley.edu
> fnoth...@eecs.berkeley.edu
> 202-340-0466 <(202)%20340-0466>
>
> On May 1, 2017, at 11:31 AM, Ryan Blue  wrote:
>
> Frank,
>
> The issue you're running into is caused by using parquet-avro with
> Avro 1.7. Can't your downstream project set the Avro dependency to 1.8?
> Spark can't update Avro because it is a breaking change that would force
> users to rebuilt specific Avro classes in some cases. But you should be
> free to use Avro 1.8 to avoid the problem.
>
> On Mon, May 1, 2017 at 11:08 AM, Frank Austin Nothaft <
> fnoth...@berkeley.edu> wrote:
>
>> Hi Ryan et al,
>>
>> The issue we’ve seen using a build of the Spark 2.2.0 branch from a
>> downstream project is that parquet-avro uses one of the new Avro 1.8.0
>> methods, and you get a NoSuchMethodError since Spark puts Avro 1.7.7 as a
>> dependency. My colleague Michael (who posted earlier on this thread)
>> documented this in Spark-19697
>> 

Re: [VOTE] Apache Spark 2.2.0 (RC1)

2017-05-01 Thread Frank Austin Nothaft
Hi Ryan,

IMO, the problem is that the Spark Avro version conflicts with the Parquet Avro 
version. As discussed upthread, I don’t think there’s a way to reliably make 
sure that Avro 1.8 is on the classpath first while using spark-submit. 
Relocating avro in our project wouldn’t solve the problem, because the 
MethodNotFoundError is thrown from the internals of the 
ParquetAvroOutputFormat, not from code in our project.

Regards,

Frank Austin Nothaft
fnoth...@berkeley.edu
fnoth...@eecs.berkeley.edu
202-340-0466

> On May 1, 2017, at 12:33 PM, Ryan Blue  wrote:
> 
> Michael, I think that the problem is with your classpath.
> 
> Spark has a dependency to 1.7.7, which can't be changed. Your project is what 
> pulls in parquet-avro and transitively Avro 1.8. Spark has no runtime 
> dependency on Avro 1.8. It is understandably annoying that using the same 
> version of Parquet for your parquet-avro dependency is what causes your 
> project to depend on Avro 1.8, but Spark's dependencies aren't a problem 
> because its Parquet dependency doesn't bring in Avro.
> 
> There are a few ways around this:
> 1. Make sure Avro 1.8 is found in the classpath first
> 2. Shade Avro 1.8 in your project (assuming Avro classes aren't shared)
> 3. Use parquet-avro 1.8.1 in your project, which I think should work with 
> 1.8.2 and avoid the Avro change
> 
> The work-around in Spark is for tests, which do use parquet-avro. We can look 
> at a Parquet 1.8.3 that avoids this issue, but I think this is reasonable for 
> the 2.2.0 release.
> 
> rb
> 
> On Mon, May 1, 2017 at 12:08 PM, Michael Heuer  > wrote:
> Please excuse me if I'm misunderstanding -- the problem is not with our 
> library or our classpath.
> 
> There is a conflict within Spark itself, in that Parquet 1.8.2 expects to 
> find Avro 1.8.0 on the runtime classpath and sees 1.7.7 instead.  Spark 
> already has to work around this for unit tests to pass.
> 
> 
> 
> On Mon, May 1, 2017 at 2:00 PM, Ryan Blue  > wrote:
> Thanks for the extra context, Frank. I agree that it sounds like your problem 
> comes from the conflict between your Jars and what comes with Spark. Its the 
> same concern that makes everyone shudder when anything has a public 
> dependency on Jackson. :)
> 
> What we usually do to get around situations like this is to relocate the 
> problem library inside the shaded Jar. That way, Spark uses its version of 
> Avro and your classes use a different version of Avro. This works if you 
> don't need to share classes between the two. Would that work for your 
> situation?
> 
> rb
> 
> On Mon, May 1, 2017 at 11:55 AM, Koert Kuipers  > wrote:
> sounds like you are running into the fact that you cannot really put your 
> classes before spark's on classpath? spark's switches to support this never 
> really worked for me either.
> 
> inability to control the classpath + inconsistent jars => trouble ?
> 
> On Mon, May 1, 2017 at 2:36 PM, Frank Austin Nothaft  > wrote:
> Hi Ryan,
> 
> We do set Avro to 1.8 in our downstream project. We also set Spark as a 
> provided dependency, and build an überjar. We run via spark-submit, which 
> builds the classpath with our überjar and all of the Spark deps. This leads 
> to avro 1.7.1 getting picked off of the classpath at runtime, which causes 
> the no such method exception to occur.
> 
> Regards,
> 
> Frank Austin Nothaft
> fnoth...@berkeley.edu 
> fnoth...@eecs.berkeley.edu 
> 202-340-0466 
>> On May 1, 2017, at 11:31 AM, Ryan Blue > > wrote:
>> 
>> Frank,
>> 
>> The issue you're running into is caused by using parquet-avro with Avro 1.7. 
>> Can't your downstream project set the Avro dependency to 1.8? Spark can't 
>> update Avro because it is a breaking change that would force users to 
>> rebuilt specific Avro classes in some cases. But you should be free to use 
>> Avro 1.8 to avoid the problem.
>> 
>> On Mon, May 1, 2017 at 11:08 AM, Frank Austin Nothaft > > wrote:
>> Hi Ryan et al,
>> 
>> The issue we’ve seen using a build of the Spark 2.2.0 branch from a 
>> downstream project is that parquet-avro uses one of the new Avro 1.8.0 
>> methods, and you get a NoSuchMethodError since Spark puts Avro 1.7.7 as a 
>> dependency. My colleague Michael (who posted earlier on this thread) 
>> documented this in Spark-19697 
>> . I know that Spark has 
>> unit tests that check this compatibility issue, but it looks like there was 
>> a recent change that sets a test scope dependency on Avro 1.8.0 
>> 

Re: [VOTE] Apache Spark 2.2.0 (RC1)

2017-05-01 Thread Ryan Blue
Michael, I think that the problem is with your classpath.

Spark has a dependency to 1.7.7, which can't be changed. Your project is
what pulls in parquet-avro and transitively Avro 1.8. Spark has no runtime
dependency on Avro 1.8. It is understandably annoying that using the same
version of Parquet for your parquet-avro dependency is what causes your
project to depend on Avro 1.8, but Spark's dependencies aren't a problem
because its Parquet dependency doesn't bring in Avro.

There are a few ways around this:
1. Make sure Avro 1.8 is found in the classpath first
2. Shade Avro 1.8 in your project (assuming Avro classes aren't shared)
3. Use parquet-avro 1.8.1 in your project, which I think should work with
1.8.2 and avoid the Avro change

The work-around in Spark is for tests, which do use parquet-avro. We can
look at a Parquet 1.8.3 that avoids this issue, but I think this is
reasonable for the 2.2.0 release.

rb

On Mon, May 1, 2017 at 12:08 PM, Michael Heuer  wrote:

> Please excuse me if I'm misunderstanding -- the problem is not with our
> library or our classpath.
>
> There is a conflict within Spark itself, in that Parquet 1.8.2 expects to
> find Avro 1.8.0 on the runtime classpath and sees 1.7.7 instead.  Spark
> already has to work around this for unit tests to pass.
>
>
>
> On Mon, May 1, 2017 at 2:00 PM, Ryan Blue  wrote:
>
>> Thanks for the extra context, Frank. I agree that it sounds like your
>> problem comes from the conflict between your Jars and what comes with
>> Spark. Its the same concern that makes everyone shudder when anything has a
>> public dependency on Jackson. :)
>>
>> What we usually do to get around situations like this is to relocate the
>> problem library inside the shaded Jar. That way, Spark uses its version of
>> Avro and your classes use a different version of Avro. This works if you
>> don't need to share classes between the two. Would that work for your
>> situation?
>>
>> rb
>>
>> On Mon, May 1, 2017 at 11:55 AM, Koert Kuipers  wrote:
>>
>>> sounds like you are running into the fact that you cannot really put
>>> your classes before spark's on classpath? spark's switches to support this
>>> never really worked for me either.
>>>
>>> inability to control the classpath + inconsistent jars => trouble ?
>>>
>>> On Mon, May 1, 2017 at 2:36 PM, Frank Austin Nothaft <
>>> fnoth...@berkeley.edu> wrote:
>>>
 Hi Ryan,

 We do set Avro to 1.8 in our downstream project. We also set Spark as a
 provided dependency, and build an überjar. We run via spark-submit, which
 builds the classpath with our überjar and all of the Spark deps. This leads
 to avro 1.7.1 getting picked off of the classpath at runtime, which causes
 the no such method exception to occur.

 Regards,

 Frank Austin Nothaft
 fnoth...@berkeley.edu
 fnoth...@eecs.berkeley.edu
 202-340-0466 <(202)%20340-0466>

 On May 1, 2017, at 11:31 AM, Ryan Blue  wrote:

 Frank,

 The issue you're running into is caused by using parquet-avro with Avro
 1.7. Can't your downstream project set the Avro dependency to 1.8? Spark
 can't update Avro because it is a breaking change that would force users to
 rebuilt specific Avro classes in some cases. But you should be free to use
 Avro 1.8 to avoid the problem.

 On Mon, May 1, 2017 at 11:08 AM, Frank Austin Nothaft <
 fnoth...@berkeley.edu> wrote:

> Hi Ryan et al,
>
> The issue we’ve seen using a build of the Spark 2.2.0 branch from a
> downstream project is that parquet-avro uses one of the new Avro 1.8.0
> methods, and you get a NoSuchMethodError since Spark puts Avro 1.7.7 as a
> dependency. My colleague Michael (who posted earlier on this thread)
> documented this in Spark-19697
> . I know that
> Spark has unit tests that check this compatibility issue, but it looks 
> like
> there was a recent change that sets a test scope dependency on Avro
> 1.8.0
> ,
> which masks this issue in the unit tests. With this error, you can’t use
> the ParquetAvroOutputFormat from a application running on Spark 2.2.0.
>
> Regards,
>
> Frank Austin Nothaft
> fnoth...@berkeley.edu
> fnoth...@eecs.berkeley.edu
> 202-340-0466 <(202)%20340-0466>
>
> On May 1, 2017, at 10:02 AM, Ryan Blue  > wrote:
>
> I agree with Sean. Spark only pulls in parquet-avro for tests. For
> execution, it implements the record materialization APIs in Parquet to go
> directly to Spark SQL rows. This doesn't actually leak an Avro 1.8
> dependency into Spark as far as I can tell.
>
> rb
>
> On Mon, May 1, 2017 at 8:34 AM, Sean 

Re: [VOTE] Apache Spark 2.2.0 (RC1)

2017-05-01 Thread Michael Heuer
Please excuse me if I'm misunderstanding -- the problem is not with our
library or our classpath.

There is a conflict within Spark itself, in that Parquet 1.8.2 expects to
find Avro 1.8.0 on the runtime classpath and sees 1.7.7 instead.  Spark
already has to work around this for unit tests to pass.



On Mon, May 1, 2017 at 2:00 PM, Ryan Blue  wrote:

> Thanks for the extra context, Frank. I agree that it sounds like your
> problem comes from the conflict between your Jars and what comes with
> Spark. Its the same concern that makes everyone shudder when anything has a
> public dependency on Jackson. :)
>
> What we usually do to get around situations like this is to relocate the
> problem library inside the shaded Jar. That way, Spark uses its version of
> Avro and your classes use a different version of Avro. This works if you
> don't need to share classes between the two. Would that work for your
> situation?
>
> rb
>
> On Mon, May 1, 2017 at 11:55 AM, Koert Kuipers  wrote:
>
>> sounds like you are running into the fact that you cannot really put your
>> classes before spark's on classpath? spark's switches to support this never
>> really worked for me either.
>>
>> inability to control the classpath + inconsistent jars => trouble ?
>>
>> On Mon, May 1, 2017 at 2:36 PM, Frank Austin Nothaft <
>> fnoth...@berkeley.edu> wrote:
>>
>>> Hi Ryan,
>>>
>>> We do set Avro to 1.8 in our downstream project. We also set Spark as a
>>> provided dependency, and build an überjar. We run via spark-submit, which
>>> builds the classpath with our überjar and all of the Spark deps. This leads
>>> to avro 1.7.1 getting picked off of the classpath at runtime, which causes
>>> the no such method exception to occur.
>>>
>>> Regards,
>>>
>>> Frank Austin Nothaft
>>> fnoth...@berkeley.edu
>>> fnoth...@eecs.berkeley.edu
>>> 202-340-0466 <(202)%20340-0466>
>>>
>>> On May 1, 2017, at 11:31 AM, Ryan Blue  wrote:
>>>
>>> Frank,
>>>
>>> The issue you're running into is caused by using parquet-avro with Avro
>>> 1.7. Can't your downstream project set the Avro dependency to 1.8? Spark
>>> can't update Avro because it is a breaking change that would force users to
>>> rebuilt specific Avro classes in some cases. But you should be free to use
>>> Avro 1.8 to avoid the problem.
>>>
>>> On Mon, May 1, 2017 at 11:08 AM, Frank Austin Nothaft <
>>> fnoth...@berkeley.edu> wrote:
>>>
 Hi Ryan et al,

 The issue we’ve seen using a build of the Spark 2.2.0 branch from a
 downstream project is that parquet-avro uses one of the new Avro 1.8.0
 methods, and you get a NoSuchMethodError since Spark puts Avro 1.7.7 as a
 dependency. My colleague Michael (who posted earlier on this thread)
 documented this in Spark-19697
 . I know that Spark
 has unit tests that check this compatibility issue, but it looks like there
 was a recent change that sets a test scope dependency on Avro 1.8.0
 ,
 which masks this issue in the unit tests. With this error, you can’t use
 the ParquetAvroOutputFormat from a application running on Spark 2.2.0.

 Regards,

 Frank Austin Nothaft
 fnoth...@berkeley.edu
 fnoth...@eecs.berkeley.edu
 202-340-0466 <(202)%20340-0466>

 On May 1, 2017, at 10:02 AM, Ryan Blue > wrote:

 I agree with Sean. Spark only pulls in parquet-avro for tests. For
 execution, it implements the record materialization APIs in Parquet to go
 directly to Spark SQL rows. This doesn't actually leak an Avro 1.8
 dependency into Spark as far as I can tell.

 rb

 On Mon, May 1, 2017 at 8:34 AM, Sean Owen  wrote:

> See discussion at https://github.com/apache/spark/pull/17163 -- I
> think the issue is that fixing this trades one problem for a slightly
> bigger one.
>
>
> On Mon, May 1, 2017 at 4:13 PM Michael Heuer 
> wrote:
>
>> Version 2.2.0 bumps the dependency version for parquet to 1.8.2 but
>> does not bump the dependency version for avro (currently at 1.7.7).  
>> Though
>> perhaps not clear from the issue I reported [0], this means that Spark is
>> internally inconsistent, in that a call through parquet (which depends on
>> avro 1.8.0 [1]) may throw errors at runtime when it hits avro 1.7.7 on 
>> the
>> classpath.  Avro 1.8.0 is not binary compatible with 1.7.7.
>>
>> [0] - https://issues.apache.org/jira/browse/SPARK-19697
>> [1] - https://github.com/apache/parquet-mr/blob/apache-parquet-1.8
>> .2/pom.xml#L96
>>
>> On Sun, Apr 30, 2017 at 3:28 AM, Sean Owen 
>> wrote:
>>
>>> I have one more issue that, if it needs to be fixed, 

Re: [VOTE] Apache Spark 2.2.0 (RC1)

2017-05-01 Thread Frank Austin Nothaft
Hi Ryan!

I think relocating the avro dependency inside of Spark would make a lot of 
sense. Otherwise, we’d need Spark to move to Avro 1.8.0, or Parquet to cut a 
new 1.8.3 release that either reverts back to Avro 1.7.7 or that eliminates the 
code that is binary incompatible between Avro 1.7.7 and 1.8.0.

Regards,

Frank Austin Nothaft
fnoth...@berkeley.edu
fnoth...@eecs.berkeley.edu
202-340-0466

> On May 1, 2017, at 12:00 PM, Ryan Blue  wrote:
> 
> Thanks for the extra context, Frank. I agree that it sounds like your problem 
> comes from the conflict between your Jars and what comes with Spark. Its the 
> same concern that makes everyone shudder when anything has a public 
> dependency on Jackson. :)
> 
> What we usually do to get around situations like this is to relocate the 
> problem library inside the shaded Jar. That way, Spark uses its version of 
> Avro and your classes use a different version of Avro. This works if you 
> don't need to share classes between the two. Would that work for your 
> situation?
> 
> rb
> 
> On Mon, May 1, 2017 at 11:55 AM, Koert Kuipers  > wrote:
> sounds like you are running into the fact that you cannot really put your 
> classes before spark's on classpath? spark's switches to support this never 
> really worked for me either.
> 
> inability to control the classpath + inconsistent jars => trouble ?
> 
> On Mon, May 1, 2017 at 2:36 PM, Frank Austin Nothaft  > wrote:
> Hi Ryan,
> 
> We do set Avro to 1.8 in our downstream project. We also set Spark as a 
> provided dependency, and build an überjar. We run via spark-submit, which 
> builds the classpath with our überjar and all of the Spark deps. This leads 
> to avro 1.7.1 getting picked off of the classpath at runtime, which causes 
> the no such method exception to occur.
> 
> Regards,
> 
> Frank Austin Nothaft
> fnoth...@berkeley.edu 
> fnoth...@eecs.berkeley.edu 
> 202-340-0466 
>> On May 1, 2017, at 11:31 AM, Ryan Blue > > wrote:
>> 
>> Frank,
>> 
>> The issue you're running into is caused by using parquet-avro with Avro 1.7. 
>> Can't your downstream project set the Avro dependency to 1.8? Spark can't 
>> update Avro because it is a breaking change that would force users to 
>> rebuilt specific Avro classes in some cases. But you should be free to use 
>> Avro 1.8 to avoid the problem.
>> 
>> On Mon, May 1, 2017 at 11:08 AM, Frank Austin Nothaft > > wrote:
>> Hi Ryan et al,
>> 
>> The issue we’ve seen using a build of the Spark 2.2.0 branch from a 
>> downstream project is that parquet-avro uses one of the new Avro 1.8.0 
>> methods, and you get a NoSuchMethodError since Spark puts Avro 1.7.7 as a 
>> dependency. My colleague Michael (who posted earlier on this thread) 
>> documented this in Spark-19697 
>> . I know that Spark has 
>> unit tests that check this compatibility issue, but it looks like there was 
>> a recent change that sets a test scope dependency on Avro 1.8.0 
>> ,
>>  which masks this issue in the unit tests. With this error, you can’t use 
>> the ParquetAvroOutputFormat from a application running on Spark 2.2.0.
>> 
>> Regards,
>> 
>> Frank Austin Nothaft
>> fnoth...@berkeley.edu 
>> fnoth...@eecs.berkeley.edu 
>> 202-340-0466 
>> 
>>> On May 1, 2017, at 10:02 AM, Ryan Blue >> > wrote:
>>> 
>>> I agree with Sean. Spark only pulls in parquet-avro for tests. For 
>>> execution, it implements the record materialization APIs in Parquet to go 
>>> directly to Spark SQL rows. This doesn't actually leak an Avro 1.8 
>>> dependency into Spark as far as I can tell.
>>> 
>>> rb
>>> 
>>> On Mon, May 1, 2017 at 8:34 AM, Sean Owen >> > wrote:
>>> See discussion at https://github.com/apache/spark/pull/17163 
>>>  -- I think the issue is that 
>>> fixing this trades one problem for a slightly bigger one.
>>> 
>>> 
>>> On Mon, May 1, 2017 at 4:13 PM Michael Heuer >> > wrote:
>>> Version 2.2.0 bumps the dependency version for parquet to 1.8.2 but does 
>>> not bump the dependency version for avro (currently at 1.7.7).  Though 
>>> perhaps not clear from the issue I reported [0], this means that Spark is 
>>> internally inconsistent, in that a call through parquet (which depends on 
>>> avro 1.8.0 [1]) may throw errors at runtime when it hits avro 1.7.7 on the 
>>> classpath.  Avro 1.8.0 

Re: [VOTE] Apache Spark 2.2.0 (RC1)

2017-05-01 Thread Koert Kuipers
sounds like you are running into the fact that you cannot really put your
classes before spark's on classpath? spark's switches to support this never
really worked for me either.

inability to control the classpath + inconsistent jars => trouble ?

On Mon, May 1, 2017 at 2:36 PM, Frank Austin Nothaft 
wrote:

> Hi Ryan,
>
> We do set Avro to 1.8 in our downstream project. We also set Spark as a
> provided dependency, and build an überjar. We run via spark-submit, which
> builds the classpath with our überjar and all of the Spark deps. This leads
> to avro 1.7.1 getting picked off of the classpath at runtime, which causes
> the no such method exception to occur.
>
> Regards,
>
> Frank Austin Nothaft
> fnoth...@berkeley.edu
> fnoth...@eecs.berkeley.edu
> 202-340-0466 <(202)%20340-0466>
>
> On May 1, 2017, at 11:31 AM, Ryan Blue  wrote:
>
> Frank,
>
> The issue you're running into is caused by using parquet-avro with Avro
> 1.7. Can't your downstream project set the Avro dependency to 1.8? Spark
> can't update Avro because it is a breaking change that would force users to
> rebuilt specific Avro classes in some cases. But you should be free to use
> Avro 1.8 to avoid the problem.
>
> On Mon, May 1, 2017 at 11:08 AM, Frank Austin Nothaft <
> fnoth...@berkeley.edu> wrote:
>
>> Hi Ryan et al,
>>
>> The issue we’ve seen using a build of the Spark 2.2.0 branch from a
>> downstream project is that parquet-avro uses one of the new Avro 1.8.0
>> methods, and you get a NoSuchMethodError since Spark puts Avro 1.7.7 as a
>> dependency. My colleague Michael (who posted earlier on this thread)
>> documented this in Spark-19697
>> . I know that Spark
>> has unit tests that check this compatibility issue, but it looks like there
>> was a recent change that sets a test scope dependency on Avro 1.8.0
>> ,
>> which masks this issue in the unit tests. With this error, you can’t use
>> the ParquetAvroOutputFormat from a application running on Spark 2.2.0.
>>
>> Regards,
>>
>> Frank Austin Nothaft
>> fnoth...@berkeley.edu
>> fnoth...@eecs.berkeley.edu
>> 202-340-0466 <(202)%20340-0466>
>>
>> On May 1, 2017, at 10:02 AM, Ryan Blue > > wrote:
>>
>> I agree with Sean. Spark only pulls in parquet-avro for tests. For
>> execution, it implements the record materialization APIs in Parquet to go
>> directly to Spark SQL rows. This doesn't actually leak an Avro 1.8
>> dependency into Spark as far as I can tell.
>>
>> rb
>>
>> On Mon, May 1, 2017 at 8:34 AM, Sean Owen  wrote:
>>
>>> See discussion at https://github.com/apache/spark/pull/17163 -- I think
>>> the issue is that fixing this trades one problem for a slightly bigger one.
>>>
>>>
>>> On Mon, May 1, 2017 at 4:13 PM Michael Heuer  wrote:
>>>
 Version 2.2.0 bumps the dependency version for parquet to 1.8.2 but
 does not bump the dependency version for avro (currently at 1.7.7).  Though
 perhaps not clear from the issue I reported [0], this means that Spark is
 internally inconsistent, in that a call through parquet (which depends on
 avro 1.8.0 [1]) may throw errors at runtime when it hits avro 1.7.7 on the
 classpath.  Avro 1.8.0 is not binary compatible with 1.7.7.

 [0] - https://issues.apache.org/jira/browse/SPARK-19697
 [1] - https://github.com/apache/parquet-mr/blob/apache-parquet-1.8
 .2/pom.xml#L96

 On Sun, Apr 30, 2017 at 3:28 AM, Sean Owen  wrote:

> I have one more issue that, if it needs to be fixed, needs to be fixed
> for 2.2.0.
>
> I'm fixing build warnings for the release and noticed that checkstyle
> actually complains there are some Java methods named in TitleCase, like
> `ProcessingTimeTimeout`:
>
> https://github.com/apache/spark/pull/17803/files#r113934080
>
> Easy enough to fix and it's right, that's not conventional. However I
> wonder if it was done on purpose to match a class name?
>
> I think this is one for @tdas
>
> On Thu, Apr 27, 2017 at 7:31 PM Michael Armbrust <
> mich...@databricks.com> wrote:
>
>> Please vote on releasing the following candidate as Apache Spark
>> version 2.2.0. The vote is open until Tues, May 2nd, 2017 at 12:00
>> PST and passes if a majority of at least 3 +1 PMC votes are cast.
>>
>> [ ] +1 Release this package as Apache Spark 2.2.0
>> [ ] -1 Do not release this package because ...
>>
>>
>> To learn more about Apache Spark, please see http://spark.apache.org/
>>
>> The tag to be voted on is v2.2.0-rc1
>>  (8ccb4a57c82146c
>> 1a8f8966c7e64010cf5632cb6)
>>
>> List of JIRA tickets resolved can be found with this filter

Re: [VOTE] Apache Spark 2.2.0 (RC1)

2017-05-01 Thread Frank Austin Nothaft
Hi Ryan,

We do set Avro to 1.8 in our downstream project. We also set Spark as a 
provided dependency, and build an überjar. We run via spark-submit, which 
builds the classpath with our überjar and all of the Spark deps. This leads to 
avro 1.7.1 getting picked off of the classpath at runtime, which causes the no 
such method exception to occur.

Regards,

Frank Austin Nothaft
fnoth...@berkeley.edu
fnoth...@eecs.berkeley.edu
202-340-0466

> On May 1, 2017, at 11:31 AM, Ryan Blue  wrote:
> 
> Frank,
> 
> The issue you're running into is caused by using parquet-avro with Avro 1.7. 
> Can't your downstream project set the Avro dependency to 1.8? Spark can't 
> update Avro because it is a breaking change that would force users to rebuilt 
> specific Avro classes in some cases. But you should be free to use Avro 1.8 
> to avoid the problem.
> 
> On Mon, May 1, 2017 at 11:08 AM, Frank Austin Nothaft  > wrote:
> Hi Ryan et al,
> 
> The issue we’ve seen using a build of the Spark 2.2.0 branch from a 
> downstream project is that parquet-avro uses one of the new Avro 1.8.0 
> methods, and you get a NoSuchMethodError since Spark puts Avro 1.7.7 as a 
> dependency. My colleague Michael (who posted earlier on this thread) 
> documented this in Spark-19697 
> . I know that Spark has 
> unit tests that check this compatibility issue, but it looks like there was a 
> recent change that sets a test scope dependency on Avro 1.8.0 
> ,
>  which masks this issue in the unit tests. With this error, you can’t use the 
> ParquetAvroOutputFormat from a application running on Spark 2.2.0.
> 
> Regards,
> 
> Frank Austin Nothaft
> fnoth...@berkeley.edu 
> fnoth...@eecs.berkeley.edu 
> 202-340-0466 
> 
>> On May 1, 2017, at 10:02 AM, Ryan Blue > > wrote:
>> 
>> I agree with Sean. Spark only pulls in parquet-avro for tests. For 
>> execution, it implements the record materialization APIs in Parquet to go 
>> directly to Spark SQL rows. This doesn't actually leak an Avro 1.8 
>> dependency into Spark as far as I can tell.
>> 
>> rb
>> 
>> On Mon, May 1, 2017 at 8:34 AM, Sean Owen > > wrote:
>> See discussion at https://github.com/apache/spark/pull/17163 
>>  -- I think the issue is that 
>> fixing this trades one problem for a slightly bigger one.
>> 
>> 
>> On Mon, May 1, 2017 at 4:13 PM Michael Heuer > > wrote:
>> Version 2.2.0 bumps the dependency version for parquet to 1.8.2 but does not 
>> bump the dependency version for avro (currently at 1.7.7).  Though perhaps 
>> not clear from the issue I reported [0], this means that Spark is internally 
>> inconsistent, in that a call through parquet (which depends on avro 1.8.0 
>> [1]) may throw errors at runtime when it hits avro 1.7.7 on the classpath.  
>> Avro 1.8.0 is not binary compatible with 1.7.7.
>> 
>> [0] - https://issues.apache.org/jira/browse/SPARK-19697 
>> 
>> [1] - 
>> https://github.com/apache/parquet-mr/blob/apache-parquet-1.8.2/pom.xml#L96 
>> 
>> 
>> On Sun, Apr 30, 2017 at 3:28 AM, Sean Owen > > wrote:
>> I have one more issue that, if it needs to be fixed, needs to be fixed for 
>> 2.2.0.
>> 
>> I'm fixing build warnings for the release and noticed that checkstyle 
>> actually complains there are some Java methods named in TitleCase, like 
>> `ProcessingTimeTimeout`:
>> 
>> https://github.com/apache/spark/pull/17803/files#r113934080 
>> 
>> 
>> Easy enough to fix and it's right, that's not conventional. However I wonder 
>> if it was done on purpose to match a class name?
>> 
>> I think this is one for @tdas
>> 
>> On Thu, Apr 27, 2017 at 7:31 PM Michael Armbrust > > wrote:
>> Please vote on releasing the following candidate as Apache Spark version 
>> 2.2.0. The vote is open until Tues, May 2nd, 2017 at 12:00 PST and passes if 
>> a majority of at least 3 +1 PMC votes are cast.
>> 
>> [ ] +1 Release this package as Apache Spark 2.2.0
>> [ ] -1 Do not release this package because ...
>> 
>> 
>> To learn more about Apache Spark, please see http://spark.apache.org/ 
>> 
>> 
>> The tag to be voted on is v2.2.0-rc1 
>>  
>> (8ccb4a57c82146c1a8f8966c7e64010cf5632cb6)
>> 
>> List of JIRA tickets resolved can be found 

Re: [VOTE] Apache Spark 2.2.0 (RC1)

2017-05-01 Thread Sean Owen
See discussion at https://github.com/apache/spark/pull/17163 -- I think the
issue is that fixing this trades one problem for a slightly bigger one.

On Mon, May 1, 2017 at 4:13 PM Michael Heuer  wrote:

> Version 2.2.0 bumps the dependency version for parquet to 1.8.2 but does
> not bump the dependency version for avro (currently at 1.7.7).  Though
> perhaps not clear from the issue I reported [0], this means that Spark is
> internally inconsistent, in that a call through parquet (which depends on
> avro 1.8.0 [1]) may throw errors at runtime when it hits avro 1.7.7 on the
> classpath.  Avro 1.8.0 is not binary compatible with 1.7.7.
>
> [0] - https://issues.apache.org/jira/browse/SPARK-19697
> [1] -
> https://github.com/apache/parquet-mr/blob/apache-parquet-1.8.2/pom.xml#L96
>
> On Sun, Apr 30, 2017 at 3:28 AM, Sean Owen  wrote:
>
>> I have one more issue that, if it needs to be fixed, needs to be fixed
>> for 2.2.0.
>>
>> I'm fixing build warnings for the release and noticed that checkstyle
>> actually complains there are some Java methods named in TitleCase, like
>> `ProcessingTimeTimeout`:
>>
>> https://github.com/apache/spark/pull/17803/files#r113934080
>>
>> Easy enough to fix and it's right, that's not conventional. However I
>> wonder if it was done on purpose to match a class name?
>>
>> I think this is one for @tdas
>>
>> On Thu, Apr 27, 2017 at 7:31 PM Michael Armbrust 
>> wrote:
>>
>>> Please vote on releasing the following candidate as Apache Spark
>>> version 2.2.0. The vote is open until Tues, May 2nd, 2017 at 12:00 PST
>>> and passes if a majority of at least 3 +1 PMC votes are cast.
>>>
>>> [ ] +1 Release this package as Apache Spark 2.2.0
>>> [ ] -1 Do not release this package because ...
>>>
>>>
>>> To learn more about Apache Spark, please see http://spark.apache.org/
>>>
>>> The tag to be voted on is v2.2.0-rc1
>>>  (
>>> 8ccb4a57c82146c1a8f8966c7e64010cf5632cb6)
>>>
>>> List of JIRA tickets resolved can be found with this filter
>>> 
>>> .
>>>
>>> The release files, including signatures, digests, etc. can be found at:
>>> http://home.apache.org/~pwendell/spark-releases/spark-2.2.0-rc1-bin/
>>>
>>> Release artifacts are signed with the following key:
>>> https://people.apache.org/keys/committer/pwendell.asc
>>>
>>> The staging repository for this release can be found at:
>>> https://repository.apache.org/content/repositories/orgapachespark-1235/
>>>
>>> The documentation corresponding to this release can be found at:
>>> http://people.apache.org/~pwendell/spark-releases/spark-2.2.0-rc1-docs/
>>>
>>>
>>> *FAQ*
>>>
>>> *How can I help test this release?*
>>>
>>> If you are a Spark user, you can help us test this release by taking an
>>> existing Spark workload and running on this release candidate, then
>>> reporting any regressions.
>>>
>>> *What should happen to JIRA tickets still targeting 2.2.0?*
>>>
>>> Committers should look at those and triage. Extremely important bug
>>> fixes, documentation, and API tweaks that impact compatibility should be
>>> worked on immediately. Everything else please retarget to 2.3.0 or 2.2.1.
>>>
>>> *But my bug isn't fixed!??!*
>>>
>>> In order to make timely releases, we will typically not hold the release
>>> unless the bug in question is a regression from 2.1.1.
>>>
>>
>


Re: [VOTE] Apache Spark 2.2.0 (RC1)

2017-05-01 Thread Michael Heuer
Version 2.2.0 bumps the dependency version for parquet to 1.8.2 but does
not bump the dependency version for avro (currently at 1.7.7).  Though
perhaps not clear from the issue I reported [0], this means that Spark is
internally inconsistent, in that a call through parquet (which depends on
avro 1.8.0 [1]) may throw errors at runtime when it hits avro 1.7.7 on the
classpath.  Avro 1.8.0 is not binary compatible with 1.7.7.

[0] - https://issues.apache.org/jira/browse/SPARK-19697
[1] -
https://github.com/apache/parquet-mr/blob/apache-parquet-1.8.2/pom.xml#L96

On Sun, Apr 30, 2017 at 3:28 AM, Sean Owen  wrote:

> I have one more issue that, if it needs to be fixed, needs to be fixed for
> 2.2.0.
>
> I'm fixing build warnings for the release and noticed that checkstyle
> actually complains there are some Java methods named in TitleCase, like
> `ProcessingTimeTimeout`:
>
> https://github.com/apache/spark/pull/17803/files#r113934080
>
> Easy enough to fix and it's right, that's not conventional. However I
> wonder if it was done on purpose to match a class name?
>
> I think this is one for @tdas
>
> On Thu, Apr 27, 2017 at 7:31 PM Michael Armbrust 
> wrote:
>
>> Please vote on releasing the following candidate as Apache Spark version
>> 2.2.0. The vote is open until Tues, May 2nd, 2017 at 12:00 PST and
>> passes if a majority of at least 3 +1 PMC votes are cast.
>>
>> [ ] +1 Release this package as Apache Spark 2.2.0
>> [ ] -1 Do not release this package because ...
>>
>>
>> To learn more about Apache Spark, please see http://spark.apache.org/
>>
>> The tag to be voted on is v2.2.0-rc1
>>  (8ccb4a57c82146c
>> 1a8f8966c7e64010cf5632cb6)
>>
>> List of JIRA tickets resolved can be found with this filter
>> 
>> .
>>
>> The release files, including signatures, digests, etc. can be found at:
>> http://home.apache.org/~pwendell/spark-releases/spark-2.2.0-rc1-bin/
>>
>> Release artifacts are signed with the following key:
>> https://people.apache.org/keys/committer/pwendell.asc
>>
>> The staging repository for this release can be found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1235/
>>
>> The documentation corresponding to this release can be found at:
>> http://people.apache.org/~pwendell/spark-releases/spark-2.2.0-rc1-docs/
>>
>>
>> *FAQ*
>>
>> *How can I help test this release?*
>>
>> If you are a Spark user, you can help us test this release by taking an
>> existing Spark workload and running on this release candidate, then
>> reporting any regressions.
>>
>> *What should happen to JIRA tickets still targeting 2.2.0?*
>>
>> Committers should look at those and triage. Extremely important bug
>> fixes, documentation, and API tweaks that impact compatibility should be
>> worked on immediately. Everything else please retarget to 2.3.0 or 2.2.1.
>>
>> *But my bug isn't fixed!??!*
>>
>> In order to make timely releases, we will typically not hold the release
>> unless the bug in question is a regression from 2.1.1.
>>
>


Re: [VOTE] Apache Spark 2.2.0 (RC1)

2017-04-30 Thread Sean Owen
I have one more issue that, if it needs to be fixed, needs to be fixed for
2.2.0.

I'm fixing build warnings for the release and noticed that checkstyle
actually complains there are some Java methods named in TitleCase, like
`ProcessingTimeTimeout`:

https://github.com/apache/spark/pull/17803/files#r113934080

Easy enough to fix and it's right, that's not conventional. However I
wonder if it was done on purpose to match a class name?

I think this is one for @tdas

On Thu, Apr 27, 2017 at 7:31 PM Michael Armbrust 
wrote:

> Please vote on releasing the following candidate as Apache Spark version
> 2.2.0. The vote is open until Tues, May 2nd, 2017 at 12:00 PST and passes
> if a majority of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 2.2.0
> [ ] -1 Do not release this package because ...
>
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> The tag to be voted on is v2.2.0-rc1
>  (
> 8ccb4a57c82146c1a8f8966c7e64010cf5632cb6)
>
> List of JIRA tickets resolved can be found with this filter
> 
> .
>
> The release files, including signatures, digests, etc. can be found at:
> http://home.apache.org/~pwendell/spark-releases/spark-2.2.0-rc1-bin/
>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1235/
>
> The documentation corresponding to this release can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-2.2.0-rc1-docs/
>
>
> *FAQ*
>
> *How can I help test this release?*
>
> If you are a Spark user, you can help us test this release by taking an
> existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> *What should happen to JIRA tickets still targeting 2.2.0?*
>
> Committers should look at those and triage. Extremely important bug fixes,
> documentation, and API tweaks that impact compatibility should be worked on
> immediately. Everything else please retarget to 2.3.0 or 2.2.1.
>
> *But my bug isn't fixed!??!*
>
> In order to make timely releases, we will typically not hold the release
> unless the bug in question is a regression from 2.1.1.
>


Re: [VOTE] Apache Spark 2.2.0 (RC1)

2017-04-29 Thread Herman van Hövell tot Westerflier
Maciej, this is definitely a bug. I have opened https://github.com/apache/
spark/pull/17810 to fix this. I don't think this should be a blocker for
the release of 2.2, if there is another RC we will include it.

On Sat, Apr 29, 2017 at 10:17 AM, Maciej Szymkiewicz  wrote:

> I am not sure if it is relevant but explode_outer and posexplode_outer
> seem to be broken: SPARK-20534
> 
>
> On 04/28/2017 12:49 AM, Sean Owen wrote:
>
> By the way the RC looks good. Sigs and license are OK, tests pass with
> -Phive -Pyarn -Phadoop-2.7. +1 from me.
>
> On Thu, Apr 27, 2017 at 7:31 PM Michael Armbrust 
> wrote:
>
>> Please vote on releasing the following candidate as Apache Spark version
>> 2.2.0. The vote is open until Tues, May 2nd, 2017 at 12:00 PST and
>> passes if a majority of at least 3 +1 PMC votes are cast.
>>
>> [ ] +1 Release this package as Apache Spark 2.2.0
>> [ ] -1 Do not release this package because ...
>>
>>
>> To learn more about Apache Spark, please see http://spark.apache.org/
>>
>> The tag to be voted on is v2.2.0-rc1
>>  (8ccb4a57c82146c
>> 1a8f8966c7e64010cf5632cb6)
>>
>> List of JIRA tickets resolved can be found with this filter
>> 
>> .
>>
>> The release files, including signatures, digests, etc. can be found at:
>> http://home.apache.org/~pwendell/spark-releases/spark-2.2.0-rc1-bin/
>>
>> Release artifacts are signed with the following key:
>> https://people.apache.org/keys/committer/pwendell.asc
>>
>> The staging repository for this release can be found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1235/
>>
>> The documentation corresponding to this release can be found at:
>> http://people.apache.org/~pwendell/spark-releases/spark-2.2.0-rc1-docs/
>>
>>
>> *FAQ*
>>
>> *How can I help test this release?*
>>
>> If you are a Spark user, you can help us test this release by taking an
>> existing Spark workload and running on this release candidate, then
>> reporting any regressions.
>>
>> *What should happen to JIRA tickets still targeting 2.2.0?*
>>
>> Committers should look at those and triage. Extremely important bug
>> fixes, documentation, and API tweaks that impact compatibility should be
>> worked on immediately. Everything else please retarget to 2.3.0 or 2.2.1.
>>
>> *But my bug isn't fixed!??!*
>>
>> In order to make timely releases, we will typically not hold the release
>> unless the bug in question is a regression from 2.1.1.
>>
>
>


-- 

Herman van Hövell

Software Engineer

Databricks Inc.

hvanhov...@databricks.com

+31 6 420 590 27

databricks.com

[image: http://databricks.com] 


[image: Join Databricks at Spark Summit 2017 in San Francisco, the world's
largest event for the Apache Spark community.] 


Re: [VOTE] Apache Spark 2.2.0 (RC1)

2017-04-29 Thread Maciej Szymkiewicz
I am not sure if it is relevant but explode_outer and posexplode_outer
seem to be broken: SPARK-20534



On 04/28/2017 12:49 AM, Sean Owen wrote:
> By the way the RC looks good. Sigs and license are OK, tests pass with
> -Phive -Pyarn -Phadoop-2.7. +1 from me.
>
> On Thu, Apr 27, 2017 at 7:31 PM Michael Armbrust
> > wrote:
>
> Please vote on releasing the following candidate as Apache Spark
> version 2.2.0. The vote is open until Tues, May 2nd, 2017 at 12:00
> PST and passes if a majority of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 2.2.0
> [ ] -1 Do not release this package because ...
>
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> The tag to be voted on is v2.2.0-rc1
>  
> (8ccb4a57c82146c1a8f8966c7e64010cf5632cb6)
>
> List of JIRA tickets resolved can be found with this filter
> 
> .
>
> The release files, including signatures, digests, etc. can be
> found at:
> http://home.apache.org/~pwendell/spark-releases/spark-2.2.0-rc1-bin/
> 
>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1235/
>
> The documentation corresponding to this release can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-2.2.0-rc1-docs/
> 
> 
>
>
> *FAQ*
>
> *How can I help test this release?*
> *
> *
> If you are a Spark user, you can help us test this release by
> taking an existing Spark workload and running on this release
> candidate, then reporting any regressions.
> *
> *
> *What should happen to JIRA tickets still targeting 2.2.0?*
> *
> *
> Committers should look at those and triage. Extremely important
> bug fixes, documentation, and API tweaks that impact compatibility
> should be worked on immediately. Everything else please retarget
> to 2.3.0 or 2.2.1.
>
> *But my bug isn't fixed!??!*
>
> In order to make timely releases, we will typically not hold the
> release unless the bug in question is a regression from 2.1.1.
>



Re: [VOTE] Apache Spark 2.2.0 (RC1)

2017-04-29 Thread Hyukjin Kwon
SPARK-20364  describes a
bug but I am unclear that we should call it a regression that blocks a
release.

It is something working incorrectly (in some cases in terms of output) but
this case looks not even working so far in the past releases.

The current master produces a wrong result when there are dots in column
names for Parquet in some cases, which did even work in past releases.


So, this looks not a regression to me although it is a bug that definitely
we should fix.


In more details, I tested this cases as below:


Spark 1.6.3

val path = "/tmp/foo"
Seq(Tuple1(Some(1)), Tuple1(None)).toDF("col.dots").write.parquet(path)
sqlContext.read.parquet(path).where("`col.dots` IS NOT NULL").show()

java.lang.IllegalArgumentException: Column [col, dots] was not found in schema!
at org.apache.parquet.Preconditions.checkArgument(Preconditions.java:55)
...

sqlContext.read.parquet(path).where("`col.dots` IS NULL").show()

java.lang.IllegalArgumentException: Column [col, dots] was not found in schema!
at org.apache.parquet.Preconditions.checkArgument(Preconditions.java:55)
...


Spark 2.0.2

val path = "/tmp/foo"
Seq(Some(1), None).toDF("col.dots").write.parquet(path)
spark.read.parquet(path).where("`col.dots` IS NOT NULL").show()

java.lang.IllegalArgumentException: Column [col, dots] was not found in schema!
at org.apache.parquet.Preconditions.checkArgument(Preconditions.java:55)
...

spark.read.parquet(path).where("`col.dots` IS NULL").show()

java.lang.IllegalArgumentException: Column [col, dots] was not found in schema!
at org.apache.parquet.Preconditions.checkArgument(Preconditions.java:55)
...


Spark 2.1.0

val path = "/tmp/foo"
Seq(Some(1), None).toDF("col.dots").write.parquet(path)
spark.read.parquet(path).where("`col.dots` IS NOT NULL").show()

java.lang.IllegalArgumentException: Column [col, dots] was not found in schema!
at org.apache.parquet.Preconditions.checkArgument(Preconditions.java:55)
...

spark.read.parquet(path).where("`col.dots` IS NULL").show()

java.lang.IllegalArgumentException: Column [col, dots] was not found in schema!
at org.apache.parquet.Preconditions.checkArgument(Preconditions.java:55)
...


Spark 2.1.1 RC4

val path = "/tmp/foo"
Seq(Some(1), None).toDF("col.dots").write.parquet(path)
spark.read.parquet(path).where("`col.dots` IS NOT NULL").show()

java.lang.IllegalArgumentException: Column [col, dots] was not found in schema!
at org.apache.parquet.Preconditions.checkArgument(Preconditions.java:55)
...

spark.read.parquet(path).where("`col.dots` IS NULL").show()

java.lang.IllegalArgumentException: Column [col, dots] was not found in schema!
at org.apache.parquet.Preconditions.checkArgument(Preconditions.java:55)
...


Current master

val path = "/tmp/foo"
Seq(Some(1), None).toDF("col.dots").write.parquet(path)
spark.read.parquet(path).where("`col.dots` IS NOT NULL").show()

++
|col.dots|
++
++

spark.read.parquet(path).where("`col.dots` IS NULL").show()

++
|col.dots|
++
|null|
++

​


2017-04-29 2:57 GMT+09:00 Koert Kuipers :

> we have been testing the 2.2.0 snapshots in the last few weeks for inhouse
> unit tests, integration tests and real workloads and we are very happy with
> it. the only issue i had so far (some encoders not being serialize anymore)
> has already been dealt with by wenchen.
>
> On Thu, Apr 27, 2017 at 6:49 PM, Sean Owen  wrote:
>
>> By the way the RC looks good. Sigs and license are OK, tests pass with
>> -Phive -Pyarn -Phadoop-2.7. +1 from me.
>>
>> On Thu, Apr 27, 2017 at 7:31 PM Michael Armbrust 
>> wrote:
>>
>>> Please vote on releasing the following candidate as Apache Spark
>>> version 2.2.0. The vote is open until Tues, May 2nd, 2017 at 12:00 PST
>>> and passes if a majority of at least 3 +1 PMC votes are cast.
>>>
>>> [ ] +1 Release this package as Apache Spark 2.2.0
>>> [ ] -1 Do not release this package because ...
>>>
>>>
>>> To learn more about Apache Spark, please see http://spark.apache.org/
>>>
>>> The tag to be voted on is v2.2.0-rc1
>>>  (8ccb4a57c82146c
>>> 1a8f8966c7e64010cf5632cb6)
>>>
>>> List of JIRA tickets resolved can be found with this filter
>>> 
>>> .
>>>
>>> The release files, including signatures, digests, etc. can be found at:
>>> http://home.apache.org/~pwendell/spark-releases/spark-2.2.0-rc1-bin/
>>>
>>> Release artifacts are signed with the following key:
>>> https://people.apache.org/keys/committer/pwendell.asc
>>>
>>> The staging repository for this release can be found at:
>>> https://repository.apache.org/content/repositories/orgapachespark-1235/
>>>
>>> The documentation corresponding to this release can be found at:
>>> 

Re: [VOTE] Apache Spark 2.2.0 (RC1)

2017-04-28 Thread Koert Kuipers
we have been testing the 2.2.0 snapshots in the last few weeks for inhouse
unit tests, integration tests and real workloads and we are very happy with
it. the only issue i had so far (some encoders not being serialize anymore)
has already been dealt with by wenchen.

On Thu, Apr 27, 2017 at 6:49 PM, Sean Owen  wrote:

> By the way the RC looks good. Sigs and license are OK, tests pass with
> -Phive -Pyarn -Phadoop-2.7. +1 from me.
>
> On Thu, Apr 27, 2017 at 7:31 PM Michael Armbrust 
> wrote:
>
>> Please vote on releasing the following candidate as Apache Spark version
>> 2.2.0. The vote is open until Tues, May 2nd, 2017 at 12:00 PST and
>> passes if a majority of at least 3 +1 PMC votes are cast.
>>
>> [ ] +1 Release this package as Apache Spark 2.2.0
>> [ ] -1 Do not release this package because ...
>>
>>
>> To learn more about Apache Spark, please see http://spark.apache.org/
>>
>> The tag to be voted on is v2.2.0-rc1
>>  (8ccb4a57c82146c
>> 1a8f8966c7e64010cf5632cb6)
>>
>> List of JIRA tickets resolved can be found with this filter
>> 
>> .
>>
>> The release files, including signatures, digests, etc. can be found at:
>> http://home.apache.org/~pwendell/spark-releases/spark-2.2.0-rc1-bin/
>>
>> Release artifacts are signed with the following key:
>> https://people.apache.org/keys/committer/pwendell.asc
>>
>> The staging repository for this release can be found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1235/
>>
>> The documentation corresponding to this release can be found at:
>> http://people.apache.org/~pwendell/spark-releases/spark-2.2.0-rc1-docs/
>>
>>
>> *FAQ*
>>
>> *How can I help test this release?*
>>
>> If you are a Spark user, you can help us test this release by taking an
>> existing Spark workload and running on this release candidate, then
>> reporting any regressions.
>>
>> *What should happen to JIRA tickets still targeting 2.2.0?*
>>
>> Committers should look at those and triage. Extremely important bug
>> fixes, documentation, and API tweaks that impact compatibility should be
>> worked on immediately. Everything else please retarget to 2.3.0 or 2.2.1.
>>
>> *But my bug isn't fixed!??!*
>>
>> In order to make timely releases, we will typically not hold the release
>> unless the bug in question is a regression from 2.1.1.
>>
>


Re: [VOTE] Apache Spark 2.2.0 (RC1)

2017-04-28 Thread Kazuaki Ishizaki
+1 (non-binding)

I tested it on Ubuntu 16.04 and OpenJDK8 on ppc64le. All of the tests for 
core have passed..

$ java -version
openjdk version "1.8.0_111"
OpenJDK Runtime Environment (build 
1.8.0_111-8u111-b14-2ubuntu0.16.04.2-b14)
OpenJDK 64-Bit Server VM (build 25.111-b14, mixed mode)
$ build/mvn -DskipTests -Phive -Phive-thriftserver -Pyarn -Phadoop-2.7 
package install
$ build/mvn -Phive -Phive-thriftserver -Pyarn -Phadoop-2.7 test -pl core
...
Run completed in 15 minutes, 45 seconds.
Total number of tests run: 1937
Suites: completed 205, aborted 0
Tests: succeeded 1937, failed 0, canceled 4, ignored 8, pending 0
All tests passed.
[INFO] 

[INFO] BUILD SUCCESS
[INFO] 

[INFO] Total time: 17:26 min
[INFO] Finished at: 2017-04-29T02:23:08+09:00
[INFO] Final Memory: 53M/491M
[INFO] 


Kazuaki Ishizaki,



From:   Michael Armbrust <mich...@databricks.com>
To: "dev@spark.apache.org" <dev@spark.apache.org>
Date:   2017/04/28 03:32
Subject:        [VOTE] Apache Spark 2.2.0 (RC1)



Please vote on releasing the following candidate as Apache Spark version 
2.2.0. The vote is open until Tues, May 2nd, 2017 at 12:00 PST and passes 
if a majority of at least 3 +1 PMC votes are cast.

[ ] +1 Release this package as Apache Spark 2.2.0
[ ] -1 Do not release this package because ...


To learn more about Apache Spark, please see http://spark.apache.org/

The tag to be voted on is v2.2.0-rc1 (
8ccb4a57c82146c1a8f8966c7e64010cf5632cb6)

List of JIRA tickets resolved can be found with this filter.

The release files, including signatures, digests, etc. can be found at:
http://home.apache.org/~pwendell/spark-releases/spark-2.2.0-rc1-bin/

Release artifacts are signed with the following key:
https://people.apache.org/keys/committer/pwendell.asc

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1235/

The documentation corresponding to this release can be found at:
http://people.apache.org/~pwendell/spark-releases/spark-2.2.0-rc1-docs/


FAQ

How can I help test this release?

If you are a Spark user, you can help us test this release by taking an 
existing Spark workload and running on this release candidate, then 
reporting any regressions.

What should happen to JIRA tickets still targeting 2.2.0?

Committers should look at those and triage. Extremely important bug fixes, 
documentation, and API tweaks that impact compatibility should be worked 
on immediately. Everything else please retarget to 2.3.0 or 2.2.1.

But my bug isn't fixed!??!

In order to make timely releases, we will typically not hold the release 
unless the bug in question is a regression from 2.1.1.




Re: [VOTE] Apache Spark 2.2.0 (RC1)

2017-04-28 Thread Koert Kuipers
this is column names containing dots that do not target fields inside
structs? so not a.b as in field b inside struct a, but somehow a field
called a.b? i didnt even know it is supported at all. its something i would
never try because it sounds like a bad idea to go there...

On Fri, Apr 28, 2017 at 12:17 PM, Andrew Ash  wrote:

> -1 due to regression from 2.1.1
>
> In 2.2.0-rc1 we bumped the Parquet version from 1.8.1 to 1.8.2 in commit
> 26a4cba3ff .  Parquet
> 1.8.2 includes a backport from 1.9.0: PARQUET-389
>  in commit 2282c22c
> 
>
> This backport caused a regression in Spark, where filtering on columns
> containing dots in the column name pushes the filter down into Parquet
> where Parquet incorrectly handles the predicate.  Spark pushes the String
> "col.dots" as the column name, but Parquet interprets this as
> "struct.field" where the predicate is on a field of a struct.  The ultimate
> result is that the predicate always returns zero results, causing a data
> correctness issue.
>
> This issue is filed in Spark as SPARK-20364
>  and has a PR fix up
> at PR #17680 .
>
> I nominate SPARK-20364  as
> a release blocker due to the data correctness regression.
>
> Thanks!
> Andrew
>
> On Thu, Apr 27, 2017 at 6:49 PM, Sean Owen  wrote:
>
>> By the way the RC looks good. Sigs and license are OK, tests pass with
>> -Phive -Pyarn -Phadoop-2.7. +1 from me.
>>
>> On Thu, Apr 27, 2017 at 7:31 PM Michael Armbrust 
>> wrote:
>>
>>> Please vote on releasing the following candidate as Apache Spark
>>> version 2.2.0. The vote is open until Tues, May 2nd, 2017 at 12:00 PST
>>> and passes if a majority of at least 3 +1 PMC votes are cast.
>>>
>>> [ ] +1 Release this package as Apache Spark 2.2.0
>>> [ ] -1 Do not release this package because ...
>>>
>>>
>>> To learn more about Apache Spark, please see http://spark.apache.org/
>>>
>>> The tag to be voted on is v2.2.0-rc1
>>>  (8ccb4a57c82146c
>>> 1a8f8966c7e64010cf5632cb6)
>>>
>>> List of JIRA tickets resolved can be found with this filter
>>> 
>>> .
>>>
>>> The release files, including signatures, digests, etc. can be found at:
>>> http://home.apache.org/~pwendell/spark-releases/spark-2.2.0-rc1-bin/
>>>
>>> Release artifacts are signed with the following key:
>>> https://people.apache.org/keys/committer/pwendell.asc
>>>
>>> The staging repository for this release can be found at:
>>> https://repository.apache.org/content/repositories/orgapachespark-1235/
>>>
>>> The documentation corresponding to this release can be found at:
>>> http://people.apache.org/~pwendell/spark-releases/spark-2.2.0-rc1-docs/
>>>
>>>
>>> *FAQ*
>>>
>>> *How can I help test this release?*
>>>
>>> If you are a Spark user, you can help us test this release by taking an
>>> existing Spark workload and running on this release candidate, then
>>> reporting any regressions.
>>>
>>> *What should happen to JIRA tickets still targeting 2.2.0?*
>>>
>>> Committers should look at those and triage. Extremely important bug
>>> fixes, documentation, and API tweaks that impact compatibility should be
>>> worked on immediately. Everything else please retarget to 2.3.0 or 2.2.1.
>>>
>>> *But my bug isn't fixed!??!*
>>>
>>> In order to make timely releases, we will typically not hold the release
>>> unless the bug in question is a regression from 2.1.1.
>>>
>>
>


Re: [VOTE] Apache Spark 2.2.0 (RC1)

2017-04-28 Thread Andrew Ash
-1 due to regression from 2.1.1

In 2.2.0-rc1 we bumped the Parquet version from 1.8.1 to 1.8.2 in commit
26a4cba3ff .  Parquet
1.8.2 includes a backport from 1.9.0: PARQUET-389
 in commit 2282c22c


This backport caused a regression in Spark, where filtering on columns
containing dots in the column name pushes the filter down into Parquet
where Parquet incorrectly handles the predicate.  Spark pushes the String
"col.dots" as the column name, but Parquet interprets this as
"struct.field" where the predicate is on a field of a struct.  The ultimate
result is that the predicate always returns zero results, causing a data
correctness issue.

This issue is filed in Spark as SPARK-20364
 and has a PR fix up at PR
#17680 .

I nominate SPARK-20364  as
a release blocker due to the data correctness regression.

Thanks!
Andrew

On Thu, Apr 27, 2017 at 6:49 PM, Sean Owen  wrote:

> By the way the RC looks good. Sigs and license are OK, tests pass with
> -Phive -Pyarn -Phadoop-2.7. +1 from me.
>
> On Thu, Apr 27, 2017 at 7:31 PM Michael Armbrust 
> wrote:
>
>> Please vote on releasing the following candidate as Apache Spark version
>> 2.2.0. The vote is open until Tues, May 2nd, 2017 at 12:00 PST and
>> passes if a majority of at least 3 +1 PMC votes are cast.
>>
>> [ ] +1 Release this package as Apache Spark 2.2.0
>> [ ] -1 Do not release this package because ...
>>
>>
>> To learn more about Apache Spark, please see http://spark.apache.org/
>>
>> The tag to be voted on is v2.2.0-rc1
>>  (8ccb4a57c82146c
>> 1a8f8966c7e64010cf5632cb6)
>>
>> List of JIRA tickets resolved can be found with this filter
>> 
>> .
>>
>> The release files, including signatures, digests, etc. can be found at:
>> http://home.apache.org/~pwendell/spark-releases/spark-2.2.0-rc1-bin/
>>
>> Release artifacts are signed with the following key:
>> https://people.apache.org/keys/committer/pwendell.asc
>>
>> The staging repository for this release can be found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1235/
>>
>> The documentation corresponding to this release can be found at:
>> http://people.apache.org/~pwendell/spark-releases/spark-2.2.0-rc1-docs/
>>
>>
>> *FAQ*
>>
>> *How can I help test this release?*
>>
>> If you are a Spark user, you can help us test this release by taking an
>> existing Spark workload and running on this release candidate, then
>> reporting any regressions.
>>
>> *What should happen to JIRA tickets still targeting 2.2.0?*
>>
>> Committers should look at those and triage. Extremely important bug
>> fixes, documentation, and API tweaks that impact compatibility should be
>> worked on immediately. Everything else please retarget to 2.3.0 or 2.2.1.
>>
>> *But my bug isn't fixed!??!*
>>
>> In order to make timely releases, we will typically not hold the release
>> unless the bug in question is a regression from 2.1.1.
>>
>


Re: [VOTE] Apache Spark 2.2.0 (RC1)

2017-04-27 Thread Sean Owen
By the way the RC looks good. Sigs and license are OK, tests pass with
-Phive -Pyarn -Phadoop-2.7. +1 from me.

On Thu, Apr 27, 2017 at 7:31 PM Michael Armbrust 
wrote:

> Please vote on releasing the following candidate as Apache Spark version
> 2.2.0. The vote is open until Tues, May 2nd, 2017 at 12:00 PST and passes
> if a majority of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 2.2.0
> [ ] -1 Do not release this package because ...
>
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> The tag to be voted on is v2.2.0-rc1
>  (
> 8ccb4a57c82146c1a8f8966c7e64010cf5632cb6)
>
> List of JIRA tickets resolved can be found with this filter
> 
> .
>
> The release files, including signatures, digests, etc. can be found at:
> http://home.apache.org/~pwendell/spark-releases/spark-2.2.0-rc1-bin/
>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1235/
>
> The documentation corresponding to this release can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-2.2.0-rc1-docs/
>
>
> *FAQ*
>
> *How can I help test this release?*
>
> If you are a Spark user, you can help us test this release by taking an
> existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> *What should happen to JIRA tickets still targeting 2.2.0?*
>
> Committers should look at those and triage. Extremely important bug fixes,
> documentation, and API tweaks that impact compatibility should be worked on
> immediately. Everything else please retarget to 2.3.0 or 2.2.1.
>
> *But my bug isn't fixed!??!*
>
> In order to make timely releases, we will typically not hold the release
> unless the bug in question is a regression from 2.1.1.
>


Re: [VOTE] Apache Spark 2.2.0 (RC1)

2017-04-27 Thread Joseph Bradley
That's very fair.

For my part, I should have been faster to make these JIRAs and get critical
dev community QA started when the branch was cut last week.

On Thu, Apr 27, 2017 at 2:59 PM, Sean Owen  wrote:

> That makes sense, but we have an RC, not just a branch. I think we've
> followed the pattern in http://spark.apache.org/versioning-policy.html in
> the past. This generally comes before and RC right, because until
> everything that Must Happen before a release has happened, someone's saying
> the RC can't possibly pass. I get it, in practice, this is an "RC0" that
> can't pass (unless somehow these issue result in zero changes) and there's
> value in that anyway. Just want to see if we're on the same page about
> process, maybe even just say this is how we manage releases, with "RCs"
> starting before QA ends.
>
> On Thu, Apr 27, 2017 at 10:36 PM Joseph Bradley 
> wrote:
>
>> This is the same thing as ever for MLlib: Once a branch has been cut, we
>> stop merging features.  Now that features are not being merged, we can
>> begin QA.  I strongly prefer to track QA work in JIRA and to have those
>> items targeted for 2.2.  I also believe that certain QA tasks should be
>> blockers; e.g., if we have not checked for binary or Java compatibility
>> issues in new APIs, then I am not comfortable signing off on a release.  I
>> agree with Michael that these don't block testing on a release; the point
>> of these issues is to do testing.
>>
>> I'll close the roadmap JIRA though.
>>
>> On Thu, Apr 27, 2017 at 1:49 PM, Michael Armbrust > > wrote:
>>
>>> All of those look like QA or documentation, which I don't think needs to
>>> block testing on an RC (and in fact probably needs an RC to test?).
>>> Joseph, please correct me if I'm wrong.  It is unlikely this first RC is
>>> going to pass, but I wanted to get the ball rolling on testing 2.2.
>>>
>>> On Thu, Apr 27, 2017 at 1:45 PM, Sean Owen  wrote:
>>>
 These are still blockers for 2.2:

 SPARK-20501 ML, Graph 2.2 QA: API: New Scala APIs, docs
 SPARK-20504 ML 2.2 QA: API: Java compatibility, docs
 SPARK-20503 ML 2.2 QA: API: Python API coverage
 SPARK-20502 ML, Graph 2.2 QA: API: Experimental, DeveloperApi, final,
 sealed audit
 SPARK-20500 ML, Graph 2.2 QA: API: Binary incompatible changes
 SPARK-18813 MLlib 2.2 Roadmap

 Joseph you opened most of these just now. Is this an "RC0" we know
 won't pass? or, wouldn't we normally cut an RC after those things are 
 ready?

 On Thu, Apr 27, 2017 at 7:31 PM Michael Armbrust <
 mich...@databricks.com> wrote:

> Please vote on releasing the following candidate as Apache Spark
> version 2.2.0. The vote is open until Tues, May 2nd, 2017 at 12:00
> PST and passes if a majority of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 2.2.0
> [ ] -1 Do not release this package because ...
>
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> The tag to be voted on is v2.2.0-rc1
>  (8ccb4a57c82146c
> 1a8f8966c7e64010cf5632cb6)
>
> List of JIRA tickets resolved can be found with this filter
> 
> .
>
> The release files, including signatures, digests, etc. can be found at:
> http://home.apache.org/~pwendell/spark-releases/spark-2.2.0-rc1-bin/
>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/
> orgapachespark-1235/
>
> The documentation corresponding to this release can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-
> 2.2.0-rc1-docs/
>
>
> *FAQ*
>
> *How can I help test this release?*
>
> If you are a Spark user, you can help us test this release by taking
> an existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> *What should happen to JIRA tickets still targeting 2.2.0?*
>
> Committers should look at those and triage. Extremely important bug
> fixes, documentation, and API tweaks that impact compatibility should be
> worked on immediately. Everything else please retarget to 2.3.0 or 2.2.1.
>
> *But my bug isn't fixed!??!*
>
> In order to make timely releases, we will typically not hold the
> release unless the bug in question is a regression from 2.1.1.
>

>>>
>>
>>
>> --
>>
>> Joseph Bradley
>>
>> Software Engineer - Machine Learning
>>
>> Databricks, Inc.
>>
>> [image: http://databricks.com] 

Re: [VOTE] Apache Spark 2.2.0 (RC1)

2017-04-27 Thread Sean Owen
That makes sense, but we have an RC, not just a branch. I think we've
followed the pattern in http://spark.apache.org/versioning-policy.html in
the past. This generally comes before and RC right, because until
everything that Must Happen before a release has happened, someone's saying
the RC can't possibly pass. I get it, in practice, this is an "RC0" that
can't pass (unless somehow these issue result in zero changes) and there's
value in that anyway. Just want to see if we're on the same page about
process, maybe even just say this is how we manage releases, with "RCs"
starting before QA ends.

On Thu, Apr 27, 2017 at 10:36 PM Joseph Bradley 
wrote:

> This is the same thing as ever for MLlib: Once a branch has been cut, we
> stop merging features.  Now that features are not being merged, we can
> begin QA.  I strongly prefer to track QA work in JIRA and to have those
> items targeted for 2.2.  I also believe that certain QA tasks should be
> blockers; e.g., if we have not checked for binary or Java compatibility
> issues in new APIs, then I am not comfortable signing off on a release.  I
> agree with Michael that these don't block testing on a release; the point
> of these issues is to do testing.
>
> I'll close the roadmap JIRA though.
>
> On Thu, Apr 27, 2017 at 1:49 PM, Michael Armbrust 
> wrote:
>
>> All of those look like QA or documentation, which I don't think needs to
>> block testing on an RC (and in fact probably needs an RC to test?).
>> Joseph, please correct me if I'm wrong.  It is unlikely this first RC is
>> going to pass, but I wanted to get the ball rolling on testing 2.2.
>>
>> On Thu, Apr 27, 2017 at 1:45 PM, Sean Owen  wrote:
>>
>>> These are still blockers for 2.2:
>>>
>>> SPARK-20501 ML, Graph 2.2 QA: API: New Scala APIs, docs
>>> SPARK-20504 ML 2.2 QA: API: Java compatibility, docs
>>> SPARK-20503 ML 2.2 QA: API: Python API coverage
>>> SPARK-20502 ML, Graph 2.2 QA: API: Experimental, DeveloperApi, final,
>>> sealed audit
>>> SPARK-20500 ML, Graph 2.2 QA: API: Binary incompatible changes
>>> SPARK-18813 MLlib 2.2 Roadmap
>>>
>>> Joseph you opened most of these just now. Is this an "RC0" we know won't
>>> pass? or, wouldn't we normally cut an RC after those things are ready?
>>>
>>> On Thu, Apr 27, 2017 at 7:31 PM Michael Armbrust 
>>> wrote:
>>>
 Please vote on releasing the following candidate as Apache Spark
 version 2.2.0. The vote is open until Tues, May 2nd, 2017 at 12:00 PST
 and passes if a majority of at least 3 +1 PMC votes are cast.

 [ ] +1 Release this package as Apache Spark 2.2.0
 [ ] -1 Do not release this package because ...


 To learn more about Apache Spark, please see http://spark.apache.org/

 The tag to be voted on is v2.2.0-rc1
  (
 8ccb4a57c82146c1a8f8966c7e64010cf5632cb6)

 List of JIRA tickets resolved can be found with this filter
 
 .

 The release files, including signatures, digests, etc. can be found at:
 http://home.apache.org/~pwendell/spark-releases/spark-2.2.0-rc1-bin/

 Release artifacts are signed with the following key:
 https://people.apache.org/keys/committer/pwendell.asc

 The staging repository for this release can be found at:
 https://repository.apache.org/content/repositories/orgapachespark-1235/

 The documentation corresponding to this release can be found at:
 http://people.apache.org/~pwendell/spark-releases/spark-2.2.0-rc1-docs/


 *FAQ*

 *How can I help test this release?*

 If you are a Spark user, you can help us test this release by taking an
 existing Spark workload and running on this release candidate, then
 reporting any regressions.

 *What should happen to JIRA tickets still targeting 2.2.0?*

 Committers should look at those and triage. Extremely important bug
 fixes, documentation, and API tweaks that impact compatibility should be
 worked on immediately. Everything else please retarget to 2.3.0 or 2.2.1.

 *But my bug isn't fixed!??!*

 In order to make timely releases, we will typically not hold the
 release unless the bug in question is a regression from 2.1.1.

>>>
>>
>
>
> --
>
> Joseph Bradley
>
> Software Engineer - Machine Learning
>
> Databricks, Inc.
>
> [image: http://databricks.com] 
>


Re: [VOTE] Apache Spark 2.2.0 (RC1)

2017-04-27 Thread Joseph Bradley
This is the same thing as ever for MLlib: Once a branch has been cut, we
stop merging features.  Now that features are not being merged, we can
begin QA.  I strongly prefer to track QA work in JIRA and to have those
items targeted for 2.2.  I also believe that certain QA tasks should be
blockers; e.g., if we have not checked for binary or Java compatibility
issues in new APIs, then I am not comfortable signing off on a release.  I
agree with Michael that these don't block testing on a release; the point
of these issues is to do testing.

I'll close the roadmap JIRA though.

On Thu, Apr 27, 2017 at 1:49 PM, Michael Armbrust 
wrote:

> All of those look like QA or documentation, which I don't think needs to
> block testing on an RC (and in fact probably needs an RC to test?).
> Joseph, please correct me if I'm wrong.  It is unlikely this first RC is
> going to pass, but I wanted to get the ball rolling on testing 2.2.
>
> On Thu, Apr 27, 2017 at 1:45 PM, Sean Owen  wrote:
>
>> These are still blockers for 2.2:
>>
>> SPARK-20501 ML, Graph 2.2 QA: API: New Scala APIs, docs
>> SPARK-20504 ML 2.2 QA: API: Java compatibility, docs
>> SPARK-20503 ML 2.2 QA: API: Python API coverage
>> SPARK-20502 ML, Graph 2.2 QA: API: Experimental, DeveloperApi, final,
>> sealed audit
>> SPARK-20500 ML, Graph 2.2 QA: API: Binary incompatible changes
>> SPARK-18813 MLlib 2.2 Roadmap
>>
>> Joseph you opened most of these just now. Is this an "RC0" we know won't
>> pass? or, wouldn't we normally cut an RC after those things are ready?
>>
>> On Thu, Apr 27, 2017 at 7:31 PM Michael Armbrust 
>> wrote:
>>
>>> Please vote on releasing the following candidate as Apache Spark
>>> version 2.2.0. The vote is open until Tues, May 2nd, 2017 at 12:00 PST
>>> and passes if a majority of at least 3 +1 PMC votes are cast.
>>>
>>> [ ] +1 Release this package as Apache Spark 2.2.0
>>> [ ] -1 Do not release this package because ...
>>>
>>>
>>> To learn more about Apache Spark, please see http://spark.apache.org/
>>>
>>> The tag to be voted on is v2.2.0-rc1
>>>  (8ccb4a57c82146c
>>> 1a8f8966c7e64010cf5632cb6)
>>>
>>> List of JIRA tickets resolved can be found with this filter
>>> 
>>> .
>>>
>>> The release files, including signatures, digests, etc. can be found at:
>>> http://home.apache.org/~pwendell/spark-releases/spark-2.2.0-rc1-bin/
>>>
>>> Release artifacts are signed with the following key:
>>> https://people.apache.org/keys/committer/pwendell.asc
>>>
>>> The staging repository for this release can be found at:
>>> https://repository.apache.org/content/repositories/orgapachespark-1235/
>>>
>>> The documentation corresponding to this release can be found at:
>>> http://people.apache.org/~pwendell/spark-releases/spark-2.2.0-rc1-docs/
>>>
>>>
>>> *FAQ*
>>>
>>> *How can I help test this release?*
>>>
>>> If you are a Spark user, you can help us test this release by taking an
>>> existing Spark workload and running on this release candidate, then
>>> reporting any regressions.
>>>
>>> *What should happen to JIRA tickets still targeting 2.2.0?*
>>>
>>> Committers should look at those and triage. Extremely important bug
>>> fixes, documentation, and API tweaks that impact compatibility should be
>>> worked on immediately. Everything else please retarget to 2.3.0 or 2.2.1.
>>>
>>> *But my bug isn't fixed!??!*
>>>
>>> In order to make timely releases, we will typically not hold the release
>>> unless the bug in question is a regression from 2.1.1.
>>>
>>
>


-- 

Joseph Bradley

Software Engineer - Machine Learning

Databricks, Inc.

[image: http://databricks.com] 


Re: [VOTE] Apache Spark 2.2.0 (RC1)

2017-04-27 Thread Michael Armbrust
All of those look like QA or documentation, which I don't think needs to
block testing on an RC (and in fact probably needs an RC to test?).
Joseph, please correct me if I'm wrong.  It is unlikely this first RC is
going to pass, but I wanted to get the ball rolling on testing 2.2.

On Thu, Apr 27, 2017 at 1:45 PM, Sean Owen  wrote:

> These are still blockers for 2.2:
>
> SPARK-20501 ML, Graph 2.2 QA: API: New Scala APIs, docs
> SPARK-20504 ML 2.2 QA: API: Java compatibility, docs
> SPARK-20503 ML 2.2 QA: API: Python API coverage
> SPARK-20502 ML, Graph 2.2 QA: API: Experimental, DeveloperApi, final,
> sealed audit
> SPARK-20500 ML, Graph 2.2 QA: API: Binary incompatible changes
> SPARK-18813 MLlib 2.2 Roadmap
>
> Joseph you opened most of these just now. Is this an "RC0" we know won't
> pass? or, wouldn't we normally cut an RC after those things are ready?
>
> On Thu, Apr 27, 2017 at 7:31 PM Michael Armbrust 
> wrote:
>
>> Please vote on releasing the following candidate as Apache Spark version
>> 2.2.0. The vote is open until Tues, May 2nd, 2017 at 12:00 PST and
>> passes if a majority of at least 3 +1 PMC votes are cast.
>>
>> [ ] +1 Release this package as Apache Spark 2.2.0
>> [ ] -1 Do not release this package because ...
>>
>>
>> To learn more about Apache Spark, please see http://spark.apache.org/
>>
>> The tag to be voted on is v2.2.0-rc1
>>  (8ccb4a57c82146c
>> 1a8f8966c7e64010cf5632cb6)
>>
>> List of JIRA tickets resolved can be found with this filter
>> 
>> .
>>
>> The release files, including signatures, digests, etc. can be found at:
>> http://home.apache.org/~pwendell/spark-releases/spark-2.2.0-rc1-bin/
>>
>> Release artifacts are signed with the following key:
>> https://people.apache.org/keys/committer/pwendell.asc
>>
>> The staging repository for this release can be found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1235/
>>
>> The documentation corresponding to this release can be found at:
>> http://people.apache.org/~pwendell/spark-releases/spark-2.2.0-rc1-docs/
>>
>>
>> *FAQ*
>>
>> *How can I help test this release?*
>>
>> If you are a Spark user, you can help us test this release by taking an
>> existing Spark workload and running on this release candidate, then
>> reporting any regressions.
>>
>> *What should happen to JIRA tickets still targeting 2.2.0?*
>>
>> Committers should look at those and triage. Extremely important bug
>> fixes, documentation, and API tweaks that impact compatibility should be
>> worked on immediately. Everything else please retarget to 2.3.0 or 2.2.1.
>>
>> *But my bug isn't fixed!??!*
>>
>> In order to make timely releases, we will typically not hold the release
>> unless the bug in question is a regression from 2.1.1.
>>
>


Re: [VOTE] Apache Spark 2.2.0 (RC1)

2017-04-27 Thread Sean Owen
These are still blockers for 2.2:

SPARK-20501 ML, Graph 2.2 QA: API: New Scala APIs, docs
SPARK-20504 ML 2.2 QA: API: Java compatibility, docs
SPARK-20503 ML 2.2 QA: API: Python API coverage
SPARK-20502 ML, Graph 2.2 QA: API: Experimental, DeveloperApi, final,
sealed audit
SPARK-20500 ML, Graph 2.2 QA: API: Binary incompatible changes
SPARK-18813 MLlib 2.2 Roadmap

Joseph you opened most of these just now. Is this an "RC0" we know won't
pass? or, wouldn't we normally cut an RC after those things are ready?

On Thu, Apr 27, 2017 at 7:31 PM Michael Armbrust 
wrote:

> Please vote on releasing the following candidate as Apache Spark version
> 2.2.0. The vote is open until Tues, May 2nd, 2017 at 12:00 PST and passes
> if a majority of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 2.2.0
> [ ] -1 Do not release this package because ...
>
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> The tag to be voted on is v2.2.0-rc1
>  (
> 8ccb4a57c82146c1a8f8966c7e64010cf5632cb6)
>
> List of JIRA tickets resolved can be found with this filter
> 
> .
>
> The release files, including signatures, digests, etc. can be found at:
> http://home.apache.org/~pwendell/spark-releases/spark-2.2.0-rc1-bin/
>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1235/
>
> The documentation corresponding to this release can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-2.2.0-rc1-docs/
>
>
> *FAQ*
>
> *How can I help test this release?*
>
> If you are a Spark user, you can help us test this release by taking an
> existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> *What should happen to JIRA tickets still targeting 2.2.0?*
>
> Committers should look at those and triage. Extremely important bug fixes,
> documentation, and API tweaks that impact compatibility should be worked on
> immediately. Everything else please retarget to 2.3.0 or 2.2.1.
>
> *But my bug isn't fixed!??!*
>
> In order to make timely releases, we will typically not hold the release
> unless the bug in question is a regression from 2.1.1.
>


[VOTE] Apache Spark 2.2.0 (RC1)

2017-04-27 Thread Michael Armbrust
Please vote on releasing the following candidate as Apache Spark version
2.2.0. The vote is open until Tues, May 2nd, 2017 at 12:00 PST and passes
if a majority of at least 3 +1 PMC votes are cast.

[ ] +1 Release this package as Apache Spark 2.2.0
[ ] -1 Do not release this package because ...


To learn more about Apache Spark, please see http://spark.apache.org/

The tag to be voted on is v2.2.0-rc1
 (
8ccb4a57c82146c1a8f8966c7e64010cf5632cb6)

List of JIRA tickets resolved can be found with this filter

.

The release files, including signatures, digests, etc. can be found at:
http://home.apache.org/~pwendell/spark-releases/spark-2.2.0-rc1-bin/

Release artifacts are signed with the following key:
https://people.apache.org/keys/committer/pwendell.asc

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1235/

The documentation corresponding to this release can be found at:
http://people.apache.org/~pwendell/spark-releases/spark-2.2.0-rc1-docs/


*FAQ*

*How can I help test this release?*

If you are a Spark user, you can help us test this release by taking an
existing Spark workload and running on this release candidate, then
reporting any regressions.

*What should happen to JIRA tickets still targeting 2.2.0?*

Committers should look at those and triage. Extremely important bug fixes,
documentation, and API tweaks that impact compatibility should be worked on
immediately. Everything else please retarget to 2.3.0 or 2.2.1.

*But my bug isn't fixed!??!*

In order to make timely releases, we will typically not hold the release
unless the bug in question is a regression from 2.1.1.