Re: [VOTE] SPARK 2.4.0 (RC4)

2018-10-25 Thread Wenchen Fan
Personally I don't think it matters. Users can build arbitrary
expressions/plans themselves with internal API, and we never guarantee the
result.

Removing these functions from the function registry is a small patch and
easy to review, and to me it's better than a 1000+ LOC patch that removes
the whole thing.

Again I don't have a strong opinion here. I'm OK to remove the entire thing
if a PR is ready and well reviewed.

On Thu, Oct 25, 2018 at 11:00 PM Dongjoon Hyun 
wrote:

> Thank you for the decision, All.
>
> As of now, to unblock this, it seems that we are trying to remove them
> from the function registry.
>
> https://github.com/apache/spark/pull/22821
>
> One problem here is that users can recover those functions like this
> simply.
>
> scala> 
> spark.sessionState.functionRegistry.createOrReplaceTempFunction("map_filter", 
> x => org.apache.spark.sql.catalyst.expressions.MapFilter(x(0),x(1)))
>
>
> Technically, the PR looks like a compromised way to unblock the release
> and to allow some users that feature completely.
>
> At first glance, I thought this is a workaround to ignore the discussion
> context. But, that sounds like one of the practical ways for Apache Spark.
> (We had Spark 2.0 Tech. Preview before.)
>
> I want to finalize the decision on `map_filter` (and related three
> functions) issue. Are we good to go with
> https://github.com/apache/spark/pull/22821?
>
> Bests,
> Dongjoon.
>
> PS. Also, there is a PR to completely remove them, too.
>https://github.com/cloud-fan/spark/pull/11
>
>
> On Wed, Oct 24, 2018 at 10:14 PM Xiao Li  wrote:
>
>> @Dongjoon Hyun   Thanks! This is a blocking
>> ticket. It returns a wrong result due to our undefined behavior. I agree we
>> should revert the newly added map-oriented functions. In 3.0 release, we
>> need to define the behavior of duplicate keys in the data type MAP and fix
>> all the related issues that are confusing to our end users.
>>
>> Thanks,
>>
>> Xiao
>>
>> On Wed, Oct 24, 2018 at 9:54 PM Wenchen Fan  wrote:
>>
>>> Ah now I see the problem. `map_filter` has a very weird semantic that is
>>> neither "earlier entry wins" or "latter entry wins".
>>>
>>> I've opened https://github.com/apache/spark/pull/22821 , to remove
>>> these newly added map-related functions from FunctionRegistry(for 2.4.0),
>>> so that they are invisible to end-users, and the weird behavior of Spark
>>> map type with duplicated keys are not escalated. We should fix it ASAP in
>>> the master branch.
>>>
>>> If others are OK with it, I'll start a new RC after that PR is merged.
>>>
>>> Thanks,
>>> Wenchen
>>>
>>> On Thu, Oct 25, 2018 at 10:32 AM Dongjoon Hyun 
>>> wrote:
>>>
 For the first question, it's `bin/spark-sql` result. I didn't check
 STS, but it will return the same with `bin/spark-sql`.

 > I think map_filter is implemented correctly. map(1,2,1,3) is
 actually map(1,2) according to the "earlier entry wins" semantic. I
 don't think this will change in 2.4.1.

 For the second one, `map_filter` issue is not about `earlier entry
 wins` stuff. Please see the following example.

 spark-sql> SELECT m, map_filter(m, (k,v) -> v=2) c FROM (SELECT
 map_concat(map(1,2), map(1,3)) m);
 {1:3} {1:2}

 spark-sql> SELECT m, map_filter(m, (k,v) -> v=3) c FROM (SELECT
 map_concat(map(1,2), map(1,3)) m);
 {1:3} {1:3}

 spark-sql> SELECT m, map_filter(m, (k,v) -> v=4) c FROM (SELECT
 map_concat(map(1,2), map(1,3)) m);
 {1:3} {}

 In other words, `map_filter` works like `push-downed filter` to the map
 in terms of the output result
 while users assumed that `map_filter` works on top of the result of
 `m`.

 This is a function semantic issue.


 On Wed, Oct 24, 2018 at 6:06 PM Wenchen Fan 
 wrote:

> > spark-sql> select map(1,2,1,3); // Spark 2.4.0 RC4
> > {1:3}
>
> Are you running in the thrift-server? Then maybe this is caused by the
> bug in `Dateset.collect` as I mentioned above.
>
> I think map_filter is implemented correctly. map(1,2,1,3) is actually
> map(1,2) according to the "earlier entry wins" semantic. I don't
> think this will change in 2.4.1.
>
> On Thu, Oct 25, 2018 at 8:56 AM Dongjoon Hyun 
> wrote:
>
>> Thank you for the follow-ups.
>>
>> Then, Spark 2.4.1 will return `{1:2}` differently from the followings
>> (including Spark/Scala) in the end?
>>
>> I hoped to fix the `map_filter`, but now Spark looks inconsistent in
>> many ways.
>>
>> scala> sql("select map(1,2,1,3)").show // Spark 2.2.2
>> +---+
>> |map(1, 2, 1, 3)|
>> +---+
>> |Map(1 -> 3)|
>> +---+
>>
>>
>> spark-sql> select map(1,2,1,3); // Spark 2.4.0 RC4
>> {1:3}
>>
>>
>> hive> select map(1,2,1,3);  // Hive 1.2.2
>> OK
>> {1:3}
>>
>>
>> presto> SELECT 

Re: [VOTE] SPARK 2.4.0 (RC4)

2018-10-25 Thread Dongjoon Hyun
Thank you for the decision, All.

As of now, to unblock this, it seems that we are trying to remove them from
the function registry.

https://github.com/apache/spark/pull/22821

One problem here is that users can recover those functions like this simply.

scala> 
spark.sessionState.functionRegistry.createOrReplaceTempFunction("map_filter",
x => org.apache.spark.sql.catalyst.expressions.MapFilter(x(0),x(1)))


Technically, the PR looks like a compromised way to unblock the release and
to allow some users that feature completely.

At first glance, I thought this is a workaround to ignore the discussion
context. But, that sounds like one of the practical ways for Apache Spark.
(We had Spark 2.0 Tech. Preview before.)

I want to finalize the decision on `map_filter` (and related three
functions) issue. Are we good to go with
https://github.com/apache/spark/pull/22821?

Bests,
Dongjoon.

PS. Also, there is a PR to completely remove them, too.
   https://github.com/cloud-fan/spark/pull/11


On Wed, Oct 24, 2018 at 10:14 PM Xiao Li  wrote:

> @Dongjoon Hyun   Thanks! This is a blocking
> ticket. It returns a wrong result due to our undefined behavior. I agree we
> should revert the newly added map-oriented functions. In 3.0 release, we
> need to define the behavior of duplicate keys in the data type MAP and fix
> all the related issues that are confusing to our end users.
>
> Thanks,
>
> Xiao
>
> On Wed, Oct 24, 2018 at 9:54 PM Wenchen Fan  wrote:
>
>> Ah now I see the problem. `map_filter` has a very weird semantic that is
>> neither "earlier entry wins" or "latter entry wins".
>>
>> I've opened https://github.com/apache/spark/pull/22821 , to remove these
>> newly added map-related functions from FunctionRegistry(for 2.4.0), so that
>> they are invisible to end-users, and the weird behavior of Spark map type
>> with duplicated keys are not escalated. We should fix it ASAP in the master
>> branch.
>>
>> If others are OK with it, I'll start a new RC after that PR is merged.
>>
>> Thanks,
>> Wenchen
>>
>> On Thu, Oct 25, 2018 at 10:32 AM Dongjoon Hyun 
>> wrote:
>>
>>> For the first question, it's `bin/spark-sql` result. I didn't check STS,
>>> but it will return the same with `bin/spark-sql`.
>>>
>>> > I think map_filter is implemented correctly. map(1,2,1,3) is actually
>>> map(1,2) according to the "earlier entry wins" semantic. I don't think
>>> this will change in 2.4.1.
>>>
>>> For the second one, `map_filter` issue is not about `earlier entry wins`
>>> stuff. Please see the following example.
>>>
>>> spark-sql> SELECT m, map_filter(m, (k,v) -> v=2) c FROM (SELECT
>>> map_concat(map(1,2), map(1,3)) m);
>>> {1:3} {1:2}
>>>
>>> spark-sql> SELECT m, map_filter(m, (k,v) -> v=3) c FROM (SELECT
>>> map_concat(map(1,2), map(1,3)) m);
>>> {1:3} {1:3}
>>>
>>> spark-sql> SELECT m, map_filter(m, (k,v) -> v=4) c FROM (SELECT
>>> map_concat(map(1,2), map(1,3)) m);
>>> {1:3} {}
>>>
>>> In other words, `map_filter` works like `push-downed filter` to the map
>>> in terms of the output result
>>> while users assumed that `map_filter` works on top of the result of `m`.
>>>
>>> This is a function semantic issue.
>>>
>>>
>>> On Wed, Oct 24, 2018 at 6:06 PM Wenchen Fan  wrote:
>>>
 > spark-sql> select map(1,2,1,3); // Spark 2.4.0 RC4
 > {1:3}

 Are you running in the thrift-server? Then maybe this is caused by the
 bug in `Dateset.collect` as I mentioned above.

 I think map_filter is implemented correctly. map(1,2,1,3) is actually
 map(1,2) according to the "earlier entry wins" semantic. I don't think
 this will change in 2.4.1.

 On Thu, Oct 25, 2018 at 8:56 AM Dongjoon Hyun 
 wrote:

> Thank you for the follow-ups.
>
> Then, Spark 2.4.1 will return `{1:2}` differently from the followings
> (including Spark/Scala) in the end?
>
> I hoped to fix the `map_filter`, but now Spark looks inconsistent in
> many ways.
>
> scala> sql("select map(1,2,1,3)").show // Spark 2.2.2
> +---+
> |map(1, 2, 1, 3)|
> +---+
> |Map(1 -> 3)|
> +---+
>
>
> spark-sql> select map(1,2,1,3); // Spark 2.4.0 RC4
> {1:3}
>
>
> hive> select map(1,2,1,3);  // Hive 1.2.2
> OK
> {1:3}
>
>
> presto> SELECT map_concat(map(array[1],array[2]),
> map(array[1],array[3])); // Presto 0.212
>  _col0
> ---
>  {1=3}
>
>
> Bests,
> Dongjoon.
>
>
> On Wed, Oct 24, 2018 at 5:17 PM Wenchen Fan 
> wrote:
>
>> Hi Dongjoon,
>>
>> Thanks for reporting it! This is indeed a bug that needs to be fixed.
>>
>> The problem is not about the function `map_filter`, but about how the
>> map type values are created in Spark, when there are duplicated keys.
>>
>> In programming languages like Java/Scala, when creating map, the
>> later entry wins. e.g. in scala
>> scala> Map(1 -> 2, 1 -> 3)
>> 

Re: [VOTE] SPARK 2.4.0 (RC4)

2018-10-24 Thread Xiao Li
@Dongjoon Hyun   Thanks! This is a blocking
ticket. It returns a wrong result due to our undefined behavior. I agree we
should revert the newly added map-oriented functions. In 3.0 release, we
need to define the behavior of duplicate keys in the data type MAP and fix
all the related issues that are confusing to our end users.

Thanks,

Xiao

On Wed, Oct 24, 2018 at 9:54 PM Wenchen Fan  wrote:

> Ah now I see the problem. `map_filter` has a very weird semantic that is
> neither "earlier entry wins" or "latter entry wins".
>
> I've opened https://github.com/apache/spark/pull/22821 , to remove these
> newly added map-related functions from FunctionRegistry(for 2.4.0), so that
> they are invisible to end-users, and the weird behavior of Spark map type
> with duplicated keys are not escalated. We should fix it ASAP in the master
> branch.
>
> If others are OK with it, I'll start a new RC after that PR is merged.
>
> Thanks,
> Wenchen
>
> On Thu, Oct 25, 2018 at 10:32 AM Dongjoon Hyun 
> wrote:
>
>> For the first question, it's `bin/spark-sql` result. I didn't check STS,
>> but it will return the same with `bin/spark-sql`.
>>
>> > I think map_filter is implemented correctly. map(1,2,1,3) is actually
>> map(1,2) according to the "earlier entry wins" semantic. I don't think
>> this will change in 2.4.1.
>>
>> For the second one, `map_filter` issue is not about `earlier entry wins`
>> stuff. Please see the following example.
>>
>> spark-sql> SELECT m, map_filter(m, (k,v) -> v=2) c FROM (SELECT
>> map_concat(map(1,2), map(1,3)) m);
>> {1:3} {1:2}
>>
>> spark-sql> SELECT m, map_filter(m, (k,v) -> v=3) c FROM (SELECT
>> map_concat(map(1,2), map(1,3)) m);
>> {1:3} {1:3}
>>
>> spark-sql> SELECT m, map_filter(m, (k,v) -> v=4) c FROM (SELECT
>> map_concat(map(1,2), map(1,3)) m);
>> {1:3} {}
>>
>> In other words, `map_filter` works like `push-downed filter` to the map
>> in terms of the output result
>> while users assumed that `map_filter` works on top of the result of `m`.
>>
>> This is a function semantic issue.
>>
>>
>> On Wed, Oct 24, 2018 at 6:06 PM Wenchen Fan  wrote:
>>
>>> > spark-sql> select map(1,2,1,3); // Spark 2.4.0 RC4
>>> > {1:3}
>>>
>>> Are you running in the thrift-server? Then maybe this is caused by the
>>> bug in `Dateset.collect` as I mentioned above.
>>>
>>> I think map_filter is implemented correctly. map(1,2,1,3) is actually
>>> map(1,2) according to the "earlier entry wins" semantic. I don't think
>>> this will change in 2.4.1.
>>>
>>> On Thu, Oct 25, 2018 at 8:56 AM Dongjoon Hyun 
>>> wrote:
>>>
 Thank you for the follow-ups.

 Then, Spark 2.4.1 will return `{1:2}` differently from the followings
 (including Spark/Scala) in the end?

 I hoped to fix the `map_filter`, but now Spark looks inconsistent in
 many ways.

 scala> sql("select map(1,2,1,3)").show // Spark 2.2.2
 +---+
 |map(1, 2, 1, 3)|
 +---+
 |Map(1 -> 3)|
 +---+


 spark-sql> select map(1,2,1,3); // Spark 2.4.0 RC4
 {1:3}


 hive> select map(1,2,1,3);  // Hive 1.2.2
 OK
 {1:3}


 presto> SELECT map_concat(map(array[1],array[2]),
 map(array[1],array[3])); // Presto 0.212
  _col0
 ---
  {1=3}


 Bests,
 Dongjoon.


 On Wed, Oct 24, 2018 at 5:17 PM Wenchen Fan 
 wrote:

> Hi Dongjoon,
>
> Thanks for reporting it! This is indeed a bug that needs to be fixed.
>
> The problem is not about the function `map_filter`, but about how the
> map type values are created in Spark, when there are duplicated keys.
>
> In programming languages like Java/Scala, when creating map, the later
> entry wins. e.g. in scala
> scala> Map(1 -> 2, 1 -> 3)
> res0: scala.collection.immutable.Map[Int,Int] = Map(1 -> 3)
>
> scala> Map(1 -> 2, 1 -> 3).get(1)
> res1: Option[Int] = Some(3)
>
> However, in Spark, the earlier entry wins
> scala> sql("SELECT map(1,2,1,3)[1]").show
> +--+
> |map(1, 2, 1, 3)[1]|
> +--+
> | 2|
> +--+
>
> So for Spark users, Map(1 -> 2, 1 -> 3) should be equal to Map(1 -> 2)
> .
>
> But there are several bugs in Spark
>
> scala> sql("SELECT map(1,2,1,3)").show
> ++
> | map(1, 2, 1, 3)|
> ++
> |[1 -> 2, 1 -> 3]|
> ++
> The displayed string of map values has a bug and we should deduplicate
> the entries, This is tracked by SPARK-25824.
>
>
> scala> sql("CREATE TABLE t AS SELECT map(1,2,1,3) as map")
> res11: org.apache.spark.sql.DataFrame = []
>
> scala> sql("select * from t").show
> ++
> | map|
> ++
> |[1 -> 3]|
> ++
> The Hive map value convert has a bug, we should respect the "earlier
> entry wins" semantic. No 

Re: [VOTE] SPARK 2.4.0 (RC4)

2018-10-24 Thread Wenchen Fan
Ah now I see the problem. `map_filter` has a very weird semantic that is
neither "earlier entry wins" or "latter entry wins".

I've opened https://github.com/apache/spark/pull/22821 , to remove these
newly added map-related functions from FunctionRegistry(for 2.4.0), so that
they are invisible to end-users, and the weird behavior of Spark map type
with duplicated keys are not escalated. We should fix it ASAP in the master
branch.

If others are OK with it, I'll start a new RC after that PR is merged.

Thanks,
Wenchen

On Thu, Oct 25, 2018 at 10:32 AM Dongjoon Hyun 
wrote:

> For the first question, it's `bin/spark-sql` result. I didn't check STS,
> but it will return the same with `bin/spark-sql`.
>
> > I think map_filter is implemented correctly. map(1,2,1,3) is actually
> map(1,2) according to the "earlier entry wins" semantic. I don't think
> this will change in 2.4.1.
>
> For the second one, `map_filter` issue is not about `earlier entry wins`
> stuff. Please see the following example.
>
> spark-sql> SELECT m, map_filter(m, (k,v) -> v=2) c FROM (SELECT
> map_concat(map(1,2), map(1,3)) m);
> {1:3} {1:2}
>
> spark-sql> SELECT m, map_filter(m, (k,v) -> v=3) c FROM (SELECT
> map_concat(map(1,2), map(1,3)) m);
> {1:3} {1:3}
>
> spark-sql> SELECT m, map_filter(m, (k,v) -> v=4) c FROM (SELECT
> map_concat(map(1,2), map(1,3)) m);
> {1:3} {}
>
> In other words, `map_filter` works like `push-downed filter` to the map in
> terms of the output result
> while users assumed that `map_filter` works on top of the result of `m`.
>
> This is a function semantic issue.
>
>
> On Wed, Oct 24, 2018 at 6:06 PM Wenchen Fan  wrote:
>
>> > spark-sql> select map(1,2,1,3); // Spark 2.4.0 RC4
>> > {1:3}
>>
>> Are you running in the thrift-server? Then maybe this is caused by the
>> bug in `Dateset.collect` as I mentioned above.
>>
>> I think map_filter is implemented correctly. map(1,2,1,3) is actually
>> map(1,2) according to the "earlier entry wins" semantic. I don't think
>> this will change in 2.4.1.
>>
>> On Thu, Oct 25, 2018 at 8:56 AM Dongjoon Hyun 
>> wrote:
>>
>>> Thank you for the follow-ups.
>>>
>>> Then, Spark 2.4.1 will return `{1:2}` differently from the followings
>>> (including Spark/Scala) in the end?
>>>
>>> I hoped to fix the `map_filter`, but now Spark looks inconsistent in
>>> many ways.
>>>
>>> scala> sql("select map(1,2,1,3)").show // Spark 2.2.2
>>> +---+
>>> |map(1, 2, 1, 3)|
>>> +---+
>>> |Map(1 -> 3)|
>>> +---+
>>>
>>>
>>> spark-sql> select map(1,2,1,3); // Spark 2.4.0 RC4
>>> {1:3}
>>>
>>>
>>> hive> select map(1,2,1,3);  // Hive 1.2.2
>>> OK
>>> {1:3}
>>>
>>>
>>> presto> SELECT map_concat(map(array[1],array[2]),
>>> map(array[1],array[3])); // Presto 0.212
>>>  _col0
>>> ---
>>>  {1=3}
>>>
>>>
>>> Bests,
>>> Dongjoon.
>>>
>>>
>>> On Wed, Oct 24, 2018 at 5:17 PM Wenchen Fan  wrote:
>>>
 Hi Dongjoon,

 Thanks for reporting it! This is indeed a bug that needs to be fixed.

 The problem is not about the function `map_filter`, but about how the
 map type values are created in Spark, when there are duplicated keys.

 In programming languages like Java/Scala, when creating map, the later
 entry wins. e.g. in scala
 scala> Map(1 -> 2, 1 -> 3)
 res0: scala.collection.immutable.Map[Int,Int] = Map(1 -> 3)

 scala> Map(1 -> 2, 1 -> 3).get(1)
 res1: Option[Int] = Some(3)

 However, in Spark, the earlier entry wins
 scala> sql("SELECT map(1,2,1,3)[1]").show
 +--+
 |map(1, 2, 1, 3)[1]|
 +--+
 | 2|
 +--+

 So for Spark users, Map(1 -> 2, 1 -> 3) should be equal to Map(1 -> 2).

 But there are several bugs in Spark

 scala> sql("SELECT map(1,2,1,3)").show
 ++
 | map(1, 2, 1, 3)|
 ++
 |[1 -> 2, 1 -> 3]|
 ++
 The displayed string of map values has a bug and we should deduplicate
 the entries, This is tracked by SPARK-25824.


 scala> sql("CREATE TABLE t AS SELECT map(1,2,1,3) as map")
 res11: org.apache.spark.sql.DataFrame = []

 scala> sql("select * from t").show
 ++
 | map|
 ++
 |[1 -> 3]|
 ++
 The Hive map value convert has a bug, we should respect the "earlier
 entry wins" semantic. No ticket yet.


 scala> sql("select map(1,2,1,3)").collect
 res14: Array[org.apache.spark.sql.Row] = Array([Map(1 -> 3)])
 Same bug happens at `collect`. No ticket yet.

 I'll create tickets and list all of them as known issues in 2.4.0.

 It's arguable if the "earlier entry wins" semantic is reasonable.
 Fixing it is a behavior change and we can only apply it to master branch.

 Going back to https://issues.apache.org/jira/browse/SPARK-25823, it's
 just a symptom of the hive map value converter bug. I think it's a

Re: [VOTE] SPARK 2.4.0 (RC4)

2018-10-24 Thread Dongjoon Hyun
For the first question, it's `bin/spark-sql` result. I didn't check STS,
but it will return the same with `bin/spark-sql`.

> I think map_filter is implemented correctly. map(1,2,1,3) is actually
map(1,2) according to the "earlier entry wins" semantic. I don't think this
will change in 2.4.1.

For the second one, `map_filter` issue is not about `earlier entry wins`
stuff. Please see the following example.

spark-sql> SELECT m, map_filter(m, (k,v) -> v=2) c FROM (SELECT
map_concat(map(1,2), map(1,3)) m);
{1:3} {1:2}

spark-sql> SELECT m, map_filter(m, (k,v) -> v=3) c FROM (SELECT
map_concat(map(1,2), map(1,3)) m);
{1:3} {1:3}

spark-sql> SELECT m, map_filter(m, (k,v) -> v=4) c FROM (SELECT
map_concat(map(1,2), map(1,3)) m);
{1:3} {}

In other words, `map_filter` works like `push-downed filter` to the map in
terms of the output result
while users assumed that `map_filter` works on top of the result of `m`.

This is a function semantic issue.


On Wed, Oct 24, 2018 at 6:06 PM Wenchen Fan  wrote:

> > spark-sql> select map(1,2,1,3); // Spark 2.4.0 RC4
> > {1:3}
>
> Are you running in the thrift-server? Then maybe this is caused by the bug
> in `Dateset.collect` as I mentioned above.
>
> I think map_filter is implemented correctly. map(1,2,1,3) is actually
> map(1,2) according to the "earlier entry wins" semantic. I don't think
> this will change in 2.4.1.
>
> On Thu, Oct 25, 2018 at 8:56 AM Dongjoon Hyun 
> wrote:
>
>> Thank you for the follow-ups.
>>
>> Then, Spark 2.4.1 will return `{1:2}` differently from the followings
>> (including Spark/Scala) in the end?
>>
>> I hoped to fix the `map_filter`, but now Spark looks inconsistent in many
>> ways.
>>
>> scala> sql("select map(1,2,1,3)").show // Spark 2.2.2
>> +---+
>> |map(1, 2, 1, 3)|
>> +---+
>> |Map(1 -> 3)|
>> +---+
>>
>>
>> spark-sql> select map(1,2,1,3); // Spark 2.4.0 RC4
>> {1:3}
>>
>>
>> hive> select map(1,2,1,3);  // Hive 1.2.2
>> OK
>> {1:3}
>>
>>
>> presto> SELECT map_concat(map(array[1],array[2]),
>> map(array[1],array[3])); // Presto 0.212
>>  _col0
>> ---
>>  {1=3}
>>
>>
>> Bests,
>> Dongjoon.
>>
>>
>> On Wed, Oct 24, 2018 at 5:17 PM Wenchen Fan  wrote:
>>
>>> Hi Dongjoon,
>>>
>>> Thanks for reporting it! This is indeed a bug that needs to be fixed.
>>>
>>> The problem is not about the function `map_filter`, but about how the
>>> map type values are created in Spark, when there are duplicated keys.
>>>
>>> In programming languages like Java/Scala, when creating map, the later
>>> entry wins. e.g. in scala
>>> scala> Map(1 -> 2, 1 -> 3)
>>> res0: scala.collection.immutable.Map[Int,Int] = Map(1 -> 3)
>>>
>>> scala> Map(1 -> 2, 1 -> 3).get(1)
>>> res1: Option[Int] = Some(3)
>>>
>>> However, in Spark, the earlier entry wins
>>> scala> sql("SELECT map(1,2,1,3)[1]").show
>>> +--+
>>> |map(1, 2, 1, 3)[1]|
>>> +--+
>>> | 2|
>>> +--+
>>>
>>> So for Spark users, Map(1 -> 2, 1 -> 3) should be equal to Map(1 -> 2).
>>>
>>> But there are several bugs in Spark
>>>
>>> scala> sql("SELECT map(1,2,1,3)").show
>>> ++
>>> | map(1, 2, 1, 3)|
>>> ++
>>> |[1 -> 2, 1 -> 3]|
>>> ++
>>> The displayed string of map values has a bug and we should deduplicate
>>> the entries, This is tracked by SPARK-25824.
>>>
>>>
>>> scala> sql("CREATE TABLE t AS SELECT map(1,2,1,3) as map")
>>> res11: org.apache.spark.sql.DataFrame = []
>>>
>>> scala> sql("select * from t").show
>>> ++
>>> | map|
>>> ++
>>> |[1 -> 3]|
>>> ++
>>> The Hive map value convert has a bug, we should respect the "earlier
>>> entry wins" semantic. No ticket yet.
>>>
>>>
>>> scala> sql("select map(1,2,1,3)").collect
>>> res14: Array[org.apache.spark.sql.Row] = Array([Map(1 -> 3)])
>>> Same bug happens at `collect`. No ticket yet.
>>>
>>> I'll create tickets and list all of them as known issues in 2.4.0.
>>>
>>> It's arguable if the "earlier entry wins" semantic is reasonable. Fixing
>>> it is a behavior change and we can only apply it to master branch.
>>>
>>> Going back to https://issues.apache.org/jira/browse/SPARK-25823, it's
>>> just a symptom of the hive map value converter bug. I think it's a
>>> non-blocker.
>>>
>>> Thanks,
>>> Wenchen
>>>
>>> On Thu, Oct 25, 2018 at 5:31 AM Dongjoon Hyun 
>>> wrote:
>>>
 Hi, All.

 -0 due to the following issue. From Spark 2.4.0, users may get an
 incorrect result when they use new `map_fitler` with `map_concat` 
 functions.

 https://issues.apache.org/jira/browse/SPARK-25823

 SPARK-25823 is only aiming to fix the data correctness issue from
 `map_filter`.

 PMC members are able to lower the priority. Always, I respect PMC's
 decision.

 I'm sending this email to draw more attention to this bug and to give
 some warning on the new feature's limitation to the community.

 Bests,
 Dongjoon.


 On Mon, 

Re: [VOTE] SPARK 2.4.0 (RC4)

2018-10-24 Thread Wenchen Fan
> spark-sql> select map(1,2,1,3); // Spark 2.4.0 RC4
> {1:3}

Are you running in the thrift-server? Then maybe this is caused by the bug
in `Dateset.collect` as I mentioned above.

I think map_filter is implemented correctly. map(1,2,1,3) is actually
map(1,2) according to the "earlier entry wins" semantic. I don't think this
will change in 2.4.1.

On Thu, Oct 25, 2018 at 8:56 AM Dongjoon Hyun 
wrote:

> Thank you for the follow-ups.
>
> Then, Spark 2.4.1 will return `{1:2}` differently from the followings
> (including Spark/Scala) in the end?
>
> I hoped to fix the `map_filter`, but now Spark looks inconsistent in many
> ways.
>
> scala> sql("select map(1,2,1,3)").show // Spark 2.2.2
> +---+
> |map(1, 2, 1, 3)|
> +---+
> |Map(1 -> 3)|
> +---+
>
>
> spark-sql> select map(1,2,1,3); // Spark 2.4.0 RC4
> {1:3}
>
>
> hive> select map(1,2,1,3);  // Hive 1.2.2
> OK
> {1:3}
>
>
> presto> SELECT map_concat(map(array[1],array[2]), map(array[1],array[3]));
> // Presto 0.212
>  _col0
> ---
>  {1=3}
>
>
> Bests,
> Dongjoon.
>
>
> On Wed, Oct 24, 2018 at 5:17 PM Wenchen Fan  wrote:
>
>> Hi Dongjoon,
>>
>> Thanks for reporting it! This is indeed a bug that needs to be fixed.
>>
>> The problem is not about the function `map_filter`, but about how the map
>> type values are created in Spark, when there are duplicated keys.
>>
>> In programming languages like Java/Scala, when creating map, the later
>> entry wins. e.g. in scala
>> scala> Map(1 -> 2, 1 -> 3)
>> res0: scala.collection.immutable.Map[Int,Int] = Map(1 -> 3)
>>
>> scala> Map(1 -> 2, 1 -> 3).get(1)
>> res1: Option[Int] = Some(3)
>>
>> However, in Spark, the earlier entry wins
>> scala> sql("SELECT map(1,2,1,3)[1]").show
>> +--+
>> |map(1, 2, 1, 3)[1]|
>> +--+
>> | 2|
>> +--+
>>
>> So for Spark users, Map(1 -> 2, 1 -> 3) should be equal to Map(1 -> 2).
>>
>> But there are several bugs in Spark
>>
>> scala> sql("SELECT map(1,2,1,3)").show
>> ++
>> | map(1, 2, 1, 3)|
>> ++
>> |[1 -> 2, 1 -> 3]|
>> ++
>> The displayed string of map values has a bug and we should deduplicate
>> the entries, This is tracked by SPARK-25824.
>>
>>
>> scala> sql("CREATE TABLE t AS SELECT map(1,2,1,3) as map")
>> res11: org.apache.spark.sql.DataFrame = []
>>
>> scala> sql("select * from t").show
>> ++
>> | map|
>> ++
>> |[1 -> 3]|
>> ++
>> The Hive map value convert has a bug, we should respect the "earlier
>> entry wins" semantic. No ticket yet.
>>
>>
>> scala> sql("select map(1,2,1,3)").collect
>> res14: Array[org.apache.spark.sql.Row] = Array([Map(1 -> 3)])
>> Same bug happens at `collect`. No ticket yet.
>>
>> I'll create tickets and list all of them as known issues in 2.4.0.
>>
>> It's arguable if the "earlier entry wins" semantic is reasonable. Fixing
>> it is a behavior change and we can only apply it to master branch.
>>
>> Going back to https://issues.apache.org/jira/browse/SPARK-25823, it's
>> just a symptom of the hive map value converter bug. I think it's a
>> non-blocker.
>>
>> Thanks,
>> Wenchen
>>
>> On Thu, Oct 25, 2018 at 5:31 AM Dongjoon Hyun 
>> wrote:
>>
>>> Hi, All.
>>>
>>> -0 due to the following issue. From Spark 2.4.0, users may get an
>>> incorrect result when they use new `map_fitler` with `map_concat` functions.
>>>
>>> https://issues.apache.org/jira/browse/SPARK-25823
>>>
>>> SPARK-25823 is only aiming to fix the data correctness issue from
>>> `map_filter`.
>>>
>>> PMC members are able to lower the priority. Always, I respect PMC's
>>> decision.
>>>
>>> I'm sending this email to draw more attention to this bug and to give
>>> some warning on the new feature's limitation to the community.
>>>
>>> Bests,
>>> Dongjoon.
>>>
>>>
>>> On Mon, Oct 22, 2018 at 10:42 AM Wenchen Fan 
>>> wrote:
>>>
 Please vote on releasing the following candidate as Apache Spark
 version 2.4.0.

 The vote is open until October 26 PST and passes if a majority +1 PMC
 votes are cast, with
 a minimum of 3 +1 votes.

 [ ] +1 Release this package as Apache Spark 2.4.0
 [ ] -1 Do not release this package because ...

 To learn more about Apache Spark, please see http://spark.apache.org/

 The tag to be voted on is v2.4.0-rc4 (commit
 e69e2bfa486d8d3b9d203b96ca9c0f37c2b6cabe):
 https://github.com/apache/spark/tree/v2.4.0-rc4

 The release files, including signatures, digests, etc. can be found at:
 https://dist.apache.org/repos/dist/dev/spark/v2.4.0-rc4-bin/

 Signatures used for Spark RCs can be found in this file:
 https://dist.apache.org/repos/dist/dev/spark/KEYS

 The staging repository for this release can be found at:
 https://repository.apache.org/content/repositories/orgapachespark-1290

 The documentation corresponding to this release can be found at:
 

Re: [VOTE] SPARK 2.4.0 (RC4)

2018-10-24 Thread Dongjoon Hyun
Thank you for the follow-ups.

Then, Spark 2.4.1 will return `{1:2}` differently from the followings
(including Spark/Scala) in the end?

I hoped to fix the `map_filter`, but now Spark looks inconsistent in many
ways.

scala> sql("select map(1,2,1,3)").show // Spark 2.2.2
+---+
|map(1, 2, 1, 3)|
+---+
|Map(1 -> 3)|
+---+


spark-sql> select map(1,2,1,3); // Spark 2.4.0 RC4
{1:3}


hive> select map(1,2,1,3);  // Hive 1.2.2
OK
{1:3}


presto> SELECT map_concat(map(array[1],array[2]), map(array[1],array[3]));
// Presto 0.212
 _col0
---
 {1=3}


Bests,
Dongjoon.


On Wed, Oct 24, 2018 at 5:17 PM Wenchen Fan  wrote:

> Hi Dongjoon,
>
> Thanks for reporting it! This is indeed a bug that needs to be fixed.
>
> The problem is not about the function `map_filter`, but about how the map
> type values are created in Spark, when there are duplicated keys.
>
> In programming languages like Java/Scala, when creating map, the later
> entry wins. e.g. in scala
> scala> Map(1 -> 2, 1 -> 3)
> res0: scala.collection.immutable.Map[Int,Int] = Map(1 -> 3)
>
> scala> Map(1 -> 2, 1 -> 3).get(1)
> res1: Option[Int] = Some(3)
>
> However, in Spark, the earlier entry wins
> scala> sql("SELECT map(1,2,1,3)[1]").show
> +--+
> |map(1, 2, 1, 3)[1]|
> +--+
> | 2|
> +--+
>
> So for Spark users, Map(1 -> 2, 1 -> 3) should be equal to Map(1 -> 2).
>
> But there are several bugs in Spark
>
> scala> sql("SELECT map(1,2,1,3)").show
> ++
> | map(1, 2, 1, 3)|
> ++
> |[1 -> 2, 1 -> 3]|
> ++
> The displayed string of map values has a bug and we should deduplicate the
> entries, This is tracked by SPARK-25824.
>
>
> scala> sql("CREATE TABLE t AS SELECT map(1,2,1,3) as map")
> res11: org.apache.spark.sql.DataFrame = []
>
> scala> sql("select * from t").show
> ++
> | map|
> ++
> |[1 -> 3]|
> ++
> The Hive map value convert has a bug, we should respect the "earlier entry
> wins" semantic. No ticket yet.
>
>
> scala> sql("select map(1,2,1,3)").collect
> res14: Array[org.apache.spark.sql.Row] = Array([Map(1 -> 3)])
> Same bug happens at `collect`. No ticket yet.
>
> I'll create tickets and list all of them as known issues in 2.4.0.
>
> It's arguable if the "earlier entry wins" semantic is reasonable. Fixing
> it is a behavior change and we can only apply it to master branch.
>
> Going back to https://issues.apache.org/jira/browse/SPARK-25823, it's
> just a symptom of the hive map value converter bug. I think it's a
> non-blocker.
>
> Thanks,
> Wenchen
>
> On Thu, Oct 25, 2018 at 5:31 AM Dongjoon Hyun 
> wrote:
>
>> Hi, All.
>>
>> -0 due to the following issue. From Spark 2.4.0, users may get an
>> incorrect result when they use new `map_fitler` with `map_concat` functions.
>>
>> https://issues.apache.org/jira/browse/SPARK-25823
>>
>> SPARK-25823 is only aiming to fix the data correctness issue from
>> `map_filter`.
>>
>> PMC members are able to lower the priority. Always, I respect PMC's
>> decision.
>>
>> I'm sending this email to draw more attention to this bug and to give
>> some warning on the new feature's limitation to the community.
>>
>> Bests,
>> Dongjoon.
>>
>>
>> On Mon, Oct 22, 2018 at 10:42 AM Wenchen Fan  wrote:
>>
>>> Please vote on releasing the following candidate as Apache Spark version
>>> 2.4.0.
>>>
>>> The vote is open until October 26 PST and passes if a majority +1 PMC
>>> votes are cast, with
>>> a minimum of 3 +1 votes.
>>>
>>> [ ] +1 Release this package as Apache Spark 2.4.0
>>> [ ] -1 Do not release this package because ...
>>>
>>> To learn more about Apache Spark, please see http://spark.apache.org/
>>>
>>> The tag to be voted on is v2.4.0-rc4 (commit
>>> e69e2bfa486d8d3b9d203b96ca9c0f37c2b6cabe):
>>> https://github.com/apache/spark/tree/v2.4.0-rc4
>>>
>>> The release files, including signatures, digests, etc. can be found at:
>>> https://dist.apache.org/repos/dist/dev/spark/v2.4.0-rc4-bin/
>>>
>>> Signatures used for Spark RCs can be found in this file:
>>> https://dist.apache.org/repos/dist/dev/spark/KEYS
>>>
>>> The staging repository for this release can be found at:
>>> https://repository.apache.org/content/repositories/orgapachespark-1290
>>>
>>> The documentation corresponding to this release can be found at:
>>> https://dist.apache.org/repos/dist/dev/spark/v2.4.0-rc4-docs/
>>>
>>> The list of bug fixes going into 2.4.0 can be found at the following URL:
>>> https://issues.apache.org/jira/projects/SPARK/versions/12342385
>>>
>>> FAQ
>>>
>>> =
>>> How can I help test this release?
>>> =
>>>
>>> If you are a Spark user, you can help us test this release by taking
>>> an existing Spark workload and running on this release candidate, then
>>> reporting any regressions.
>>>
>>> If you're working in PySpark you can set up a virtual env and install
>>> the current RC and see if anything important 

Re: [VOTE] SPARK 2.4.0 (RC4)

2018-10-24 Thread Wenchen Fan
Hi Dongjoon,

Thanks for reporting it! This is indeed a bug that needs to be fixed.

The problem is not about the function `map_filter`, but about how the map
type values are created in Spark, when there are duplicated keys.

In programming languages like Java/Scala, when creating map, the later
entry wins. e.g. in scala
scala> Map(1 -> 2, 1 -> 3)
res0: scala.collection.immutable.Map[Int,Int] = Map(1 -> 3)

scala> Map(1 -> 2, 1 -> 3).get(1)
res1: Option[Int] = Some(3)

However, in Spark, the earlier entry wins
scala> sql("SELECT map(1,2,1,3)[1]").show
+--+
|map(1, 2, 1, 3)[1]|
+--+
| 2|
+--+

So for Spark users, Map(1 -> 2, 1 -> 3) should be equal to Map(1 -> 2).

But there are several bugs in Spark

scala> sql("SELECT map(1,2,1,3)").show
++
| map(1, 2, 1, 3)|
++
|[1 -> 2, 1 -> 3]|
++
The displayed string of map values has a bug and we should deduplicate the
entries, This is tracked by SPARK-25824.


scala> sql("CREATE TABLE t AS SELECT map(1,2,1,3) as map")
res11: org.apache.spark.sql.DataFrame = []

scala> sql("select * from t").show
++
| map|
++
|[1 -> 3]|
++
The Hive map value convert has a bug, we should respect the "earlier entry
wins" semantic. No ticket yet.


scala> sql("select map(1,2,1,3)").collect
res14: Array[org.apache.spark.sql.Row] = Array([Map(1 -> 3)])
Same bug happens at `collect`. No ticket yet.

I'll create tickets and list all of them as known issues in 2.4.0.

It's arguable if the "earlier entry wins" semantic is reasonable. Fixing it
is a behavior change and we can only apply it to master branch.

Going back to https://issues.apache.org/jira/browse/SPARK-25823, it's just
a symptom of the hive map value converter bug. I think it's a non-blocker.

Thanks,
Wenchen

On Thu, Oct 25, 2018 at 5:31 AM Dongjoon Hyun 
wrote:

> Hi, All.
>
> -0 due to the following issue. From Spark 2.4.0, users may get an
> incorrect result when they use new `map_fitler` with `map_concat` functions.
>
> https://issues.apache.org/jira/browse/SPARK-25823
>
> SPARK-25823 is only aiming to fix the data correctness issue from
> `map_filter`.
>
> PMC members are able to lower the priority. Always, I respect PMC's
> decision.
>
> I'm sending this email to draw more attention to this bug and to give some
> warning on the new feature's limitation to the community.
>
> Bests,
> Dongjoon.
>
>
> On Mon, Oct 22, 2018 at 10:42 AM Wenchen Fan  wrote:
>
>> Please vote on releasing the following candidate as Apache Spark version
>> 2.4.0.
>>
>> The vote is open until October 26 PST and passes if a majority +1 PMC
>> votes are cast, with
>> a minimum of 3 +1 votes.
>>
>> [ ] +1 Release this package as Apache Spark 2.4.0
>> [ ] -1 Do not release this package because ...
>>
>> To learn more about Apache Spark, please see http://spark.apache.org/
>>
>> The tag to be voted on is v2.4.0-rc4 (commit
>> e69e2bfa486d8d3b9d203b96ca9c0f37c2b6cabe):
>> https://github.com/apache/spark/tree/v2.4.0-rc4
>>
>> The release files, including signatures, digests, etc. can be found at:
>> https://dist.apache.org/repos/dist/dev/spark/v2.4.0-rc4-bin/
>>
>> Signatures used for Spark RCs can be found in this file:
>> https://dist.apache.org/repos/dist/dev/spark/KEYS
>>
>> The staging repository for this release can be found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1290
>>
>> The documentation corresponding to this release can be found at:
>> https://dist.apache.org/repos/dist/dev/spark/v2.4.0-rc4-docs/
>>
>> The list of bug fixes going into 2.4.0 can be found at the following URL:
>> https://issues.apache.org/jira/projects/SPARK/versions/12342385
>>
>> FAQ
>>
>> =
>> How can I help test this release?
>> =
>>
>> If you are a Spark user, you can help us test this release by taking
>> an existing Spark workload and running on this release candidate, then
>> reporting any regressions.
>>
>> If you're working in PySpark you can set up a virtual env and install
>> the current RC and see if anything important breaks, in the Java/Scala
>> you can add the staging repository to your projects resolvers and test
>> with the RC (make sure to clean up the artifact cache before/after so
>> you don't end up building with a out of date RC going forward).
>>
>> ===
>> What should happen to JIRA tickets still targeting 2.4.0?
>> ===
>>
>> The current list of open tickets targeted at 2.4.0 can be found at:
>> https://issues.apache.org/jira/projects/SPARK and search for "Target
>> Version/s" = 2.4.0
>>
>> Committers should look at those and triage. Extremely important bug
>> fixes, documentation, and API tweaks that impact compatibility should
>> be worked on immediately. Everything else please retarget to an
>> appropriate release.
>>
>> ==
>> But my bug isn't 

Re: [VOTE] SPARK 2.4.0 (RC4)

2018-10-24 Thread Dongjoon Hyun
Hi, All.

-0 due to the following issue. From Spark 2.4.0, users may get an incorrect
result when they use new `map_fitler` with `map_concat` functions.

https://issues.apache.org/jira/browse/SPARK-25823

SPARK-25823 is only aiming to fix the data correctness issue from
`map_filter`.

PMC members are able to lower the priority. Always, I respect PMC's
decision.

I'm sending this email to draw more attention to this bug and to give some
warning on the new feature's limitation to the community.

Bests,
Dongjoon.


On Mon, Oct 22, 2018 at 10:42 AM Wenchen Fan  wrote:

> Please vote on releasing the following candidate as Apache Spark version
> 2.4.0.
>
> The vote is open until October 26 PST and passes if a majority +1 PMC
> votes are cast, with
> a minimum of 3 +1 votes.
>
> [ ] +1 Release this package as Apache Spark 2.4.0
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> The tag to be voted on is v2.4.0-rc4 (commit
> e69e2bfa486d8d3b9d203b96ca9c0f37c2b6cabe):
> https://github.com/apache/spark/tree/v2.4.0-rc4
>
> The release files, including signatures, digests, etc. can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v2.4.0-rc4-bin/
>
> Signatures used for Spark RCs can be found in this file:
> https://dist.apache.org/repos/dist/dev/spark/KEYS
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1290
>
> The documentation corresponding to this release can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v2.4.0-rc4-docs/
>
> The list of bug fixes going into 2.4.0 can be found at the following URL:
> https://issues.apache.org/jira/projects/SPARK/versions/12342385
>
> FAQ
>
> =
> How can I help test this release?
> =
>
> If you are a Spark user, you can help us test this release by taking
> an existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> If you're working in PySpark you can set up a virtual env and install
> the current RC and see if anything important breaks, in the Java/Scala
> you can add the staging repository to your projects resolvers and test
> with the RC (make sure to clean up the artifact cache before/after so
> you don't end up building with a out of date RC going forward).
>
> ===
> What should happen to JIRA tickets still targeting 2.4.0?
> ===
>
> The current list of open tickets targeted at 2.4.0 can be found at:
> https://issues.apache.org/jira/projects/SPARK and search for "Target
> Version/s" = 2.4.0
>
> Committers should look at those and triage. Extremely important bug
> fixes, documentation, and API tweaks that impact compatibility should
> be worked on immediately. Everything else please retarget to an
> appropriate release.
>
> ==
> But my bug isn't fixed?
> ==
>
> In order to make timely releases, we will typically not hold the
> release unless the bug in question is a regression from the previous
> release. That being said, if there is something which is a regression
> that has not been correctly targeted please ping me or a committer to
> help target the issue.
>


Re: [VOTE] SPARK 2.4.0 (RC4)

2018-10-23 Thread Ryan Blue
+1 (non-binding)

The Iceberg implementation of DataSourceV2 is passing all tests after
updating to the 2.4 API, although I've had to disable ORC support because
BufferHolder is no longer public.

One oddity is that the DSv2 API for batch sources now includes an epoch ID,
which I think will be removed in the refactor before 2.5 or 3.0 and wasn't
part of the 2.3 release. That's strange, but it's minor.

rb

On Tue, Oct 23, 2018 at 5:10 PM Sean Owen  wrote:

> Hm, so you're trying to build a source release from a binary release?
> I don't think that needs to work nor do I expect it to for reasons
> like this. They just have fairly different things.
>
> On Tue, Oct 23, 2018 at 7:04 PM Dongjoon Hyun 
> wrote:
> >
> > Ur, Wenchen.
> >
> > Source distribution seems to fail by default.
> >
> >
> https://dist.apache.org/repos/dist/dev/spark/v2.4.0-rc4-bin/spark-2.4.0.tgz
> >
> > $ dev/make-distribution.sh -Pyarn -Phadoop-2.7 -Pkinesis-asl -Phive
> -Phive-thriftserver
> > ...
> > + cp /spark-2.4.0/LICENSE-binary /spark-2.4.0/dist/LICENSE
> > cp: /spark-2.4.0/LICENSE-binary: No such file or directory
> >
> >
> > The root cause seems to be the following fix.
> >
> >
> https://github.com/apache/spark/pull/22436/files#diff-01ca42240614718522afde4d4885b40dR175
> >
> > Although Apache Spark provides the binary distributions, it would be
> great if this succeeds out of the box.
> >
> > Bests,
> > Dongjoon.
> >
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

-- 
Ryan Blue
Software Engineer
Netflix


Re: [VOTE] SPARK 2.4.0 (RC4)

2018-10-23 Thread Sean Owen
Hm, so you're trying to build a source release from a binary release?
I don't think that needs to work nor do I expect it to for reasons
like this. They just have fairly different things.

On Tue, Oct 23, 2018 at 7:04 PM Dongjoon Hyun  wrote:
>
> Ur, Wenchen.
>
> Source distribution seems to fail by default.
>
> https://dist.apache.org/repos/dist/dev/spark/v2.4.0-rc4-bin/spark-2.4.0.tgz
>
> $ dev/make-distribution.sh -Pyarn -Phadoop-2.7 -Pkinesis-asl -Phive 
> -Phive-thriftserver
> ...
> + cp /spark-2.4.0/LICENSE-binary /spark-2.4.0/dist/LICENSE
> cp: /spark-2.4.0/LICENSE-binary: No such file or directory
>
>
> The root cause seems to be the following fix.
>
> https://github.com/apache/spark/pull/22436/files#diff-01ca42240614718522afde4d4885b40dR175
>
> Although Apache Spark provides the binary distributions, it would be great if 
> this succeeds out of the box.
>
> Bests,
> Dongjoon.
>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [VOTE] SPARK 2.4.0 (RC4)

2018-10-23 Thread Dongjoon Hyun
Ur, Wenchen.

Source distribution seems to fail by default.

https://dist.apache.org/repos/dist/dev/spark/v2.4.0-rc4-bin/spark-2.4.0.tgz

$ dev/make-distribution.sh -Pyarn -Phadoop-2.7 -Pkinesis-asl -Phive
-Phive-thriftserver
...
+ cp /spark-2.4.0/LICENSE-binary /spark-2.4.0/dist/LICENSE
cp: /spark-2.4.0/LICENSE-binary: No such file or directory


The root cause seems to be the following fix.

https://github.com/apache/spark/pull/22436/files#diff-01ca42240614718522afde4d4885b40dR175

Although Apache Spark provides the binary distributions, it would be great
if this succeeds out of the box.

Bests,
Dongjoon.


On Mon, Oct 22, 2018 at 10:42 AM Wenchen Fan  wrote:

> Please vote on releasing the following candidate as Apache Spark version
> 2.4.0.
>
> The vote is open until October 26 PST and passes if a majority +1 PMC
> votes are cast, with
> a minimum of 3 +1 votes.
>
> [ ] +1 Release this package as Apache Spark 2.4.0
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> The tag to be voted on is v2.4.0-rc4 (commit
> e69e2bfa486d8d3b9d203b96ca9c0f37c2b6cabe):
> https://github.com/apache/spark/tree/v2.4.0-rc4
>
> The release files, including signatures, digests, etc. can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v2.4.0-rc4-bin/
>
> Signatures used for Spark RCs can be found in this file:
> https://dist.apache.org/repos/dist/dev/spark/KEYS
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1290
>
> The documentation corresponding to this release can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v2.4.0-rc4-docs/
>
> The list of bug fixes going into 2.4.0 can be found at the following URL:
> https://issues.apache.org/jira/projects/SPARK/versions/12342385
>
> FAQ
>
> =
> How can I help test this release?
> =
>
> If you are a Spark user, you can help us test this release by taking
> an existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> If you're working in PySpark you can set up a virtual env and install
> the current RC and see if anything important breaks, in the Java/Scala
> you can add the staging repository to your projects resolvers and test
> with the RC (make sure to clean up the artifact cache before/after so
> you don't end up building with a out of date RC going forward).
>
> ===
> What should happen to JIRA tickets still targeting 2.4.0?
> ===
>
> The current list of open tickets targeted at 2.4.0 can be found at:
> https://issues.apache.org/jira/projects/SPARK and search for "Target
> Version/s" = 2.4.0
>
> Committers should look at those and triage. Extremely important bug
> fixes, documentation, and API tweaks that impact compatibility should
> be worked on immediately. Everything else please retarget to an
> appropriate release.
>
> ==
> But my bug isn't fixed?
> ==
>
> In order to make timely releases, we will typically not hold the
> release unless the bug in question is a regression from the previous
> release. That being said, if there is something which is a regression
> that has not been correctly targeted please ping me or a committer to
> help target the issue.
>


Re: [VOTE] SPARK 2.4.0 (RC4)

2018-10-23 Thread Sean Owen
To be clear I'm currently +1 on this release, with much commentary.

OK, the explanation for kubernetes tests makes sense. Yes I think we need
to propagate the scala-2.12 build profile to make it work. Go for it, if
you have a lead on what the change is.
This doesn't block the release as it's an issue for tests, and only affects
2.12. However if we had a clean fix for this and there were another RC, I'd
include it.

Dongjoon has a good point about the spark-kubernetes-integration-tests
artifact. That doesn't sound like it should be published in this way,
though, of course, we publish the test artifacts from every module already.
This is only a bit odd in being a non-test artifact meant for testing. But
it's special testing! So I also don't think that needs to block a release.

This happens because the integration tests module is enabled with the
'kubernetes' profile too, and also this output is copied into the release
tarball at kubernetes/integration-tests/tests. Do we need that in a binary
release?

If these integration tests are meant to be run ad hoc, manually, not part
of a normal test cycle, then I think we can just not enable it with
-Pkubernetes. If it is meant to run every time, then it sounds like we need
a little extra work shown in recent PRs to make that easier, but then, this
test code should just be the 'test' artifact parts of the kubernetes
module, no?


On Tue, Oct 23, 2018 at 1:46 PM Dongjoon Hyun 
wrote:

> BTW, for that integration suite, I saw the related artifacts in the RC4
> staging directory.
>
> Does Spark 2.4.0 need to start to release these 
> `spark-kubernetes-integration-tests`
> artifacts?
>
>-
>
> https://repository.apache.org/content/repositories/orgapachespark-1290/org/apache/spark/spark-kubernetes-integration-tests_2.11/
>-
>
> https://repository.apache.org/content/repositories/orgapachespark-1290/org/apache/spark/spark-kubernetes-integration-tests_2.12/
>
> Historically, Spark released `spark-docker-integration-tests` at Spark
> 1.6.x era and stopped since Spark 2.0.0.
>
>-
>
> http://central.maven.org/maven2/org/apache/spark/spark-docker-integration-tests_2.10/
>-
>
> http://central.maven.org/maven2/org/apache/spark/spark-docker-integration-tests_2.11/
>
>
> Bests,
> Dongjoon.
>
> On Tue, Oct 23, 2018 at 11:43 AM Stavros Kontopoulos <
> stavros.kontopou...@lightbend.com> wrote:
>
>> Sean,
>>
>> Ok makes sense, im using a cloned repo. I built with Scala 2.12 profile
>> using the related tag v2.4.0-rc4:
>>
>> ./dev/change-scala-version.sh 2.12
>> ./dev/make-distribution.sh  --name test --r --tgz -Pscala-2.12 -Psparkr
>> -Phadoop-2.7 -Pkubernetes -Phive
>> Pushed images to dockerhub (previous email) since I didnt use the
>> minikube daemon (default behavior).
>>
>> Then run tests successfully against minikube:
>>
>> TGZ_PATH=$(pwd)/spark-2.4.0-bin-test.gz
>> cd resource-managers/kubernetes/integration-tests
>>
>> ./dev/dev-run-integration-tests.sh --spark-tgz $TGZ_PATH
>> --service-account default --namespace default
>> --image-tag k8s-scala-12 --image-repo skonto
>>
>


Re: [VOTE] SPARK 2.4.0 (RC4)

2018-10-23 Thread Stavros Kontopoulos
+1 (non-binding). Run k8s tests with Scala 2.12. Also included the
RTestsSuite (mentioned by Ilan) although not part of the 2.4 rc tag:

[INFO] --- scalatest-maven-plugin:1.0:test (integration-test) @
spark-kubernetes-integration-tests_2.12 ---
Discovery starting.
Discovery completed in 239 milliseconds.
Run starting. Expected test count is: 15
KubernetesSuite:
- Run SparkPi with no resources
- Run SparkPi with a very long application name.
- Use SparkLauncher.NO_RESOURCE
- Run SparkPi with a master URL without a scheme.
- Run SparkPi with an argument.
- Run SparkPi with custom labels, annotations, and environment variables.
- Run extraJVMOptions check on driver
- Run SparkRemoteFileTest using a remote data file
- Run SparkPi with env and mount secrets.
- Run PySpark on simple pi.py example
- Run PySpark with Python2 to test a pyfiles example
- Run PySpark with Python3 to test a pyfiles example
- Run PySpark with memory customization
- Run in client mode.
- Run SparkR on simple dataframe.R example
Run completed in 6 minutes, 32 seconds.
Total number of tests run: 15
Suites: completed 2, aborted 0
Tests: succeeded 15, failed 0, canceled 0, ignored 0, pending 0
All tests passed.
[INFO]

[INFO] Reactor Summary:
[INFO]
[INFO] Spark Project Parent POM 2.4.0 . SUCCESS [
4.480 s]
[INFO] Spark Project Tags . SUCCESS [
3.898 s]
[INFO] Spark Project Local DB . SUCCESS [
2.773 s]
[INFO] Spark Project Networking ... SUCCESS [
5.063 s]
[INFO] Spark Project Shuffle Streaming Service  SUCCESS [
2.651 s]
[INFO] Spark Project Unsafe ... SUCCESS [
2.662 s]
[INFO] Spark Project Launcher . SUCCESS [
5.103 s]
[INFO] Spark Project Core . SUCCESS [
25.703 s]
[INFO] Spark Project Kubernetes Integration Tests 2.4.0 ... SUCCESS [06:51
min]
[INFO]

[INFO] BUILD SUCCESS
[INFO]

[INFO] Total time: 07:44 min
[INFO] Finished at: 2018-10-23T19:09:41Z
[INFO]


Stavros

On Tue, Oct 23, 2018 at 9:46 PM, Dongjoon Hyun 
wrote:

> BTW, for that integration suite, I saw the related artifacts in the RC4
> staging directory.
>
> Does Spark 2.4.0 need to start to release these `spark-kubernetes
> -integration-tests` artifacts?
>
>- https://repository.apache.org/content/repositories/
>orgapachespark-1290/org/apache/spark/spark-kubernetes-
>integration-tests_2.11/
>
> 
>- https://repository.apache.org/content/repositories/
>orgapachespark-1290/org/apache/spark/spark-kubernetes-
>integration-tests_2.12/
>
> 
>
> Historically, Spark released `spark-docker-integration-tests` at Spark
> 1.6.x era and stopped since Spark 2.0.0.
>
>- http://central.maven.org/maven2/org/apache/spark/spark-
>docker-integration-tests_2.10/
>- http://central.maven.org/maven2/org/apache/spark/spark-
>docker-integration-tests_2.11/
>
>
> Bests,
> Dongjoon.
>
> On Tue, Oct 23, 2018 at 11:43 AM Stavros Kontopoulos  lightbend.com> wrote:
>
>> Sean,
>>
>> Ok makes sense, im using a cloned repo. I built with Scala 2.12 profile
>> using the related tag v2.4.0-rc4:
>>
>> ./dev/change-scala-version.sh 2.12
>> ./dev/make-distribution.sh  --name test --r --tgz -Pscala-2.12 -Psparkr
>> -Phadoop-2.7 -Pkubernetes -Phive
>> Pushed images to dockerhub (previous email) since I didnt use the
>> minikube daemon (default behavior).
>>
>> Then run tests successfully against minikube:
>>
>> TGZ_PATH=$(pwd)/spark-2.4.0-bin-test.gz
>> cd resource-managers/kubernetes/integration-tests
>>
>> ./dev/dev-run-integration-tests.sh --spark-tgz $TGZ_PATH
>> --service-account default --namespace default --image-tag k8s-scala-12 
>> --image-repo
>> skonto
>>
>>
>> [INFO]
>> [INFO] --- scalatest-maven-plugin:1.0:test (integration-test) @
>> spark-kubernetes-integration-tests_2.12 ---
>> Discovery starting.
>> Discovery completed in 229 milliseconds.
>> Run starting. Expected test count is: 14
>> KubernetesSuite:
>> - Run SparkPi with no resources
>> - Run SparkPi with a very long application name.
>> - Use SparkLauncher.NO_RESOURCE
>> - Run SparkPi with a master URL without a scheme.
>> - Run SparkPi with an argument.
>> - Run SparkPi with custom labels, annotations, and environment variables.
>> - Run extraJVMOptions check on driver
>> - Run SparkRemoteFileTest using a remote data file
>> - Run SparkPi with env and mount 

Re: [VOTE] SPARK 2.4.0 (RC4)

2018-10-23 Thread Dongjoon Hyun
BTW, for that integration suite, I saw the related artifacts in the RC4
staging directory.

Does Spark 2.4.0 need to start to release these
`spark-kubernetes-integration-tests`
artifacts?

   -
   
https://repository.apache.org/content/repositories/orgapachespark-1290/org/apache/spark/spark-kubernetes-integration-tests_2.11/
   -
   
https://repository.apache.org/content/repositories/orgapachespark-1290/org/apache/spark/spark-kubernetes-integration-tests_2.12/

Historically, Spark released `spark-docker-integration-tests` at Spark
1.6.x era and stopped since Spark 2.0.0.

   -
   
http://central.maven.org/maven2/org/apache/spark/spark-docker-integration-tests_2.10/
   -
   
http://central.maven.org/maven2/org/apache/spark/spark-docker-integration-tests_2.11/


Bests,
Dongjoon.

On Tue, Oct 23, 2018 at 11:43 AM Stavros Kontopoulos <
stavros.kontopou...@lightbend.com> wrote:

> Sean,
>
> Ok makes sense, im using a cloned repo. I built with Scala 2.12 profile
> using the related tag v2.4.0-rc4:
>
> ./dev/change-scala-version.sh 2.12
> ./dev/make-distribution.sh  --name test --r --tgz -Pscala-2.12 -Psparkr
> -Phadoop-2.7 -Pkubernetes -Phive
> Pushed images to dockerhub (previous email) since I didnt use the minikube
> daemon (default behavior).
>
> Then run tests successfully against minikube:
>
> TGZ_PATH=$(pwd)/spark-2.4.0-bin-test.gz
> cd resource-managers/kubernetes/integration-tests
>
> ./dev/dev-run-integration-tests.sh --spark-tgz $TGZ_PATH --service-account
> default --namespace default --image-tag k8s-scala-12 --image-repo skonto
>
>
> [INFO]
> [INFO] --- scalatest-maven-plugin:1.0:test (integration-test) @
> spark-kubernetes-integration-tests_2.12 ---
> Discovery starting.
> Discovery completed in 229 milliseconds.
> Run starting. Expected test count is: 14
> KubernetesSuite:
> - Run SparkPi with no resources
> - Run SparkPi with a very long application name.
> - Use SparkLauncher.NO_RESOURCE
> - Run SparkPi with a master URL without a scheme.
> - Run SparkPi with an argument.
> - Run SparkPi with custom labels, annotations, and environment variables.
> - Run extraJVMOptions check on driver
> - Run SparkRemoteFileTest using a remote data file
> - Run SparkPi with env and mount secrets.
> - Run PySpark on simple pi.py example
> - Run PySpark with Python2 to test a pyfiles example
> - Run PySpark with Python3 to test a pyfiles example
> - Run PySpark with memory customization
> - Run in client mode.
> Run completed in 5 minutes, 24 seconds.
> Total number of tests run: 14
> Suites: completed 2, aborted 0
> Tests: succeeded 14, failed 0, canceled 0, ignored 0, pending 0
> All tests passed.
> [INFO]
> 
> [INFO] Reactor Summary:
> [INFO]
> [INFO] Spark Project Parent POM 2.4.0 . SUCCESS [
> 4.491 s]
> [INFO] Spark Project Tags . SUCCESS [
> 3.833 s]
> [INFO] Spark Project Local DB . SUCCESS [
> 2.680 s]
> [INFO] Spark Project Networking ... SUCCESS [
> 4.817 s]
> [INFO] Spark Project Shuffle Streaming Service  SUCCESS [
> 2.541 s]
> [INFO] Spark Project Unsafe ... SUCCESS [
> 2.795 s]
> [INFO] Spark Project Launcher . SUCCESS [
> 5.593 s]
> [INFO] Spark Project Core . SUCCESS [
> 25.160 s]
> [INFO] Spark Project Kubernetes Integration Tests 2.4.0 ... SUCCESS [05:30
> min]
> [INFO]
> 
> [INFO] BUILD SUCCESS
> [INFO]
> 
> [INFO] Total time: 06:23 min
> [INFO] Finished at: 2018-10-23T18:39:11Z
> [INFO]
> 
>
>
> but had to modify this line
> 
>  and
> added -Pscala-2.12 , otherwise it fails (these tests inherit from the
> parent pom but the profile is not propagated to the mvn command that
> launches the tests, I can create a PR to fix that).
>
>
> On Tue, Oct 23, 2018 at 7:44 PM, Hyukjin Kwon  wrote:
>
>> https://github.com/apache/spark/pull/22514 sounds like a regression that
>> affects Hive CTAS in write path (by not replacing them into Spark internal
>> datasources; therefore performance regression).
>> but yea I suspect if we should block the release by this.
>>
>> https://github.com/apache/spark/pull/22144 is just being discussed if I
>> am not mistaken.
>>
>> Thanks.
>>
>> 2018년 10월 24일 (수) 오전 12:27, Xiao Li 님이 작성:
>>
>>> https://github.com/apache/spark/pull/22144 is also not a blocker of
>>> Spark 2.4 release, as discussed in the PR.
>>>
>>> Thanks,
>>>
>>> Xiao
>>>
>>> Xiao Li  于2018年10月23日周二 上午9:20写道:
>>>
 Thanks for reporting this. https://github.com/apache/spark/pull/22514
 is 

Re: [VOTE] SPARK 2.4.0 (RC4)

2018-10-23 Thread Ilan Filonenko
+1 (non-binding) in reference to all k8s tests for 2.11 (including SparkR
Tests with R version being 3.4.1)

*[INFO] --- scalatest-maven-plugin:1.0:test (integration-test) @
spark-kubernetes-integration-tests_2.11 ---*
*Discovery starting.*
*Discovery completed in 202 milliseconds.*
*Run starting. Expected test count is: 15*
*KubernetesSuite:*
*- Run SparkPi with no resources*
*- Run SparkPi with a very long application name.*
*- Use SparkLauncher.NO_RESOURCE*
*- Run SparkPi with a master URL without a scheme.*
*- Run SparkPi with an argument.*
*- Run SparkPi with custom labels, annotations, and environment variables.*
*- Run extraJVMOptions check on driver*
*- Run SparkRemoteFileTest using a remote data file*
*- Run SparkPi with env and mount secrets.*
*- Run PySpark on simple pi.py example*
*- Run PySpark with Python2 to test a pyfiles example*
*- Run PySpark with Python3 to test a pyfiles example*
*- Run PySpark with memory customization*
*- Run SparkR on simple dataframe.R example*
*- Run in client mode.*
*Run completed in 6 minutes, 47 seconds.*
*Total number of tests run: 15*
*Suites: completed 2, aborted 0*
*Tests: succeeded 15, failed 0, canceled 0, ignored 0, pending 0*
*All tests passed.*

Sean, in reference to your issues, the comment you linked is correct in
that you would need to build a Kubernetes distribution:
i.e.
*dev/make-distribution.sh --pip --r --tgz -Psparkr -Phadoop-2.7
-Pkubernetes*setup minikube
i.e. *minikube start --insecure-registry=localhost:5000 --cpus 6 --memory
6000*
and then run appropriate tests:
i.e. *dev/dev-run-integration-tests.sh --spark-tgz
.../spark-2.4.0-bin-2.7.3.tgz*

The newest PR that you linked allows us to point to the local Kubernetes
cluster deployed via docker-for-mac as opposed to minikube which gives us
another way to test, but does not change the workflow of testing AFAICT.

On Tue, Oct 23, 2018 at 9:14 AM Sean Owen  wrote:

> (I should add, I only observed this with the Scala 2.12 build. It all
> seemed to work with 2.11. Therefore I'm not too worried about it. I
> don't think it's a Scala version issue, but perhaps something looking
> for a spark 2.11 tarball and not finding it. See
> https://github.com/apache/spark/pull/22805#issuecomment-432304622 for
> a change that might address this kind of thing.)
>
> On Tue, Oct 23, 2018 at 11:05 AM Sean Owen  wrote:
> >
> > Yeah, that's maybe the issue here. This is a source release, not a git
> checkout, and it still needs to work in this context.
> >
> > I just added -Pkubernetes to my build and didn't do anything else. I
> think the ideal is that a "mvn -P... -P... install" to work from a source
> release; that's a good expectation and consistent with docs.
> >
> > Maybe these tests simply don't need to run with the normal suite of
> tests, and can be considered tests run manually by developers running these
> scripts? Basically, KubernetesSuite shouldn't run in a normal mvn install?
> >
> > I don't think this has to block the release even if so, just trying to
> get to the bottom of it.
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: [VOTE] SPARK 2.4.0 (RC4)

2018-10-23 Thread Stavros Kontopoulos
Sean,

Ok makes sense, im using a cloned repo. I built with Scala 2.12 profile
using the related tag v2.4.0-rc4:

./dev/change-scala-version.sh 2.12
./dev/make-distribution.sh  --name test --r --tgz -Pscala-2.12 -Psparkr
-Phadoop-2.7 -Pkubernetes -Phive
Pushed images to dockerhub (previous email) since I didnt use the minikube
daemon (default behavior).

Then run tests successfully against minikube:

TGZ_PATH=$(pwd)/spark-2.4.0-bin-test.gz
cd resource-managers/kubernetes/integration-tests

./dev/dev-run-integration-tests.sh --spark-tgz $TGZ_PATH --service-account
default --namespace default --image-tag k8s-scala-12 --image-repo skonto


[INFO]
[INFO] --- scalatest-maven-plugin:1.0:test (integration-test) @
spark-kubernetes-integration-tests_2.12 ---
Discovery starting.
Discovery completed in 229 milliseconds.
Run starting. Expected test count is: 14
KubernetesSuite:
- Run SparkPi with no resources
- Run SparkPi with a very long application name.
- Use SparkLauncher.NO_RESOURCE
- Run SparkPi with a master URL without a scheme.
- Run SparkPi with an argument.
- Run SparkPi with custom labels, annotations, and environment variables.
- Run extraJVMOptions check on driver
- Run SparkRemoteFileTest using a remote data file
- Run SparkPi with env and mount secrets.
- Run PySpark on simple pi.py example
- Run PySpark with Python2 to test a pyfiles example
- Run PySpark with Python3 to test a pyfiles example
- Run PySpark with memory customization
- Run in client mode.
Run completed in 5 minutes, 24 seconds.
Total number of tests run: 14
Suites: completed 2, aborted 0
Tests: succeeded 14, failed 0, canceled 0, ignored 0, pending 0
All tests passed.
[INFO]

[INFO] Reactor Summary:
[INFO]
[INFO] Spark Project Parent POM 2.4.0 . SUCCESS [
4.491 s]
[INFO] Spark Project Tags . SUCCESS [
3.833 s]
[INFO] Spark Project Local DB . SUCCESS [
2.680 s]
[INFO] Spark Project Networking ... SUCCESS [
4.817 s]
[INFO] Spark Project Shuffle Streaming Service  SUCCESS [
2.541 s]
[INFO] Spark Project Unsafe ... SUCCESS [
2.795 s]
[INFO] Spark Project Launcher . SUCCESS [
5.593 s]
[INFO] Spark Project Core . SUCCESS [
25.160 s]
[INFO] Spark Project Kubernetes Integration Tests 2.4.0 ... SUCCESS [05:30
min]
[INFO]

[INFO] BUILD SUCCESS
[INFO]

[INFO] Total time: 06:23 min
[INFO] Finished at: 2018-10-23T18:39:11Z
[INFO]



but had to modify this line

and
added -Pscala-2.12 , otherwise it fails (these tests inherit from the
parent pom but the profile is not propagated to the mvn command that
launches the tests, I can create a PR to fix that).


On Tue, Oct 23, 2018 at 7:44 PM, Hyukjin Kwon  wrote:

> https://github.com/apache/spark/pull/22514 sounds like a regression that
> affects Hive CTAS in write path (by not replacing them into Spark internal
> datasources; therefore performance regression).
> but yea I suspect if we should block the release by this.
>
> https://github.com/apache/spark/pull/22144 is just being discussed if I
> am not mistaken.
>
> Thanks.
>
> 2018년 10월 24일 (수) 오전 12:27, Xiao Li 님이 작성:
>
>> https://github.com/apache/spark/pull/22144 is also not a blocker of
>> Spark 2.4 release, as discussed in the PR.
>>
>> Thanks,
>>
>> Xiao
>>
>> Xiao Li  于2018年10月23日周二 上午9:20写道:
>>
>>> Thanks for reporting this. https://github.com/apache/spark/pull/22514
>>> is not a blocker. We can fix it in the next minor release, if we are unable
>>> to make it in this release.
>>>
>>> Thanks,
>>>
>>> Xiao
>>>
>>> Sean Owen  于2018年10月23日周二 上午9:14写道:
>>>
 (I should add, I only observed this with the Scala 2.12 build. It all
 seemed to work with 2.11. Therefore I'm not too worried about it. I
 don't think it's a Scala version issue, but perhaps something looking
 for a spark 2.11 tarball and not finding it. See
 https://github.com/apache/spark/pull/22805#issuecomment-432304622 for
 a change that might address this kind of thing.)

 On Tue, Oct 23, 2018 at 11:05 AM Sean Owen  wrote:
 >
 > Yeah, that's maybe the issue here. This is a source release, not a
 git checkout, and it still needs to work in this context.
 >
 > I just added -Pkubernetes to my build and didn't do anything else. I
 think the ideal is that a "mvn -P... -P... install" to work from a source
 release; that's a good expectation and consistent with docs.
 >
 > Maybe these tests simply don't need to run 

Re: [VOTE] SPARK 2.4.0 (RC4)

2018-10-23 Thread Hyukjin Kwon
https://github.com/apache/spark/pull/22514 sounds like a regression that
affects Hive CTAS in write path (by not replacing them into Spark internal
datasources; therefore performance regression).
but yea I suspect if we should block the release by this.

https://github.com/apache/spark/pull/22144 is just being discussed if I am
not mistaken.

Thanks.

2018년 10월 24일 (수) 오전 12:27, Xiao Li 님이 작성:

> https://github.com/apache/spark/pull/22144 is also not a blocker of Spark
> 2.4 release, as discussed in the PR.
>
> Thanks,
>
> Xiao
>
> Xiao Li  于2018年10月23日周二 上午9:20写道:
>
>> Thanks for reporting this. https://github.com/apache/spark/pull/22514 is
>> not a blocker. We can fix it in the next minor release, if we are unable to
>> make it in this release.
>>
>> Thanks,
>>
>> Xiao
>>
>> Sean Owen  于2018年10月23日周二 上午9:14写道:
>>
>>> (I should add, I only observed this with the Scala 2.12 build. It all
>>> seemed to work with 2.11. Therefore I'm not too worried about it. I
>>> don't think it's a Scala version issue, but perhaps something looking
>>> for a spark 2.11 tarball and not finding it. See
>>> https://github.com/apache/spark/pull/22805#issuecomment-432304622 for
>>> a change that might address this kind of thing.)
>>>
>>> On Tue, Oct 23, 2018 at 11:05 AM Sean Owen  wrote:
>>> >
>>> > Yeah, that's maybe the issue here. This is a source release, not a git
>>> checkout, and it still needs to work in this context.
>>> >
>>> > I just added -Pkubernetes to my build and didn't do anything else. I
>>> think the ideal is that a "mvn -P... -P... install" to work from a source
>>> release; that's a good expectation and consistent with docs.
>>> >
>>> > Maybe these tests simply don't need to run with the normal suite of
>>> tests, and can be considered tests run manually by developers running these
>>> scripts? Basically, KubernetesSuite shouldn't run in a normal mvn install?
>>> >
>>> > I don't think this has to block the release even if so, just trying to
>>> get to the bottom of it.
>>>
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>
>>>


Re: [VOTE] SPARK 2.4.0 (RC4)

2018-10-23 Thread Xiao Li
https://github.com/apache/spark/pull/22144 is also not a blocker of Spark
2.4 release, as discussed in the PR.

Thanks,

Xiao

Xiao Li  于2018年10月23日周二 上午9:20写道:

> Thanks for reporting this. https://github.com/apache/spark/pull/22514 is
> not a blocker. We can fix it in the next minor release, if we are unable to
> make it in this release.
>
> Thanks,
>
> Xiao
>
> Sean Owen  于2018年10月23日周二 上午9:14写道:
>
>> (I should add, I only observed this with the Scala 2.12 build. It all
>> seemed to work with 2.11. Therefore I'm not too worried about it. I
>> don't think it's a Scala version issue, but perhaps something looking
>> for a spark 2.11 tarball and not finding it. See
>> https://github.com/apache/spark/pull/22805#issuecomment-432304622 for
>> a change that might address this kind of thing.)
>>
>> On Tue, Oct 23, 2018 at 11:05 AM Sean Owen  wrote:
>> >
>> > Yeah, that's maybe the issue here. This is a source release, not a git
>> checkout, and it still needs to work in this context.
>> >
>> > I just added -Pkubernetes to my build and didn't do anything else. I
>> think the ideal is that a "mvn -P... -P... install" to work from a source
>> release; that's a good expectation and consistent with docs.
>> >
>> > Maybe these tests simply don't need to run with the normal suite of
>> tests, and can be considered tests run manually by developers running these
>> scripts? Basically, KubernetesSuite shouldn't run in a normal mvn install?
>> >
>> > I don't think this has to block the release even if so, just trying to
>> get to the bottom of it.
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>


Re: [VOTE] SPARK 2.4.0 (RC4)

2018-10-23 Thread Xiao Li
Thanks for reporting this. https://github.com/apache/spark/pull/22514 is
not a blocker. We can fix it in the next minor release, if we are unable to
make it in this release.

Thanks,

Xiao

Sean Owen  于2018年10月23日周二 上午9:14写道:

> (I should add, I only observed this with the Scala 2.12 build. It all
> seemed to work with 2.11. Therefore I'm not too worried about it. I
> don't think it's a Scala version issue, but perhaps something looking
> for a spark 2.11 tarball and not finding it. See
> https://github.com/apache/spark/pull/22805#issuecomment-432304622 for
> a change that might address this kind of thing.)
>
> On Tue, Oct 23, 2018 at 11:05 AM Sean Owen  wrote:
> >
> > Yeah, that's maybe the issue here. This is a source release, not a git
> checkout, and it still needs to work in this context.
> >
> > I just added -Pkubernetes to my build and didn't do anything else. I
> think the ideal is that a "mvn -P... -P... install" to work from a source
> release; that's a good expectation and consistent with docs.
> >
> > Maybe these tests simply don't need to run with the normal suite of
> tests, and can be considered tests run manually by developers running these
> scripts? Basically, KubernetesSuite shouldn't run in a normal mvn install?
> >
> > I don't think this has to block the release even if so, just trying to
> get to the bottom of it.
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: [VOTE] SPARK 2.4.0 (RC4)

2018-10-23 Thread Sean Owen
(I should add, I only observed this with the Scala 2.12 build. It all
seemed to work with 2.11. Therefore I'm not too worried about it. I
don't think it's a Scala version issue, but perhaps something looking
for a spark 2.11 tarball and not finding it. See
https://github.com/apache/spark/pull/22805#issuecomment-432304622 for
a change that might address this kind of thing.)

On Tue, Oct 23, 2018 at 11:05 AM Sean Owen  wrote:
>
> Yeah, that's maybe the issue here. This is a source release, not a git 
> checkout, and it still needs to work in this context.
>
> I just added -Pkubernetes to my build and didn't do anything else. I think 
> the ideal is that a "mvn -P... -P... install" to work from a source release; 
> that's a good expectation and consistent with docs.
>
> Maybe these tests simply don't need to run with the normal suite of tests, 
> and can be considered tests run manually by developers running these scripts? 
> Basically, KubernetesSuite shouldn't run in a normal mvn install?
>
> I don't think this has to block the release even if so, just trying to get to 
> the bottom of it.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [VOTE] SPARK 2.4.0 (RC4)

2018-10-23 Thread Wenchen Fan
I read through the contributing guide
, it only mentions that data
correctness and data loss issues should be marked as blockers. AFAIK we
also mark regressions of current release as blockers, but not regressions
of the previous releases.

SPARK-24935 is indeed a bug, and is a regression from Spark 2.2.0. We
should definitely fix it, but doesn't seem like a blocker. BTW the root
cause of SPARK-24935 is unknown(at least I can't tell from the PR), so
fixing it might take a while.

On Tue, Oct 23, 2018 at 11:58 PM Stavros Kontopoulos <
stavros.kontopou...@lightbend.com> wrote:

> Sean,
>
> I will try it against 2.12 shortly.
>
> You're saying someone would have to first build a k8s distro from source
>> too?
>
>
> Ok I missed the error one line above, before the distro error there is
> another one:
>
> fatal: not a git repository (or any of the parent directories): .git
>
>
> So that seems to come from here
> .
> It seems that the test root is not set up correctly. It should be the top
> git dir from which you built Spark.
>
> Now regarding the distro thing. dev-run-integration-tests.sh should run
> from within the cloned project after the distro is built. The distro is
> required
> 
> , it should fail otherwise.
>
> Integration tests run the setup-integration-test-env.sh script. 
> dev-run-integration-tests.sh
> calls mvn
> 
>  which
> in turn executes that setup script
> 
> .
>
> How do you run the tests?
>
> Stavros
>
> On Tue, Oct 23, 2018 at 3:01 PM, Sean Owen  wrote:
>
>> No, because the docs are built into the release too and released to
>> the site too from the released artifact.
>> As a practical matter, I think these docs are not critical for
>> release, and can follow in a maintenance release. I'd retarget to
>> 2.4.1 or untarget.
>> I do know at times a release's docs have been edited after the fact,
>> but that's bad form. We'd not go change a class in the release after
>> it was released and call it the same release.
>>
>> I'd still like some confirmation that someone can build and pass tests
>> with -Pkubernetes, maybe? It actually all passed with the 2.11 build.
>> I don't think it's a 2.12 incompatibility, but rather than the K8S
>> tests maybe don't quite work with the 2.12 build artifact naming. Or
>> else something to do with my env.
>>
>> On Mon, Oct 22, 2018 at 9:08 PM Wenchen Fan  wrote:
>> >
>> > Regarding the doc tickets, I vaguely remember that we can merge doc PRs
>> after release and publish doc to spark website later. Can anyone confirm?
>> >
>> > On Tue, Oct 23, 2018 at 8:30 AM Sean Owen  wrote:
>> >>
>> >> This is what I got from a straightforward build of the source distro
>> >> here ... really, ideally, it builds as-is from source. You're saying
>> >> someone would have to first build a k8s distro from source too?
>> >> It's not a 'must' that this be automatic but nothing else fails out of
>> the box.
>> >> I feel like I might be misunderstanding the setup here.
>> >> On Mon, Oct 22, 2018 at 7:25 PM Stavros Kontopoulos
>> >>  wrote:
>>
>
>
>
> --
> Stavros Kontopoulos
>
> *Senior Software Engineer*
> *Lightbend, Inc.*
>
> *p:  +30 6977967274 <%2B1%20650%20678%200020>*
> *e: stavros.kontopou...@lightbend.com* 
>
>
>


Re: [VOTE] SPARK 2.4.0 (RC4)

2018-10-23 Thread Sean Owen
Yeah, that's maybe the issue here. This is a source release, not a git
checkout, and it still needs to work in this context.

I just added -Pkubernetes to my build and didn't do anything else. I think
the ideal is that a "mvn -P... -P... install" to work from a source
release; that's a good expectation and consistent with docs.

Maybe these tests simply don't need to run with the normal suite of tests,
and can be considered tests run manually by developers running these
scripts? Basically, KubernetesSuite shouldn't run in a normal mvn install?

I don't think this has to block the release even if so, just trying to get
to the bottom of it.


On Tue, Oct 23, 2018 at 10:58 AM Stavros Kontopoulos <
stavros.kontopou...@lightbend.com> wrote:

> Ok I missed the error one line above, before the distro error there is
> another one:
>
> fatal: not a git repository (or any of the parent directories): .git
>
>
> So that seems to come from here
> .
> It seems that the test root is not set up correctly. It should be the top
> git dir from which you built Spark.
>
> Now regarding the distro thing. dev-run-integration-tests.sh should run
> from within the cloned project after the distro is built. The distro is
> required
> 
> , it should fail otherwise.
>
> Integration tests run the setup-integration-test-env.sh script. 
> dev-run-integration-tests.sh
> calls mvn
> 
>  which
> in turn executes that setup script
> 
> .
>
> How do you run the tests?
>
>


Re: [VOTE] SPARK 2.4.0 (RC4)

2018-10-23 Thread Hyukjin Kwon
I am searching and checking some PRs or JIRAs that state regression. Let me
leave a link - it might be good to double check
https://github.com/apache/spark/pull/22514 as well.

2018년 10월 23일 (화) 오후 11:58, Stavros Kontopoulos <
stavros.kontopou...@lightbend.com>님이 작성:

> Sean,
>
> I will try it against 2.12 shortly.
>
> You're saying someone would have to first build a k8s distro from source
>> too?
>
>
> Ok I missed the error one line above, before the distro error there is
> another one:
>
> fatal: not a git repository (or any of the parent directories): .git
>
>
> So that seems to come from here
> .
> It seems that the test root is not set up correctly. It should be the top
> git dir from which you built Spark.
>
> Now regarding the distro thing. dev-run-integration-tests.sh should run
> from within the cloned project after the distro is built. The distro is
> required
> 
> , it should fail otherwise.
>
> Integration tests run the setup-integration-test-env.sh script. 
> dev-run-integration-tests.sh
> calls mvn
> 
>  which
> in turn executes that setup script
> 
> .
>
> How do you run the tests?
>
> Stavros
>
> On Tue, Oct 23, 2018 at 3:01 PM, Sean Owen  wrote:
>
>> No, because the docs are built into the release too and released to
>> the site too from the released artifact.
>> As a practical matter, I think these docs are not critical for
>> release, and can follow in a maintenance release. I'd retarget to
>> 2.4.1 or untarget.
>> I do know at times a release's docs have been edited after the fact,
>> but that's bad form. We'd not go change a class in the release after
>> it was released and call it the same release.
>>
>> I'd still like some confirmation that someone can build and pass tests
>> with -Pkubernetes, maybe? It actually all passed with the 2.11 build.
>> I don't think it's a 2.12 incompatibility, but rather than the K8S
>> tests maybe don't quite work with the 2.12 build artifact naming. Or
>> else something to do with my env.
>>
>> On Mon, Oct 22, 2018 at 9:08 PM Wenchen Fan  wrote:
>> >
>> > Regarding the doc tickets, I vaguely remember that we can merge doc PRs
>> after release and publish doc to spark website later. Can anyone confirm?
>> >
>> > On Tue, Oct 23, 2018 at 8:30 AM Sean Owen  wrote:
>> >>
>> >> This is what I got from a straightforward build of the source distro
>> >> here ... really, ideally, it builds as-is from source. You're saying
>> >> someone would have to first build a k8s distro from source too?
>> >> It's not a 'must' that this be automatic but nothing else fails out of
>> the box.
>> >> I feel like I might be misunderstanding the setup here.
>> >> On Mon, Oct 22, 2018 at 7:25 PM Stavros Kontopoulos
>> >>  wrote:
>>
>
>
>
> --
> Stavros Kontopoulos
>
> *Senior Software Engineer*
> *Lightbend, Inc.*
>
> *p:  +30 6977967274 <%2B1%20650%20678%200020>*
> *e: stavros.kontopou...@lightbend.com* 
>
>
>


Re: [VOTE] SPARK 2.4.0 (RC4)

2018-10-23 Thread Stavros Kontopoulos
Sean,

I will try it against 2.12 shortly.

You're saying someone would have to first build a k8s distro from source
> too?


Ok I missed the error one line above, before the distro error there is
another one:

fatal: not a git repository (or any of the parent directories): .git


So that seems to come from here
.
It seems that the test root is not set up correctly. It should be the top
git dir from which you built Spark.

Now regarding the distro thing. dev-run-integration-tests.sh should run
from within the cloned project after the distro is built. The distro is
required

, it should fail otherwise.

Integration tests run the setup-integration-test-env.sh script.
dev-run-integration-tests.sh
calls mvn

which
in turn executes that setup script

.

How do you run the tests?

Stavros

On Tue, Oct 23, 2018 at 3:01 PM, Sean Owen  wrote:

> No, because the docs are built into the release too and released to
> the site too from the released artifact.
> As a practical matter, I think these docs are not critical for
> release, and can follow in a maintenance release. I'd retarget to
> 2.4.1 or untarget.
> I do know at times a release's docs have been edited after the fact,
> but that's bad form. We'd not go change a class in the release after
> it was released and call it the same release.
>
> I'd still like some confirmation that someone can build and pass tests
> with -Pkubernetes, maybe? It actually all passed with the 2.11 build.
> I don't think it's a 2.12 incompatibility, but rather than the K8S
> tests maybe don't quite work with the 2.12 build artifact naming. Or
> else something to do with my env.
>
> On Mon, Oct 22, 2018 at 9:08 PM Wenchen Fan  wrote:
> >
> > Regarding the doc tickets, I vaguely remember that we can merge doc PRs
> after release and publish doc to spark website later. Can anyone confirm?
> >
> > On Tue, Oct 23, 2018 at 8:30 AM Sean Owen  wrote:
> >>
> >> This is what I got from a straightforward build of the source distro
> >> here ... really, ideally, it builds as-is from source. You're saying
> >> someone would have to first build a k8s distro from source too?
> >> It's not a 'must' that this be automatic but nothing else fails out of
> the box.
> >> I feel like I might be misunderstanding the setup here.
> >> On Mon, Oct 22, 2018 at 7:25 PM Stavros Kontopoulos
> >>  wrote:
>



-- 
Stavros Kontopoulos

*Senior Software Engineer*
*Lightbend, Inc.*

*p:  +30 6977967274 <%2B1%20650%20678%200020>*
*e: stavros.kontopou...@lightbend.com* 


Re: [VOTE] SPARK 2.4.0 (RC4)

2018-10-23 Thread Sean Owen
No, because the docs are built into the release too and released to
the site too from the released artifact.
As a practical matter, I think these docs are not critical for
release, and can follow in a maintenance release. I'd retarget to
2.4.1 or untarget.
I do know at times a release's docs have been edited after the fact,
but that's bad form. We'd not go change a class in the release after
it was released and call it the same release.

I'd still like some confirmation that someone can build and pass tests
with -Pkubernetes, maybe? It actually all passed with the 2.11 build.
I don't think it's a 2.12 incompatibility, but rather than the K8S
tests maybe don't quite work with the 2.12 build artifact naming. Or
else something to do with my env.

On Mon, Oct 22, 2018 at 9:08 PM Wenchen Fan  wrote:
>
> Regarding the doc tickets, I vaguely remember that we can merge doc PRs after 
> release and publish doc to spark website later. Can anyone confirm?
>
> On Tue, Oct 23, 2018 at 8:30 AM Sean Owen  wrote:
>>
>> This is what I got from a straightforward build of the source distro
>> here ... really, ideally, it builds as-is from source. You're saying
>> someone would have to first build a k8s distro from source too?
>> It's not a 'must' that this be automatic but nothing else fails out of the 
>> box.
>> I feel like I might be misunderstanding the setup here.
>> On Mon, Oct 22, 2018 at 7:25 PM Stavros Kontopoulos
>>  wrote:

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [VOTE] SPARK 2.4.0 (RC4)

2018-10-23 Thread Hyukjin Kwon
I am sorry for raising this late. Out of curiosity, does anyone know why we
don't treat SPARK-24935 (https://github.com/apache/spark/pull/22144) as a
blocker?

It looks it broke a API compatibility, and an actual usecase of an external
library (https://github.com/DataSketches/sketches-hive)
Also, looks sufficient discussion was made for its diagnosis (
https://groups.google.com/forum/?utm_medium=email_source=footer#!msg/sketches-user/GmH4-OlHP9g/MW-J7Hg4BwAJ
).


2018년 10월 23일 (화) 오후 12:03, Darcy Shen 님이 작성:

>
>
> +1
>
>
>  On Tue, 23 Oct 2018 01:42:06 +0800 Wenchen Fan
> wrote 
>
> Please vote on releasing the following candidate as Apache Spark version
> 2.4.0.
>
> The vote is open until October 26 PST and passes if a majority +1 PMC
> votes are cast, with
> a minimum of 3 +1 votes.
>
> [ ] +1 Release this package as Apache Spark 2.4.0
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> The tag to be voted on is v2.4.0-rc4 (commit
> e69e2bfa486d8d3b9d203b96ca9c0f37c2b6cabe):
> https://github.com/apache/spark/tree/v2.4.0-rc4
>
> The release files, including signatures, digests, etc. can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v2.4.0-rc4-bin/
>
> Signatures used for Spark RCs can be found in this file:
> https://dist.apache.org/repos/dist/dev/spark/KEYS
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1290
>
> The documentation corresponding to this release can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v2.4.0-rc4-docs/
>
> The list of bug fixes going into 2.4.0 can be found at the following URL:
> https://issues.apache.org/jira/projects/SPARK/versions/12342385
>
> FAQ
>
> =
> How can I help test this release?
> =
>
> If you are a Spark user, you can help us test this release by taking
> an existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> If you're working in PySpark you can set up a virtual env and install
> the current RC and see if anything important breaks, in the Java/Scala
> you can add the staging repository to your projects resolvers and test
> with the RC (make sure to clean up the artifact cache before/after so
> you don't end up building with a out of date RC going forward).
>
> ===
> What should happen to JIRA tickets still targeting 2.4.0?
> ===
>
> The current list of open tickets targeted at 2.4.0 can be found at:
> https://issues.apache.org/jira/projects/SPARK and search for "Target
> Version/s" = 2.4.0
>
> Committers should look at those and triage. Extremely important bug
> fixes, documentation, and API tweaks that impact compatibility should
> be worked on immediately. Everything else please retarget to an
> appropriate release.
>
> ==
> But my bug isn't fixed?
> ==
>
> In order to make timely releases, we will typically not hold the
> release unless the bug in question is a regression from the previous
> release. That being said, if there is something which is a regression
> that has not been correctly targeted please ping me or a committer to
> help target the issue.
>
>
>


Re: [VOTE] SPARK 2.4.0 (RC4)

2018-10-22 Thread Darcy Shen




+1 On Tue, 23 Oct 2018 01:42:06 +0800  Wenchen 
Fan wrote Please vote on releasing the following 
candidate as Apache Spark version 2.4.0.The vote is open until October 26 PST 
and passes if a majority +1 PMC votes are cast, witha minimum of 3 +1 votes.[ ] 
+1 Release this package as Apache Spark 2.4.0[ ] -1 Do not release this package 
because ...To learn more about Apache Spark, please see 
http://spark.apache.org/The tag to be voted on is v2.4.0-rc4 (commit 
e69e2bfa486d8d3b9d203b96ca9c0f37c2b6cabe):https://github.com/apache/spark/tree/v2.4.0-rc4The
 release files, including signatures, digests, etc. can be found 
at:https://dist.apache.org/repos/dist/dev/spark/v2.4.0-rc4-bin/Signatures used 
for Spark RCs can be found in this 
file:https://dist.apache.org/repos/dist/dev/spark/KEYSThe staging repository 
for this release can be found 
at:https://repository.apache.org/content/repositories/orgapachespark-1290The 
documentation corresponding to this release can be found 
at:https://dist.apache.org/repos/dist/dev/spark/v2.4.0-rc4-docs/The list of bug 
fixes going into 2.4.0 can be found at the following 
URL:https://issues.apache.org/jira/projects/SPARK/versions/12342385FAQ=How
 can I help test this release?=If you are a Spark user, 
you can help us test this release by takingan existing Spark workload and 
running on this release candidate, thenreporting any regressions.If you're 
working in PySpark you can set up a virtual env and installthe current RC and 
see if anything important breaks, in the Java/Scalayou can add the staging 
repository to your projects resolvers and testwith the RC (make sure to clean 
up the artifact cache before/after soyou don't end up building with a out of 
date RC going forward).===What should 
happen to JIRA tickets still targeting 
2.4.0?===The current list of open 
tickets targeted at 2.4.0 can be found 
at:https://issues.apache.org/jira/projects/SPARK and search for "Target 
Version/s" = 2.4.0Committers should look at those and triage. Extremely 
important bugfixes, documentation, and API tweaks that impact compatibility 
shouldbe worked on immediately. Everything else please retarget to 
anappropriate release.==But my bug isn't 
fixed?==In order to make timely releases, we will typically not 
hold therelease unless the bug in question is a regression from the 
previousrelease. That being said, if there is something which is a 
regressionthat has not been correctly targeted please ping me or a committer 
tohelp target the issue. 








Re: [VOTE] SPARK 2.4.0 (RC4)

2018-10-22 Thread Aron.tao
+1



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [VOTE] SPARK 2.4.0 (RC4)

2018-10-22 Thread Imran Rashid
+1
No blockers and our internal tests are all passing.

(I did file https://issues.apache.org/jira/browse/SPARK-25805, but this is
just a minor issue with a flaky test)

On Mon, Oct 22, 2018 at 12:42 PM Wenchen Fan  wrote:

> Please vote on releasing the following candidate as Apache Spark version
> 2.4.0.
>
> The vote is open until October 26 PST and passes if a majority +1 PMC
> votes are cast, with
> a minimum of 3 +1 votes.
>
> [ ] +1 Release this package as Apache Spark 2.4.0
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> The tag to be voted on is v2.4.0-rc4 (commit
> e69e2bfa486d8d3b9d203b96ca9c0f37c2b6cabe):
> https://github.com/apache/spark/tree/v2.4.0-rc4
>
> The release files, including signatures, digests, etc. can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v2.4.0-rc4-bin/
>
> Signatures used for Spark RCs can be found in this file:
> https://dist.apache.org/repos/dist/dev/spark/KEYS
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1290
>
> The documentation corresponding to this release can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v2.4.0-rc4-docs/
>
> The list of bug fixes going into 2.4.0 can be found at the following URL:
> https://issues.apache.org/jira/projects/SPARK/versions/12342385
>
> FAQ
>
> =
> How can I help test this release?
> =
>
> If you are a Spark user, you can help us test this release by taking
> an existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> If you're working in PySpark you can set up a virtual env and install
> the current RC and see if anything important breaks, in the Java/Scala
> you can add the staging repository to your projects resolvers and test
> with the RC (make sure to clean up the artifact cache before/after so
> you don't end up building with a out of date RC going forward).
>
> ===
> What should happen to JIRA tickets still targeting 2.4.0?
> ===
>
> The current list of open tickets targeted at 2.4.0 can be found at:
> https://issues.apache.org/jira/projects/SPARK and search for "Target
> Version/s" = 2.4.0
>
> Committers should look at those and triage. Extremely important bug
> fixes, documentation, and API tweaks that impact compatibility should
> be worked on immediately. Everything else please retarget to an
> appropriate release.
>
> ==
> But my bug isn't fixed?
> ==
>
> In order to make timely releases, we will typically not hold the
> release unless the bug in question is a regression from the previous
> release. That being said, if there is something which is a regression
> that has not been correctly targeted please ping me or a committer to
> help target the issue.
>


Re: [VOTE] SPARK 2.4.0 (RC4)

2018-10-22 Thread Wenchen Fan
Regarding the doc tickets, I vaguely remember that we can merge doc PRs
after release and publish doc to spark website later. Can anyone confirm?

On Tue, Oct 23, 2018 at 8:30 AM Sean Owen  wrote:

> This is what I got from a straightforward build of the source distro
> here ... really, ideally, it builds as-is from source. You're saying
> someone would have to first build a k8s distro from source too?
> It's not a 'must' that this be automatic but nothing else fails out of the
> box.
> I feel like I might be misunderstanding the setup here.
> On Mon, Oct 22, 2018 at 7:25 PM Stavros Kontopoulos
>  wrote:
> >
> >
> >>
> >> tar (child): Error is not recoverable: exiting now
> >> tar: Child returned status 2
> >> tar: Error is not recoverable: exiting now
> >> scripts/setup-integration-test-env.sh: line 85:
> >>
> /home/srowen/spark-2.4.0/resource-managers/kubernetes/integration-tests/target/spark-dist-unpacked/bin/docker-image-tool.sh:
> >
> >
> > It seems you are missing the distro file... here is how I run it locally:
> >
> > DOCKER_USERNAME=...
> > SPARK_K8S_IMAGE_TAG=...
> >
> > ./dev/make-distribution.sh --name test --tgz -Phadoop-2.7 -Pkubernetes
> -Phive
> > tar -zxvf spark-2.4.0-SNAPSHOT-bin-test.tgz
> > cd spark-2.4.0-SNAPSHOT-bin-test
> > ./bin/docker-image-tool.sh -r $DOCKER_USERNAME -t $SPARK_K8S_IMAGE_TAG
> build
> > cd ..
> > TGZ_PATH=$(pwd)/spark-2.4.0-SNAPSHOT-bin-test.tgz
> > cd resource-managers/kubernetes/integration-tests
> > ./dev/dev-run-integration-tests.sh --image-tag $SPARK_K8S_IMAGE_TAG
> --spark-tgz $TGZ_PATH --image-repo $DOCKER_USERNAME
> >
> > Stavros
> >
> > On Tue, Oct 23, 2018 at 1:54 AM, Sean Owen  wrote:
> >>
> >> Provisionally looking good to me, but I had a few questions.
> >>
> >> We have these open for 2.4, but I presume they aren't actually going
> >> to be in 2.4 and should be untargeted:
> >>
> >> SPARK-25507 Update documents for the new features in 2.4 release
> >> SPARK-25179 Document the features that require Pyarrow 0.10
> >> SPARK-25783 Spark shell fails because of jline incompatibility
> >> SPARK-25347 Document image data source in doc site
> >> SPARK-25584 Document libsvm data source in doc site
> >> SPARK-25346 Document Spark builtin data sources
> >> SPARK-24464 Unit tests for MLlib's Instrumentation
> >> SPARK-23197 Flaky test:
> spark.streaming.ReceiverSuite."receiver_life_cycle"
> >> SPARK-22809 pyspark is sensitive to imports with dots
> >> SPARK-21030 extend hint syntax to support any expression for Python and
> R
> >>
> >> Comments in several of the doc issues suggest they are needed for 2.4
> >> though. How essential?
> >>
> >> (Brief digression: SPARK-21030 is an example of a pattern I see
> >> sometimes. Parent Epic A is targeted for version X. Children B and C
> >> are not. Epic A's description is basically "do X and Y". Is the parent
> >> helping? And now that Y is done, is there a point in tracking X with
> >> two JIRAs? can I just close the Epic?)
> >>
> >> I am not sure I've tried running K8S in my test runs before, but I get
> >> this on my Linux machine:
> >>
> >> [INFO] --- exec-maven-plugin:1.4.0:exec (setup-integration-test-env) @
> >> spark-kubernetes-integration-tests_2.12 ---
> >> fatal: not a git repository (or any of the parent directories): .git
> >> tar (child): --strip-components=1: Cannot open: No such file or
> directory
> >> tar (child): Error is not recoverable: exiting now
> >> tar: Child returned status 2
> >> tar: Error is not recoverable: exiting now
> >> scripts/setup-integration-test-env.sh: line 85:
> >>
> /home/srowen/spark-2.4.0/resource-managers/kubernetes/integration-tests/target/spark-dist-unpacked/bin/docker-image-tool.sh:
> >> No such file or directory
> >> /home/srowen/spark-2.4.0/resource-managers/kubernetes/integration-tests
> >> [INFO]
> >> [INFO] --- scalatest-maven-plugin:1.0:test (integration-test) @
> >> spark-kubernetes-integration-tests_2.12 ---
> >> Discovery starting.
> >> Discovery completed in 289 milliseconds.
> >> Run starting. Expected test count is: 14
> >> KubernetesSuite:
> >> org.apache.spark.deploy.k8s.integrationtest.KubernetesSuite *** ABORTED
> ***
> >>   java.lang.NullPointerException:
> >>   at
> org.apache.spark.deploy.k8s.integrationtest.KubernetesSuite.beforeAll(KubernetesSuite.scala:92)
> >>   at
> org.scalatest.BeforeAndAfterAll.liftedTree1$1(BeforeAndAfterAll.scala:212)
> >>   at org.scalatest.BeforeAndAfterAll.run(BeforeAndAfterAll.scala:210)
> >>   at org.scalatest.BeforeAndAfterAll.run$(BeforeAndAfterAll.scala:208)
> >>   at org.apache.spark.deploy.k8s.integrationtest.KubernetesSuite.org
> $scalatest$BeforeAndAfter$$super$run(KubernetesSuite.scala:39)
> >>   at org.scalatest.BeforeAndAfter.run(BeforeAndAfter.scala:258)
> >>   at org.scalatest.BeforeAndAfter.run$(BeforeAndAfter.scala:256)
> >>   at
> org.apache.spark.deploy.k8s.integrationtest.KubernetesSuite.run(KubernetesSuite.scala:39)
> >>   at org.scalatest.Suite.callExecuteOnSuite$1(Suite.scala:1210)
> >>   at 

Re: [VOTE] SPARK 2.4.0 (RC4)

2018-10-22 Thread Sean Owen
This is what I got from a straightforward build of the source distro
here ... really, ideally, it builds as-is from source. You're saying
someone would have to first build a k8s distro from source too?
It's not a 'must' that this be automatic but nothing else fails out of the box.
I feel like I might be misunderstanding the setup here.
On Mon, Oct 22, 2018 at 7:25 PM Stavros Kontopoulos
 wrote:
>
>
>>
>> tar (child): Error is not recoverable: exiting now
>> tar: Child returned status 2
>> tar: Error is not recoverable: exiting now
>> scripts/setup-integration-test-env.sh: line 85:
>> /home/srowen/spark-2.4.0/resource-managers/kubernetes/integration-tests/target/spark-dist-unpacked/bin/docker-image-tool.sh:
>
>
> It seems you are missing the distro file... here is how I run it locally:
>
> DOCKER_USERNAME=...
> SPARK_K8S_IMAGE_TAG=...
>
> ./dev/make-distribution.sh --name test --tgz -Phadoop-2.7 -Pkubernetes -Phive
> tar -zxvf spark-2.4.0-SNAPSHOT-bin-test.tgz
> cd spark-2.4.0-SNAPSHOT-bin-test
> ./bin/docker-image-tool.sh -r $DOCKER_USERNAME -t $SPARK_K8S_IMAGE_TAG build
> cd ..
> TGZ_PATH=$(pwd)/spark-2.4.0-SNAPSHOT-bin-test.tgz
> cd resource-managers/kubernetes/integration-tests
> ./dev/dev-run-integration-tests.sh --image-tag $SPARK_K8S_IMAGE_TAG 
> --spark-tgz $TGZ_PATH --image-repo $DOCKER_USERNAME
>
> Stavros
>
> On Tue, Oct 23, 2018 at 1:54 AM, Sean Owen  wrote:
>>
>> Provisionally looking good to me, but I had a few questions.
>>
>> We have these open for 2.4, but I presume they aren't actually going
>> to be in 2.4 and should be untargeted:
>>
>> SPARK-25507 Update documents for the new features in 2.4 release
>> SPARK-25179 Document the features that require Pyarrow 0.10
>> SPARK-25783 Spark shell fails because of jline incompatibility
>> SPARK-25347 Document image data source in doc site
>> SPARK-25584 Document libsvm data source in doc site
>> SPARK-25346 Document Spark builtin data sources
>> SPARK-24464 Unit tests for MLlib's Instrumentation
>> SPARK-23197 Flaky test: spark.streaming.ReceiverSuite."receiver_life_cycle"
>> SPARK-22809 pyspark is sensitive to imports with dots
>> SPARK-21030 extend hint syntax to support any expression for Python and R
>>
>> Comments in several of the doc issues suggest they are needed for 2.4
>> though. How essential?
>>
>> (Brief digression: SPARK-21030 is an example of a pattern I see
>> sometimes. Parent Epic A is targeted for version X. Children B and C
>> are not. Epic A's description is basically "do X and Y". Is the parent
>> helping? And now that Y is done, is there a point in tracking X with
>> two JIRAs? can I just close the Epic?)
>>
>> I am not sure I've tried running K8S in my test runs before, but I get
>> this on my Linux machine:
>>
>> [INFO] --- exec-maven-plugin:1.4.0:exec (setup-integration-test-env) @
>> spark-kubernetes-integration-tests_2.12 ---
>> fatal: not a git repository (or any of the parent directories): .git
>> tar (child): --strip-components=1: Cannot open: No such file or directory
>> tar (child): Error is not recoverable: exiting now
>> tar: Child returned status 2
>> tar: Error is not recoverable: exiting now
>> scripts/setup-integration-test-env.sh: line 85:
>> /home/srowen/spark-2.4.0/resource-managers/kubernetes/integration-tests/target/spark-dist-unpacked/bin/docker-image-tool.sh:
>> No such file or directory
>> /home/srowen/spark-2.4.0/resource-managers/kubernetes/integration-tests
>> [INFO]
>> [INFO] --- scalatest-maven-plugin:1.0:test (integration-test) @
>> spark-kubernetes-integration-tests_2.12 ---
>> Discovery starting.
>> Discovery completed in 289 milliseconds.
>> Run starting. Expected test count is: 14
>> KubernetesSuite:
>> org.apache.spark.deploy.k8s.integrationtest.KubernetesSuite *** ABORTED ***
>>   java.lang.NullPointerException:
>>   at 
>> org.apache.spark.deploy.k8s.integrationtest.KubernetesSuite.beforeAll(KubernetesSuite.scala:92)
>>   at 
>> org.scalatest.BeforeAndAfterAll.liftedTree1$1(BeforeAndAfterAll.scala:212)
>>   at org.scalatest.BeforeAndAfterAll.run(BeforeAndAfterAll.scala:210)
>>   at org.scalatest.BeforeAndAfterAll.run$(BeforeAndAfterAll.scala:208)
>>   at 
>> org.apache.spark.deploy.k8s.integrationtest.KubernetesSuite.org$scalatest$BeforeAndAfter$$super$run(KubernetesSuite.scala:39)
>>   at org.scalatest.BeforeAndAfter.run(BeforeAndAfter.scala:258)
>>   at org.scalatest.BeforeAndAfter.run$(BeforeAndAfter.scala:256)
>>   at 
>> org.apache.spark.deploy.k8s.integrationtest.KubernetesSuite.run(KubernetesSuite.scala:39)
>>   at org.scalatest.Suite.callExecuteOnSuite$1(Suite.scala:1210)
>>   at org.scalatest.Suite.$anonfun$runNestedSuites$1(Suite.scala:1257)
>>   ...
>>
>> Clearly it's expecting something about the env that isn't true, but I
>> don't know if it's a problem with those expectations versus what is in
>> the source release, or, just something to do with my env. This is with
>> Scala 2.12.
>>
>>
>>
>> On Mon, Oct 22, 2018 at 12:42 PM Wenchen Fan  wrote:
>> >
>> > Please vote on 

Re: [VOTE] SPARK 2.4.0 (RC4)

2018-10-22 Thread Stavros Kontopoulos
> tar (child): Error is not recoverable: exiting now
> tar: Child returned status 2
> tar: Error is not recoverable: exiting now
> scripts/setup-integration-test-env.sh: line 85:
> /home/srowen/spark-2.4.0/resource-managers/kubernetes/
> integration-tests/target/spark-dist-unpacked/bin/docker-image-tool.sh:


It seems you are missing the distro file... here is how I run it locally:

DOCKER_USERNAME=...
SPARK_K8S_IMAGE_TAG=...

./dev/make-distribution.sh --name test --tgz -Phadoop-2.7 -Pkubernetes
-Phive
tar -zxvf spark-2.4.0-SNAPSHOT-bin-test.tgz
cd spark-2.4.0-SNAPSHOT-bin-test
./bin/docker-image-tool.sh -r $DOCKER_USERNAME -t $SPARK_K8S_IMAGE_TAG build
cd ..
TGZ_PATH=$(pwd)/spark-2.4.0-SNAPSHOT-bin-test.tgz
cd resource-managers/kubernetes/integration-tests
./dev/dev-run-integration-tests.sh --image-tag $SPARK_K8S_IMAGE_TAG
--spark-tgz $TGZ_PATH --image-repo $DOCKER_USERNAME

Stavros

On Tue, Oct 23, 2018 at 1:54 AM, Sean Owen  wrote:

> Provisionally looking good to me, but I had a few questions.
>
> We have these open for 2.4, but I presume they aren't actually going
> to be in 2.4 and should be untargeted:
>
> SPARK-25507 Update documents for the new features in 2.4 release
> SPARK-25179 Document the features that require Pyarrow 0.10
> SPARK-25783 Spark shell fails because of jline incompatibility
> SPARK-25347 Document image data source in doc site
> SPARK-25584 Document libsvm data source in doc site
> SPARK-25346 Document Spark builtin data sources
> SPARK-24464 Unit tests for MLlib's Instrumentation
> SPARK-23197 Flaky test: spark.streaming.ReceiverSuite.
> "receiver_life_cycle"
> SPARK-22809 pyspark is sensitive to imports with dots
> SPARK-21030 extend hint syntax to support any expression for Python and R
>
> Comments in several of the doc issues suggest they are needed for 2.4
> though. How essential?
>
> (Brief digression: SPARK-21030 is an example of a pattern I see
> sometimes. Parent Epic A is targeted for version X. Children B and C
> are not. Epic A's description is basically "do X and Y". Is the parent
> helping? And now that Y is done, is there a point in tracking X with
> two JIRAs? can I just close the Epic?)
>
> I am not sure I've tried running K8S in my test runs before, but I get
> this on my Linux machine:
>
> [INFO] --- exec-maven-plugin:1.4.0:exec (setup-integration-test-env) @
> spark-kubernetes-integration-tests_2.12 ---
> fatal: not a git repository (or any of the parent directories): .git
> tar (child): --strip-components=1: Cannot open: No such file or directory
> tar (child): Error is not recoverable: exiting now
> tar: Child returned status 2
> tar: Error is not recoverable: exiting now
> scripts/setup-integration-test-env.sh: line 85:
> /home/srowen/spark-2.4.0/resource-managers/kubernetes/
> integration-tests/target/spark-dist-unpacked/bin/docker-image-tool.sh:
> No such file or directory
> /home/srowen/spark-2.4.0/resource-managers/kubernetes/integration-tests
> [INFO]
> [INFO] --- scalatest-maven-plugin:1.0:test (integration-test) @
> spark-kubernetes-integration-tests_2.12 ---
> Discovery starting.
> Discovery completed in 289 milliseconds.
> Run starting. Expected test count is: 14
> KubernetesSuite:
> org.apache.spark.deploy.k8s.integrationtest.KubernetesSuite *** ABORTED
> ***
>   java.lang.NullPointerException:
>   at org.apache.spark.deploy.k8s.integrationtest.
> KubernetesSuite.beforeAll(KubernetesSuite.scala:92)
>   at org.scalatest.BeforeAndAfterAll.liftedTree1$
> 1(BeforeAndAfterAll.scala:212)
>   at org.scalatest.BeforeAndAfterAll.run(BeforeAndAfterAll.scala:210)
>   at org.scalatest.BeforeAndAfterAll.run$(BeforeAndAfterAll.scala:208)
>   at org.apache.spark.deploy.k8s.integrationtest.KubernetesSuite.org
> $scalatest$BeforeAndAfter$$super$run(KubernetesSuite.scala:39)
>   at org.scalatest.BeforeAndAfter.run(BeforeAndAfter.scala:258)
>   at org.scalatest.BeforeAndAfter.run$(BeforeAndAfter.scala:256)
>   at org.apache.spark.deploy.k8s.integrationtest.KubernetesSuite.run(
> KubernetesSuite.scala:39)
>   at org.scalatest.Suite.callExecuteOnSuite$1(Suite.scala:1210)
>   at org.scalatest.Suite.$anonfun$runNestedSuites$1(Suite.scala:1257)
>   ...
>
> Clearly it's expecting something about the env that isn't true, but I
> don't know if it's a problem with those expectations versus what is in
> the source release, or, just something to do with my env. This is with
> Scala 2.12.
>
>
>
> On Mon, Oct 22, 2018 at 12:42 PM Wenchen Fan  wrote:
> >
> > Please vote on releasing the following candidate as Apache Spark version
> 2.4.0.
> >
> > The vote is open until October 26 PST and passes if a majority +1 PMC
> votes are cast, with
> > a minimum of 3 +1 votes.
> >
> > [ ] +1 Release this package as Apache Spark 2.4.0
> > [ ] -1 Do not release this package because ...
> >
> > To learn more about Apache Spark, please see http://spark.apache.org/
> >
> > The tag to be voted on is v2.4.0-rc4 (commit
> e69e2bfa486d8d3b9d203b96ca9c0f37c2b6cabe):
> > 

Re: [VOTE] SPARK 2.4.0 (RC4)

2018-10-22 Thread Sean Owen
Provisionally looking good to me, but I had a few questions.

We have these open for 2.4, but I presume they aren't actually going
to be in 2.4 and should be untargeted:

SPARK-25507 Update documents for the new features in 2.4 release
SPARK-25179 Document the features that require Pyarrow 0.10
SPARK-25783 Spark shell fails because of jline incompatibility
SPARK-25347 Document image data source in doc site
SPARK-25584 Document libsvm data source in doc site
SPARK-25346 Document Spark builtin data sources
SPARK-24464 Unit tests for MLlib's Instrumentation
SPARK-23197 Flaky test: spark.streaming.ReceiverSuite."receiver_life_cycle"
SPARK-22809 pyspark is sensitive to imports with dots
SPARK-21030 extend hint syntax to support any expression for Python and R

Comments in several of the doc issues suggest they are needed for 2.4
though. How essential?

(Brief digression: SPARK-21030 is an example of a pattern I see
sometimes. Parent Epic A is targeted for version X. Children B and C
are not. Epic A's description is basically "do X and Y". Is the parent
helping? And now that Y is done, is there a point in tracking X with
two JIRAs? can I just close the Epic?)

I am not sure I've tried running K8S in my test runs before, but I get
this on my Linux machine:

[INFO] --- exec-maven-plugin:1.4.0:exec (setup-integration-test-env) @
spark-kubernetes-integration-tests_2.12 ---
fatal: not a git repository (or any of the parent directories): .git
tar (child): --strip-components=1: Cannot open: No such file or directory
tar (child): Error is not recoverable: exiting now
tar: Child returned status 2
tar: Error is not recoverable: exiting now
scripts/setup-integration-test-env.sh: line 85:
/home/srowen/spark-2.4.0/resource-managers/kubernetes/integration-tests/target/spark-dist-unpacked/bin/docker-image-tool.sh:
No such file or directory
/home/srowen/spark-2.4.0/resource-managers/kubernetes/integration-tests
[INFO]
[INFO] --- scalatest-maven-plugin:1.0:test (integration-test) @
spark-kubernetes-integration-tests_2.12 ---
Discovery starting.
Discovery completed in 289 milliseconds.
Run starting. Expected test count is: 14
KubernetesSuite:
org.apache.spark.deploy.k8s.integrationtest.KubernetesSuite *** ABORTED ***
  java.lang.NullPointerException:
  at 
org.apache.spark.deploy.k8s.integrationtest.KubernetesSuite.beforeAll(KubernetesSuite.scala:92)
  at org.scalatest.BeforeAndAfterAll.liftedTree1$1(BeforeAndAfterAll.scala:212)
  at org.scalatest.BeforeAndAfterAll.run(BeforeAndAfterAll.scala:210)
  at org.scalatest.BeforeAndAfterAll.run$(BeforeAndAfterAll.scala:208)
  at 
org.apache.spark.deploy.k8s.integrationtest.KubernetesSuite.org$scalatest$BeforeAndAfter$$super$run(KubernetesSuite.scala:39)
  at org.scalatest.BeforeAndAfter.run(BeforeAndAfter.scala:258)
  at org.scalatest.BeforeAndAfter.run$(BeforeAndAfter.scala:256)
  at 
org.apache.spark.deploy.k8s.integrationtest.KubernetesSuite.run(KubernetesSuite.scala:39)
  at org.scalatest.Suite.callExecuteOnSuite$1(Suite.scala:1210)
  at org.scalatest.Suite.$anonfun$runNestedSuites$1(Suite.scala:1257)
  ...

Clearly it's expecting something about the env that isn't true, but I
don't know if it's a problem with those expectations versus what is in
the source release, or, just something to do with my env. This is with
Scala 2.12.



On Mon, Oct 22, 2018 at 12:42 PM Wenchen Fan  wrote:
>
> Please vote on releasing the following candidate as Apache Spark version 
> 2.4.0.
>
> The vote is open until October 26 PST and passes if a majority +1 PMC votes 
> are cast, with
> a minimum of 3 +1 votes.
>
> [ ] +1 Release this package as Apache Spark 2.4.0
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> The tag to be voted on is v2.4.0-rc4 (commit 
> e69e2bfa486d8d3b9d203b96ca9c0f37c2b6cabe):
> https://github.com/apache/spark/tree/v2.4.0-rc4
>
> The release files, including signatures, digests, etc. can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v2.4.0-rc4-bin/
>
> Signatures used for Spark RCs can be found in this file:
> https://dist.apache.org/repos/dist/dev/spark/KEYS
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1290
>
> The documentation corresponding to this release can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v2.4.0-rc4-docs/
>
> The list of bug fixes going into 2.4.0 can be found at the following URL:
> https://issues.apache.org/jira/projects/SPARK/versions/12342385
>
> FAQ
>
> =
> How can I help test this release?
> =
>
> If you are a Spark user, you can help us test this release by taking
> an existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> If you're working in PySpark you can set up a virtual env and install
> the current RC and see if anything important breaks, in the Java/Scala
> you can 

Re: [VOTE] SPARK 2.4.0 (RC4)

2018-10-22 Thread Wenchen Fan
Since GitHub and Jenkins are in a chaotic state, I didn't wait for a green
Jenkins QA job for the RC4 commit. We should fail this RC if the Jenkins is
broken (very unlikely).

I'm adding my own +1, all known blockers are resolved.

On Tue, Oct 23, 2018 at 1:42 AM Wenchen Fan  wrote:

> Please vote on releasing the following candidate as Apache Spark version
> 2.4.0.
>
> The vote is open until October 26 PST and passes if a majority +1 PMC
> votes are cast, with
> a minimum of 3 +1 votes.
>
> [ ] +1 Release this package as Apache Spark 2.4.0
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> The tag to be voted on is v2.4.0-rc4 (commit
> e69e2bfa486d8d3b9d203b96ca9c0f37c2b6cabe):
> https://github.com/apache/spark/tree/v2.4.0-rc4
>
> The release files, including signatures, digests, etc. can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v2.4.0-rc4-bin/
>
> Signatures used for Spark RCs can be found in this file:
> https://dist.apache.org/repos/dist/dev/spark/KEYS
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1290
>
> The documentation corresponding to this release can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v2.4.0-rc4-docs/
>
> The list of bug fixes going into 2.4.0 can be found at the following URL:
> https://issues.apache.org/jira/projects/SPARK/versions/12342385
>
> FAQ
>
> =
> How can I help test this release?
> =
>
> If you are a Spark user, you can help us test this release by taking
> an existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> If you're working in PySpark you can set up a virtual env and install
> the current RC and see if anything important breaks, in the Java/Scala
> you can add the staging repository to your projects resolvers and test
> with the RC (make sure to clean up the artifact cache before/after so
> you don't end up building with a out of date RC going forward).
>
> ===
> What should happen to JIRA tickets still targeting 2.4.0?
> ===
>
> The current list of open tickets targeted at 2.4.0 can be found at:
> https://issues.apache.org/jira/projects/SPARK and search for "Target
> Version/s" = 2.4.0
>
> Committers should look at those and triage. Extremely important bug
> fixes, documentation, and API tweaks that impact compatibility should
> be worked on immediately. Everything else please retarget to an
> appropriate release.
>
> ==
> But my bug isn't fixed?
> ==
>
> In order to make timely releases, we will typically not hold the
> release unless the bug in question is a regression from the previous
> release. That being said, if there is something which is a regression
> that has not been correctly targeted please ping me or a committer to
> help target the issue.
>


[VOTE] SPARK 2.4.0 (RC4)

2018-10-22 Thread Wenchen Fan
Please vote on releasing the following candidate as Apache Spark version
2.4.0.

The vote is open until October 26 PST and passes if a majority +1 PMC votes
are cast, with
a minimum of 3 +1 votes.

[ ] +1 Release this package as Apache Spark 2.4.0
[ ] -1 Do not release this package because ...

To learn more about Apache Spark, please see http://spark.apache.org/

The tag to be voted on is v2.4.0-rc4 (commit
e69e2bfa486d8d3b9d203b96ca9c0f37c2b6cabe):
https://github.com/apache/spark/tree/v2.4.0-rc4

The release files, including signatures, digests, etc. can be found at:
https://dist.apache.org/repos/dist/dev/spark/v2.4.0-rc4-bin/

Signatures used for Spark RCs can be found in this file:
https://dist.apache.org/repos/dist/dev/spark/KEYS

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1290

The documentation corresponding to this release can be found at:
https://dist.apache.org/repos/dist/dev/spark/v2.4.0-rc4-docs/

The list of bug fixes going into 2.4.0 can be found at the following URL:
https://issues.apache.org/jira/projects/SPARK/versions/12342385

FAQ

=
How can I help test this release?
=

If you are a Spark user, you can help us test this release by taking
an existing Spark workload and running on this release candidate, then
reporting any regressions.

If you're working in PySpark you can set up a virtual env and install
the current RC and see if anything important breaks, in the Java/Scala
you can add the staging repository to your projects resolvers and test
with the RC (make sure to clean up the artifact cache before/after so
you don't end up building with a out of date RC going forward).

===
What should happen to JIRA tickets still targeting 2.4.0?
===

The current list of open tickets targeted at 2.4.0 can be found at:
https://issues.apache.org/jira/projects/SPARK and search for "Target
Version/s" = 2.4.0

Committers should look at those and triage. Extremely important bug
fixes, documentation, and API tweaks that impact compatibility should
be worked on immediately. Everything else please retarget to an
appropriate release.

==
But my bug isn't fixed?
==

In order to make timely releases, we will typically not hold the
release unless the bug in question is a regression from the previous
release. That being said, if there is something which is a regression
that has not been correctly targeted please ping me or a committer to
help target the issue.