[RESULT][VOTE][SPARK-25299] SPIP: Shuffle Storage API

2019-06-21 Thread Vinoo Ganesh
// Fixing Subject

Results of the voting:

Binding +1s: 5 (Tom Graves,  Dongjoon Hyun, Felix Cheung, Saisai Shao, Imran 
Rashid)

Non-Binding +1s: 8

-1 from PMC members: 0

Per PMC / SPIP Voting Rules 
(https://spark.apache.org/improvement-proposals.html 
[spark.apache.org]),
 given that the vote has been open for >72 hours and 3 +1 binding votes have 
been received, this SPIP passes.

Thanks everyone.


From: Vinoo Ganesh 
Date: Friday, June 21, 2019 at 13:44
To: Tom Graves , dhruve ashar , 
John Zhuge , "Guo, Chenzhao" 
Cc: Felix Cheung , Yinan Li , 
"rb...@netflix.com" , Dongjoon Hyun 
, Saisai Shao , Imran Rashid 
, Ilan Filonenko , bo yang 
, Matt Cheah , Spark Dev List 
, "Yifei Huang (PD)" , Imran Rashid 

Subject: Re: [VOTE][SPARK-25299] SPIP: Shuffle Storage API

Results of the voting:

Binding +1s: 5 (Tom Graves,  Dongjoon Hyun, Felix Cheung, Saisai Shao, Imran 
Rashid)

Non-Binding +1s: 8

-1 from PMC members: 0

Per PMC / SPIP Voting Rules 
(https://spark.apache.org/improvement-proposals.html 
[spark.apache.org]),
 given that the vote has been open for >72 hours and 3 +1 binding votes have 
been received, this SPIP passes.

Thanks everyone.

From: Tom Graves 
Date: Friday, June 21, 2019 at 13:02
To: dhruve ashar , John Zhuge , 
"Guo, Chenzhao" 
Cc: Vinoo Ganesh , Felix Cheung 
, Yinan Li , 
"rb...@netflix.com" , Dongjoon Hyun 
, Saisai Shao , Imran Rashid 
, Ilan Filonenko , bo yang 
, Matt Cheah , Spark Dev List 
, "Yifei Huang (PD)" , Imran Rashid 

Subject: Re: [VOTE][SPARK-25299] SPIP: Shuffle Storage API

+1 (binding)

I haven't looked at the low level api, but like the idea and approach to get it 
started.

Tom

On Tuesday, June 18, 2019, 10:40:34 PM CDT, Guo, Chenzhao 
 wrote:



Cool : )



+1 (non-binding)



Chenzhao



From: dhruve ashar [mailto:dhruveas...@gmail.com]
Sent: Wednesday, June 19, 2019 2:58 AM
To: John Zhuge 
Cc: Vinoo Ganesh ; Felix Cheung 
; Yinan Li ; 
rb...@netflix.com; Dongjoon Hyun ; Saisai Shao 
; Imran Rashid ; Ilan Filonenko 
; bo yang ; Matt Cheah 
; Spark Dev List ; Yifei Huang (PD) 
; Imran Rashid 
Subject: Re: [VOTE][SPARK-25299] SPIP: Shuffle Storage API



+1 (non-binding)



On Tue, Jun 18, 2019 at 12:12 PM John Zhuge 
mailto:john.zh...@gmail.com>> wrote:

+1 (non-binding)  Great work!



On Tue, Jun 18, 2019 at 6:22 AM Vinoo Ganesh 
mailto:vgan...@palantir.com>> wrote:

+1 (non-binding).



Thanks for pushing this forward, Matt and Yifei.



From: Felix Cheung mailto:felixcheun...@hotmail.com>>
Date: Tuesday, June 18, 2019 at 00:01
To: Yinan Li mailto:liyinan...@gmail.com>>, 
"rb...@netflix.com" 
mailto:rb...@netflix.com>>
Cc: Dongjoon Hyun mailto:dongjoon.h...@gmail.com>>, 
Saisai Shao mailto:sai.sai.s...@gmail.com>>, Imran 
Rashid mailto:im...@therashids.com>>, Ilan Filonenko 
mailto:i...@cornell.edu>>, bo yang 
mailto:bobyan...@gmail.com>>, Matt Cheah 
mailto:mch...@palantir.com>>, Spark Dev List 
mailto:dev@spark.apache.org>>, "Yifei Huang (PD)" 
mailto:yif...@palantir.com>>, Vinoo Ganesh 
mailto:vgan...@palantir.com>>, Imran Rashid 
mailto:iras...@cloudera.com>>
Subject: Re: [VOTE][SPARK-25299] SPIP: Shuffle Storage API



+1



Glad to see the progress in this space - it’s been more than a year since the 
original discussion and effort started.





From: Yinan Li mailto:liyinan...@gmail.com>>
Sent: Monday, June 17, 2019 7:14:42 PM
To: rb...@netflix.com
Cc: Dongjoon Hyun; Saisai Shao; Imran Rashid; Ilan Filonenko; bo yang; Matt 
Cheah; Spark Dev List; Yifei Huang (PD); Vinoo Ganesh; Imran Rashid
Subject: Re: [VOTE][SPARK-25299] SPIP: Shuffle Storage API



+1 (non-binding)



On Mon, Jun 17, 2019 at 1:58 PM Ryan Blue 
mailto:rb...@netflix.com.invalid>> wrote:

+1 (non-binding)



On Sun, Jun 16, 2019 at 11:11 PM Dongjoon Hyun 
mailto:dongjoon.h...@gmail.com>> wrote:

+1



Bests,

Dongjoon.





On Sun, Jun 16, 2019 at 9:41 PM Saisai Shao 
mailto:sai.sai.s...@gmail.com>> wrote:

+1 (binding)



Thanks

Saisai



Imran Rashid mailto:im...@therashids.com>> 于2019年6月15日周六 
上午3:46写道:

+1 (binding)

I think this is a really important feature for spark.

First, there is already a lot of interest in alternative shuffle storage in the 
community.  There is already a lot of interest in alternative shuffle storage, 
from dynamic allocation in kubernetes, to even just improving stability in 
standard on-premise use of 

Re: [VOTE][SPARK-25299] SPIP: Shuffle Storage API

2019-06-21 Thread Vinoo Ganesh
Results of the voting:

Binding +1s: 5 (Tom Graves,  Dongjoon Hyun, Felix Cheung, Saisai Shao, Imran 
Rashid)

Non-Binding +1s: 8

-1 from PMC members: 0

Per PMC / SPIP Voting Rules 
(https://spark.apache.org/improvement-proposals.html), given that the vote has 
been open for >72 hours and 3 +1 binding votes have been received, this SPIP 
passes.

Thanks everyone.

From: Tom Graves 
Date: Friday, June 21, 2019 at 13:02
To: dhruve ashar , John Zhuge , 
"Guo, Chenzhao" 
Cc: Vinoo Ganesh , Felix Cheung 
, Yinan Li , 
"rb...@netflix.com" , Dongjoon Hyun 
, Saisai Shao , Imran Rashid 
, Ilan Filonenko , bo yang 
, Matt Cheah , Spark Dev List 
, "Yifei Huang (PD)" , Imran Rashid 

Subject: Re: [VOTE][SPARK-25299] SPIP: Shuffle Storage API

+1 (binding)

I haven't looked at the low level api, but like the idea and approach to get it 
started.

Tom

On Tuesday, June 18, 2019, 10:40:34 PM CDT, Guo, Chenzhao 
 wrote:



Cool : )



+1 (non-binding)



Chenzhao



From: dhruve ashar [mailto:dhruveas...@gmail.com]
Sent: Wednesday, June 19, 2019 2:58 AM
To: John Zhuge 
Cc: Vinoo Ganesh ; Felix Cheung 
; Yinan Li ; 
rb...@netflix.com; Dongjoon Hyun ; Saisai Shao 
; Imran Rashid ; Ilan Filonenko 
; bo yang ; Matt Cheah 
; Spark Dev List ; Yifei Huang (PD) 
; Imran Rashid 
Subject: Re: [VOTE][SPARK-25299] SPIP: Shuffle Storage API



+1 (non-binding)



On Tue, Jun 18, 2019 at 12:12 PM John Zhuge 
mailto:john.zh...@gmail.com>> wrote:

+1 (non-binding)  Great work!



On Tue, Jun 18, 2019 at 6:22 AM Vinoo Ganesh 
mailto:vgan...@palantir.com>> wrote:

+1 (non-binding).



Thanks for pushing this forward, Matt and Yifei.



From: Felix Cheung mailto:felixcheun...@hotmail.com>>
Date: Tuesday, June 18, 2019 at 00:01
To: Yinan Li mailto:liyinan...@gmail.com>>, 
"rb...@netflix.com" 
mailto:rb...@netflix.com>>
Cc: Dongjoon Hyun mailto:dongjoon.h...@gmail.com>>, 
Saisai Shao mailto:sai.sai.s...@gmail.com>>, Imran 
Rashid mailto:im...@therashids.com>>, Ilan Filonenko 
mailto:i...@cornell.edu>>, bo yang 
mailto:bobyan...@gmail.com>>, Matt Cheah 
mailto:mch...@palantir.com>>, Spark Dev List 
mailto:dev@spark.apache.org>>, "Yifei Huang (PD)" 
mailto:yif...@palantir.com>>, Vinoo Ganesh 
mailto:vgan...@palantir.com>>, Imran Rashid 
mailto:iras...@cloudera.com>>
Subject: Re: [VOTE][SPARK-25299] SPIP: Shuffle Storage API



+1



Glad to see the progress in this space - it’s been more than a year since the 
original discussion and effort started.





From: Yinan Li mailto:liyinan...@gmail.com>>
Sent: Monday, June 17, 2019 7:14:42 PM
To: rb...@netflix.com
Cc: Dongjoon Hyun; Saisai Shao; Imran Rashid; Ilan Filonenko; bo yang; Matt 
Cheah; Spark Dev List; Yifei Huang (PD); Vinoo Ganesh; Imran Rashid
Subject: Re: [VOTE][SPARK-25299] SPIP: Shuffle Storage API



+1 (non-binding)



On Mon, Jun 17, 2019 at 1:58 PM Ryan Blue 
mailto:rb...@netflix.com.invalid>> wrote:

+1 (non-binding)



On Sun, Jun 16, 2019 at 11:11 PM Dongjoon Hyun 
mailto:dongjoon.h...@gmail.com>> wrote:

+1



Bests,

Dongjoon.





On Sun, Jun 16, 2019 at 9:41 PM Saisai Shao 
mailto:sai.sai.s...@gmail.com>> wrote:

+1 (binding)



Thanks

Saisai



Imran Rashid mailto:im...@therashids.com>> 于2019年6月15日周六 
上午3:46写道:

+1 (binding)

I think this is a really important feature for spark.

First, there is already a lot of interest in alternative shuffle storage in the 
community.  There is already a lot of interest in alternative shuffle storage, 
from dynamic allocation in kubernetes, to even just improving stability in 
standard on-premise use of Spark.  However, they're often stuck doing this in 
forks of Spark, and in ways that are not maintainable (because they copy-paste 
many spark internals) or are incorrect (for not correctly handling speculative 
execution & stage retries).

Second, I think the specific proposal is good for finding the right balance 
between flexibility and too much complexity, to allow incremental improvements. 
 A lot of work has been put into this already to try to figure out which pieces 
are essential to make alternative shuffle storage implementations feasible.

Of course, that means it doesn't include everything imaginable; some things 
still aren't supported, and some will still choose to use the older 
ShuffleManager api to give total control over all of shuffle.  But we know 
there are a reasonable set of things which can be implemented behind the api as 
the first step, and it can continue to evolve.



On Fri, Jun 14, 2019 at 12:13 PM Ilan Filonenko 
mailto:i...@cornell.edu>> wrote:

+1 (non-binding). This API is versatile and flexible enough to handle 
Bloomberg's internal use-cases. The ability for us to vary implementation 
strategies is quite appealing. It is also worth to note the minimal changes to 
Spark core in order to make it work. This is a very much needed addition within 
the Spark shuffle story.



On Fri, Jun 14, 2019 at 9:59 AM bo yang 

Re: [VOTE][SPARK-25299] SPIP: Shuffle Storage API

2019-06-21 Thread Tom Graves
 +1 (binding)
I haven't looked at the low level api, but like the idea and approach to get it 
started.
Tom
On Tuesday, June 18, 2019, 10:40:34 PM CDT, Guo, Chenzhao 
 wrote:  
 
 #yiv1391836063 #yiv1391836063 -- _filtered #yiv1391836063 
{font-family:SimSun;panose-1:2 1 6 0 3 1 1 1 1 1;} _filtered #yiv1391836063 
{panose-1:2 11 6 9 7 2 5 8 2 4;} _filtered #yiv1391836063 {panose-1:2 4 5 3 5 4 
6 3 2 4;} _filtered #yiv1391836063 {font-family:Calibri;panose-1:2 15 5 2 2 2 4 
3 2 4;} _filtered #yiv1391836063 {panose-1:2 1 6 0 3 1 1 1 1 1;} _filtered 
#yiv1391836063 {panose-1:2 11 6 9 7 2 5 8 2 4;}#yiv1391836063 #yiv1391836063 
p.yiv1391836063MsoNormal, #yiv1391836063 li.yiv1391836063MsoNormal, 
#yiv1391836063 div.yiv1391836063MsoNormal 
{margin:0in;margin-bottom:.0001pt;font-size:12.0pt;font-family:New 
serif;}#yiv1391836063 a:link, #yiv1391836063 span.yiv1391836063MsoHyperlink 
{color:blue;text-decoration:underline;}#yiv1391836063 a:visited, #yiv1391836063 
span.yiv1391836063MsoHyperlinkFollowed 
{color:purple;text-decoration:underline;}#yiv1391836063 
span.yiv1391836063EmailStyle17 
{font-family:sans-serif;color:#1F497D;}#yiv1391836063 
.yiv1391836063MsoChpDefault {font-family:sans-serif;} _filtered #yiv1391836063 
{margin:1.0in 1.0in 1.0in 1.0in;}#yiv1391836063 div.yiv1391836063WordSection1 
{}#yiv1391836063 
Cool : )
 
  
 
+1 (non-binding)
 
  
 
Chenzhao
 
  
 
From: dhruve ashar [mailto:dhruveas...@gmail.com]
Sent: Wednesday, June 19, 2019 2:58 AM
To: John Zhuge 
Cc: Vinoo Ganesh ; Felix Cheung 
; Yinan Li ; 
rb...@netflix.com; Dongjoon Hyun ; Saisai Shao 
; Imran Rashid ; Ilan Filonenko 
; bo yang ; Matt Cheah 
; Spark Dev List ; Yifei Huang (PD) 
; Imran Rashid 
Subject: Re: [VOTE][SPARK-25299] SPIP: Shuffle Storage API
 
  
 
+1 (non-binding)
 
  
 
On Tue, Jun 18, 2019 at 12:12 PM John Zhuge  wrote:
 

+1 (non-binding)  Great work!
 
  
 
On Tue, Jun 18, 2019 at 6:22 AM Vinoo Ganesh  wrote:
 

+1 (non-binding).
 
 
 
Thanks for pushing this forward, Matt and Yifei.
 
 
 
From:Felix Cheung 
Date: Tuesday, June 18, 2019 at 00:01
To: Yinan Li , "rb...@netflix.com" 
Cc: Dongjoon Hyun , Saisai Shao 
, Imran Rashid , Ilan Filonenko 
, bo yang , Matt Cheah 
, Spark Dev List , "Yifei Huang 
(PD)" , Vinoo Ganesh , Imran Rashid 

Subject: Re: [VOTE][SPARK-25299] SPIP: Shuffle Storage API
 
 
 
+1
 
 
 
Glad to see the progress in this space - it’s been more than a year since the 
original discussion and effort started.
 
 
 
From: Yinan Li 
Sent: Monday, June 17, 2019 7:14:42 PM
To: rb...@netflix.com
Cc: Dongjoon Hyun; Saisai Shao; Imran Rashid; Ilan Filonenko; bo yang; Matt 
Cheah; Spark Dev List; Yifei Huang (PD); Vinoo Ganesh; Imran Rashid
Subject: Re: [VOTE][SPARK-25299] SPIP: Shuffle Storage API 
 
 
 
+1 (non-binding) 
 
 
 
On Mon, Jun 17, 2019 at 1:58 PM Ryan Blue  wrote:
 

+1 (non-binding)
 
 
 
On Sun, Jun 16, 2019 at 11:11 PM Dongjoon Hyun  wrote:
 

+1
 
 
 
Bests,
 
Dongjoon.
 
 
 
 
 
On Sun, Jun 16, 2019 at 9:41 PM Saisai Shao  wrote:
 

+1 (binding)
 
 
 
Thanks
 
Saisai
 
 
 
Imran Rashid 于2019年6月15日周六上午3:46写道:
 

+1 (binding)

I think this is a really important feature for spark.

First, there is already a lot of interest in alternative shuffle storage in the 
community.  There is already a lot of interest in alternative shuffle storage, 
from dynamic allocation in kubernetes, to even just improving stability in 
standard on-premise use of Spark.  However, they're often stuck doing this in 
forks of Spark, and in ways that are not maintainable (because they copy-paste 
many spark internals) or are incorrect (for not correctly handling speculative 
execution & stage retries).

Second, I think the specific proposal is good for finding the right balance 
between flexibility and too much complexity, to allow incremental improvements. 
 A lot of work has been put into this already to try to figure out which pieces 
are essential to make alternative shuffle storage implementations feasible.

Of course, that means it doesn't include everything imaginable; some things 
still aren't supported, and some will still choose to use the older 
ShuffleManager api to give total control over all of shuffle.  But we know 
there are a reasonable set of things which can be implemented behind the api as 
the first step, and it can continue to evolve.
 
 
 
On Fri, Jun 14, 2019 at 12:13 PM Ilan Filonenko  wrote:
 

+1 (non-binding). This API is versatile and flexible enough to handle 
Bloomberg's internal use-cases. The ability for us to vary implementation 
strategies is quite appealing. It is also worth to note the minimal changes to 
Spark core in order to make it work. This is a very much needed addition within 
the Spark shuffle story. 
 
 
 
On Fri, Jun 14, 2019 at 9:59 AM bo yang  wrote:
 

+1 This is great work, allowing plugin of different sort shuffle write/read 
implementation! Also great to see it retain the current Spark configuration 

Re: DSv1 removal

2019-06-21 Thread Gabor Somogyi
Hi Ryan,

Thanks for the explanation! This shed lights on areas but also triggered
some questions.

The main conclusion to me on the Kafka connector side is to keep the v1 as
default. Let the users some time to migrate to v2 and later delete v1 when
its stable (which makes sense from my perspective).

The interesting part is that the Kafka microbatch already uses v2 as
default which I don't fully understand how to fit into this.
Please see this test:
https://github.com/apache/spark/blob/54da3bbfb2c936827897c52ed6e5f0f428b98e9f/external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaMicroBatchSourceSuite.scala#L1084
Since https://issues.apache.org/jira/browse/SPARK-23362 merged into 2.4 it
shouldn't be breaking (I assume batch part should be similar).

We can continue the discussion about Kafka batch v1/v2 default on
https://github.com/apache/spark/pull/24738 not to bomb everybody.

Please send me an invite to the sync meeting. Not sure when exactly that
happens but presume it's in the night from CET timezone perspective.
Try to organize my time to participate...

BR,
G


On Thu, Jun 20, 2019 at 8:24 PM Ryan Blue  wrote:

> Hi Gabor,
>
> First, a little context... one of the goals of DSv2 is to standardize the
> behavior of SQL operations in Spark. For example, running CTAS when a table
> exists will fail, not take some action depending on what the source
> chooses, like drop & CTAS, inserting, or failing.
>
> Unfortunately, this means that DSv1 can't be easily replaced because it
> has behavior differences between sources. In addition, we're not really
> sure how DSv1 works in all cases -- it really depends on what seemed
> reasonable to authors at the time. For example, we don't have a good
> understanding of how file-based tables behave (those not backed by a
> Metastore). There are also changes that we know are breaking and are okay
> with, like only inserting safe casts when writing with v2.
>
> Because of this, we can't just replace v1 with v2 transparently, so the
> plan is to allow deployments to migrate to v2 in stages. Here's the plan:
> 1. Use v1 by default so all existing queries work as they do today for
> identifiers like `db.table`
> 2. Allow users to add additional v2 catalogs that will be used when
> identifiers specifically start with one, like `test_catalog.db.table`
> 3. Add a v2 catalog that delegates to the session catalog, so that v2
> read/write implementations can be used, but are stored just like v1 tables
> in the session catalog
> 4. Add a setting to use a v2 catalog as the default. Setting this would
> use a v2 catalog for all identifiers without a catalog, like `db.table`
> 5. Add a way for a v2 catalog to return a table that gets converted to v1.
> This is what `CatalogTableAsV2` does in #24768
> .
>
> PR #24768  implements the
> rest of these changes. Specifically, we initially used the default catalog
> for v2 sources, but that causes namespace problems, so we need the v2
> session catalog (point #3) as the default when there is no default v2
> catalog.
>
> I hope that answers your question. If not, I'm happy to answer follow-ups
> and we can add this as a topic in the next v2 sync on Wednesday. I'm also
> planning on talking about metadata columns or function push-down from the
> Kafka v2 PR at that sync, so you may want to attend.
>
> rb
>
>
> On Thu, Jun 20, 2019 at 4:45 AM Gabor Somogyi 
> wrote:
>
>> Hi All,
>>
>>   I've taken a look at the code and docs to find out when DSv1 sources
>> has to be removed (in case of DSv2 replacement is implemented). After some
>> digging I've found DSv1 sources which are already removed but in some cases
>> v1 and v2 still exists in parallel.
>>
>> Can somebody please tell me what's the overall plan in this area?
>>
>> BR,
>> G
>>
>>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>