Re: [Feedback Requested] SPARK-25299: Using Distributed Storage for Persisting Shuffle Data

2018-09-20 Thread Felix Cheung
Hi
+baibing3
+huangtao6

Came across your presentation on Alluxio - including shuffling - would you be 
interested in this?



From: Matt Cheah 
Sent: Tuesday, September 4, 2018 2:54 PM
To: Yuanjian Li
Cc: Spark dev list
Subject: Re: [Feedback Requested] SPARK-25299: Using Distributed Storage for 
Persisting Shuffle Data

Yuanjian, Thanks for sharing your progress! I was wondering if there was any 
prototype code that we could read to get an idea of what the implementation 
looks like? We can evaluate the design together and also benchmark workloads 
from across the community �C that is, we can collect more data from more Spark 
users.

The experience would be greatly appreciated in the discussion.

-Matt Cheah

From: Yuanjian Li 
Date: Friday, August 31, 2018 at 8:29 PM
To: Matt Cheah 
Cc: Spark dev list 
Subject: Re: [Feedback Requested] SPARK-25299: Using Distributed Storage for 
Persisting Shuffle Data

Hi Matt,
 Thanks for the great document and proposal, I want to +1 for the reliable 
shuffle data and give some feedback.
 I think a reliable shuffle service based on DFS is necessary on Spark, 
especially running Spark job over unstable environment. For example, while 
mixed deploying Spark with online service, Spark executor will be killed any 
time. Current stage retry strategy will make the job many times slower than 
normal job.
 Actually we(Baidu inc) solved this problem by stable shuffle service over 
Hadoop, and we are now docking Spark to this shuffle service. The POC work will 
be done at October as expect. We'll post more benchmark and detailed work at 
that time. I'm still reading your discussion document and happy to give more 
feedback in the doc.

Thanks,
Yuanjian Li

Matt Cheah 
mailto:mch...@palantir.com>>于2018年9月1日周六上午8:42写道:
Hi everyone,

I filed SPARK-25299 
[issues.apache.org]<https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_SPARK-2D25299=DwMFaQ=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8=hzwIMNQ9E99EMYGuqHI0kXhVbvX3nU3OSDadUnJxjAs=aWBmhsrm7S7YT8YUwf0fphAsQ-piBw9ENlRn2ojrs9U=QmUpw5K6D-6ot7Kel1_RhXKdr7Rk_fXgqoaeIZN-kes=>
 to promote discussion on how we can improve the shuffle operation in Spark. 
The basic premise is to discuss the ways we can leverage distributed storage to 
improve the reliability and isolation of Spark’s shuffle architecture.

A few designs and a full problem statement are outlined in thisarchitecture 
discussion document 
[docs.google.com]<https://urldefense.proofpoint.com/v2/url?u=https-3A__docs.google.com_document_d_1uCkzGGVG17oGC6BJ75TpzLAZNorvrAU3FRd2X-2DrVHSM_edit-23heading-3Dh.btqugnmt2h40=DwMFaQ=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8=hzwIMNQ9E99EMYGuqHI0kXhVbvX3nU3OSDadUnJxjAs=aWBmhsrm7S7YT8YUwf0fphAsQ-piBw9ENlRn2ojrs9U=d60j5-gfmUL6SeNwkEdWAR8IYOQd3UXHJ20XwUtteew=>.

This is a complex problem and it would be great to get feedback from the 
community about the right direction to take this work in. Note that we have not 
yet committed to a specific implementation and architecture �C there’s a lot 
that needs to be discussed for this improvement, so we hope to get as much 
input as possible before moving forward with a design.

Please feel free to leave comments and suggestions on the JIRA ticket or on the 
discussion document.

Thank you!

-Matt Cheah


Re: [Feedback Requested] SPARK-25299: Using Distributed Storage for Persisting Shuffle Data

2018-09-04 Thread Matt Cheah
Yuanjian, Thanks for sharing your progress! I was wondering if there was any 
prototype code that we could read to get an idea of what the implementation 
looks like? We can evaluate the design together and also benchmark workloads 
from across the community – that is, we can collect more data from more Spark 
users.

 

The experience would be greatly appreciated in the discussion.

 

-Matt Cheah

 

From: Yuanjian Li 
Date: Friday, August 31, 2018 at 8:29 PM
To: Matt Cheah 
Cc: Spark dev list 
Subject: Re: [Feedback Requested] SPARK-25299: Using Distributed Storage for 
Persisting Shuffle Data

 

Hi Matt, 

 Thanks for the great document and proposal, I want to +1 for the reliable 
shuffle data and give some feedback.

 I think a reliable shuffle service based on DFS is necessary on Spark, 
especially running Spark job over unstable environment. For example, while 
mixed deploying Spark with online service, Spark executor will be killed any 
time. Current stage retry strategy will make the job many times slower than 
normal job.

 Actually we(Baidu inc) solved this problem by stable shuffle service over 
Hadoop, and we are now docking Spark to this shuffle service. The POC work will 
be done at October as expect. We'll post more benchmark and detailed work at 
that time. I'm still reading your discussion document and happy to give more 
feedback in the doc.

 

Thanks,

Yuanjian Li

 

Matt Cheah  于2018年9月1日周六 上午8:42写道:

Hi everyone,

 

I filed SPARK-25299 [issues.apache.org] to promote discussion on how we can 
improve the shuffle operation in Spark. The basic premise is to discuss the 
ways we can leverage distributed storage to improve the reliability and 
isolation of Spark’s shuffle architecture.

 

A few designs and a full problem statement are outlined in this architecture 
discussion document [docs.google.com].

 

This is a complex problem and it would be great to get feedback from the 
community about the right direction to take this work in. Note that we have not 
yet committed to a specific implementation and architecture – there’s a lot 
that needs to be discussed for this improvement, so we hope to get as much 
input as possible before moving forward with a design.

 

Please feel free to leave comments and suggestions on the JIRA ticket or on the 
discussion document.

 

Thank you!

 

-Matt Cheah



smime.p7s
Description: S/MIME cryptographic signature


Re: [Feedback Requested] SPARK-25299: Using Distributed Storage for Persisting Shuffle Data

2018-08-31 Thread Yuanjian Li
Hi Matt,
 Thanks for the great document and proposal, I want to +1 for the
reliable shuffle data and give some feedback.
 I think a reliable shuffle service based on DFS is necessary on Spark,
especially running Spark job over unstable environment. For example, while
mixed deploying Spark with online service, Spark executor will be killed
any time. Current stage retry strategy will make the job many times slower
than normal job.
 Actually we(Baidu inc) solved this problem by stable shuffle service
over Hadoop, and we are now docking Spark to this shuffle service. The POC
work will be done at October as expect. We'll post more benchmark and
detailed work at that time. I'm still reading your discussion document and
happy to give more feedback in the doc.

Thanks,
Yuanjian Li

Matt Cheah  于2018年9月1日周六 上午8:42写道:

> Hi everyone,
>
>
>
> I filed SPARK-25299 
> to promote discussion on how we can improve the shuffle operation in Spark.
> The basic premise is to discuss the ways we can leverage distributed
> storage to improve the reliability and isolation of Spark’s shuffle
> architecture.
>
>
>
> A few designs and a full problem statement are outlined in this architecture
> discussion document
> 
> .
>
>
>
> This is a complex problem and it would be great to get feedback from the
> community about the right direction to take this work in. Note that we have
> not yet committed to a specific implementation and architecture – there’s a
> lot that needs to be discussed for this improvement, so we hope to get as
> much input as possible before moving forward with a design.
>
>
>
> Please feel free to leave comments and suggestions on the JIRA ticket or
> on the discussion document.
>
>
>
> Thank you!
>
>
>
> -Matt Cheah
>


[Feedback Requested] SPARK-25299: Using Distributed Storage for Persisting Shuffle Data

2018-08-31 Thread Matt Cheah
Hi everyone,

 

I filed SPARK-25299 to promote discussion on how we can improve the shuffle 
operation in Spark. The basic premise is to discuss the ways we can leverage 
distributed storage to improve the reliability and isolation of Spark’s shuffle 
architecture.

 

A few designs and a full problem statement are outlined in this architecture 
discussion document.

 

This is a complex problem and it would be great to get feedback from the 
community about the right direction to take this work in. Note that we have not 
yet committed to a specific implementation and architecture – there’s a lot 
that needs to be discussed for this improvement, so we hope to get as much 
input as possible before moving forward with a design.

 

Please feel free to leave comments and suggestions on the JIRA ticket or on the 
discussion document.

 

Thank you!

 

-Matt Cheah



smime.p7s
Description: S/MIME cryptographic signature