Re: Apache Spark 3.1 Feature Expectation (Dec. 2020)

2020-07-05 Thread Dongjoon Hyun
Thank you for sharing your opinions, Jacky, Maxim, Holden, Jungtaek, Yi,
Tom, Gabor, Felix.

I also want to include both `New Features` and `Improvements` together
according to the above discussion.

When I checked the item status as of today, it looked like the following.
In short, I removed K8s GA and DSv2 Stabilization explicitly from ON-TRACK
list according to the given concerns. For those items, we can try to build
a consensus for Apache Spark 3.2 (June 2021) or later.

ON-TRACK
1. Support Scala 2.13 (SPARK-25075)
2. Use Apache Hadoop 3.2 by default for better cloud support (SPARK-32058)
3. Stage Level Scheduling (SPARK-27495)
4. Support filter pushdown more (CSV is already shipped by SPARK-30323 in
3.0)
- Support filters pushdown to JSON (SPARK-30648 in 3.1)
- Support filters pushdown to Avro (SPARK-XXX in 3.1)
- Support nested attributes of filters pushed down to JSON
5. Support JDBC Kerberos w/ keytab (SPARK-12312)

NICE TO HAVE OR DEFERRED TO APACHE SPARK 3.2
1. Declaring Kubernetes Scheduler GA
- Should we also consider the shuffle service refactoring to support
pluggable storage engines as targeting the 3.1 release? (Holden)
- I think pluggable storage in shuffle is essential for k8s GA (Felix)
- Use remote storage for persisting shuffle data (SPARK-25299)
2. DSv2 Stabilization? (The followings and more)
- SPARK-31357 Catalog API for view metadata
- SPARK-31694 Add SupportsPartitions Catalog APIs on DataSourceV2

As we know, we work willingly and voluntarily. If something lands on the
`master` branch before the feature freeze (November), it will be a part of
Apache Spark 3.1, of course.

Thanks,
Dongjoon.

On Sun, Jul 5, 2020 at 12:21 PM Felix Cheung 
wrote:

> I think pluggable storage in shuffle is essential for k8s GA
>
> --
> *From:* Holden Karau 
> *Sent:* Monday, June 29, 2020 9:33 AM
> *To:* Maxim Gekk
> *Cc:* Dongjoon Hyun; dev
> *Subject:* Re: Apache Spark 3.1 Feature Expectation (Dec. 2020)
>
> Should we also consider the shuffle service refactoring to support
> pluggable storage engines as targeting the 3.1 release?
>
> On Mon, Jun 29, 2020 at 9:31 AM Maxim Gekk 
> wrote:
>
>> Hi Dongjoon,
>>
>> I would add:
>> - Filters pushdown to JSON (https://github.com/apache/spark/pull/27366)
>> - Filters pushdown to other datasources like Avro
>> - Support nested attributes of filters pushed down to JSON
>>
>> Maxim Gekk
>>
>> Software Engineer
>>
>> Databricks, Inc.
>>
>>
>> On Mon, Jun 29, 2020 at 7:07 PM Dongjoon Hyun 
>> wrote:
>>
>>> Hi, All.
>>>
>>> After a short celebration of Apache Spark 3.0, I'd like to ask you the
>>> community opinion on Apache Spark 3.1 feature expectations.
>>>
>>> First of all, Apache Spark 3.1 is scheduled for December 2020.
>>> - https://spark.apache.org/versioning-policy.html
>>>
>>> I'm expecting the following items:
>>>
>>> 1. Support Scala 2.13
>>> 2. Use Apache Hadoop 3.2 by default for better cloud support
>>> 3. Declaring Kubernetes Scheduler GA
>>> In my perspective, the last main missing piece was Dynamic
>>> allocation and
>>> - Dynamic allocation with shuffle tracking is already shipped at 3.0.
>>> - Dynamic allocation with worker decommission/data migration is
>>> targeting 3.1. (Thanks, Holden)
>>> 4. DSv2 Stabilization
>>>
>>> I'm aware of some more features which are on the way currently, but I
>>> love to hear the opinions from the main developers and more over the main
>>> users who need those features.
>>>
>>> Thank you in advance. Welcome for any comments.
>>>
>>> Bests,
>>> Dongjoon.
>>>
>>
>
> --
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>


Re: Apache Spark 3.1 Feature Expectation (Dec. 2020)

2020-07-05 Thread Felix Cheung
I think pluggable storage in shuffle is essential for k8s GA


From: Holden Karau 
Sent: Monday, June 29, 2020 9:33 AM
To: Maxim Gekk
Cc: Dongjoon Hyun; dev
Subject: Re: Apache Spark 3.1 Feature Expectation (Dec. 2020)

Should we also consider the shuffle service refactoring to support pluggable 
storage engines as targeting the 3.1 release?

On Mon, Jun 29, 2020 at 9:31 AM Maxim Gekk 
mailto:maxim.g...@databricks.com>> wrote:
Hi Dongjoon,

I would add:
- Filters pushdown to JSON (https://github.com/apache/spark/pull/27366)
- Filters pushdown to other datasources like Avro
- Support nested attributes of filters pushed down to JSON

Maxim Gekk

Software Engineer

Databricks, Inc.


On Mon, Jun 29, 2020 at 7:07 PM Dongjoon Hyun 
mailto:dongjoon.h...@gmail.com>> wrote:
Hi, All.

After a short celebration of Apache Spark 3.0, I'd like to ask you the 
community opinion on Apache Spark 3.1 feature expectations.

First of all, Apache Spark 3.1 is scheduled for December 2020.
- https://spark.apache.org/versioning-policy.html

I'm expecting the following items:

1. Support Scala 2.13
2. Use Apache Hadoop 3.2 by default for better cloud support
3. Declaring Kubernetes Scheduler GA
In my perspective, the last main missing piece was Dynamic allocation and
- Dynamic allocation with shuffle tracking is already shipped at 3.0.
- Dynamic allocation with worker decommission/data migration is targeting 
3.1. (Thanks, Holden)
4. DSv2 Stabilization

I'm aware of some more features which are on the way currently, but I love to 
hear the opinions from the main developers and more over the main users who 
need those features.

Thank you in advance. Welcome for any comments.

Bests,
Dongjoon.


--
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 
<https://amzn.to/2MaRAG9>
YouTube Live Streams: https://www.youtube.com/user/holdenkarau


Re: Apache Spark 3.1 Feature Expectation (Dec. 2020)

2020-07-01 Thread Gabor Somogyi
Hi Dongjoon,

I would add JDBC Kerberos support w/ keytab:
https://issues.apache.org/jira/browse/SPARK-12312

BR,
G


On Mon, Jun 29, 2020 at 6:07 PM Dongjoon Hyun 
wrote:

> Hi, All.
>
> After a short celebration of Apache Spark 3.0, I'd like to ask you the
> community opinion on Apache Spark 3.1 feature expectations.
>
> First of all, Apache Spark 3.1 is scheduled for December 2020.
> - https://spark.apache.org/versioning-policy.html
>
> I'm expecting the following items:
>
> 1. Support Scala 2.13
> 2. Use Apache Hadoop 3.2 by default for better cloud support
> 3. Declaring Kubernetes Scheduler GA
> In my perspective, the last main missing piece was Dynamic allocation
> and
> - Dynamic allocation with shuffle tracking is already shipped at 3.0.
> - Dynamic allocation with worker decommission/data migration is
> targeting 3.1. (Thanks, Holden)
> 4. DSv2 Stabilization
>
> I'm aware of some more features which are on the way currently, but I love
> to hear the opinions from the main developers and more over the main users
> who need those features.
>
> Thank you in advance. Welcome for any comments.
>
> Bests,
> Dongjoon.
>


Re: Apache Spark 3.1 Feature Expectation (Dec. 2020)

2020-06-30 Thread Tom Graves
 Stage Level Scheduling -  https://issues.apache.org/jira/browse/SPARK-27495

TomOn Monday, June 29, 2020, 11:07:18 AM CDT, Dongjoon Hyun 
 wrote:  
 
 Hi, All.
After a short celebration of Apache Spark 3.0, I'd like to ask you the 
community opinion on Apache Spark 3.1 feature expectations.
First of all, Apache Spark 3.1 is scheduled for December 2020.- 
https://spark.apache.org/versioning-policy.html
I'm expecting the following items:
1. Support Scala 2.132. Use Apache Hadoop 3.2 by default for better cloud 
support3. Declaring Kubernetes Scheduler GA    In my perspective, the last main 
missing piece was Dynamic allocation and    - Dynamic allocation with shuffle 
tracking is already shipped at 3.0.    - Dynamic allocation with worker 
decommission/data migration is targeting 3.1. (Thanks, Holden)4. DSv2 
Stabilization
I'm aware of some more features which are on the way currently, but I love to 
hear the opinions from the main developers and more over the main users who 
need those features.
Thank you in advance. Welcome for any comments.
Bests,Dongjoon.  

Re: Apache Spark 3.1 Feature Expectation (Dec. 2020)

2020-06-29 Thread wuyi
This could be a sub-task of 
https://issues.apache.org/jira/browse/SPARK-25299
  (Use remote storage for
persisting shuffle data)? 

It's good if we could put the whole SPARK-25299 in Spark 3.1.



Holden Karau wrote
> Should we also consider the shuffle service refactoring to support
> pluggable storage engines as targeting the 3.1 release?
> 
> On Mon, Jun 29, 2020 at 9:31 AM Maxim Gekk 

> maxim.gekk@

> 
> wrote:
> 
>> Hi Dongjoon,
>>
>> I would add:
>> - Filters pushdown to JSON (https://github.com/apache/spark/pull/27366)
>> - Filters pushdown to other datasources like Avro
>> - Support nested attributes of filters pushed down to JSON
>>
>> Maxim Gekk
>>
>> Software Engineer
>>
>> Databricks, Inc.
>>
>>
>> On Mon, Jun 29, 2020 at 7:07 PM Dongjoon Hyun 

> dongjoon.hyun@

> 
>> wrote:
>>
>>> Hi, All.
>>>
>>> After a short celebration of Apache Spark 3.0, I'd like to ask you the
>>> community opinion on Apache Spark 3.1 feature expectations.
>>>
>>> First of all, Apache Spark 3.1 is scheduled for December 2020.
>>> - https://spark.apache.org/versioning-policy.html
>>>
>>> I'm expecting the following items:
>>>
>>> 1. Support Scala 2.13
>>> 2. Use Apache Hadoop 3.2 by default for better cloud support
>>> 3. Declaring Kubernetes Scheduler GA
>>> In my perspective, the last main missing piece was Dynamic
>>> allocation
>>> and
>>> - Dynamic allocation with shuffle tracking is already shipped at
>>> 3.0.
>>> - Dynamic allocation with worker decommission/data migration is
>>> targeting 3.1. (Thanks, Holden)
>>> 4. DSv2 Stabilization
>>>
>>> I'm aware of some more features which are on the way currently, but I
>>> love to hear the opinions from the main developers and more over the
>>> main
>>> users who need those features.
>>>
>>> Thank you in advance. Welcome for any comments.
>>>
>>> Bests,
>>> Dongjoon.
>>>
>>
> 
> -- 
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  https://amzn.to/2MaRAG9;
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau





--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Apache Spark 3.1 Feature Expectation (Dec. 2020)

2020-06-29 Thread Jungtaek Lim
Does this count only "new features" (probably major), or also count
"improvements"? I'm aware of a couple of improvements which should be
ideally included in the next release, but if this counts only major new
features then don't feel they should be listed.

On Tue, Jun 30, 2020 at 1:32 AM Holden Karau  wrote:

> Should we also consider the shuffle service refactoring to support
> pluggable storage engines as targeting the 3.1 release?
>
> On Mon, Jun 29, 2020 at 9:31 AM Maxim Gekk 
> wrote:
>
>> Hi Dongjoon,
>>
>> I would add:
>> - Filters pushdown to JSON (https://github.com/apache/spark/pull/27366)
>> - Filters pushdown to other datasources like Avro
>> - Support nested attributes of filters pushed down to JSON
>>
>> Maxim Gekk
>>
>> Software Engineer
>>
>> Databricks, Inc.
>>
>>
>> On Mon, Jun 29, 2020 at 7:07 PM Dongjoon Hyun 
>> wrote:
>>
>>> Hi, All.
>>>
>>> After a short celebration of Apache Spark 3.0, I'd like to ask you the
>>> community opinion on Apache Spark 3.1 feature expectations.
>>>
>>> First of all, Apache Spark 3.1 is scheduled for December 2020.
>>> - https://spark.apache.org/versioning-policy.html
>>>
>>> I'm expecting the following items:
>>>
>>> 1. Support Scala 2.13
>>> 2. Use Apache Hadoop 3.2 by default for better cloud support
>>> 3. Declaring Kubernetes Scheduler GA
>>> In my perspective, the last main missing piece was Dynamic
>>> allocation and
>>> - Dynamic allocation with shuffle tracking is already shipped at 3.0.
>>> - Dynamic allocation with worker decommission/data migration is
>>> targeting 3.1. (Thanks, Holden)
>>> 4. DSv2 Stabilization
>>>
>>> I'm aware of some more features which are on the way currently, but I
>>> love to hear the opinions from the main developers and more over the main
>>> users who need those features.
>>>
>>> Thank you in advance. Welcome for any comments.
>>>
>>> Bests,
>>> Dongjoon.
>>>
>>
>
> --
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>


Re: Apache Spark 3.1 Feature Expectation (Dec. 2020)

2020-06-29 Thread Holden Karau
Should we also consider the shuffle service refactoring to support
pluggable storage engines as targeting the 3.1 release?

On Mon, Jun 29, 2020 at 9:31 AM Maxim Gekk 
wrote:

> Hi Dongjoon,
>
> I would add:
> - Filters pushdown to JSON (https://github.com/apache/spark/pull/27366)
> - Filters pushdown to other datasources like Avro
> - Support nested attributes of filters pushed down to JSON
>
> Maxim Gekk
>
> Software Engineer
>
> Databricks, Inc.
>
>
> On Mon, Jun 29, 2020 at 7:07 PM Dongjoon Hyun 
> wrote:
>
>> Hi, All.
>>
>> After a short celebration of Apache Spark 3.0, I'd like to ask you the
>> community opinion on Apache Spark 3.1 feature expectations.
>>
>> First of all, Apache Spark 3.1 is scheduled for December 2020.
>> - https://spark.apache.org/versioning-policy.html
>>
>> I'm expecting the following items:
>>
>> 1. Support Scala 2.13
>> 2. Use Apache Hadoop 3.2 by default for better cloud support
>> 3. Declaring Kubernetes Scheduler GA
>> In my perspective, the last main missing piece was Dynamic allocation
>> and
>> - Dynamic allocation with shuffle tracking is already shipped at 3.0.
>> - Dynamic allocation with worker decommission/data migration is
>> targeting 3.1. (Thanks, Holden)
>> 4. DSv2 Stabilization
>>
>> I'm aware of some more features which are on the way currently, but I
>> love to hear the opinions from the main developers and more over the main
>> users who need those features.
>>
>> Thank you in advance. Welcome for any comments.
>>
>> Bests,
>> Dongjoon.
>>
>

-- 
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  
YouTube Live Streams: https://www.youtube.com/user/holdenkarau


Re: Apache Spark 3.1 Feature Expectation (Dec. 2020)

2020-06-29 Thread Maxim Gekk
Hi Dongjoon,

I would add:
- Filters pushdown to JSON (https://github.com/apache/spark/pull/27366)
- Filters pushdown to other datasources like Avro
- Support nested attributes of filters pushed down to JSON

Maxim Gekk

Software Engineer

Databricks, Inc.


On Mon, Jun 29, 2020 at 7:07 PM Dongjoon Hyun 
wrote:

> Hi, All.
>
> After a short celebration of Apache Spark 3.0, I'd like to ask you the
> community opinion on Apache Spark 3.1 feature expectations.
>
> First of all, Apache Spark 3.1 is scheduled for December 2020.
> - https://spark.apache.org/versioning-policy.html
>
> I'm expecting the following items:
>
> 1. Support Scala 2.13
> 2. Use Apache Hadoop 3.2 by default for better cloud support
> 3. Declaring Kubernetes Scheduler GA
> In my perspective, the last main missing piece was Dynamic allocation
> and
> - Dynamic allocation with shuffle tracking is already shipped at 3.0.
> - Dynamic allocation with worker decommission/data migration is
> targeting 3.1. (Thanks, Holden)
> 4. DSv2 Stabilization
>
> I'm aware of some more features which are on the way currently, but I love
> to hear the opinions from the main developers and more over the main users
> who need those features.
>
> Thank you in advance. Welcome for any comments.
>
> Bests,
> Dongjoon.
>


Re: Apache Spark 3.1 Feature Expectation (Dec. 2020)

2020-06-29 Thread JackyLee
Thank you for putting forward this.
Can we put the support of view and partition catalog in version 3.1? 
AFAIT, these are great features in DSv2 and Catalog. With these, we can work
well with warehouse, such as delta or hive.

https://github.com/apache/spark/pull/28147
https://github.com/apache/spark/pull/28617

Thanks.



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Apache Spark 3.1 Feature Expectation (Dec. 2020)

2020-06-29 Thread Dongjoon Hyun
Hi, All.

After a short celebration of Apache Spark 3.0, I'd like to ask you the
community opinion on Apache Spark 3.1 feature expectations.

First of all, Apache Spark 3.1 is scheduled for December 2020.
- https://spark.apache.org/versioning-policy.html

I'm expecting the following items:

1. Support Scala 2.13
2. Use Apache Hadoop 3.2 by default for better cloud support
3. Declaring Kubernetes Scheduler GA
In my perspective, the last main missing piece was Dynamic allocation
and
- Dynamic allocation with shuffle tracking is already shipped at 3.0.
- Dynamic allocation with worker decommission/data migration is
targeting 3.1. (Thanks, Holden)
4. DSv2 Stabilization

I'm aware of some more features which are on the way currently, but I love
to hear the opinions from the main developers and more over the main users
who need those features.

Thank you in advance. Welcome for any comments.

Bests,
Dongjoon.