Thank you for sharing your opinions, Jacky, Maxim, Holden, Jungtaek, Yi,
Tom, Gabor, Felix.
I also want to include both `New Features` and `Improvements` together
according to the above discussion.
When I checked the item status as of today, it looked like the following.
In short, I removed K8s GA and DSv2 Stabilization explicitly from ON-TRACK
list according to the given concerns. For those items, we can try to build
a consensus for Apache Spark 3.2 (June 2021) or later.
ON-TRACK
1. Support Scala 2.13 (SPARK-25075)
2. Use Apache Hadoop 3.2 by default for better cloud support (SPARK-32058)
3. Stage Level Scheduling (SPARK-27495)
4. Support filter pushdown more (CSV is already shipped by SPARK-30323 in
3.0)
- Support filters pushdown to JSON (SPARK-30648 in 3.1)
- Support filters pushdown to Avro (SPARK-XXX in 3.1)
- Support nested attributes of filters pushed down to JSON
5. Support JDBC Kerberos w/ keytab (SPARK-12312)
NICE TO HAVE OR DEFERRED TO APACHE SPARK 3.2
1. Declaring Kubernetes Scheduler GA
- Should we also consider the shuffle service refactoring to support
pluggable storage engines as targeting the 3.1 release? (Holden)
- I think pluggable storage in shuffle is essential for k8s GA (Felix)
- Use remote storage for persisting shuffle data (SPARK-25299)
2. DSv2 Stabilization? (The followings and more)
- SPARK-31357 Catalog API for view metadata
- SPARK-31694 Add SupportsPartitions Catalog APIs on DataSourceV2
As we know, we work willingly and voluntarily. If something lands on the
`master` branch before the feature freeze (November), it will be a part of
Apache Spark 3.1, of course.
Thanks,
Dongjoon.
On Sun, Jul 5, 2020 at 12:21 PM Felix Cheung <[email protected]>
wrote:
> I think pluggable storage in shuffle is essential for k8s GA
>
> ------------------------------
> *From:* Holden Karau <[email protected]>
> *Sent:* Monday, June 29, 2020 9:33 AM
> *To:* Maxim Gekk
> *Cc:* Dongjoon Hyun; dev
> *Subject:* Re: Apache Spark 3.1 Feature Expectation (Dec. 2020)
>
> Should we also consider the shuffle service refactoring to support
> pluggable storage engines as targeting the 3.1 release?
>
> On Mon, Jun 29, 2020 at 9:31 AM Maxim Gekk <[email protected]>
> wrote:
>
>> Hi Dongjoon,
>>
>> I would add:
>> - Filters pushdown to JSON (https://github.com/apache/spark/pull/27366)
>> - Filters pushdown to other datasources like Avro
>> - Support nested attributes of filters pushed down to JSON
>>
>> Maxim Gekk
>>
>> Software Engineer
>>
>> Databricks, Inc.
>>
>>
>> On Mon, Jun 29, 2020 at 7:07 PM Dongjoon Hyun <[email protected]>
>> wrote:
>>
>>> Hi, All.
>>>
>>> After a short celebration of Apache Spark 3.0, I'd like to ask you the
>>> community opinion on Apache Spark 3.1 feature expectations.
>>>
>>> First of all, Apache Spark 3.1 is scheduled for December 2020.
>>> - https://spark.apache.org/versioning-policy.html
>>>
>>> I'm expecting the following items:
>>>
>>> 1. Support Scala 2.13
>>> 2. Use Apache Hadoop 3.2 by default for better cloud support
>>> 3. Declaring Kubernetes Scheduler GA
>>> In my perspective, the last main missing piece was Dynamic
>>> allocation and
>>> - Dynamic allocation with shuffle tracking is already shipped at 3.0.
>>> - Dynamic allocation with worker decommission/data migration is
>>> targeting 3.1. (Thanks, Holden)
>>> 4. DSv2 Stabilization
>>>
>>> I'm aware of some more features which are on the way currently, but I
>>> love to hear the opinions from the main developers and more over the main
>>> users who need those features.
>>>
>>> Thank you in advance. Welcome for any comments.
>>>
>>> Bests,
>>> Dongjoon.
>>>
>>
>
> --
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9>
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>