Re: Spark 3.0 preview release on-going features discussion

Wenchen Fan Fri, 20 Sep 2019 01:07:43 -0700

> New pushdown API for DataSourceV2

One correction: I want to revisit the pushdown API to make sure it works
for dynamic partition pruning and can be extended to support
limit/aggregate/... pushdown in the future. It should be a small API update
instead of a new API.


On Fri, Sep 20, 2019 at 3:46 PM Xingbo Jiang <jiangxb1...@gmail.com> wrote:

> Hi all,
>
> Let's start a new thread to discuss the on-going features for Spark 3.0
> preview release.
>
> Below is the feature list for the Spark 3.0 preview release. The list is
> collected from the previous discussions in the dev list.
>
>    - Followup of the shuffle+repartition correctness issue: support roll
>    back shuffle stages (https://issues.apache.org/jira/browse/SPARK-25341)
>    - Upgrade the built-in Hive to 2.3.5 for hadoop-3.2 (
>    https://issues.apache.org/jira/browse/SPARK-23710)
>    - JDK 11 support (https://issues.apache.org/jira/browse/SPARK-28684)
>    - Scala 2.13 support (https://issues.apache.org/jira/browse/SPARK-25075
>    )
>    - DataSourceV2 features
>       - Enable file source v2 writers (
>       https://issues.apache.org/jira/browse/SPARK-27589)
>       - CREATE TABLE USING with DataSourceV2
>       - New pushdown API for DataSourceV2
>       - Support DELETE/UPDATE/MERGE Operations in DataSourceV2 (
>       https://issues.apache.org/jira/browse/SPARK-28303)
>    - Correctness issue: Stream-stream joins - left outer join gives
>    inconsistent output (https://issues.apache.org/jira/browse/SPARK-26154)
>    - Revisiting Python / pandas UDF (
>    https://issues.apache.org/jira/browse/SPARK-28264)
>    - Spark Graph (https://issues.apache.org/jira/browse/SPARK-25994)
>
> Features that are nice to have:
>
>    - Use remote storage for persisting shuffle data (
>    https://issues.apache.org/jira/browse/SPARK-25299)
>    - Spark + Hadoop + Parquet + Avro compatibility problems (
>    https://issues.apache.org/jira/browse/SPARK-25588)
>    - Introduce new option to Kafka source - specify timestamp to start
>    and end offset (https://issues.apache.org/jira/browse/SPARK-26848)
>    - Delete files after processing in structured streaming (
>    https://issues.apache.org/jira/browse/SPARK-20568)
>
> Here, I am proposing to cut the branch on October 15th. If the features
> are targeting to 3.0 preview release, please prioritize the work and finish
> it before the date. Note, Oct. 15th is not the code freeze of Spark 3.0.
> That means, the community will still work on the features for the upcoming
> Spark 3.0 release, even if they are not included in the preview release.
> The goal of preview release is to collect more feedback from the community
> regarding the new 3.0 features/behavior changes.
>
> Thanks!
>

Re: Spark 3.0 preview release on-going features discussion

Reply via email to