Thank you for linking that, Dongjoon! I found SPARK-44518<https://issues.apache.org/jira/browse/SPARK-44518> in that list which wants to turn Spark’s Hive integration into a data source. To think out loud: The big gaps between built-in v1 and v2 data sources are support for bucketing and partitioning. And the reason v1 data sources support those is because they’re kind of interleaved with Spark’s Hive integration. Separating that Hive integration or making it more data source-ish would put us close to supporting bucketing and partitioning in v2 and then defaulting to v2. (Just my understanding – curious if I’m thinking about this correctly).
Anyway, thank you for the pointer. From: Dongjoon Hyun <dongjoon.h...@gmail.com> Date: Friday, 15 September 2023 at 05:36 To: Will Raschkowski <wraschkow...@palantir.com.invalid> Cc: dev@spark.apache.org <dev@spark.apache.org> Subject: Re: Plans for built-in v2 data sources in Spark 4 CAUTION: This email originates from an external party (outside of Palantir). If you believe this message is suspicious in nature, please use the "Report Message" button built into Outlook. Hi, Will. According to the following JIRA, as of now, there is no plan or on-going discussion to switch it. https://issues.apache.org/jira/browse/SPARK-44111 [issues.apache.org]<https://urldefense.com/v3/__https:/issues.apache.org/jira/browse/SPARK-44111__;!!NkS9JGVQ2sDq!9ClB4HvwYAfMI2IMJf1zw4UPYwDUxsnN21c3p35XbY8OQO8vCZnS-KtrRL52X6vfCnXAqFpB_jh0S5q-m5htQQyNwA4$> (Prepare Apache Spark 4.0.0) Thanks, Dongjoon. On Wed, Sep 13, 2023 at 9:02 AM Will Raschkowski <wraschkow...@palantir.com.invalid> wrote: Hey everyone, I was wondering what the plans are for Spark's built-in v2 file data sources in Spark 4. Concretely, is the plan for Spark 4 to continue defaulting to the built-in v1 data sources? And if yes, what are the blockers for defaulting to v2? I see, just as example, that writing Hive-partitions is not supported in v2. Are there other blockers or outstanding discussions? Regards, Will