cloud-fan commented on issue #25822: [SPARK-29127][SQL] Support partitioning and bucketing through DataFrameWriter.save for V2 Tables URL: https://github.com/apache/spark/pull/25822#issuecomment-532758371 This PR is related to #25651 but targets a different use case: `DataFrameReader.load` with save mode. Both PRs need to update/extend `TableProvider`, so it's better to think about our requirements to `TableProvider` together. In #25651, what we need is: 1. report schema/partitioning (usually by inference). e.g. `CREATE TABLE t USING format`. Spark needs to ask the `TableProvider` to report schema/partitioning of a table first and then store it in the metastore. 2. read/write data with given schema/partitioning. It's too expensive to do schema/partitioning inference every time. Spark gets the schema/partitioning from metastore and passes it to `TableProvider` to read/write data. This use case is very similar to Hive's [EXTERNAL TABLE](https://cwiki.apache.org/confluence/display/Hive/Managed+vs.+External+Tables). The table metadata is stored in Spark's metastore and the table data is stored outside of Spark (i.e. external data). So in this case, `TableProvider` only needs to provide the external data as tables, and we don't need to ask `TableProvider` to create/drop/... tables. However, people may ask about Hive MANAGED TABLE. What's the corresponding concept in Spark? In Hive, what gets managed is the file directories. So it only applies to file sources(we can also call it path-based data source). Note that it doesn't mean we can only use file source with MANAGED TABLE, Hive can still create an EXTERNAL TABLE pointing to a file directory. To support the use case like Hive MANAGED TABLE, we need a variant of `TableProvider` to indicate that it's a file source. For `CREATE TABLE t(...) USING file_source`, Spark creates the directory for this table. When read/write this table, Spark passes the directory path to the underlying file source. When the table is dropped, Spark removes the directory. Back to `DataFrameReader`, some requirements are the same: 1. `DataFrameReader.load()` needs `TableProvider` to report schema/partitioning. 2. `DataFrameReader.schema(...).load()` needs `TableProvider` to get a table with given schema. However, when it comes to `SaveMode`, it becomes complicated. It needs to check table existence and create table. IIUC in a previous discussion @rdblue was against this idea. I think we can still support `SaveMode` for file source(path-based data source): 1. `ErrorIfExists`: fail if the path exists 2. `Append`: create the path if not exists 3. `Overwrite`: create the path if not exists, truncate it if exists 4. `Ignore`: skip if path exists. That said, with the change in #25651 , we need to add one mixin trait of `TableProvider` to indicate it's a file source. cc @rdblue @gengliangwang
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
