brkyvz commented on issue #25822: [SPARK-29127][SQL] Support partitioning and bucketing through DataFrameWriter.save for V2 Tables URL: https://github.com/apache/spark/pull/25822#issuecomment-532772153 @cloud-fan I guess my main confusion comes from the lack of usefulness of TableProvider thus far. When a catalog is defined, it's only usefulness comes from defining whether a data source has a V2 definition. It seems as a combination of `RelationProvider` and `SchemaRelationProvider` from V1 land. Let's look at the interfaces we have thus far: 1. `Table`: Pretty great interface that describes all the properties and capabilities of a table 2. `TableCatalog`: An interface that checks existence of Tables, and also the creation/alteration of these tables 3. `TableProvider`: An interface that creates a `Table` through data source options without a catalog, but still doesn't have the complete set of APIs to fully define a Table. TableProvider is currently missing the passing of partitioning info. This can be passed as part of DataFrameWriter, but unfortunately not as part of DataFrameReader. This means that for file based sources, where there is no catalog to store the partitioning info, Spark cannot initialize a complete and correct Table definition through user input. I had a more radical idea, and I've started working on it here: https://github.com/apache/spark/pull/25833 Why don't we make `TableProvider` also extend `TableCatalog`? On top of that, it will also need a layer to go between DataSource options and an Identifier. This way, a lot of the V2 code paths can be re-used, you get path based table support out of the box in DataFrameWriterV2, and you don't need to fill in incorrect information in DataFrameReader.load for file based data sources.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
