[GitHub] [spark] brkyvz edited a comment on issue #25822: [SPARK-29127][SQL] Support partitioning and bucketing through DataFrameWriter.save for V2 Tables

GitBox Wed, 18 Sep 2019 09:54:40 -0700

brkyvz edited a comment on issue #25822: [SPARK-29127][SQL] Support 
partitioning and bucketing through DataFrameWriter.save for V2 Tables
URL: https://github.com/apache/spark/pull/25822#issuecomment-532772153
 
 
   @cloud-fan 
   I guess my main confusion comes from the lack of usefulness of TableProvider 
thus far. When a catalog is defined, it's only usefulness comes from defining 
whether a data source has a V2 definition. It seems as a combination of 
`RelationProvider` and `SchemaRelationProvider` from V1 land.
   
   Let's look at the interfaces we have thus far:
    1. `Table`: Pretty great interface that describes all the properties and 
capabilities of a table
    2. `TableCatalog`: An interface that checks existence of Tables, and also 
the creation/alteration of these tables
    3. `TableProvider`: An interface that creates a `Table` through data source 
options without a catalog, but still doesn't have the complete set of APIs to 
fully define a Table.
   
   TableProvider is currently missing the passing of partitioning info. This 
can be passed as part of DataFrameWriter, but unfortunately not as part of 
DataFrameReader. This means that for file based sources, where there is no 
catalog to store the partitioning info, Spark cannot initialize a complete and 
correct Table definition through user input.
   
   I had a more radical idea, and I've started working on it here: 
https://github.com/apache/spark/pull/25833
   
   Why don't we make `TableProvider` also extend `TableCatalog`? On top of 
that, it will also need a layer to go between DataSource options and an 
Identifier. This way, a lot of the V2 code paths can be re-used (you even get 
partition column normalization!), you get path based table support out of the 
box in DataFrameWriterV2, and you don't need to fill in incorrect information 
in DataFrameReader.load for file based data sources.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] brkyvz edited a comment on issue #25822: [SPARK-29127][SQL] Support partitioning and bucketing through DataFrameWriter.save for V2 Tables

Reply via email to