[GitHub] [spark] cloud-fan commented on issue #25822: [SPARK-29127][SQL] Support partitioning and bucketing through DataFrameWriter.save for V2 Tables

GitBox Wed, 18 Sep 2019 09:17:31 -0700

cloud-fan commented on issue #25822: [SPARK-29127][SQL] Support partitioning 
and bucketing through DataFrameWriter.save for V2 Tables
URL: https://github.com/apache/spark/pull/25822#issuecomment-532758371
 
 
   This PR is related to #25651  but targets a different use case: 
`DataFrameReader.load` with save mode.
   
   Both PRs need to update/extend `TableProvider`, so it's better to think 
about our requirements to `TableProvider` together.
   
   In #25651, what we need is:
   1. report schema/partitioning (usually by inference). e.g. `CREATE TABLE t 
USING format`. Spark needs to ask the `TableProvider` to report 
schema/partitioning of a table first and then store it in the metastore.
   2. read/write data with given schema/partitioning. It's too expensive to do 
schema/partitioning inference every time. Spark gets the schema/partitioning 
from metastore and passes it to `TableProvider` to read/write data.
   
   This use case is very similar to Hive's [EXTERNAL 
TABLE](https://cwiki.apache.org/confluence/display/Hive/Managed+vs.+External+Tables).
 The table metadata is stored in Spark's metastore and the table data is stored 
outside of Spark (i.e. external data). So in this case, `TableProvider` only 
needs to provide the external data as tables, and we don't need to ask 
`TableProvider` to create/drop/... tables.
   
   However, people may ask about Hive MANAGED TABLE. What's the corresponding 
concept in Spark? In Hive, what gets managed is the file directories. So it 
only applies to file sources(we can also call it path-based data source). Note 
that it doesn't mean we can only use file source with MANAGED TABLE, Hive can 
still create an EXTERNAL TABLE pointing to a file directory.
   
   To support the use case like Hive MANAGED TABLE, we need a variant of 
`TableProvider` to indicate that it's a file source. For `CREATE TABLE t(...) 
USING file_source`, Spark creates the directory for this table. When read/write 
this table, Spark passes the directory path to the underlying file source. When 
the table is dropped, Spark removes the directory.
   
   Back to `DataFrameReader`, some requirements are the same:
   1. `DataFrameReader.load()` needs `TableProvider` to report 
schema/partitioning.
   2. `DataFrameReader.schema(...).load()` needs `TableProvider` to get a table 
with given schema.
   
   However, when it comes to `SaveMode`, it becomes complicated. It needs to 
check table existence and create table. IIUC in a previous discussion @rdblue 
was against this idea.
   
   I think we can still support `SaveMode` for file source(path-based data 
source):
   1. `ErrorIfExists`: fail if the path exists
   2. `Append`: create the path if not exists
   3. `Overwrite`: create the path if not exists, truncate it if exists
   4. `Ignore`: skip if path exists.
   
   That said, with the change in #25651 , we need to add one mixin trait of 
`TableProvider` to indicate it's a file source.
   
   cc @rdblue @gengliangwang


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] cloud-fan commented on issue #25822: [SPARK-29127][SQL] Support partitioning and bucketing through DataFrameWriter.save for V2 Tables

Reply via email to