[GitHub] spark issue #21306: [SPARK-24252][SQL] Add DataSourceV2 mix-in for catalog s...

cloud-fan Thu, 28 Jun 2018 01:29:09 -0700

Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/21306
  
    There are several things we need to discuss here:
    
    - What catalog operations we want to forward to the data source catalog? 
Currently it's create/drop/alter table, I think it's good enough for now.
    - How does Spark forward these catalog operations? IMO there are 2 ways.
      - Spark provides an API so that end-users can do it directly. e.g. 
`spark.catalog("iceberge").createTable(...)`, or SQL API `CREATE TABLE 
iceberge.db1.tbl1 ...`.
      - When creating/dropping/altering Spark tables, also forward it to the 
data source catalog. For example, users create a table in Spark via `CREATE 
TABLE t(...) USING iceberg`, which creates an table entry in the Hive 
metastore, as well as a iceberg meta file. When dropping this table, Spark 
should notify iceberg to remove the meta file. It's arguable that we need this 
feature or not, if users are willing to always add the catalog prefix, they can 
just write`CREATE TABLE iceberge.db1.tbl1 ...` and `SELECT ... FROM 
iceberge.db1.tbl1`, and totoally by-pass the Spark catalog.
    - How to lookup the table metadata from data source catalog? I think 
database name + table name is a common way(e.g. `iceberge.db1.tbl1`), but we 
should also consider other ways like path (e.g. `` delta.`/a/path` ``). Maybe 
we can treat path as a table name without database, and leave the data source 
to interprete it.
    - How to define table metadata? It seems that Spark only need to know the 
table schema for analysis. Maybe we can forward `DESC TABLE` to data source so 
that Spark doesn't need to standardize the table metadata.
    - How does the table metadata involve in data reading/writing? When reading 
data without catalog, e.g. `spark.read.format("my_data_source").option("table", 
"my_table").load()`, the data source need to get the metadata of the given 
table. When reading data with catalog, e.g. 
`spark.table("my_data_source.my_table")`, the data source also need to get the 
metadata of the given table, but need to implement it in a different 
API(`CatalogSupport`). It's ok to say that data source implementation is 
responsible to eliminate code duplication themselves.




---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark issue #21306: [SPARK-24252][SQL] Add DataSourceV2 mix-in for catalog s...

Reply via email to