[GitHub] spark issue #21306: [SPARK-24252][SQL] Add DataSourceV2 mix-in for catalog s...

rdblue Tue, 03 Jul 2018 15:50:37 -0700

Github user rdblue commented on the issue:

    https://github.com/apache/spark/pull/21306
  
    @cloud-fan, thanks for the thorough feedback!
    
    > What catalog operations we want to forward to the data source catalog? 
Currently it's create/drop/alter table, I think it's good enough for now.
    
    This PR introduces create, drop, and alter. We can always add more later. 
These are the ones that we need to implement DataSourceV2 operations and DDL 
support.
    
    > Spark provides an API so that end-users can do it directly. e.g. 
`spark.catalog("iceberge").createTable(...)`, or SQL API `CREATE TABLE 
iceberge.db1.tbl1 . . .`
    
    These two are the easiest and least intrusive way to start because the data 
source catalog interaction is explicitly tied to a catalog. It also matches the 
behavior used by other systems for multiple catalogs. I think this is what we 
should start with and then tackle ideas like your second point.
    
    > When creating/dropping/altering Spark tables, also forward it to the data 
source catalog. . .
    
    For this and a couple other questions, I don't think we need to decide 
right now. This PR is about getting the interface for other sources in Spark. 
We don't necessarily need to know all of the ways that users will call it or 
interact with it, like how `DESC TABLE` will work.
    
    To your question here, I'm not sure whether the `CREATE TABLE ... USING 
source` syntax should use the default catalog or defer to the catalog for 
`source` or forward to both, but that doesn't need to block adding this API 
because I think we can decide it later. In addition, we should probably discuss 
this on the dev list to make sure we get the behavior right.
    
    > How to lookup the table metadata from data source catalog?
    
    The SPIP proposes two catalog interfaces that return `Table`. One that uses 
table identifiers and one that uses paths. Data sources can implement support 
for both or just one. This PR includes just the support for table identifiers. 
We would add a similar API for path-based tables in another PR.
    
    > How to define table metadata? Maybe we can forward `DESC TABLE` . . .
    
    That sounds like a reasonable idea to me. Like the behavior of `USING`, I 
don't think this is something that we have to decide right now. We can add 
support later as we implement table DDL. Maybe `Table` should return a DF that 
is its `DESCRIBE` output.
    
    > How does the table metadata involve in data reading/writing?
    
    This is another example of something we don't need to decide yet. We have a 
couple different options for the behavior and will want to think them through 
and discuss them on the dev list. But I don't think that the behavior 
necessarily needs to be decided before we add this API to sources.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark issue #21306: [SPARK-24252][SQL] Add DataSourceV2 mix-in for catalog s...

Reply via email to