[GitHub] [spark] cloud-fan commented on issue #26868: [SPARK-29665][SQL] refine the TableProvider interface

GitBox Sun, 22 Dec 2019 21:07:23 -0800

cloud-fan commented on issue #26868: [SPARK-29665][SQL] refine the 
TableProvider interface
URL: https://github.com/apache/spark/pull/26868#issuecomment-568356522
 
 
   > what is your rationale for saying "For [sources like Kafka], a simple 
getTable(properties) is the best."?
   
   We can tell it by looking at the code, but let me explain it here as well.
   ```
   class KafkaProvider implements TableProvider {
     Table getTable(properties) {
       return new KafkaTable(properties)
     }
   }
   
   class KafkaTable implements TableProvider {
     StructType schema() {
       return the_fixed_schema;
     }
   
     Transform[] partitioning() {
       return new Transform[0];
     }
   
     ScanBuilder ...
     WriteBuilder ...
   }
   ```
   This is simpler than the below one, as we don't need to worry about if the 
passed in schema and partitioning are wrong.
   ```
   class KafkaProvider implements TableProvider {
     StructType inferSchema() {
       return the_fixed_schema;
     }
   
     Transform[] inferPartitioning() {
       return new Transform[0];
     }
   
     Table getTable(schema, partitioninng, properties) {
       assert(schema == the_fixed_schema)
       assert(partitioninng.isEmpty)
       return new KafkaTable(schema, properties)
     }
   }
   
   class KafkaTable(schema) implements TableProvider {
     StructType schema() {
       return this.schema;
     }
   
     Transform[] partitioning() {
       return new Transform[0];
     }
   
     ScanBuilder ...
     WriteBuilder ...
   }
   ```
   
   > If I want to store a Kafka stream in the built-in generic catalog, we 
agree that catalog should pass the schema and partitioning to 
TableProvider.getTable (Your point 2.). That means that both 
getTable(properties) and getTable(schema, partitioning, properties) must be 
implemented. 
   
   In the last sync, I think we agree that we should have a "flag" to let Spark 
not store the schema/partitioning in the built-in generic catalog. 
`SupportsExternalMetadata` is the flag. If a source don't implement 
`SupportsExternalMetadata`, then Spark won't store the schema/partitioning in 
the builtin catalog. When we scan this table, Spark just call 
`getTable(properties)` and ask the source to report schema/partitioning.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] cloud-fan commented on issue #26868: [SPARK-29665][SQL] refine the TableProvider interface

Reply via email to