cloud-fan commented on issue #26868: [SPARK-29665][SQL] refine the TableProvider interface URL: https://github.com/apache/spark/pull/26868#issuecomment-568356522 > what is your rationale for saying "For [sources like Kafka], a simple getTable(properties) is the best."? We can tell it by looking at the code, but let me explain it here as well. ``` class KafkaProvider implements TableProvider { Table getTable(properties) { return new KafkaTable(properties) } } class KafkaTable implements TableProvider { StructType schema() { return the_fixed_schema; } Transform[] partitioning() { return new Transform[0]; } ScanBuilder ... WriteBuilder ... } ``` This is simpler than the below one, as we don't need to worry about if the passed in schema and partitioning are wrong. ``` class KafkaProvider implements TableProvider { StructType inferSchema() { return the_fixed_schema; } Transform[] inferPartitioning() { return new Transform[0]; } Table getTable(schema, partitioninng, properties) { assert(schema == the_fixed_schema) assert(partitioninng.isEmpty) return new KafkaTable(schema, properties) } } class KafkaTable(schema) implements TableProvider { StructType schema() { return this.schema; } Transform[] partitioning() { return new Transform[0]; } ScanBuilder ... WriteBuilder ... } ``` > If I want to store a Kafka stream in the built-in generic catalog, we agree that catalog should pass the schema and partitioning to TableProvider.getTable (Your point 2.). That means that both getTable(properties) and getTable(schema, partitioning, properties) must be implemented. In the last sync, I think we agree that we should have a "flag" to let Spark not store the schema/partitioning in the built-in generic catalog. `SupportsExternalMetadata` is the flag. If a source don't implement `SupportsExternalMetadata`, then Spark won't store the schema/partitioning in the builtin catalog. When we scan this table, Spark just call `getTable(properties)` and ask the source to report schema/partitioning.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
