[GitHub] [spark] rdblue commented on a change in pull request #26297: [SPARK-29665][SQL] refine the TableProvider interface

GitBox Wed, 13 Nov 2019 16:43:06 -0800

rdblue commented on a change in pull request #26297: [SPARK-29665][SQL] refine 
the TableProvider interface
URL: https://github.com/apache/spark/pull/26297#discussion_r346077428


 ##########
 File path: 
sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/TableProvider.java
 ##########
 @@ -36,26 +39,34 @@
 public interface TableProvider {
 
   /**
-   * Return a {@link Table} instance to do read/write with user-specified 
options.
+   * Infer the schema of the table that is identified by the given options.
+   *
+   * @param options The options that can identify a table, e.g. file path, 
Kafka topic name, etc.
+   *                It's an immutable case-insensitive string-to-string map.
    *
-   * @param options the user-specified options that can identify a table, e.g. 
file path, Kafka
-   *                topic name, etc. It's an immutable case-insensitive 
string-to-string map.
    */
-  Table getTable(CaseInsensitiveStringMap options);
+  StructType inferSchema(CaseInsensitiveStringMap options);
 
   /**
-   * Return a {@link Table} instance to do read/write with user-specified 
schema and options.
-   * <p>
-   * By default this method throws {@link UnsupportedOperationException}, 
implementations should
-   * override this method to handle user-specified schema.
-   * </p>
-   * @param options the user-specified options that can identify a table, e.g. 
file path, Kafka
-   *                topic name, etc. It's an immutable case-insensitive 
string-to-string map.
-   * @param schema the user-specified schema.
-   * @throws UnsupportedOperationException
+   * Infer the partitioning of the table that is identified by the given 
options.
+   *
+   * @param schema The schema of the table.
+   * @param options The options that can identify a table, e.g. file path, 
Kafka topic name, etc.
+   *                It's an immutable case-insensitive string-to-string map.
+   */
+  Transform[] inferPartitioning(StructType schema, CaseInsensitiveStringMap 
options);
 
 Review comment:
   > It seems very weird if we allow users to specify partitioning and infer 
schema.
   
   This isn't what I'm suggesting. We can set up rules for when schema and 
partition inference are called that restrict to just those 3 cases.
   
   What I'm suggesting is that schema inference and partition inference are 
independent so we don't need to pass a schema in to `inferPartition`. The 
schema isn't actually used by file sources, and file sources are why we are 
making these changes.
   
   > you can't pick a non-existing column as partition column
   
   There's no reason why this must be the case. Another partition column could 
be added to the schema. Data files don't usually store partition columns, so 
the schema is usually the union of all file schemas plus whatever is inferred 
for the partition schema. That means schema depends on partitioning, not the 
other way around.
   
   > We've also had bugs in the past where the inference picks a different data 
type than what you want.
   
   I see what you mean here, but I think is better to reconcile the differences 
in Spark instead of in the source. If the source infers that a partition is a 
string, but the user supplies a schema with an integer type, then all the 
source would do is throw an exception. Spark can do that once the partitioning 
is passed back, couldn't it?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] rdblue commented on a change in pull request #26297: [SPARK-29665][SQL] refine the TableProvider interface

Reply via email to