Github user rdblue commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21306#discussion_r200709816
  
    --- Diff: 
sql/core/src/main/java/org/apache/spark/sql/sources/v2/catalog/TableCatalog.java
 ---
    @@ -0,0 +1,123 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.sql.sources.v2.catalog;
    +
    +import org.apache.spark.sql.catalyst.TableIdentifier;
    +import org.apache.spark.sql.catalyst.analysis.NoSuchTableException;
    +import org.apache.spark.sql.catalyst.analysis.TableAlreadyExistsException;
    +import org.apache.spark.sql.catalyst.expressions.Expression;
    +import org.apache.spark.sql.types.StructType;
    +
    +import java.util.Arrays;
    +import java.util.Collections;
    +import java.util.List;
    +import java.util.Map;
    +
    +public interface TableCatalog {
    +  /**
    +   * Load table metadata by {@link TableIdentifier identifier} from the 
catalog.
    +   *
    +   * @param ident a table identifier
    +   * @return the table's metadata
    +   * @throws NoSuchTableException If the table doesn't exist.
    +   */
    +  Table loadTable(TableIdentifier ident) throws NoSuchTableException;
    +
    +  /**
    +   * Create a table in the catalog.
    +   *
    +   * @param ident a table identifier
    +   * @param schema the schema of the new table, as a struct type
    +   * @return metadata for the new table
    +   * @throws TableAlreadyExistsException If a table already exists for the 
identifier
    +   */
    +  default Table createTable(TableIdentifier ident,
    +                            StructType schema) throws 
TableAlreadyExistsException {
    +    return createTable(ident, schema, Collections.emptyList(), 
Collections.emptyMap());
    +  }
    +
    +  /**
    +   * Create a table in the catalog.
    +   *
    +   * @param ident a table identifier
    +   * @param schema the schema of the new table, as a struct type
    +   * @param properties a string map of table properties
    +   * @return metadata for the new table
    +   * @throws TableAlreadyExistsException If a table already exists for the 
identifier
    +   */
    +  default Table createTable(TableIdentifier ident,
    +                            StructType schema,
    +                            Map<String, String> properties) throws 
TableAlreadyExistsException {
    +    return createTable(ident, schema, Collections.emptyList(), properties);
    +  }
    +
    +  /**
    +   * Create a table in the catalog.
    +   *
    +   * @param ident a table identifier
    +   * @param schema the schema of the new table, as a struct type
    +   * @param partitions a list of expressions to use for partitioning data 
in the table
    +   * @param properties a string map of table properties
    +   * @return metadata for the new table
    +   * @throws TableAlreadyExistsException If a table already exists for the 
identifier
    +   */
    +  Table createTable(TableIdentifier ident,
    +                    StructType schema,
    +                    List<Expression> partitions,
    --- End diff --
    
    > The current end-user API only allows users to specify partition columns.
    
    I think an example would help understand the use of expression here. Right 
now, I can create a table partitioned by day like this:
    ```
    CREATE TABLE t (ts timestamp, data string, day string) PARTITIONED BY (day)
    ```
    
    Then it's up to queries to supply the right values for `day` in their 
queries. I'm proposing we change that to something like the following that uses 
expressions in the PARTITIONED BY clause instead of only allowing column names:
    ```
    CREATE TABLE t (ts timestamp, data string) PARTITIONED BY (date(ts));
    ```
    
    This can handle all identity partitioning in Hive tables today and it can 
handle bucketing.
    
    > And why does the "partition transform" belong to a table definition?
    
    Transforms should be passed to the table so the source use them for the 
physical layout. In DataSourceV2, the source could be anything so it needs to 
be the component that handles the physical layout. Because we want distributed 
data sources, we need some way of telling them how to distribute data.
    
    For example, I could use a partitioning expression to tell a source how to 
shard across PostgreSQL instances. I could also use it to define the keys in an 
HBase connector. Those are uses of partitioning that Spark can't handle 
internally.
    
    Like Hive, Spark has only supported a limited definition of partitioning up 
to now, but I'd like to be able to put tables using Hive's layout behind this 
API eventually. I think this way of configuring partitioning is a good way to 
do that, while supporting what Iceberg and other sources will need.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to