Github user rdblue commented on a diff in the pull request:
https://github.com/apache/spark/pull/21306#discussion_r200709816
--- Diff:
sql/core/src/main/java/org/apache/spark/sql/sources/v2/catalog/TableCatalog.java
---
@@ -0,0 +1,123 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.sources.v2.catalog;
+
+import org.apache.spark.sql.catalyst.TableIdentifier;
+import org.apache.spark.sql.catalyst.analysis.NoSuchTableException;
+import org.apache.spark.sql.catalyst.analysis.TableAlreadyExistsException;
+import org.apache.spark.sql.catalyst.expressions.Expression;
+import org.apache.spark.sql.types.StructType;
+
+import java.util.Arrays;
+import java.util.Collections;
+import java.util.List;
+import java.util.Map;
+
+public interface TableCatalog {
+ /**
+ * Load table metadata by {@link TableIdentifier identifier} from the
catalog.
+ *
+ * @param ident a table identifier
+ * @return the table's metadata
+ * @throws NoSuchTableException If the table doesn't exist.
+ */
+ Table loadTable(TableIdentifier ident) throws NoSuchTableException;
+
+ /**
+ * Create a table in the catalog.
+ *
+ * @param ident a table identifier
+ * @param schema the schema of the new table, as a struct type
+ * @return metadata for the new table
+ * @throws TableAlreadyExistsException If a table already exists for the
identifier
+ */
+ default Table createTable(TableIdentifier ident,
+ StructType schema) throws
TableAlreadyExistsException {
+ return createTable(ident, schema, Collections.emptyList(),
Collections.emptyMap());
+ }
+
+ /**
+ * Create a table in the catalog.
+ *
+ * @param ident a table identifier
+ * @param schema the schema of the new table, as a struct type
+ * @param properties a string map of table properties
+ * @return metadata for the new table
+ * @throws TableAlreadyExistsException If a table already exists for the
identifier
+ */
+ default Table createTable(TableIdentifier ident,
+ StructType schema,
+ Map<String, String> properties) throws
TableAlreadyExistsException {
+ return createTable(ident, schema, Collections.emptyList(), properties);
+ }
+
+ /**
+ * Create a table in the catalog.
+ *
+ * @param ident a table identifier
+ * @param schema the schema of the new table, as a struct type
+ * @param partitions a list of expressions to use for partitioning data
in the table
+ * @param properties a string map of table properties
+ * @return metadata for the new table
+ * @throws TableAlreadyExistsException If a table already exists for the
identifier
+ */
+ Table createTable(TableIdentifier ident,
+ StructType schema,
+ List<Expression> partitions,
--- End diff --
> The current end-user API only allows users to specify partition columns.
I think an example would help understand the use of expression here. Right
now, I can create a table partitioned by day like this:
```
CREATE TABLE t (ts timestamp, data string, day string) PARTITIONED BY (day)
```
Then it's up to queries to supply the right values for `day` in their
queries. I'm proposing we change that to something like the following that uses
expressions in the PARTITIONED BY clause instead of only allowing column names:
```
CREATE TABLE t (ts timestamp, data string) PARTITIONED BY (date(ts));
```
This can handle all identity partitioning in Hive tables today and it can
handle bucketing.
> And why does the "partition transform" belong to a table definition?
Transforms should be passed to the table so the source use them for the
physical layout. In DataSourceV2, the source could be anything so it needs to
be the component that handles the physical layout. Because we want distributed
data sources, we need some way of telling them how to distribute data.
For example, I could use a partitioning expression to tell a source how to
shard across PostgreSQL instances. I could also use it to define the keys in an
HBase connector. Those are uses of partitioning that Spark can't handle
internally.
Like Hive, Spark has only supported a limited definition of partitioning up
to now, but I'd like to be able to put tables using Hive's layout behind this
API eventually. I think this way of configuring partitioning is a good way to
do that, while supporting what Iceberg and other sources will need.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]