[DISCUSS] SPIP: APIs for Table Metadata Operations

Ryan Blue Tue, 24 Jul 2018 09:26:49 -0700

The recently adopted SPIP to standardize logical plans requires a way for
to plug in providers for table metadata operations, so that the new plans
can create and drop tables. I proposed an API to do this in a follow-up SPIP
on APIs for Table Metadata Operations
<https://docs.google.com/document/d/1zLFiA1VuaWeVxeTDXNg8bL6GP3BVoOZBkewFtEnjEoo/edit#>.
This thread is to discuss that proposal.


There are two main parts:

   - A public facing API for creating, altering, and dropping tables
   - An API for catalog implementations to provide the underlying table
   operations

The main need is for the plug-in API, but I included the public one because
there isn’t currently a friendly public API to create tables and I think it
helps to see how both would work together.

Here’s a sample of the proposed public API:

catalog.createTable("db.table")
    .addColumn("id", LongType)
    .addColumn("data", StringType, nullable=true)
    .addColumn("ts", TimestampType)
    .partitionBy(day($"ts"))
    .config("prop", "val")
    .commit()

And here’s a sample of the catalog plug-in API:

Table createTable(
    TableIdentifier ident,
    StructType schema,
    List<Expression> partitions,
    Optional<List<SortOrder>> sortOrder,
    Map<String, String> properties)

Note that this API passes both bucketing and column-based partitioning as
Expressions. This is a generalization that makes it possible for the table
to use the relationship between columns and partitions. In the example
above, data is partitioned by the day of the timestamp field. Because the
expression is passed to the table, the table can use predicates on the
timestamp to filter out partitions without an explicit partition predicate.
There’s more detail in the proposal on this.

The SPIP is for the APIs and does not cover how multiple catalogs would be
exposed. I started a separate discussion thread on how to access multiple
catalogs and maintain compatibility with Spark’s current behavior (how to
get the catalog instance in the above example).

Please use this thread to discuss the proposed APIs. Thanks, everyone!

rb

-- 
Ryan Blue
Software Engineer
Netflix

[DISCUSS] SPIP: APIs for Table Metadata Operations

Reply via email to