There are two main ways to load tables in Spark: by name (db.table) and by
a path. Unfortunately, the integration for DataSourceV2 has no support for
identifying tables by name.

I propose supporting the use of TableIdentifier, which is the standard way
to pass around table names.

The reason I think we should do this is to easily support more ways of
working with DataSourceV2 tables. SQL statements and parts of the
DataFrameReader and DataFrameWriter APIs that use table names create
UnresolvedRelation instances that wrap an unresolved TableIdentifier.

By adding support for passing TableIdentifier to a DataSourceV2Relation,
then about all we need to enable these code paths is to add a resolution
rule. For that rule, we could easily identify a default data source that
handles named tables.

This is what we’re doing in our Spark build, and we have DataSourceV2
tables working great through SQL. (Part of this depends on the logical plan
changes from my previous email to ensure inserts are properly resolved.)

In the long term, I think we should update how we parse tables so that
TableIdentifier can contain a source in addition to a database/context and
a table name. That would allow us to integration new sources fairly
seamlessly, without needing to a rather redundant SQL create statement like


Also, I think we should pass TableIdentifier to DataSourceV2Relation,
rather than going with Wenchen’s suggestion that we pass the table name as
a string property, “table”. My rationale is that the new API shouldn’t leak
its internal details to other parts of the planner.

If we were to convert TableIdentifer to a “table” property wherever
DataSourceV2Relation is created, we create several places that need to be
in sync with the same convention. On the other hand, passing TableIdentifier
to DataSourceV2Relation and relying on the relation to correctly set the
options passed to readers and writers minimizes the number of places that
conversion needs to happen.

Ryan Blue
Software Engineer

Reply via email to