yaooqinn opened a new pull request #28527: URL: https://github.com/apache/spark/pull/28527
### What changes were proposed in this pull request? Currently, the user home directory is used as the base path for the database and table locations when their locationa are specified with a relative paths, e.g. ```sql > set spark.sql.warehouse.dir; spark.sql.warehouse.dir file:/Users/kentyao/Downloads/spark/spark-3.1.0-SNAPSHOT-bin-20200512/spark-warehouse/ spark-sql> create database loctest location 'loctestdbdir'; spark-sql> desc database loctest; Database Name loctest Comment Location file:/Users/kentyao/Downloads/spark/spark-3.1.0-SNAPSHOT-bin-20200512/loctestdbdir Owner kentyao spark-sql> create table loctest(id int) location 'loctestdbdir'; spark-sql> desc formatted loctest; id int NULL # Detailed Table Information Database default Table loctest Owner kentyao Created Time Thu May 14 16:29:05 CST 2020 Last Access UNKNOWN Created By Spark 3.1.0-SNAPSHOT Type EXTERNAL Provider parquet Location file:/Users/kentyao/Downloads/spark/spark-3.1.0-SNAPSHOT-bin-20200512/loctestdbdir Serde Library org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe InputFormat org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat OutputFormat org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat ``` The user home is not always warehouse-related, unchangeable in runtime, and shared both by database and table as the parent directory. Meanwhile, we use the table path as the parent directory for relative partition locations. The config `spark.sql.warehouse.dir` represents `the default location for managed databases and tables`. For databases, the case above seems not to follow its semantics, because it should use ` `spark.sql.warehouse.dir` as the base path instead. For tables, it seems to be right but here I suggest enriching the meaning that lets it also be the for external tables with relative paths for locations. With changes in this PR, The location of a database will be `warehouseDir/dbpath` when `dbpath` is relative. The location of a table will be `dbpath/tblpath` when `tblpath` is relative. ### Why are the changes needed? bugfix and improvement Firstly, the databases with relative locations should be created under the default location specified by `spark.sql.warehouse.dir`. Secondly, the external tables with relative paths may also follow this behavior for consistency. At last, the behavior for database, tables and partitions with relative paths to choose base paths should be the same. ### Does this PR introduce _any_ user-facing change? Yes, this PR changes the `createDatabase`, `alterDatabase`, `createTable` and `alterTable` APIs and related DDLs. If the LOCATION clause is followed by a relative path, the root path will be `spark.sql.warehouse.dir` for databases, and `spark.sql.warehouse.dir` / `dbPath` for tables. ### How was this patch tested? Add unit tests. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
