yaooqinn opened a new pull request #28527:
URL: https://github.com/apache/spark/pull/28527


   ### What changes were proposed in this pull request?
   
   Currently, the user home directory is used as the base path for the database 
and table locations when their locationa are specified with a relative paths, 
e.g.
   ```sql
   > set spark.sql.warehouse.dir;
   spark.sql.warehouse.dir      
file:/Users/kentyao/Downloads/spark/spark-3.1.0-SNAPSHOT-bin-20200512/spark-warehouse/
   spark-sql> create database loctest location 'loctestdbdir';
   
   spark-sql> desc database loctest;
   Database Name        loctest
   Comment
   Location     
file:/Users/kentyao/Downloads/spark/spark-3.1.0-SNAPSHOT-bin-20200512/loctestdbdir
   Owner        kentyao
   
   spark-sql> create table loctest(id int) location 'loctestdbdir';
   spark-sql> desc formatted loctest;
   id   int     NULL
   
   # Detailed Table Information
   Database     default
   Table        loctest
   Owner        kentyao
   Created Time Thu May 14 16:29:05 CST 2020
   Last Access  UNKNOWN
   Created By   Spark 3.1.0-SNAPSHOT
   Type EXTERNAL
   Provider     parquet
   Location     
file:/Users/kentyao/Downloads/spark/spark-3.1.0-SNAPSHOT-bin-20200512/loctestdbdir
   Serde Library        
org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe
   InputFormat  org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat
   OutputFormat org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat
   ```
   The user home is not always warehouse-related, unchangeable in runtime, and 
shared both by database and table as the parent directory. Meanwhile, we use 
the table path as the parent directory for relative partition locations.
   
   The config `spark.sql.warehouse.dir` represents `the default location for 
managed databases and tables`.
   For databases, the case above seems not to follow its semantics, because it 
should use ` `spark.sql.warehouse.dir` as the base path instead.
   
   For tables, it seems to be right but here I suggest enriching the meaning 
that lets it also be the for external tables with relative paths for locations.
   
   With changes in this PR,
   
   The location of a database will be `warehouseDir/dbpath` when `dbpath` is 
relative.
   The location of a table will be `dbpath/tblpath` when `tblpath` is relative. 
   
   ### Why are the changes needed?
   
   bugfix and improvement
   
   Firstly, the databases with relative locations should be created under the 
default location specified by `spark.sql.warehouse.dir`.
   
   Secondly, the external tables with relative paths may also follow this 
behavior for consistency.
   
   At last, the behavior for database, tables and partitions with relative 
paths to choose base paths should be the same.
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   Yes, this PR changes the `createDatabase`, `alterDatabase`, `createTable` 
and `alterTable` APIs and related DDLs. If the LOCATION clause is followed by a 
relative path, the root path will be `spark.sql.warehouse.dir` for databases, 
and `spark.sql.warehouse.dir` / `dbPath` for tables.
   
   
   ### How was this patch tested?
   
   
   Add unit tests.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to