[GitHub] [iceberg] anigos commented on a diff in pull request #8194: Core: check location for conflict before creating table

via GitHub Thu, 03 Aug 2023 20:21:54 -0700


anigos commented on code in PR #8194:
URL: https://github.com/apache/iceberg/pull/8194#discussion_r1283935592



##########
core/src/main/java/org/apache/iceberg/TableProperties.java:
##########
@@ -365,4 +365,7 @@ private TableProperties() {}
 
   public static final String UPSERT_ENABLED = "write.upsert.enabled";
   public static final boolean UPSERT_ENABLED_DEFAULT = false;
+
+  public static final String UNIQUE_LOCATION = "location.unique";

Review Comment:
   I have thought through this and mostly two cases came to my mind. We may 
think with this route
   
   1. No database creation should be allowed under an existing database path. 
It will help a major problem of people creating even databases under existing 
db path. 
   2. No table creation should be allowed under an existing table path. 
   
   **Case 1**
   
   We have the following information with us which is an existing Table and 
it's location. Once a table got created in past that defintely has a valid path 
do we really needs to check fileIO or a simple string comparison/regex match is 
enough?
   
   Say a table's location is `s3://somerandompath/my_database/my_table `
    
   I feel instead of looking into fileIO why not we leverage our own metadata? 
We have various ways of creating iceberg table just via database.tableName, 
with location etc. This DB path is always a constant path by practice. If 
someone is trying to create a table under the same location with same name we 
can just throw the exception that s3://somerandompath/my_database/my_table 
exists just by looking it's database reference, which should be one level up 
and only one level under a database path should be a permissible table path. 
The uniqueness not necessarily you need from storage file location but from our 
metadata information. 
   
   ```
   CREATE TABLE prod.db.sample
   USING iceberg
   PARTITIONED BY (part)
   TBLPROPERTIES ('key'='value')
   AS SELECT ...
   ```
   
   OR 
   
   ```
   CREATE TABLE IF NOT EXISTS prod.db.sample (
            id integer,
           ......
          )
          USING ICEBERG 
          LOCATION 
          TBLPROPERTIES (
            'type' 'hive',.......
          )
   ```
   
   
   **Case 2**
   
   Rename table : When we rename a table we don't move files it is a metadata 
operation. The base path remains same but the table name gets updated. So in 
this case there is no impact. For unique location we can still look up to the 
metadata and get all unique paths under db reference. 
   
   



##########
core/src/main/java/org/apache/iceberg/BaseMetastoreCatalog.java:
##########
@@ -192,6 +192,14 @@ public Table create() {
 
       String baseLocation = location != null ? location : 
defaultWarehouseLocation(identifier);
       tableProperties.putAll(tableOverrideProperties());
+
+      if 
(Boolean.parseBoolean(tableProperties.get(TableProperties.UNIQUE_LOCATION))) {
+        boolean alreadyExists = ops.io().newInputFile(baseLocation).exists();
+        if (alreadyExists) {
+          throw new AlreadyExistsException("Table location already in use: 
%s", baseLocation);

Review Comment:
   I have thought through this and mostly two cases came to my mind. We may 
think with this route
   
   1. No database creation should be allowed under an existing database path. 
It will help a major problem of people creating even databases under existing 
db path. 
   2. No table creation should be allowed under an existing table path. 
   
   **Case 1**
   
   We have the following information with us which is an existing Table and 
it's location. Once a table got created in past that defintely has a valid path 
do we really needs to check fileIO or a simple string comparison/regex match is 
enough?
   
   Say a table's location is `s3://somerandompath/my_database/my_table `
    
   I feel instead of looking into fileIO why not we leverage our own metadata? 
We have various ways of creating iceberg table just via database.tableName, 
with location etc. This DB path is always a constant path by practice. If 
someone is trying to create a table under the same location with same name we 
can just throw the exception that s3://somerandompath/my_database/my_table 
exists just by looking it's database reference, which should be one level up 
and only one level under a database path should be a permissible table path. 
The uniqueness not necessarily you need from storage file location but from our 
metadata information. 
   
   ```
   CREATE TABLE prod.db.sample
   USING iceberg
   PARTITIONED BY (part)
   TBLPROPERTIES ('key'='value')
   AS SELECT ...
   ```
   
   OR 
   
   ```
   CREATE TABLE IF NOT EXISTS prod.db.sample (
            id integer,
           ......
          )
          USING ICEBERG 
          LOCATION 
          TBLPROPERTIES (
            'type' 'hive',.......
          )
   ```
   
   
   **Case 2**
   
   Rename table : When we rename a table we don't move files it is a metadata 
operation. The base path remains same but the table name gets updated. So in 
this case there is no impact. For unique location we can still look up to the 
metadata and get all unique paths under db reference. 
   
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] anigos commented on a diff in pull request #8194: Core: check location for conflict before creating table

Reply via email to