[GitHub] [iceberg] sshkvar commented on pull request #2850: Spark: Added ability to add uuid suffix to the table location in Hive catalog

GitBox Mon, 26 Jul 2021 02:01:14 -0700


sshkvar commented on pull request #2850:
URL: https://github.com/apache/iceberg/pull/2850#issuecomment-886514901



   > In testing etc, I very often use a similar pattern (possibly using a 
timestamp as the table suffix).
   > 
   > However, I'm not sure if the best place to be doing this is in the Iceberg 
code.
   > 
   > What other tools are you using to create these tables that have UUID 
suffixes? Usually, when I encounter this need, I'm doing it in one of two 
places:
   > (1) Directly from shell scripts or small Spark / Trino jobs when testing 
on S3 (and wanting to ensure a brand new table). The solution for me there is 
simply to either place the table name with a timestamp in the code. Here's a 
sample from some code I have elsewhere:
   > 
   > ```scala
   >     val currentTime = new Date().getTime
   >     val tableName = "table_" + currentTime;
   >     spark.sql(s"CREATE TABLE IF NOT EXISTS my_catalog.default.${tableName} 
(name string, age int) USING iceberg")
   > ```
   > 
   > (2) From some sort of scheduling tool, such as Airflow or Azkaban. In this 
case, it's very easy to create a UUID when passing In the "new table name" to 
the spark job.
   > 
   > Effectively, for me, I'm not sure if this is something that makes sense to 
place it in Iceberg.
   > 
   > Can you elaborate further on why this isn't something that you can pass as 
an argument to your jobs etc? It feels very use case specific, with possible 
ways for you to deal with it using existing tools, but maybe I'm not fully 
understanding the scope of your problem. 🙂
   
   @kbendick Thanks for the quick reply 
   Let me provide additional details.
   Actually we do not need to change table name (and we don't do it), in 
    this PR just add uuid suffix to the table location.
   We need this to store tables with same name in different "folders" on s3. 
   Our use-case:
   1. We created table with name `test_table` and inserted some data to this 
table
   2. Then we dropped this table only from metastore, because we should have 
ability to restore this table 
   3. Then we again created new table with same name `test_table`.
   4. And again dropped this table
   With this PR we will be able to restore any of this tables, because data and 
metadata placed in different folders, we just need to restore information about 
table location in metastore (we can easy do it via iceberg API). 
   
   Also we have scheduled compaction and orhan files cleanup processes. If we 
will have data and metadata files for both tables in same folder, orhan files 
cleanup process will delete data and metadata for table which was deleted in 
step 2.
   Based on described above `EXTERNAL` table is not an option for us 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] sshkvar commented on pull request #2850: Spark: Added ability to add uuid suffix to the table location in Hive catalog

Reply via email to