rustyconover commented on PR #8260:
URL: https://github.com/apache/iceberg/pull/8260#issuecomment-1687313178

   Hi All, I love the idea of this spec, I'm especially interested in using 
Iceberg with S3 replication for disaster recovery with S3 replication. 
   
   There will be challenges with the Glue catalog specifically. But I have 
ideas how to over come them.  Additionally with S3 replication there can be 
time where all of the contents of a table snapshot are not yet replicated, so 
it would be useful to track which "snapshot" is available at each location 
where the table is replicated.  Again, this is possible to do for S3 but I'm 
unsure of other storage systems.
   
   From reading the google doc, it is not especially clear what is proposed.  
There are many "alternatives" proposed and I couldn't divine the actual 
proposal's TL;DR action points.
   
   Personally, I was thinking of solving this challenge with a map of a 
"location" string to a regular expression to apply to all table URIs.  Each 
catalog instance could specify their "location" using a URI parameter added to 
the `metadata.json` current_metadata.  An example could be for a table in 
us-east-2 replicated from us-east-1.
   
   
`s3://bucket-us-east-2/database/table_name/data/[uuid].metadata.json?location=aws:us-east-2`
   
   The Iceberg table's metadata would have a property called 
"location_replacements" which would be a JSON string of a serialized map from 
each "location" to a regular expression to perform on all URI references.
   
   ```
   metadata.table.location_replacements = {
     "aws:us-east-2": "s/^s3:\/\/bucket-us-east-1\//s3:\/\/bucket-us-east-2/"
   }
   ```
   
   The regular expression substitution would mean paths that normally reference 
`bucket-us-east-1` would reference `bucket-us-east-2` when reading the catalog 
from that "location".
   
   The location replacements map would contain a "stringified" regular 
expression that can be applied all URIs that are contained in the table.  The 
use of regular expressions would allow a lot of flexibility, but bring a 
moderate amount of complexity.  Regular expressions could allow tables to be at 
different relative paths depending on storage media.
   
   Since this will change almost every client of Iceberg this spec should 
likely have more eyes on it.
   
   @fokko your thoughts from the Python perspective?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to