rustyconover commented on PR #8260:
URL: https://github.com/apache/iceberg/pull/8260#issuecomment-1687313178
Hi All, I love the idea of this spec, I'm especially interested in using
Iceberg with S3 replication for disaster recovery with S3 replication.
There will be challenges with the Glue catalog specifically. But I have
ideas how to over come them. Additionally with S3 replication there can be
time where all of the contents of a table snapshot are not yet replicated, so
it would be useful to track which "snapshot" is available at each location
where the table is replicated. Again, this is possible to do for S3 but I'm
unsure of other storage systems.
From reading the google doc, it is not especially clear what is proposed.
There are many "alternatives" proposed and I couldn't divine the actual
proposal's TL;DR action points.
Personally, I was thinking of solving this challenge with a map of a
"location" string to a regular expression to apply to all table URIs. Each
catalog instance could specify their "location" using a URI parameter added to
the `metadata.json` current_metadata. An example could be for a table in
us-east-2 replicated from us-east-1.
`s3://bucket-us-east-2/database/table_name/data/[uuid].metadata.json?location=aws:us-east-2`
The Iceberg table's metadata would have a property called
"location_replacements" which would be a JSON string of a serialized map from
each "location" to a regular expression to perform on all URI references.
```
metadata.table.location_replacements = {
"aws:us-east-2": "s/^s3:\/\/bucket-us-east-1\//s3:\/\/bucket-us-east-2/"
}
```
The regular expression substitution would mean paths that normally reference
`bucket-us-east-1` would reference `bucket-us-east-2` when reading the catalog
from that "location".
The location replacements map would contain a "stringified" regular
expression that can be applied all URIs that are contained in the table. The
use of regular expressions would allow a lot of flexibility, but bring a
moderate amount of complexity. Regular expressions could allow tables to be at
different relative paths depending on storage media.
Since this will change almost every client of Iceberg this spec should
likely have more eyes on it.
@fokko your thoughts from the Python perspective?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]