[GitHub] [iceberg] rdblue commented on issue #1617: Support relative paths in Table Metadata

GitBox Fri, 25 Jun 2021 18:12:46 -0700


rdblue commented on issue #1617:
URL: https://github.com/apache/iceberg/issues/1617#issuecomment-868901758

I looked through the design doc. It seems to be much better than the last
version, so thank you for the update.

Overall, I think it is still incorrect on a few points and is quite a bit
longer than it will need to be in the end. The main confusion seems to be where
the table location comes from. Iceberg [table metadata tracks a
location](https://iceberg.apache.org/spec/#table-metadata-fields) string that
is written into all metadata.json files. This is the table's location. Iceberg
does not require any location other than this one and there isn't a way to pass
a different location in the `TableOperations` API; the only way to pass a table
location is through `TableMetadata`.

There are also a few of table properties that can affect locations:
* Metadata can be redirected by setting `write.metadata.path`, which is
defaulted to the `metadata` folder under the table location
* The default location provider writes data underneath
`write.folder-storage.path`, which is defaulted to the `data` folder under the
table location
* The object storage location provider shards data underneath
`write.object-storage.path`

There are two broad classes of tables:
1. Hadoop tables identify the table itself by location and enforce that the
location in `TableMetadata` is identical to the location used to identify the
table when it is created. For Hadoop tables, it makes sense that the table
location used for relative paths is the location that gets passed to create
[`HadoopTableOperations`](https://github.com/apache/iceberg/blob/master/core/src/main/java/org/apache/iceberg/hadoop/HadoopTableOperations.java#L71).
But then there could be a problem that table metadata actually points to a
different location unless it is modified.
2. Metastore tables do not have a location that is external to table
metadata. However, when a metastore table is tracked by Hive, there is a
location that Hive tracks and we set that to the table's location.

For table metadata files, there is only an external location for Hadoop
tables. Otherwise, the table location must come from the table's metadata.json
file. That makes it difficult to make `metadata_location` and
`previous_metadata_location` relative paths. I would probably not attempt to
make these paths relative, but I'm open to ideas.

We could update the spec to allow relative metadata.json paths for Hadoop
tables, but will need to state how to handle a different `location` in the
metadata file. (Ignore?)

Metadata locations aside, I think that it is a good plan to make the
relative paths transparent. Most of the library should continue to operate on
absolute paths and relative paths should be made while writing and made
absolute while reading. That makes it so the changes are fairly easily tested.
If you do this, then I think there is a lot less to cover in the design doc.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] rdblue commented on issue #1617: Support relative paths in Table Metadata

Reply via email to