rdblue commented on issue #1617:
URL: https://github.com/apache/iceberg/issues/1617#issuecomment-868901758


   I looked through the design doc. It seems to be much better than the last 
version, so thank you for the update.
   
   Overall, I think it is still incorrect on a few points and is quite a bit 
longer than it will need to be in the end. The main confusion seems to be where 
the table location comes from. Iceberg [table metadata tracks a 
location](https://iceberg.apache.org/spec/#table-metadata-fields) string that 
is written into all metadata.json files. This is the table's location. Iceberg 
does not require any location other than this one and there isn't a way to pass 
a different location in the `TableOperations` API; the only way to pass a table 
location is through `TableMetadata`.
   
   There are also a few of table properties that can affect locations:
   * Metadata can be redirected by setting `write.metadata.path`, which is 
defaulted to the `metadata` folder under the table location
   * The default location provider writes data underneath 
`write.folder-storage.path`, which is defaulted to the `data` folder under the 
table location
   * The object storage location provider shards data underneath 
`write.object-storage.path`
   
   There are two broad classes of tables:
   1. Hadoop tables identify the table itself by location and enforce that the 
location in `TableMetadata` is identical to the location used to identify the 
table when it is created. For Hadoop tables, it makes sense that the table 
location used for relative paths is the location that gets passed to create 
[`HadoopTableOperations`](https://github.com/apache/iceberg/blob/master/core/src/main/java/org/apache/iceberg/hadoop/HadoopTableOperations.java#L71).
 But then there could be a problem that table metadata actually points to a 
different location unless it is modified.
   2. Metastore tables do not have a location that is external to table 
metadata. However, when a metastore table is tracked by Hive, there is a 
location that Hive tracks and we set that to the table's location.
   
   For table metadata files, there is only an external location for Hadoop 
tables. Otherwise, the table location must come from the table's metadata.json 
file. That makes it difficult to make `metadata_location` and 
`previous_metadata_location` relative paths. I would probably not attempt to 
make these paths relative, but I'm open to ideas.
   
   We could update the spec to allow relative metadata.json paths for Hadoop 
tables, but will need to state how to handle a different `location` in the 
metadata file. (Ignore?)
   
   Metadata locations aside, I think that it is a good plan to make the 
relative paths transparent. Most of the library should continue to operate on 
absolute paths and relative paths should be made while writing and made 
absolute while reading. That makes it so the changes are fairly easily tested. 
If you do this, then I think there is a lot less to cover in the design doc.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to