jackye1995 opened a new issue #1908:
URL: https://github.com/apache/iceberg/issues/1908
In Iceberg, FileIO is a part of the `Table` and `TableOperations`
interfaces, and used for both data and metadata. This works fine when people
store both data and metadata using the same IO. However, if people want to have
customized FileIO features for data based on metadata information, it creates a
circular dependency. For example, in `TableOperations`:
1. user calls `TableOperations.io()` to get a FileIO
2. that calls `TableOperations.current()` to get table properties
3. that calls `TableOperations.refresh()` to get latest metadata
4. that calls `TableOperations.io()` to get a FileIO to read the metadata
file
When implementing the dynamic loading of FileIO (#1618 ), there was some
discussion around this, and we basically decided to load FileIO through catalog
properties and use the same FileIO for all table operations as a default
behavior. Although users can have a customized FileIO for different tables if
they want, the metadata and data aspect of it is still not decoupled. So far, I
have heard multiple customer use cases around this, for example:
1. use a different encryption mechanism for metadata and data, with
encryption key stored as a table property
2. check permissions for read and write access based on an access control
list stored in table properties
Typically, users now let FileIO internally check the file path to determine
what is the right mechanism to read the data, such as checking if keyword
`metadata` is in the path or not to know if it is metadata. There might be also
a layer of delegation added to pass calls to multiple different storage
specific FileIOs. But I would consider this as a hack because it is basically
reverse engineering who is calling FileIO.
I imagine a few different potential approaches (have not thought too much
into details):
1. use a different read and write mechanism for table properties, so that
this circular dependency does not exist anymore.
2. a new method `default FileIO metaIO() { return io(); }` could potentially
be added and used for all metadata operations instead of `io()`, because at
upstream we always know if we are writing data or metadata.
Has anyone thought about this problem before? Is this a situation that we
think Iceberg should handle by design? Any suggestion would be appreciated,
thanks!
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]