jackye1995 opened a new issue #1908:
URL: https://github.com/apache/iceberg/issues/1908


   In Iceberg, FileIO is a part of the `Table` and `TableOperations` 
interfaces, and used for both data and metadata. This works fine when people 
store both data and metadata using the same IO. However, if people want to have 
customized FileIO features for data based on metadata information, it creates a 
circular dependency. For example, in `TableOperations`:
   1. user calls `TableOperations.io()` to get a FileIO
   2. that calls `TableOperations.current()` to get table properties
   3. that calls `TableOperations.refresh()` to get latest metadata
   4. that calls `TableOperations.io()` to get a FileIO to read the metadata 
file
   
   When implementing the dynamic loading of FileIO (#1618 ), there was some 
discussion around this, and we basically decided to load FileIO through catalog 
properties and use the same FileIO for all table operations as a default 
behavior. Although users can have a customized FileIO for different tables if 
they want, the metadata and data aspect of it is still not decoupled. So far, I 
have heard multiple customer use cases around this, for example:
   
   1. use a different encryption mechanism for metadata and data, with 
encryption key stored as a table property
   2. check permissions for read and write access based on an access control 
list stored in table properties
   
   Typically, users now let FileIO internally check the file path to determine 
what is the right mechanism to read the data, such as checking if keyword 
`metadata` is in the path or not to know if it is metadata. There might be also 
a layer of delegation added to pass calls to multiple different storage 
specific FileIOs. But I would consider this as a hack because it is basically 
reverse engineering who is calling FileIO.
   
   I imagine a few different potential approaches (have not thought too much 
into details):
   1. use a different read and write mechanism for table properties, so that 
this circular dependency does not exist anymore.
   2. a new method `default FileIO metaIO() { return io(); }` could potentially 
be added and used for all metadata operations instead of `io()`, because at 
upstream we always know if we are writing data or metadata.
   
   Has anyone thought about this problem before? Is this a situation that we 
think Iceberg should handle by design? Any suggestion would be appreciated, 
thanks!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to