jackye1995 commented on a change in pull request #1618:
URL: https://github.com/apache/iceberg/pull/1618#discussion_r510543328
##########
File path:
hive-metastore/src/main/java/org/apache/iceberg/hive/HiveTableOperations.java
##########
@@ -108,7 +108,8 @@ protected String tableName() {
@Override
public FileIO io() {
if (fileIO == null) {
- fileIO = new HadoopFileIO(conf);
+ // avoid refresh metadata because refresh() calls io() again
+ fileIO = CatalogUtil.loadFileIO(getCurrentMetadataNoRefresh(), conf);
Review comment:
Because Iceberg catalog does not enforce the implementation detail of
how table properties are store, an implementation can potentially write the
table properties using the `io()`. Therefore if we try to determine the `io()`
to use based on table property, there is always a chance for cyclic dependency
in logic. `HadoopCatalog` has exactly this problem.
Here are 3 potential ways to solve this problem:
1. Create a new constructor for each `TableOperation` that accepts `FileIO`
as an argument. If the constructor is used, then `FileIO` is set at
construction time instead of the first time `io()` is called. A default
implementation can be loaded based on namespace properties. Engines like Spark
and Flink can load a configuration key from Hadoop config to load the FileIO
outside the core logic. This requires code change at multiple places, including
(1) add new constructors in existing table operations, and (2) add support in
each engine separately.
2. Determine `FileIO` based on the warehouse path scheme. For example,
`s3://` always use `S3FileIO`, `hdfs://` always use `HadoopFileIO`. However, it
is easy to create a cyclic dependency issue, for example: `iceberg-aws` module
depends on `iceberg-core`, so `HadoopCatalog` in `iceberg-core` cannot import
`S3FileIO` in `iceberg-aws`. This means we need a registry that allows people
to register custom IO mapping at runtime. This approach has a similar amount of
work as approach 1, because we need code change in each existing catalog, and
each engine needs to find a way to register `FIleIO` implementation to scheme
mapping at JVM start time.
3. Load `FileIO` using Hadoop configuration. Because `HadoopFileIO` is
always the default implementation, Hadoop configuration object is always passed
in. So user can always just define custom implementation at `core-site.xml`.
This creates a simple solution with no dependency concern. However, this is not
an elegant option because ideally we would like to load all Iceberg classes
using Iceberg-based configs such as table properties.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]