[GitHub] [iceberg] jackye1995 commented on a change in pull request #1618: refactor util to dynamically load location provider and file io

GitBox Thu, 22 Oct 2020 18:18:48 -0700


jackye1995 commented on a change in pull request #1618:
URL: https://github.com/apache/iceberg/pull/1618#discussion_r510543328




##########
File path: 
hive-metastore/src/main/java/org/apache/iceberg/hive/HiveTableOperations.java
##########
@@ -108,7 +108,8 @@ protected String tableName() {
   @Override
   public FileIO io() {
     if (fileIO == null) {
-      fileIO = new HadoopFileIO(conf);
+      // avoid refresh metadata because refresh() calls io() again
+      fileIO = CatalogUtil.loadFileIO(getCurrentMetadataNoRefresh(), conf);

Review comment:
       Because Iceberg catalog does not enforce the implementation detail of 
how table properties are store, an implementation can potentially write the 
table properties using the `io()`. Therefore if we try to determine the `io()` 
to use based on table property, there is always a chance for cyclic dependency 
in logic. `HadoopCatalog` has exactly this problem.
   
   Here are 3 potential ways to solve this problem:
   
   1. Create a new constructor for each `TableOperation` that accepts `FileIO` 
as an argument. If the constructor is used, then `FileIO` is set at 
construction time instead of the first time `io()` is called. A default 
implementation can be loaded based on namespace properties. Engines like Spark 
and Flink can load a configuration key from Hadoop config to load the FileIO 
outside the core logic. This requires code change at multiple places, including 
(1) add new constructors in existing table operations, and (2) add support in 
each engine separately.
   
   2. Determine `FileIO` based on the warehouse path scheme. For example, 
`s3://` always use `S3FileIO`, `hdfs://` always use `HadoopFileIO`. However, it 
is easy to create a cyclic dependency issue, for example: `iceberg-aws` module 
depends on `iceberg-core`, so `HadoopCatalog` in `iceberg-core` cannot import 
`S3FileIO` in `iceberg-aws`. This means we need a registry that allows people 
to register custom IO mapping at runtime. This approach has a similar amount of 
work as approach 1, because we need code change in each existing catalog, and 
each engine needs to find a way to register `FIleIO` implementation to scheme 
mapping at JVM start time.
   
   3. Load `FileIO` using Hadoop configuration. Because `HadoopFileIO` is 
always the default implementation, Hadoop configuration object is always passed 
in. So user can always just define custom implementation at `core-site.xml`. 
This creates a simple solution with no dependency concern. However, this is not 
an elegant option because ideally we would like to load all Iceberg classes 
using Iceberg-based configs such as table properties.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] jackye1995 commented on a change in pull request #1618: refactor util to dynamically load location provider and file io

Reply via email to