[GitHub] [iceberg] rdblue commented on a change in pull request #1618: refactor util to dynamically load location provider and file io

GitBox Fri, 23 Oct 2020 12:15:30 -0700


rdblue commented on a change in pull request #1618:
URL: https://github.com/apache/iceberg/pull/1618#discussion_r511094857




##########
File path: 
hive-metastore/src/main/java/org/apache/iceberg/hive/HiveTableOperations.java
##########
@@ -108,7 +108,8 @@ protected String tableName() {
   @Override
   public FileIO io() {
     if (fileIO == null) {
-      fileIO = new HadoopFileIO(conf);
+      // avoid refresh metadata because refresh() calls io() again
+      fileIO = CatalogUtil.loadFileIO(getCurrentMetadataNoRefresh(), conf);

Review comment:
       One quick thing to note: we want to avoid increasing dependence on a 
Hadoop configuration. It's fine to pass one where it is required, but we should 
always make sure there is an alternative and should generally avoid using 
config from it.
   
   Thanks for writing up the options. Sounds like we have options for 
configuring a `FileIO`:
   1. Pass `FileIO` in from the catalog and use catalog config to initialize it
   2. Instantiate `FileIO` based on table location or metadata path just before 
using it
   3. Use config from the environment, like a Hadoop FileSystem
   
   I think the best choice is #1, catalog-level configuration.
   
   We can't use table-level config because that creates a situation where it 
isn't available to load a table. Working around this requires deciding which 
FileIO to load based on different information at different times (location for 
create, metadata location for load), and would also make supplying a custom 
`FileIO` implementation in configuration difficult.
   
   Environment-level config doesn't fit with the current model for customizing 
behavior, where everything is injected through `Catalog` and `Table`. Plus, 
Iceberg has no environment config like Hadoop `Configuration` and I don't think 
it is a good idea to add it. I think the most sensible thing is to maintain the 
chain of configuration: application passes config to catalogs, catalogs pass 
config to tables.
   
   I don't think it would be too difficult to change over to `FileIO` passed in 
from the catalog, but this would mean not basing the implementation on table 
path. We wouldn't know to choose between `S3FileIO` or `HadoopFileIO` for a 
table and would have to use one for all tables from a catalog, or build a 
`FileIO` that knows when to choose between the two. I was already thinking that 
we would need a delegating implementation, since `S3FileIO` can't handle non-S3 
paths. That should be okay.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] rdblue commented on a change in pull request #1618: refactor util to dynamically load location provider and file io

Reply via email to