rdblue commented on a change in pull request #1618:
URL: https://github.com/apache/iceberg/pull/1618#discussion_r511094857
##########
File path:
hive-metastore/src/main/java/org/apache/iceberg/hive/HiveTableOperations.java
##########
@@ -108,7 +108,8 @@ protected String tableName() {
@Override
public FileIO io() {
if (fileIO == null) {
- fileIO = new HadoopFileIO(conf);
+ // avoid refresh metadata because refresh() calls io() again
+ fileIO = CatalogUtil.loadFileIO(getCurrentMetadataNoRefresh(), conf);
Review comment:
One quick thing to note: we want to avoid increasing dependence on a
Hadoop configuration. It's fine to pass one where it is required, but we should
always make sure there is an alternative and should generally avoid keeping
properties there.
Thanks for writing up the options. Sounds like we have options for
configuring a `FileIO`:
1. Pass `FileIO` in from the catalog and use catalog config to initialize it
2. Instantiate `FileIO` based on table location or metadata path just before
using it
3. Use config from the environment, like a Hadoop FileSystem
I think the best choice is #1, catalog-level configuration.
We can't use table-level config because that creates a situation where it
isn't available to load a table. Working around this requires deciding which
FileIO to load based on different information at different times (location for
create, metadata location for load), and would also make supplying a custom
`FileIO` implementation in configuration difficult.
Environment-level config doesn't fit with the current model for customizing
behavior, where everything is injected through `Catalog` and `Table`. Plus,
Iceberg has no environment config like Hadoop `Configuration` and I don't think
it is a good idea to add it. I think the most sensible thing is to maintain the
chain of configuration: application passes config to catalogs, catalogs pass
config to tables.
I don't think it would be too difficult to change over to `FileIO` passed in
from the catalog, but this would mean not basing the implementation on table
path. We wouldn't know to choose between `S3FileIO` or `HadoopFileIO` for a
table and would have to use one for all tables from a catalog, or build a
`FileIO` that knows when to choose between the two. I was already thinking that
we would need a delegating implementation, since `S3FileIO` can't handle non-S3
paths. That should be okay.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]