[GitHub] [iceberg] jackye1995 commented on issue #3044: Unable to use GlueCatalog in flink environments without hadoop

GitBox Wed, 01 Sep 2021 00:29:06 -0700


jackye1995 commented on issue #3044:
URL: https://github.com/apache/iceberg/issues/3044#issuecomment-910015259



   
   > At present, I think adding to the known catalog-types might be the best 
path forward to resolve the issue more immediately. 
   
   That is actually something we did not want to do and that's why the aws 
module is not a part of the flink module dependency. Adding that dependency is 
likely good for AWS, but as more catalog implementations are added, it becomes 
not manageable to have that many dependencies.
   
   >  believe that one potential root issue is that FileIO has leaked out from 
TableOperations into catalog implementations like GlueCatalog. 
   
   This was also discussed when implementing the catalog. Having FileIO default 
definition at catalog level is a feature, that's why 
`CatalogProperties.FILE_IO_IMPL` was created. Initializing the default FileIO 
in catalog allows reusing the same FileIO singleton instead of creating many 
different ones. 
   
   I don't think the problem is solved even if you hide the `FileIO` creation 
in `TableOprations`, because `FileIO` also checks for the `Configurable` 
interface, it does not make much difference.
   
   > Additional updates to the FlinkCatalogFactory are still needed on top of 
these changes in order to fully remove the hadoop dependency
   
   Yes you are right, we can fully remove the dependency in `GlueCatalog`, but 
the issue is more on the engine side that basically requires such dependency. 
The Flink catalog entry point `CatalogFactory.createCatalog(String name, 
Map<String, String> properties)` has a direct call to `createCatalog(name, 
properties, clusterHadoopConf())` that initializes Hadoop configuration, and 
the serialized catalog loader `CustomCatalogLoader` has 
`SerializableConfiguration` as a field, so you are guaranteed to get 
serialization exception in Flink if you don't have the Hadoop configuration. 
This looks like a deeper issue than just a fix at catalog side.
   
   I think we should first tackle this on engine side, and then see what's the 
best way forward for catalog implementations. This seems like a valid ask for 
Flink catalog factory improvement.
   
   Meanwhile, although a bit hacky, why not add just 2 empty classes 
`Configuration` and `Configurable` to your classpath? That removes the need for 
the entire hadoop jar.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] jackye1995 commented on issue #3044: Unable to use GlueCatalog in flink environments without hadoop

Reply via email to