jackye1995 opened a new issue #1887:
URL: https://github.com/apache/iceberg/issues/1887


   This issue is opened to discuss about a standardized way to ship new 
catalogs, as we plan to ship `GlueCatalog` and `NessieCatalog` in upcoming 
release, and new catalogs like JDBC is also in progress to be added.
   
   In the last community sync meeting, we discussed 2 ways:
   1. for each new catalog (if in a new module), add a new runtime module that 
bundles all additional dependencies.
   2. directly add the module in existing runtimes, such as 
`iceberg-spark3-runtime` as long as the jar size increase is reasonable.
   
   For approach 1, compiling a runtime jar for shared usage seems to be not 
ideal. It will introduce duplicated class path issue on the user side, with the 
AWS client version used here potentially different from the one in the user's 
application. Another issue with this approach is that we will introduce a ton 
of runtime modules to iceberg as more catalogs are added, and this is not 
desired.
   
   For approach 2, I did some experiments based on the current `aws` module, 
and the result was not good. When added to the spark3 runtime, the jar size 
increased from 18.9MB to 34.7MB, almost doubled. I checked all AWS 
dependencies, and even with all non-aws dependencies excluded, the added size 
was still over 10MB. I would imagine this situation very similar if we support 
for GCS and Azure are added in the future.
   
   So the best way to go in open source seems to be not bundling any runtime 
jar, and a user can start the spark session by specifying the additional 
dependencies in the `--packages` flag:
   
   ```
   spark-sql --packages 
org.apache.iceberg:iceberg-spark3-runtime:0.11.0,org.apache.iceberg:iceberg-aws:0.11.0,software.amazon.awssdk:bundle:2.15.40
 \
       --conf spark.sql.catalog.test=org.apache.iceberg.spark.SparkCatalog \
       --conf 
spark.sql.catalog.test.catalog-impl=org.apache.iceberg.aws.glue.GlueCatalog \
       --conf spark.sql.catalog.test.warehouse=s3://some-bucket
   ```
   By doing so, the user can freely choose the version of aws client, `2.15.40` 
here for example. And for users sensitive to jar size, they can cherry pick the 
aws client packages to bring in by themselves, instead of using the 250MB 
bundle.
   
   Any thoughts? @rdblue @rymurr @yyanyy @giovannifumarola


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to