jackye1995 opened a new issue #1887:
URL: https://github.com/apache/iceberg/issues/1887
This issue is opened to discuss about a standardized way to ship new
catalogs, as we plan to ship `GlueCatalog` and `NessieCatalog` in upcoming
release, and new catalogs like JDBC is also in progress to be added.
In the last community sync meeting, we discussed 2 ways:
1. for each new catalog (if in a new module), add a new runtime module that
bundles all additional dependencies.
2. directly add the module in existing runtimes, such as
`iceberg-spark3-runtime` as long as the jar size increase is reasonable.
For approach 1, compiling a runtime jar for shared usage seems to be not
ideal. It will introduce duplicated class path issue on the user side, with the
AWS client version used here potentially different from the one in the user's
application. Another issue with this approach is that we will introduce a ton
of runtime modules to iceberg as more catalogs are added, and this is not
desired.
For approach 2, I did some experiments based on the current `aws` module,
and the result was not good. When added to the spark3 runtime, the jar size
increased from 18.9MB to 34.7MB, almost doubled. I checked all AWS
dependencies, and even with all non-aws dependencies excluded, the added size
was still over 10MB. I would imagine this situation very similar if we support
for GCS and Azure are added in the future.
So the best way to go in open source seems to be not bundling any runtime
jar, and a user can start the spark session by specifying the additional
dependencies in the `--packages` flag:
```
spark-sql --packages
org.apache.iceberg:iceberg-spark3-runtime:0.11.0,org.apache.iceberg:iceberg-aws:0.11.0,software.amazon.awssdk:bundle:2.15.40
\
--conf spark.sql.catalog.test=org.apache.iceberg.spark.SparkCatalog \
--conf
spark.sql.catalog.test.catalog-impl=org.apache.iceberg.aws.glue.GlueCatalog \
--conf spark.sql.catalog.test.warehouse=s3://some-bucket
```
By doing so, the user can freely choose the version of aws client, `2.15.40`
here for example. And for users sensitive to jar size, they can cherry pick the
aws client packages to bring in by themselves, instead of using the 250MB
bundle.
Any thoughts? @rdblue @rymurr @yyanyy @giovannifumarola
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]