kevinjqliu commented on code in PR #15124: URL: https://github.com/apache/iceberg/pull/15124#discussion_r2813452953
########## docker/iceberg-flink-quickstart/Dockerfile: ########## @@ -0,0 +1,57 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. +# + +# Version arguments - can be overridden at build time +ARG FLINK_VERSION=2.0 + +FROM apache/flink:${FLINK_VERSION}-java21 + +SHELL ["/bin/bash", "-c"] + +# Redeclare ARG variables after FROM to make them available in subsequent layers +ARG ICEBERG_FLINK_RUNTIME_VERSION=2.0 +ARG ICEBERG_VERSION=1.10.1 +ARG ICEBERG_AWS_BUNDLE_VERSION=1.10.1 +ARG HADOOP_VERSION=3.4.2 + +# Switch to flink user for installation +USER flink + +WORKDIR /opt/flink + +# Install Iceberg Flink runtime and AWS bundle JARs +RUN echo "-> Install JARs: Dependencies for Iceberg" && \ + mkdir -p ./lib/iceberg && pushd $_ && \ + curl -fO https://repo.maven.apache.org/maven2/org/apache/iceberg/iceberg-flink-runtime-${ICEBERG_FLINK_RUNTIME_VERSION}/${ICEBERG_VERSION}/iceberg-flink-runtime-${ICEBERG_FLINK_RUNTIME_VERSION}-${ICEBERG_VERSION}.jar && \ + curl -fO https://repo.maven.apache.org/maven2/org/apache/iceberg/iceberg-aws-bundle/${ICEBERG_AWS_BUNDLE_VERSION}/iceberg-aws-bundle-${ICEBERG_AWS_BUNDLE_VERSION}.jar && \ + popd + +# Install Hadoop dependencies required for filesystem operations +RUN echo "-> Install JARs: Hadoop" && \ + mkdir -p ./lib/hadoop && pushd $_ && \ + curl -fO https://repo.maven.apache.org/maven2/org/apache/commons/commons-configuration2/2.1.1/commons-configuration2-2.1.1.jar && \ Review Comment: Heres the summary from claude Hadoop JAR Replacement Analysis The 9 individual Hadoop-related JARs in the Dockerfile can be replaced by just 2 JARs: - `hadoop-client-api-${HADOOP_VERSION}.jar` - `hadoop-client-runtime-${HADOOP_VERSION}.jar` | JAR in current Dockerfile | Version in Dockerfile | Included in | Version in Hadoop 3.4.2 POM | Notes | |---|---|---|---|---| | `hadoop-common` | 3.4.2 | **hadoop-client-api** | 3.4.2 | Hadoop classes included un-relocated | | `hadoop-auth` | 3.4.2 | **hadoop-client-api** | 3.4.2 | Transitive dep of hadoop-common, shaded into API jar | | `hadoop-hdfs-client` | 3.4.2 | **hadoop-client-api** | 3.4.2 | Direct dep of `hadoop-client` aggregator | | `hadoop-mapreduce-client-core` | 3.4.2 | **hadoop-client-api** | 3.4.2 | Direct dep of `hadoop-client` aggregator | | `commons-configuration2` | **2.1.1** | **hadoop-client-runtime** | **2.10.1** | Shaded/relocated; Dockerfile version is outdated | | `commons-collections4` | 4.4 | **hadoop-client-runtime** | 4.4 | Shaded/relocated; versions match | | `hadoop-shaded-guava` | **1.1.1** | **hadoop-client-runtime** | **1.4.0** | Shaded/relocated; Dockerfile version is outdated | | `stax2-api` | 4.2.1 | **hadoop-client-runtime** | 4.2.1 | Shaded/relocated; versions match | | `woodstox-core` | **5.3.0** | **hadoop-client-runtime** | **5.4.0** | Shaded/relocated; Dockerfile version is outdated | - **`hadoop-client-api`** shades all `org.apache.hadoop:*` modules (hadoop-common, hadoop-auth, hadoop-hdfs-client, hadoop-mapreduce-client-core, hadoop-yarn-api/client, etc.). Internal references to third-party libs are relocated under `org.apache.hadoop.shaded.*`. - **`hadoop-client-runtime`** shades all third-party dependencies (commons-configuration2, commons-collections4, woodstox, stax2-api, hadoop-shaded-guava, plus many more) under the same `org.apache.hadoop.shaded.*` namespace, matching what hadoop-client-api expects. 1. **All 9 JARs are fully covered** — 4 Hadoop modules go into the API jar, 5 third-party libs go into the runtime jar. 2. **3 version mismatches fixed** — The current Dockerfile has outdated versions for commons-configuration2 (2.1.1 vs 2.10.1), hadoop-shaded-guava (1.1.1 vs 1.4.0), and woodstox-core (5.3.0 vs 5.4.0). The client JARs bundle the correct versions. 3. **This is the officially recommended approach** — `hadoop-client-api` + `hadoop-client-runtime` is Hadoop's supported way for downstream consumers to depend on Hadoop, introduced specifically to replace ad-hoc collections of individual JARs. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
