This is an automated email from the ASF dual-hosted git repository. yufei pushed a commit to branch main in repository https://gitbox.apache.org/repos/asf/polaris.git
The following commit(s) were added to refs/heads/main by this push: new 33d99409f Improve the bundle jar license and notice remove using exclude (#1991) 33d99409f is described below commit 33d99409fd3d72852d66f6347155ec39e39089af Author: Yun Zou <yunzou.colost...@gmail.com> AuthorDate: Wed Jul 9 17:36:33 2025 -0700 Improve the bundle jar license and notice remove using exclude (#1991) --- plugins/spark/README.md | 74 ++++++++++----- .../getting-started/notebooks/SparkPolaris.ipynb | 2 +- .../spark/v3.5/spark/{LICENSE => BUNDLE-LICENSE} | 0 plugins/spark/v3.5/spark/{NOTICE => BUNDLE-NOTICE} | 0 plugins/spark/v3.5/spark/build.gradle.kts | 104 +++------------------ 5 files changed, 65 insertions(+), 115 deletions(-) diff --git a/plugins/spark/README.md b/plugins/spark/README.md index c7d6bc876..9764fc8d1 100644 --- a/plugins/spark/README.md +++ b/plugins/spark/README.md @@ -28,32 +28,29 @@ REST endpoints, and provides implementations for Apache Spark's Right now, the plugin only provides support for Spark 3.5, Scala version 2.12 and 2.13, and depends on iceberg-spark-runtime 1.9.0. -# Build Plugin Jar -A task createPolarisSparkJar is added to build a jar for the Polaris Spark plugin, the jar is named as: -`polaris-spark-<sparkVersion>_<scalaVersion>-<polarisVersion>-bundle.jar`. For example: -`polaris-spark-3.5_2.12-0.11.0-beta-incubating-SNAPSHOT-bundle.jar`. - -- `./gradlew :polaris-spark-3.5_2.12:createPolarisSparkJar` -- build jar for Spark 3.5 with Scala version 2.12. -- `./gradlew :polaris-spark-3.5_2.13:createPolarisSparkJar` -- build jar for Spark 3.5 with Scala version 2.13. - -The result jar is located at plugins/spark/v3.5/build/<scala_version>/libs after the build. - -# Start Spark with Local Polaris Service using built Jar -Once the jar is built, we can manually test it with Spark and a local Polaris service. - +# Start Spark with local Polaris service using the Polaris Spark plugin The following command starts a Polaris server for local testing, it runs on localhost:8181 with default -realm `POLARIS` and root credentials `root:secret`: +realm `POLARIS` and root credentials `root:s3cr3t`: ```shell ./gradlew run ``` -Once the local server is running, the following command can be used to start the spark-shell with the built Spark client -jar, and to use the local Polaris server as a Catalog. +Once the local server is running, you can start Spark with the Polaris Spark plugin using either the `--packages` +option with the Polaris Spark package, or the `--jars` option with the Polaris Spark bundle JAR. + +The following sections explain how to build and run Spark with both the Polaris package and the bundle JAR. + +# Build and run with Polaris spark package locally +The Polaris Spark client source code is located in plugins/spark/v3.5/spark. To use the Polaris Spark package +with Spark, you first need to publish the source JAR to your local Maven repository. + +Run the following command to build the Polaris Spark project and publish the source JAR to your local Maven repository: +- `./gradlew assemble` -- build the whole Polaris project without running tests +- `./gradlew publishToMavenLocal` -- publish Polaris project source JAR to local Maven repository ```shell bin/spark-shell \ ---jars <path-to-spark-client-jar> \ ---packages org.apache.iceberg:iceberg-aws-bundle:1.9.0,io.delta:delta-spark_2.12:3.3.1 \ +--packages org.apache.polaris:polaris-spark-<spark_version>_<scala_version>:<polaris_version>,org.apache.iceberg:iceberg-aws-bundle:1.9.0,io.delta:delta-spark_2.12:3.3.1 \ --conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions,io.delta.sql.DeltaSparkSessionExtension \ --conf spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog \ --conf spark.sql.catalog.<catalog-name>.warehouse=<catalog-name> \ @@ -66,17 +63,20 @@ bin/spark-shell \ --conf spark.sql.sources.useV1SourceList='' ``` -Assume the path to the built Spark client jar is -`/polaris/plugins/spark/v3.5/spark/build/2.12/libs/polaris-spark-3.5_2.12-0.11.0-beta-incubating-SNAPSHOT-bundle.jar` -and the name of the catalog is `polaris`. The cli command will look like following: +The Polaris version is defined in the `versions.txt` file located in the root directory of the Polaris project. +Assume the following values: +- `spark_version`: 3.5 +- `scala_version`: 2.12 +- `polaris_version`: 1.1.0-incubating-SNAPSHOT +- `catalog-name`: `polaris` +The Spark command would look like following: ```shell bin/spark-shell \ ---jars /polaris/plugins/spark/v3.5/spark/build/2.12/libs/polaris-spark-3.5_2.12-0.11.0-beta-incubating-SNAPSHOT-bundle.jar \ ---packages org.apache.iceberg:iceberg-aws-bundle:1.9.0,io.delta:delta-spark_2.12:3.3.1 \ +--packages org.apache.polaris:polaris-spark-3.5_2.12:1.1.0-incubating-SNAPSHOT,org.apache.iceberg:iceberg-aws-bundle:1.9.0,io.delta:delta-spark_2.12:3.3.1 \ --conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions,io.delta.sql.DeltaSparkSessionExtension \ --conf spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog \ ---conf spark.sql.catalog.polaris.warehouse=<catalog-name> \ +--conf spark.sql.catalog.polaris.warehouse=polaris \ --conf spark.sql.catalog.polaris.header.X-Iceberg-Access-Delegation=vended-credentials \ --conf spark.sql.catalog.polaris=org.apache.polaris.spark.SparkCatalog \ --conf spark.sql.catalog.polaris.uri=http://localhost:8181/api/catalog \ @@ -86,6 +86,32 @@ bin/spark-shell \ --conf spark.sql.sources.useV1SourceList='' ``` +# Build and run with Polaris spark bundle JAR +The polaris-spark project also provides a Spark bundle JAR for the `--jars` use case. The resulting JAR will follow this naming format: +polaris-spark-<spark_version>_<scala_version>-<polaris_version>-bundle.jar +For example: +polaris-spark-bundle-3.5_2.12-1.1.0-incubating-SNAPSHOT-bundle.jar + +Run `./gradlew assemble` to build the entire Polaris project without running tests. After the build completes, +the bundle JAR can be found under: plugins/spark/v3.5/spark/build/<scala_version>/libs/. +To start Spark using the bundle JAR, specify it with the `--jars` option as shown below: + +```shell +bin/spark-shell \ +--jars <path-to-spark-client-jar> \ +--packages org.apache.iceberg:iceberg-aws-bundle:1.9.0,io.delta:delta-spark_2.12:3.3.1 \ +--conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions,io.delta.sql.DeltaSparkSessionExtension \ +--conf spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog \ +--conf spark.sql.catalog.<catalog-name>.warehouse=<catalog-name> \ +--conf spark.sql.catalog.<catalog-name>.header.X-Iceberg-Access-Delegation=vended-credentials \ +--conf spark.sql.catalog.<catalog-name>=org.apache.polaris.spark.SparkCatalog \ +--conf spark.sql.catalog.<catalog-name>.uri=http://localhost:8181/api/catalog \ +--conf spark.sql.catalog.<catalog-name>.credential="root:secret" \ +--conf spark.sql.catalog.<catalog-name>.scope='PRINCIPAL_ROLE:ALL' \ +--conf spark.sql.catalog.<catalog-name>.token-refresh-enabled=true \ +--conf spark.sql.sources.useV1SourceList='' +``` + # Limitations The Polaris Spark client supports catalog management for both Iceberg and Delta tables, it routes all Iceberg table requests to the Iceberg REST endpoints, and routes all Delta table requests to the Generic Table REST endpoints. diff --git a/plugins/spark/v3.5/getting-started/notebooks/SparkPolaris.ipynb b/plugins/spark/v3.5/getting-started/notebooks/SparkPolaris.ipynb index 1c3803d7b..8974a81e2 100644 --- a/plugins/spark/v3.5/getting-started/notebooks/SparkPolaris.ipynb +++ b/plugins/spark/v3.5/getting-started/notebooks/SparkPolaris.ipynb @@ -265,7 +265,7 @@ "from pyspark.sql import SparkSession\n", "\n", "spark = (SparkSession.builder\n", - " .config(\"spark.jars\", \"../polaris_libs/polaris-spark-3.5_2.12-0.11.0-beta-incubating-SNAPSHOT-bundle.jar\")\n", + " .config(\"spark.jars\", \"../polaris_libs/polaris-spark-3.5_2.12-1.1.0-incubating-SNAPSHOT-bundle.jar\") # TODO: add a way to automatically discover the Jar\n", " .config(\"spark.jars.packages\", \"org.apache.iceberg:iceberg-aws-bundle:1.9.0,io.delta:delta-spark_2.12:3.2.1\")\n", " .config(\"spark.sql.catalog.spark_catalog\", \"org.apache.spark.sql.delta.catalog.DeltaCatalog\")\n", " .config('spark.sql.iceberg.vectorization.enabled', 'false')\n", diff --git a/plugins/spark/v3.5/spark/LICENSE b/plugins/spark/v3.5/spark/BUNDLE-LICENSE similarity index 100% rename from plugins/spark/v3.5/spark/LICENSE rename to plugins/spark/v3.5/spark/BUNDLE-LICENSE diff --git a/plugins/spark/v3.5/spark/NOTICE b/plugins/spark/v3.5/spark/BUNDLE-NOTICE similarity index 100% rename from plugins/spark/v3.5/spark/NOTICE rename to plugins/spark/v3.5/spark/BUNDLE-NOTICE diff --git a/plugins/spark/v3.5/spark/build.gradle.kts b/plugins/spark/v3.5/spark/build.gradle.kts index 797b27f7d..45af3b6f9 100644 --- a/plugins/spark/v3.5/spark/build.gradle.kts +++ b/plugins/spark/v3.5/spark/build.gradle.kts @@ -89,96 +89,20 @@ tasks.register<ShadowJar>("createPolarisSparkJar") { from(sourceSets.main.get().output) configurations = listOf(project.configurations.runtimeClasspath.get()) - // Optimization: Minimize the JAR (remove unused classes from dependencies) - // The iceberg-spark-runtime plugin is always packaged along with our polaris-spark plugin, - // therefore excluded from the optimization. - minimize { exclude(dependency("org.apache.iceberg:iceberg-spark-runtime-*.*")) } - - // Always run the license file addition after this task completes - finalizedBy("addLicenseFilesToJar") -} - -// Post-processing task to add our project's LICENSE and NOTICE files to the jar and remove any -// other LICENSE or NOTICE files that were shaded in. -tasks.register("addLicenseFilesToJar") { - dependsOn("createPolarisSparkJar") - - doLast { - val shadowTask = tasks.named("createPolarisSparkJar", ShadowJar::class.java).get() - val jarFile = shadowTask.archiveFile.get().asFile - val tempDir = - File( - "${project.layout.buildDirectory.get().asFile}/tmp/jar-cleanup-${shadowTask.archiveBaseName.get()}-${shadowTask.archiveClassifier.get()}" - ) - val projectLicenseFile = File(projectDir, "LICENSE") - val projectNoticeFile = File(projectDir, "NOTICE") - - // Validate that required license files exist - if (!projectLicenseFile.exists()) { - throw GradleException("Project LICENSE file not found at: ${projectLicenseFile.absolutePath}") - } - if (!projectNoticeFile.exists()) { - throw GradleException("Project NOTICE file not found at: ${projectNoticeFile.absolutePath}") - } - - logger.info("Processing jar: ${jarFile.absolutePath}") - logger.info("Using temp directory: ${tempDir.absolutePath}") - - // Clean up temp directory - if (tempDir.exists()) { - tempDir.deleteRecursively() - } - tempDir.mkdirs() - - // Extract the jar - copy { - from(zipTree(jarFile)) - into(tempDir) - } - - fileTree(tempDir) - .matching { - include("**/*LICENSE*") - include("**/*NOTICE*") - } - .forEach { file -> - logger.info("Removing license file: ${file.relativeTo(tempDir)}") - file.delete() - } - - // Remove META-INF/licenses directory if it exists - val licensesDir = File(tempDir, "META-INF/licenses") - if (licensesDir.exists()) { - licensesDir.deleteRecursively() - logger.info("Removed META-INF/licenses directory") - } - - // Copy our project's license files to root - copy { - from(projectLicenseFile) - into(tempDir) - } - logger.info("Added project LICENSE file") - - copy { - from(projectNoticeFile) - into(tempDir) - } - logger.info("Added project NOTICE file") - - // Delete the original jar - jarFile.delete() - - // Create new jar with only project LICENSE and NOTICE files - ant.withGroovyBuilder { - "jar"("destfile" to jarFile.absolutePath) { "fileset"("dir" to tempDir.absolutePath) } - } - - logger.info("Recreated jar with only project LICENSE and NOTICE files") - - // Clean up temp directory - tempDir.deleteRecursively() - } + // recursively remove all LICENSE and NOTICE file under META-INF, includes + // directories contains 'license' in the name + exclude("META-INF/**/*LICENSE*") + exclude("META-INF/**/*NOTICE*") + // exclude the top level LICENSE, LICENSE-*.txt and NOTICE + exclude("LICENSE*") + exclude("NOTICE*") + + // add polaris customized LICENSE and NOTICE for the bundle jar at top level. Note that the + // customized LICENSE and NOTICE file are called BUNDLE-LICENSE and BUNDLE-NOTICE, + // and renamed to LICENSE and NOTICE after include, this is to avoid the file + // being excluded due to the exclude pattern matching used above. + from("${projectDir}/BUNDLE-LICENSE") { rename { "LICENSE" } } + from("${projectDir}/BUNDLE-NOTICE") { rename { "NOTICE" } } } // ensure the shadow jar job (which will automatically run license addition) is run for both