This is an automated email from the ASF dual-hosted git repository.
agrove pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/datafusion-comet.git
The following commit(s) were added to refs/heads/main by this push:
new 73a713a5c Update Iceberg docs to switch ordering of two
implementations and add a tuning config to native reader. (#3691)
73a713a5c is described below
commit 73a713a5c2a9ed6c9ff8148a3f7144050d7813d6
Author: Matt Butrovich <[email protected]>
AuthorDate: Fri Mar 13 16:11:45 2026 -0400
Update Iceberg docs to switch ordering of two implementations and add a
tuning config to native reader. (#3691)
---
docs/source/user-guide/latest/iceberg.md | 258 ++++++++++++++++---------------
1 file changed, 131 insertions(+), 127 deletions(-)
diff --git a/docs/source/user-guide/latest/iceberg.md
b/docs/source/user-guide/latest/iceberg.md
index ad6e5b243..b8a0c424f 100644
--- a/docs/source/user-guide/latest/iceberg.md
+++ b/docs/source/user-guide/latest/iceberg.md
@@ -20,11 +20,139 @@
# Accelerating Apache Iceberg Parquet Scans using Comet (Experimental)
**Note: Iceberg integration is a work-in-progress. Comet currently has two
distinct Iceberg
-code paths: 1) a hybrid reader (native Parquet decoding, JVM otherwise) that
requires
-building Iceberg from source rather than using available artifacts in Maven,
and 2) fully-native
-reader (based on [iceberg-rust](https://github.com/apache/iceberg-rust)).
Directions for both
+code paths: 1) fully-native
+reader (based on [iceberg-rust](https://github.com/apache/iceberg-rust)), and
2) a hybrid reader (native Parquet decoding, JVM otherwise) that requires
+building Iceberg from source rather than using available artifacts in Maven.
Directions for both
designs are provided below.**
+## Native Reader
+
+Comet's fully-native Iceberg integration does not require modifying Iceberg
source
+code. Instead, Comet relies on reflection to extract `FileScanTask`s from
Iceberg, which are
+then serialized to Comet's native execution engine (see
+[PR #2528](https://github.com/apache/datafusion-comet/pull/2528)).
+
+The example below uses Spark's package downloader to retrieve Comet 0.14.0 and
Iceberg
+1.8.1, but Comet has been tested with Iceberg 1.5, 1.7, 1.8, 1.9, and 1.10.
The key configuration
+to enable fully-native Iceberg is
`spark.comet.scan.icebergNative.enabled=true`. This
+configuration should **not** be used with the hybrid Iceberg configuration
+`spark.sql.iceberg.parquet.reader-type=COMET` from below.
+
+```shell
+$SPARK_HOME/bin/spark-shell \
+ --packages
org.apache.datafusion:comet-spark-spark3.5_2.12:0.14.0,org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.8.1,org.apache.iceberg:iceberg-core:1.8.1
\
+ --repositories https://repo1.maven.org/maven2/ \
+ --conf
spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions
\
+ --conf
spark.sql.catalog.spark_catalog=org.apache.iceberg.spark.SparkCatalog \
+ --conf spark.sql.catalog.spark_catalog.type=hadoop \
+ --conf spark.sql.catalog.spark_catalog.warehouse=/tmp/warehouse \
+ --conf spark.plugins=org.apache.spark.CometPlugin \
+ --conf
spark.shuffle.manager=org.apache.spark.sql.comet.execution.shuffle.CometShuffleManager
\
+ --conf spark.sql.extensions=org.apache.comet.CometSparkSessionExtensions \
+ --conf spark.comet.scan.icebergNative.enabled=true \
+ --conf spark.comet.explainFallback.enabled=true \
+ --conf spark.memory.offHeap.enabled=true \
+ --conf spark.memory.offHeap.size=2g
+```
+
+### Tuning
+
+Comet’s native Iceberg reader supports fetching multiple files in parallel to
hide I/O latency with the
+config `spark.comet.scan.icebergNative.dataFileConcurrencyLimit`. This value
defaults to 1 to
+maintain test behavior on Iceberg Java tests without `ORDER BY` clauses, but
we suggest increasing it to
+values between 2 and 8 based on your workload.
+
+### Supported features
+
+The native Iceberg reader supports the following features:
+
+**Table specifications:**
+
+- Iceberg table spec v1 and v2 (v3 will fall back to Spark)
+
+**Schema and data types:**
+
+- All primitive types including UUID
+- Complex types: arrays, maps, and structs
+- Schema evolution (adding and dropping columns)
+
+**Time travel and branching:**
+
+- `VERSION AS OF` queries to read historical snapshots
+- Branch reads for accessing named branches
+
+**Delete handling (Merge-On-Read tables):**
+
+- Positional deletes
+- Equality deletes
+- Mixed delete types
+
+**Filter pushdown:**
+
+- Equality and comparison predicates (`=`, `!=`, `>`, `>=`, `<`, `<=`)
+- Logical operators (`AND`, `OR`)
+- NULL checks (`IS NULL`, `IS NOT NULL`)
+- `IN` and `NOT IN` list operations
+- `BETWEEN` operations
+
+**Partitioning:**
+
+- Standard partitioning with partition pruning
+- Date partitioning with `days()` transform
+- Bucket partitioning
+- Truncate transform
+- Hour transform
+
+**Storage:**
+
+- Local filesystem
+- Hadoop Distributed File System (HDFS)
+- S3-compatible storage (AWS S3, MinIO)
+
+### REST Catalog
+
+Comet's native Iceberg reader also supports REST catalogs. The following
example shows how to
+configure Spark to use a REST catalog with Comet's native Iceberg scan:
+
+```shell
+$SPARK_HOME/bin/spark-shell \
+ --packages
org.apache.datafusion:comet-spark-spark3.5_2.12:0.14.0,org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.8.1,org.apache.iceberg:iceberg-core:1.8.1
\
+ --repositories https://repo1.maven.org/maven2/ \
+ --conf
spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions
\
+ --conf spark.sql.catalog.rest_cat=org.apache.iceberg.spark.SparkCatalog \
+ --conf
spark.sql.catalog.rest_cat.catalog-impl=org.apache.iceberg.rest.RESTCatalog \
+ --conf spark.sql.catalog.rest_cat.uri=http://localhost:8181 \
+ --conf spark.sql.catalog.rest_cat.warehouse=/tmp/warehouse \
+ --conf spark.plugins=org.apache.spark.CometPlugin \
+ --conf
spark.shuffle.manager=org.apache.spark.sql.comet.execution.shuffle.CometShuffleManager
\
+ --conf spark.sql.extensions=org.apache.comet.CometSparkSessionExtensions \
+ --conf spark.comet.scan.icebergNative.enabled=true \
+ --conf spark.comet.explainFallback.enabled=true \
+ --conf spark.memory.offHeap.enabled=true \
+ --conf spark.memory.offHeap.size=2g
+```
+
+Note that REST catalogs require explicit namespace creation before creating
tables:
+
+```scala
+scala> spark.sql("CREATE NAMESPACE rest_cat.db")
+scala> spark.sql("CREATE TABLE rest_cat.db.test_table (id INT, name STRING)
USING iceberg")
+scala> spark.sql("INSERT INTO rest_cat.db.test_table VALUES (1, 'Alice'), (2,
'Bob')")
+scala> spark.sql("SELECT * FROM rest_cat.db.test_table").show()
+```
+
+### Current limitations
+
+The following scenarios will fall back to Spark's native Iceberg reader:
+
+- Iceberg table spec v3 scans
+- Iceberg writes (reads are accelerated, writes use Spark)
+- Tables backed by Avro or ORC data files (only Parquet is accelerated)
+- Tables partitioned on `BINARY` or `DECIMAL` (with precision >28) columns
+- Scans with residual filters using `truncate`, `bucket`, `year`, `month`,
`day`, or `hour`
+ transform functions (partition pruning still works, but row-level filtering
of these
+ transforms falls back)
+
## Hybrid Reader
### Build Comet
@@ -149,127 +277,3 @@ scala> spark.sql(s"SELECT * from t1").explain()
- Spark Runtime Filtering isn't
[working](https://github.com/apache/datafusion-comet/issues/2116)
- You can bypass the issue by either setting
`spark.sql.adaptive.enabled=false` or
`spark.comet.exec.broadcastExchange.enabled=false`
-
-## Native Reader
-
-Comet's fully-native Iceberg integration does not require modifying Iceberg
source
-code. Instead, Comet relies on reflection to extract `FileScanTask`s from
Iceberg, which are
-then serialized to Comet's native execution engine (see
-[PR #2528](https://github.com/apache/datafusion-comet/pull/2528)).
-
-The example below uses Spark's package downloader to retrieve Comet 0.12.0 and
Iceberg
-1.8.1, but Comet has been tested with Iceberg 1.5, 1.7, 1.8, and 1.10. The key
configuration
-to enable fully-native Iceberg is
`spark.comet.scan.icebergNative.enabled=true`. This
-configuration should **not** be used with the hybrid Iceberg configuration
-`spark.sql.iceberg.parquet.reader-type=COMET` from above.
-
-```shell
-$SPARK_HOME/bin/spark-shell \
- --packages
org.apache.datafusion:comet-spark-spark3.5_2.12:0.12.0,org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.8.1,org.apache.iceberg:iceberg-core:1.8.1
\
- --repositories https://repo1.maven.org/maven2/ \
- --conf
spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions
\
- --conf
spark.sql.catalog.spark_catalog=org.apache.iceberg.spark.SparkCatalog \
- --conf spark.sql.catalog.spark_catalog.type=hadoop \
- --conf spark.sql.catalog.spark_catalog.warehouse=/tmp/warehouse \
- --conf spark.plugins=org.apache.spark.CometPlugin \
- --conf
spark.shuffle.manager=org.apache.spark.sql.comet.execution.shuffle.CometShuffleManager
\
- --conf spark.sql.extensions=org.apache.comet.CometSparkSessionExtensions \
- --conf spark.comet.scan.icebergNative.enabled=true \
- --conf spark.comet.explainFallback.enabled=true \
- --conf spark.memory.offHeap.enabled=true \
- --conf spark.memory.offHeap.size=2g
-```
-
-The same sample queries from above can be used to test Comet's fully-native
Iceberg integration,
-however the scan node to look for is `CometIcebergNativeScan`.
-
-### Supported features
-
-The native Iceberg reader supports the following features:
-
-**Table specifications:**
-
-- Iceberg table spec v1 and v2 (v3 will fall back to Spark)
-
-**Schema and data types:**
-
-- All primitive types including UUID
-- Complex types: arrays, maps, and structs
-- Schema evolution (adding and dropping columns)
-
-**Time travel and branching:**
-
-- `VERSION AS OF` queries to read historical snapshots
-- Branch reads for accessing named branches
-
-**Delete handling (Merge-On-Read tables):**
-
-- Positional deletes
-- Equality deletes
-- Mixed delete types
-
-**Filter pushdown:**
-
-- Equality and comparison predicates (`=`, `!=`, `>`, `>=`, `<`, `<=`)
-- Logical operators (`AND`, `OR`)
-- NULL checks (`IS NULL`, `IS NOT NULL`)
-- `IN` and `NOT IN` list operations
-- `BETWEEN` operations
-
-**Partitioning:**
-
-- Standard partitioning with partition pruning
-- Date partitioning with `days()` transform
-- Bucket partitioning
-- Truncate transform
-- Hour transform
-
-**Storage:**
-
-- Local filesystem
-- Hadoop Distributed File System (HDFS)
-- S3-compatible storage (AWS S3, MinIO)
-
-### REST Catalog
-
-Comet's native Iceberg reader also supports REST catalogs. The following
example shows how to
-configure Spark to use a REST catalog with Comet's native Iceberg scan:
-
-```shell
-$SPARK_HOME/bin/spark-shell \
- --packages
org.apache.datafusion:comet-spark-spark3.5_2.12:0.12.0,org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.8.1,org.apache.iceberg:iceberg-core:1.8.1
\
- --repositories https://repo1.maven.org/maven2/ \
- --conf
spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions
\
- --conf spark.sql.catalog.rest_cat=org.apache.iceberg.spark.SparkCatalog \
- --conf
spark.sql.catalog.rest_cat.catalog-impl=org.apache.iceberg.rest.RESTCatalog \
- --conf spark.sql.catalog.rest_cat.uri=http://localhost:8181 \
- --conf spark.sql.catalog.rest_cat.warehouse=/tmp/warehouse \
- --conf spark.plugins=org.apache.spark.CometPlugin \
- --conf
spark.shuffle.manager=org.apache.spark.sql.comet.execution.shuffle.CometShuffleManager
\
- --conf spark.sql.extensions=org.apache.comet.CometSparkSessionExtensions \
- --conf spark.comet.scan.icebergNative.enabled=true \
- --conf spark.comet.explainFallback.enabled=true \
- --conf spark.memory.offHeap.enabled=true \
- --conf spark.memory.offHeap.size=2g
-```
-
-Note that REST catalogs require explicit namespace creation before creating
tables:
-
-```scala
-scala> spark.sql("CREATE NAMESPACE rest_cat.db")
-scala> spark.sql("CREATE TABLE rest_cat.db.test_table (id INT, name STRING)
USING iceberg")
-scala> spark.sql("INSERT INTO rest_cat.db.test_table VALUES (1, 'Alice'), (2,
'Bob')")
-scala> spark.sql("SELECT * FROM rest_cat.db.test_table").show()
-```
-
-### Current limitations
-
-The following scenarios will fall back to Spark's native Iceberg reader:
-
-- Iceberg table spec v3 scans
-- Iceberg writes (reads are accelerated, writes use Spark)
-- Tables backed by Avro or ORC data files (only Parquet is accelerated)
-- Tables partitioned on `BINARY` or `DECIMAL` (with precision >28) columns
-- Scans with residual filters using `truncate`, `bucket`, `year`, `month`,
`day`, or `hour`
- transform functions (partition pruning still works, but row-level filtering
of these
- transforms falls back)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]