This is an automated email from the ASF dual-hosted git repository.
github-bot pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/datafusion-comet.git
The following commit(s) were added to refs/heads/asf-site by this push:
new 7339c4b23 Publish built docs triggered by
dba523d994f3f8336d2c5ca469c61672768611a1
7339c4b23 is described below
commit 7339c4b23e22db6c1d30d1bc758e2a93df2f234c
Author: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
AuthorDate: Mon Nov 3 20:53:09 2025 +0000
Publish built docs triggered by dba523d994f3f8336d2c5ca469c61672768611a1
---
_sources/contributor-guide/index.md.txt | 1 +
_sources/contributor-guide/parquet_scans.md.txt | 137 ++++++++++++++++
_sources/user-guide/latest/compatibility.md.txt | 68 +-------
_sources/user-guide/latest/datasources.md.txt | 72 +--------
contributor-guide/adding_a_new_expression.html | 1 +
contributor-guide/benchmarking.html | 1 +
contributor-guide/contributing.html | 1 +
contributor-guide/debugging.html | 1 +
contributor-guide/development.html | 7 +-
contributor-guide/ffi.html | 7 +-
contributor-guide/index.html | 5 +
.../{tracing.html => parquet_scans.html} | 177 +++++++++++++++------
contributor-guide/plugin_overview.html | 1 +
contributor-guide/profiling_native_code.html | 1 +
contributor-guide/roadmap.html | 1 +
contributor-guide/spark-sql-tests.html | 1 +
contributor-guide/tracing.html | 1 +
objects.inv | Bin 1486 -> 1509 bytes
searchindex.js | 2 +-
user-guide/latest/compatibility.html | 80 +---------
user-guide/latest/datasources.html | 81 +---------
21 files changed, 315 insertions(+), 331 deletions(-)
diff --git a/_sources/contributor-guide/index.md.txt
b/_sources/contributor-guide/index.md.txt
index ba4692a97..eb79f7ab5 100644
--- a/_sources/contributor-guide/index.md.txt
+++ b/_sources/contributor-guide/index.md.txt
@@ -26,6 +26,7 @@ under the License.
Getting Started <contributing>
Comet Plugin Overview <plugin_overview>
Arrow FFI <ffi>
+Parquet Scans <parquet_scans>
Development Guide <development>
Debugging Guide <debugging>
Benchmarking Guide <benchmarking>
diff --git a/_sources/contributor-guide/parquet_scans.md.txt
b/_sources/contributor-guide/parquet_scans.md.txt
new file mode 100644
index 000000000..4aec9f347
--- /dev/null
+++ b/_sources/contributor-guide/parquet_scans.md.txt
@@ -0,0 +1,137 @@
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements. See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership. The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License. You may obtain a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied. See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+# Comet Parquet Scan Implementations
+
+Comet currently has three distinct implementations of the Parquet scan
operator. The configuration property
+`spark.comet.scan.impl` is used to select an implementation. The default
setting is `spark.comet.scan.impl=auto`, and
+Comet will choose the most appropriate implementation based on the Parquet
schema and other Comet configuration
+settings. Most users should not need to change this setting. However, it is
possible to force Comet to try and use
+a particular implementation for all scan operations by setting this
configuration property to one of the following
+implementations.
+
+| Implementation | Description
|
+| ----------------------- |
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|
+| `native_comet` | This implementation provides strong compatibility
with Spark but does not support complex types. This is the original scan
implementation in Comet and may eventually be removed. |
+| `native_iceberg_compat` | This implementation delegates to DataFusion's
`DataSourceExec` but uses a hybrid approach of JVM and native code. This scan
is designed to be integrated with Iceberg in the future. |
+| `native_datafusion` | This experimental implementation delegates to
DataFusion's `DataSourceExec` for full native execution. There are known
compatibility issues when using this scan. |
+
+The `native_datafusion` and `native_iceberg_compat` scans provide the
following benefits over the `native_comet`
+implementation:
+
+- Leverages the DataFusion community's ongoing improvements to `DataSourceExec`
+- Provides support for reading complex types (structs, arrays, and maps)
+- Removes the use of reusable mutable-buffers in Comet, which is complex to
maintain
+- Improves performance
+
+The `native_datafusion` and `native_iceberg_compat` scans share the following
limitations:
+
+- When reading Parquet files written by systems other than Spark that contain
columns with the logical types `UINT_8`
+ or `UINT_16`, Comet will produce different results than Spark because Spark
does not preserve or understand these
+ logical types. Arrow-based readers, such as DataFusion and Comet do respect
these types and read the data as unsigned
+ rather than signed. By default, Comet will fall back to `native_comet` when
scanning Parquet files containing `byte` or `short`
+ types (regardless of the logical type). This behavior can be disabled by
setting
+ `spark.comet.scan.allowIncompatible=true`.
+- No support for default values that are nested types (e.g., maps, arrays,
structs). Literal default values are supported.
+
+The `native_datafusion` scan has some additional limitations:
+
+- Bucketed scans are not supported
+- No support for row indexes
+- `PARQUET_FIELD_ID_READ_ENABLED` is not respected [#1758]
+- There are failures in the Spark SQL test suite [#1545]
+- Setting Spark configs `ignoreMissingFiles` or `ignoreCorruptFiles` to `true`
is not compatible with Spark
+
+## S3 Support
+
+There are some
+
+### `native_comet`
+
+The default `native_comet` Parquet scan implementation reads data from S3
using the [Hadoop-AWS
module](https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/index.html),
which
+is identical to the approach commonly used with vanilla Spark. AWS credential
configuration and other Hadoop S3A
+configurations works the same way as in vanilla Spark.
+
+### `native_datafusion` and `native_iceberg_compat`
+
+The `native_datafusion` and `native_iceberg_compat` Parquet scan
implementations completely offload data loading
+to native code. They use the [`object_store`
crate](https://crates.io/crates/object_store) to read data from S3 and
+support configuring S3 access using standard [Hadoop S3A
configurations](https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/index.html#General_S3A_Client_configuration)
by translating them to
+the `object_store` crate's format.
+
+This implementation maintains compatibility with existing Hadoop S3A
configurations, so existing code will
+continue to work as long as the configurations are supported and can be
translated without loss of functionality.
+
+#### Additional S3 Configuration Options
+
+Beyond credential providers, the `native_datafusion` implementation supports
additional S3 configuration options:
+
+| Option | Description |
+|--------|-------------|
+| `fs.s3a.endpoint` | The endpoint of the S3 service |
+| `fs.s3a.endpoint.region` | The AWS region for the S3 service. If not
specified, the region will be auto-detected. |
+| `fs.s3a.path.style.access` | Whether to use path style access for the S3
service (true/false, defaults to virtual hosted style) |
+| `fs.s3a.requester.pays.enabled` | Whether to enable requester pays for S3
requests (true/false) |
+
+All configuration options support bucket-specific overrides using the pattern
`fs.s3a.bucket.{bucket-name}.{option}`.
+
+#### Examples
+
+The following examples demonstrate how to configure S3 access with the
`native_datafusion` Parquet scan implementation using different authentication
methods.
+
+**Example 1: Simple Credentials**
+
+This example shows how to access a private S3 bucket using an access key and
secret key. The `fs.s3a.aws.credentials.provider` configuration can be omitted
since `org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider` is included in
Hadoop S3A's default credential provider chain.
+
+```shell
+$SPARK_HOME/bin/spark-shell \
+...
+--conf spark.comet.scan.impl=native_datafusion \
+--conf spark.hadoop.fs.s3a.access.key=my-access-key \
+--conf spark.hadoop.fs.s3a.secret.key=my-secret-key
+...
+```
+
+**Example 2: Assume Role with Web Identity Token**
+
+This example demonstrates using an assumed role credential to access a private
S3 bucket, where the base credential for assuming the role is provided by a web
identity token credentials provider.
+
+```shell
+$SPARK_HOME/bin/spark-shell \
+...
+--conf spark.comet.scan.impl=native_datafusion \
+--conf
spark.hadoop.fs.s3a.aws.credentials.provider=org.apache.hadoop.fs.s3a.auth.AssumedRoleCredentialProvider
\
+--conf
spark.hadoop.fs.s3a.assumed.role.arn=arn:aws:iam::123456789012:role/my-role \
+--conf spark.hadoop.fs.s3a.assumed.role.session.name=my-session \
+--conf
spark.hadoop.fs.s3a.assumed.role.credentials.provider=com.amazonaws.auth.WebIdentityTokenCredentialsProvider
+...
+```
+
+#### Limitations
+
+The S3 support of `native_datafusion` has the following limitations:
+
+1. **Partial Hadoop S3A configuration support**: Not all Hadoop S3A
configurations are currently supported. Only the configurations listed in the
tables above are translated and applied to the underlying `object_store` crate.
+
+2. **Custom credential providers**: Custom implementations of AWS credential
providers are not supported. The implementation only supports the standard
credential providers listed in the table above. We are planning to add support
for custom credential providers through a JNI-based adapter that will allow
calling Java credential providers from native code. See [issue
#1829](https://github.com/apache/datafusion-comet/issues/1829) for more details.
+
+
+
+[#1545]: https://github.com/apache/datafusion-comet/issues/1545
+[#1758]: https://github.com/apache/datafusion-comet/issues/1758
diff --git a/_sources/user-guide/latest/compatibility.md.txt
b/_sources/user-guide/latest/compatibility.md.txt
index ac2be802d..908693ff5 100644
--- a/_sources/user-guide/latest/compatibility.md.txt
+++ b/_sources/user-guide/latest/compatibility.md.txt
@@ -25,59 +25,11 @@ This guide offers information about areas of functionality
where there are known
## Parquet
-### Data Type Support
+Comet has the following limitations when reading Parquet files:
-Comet does not support reading decimals encoded in binary format.
-
-### Parquet Scans
-
-Comet currently has three distinct implementations of the Parquet scan
operator. The configuration property
-`spark.comet.scan.impl` is used to select an implementation. The default
setting is `spark.comet.scan.impl=auto`, and
-Comet will choose the most appropriate implementation based on the Parquet
schema and other Comet configuration
-settings. Most users should not need to change this setting. However, it is
possible to force Comet to try and use
-a particular implementation for all scan operations by setting this
configuration property to one of the following
-implementations.
-
-| Implementation | Description
|
-| ----------------------- |
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|
-| `native_comet` | This implementation provides strong compatibility
with Spark but does not support complex types. This is the original scan
implementation in Comet and may eventually be removed. |
-| `native_iceberg_compat` | This implementation delegates to DataFusion's
`DataSourceExec` but uses a hybrid approach of JVM and native code. This scan
is designed to be integrated with Iceberg in the future. |
-| `native_datafusion` | This experimental implementation delegates to
DataFusion's `DataSourceExec` for full native execution. There are known
compatibility issues when using this scan. |
-
-The `native_datafusion` and `native_iceberg_compat` scans provide the
following benefits over the `native_comet`
-implementation:
-
-- Leverages the DataFusion community's ongoing improvements to `DataSourceExec`
-- Provides support for reading complex types (structs, arrays, and maps)
-- Removes the use of reusable mutable-buffers in Comet, which is complex to
maintain
-- Improves performance
-
-The `native_datafusion` and `native_iceberg_compat` scans share the following
limitations:
-
-- When reading Parquet files written by systems other than Spark that contain
columns with the logical types `UINT_8`
- or `UINT_16`, Comet will produce different results than Spark because Spark
does not preserve or understand these
- logical types. Arrow-based readers, such as DataFusion and Comet do respect
these types and read the data as unsigned
- rather than signed. By default, Comet will fall back to `native_comet` when
scanning Parquet files containing `byte` or `short`
- types (regardless of the logical type). This behavior can be disabled by
setting
- `spark.comet.scan.allowIncompatible=true`.
+- Comet does not support reading decimals encoded in binary format.
- No support for default values that are nested types (e.g., maps, arrays,
structs). Literal default values are supported.
-The `native_datafusion` scan has some additional limitations:
-
-- Bucketed scans are not supported
-- No support for row indexes
-- `PARQUET_FIELD_ID_READ_ENABLED` is not respected [#1758]
-- There are failures in the Spark SQL test suite [#1545]
-- Setting Spark configs `ignoreMissingFiles` or `ignoreCorruptFiles` to `true`
is not compatible with Spark
-
-[#1545]: https://github.com/apache/datafusion-comet/issues/1545
-[#1758]: https://github.com/apache/datafusion-comet/issues/1758
-
-### S3 Support with `native_iceberg_compat`
-
-- When using the default AWS S3 endpoint (no custom endpoint configured), a
valid region is required. Comet
- will attempt to resolve the region if it is not provided.
-
## ANSI Mode
Comet will fall back to Spark for the following expressions when ANSI mode is
enabled, unless
@@ -101,18 +53,14 @@ Sorting on floating-point data types (or complex types
containing floating-point
Spark if the data contains both zero and negative zero. This is likely an edge
case that is not of concern for many users
and sorting on floating-point data can be enabled by setting
`spark.comet.expression.SortOrder.allowIncompatible=true`.
-There is a known bug with using count(distinct) within aggregate queries,
where each NaN value will be counted
-separately [#1824](https://github.com/apache/datafusion-comet/issues/1824).
-
## Incompatible Expressions
-Some Comet native expressions are not 100% compatible with Spark and are
disabled by default. These expressions
-will fall back to Spark but can be enabled by setting
`spark.comet.expression.allowIncompatible=true`.
-
-## Array Expressions
+Expressions that are not 100% Spark-compatible will fall back to Spark by
default and can be enabled by setting
+`spark.comet.expression.EXPRNAME.allowIncompatible=true`, where `EXPRNAME` is
the Spark expression class name. See
+the [Comet Supported Expressions Guide](expressions.md) for more information
on this configuration setting.
-Comet has experimental support for a number of array expressions. These are
experimental and currently marked
-as incompatible and can be enabled by setting
`spark.comet.expression.allowIncompatible=true`.
+It is also possible to specify `spark.comet.expression.allowIncompatible=true`
to enable all
+incompatible expressions.
## Regular Expressions
@@ -127,7 +75,7 @@ Cast operations in Comet fall into three levels of support:
- **Compatible**: The results match Apache Spark
- **Incompatible**: The results may match Apache Spark for some inputs, but
there are known issues where some inputs
will result in incorrect results or exceptions. The query stage will fall
back to Spark by default. Setting
- `spark.comet.expression.allowIncompatible=true` will allow all incompatible
casts to run natively in Comet, but this is not
+ `spark.comet.expression.Cast.allowIncompatible=true` will allow all
incompatible casts to run natively in Comet, but this is not
recommended for production use.
- **Unsupported**: Comet does not provide a native version of this cast
expression and the query stage will fall back to
Spark.
diff --git a/_sources/user-guide/latest/datasources.md.txt
b/_sources/user-guide/latest/datasources.md.txt
index 98bd61f71..14d0ecc15 100644
--- a/_sources/user-guide/latest/datasources.md.txt
+++ b/_sources/user-guide/latest/datasources.md.txt
@@ -163,23 +163,11 @@ Or use `spark-shell` with HDFS support as described
[above](#using-experimental-
## S3
-DataFusion Comet has [multiple Parquet scan
implementations](./compatibility.md#parquet-scans) that use different
approaches to read data from S3.
-
-### `native_comet`
-
-The default `native_comet` Parquet scan implementation reads data from S3
using the [Hadoop-AWS
module](https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/index.html),
which is identical to the approach commonly used with vanilla Spark. AWS
credential configuration and other Hadoop S3A configurations works the same way
as in vanilla Spark.
-
-### `native_datafusion` and `native_iceberg_compat`
-
-The `native_datafusion` and `native_iceberg_compat` Parquet scan
implementations completely offload data loading to native code. They use the
[`object_store` crate](https://crates.io/crates/object_store) to read data from
S3 and support configuring S3 access using standard [Hadoop S3A
configurations](https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/index.html#General_S3A_Client_configuration)
by translating them to the `object_store` crate's format.
-
-This implementation maintains compatibility with existing Hadoop S3A
configurations, so existing code will continue to work as long as the
configurations are supported and can be translated without loss of
functionality.
-
#### Root CA Certificates
-One major difference between `native_comet` and the other scan implementations
is the mechanism for discovering Root
-CA Certificates. The `native_comet` scan uses the JVM to read CA Certificates
from the Java Trust Store, but the native
-scan implementations `native_datafusion` and `native_iceberg_compat` use
system Root CA Certificates (typically stored
+One major difference between Spark and Comet is the mechanism for discovering
Root
+CA Certificates. Spark uses the JVM to read CA Certificates from the Java
Trust Store, but native Comet
+scans use system Root CA Certificates (typically stored
in `/etc/ssl/certs` on Linux). These scans will not be able to interact with
S3 if the Root CA Certificates are not
installed.
@@ -200,57 +188,3 @@ AWS credential providers can be configured using the
`fs.s3a.aws.credentials.pro
|
`com.amazonaws.auth.WebIdentityTokenCredentialsProvider`<br/>`software.amazon.awssdk.auth.credentials.WebIdentityTokenFileCredentialsProvider`
| Authenticate using web identity token file | None |
Multiple credential providers can be specified in a comma-separated list using
the `fs.s3a.aws.credentials.provider` configuration, just as Hadoop AWS
supports. If `fs.s3a.aws.credentials.provider` is not configured, Hadoop S3A's
default credential provider chain will be used. All configuration options also
support bucket-specific overrides using the pattern
`fs.s3a.bucket.{bucket-name}.{option}`.
-
-#### Additional S3 Configuration Options
-
-Beyond credential providers, the `native_datafusion` implementation supports
additional S3 configuration options:
-
-| Option | Description |
-|--------|-------------|
-| `fs.s3a.endpoint` | The endpoint of the S3 service |
-| `fs.s3a.endpoint.region` | The AWS region for the S3 service. If not
specified, the region will be auto-detected. |
-| `fs.s3a.path.style.access` | Whether to use path style access for the S3
service (true/false, defaults to virtual hosted style) |
-| `fs.s3a.requester.pays.enabled` | Whether to enable requester pays for S3
requests (true/false) |
-
-All configuration options support bucket-specific overrides using the pattern
`fs.s3a.bucket.{bucket-name}.{option}`.
-
-#### Examples
-
-The following examples demonstrate how to configure S3 access with the
`native_datafusion` Parquet scan implementation using different authentication
methods.
-
-**Example 1: Simple Credentials**
-
-This example shows how to access a private S3 bucket using an access key and
secret key. The `fs.s3a.aws.credentials.provider` configuration can be omitted
since `org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider` is included in
Hadoop S3A's default credential provider chain.
-
-```shell
-$SPARK_HOME/bin/spark-shell \
-...
---conf spark.comet.scan.impl=native_datafusion \
---conf spark.hadoop.fs.s3a.access.key=my-access-key \
---conf spark.hadoop.fs.s3a.secret.key=my-secret-key
-...
-```
-
-**Example 2: Assume Role with Web Identity Token**
-
-This example demonstrates using an assumed role credential to access a private
S3 bucket, where the base credential for assuming the role is provided by a web
identity token credentials provider.
-
-```shell
-$SPARK_HOME/bin/spark-shell \
-...
---conf spark.comet.scan.impl=native_datafusion \
---conf
spark.hadoop.fs.s3a.aws.credentials.provider=org.apache.hadoop.fs.s3a.auth.AssumedRoleCredentialProvider
\
---conf
spark.hadoop.fs.s3a.assumed.role.arn=arn:aws:iam::123456789012:role/my-role \
---conf spark.hadoop.fs.s3a.assumed.role.session.name=my-session \
---conf
spark.hadoop.fs.s3a.assumed.role.credentials.provider=com.amazonaws.auth.WebIdentityTokenCredentialsProvider
-...
-```
-
-#### Limitations
-
-The S3 support of `native_datafusion` has the following limitations:
-
-1. **Partial Hadoop S3A configuration support**: Not all Hadoop S3A
configurations are currently supported. Only the configurations listed in the
tables above are translated and applied to the underlying `object_store` crate.
-
-2. **Custom credential providers**: Custom implementations of AWS credential
providers are not supported. The implementation only supports the standard
credential providers listed in the table above. We are planning to add support
for custom credential providers through a JNI-based adapter that will allow
calling Java credential providers from native code. See [issue
#1829](https://github.com/apache/datafusion-comet/issues/1829) for more details.
-
diff --git a/contributor-guide/adding_a_new_expression.html
b/contributor-guide/adding_a_new_expression.html
index d749ae49b..28c2a9c48 100644
--- a/contributor-guide/adding_a_new_expression.html
+++ b/contributor-guide/adding_a_new_expression.html
@@ -357,6 +357,7 @@ under the License.
<li class="toctree-l2"><a class="reference internal"
href="plugin_overview.html">Comet Plugin Architecture</a></li>
<li class="toctree-l2"><a class="reference internal"
href="plugin_overview.html#plugin-components">Plugin Components</a></li>
<li class="toctree-l2"><a class="reference internal" href="ffi.html">Arrow
FFI</a></li>
+<li class="toctree-l2"><a class="reference internal"
href="parquet_scans.html">Parquet Scans</a></li>
<li class="toctree-l2"><a class="reference internal"
href="development.html">Development Guide</a></li>
<li class="toctree-l2"><a class="reference internal"
href="debugging.html">Debugging Guide</a></li>
<li class="toctree-l2"><a class="reference internal"
href="benchmarking.html">Benchmarking Guide</a></li>
diff --git a/contributor-guide/benchmarking.html
b/contributor-guide/benchmarking.html
index 3d65c3ff5..6723a56bf 100644
--- a/contributor-guide/benchmarking.html
+++ b/contributor-guide/benchmarking.html
@@ -357,6 +357,7 @@ under the License.
<li class="toctree-l2"><a class="reference internal"
href="plugin_overview.html">Comet Plugin Architecture</a></li>
<li class="toctree-l2"><a class="reference internal"
href="plugin_overview.html#plugin-components">Plugin Components</a></li>
<li class="toctree-l2"><a class="reference internal" href="ffi.html">Arrow
FFI</a></li>
+<li class="toctree-l2"><a class="reference internal"
href="parquet_scans.html">Parquet Scans</a></li>
<li class="toctree-l2"><a class="reference internal"
href="development.html">Development Guide</a></li>
<li class="toctree-l2"><a class="reference internal"
href="debugging.html">Debugging Guide</a></li>
<li class="toctree-l2 current"><a class="current reference internal"
href="#">Benchmarking Guide</a></li>
diff --git a/contributor-guide/contributing.html
b/contributor-guide/contributing.html
index 4c91dfb07..31fba8be2 100644
--- a/contributor-guide/contributing.html
+++ b/contributor-guide/contributing.html
@@ -357,6 +357,7 @@ under the License.
<li class="toctree-l2"><a class="reference internal"
href="plugin_overview.html">Comet Plugin Architecture</a></li>
<li class="toctree-l2"><a class="reference internal"
href="plugin_overview.html#plugin-components">Plugin Components</a></li>
<li class="toctree-l2"><a class="reference internal" href="ffi.html">Arrow
FFI</a></li>
+<li class="toctree-l2"><a class="reference internal"
href="parquet_scans.html">Parquet Scans</a></li>
<li class="toctree-l2"><a class="reference internal"
href="development.html">Development Guide</a></li>
<li class="toctree-l2"><a class="reference internal"
href="debugging.html">Debugging Guide</a></li>
<li class="toctree-l2"><a class="reference internal"
href="benchmarking.html">Benchmarking Guide</a></li>
diff --git a/contributor-guide/debugging.html b/contributor-guide/debugging.html
index 197b3a3e5..445e001bd 100644
--- a/contributor-guide/debugging.html
+++ b/contributor-guide/debugging.html
@@ -357,6 +357,7 @@ under the License.
<li class="toctree-l2"><a class="reference internal"
href="plugin_overview.html">Comet Plugin Architecture</a></li>
<li class="toctree-l2"><a class="reference internal"
href="plugin_overview.html#plugin-components">Plugin Components</a></li>
<li class="toctree-l2"><a class="reference internal" href="ffi.html">Arrow
FFI</a></li>
+<li class="toctree-l2"><a class="reference internal"
href="parquet_scans.html">Parquet Scans</a></li>
<li class="toctree-l2"><a class="reference internal"
href="development.html">Development Guide</a></li>
<li class="toctree-l2 current"><a class="current reference internal"
href="#">Debugging Guide</a></li>
<li class="toctree-l2"><a class="reference internal"
href="benchmarking.html">Benchmarking Guide</a></li>
diff --git a/contributor-guide/development.html
b/contributor-guide/development.html
index c60c4dd6b..a034984d1 100644
--- a/contributor-guide/development.html
+++ b/contributor-guide/development.html
@@ -66,7 +66,7 @@ under the License.
<link rel="index" title="Index" href="../genindex.html" />
<link rel="search" title="Search" href="../search.html" />
<link rel="next" title="Comet Debugging Guide" href="debugging.html" />
- <link rel="prev" title="Arrow FFI Usage in Comet" href="ffi.html" />
+ <link rel="prev" title="Comet Parquet Scan Implementations"
href="parquet_scans.html" />
<meta name="viewport" content="width=device-width, initial-scale=1"/>
<meta name="docsearch:language" content="en"/>
<meta name="docsearch:version" content="" />
@@ -357,6 +357,7 @@ under the License.
<li class="toctree-l2"><a class="reference internal"
href="plugin_overview.html">Comet Plugin Architecture</a></li>
<li class="toctree-l2"><a class="reference internal"
href="plugin_overview.html#plugin-components">Plugin Components</a></li>
<li class="toctree-l2"><a class="reference internal" href="ffi.html">Arrow
FFI</a></li>
+<li class="toctree-l2"><a class="reference internal"
href="parquet_scans.html">Parquet Scans</a></li>
<li class="toctree-l2 current"><a class="current reference internal"
href="#">Development Guide</a></li>
<li class="toctree-l2"><a class="reference internal"
href="debugging.html">Debugging Guide</a></li>
<li class="toctree-l2"><a class="reference internal"
href="benchmarking.html">Benchmarking Guide</a></li>
@@ -601,12 +602,12 @@ cargo<span class="w"> </span>clippy<span class="w">
</span>--color<span class="o
<div class="prev-next-area">
<a class="left-prev"
- href="ffi.html"
+ href="parquet_scans.html"
title="previous page">
<i class="fa-solid fa-angle-left"></i>
<div class="prev-next-info">
<p class="prev-next-subtitle">previous</p>
- <p class="prev-next-title">Arrow FFI Usage in Comet</p>
+ <p class="prev-next-title">Comet Parquet Scan Implementations</p>
</div>
</a>
<a class="right-next"
diff --git a/contributor-guide/ffi.html b/contributor-guide/ffi.html
index 93dfc9e12..c8585787a 100644
--- a/contributor-guide/ffi.html
+++ b/contributor-guide/ffi.html
@@ -65,7 +65,7 @@ under the License.
<script async="true" defer="true"
src="https://buttons.github.io/buttons.js"></script>
<link rel="index" title="Index" href="../genindex.html" />
<link rel="search" title="Search" href="../search.html" />
- <link rel="next" title="Comet Development Guide" href="development.html" />
+ <link rel="next" title="Comet Parquet Scan Implementations"
href="parquet_scans.html" />
<link rel="prev" title="Comet Plugin Architecture"
href="plugin_overview.html" />
<meta name="viewport" content="width=device-width, initial-scale=1"/>
<meta name="docsearch:language" content="en"/>
@@ -357,6 +357,7 @@ under the License.
<li class="toctree-l2"><a class="reference internal"
href="plugin_overview.html">Comet Plugin Architecture</a></li>
<li class="toctree-l2"><a class="reference internal"
href="plugin_overview.html#plugin-components">Plugin Components</a></li>
<li class="toctree-l2 current"><a class="current reference internal"
href="#">Arrow FFI</a></li>
+<li class="toctree-l2"><a class="reference internal"
href="parquet_scans.html">Parquet Scans</a></li>
<li class="toctree-l2"><a class="reference internal"
href="development.html">Development Guide</a></li>
<li class="toctree-l2"><a class="reference internal"
href="debugging.html">Debugging Guide</a></li>
<li class="toctree-l2"><a class="reference internal"
href="benchmarking.html">Benchmarking Guide</a></li>
@@ -832,11 +833,11 @@ t4 Batch handle released ArrowBuf freed
Data freed
</div>
</a>
<a class="right-next"
- href="development.html"
+ href="parquet_scans.html"
title="next page">
<div class="prev-next-info">
<p class="prev-next-subtitle">next</p>
- <p class="prev-next-title">Comet Development Guide</p>
+ <p class="prev-next-title">Comet Parquet Scan Implementations</p>
</div>
<i class="fa-solid fa-angle-right"></i>
</a>
diff --git a/contributor-guide/index.html b/contributor-guide/index.html
index 40f181483..52f0039f6 100644
--- a/contributor-guide/index.html
+++ b/contributor-guide/index.html
@@ -357,6 +357,7 @@ under the License.
<li class="toctree-l2"><a class="reference internal"
href="plugin_overview.html">Comet Plugin Architecture</a></li>
<li class="toctree-l2"><a class="reference internal"
href="plugin_overview.html#plugin-components">Plugin Components</a></li>
<li class="toctree-l2"><a class="reference internal" href="ffi.html">Arrow
FFI</a></li>
+<li class="toctree-l2"><a class="reference internal"
href="parquet_scans.html">Parquet Scans</a></li>
<li class="toctree-l2"><a class="reference internal"
href="development.html">Development Guide</a></li>
<li class="toctree-l2"><a class="reference internal"
href="debugging.html">Debugging Guide</a></li>
<li class="toctree-l2"><a class="reference internal"
href="benchmarking.html">Benchmarking Guide</a></li>
@@ -478,6 +479,10 @@ under the License.
<li class="toctree-l2"><a class="reference internal"
href="ffi.html#further-reading">Further Reading</a></li>
</ul>
</li>
+<li class="toctree-l1"><a class="reference internal"
href="parquet_scans.html">Parquet Scans</a><ul>
+<li class="toctree-l2"><a class="reference internal"
href="parquet_scans.html#s3-support">S3 Support</a></li>
+</ul>
+</li>
<li class="toctree-l1"><a class="reference internal"
href="development.html">Development Guide</a><ul>
<li class="toctree-l2"><a class="reference internal"
href="development.html#project-layout">Project Layout</a></li>
<li class="toctree-l2"><a class="reference internal"
href="development.html#development-setup">Development Setup</a></li>
diff --git a/contributor-guide/tracing.html
b/contributor-guide/parquet_scans.html
similarity index 55%
copy from contributor-guide/tracing.html
copy to contributor-guide/parquet_scans.html
index 74cffacaa..e0589b79b 100644
--- a/contributor-guide/tracing.html
+++ b/contributor-guide/parquet_scans.html
@@ -27,7 +27,7 @@ under the License.
<meta charset="utf-8" />
<meta name="viewport" content="width=device-width, initial-scale=1.0"
/><meta name="viewport" content="width=device-width, initial-scale=1" />
- <title>Tracing — Apache DataFusion Comet documentation</title>
+ <title>Comet Parquet Scan Implementations — Apache DataFusion Comet
documentation</title>
@@ -61,12 +61,12 @@ under the License.
<script src="../_static/documentation_options.js?v=5929fcd5"></script>
<script src="../_static/doctools.js?v=9a2dae69"></script>
<script src="../_static/sphinx_highlight.js?v=dc90522c"></script>
- <script>DOCUMENTATION_OPTIONS.pagename =
'contributor-guide/tracing';</script>
+ <script>DOCUMENTATION_OPTIONS.pagename =
'contributor-guide/parquet_scans';</script>
<script async="true" defer="true"
src="https://buttons.github.io/buttons.js"></script>
<link rel="index" title="Index" href="../genindex.html" />
<link rel="search" title="Search" href="../search.html" />
- <link rel="next" title="Profiling Native Code"
href="profiling_native_code.html" />
- <link rel="prev" title="Adding a New Expression"
href="adding_a_new_expression.html" />
+ <link rel="next" title="Comet Development Guide" href="development.html" />
+ <link rel="prev" title="Arrow FFI Usage in Comet" href="ffi.html" />
<meta name="viewport" content="width=device-width, initial-scale=1"/>
<meta name="docsearch:language" content="en"/>
<meta name="docsearch:version" content="" />
@@ -357,11 +357,12 @@ under the License.
<li class="toctree-l2"><a class="reference internal"
href="plugin_overview.html">Comet Plugin Architecture</a></li>
<li class="toctree-l2"><a class="reference internal"
href="plugin_overview.html#plugin-components">Plugin Components</a></li>
<li class="toctree-l2"><a class="reference internal" href="ffi.html">Arrow
FFI</a></li>
+<li class="toctree-l2 current"><a class="current reference internal"
href="#">Parquet Scans</a></li>
<li class="toctree-l2"><a class="reference internal"
href="development.html">Development Guide</a></li>
<li class="toctree-l2"><a class="reference internal"
href="debugging.html">Debugging Guide</a></li>
<li class="toctree-l2"><a class="reference internal"
href="benchmarking.html">Benchmarking Guide</a></li>
<li class="toctree-l2"><a class="reference internal"
href="adding_a_new_expression.html">Adding a New Expression</a></li>
-<li class="toctree-l2 current"><a class="current reference internal"
href="#">Tracing</a></li>
+<li class="toctree-l2"><a class="reference internal"
href="tracing.html">Tracing</a></li>
<li class="toctree-l2"><a class="reference internal"
href="profiling_native_code.html">Profiling Native Code</a></li>
<li class="toctree-l2"><a class="reference internal"
href="spark-sql-tests.html">Spark SQL Tests</a></li>
<li class="toctree-l2"><a class="reference internal"
href="roadmap.html">Roadmap</a></li>
@@ -414,7 +415,7 @@ under the License.
<li class="breadcrumb-item"><a href="index.html" class="nav-link">Comet
Contributor Guide</a></li>
- <li class="breadcrumb-item active" aria-current="page"><span
class="ellipsis">Tracing</span></li>
+ <li class="breadcrumb-item active" aria-current="page"><span
class="ellipsis">Comet Parquet Scan Implementations</span></li>
</ul>
</nav>
</div>
@@ -449,56 +450,138 @@ KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->
-<section id="tracing">
-<h1>Tracing<a class="headerlink" href="#tracing" title="Link to this
heading">#</a></h1>
-<p>Tracing can be enabled by setting <code class="docutils literal
notranslate"><span
class="pre">spark.comet.tracing.enabled=true</span></code>.</p>
-<p>With this feature enabled, each Spark executor will write a JSON event log
file in
-Chrome’s <a class="reference external"
href="https://docs.google.com/document/d/1CvAClvFfyA5R-PhYUmn5OOQtYMH4h6I0nSsKchNAySU/preview?tab=t.0#heading=h.yr4qxyxotyw">Trace
Event Format</a>. The file will be written to the executor’s current working
-directory with the filename <code class="docutils literal notranslate"><span
class="pre">comet-event-trace.json</span></code>.</p>
-<p>Additionally, enabling the <code class="docutils literal notranslate"><span
class="pre">jemalloc</span></code> feature will enable tracing of native memory
allocations.</p>
-<div class="highlight-shell notranslate"><div
class="highlight"><pre><span></span>make<span class="w"> </span>release<span
class="w"> </span><span class="nv">COMET_FEATURES</span><span
class="o">=</span><span class="s2">"jemalloc"</span>
-</pre></div>
-</div>
-<p>Example output:</p>
-<div class="highlight-json notranslate"><div
class="highlight"><pre><span></span><span class="p">{</span><span class="w">
</span><span class="nt">"name"</span><span class="p">:</span><span
class="w"> </span><span class="s2">"decodeShuffleBlock"</span><span
class="p">,</span><span class="w"> </span><span
class="nt">"cat"</span><span class="p">:</span><span class="w">
</span><span class="s2">"PERF"</span><span class="p">,</span><span
class="w"> </spa [...]
-<span class="p">{</span><span class="w"> </span><span
class="nt">"name"</span><span class="p">:</span><span class="w">
</span><span class="s2">"decodeShuffleBlock"</span><span
class="p">,</span><span class="w"> </span><span
class="nt">"cat"</span><span class="p">:</span><span class="w">
</span><span class="s2">"PERF"</span><span class="p">,</span><span
class="w"> </span><span class="nt">"ph"</span><span
class="p">:</span><span class="w"> [...]
-<span class="p">{</span><span class="w"> </span><span
class="nt">"name"</span><span class="p">:</span><span class="w">
</span><span class="s2">"decodeShuffleBlock"</span><span
class="p">,</span><span class="w"> </span><span
class="nt">"cat"</span><span class="p">:</span><span class="w">
</span><span class="s2">"PERF"</span><span class="p">,</span><span
class="w"> </span><span class="nt">"ph"</span><span
class="p">:</span><span class="w"> [...]
-<span class="p">{</span><span class="w"> </span><span
class="nt">"name"</span><span class="p">:</span><span class="w">
</span><span class="s2">"decodeShuffleBlock"</span><span
class="p">,</span><span class="w"> </span><span
class="nt">"cat"</span><span class="p">:</span><span class="w">
</span><span class="s2">"PERF"</span><span class="p">,</span><span
class="w"> </span><span class="nt">"ph"</span><span
class="p">:</span><span class="w"> [...]
-<span class="p">{</span><span class="w"> </span><span
class="nt">"name"</span><span class="p">:</span><span class="w">
</span><span class="s2">"execute_plan"</span><span
class="p">,</span><span class="w"> </span><span
class="nt">"cat"</span><span class="p">:</span><span class="w">
</span><span class="s2">"PERF"</span><span class="p">,</span><span
class="w"> </span><span class="nt">"ph"</span><span
class="p">:</span><span class="w"> </span [...]
-<span class="p">{</span><span class="w"> </span><span
class="nt">"name"</span><span class="p">:</span><span class="w">
</span><span class="s2">"CometExecIterator_getNextBatch"</span><span
class="p">,</span><span class="w"> </span><span
class="nt">"cat"</span><span class="p">:</span><span class="w">
</span><span class="s2">"PERF"</span><span class="p">,</span><span
class="w"> </span><span class="nt">"ph"</span><span
class="p">:</span><span [...]
-<span class="p">{</span><span class="w"> </span><span
class="nt">"name"</span><span class="p">:</span><span class="w">
</span><span class="s2">"CometExecIterator_getNextBatch"</span><span
class="p">,</span><span class="w"> </span><span
class="nt">"cat"</span><span class="p">:</span><span class="w">
</span><span class="s2">"PERF"</span><span class="p">,</span><span
class="w"> </span><span class="nt">"ph"</span><span
class="p">:</span><span [...]
-</pre></div>
-</div>
-<p>Traces can be viewed with <a class="reference external"
href="https://github.com/catapult-project/catapult/blob/main/tracing/README.md">Trace
Viewer</a>.</p>
-<p>Example trace visualization:</p>
-<p><img alt="tracing" src="../_images/tracing.png" /></p>
-<section id="definition-of-labels">
-<h2>Definition of Labels<a class="headerlink" href="#definition-of-labels"
title="Link to this heading">#</a></h2>
+<section id="comet-parquet-scan-implementations">
+<h1>Comet Parquet Scan Implementations<a class="headerlink"
href="#comet-parquet-scan-implementations" title="Link to this
heading">#</a></h1>
+<p>Comet currently has three distinct implementations of the Parquet scan
operator. The configuration property
+<code class="docutils literal notranslate"><span
class="pre">spark.comet.scan.impl</span></code> is used to select an
implementation. The default setting is <code class="docutils literal
notranslate"><span class="pre">spark.comet.scan.impl=auto</span></code>, and
+Comet will choose the most appropriate implementation based on the Parquet
schema and other Comet configuration
+settings. Most users should not need to change this setting. However, it is
possible to force Comet to try and use
+a particular implementation for all scan operations by setting this
configuration property to one of the following
+implementations.</p>
<div class="pst-scrollable-table-container"><table class="table">
<thead>
-<tr class="row-odd"><th class="head"><p>Label</p></th>
-<th class="head"><p>Meaning</p></th>
+<tr class="row-odd"><th class="head"><p>Implementation</p></th>
+<th class="head"><p>Description</p></th>
</tr>
</thead>
<tbody>
-<tr class="row-even"><td><p>jvm_heapUsed</p></td>
-<td><p>JVM heap memory usage of live objects for the executor process</p></td>
+<tr class="row-even"><td><p><code class="docutils literal notranslate"><span
class="pre">native_comet</span></code></p></td>
+<td><p>This implementation provides strong compatibility with Spark but does
not support complex types. This is the original scan implementation in Comet
and may eventually be removed.</p></td>
+</tr>
+<tr class="row-odd"><td><p><code class="docutils literal notranslate"><span
class="pre">native_iceberg_compat</span></code></p></td>
+<td><p>This implementation delegates to DataFusion’s <code class="docutils
literal notranslate"><span class="pre">DataSourceExec</span></code> but uses a
hybrid approach of JVM and native code. This scan is designed to be integrated
with Iceberg in the future.</p></td>
+</tr>
+<tr class="row-even"><td><p><code class="docutils literal notranslate"><span
class="pre">native_datafusion</span></code></p></td>
+<td><p>This experimental implementation delegates to DataFusion’s <code
class="docutils literal notranslate"><span
class="pre">DataSourceExec</span></code> for full native execution. There are
known compatibility issues when using this scan.</p></td>
+</tr>
+</tbody>
+</table>
+</div>
+<p>The <code class="docutils literal notranslate"><span
class="pre">native_datafusion</span></code> and <code class="docutils literal
notranslate"><span class="pre">native_iceberg_compat</span></code> scans
provide the following benefits over the <code class="docutils literal
notranslate"><span class="pre">native_comet</span></code>
+implementation:</p>
+<ul class="simple">
+<li><p>Leverages the DataFusion community’s ongoing improvements to <code
class="docutils literal notranslate"><span
class="pre">DataSourceExec</span></code></p></li>
+<li><p>Provides support for reading complex types (structs, arrays, and
maps)</p></li>
+<li><p>Removes the use of reusable mutable-buffers in Comet, which is complex
to maintain</p></li>
+<li><p>Improves performance</p></li>
+</ul>
+<p>The <code class="docutils literal notranslate"><span
class="pre">native_datafusion</span></code> and <code class="docutils literal
notranslate"><span class="pre">native_iceberg_compat</span></code> scans share
the following limitations:</p>
+<ul class="simple">
+<li><p>When reading Parquet files written by systems other than Spark that
contain columns with the logical types <code class="docutils literal
notranslate"><span class="pre">UINT_8</span></code>
+or <code class="docutils literal notranslate"><span
class="pre">UINT_16</span></code>, Comet will produce different results than
Spark because Spark does not preserve or understand these
+logical types. Arrow-based readers, such as DataFusion and Comet do respect
these types and read the data as unsigned
+rather than signed. By default, Comet will fall back to <code class="docutils
literal notranslate"><span class="pre">native_comet</span></code> when scanning
Parquet files containing <code class="docutils literal notranslate"><span
class="pre">byte</span></code> or <code class="docutils literal
notranslate"><span class="pre">short</span></code>
+types (regardless of the logical type). This behavior can be disabled by
setting
+<code class="docutils literal notranslate"><span
class="pre">spark.comet.scan.allowIncompatible=true</span></code>.</p></li>
+<li><p>No support for default values that are nested types (e.g., maps,
arrays, structs). Literal default values are supported.</p></li>
+</ul>
+<p>The <code class="docutils literal notranslate"><span
class="pre">native_datafusion</span></code> scan has some additional
limitations:</p>
+<ul class="simple">
+<li><p>Bucketed scans are not supported</p></li>
+<li><p>No support for row indexes</p></li>
+<li><p><code class="docutils literal notranslate"><span
class="pre">PARQUET_FIELD_ID_READ_ENABLED</span></code> is not respected <a
class="reference external"
href="https://github.com/apache/datafusion-comet/issues/1758">#1758</a></p></li>
+<li><p>There are failures in the Spark SQL test suite <a class="reference
external"
href="https://github.com/apache/datafusion-comet/issues/1545">#1545</a></p></li>
+<li><p>Setting Spark configs <code class="docutils literal notranslate"><span
class="pre">ignoreMissingFiles</span></code> or <code class="docutils literal
notranslate"><span class="pre">ignoreCorruptFiles</span></code> to <code
class="docutils literal notranslate"><span class="pre">true</span></code> is
not compatible with Spark</p></li>
+</ul>
+<section id="s3-support">
+<h2>S3 Support<a class="headerlink" href="#s3-support" title="Link to this
heading">#</a></h2>
+<p>There are some</p>
+<section id="native-comet">
+<h3><code class="docutils literal notranslate"><span
class="pre">native_comet</span></code><a class="headerlink"
href="#native-comet" title="Link to this heading">#</a></h3>
+<p>The default <code class="docutils literal notranslate"><span
class="pre">native_comet</span></code> Parquet scan implementation reads data
from S3 using the <a class="reference external"
href="https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/index.html">Hadoop-AWS
module</a>, which
+is identical to the approach commonly used with vanilla Spark. AWS credential
configuration and other Hadoop S3A
+configurations works the same way as in vanilla Spark.</p>
+</section>
+<section id="native-datafusion-and-native-iceberg-compat">
+<h3><code class="docutils literal notranslate"><span
class="pre">native_datafusion</span></code> and <code class="docutils literal
notranslate"><span class="pre">native_iceberg_compat</span></code><a
class="headerlink" href="#native-datafusion-and-native-iceberg-compat"
title="Link to this heading">#</a></h3>
+<p>The <code class="docutils literal notranslate"><span
class="pre">native_datafusion</span></code> and <code class="docutils literal
notranslate"><span class="pre">native_iceberg_compat</span></code> Parquet scan
implementations completely offload data loading
+to native code. They use the <a class="reference external"
href="https://crates.io/crates/object_store"><code class="docutils literal
notranslate"><span class="pre">object_store</span></code> crate</a> to read
data from S3 and
+support configuring S3 access using standard <a class="reference external"
href="https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/index.html#General_S3A_Client_configuration">Hadoop
S3A configurations</a> by translating them to
+the <code class="docutils literal notranslate"><span
class="pre">object_store</span></code> crate’s format.</p>
+<p>This implementation maintains compatibility with existing Hadoop S3A
configurations, so existing code will
+continue to work as long as the configurations are supported and can be
translated without loss of functionality.</p>
+<section id="additional-s3-configuration-options">
+<h4>Additional S3 Configuration Options<a class="headerlink"
href="#additional-s3-configuration-options" title="Link to this
heading">#</a></h4>
+<p>Beyond credential providers, the <code class="docutils literal
notranslate"><span class="pre">native_datafusion</span></code> implementation
supports additional S3 configuration options:</p>
+<div class="pst-scrollable-table-container"><table class="table">
+<thead>
+<tr class="row-odd"><th class="head"><p>Option</p></th>
+<th class="head"><p>Description</p></th>
</tr>
-<tr class="row-odd"><td><p>jemalloc_allocated</p></td>
-<td><p>Native memory usage for the executor process</p></td>
+</thead>
+<tbody>
+<tr class="row-even"><td><p><code class="docutils literal notranslate"><span
class="pre">fs.s3a.endpoint</span></code></p></td>
+<td><p>The endpoint of the S3 service</p></td>
</tr>
-<tr class="row-even"><td><p>task_memory_comet_NNN</p></td>
-<td><p>Off-heap memory allocated by Comet for query execution</p></td>
+<tr class="row-odd"><td><p><code class="docutils literal notranslate"><span
class="pre">fs.s3a.endpoint.region</span></code></p></td>
+<td><p>The AWS region for the S3 service. If not specified, the region will be
auto-detected.</p></td>
</tr>
-<tr class="row-odd"><td><p>task_memory_spark_NNN</p></td>
-<td><p>On-heap & Off-heap memory allocated by Spark</p></td>
+<tr class="row-even"><td><p><code class="docutils literal notranslate"><span
class="pre">fs.s3a.path.style.access</span></code></p></td>
+<td><p>Whether to use path style access for the S3 service (true/false,
defaults to virtual hosted style)</p></td>
</tr>
-<tr class="row-even"><td><p>comet_shuffle_NNN</p></td>
-<td><p>Off-heap memory allocated by Comet for columnar shuffle</p></td>
+<tr class="row-odd"><td><p><code class="docutils literal notranslate"><span
class="pre">fs.s3a.requester.pays.enabled</span></code></p></td>
+<td><p>Whether to enable requester pays for S3 requests (true/false)</p></td>
</tr>
</tbody>
</table>
</div>
+<p>All configuration options support bucket-specific overrides using the
pattern <code class="docutils literal notranslate"><span
class="pre">fs.s3a.bucket.{bucket-name}.{option}</span></code>.</p>
+</section>
+<section id="examples">
+<h4>Examples<a class="headerlink" href="#examples" title="Link to this
heading">#</a></h4>
+<p>The following examples demonstrate how to configure S3 access with the
<code class="docutils literal notranslate"><span
class="pre">native_datafusion</span></code> Parquet scan implementation using
different authentication methods.</p>
+<p><strong>Example 1: Simple Credentials</strong></p>
+<p>This example shows how to access a private S3 bucket using an access key
and secret key. The <code class="docutils literal notranslate"><span
class="pre">fs.s3a.aws.credentials.provider</span></code> configuration can be
omitted since <code class="docutils literal notranslate"><span
class="pre">org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider</span></code>
is included in Hadoop S3A’s default credential provider chain.</p>
+<div class="highlight-shell notranslate"><div
class="highlight"><pre><span></span><span
class="nv">$SPARK_HOME</span>/bin/spark-shell<span class="w"> </span><span
class="se">\</span>
+...
+--conf<span class="w"> </span>spark.comet.scan.impl<span
class="o">=</span>native_datafusion<span class="w"> </span><span
class="se">\</span>
+--conf<span class="w"> </span>spark.hadoop.fs.s3a.access.key<span
class="o">=</span>my-access-key<span class="w"> </span><span class="se">\</span>
+--conf<span class="w"> </span>spark.hadoop.fs.s3a.secret.key<span
class="o">=</span>my-secret-key
+...
+</pre></div>
+</div>
+<p><strong>Example 2: Assume Role with Web Identity Token</strong></p>
+<p>This example demonstrates using an assumed role credential to access a
private S3 bucket, where the base credential for assuming the role is provided
by a web identity token credentials provider.</p>
+<div class="highlight-shell notranslate"><div
class="highlight"><pre><span></span><span
class="nv">$SPARK_HOME</span>/bin/spark-shell<span class="w"> </span><span
class="se">\</span>
+...
+--conf<span class="w"> </span>spark.comet.scan.impl<span
class="o">=</span>native_datafusion<span class="w"> </span><span
class="se">\</span>
+--conf<span class="w">
</span>spark.hadoop.fs.s3a.aws.credentials.provider<span
class="o">=</span>org.apache.hadoop.fs.s3a.auth.AssumedRoleCredentialProvider<span
class="w"> </span><span class="se">\</span>
+--conf<span class="w"> </span>spark.hadoop.fs.s3a.assumed.role.arn<span
class="o">=</span>arn:aws:iam::123456789012:role/my-role<span class="w">
</span><span class="se">\</span>
+--conf<span class="w">
</span>spark.hadoop.fs.s3a.assumed.role.session.name<span
class="o">=</span>my-session<span class="w"> </span><span class="se">\</span>
+--conf<span class="w">
</span>spark.hadoop.fs.s3a.assumed.role.credentials.provider<span
class="o">=</span>com.amazonaws.auth.WebIdentityTokenCredentialsProvider
+...
+</pre></div>
+</div>
+</section>
+<section id="limitations">
+<h4>Limitations<a class="headerlink" href="#limitations" title="Link to this
heading">#</a></h4>
+<p>The S3 support of <code class="docutils literal notranslate"><span
class="pre">native_datafusion</span></code> has the following limitations:</p>
+<ol class="arabic simple">
+<li><p><strong>Partial Hadoop S3A configuration support</strong>: Not all
Hadoop S3A configurations are currently supported. Only the configurations
listed in the tables above are translated and applied to the underlying <code
class="docutils literal notranslate"><span
class="pre">object_store</span></code> crate.</p></li>
+<li><p><strong>Custom credential providers</strong>: Custom implementations of
AWS credential providers are not supported. The implementation only supports
the standard credential providers listed in the table above. We are planning to
add support for custom credential providers through a JNI-based adapter that
will allow calling Java credential providers from native code. See <a
class="reference external"
href="https://github.com/apache/datafusion-comet/issues/1829">issue #1829</a>
for [...]
+</ol>
+</section>
+</section>
</section>
</section>
@@ -513,20 +596,20 @@ directory with the filename <code class="docutils literal
notranslate"><span cla
<div class="prev-next-area">
<a class="left-prev"
- href="adding_a_new_expression.html"
+ href="ffi.html"
title="previous page">
<i class="fa-solid fa-angle-left"></i>
<div class="prev-next-info">
<p class="prev-next-subtitle">previous</p>
- <p class="prev-next-title">Adding a New Expression</p>
+ <p class="prev-next-title">Arrow FFI Usage in Comet</p>
</div>
</a>
<a class="right-next"
- href="profiling_native_code.html"
+ href="development.html"
title="next page">
<div class="prev-next-info">
<p class="prev-next-subtitle">next</p>
- <p class="prev-next-title">Profiling Native Code</p>
+ <p class="prev-next-title">Comet Development Guide</p>
</div>
<i class="fa-solid fa-angle-right"></i>
</a>
diff --git a/contributor-guide/plugin_overview.html
b/contributor-guide/plugin_overview.html
index 43df5e530..cd632fef3 100644
--- a/contributor-guide/plugin_overview.html
+++ b/contributor-guide/plugin_overview.html
@@ -357,6 +357,7 @@ under the License.
<li class="toctree-l2 current"><a class="current reference internal"
href="#">Comet Plugin Architecture</a></li>
<li class="toctree-l2"><a class="reference internal"
href="#plugin-components">Plugin Components</a></li>
<li class="toctree-l2"><a class="reference internal" href="ffi.html">Arrow
FFI</a></li>
+<li class="toctree-l2"><a class="reference internal"
href="parquet_scans.html">Parquet Scans</a></li>
<li class="toctree-l2"><a class="reference internal"
href="development.html">Development Guide</a></li>
<li class="toctree-l2"><a class="reference internal"
href="debugging.html">Debugging Guide</a></li>
<li class="toctree-l2"><a class="reference internal"
href="benchmarking.html">Benchmarking Guide</a></li>
diff --git a/contributor-guide/profiling_native_code.html
b/contributor-guide/profiling_native_code.html
index eef562b10..12afcbc1c 100644
--- a/contributor-guide/profiling_native_code.html
+++ b/contributor-guide/profiling_native_code.html
@@ -357,6 +357,7 @@ under the License.
<li class="toctree-l2"><a class="reference internal"
href="plugin_overview.html">Comet Plugin Architecture</a></li>
<li class="toctree-l2"><a class="reference internal"
href="plugin_overview.html#plugin-components">Plugin Components</a></li>
<li class="toctree-l2"><a class="reference internal" href="ffi.html">Arrow
FFI</a></li>
+<li class="toctree-l2"><a class="reference internal"
href="parquet_scans.html">Parquet Scans</a></li>
<li class="toctree-l2"><a class="reference internal"
href="development.html">Development Guide</a></li>
<li class="toctree-l2"><a class="reference internal"
href="debugging.html">Debugging Guide</a></li>
<li class="toctree-l2"><a class="reference internal"
href="benchmarking.html">Benchmarking Guide</a></li>
diff --git a/contributor-guide/roadmap.html b/contributor-guide/roadmap.html
index ce9b33aa3..8bd2dd75d 100644
--- a/contributor-guide/roadmap.html
+++ b/contributor-guide/roadmap.html
@@ -357,6 +357,7 @@ under the License.
<li class="toctree-l2"><a class="reference internal"
href="plugin_overview.html">Comet Plugin Architecture</a></li>
<li class="toctree-l2"><a class="reference internal"
href="plugin_overview.html#plugin-components">Plugin Components</a></li>
<li class="toctree-l2"><a class="reference internal" href="ffi.html">Arrow
FFI</a></li>
+<li class="toctree-l2"><a class="reference internal"
href="parquet_scans.html">Parquet Scans</a></li>
<li class="toctree-l2"><a class="reference internal"
href="development.html">Development Guide</a></li>
<li class="toctree-l2"><a class="reference internal"
href="debugging.html">Debugging Guide</a></li>
<li class="toctree-l2"><a class="reference internal"
href="benchmarking.html">Benchmarking Guide</a></li>
diff --git a/contributor-guide/spark-sql-tests.html
b/contributor-guide/spark-sql-tests.html
index 6652284c8..33b687156 100644
--- a/contributor-guide/spark-sql-tests.html
+++ b/contributor-guide/spark-sql-tests.html
@@ -357,6 +357,7 @@ under the License.
<li class="toctree-l2"><a class="reference internal"
href="plugin_overview.html">Comet Plugin Architecture</a></li>
<li class="toctree-l2"><a class="reference internal"
href="plugin_overview.html#plugin-components">Plugin Components</a></li>
<li class="toctree-l2"><a class="reference internal" href="ffi.html">Arrow
FFI</a></li>
+<li class="toctree-l2"><a class="reference internal"
href="parquet_scans.html">Parquet Scans</a></li>
<li class="toctree-l2"><a class="reference internal"
href="development.html">Development Guide</a></li>
<li class="toctree-l2"><a class="reference internal"
href="debugging.html">Debugging Guide</a></li>
<li class="toctree-l2"><a class="reference internal"
href="benchmarking.html">Benchmarking Guide</a></li>
diff --git a/contributor-guide/tracing.html b/contributor-guide/tracing.html
index 74cffacaa..2245a3ce0 100644
--- a/contributor-guide/tracing.html
+++ b/contributor-guide/tracing.html
@@ -357,6 +357,7 @@ under the License.
<li class="toctree-l2"><a class="reference internal"
href="plugin_overview.html">Comet Plugin Architecture</a></li>
<li class="toctree-l2"><a class="reference internal"
href="plugin_overview.html#plugin-components">Plugin Components</a></li>
<li class="toctree-l2"><a class="reference internal" href="ffi.html">Arrow
FFI</a></li>
+<li class="toctree-l2"><a class="reference internal"
href="parquet_scans.html">Parquet Scans</a></li>
<li class="toctree-l2"><a class="reference internal"
href="development.html">Development Guide</a></li>
<li class="toctree-l2"><a class="reference internal"
href="debugging.html">Debugging Guide</a></li>
<li class="toctree-l2"><a class="reference internal"
href="benchmarking.html">Benchmarking Guide</a></li>
diff --git a/objects.inv b/objects.inv
index 6771033be..cc6ac2420 100644
Binary files a/objects.inv and b/objects.inv differ
diff --git a/searchindex.js b/searchindex.js
index 60ce38c92..32384baf1 100644
--- a/searchindex.js
+++ b/searchindex.js
@@ -1 +1 @@
-Search.setIndex({"alltitles": {"1. Install Comet": [[17, "install-comet"]],
"2. Clone Spark and Apply Diff": [[17, "clone-spark-and-apply-diff"]], "3. Run
Spark SQL Tests": [[17, "run-spark-sql-tests"]], "ANSI Mode": [[20,
"ansi-mode"], [33, "ansi-mode"], [73, "ansi-mode"]], "ANSI mode": [[46,
"ansi-mode"], [59, "ansi-mode"]], "API Differences Between Spark Versions":
[[3, "api-differences-between-spark-versions"]], "ASF Links": [[2, null], [2,
null]], "Accelerating Apache Iceberg Parque [...]
\ No newline at end of file
+Search.setIndex({"alltitles": {"1. Install Comet": [[18, "install-comet"]],
"2. Clone Spark and Apply Diff": [[18, "clone-spark-and-apply-diff"]], "3. Run
Spark SQL Tests": [[18, "run-spark-sql-tests"]], "ANSI Mode": [[21,
"ansi-mode"], [34, "ansi-mode"], [74, "ansi-mode"]], "ANSI mode": [[47,
"ansi-mode"], [60, "ansi-mode"]], "API Differences Between Spark Versions":
[[3, "api-differences-between-spark-versions"]], "ASF Links": [[2, null], [2,
null]], "Accelerating Apache Iceberg Parque [...]
\ No newline at end of file
diff --git a/user-guide/latest/compatibility.html
b/user-guide/latest/compatibility.html
index b55c99b35..5903394bd 100644
--- a/user-guide/latest/compatibility.html
+++ b/user-guide/latest/compatibility.html
@@ -464,71 +464,11 @@ under the License.
<p>This guide offers information about areas of functionality where there are
known differences.</p>
<section id="parquet">
<h2>Parquet<a class="headerlink" href="#parquet" title="Link to this
heading">#</a></h2>
-<section id="data-type-support">
-<h3>Data Type Support<a class="headerlink" href="#data-type-support"
title="Link to this heading">#</a></h3>
-<p>Comet does not support reading decimals encoded in binary format.</p>
-</section>
-<section id="parquet-scans">
-<h3>Parquet Scans<a class="headerlink" href="#parquet-scans" title="Link to
this heading">#</a></h3>
-<p>Comet currently has three distinct implementations of the Parquet scan
operator. The configuration property
-<code class="docutils literal notranslate"><span
class="pre">spark.comet.scan.impl</span></code> is used to select an
implementation. The default setting is <code class="docutils literal
notranslate"><span class="pre">spark.comet.scan.impl=auto</span></code>, and
-Comet will choose the most appropriate implementation based on the Parquet
schema and other Comet configuration
-settings. Most users should not need to change this setting. However, it is
possible to force Comet to try and use
-a particular implementation for all scan operations by setting this
configuration property to one of the following
-implementations.</p>
-<div class="pst-scrollable-table-container"><table class="table">
-<thead>
-<tr class="row-odd"><th class="head"><p>Implementation</p></th>
-<th class="head"><p>Description</p></th>
-</tr>
-</thead>
-<tbody>
-<tr class="row-even"><td><p><code class="docutils literal notranslate"><span
class="pre">native_comet</span></code></p></td>
-<td><p>This implementation provides strong compatibility with Spark but does
not support complex types. This is the original scan implementation in Comet
and may eventually be removed.</p></td>
-</tr>
-<tr class="row-odd"><td><p><code class="docutils literal notranslate"><span
class="pre">native_iceberg_compat</span></code></p></td>
-<td><p>This implementation delegates to DataFusion’s <code class="docutils
literal notranslate"><span class="pre">DataSourceExec</span></code> but uses a
hybrid approach of JVM and native code. This scan is designed to be integrated
with Iceberg in the future.</p></td>
-</tr>
-<tr class="row-even"><td><p><code class="docutils literal notranslate"><span
class="pre">native_datafusion</span></code></p></td>
-<td><p>This experimental implementation delegates to DataFusion’s <code
class="docutils literal notranslate"><span
class="pre">DataSourceExec</span></code> for full native execution. There are
known compatibility issues when using this scan.</p></td>
-</tr>
-</tbody>
-</table>
-</div>
-<p>The <code class="docutils literal notranslate"><span
class="pre">native_datafusion</span></code> and <code class="docutils literal
notranslate"><span class="pre">native_iceberg_compat</span></code> scans
provide the following benefits over the <code class="docutils literal
notranslate"><span class="pre">native_comet</span></code>
-implementation:</p>
-<ul class="simple">
-<li><p>Leverages the DataFusion community’s ongoing improvements to <code
class="docutils literal notranslate"><span
class="pre">DataSourceExec</span></code></p></li>
-<li><p>Provides support for reading complex types (structs, arrays, and
maps)</p></li>
-<li><p>Removes the use of reusable mutable-buffers in Comet, which is complex
to maintain</p></li>
-<li><p>Improves performance</p></li>
-</ul>
-<p>The <code class="docutils literal notranslate"><span
class="pre">native_datafusion</span></code> and <code class="docutils literal
notranslate"><span class="pre">native_iceberg_compat</span></code> scans share
the following limitations:</p>
+<p>Comet has the following limitations when reading Parquet files:</p>
<ul class="simple">
-<li><p>When reading Parquet files written by systems other than Spark that
contain columns with the logical types <code class="docutils literal
notranslate"><span class="pre">UINT_8</span></code>
-or <code class="docutils literal notranslate"><span
class="pre">UINT_16</span></code>, Comet will produce different results than
Spark because Spark does not preserve or understand these
-logical types. Arrow-based readers, such as DataFusion and Comet do respect
these types and read the data as unsigned
-rather than signed. By default, Comet will fall back to <code class="docutils
literal notranslate"><span class="pre">native_comet</span></code> when scanning
Parquet files containing <code class="docutils literal notranslate"><span
class="pre">byte</span></code> or <code class="docutils literal
notranslate"><span class="pre">short</span></code>
-types (regardless of the logical type). This behavior can be disabled by
setting
-<code class="docutils literal notranslate"><span
class="pre">spark.comet.scan.allowIncompatible=true</span></code>.</p></li>
+<li><p>Comet does not support reading decimals encoded in binary
format.</p></li>
<li><p>No support for default values that are nested types (e.g., maps,
arrays, structs). Literal default values are supported.</p></li>
</ul>
-<p>The <code class="docutils literal notranslate"><span
class="pre">native_datafusion</span></code> scan has some additional
limitations:</p>
-<ul class="simple">
-<li><p>Bucketed scans are not supported</p></li>
-<li><p>No support for row indexes</p></li>
-<li><p><code class="docutils literal notranslate"><span
class="pre">PARQUET_FIELD_ID_READ_ENABLED</span></code> is not respected <a
class="reference external"
href="https://github.com/apache/datafusion-comet/issues/1758">#1758</a></p></li>
-<li><p>There are failures in the Spark SQL test suite <a class="reference
external"
href="https://github.com/apache/datafusion-comet/issues/1545">#1545</a></p></li>
-<li><p>Setting Spark configs <code class="docutils literal notranslate"><span
class="pre">ignoreMissingFiles</span></code> or <code class="docutils literal
notranslate"><span class="pre">ignoreCorruptFiles</span></code> to <code
class="docutils literal notranslate"><span class="pre">true</span></code> is
not compatible with Spark</p></li>
-</ul>
-</section>
-<section id="s3-support-with-native-iceberg-compat">
-<h3>S3 Support with <code class="docutils literal notranslate"><span
class="pre">native_iceberg_compat</span></code><a class="headerlink"
href="#s3-support-with-native-iceberg-compat" title="Link to this
heading">#</a></h3>
-<ul class="simple">
-<li><p>When using the default AWS S3 endpoint (no custom endpoint configured),
a valid region is required. Comet
-will attempt to resolve the region if it is not provided.</p></li>
-</ul>
-</section>
</section>
<section id="ansi-mode">
<h2>ANSI Mode<a class="headerlink" href="#ansi-mode" title="Link to this
heading">#</a></h2>
@@ -551,18 +491,14 @@ So Comet will add additional normalization expression of
NaN and zero for compar
<p>Sorting on floating-point data types (or complex types containing
floating-point values) is not compatible with
Spark if the data contains both zero and negative zero. This is likely an edge
case that is not of concern for many users
and sorting on floating-point data can be enabled by setting <code
class="docutils literal notranslate"><span
class="pre">spark.comet.expression.SortOrder.allowIncompatible=true</span></code>.</p>
-<p>There is a known bug with using count(distinct) within aggregate queries,
where each NaN value will be counted
-separately <a class="reference external"
href="https://github.com/apache/datafusion-comet/issues/1824">#1824</a>.</p>
</section>
<section id="incompatible-expressions">
<h2>Incompatible Expressions<a class="headerlink"
href="#incompatible-expressions" title="Link to this heading">#</a></h2>
-<p>Some Comet native expressions are not 100% compatible with Spark and are
disabled by default. These expressions
-will fall back to Spark but can be enabled by setting <code class="docutils
literal notranslate"><span
class="pre">spark.comet.expression.allowIncompatible=true</span></code>.</p>
-</section>
-<section id="array-expressions">
-<h2>Array Expressions<a class="headerlink" href="#array-expressions"
title="Link to this heading">#</a></h2>
-<p>Comet has experimental support for a number of array expressions. These are
experimental and currently marked
-as incompatible and can be enabled by setting <code class="docutils literal
notranslate"><span
class="pre">spark.comet.expression.allowIncompatible=true</span></code>.</p>
+<p>Expressions that are not 100% Spark-compatible will fall back to Spark by
default and can be enabled by setting
+<code class="docutils literal notranslate"><span
class="pre">spark.comet.expression.EXPRNAME.allowIncompatible=true</span></code>,
where <code class="docutils literal notranslate"><span
class="pre">EXPRNAME</span></code> is the Spark expression class name. See
+the <a class="reference internal" href="expressions.html"><span class="std
std-doc">Comet Supported Expressions Guide</span></a> for more information on
this configuration setting.</p>
+<p>It is also possible to specify <code class="docutils literal
notranslate"><span
class="pre">spark.comet.expression.allowIncompatible=true</span></code> to
enable all
+incompatible expressions.</p>
</section>
<section id="regular-expressions">
<h2>Regular Expressions<a class="headerlink" href="#regular-expressions"
title="Link to this heading">#</a></h2>
@@ -577,7 +513,7 @@ this can be overridden by setting <code class="docutils
literal notranslate"><sp
<li><p><strong>Compatible</strong>: The results match Apache Spark</p></li>
<li><p><strong>Incompatible</strong>: The results may match Apache Spark for
some inputs, but there are known issues where some inputs
will result in incorrect results or exceptions. The query stage will fall back
to Spark by default. Setting
-<code class="docutils literal notranslate"><span
class="pre">spark.comet.expression.allowIncompatible=true</span></code> will
allow all incompatible casts to run natively in Comet, but this is not
+<code class="docutils literal notranslate"><span
class="pre">spark.comet.expression.Cast.allowIncompatible=true</span></code>
will allow all incompatible casts to run natively in Comet, but this is not
recommended for production use.</p></li>
<li><p><strong>Unsupported</strong>: Comet does not provide a native version
of this cast expression and the query stage will fall back to
Spark.</p></li>
diff --git a/user-guide/latest/datasources.html
b/user-guide/latest/datasources.html
index 0cc56d422..0b951336e 100644
--- a/user-guide/latest/datasources.html
+++ b/user-guide/latest/datasources.html
@@ -598,25 +598,16 @@ Input<span class="w"> </span><span
class="o">[</span><span class="m">3</span><sp
</section>
<section id="s3">
<h2>S3<a class="headerlink" href="#s3" title="Link to this heading">#</a></h2>
-<p>DataFusion Comet has <a class="reference internal"
href="compatibility.html#parquet-scans"><span class="std std-ref">multiple
Parquet scan implementations</span></a> that use different approaches to read
data from S3.</p>
-<section id="native-comet">
-<h3><code class="docutils literal notranslate"><span
class="pre">native_comet</span></code><a class="headerlink"
href="#native-comet" title="Link to this heading">#</a></h3>
-<p>The default <code class="docutils literal notranslate"><span
class="pre">native_comet</span></code> Parquet scan implementation reads data
from S3 using the <a class="reference external"
href="https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/index.html">Hadoop-AWS
module</a>, which is identical to the approach commonly used with vanilla
Spark. AWS credential configuration and other Hadoop S3A configurations works
the same way as in vanilla Spark.</p>
-</section>
-<section id="native-datafusion-and-native-iceberg-compat">
-<h3><code class="docutils literal notranslate"><span
class="pre">native_datafusion</span></code> and <code class="docutils literal
notranslate"><span class="pre">native_iceberg_compat</span></code><a
class="headerlink" href="#native-datafusion-and-native-iceberg-compat"
title="Link to this heading">#</a></h3>
-<p>The <code class="docutils literal notranslate"><span
class="pre">native_datafusion</span></code> and <code class="docutils literal
notranslate"><span class="pre">native_iceberg_compat</span></code> Parquet scan
implementations completely offload data loading to native code. They use the <a
class="reference external" href="https://crates.io/crates/object_store"><code
class="docutils literal notranslate"><span
class="pre">object_store</span></code> crate</a> to read data from S3 and sup
[...]
-<p>This implementation maintains compatibility with existing Hadoop S3A
configurations, so existing code will continue to work as long as the
configurations are supported and can be translated without loss of
functionality.</p>
<section id="root-ca-certificates">
-<h4>Root CA Certificates<a class="headerlink" href="#root-ca-certificates"
title="Link to this heading">#</a></h4>
-<p>One major difference between <code class="docutils literal
notranslate"><span class="pre">native_comet</span></code> and the other scan
implementations is the mechanism for discovering Root
-CA Certificates. The <code class="docutils literal notranslate"><span
class="pre">native_comet</span></code> scan uses the JVM to read CA
Certificates from the Java Trust Store, but the native
-scan implementations <code class="docutils literal notranslate"><span
class="pre">native_datafusion</span></code> and <code class="docutils literal
notranslate"><span class="pre">native_iceberg_compat</span></code> use system
Root CA Certificates (typically stored
+<h3>Root CA Certificates<a class="headerlink" href="#root-ca-certificates"
title="Link to this heading">#</a></h3>
+<p>One major difference between Spark and Comet is the mechanism for
discovering Root
+CA Certificates. Spark uses the JVM to read CA Certificates from the Java
Trust Store, but native Comet
+scans use system Root CA Certificates (typically stored
in <code class="docutils literal notranslate"><span
class="pre">/etc/ssl/certs</span></code> on Linux). These scans will not be
able to interact with S3 if the Root CA Certificates are not
installed.</p>
</section>
<section id="supported-credential-providers">
-<h4>Supported Credential Providers<a class="headerlink"
href="#supported-credential-providers" title="Link to this heading">#</a></h4>
+<h3>Supported Credential Providers<a class="headerlink"
href="#supported-credential-providers" title="Link to this heading">#</a></h3>
<p>AWS credential providers can be configured using the <code class="docutils
literal notranslate"><span
class="pre">fs.s3a.aws.credentials.provider</span></code> configuration. The
following table shows the supported credential providers and their
configuration options:</p>
<div class="pst-scrollable-table-container"><table class="table">
<thead>
@@ -667,68 +658,6 @@ installed.</p>
</div>
<p>Multiple credential providers can be specified in a comma-separated list
using the <code class="docutils literal notranslate"><span
class="pre">fs.s3a.aws.credentials.provider</span></code> configuration, just
as Hadoop AWS supports. If <code class="docutils literal notranslate"><span
class="pre">fs.s3a.aws.credentials.provider</span></code> is not configured,
Hadoop S3A’s default credential provider chain will be used. All configuration
options also support bucket-specific overrides [...]
</section>
-<section id="additional-s3-configuration-options">
-<h4>Additional S3 Configuration Options<a class="headerlink"
href="#additional-s3-configuration-options" title="Link to this
heading">#</a></h4>
-<p>Beyond credential providers, the <code class="docutils literal
notranslate"><span class="pre">native_datafusion</span></code> implementation
supports additional S3 configuration options:</p>
-<div class="pst-scrollable-table-container"><table class="table">
-<thead>
-<tr class="row-odd"><th class="head"><p>Option</p></th>
-<th class="head"><p>Description</p></th>
-</tr>
-</thead>
-<tbody>
-<tr class="row-even"><td><p><code class="docutils literal notranslate"><span
class="pre">fs.s3a.endpoint</span></code></p></td>
-<td><p>The endpoint of the S3 service</p></td>
-</tr>
-<tr class="row-odd"><td><p><code class="docutils literal notranslate"><span
class="pre">fs.s3a.endpoint.region</span></code></p></td>
-<td><p>The AWS region for the S3 service. If not specified, the region will be
auto-detected.</p></td>
-</tr>
-<tr class="row-even"><td><p><code class="docutils literal notranslate"><span
class="pre">fs.s3a.path.style.access</span></code></p></td>
-<td><p>Whether to use path style access for the S3 service (true/false,
defaults to virtual hosted style)</p></td>
-</tr>
-<tr class="row-odd"><td><p><code class="docutils literal notranslate"><span
class="pre">fs.s3a.requester.pays.enabled</span></code></p></td>
-<td><p>Whether to enable requester pays for S3 requests (true/false)</p></td>
-</tr>
-</tbody>
-</table>
-</div>
-<p>All configuration options support bucket-specific overrides using the
pattern <code class="docutils literal notranslate"><span
class="pre">fs.s3a.bucket.{bucket-name}.{option}</span></code>.</p>
-</section>
-<section id="examples">
-<h4>Examples<a class="headerlink" href="#examples" title="Link to this
heading">#</a></h4>
-<p>The following examples demonstrate how to configure S3 access with the
<code class="docutils literal notranslate"><span
class="pre">native_datafusion</span></code> Parquet scan implementation using
different authentication methods.</p>
-<p><strong>Example 1: Simple Credentials</strong></p>
-<p>This example shows how to access a private S3 bucket using an access key
and secret key. The <code class="docutils literal notranslate"><span
class="pre">fs.s3a.aws.credentials.provider</span></code> configuration can be
omitted since <code class="docutils literal notranslate"><span
class="pre">org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider</span></code>
is included in Hadoop S3A’s default credential provider chain.</p>
-<div class="highlight-shell notranslate"><div
class="highlight"><pre><span></span><span
class="nv">$SPARK_HOME</span>/bin/spark-shell<span class="w"> </span><span
class="se">\</span>
-...
---conf<span class="w"> </span>spark.comet.scan.impl<span
class="o">=</span>native_datafusion<span class="w"> </span><span
class="se">\</span>
---conf<span class="w"> </span>spark.hadoop.fs.s3a.access.key<span
class="o">=</span>my-access-key<span class="w"> </span><span class="se">\</span>
---conf<span class="w"> </span>spark.hadoop.fs.s3a.secret.key<span
class="o">=</span>my-secret-key
-...
-</pre></div>
-</div>
-<p><strong>Example 2: Assume Role with Web Identity Token</strong></p>
-<p>This example demonstrates using an assumed role credential to access a
private S3 bucket, where the base credential for assuming the role is provided
by a web identity token credentials provider.</p>
-<div class="highlight-shell notranslate"><div
class="highlight"><pre><span></span><span
class="nv">$SPARK_HOME</span>/bin/spark-shell<span class="w"> </span><span
class="se">\</span>
-...
---conf<span class="w"> </span>spark.comet.scan.impl<span
class="o">=</span>native_datafusion<span class="w"> </span><span
class="se">\</span>
---conf<span class="w">
</span>spark.hadoop.fs.s3a.aws.credentials.provider<span
class="o">=</span>org.apache.hadoop.fs.s3a.auth.AssumedRoleCredentialProvider<span
class="w"> </span><span class="se">\</span>
---conf<span class="w"> </span>spark.hadoop.fs.s3a.assumed.role.arn<span
class="o">=</span>arn:aws:iam::123456789012:role/my-role<span class="w">
</span><span class="se">\</span>
---conf<span class="w">
</span>spark.hadoop.fs.s3a.assumed.role.session.name<span
class="o">=</span>my-session<span class="w"> </span><span class="se">\</span>
---conf<span class="w">
</span>spark.hadoop.fs.s3a.assumed.role.credentials.provider<span
class="o">=</span>com.amazonaws.auth.WebIdentityTokenCredentialsProvider
-...
-</pre></div>
-</div>
-</section>
-<section id="limitations">
-<h4>Limitations<a class="headerlink" href="#limitations" title="Link to this
heading">#</a></h4>
-<p>The S3 support of <code class="docutils literal notranslate"><span
class="pre">native_datafusion</span></code> has the following limitations:</p>
-<ol class="arabic simple">
-<li><p><strong>Partial Hadoop S3A configuration support</strong>: Not all
Hadoop S3A configurations are currently supported. Only the configurations
listed in the tables above are translated and applied to the underlying <code
class="docutils literal notranslate"><span
class="pre">object_store</span></code> crate.</p></li>
-<li><p><strong>Custom credential providers</strong>: Custom implementations of
AWS credential providers are not supported. The implementation only supports
the standard credential providers listed in the table above. We are planning to
add support for custom credential providers through a JNI-based adapter that
will allow calling Java credential providers from native code. See <a
class="reference external"
href="https://github.com/apache/datafusion-comet/issues/1829">issue #1829</a>
for [...]
-</ol>
-</section>
-</section>
</section>
</section>
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]