mbutrovich commented on code in PR #3433:
URL: https://github.com/apache/datafusion-comet/pull/3433#discussion_r2818264888
##########
docs/source/contributor-guide/parquet_scans.md:
##########
@@ -19,71 +19,59 @@ under the License.
# Comet Parquet Scan Implementations
-Comet currently has three distinct implementations of the Parquet scan
operator. The configuration property
-`spark.comet.scan.impl` is used to select an implementation. The default
setting is `spark.comet.scan.impl=auto`, and
-Comet will choose the most appropriate implementation based on the Parquet
schema and other Comet configuration
-settings. Most users should not need to change this setting. However, it is
possible to force Comet to try and use
-a particular implementation for all scan operations by setting this
configuration property to one of the following
-implementations.
-
-| Implementation | Description
|
-| ----------------------- |
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|
-| `native_comet` | **Deprecated.** This implementation provides
strong compatibility with Spark but does not support complex types. This is the
original scan implementation in Comet and will be removed in a future release. |
-| `native_iceberg_compat` | This implementation delegates to DataFusion's
`DataSourceExec` but uses a hybrid approach of JVM and native code. This scan
is designed to be integrated with Iceberg in the future.
|
-| `native_datafusion` | This experimental implementation delegates to
DataFusion's `DataSourceExec` for full native execution. There are known
compatibility issues when using this scan.
|
-
-The `native_datafusion` and `native_iceberg_compat` scans provide the
following benefits over the `native_comet`
-implementation:
-
-- Leverages the DataFusion community's ongoing improvements to `DataSourceExec`
-- Provides support for reading complex types (structs, arrays, and maps)
-- Delegates Parquet decoding to native Rust code rather than JVM-side decoding
-- Improves performance
-
-> **Note on mutable buffers:** Both `native_comet` and `native_iceberg_compat`
use reusable mutable buffers
-> when transferring data from JVM to native code via Arrow FFI. The
`native_iceberg_compat` implementation uses DataFusion's native Parquet reader
for data columns, bypassing Comet's mutable buffer infrastructure entirely.
However, partition columns still use `ConstantColumnReader`, which relies on
Comet's mutable buffers that are reused across batches. This means native
operators that buffer data (such as `SortExec` or `ShuffleWriterExec`) must
perform deep copies to avoid data corruption.
-> See the [FFI documentation](ffi.md) for details on the `arrow_ffi_safe` flag
and ownership semantics.
-
-The `native_datafusion` and `native_iceberg_compat` scans share the following
limitations:
-
-- When reading Parquet files written by systems other than Spark that contain
columns with the logical type `UINT_8`
- (unsigned 8-bit integers), Comet may produce different results than Spark.
Spark maps `UINT_8` to `ShortType`, but
- Comet's Arrow-based readers respect the unsigned type and read the data as
unsigned rather than signed. Since Comet
- cannot distinguish `ShortType` columns that came from `UINT_8` versus signed
`INT16`, by default Comet falls back to
- Spark when scanning Parquet files containing `ShortType` columns. This
behavior can be disabled by setting
- `spark.comet.scan.unsignedSmallIntSafetyCheck=false`. Note that `ByteType`
columns are always safe because they can
- only come from signed `INT8`, where truncation preserves the signed value.
-- No support for default values that are nested types (e.g., maps, arrays,
structs). Literal default values are supported.
-- No support for datetime rebasing detection or the
`spark.comet.exceptionOnDatetimeRebase` configuration. When reading
- Parquet files containing dates or timestamps written before Spark 3.0 (which
used a hybrid Julian/Gregorian calendar),
- the `native_comet` implementation can detect these legacy values and either
throw an exception or read them without
- rebasing. The DataFusion-based implementations do not have this detection
capability and will read all dates/timestamps
- as if they were written using the Proleptic Gregorian calendar. This may
produce incorrect results for dates before
- October 15, 1582.
-- No support for Spark's Datasource V2 API. When
`spark.sql.sources.useV1SourceList` does not include `parquet`,
- Spark uses the V2 API for Parquet scans. The DataFusion-based
implementations only support the V1 API, so Comet
- will fall back to `native_comet` when V2 is enabled.
-
-The `native_datafusion` scan has some additional limitations:
+Comet currently has two distinct implementations of the Parquet scan operator.
+
+The two implementations are `native_datafusion` and `native_iceberg_compat`.
They both delegate to DataFusion's
+`DataSourceExec`. The main difference between these implementations is that
`native_datafusion` runs fully natively, and
+`native_iceberg_compat` is a hybrid JVM/Rust implementation that can support
some Spark features that
+`native_datafusion` can not, but has some performance overhead due to crossing
the JVM/Rust boundary.
+
+The configuration property
+`spark.comet.scan.impl` is used to select an implementation. The default
setting is `spark.comet.scan.impl=auto`, which
+currently always uses the `native_iceberg_compat` implementation. Most users
should not need to change this setting.
+However, it is possible to force Comet to try and use a particular
implementation for all scan operations by setting
Review Comment:
"to try and" is not doing anything here. Suggest removing.
##########
docs/source/contributor-guide/parquet_scans.md:
##########
@@ -19,71 +19,59 @@ under the License.
# Comet Parquet Scan Implementations
-Comet currently has three distinct implementations of the Parquet scan
operator. The configuration property
-`spark.comet.scan.impl` is used to select an implementation. The default
setting is `spark.comet.scan.impl=auto`, and
-Comet will choose the most appropriate implementation based on the Parquet
schema and other Comet configuration
-settings. Most users should not need to change this setting. However, it is
possible to force Comet to try and use
-a particular implementation for all scan operations by setting this
configuration property to one of the following
-implementations.
-
-| Implementation | Description
|
-| ----------------------- |
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|
-| `native_comet` | **Deprecated.** This implementation provides
strong compatibility with Spark but does not support complex types. This is the
original scan implementation in Comet and will be removed in a future release. |
-| `native_iceberg_compat` | This implementation delegates to DataFusion's
`DataSourceExec` but uses a hybrid approach of JVM and native code. This scan
is designed to be integrated with Iceberg in the future.
|
-| `native_datafusion` | This experimental implementation delegates to
DataFusion's `DataSourceExec` for full native execution. There are known
compatibility issues when using this scan.
|
-
-The `native_datafusion` and `native_iceberg_compat` scans provide the
following benefits over the `native_comet`
-implementation:
-
-- Leverages the DataFusion community's ongoing improvements to `DataSourceExec`
-- Provides support for reading complex types (structs, arrays, and maps)
-- Delegates Parquet decoding to native Rust code rather than JVM-side decoding
-- Improves performance
-
-> **Note on mutable buffers:** Both `native_comet` and `native_iceberg_compat`
use reusable mutable buffers
-> when transferring data from JVM to native code via Arrow FFI. The
`native_iceberg_compat` implementation uses DataFusion's native Parquet reader
for data columns, bypassing Comet's mutable buffer infrastructure entirely.
However, partition columns still use `ConstantColumnReader`, which relies on
Comet's mutable buffers that are reused across batches. This means native
operators that buffer data (such as `SortExec` or `ShuffleWriterExec`) must
perform deep copies to avoid data corruption.
-> See the [FFI documentation](ffi.md) for details on the `arrow_ffi_safe` flag
and ownership semantics.
-
-The `native_datafusion` and `native_iceberg_compat` scans share the following
limitations:
-
-- When reading Parquet files written by systems other than Spark that contain
columns with the logical type `UINT_8`
- (unsigned 8-bit integers), Comet may produce different results than Spark.
Spark maps `UINT_8` to `ShortType`, but
- Comet's Arrow-based readers respect the unsigned type and read the data as
unsigned rather than signed. Since Comet
- cannot distinguish `ShortType` columns that came from `UINT_8` versus signed
`INT16`, by default Comet falls back to
- Spark when scanning Parquet files containing `ShortType` columns. This
behavior can be disabled by setting
- `spark.comet.scan.unsignedSmallIntSafetyCheck=false`. Note that `ByteType`
columns are always safe because they can
- only come from signed `INT8`, where truncation preserves the signed value.
-- No support for default values that are nested types (e.g., maps, arrays,
structs). Literal default values are supported.
-- No support for datetime rebasing detection or the
`spark.comet.exceptionOnDatetimeRebase` configuration. When reading
- Parquet files containing dates or timestamps written before Spark 3.0 (which
used a hybrid Julian/Gregorian calendar),
- the `native_comet` implementation can detect these legacy values and either
throw an exception or read them without
- rebasing. The DataFusion-based implementations do not have this detection
capability and will read all dates/timestamps
- as if they were written using the Proleptic Gregorian calendar. This may
produce incorrect results for dates before
- October 15, 1582.
-- No support for Spark's Datasource V2 API. When
`spark.sql.sources.useV1SourceList` does not include `parquet`,
- Spark uses the V2 API for Parquet scans. The DataFusion-based
implementations only support the V1 API, so Comet
- will fall back to `native_comet` when V2 is enabled.
-
-The `native_datafusion` scan has some additional limitations:
+Comet currently has two distinct implementations of the Parquet scan operator.
+
+The two implementations are `native_datafusion` and `native_iceberg_compat`.
They both delegate to DataFusion's
+`DataSourceExec`. The main difference between these implementations is that
`native_datafusion` runs fully natively, and
+`native_iceberg_compat` is a hybrid JVM/Rust implementation that can support
some Spark features that
Review Comment:
This sentence is hard to follow with the subject switching back and forth,
making it unclear what "but has some performance overhead due to crossing the
JVM/Rust boundary." is actually referring to. Suggest breaking it up.
##########
docs/source/contributor-guide/parquet_scans.md:
##########
@@ -19,71 +19,59 @@ under the License.
# Comet Parquet Scan Implementations
-Comet currently has three distinct implementations of the Parquet scan
operator. The configuration property
-`spark.comet.scan.impl` is used to select an implementation. The default
setting is `spark.comet.scan.impl=auto`, and
-Comet will choose the most appropriate implementation based on the Parquet
schema and other Comet configuration
-settings. Most users should not need to change this setting. However, it is
possible to force Comet to try and use
-a particular implementation for all scan operations by setting this
configuration property to one of the following
-implementations.
-
-| Implementation | Description
|
-| ----------------------- |
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|
-| `native_comet` | **Deprecated.** This implementation provides
strong compatibility with Spark but does not support complex types. This is the
original scan implementation in Comet and will be removed in a future release. |
-| `native_iceberg_compat` | This implementation delegates to DataFusion's
`DataSourceExec` but uses a hybrid approach of JVM and native code. This scan
is designed to be integrated with Iceberg in the future.
|
-| `native_datafusion` | This experimental implementation delegates to
DataFusion's `DataSourceExec` for full native execution. There are known
compatibility issues when using this scan.
|
-
-The `native_datafusion` and `native_iceberg_compat` scans provide the
following benefits over the `native_comet`
-implementation:
-
-- Leverages the DataFusion community's ongoing improvements to `DataSourceExec`
-- Provides support for reading complex types (structs, arrays, and maps)
-- Delegates Parquet decoding to native Rust code rather than JVM-side decoding
-- Improves performance
-
-> **Note on mutable buffers:** Both `native_comet` and `native_iceberg_compat`
use reusable mutable buffers
-> when transferring data from JVM to native code via Arrow FFI. The
`native_iceberg_compat` implementation uses DataFusion's native Parquet reader
for data columns, bypassing Comet's mutable buffer infrastructure entirely.
However, partition columns still use `ConstantColumnReader`, which relies on
Comet's mutable buffers that are reused across batches. This means native
operators that buffer data (such as `SortExec` or `ShuffleWriterExec`) must
perform deep copies to avoid data corruption.
-> See the [FFI documentation](ffi.md) for details on the `arrow_ffi_safe` flag
and ownership semantics.
-
-The `native_datafusion` and `native_iceberg_compat` scans share the following
limitations:
-
-- When reading Parquet files written by systems other than Spark that contain
columns with the logical type `UINT_8`
- (unsigned 8-bit integers), Comet may produce different results than Spark.
Spark maps `UINT_8` to `ShortType`, but
- Comet's Arrow-based readers respect the unsigned type and read the data as
unsigned rather than signed. Since Comet
- cannot distinguish `ShortType` columns that came from `UINT_8` versus signed
`INT16`, by default Comet falls back to
- Spark when scanning Parquet files containing `ShortType` columns. This
behavior can be disabled by setting
- `spark.comet.scan.unsignedSmallIntSafetyCheck=false`. Note that `ByteType`
columns are always safe because they can
- only come from signed `INT8`, where truncation preserves the signed value.
-- No support for default values that are nested types (e.g., maps, arrays,
structs). Literal default values are supported.
-- No support for datetime rebasing detection or the
`spark.comet.exceptionOnDatetimeRebase` configuration. When reading
- Parquet files containing dates or timestamps written before Spark 3.0 (which
used a hybrid Julian/Gregorian calendar),
- the `native_comet` implementation can detect these legacy values and either
throw an exception or read them without
- rebasing. The DataFusion-based implementations do not have this detection
capability and will read all dates/timestamps
- as if they were written using the Proleptic Gregorian calendar. This may
produce incorrect results for dates before
- October 15, 1582.
-- No support for Spark's Datasource V2 API. When
`spark.sql.sources.useV1SourceList` does not include `parquet`,
- Spark uses the V2 API for Parquet scans. The DataFusion-based
implementations only support the V1 API, so Comet
- will fall back to `native_comet` when V2 is enabled.
-
-The `native_datafusion` scan has some additional limitations:
+Comet currently has two distinct implementations of the Parquet scan operator.
+
+The two implementations are `native_datafusion` and `native_iceberg_compat`.
They both delegate to DataFusion's
+`DataSourceExec`. The main difference between these implementations is that
`native_datafusion` runs fully natively, and
+`native_iceberg_compat` is a hybrid JVM/Rust implementation that can support
some Spark features that
+`native_datafusion` can not, but has some performance overhead due to crossing
the JVM/Rust boundary.
+
+The configuration property
+`spark.comet.scan.impl` is used to select an implementation. The default
setting is `spark.comet.scan.impl=auto`, which
+currently always uses the `native_iceberg_compat` implementation. Most users
should not need to change this setting.
+However, it is possible to force Comet to try and use a particular
implementation for all scan operations by setting
+this configuration property to one of the following implementations.
+
+The following unsupported features are shared by both scans and cause Comet to
fall back to Spark:
+
+- `ShortType` columns, by default. When reading Parquet files written by
systems other than Spark that contain
+ columns with the logical type `UINT_8` (unsigned 8-bit integers), Comet may
produce different results than Spark.
+ Spark maps `UINT_8` to `ShortType`, but Comet's Arrow-based readers respect
the unsigned type and read the data as
+ unsigned rather than signed. Since Comet cannot distinguish `ShortType`
columns that came from `UINT_8` versus
+ signed `INT16`, by default Comet falls back to Spark when scanning Parquet
files containing `ShortType` columns.
+ This behavior can be disabled by setting
`spark.comet.scan.unsignedSmallIntSafetyCheck=false`. Note that `ByteType`
+ columns are always safe because they can only come from signed `INT8`, where
truncation preserves the signed value.
+- Default values that are nested types (e.g., maps, arrays, structs). Literal
default values are supported.
+- Spark's Datasource V2 API. When `spark.sql.sources.useV1SourceList` does not
include `parquet`, Spark uses the
+ V2 API for Parquet scans. The DataFusion-based implementations only support
the V1 API.
+- Spark metadata columns (e.g., `_metadata.file_path`)
+- No support for Dynamic Partition Pruning (DPP)
+
+The following shared limitation may produce incorrect results without falling
back to Spark:
+
+- No support for datetime rebasing detection or the
`spark.comet.exceptionOnDatetimeRebase` configuration. When
+ reading Parquet files containing dates or timestamps written before Spark
3.0 (which used a hybrid
+ Julian/Gregorian calendar), dates/timestamps will be read as if they were
written using the Proleptic Gregorian
+ calendar. This may produce incorrect results for dates before October 15,
1582.
+
+The `native_datafusion` scan has some additional limitations. All of these
cause Comet to fall back to Spark.
Review Comment:
mostly related to Parquet metadata columns.
##########
docs/source/contributor-guide/ffi.md:
##########
@@ -177,9 +177,8 @@ message Scan {
#### When ownership is NOT transferred to native:
-If the data originates from `native_comet` scan (deprecated, will be removed
in a future release) or from
-`native_iceberg_compat` in some cases, then ownership is not transferred to
native and the JVM may re-use the
-underlying buffers in the future.
+If the data originates from a scan that uses mutable buffers (such Iceberg
scans using the Iceberg Java integration path),
Review Comment:
Iceberg Java is a bit ambiguous since the native reader still needs Iceberg
Java for planning. We could link to the hybrid reader here:
https://datafusion.apache.org/comet/user-guide/latest/iceberg.html#hybrid-reader
I am interested in standardizing terminology on referring to this codepath
as a legacy path.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]