comphead commented on code in PR #3433:
URL: https://github.com/apache/datafusion-comet/pull/3433#discussion_r2818181835
##########
docs/source/contributor-guide/parquet_scans.md:
##########
@@ -19,71 +19,59 @@ under the License.
# Comet Parquet Scan Implementations
-Comet currently has three distinct implementations of the Parquet scan
operator. The configuration property
-`spark.comet.scan.impl` is used to select an implementation. The default
setting is `spark.comet.scan.impl=auto`, and
-Comet will choose the most appropriate implementation based on the Parquet
schema and other Comet configuration
-settings. Most users should not need to change this setting. However, it is
possible to force Comet to try and use
-a particular implementation for all scan operations by setting this
configuration property to one of the following
-implementations.
-
-| Implementation | Description
|
-| ----------------------- |
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|
-| `native_comet` | **Deprecated.** This implementation provides
strong compatibility with Spark but does not support complex types. This is the
original scan implementation in Comet and will be removed in a future release. |
-| `native_iceberg_compat` | This implementation delegates to DataFusion's
`DataSourceExec` but uses a hybrid approach of JVM and native code. This scan
is designed to be integrated with Iceberg in the future.
|
-| `native_datafusion` | This experimental implementation delegates to
DataFusion's `DataSourceExec` for full native execution. There are known
compatibility issues when using this scan.
|
-
-The `native_datafusion` and `native_iceberg_compat` scans provide the
following benefits over the `native_comet`
-implementation:
-
-- Leverages the DataFusion community's ongoing improvements to `DataSourceExec`
-- Provides support for reading complex types (structs, arrays, and maps)
-- Delegates Parquet decoding to native Rust code rather than JVM-side decoding
-- Improves performance
-
-> **Note on mutable buffers:** Both `native_comet` and `native_iceberg_compat`
use reusable mutable buffers
-> when transferring data from JVM to native code via Arrow FFI. The
`native_iceberg_compat` implementation uses DataFusion's native Parquet reader
for data columns, bypassing Comet's mutable buffer infrastructure entirely.
However, partition columns still use `ConstantColumnReader`, which relies on
Comet's mutable buffers that are reused across batches. This means native
operators that buffer data (such as `SortExec` or `ShuffleWriterExec`) must
perform deep copies to avoid data corruption.
-> See the [FFI documentation](ffi.md) for details on the `arrow_ffi_safe` flag
and ownership semantics.
-
-The `native_datafusion` and `native_iceberg_compat` scans share the following
limitations:
-
-- When reading Parquet files written by systems other than Spark that contain
columns with the logical type `UINT_8`
- (unsigned 8-bit integers), Comet may produce different results than Spark.
Spark maps `UINT_8` to `ShortType`, but
- Comet's Arrow-based readers respect the unsigned type and read the data as
unsigned rather than signed. Since Comet
- cannot distinguish `ShortType` columns that came from `UINT_8` versus signed
`INT16`, by default Comet falls back to
- Spark when scanning Parquet files containing `ShortType` columns. This
behavior can be disabled by setting
- `spark.comet.scan.unsignedSmallIntSafetyCheck=false`. Note that `ByteType`
columns are always safe because they can
- only come from signed `INT8`, where truncation preserves the signed value.
-- No support for default values that are nested types (e.g., maps, arrays,
structs). Literal default values are supported.
-- No support for datetime rebasing detection or the
`spark.comet.exceptionOnDatetimeRebase` configuration. When reading
- Parquet files containing dates or timestamps written before Spark 3.0 (which
used a hybrid Julian/Gregorian calendar),
- the `native_comet` implementation can detect these legacy values and either
throw an exception or read them without
- rebasing. The DataFusion-based implementations do not have this detection
capability and will read all dates/timestamps
- as if they were written using the Proleptic Gregorian calendar. This may
produce incorrect results for dates before
- October 15, 1582.
-- No support for Spark's Datasource V2 API. When
`spark.sql.sources.useV1SourceList` does not include `parquet`,
- Spark uses the V2 API for Parquet scans. The DataFusion-based
implementations only support the V1 API, so Comet
- will fall back to `native_comet` when V2 is enabled.
-
-The `native_datafusion` scan has some additional limitations:
+Comet currently has two distinct implementations of the Parquet scan operator.
+
+The two implementations are `native_datafusion` and `native_iceberg_compat`.
They both delegate to DataFusion's
Review Comment:
```suggestion
Comet currently has following distinct implementations of the Parquet scan
operator
- `native_datafusion`
- `native_iceberg_compat`
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]