Re: [PR] docs: Update Parquet scan documentation [datafusion-comet]

via GitHub Tue, 17 Feb 2026 09:22:32 -0800


comphead commented on code in PR #3433:
URL: https://github.com/apache/datafusion-comet/pull/3433#discussion_r2818155829



##########
docs/source/contributor-guide/parquet_scans.md:
##########
@@ -19,71 +19,59 @@ under the License.
 
 # Comet Parquet Scan Implementations
 
-Comet currently has three distinct implementations of the Parquet scan 
operator. The configuration property
-`spark.comet.scan.impl` is used to select an implementation. The default 
setting is `spark.comet.scan.impl=auto`, and
-Comet will choose the most appropriate implementation based on the Parquet 
schema and other Comet configuration
-settings. Most users should not need to change this setting. However, it is 
possible to force Comet to try and use
-a particular implementation for all scan operations by setting this 
configuration property to one of the following
-implementations.
-
-| Implementation          | Description                                        
                                                                                
                                                                         |
-| ----------------------- | 
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 |
-| `native_comet`          | **Deprecated.** This implementation provides 
strong compatibility with Spark but does not support complex types. This is the 
original scan implementation in Comet and will be removed in a future release. |
-| `native_iceberg_compat` | This implementation delegates to DataFusion's 
`DataSourceExec` but uses a hybrid approach of JVM and native code. This scan 
is designed to be integrated with Iceberg in the future.                        
|
-| `native_datafusion`     | This experimental implementation delegates to 
DataFusion's `DataSourceExec` for full native execution. There are known 
compatibility issues when using this scan.                                      
     |
-
-The `native_datafusion` and `native_iceberg_compat` scans provide the 
following benefits over the `native_comet`
-implementation:
-
-- Leverages the DataFusion community's ongoing improvements to `DataSourceExec`
-- Provides support for reading complex types (structs, arrays, and maps)
-- Delegates Parquet decoding to native Rust code rather than JVM-side decoding
-- Improves performance
-
-> **Note on mutable buffers:** Both `native_comet` and `native_iceberg_compat` 
use reusable mutable buffers
-> when transferring data from JVM to native code via Arrow FFI. The 
`native_iceberg_compat` implementation uses DataFusion's native Parquet reader 
for data columns, bypassing Comet's mutable buffer infrastructure entirely. 
However, partition columns still use `ConstantColumnReader`, which relies on 
Comet's mutable buffers that are reused across batches. This means native 
operators that buffer data (such as `SortExec` or `ShuffleWriterExec`) must 
perform deep copies to avoid data corruption.
-> See the [FFI documentation](ffi.md) for details on the `arrow_ffi_safe` flag 
and ownership semantics.
-
-The `native_datafusion` and `native_iceberg_compat` scans share the following 
limitations:
-
-- When reading Parquet files written by systems other than Spark that contain 
columns with the logical type `UINT_8`
-  (unsigned 8-bit integers), Comet may produce different results than Spark. 
Spark maps `UINT_8` to `ShortType`, but
-  Comet's Arrow-based readers respect the unsigned type and read the data as 
unsigned rather than signed. Since Comet
-  cannot distinguish `ShortType` columns that came from `UINT_8` versus signed 
`INT16`, by default Comet falls back to
-  Spark when scanning Parquet files containing `ShortType` columns. This 
behavior can be disabled by setting
-  `spark.comet.scan.unsignedSmallIntSafetyCheck=false`. Note that `ByteType` 
columns are always safe because they can
-  only come from signed `INT8`, where truncation preserves the signed value.
-- No support for default values that are nested types (e.g., maps, arrays, 
structs). Literal default values are supported.
-- No support for datetime rebasing detection or the 
`spark.comet.exceptionOnDatetimeRebase` configuration. When reading
-  Parquet files containing dates or timestamps written before Spark 3.0 (which 
used a hybrid Julian/Gregorian calendar),
-  the `native_comet` implementation can detect these legacy values and either 
throw an exception or read them without
-  rebasing. The DataFusion-based implementations do not have this detection 
capability and will read all dates/timestamps
-  as if they were written using the Proleptic Gregorian calendar. This may 
produce incorrect results for dates before
-  October 15, 1582.
-- No support for Spark's Datasource V2 API. When 
`spark.sql.sources.useV1SourceList` does not include `parquet`,
-  Spark uses the V2 API for Parquet scans. The DataFusion-based 
implementations only support the V1 API, so Comet
-  will fall back to `native_comet` when V2 is enabled.
-
-The `native_datafusion` scan has some additional limitations:
+Comet currently has two distinct implementations of the Parquet scan operator.
+
+The two implementations are `native_datafusion` and `native_iceberg_compat`. 
They both delegate to DataFusion's
+`DataSourceExec`. The main difference between these implementations is that 
`native_datafusion` runs fully natively, and
+`native_iceberg_compat` is a hybrid JVM/Rust implementation that can support 
some Spark features that
+`native_datafusion` can not, but has some performance overhead due to crossing 
the JVM/Rust boundary.
+
+The configuration property
+`spark.comet.scan.impl` is used to select an implementation. The default 
setting is `spark.comet.scan.impl=auto`, which
+currently always uses the `native_iceberg_compat` implementation. Most users 
should not need to change this setting.
+However, it is possible to force Comet to try and use a particular 
implementation for all scan operations by setting
+this configuration property to one of the following implementations.

Review Comment:
   ```suggestion
   this configuration property to one of the following implementations. For 
example: `--conf spark.comet.scan.impl=native_datafusion`
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] docs: Update Parquet scan documentation [datafusion-comet]

Reply via email to