Re: [PR] docs: Update Parquet scan documentation [datafusion-comet]

via GitHub Tue, 17 Feb 2026 09:45:25 -0800


mbutrovich commented on code in PR #3433:
URL: https://github.com/apache/datafusion-comet/pull/3433#discussion_r2818264888



##########
docs/source/contributor-guide/parquet_scans.md:
##########
@@ -19,71 +19,59 @@ under the License.
 
 # Comet Parquet Scan Implementations
 
-Comet currently has three distinct implementations of the Parquet scan 
operator. The configuration property
-`spark.comet.scan.impl` is used to select an implementation. The default 
setting is `spark.comet.scan.impl=auto`, and
-Comet will choose the most appropriate implementation based on the Parquet 
schema and other Comet configuration
-settings. Most users should not need to change this setting. However, it is 
possible to force Comet to try and use
-a particular implementation for all scan operations by setting this 
configuration property to one of the following
-implementations.
-
-| Implementation          | Description                                        
                                                                                
                                                                         |
-| ----------------------- | 
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 |
-| `native_comet`          | **Deprecated.** This implementation provides 
strong compatibility with Spark but does not support complex types. This is the 
original scan implementation in Comet and will be removed in a future release. |
-| `native_iceberg_compat` | This implementation delegates to DataFusion's 
`DataSourceExec` but uses a hybrid approach of JVM and native code. This scan 
is designed to be integrated with Iceberg in the future.                        
|
-| `native_datafusion`     | This experimental implementation delegates to 
DataFusion's `DataSourceExec` for full native execution. There are known 
compatibility issues when using this scan.                                      
     |
-
-The `native_datafusion` and `native_iceberg_compat` scans provide the 
following benefits over the `native_comet`
-implementation:
-
-- Leverages the DataFusion community's ongoing improvements to `DataSourceExec`
-- Provides support for reading complex types (structs, arrays, and maps)
-- Delegates Parquet decoding to native Rust code rather than JVM-side decoding
-- Improves performance
-
-> **Note on mutable buffers:** Both `native_comet` and `native_iceberg_compat` 
use reusable mutable buffers
-> when transferring data from JVM to native code via Arrow FFI. The 
`native_iceberg_compat` implementation uses DataFusion's native Parquet reader 
for data columns, bypassing Comet's mutable buffer infrastructure entirely. 
However, partition columns still use `ConstantColumnReader`, which relies on 
Comet's mutable buffers that are reused across batches. This means native 
operators that buffer data (such as `SortExec` or `ShuffleWriterExec`) must 
perform deep copies to avoid data corruption.
-> See the [FFI documentation](ffi.md) for details on the `arrow_ffi_safe` flag 
and ownership semantics.
-
-The `native_datafusion` and `native_iceberg_compat` scans share the following 
limitations:
-
-- When reading Parquet files written by systems other than Spark that contain 
columns with the logical type `UINT_8`
-  (unsigned 8-bit integers), Comet may produce different results than Spark. 
Spark maps `UINT_8` to `ShortType`, but
-  Comet's Arrow-based readers respect the unsigned type and read the data as 
unsigned rather than signed. Since Comet
-  cannot distinguish `ShortType` columns that came from `UINT_8` versus signed 
`INT16`, by default Comet falls back to
-  Spark when scanning Parquet files containing `ShortType` columns. This 
behavior can be disabled by setting
-  `spark.comet.scan.unsignedSmallIntSafetyCheck=false`. Note that `ByteType` 
columns are always safe because they can
-  only come from signed `INT8`, where truncation preserves the signed value.
-- No support for default values that are nested types (e.g., maps, arrays, 
structs). Literal default values are supported.
-- No support for datetime rebasing detection or the 
`spark.comet.exceptionOnDatetimeRebase` configuration. When reading
-  Parquet files containing dates or timestamps written before Spark 3.0 (which 
used a hybrid Julian/Gregorian calendar),
-  the `native_comet` implementation can detect these legacy values and either 
throw an exception or read them without
-  rebasing. The DataFusion-based implementations do not have this detection 
capability and will read all dates/timestamps
-  as if they were written using the Proleptic Gregorian calendar. This may 
produce incorrect results for dates before
-  October 15, 1582.
-- No support for Spark's Datasource V2 API. When 
`spark.sql.sources.useV1SourceList` does not include `parquet`,
-  Spark uses the V2 API for Parquet scans. The DataFusion-based 
implementations only support the V1 API, so Comet
-  will fall back to `native_comet` when V2 is enabled.
-
-The `native_datafusion` scan has some additional limitations:
+Comet currently has two distinct implementations of the Parquet scan operator.
+
+The two implementations are `native_datafusion` and `native_iceberg_compat`. 
They both delegate to DataFusion's
+`DataSourceExec`. The main difference between these implementations is that 
`native_datafusion` runs fully natively, and
+`native_iceberg_compat` is a hybrid JVM/Rust implementation that can support 
some Spark features that
+`native_datafusion` can not, but has some performance overhead due to crossing 
the JVM/Rust boundary.
+
+The configuration property
+`spark.comet.scan.impl` is used to select an implementation. The default 
setting is `spark.comet.scan.impl=auto`, which
+currently always uses the `native_iceberg_compat` implementation. Most users 
should not need to change this setting.
+However, it is possible to force Comet to try and use a particular 
implementation for all scan operations by setting

Review Comment:
   "to try and" is not doing anything here. Suggest removing.



##########
docs/source/contributor-guide/parquet_scans.md:
##########
@@ -19,71 +19,59 @@ under the License.
 
 # Comet Parquet Scan Implementations
 
-Comet currently has three distinct implementations of the Parquet scan 
operator. The configuration property
-`spark.comet.scan.impl` is used to select an implementation. The default 
setting is `spark.comet.scan.impl=auto`, and
-Comet will choose the most appropriate implementation based on the Parquet 
schema and other Comet configuration
-settings. Most users should not need to change this setting. However, it is 
possible to force Comet to try and use
-a particular implementation for all scan operations by setting this 
configuration property to one of the following
-implementations.
-
-| Implementation          | Description                                        
                                                                                
                                                                         |
-| ----------------------- | 
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 |
-| `native_comet`          | **Deprecated.** This implementation provides 
strong compatibility with Spark but does not support complex types. This is the 
original scan implementation in Comet and will be removed in a future release. |
-| `native_iceberg_compat` | This implementation delegates to DataFusion's 
`DataSourceExec` but uses a hybrid approach of JVM and native code. This scan 
is designed to be integrated with Iceberg in the future.                        
|
-| `native_datafusion`     | This experimental implementation delegates to 
DataFusion's `DataSourceExec` for full native execution. There are known 
compatibility issues when using this scan.                                      
     |
-
-The `native_datafusion` and `native_iceberg_compat` scans provide the 
following benefits over the `native_comet`
-implementation:
-
-- Leverages the DataFusion community's ongoing improvements to `DataSourceExec`
-- Provides support for reading complex types (structs, arrays, and maps)
-- Delegates Parquet decoding to native Rust code rather than JVM-side decoding
-- Improves performance
-
-> **Note on mutable buffers:** Both `native_comet` and `native_iceberg_compat` 
use reusable mutable buffers
-> when transferring data from JVM to native code via Arrow FFI. The 
`native_iceberg_compat` implementation uses DataFusion's native Parquet reader 
for data columns, bypassing Comet's mutable buffer infrastructure entirely. 
However, partition columns still use `ConstantColumnReader`, which relies on 
Comet's mutable buffers that are reused across batches. This means native 
operators that buffer data (such as `SortExec` or `ShuffleWriterExec`) must 
perform deep copies to avoid data corruption.
-> See the [FFI documentation](ffi.md) for details on the `arrow_ffi_safe` flag 
and ownership semantics.
-
-The `native_datafusion` and `native_iceberg_compat` scans share the following 
limitations:
-
-- When reading Parquet files written by systems other than Spark that contain 
columns with the logical type `UINT_8`
-  (unsigned 8-bit integers), Comet may produce different results than Spark. 
Spark maps `UINT_8` to `ShortType`, but
-  Comet's Arrow-based readers respect the unsigned type and read the data as 
unsigned rather than signed. Since Comet
-  cannot distinguish `ShortType` columns that came from `UINT_8` versus signed 
`INT16`, by default Comet falls back to
-  Spark when scanning Parquet files containing `ShortType` columns. This 
behavior can be disabled by setting
-  `spark.comet.scan.unsignedSmallIntSafetyCheck=false`. Note that `ByteType` 
columns are always safe because they can
-  only come from signed `INT8`, where truncation preserves the signed value.
-- No support for default values that are nested types (e.g., maps, arrays, 
structs). Literal default values are supported.
-- No support for datetime rebasing detection or the 
`spark.comet.exceptionOnDatetimeRebase` configuration. When reading
-  Parquet files containing dates or timestamps written before Spark 3.0 (which 
used a hybrid Julian/Gregorian calendar),
-  the `native_comet` implementation can detect these legacy values and either 
throw an exception or read them without
-  rebasing. The DataFusion-based implementations do not have this detection 
capability and will read all dates/timestamps
-  as if they were written using the Proleptic Gregorian calendar. This may 
produce incorrect results for dates before
-  October 15, 1582.
-- No support for Spark's Datasource V2 API. When 
`spark.sql.sources.useV1SourceList` does not include `parquet`,
-  Spark uses the V2 API for Parquet scans. The DataFusion-based 
implementations only support the V1 API, so Comet
-  will fall back to `native_comet` when V2 is enabled.
-
-The `native_datafusion` scan has some additional limitations:
+Comet currently has two distinct implementations of the Parquet scan operator.
+
+The two implementations are `native_datafusion` and `native_iceberg_compat`. 
They both delegate to DataFusion's
+`DataSourceExec`. The main difference between these implementations is that 
`native_datafusion` runs fully natively, and
+`native_iceberg_compat` is a hybrid JVM/Rust implementation that can support 
some Spark features that

Review Comment:
   This sentence is hard to follow with the subject switching back and forth, 
making it unclear what "but has some performance overhead due to crossing the 
JVM/Rust boundary." is actually referring to. Suggest breaking it up.



##########
docs/source/contributor-guide/parquet_scans.md:
##########
@@ -19,71 +19,59 @@ under the License.
 
 # Comet Parquet Scan Implementations
 
-Comet currently has three distinct implementations of the Parquet scan 
operator. The configuration property
-`spark.comet.scan.impl` is used to select an implementation. The default 
setting is `spark.comet.scan.impl=auto`, and
-Comet will choose the most appropriate implementation based on the Parquet 
schema and other Comet configuration
-settings. Most users should not need to change this setting. However, it is 
possible to force Comet to try and use
-a particular implementation for all scan operations by setting this 
configuration property to one of the following
-implementations.
-
-| Implementation          | Description                                        
                                                                                
                                                                         |
-| ----------------------- | 
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 |
-| `native_comet`          | **Deprecated.** This implementation provides 
strong compatibility with Spark but does not support complex types. This is the 
original scan implementation in Comet and will be removed in a future release. |
-| `native_iceberg_compat` | This implementation delegates to DataFusion's 
`DataSourceExec` but uses a hybrid approach of JVM and native code. This scan 
is designed to be integrated with Iceberg in the future.                        
|
-| `native_datafusion`     | This experimental implementation delegates to 
DataFusion's `DataSourceExec` for full native execution. There are known 
compatibility issues when using this scan.                                      
     |
-
-The `native_datafusion` and `native_iceberg_compat` scans provide the 
following benefits over the `native_comet`
-implementation:
-
-- Leverages the DataFusion community's ongoing improvements to `DataSourceExec`
-- Provides support for reading complex types (structs, arrays, and maps)
-- Delegates Parquet decoding to native Rust code rather than JVM-side decoding
-- Improves performance
-
-> **Note on mutable buffers:** Both `native_comet` and `native_iceberg_compat` 
use reusable mutable buffers
-> when transferring data from JVM to native code via Arrow FFI. The 
`native_iceberg_compat` implementation uses DataFusion's native Parquet reader 
for data columns, bypassing Comet's mutable buffer infrastructure entirely. 
However, partition columns still use `ConstantColumnReader`, which relies on 
Comet's mutable buffers that are reused across batches. This means native 
operators that buffer data (such as `SortExec` or `ShuffleWriterExec`) must 
perform deep copies to avoid data corruption.
-> See the [FFI documentation](ffi.md) for details on the `arrow_ffi_safe` flag 
and ownership semantics.
-
-The `native_datafusion` and `native_iceberg_compat` scans share the following 
limitations:
-
-- When reading Parquet files written by systems other than Spark that contain 
columns with the logical type `UINT_8`
-  (unsigned 8-bit integers), Comet may produce different results than Spark. 
Spark maps `UINT_8` to `ShortType`, but
-  Comet's Arrow-based readers respect the unsigned type and read the data as 
unsigned rather than signed. Since Comet
-  cannot distinguish `ShortType` columns that came from `UINT_8` versus signed 
`INT16`, by default Comet falls back to
-  Spark when scanning Parquet files containing `ShortType` columns. This 
behavior can be disabled by setting
-  `spark.comet.scan.unsignedSmallIntSafetyCheck=false`. Note that `ByteType` 
columns are always safe because they can
-  only come from signed `INT8`, where truncation preserves the signed value.
-- No support for default values that are nested types (e.g., maps, arrays, 
structs). Literal default values are supported.
-- No support for datetime rebasing detection or the 
`spark.comet.exceptionOnDatetimeRebase` configuration. When reading
-  Parquet files containing dates or timestamps written before Spark 3.0 (which 
used a hybrid Julian/Gregorian calendar),
-  the `native_comet` implementation can detect these legacy values and either 
throw an exception or read them without
-  rebasing. The DataFusion-based implementations do not have this detection 
capability and will read all dates/timestamps
-  as if they were written using the Proleptic Gregorian calendar. This may 
produce incorrect results for dates before
-  October 15, 1582.
-- No support for Spark's Datasource V2 API. When 
`spark.sql.sources.useV1SourceList` does not include `parquet`,
-  Spark uses the V2 API for Parquet scans. The DataFusion-based 
implementations only support the V1 API, so Comet
-  will fall back to `native_comet` when V2 is enabled.
-
-The `native_datafusion` scan has some additional limitations:
+Comet currently has two distinct implementations of the Parquet scan operator.
+
+The two implementations are `native_datafusion` and `native_iceberg_compat`. 
They both delegate to DataFusion's
+`DataSourceExec`. The main difference between these implementations is that 
`native_datafusion` runs fully natively, and
+`native_iceberg_compat` is a hybrid JVM/Rust implementation that can support 
some Spark features that
+`native_datafusion` can not, but has some performance overhead due to crossing 
the JVM/Rust boundary.
+
+The configuration property
+`spark.comet.scan.impl` is used to select an implementation. The default 
setting is `spark.comet.scan.impl=auto`, which
+currently always uses the `native_iceberg_compat` implementation. Most users 
should not need to change this setting.
+However, it is possible to force Comet to try and use a particular 
implementation for all scan operations by setting
+this configuration property to one of the following implementations.
+
+The following unsupported features are shared by both scans and cause Comet to 
fall back to Spark:
+
+- `ShortType` columns, by default. When reading Parquet files written by 
systems other than Spark that contain
+  columns with the logical type `UINT_8` (unsigned 8-bit integers), Comet may 
produce different results than Spark.
+  Spark maps `UINT_8` to `ShortType`, but Comet's Arrow-based readers respect 
the unsigned type and read the data as
+  unsigned rather than signed. Since Comet cannot distinguish `ShortType` 
columns that came from `UINT_8` versus
+  signed `INT16`, by default Comet falls back to Spark when scanning Parquet 
files containing `ShortType` columns.
+  This behavior can be disabled by setting 
`spark.comet.scan.unsignedSmallIntSafetyCheck=false`. Note that `ByteType`
+  columns are always safe because they can only come from signed `INT8`, where 
truncation preserves the signed value.
+- Default values that are nested types (e.g., maps, arrays, structs). Literal 
default values are supported.
+- Spark's Datasource V2 API. When `spark.sql.sources.useV1SourceList` does not 
include `parquet`, Spark uses the
+  V2 API for Parquet scans. The DataFusion-based implementations only support 
the V1 API.
+- Spark metadata columns (e.g., `_metadata.file_path`)
+- No support for Dynamic Partition Pruning (DPP)
+
+The following shared limitation may produce incorrect results without falling 
back to Spark:
+
+- No support for datetime rebasing detection or the 
`spark.comet.exceptionOnDatetimeRebase` configuration. When
+  reading Parquet files containing dates or timestamps written before Spark 
3.0 (which used a hybrid
+  Julian/Gregorian calendar), dates/timestamps will be read as if they were 
written using the Proleptic Gregorian
+  calendar. This may produce incorrect results for dates before October 15, 
1582.
+
+The `native_datafusion` scan has some additional limitations. All of these 
cause Comet to fall back to Spark.

Review Comment:
   mostly related to Parquet metadata columns.



##########
docs/source/contributor-guide/ffi.md:
##########
@@ -177,9 +177,8 @@ message Scan {
 
 #### When ownership is NOT transferred to native:
 
-If the data originates from `native_comet` scan (deprecated, will be removed 
in a future release) or from
-`native_iceberg_compat` in some cases, then ownership is not transferred to 
native and the JVM may re-use the
-underlying buffers in the future.
+If the data originates from a scan that uses mutable buffers (such Iceberg 
scans using the Iceberg Java integration path),

Review Comment:
   Iceberg Java is a bit ambiguous since the native reader still needs Iceberg 
Java for planning. We could link to the hybrid reader here:
   
   
https://datafusion.apache.org/comet/user-guide/latest/iceberg.html#hybrid-reader
   
   I am interested in standardizing terminology on referring to this codepath 
as a legacy path.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] docs: Update Parquet scan documentation [datafusion-comet]

Reply via email to