Re: [PR] docs: Update compatibility docs for new native scans [datafusion-comet]

via GitHub Fri, 18 Apr 2025 08:48:42 -0700


comphead commented on code in PR #1657:
URL: https://github.com/apache/datafusion-comet/pull/1657#discussion_r2050792116



##########
docs/source/user-guide/compatibility.md:
##########
@@ -34,25 +34,36 @@ This guide offers information about areas of functionality 
where there are known
 Comet currently has three distinct implementations of the Parquet scan 
operator. The configuration property
 `spark.comet.scan.impl` is used to select an implementation.
 
-| Implementation          | Description                                        
                                                                                
                                                    |
-| ----------------------- | 
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 |
-| `native_comet`          | This is the default implementation. It provides 
strong compatibility with Spark but does not support complex types.             
                                                       |
-| `native_datafusion`     | This implementation delegates to DataFusion's 
`ParquetExec`.                                                                  
                                                         |
-| `native_iceberg_compat` | This implementation also delegates to DataFusion's 
`ParquetExec` but uses a hybrid approach of JVM and native code. This scan is 
designed to be integrated with Iceberg in the future. |
-
-The new (and currently experimental) `native_datafusion` and 
`native_iceberg_compat` scans are being added to
-provide the following benefits over the `native_comet` implementation:
-
-- Leverage the DataFusion community's ongoing improvements to `ParquetExec`
-- Provide support for reading complex types (structs, arrays, and maps)
-- Remove the use of reusable mutable-buffers in Comet, which is complex to 
maintain
-
-These new implementations are not fully implemented. Some of the current 
limitations are:
-
-- Scanning Parquet files containing unsigned 8 or 16-bit integers can produce 
results that don't match Spark. By default, Comet
-will fall back to Spark when using these scan implementations to read Parquet 
files containing 8 or 16-bit integers.
-This behavior can be disabled by setting 
`spark.comet.scan.allowIncompatible=true`.
-- These implementations do not yet fully support timestamps, decimals, or 
complex types.
+| Implementation          | Description                                        
                                                                                
                                                       |
+| ----------------------- | 
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 |
+| `native_comet`          | This is the default implementation. It provides 
strong compatibility with Spark but does not support complex types.             
                                                          |
+| `native_datafusion`     | This implementation delegates to DataFusion's 
`DataSourceExec`.                                                               
                                                            |
+| `native_iceberg_compat` | This implementation also delegates to DataFusion's 
`DataSourceExec` but uses a hybrid approach of JVM and native code. This scan 
is designed to be integrated with Iceberg in the future. |
+
+The new (and currently experimental) `native_datafusion` and 
`native_iceberg_compat` scans provide the following benefits over the 
`native_comet`
+implementation:
+
+- Leverages the DataFusion community's ongoing improvements to `DataSourceExec`
+- Provides support for reading complex types (structs, arrays, and maps)
+- Removes the use of reusable mutable-buffers in Comet, which is complex to 
maintain
+- Improves performance
+
+The new scans currently have the following limitations:
+
+- When reading Parquet files written by systems other than Spark that contain 
columns with the logical types `UINT_8` 
+  or `UINT_16`, Comet will produce different results than Spark because Spark 
does not preserve or understand these 
+  logical types. Arrow-based readers, such as DataFusion and Comet do respect 
these types and read the data as unsigned 
+  rather than signed. By default, Comet will fall back to Spark when scanning 
Parquet files containing byte or short 

Review Comment:
   ```suggestion
     rather than signed. By default, Comet will fall back to Spark when 
scanning Parquet files containing `byte` or `short` 
   ```



##########
docs/source/user-guide/compatibility.md:
##########
@@ -34,25 +34,36 @@ This guide offers information about areas of functionality 
where there are known
 Comet currently has three distinct implementations of the Parquet scan 
operator. The configuration property
 `spark.comet.scan.impl` is used to select an implementation.
 
-| Implementation          | Description                                        
                                                                                
                                                    |
-| ----------------------- | 
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 |
-| `native_comet`          | This is the default implementation. It provides 
strong compatibility with Spark but does not support complex types.             
                                                       |
-| `native_datafusion`     | This implementation delegates to DataFusion's 
`ParquetExec`.                                                                  
                                                         |
-| `native_iceberg_compat` | This implementation also delegates to DataFusion's 
`ParquetExec` but uses a hybrid approach of JVM and native code. This scan is 
designed to be integrated with Iceberg in the future. |
-
-The new (and currently experimental) `native_datafusion` and 
`native_iceberg_compat` scans are being added to
-provide the following benefits over the `native_comet` implementation:
-
-- Leverage the DataFusion community's ongoing improvements to `ParquetExec`
-- Provide support for reading complex types (structs, arrays, and maps)
-- Remove the use of reusable mutable-buffers in Comet, which is complex to 
maintain
-
-These new implementations are not fully implemented. Some of the current 
limitations are:
-
-- Scanning Parquet files containing unsigned 8 or 16-bit integers can produce 
results that don't match Spark. By default, Comet
-will fall back to Spark when using these scan implementations to read Parquet 
files containing 8 or 16-bit integers.
-This behavior can be disabled by setting 
`spark.comet.scan.allowIncompatible=true`.
-- These implementations do not yet fully support timestamps, decimals, or 
complex types.
+| Implementation          | Description                                        
                                                                                
                                                       |
+| ----------------------- | 
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 |
+| `native_comet`          | This is the default implementation. It provides 
strong compatibility with Spark but does not support complex types.             
                                                          |
+| `native_datafusion`     | This implementation delegates to DataFusion's 
`DataSourceExec`.                                                               
                                                            |
+| `native_iceberg_compat` | This implementation also delegates to DataFusion's 
`DataSourceExec` but uses a hybrid approach of JVM and native code. This scan 
is designed to be integrated with Iceberg in the future. |
+
+The new (and currently experimental) `native_datafusion` and 
`native_iceberg_compat` scans provide the following benefits over the 
`native_comet`
+implementation:
+
+- Leverages the DataFusion community's ongoing improvements to `DataSourceExec`
+- Provides support for reading complex types (structs, arrays, and maps)
+- Removes the use of reusable mutable-buffers in Comet, which is complex to 
maintain
+- Improves performance
+
+The new scans currently have the following limitations:
+
+- When reading Parquet files written by systems other than Spark that contain 
columns with the logical types `UINT_8` 

Review Comment:
   is it types in Parquet? I cannot find them in 
https://parquet.apache.org/docs/file-format/types/



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] docs: Update compatibility docs for new native scans [datafusion-comet]

Reply via email to