andygrove commented on PR #3500:
URL: 
https://github.com/apache/datafusion-comet/pull/3500#issuecomment-3892756243

   ## Root Cause Analysis: `ParquetFieldIdIOSuite` Failure
   
   ### Summary
   
   The new `ParquetFieldIdIOSuite` failure is caused by the projection 
computation change in `parquet_exec.rs`. This affects **all scan modes** (auto, 
native_iceberg_compat, native_datafusion), not just native_datafusion.
   
   ### Key finding: `NativeBatchReader` calls `init_datasource_exec`
   
   The `NativeBatchReader` (used by auto/iceberg_compat mode) initializes via 
JNI at 
[`mod.rs:764`](https://github.com/apache/datafusion-comet/blob/dc2b9a459/native/core/src/parquet/mod.rs#L764):
   
   ```rust
   let scan = init_datasource_exec(
       required_schema,
       Some(data_schema),   // always Some
       None,                // partition_schema
       None,                // partition_fields
       object_store_url,
       file_groups,
       None,                // projection_vector is always None
       data_filters,
       ...
   );
   ```
   
   This means `data_schema = Some(...)` and `projection_vector = None` — which 
is the **exact new code path** added by this PR.
   
   ### What changed
   
   **Before (base df52):**
   ```rust
   let base_schema = match (&data_schema, &projection_vector) {
       (Some(schema), Some(_)) => Arc::clone(schema),
       _ => Arc::clone(&required_schema),  // ← (Some, None) falls here: uses 
required_schema, no projection
   };
   ```
   
   **After (this PR):**
   ```rust
   let (base_schema, projection) = match (&data_schema, &projection_vector) {
       (Some(schema), Some(proj)) => (Arc::clone(schema), Some(proj.clone())),
       (Some(schema), None) => {
           // ← NEW: computes projection by NAME matching required_schema → 
data_schema
           let projection: Vec<usize> = required_schema.fields().iter()
               .filter_map(|req_field| {
                   schema.fields().iter().position(|data_field| {
                       data_field.name() == req_field.name()  // name-based 
lookup
                   })
               })
               .collect();
           (Arc::clone(schema), Some(projection))
       }
       _ => (Arc::clone(&required_schema), None),
   };
   ```
   
   ### Why it fails
   
   When Parquet **field IDs** are used, the `required_schema` field names (from 
Spark's read schema) can differ from the `data_schema` field names (from the 
Parquet file schema). The name-based `position()` lookup fails to find matches, 
producing a projection vector that is shorter than expected (or empty). This 
causes the downstream `index out of bounds` panic in `currentColumnBatch` when 
`NativeBatchReader` tries to access columns that don't exist in the batch.
   
   The specific errors observed:
   - `CometNativeException: index out of bounds: the len is 0 but the index is 
0`
   - `CometNativeException: index out of bounds: the len is 1 but the index is 
2`
   
   ### Verification
   
   - Run at base commit 
[`dc2b9a459`](https://github.com/apache/datafusion-comet/actions/runs/21949974183)
 (without PR): `ParquetFieldIdIOSuite` **passes** (919ms)
   - Run at PR commit 
[`b08c3a00b`](https://github.com/apache/datafusion-comet/actions/runs/21954911628)
 (with PR): `ParquetFieldIdIOSuite` **fails** (144ms)
   
   The "absence of field ids" tests also fail because the name-based projection 
can produce incorrect column ordering even without field IDs, when the 
`required_schema` and `data_schema` have columns in different orders.
   
   ### Suggested fix
   
   The `(Some(schema), None)` case needs to account for the fact that column 
correspondence between `required_schema` and `data_schema` may not be 
name-based (e.g., field ID mapping). Consider either:
   1. Falling back to the original behavior (`required_schema` as base, no 
projection) when names don't fully match
   2. Using the same column mapping logic that the Java `NativeBatchReader` 
uses (which handles field IDs)
   3. Only applying the new projection logic when called from 
`native_datafusion` (not from `NativeBatchReader`)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to