[I] Incorrect results when reading parquet files with different schemas and statistics [arrow-datafusion]

via GitHub Wed, 13 Dec 2023 11:48:37 -0800


alamb opened a new issue, #8532:
URL: https://github.com/apache/arrow-datafusion/issues/8532


   ### Describe the bug
   
   There is a problem where the statistics for the wrong columns can be used to 
prune parquet row groups, leading to potentially incorrect answers (missing 
data)
   
   It affects cases where there are predicates that can be used to prune row 
groups
   
   ### To Reproduce
   
   I am working on a self contained reproducer
   
   Here is what I found in IOx: for the file that is being pruned incorrectly, 
the state column is a string with the value `MA` but the max value from the 
statistics that is returned is as follows (it turns out it is the statistics 
for the `time` column):
   ```
   +-------+
   | state |
   +-------+
   | 50    |
   +-------+
   ```
   
   I dug into the issue some more, and I found that It appears to be a bug in 
how the column index names are looked up by the `parquet_column` function (that 
we sadly added to handle another bug 🤦 )
   
   Here is the arrow schema:
   
   ```text
   Schema {
     fields: [
       Field { name: "area", data_type: UInt64, nullable: true },
       Field { name: "city", data_type: Dictionary(Int32, Utf8) },
       Field { name: "max_temp", data_type: Float64, nullable: true },
       Field { name: "min_temp", data_type: Float64, nullable: true },
       Field { name: "no_overlap", data_type: Float64, nullable: true },
       Field { name: "state", data_type: Dictionary(Int32, Utf8), nullable: 
true },  <--- index 5
       Field { name: "time", data_type: Timestamp(Nanosecond, None), nullable: 
false },
     ]
   }
   ```
   Note the `state` column is index 5
   
   The code in parquet_column resolves column `5` to index 5 in the parquet 
schema
   
   The paruqet_file in question has this schema
   
   ```
   GroupType { basic_info: BasicTypeInfo { name: "arrow_schema", repetition: 
None, converted_type: NONE, logical_type: None, id: None },
     fields: [
       PrimitiveType { basic_info: BasicTypeInfo { name: "area", repetition: 
Some(OPTIONAL), converted_type: UINT_64, logical_type: Some(Integer { 
bit_width: 64, is_signed: false }), id: None }, physical_type: INT64, 
type_length: -1, scale: -1, precision: -1 },
       PrimitiveType { basic_info: BasicTypeInfo { name: "city", repetition: 
Some(OPTIONAL), converted_type: UTF8, logical_type: Some(String), id: None }, 
physical_type: BYTE_ARRAY, type_length: -1, scale: -1, precision: -1 },
       PrimitiveType { basic_info: BasicTypeInfo { name: "max_temp", 
repetition: Some(OPTIONAL), converted_type: NONE, logical_type: None, id: None 
}, physical_type: DOUBLE, type_length: -1, scale: -1, precision: -1 },
       PrimitiveType { basic_info: BasicTypeInfo { name: "min_temp", 
repetition: Some(OPTIONAL), converted_type: NONE, logical_type: None, id: None 
}, physical_type: DOUBLE, type_length: -1, scale: -1, precision: -1 },
       PrimitiveType { basic_info: BasicTypeInfo { name: "no_overlap", 
repetition: Some(OPTIONAL), converted_type: NONE, logical_type: None, id: None 
}, physical_type: DOUBLE, type_length: -1, scale: -1, precision: -1 },
       PrimitiveType { basic_info: BasicTypeInfo { name: "state", repetition: 
Some(OPTIONAL), converted_type: UTF8, logical_type: Some(String), id: None }, 
physical_type: BYTE_ARRAY, type_length: -1, scale: -1, precision: -1 },
       PrimitiveType { basic_info: BasicTypeInfo { name: "time", repetition: 
Some(REQUIRED), converted_type: NONE, logical_type: Some(Timestamp { 
is_adjusted_to_u_t_c: false, unit: NANOS(NanoSeconds) }), id: None }, 
physical_type: INT64, type_length: -1, scale: -1, precision: -1 }
    ]
   }
   ```
   Note that `state` is actually at offset `6` in this file, which leads to the 
statistics being incorrectly read and therefore pruned.
   
   
   ### Expected behavior
   
   The correct statistics should be used
   
   ### Additional context
   
   Kudos to @appletreeisyellow for helping track down this issue. 
   
   This was introduced in https://github.com/apache/arrow-datafusion/pull/8294 
/ 
https://github.com/apache/arrow-datafusion/commit/06bbe1298fa8aa042b6a6462e55b2890969d884a
 by myself and @tustvold 
   
   At the core, the problem is that previously parquet statistics were looked 
up by name using the parquet schema. We added a lookup on arrow schema first to 
make sure the right parquet field was found, but we used the wrong arrow 
scheam. 
   
   @tustvold pointed out that this is incorrect for nested types 
(https://github.com/apache/arrow-datafusion/issues/8335), and thus we fixed the 
problem by doing the lookup by name in the arrow schema, and then mapping to 
the parquet schema by column index and checking that it was not part of a 
nested type. 
   
   However, this mapping failed to account for the fact that each individual 
parquet file may have a different schema than the overall table (and 
ParquetExec handles this mapping)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] Incorrect results when reading parquet files with different schemas and statistics [arrow-datafusion]

Reply via email to