mapleFU commented on PR #36866:
URL: https://github.com/apache/arrow/pull/36866#issuecomment-1651711260
Oops, I found a `HasCorrectStats` here:
```c++
// Reference:
//
parquet-mr/parquet-column/src/main/java/org/apache/parquet/CorruptStatistics.java
// PARQUET-686 has more discussion on statistics
bool ApplicationVersion::HasCorrectStatistics(Type::type col_type,
EncodedStatistics& statistics,
SortOrder::type sort_order)
const {
// parquet-cpp version 1.3.0 and parquet-mr 1.10.0 onwards stats are
computed
// correctly for all types
if ((application_ == "parquet-cpp" &&
VersionLt(PARQUET_CPP_FIXED_STATS_VERSION())) ||
(application_ == "parquet-mr" &&
VersionLt(PARQUET_MR_FIXED_STATS_VERSION()))) {
// Only SIGNED are valid unless max and min are the same
// (in which case the sort order does not matter)
bool max_equals_min = statistics.has_min && statistics.has_max
? statistics.min() == statistics.max()
: false;
if (SortOrder::SIGNED != sort_order && !max_equals_min) {
return false;
}
// Statistics of other types are OK
if (col_type != Type::FIXED_LEN_BYTE_ARRAY && col_type !=
Type::BYTE_ARRAY) {
return true;
}
}
// created_by is not populated, which could have been caused by
// parquet-mr during the same time as PARQUET-251, see PARQUET-297
if (application_ == "unknown") {
return true;
}
// Unknown sort order has incorrect stats
if (SortOrder::UNKNOWN == sort_order) {
return false;
}
// PARQUET-251
if (VersionLt(PARQUET_251_FIXED_VERSION())) {
return false;
}
return true;
}
```
Though `Statistics` would be wrong when parsing from `ColumnMetadata`,
however, when calling `ColumnChunkMetadata::statistics()`, it will found that
it's an old file and discard it.
Should I keep this code to avoid generate `Statistics` with ambigious
min-max? Or just leave the code here? @pitrou @wgtmac
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]