mapleFU commented on PR #36866:
URL: https://github.com/apache/arrow/pull/36866#issuecomment-1651711260

   Oops, I found a `HasCorrectStats` here:
   
   ```c++
   // Reference:
   // 
parquet-mr/parquet-column/src/main/java/org/apache/parquet/CorruptStatistics.java
   // PARQUET-686 has more discussion on statistics
   bool ApplicationVersion::HasCorrectStatistics(Type::type col_type,
                                                 EncodedStatistics& statistics,
                                                 SortOrder::type sort_order) 
const {
     // parquet-cpp version 1.3.0 and parquet-mr 1.10.0 onwards stats are 
computed
     // correctly for all types
     if ((application_ == "parquet-cpp" && 
VersionLt(PARQUET_CPP_FIXED_STATS_VERSION())) ||
         (application_ == "parquet-mr" && 
VersionLt(PARQUET_MR_FIXED_STATS_VERSION()))) {
       // Only SIGNED are valid unless max and min are the same
       // (in which case the sort order does not matter)
       bool max_equals_min = statistics.has_min && statistics.has_max
                                 ? statistics.min() == statistics.max()
                                 : false;
       if (SortOrder::SIGNED != sort_order && !max_equals_min) {
         return false;
       }
   
       // Statistics of other types are OK
       if (col_type != Type::FIXED_LEN_BYTE_ARRAY && col_type != 
Type::BYTE_ARRAY) {
         return true;
       }
     }
     // created_by is not populated, which could have been caused by
     // parquet-mr during the same time as PARQUET-251, see PARQUET-297
     if (application_ == "unknown") {
       return true;
     }
   
     // Unknown sort order has incorrect stats
     if (SortOrder::UNKNOWN == sort_order) {
       return false;
     }
   
     // PARQUET-251
     if (VersionLt(PARQUET_251_FIXED_VERSION())) {
       return false;
     }
   
     return true;
   }
   ```
   
   Though `Statistics` would be wrong when parsing from `ColumnMetadata`, 
however, when calling `ColumnChunkMetadata::statistics()`, it will found that 
it's an old file and discard it.
   
   Should I keep this code to avoid generate `Statistics` with ambigious 
min-max? Or just leave the code here? @pitrou @wgtmac 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to